AlphaFind: Machine Learning and Clustering Enable Proteome-Wide Fast 3D Structure Similarity Search
Procházka et al. recently reported AlphaFind which employs a machine learning model to discover the most similar ternary structures of a given protein using AlphaFold 2 (AF2) database.
AlphaFind attempts to overcome the limitations of existing protein search tools such as Foldseek, 3D-SURFER, and Dali server. The Dali server and the 3D-SURFER do not scale well to large protein structural data. Foldseek does not support the entire AF database as it uses a pre-clustered 52-million subset of the >200-million AF database. In addition, Foldseek focuses on local interactions between residues and neighbors, limiting its use for similarity search.
The protein data bank has accumulated more than 200,000 experimentally determined protein structures over seven decades. This data was used to train the AF2 model that was in turn used to predict, with high accuracy, more than 200 million protein structures housed in the AF database. This massive amount of structural data requires fast methods to organize, explore, and utilize them efficiently.
AlphaFind is a protein structure search tool that extracts protein 3D features and represents the structures using a previously reported compact data embedding method, combined with data clustering and a machine learning model to identify the most similar structures to a given query.
The input to AlphaFind is the UniProt ID, PDB ID, or relevant gene ID for a given protein, while the output is a set of proteins similar to the query.
When given a query, the sequence of events implemented by AlphaFind include:
1️⃣ Converting the input into a UniProt ID
2️⃣ Identifying the associated candidate proteins
3️⃣ Calculating global and local similarity
4️⃣ Retrieving metadata for query and results from AF database
5️⃣ Superposing and visualizing pairs of input and output using NGL viewer, with results also linked to Mol*
6️⃣ Optional expanding of search results
7️⃣ Downloading of search results.
While AlphaFind is an incredible resource, it does have some limitations. AlphaFind was developed on top of relatively older AF2 version 3, prior to the release of version 4. Trading of computational load for precision, the results returned by AlphaFind for a given query are approximate and may not always contain all the most similar structures. Also, AlphaFind considers all segments of the entire AF2 structure equally, and does not distinguish between structured and unstructured (i.e. high and low confident regions), hence potentially biasing search results.
References
Paper: AlphaFind: discover structure similarity across the proteome in AlphaFold DB
GitHub: https://github.com/Coda-Research-Group/AlphaFind
Web app: https://alphafind.fi.muni.cz/search
Manual: https://github.com/Coda-Research-Group/AlphaFind/wiki/Manual