Every protein in our cells has evolved to fold into a specific structure to perform its functions. However, determining these structures experimentally is often challenging, in some cases taking years. Recently, machine-learning algorithms have been designed to predict a protein’s structure directly from its amino acid sequence in minutes to hours. Since the release of the first of these algorithms, AlphaFold and RoseTTAFold, several more have been developed. These have been complemented by tools that leverage the outputs to give structural context to biochemical data, screen for novel protein–protein interactions or even help solve experimental structures. In addition, several public resources have incorporated the predictions into their databases, making the data open to all. Here, we provide a user-focused perspective on machine-learning protein structure prediction, covering some of the popular applications and highlighting caveats. Used effectively, predictive programs offer the potential to speed up research and guide experimental design.

Proteins are biological machines that have evolved over billions of years to perform molecular tasks. Each protein is composed of the same 20 amino acid building blocks (also known as residues), combined in different ways to produce diverse nanometre-sized structures, from the long contractile filaments in our muscles to the enzymes that catalyse chemical reactions to power our cells. The 3D structure of a protein is specific to its function, e.g. an enzyme must fold in a precise manner such that its catalytic residues can properly engage a substrate. Therefore, characterizing a protein’s structure is often an important step in understanding its biological role, as well as providing insight into the impact of disease-causing mutations.

One of the holy grails for the structural biology field has been to computationally determine a protein’s structure directly from its primary sequence, since experimental methods are often time consuming and costly. As of 1994, progress towards this goal has been evaluated in the international experiment known as the Critical Assessment of Structure Prediction (CASP). At CASP13 in 2018, AlphaFold from Google DeepMind took a major step towards this goal, followed by further developments at CASP14 in 2020 with AlphaFold2 as well as RoseTTAFold from the Institute for Protein Design. These groups implemented the pattern recognition power of machine-learning algorithms, training them using the vast repository of experimentally determined structures in the Protein Data Bank (PDB). The resulting algorithms were able to produce highly accurate predictions of proteins they had not previously encountered.

Since CASP14, several machine-learning structure predictors (hereafter called ML-SPs) have emerged. In one form or another, the algorithms are all trained to infer which amino acids within a protein sequence are close in 3D space by examining if they have followed similar evolutionary paths. Depending on the implementation, ML-SPs can be split into two categories. The first is multiple sequence alignment structure predictors (MSA-SPs), which include AlphaFold2 and RoseTTAFold. These algorithms generate an MSA from the input sequence and extract the important co-evolutionary signals to predict a structure. The second category, protein language model structure predictors (pLM-SPs), includes ESMFold and OmegaFold. These are designed to embed the evolutionary relationships of interacting amino acids directly into their algorithm, allowing them to predict structure from sequence without generating an MSA each time.

Both MSA-SPs and pLM-SPs provide an accessible first glance into a protein’s shape in the absence of experimental data. The predictions are typically accompanied by an overall score (e.g. predicted template modelling score; pTM from AlphaFold2). To effectively interpret a prediction, however, it is crucial to evaluate all the calculated metrics, which are described in original publications as well as in several excellent guides (see Further Reading).

The simplest quality metric is an estimate of the positional accuracy of each residue. For example, the predicted local distance difference test (pLDDT) is a per-residue analysis that can be easily mapped onto a structure to display the confidence of local features like α-helices (Figure 1a). A less intuitive metric is the predicted aligned error (PAE), a measure of the confidence in the relative position of a pair of amino acids in the structure, plotted for all pairs in the PAE plot (Figure 1B). Confidently predicted interactions between amino acids possess low PAEs (i.e., low error). By extension, this results in areas with low PAEs where there are structured domains, interacting peptides and protein–protein interfaces. Visualization tools like ChimeraX and PAE Viewer provide a convenient way to interpret the PAE plot in the context of the structure. Note that the PAE plot can be asymmetric, as the PAE is influenced by the local conformation of each residue in the pair (i.e., residue 2 is confident relative to residue 1, but not vice-versa). This may suggest that the orientation of two interacting residues or interfaces is less clear.

Figure 1

AlphaFold prediction of DLIC1. (a) Prediction of DLIC1 (Uniprot ID: Q9Y6G9), coloured by pLDDT, taken from the AlphaFold Protein Structure Database (AF-Q9Y6G9-F1). The Ras-like domain (D1) and helices (H1–H4) are labelled. Note that the pLDDT score is non-linear. (b) PAE plot for panel A, highlighting low-error PAE regions corresponding to D1 and H1–H4.

Figure 1

AlphaFold prediction of DLIC1. (a) Prediction of DLIC1 (Uniprot ID: Q9Y6G9), coloured by pLDDT, taken from the AlphaFold Protein Structure Database (AF-Q9Y6G9-F1). The Ras-like domain (D1) and helices (H1–H4) are labelled. Note that the pLDDT score is non-linear. (b) PAE plot for panel A, highlighting low-error PAE regions corresponding to D1 and H1–H4.

Close modal

Although ML-SPs were initially designed to predict one protein at a time, several have been expanded to predict interactions between proteins, notably AlphaFold2 and RoseTTAFold2. While these have been able to predict novel protein complexes, they are not perfect: interactions can be missed (false negatives) or confidently predicted despite contrary experimental evidence (false positives). An example of the latter is an interaction between a short α-helix in dynein’s light intermediate chain (DLIC1) and the N-terminus of RILPL2, which is predicted by AlphaFold2 (Figure 2) but does not occur in vitro (Celestino et al., 2022). This false positive may arise from the conservation of RILPL2’s interacting domain with its homologue RILP, which does bind dynein. This highlights the importance of experimental data in validating the results from ML-SPs.

Figure 2

False positive prediction of RILPL2:DLIC1. (a) Prediction of two copies of RILPL21-127 (UniProt ID: Q969X0) and two copies of DLIC1440-462 (UniProt ID: Q9Y6G9) using ColabFold v2.3. The structure is coloured by pLDDT. (b) PAE plot for the RILPL2-DLIC1 prediction.

Figure 2

False positive prediction of RILPL2:DLIC1. (a) Prediction of two copies of RILPL21-127 (UniProt ID: Q969X0) and two copies of DLIC1440-462 (UniProt ID: Q9Y6G9) using ColabFold v2.3. The structure is coloured by pLDDT. (b) PAE plot for the RILPL2-DLIC1 prediction.

Close modal

Another caveat in interpreting ML-SP results is that they tend to predict one conformation of a given protein. Most proteins alter their conformation both subtly and dramatically in response to different triggers such as nucleotide state or ligand binding, and many have highly flexible regions connecting ordered domains. While it is difficult to encourage ML-SPs to generate different conformations, there has been notable success from subsampling and clustering the MSAs (e.g., using ACE and AF-Cluster). For now, ML-SPs only offer a glimpse into the conformational landscape that a protein might explore.

Several resources have incorporated ML-SP results into their databases, providing easy access to the wider scientific community. AlphaFold2 (in partnership with the EMBL European Bioinformatics Institute) and ESMFold have both released their own databases containing structure predictions for millions of proteins across many organisms. The AlphaFold database has been further incorporated into DALI and FoldSeek, where you can conveniently search for structurally homologous proteins, as well as UniProt, a central resource of protein information.

These resources are valuable, but one must interpret the predictions with care. Proteins in the public databases are currently predicted as monomers, despite many forming obligate oligomers. In addition, there are cases where experimentally derived structures exist for homologues. These can be more insightful than the provided prediction but are currently not readily displayed. Importantly, PAEs cannot be represented easily on a structure like the pLDDT, so the latter is used to colour the model. Due to these caveats, researchers cannot rely solely on UniProt for detailed structural insights. The obligate dimer Hook2 demonstrates some of these limitations. Hook2’s AlphaFold entry in UniProt is monomeric (Figure 3a), showing its C-terminus (H4, 5 and 6 in Figure 3a) wrapped around a main α helix. The PAE plot provided in the AlphaFold database (but not in UniProt) shows the conformation of the C-terminus is not predicted confidently (Figure 3b). Running a separate prediction with two copies of Hook2 further shows that the long α-helices fold together into coiled coils (Figure 3c, d).

Figure 3

Hook2 monomer and dimer predictions. (a) Hook2 (UniProt ID: Q96ED9) monomer prediction from the AlphaFold database (AF-Q96ED9-F1), which used AlphaFold Monomer v2.0. (b) PAE plot for the Hook2 monomer. (c) Hook2 dimeric prediction using AlphaFold v2.3. (d) PAE plot for the Hook2 dimer. Domains (D1) and helices (H1–H6) are labelled in panels A-C.

Figure 3

Hook2 monomer and dimer predictions. (a) Hook2 (UniProt ID: Q96ED9) monomer prediction from the AlphaFold database (AF-Q96ED9-F1), which used AlphaFold Monomer v2.0. (b) PAE plot for the Hook2 monomer. (c) Hook2 dimeric prediction using AlphaFold v2.3. (d) PAE plot for the Hook2 dimer. Domains (D1) and helices (H1–H6) are labelled in panels A-C.

Close modal

Most projects, notably those investigating protein–protein interactions, require us to use ML-SPs ourselves rather than rely on database predictions. Fortunately, it is now relatively easy to predict the structure of your favourite protein or complex. ML-SPs are available on various webservers (e.g., Robetta) and have also been conveniently implemented as Google Colab notebooks, which allow users to run predictions online using Google’s computational resources. The development of rapid MSA generation by MMseqs2, implemented in ColabFold, has cut down prediction time, making it trivial to run multiple predictions. On Colab, sequences with fewer than ~2000 amino acids can be predicted, a limitation set by the memory of the graphical processing unit (GPU) that is allocated. Local installations can increase the limit to >3000 amino acids when using the latest commercial GPUs. Practically, this limit can be managed by splitting long sequences into smaller fragments harbouring the domains or interfaces in question.

Unsurprisingly, structural biologists were among the first to add ML-SPs into their workflows. In X-ray crystallography, ML-SPs have been used to address the long-standing ‘phase problem’, whereby the loss of the X-ray’s phase during diffraction data collection prevents structure determination. Phase information is traditionally recovered using laborious experimental techniques or by using molecular replacement (MR), which requires a previously-solved, closely-related structure as a template. Predictions from ML-SPs can now serve as templates in cases where structures are not available, as demonstrated for the Escherichia coli lipoprotein PqiC (Cooper et al., 2024). However, even minor deviations from the X-ray structure, such as kinks in the long helices of a prediction, can pose problems for MR (McCoy et al., 2022).

Cryo-EM techniques, including single particle analysis (SPA) and cryo-electron tomography (cryo-ET), have also incorporated ML-SPs into their workflows. In SPA, thousands of images are averaged to produce a high-resolution 3D map of a protein or complex. In some cases, ML-SPs can provide a good model to fit into the low-resolution regions of the map. In a recent cryo-EM study of the microtubule motor dynein complexed with its regulator LIS1 and its cofactor dynactin, AlphaFold2 revealed the molecular details of a previously unreported interaction (Figure 4) (Singh et al., 2024). The AlphaFold2 prediction of dynactin bound to LIS1 fits well into an ambiguous region in the map, revealing the critical residues that were subsequently validated to be important for dynein function.

Figure 4

AlphaFold2 reveals a LIS1–dynactin interface. (a) Dynein-LIS1-dynactin complex (PDB: 8PQW). Region modelled using the AlphaFold2 prediction is coloured, with other regions in grey and white. The inset shows the LIS1–dynactin interaction and the residues chosen for mutation-based validation (yellow). (b) PAE plot highlighting the LIS1–dynactin interaction in dashed boxes.

Figure 4

AlphaFold2 reveals a LIS1–dynactin interface. (a) Dynein-LIS1-dynactin complex (PDB: 8PQW). Region modelled using the AlphaFold2 prediction is coloured, with other regions in grey and white. The inset shows the LIS1–dynactin interaction and the residues chosen for mutation-based validation (yellow). (b) PAE plot highlighting the LIS1–dynactin interaction in dashed boxes.

Close modal

In contrast to SPA, cryo-ET can provide structural information for entire cellular regions, albeit often at resolutions where identifying proteins is difficult. Here, ML-SPs can be used to assign unknown components within large assemblies de novo. For example, in the microtubules of mammalian sperm, researchers searched through 21,000 AlphaFold predictions to find those that precisely matched their unassigned regions, validating their findings using proteomics (Chen et al., 2023).

Other structural techniques such as NMR and cross-linking mass spectrometry (XL-MS) generate information about the distances between amino acids in a protein or complex. When an experimental structure is lacking, ML-SPs can provide models to help interpret these data. In AlphaLink2, XL-MS data has been directly integrated into the AlphaFold2 algorithm as an additional parameter. Including even a few cross-links between pairs of amino acids improves prediction performance. The ability to incorporate more experimental data into ML-SPs is an exciting direction for the field.

The speed of ML-SPs presented another exciting opportunity: to discover novel protein complexes by screening a large number of multimeric predictions. In yeast, novel interactors could be found from a screen of ~8 million protein pairs, performed using a combination of RoseTTAFold and AlphaFold (Humphreys et al., 2021). These analyses have been streamlined with tools like AlphaPulldown, a modified AlphaFold2 workflow that is optimized for large-scale screens. This strategy can also help validate protein interactions identified using biochemical techniques.

A challenge with large screens is ranking the results to find the interactions that are promising. Instead of manually inspecting each prediction, confidence metrics can be calculated that specifically assess the interaction interfaces. Many metrics can be used, including those that interpret the prediction quality at the interaction interface (iPTM, LIS), those that assess the interacting atomic models (PI-score, DockQ) or a combination of both (pDockQ2, mpDockQ). While these scores are effective at identifying candidates, closer inspection and further validation are important for the reasons described above.

So far, we have highlighted some established uses of ML-SPs. As the field progresses, new algorithms are being released regularly to address challenges not solvable by the original ones. For example, a single mutation that would break a protein’s function often has little impact on the prediction, since the surrounding conservation overrides the mutated residue. Making adjustments to the MSA input might be the solution, but this is still being explored. Similarly, protein predictions from divergent organisms, such as trypanosome parasites, have benefited from MSA curation. On the other hand, those that evolve very rapidly, such as antibodies, lack the conservation that is necessary to accurately predict their structure. To this end, exciting progress is being made by modifying existing ML-SPs, such as RoseTTAFold2.

Another upcoming milestone is the inclusion of non-protein molecules into a protein prediction. These are often essential to a biological process, such as the ATP in an enzyme or the RNAs that make up ribosomes. Algorithms such as AlphaFill can computationally fit ligands into gaps that current ML-SPs leave behind in the predicted protein structures. However, newer ML-SPs can incorporate other molecules directly into their pipeline, such as RoseTTAFold’s extension to nucleic acids (RoseTTAFold2NA). More recently, RoseTTAFold-All-Atom has made further headway towards this milestone by including proteins, nucleic acids, small molecules and ions.

The scientific advancements provided by the emergence of ML-SPs cannot be overstated. At their best, ML-SPs are relatively low-cost tools that generate hypotheses, accelerating the pace of discovery. As the algorithms evolve and computational resources improve, it will become trivial to predict difficult sequences or complicated biomolecular assemblies, and even create designer proteins. Provided that we are judicious in our use of ML-SPs, they will remain an indispensable tool in our scientific arsenal.

  • For a comprehensive reference list including for the algorithms and metrics described here, please visit the Lau lab website (https://laulab.web.ox.ac.uk/resources).

  • AlphaFold, A practical guide. https://www.ebi.ac.uk/training/online/courses/alphafold/. DOI: 10.6019/TOL.AlphaFold-w.2024.00001.1

  • Baek, M., DiMaio, F., Anishchenko, I. et al. (2021) Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373, 871–876. DOI: 10.1126/science.abj8754

  • Celestino, R., Gama, J.B., Castro-Rodrigues, A.F. et al. (2022) JIP3 interacts with dynein and kinesin-1 to regulate bidirectional organelle transport. J. Cell Biol., 221, e202110057 DOI: 10.1083/jcb.202110057

  • Chen, Z., Shiozaki, M., Haas, K.M. et al. (2023) De novo protein identification in mammalian sperm using in situ cryoelectron tomography and AlphaFold2 docking. Cell, 186, 5041–5053.e19. DOI: 10.1016/j.cell.2023.09.017

  • Cooper, B.F., Ratkevičiūtė, G., Clifton, L.A. et al. (2024) An octameric PqiC toroid stabilises the outer-membrane interaction of the PqiABC transport system. EMBO Rep., 25, 82–101. DOI: 10.1038/s44319-023-00014-4

  • Evans, R., O’Neill, M., Pritzel, A. et al. (2021) Protein complex prediction with AlphaFold-Multimer. bioRxiv, 10.04.463034. DOI: 10.1101/2021.10.04.463034

  • Humphreys, I.R., Pei, J., Baek, M. et al. (2021) Computed structures of core eukaryotic protein complexes. Science, 374, eabm4805. DOI: 10.1126/science.abm4805

  • Jumper, J., Evans, R., Pritzel, A. et al. (2021) Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583–589. DOI: 10.1038/s41586-021-03819-2

  • Lin, Z., Akin, H., Rao, R. et al. (2023) Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379, 1123–1130. DOI: 10.1126/science.ade2574

  • Mirdita, M., Schütze, K., Moriwaki, Y. et al. (2022) ColabFold: making protein folding accessible to all. Nat. Methods, 19, 1–4. DOI: 10.1038/s41592-022-01488-1

  • McCoy, A.J., Sammito, MD and Read, R.J. (2022) Implications of AlphaFold2 for crystallographic phasing by molecular replacement. Acta Cryst. D. 78, 1–13. DOI: 10.1107 /S2059798321012122

  • Singh, K., Lau, C.K., Manigrasso, G. et al. (2024) Molecular mechanism of dynein-dynactin complex assembly by LIS1. Science, 383, eadk8544. DOI: 10.1126/science.adk8544

We thank Andrew Carter for sharing the false positive example of DLIC1 and RILPL2, and thank Kashish Singh, Alex Fellows and Giulia Manigrasso for critically reading the manuscript.

graphic

Sami Chaaban did his graduate studies at McGill University, Montreal, where he worked on microtubule structure and dynamics during his PhD in the Brouhard lab. As a postdoc in the Carter lab at the MRC LMB in Cambridge, he is using structural biology techniques to explore how complex cellular machinery is built, including new cryo-EM methods to analyse the sparse and flexible dynein motors as they move along microtubules.

graphic

Giedrė Ratkevičiūtė completed her PhD at the University of Birmingham under the guidance of Dr Timothy Knowles, investigating lipid transport systems in predatory bacteria. During the final year of her PhD, she conducted research within the lab of Professor Jose Maria Carazo at the CNB-CSIC (Madrid), focusing on cryo-EM data processing and protein flexibility. She later joined Dr Clinton Lau's lab at the University of Oxford, where she is currently engaged in the structural studies of malarial microtubule-binding proteins. Twitter: @GRatkeviciute

graphic

Clinton Lau studied malaria adhesion proteins during his DPhil in the Higgins Lab at the Department of Biochemistry, University of Oxford. He then joined Andrew Carter’s group at the MRC LMB in Cambridge, to examine dynein complexes using cryo-EM and TIRF microscopy. When AlphaFold2 was released, Clinton and Sami Chaaban collated knowledge about AlphaFold2 at the LMB, leading discussion groups on how to use AlphaFold2 effectively. He recently set up his lab back in the Department of Biochemistry in Oxford, funded by a Wellcome CDA Fellowship to explore microtubule-binding proteins in the malaria parasite, Plasmodium falciparum. Email: [email protected]. Twitter: @CKYLau

Author notes

These authors contributed equally to this article.

Published by Portland Press Limited under the Creative Commons Attribution License 4.0 (CC BY-NC-ND)