Google AI DeepMind’s – AlphaFold-2 Solves Decade Long Mystery of Proteins Structure Prediction Problem
M.M Mohamed Mufassirin
- Introduction
Proteins are the most important macromolecules in the living cells that constitute life. Proteins constitute amino-acid monomers joined together by peptide bonds. The initial polypeptide chain lacks any stable structure. Amino acid residues interact with each other and produce a well-defined three-dimensional (3D) structure. Proteins perform many tasks within the living organisms, including carrying oxygen (by haemoglobin), fighting infection (by antibodies), signalling cells (by insulin), contracting muscles (by actin and myosin) and performing metabolism (by enzymes). However, proteins’ functionality depends on the three-dimensional structures that they fold into in a particular environment. Mis-folded proteins are responsible for many diseases such as Alzheimer’s disease, Cystic fibrosis and Mad Cow disease.
The landmark study by Anfinsen in the 1970s showed that the tertiary structure of a protein is dependent on its amino acid sequence (Anfinsen, 1973). Since then, understanding the protein sequence–structure-function paradigm has become a cornerstone of modern biomedical studies. Due to significant efforts in genome sequencing over the last four decades, the number of known nucleotide sequences in the GenBank database (Sayers et al., 2019) has grown to over 2600 million as of 2021. Of these nucleotide sequences, approximately 200 million have been translated into the corresponding amino acid sequences and deposited in UniProt (Bairoch, 2005). Despite the impressive accumulation of data, the amino acid sequences themselves provide only limited insight into the biological functions of each protein, as these are essentially determined by their three-dimensional structures (Pearce & Zhang, 2021).
An interesting exception to this is intrinsically disordered proteins, estimated to make up roughly 30% of proteins in the human proteome and may be functional despite lacking well-defined tertiary structures (Deiana, Forcelloni, Porrello, &Giansanti, 2019). However, even intrinsically disordered proteins may undergo disordered-to-ordered transitions and adopt tertiary structures upon binding to their partners and performing their biological functions.
Scientists have dreamed of simply predicting a protein’s shape from its amino acid sequence— an ability that would open a world of insights into the workings of life. This problem has been around for 50 years; lots of people have broken their heads on it. However, a practical solution is in their grasp.
This article explains an Artificial Intelligence (AI) system created by Google AI offshoot DeepMind which has done a massive climb in solving PSP. DeepMind’s AlphaFold (Senior, et al., 2020) program outperformed about 100 other teams in a biennial protein structure prediction challenge called Critical Assessment of Structure Prediction (CASP). CASP is a competition to determine and advance the state-of-the-art in modelling protein structure from amino acid sequence. Every two years, participants are welcome to submit models for many proteins for which the experimental structures are not yet open. Free assessors then compare the models with the experiment. Scientists at DeepMind started working on an Artificial Intelligent (AI) system four years ago called AlphaFold, to solve this protein structure folding problem. Now, DeepMind announced that the latest version of AlphaFold called AlphaFold 2 (Jumper, et al., 2021) is even more accurate and recognised by CASP as a breakthrough.
- Protein structure prediction
Every protein can fold into a specific three-dimensional structure known as a native structure that consumes minimal possible free energy characterized as conformation. However, only a little is known about the native structure of proteins. In general, in-vitro laboratory techniques such as Nuclear Magnetic Resonance (NMR) and X-ray crystallography have been used in determining a protein’s structure. Unfortunately, these approaches are not always feasible due to high expensive, slow, failure-prone and heavy operational burdens (Comellas&Rienstra, 2013). Consequently, the gap between the massive number of already-sequenced proteins and the known native structures are high. Due to the recent advancement in high-performance computer systems, researchers have paid their curiosity on this protein structure prediction (PSP) problem to develop computational solutions.
There are two general computational approaches to predicting the structure of a protein: template-based modelling, in which the previously determined structure of a related protein is used to model the unknown structure of the target; and template-free modelling, which does not rely on global similarity to a structure in the PDB and hence can be applied to proteins with novel folds. Historically, the methods applied in these two approaches have been quite distinct, with template-based modelling focusing on detecting and alignment to a related protein of known structure and template-free modelling relying on large-scale conformational sampling and the application of physics-based energy functions. Recently, however, the line between these approaches has begun to blur, as template-based methods have incorporated energy-guided model refinement, and template-free methods have employed machine learning and fragment-based sampling approaches to exploit the information in the structural database (Kuhlman & Bradley, 2019).
- Achievement of AlphaFold-2 on Proteins Structure Prediction Problem
A Google AI branch constructed an Artificial Intelligence (AI) system. DeepMind has made significant progress in detecting a protein’s 3D shape from its amino-acid sequence, one of biology’s most challenging problems. DeepMind’s software, AlphaFold, outperformed about 100 other teams (CASP) in a biannual protein structure prediction contest dubbed Critical Assessment of Structure Prediction in a biannual protein structure prediction contest.
DeepMind scientists have been working on an Artificial Intelligence (AI) system dubbed AlphaFold(Senior, et al., 2020) for four years to solve the problem of protein structure folding. AlphaFold has trained on the sequences and structures of over a hundred thousand proteins that were painstakingly mapped out by scientists worldwide. DeepMind has now stated that the most recent version of AlphaFold (AlphaFold 2) (Jumper, et al., 2021)is even more accurate and recognized as a breakthrough by CASP.
Life sciences and health would greatly benefit from correctly anticipating protein structures given their amino-acid sequence. It would vastly accelerate attempts to identify the building components of cells, allowing for faster and more advanced medication development. AlphaFold’s structural predictions were difficult to distinguish from those found using “gold standard” experimental approaches like X-ray crystallography and, more recently, cryo-electron microscopy in several cases (cryo-EM). The AlphaFold is not anticipated to eliminate the need for these time-consuming and expensive procedures, but it will allow scientists to investigate living things in new ways.
AlphaFold’s forecasts were submitted under the name “group 427” and used a lot of computing power. The team utilizes around 128 TPUv3 cores (about equal to 100-200 GPUs) over the course of a few weeks, which is a small amount of computation compared to the most significant innovative methods utilized in machine learning today.AlphaFold’s forecast of 350,000 protein structures includes the 20,000 in the human proteome and those of so-called model species utilized in a scientific investigation.
A prediction of AlphaFold helped identify a bacterial protein structure which Lupas’ laboratory has tried to solve over many years. The LUPA team had previously collected raw X-ray diffraction data, but it needs information on the protein shape to transform these Rorschach patterns into a structure. Tricks and other prediction methods have failed to obtain this information. In half an hour, after Lupas’s laboratory had spent ten years trying everything, the model from the Group 427 (AlphaFold) predicted the design.
Overall, teams in the latter half of 2020 anticipated their structures more correctly than the previous CASP, although AlphaFold can bear much of their development. With regard to protein targets deemed moderately challenging, other teams’ best performance usually amounted to 75 on a 100-point accuracy scale, whereas AlphaFold achieved around 92.4 on the same targets. This is transformational in our knowledge of the functioning of life since proteins are the basic building blocks from which living things are created.
- Conclusion
To conclude, in terms of effect, if not epistemologically, what Google just did may become one of the most important scientific achievements in this century. Uses that span the whole life and the medical sciences, from fundamental biology to pharmaceutical applications, will be unlocked by the long-desired capacity to predict protein structure from its sequence and the availability of comparable altered sequences. The possibilities are incredible.
DeepMind’s success increases the number of aspects that are important to us as the scientific community. This accomplishment will raise deeper issues about how we do and disseminate research and if our collective community, which has greater resources and collected information, is efficient concerning its potential. This is more efficient and financed than most individual research organizations. We must also consider our duty as scientists to guarantee that science stays open and that public support for research remains beneficial to the public.
References
Anfinsen, C.B. (1973) Principles that govern the folding of protein chains. Science, 181(4096):223–230.
Kuhlman, B., & Bradley, P. (2019). Advances in protein structure prediction and design. Nature Reviews Molecular Cell Biology, 20(11), 681-697.
Pearce, R., & Zhang, Y. (2021). Toward the solution of the protein structure prediction problem. Journal of Biological Chemistry, 100870.
Sayers, E. W., Cavanaugh, M., Clark, K., Pruitt, K. D., Schoch, C. L., Sherry, S. T., &Karsch-Mizrachi, I. (2021). GenBank. Nucleic acids research, 49(D1), D92-D96.
Bairoch, A., Apweiler, R., Wu, C. H., Barker, W. C., Boeckmann, B., Ferro, S., … & Yeh, L. S. L. (2005). The universal protein resource (UniProt). Nucleic acids research, 33(suppl_1), D154-D159.
Deiana, A., Forcelloni, S., Porrello, A., &Giansanti, A. (2019). Intrinsically disordered proteins and structured proteins with intrinsically disordered regions have different functional roles in the cell. PloS one, 14(8), e0217889.
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., … & Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583-589.
Senior, A. W., Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L., Green, T., … & Hassabis, D. (2020). Improved protein structure prediction using potentials from deep learning. Nature, 577(7792), 706-710.
Comellas, G., &Rienstra, C. M. (2013). Protein structure determination by magic-angle spinning solid-state NMR, and insights into the formation, structure, and stability of amyloid fibrils. Annual review of biophysics, 42, 515-536.