Large language models (LLMs) have revolutionized information processing and are now being applied to complex problems in molecular biosciences. LLMs excel at learning patterns in biological sequences, enabling applications such as predicting the function of DNA regulatory elements, analysing chromatin modifications and predicting the effects of genetic variants. In drug discovery, LLMs are being used to optimize drug properties, design novel molecules and even automate compound synthesis. However, realizing the full potential of LLMs in the molecular biosciences requires addressing challenges such as designing effective learning objectives, generating high-quality, domain-specific datasets and building trust in AI-assisted decision-making. As these challenges are addressed, LLMs hold immense potential for advancing research and driving ground-breaking discoveries in molecular biosciences.
Brief history and success of LLMs
Large language models (LLMs) like BERT, LLaMA and GPT series have revolutionized how we process information, including text generation, text summarization and question answering. At their core, LLMs utilize neural networks called transformers to predict the next word, character or symbol based on a sequence of such objects. When the model size is very large, such as billions or hundreds of billions of parameters, trained over petabytes of text in the case of text LLMs, important artificial intelligence features emerge.
In the medical field, LLMs’ impact stems from its near-human-level or equal-human-level performance in a variety of tasks, such as answering medical questions. Given the breadth of potential applications of LLMs, the concept of foundation models has emerged in the field of AI. Foundation models are generic models that encapsulate vast, compact and dense human knowledge and can be used as computational ‘LEGO bricks’. Like stacking building blocks, many downstream applications can be constructed on top of them, making many modelling workflows feasible, efficient and scalable.
Inspired by LLM success stories, biomedical researchers have started tailoring these models to tackle complex biological problems. This effort has led to a variety of biomolecular LLMs – applied to understanding cellular behaviours, helping discover new drugs and helping create effective therapeutic strategies. In what follows, we review these different applications of LLMs.
Understanding cellular behaviour using LLMs
LLMs have significantly advanced our understanding of the inner working system of cells and cellular behaviour. Examples of new applications include predicting the function of DNA elements that control gene expression and forecasting pathogenicity of protein variants in human disease-related genes.
LLMs have especially shown promise in genome analysis. Modelling the genome requires what LLMs excel at: learning patterns of correlations in DNA observed within and across many genomes and then use this learning for the next series of nucleotides given a sequence of such nucleotides. LLMs can be specifically developed to analyse long-range correlations in DNA sequences, which can be used for practical applications such as relating core gene regulatory elements to behavioural patterns of cells (Figure 1).
LLMs were trained on a diverse collection of genomic datasets to explore the role of model scale and data diversity on task performance. The model innovates on pretrained embeddings extracted from various layers of the transformer model. Predictions power can help with understanding, e.g., the dynamics of how DNA is translated into RNA and proteins, unlocking new clinical applications.
LLMs were trained on a diverse collection of genomic datasets to explore the role of model scale and data diversity on task performance. The model innovates on pretrained embeddings extracted from various layers of the transformer model. Predictions power can help with understanding, e.g., the dynamics of how DNA is translated into RNA and proteins, unlocking new clinical applications.
For example, LLMs can identify regulatory elements such as promoters and enhancers and potential binding sites for transcriptional factors and also predict their function. Another related application is the analysis of chromatin modifications. Investigators have trained genomic foundation LLM models on large sequences of up to one million nucleotides. Learning on longer DNA sequences than previous models led to state-of-the-art performance on various genomic benchmarks while using fewer parameters and less pretraining data than previous models. For example, these models showed overall best performance across several other methods for the challenging problem of predicting the location of epigenetic marks.
LLMs can also be used to predict the effects of both coding and non-coding variants in the human genome. For example, DNA language models can be trained on whole-genome sequence alignments across multiple species. By leveraging evolutionary information from aligned sequences of related species and assessing the sequence context of conserved genomic regions, these LLMs can effectively assess whether a new variant is likely to be deleterious.
Outside of genomic applications, the fast-growing number and diversity of single-cell RNAseq profiles present a unique opportunity to create new foundation models that capture complex interactions between genes and the RNA and protein they express and their link to important phenotypes such as cell viability. Models such as GeneFormer have shown surprising accuracy in predicting which genes are essential to a cell’s survival. Another single-cell foundation model called scGPT exemplifies how foundation models can elevate our knowledge in cell biology. Accurate genetic perturbation prediction allows scientists to hypothesize the potential impact of gene editing or drug interventions on cellular health. Gene network inference further offers a blueprint of how genes regulate one another and contribute to cellular phenotypes. As these models continue to evolve and incorporate more data, they will likely become an indispensable part of the biotechnology toolkit, driving forward the fields of cellular engineering and disease understanding.
Using LLMs to enable drug discovery’s AI revolution
Drug discovery, or the ability to perturb cellular and molecular networks for therapeutic use, is witnessing a transformation fuelled by AI. Here too, LLMs have started demonstrating their potential to play a key role in this transformation. They have been used to optimize drug properties, design new molecules and even to automate the compound synthesis process. A pivotal factor behind the use of LLM for drug discovery is the growing availability of ultra-large-scale chemical databases, e.g., PubChem, ChEMBL and ZINC (Figure 2).
Several successful applications of LLMs in various stage of the drug development pipeline: LLMs harness diverse information of drugs from chemical datasets, to facilitate molecular design, lead optimization, chemical synthesis and clinical development.
Several successful applications of LLMs in various stage of the drug development pipeline: LLMs harness diverse information of drugs from chemical datasets, to facilitate molecular design, lead optimization, chemical synthesis and clinical development.
Designing drug structures in silico with desired physiochemical properties and target specificities has always been a difficult problem due to the huge space of possible molecular structures and their complex interactions with multiple proteins. GPT models are now capable of optimizing multi-properties in molecular design, outperforming traditional machine learning approaches which often struggle to generate molecules that meet a comprehensive set of specific property constraints. GPT models can also be used to generate novel and diverse molecules. One method that combined GPT models with reinforcement learning achieved a success rate as high as 64% in modifying molecules to achieve a desired quantitative estimate of drug-likeness (QED) value.
LLMs can even be used to generate entirely new molecules. Techniques like self-supervised learning, which allows models to understand and represent molecular structures without direct task-specific instructions, are proving invaluable. Ground-breaking research, including studies on RoBERTa, Grover, Chemformer and BARTSmiles, has demonstrated their ability to accurately create compounds with favourable solubility and solvation profiles, affecting a molecule’s absorption, distribution, metabolism and excretion (ADME) and toxicity profiles. Together, these characteristics affect a drug’s performance and dosage and are therefore of utmost importance in drug design.
Nowadays, GPT-4’s multimodal processing capabilities and new APIs allow it to be integrated with specialized chemistry tools to improve autonomous research capabilities of LLMs. LLM-powered chemistry engines have been designed that can integrate multiple agents into an intelligent system. For example, when asked to synthesize a particular molecule, various specialized agents can collaborate. Some agents are responsible for devising experimental protocols for the reactions needed, while others may be focused on writing and running the code so that the robot performs its programmed tasks. This collaborative working model allows the whole system to be more flexible and efficient and can be used for self-designed experiments to synthesize novel drug molecules, e.g., to better meet the needs of users.
Combining cellular behaviour and chemistry, LLMs are starting to be used to predict efficacy of drug combination and identify potential adverse drug–drug interactions (DDIs). These advancements could revolutionize treatment strategies for complex diseases, making therapy safer and more effective. Indeed, drug combination therapy is increasingly recognized as an effective strategy for managing complex diseases like cancer, infectious diseases and neurological disorders. Often, combination therapy yields better outcomes than single-agent therapy. However, identifying effective drug combinations experimentally is challenging and time-consuming due to the vast number of potential combinations. A major hurdle in computational prediction of effective drug combinations, particularly for rare diseases with scarce data, is the limited transferability of most computational methods to new tumour or tissue types. Methods such as CancerGPT are trained on PubMed abstracts and can successfully predict drug combinations for rare cancer types: liver, soft tissue and urinary tract.
Drug safety is a critical concern in the medical community. When different drugs are combined, there can be a decrease in treatment effectiveness or even risks to human health due to DDIs. Predicting these interactions is crucial for safe drug combination therapies. Juhi and colleagues evaluated GPT-4’s ability to gather information about DDIs, achieving a moderate accuracy rate of 50–60% when compared with two clinical tools. This underscores that general-purpose LLMs like GPT models have limited effectiveness in predicting and explaining DDIs. To improve upon this, Zhu and colleagues developed TextDDI, a domain-specific language model. This model features an ‘information selector’ that generates concise and relevant prompts for LLMs to facilitate DDI predictions, enhancing the model’s accuracy in identifying potential DDIs.
Challenges and the road ahead
The blending of language models and biological and chemical data open up new avenues for exploring, evaluating and applying biological process models and drug discovery techniques. However, fully realizing the potential of LLMs presents significant challenges and demands careful consideration. A crucial part of this journey involves defining effective learning objectives for LLMs. Despite their advanced architecture, LLMs sometimes perform worse than simpler models. For example, current single-cell LLMs produce cell embeddings. When evaluated for cell-type clustering, models like GeneFormer and scGPT performed worse than baseline strategies, limiting its utility for biological discovery. This paradox, as noted by Kedzierska, serves as a reminder to realize LLMs’ potential. Our ambitions for LLMs can only be realized through the pursuit of scrutinizing the embeddings’ quality and designing more challenging fine-tuning objectives.
LLMs heavily rely on data. The datasets currently available for training in the biological and chemical domains are relatively small compared to those for general-purpose LLMs, leading to impressive performances on open benchmarks but underperformance in practical applications. The creation of high-quality, domain-specific datasets, through both human labelling and unsupervised learning, is essential for expanding the scope of knowledge these models can learn from.
Beyond the technical hurdles, building clinicians' trust in AI-assisted decision-making is critical for the entry of AI in clinic. Although LLMs are often viewed as ‘black boxes’, they actually capture complex data correlations. Developing augmented language models (ALMs) with improved reasoning and reliability mechanisms could help mitigate trust issues. The research focus should perhaps be placed on reliable model evaluations. AI researchers can incorporate validation processes including iterative feedback from domain experts and verification from biology experiments to avoid reasoning errors or misleading outputs of LLMs.
In conclusion, LLMs have shown immense potential in advancing research and applications in molecular biosciences. By leveraging these models, we can unravel complex cellular behaviours, accelerate drug discovery and develop more effective therapeutic strategies. However, realizing the full potential of LLMs requires addressing challenges such as designing effective learning objectives, generating high-quality, domain-specific datasets and building trust in AI-assisted decision-making. As we continue to explore and refine the use of LLMs in this field, we can look forward to a future where AI and human expertise work hand in hand to drive ground-breaking discoveries and improve human health.
Further reading
Moor, M., Banerjee, O., Abad, Z.S.H. et al. (2023) Foundation models for generalist medical artificial intelligence. Nature616, 259–265. doi: 10.1038/s41586-023-05881-4
Theodoris, C.V., Xiao, L., Chopra, A. et al. (2022) Transfer learning enables predictions in network biology. Nature618, 616–624. doi: 10.1038/s41586-023-06139-9
Cui, H., Wang, C., Maan, H. et al. (2024) scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat. Methods 1–11. doi: 10.1038/s41592-024-02201-0
Boiko, D. A., MacKnight, R., Kline, B. et al. (2023) Autonomous chemical research with large language models. Nature624, 570–578. doi: 10.1038/s41586-023-06792-0
Li, T., Shetty, S., Kamath, A. et al. (2023) CancerGPT for few shot drug pair synergy prediction using large pretrained language models. NPJ Digit. Med. 7, 40 doi: 10.1038/s41746-024-01024-9
Kedzierska, K.Z., Crawford, L., Amini, A.P. et al. (2023) Assessing the limits of zero-shot foundation models in single-cell biology. bioRxiv 2023.10.16.561085 doi: 10.1101/2023.10.16.561085
Vert, J.-P. (2023) How will generative AI disrupt data science in drug discovery? Nat. Biotechnol.41, 750–751. doi: 10.1038/s41587-023-01789-6
Author information
Chengqi Xu obtained her BSc (Hons) from the University of Edinburgh and her MSc from Cornell University. Chengqi is currently a PhD candidate at Weill Cornell Medicine, supervised by Dr Elemento. She is working on designing novel AI algorithms to expedite translational research by exploring therapeutically effective drug combinations.
Olivier Elemento, PhD, is the Director of the Englander Institute for Precision Medicine (EIPM) and professor in the Department of Physiology and Biophysics. Dr Elemento is the recipient of several awards including the NSF CAREER Award, the Hirschl Trust Career Scientist Award, the Walter B Wriston Award and the Daedalus Award. Dr Elemento has published over 450 scientific papers in the area of genomics, epigenomics, computational biology and drug discovery. Under Dr Elemento’s leadership, EIPM’s primary mission is to uncover the molecular roots of disease using genomic sequencing, informatics and other technologies, empowering WCM investigators to expand their research focus, taking the precision medicine methodology they pioneered in cancer and applying it to other areas of research including cardiovascular disease, lung disease, diabetes and neurological disease. Email: [email protected].