The first protein structures revealed a complex web of weak interactions stabilising the three-dimensional shape of the molecule. Small molecule ligands were then found to exploit these same weak binding events to modulate protein function or act as substrates in enzymatic reactions. As the understanding of ligand–protein binding grew, it became possible to firstly predict how and where a particular small molecule might interact with a protein, and then to identify putative ligands for a specific protein site. Computer-aided drug discovery, based on the structure of target proteins, is now a well-established technique that has produced several marketed drugs. We present here an overview of the various methodologies being used for structure-based computer-aided drug discovery and comment on possible future developments in the field.
Computational methods of drug discovery have a basis in the very earliest structural studies of protein molecules. The first structure of a protein, that of sperm whale myoglobin initially at 6 Å resolution  then refined to 2 Å resolution , revealed a complex network of interactions between the amino acids far more difficult to understand than the elegant simplicity seen in the base-pairing of DNA . Understanding how proteins interact with ligands had to wait for several more years; the analysis of the structure of lysozyme  and comparison with inhibitor–lysozyme complexes  in 1965 immediately revealed that there is structural specificity in ligand–protein binding, with known lysozyme inhibitors all binding to the same site on the protein unlike closely related molecules, that could not inhibit the enzyme and did not bind to a particular site. Details at the atomic level of ligand–protein complexes soon followed, demonstrating that the theoretical models for how enzyme-bound substrates and allosteric modulators were an accurate picture of complex formation .
These early structural insights made it clear that an understanding of ligand–protein interactions could also provide the basis for the rational design of new molecules with an increased affinity for their binding site. A knowledge of the contacts, at the atomic level, present between protein residues and ligand atoms, should allow selective modification of the chemical structure of the ligand, perhaps improving the complementarity of the shape of the ligand to the binding site of the protein, introducing additional favourable interactions or removing interactions that would inhibit binding. One of the first attempts to do this was the design of novel compounds intended to mimic the action of 2,3-diphosphoglycerate (DPG, Figure 1) on haemoglobin , based on the crystal structure of the DPG–deoxyhaemoglobin complex . Carried out using a hand-made physical model of the protein, this involved measuring distances between amino acid side-chains and identifying possible electrostatic interactions in the DPG pocket. A set of three compounds, with chemical structures unrelated to that of DPG, were designed, synthesised and shown to induce a shift in the oxygen dissociation curve of haemoglobin in a manner closely following that of the original ligand . The success of this first attempt at using a protein structure to design novel ligands indicated the great promise of this approach for drug design.
Chemical structures of the compounds listed in this review.
Coupled with improvements in computer graphics, such as the release of the Evans and Sutherland Picture Systems in the 1970s, the translation of these early manual approaches to drug design into practical computational processes for the rational development of new drugs was a logical step . Among the first programmes for protein structure-based drug design was GRID , created by one of the authors (Dr Peter Goodford) of the DPG analogue study . GRID calculates the energy of interaction for a series of probes (water, methyl, amine nitrogen, carboxyl oxygen and hydroxyl, representing a subset of likely atom types in a ligand) with the atoms of a protein structure, providing a three-dimensional potential map for each probe in the volume of space around the protein. Graphical representations of the potential maps for each probe are then displayed overlaid with the structure of the protein, ‘…so that energy and shape can be considered simultaneously when designing drugs.’ . Commercialisation of molecular modelling and drug design occurred at almost the same time as the publication of GRID, with companies such as Biosym Incorporated, Tripos Incorporated and Molecular Discovery Ltd. coming into existence in the mid-1980s.
The field of computational drug design has expanded enormously since these early exploratory efforts. Several marketed drugs now exist that were invented using structure-based drug design techniques, the first being the neuraminidase inhibitor Relenza™ (Figure 1), a treatment for influenza modelled on the structure of the sialic acid–neuraminidase complex [11,12]. Protein structures are now used routinely at several points in the drug development process (Figure 2), from assessing the ‘druggability’ of a target through initial hit identification and design, to checking for potential off-target effects. It should be noted, however, that information arising from structure-based computer-aided drug discovery is essentially a prediction and remains as such until confirmed using appropriate experimental techniques (e.g. biological screening of compounds identified in an in silico screen or chemical synthesis and assay of de novo designed molecules). Before we outline how protein structures are employed for drug discovery and some of the computational approaches used at each point in the process, we will touch very briefly on the experimental methods used to obtain atomic level protein structures and how to access them.
Typical computational drug discovery workflow.
The worldwide repository for protein structures is the Protein Data Bank (PDB, www.rcsb.org, www.wwpdb.org, [13–15]) and at the time of preparing this mini review (8 June 2018), there were 141 010 data entries, covering 44 278 unique protein sequences. Originally housed at the Brookhaven National Laboratory (Long Island, New York, U.S.A.), the creation of the PDB in 1971 and free access to the data it contains has increasingly driven the drug discovery process. The conceptualisation of a common file format to store protein atomic coordinate data, combined with the development of the Brookhaven Raster Display (BRAD) molecular graphics system to visualise proteins in three dimensions and the SEARCH program to remotely access the database, ultimately resulted in the birth of the PDB as we know it today. Overwhelmingly the bulk of the protein structures are determined by X-ray crystallography and currently in the database there are 124 299 protein structures determined using this experimental technique. A relatively minor proportion of the protein structures (∼8%) arise from nuclear magnetic resonance (NMR) techniques, with 400–600 structures being added annually since the peak in 2007 when 965 structures were deposited. Since the first electron microscopy (EM) protein structure was deposited in the database in 1977, the number of structures determined by EM, and more recently cryo-EM (cryogenic-electron microscopy), have increased exponentially each year with 555 structures deposited in 2017 bringing the total to 2123 (∼1.5% of entries in the PDB). While successfully employed to study soluble proteins down to the size of haemoglobin (64 kDa) , it is the increasingly successful application of cryo-EM to challenging targets such as membrane receptors [17–19] where the contribution of this technique to the field of protein structure-based computer-aided drug design is set to sky-rocket. It is, of course, the case that it is not always possible to start a computational drug discovery project with an experimentally derived protein structure, but the wealth of data in the PDB means that useful homology models of many proteins of interest can be prepared from a suitable template protein structure [20–24].
It must always be remembered that the protein structures in the PDB are three-dimensional model representations of the experimental data. The quality of the model depends upon the resolution, accuracy, completeness and interpretation of the experimental dataset. For example, in a protein structure obtained by X-ray crystallography, there may be disordered side-chains and the interpretation of the electron density in these regions is largely dependent upon the crystallographer building the model. Submission of the experimental data to the PDB was initially voluntary; however, by 2008 it was compulsory to deposit crystallographic structure factors and NMR restraints along with the atomic coordinates . Thus, it is prudent to always read the PDB file header information and download the experimental data to evaluate the quality of the protein model prior to commencing drug design work.
Site identification and validation
Too often it is forgotten that a protein crystal structure is not only a model, but is also simply a snapshot of the protein, a low-energy conformation trapped at a single time point. Proteins are dynamic — whole domains rotate/translate, loops flap around and residue side-chains move. Proteins have active and inactive states (or open and closed states in the case of channels and transporters), and the structures of these low-energy states (or conformations) can usually be captured by X-ray crystallography, cryo-EM or NMR. Some proteins also have clearly identifiable intermediate states which can be trapped experimentally; a text search for the key words ‘intermediate state’ identified ∼430 available X-ray crystallography and cryo-EM structures in the PDB. Increasingly, molecular dynamic simulations are being used to predict intermediate protein states on conformational landscapes connecting stable, low energy end states with varying degrees of success [26–28]. Ligand (or protein)-binding sites can be targeted by small molecules using a variety of computational techniques (described below). One of the first steps is to identify which protein state is appropriate for you to target, for example targeting the ‘DFG-out’ or inactive conformation of the c-Abl kinase domain of the Bcr-Abl oncogenic fusion protein resulted in the highly successful drug Glivac™ (also known as Gleevec™, imatinib and STI-571, Figure 1, ) approved for the use in treating chronic myelogenous leukaemia (CML), gastrointestinal stromal tumours and other cancers.
Having decided on the protein conformation or state to use, the next step is to identify ‘druggable’ sites on the protein, i.e. binding site(s) to target with small molecules. If ligand–protein complex structure(s) already exist for a protein target with biologically relevant molecules or approved drugs, then the binding site is not only identified but also validated. Drug-/ligand-binding sites come in a variety of shapes, with different characteristics and locations (Figure 3); the ideal site is a concave pocket lined by many hydrogen bond donors and acceptors, and a few hydrophobic side-chains [30,31]. Some binding sites are relatively flat and seemingly featureless, as in the case of many protein–protein interaction (PPI) interfaces. However, at the interface, we often find ‘hotspot’ or key residues which account for a high percentage of the interaction energy and it is possible to target these residues with small molecules [32–34]. Detailed descriptions of computational methods or tools to characterise binding sites have been recently published [31,35,36], there are also comprehensive lists (with web links) available on many structure-based drug design websites (e.g. https://www.click2drug.org/).
Different types of druggable ligand-binding sites in proteins.
In the majority of cases, the choice of binding site may be an obvious one; for example, the orthosteric agonist site of a receptor or the catalytic site of an enzyme. However, sometimes greater selectivity and/or potency is achieved by targeting a unique allosteric or cryptic binding site. A cryptic binding site (or pocket) is a site on a protein that is only revealed upon the binding of a ligand (or another protein). They can be identified experimentally (e.g. by analysing ligand-bound and unbound structures of the same protein) and computationally (e.g. by molecular dynamic methods, flexible ligand docking and hot spot residue prediction) [37–40]. The interest in targeting cryptic binding sites has led to the development of more automated web-based computational tools to identify or predict their presence in unbound protein structures, like the TRAPP webserver  and CryptoSite . A recent example of a compound designed to target an allosteric pocket is the clinical candidate asciminib (ABL001, Figure 1), currently in Phase I and III clinical trials for the treatment of CML, Philadelphia positive acute lymphoblastic leukaemia and advanced solid tumours (NCT03106779, NCT02081378, NCT03292783, [43,44]). Asciminib binds into a small pocket on the Bcr-Abl kinase N-terminal domain usually occupied by myristate (Figure 3). The exploitation of a cryptic pocket in HIV integrase led to the development of Isentress™ (raltegravir, Figure 1, ) a retroviral drug used in the treatment of HIV/AIDS.
In silico screening and ligand docking
Over the last ∼30 years, the techniques have moved from manually (or interactively) docking single compounds into a pharmaceutically validated protein binding site to now docking in silico libraries containing millions of compounds. The development of the individual technologies that are distilled into what is referred to as in silico or virtual screening (new algorithms, parallelisation of algorithms, improvements in molecular mechanics force fields, incorporation of quantum mechanical features, increased CPU power and the use of graphics processing units for computation, ability to handle extremely large datasets electronically, high resolution graphic displays etc. [26,46–51]), combined with ever decreasing costs for computer hardware, has resulted in the methodology becoming entrenched in the drug discovery process in both industry and academia. In silico screening is now routinely used to enrich compound libraries prior to high-throughput screening campaigns [21,33,47,52–57]. Owing to the number of in-depth reviews on this subject published over the last ten years (e.g. [21,49,51,58–62]), the method is only described briefly here.
The key steps in protein structure-based in silico screening are: (1) preparation of the protein structure(s); (2) compound library preparation; (3) compound docking strategy selection and (4) analysis of results. Some of the factors to consider, or tasks to complete, when preparing the protein structure are to add in missing residues or atoms, decide whether structurally or functionally important water molecules should be included in the target site, and assign protonation/tautomer states to amino acid side-chains. When preparing an in silico compound library, it is critical that the nature of the protein target is taken into consideration. For example, when targeting a PPI some of the traditional drug-like physicochemical compound selection parameters may be less relevant [33,63–67]. For a central nervous system protein target, a compound library with a bias towards physicochemical properties enabling the compounds to cross the blood–brain barrier would be essential . Other factors to consider during the compound library preparation are whether to generate stereoisomers, which protonation and tautomer states to include, whether to filter out compounds known to show non-specific inhibition or those with other unwanted chemical properties [69–71]. In the case of natural product compound libraries being used to explore new chemical scaffolds, it may be that none of the standard compound filters are appropriate . Any confirmed ligands for a protein target should be included in the compound library as positive controls, while any compounds known to be non-binders can be included as negative controls. Once the compound library is prepared, different docking strategies must be considered; target the selected site in a single protein structure or, if available, multiple structures, what level of protein and compound flexibility to use, which docking algorithm and scoring functions are most appropriate, whether the presence of specific interactions seen for known inhibitors should be enforced, etc. Finally, the library screen is complete and the task of analysis begins — typically the compounds are ranked by docking score and the highest ranking compounds are clustered by chemical similarity to provide a chemotype ranking rather than simply a compound ranking. This is particularly helpful if a docked library contains several closely related compounds that all score well against a target site, swamping the compound ranking list but providing a single entry in a chemotype ranking list. Compounds are then purchased, or medicinal chemistry undertaken, and the affinity and/or activity of the in silico ligands evaluated in in vitro and possibly in vivo assays. Examples of protein targets, software, methodology used and hit compounds can be found in the reviews cited above.
The initial focus of in silico screening was the identification of non-covalently bound compounds. However, with the FDA approval of several covalent drugs (e.g. Sabril™ in the treatment of epilepsy,  and Incivek™ in the treatment of hepatitis C virus, , Figure 1), there is renewed interest in applying the process to the discovery of covalently bound compounds. Modifications have been made to existing docking algorithms, such as Gold and Autodock, in an attempt to account for the covalent linkage between compound and protein, while new algorithms, such as DOCKTITE and CovalentDock, have also been developed [75–78]. The reader is directed to an excellent review on the subject by De Cesco et al. .
De novo ligand design
Identification of potential ligands for a target site on a protein through in silico screening has one drawback — the process can only identify compounds that already exist in the library used for screening. However, the number of potential drug-like compounds, a ‘chemical space’ estimated to be of the order of 1060 distinct molecules, is vastly greater than the ∼108 molecules that have been synthesised  let alone the standard in silico library size in the 106 range. As an alternative, de novo ligand design combines information about the target site on the protein with computational chemistry to build new ligands in situ, selecting functional groups on the basis of their interaction with the target site and the geometry and chemistry of the compound as it is assembled [81,82]. This means that a much wider region of drug-like chemical space can be sampled directly, potentially leap-frogging some of the initial hit to lead development process while also identifying unique, and therefore valuable, compound structures.
While the principle in each case is the same, a range of different approaches exist for the derivation of ligand-binding sites during de novo design. New molecules are assembled in the target site on an atom-by-atom or fragment-by-fragment basis, where each fragment is a larger subsection of a drug-like molecule, such as an aromatic ring or methyl group (Figure 4). A method of scoring the addition of each new group to the growing molecule is used to rank a potential change. Depending on the software used, the scoring process may be rule-based, rely on spatial probe maps derived from a program such as GRID , employ a force field or make use of knowledge-based or empirical scoring functions . The de novo design process can be interactive, allowing a medicinal chemist to design new compounds by hand, or automated such that the program generates a list of potential ligands independently. In both cases, ensuring synthetic tractability of the designed compounds is key, with the medicinal chemist's direct input during manual design and increasingly sophisticated computational synthesis for automated design. A recent review of chemistry-driven de novo design  summarises the software packages available with examples of their application, such as the successful development of both BACE-1  and dihydroorotate dehydrogenase  inhibitors using SPROUT .
Overview of the process of de novo drug design.
As well as building new compounds from scratch, the de novo design approach can be applied to an existing ligand–protein complex with the known ligand used as a seed structure for the design process. The software then identifies bioisosteres  for replaced sections of the molecule (Figure 4). Retaining part of a bound compound in its crystallographically determined position guarantees a starting point for de novo design that is located correctly, while maintaining the advantage of novel chemical space . A single compound structure can potentially provide multiple substructural starting fragments to which de novo design rebuilding steps can be applied .
De novo design can also be used to remove undesirable features, such as off-target effects, from a ligand. Comparison of the structure of a developmental drug in complex with both its target and an off-target protein can potentially identify vectors of modification that will reduce binding to the undesired protein. An excellent example of this use of structure-based design is in the development of Bcl-2 family inhibitors as cancer therapeutics, where a co-crystal of the lead compound with human serum albumin (HSA) allowed the design of new molecules with reduced HSA affinity but unchanged binding to their true target .
Enzyme transition state inhibitors
Distinct from intermediate states, all enzymes have transitional states that exist on a femtosecond (10−15 s) to picosecond (10−12 s) timescale [89,90]. Given that this timescale is of the same order as bond vibrations, it is extremely difficult to study these transient states experimentally. They have been observed in the gas phase using femtosecond laser technologies, but they are more commonly studied using a combination of biochemical kinetic isotope effects and computational chemistry [89,90]. NMR and mass spectroscopy techniques have been developed to analyse kinetic isotope effect data [91–93]. X-ray absorption spectroscopy currently operates at the femtosecond timescale and is moving towards the attosecond (10−18 s) realm; application of this powerful technique to enzyme reactions could lead to major advances in the observation of transition states [94,95]. Currently enzyme transition states can be modelled using computational chemistry methods, such as quantum mechanic calculations to interpret the kinetic isotope effect data [96,97], transition path sampling to find transition states in complex systems in a more unbiased way [98,99] or molecular dynamic simulations [100,101]. Compounds can then be designed to mimic the geometry and electrostatics of the transition state species. Such compounds are known as transition state analogues or inhibitors and they typically display a slow on-set of enzyme inhibition and then slow inhibitor release . It is only possible to develop transition state inhibitors for enzymes where the features of the transition state can be mimicked by stable chemistry, for example the transition state hydride transfer of dehydrogenases would be extremely difficult to exploit.
The first X-ray crystal structures of transition state analogues in complex with their enzymes were solved in 1970 [102,103]. Over the ensuing 48 years, this number has increased by several hundred and a text search of the PDB identified ∼360 structure entries with the key words ‘transition state inhibitor’. A variety of computational methods have been applied to transition state inhibitor–enzyme crystal complexes to design new inhibitors with improved potency and selectivity, many of which are in the clinic or undergoing clinical trials. One example is the first-in-class renin inhibitor Tekturna™ (Figure 1), approved by the FDA in 2007 for primary hypertension. Structural analysis of transition state inhibitor–renin crystal complexes combined with interactive molecular modelling led to the development of Tekturna™ (also known as Rasilez™ and aliskiren, ). A more recent example is DADMe-Immucillin-H (Figure 1), a transition state inhibitor of human purine nucleoside phosphorylase. Immucillin-H–purine nucleoside phosphorylase crystal complexes were used as the starting point to design achiral inhibitors utilising molecular dynamics and quantum chemistry in the design process [89,90]. Under the names BCX4208 and Ulodesine, DADMe-Immucillin-H completed Phase II clinical trials in 2012 for gout (NCT01407874, NCT01265264).
While each technique has its unique shortcomings, there are some proteins that remain difficult to target through computational drug design regardless of the method used. Clearly where a protein has a poorly defined structure, as is the case of intrinsically disordered proteins that may only adopt transitory structures in complex with a physiologically relevant partner , structure-based design is problematic. Computational approaches are being developed to tackle such proteins , but they remain a challenging target. As mentioned above, PPIs are another hard target for computational drug discovery. The typical PPI interface is a relatively large, flat, featureless surface; the exact opposite of a desirable target site for ligand design (see above and Figure 3). The importance of PPIs in many physiological processes means that efforts are still made to target them for drug design, but particular care needs to be taken .
Computational techniques for structure-based drug discovery have come a long way from the initial attempts using manual fitting of wood, wire and plastic models. The field, however, is not static and there is a range of areas in which active development promises additional ways in which ligand–protein complexes will provide insights for the process of drug discovery, from initial hit discovery to preclinical candidate selection.
One area where there has been substantial progress is the determination of accurate binding energies for the formation of ligand–protein complexes [107–109]. To date, the binding affinity of ligands has been calculated either through some form of scoring function, whether knowledge-based or empirical, or through a relatively simple force field. While fine-tuning of these over time has meant that useful ranking information can be generated from these simplistic tools, an accurate measure of affinity with an error less than 2 kcal mol−1 is much harder to achieve. Using full atom molecular dynamic simulations, an estimate of the total energy of the system, including protein, ligand, solvent etc., can be calculated as a function of the position of all the atoms. In simple terms, repetition of this calculation for extended simulations with the ligand both bound to the protein and free in solution allows an estimate of the energy of binding to be computed. Accuracy in the calculation of binding affinity is limited by the length of the simulation, and hence by the computational power available and the patience of the researcher. Even more accurate predictions can now be made where there is a known ligand with an experimentally determined affinity. The ligand structure can be computationally morphed into a new ligand of interest during the simulation . This has particular applicability in medicinal chemistry campaigns where a compound core may be decorated with a range of functional groups by a team of chemists; experimental binding data need only be determined for one of the class of compounds allowing the rest to be accurately (±1 kcal mol−1) determined computationally .
Another area of active development is in silico prediction of absorption, metabolism, distribution, elimination and toxicity (ADME-Tox). Prior to research compounds moving into the clinic and being tested in people, they go through a process of preclinical development, where off-target effects and toxicity are explored in tissue culture and non-human animals . Animal testing is a major ethical concern, so computational models that can replace some of the live animal testing are attractive. Computer models based on the chemical structure of potential drugs can be used to predict many of the important ADME-Tox properties of the compounds [112–116]. With the availability of more and more structures for important proteins in human metabolism, it is also now possible to explore the interaction of a compound with a potential metabolic or toxicity target directly. For example, plasma protein binding is an important factor in the distribution and elimination of drugs. As noted previously for Bcl-2 family inhibitors , knowledge of the details of ligand–plasma protein complex structures can be exploited to alter the affinity of the complex, thus altering the ADME parameters . Recently, the cryo-EM structure of the potassium channel protein hERG became available, allowing rationalisation of years of toxicity models for this important cardiac ion channel . As further structures of relevant human proteins become available, the ability to make ADME-Tox decisions on compound design based on specific ligand–protein complex models will continue to grow.
With the growth in computing power, there has also been a growth in the power of artificial intelligence (AI) techniques for problem solving. Machine learning algorithms, in which a training set of known results is used to optimise decision-making processes independently of direct human intervention, can produce systems that solve extremely complex problems very efficiently including some where algorithms seemed impossible, such as the game of Go [119,120]. These deep learning approaches can be applied to drug discovery as well ; it is currently being applied to all aspects of computational chemistry and in silico ligand identification including de novo design , docking and in silico screening . While such techniques hold great promise, they are currently restricted by the inherent multi-feature optimisation required in drug design and the limited availability of appropriate training data . As more work is put into machine learning and AI techniques for computational drug design, it can be expected that appropriate datasets for training will be developed in parallel.
Perhaps, as such techniques are perfected, a drug discovery programme will one day consist of simply handing a protein structure to a dedicated AI and collecting the clinical candidate from the automated medicinal chemistry laboratory next door. The inherent complexity of both the computational problem in identifying a potential drug and the synthetic challenge in creating it means that such an idyllic process is many years from being realised — there will be jobs for human drug discovery and development teams for the foreseeable future!
absorption metabolism distribution elimination and toxicity
chronic myelogenous leukaemia
enzyme-linked immunosorbent assay
Förster resonance energy transfer
human serum albumin
high-throughput chemical screening assays
nuclear magnetic resonance
Protein Data Bank
T.L.N. and C.J.M. wrote the paper. All authors refined the manuscript. M.W.P. supervised the work.
This work was partly supported by a grant from the Australian Cancer Research Foundation to M.W.P. Funding from the Victorian Government Operational Infrastructure Support Scheme to St Vincent's Institute is acknowledged. M.W.P. is a National Health and Medical Research Council of Australia Research Fellow.
We thank all current and past members of the Parker laboratory and our collaborators for their contributions to our structure-based drug discovery efforts. We particularly thank Biota Pharmaceuticals for providing the opportunity to us to develop a structure-based drug discovery laboratory.
The Authors declare that there are no competing interests associated with the manuscript.