Divergent evolution of proteins reflects both selectively advantageous and neutral amino acid substitutions. In the present article, we examine restraints on sequence, which arise from selectively advantageous roles for structure and function and which lead to the conservation of local sequences and structures in families and superfamilies. We analyse structurally aligned members of protein families and superfamilies in order to investigate the importance of the local structural environment of amino acid residues in the acceptance of amino acid substitutions during protein evolution. We show that solvent accessibility is the most important determinant, followed by the existence of hydrogen bonds from the side-chain to main-chain functions and the nature of the element of secondary structure to which the amino acid contributes. Polar side chains whose hydrogen-bonding potential is satisfied tend to be more conserved than their unsatisfied or non-hydrogen-bonded counterparts, and buried and satisfied polar residues tend to be significantly more conserved than buried hydrophobic residues. Finally, we discuss the importance of functional restraints in the form of interactions of proteins with other macromolecules in assemblies or with substrates, ligands or allosteric regulators. We show that residues involved in such functional interactions are significantly more conserved and have differing amino acid substitution patterns.
An understanding of protein evolution requires not only knowledge of genomes, protein sequences, structures and functions, but also an understanding of selective pressures at the level of the whole organism and the role of the protein in cells and whole-organism systems [1,2].
Insights into the relationship of protein structure, function and evolution began to emerge nearly 50 years ago as protein structures were determined for which there were multiple sequences. For example, insulin sequences from Fred Sanger in the 1950s ([3,4] and see  for a review), together with the three-dimensional structure from Dorothy Hodgkin a decade later [6,7], provided clues about the impacts of amino acid substitutions on tertiary structure and precursor activation, on quaternary interactions at dimer and hexamer interfaces, and on the putative receptor-binding region . They demonstrated that amino acid substitutions were accepted during evolution in a way that satisfied restraints arising from structure and function . Thus the core of the protein tended to be relatively conserved (Figure 1), and residues in helices and strands were substituted in ways that maintained the overall stabilities of these secondary structures. Most interestingly, a glycine residue with a positive φ main-chain torsion angle that allowed the chain to change direction sharply was conserved in all insulins. Substitutions of amino acids at positions involved in dimer formation retained their hydrophobic character in all species except the hystricomorpha. The conservation of B10 histidine in most mammalian, fish and bird insulins was evidence of restraints arising from the existence of a hexamer.
Evolution of insulins as demonstrated by sequence alignment and structural superimposition
The insulin structure also provided evidence of restraints from functional interactions. Residues in a patch mainly on the surface of the monomer appeared to have greater restraints on their substitution than could be explained by retention of the structure of insulin throughout evolution; this observation provided the clues about restraints in evolution arising from function, in this case the binding of insulin with its receptor.
Much of the sequence variation appeared to be selectively neutral; the accepted amino acids were able to fulfil the same structural and functional roles. Some amino acid changes, such as those occurring in the hystricomorph insulins, had been proposed to be the result of selectively neutral substitutions . However, these substitutions proved to be consistent with loss of ability to dimerize and stabilization of the monomeric form. This presumably resulted from change of storage form, possibly related to zinc availability, and was therefore probably also selectively advantageous.
Thus the analyses of insulins, along with parallel work on globins, lysozymes and serine proteinases, provided strong evidence for the conservation of tertiary structure during evolution, and emphasized the importance of considering restraints from protein interactions, in this case in terms of oligomers and receptor activation. They underlined the importance of local environment in the acceptance of amino acid substitutions during protein evolution.
Subsequent analyses of more divergent members of superfamilies showed that these general features could still be preserved while allowing considerable sequence divergence so that sequence similarities were in the twilight zone (20–35% identity) and below. Indeed, analyses of existing protein structures and sequences provided an approach not only to understand the structure and function of proteins, but also to solving the inverse folding problem. It enabled sequence profiles for each protein family and its more distant relatives, the protein superfamily, to be generated. For insulin, this led to the identification of somatomedin C (insulin-like growth factor 1), relaxin and later many other molecules as members of the insulin superfamily .
Similar analyses led to the recognition of other distant superfamilies and at the same time to identification of further restraints on amino acid substitutions arising from local structural features. Thus, in the family of βγ-crystallins which have four copies of a Greek-key motif in each protomer, a buried serine residue, which is hydrogen-bonded to a main-chain amide function, stabilizes the folding of the loop-joining strands a and b over the globular domain in each motif and leads to the most conserved sequence pattern . This motif was later identified in Myxococcus xanthus spore coat protein S , both proteins probably being selected for as a consequence of their stability.
A more celebrated example was the evolution of the aspartic proteinases where two gene duplication and fusion events were predicted in the evolution and a symmetrical ancestral dimer was proposed [14,15]. The buried threonine residue next to the catalytic aspartate residue in each Asp-Thr-Gly motif, which forms an important buried hydrogen-bond to a buried main-chain amide function, appears to be a critical structural restraint and an easily recognizable sequence motif. A similar symmetrical dimer was later predicted and identified in retroviral proteases from rous sarcoma and HIV , where the Asp-Thr-Gly sequence forms a similar role and retention of the dimeric form as a regulatory requirement in its activation appears to have provided an evolutionary selective advantage to the viruses.
Taken together, the aspartic proteinases and crystallins implied that, apart from solvent accessibility and secondary structure, side-chain hydrogen-bonding, particularly to main-chain functions in buried positions, might also be strong restraints on sequence variation.
Tertiary structural restraints
Such analyses of superfamilies led to the idea that propensities for amino acids  and their substitution patterns [18,19] might be systematically defined in terms of local structural environments. Solvent accessibility of the side chain and occurrence in regular secondary structures were local environments used by most groups [17,20,21]. Two further classes of local environment were added to these by Overington et al. : (i) amino acids with a positive φ main-chain torsion angle (learnt from the B8 glycine residue of insulin); and (ii) amino acids with side chains that formed hydrogen bonds to main-chain or other side-chain functions (inspired by the conserved serine and threonine residues of the crystallins and aspartic proteinases).
Although buried main-chain functions achieve hydrogen-bond satisfaction through the formation of secondary structures in most positions in the core, buried polar residues often carry out this role. Interestingly, those amino acid residues with polar side chains whose hydrogen-bonding potential is satisfied tend to be more conserved than their unsatisfied or non-hydrogen-bonded counterparts, particularly when buried . Indeed, such buried and satisfied polar residues are significantly more conserved in sequence identity than individual hydrophobic residues such as leucine in similar solvent-inaccessible local environments.
An ESST (environment-specific substitution table; http://www-cryst.bioc.cam.ac.uk/esst) describes the substitution of amino acids as a function of structural environments which restrict the possible and allowable substitutions . The combination of environmental descriptors for solvent accessibility, secondary structure and side-chain hydrogen-bonding gives 64 matrices for each amino acid in this model, and each is associated with a distinct pattern of amino acid substitution. Figures 2(A)–2(D) demonstrate that amino acid substitution patterns are influenced by local structural environment. In particular, a solvent-inaccessible environment (Figure 2B) restricts the possible substitution of amino acids most strongly, enhancing the diagonal of the substitution matrix, but secondary structure and the existence of side-chain hydrogen bonds also lead to different substitution patterns. The relative importance of these has been demonstrated by an analysis of distances among the 64 tables, each characterized by a different set of restraints, followed by PCA (principal component analysis)  based on a matrix of substitution profiles for all 64 environments over 441 (21×21) possible substitutions (Figures 2E and 2F). In Figure 2(E), PCA divides the 64 environments by solvent accessibility, which corresponds to the primary principal component (PC1). As expected, we observed that, for all 21 amino acids, the degree of residue conservation in the solvent-inaccessible regions is much higher than that of solvent-accessible regions (see Supplementary Figure S1 at http://www.biochemsoctrans.org/bst/037/bst0370727add.htm). In Figure 2(G) and 2(H), it is very evident that the relative importance of the three types of hydrogen bond is very hierarchical; eight environments are divided by the existence of a hydrogen bond from a side-chain to main-chain amide (N/n) followed by a main-chain carbonyl (O/o), which correspond to the first and second component of PCA respectively.
Different amino acid substitution patterns and the relative importance of structural restraints
Restraints from functional interactions with other proteins
As shown in the case of the insulins, functional interactions with other macromolecules can also provide restraints on the acceptance of amino acid substitutions. For protein–protein interactions, two structural environments can be defined for interfacial residues: (i) interface core for residues with relative accessibilities less than 7% ; and (ii) interface periphery for those with relative accessibilities greater than 7% (Figure 3A). Residue propensities can be calculated as the relative proportion of each residue type in each of the interfacial accessibility structural environments and for buried and exposed non-interface residue environments (Figure 3B). Propensities for the majority of residues at the interface core and periphery are intermediate between those for the protein core and exposed surface, with the interface periphery most similar to the exposed protein surface and the interface core most similar to the protein core. The exceptions to this are methionine, glycine, alanine, histidine, tryptophan, tyrosine and arginine. Of these, alanine and glycine, the two smallest residues, are disfavoured at the interface periphery. Histidine and arginine, two positively charged residues, are favoured at the periphery; in fact, this is the structural environment in which these residues are most enriched. Arginine is capable of multiple types of favourable interaction: it can simultaneously form up to five hydrogen bonds and an ionic salt bridge with the positive charge carried on its guanidinium motif. Tryptophan, tyrosine and methionine, the three largest hydrophobic residues that can engage in a range of interactions, are all favoured at the interface core, corresponding with the observations of Ofran and Rost . The enrichment of aromatic tyrosine may be explained by its contribution to the hydrophobic effect without a large entropic penalty owing to the side chain having few rotatable bonds as well as the hydrogen-bonding capacity of its 4-hydroxy group. Tryptophan has a very large aromatic side chain that can mediate aromatic π-interactions, act as hydrogen-bond donor and form extensive hydrophobic contacts.
Definition of interface residues and their propensities
Increasing the number of descriptors captures the residue environments more accurately. However, the combinatorics of environment definitions can result in a large number of environments such that the available alignment data may be partitioned into individual environments that are only sparsely populated. A total of 48 protein–protein interaction-specific ESSTs can be derived using a combination of the four categories of interface accessibility environment, four categories of main-chain conformation and secondary structure (helix, strand, coil and positive φ main-chain torsion angle), two categories of intermolecular hydrogen-bonding (bonded and unbonded) and two categories of intramolecular hydrogen-bonding (bonded and unbonded). The main determinant appears to be the interface accessibility environment (results not shown). As in tertiary interactions, intramolecular hydrogen-bonding status is a further strong determinant (results not shown).
Functional restraints from protein–nucleic acid interactions
The mechanism and nature of protein–nucleic acid recognition is believed to be different from those of protein–protein interaction. Hence, the restraints of protein–nucleic acid interaction should be also considered differently. Protein–nucleic acid interfaces, which arise as a consequence of functional restraints, are often conserved and show distinctive amino acid propensities owing to the polar nature of DNA/RNA (Figure 3C).
To describe the differences between amino acid substitution patterns arising from structural restraints and those under functional restraints of nucleic acid binding, new sets of restraints have been incorporated into ESSTs to represent the nature of protein–nucleic acid interactions . Hence, residues involved in intermolecular interactions with nucleic acids are classified further into three types: (i) hydrogen bond; (ii) water-mediated hydrogen bond; and (iii) van der Waals contact. A total of 128 ESSTs were created using four categories of secondary structure (as for tertiary interactions), and two categories of each of solvent accessibility, hydrogen-bonding to nucleic acid, water-mediated hydrogen-bonding to nucleic acid and van der Waals contact to nucleic acid.
By measuring distances among the new sets of ESSTs and constructing phylogenetic trees using the distance matrices, the residues interacting with nucleic acids have shown distinct substitution patterns when compared with the other sites (S. Lee and T.L. Blundell, unpublished work). The new ESSTs were also tested using the sequence–structure homology recognition program, FUGUE , to compare the recognition performance with the conventional substitution tables. Significant improvements were achieved in both recognition performance and alignment accuracy using the new substitution tables.
Residues involved in functional interactions with substrates, ligands and multi-component assemblies are clearly more conserved and have differing amino acid substitution patterns. It is clear therefore that they need to be removed when calculating substitution tables for amino acid residues that are under restraints of tertiary interactions . However, amino acids some distance away can also be under restraint, particularly from the need to maintain precise arrangements of catalytic residues and cofactor-binding interactions. Conversely, amino acid substitutions may affect substrate- and drug-binding specificity, a factor evident in the analysis of kinase drugs (Figure 4) .
The kinase fold and drug discovery
Protein Evolution: Sequences, Structures and Systems: Biochemical Society Focused Meeting to commemorate the 200th Anniversary of Charles Darwin's birth held at the Wellcome Trust Conference Centre, Cambridge, U.K., 26–27 January 2009. Organized and Edited by Roman Laskowski (EMBL-EBI, Hinxton, U.K.), Michael Sternberg (Imperial College London, U.K.) and Janet Thornton (EMBL-EBI, Hinxton, U.K.).
environment-specific substitution table
principal component analysis
S.L. thanks Juok Cho for statistical analysis.
T.L.B. thanks the Wellcome Trust for support of our structural studies of protein assemblies. C.L.W. and G.R.J.B. thank the Biotechnology and Biological Sciences Research Council for a studentship. D.T. thanks the Royal Thai Government for funding the study of inhibitor selectivity in protein kinase. S.L. thanks Mogam Science Scholarship Foundation for partial funding for the study of protein–nucleic acid interactions.