The evolution of protein function appears to involve alternating periods of conservative evolution and of relatively rapid change. Evidence for such episodic evolution, consistent with some theoretical expectations, comes from the application of increasingly sophisticated models of evolution to large sequence datasets. We present here some of the recent methods to detect functional shifts, using amino acid or codon models. Both provide evidence for punctual shifts in patterns of amino acid conservation, including the fixation of key changes by positive selection. Although a link to gene duplication, a presumed source of functional changes, has been difficult to establish, this episodic model appears to apply to a wide variety of proteins and organisms.
Protein-coding genes are modified by random mutations, which may then be affected by natural selection. If the phenotype after mutation decreases fitness, purifying selection will probably remove it from the genome. If the mutation increases the fitness, it may be fixed by positive selection (also called diversifying or Darwinian). This mechanism can be described quantitatively by substitution models. Most such models assume that the replacement rate of amino acids is identically distributed over the protein. This statement is not very realistic, as amino acids in active sites or in interaction surfaces are likely to be under stronger selective pressure than elsewhere in the structure. Many substitution models also assume that the replacement rate is constant over time, whereas it is likely that evolution passes through periods of acceleration and deceleration, depending, for example, on the environmental selective pressures. A model of episodic evolution taking into account these variations in time and space was proposed by Gillespie in his book The Causes of Molecular Evolution  as early as 1991, but data and methods to test it were lacking. Of special interest, those mutations fixed by positive selection can promote new functions, and cause differences between species. Novelties can also arise by duplication (local or whole-genome duplication), in which selective pressures could act to preserve both duplicates [2,3].
Amino acid protein evolution
During the last decade, various models of episodic amino acid evolution have been implemented to detect functional divergence between groups of protein sequences [4,5]. The ‘covarion’-type models , also known as Type I functional divergence  or heterotachy , detect differential rates (acceleration or deceleration) in each amino acid position. For example, a serine residue can be fully conserved in all orthologues in one clade, while a mixture of alanine, aspartate and lysine is found in the orthologues of another clade. This type of asymmetry has been observed in several duplicated genes, such as the receptor tyrosine kinase in teleost fish , or the groEL family in Chlamydiae , suggesting the evolution of a new function in one duplicate (‘neofunctionalization’). Penn et al.  developed a more general model that detects such changes anywhere in a tree, without specifying a priori a branch of interest. This model detected more than 200 shifts among seven subtypes of 182 HIV-1 genome . Another type of functional divergence is when a site is conserved in one clade, and also conserved in the other clade, but with different physicochemical properties in each clade. This has been called CDB (constant-but-different) or Type II functional divergence . Gribaldo et al.  suggested that CDB is a better predictor of functional divergence, as they found more divergence in paralogues than in orthologues with CBD, but not with covarions, in haemoglobins. Similarly, Seoighe et al.  analysed many domains from Pfam and found a slight excess of CBD in branches following duplication, relative to speciation. We note that different implementations of covarion and CBD models can lead to different results , depending on parameters such as the grouping of amino acids into classes (hydrophobic, polar, etc.). Another feature of amino acids in protein studies, which is less studied because it requires very large datasets for reliable inference, is co-evolution between sites. Amino acids under co-evolution are predicted to be closely related in the three-dimensional structure, and different methods have been developed to detect them [15–17]. Interestingly, such co-evolution could cause false-positive prediction in functional analysis (in both amino acid and codon models), as a mutation could be compensated for by another mutation in the vicinity of the three-dimensional protein structure, adding no novelties in the protein function .
In a preliminary study of the impact of functional divergence in amino acids between paralogues and orthologues (R.A. Studer and M. Robinson-Rechavi, unpublished work), we have used SHIFT-FINDER (http://sites.univ-provence.fr/evol/phylogenomics-lab/PageWeb/SHIFT-FINDER.htm), an implementation of the algorithm originally implemented in DIVERGE [7,19], to detect covarions [DIVERGE is a GUI (graphical user interface)-only software poorly adapted to large-scale studies]. We applied SHIFT-FINDER to 139 families of fish-specific duplicate genes. Comparing paralogues, we found, on average, ∼8% of covarion sites per family (with at least 95% of posterior probability). We expected to find less functional divergence between orthologues in the same gene families, but, on the contrary, we found ∼12% of covarion sites between teleost fish and tetrapods. It is possible that the variation in paralogues has an impact on the estimation of variation between orthologues, which include duplicated co-orthologues. So we made the same comparison in 374 families of single-copy orthologues (no duplication in any species). And again, we found more (∼10%) sites under functional divergence. We then used BADASP  to detect sites presenting CBD changes. In 163 families of fish-specific duplicates, we found ∼2.5% of sites with CBD between paralogues, but again similar differences were found between tetrapod–fish orthologues (2.3–3.5%). Although these results are preliminary, and need further investigation into potential limitations or bias of the methods, they suggest that paralogues do not diverge more than orthologues in protein function.
Codon models to study selective pressure
Moving to the level of nucleotides allows a more precise estimation of evolutionary pressure, and offers, notably, the potential to discriminate between relaxation of purifying selection and increase in positive selection, which may be linked to protein function [4,21–23]. The evolutionary pressure between two sequences can be computed by comparing the number of non-synonymous substitutions per non-synonymous site (dN or Ka) to the number of synonymous substitutions per synonymous site (dS or Ks). If synonymous changes are neutral, dS is an estimation of the combined effect of mutation rates and divergence time, whereas dN includes both of these parameters plus selective pressure. Thus the dN/dS ratio (ω) will provide a measure of evolutionary pressure, and of whether the protein has been evolving under purifying selection (ω<1), neutral evolution (ω=1) or positive selection (ω>1). More sophisticated models than simple averaging out of dN and dS over pairs of sequences have been developed, notably in the laboratory of Ziheng Yang, and implemented in the PAML package . Other models are available in tools such as FitModel  or Selecton . The Branch model of PAML  estimates different ω values per branch, but averages over sites, thus assuming that selective pressure is the same over all the protein. This model may be interesting for the study of genes containing a very strong change in selective pressure in a specific clade . In an orthogonal manner, the Site models  estimate different ω values per site, averaging over branches, i.e. assuming that selective pressure does not vary over time. This model is useful to detect positions in the structure that are continually adapting to evolutionary pressure, as in immunity or viral genes (i.e. MHC or HIV proteins). Assuming that, in many cases, evolutionary pressure is changing over both time and space, Yang and Nielsen  developed branch-site models. The associated test was improved further by Zhang et al.  and Zhang , to avoid false positives. These models classify amino acid positions into four different categories. Two describe sites for which selective pressure does not change over time, either under purifying selection (K0, ω0<1) or under neutral evolution (K1, ω1=1). The two other categories are sites potentially evolving under positive selection only in one or more selected branches (foreground branches ω2≥1), and evolving in all other branches under purifying selection (K2a, background branches ω0<1) or neutral evolution (K2b, background branches ω1=1). Interestingly, this model should be able to describe covarion-like evolution (i.e. K2b sites) and CDB-like evolution (i.e. K2a sites). This branch-site model has been implemented in databases [33,34]. It would be interesting in the future to investigate the level of consistency between amino acid and codon methods. Of note, recent research on codon models goes towards a convergence with amino acid models, by integrating information such as physicochemical amino acid properties .
Evidence of positive selection has been reported for several categories of genes, including immunity, arms race, reproduction and digestion. Positive selection is also expected in duplicated genes, under neofunctionalization, or in some cases under subfunctionalization through escape from adaptive conflict . Recent studies have used the branch-site models to analyse the evolutionary mode of genes of interest, allowing the testing of these expectations with increased rigor. We will describe some of these studies, and show the link between codon models and functional evolution.
Rubisco (ribulose-1,5-bisphosphate carboxylase/oxygenase) is the central enzyme that fixes CO2 molecules during photosynthesis. Two major types of photosynthetic pathway exist in plants: C3 and C4. C3 plants are the most common (e.g. rice, spinach or potato), whereas C4 plants are found in warm, dry and saline climates (e.g. sugar cane or maize). The photosynthesis mechanism is slightly different, and, interestingly, the C4 Rubisco appears adapted to different concentrations of CO2. This co-evolutionary mechanism has been fixed by positive selection , and it appears that only a few sites are responsible for the observed difference in kinetics. In particular, an alanine residue mostly conserved in C3 plants shifted to a serine residue many times independently in C4 plants belonging to distant plant lineages. This shift could alter the movement of a loop and thus alter the kinetic properties; it has been detected by branch-site codon models . A more dramatic shift in function has been revealed by a branch-site model in fungi, in the branch separating feruloyl esterase from other lipases . Three of the sites under positive selection have a direct implication in the switch of function, as indicated by site-directed mutagenesis. Interestingly, such models could also infirm hypotheses of positive selection. Such a hypothesis was originally suggested for the duplicate genes FLO and DEF, responsible for floral development in plants, but instead a hypothesis of relaxed selection is favoured .
These codon models can also be applied in large-scale studies. But special attention must be given to methodological issues. The first is the possibility of saturation in synonymous sites. This can be resolved using either closely related sequences, or good sampling to ‘break’ long branches. The second is potential error in MSA (multiple sequence alignment) . In small-scale analysis, they can be verified by hand, but in large-scale studies involving hundreds of alignments, this is not a realistic approach. We recommend using at least one of the start-of-the-art MSA tools , and selecting only well-aligned blocks [40,41]. Finally, it is recommended to correct for multiple testing .
There have been many scans for positive selection in genes, using different methods (reviewed in ). In the present article, we focus on large analyses using branch-site models, which are most relevant to the episodic evolution model . Most such studies have been conducted in mammals, and have usually reported relatively low levels of positive selection. Thus a study of human–mouse–pig gene trios found between 2.0% (pig) and 3.0% (human) of genes with evidence of branch-site-specific positive selection ; of note, they only found one gene out of 1120 with a branch model. Larger-scale scans (≈13000 genes) in mammals found positive selection to be associated with functional categories such as sensory perception, signalling or immune response , and to be more frequent in chimpanzee than in human evolution . These results have been confirmed in the largest scan yet of mammalian genomes , comparing ≈16500 genes in six species (to the exclusion of duplicated genes), which also reported differences in the gene categories under positive selection in rodents and primates, and a tendency for positively selected genes to be more tissue-specific.
With the help of five sequenced fish genomes, we were able to extend such analyses to a deeper evolutionary time, spanning bony vertebrate evolution . We analysed 881 families of vertebrate genes containing either strict singleton genes or genes with duplication in fishes. As many theoretical models predict a shift in selective pressure between duplicates, we analysed the branch separating the two copies of teleost fish genes. We found that 36% of genes exhibited positive selection, at a few sites per gene (3%). But we also found ∼60% of branches separating teleost fish and tetrapods under positive selection, concerning again few sites (5%). This is consistent with our preliminary results on covarions and CBD on amino acids. In any case, it appears that episodic positive selection has affected at least 77% of the genes we studied, and these represent the most conservative subset of the genome.
Overall, positive selection between orthologues does not appear to be rare. This could result in changes of function between species, and alter the biochemical properties of proteins, leading to erroneous conclusions in drug discovery analysis . Moreover, it seems that genes implicated in pathologies such as cancers, schizophrenia and Alzheimer's disease, are overrepresented among genes having evolved under positive selection .
Episodic selection as a major mechanism in protein evolution
The episodic model of evolution [1,51] appears to be an important concept to understand protein evolution. It is thus of great interest to identify such changes in gene families. The main challenge is to define criteria to predict in silico whether there is a biological change in function or not. The studies we present provide only a measure in divergence between homologous genes. It is probable that some diverging sites cause strong functional divergence (differences in substrate, or in interaction partners), whereas, for others, it is more slight (optimization for temperature, change in the rate of the enzymatic reaction). Transferring annotations by homology can be misled by such changes. It has been reported that most enzymatic domains have difference between homologues directly in their actives sites, thus possibly moving the protein to another enzymatic class . Such a shift has already been observed between two highly conserved proteins with 98% identity , and thus can depend on very few amino acid changes. As more genomes from various clades will be sequenced in the future, and increase the power of sequence analyses, we hope to see an increase in studies using this type of method (covarion, CBD and branch-site models), thus clarifying the evolution of protein function.
Protein Evolution: Sequences, Structures and Systems: Biochemical Society Focused Meeting to commemorate the 200th Anniversary of Charles Darwin's birth held at the Wellcome Trust Conference Centre, Cambridge, U.K., 26–27 January 2009. Organized and Edited by Roman Laskowski (EMBL-EBI, Hinxton, U.K.), Michael Sternberg (Imperial College London, U.K.) and Janet Thornton (EMBL-EBI, Hinxton, U.K.).
We thank P.-A. Christin for helpful discussions.
We acknowledge funding from Etat de Vaud and Swiss National Science Foundation [grant number 116798].