The main features of the triple helical structure of collagen were deduced in the mid-1950s from fibre X-ray diffraction of tendons. Yet, the resulting models only could offer an average description of the molecular conformation. A critical advance came about 20 years later with the chemical synthesis of sufficiently long and homogeneous peptides with collagen-like sequences. The availability of these collagen model peptides resulted in a large number of biochemical, crystallographic and NMR studies that have revolutionized our understanding of collagen structure. High-resolution crystal structures from collagen model peptides have provided a wealth of data on collagen conformational variability, interaction with water, collagen stability or the effects of interruptions. Furthermore, a large increase in the number of structures of collagen model peptides in complex with domains from receptors or collagen-binding proteins has shed light on the mechanisms of collagen recognition. In recent years, collagen biochemistry has escaped the boundaries of natural collagen sequences. Detailed knowledge of collagen structure has opened the field for protein engineers who have used chemical biology approaches to produce hyperstable collagens with unnatural residues, rationally designed collagen heterotrimers, self-assembling collagen peptides, etc. This review summarizes our current understanding of the structure of the collagen triple helical domain (COL×3) and gives an overview of some of the new developments in collagen molecular engineering aiming to produce novel collagen-based materials with superior properties.
In its most widely used meaning, the term ‘collagen’ refers to the main structural protein in connective tissues such as skin, bone, tendons or cartilage. This cable-like protein forms elongated fibrils that provide mechanical stability to animals. Bone or skin collagen is mainly type I collagen, the most abundant protein in mammals, but in fact there are many types of collagens in any given animal . Vertebrates have at least 45 distinct collagen genes that account for a family of 28 collagen proteins [2–5]. Invertebrates have their own collagenomes (used here to refer to the collection of collagens and collagen-like proteins from a given organism or taxonomic group), which contribute to the structural integrity of the animals. Thus, the cuticle of the Caenorhabditis elegans nematode is predominantly made of cross-linked collagens that are encoded by more than 170 collagen genes [6–8]. Collagens are already present in the most primitive animals, such as sponges, and therefore are considered intrinsic to the evolution of metazoans [9,10].
All collagens are trimeric proteins formed by association of three polypeptide chains. These can be identical (homotrimer collagens) or different (heterotrimer collagens). Different genes code for distinct collagen chains, and only chains from the same collagen type associate with each other (there are no trimers made from chains of different collagen types). Vertebrate collagen types are designated with roman numerals (I–XXVIII) in the chronological order of their discovery and the individual genetically different chains for each type are named with the letter α and an arabic numeral. Thus, the cartilage-predominant type II collagen is a homotrimer of three α1(II) chains, whereas the bone-predominant type I collagen is a heterotrimer made of two α1(I) chains and one α2(I) chain [1–5].
Collagen types are broadly classified by their function, domain architecture and supramolecular organization. The main fibril-forming collagens (I, II, III, V and XI) account for 80–90% of the human collagens and are the principal source of tensile strength in animal skin, bones, cartilage, blood vessels, etc. Type IV collagen forms chicken-wire networks in basement membranes. Other collagen types form hexagonal lattices (VIII and X), beaded filaments (VI), anchoring fibrils (VII) or have transmembrane domains [1–5]. A number of additional proteins that share structural characteristics with the 28 collagen types are collectively classified as ‘collagen-like’ proteins rather than ‘collagens’, for not entirely clear reasons. The human collagenome includes collagen-like proteins such as acetylcholinesterase , adiponectin , collectins , C1q , class A scavenger receptors  and others .
All collagens and collagen-like proteins share a structural motif that defines the entire superfamily: the collagen triple helix. Each collagen has one or more occurrences of this particular triple helical domain (hereafter referred to as COL×3 domain), and its length varies across different collagen types. The mature form of the major fibrillar collagens is essentially a long COL×3 domain of just over 1000 amino acids and about 300 nm in length. In other collagen types, however, the COL×3 domain represents less than 10% of the number of amino acids of the mature protein . Triple helical COL×3 domains are easily recognizable at the genomic level by a conspicuous ─Gly-X-Y─ repetitive sequence in which every third amino acid position is occupied by a glycine residue (Gly, G) and the X and Y positions are often occupied by proline residues (Pro, P). In animals, Pro residues in the Y position are often modified post-translationally to 4-hydroxyproline (Hyp, O). These sequence requirements respond to specific structural characteristics of the collagen triple helix, as will be reviewed later.
EARLY FIBRE DIFFRACTION STUDIES: DISCOVERY OF THE COLLAGEN TRIPLE HELIX
Tendons are primary built from type I collagen, which represents 70–80% of their dry weight. Furthermore, the structure of tendon is such that the collagen fibres are essentially aligned to the long axis of the tendon . This structural alignment is critical to provide the tensile strength required to connect bone to muscle while allowing for the mechanical mobility of the body . Thus, the earliest structural information on collagen was obtained from fibre X-ray diffraction photographs of stretched tendons . Given the sparse amount of information provided by these images (Figure 1A), it is remarkable that an essentially correct structure was eventually obtained after a few unsuccessful attempts at model building (see [18,19] for detailed historical accounts). Early models were based on single polypeptide chains containing cis or both cis and trans peptide bonds [17,20]. In 1951, Pauling and Corey  proposed a model for collagen as part of their classic series of papers on the stereochemistry of polypeptide structures. From density measurements and average residue weight, they concluded that the molecule of collagen consisted of three polypeptide chains. A triple helical structure with all peptide bonds in trans was first proposed in 1954 by Ramachandran and Kartha . The same authors soon modified this model  by applying the concept of ‘coiled-coil’, where the three chains were wound around the common central axis, and introducing a vertical staggering following a superhelical path. The stereochemistry of this structural model (which became known as the ‘Madras triple helix’) was then refined by Rich and Crick [24,25] and Cowan et al. . In parallel, physicochemical studies with soluble collagen confirmed that collagen molecules were rather rigid and rod-like, with a length of 3000 Å (1 Å =0.1 nm) and a diameter of 13.6 Å, and composed of three polypeptide chains which became separated upon temperature increase [18,27–29].
Examples of collagen X-ray diffraction
The models obtained from the combination of fibre diffraction, amino acid composition and physicochemical data already described the essential features of the collagen triple helix as we understand it today. The discovery had its controversy, in particular about the recognition of Ramachandran's contribution to the elucidation of the correct structure [30–32]. An interesting account of the historical events is given in Ramachandran's biography by Sarma . Remarkably, the discussions about the stereochemistry of the collagen models led eventually to the publication of the Ramachandran map .
AVERAGE STRUCTURE OF THE COLLAGEN TRIPLE HELIX
Tendons used in fibre diffraction experiments have a degree of heterogeneity. Their oriented fibres can at best be classified as polycrystalline and more realistically as non-crystalline due to the large amount of disorder in their structure . Thus, the models obtained from fibre diffraction were necessarily average structures of the repeating unit of the collagen triple helix . Linked-atom least-squares refinement of early models against fibre diffraction data from highly stretched partially dehydrated kangaroo tail tendon produced probably the best average structure at the time .
The overall structure of the collagen molecule is a right-handed triple helix of three individual polypeptide strands, each in a left-handed helical conformation with three residues per turn and 32 helical symmetry known as polyproline II or polyglycine II (PPII/PGII) [37–40]. The three strands are supercoiled around each other in a right-handed manner, and a ladder of intermolecular backbone N─H···O═C hydrogen bonds links adjacent strands (Figure 2). These hydrogen bonds are transversal to the helical axis, as opposite to the ubiquitous α-helices where hydrogen bond directions are roughly parallel to the helical axis. To form this triple helical assembly two things need to happen: the three strands must be staggered by one residue with respect to each other (an approximate rise of 2.9 Å in the direction of the helical axis), and every third residue in each strand must be placed near the common helical axis, due to the resulting close packing of the three strands. This can only be achieved if the smallest amino acid Gly occurs at that position, which explains the repetitive ─Gly-X-Y─ sequence seen in the primary structure of COL×3 domains. The one-residue-staggering in the direction of the axis has also important consequences as the three strands become topologically non-equivalent. This contrasts with the situation in, for instance, trimeric α-helical coiled coils, where the three strands are at the same axial level. A useful notation to distinguish the three strands as trailing, intermediate and leading (Figure 2) was introduced much later when it became important to describe the different interactions between each of the three collagen strands and an integrin receptor domain .
Different representations of a collagen triple helix with sequence (POG)10
It is illustrative to stop for a moment on the PGII structure to visualize the intimate relation between collagen structure and sequence. The PGII/PPII conformation has 3-fold helical symmetry (31 or 32 for an achiral poly(G) sequence, obligate 32 for any other protein sequence). This symmetry places peptide bonds in three directions separated by 120°. Each polyglycine chain in the PGII structure is connected to its six closest neighbours via hydrogen bonds roughly perpendicular to the helix axis, resulting in a continuously connected hexagonal array . This continuous structure, also observed in synthetic copolymers of Gly residues with ω-amino acids [42,43], would be hardly compatible with a biological macromolecule that needs to avoid aggregation. In the consensus collagen sequence, only every third residue is Gly, which means that an individual chain can act as hydrogen bonding donor in only one direction, whereas it can still receive in any of the three directions (imino acids cannot donate hydrogen bonds and any side chain other than Gly would interfere with hydrogen bonding in that particular direction). This hydrogen bonding topology allows for the formation of a self-contained trimeric assembly, which ultimately is the basis for the triple helical structure of collagen. Interestingly, the constraints imposed by this self-contained structure, together with the high content of imino acids and Gly residues, prevent fibrillar collagen molecules from forming continuous harmful aggregate structures such as amyloid fibrils [44,45].
HIGH-RESOLUTION STRUCTURES FROM COLLAGEN MODEL PEPTIDES
The average structure of collagen was reviewed periodically without increasing significantly its level of detail. Synthetic polypeptides with a collagen-like structure, such as poly(PGP) or poly(GPO), did not improve the information obtained from native sources [19,46]. Collagen in general was considered ‘non-crystallizable’ and it was not until the introduction of solid-phase synthesis of collagen model peptides that this perception changed. Physicochemical analysis confirmed that these collagen peptides were trimeric in dilute aqueous solution, formed triple helical structures in which the three chains were parallel and in register, and showed sharp thermal transitions corresponding to the denaturation of the triple helices [47–53]. Chemically synthesized collagen peptides such as (PPG)10 or (POG)10 are homogeneous in molecular mass, have defined length and chemical composition, and thus are amenable to producing single crystals (Figure 1B). To date, more than 60 structures of collagen model peptides on their own or in complex with other proteins have been deposited in the PDB (Supplementary Table S1) with atomic or near-atomic resolution. These structures have confirmed the general features of the triple helical structure derived from fibre diffraction of tendons (with some important differences discussed later), and have provided a wealth of data on the conformational variability of the collagen triple helix, molecular details of the interaction between collagen and water or effects of collagen interruptions. They have also helped to clarify the role of Hyp and other residue types in collagen stability and resolved a number of long-ongoing controversies about the topology of hydrogen bonding in the collagen triple helix.
The first successful crystals and high-resolution X-ray diffraction data were obtained by Okuyama et al. [50,54] with the (PPG)10 peptide. The highly repetitive structure of this peptide and a columnar arrangement of the molecules in the unit cell prevented a complete determination of the entire triple helix. Thus, an average atomic structure for the repeating unit of an infinite helix model was determined. The first full-length crystal structure was obtained with the Gly→Ala peptide (POG)4-POA-(POG)5 (PDB code 1CAG) . In this peptide, the change of one central Gly residue to alanine (Ala, A) was sufficient to break the columnar arrangement seen in the (PPG)10 structure and facilitated structural determination. Better quality average structures for (PPG)10 and (POG)10/11 were later obtained (Supplementary Table S1), and the problem of highly repetitive sequences was eventually solved with very-high-resolution crystals for the peptides (PPG)9 (PDB code 3AH9), (PPG)10 (PDB code 1K6F) and (POG)9 (PDB code 3B0S) [56–58]. In the meantime, collagen model peptides containing amino acids other than Pro/Hyp were crystallized and information about side-chain conformation, water-mediated hydrogen bonding or conformational variability was finally obtained.
SEQUENCE DEPENDENCE OF COLLAGEN HELICAL SYMMETRY
Fibre diffraction images of tendons were interpreted with average triple helical collagen structures (all repeating units were considered to have the same conformation and to be related by some helical symmetry operator). One of the helical parameters measured experimentally from fibre diffraction diagrams is the magnitude of the unit twist |κ| (Table 1) and values between 107° (3.36 units/turn) and 110° (3.27 units/turn) were reported [36,46,59]. In order to describe the helical symmetry as a ratio of small integers, a rounded-off value of ten repeating units in three turns (10/3, 107 or 3.33 units per turn) was adopted, corresponding to a magnitude of the unit twist |κ|=108° [25,36,46]. The different helical parameters for the individual strands and superhelix of the 107 (or 10/3) model are given in Table 1; co-ordinates of a 107 molecular model obtained from fibre diffraction are available in PDB format in the Supplementary Online Data (file Fraser1979.pdb) .
|10/3 helix||7/2 helix||Variable helix*|
|Unit twist (t)||36°||51.43°||42.4°|
|Unit height (h)||8.95 Å||8.61 Å||8.64 Å|
|Helical pitch†||89.5 Å||60.3 Å||∼73.4 Å|
|Triplets per turn (360/t)||10||7||∼8.5|
|Residues per turn||30||21||∼25.5|
|Unit twist (κ)||−108°||−102.86°||−106°|
|Unit height (τ)||2.98 Å||2.87 Å||2.88 Å|
|Helical repeat||29.8 Å||20.1 Å||∼49.6 Å|
|Units per turn (360/κ)||3.33||3.50||3.40|
|10/3 helix||7/2 helix||Variable helix*|
|Unit twist (t)||36°||51.43°||42.4°|
|Unit height (h)||8.95 Å||8.61 Å||8.64 Å|
|Helical pitch†||89.5 Å||60.3 Å||∼73.4 Å|
|Triplets per turn (360/t)||10||7||∼8.5|
|Residues per turn||30||21||∼25.5|
|Unit twist (κ)||−108°||−102.86°||−106°|
|Unit height (τ)||2.98 Å||2.87 Å||2.88 Å|
|Helical repeat||29.8 Å||20.1 Å||∼49.6 Å|
|Units per turn (360/κ)||3.33||3.50||3.40|
However, the X-ray diffraction of (PPG)10 crystals was clearly indicative of a tighter symmetry, with 7 units in two turns (7/2, 75 or 3.5 units per turn) [50,54] (helical parameters in Table 1). Both 107 and 75 symmetries were compatible with the fibre diffraction data and this observation led to Okuyama et al. [60–62] to propose a new structural model for collagen based on the 75 (or 7/2) helix. Many of the crystal structures of collagen model peptides determined to date show 75 helical symmetry, particularly those with an amino acid sequence predominantly or entirely made of imino acids (Pro, Hyp and alternative modifications of Pro; collagen peptides with this type of sequence are referred to here as ‘imino-saturated’). However, the crystal structure of a peptide with a sequence of human type III collagen (POG)3-ITGARGLAG-(POG)4 (PDB code 1BKV, peptide T3-785) shows distinct helical symmetry for the imino-saturated (POG)n zones (75) and the central, imino acid-free ITGARGLAG zone (close to 107) (Figure 3) [63,64]. Similarly the crystal structure of a peptide with a much longer type III collagen sequence GPIGPOGPRGNRGERGSEGSOGHOGσMO-(GPO)2-AOGPCCGG (PDB code 3DMW, peptide T3-991, σM selenomethionine) shows that the central imino acid-free zone and the imino acid-rich flanks are close to 107 and 75 helices respectively .
Helical twist in the structure of collagen
The differences between a 75 and a 107 helix are minimal at the level of one individual three-amino-acid-repeating unit (Figure 3D). However, they become important in the context of very long COL×3 domains such as those in fibrillar collagens. Thus, it is important to obtain a reasonably accurate figure for the average unit twist of a COL×3 domain, so that the conformation of the entire domain is adequately represented. Clarification of the sequence-dependence of collagen helical symmetry has been obtained using a novel method in which the repeating unit of the collagen triple helix is defined as a triplet of residues, each from a different chain, sitting approximately at the same vertical level . The unit twist κ is obtained from the helical operation that relates two consecutive such triplets. Variations of local conformation in the high-resolution crystal structures of collagen peptides were visualized by plotting κ against their trimeric sequence (Figure 4) . This analysis showed that imino-saturated zones have an average κ of −103° (corresponding to a 75 helix), whereas imino acid-free zones approach an average κ of −108° (corresponding to a 107 helix). Similarly, the average unit height τ increases from 2.84 Å in imino-saturated zones to 2.90 Å in imino acid-free zones. Interestingly, zones with intermediate imino acid content show average κ and τ values in between those corresponding to the 75 and 107 helices .
Variation of the unit twist κ as a function of the collagen amino acid sequence
It has been suggested that the unit twist κ between two consecutive triplets changes gradually through the COL×3 domains and is related to the number of imino acids in these two triplets. Analysis of collagen sequences in terms of the expected values of κ for every pair of consecutive triplets predicts an average value κ=−106° for the COL×3 domains of fibrillar collagens I, II or III, and similar values can be obtained for uninterrupted COL×3 domains in other collagens (Figure 4C) . This average value of κ corresponds to a helical symmetry intermediate between the 107 and 75 helices, and reflects the fact that in many collagens the most common pairs of consecutive triplets contain zero to two imino acids (GP0-GP0, GP0-GP1, GP1-GP1), whereas imino-acid rich pairs (three or four imino acids, GP1-GP2, GP2-GP2) are less frequent (Figure 4). There are no simple rational helices corresponding to a collagen triple helix with κ=−106°. Its average conformation can be approximately described by helices with exotic integer combinations such as 1712: a left-handed superhelix of 17 units in five turns made of three supercoiled right-handed 172 helices (Table 1) .
There is no question that imino-saturated collagen peptides show a 75 structure , and there is consensus that the conformational restraints from imino acids favour such tight helix, with 3.5 residues per turn. When the proportion of non-imino acids increases, the conformation of collagen peptides relaxes into a less tightly-wound helix which approaches the 3.33 residues per turn. On average, many COL×3 domains will have an intermediate number of residues per turn. Thus, if a monotonous helix is used to represent the entire conformation of a COL×3 domain, a helical model matching its average κ should be considered rather than a pure 75 or 107 model . In the absence of high-resolution structures of collagen fibrils or native long collagen molecules, it is still unknown whether these predictions are accurate. Nevertheless, Orgel et al.  conclude from their analysis of fibre X-ray diffraction data that different triple helical symmetries occur along type I and II collagen molecules in their native (fibrillar tissue) environment. They discuss that other factors such as local helix dissociation or molecular packing must have an effect on the local symmetry of the collagen triple helix, in addition to any intrinsic sequence-related preferences towards a 107 or 75 symmetry.
COLLAGEN HYDROGEN BONDING AND THE WATER-BONDED STRUCTURE
The pattern of hydrogen bonding between collagen strands differs from that of common secondary structure elements, such as α-helices or β-sheets, in that not all peptide bonds from the repetitive ─Gly-X-Y─ polypeptide chain can form main-chain to main-chain hydrogen bonds. Early fibre diffraction models already deduced the correct interstrand N─H···O═C connectivity where the N─H groups of the Gly residues act as donors and the C═O groups of the residues in the X position of the following strand act as acceptors (so-called Rich and Crick II hydrogen bonding topology) [25,46]. This topology means that only one interstrand hydrogen bond can form per Gly-X-Y triplet. Structurally, these interstrand hydrogen bonds form a ladder along the triple helix where the steps change orientation, following a helical path as they connect different pairs of strands in turn (Figure 2).
Nevertheless, hydrogen exchange experiments on collagen suggested that only one of the three amide hydrogens per triplet exchanged at the high rates expected for a peptide group freely exposed to the surrounding solvent. The remaining two amide hydrogens exchanged much more slowly, although they could be distinguished in two groups. The first group (one amide hydrogen per triplet) showed very low rates of exchange that changed little upon temperature increase. The second group (about 0.7 hydrogens per triplet) showed low rates of exchange that increased rapidly with the increase in temperature [68–70]. These observations suggested that about 1.7 amide groups per every three residues in collagen were involved in hydrogen bonding, but the two interactions were not equivalent. Deformation of the triple helix to account for a second hydrogen bond per triplet is not possible due to stereochemical reasons. Thus, in order to resolve the dilemma of the two slow amide hydrogens, Ramachandran and Chandrasekharan  proposed that when the residue in the X position was not an imino acid, a second ‘hydrogen bond’ could be formed through a water molecule between the N─H group of the amino acid in the X position and the C═O group of the Gly residue on the previous strand. Therefore, this water-bridged hydrogen bonding would reverse that of the direct interstrand hydrogen bonds. Ramachandran et al.  also noted a few years later that the hydroxy groups of Hyp residues could form an additional hydrogen bond to the bridging water molecule.
Confirmation of this ‘water-bonded’ model for collagen was not really possible from fibre diffraction data alone, but water molecules around collagen became clearly visible in the high-resolution crystal structures of collagen model peptides. The first evidence of water-mediated hydrogen bonding between collagen strands was observed in the Gly→Ala peptide, where the disruption introduced by the Gly substitution was overcome through water molecules connecting the three strands around the sites of interruption [55,73]. Further structures of the T3-875, EKG (PDB code 1QSU)  and other collagen model peptides confirmed the existence of single water molecules connecting the strands of uninterrupted collagen triple helices whenever a free N─H group occurs at the amino acid in the X position, in the manner postulated in the water-bonded model for collagen (Figure 5). These water bridges will be referred to here as ζ1 bridges, following a naming scheme suggested for the different types of collagen-water hydrogen bonding topology . Water molecules in ζ1 bridges often appear as particularly well-defined spheres in the electron density maps (Figure 5A) and have crystallographic temperature factors similar to those of the surrounding atoms from the peptide chains. Some ζ1 bridges are stabilized by additional hydrogen bonding from the hydroxy group of Hyp residues (ζ1–γ1 bridges, in the naming scheme above). For this to occur the Hyp residues must be placed two positions C-terminal to the C═O acceptor group (Figure 5). In ζ1–γ1 bridges, the water-binding sites created by the three groups on the collagen triple helix have a remarkably ideal geometry, with the water molecules adopting a tetrahedral co-ordination and with the three hydrogen bonds showing ideal distances and angles. The fourth co-ordination position is usually occupied by another water molecule.
Water-mediated hydrogen bonding in collagen
Most high-resolution crystal structures of collagen model peptides to date are disproportionately rich in imino acids, many with imino-saturated stretches at both ends. This is mainly for designer reasons, to ensure the stability of relatively short triple helices. However, a few structures containing non-imino acids in the X positions have been determined (Supplementary Table S1). In peptide structures with resolution 2.0 Å or better, there are 56 observed ζ1 bridges out of 66 possible (85%). Of the remaining ten cases, there is one unusual two-water ζ2 bridge and the other positions are disrupted or occupied by other side chains or lattice interactions. Of the water molecules involved in ζ1 bridges, 38 form hydrogen bonds with hydroxy groups of Hyp residues (ζ1–γ1 bridges) out of 39 possible (97%). The T3-785 crystal structure also shows two instances of Thr hydroxy groups playing similar roles to the Hyp ones in γ1 bridges.
NMR studies of the T3-785 and (POG)10 peptides where specific positions had been enriched with 15N-labelled amide groups showed that the Gly and Leu (X position) amide protons exchanged very slowly and slowly respectively, whereas the Ala (Y position) amide protons exchanged rapidly with the solvent. The Gly amide protons at the centre of the (POG)10 peptide exchanged even more slowly than the corresponding Gly amide protons in the middle of the T3-785 peptide . The rates of exchange observed on these peptides were consistent with the classic experiments of hydrogen exchange on native collagen. The combination of the NMR data and the X-ray structural information of the T3-785 peptide suggests that the water-mediated hydrogen bonding in the X position slows the hydrogen exchange almost in the same manner as a direct hydrogen bond. It is therefore likely that these water-mediated hydrogen bonds contribute to the local stability of the triple helix and help to maintain the triple helical conformation in regions where there are no imino acids. Even with this second set of hydrogen bonds, regions of triple helix conformation without imino acids will be more flexible and dynamic than regions where all X and Y positions are occupied by Pro and Hyp residues respectively. Water-mediated hydrogen bonds should be less resistant towards increases in temperature than direct interstrand hydrogen bonds. This would agree with the classic observations of the effect of temperature on collagen hydrogen exchange and the different behaviour of the two populations of slow-exchanging protons [68,70].
Figure 4 also shows all possible positions for water-mediated hydrogen bonding (ζ1 and ζ1–γ1 bridges) on the sequence of the COL×3 domain of type III collagen. There are a total of 740 possible such positions, with some very long stretches where there are no imino acids in the X position. Thus, the triple helical structure with two hydrogen bonding connections per triplet (one mediated through water) is potentially far more common than the structure with strictly one hydrogen bond per triplet (assuming that most of these water bridges do actually form). The mature type III collagen protein is a trimer of three α1(III) chains, each with 1068 amino acids (after cleavage of the N- and C-terminal propeptides, UniProt P02461). Thus, the number of possible water-mediated hydrogen bonds is equivalent to 0.69 per every three amino acids of mature type III collagen, which would be consistent with the 0.7 amide protons per triplet that have been shown to exchange slowly with the solvent, but not as slowly as the one amide proton per triplet involved in direct interstrand hydrogen bonding.
HIGHLY STRUCTURED HYDRATION NETWORKS AROUND COLLAGEN MODEL PEPTIDES
Water has always been considered an intrinsic component of collagen with a role in maintaining the conformation of the native collagen molecule. Tendons contain tightly bound water, and dehydration increases their mechanical stiffness. Early X-ray fibre diffraction experiments showed evidence of wide-range structural changes when tendons were dehydrated [18,19,76]. These findings have been corroborated through the years. A recent study has shown that water removal from tendons shortens the collagen molecules and fibrils and this shortening translates into tensile forces much larger than these achievable from muscle contraction alone . Several studies on oriented hydrated tendons have been conducted using a variety of techniques (Raman and infrared spectroscopies, calorimetric, dielectric measurements, dynamic mechanical spectroscopy and NMR). They have provided evidence for ordering of water molecules in collagen fibrils that differs from bulk water [16,78–85]. Spectroscopic analyses suggest different groups of water molecules, ranging from strongly bound waters with correlation times ≥1 ns to relatively free waters with rotational correlation times on the 10−10 s scale [81,82,85]. Before high-resolution structures became available, the groups of water molecules were interpreted on the basis of the water-bonded model proposed from fibre diffraction studies, plus additional hydration layers around the collagen molecules.
The crystal structure of the Gly→Ala peptide showed an extensive network of ordered water molecules surrounding the triple helical structure (PDB code 1CGD) . Many of these water molecules were located in positions consistent with hydrogen bonding to the C═O groups of the main chain and the OH groups of the Hyp residue side chains, as indicated by distances and angles between the atoms visible in the crystal structures. Most water molecules showed additional hydrogen bonding (on a stereochemical basis) to other water molecules, which in turn were connected to other waters and eventually back to the appropriate groups on the same peptide or on a different peptide on the crystalline lattice. An elaborate network of bridges was described and categorized and a predominance of water-bonded motifs with partial pentagonal geometries was noted. The hydroxy groups of the Hyp residues acted as linchpins for the water networks. Later structural determinations have confirmed the existence of these extensive water networks around the triple helical structures, and the same topologies of collagen-water hydrogen bonding have been observed [86,87]. In particular, some of the water bridges involving the side chains of the Hyp residues are ubiquitous in these crystal structures (Figure 6). The extent of water ordering in different structures is dependent on the particular amino acid sequence and the packing of the triple helices in the lattice, and parallel columnar arrangements of collagen model peptides with a high content of imino acids seem particularly effective at inducing large structuring of the surrounding water molecules (Figure 6).
Ordered water networks between neighbouring triple helices in crystal structures of collagen peptides
One of the key observations is that the triple helices in the crystal lattices have hardly any direct hydrogen bonding or hydrophobic contact between their side chains (Figure 6). In most crystal structures only a few contacts between side chains extending towards the neighbouring chains are observed (Leu residues in (POG)4-(LOG)2-(POG)4, PDB code 2DRX; Asp, Glu, Lys residues in (POG)3-PKG-E/DOG-(POG)3, PDB code 3T4F, 3U29; Phe, Gln, Arg residues in (GPO)3-GPRGQOGVMGFO-(GPO)3, PDB code 4DMT; etc.) [88–90]. This situation contrasts with what is usually observed in globular protein crystals, which build the crystal lattices through clusters of surface residues involved in direct protein–protein interactions. In collagen peptide crystals, the water networks surrounding the triple helices seem to act as cushions or spacers between them. The packing arrangements seen in peptide crystal structures (Figure 6B) are often reminiscent of the expected arrangements of collagen triple helices in fibrils. Furthermore, distances between triple helices in the crystal lattices are similar to the lateral packing distances between collagen molecules in tendon and other tissues examined by fibre X-ray diffraction. Thus, the lateral interactions seen in crystals are probably representative of what occurs in fibrillar collagen assemblies (with the caveat that crystals usually have parallel and antiparallel triple helices on the same lattice), and their interaxial distances are maintained by water molecules that connect adjacent helices. Leikin et al.  have reported the existence of hydration forces between collagen triple helices that are completely consistent with the observed hydration network. At short interaxial spacings (less than 16.8 Å) the forces are repulsive, whereas at longer spacings the forces are attractive. The repulsive forces start at a distance larger than the diameter of a single triple helix, and result from the compression of the water layer bound to the collagen model. The attractive forces are hydrophilic in nature and result from the dynamic network of water molecules that interconnect laterally the triple helices by hydrogen bonding (Figure 6C) [83,92]. This hydration network seems to have a role in directing assembly of collagen fibrils and maintaining their geometry.
HYDROXYPROLINE AND COLLAGEN STABILITY: RING PROPENSITY AND CHAIN PREORGANIZATION
Collagens are unique in animal proteins for their high content of Hyp, in particular the 4(R)-hydroxyproline stereoisomer (Hyp4R, O). This is produced via post-translational modification of Pro residues of individual collagen strands by the enzyme prolyl-4-hydroxylase (P4H) (EC 18.104.22.168). P4H uses Pro, 2-oxoglutarate and O2 as substrates and produces succinate, Hyp4R and CO2. The enzyme is very specific: it hydroxylates Pro residues in the Y position of individual polypeptide chains with repeating ─Gly-X-Y─ sequence, not acting on Pro residues when already on triple helical conformation . Some collagens also contain 3(S)-hydroxyproline (Hyp3S, O3S), a rare post-translational modification by a different enzyme, prolyl-3-hydroxylase (P3H) EC 22.214.171.124 . Hyp3S will be discussed separately later.
At a very basic level, the imino acids Pro and Hyp stabilize the PPII conformation via stereochemical restrictions imposed by the imino acid rings. The Ramachandran plot for Pro shows a very limited choice of φ and ψ conformational angles, and one of the main regions corresponds to the PPII conformation . This type of stabilization, often referred to as preorganization, is of entropic nature. In this case it decreases the entropic cost of collagen folding by favouring extended PPII-like chain conformations in the unfolded state. These are closer to the final folded state (triple helix) than if they were in a completely random conformation . However, Hyp has an additional stabilizing effect, as demonstrated by the differences in thermal stability between hydroxylated (Tm=43°C) and non-hydroxylated (Tm=27°C) human type I collagen [96–98]. Importantly, lack of prolyl hydroxylation in animals results in collagens that are not stable at physiological temperature, and removal of P4H activity is lethal for both vertebrate and invertebrate animal models [99,100]. The very first collagen model peptides provided further evidence for the effect of Hyp on thermal stability: (POG)10 had a temperature of denaturation 30°C higher than that of the homologous peptide (PPG)10 . The impact of Hyp and many related Pro derivatives has since been studied extensively using collagen model peptides, and several reviews of these studies have been published [44,101–105]. Biochemical knowledge obtained from these studies has opened a myriad of engineering possibilities, and the chemical biology of proline modifications is now an area of intense activity. A summary of the current state of the field follows.
Crystal structures of collagen model peptides show a distinct preference for the conformation of the imino acid rings, depending on their position in the collagen chain. Proline rings can adopt two states, defined by their φ and χ1 angles (interdependent in imino acids). The Cγ-endo (down) conformation (Figure 7) is characterized by positive χ1 angles close to 25° and values of φ close to −75°. The Cγ-exo (up) conformation shows negative χ1 angles close to −20° and values of φ close to −60° [56,106]. These values of φ match closely the mean observed values of the φ conformational angles in the collagen triple helix: −72±6° for the X position, −59±4° for the Y position (averages from collagen peptide crystal structures at resolution 1.5 Å or better). This means that imino acids are effectively preorganized to fit into the collagen conformation without any significant strain . The first crystal structure of a collagen model peptide with Hyp residues (PDB code 1CGD) showed that Hyp residues (all in the Y position) had a strong preference for the Cγ-exo conformation, whereas Pro residues (all in the X position) had a clear preference for the Cγ-endo conformation . The same trend was observed in several crystal structures of (PPG)9/10 peptides (PDB codes 1A3I, 1A3J, 1G9W, 1ITT and 1K6F) where the Cγ-exo conformation was favoured for Pro residues in the Y position and the Cγ-endo conformation was preferred for Pro residues in the X position [56,106–108]. These positional preferences [87,109] are the consequence of the differences in the φ conformational angle at each position on the collagen triple helix, and have been largely confirmed in all peptide structures determined to date (Supplementary Table S1). On the other hand, crystal structures of amino acids and small peptides show a clear intrinsic preference for the Cγ-exo conformation in Hyp residues, whereas Pro residues show a mixture of Cγ-endo and Cγ-exo conformations with a 2:1 preference for the former [106,110]. These observations led to Zagari and co-workers  to propose the propensity-based hypothesis by which replacing the Pro residues in the Y position with Hyp would stabilize the triple helix by reducing the conformational freedom of the unfolded state and preorganizing the unfolded chain towards the conformation of the folded state. This stabilization, of entropic nature, means that Hyp residues are more effective than Pro residues in the Y position as they are better preorganized to the conformational angles required at that position in the folded state (the triple helix). The propensity-based hypothesis would explain why the stabilizing effect of Hyp is very stereospecific and only the Hyp4R diastereoisomer is found in the Y position of natural collagens. Accordingly, the collagen model peptide with swapped imino acid positions, (OPG)10 does not form stable triple helices , and neither does the peptide (PαOG)10 with allo-hydroxyproline residues (Hyp4S, αO) in the Y position . Crystal structures of the host–guest peptides (PPG)4-PαOG-(PPG)4 (PDB code 1X1K) and (PPG)4-OPG-(PPG)4 (PDB code 3A0A) show the Hyp4S and Hyp4R residues on the central triplets following the positional preference for the ring conformation, which is opposite to their own intrinsic preference (Figure 7). Both peptides are thus destabilized with respect to the parent peptide (PPG)9 in a manner consistent with the propensity-based hypothesis [87,113,114].
Preferences for ring conformation in proline and several of its derivatives
Raines and co-workers [44,115–118] have developed the concept further and used it to engineer new hyperstable collagen model peptides with synthetic proline derivatives. The Cγ-exo preference for Hyp can be explained by a stereoelectronic gauche effect (Figure 7). Similarly other electronegative groups such as fluorine in 4(R)-fluoroproline or chlorine in 4(R)-chloroproline will favour this conformation (Table 2). Steric effects can also preorganize Pro derivatives to preferred ring conformations. A methyl group will prefer an equatorial position in a Pro ring and thus 4(R)-methylproline will favour a Cγ-endo conformation (Figure 7), whereas 4(S)-methylproline will favour a Cγ-exo conformation (Table 2) (see  for a comprehensive review). The torsion angles for the Cγ-exo conformation match the torsion angles for the Y position in the collagen triple helix. Therefore, replacing a Pro residue in the Y position with a derivative preorganized in the Cγ-exo conformation (Table 2) introduces an entropic stabilization, and the fluorinated collagen model peptide (PfP4RG)10 has one of the highest thermal stabilities reported to date (Tm=91°C, ). In contrast, when the electronegative groups are in the 4(S) position the favoured conformation is the Cγ-endo (Table 2). Replacing a Pro residue in the X position with a derivative preorganized in Cγ-endo conformation will also introduce an entropic stabilization, although other unfavourable factors may overturn this stabilization (discussed below). Generally, however, the effect of Pro substitution in the X position is less pronounced than that of replacing Pro in the Y position. Pro residues have already preference for the Cγ-endo conformation, whereas the Cγ-exo substitutes in the Y position reverse that preference and their impact on the entropic stabilization is higher . Opposite effects are expected if Cγ-exo Pro derivatives are placed in the X position or if Cγ-endo Pro derivatives are placed in the Y position. In these cases a destabilization of the triple helix would occur. If the electronegative group is also a hydrogen bonding donor the Cγ-endo conformation could be further stabilized by intramolecular hydrogen bonding to the C═O group of the substituted imino acid, as shown in Figure 7 for 4(S)-aminoproline under acidic conditions (Amp4S+) (Table 2) . However, such mechanism has been reported as destabilizing for the collagen triple helix when in the X position, due to a weakening interference with the interstrand hydrogen bonds [119–122].
An additional stereoelectronic effect termed n → π* interaction  has been linked to collagen stabilization by Hyp and other similarly substituted Pro derivatives. This weak interaction occurs between the lone pairs of the oxygen atom of a peptide bond and an empty orbital on the oxygen atom in the next peptide bond. The preferred value for the ψ angle in the PPII conformation (145°) and in the Y position of the collagen triple helix (151±4°) is geometrically ideal for these n → π* interactions, and thus this effect will be potentially larger for the imino acids that favour the Cγ-exo conformation, such as Hyp4R (Table 2) [44,124]. Although the stabilization energy of n → π* interactions is modest (less than 1 kcal/mol, weaker than a ‘weak hydrogen bond’) , their main impact is to displace the equilibrium between the trans and cis forms of the peptide bond, Ktrans/cis in imino acid residues . Unfolded collagen strands will show a mixture of Pro residues with cis and trans peptide bonds, and all of them must be set to trans in order to build the triple helical structure. Thus, favouring trans peptide bonds in the unfolded state should contribute to an entropic stabilization of collagen. Nevertheless, replacement of Pro-Pro residues from a (PPG)10 peptide with a Pro-trans-Pro alkene isostere shows that the cis–trans isomerization effects on the stability of the triple helix are limited .
Combining the conformational preferences of different Pro derivatives allows for the engineering of largely preorganized collagens with improved thermal stability. The rationale for the design is to use derivatives that favour the Cγ-endo conformation in the X position and derivatives that favour the Cγ-exo conformation in the Y position (Table 2). Numerous collagen model peptides engineered with Pro residues replaced are consistent with this entropic stabilization (Supplementary Table S2; see  for a comprehensive list). For instance, fluorine is more electronegative than oxygen and Flp4R has a stronger preference for the Cγ-exo conformation than Hyp4R. Thus Flp4R in the Y position should be more effective than Hyp4R in stabilizing collagen, whereas Flp4R in the X position should be destabilizing. Accordingly, the peptide (PfP4RG)10 is much more stable thermally than the reference peptide (POG)10 (Supplementary Table S2) , whereas the peptide (fP4RPG)10 does not form triple helices .
These design principles have been demonstrated elegantly with the synthesis, biochemical characterization and crystal structure determination of the collagen model peptide (mP4RfP4RG)7 (PDB code 3IPN) . This peptide has a large thermal stability compared with the unstable reference peptide (PPG)7 (Supplementary Table S2). The main component of its stability is entropic, as expected from the preorganization of the chain conformation consequence of the choice of proline derivatives for each position. This preorganization has not altered the structure of the triple helix, which closely resembles that of the peptides (PPG)10 or (POG)10.
For all its conceptual simplicity, the propensity model does not explain all cases and an increasing number of ‘exceptions’ to the rule have been reported (Supplementary Table S2). For instance Hyp4S does not stabilize the helix at any position. Its preferred Cγ-endo conformation should make it suitable for incorporation in the X position and yet the peptide (αOPG)10 does not form triple helices  and the peptide (αOPG)15 is less stable than its reference peptide (PPG)15 [120,129] (Supplementary Table S2). Incorporation of Flp4S into the X position of peptides (fP4SPG)7 and (fP4SPG)10 or Flp4R into the Y position of peptides (PfP4RG)7 and (PfP4RG)10 is stabilizing. Preorganization of both X and Y positions should bring extraordinary stability. Yet, peptides with both positions occupied by Flp, (fP4SfP4RG)7, (fP3SfP4RG)7 and (fP4SfP4RG)10 either do not form triple helices or are destabilized [130,131] (Supplementary Table S2). Different explanations have been put forward to rationalize these exceptions: unfavourable steric interactions, interference of intramolecular hydrogen bonds with the interstrand N─H···O═C hydrogen bonding, etc. [106,120,131].
The anomaly of Hyp4S is illustrative. When Hyp4S is in the X position its hydroxy group points to the inside of the collagen triple helix. Thus, it was suggested that steric clashes with the neighbouring chains would prevent the formation of a stable triple helix . However, Flp4S is very stabilizing in the X position [128,132] and the size difference between the ─OH and ─F groups is too small to account for the discrepancy in stability. An alternative mechanism of destabilization would result from the formation of an intramolecular hydrogen bond between the ─OH and C═O groups of Hyp4S (similar to the case shown for Amp4S+ in Figure 7). In the X position, the C═O group is already involved in N─H···O═C hydrogen bonding. Accepting a second hydrogen bond would weaken this interstrand interaction . The crystal structure of a host–guest peptide with two triplets Hyp4S-Pro-Gly (PDB code 3B2C)  shows conformational diversity for the Hyp4S residues: about 30% adopt the Cγ-exo conformation, going against both intrinsic and positional preferences of Hyp4S in the X position; the remaining 70% adopts a distorted (shallow) Cγ-endo conformation with χ1 values averaging 16° (compared with the average from high-resolution peptide crystal structures, χ1=26±8°). These distortions probably avoid unfavourable steric interactions with the other chains, and are thought to cause the marked decrease in thermal stability of the peptide (POG)4-(αOPG)2-(POG)4 (Tm=49°C) with respect to the reference peptide (POG)10 (Tm=62°C) . Concerning the proposed intramolecular hydrogen bond, the average distance between the ─OH group and the carbonyl oxygen is 3.15 (±0.13) Å, whereas the average angle between the hydrogen bond and the C═O bond is 76 (± 3)°. These geometrical parameters indicate a weak, if any, hydrogen bonding interaction and suggest that, at least for the (POG)4-(αOPG)2-(POG)4 peptide, there will be little interference with the interstrand N─H···O═C hydrogen bonding.
Unexpected results have also been obtained when one Pro-Flp4R-Gly unit is embedded within a Pro-Hyp-Gly context. Thus, (POG)3-PfP4RG-(POG)4 is slightly less stable (Tm=44°C) than the parent peptide (POG)8 (Tm=47°C) despite introducing in the Y position the highly stabilizing Flp4R . On the other hand, adding Hyp to the X and Y positions produces completely the opposite to the expected results. The peptide (OOG)10 with Hyp4R in both X and Y positions goes against the propensity model and yet it is slightly more stable than the reference (POG)10. The related peptide (POG)3-OOG-(POG)4 has the same thermal stability (Tm=47°C) than the parent peptide (POG)8. On the other hand, the peptide (O4SOG)10 with Hyp4S in the X position and Hyp4R in the Y position follows the propensity model and yet is much less stable than the reference (POG)10  (Supplementary Table S2). Possible reasons for these discrepancies in the case of Hyp-rich peptides are discussed in the next section.
Current knowledge from structural and biochemical analysis of collagen peptides with different proline derivatives has led to relatively high levels of success in predicting the changes in collagen stability and folding kinetics when using these derivatives. This knowledge is guiding ongoing efforts on the design of new collagen peptides with functionalizable groups [135–138] or the engineering of novel collagen peptide mimetics and collagen-based biomaterials [122,139–144]. One example of application is the development of pH-dependent triple helices through synthetic ionizable Pro derivatives such as aminoproline (Amp) residues [119,145–147] or carboxylate-modified prolines [148,149]. Amp-based collagen model peptides have a complex stability profile due to the convergence of several factors like the intrinsic ring preferences for Amp4S/Amp4R+, the possibility of intramolecular hydrogen bonding and the different stereoelectronic properties of the neutral (─NH2) and protonated (─NH3+) forms of the amino group (Table 2). At neutral and acid pH, the amino group is protonated: Amp4S+ prefers to adopt a Cγ-endo conformation, possibly enforced by an intramolecular transannular hydrogen bond (Figure 7). On the other hand, the uncharged amino group at basic pH will favour the Cγ-exo conformation for Amp4S, in a similar way to Mep4S. This difference forms the basis of a pH-dependent conformational switch between the two conformations, and thus (POG)3-aP4SPG-(POG)3 is more stable at pH 11 (Tm=33°C) than at pH 3 (Tm=13°C) . The related peptide (POG)3-PaP4SG-(POG)3 is more stable at pH 11 (Tm=44°C) due to the uncharged Amp4S preferring the Cγ-exo conformation and being in the Y position . Partially conflicting results have been obtained using repeating peptides with six triplets of Pro, Amp and Gly residues (Supplementary Table S2). These data suggest that both protonated (pH 3) and neutral Amp4R (pH 12) are stabilizing in the Y position, whereas Amp4S+ and Amp4R+ are both stabilizing in the X position when protonated (pH 3), but not when neutral [145–147].
HYDROXYPROLINE AND COLLAGEN STABILIZATION: ENTHALPY CONTRIBUTION
The mechanism of stabilization discussed above is essentially of entropic nature: the chain conformation in the unfolded state is primed to a conformation that is closer to the folded state and therefore reduces the overall entropic cost of going from three individual chains to one single triple helix. However, collagen stability has a large enthalpic component, in contrast to what occurs with other proteins, and the stabilization of collagen by Hyp has been correlated with an increase in that enthalpy . An early thermodynamic analysis of the denaturation of the model peptides (POG)10 and (PPG)10 demonstrated that (POG)10 has a much higher denaturation temperature than (PPG)10 and also that it shows larger enthalpy and entropy changes for the coil → triple helix transition . The same findings have been obtained in previous studies [128,150].
What is the origin of the large enthalpy of collagen? In the context of protein stability, a high enthalpy of stabilization is usually associated with more (or stronger, or both) hydrogen bonds, more efficient molecular packing leading to stronger van der Waals interactions, stronger electrostatic interactions, etc. Given the non-globular structure of the collagen triple helix and the absence of charge effects between Pro and Hyp, it would follow that the enthalpic component of collagen stabilization by Hyp mainly relates to an increase in hydrogen bonding. However, the structure of the triple helix, which places the hydroxy groups of Hyp at the periphery, does not allow for it. The only obvious alternative is significant hydrogen bonding interaction with water surrounding the triple helix and the formation of water bridges connecting the chains. Different models of hydrogen bonding interactions between the main-chain C═O groups, the Hyp side chains, and water molecules around the triple helix have been proposed in the past, including the water-bonded structure when non-imino acids occupy the X position [71,72,151]. As already discussed, the crystal structure of the Gly→Ala peptide revealed an extensive repetitive network of water molecules, interconnecting the triple helical structure through hydrogen bonding with the C═O groups of the peptide bonds and the ─OH groups of the Hyp residues . Water networks with the same hydrogen bonding topology have been observed in subsequent, higher-resolution crystal structure determinations of peptides with a significant proportion of POG triplets (Supplementary Table S1). In these structures, the ─OH groups of Hyp side chains are central to the connectivity of the hydration networks (Figure 6). However, it is not possible to assess from crystal structures whether the contribution of the hydrogen bonding between Hyp and water is ultimately stabilizing.
There has been a degree of controversy about the physical significance of these water networks with respect to collagen stability. The large body of evidence in support of the propensity-based hypothesis and the stereoelectronic effects (see above) has offered a convincing case for collagen stability, while apparently dismissing any role of water in the structure and stability of the triple helix. In particular, the entropic cost of engaging many water molecules in hydrogen bonding interactions is thought to largely overcome any stabilizing enthalpic interaction . Nevertheless, neither the propensity-based hypothesis nor stereoelectronic effects offer a reasonable explanation for the enthalpic stabilization of natural collagen. The (OOG)10 peptide is one of the exceptions to the propensity-based predictions. It forms a triple helix that is slightly more stable (Tm =65°C) than (POG)10, despite placing Cγ-exo favouring Hyp4R in the X position. Crystal structures of the (OOG)10 peptide (PDB code 1WZB)  and the closely related (GOO)10 peptide (PDB code 1YM8)  show that essentially all Hyp residues adopt the Cγ-exo conformation irrespective of their X or Y position, and yet the resulting triple helices have the same helical and hydrogen bonding patterns as other collagen model peptides.
To gain insight into this anomaly, Kobayashi and co-workers [130,150] performed differential scanning calorimetry analyses on (OOG)10 and other peptides. They concluded that the increase in stability with respect to (PPG)10 seen in the different peptides could be enthalpy-dominant or entropy-dominant, and that the mechanisms of collagen stabilization by Flp and Hyp would differ (Figure 8). Thus, the increased stability of (PfP4RG)10 with respect to (PPG)10 would be entropically driven, consistent with the propensity-based model, stereoelectronic effects, and cis/trans equilibrium of the peptide bonds discussed above. In contrast, the increased stability of (POG)10 with respect to (PPG)10 would be enthalpically driven, with a significant contribution from hydrogen bonding to the hydration network surrounding the triple helix, in addition to the entropic factors mentioned above . The (OOG)10 peptide showed smaller enthalpy and entropy changes than those seen in (PPG)10 and (POG)10 (Figure 8). The degree of hydration of (OOG)10 and (POG)10 is very similar in the crystalline state, but (OOG)10 appears to be more hydrated in the unfolded state. This would explain the reduction in enthalpy and entropy differences between folded and unfolded state of (OOG)10 with respect to (PPG)10 or (POG)10 .
Comparison of thermodynamic parameters for the triple helix → coil transition of different collagen peptides normalized at the equilibrium transition temperature of (PPG)10 
COLLAGEN AND 3-HYDROXYPROLINE
The other post-translational modification of Pro residues is prolyl-3-hydroxylation, a rare modification that so far has only been reported in collagen proteins. Hyp3S was originally discovered on type I collagen more than 50 years ago , but its biological function is still being elucidated . Hyp3S is a quantitatively minor modification, with only a few residues per α chain in collagen types I, II, III, IV and V/XI [155–159]. It occurs only at specific sites, always in the X position of triplets with Hyp4R in the Y position. Hyp3S sites are highly conserved, and show variable occupancy in a given tissue and pronounced specificity across tissues or developmental stages . For example, 3-hydroxylation of a C-terminal (GPP)n motif in type I collagen is unique to tendon and absent from skin and bone. The levels of Hyp3S in this motif appear to be regulated during development [157,159]. Hyp3S must be important for the self-assembly of collagen supramolecular structures as several genetic variants with defective prolyl-3-hydroxylation have been linked to recessive forms of osteogenesis imperfecta and severe myopia (reviewed in [94,160]).
All of these findings have contributed to a renewed interest in the role of Hyp3S in collagen structure and function. Hyp3S has preference for the Cγ-endo conformation (Table 2) and, according to the propensity model, it should provide stabilization in the X position. Initial studies produced conflicting results, as the peptide (GO3SO)10 appeared not to form a triple helix at all, whereas the host–guest peptide (GPO)3-GO3SO-(GPO)4 was slightly more stable than the related (GPO)8 . On the other hand, the triple helical structure of the peptide (GPO)3-(GO3SO)2-GPO4 (2G66)  was virtually identical with that of other (PPG)n or (POG)n peptides, with all of the Hyp3S residues in the expected Cγ-endo conformation and no evidence for any destabilizing interaction. Later work by the same group established that Hyp3S does stabilize slightly the collagen triple helix in the X position of Gly-X-Hyp4R triplets  and that this stabilization is mainly entropic, as would be expected from the propensity model.
Nevertheless, the very low frequency of Hyp3S residues in collagen (compared with more than 100 Hyp4R positions per collagen chain) argues against a main role for Hyp3S in simply conferring additional stability to the triple helix . Other functions have been proposed where Hyp3S side chains may act as specific points for collagen interaction, with other molecules of collagen or with collagen-binding proteins, and the evidence is slowly emerging. The regular spacing of several Hyp3S sites in collagen types I and II may indicate a possible role in fibril assembly . Also, Hyp3S sites may overlap with fibril surface binding sites for small leucine-rich proteoglycans [94,160]. The (GPO)3-(GO3SO)2-GPO4 structure (PDB code 2G66) shows that the ─OH groups from Hyp3S residues point away from the triple helix. This orientation would be consistent with a role in directing specific hydrogen bonding intermolecular interactions to other proteins or to other collagen triple helices [156,162]. A recent report shows that Hyp3S residues in type IV collagen prevent its interaction with platelet-specific glycoprotein VI, thus blocking the initiation of platelet aggregation . All of these studies point towards biological functions for Hyp3S based on molecular interactions. Interestingly, evolution seems to have selected different roles for the two hydroxylated Pro residues: Hyp4R as a widespread non-specific modification to increase thermal stability throughout; and Hyp3S as a specific localized modification to provide sites for molecular recognition.
COLLAGEN STABILIZATION BY OTHER AMINO ACID SIDE CHAINS
Over the last decade, several studies have reported the existence of numerous open reading frames with the hallmark (Gly-X-Y)n collagen sequence in prokaryotic and viral genomes [165–169]. Some of these bacterial and viral collagen-like proteins have been produced using recombinant techniques, and the formation of triple helical structures confirmed [169–172]. Sequence analyses of non-metazoan collagenomes reveal amino acid compositions quite different from those of vertebrates or invertebrates (Table 3, Figure 9). In particular, the proportion of Pro residues in the Y position is very low in all prokaryotic and viral collagenomes, probably due to the lack of prolyl hydroxylation in these organisms. Absence of Hyp residues appears to be compensated for by an increase in other residues at specific positions, depending on the taxonomic groups (Table 3, Figure 9): Pro residues in the X position (Escherichia coli, phages, Bacillus); charged residues in the X and Y positions (Mimivirus, E. coli, Streptococcus); Thr residues in the Y position (Bacillus); Gln residues in the Y position (phages, Bacillus, Streptococcus, E. coli); or Ala residues in the X position (Bacillus). The biochemical characterization of these prokaryotic collagens shows thermal stabilities close to those of vertebrate collagens [169–173], indicating that other mechanisms of stabilization independent of prolyl hydroxylation are at play. Prokaryotic collagens have rapidly generated interest as possible sources of collagen-based biomaterials with designed properties [172,174,175], and thus it is important to elucidate these Hyp-independent mechanisms of collagen stabilization as they may open new avenues for collagen engineering.
Effect of amino acid side chains other than Pro and Hyp on the stability of the collagen triple helix
Brodsky and co-workers [176–179] have carried out a systematic study of the effect of different amino acids in the X and Y positions on the stability of collagen. Most peptides with a simple repetitive (Gly-X-Y)n sequence are not stable at room temperature unless X and Y are imino acids. Thus, the usual approach to investigate the contribution of specific triplets to collagen stability is the design of host–guest peptides with sequences such as Ac-(GPO)3-(GXAXB)-(GPO)4-GG-NH2 or Ac-(GPO)3-(GXAXB-GXCXD)-(GPO)3-GG-NH2, where the destabilizing effect of the ‘guest’ triplets GXAXB or GXAXB-GXCXD is measured by comparison to the reference peptide Ac-(GPO)8-GG-NH2 (the most stable of the series). From these stability data, an empirical algorithm for the prediction of collagen stability from its amino acid sequence was developed  (Collagen Stability Calculator, http://compbio.cs.princeton.edu/csc). This algorithm can be used to predict the melting temperature of short peptides with an architecture similar to the host–guest series above, but also to produce a profile of relative stability for longer collagen sequences that can be useful to identify regions of high or low stability along a given collagen sequence .
|Human X/Y||Mouse||Fish||Worm||Fly||E. coli||Streptococcus||Bacillus||Phages||Mimiviridae|
|Human X/Y||Mouse||Fish||Worm||Fly||E. coli||Streptococcus||Bacillus||Phages||Mimiviridae|
The least destabilizing residues in the host–guest analysis of single GXAXB triplets were Pro, Glu, Ala, Lys, Arg, Gln and Asp for the X position and Hyp, Arg, Met, Ile, Gln and Ala for the Y position. Aromatic residues Trp, Tyr and Phe were very destabilizing in either position, and Gly was also very destabilizing, more in the X position. Thus, the least destabilizing triplet without imino acids was GER and the most destabilizing observed experimentally was GGF . There is no clear explanation at the molecular level for many of these effects. For instance, it is noteworthy that Ala residues are relatively favoured at either the X or Y positions of the triple helix, without an obvious reason other than being small and not interfering. The case of Arg in the Y position seems clearer: Arg side chains can form hydrogen bonds to main-chain C═O groups on the following strand, as seen in the crystal structure of the T3-785 peptide (PDB code 1BKV) (Figure 9) . Other peptides with Arg residues in the Y position show the same interaction for some but not all their side chains (PDB code 1Q7D, 3DMW, 4AXY, 4DMT and 4GYX) [65,90,180–182]. Analysis of the side-chain conformation of Arg residues in collagen peptides shows a preferred tt conformation for the χ1 and χ2 torsion angles that differs from that most commonly observed in globular proteins. In particular, the tttt or ttgg conformations for χ1 to χ4 are observed in Arg side chains where Nε or Nη1 respectively form a hydrogen bond to the main-chain C═O group on the Y position from the following strand .
Extension of the host–guest analysis to the two triplet case GXAXBGXCXD demonstrated that interstrand electrostatic interactions between residues of opposite charges can be very stabilizing, as shown by host–guest peptides with GPKGEO and GPKGDO sequences . Crystal structures of peptides with these sequences (PDB codes 3T4F and 3U29) confirmed the formation of interstrand ion pairs and additional hydrogen bonding involving the Lys side chains . The KGE and KGD motifs appear to be more common in metazoan collagen sequences than expected from the individual residue frequencies , and positional preference analyses of non-metazoan collagens shows a high proportion of charged residues, Glu/Asp in the X position, Lys/Arg in the Y position, for E. coli, Streptococcus or viral collagenomes (Table 3, Figure 9A). Biochemical analysis of Scl2, a collagen-like protein from Streptococcus pyogenes, indicates a high degree of electrostatic stabilization of the triple helix that is a consequence of its relatively high proportion of charged amino acids . Similarly, collagen-like proteins from E. coli genomes contain a high proportion of charged amino acids plus a very high preference for Pro in the X position (Table 3, Figure 9A). The combination of both effects may contribute to the observed stability of the triple helices of these proteins . Some prokaryotic or viral collagen-like sequences show stretches of repetitive sequences alternating charged residues of different sign (KGE)n or (KGD)n, while free of imino acids (one predicted collagen-like sequence from white spot shrimp virus contains 33 consecutive KGE repeats, Q8QTH5). Formation of repetitive interstrand ion pairs as shown in Figure 9B could be essential for maintaining a triple helical structure of these atypical collagen-like sequences.
Other mechanisms for collagen stabilization involving Thr residues have been reported. The cuticle collagen of the deep sea hydrothermal vent worm Riftia pachyptila has very low proportion of imino acid residues, high content of Thr residues in the Y position (>18%) and thermal stability comparable to that of metazoan collagens . Biochemical analysis and partial sequencing showed that the Thr residues are glycosylated with galactose saccharides . These glycosylated Thr residues are responsible for the thermal stability of the cuticle collagen in place of Hyp. Thus, the synthetic peptide (GPT)10 does not form a triple helical structure, but modification of its Thr residues with β-D-galactose induces triple helical formation (Supplementary Table S2) . Interestingly, the peptide (GOT)10 does form a stable triple helix (Tm=19°C), much more stable after Thr glycosylation (Tm=55°C) . The combination of Hyp residues in the X position and glycosylated Thr residues in the Y position seems to stabilize several invertebrate cuticle collagens , and the crystal structures of the peptides (PPG)4-O(S/T/V)G-(PPG)4 (PDB codes 3ADM, 3A1H and 3A0M)  suggest that this stabilization is due to van der Waals and dipole–dipole interactions between the Hyp residues in the X position and the S/T/V residues in the Y position of the preceding chain. Some prokaryotic collagenomes (Bacillus and also Clostridium and other firmicutes) also show a clear preference for Thr residues in the Y position (Figure 9A), and glycosylation of the collagen-like protein from Bacillus anthracis BclA has been confirmed . However, a recombinant version of BclA produced in E. coli and thus unlikely to contain the specific glycosylation of native BclA, still formed a collagen triple helix with a stability of 37°C for the collagen-like region alone .
Electrostatic interactions have been used in the design of novel heterotrimer collagens. When charged collagen peptides with repetitive sequences such as (PRG)10 or (EOG)10 are mixed together with neutral peptides like (POG)10, they can self-assemble into heterotrimers with 2:1 or 1:1:1 stoichiometries . Some of these heterotrimers have stabilities comparable to the reference (POG)10 homotrimer (Supplementary Table S2) . Interstrand electrostatic interactions between the charged side chains of the individual peptides provide the mechanism of stabilization, and neither the cationic nor anionic peptides are able to form stable homotrimer triple helices. The NMR structure of the heterotrimer triple helix [(PKG)10:(DOG)10:(POG)10] (PDB code 2KLW) shows a single-register triple helix where (PKG)10 is the trailing strand, (DOG)10 is the middle strand and (POG)10 is the leading strand, with multiple interstrand salt-bridge interactions between the Lys and Asp side chains (Figure 9D) . By optimization of the amino acid sequence and the ratio of charged and Hyp residues on each chain, Hartgerink and co-workers [193–196] have produced several collagen peptide designs with better control of the resulting triple helix, favouring heterotrimer assembly and unfavouring the formation of homotrimers.
Similar principles could be applied for the design of heterotrimer collagens based on optimization of steric interactions. Thus, neither the (fP4SfP4RG)7 or (PPG)7 peptides can form stable homotrimer helices, yet the heterotrimer [(fP4SfP4RG)7]2:(PPG)7 is stable at room temperature (Supplementary Table S2). Molecular models of the homo- and hetero-trimers suggest that the (PPG)7 strand can relax some of the unfavourable steric interactions between fluorine atoms of the neighbouring (fP4SfP4RG)7 strands . It can be predicted that favourable combinations of steric and electrostatic interactions will lead to the development of codes for heterotrimer assemblies of collagen peptides with designed sequences [44,131].
The relatively simple 3D structure of the collagen triple helix means that any molecular recognition motif on the COL×3 domains of collagens must be essentially linear. Furthermore, such recognition motifs must occur on the framework of a highly repetitive structure imposed by the requirement of the (Gly-X-Y)n sequence. Use of collagen model peptides has been essential to investigate these questions and in recent years several crystal structure determinations have clarified the mechanisms of collagen recognition by other proteins (Supplementary Table S1). Collagen-binding sites have been mapped on the sequences of COL×3 domains and on approximate 2D and 3D models of collagen fibrils based on fibre X-ray diffraction [197–200]. These maps help in the identification of functional domains along the collagen fibril, providing insight into their accessibility to recognition by ligands and the impact of collagen mutations on major ligand-binding sites.
Binding to collagens by cell-surface receptors and extracellular matrix proteins is critical for their biological function. For example, collagens regulate cell behaviour (adhesion, migration and proliferation) through their interaction with specific cellular receptors: collagen-binding integrins, collagen-binding immune receptors and discoidin domain receptors [9,201–203]. The development of collagen ‘toolkits’ by Farndale and co-workers [204,205] has been extremely important for the discovery of the collagen sequence motifs recognized by these receptors and other collagen-binding molecules. The toolkit approach consists of synthesizing a library of collagen peptides with overlapping sequences covering the entire length of long COL×3 domains, such as those from type II and type III collagen. This library is then tested for binding against the different collagen-binding proteins or receptors and the collagen recognition motifs are identified from the sequences of the peptides that show binding [204,205]. As a case example, an early version of this method identified the GFOGER sequence as the collagen recognition motif for integrin α2β1 [206,207]. The structure of the peptide (GPO)2-GFOGER-(GPO)3 co-crystallized with the integrin α2 I-domain (PDB code 1DZI) elucidated the molecular basis of the interaction (Figure 10) [41,208]. Additional integrin-binding motifs with sequences similar to GFOGER and with different affinities and specificities have been discovered through a combination of methods and confirmed with synthetic peptides [205,209–211]. Two additional structures of integrin α1 and α2 I-domains in complex with peptides containing GLOGEN or GFOGER motifs have been determined by solution NMR and crystallography (PDB codes 2M32 and 4BJ3) [212,213]. In all of these, structures the Glu residue from the OGE triplet completes the co-ordination of a divalent cation in the integrin I-domain (Figure 10), a mechanism usually seen in integrin interaction with its ligands [41,208].
Structures of complexes between collagen model peptides and collagen-binding domains
Collagen toolkits helped to identify the sequence GVMGFO as an interaction motif for three different unrelated proteins: von Willebrand factor (a plasma protein involved in homoeostasis), discoidin domain receptor 2 (DDR2, a receptor tyrosine kinase that regulates cell behaviour and extracellular matrix remodelling) and osteonectin (also called SPARC/BM40, an extracellular calcium-binding matrix glycoprotein associated with tissue remodelling and bone mineralization) [214–216]. Crystal structures for each interaction have been reported (PDB codes 4DMU, 2WUH and 2V53) [90,217,218]. In these crystals, the GVMGFO motifs bind to amphiphilic specificity pockets on the surface of the collagen-binding proteins (Figure 10) in addition to other areas of contact specific to each protein. It has been suggested that the convergence of binding mechanisms to the same GVMGFO motif is related to the unique hydrophobic knob created by two large hydrophobic residues separated by Gly, and the relative rarity of such motif on the sequences of COL×3 domains of fibrillar collagens .
Other structures of collagen peptides in complex with collagen-binding proteins have been recently determined. The collagen-specific chaperone heat-shock protein 47 (Hsp47) is essential for the correct assembly and maturation of collagen triple helices in the endoplasmic reticulum . Crystal structures of Hsp47 in complex with collagen peptides containing a PRG triplet show a 2:1 stoichiometry (Figure 10; PDB codes 4AU2, 4AU3 and 3ZHA) where two Hsp47 molecules bind in a head-to-head fashion to two sites on the homotrimer peptides. Each Hsp47 molecule makes extensive contacts with the leading or trailing strands. The Arg residues from these strands form salt bridges with Asp residues from separate Hsp47 molecules (Figure 10E) .
Structural analyses of collagen recognition by other domains include crystal and NMR structures of collagen peptides in complex with Fab fragments from arthritogenic autoantibodies (PDB codes 2Y5T and 4BKL) [220,221], CUB domains from molecules of the innate immune system (PDB codes 3POB and 4LOR) [222,223], metalloproteinase domains (PDB codes 4AUO and 2MQS) [224,225] or the collagen-binding protein CNA from Staphylococcus aureus (PDB code 2F6A) . The Fab–collagen complexes show how antibodies recognize epitopes from type II collagen in its native triple helical conformation. In these structures the peptides fill completely the binding clefts with extensive van der Waals, direct and water-mediated hydrogen bonding and salt-bridge interactions between collagen and Fab residues [220,221]. The CUB–collagen complexes show an interesting interaction where one Lys residue from the collagen peptide forms salt bridges with acidic side chains involved in the co-ordination of a Ca2+ ion, reminiscent of the interaction between collagen and α1/α2 integrin I-domains , although the Lys residue does not participate in metal co-ordination.
Most complexes between collagen triple helical peptides and their binding partners show the triple helix placed ‘on top’ of the binding domain (Figures 10A and 10B), rather than being surrounded by it. Two complexes show 2:1 stoichiometries [182,213], where one triple helical peptide forms a complex with two partners exploiting the repetition of binding sites on homotrimer collagens (Figure 10E). A special case is the structure of a collagen peptide with the collagen-binding protein CNA from Staphylococcus aureus where the triple helix is ‘hugged’ by the different subdomains of the bacterial protein . Thus, binding often involves predominantly two of the three strands engaging in van der Waals, hydrogen bonding and salt-bridge interactions with residues on the surface of the binding partner. Nevertheless, participation of key residues from different chains demonstrates the need for a triple helical structure in these interactions. Some of the collagen triple helices show appreciable bending upon complex formation, whereas others remain quite straight (Figure 10). It is possible that lattice interactions also have an effect on either bending the triple helices or keeping them straight. When the structure of the collagen peptide on its own is also available, it is possible to compare whether there are more subtle structural variations upon binding, as for instance a variation in the internal superhelical twist (Figure 4). However, the resolution of the complexes is often significantly lower than that of the isolated peptides and no general conclusions about changes in superhelical twist can be extracted.
In complexes where there is sufficient resolution to visualize the positions of bound water molecules (PDB codes 1DZI, 2WUH, 2Y5T and 3POB), the water-bonded model is preserved: water molecules bound to the amide groups of amino acids at the X position form additional hydrogen bonds to the carbonyl groups of Gly residues on the preceding chain and, if available, to hydroxy groups from Hyp residues (ζ1 or ζ1–γ1 bridges, as in Figure 5) . These water molecules and hydrogen bonding topology are preserved even in the regions of contact between the collagen triple helix and its binding partner, and their network of hydrogen bonds often extends to residues on the surface of the partner (Supplementary Figure S1). Conservation of these water molecules even upon complex formation reinforces the idea that they are an intrinsic feature of the triple helical structure of collagen . Even when water molecules are not visible due to the low resolution of the structures, it is safe to assume that the hydration positions most consistently observed in high-resolution structures may remain occupied upon complex formation. Modelling waters into these positions often suggests additional water-mediated hydrogen bonding between the collagen triple helix and its partner. Molecular dynamics simulations of water molecules around collagen–ligand complexes and their separate components have been used to identify structured water molecules that mediate interactions between the two partners . These water molecules show significant residence times and binding energies, and maintain hydrogen bonding topologies equivalent to those of the isolated collagen triple helix. These molecular dynamics simulations strongly support the notion that hydration on the surface of the triple helix also contributes to collagen molecular recognition .
C─H···O═C HYDROGEN BONDING
Conventional views on hydrogen bonding state that both the donor and acceptor atoms must be highly electronegative (O, N, F). There is evidence, however, that hydrogen bonds are possible where one or both of the atoms or groups involved have moderate or low electronegativity. These weak hydrogen bonds can occur, for instance, with carbon atoms acting as donors or aromatic rings acting as acceptors . The possibility of hydrogen bonding from Cα atoms to carbonyl groups (Cα─H···O═C) in proteins was already a subject of interest at the time of the fibre diffraction studies on collagen, and it was suggested for the PGII and collagen structures on the basis of model building [39,46,229]. The crystal structure of the Gly→Ala peptide provided experimental evidence for two repetitive patterns of Cα─H···O═C hydrogen bonding in the collagen triple helix  (Figure 11). One set of hydrogen bonds occurs between the Cα of the amino acid in the Y position and the C═O group in the X position of the following strand, acting as companions to the conventional N─H···O═C interstrand hydrogen bonding. This tandem topology is exactly the same as seen ubiquitously in the β-sheet structure . A second set occurs between the Cα of Gly and the C═O groups of Gly and the amino acid in the X position both from the preceding strand, in opposite direction to the N─H···O═C bonds. The geometry of the Gly-Cα─H···O═C interactions is indicative of a bifurcated and three-centred hydrogen bonding configuration (Figure 11B), which is a characteristic feature in crystal structures with deficiency of hydrogen bonding donor groups .
Interstrand hydrogen bonding in the collagen triple helix
Weak hydrogen bonds are not so weak after all. The strength of Cα─H···OH2 hydrogen bonds has been calculated as about half of that of the water dimer , and ab initio quantum calculations suggest that N─H···O and Cα─H···O hydrogen bonds have a similar contribution to the stability of β-sheets . Thus, given their large numbers in proteins, Cα─H···O hydrogen bonds in particular should be considered as important factors in protein structure and stability. They have also been reported to contribute to the stability of protein–protein interfaces and to macromolecular recognition [235–237]. In the collagen case, interstrand Cα─H···O═C hydrogen bonds are another characteristic feature of the collagen triple helix. They alleviate the problem of shortage of available hydrogen bonding donors (only one direct N─H···O═C hydrogen bond per triplet) and define an ensemble of stabilizing interactions at the inner core of the triple helix.
Structural analysis of Cα─H···O hydrogen bonding offers strategic paths for molecular engineering. The concept of isostructurality exploits the replacement of weaker C─H donor groups with other chemical groups with stronger hydrogen bonding strength, such as N─H. This strategy has been recently applied to collagen successfully. In an elegant study, the introduction of an aza-Gly residue in the middle of typical collagen model peptides resulted in hyperstability and faster folding of the triple helix . Molecular dynamics simulations suggest that substitution of the CH2 group with a NH group maintained the same overall hydrogen bonding, with the C═O groups of aza-Gly and Pro residues from the previous chain acting as acceptors (Figure 11C).
The ─Gly-X-Y─ repetitive pattern is not strictly conserved in the majority of collagens and collagen-like proteins. Interruptions result from substitution of the invariant Gly residues or from removing or adding additional residues in between Gly residues, thus breaking the repetitive sequence. Although these interruptions seem to be an intrinsic part of collagen structure, they are not tolerated in the long COL×3 domains of fibrillar collagens. Missense mutations in these domains where a single Gly residue is replaced with another amino acid result in connective tissue disorders such as osteogenesis imperfecta [1,239], and their clinical severity depends on the substitute residue, the local environment of the substitution and its proximity to collagen interaction sites [101,240,241]. It is thought that interruptions in the COL×3 domains of fibrillar collagens are incompatible with the correct formation of the supramacromolecular fibril assemblies, whereas in non-fibrillar collagens they may have some functional role .
Interruptions of the ─Gly-X-Y─ pattern can range from a few residues to sequences long enough to include entire domains, although in the latter case they are not referred as collagen interruptions. The boundary between these two scenarios is necessarily blurred and could be arbitrarily placed around the 15–20 residues. A nomenclature system (Table 4) has been developed for classification purposes [242,243]. Thus, a typical collagen sequence ─Gly-X-Y-Gly-Z-Gly-X-Y-Gly─ where a single residue is missing is termed a G1G interruption. Single Gly→Z substitutions in the middle of a collagen sequence result in the altered sequence ─Gly-X-Y-Gly-X-Y-Z-X-Y-Gly-X-Y-Gly─, which is equivalent to a G5G interruption. Interruptions in heterotrimer collagens can be classified as commensurate or incommensurate, depending on the relative register of the ─Gly-X-Y─ pattern at both sides of the interruption (Figure 12). A census of collagen interruptions in the collagenomes from prokaryotes to humans has been recently reported . This census shows a predominance of shorter interruptions in general, but with significant differences in their distribution across different taxonomic groups (Figure 12).
Interruptions in the collagen triple helix
Collagen interruptions must impose a local discontinuity or distortion on the triple helical structure with consequences to its stability at the site of the interruption. Again, collagen model peptides are a useful vehicle to address these questions biochemically . Only two crystal structures of collagen peptides with interruptions have been determined so far. The Gly→Ala peptide (PDB codes 1CAG and 1CGD), with sequence (POG)3-POGPOAPOG-(POG)4, is an example of a G5G interruption (underlined). The replacement of a consensus Gly with an Ala residue results in a small bulge at the Ala site, replacement of three N─H···O═C hydrogen bonds with water-mediated hydrogen bonds (ζ1 bridges) and local untwisting at the replacement site (Figure 13) [55,66,73,244]. The single Gly→Ala replacement brings a dramatic decrease in thermal stability compared with the parent peptide (POG)10 . The Hyp− peptide (PDB code 1EI8), with sequence (POG)3-POGPG-(POG)5 is an example of a G1G interruption where one residue has been removed from the consensus collagen triplet. The structure of this peptide shows local disruption of the triple helical structure and unusual Gly-N─H···O═C-Gly hydrogen bonding, with consequences for the spatial relation between the segments at either side of the interruption. The pattern of hydrogen bonding topology is resumed at both sides of the interruption, but the two sets of hydrogen bonds are out of phase [66,243,246]. From the hydrogen bonding topologies observed in these crystal structures it is possible to define ideal cases where the hydrogen bonding connectivity, including water-mediated hydrogen bonds, would be maintained with minimal disruptions to the structure . However, NMR studies combined with molecular dynamics simulations suggest a more complex scenario where other unusual hydrogen bonding arrangements may occur [241,242,247–250]. These studies also show that the conformation of the triple helix is more flexible at the sites of interruption, and even the crystal structures show evidence of higher flexibility at these specific sites (higher temperature B-factors) . Thus, it is very likely that interrupted collagens in solution have a more dynamic structure and may have local conformations, hydrogen bonding and water-mediated bridges that differ from the ones seen in only two crystal structures of interrupted collagen peptides determined so far. More 3D structures, both from crystallographic and NMR analyses, are needed to achieve a more comprehensive understanding of collagen conformation at the sites of interruption.
Structural impact of interruptions
The G1G interruption had an even greater negative impact on the stability of the Hyp− peptide than the corresponding G5G interruption in the Gly→Ala peptide . Yet, the peptide of sequence (GPO)4-GAAVMGPO-(GPO)3 (GAAVM peptide), with a G4G interruption, is stable at room temperature even though this G4G interruption is commensurate with the G1G interruption of the Hyp− peptide . Recently, two peptides with a G5G interruption where the same residues are arranged in different order, (GPO)5-GPOALOG/GLOAPOG-PO-(GPO)3, showed significant differences in stability, conformation and flexibility . It follows that not all interruption types have the same impact on collagen stability and that the actual sequences in or around the interruption are also important. Intriguingly, sequence conservation analyses on collagen interruptions show certain patterns of preferences for the residues inside or surrounding the interruptions [242,243]. Understanding the intricate relation between collagen interruptions, their sequence preferences and the impact on stability, conformation and flexibility is critical to understand their biological role, but there are also practical considerations. Interruption sites could be incorporated into engineered collagen-based polymers to introduce flexibility points or sites for interaction with specific receptors. Furthermore, it has been suggested that collagen interruptions may impose some preference on the chain register adopted by heterotrimer collagens . Some support for this hypothesis has been obtained from the recent analysis of a heterotrimer peptide modelling a commensurate interruption in type IV collagen, with two chains with a G1G interruption (GVG) and one chain with a G4G interruption (GISLKG). This study confirms the predicted register although the observed hydrogen bonding topology may be different from the one originally suggested . The study provides yet another example that it may be possible to control the formation of collagen heterotrimers using information embedded in the collagen sequence alone. Thus, collagen interruptions as a whole should not be simply considered as ‘obstacles’ in the sense commonly associated to the pathological mutations of fibrillar collagens, but as elements that can have functional roles in guiding collagen heterotrimer chain register  or in influencing the formation of supramolecular arrangements specific to the interrupted collagens.
CONCLUDING REMARKS AND FUTURE DIRECTIONS
Our understanding of the structure and stability of the collagen triple helix has improved significantly since the early fibre diffraction models from 50 years ago. High-resolution structural determinations and biochemical characterization of collagen model peptides have largely established the relative contributions to stability of every amino acid type, at least for homotrimer triple helices. Non-natural proline derivatives have led to the synthesis of hyperstable collagens and have improved our understanding of the fundamental mechanisms of collagen stabilization by hydroxyproline.
Collagen triple helical domains outside metazoa have amino acid compositions very different from those in vertebrates. They demonstrate that there are alternative mechanisms for building stable collagen triple helices in the absence of prolyl hydroxylation. These mechanisms have been used for the design of stable collagen mimics and have opened avenues for the engineering of heterotrimer collagens and self-assembling synthetic collagen peptides. Prokaryotic collagens will facilitate the development of bacterial expression systems for cost-effective production of stable collagen-based polymers. These recombinant collagens could be designed with sequence motifs targeting specific collagen receptors or collagen-binding proteins, with the aim of producing artificial extracellular matrices for biomedical applications.
Use of peptide libraries has advanced enormously our understanding of collagen molecular recognition. Given the largely linear structure of collagen domain, it has become possible to map several ligand recognition sites on 2D representations of the collagen fibril. Furthermore, the atomic details of the interaction between collagen triple helices and several collagen-binding proteins or receptors are now known. This information will be invaluable to improve our understanding of collagen biology.
Challenges remain at the basic molecular level to understand the triple helical conformation at the sites of interruption, which are very varied and numerous in non-fibrillar collagens. The effect of incommensurate interruptions in molecular structure and chain register is largely unknown, and the possible impact of commensurate interruptions in heterotrimer chain selection deserves investigation. Interruptions could also be exploited as sites of local flexibility in recombinant collagen-based biomaterials.
Finally, a main challenge is the development of methods for the synthesis of hierarchical collagen assemblies that mimic collagen fibrils, networks, basement membranes or other supramolecular organizations. This will require a much deeper understanding of the mechanisms of collagen assembly and organization and is critical for future tissue regeneration interventions. Connective tissues differ in composition and organization. Thus, it is essential to elucidate the mechanisms that lead to the generation of different structures with distinct features and functional properties of each tissue. In particular, our understanding of the dynamic interactions of collagen with other biomolecules has to improve. These advances will also result in a better understanding of collagen pathologies and will make possible a new generation of biomimetic biomaterials.
collagen-binding protein from Staphylococcus aureus
collagen triple helical domain
discoidin domain receptor 2
GP1, GP2, triple helical steps with zero, one or two imino acids
heat-shock protein 47
Streptococcus collagen-like protein 1 and 2
melting (denaturation) temperature