We have previously demonstrated that the genes of SCPs (semen coagulum proteins) and the WFDC (whey acidic protein four-disulfide core)-type protease inhibitor elafin are homologous in spite of lacking similarity between their protein products. This led to the discovery of a locus on human chromosome 20, encompassing genes of the SCPs, SEMG1 (semenogelin I) and SEMG2, and 14 genes containing the sequence motif that is characteristic of WFDC-type protease inhibitors. We have now identified additional genes at the locus that are similarly organized, but which give rise to proteins containing the motif of Kunitz-type protease inhibitors. Here, we discuss the evolution of genes encoding SCPs and describe mechanisms by which they and genes with Kunitz motifs might have evolved from genes with WFDC motifs. We can also demonstrate an expansion of the WFDC locus with 0.6 Mb in the cow. The region, which seems to be specific to ruminants, contains several genes and pseudogenes with Kunitz motifs, one of which is the much-studied BPTI (bovine pancreatic trypsin inhibitor).
The canonical protease inhibitors are a heterogeneous collection of, usually small, proteins that inhibit serine proteases by what is known as the standard mechanism, i.e. the inhibitor binds to the catalytic site of the protease by mimicking a substrate . Examples of canonical inhibitors in mammals are PSTI (pancreatic secretory trypsin inhibitor), BPTI (bovine pancreatic trypsin inhibitor) and SLPI (secretory leucocyte protease inhibitor), which carry sequence motifs of Kazal, Kunitz and WFDC (whey acidic protein four-disulfide core)-type inhibitors respectively.
We have previously shown that 14 genes encompassing the sequence motif of WFDC-type protease inhibitors are clustered at a locus on the long arm of human chromosome 20 . The genes were discovered as the direct result of an earlier discovery, where it was found that the gene of the WFDC-type protease inhibitor elafin are related to the genes of SCPs (semen coagulum proteins), in spite of lacking similarity between their protein products . In the present paper, we discuss the evolution of SCPs and their relation to WFDC-type protease inhibitors. It is also argued that the genes of Kunitz-type protease inhibitors at the WFDC locus are related to those of WFDC inhibitors and that an expansion in the number of Kunitz genes has occurred in the bovine genome.
Evolution of mammalian SCPs
In many mammals, the ejaculatory mixing of epididymal fluid, containing spermatozoa, and secretions from the accessory sex glands, i.e. the seminal vesicles and the prostate, results in the formation of a semen coagulum. In rodents, the coagulum is stabilized by isopeptide cross-links, formed by a prostate-secreted transglutaminase, to generate a copulatory vaginal plug , whereas in humans the coagulum is rapidly liquefied by proteolysis of the coagulum-forming proteins by the kallikrein-related peptidase, prostate-specific antigen . The SCPs are secreted by the seminal vesicles at very high concentration, but some mammals, notably carnivores, do not have seminal vesicles and subsequently cannot generate a semen coagulum.
The human SCPs, SEMG1 (semenogelin 1) and SEMG2, with molecular masses of 50 and 63 kDa, are reported to contain homologous, but poorly defined, tandem repeats of 60 amino acid residues [6,7]. The proteins are conserved in most primates, but vary considerably in size due to varying number of tandem repeats and the frequent occurrence of premature stop codons [8,9].
The seminal vesicle secretions of murine rodents contain six major proteins, Svs1–Svs6 , all of which, except Svs1, originate from a locus on the chromosome, which is homologous with the human SEMG locus . Similar to SEMG1 and SEMG2, Svs2–Svs6 consists of three exons, of which the first encodes the signal peptide, the second encodes most of the secreted protein, and the third exon encompasses 3′ non-translated nucleotides only [12–15]. The first and third exons of the human and rodent genes, containing regulatory nucleotides, are conserved, whereas the second exon is not . The exception is that the splice acceptor and donor sites of the second exon and a few flanking nucleotides are conserved in the genes of the major SCPs in humans and murine rodents [13,16]. The predominating murine SCP, Svs2, consists of imperfect tandem repeats, similar to what is found in SEMG1 and SEMG2, but the repeat size, reported to be 13 amino acid residues, is shorter and has a different amino acid sequence . It appears as if the second exon, since the separation of the primate and the murine lineages, has evolved independently by internal expansion, initially through imperfect duplication of short repeats, followed by duplication of larger regions, which may have contained several shorter repeats.
The major SCP in the guinea pig, SVP-1, is derived from a precursor that originates from a gene with a similar makeup as the genes of the human and murine SCPs . The guinea pig gene has a large first intron and in its 5′-end it is similar in sequence not only to the first intron but also to several hundred base pairs of coding nucleotides in the second exon of SEMG1 and SEMG2, which indicates that a new transcript was generated in the guinea pig by de novo selection of splice site . The same mechanism was also suggested to be behind the evolution of Svp4–Svp6 .
Homology between elafin and semen coagulum genes
Following unfruitful attempts in the mid-1990s to identify genes with homology with the SEMG genes by DNA hybridization, focus was shifted to the then rapidly accumulating number of novel gene sequences. DNA sequence databases were probed at low stringency with nucleotide sequences from the first and the last exon of SCP genes, as, based on our previous findings , these should be the most conserved part of the genes. This led to the discovery that the elafin gene (PI3) and the SCP genes are homologous . The PI3 has the same organization as the SCP genes, consisting of three exons, with nucleotides encoding the signal peptide and the secreted protein on two separate exons and a third exon consisting of 3′ non-translated nucleotides only . The fraction of conserved nucleotides is approximately 65% in both exons 1 and 3, whereas no conservation was observed in exon 2 , which in PI3 encodes a WFDC domain that is a high-affinity inhibitor of the serine protease elastase . Another conserved feature of the genes is that the split of the codon at the splice junction between the first and second exon is always of phase 1, which means that the first nucleotide of the codon is located on exon 1 and the two remaining nucleotides of the codon are in exon 2.
The WFDC domain in the elafin precursor is preceded by a region of short N-terminal repeats . These hexapeptide repeats are reported to be transglutaminase substrates which can function to covalently link the elafin precursor, by some authors denoted trappin-2, to protein or poly-amine substrates by way of isopeptide bonds [21,22]. The repeat units vary considerably between different mammals, both in sequence and number, in a way that is very reminiscent of the species variation of repeats in SCPs [8,9]. We therefore hypothesize that the SCPs have evolved from a gene that was created by a duplication of PI3. A subsequent selective pressure on the transglutaminase substrate moiety expanded it in size, whereas at the same time the lack of selection pressure removed the WFDC domain in a purifying process (Figure 1).
Formation of SCPs from the transglutaminase substrate moiety of elafin
A locus of genes encoding WFDC motifs
The study that demonstrated the homology between genes of SCPs and elafin also identified SLPI as a second homologue, based on conserved nucleotide sequences in the first exon and a conserved size of the last exon . SLPI differs slightly from PI3 by having two exons encoding WFDC domains, as the result of an exon duplication, and a few coding nucleotides on the last exon. When DNA sequences from the Human Genome Project became available around the turn of the millennium, it became clear that SEMG1 and SEMG2 were flanked on the chromosome by PI3 and SLPI. Because of this, genomic DNA sequences surrounding the SEMG locus were analysed to identify novel genes using the conserved nucleotides coding for the signal peptide of SCPs. This led to the discovery of WFDC12, initially called huWAP2 , but following this no more genes were identified by this approach. Instead, a number of genes were identified by searching the locus with WFDC motifs: in total, we identified 14 genes in a region of 0.68 Mb on human chromosome 20q12-q13 . Four of the genes are clustered with the SEMG genes at a sublocus of 145 kb with a more centromeric location, which is separated by 215 kb from the remaining genes, clustered at a locus of 318 kb with a more telomeric location.
Analysis of WFDC genes in non-primate mammals shows that the WFDC locus is conserved in murine rodents, with some exceptions where gene duplication or gene silencing have affected the number of active genes . Many of the genes have evolved rapidly since the separation of the primate and murine lineages, as indicated by relatively low-sequence conservation in the range 50–70%, but there are also contrary examples, e.g. the terminal domains of human and mouse WFDC3 carry more than 80% identical amino acid residues. The conservation can also vary considerably between domains in those proteins which have multiple domains, e.g. the first domain of WFDC5 is only 63% conserved, whereas the second domain is 81% conserved. It has been suggested that the rapid evolution could be driven by putative functions in reproduction, as suggested by the expression of several WFDC genes in the male reproductive tract . A second force that has been suggested to drive the evolution of the WFDC genes is interaction with micro-organisms . This hypothesis is based on the demonstrated anti-microbial properties displayed by some WFDC proteins and the observed accelerated evolution, which then could reflect adaptation to differences in the commensal microbial flora in different animals.
Similar organization of genes encoding WFDC and Kunitz motifs
In our paper where we described the WFDC locus, we noted that three WFDC genes, WFDC6, SPINLW1 and WFDC8, also carried exons that coded for Kunitz domains, i.e. they give rise to molecules with properties of both WFDC- and Kunitz-type protease inhibitors . It is not unlikely that the Kunitz motif was first picked up by a WFDC gene by exon shuffling and subsequent gene duplications could then have generated the three genes, which are located in tandem on the chromosome. The strong evolutionary forces, which have resulted in accelerated evolution of some WFDC genes and the creation of the SCPs, might also have formed genes with Kunitz motifs only, as loss of selection pressure could have eliminated the WFDC motif from a gene with dual motifs.
Our search of the WFDC locus for Kunitz motifs identified six genes, three of which were the genes with dual WFDC and Kunitz motifs described above, but there were also three novel genes with Kunitz motifs only (Figure 2). The novel genes SPINT3–SPINT5 are similarly organized as the WFDC genes and give rise to secreted proteins consisting of a single Kunitz domain (A. Clauss, M. Persson, H. Lilja and Å. Lundwall, unpublished work). In a panel of cDNA from 26 different tissues, significant transcript levels of SPINT3 and SPINT4 were detected only in the epididymis (A. Clauss, M. Persson, H. Lilja and Å. Lundwall, unpublished work). None of the human tissue samples contained significant levels of SPINT5 transcript. In fact, the structure of SPINT5 was determined by sequencing transcripts of the mouse orthologue, as low levels prohibited this being done with the human cDNA samples (A. Clauss, M. Persson, H. Lilja and Å. Lundwall, unpublished work).
The WFDC locus on human chromosome 20q12-13
There are orthologues of the novel Kunitz genes in the mouse genome, but the primary structures of their protein products are less than 60% conserved when compared with their human counterparts, which is indicative of accelerated evolution (Å. Lundwall, unpublished work). It is not obvious from the rapidly evolving sequences if they have the same origin or to determine whether they were founded by an ancestor of the present-day genes with both WFDC and Kunitz motifs. The origin of the genes can presumably be solved by thorough phylogenetic analysis and identification of phylogenetic intermediates, when sequence from a broader spectrum of mammalian species becomes available in the near future. However, the chromosomal location and organization of their genes are strong circumstantial evidence in favour of the hypothesis that the WFDC genes are related to the Kunitz genes in a similar way as they are to the SCP genes.
An expanded region of genes with Kunitz motifs in ruminants
BPTI is one of the most studied of all proteins investigated by scientists, both from structural and mechanistic points of view [25,26]. The protein consists, similar to SPINT3, SPINT4 and SPINT5, of a single secreted Kunitz domain. This similarity encouraged us to analyse whether BPTI is related to the novel putative Kunitz inhibitors. A search of the bovine genome showed that BPTI indeed is located within the bovine WFDC locus, but surprisingly, the gene is located approximately 60 kb upstream of WFDC2, in a region which in the human genome is located between the centromeric and telomeric subloci. An extended search of the bovine WFDC locus showed that it contains a region of 617 kb that is not present in the human genome (Figure 3). The region encompasses 29 exons with Kunitz motifs, 16 of which appear to be functional, whereas the remaining 13 presumably are pseudo-exons (Å. Lundwall, unpublished work). Several of the motifs are associated with genes that, like BPTI, give rise to a single secreted Kunitz domain, e.g. the BSTI (bovine spleen trypsin inhibitor) . At least five motifs are associated with genes that give rise to proteins that are expressed in trophoblasts. The proteins, denoted TKDP-1–TKDP-5, consist of a C-terminal Kunitz domain and an N-terminal domain of approximately 80 amino acid residues, which may be present in one or more copies [28,29]. The primary structure of the N-terminal domain shows no resemblance to other sequence motifs and seems to evolve rapidly. The function of the proteins is not known, but homologues have, besides that from cow, also been detected in trophoblasts from sheep [28,29].
A sublocus with Kunitz motifs at the bovine WFDC locus
The clustering of similarly organized WFDC genes to a locus suggests that they have evolved from an ancestor by multiple gene duplications. Most of the genes are with some specificity expressed in the epididymis, indicating that the duplications also included conserved gene regulatory sequences of importance in the male reproductive tract. The simple architecture of the genes can be regarded as an expression cassette to which sexual selection has introduced new coding information, i.e. the SCP and the Kunitz motif, whereas at the same time gene regulatory sequences are preserved. Thus future studies of genes at the WFDC locus can provide unique insights into molecular mechanisms behind sexual selection and gene evolution.
Structure and Function of Whey Acidic Protein (WAP) Four-Disulfide Core (WFDC) Proteins: An Independent Meeting held at Robinson College, Cambridge, U.K., 12–14 April 2011. Organized and Edited by Colin Bingle (Sheffield, U.K.), Judith Hall (Newcastle, U.K.), Cliff Taggart (Queen's University Belfast, U.K.) and Annapurna Vyakarnam (King's College London, U.K.).
This work was supported by grants from MAS Cancer Foundation and Alfred Österlund's Foundation.