Structure of SPH (self-incompatibility protein homologue) proteins: a widespread family of small, highly stable, secreted proteins

SPH (self-incompatibility protein homologue) proteins are a large family of small, disul- ﬁ de-bonded, secreted proteins, initially found in the self-incompatibility response in the ﬁ eld poppy ( Papaver rhoeas ), but now known to be widely distributed in plants, many containing multiple members of this protein family. Using the Origami strain of Escherichia coli , we expressed one member of this family, SPH15 from Arabidopsis thaliana , as a folded thioredoxin fusion protein and puri ﬁ ed it from the cytosol. The fusion protein was cleaved and characterised by analytical ultracentrifugation, circular dichroism and nuclear magnetic resonance (NMR) spectroscopy. This showed that SPH15 is mono-meric and temperature stable, with a β -sandwich structure. The four strands in each sheet have the same topology as the unrelated proteins: human transthyretin, bacterial TssJ and pneumolysin, with no discernible sequence similarity. The NMR-derived structure was compared with a de novo model, made using a new deep learning algorithm based on co-evolution/correlated mutations, DeepCDPred, validating the method. The DeepCDPred de novo method and homology modelling to SPH15 were then both used to derive models of the 3D structure of the three known PrsS proteins from P. rhoeas , which have only 15 – 18% sequence homology to SPH15. The DeepCDPred method gave models with lower discreet optimised protein energy scores than the homology models. Three loops at one end of the poppy structures are postulated to interact with their respective pollen receptors to instigate programmed cell death in pollen tubes.


Introduction
Plants express a large number and variety of small, disulfide-bonded, secreted peptides, often acting in defence roles, or in cell signalling. The former group of peptides include plant defensins, thionins, snakins and protease inhibitors, whereas the latter group are important for the control of cell differentiation, such as in the development of seeds, and also for sexual reproduction. While these families of proteins are widespread, their genes have often been missed using conventional genomic algorithms, due to their small open reading frames and high sequence diversity [1,2]. Newer bioinformatics algorithms have found many more of these proteins than identified originally [3,4]. While some members of these families of proteins have been expressed, the presence of multiple disulfide bonds can make them hard to overexpress and purify in soluble form in many host organisms.
One family of proteins that has yet to be structurally characterised is the SPH (self-incompatibility protein homologue) family. The S (self-incompatibility) proteins, now known as PrsS (Papaver rhoeas stigma S-determinant) proteins, were discovered in the field poppy, P. rhoeas [5]. Self-incompatibility (SI) is used to prevent self-fertilisation in plants. In poppy, this is controlled by a multi-allelic S-locus 50 mg/ml ampicillin, 15 mg/ml kanamycin and 12.5 mg/ml tetracycline. Expression of the protein was induced with 1 mM IPTG at mid-log growth phase. Then, after 30 min, the temperature was reduced to 15°C and the cultures were incubated overnight. The fusion protein was extracted from the cells and purified on a Ni-NTA column using an imidazole gradient in buffer containing 50 mM sodium phosphate ( pH 8.0) and 300 mM NaCl, as described recently [19]. It was dialysed into 20 mM Tris-HCl ( pH 7.5), 100 mM NaCl, 2 mM CaCl 2 , followed by cleavage with recombinant enterokinase, at room temperature for 2-3 days.
The cleaved proteins were dialysed into Ni-NTA buffer and re-applied onto the Ni-NTA column. The flowthrough from the column, containing SPH15, was kept, and any uncleaved protein and thioredoxin-His 6 fusion eluted from the column using imidazole in Ni-NTA buffer. Alternatively, the cleaved proteins were dialysed into 20 mM sodium phosphate buffer ( pH 7.6), 100 mM NaCl, 0.1 mM EDTA and loaded onto a phosphocellulose, P11 column, as described previously [19]. The flow-through samples contained thioredoxin-His 6 , while the SPH15 protein was eluted using a 100-600 mM NaCl gradient in the same buffer.
The SPH15 protein was concentrated by ultrafiltration with a 10 kDa cut-off filter and dialysed into the appropriate buffer for subsequent experiments. The concentration of protein was estimated from the absorbance at 280 nm, using a molar absorbance coefficient of 18,700 M −1 cm −1 , based on its amino acid composition [20].

Analytical ultracentrifugation
AUC (analytical ultracentrifugation) experiments were done using a Beckman ProteomeLab XL-1 analytical ultracentrifuge with an AN50 Titanium rotor at 20°C. Three SPH15 samples at the concentrations, 0.047, 0.028 (a) Consensus sequence for SPH proteins (top), sequence alignment of SPH15 with PrsS sequences (centre) and (bottom) secondary structure prediction. Top: consensus sequence for 759 proteins homologous to SPH15 and to the three PrsS proteins, from HHblits [14], plotted by WebLogo [15]. Centre: sequence alignment of the SPH domain of SPH15, PrsS1, PrsS3 and PrsS8, from HHblits [14]. For PrsS1, the sequence shown starts at residue 4 and is truncated by seven residues, while for PrsS3 and PrsS8 the sequences start at residue 5 of the mature peptide and are truncated by five and six residues, respectively. Bottom: secondary structure prediction of SPH proteins from SPIDER2 [16], based on the consensus sequence above. (b) A maximum-likelihood phylogenetic tree constructed using SPH protein sequences from A. thaliana and PrsS protein sequences from P. rhoeas (red). Sequences from the BLAST analysis were aligned using MUSCLE [13]. Model selection and tree construction were performed using MEGA7 [12] and the WAG+G model of evolution. Node support values shown on the tree were obtained from 300 bootstrap replicates. Classes are based on the numbers of cysteines (Class I, dark blue and cyanfour cysteines, Class II, purple and blue -five/six cysteines, Class III, green and browntwo cysteines, Class IV, yellowthree cysteines) and the sequence of hydrophilic loop 2 (A -K/RXXD, Bheterogeneous sequence). SPH1, SPH4, SPH8 and SPH15, and the PrsS proteins are outlined with a red box. and 0.005 mM, were used in buffer containing 25 mM sodium phosphate at pH 7.4 and 0.05 mM EDTA. These were spun at 45,000 rpm for 20 h. Absorbance measurements at 280 nm were taken at 10 min intervals and analysed using SEDFIT [21], with υ and ρ set to 0.7327 cm 3 and 1.0017 g/l, respectively.

Circular dichroism spectroscopy
The circular dichroism (CD) spectrum of SPH15 protein at 0.0095 mM, in 25 mM sodium phosphate buffer ( pH 7.4), 0.05 mM EDTA, was measured over the range of 190-280 nm, in a 1 mm pathlength cuvette, using a JASCO J-810 CD spectrophotometer at room temperature. The spectrum was subtracted from that of the buffer, taken under the same conditions. The secondary structure was analysed using the CSSTR method [22] in Dichroweb [23].
The CD measurement of a 0.0095 mM SPH15 sample in the buffer above was monitored at 215 nm over a temperature range of 30-90°C at 0.1°C intervals, in a 2 mm cuvette. Far-UV CD spectra of the protein were taken at 5°C intervals. No correction for buffer was applied. The intensity at 215 nm (y) vs. temperature (x) was fitted to a Hill plot with a linear slope, a (eqn 1) and the midpoint (EC 50 ) determined by non-linear regression in Sigmaplot12.

NMR spectroscopy
The NMR spectra of 15 N-, 13 C-labelled SPH15 were taken and assigned using triple resonance methods, as described recently [19]. Distance restraints were derived from 3D NOESY-15 N HSQC in H 2 O and NOESY-13 C HSQC in D 2 O, both with 100 ms mixing time, acquired on a Varian 800 MHz spectrometer with a room temperature TXI probe using a ∼1 mM protein sample in 20 mM sodium phosphate buffer ( pH 5.2), 50 mM NaCl, 0.1 mM EDTA. An additional, higher resolution, NOESY-13 C HSQC spectrum was acquired using non-linear sampling, on a Bruker 600 MHz spectrometer, with a TXI cryoprobe.
Backbone w and ψ torsion angles were estimated from Cα, Cβ, C 0 , N and Hα chemical shifts using the program DANGLE [24]. Hydrogen-bond restraints were based on slowly exchanging amides identified in 1 H-15 N HSQC spectra taken at a series of intervals up to several weeks after lyophilisation of the protein and dissolving it in D 2 O. Once the basic fold of the protein had been determined, disulfide bond restraints between the two pairs of cysteine residues were added to the structural calculations.
Structures were calculated using the program ARIA 2.3 [25] interfaced to CNS 1.2 [26]. A total of 200 structures were calculated using ambiguous and unambiguous NOE assignments. Assignment ambiguity was reduced during successive iterations in ARIA. Automatic NOE assignments from ARIA were checked manually and the protocol was repeated until there were no validation errors. Finally, the 20 structures with the lowest energies were refined in water. The family of structures was validated using the program PROCHECK [27].

Structure prediction of PrsS proteins
DeepCDPred [28] was used to generate structures for SPH15, PrsS1, PrsS3 and PrsS8. DeepCDPred uses deep learning to predict contacts and distances between residues, based on many inputs including amino acid profile, secondary structure prediction from SPIDER2 [16] and amino acid co-evolution couplings. The couplings are predicted from (i) mutual information [29], (ii) mean-field direct coupling analysis [30], (iii) QUIC [31], (iv) pseudo-likelihood direct coupling analysis [32] and (v) statistical potential [33]. In addition, the programme uses the number of amino acids in the target protein, the number of homologous sequences in the MSA built by HHblits [14] and an estimate of the number of non-redundant sequences in the alignment. DeepCDPred [28] also has a β-sheet prediction algorithm that provides hydrogen-bonding restraints between strands. These restraints, together with restraints to enforce the secondary structure prediction from SPIDER2 [16], are fed into the protein structure modelling program, AbinitioRelax, which is from the Rosetta suite [34]. Three-residue and nine-residue structure fragments required by AbinitioRelax were generated by the Perl script make_fragments.pl from the Rosetta suite. One hundred candidate structures were generated using the protocol. The structure with a lowest Rosetta energy score was chosen as the final model and the TM (template modelling) score [35] estimated, using a predictor that is part of the DeepCDPred. The best de novo (DeepCDPred) model for each protein was further refined using ReFOLD [36].
MODELLER [37] was used to model the poppy proteins, based on the structure of SPH15, with two different methods of sequence alignment. In the first method, HHblits [12] was used to align each of the proteins to SPH15, based on the sequence alone. In the second method, a structural alignment was determined from the structural predictions from DeepCDPred for each protein with the NMR structure of SPH15, using TM alignment [29]. For this second method, MODELLER was used with the SPH15 template either with or without the predicted contact restraints from DeepCDPred. For each of the three proteins, 500 models were calculated by each of these three comparative modelling methods and their DOPE (discreet optimised protein energy) scores [38] were calculated and compared with those from the de novo DeepCDPred structure.

Results and discussion
The sequence alignment of the SPH domain of SPH15 from A. thaliana and the three PrsS proteins from P. rhoeas is shown in Figure 1a, together with the consensus from 759 sequences from HHblits [14] and the secondary structure prediction, showing nine β-strands, from SPIDER2 [16]. The original aim was to determine the structure of one of the PrsS proteins; however, expression of the mature PrsS1 and PrsS3 proteins in E. coli gave inclusion bodies, which, after solubilisation, gave very low yields of protein. Our attention, therefore, turned to the SPH proteins from A. thaliana.

Phylogenetic analysis of SPH proteins in Arabidopsis thaliana
An iterative BLAST search of A. thaliana, using the PrsS1 amino acid sequence as the initial search sequence, followed by successive BLAST searches with identified SPH amino acid sequences, yielded 92 members of the SPH protein family, more than in the Pfam database [8] but fewer than in SPADA [3]. The 92 proteins were grouped into four classes, based on the number of cysteines. Maximum-likelihood phylogenetic analysis of these sequences with the PrsS proteins of P. rhoeas, using MEGA7 [12], shows that all of these proteins are evolutionarily related and can be resolved into the same four classes (Figure 1b). Attempts to include other potential SPH proteins identified using SPADA [3] gave trees that were poorly resolved, with very poor bootstrap results, that displaced members of the original 92 identified proteins that share an obvious identity. This suggests that these additional proteins may have evolved independently; alternatively, the sequence has diverged so far as to render phylogenetic analysis for the entire SPH cohort including these additional proteins impossible. In contrast, P. rhoeas PrsS proteins and homologues from Selaginella moellendorffii and Physcomitrella patens identified using BLAST do not disrupt the phylogenetic tree generation (data not shown). The PrsS proteins of P. rhoeas group within the Class I subfamily (Figure 1b), containing four cysteines in strands 2, 5 and 7, and in loop 6 ( Figure 1a). Class II proteins have either two additional cysteines, in strands 8 and 9, or are truncated in strand 9 and so have only five cysteines. Class III proteins have only two cysteines in strands 2 and 5, whereas Class IV proteins have three cysteines on strands 2 and 5 and in loop 6. These classes fit the phylogenetic analysis well and it is evident from the analysis that Class II proteins are derived from those in Class I, whereas Class IV proteins (including SPH15) cluster within the Class III proteins. The minor classification (1A, 1B, etc.) refers to the sequence of hydrophilic loop 2 where those in subclass A have the motif K/RXXD, whereas those in subclass B are heterogeneous.

SPH15 protein expression and purification
SPH1 in Class 1A, SPH4 in Class IIA, SPH8 in Class IIIB and SPH15 in Class IV were selected for initial trials for expression in E. coli as they are relatively distant from each other on phylogenetic analyses and represent the different patterns of conserved cysteine residues observed in sequence alignments ( Figure 1a). Initial attempts to express the mature proteins in E. coli using the methods used for the poppy proteins also led to inclusion bodies, but SPH15 gave some soluble protein.
Since the cause of the insolubility might be the lack of formation of disulfide bonds in normal strains of E. coli, expression trials moved to expression from pET32b in the AD494 strain of E. coli which contains a knockout mutation of the thioredoxin reductase (trxB) gene, thereby enhancing disulfide bond formation. The section of the SPH15 gene that encodes the predicted mature protein was cloned between the Nco1 and Xho1 sites of pET32b(+), thus expressing a thioredoxin-His 6 -SPH15 fusion, with an enterokinase cleavage site between the His tag and SPH15. This approach, with both overexpression of thioredoxin, and having the trxB mutation, greatly improved the yield of soluble SPH15 protein. The yield was increased further by using the Origami strain of E. coli, which contains mutations in both thioredoxin reductase and glutathione reductase [18].
The fusion protein was purified using a nickel affinity column for the His 6 tag (Figure 2a, i), which could also be used for separation of the two proteins produced after cleavage with enterokinase. For large volumes, after enterokinase cleavage, the proteins were separated on a phosphocellulose, cation exchange column. SPH15 contains a large number of basic residues and has a predicted pI 9.7, and so binds to the column at neutral pH, while His-tagged thioredoxin has a predicted pI of 4.7 and flows through the column without binding (Figure 2a, ii). The SPH15 produced has no signal sequence but contains three additional amino acids, AMG, at its N-terminus.

SPH15 protein characterisation and 3D structure
The molecular mass and sedimentation coefficient of SPH15 were determined by AUC (Figure 2b). The sedimentation coefficient calculated, 1.68 S, is slightly greater than expected for a spherical protein of a similar  mass, indicating a slightly extended shape (f/f 0 = 1.26), and gave an estimated molecular mass of 14.7 kDa, compared with the calculated mass of 13.5 kDa. The sedimentation coefficient was constant for the three concentrations measured, indicating that the protein does not aggregate in the 5-47 mM concentration range and is monomeric.
The CD spectrum of SPH15 in the far-UV shows a minimum at 217.5 nm and a maximum at 198 nm, consistent with it having the β-sheet secondary structure (Figure 2c). Analysis of the secondary structure using the CSSTR method [22] in Dichroweb [23] indicates that it contains 47% β-strand, 24% turn and 28% disordered structure. CD spectra were measured at a series of temperatures. The signal at 215 nm, indicating the extent of secondary structure, showed an initial slow decrease in ellipticity, and then rapid denaturation over ∼10°C, with a midpoint at 75°C (Figure 2d), showing that the protein unfolds co-operatively and is thermally very stable.
The 1 H-15 N HSQC spectrum of SPH15 (Figure 3a) is well resolved, with a single peak for each amino acid (105 peaks out of 108 expected), showing that the protein is folded and has a single conformation. The NMR spectrum of the protein was assigned by standard triple resonance methods [19]. The secondary 13 C, 15 N and 1 H chemical shifts [39] confirmed that the protein has mainly β-sheet conformation, as predicted [1] and shown by the CD spectrum. The structure was determined from interproton distance restraints estimated from NOESY spectra, complemented by analysis of the H-D exchange rates of the backbone NH groups. Many of the NH signals remain visible for several days after dissolving the protein in D 2 O, showing the high persistence  of the NH bonds (Figure 3b). The disulfide-bonding pattern predicted previously from sequence alignments [1] was confirmed by NOEs between the Cβ protons of the bonded cysteine residues. Figure 4a shows an overlay of the backbone structures of the 20 lowest energy structures calculated from the NMR data, with a ribbon diagram of the lowest energy structure in two orientations in Figure 4b. The structure is generally well determined, with the backbone RMSD of the top 20 structures from the mean structure of 1.6 Å for the β-strands; however, some of the loops are less well-defined, leading to an overall RMSD of the backbone of 2.3 Å ( Table 1). In particular, residues 48-51 and 75-80, at the tips of the longer loops, loop 4 and loop 6, respectively, have backbone RMSDs >4 Å, possibly due to flexibility.
SPH15 contains eight β-strands arranged in two, four-stranded sheets, as a sandwich. The strands are mainly in agreement with the secondary structure prediction from the MSA, shown in Figure 1a. The last strand predicted in Figure 1 contains only three residues, with a central proline, and is irregular, but shown in some of the models. In the 3D structure, two of the strands in one sheet, namely strands 1 and 7, are parallel, while the other strands are all antiparallel (Figure 4b,c). There is a large hydrophobic core containing residues from each strand (Figure 4c), many of which are conserved as hydrophobic residues in the consensus sequence (Figure 1a). In SPH15, Cys 1, at the N-terminus, is bonded to Cys 81, in loop 6, keeping strand 1 close to strand 7, while Cys 21, on strand 2, is bonded to Cys 55, on strand 5, these being neighbouring strands within the same sheet (Figure 4c). While the protein is overall highly basic, with 21 positively charged residues and 12 acidic residues, the charges are distributed over the whole of the surface of the protein (Figure 4d). The overall shape of SPH15 is similar to that of protein domains with a Greek key fold, such as the immunoglobulin constant domain; however, the topology of the strands differs as, in the latter, all strands are antiparallel, and the single disulfide bond is between the two sheets.
Comparison of SPH15 protein structure with other proteins of known structure Comparison of the structure of SPH15 with other proteins in the protein structure database using DALI [40] gave, as hits, many proteins with a Greek key topology. The top hits containing domains with the same topology

as SPH15 were the membrane-binding domains (domain 4) of the cytotoxic proteins pneumolysin from
Streptococcus pneumoniae [41,42] and perfringolysin from Clostridium perfringens [43], with Z-scores >6.5 and small RMSDs (2.8 Å) to SPH15, despite negligible sequence identity (∼8-12%) (Figure 5a). A bacterial-type VI secretion protein, TssJ, from a pathogenic strain of E. coli [44] was also identified as containing a domain with the same topology, with a Z-score of 6.4 and an RMSD of 3.1 Å (Figure 5b). One entire protein identified with the same topology as SPH15 was human transthyretin [45], with a Z-score of ∼6 and RMSD ∼3.5 Å (Figure 5c). Again, neither TssJ nor transthyretin shows any discernible sequence similarity to SPH15. The proteins identified have very different functions, oligomeric states, and interact with their partners using different surfaces in each case. Perfringolysin and pneumolysin belong to a group of toxins from Gram-negative organisms that kill cells by forming large circular aggregates that make holes in eukaryotic membranes [46,47] (Figure 5a). TssJ is an outer membrane lipoprotein that forms part of the outer shell of the type VI secretion system from E. coli and hence also an important virulence factor [44]. The syringe-like, type VI secretion complex, contains 10 copies of TssJ and its partners [48]. In contrast, transthyretin is a homotetramer that is involved in the transport of thyroid hormone and, separately, using a different interface, transport of retinolbinding protein bound to retinoic acid [49,50] (Figure 5c). Given the differences in oligomeric state and interaction surfaces in each case, little can be deduced from these proteins about how S-proteins may function.

De novo structure prediction of SPH15
The only proteins in the SPH family with known functions and interaction partners are three poppy PrsS proteins, PrsS1, PrsS3 and PrsS8, involved in the SI response. The corresponding receptor proteins, PrpS1, PrpS3 and PrpS8 (P. rhoeas, pollen S-determinant) [51], are ∼20 kDa transmembrane proteins. The three secreted (a) Pneumolysin domain 4, from PBD 5CR8 [42], perfringolysin full monomer, from 1PFO [46], pneumolysin prepore complex, from 2BK2 (from cryoEM) [47]. (b) TssJ from PDB 4Y7O [48] and the monomer of the TssJ-TssM complex from PDB 3RX9 [44]. The dotted line connects residues in TssJ that were not observed in the structure of the complex. (c) Transthyretin monomer and the transthyretin-thyrodoxin dimer interface, both from PDB 2ROX [49], and the transthyretin homotetramer complexed with retinol-binding protein from PDB 1RLB [50]. Each monomer/domain is shown as a ribbon structure, coloured in rainbow colours from blue to red, from N-to C-terminus, with strand 1 in a similar orientation to SPH15 in Figure 4a. In the complexes, one monomer is coloured in rainbow colours and, where relevant, other monomers of the same protein/domain are coloured red, the partner proteins are coloured blue.
PrsS proteins show ∼60% sequence identity and ∼75% sequence similarity to each other, with a similar level of sequence identity between the receptor proteins, across the length of the sequences. While SPH15 and the PrsS proteins clearly belong to the same protein family, they are in different classes within the family and there is only 15-18% sequence identity between them, rather small for reliable homology modelling (Figure 1a,b). However, the relatively large number of proteins in this family enabled us to test a newly developed de novo structure prediction method, DeepCDPred [28], based on co-evolution/correlated mutation to model SPH15. We then used this method to predict structures of the three poppy proteins.
The de novo prediction of the structure of SPH15 using the DeepCDPred method was made solely from the MSA of the protein and homologous sequences, independently of the NMR measurements. The method uses deep learning to predict contacts and distances between residues and then produces a series of structural models. Figure 6a shows the de novo structure of SPH15 with the lowest DOPE score [38], in a similar orientation to that of the NMR-derived structure in Figure 4a. The two structures are very similar with a TM score of 0.62 and an RMSD of 3.34 Å for all Cα atoms and 2.91 Å for the Cα atoms of residues in the β-strands. One major difference between the two structures is that in the de novo structure the β-strands are slightly longer, with more regular geometry, hence showing the short ninth strand, antiparallel to strand 8. The de novo structure also shows one turn of an α helix in loop 4, not seen in the NMR measurements.
To examine the precision of the de novo structure prediction, the five models with the lowest Rosetta energies were aligned and the differences between their Cα positions determined compared with the restraint strength at each position derived from the co-evolution map (Figure 6b). The differences in structures are below 3 Å, apart from the N-terminal three amino acids, and the regions 75-85 and 100-104, where there are no restraints. For the top 10 predicted SPH15 structures, the pairwise RMSD of the Cα atoms for residues in the β-strands is 3.1 Å, and it is 5.3 Å for all Cα atoms ( Table 2); somewhat larger than the RMSDs between the top 20 NMR structures ( Table 1). The lack of restraints between residues 75-85 could arise if this region of the protein was conserved across homologues, or if it were flexible so that amino acid substitutions in one part of the loop do not allow prediction of substitutions in other parts of the loop. The latter is consistent with the NMR structure, where the residues at the tip of loop 6 are poorly constrained. (a) Ribbon representation of the de novo structure prediction of SPH15 from DeepCDPred with the lowest DOPE score in the same orientation as that of the NMR-derived structure in Figure 4a. The structure is coloured in rainbow colours, from blue to red, from N-to C-terminus. (b) Differences in Cα positions of the top five models of SPH15 from DeepCDPred, compared with the restraint strength. The five de novo SPH15 structures with the lowest energies were aligned and the average distance between the corresponding Cα atoms was calculated (black circles, solid lines). The restraint strength is the sum of the predictions of contacts at each position (white circles, dashed lines). The regions with more restraints are more converged (lower average distance).

Structure prediction of the poppy SI proteins, PrsS1, PrsS3 and PrsS8
The similarity of the fold of the de novo prediction to the NMR-determined structure, and the ability to estimate errors in the structure predictions, gave us confidence to use this method to model each of the related proteins, PrsS1, PrsS3 and PrsS8, separately. To improve the structural predictions, 3-7 amino acids at the N-and C-terminals of each of the proteins were not included in the calculations. These amino acids give no restraints and are likely to be highly flexible. This left the core of ∼110 amino acids for each protein shown in Figure 1a, for which the structure was calculated, without disulfide bond constraints. As each protein has a different length and sequence, each has a slightly different set of homologous proteins within the SPH family. For each, over 900 homologous sequences were found using HHblits [14], with 759 sequences found in common in homology searches of the three proteins and SPH15. The different set of homologous proteins gives a slightly different co-evolution contact map for each PrsS protein, but, given the large number of common sequences, the contact maps and the overall predicted secondary structures are similar.
As a comparison of the DeepCDPred method, we also used MODELLER [37] to model the poppy proteins based on the NMR structure of SPH15. Two different methods of the sequence alignment to the NMR structure of SPH15 were used; either solely based on the HHblits sequence alignment [14], as in Figure 1a, or alignment using the structure prediction of each of the proteins calculated from DeepCDPred, with the program TM-align [35]. The TM alignment was then used in MODELLER with or without the predicted contact restraints from DeepCDPred. For each of the three PrsS proteins, 500 models were calculated in MODELLER by each of these three comparative modelling methods. The DOPE score [38] of each model was calculated and compared with the DOPE scores of 100 models from the de novo DeepCDPred calculations (Figure 7). For each of the three comparative modelling methods, MODELLER gave similar structures with a narrow range of DOPE scores. In contrast, the de novo DeepCDPred models had a much wider range of DOPE scores, a few of the structures had poor DOPE scores, but the mean DOPE scores were all more negative (better) than those from MODELLER. In all three proteins, the lowest DOPE scores for the DeepCDPred models were very much lower than the lowest energy models from MODELLER. For PrsS1 and PrsS8, the entire interquartile range of scores for the DeepCDPred models was lower than the lower quartile of all the other methods, suggesting more stable structures. The de novo DeepCDPred model with the lowest DOPE score was taken to be the most representative structure for each protein and is shown in Figure 8.
The topology of all three PrsS proteins predicted from DeepCDPred (Figure 8) is the same as that of SPH15; the hydrophobic core is maintained, but the exact lengths and regularity of the β-strands and the orientation of the loops vary. In particular, as for the SPH15 structures, loop 6, between strands 6 and 7, is poorly constrained and varies in the models, as does the length and orientation of the C-terminal region, which has a longer final β-strand, strand 9, particularly in PrsS1, than in SPH15. Table 2 shows the pairwise distance distribution Table 2 RMSD between pairs of top 10 de novo calculated structures of SPH15, PrsS1, PrsS3 and PrsS8 Pairwise RMSDs between top 10 de novo DeepCDPred models, with the lowest DOPE scores. Lower half, bold, RMSD between all Cα atoms. Upper half, italics, RMSD for residues in β-strands only (as defined by the multiple sequence alignment in Figure 1). The diagonal (grey boxes) shows the RMSD for 10 models within the same protein, whereas off-diagonal numbers show the RMSDs between 10 models of each protein. 5.9 ± 0.8 Å 6.0 ± 1.3 Å 6.2 ± 1.5 Å 3.5 ± 0.9 Å 4.9 ± 1.1 Å between the models with the 10 lowest DOPE scores, both within predictions of the same protein and comparing the models of the different poppy proteins and SPH15. The average pairwise RMSDs between models of the same protein (4.9-6.6 Å for all atoms) and between models of the different proteins (5.9-6.6 Å) are similar. This suggests that the four proteins are very similar in structure to each other, within error, despite the disulfide-bonding pattern in SPH15 (Class IV) being different from that in the poppy proteins (Class I) ( Figure 1). One of the disulfide bonds in SPH15, that between Cys 22 in strand 2 and Cys56 in strand 5, is conserved throughout all the SPH proteins and the two cysteines are close in all four structures. Cys 82 in loop 6 is conserved in all four proteins, but, in the poppy proteins, and other Class I proteins, it is disulfide bonded to a residue in strand 7, rather than to Cys 1, as found in SPH15. These disulfide bonds were added to the de novo calculations of the PrsS proteins as further constraints and shown to be compatible with the overall structures. The addition of the disulfide bonds had no effect on the DOPE score of the best model for PrsS1 and PrsS8, but gave a small 4% decrease in the DOPE score for PrsS3, with an equivalent improvement in the DOPE score for SPH15.

Interactions of PrsS proteins with their receptors
In a previous study [52], the predicted surface loops of PrsS1 were mutated at one or more sites, the proteins purified and refolded from inclusion bodies, and their activity to inhibit germination of pollen assayed. The mutation with the largest effect was D79 at the tip of loop 6 (Figure 8a,b). When this was mutated to either G or H (as found in PrsS3 and PrsS8, respectively), both totally abrogated activity. Mutation of the other aspartate residues in the loop (D77, D78), which are conserved in the other alleles, each to His, also removed PrsS1 activity, but the double mutation D77E/D78E was as active as wild type, suggesting that the negative charge on this loop is important. The only other single mutation tested that showed any effect was D27H, in loop 2, a conserved residue in all three poppy sequences. This mutation showed only 78% wild-type activity when assayed at low protein concentrations (25 mg/ml), but was as active as wild type at 75 mg/ml. Loop 2 is on the   same side of the protein as loop 6 and both contain several aspartate and charged residues at their tips in all alleles (Figures 1a and 8a,b). These loops are separated by loop 4, which contains mainly charged residues, but with two central phenylalanine residues at its tip in PrsS1 (sequence REDFFH). Similar hydrophobic residues are found at the tips of the loop in the alleles (Figure 8). The mutation F49M in PrsS1 to the residue found in PrsS3 had little effect on activity, suggesting that the aromatic ring here is not essential for activity, but both F and M are hydrophobic, so the importance of hydrophobicity for activity was not tested. A single mutation in loop 1 (N15H) in PrsS1, two double conservative mutations in loop 3 (T36S/S37D and H39Q/D40E) and a single mutation in loop 8 (D99H) showed no effect. A single conservative mutation in loop 5, K63R, also showed no effect, but a triple mutant K63R/E64K/T65G showed only 58% of the activity of wildtype protein at 25 mg/ml, less than the single mutation in loop 2, D27H, but much higher than D79G.
However, our structural model shows that T65 is at the beginning of strand 6 of PrsS1, rather than in a loop, so this triple mutation may affect protein folding rather than simply receptor binding.
Overall, these mutagenesis studies, together with the structural models, suggest that loops 1, 3, 5 and 8 have little effect on the activity of PrsS1, while loop 6 at the tip of the protein interacts with the receptor, probably alongside loops 2 and 4 ( Figure 8b). Given the length and flexibility of the loop 6, this is the loop most likely to interact with the receptor and confer specificity. It is also highly divergent in sequence across the SPH protein family. Loops 2 and 6 are largely charged, and aspartate residues in these loops have been shown to affect activity. In contrast, the tip of loop 4 contains two hydrophobic Phe groups, with charged groups on either side. The exposure of such hydrophobic residues on the outside of proteins is unusual, and they are good candidates for intermolecular interactions with hydrophobic partners.
The receptor proteins PrpS1, S3 and S8 are highly hydrophobic, 20 kDa proteins, that have been shown to localise to the plasma membrane of the pollen tube. Interestingly, no homologues to these sequences were found using HHblits; however, secondary structure predictions suggest that they contain six transmembrane helices, with an extracellular, extended, domain of ∼35 amino acids. The central, 15 amino acid segment of this extracellular loop of the PrpS1 protein, namely DQKWVVAFGTAAICD, has been synthesised, and shown to interact with purified PrsS1, in a slot blot, whereas a randomised peptide of the same composition did not [51]. The same peptide could block the inhibition of germination of pollen by PrsS1 protein, and, surprisingly, was allele specific, despite this part of the PrpS1 receptor protein being similar to the other alleles. Hence, this peptide is likely to bind to the PrsS1 protein. This peptide is largely hydrophobic in the central region (underlined residues). We postulate that the central hydrophobic amino acids of the receptor interact with the exposed F49 and F50 residues in loop 4, while the charged residues within the receptor may interact with the charged residues on this loop and loops 2 and 6.

Conclusion
The SPH family of proteins are extremely widespread in dicotyledonous plants and are thought to be involved in a large variety of signalling pathways. Arabidopsis thaliana contains 92 core members of this family that are evolutionarily related and have sequence resemblance to Papaver PrsS proteins. Other core members of the SPH family are found only in dicotyledonous plants and the lower plants, Lycopodiopsida (Selaginella, spike moss) and Bryophyta (Physcomitrella, spreading earthmoss), and have not been identified in monocotyledonous plants, despite numerous attempts using BLAST. This specific phylogenetic distribution suggests that, like the additional SPADA-identified homologues in Arabidopsis, proteins identified in fungi and animals and placed in Pfam group PF05938 may be the result of independent evolution or divergence to the point at which little sequence homology remains.
The SPH proteins are highly stable and have a β-sandwich structure, with 8-9 β-sheets in a topology distinct from that found in most other proteins to date. The different classes of SPH proteins are evolutionarily related and have distinct disulfide-bonding patterns that can be readily accommodated within the proposed structure. All the proteins have a disulfide bond between the neighbouring strands 2 and 5. The PrsS proteins and those in Class I have four cysteines, with an additional disulfide bond between strand 7 and loop 6. Class II proteins have an additional disulfide bond between the adjacent strands 8 and 9, whereas Class III proteins only have the one conserved disulfide bond. SPH15 is a Class IV protein, with three cysteines within the conserved sequence and one, additional one, at residue 1, outside of this, giving rise to a disulfide bond between residue 1 and loop 6. The classes have been subclassified on the sequence of the hydrophilic loop 2, proteins in subclass A have the motif K/RXXD while those in subclass B are heterogeneous. PrsS1 and PrsS8 have KXXD in this loop, whereas PrsS3 contains E as the first amino acid of the motif but still contains the conserved D.
From the limited mutagenesis data, involving this conserved D residue in loop 2 and the larger effects of mutation in loop 6 [52], together with the structural analysis presented here, we speculate that loops 2, 4 and 6, on one face of the protein, interact with the receptor PrpS1 to mediate programmed cell death. However, other unrelated proteins with the same strand topology, such as transthyretin and TssJ, show that a range of interaction interfaces is possible ( Figure 5). Thus, this family of proteins may have evolved to act as a versatile and stable scaffold to display a variety of peptides in the loops, each interacting with a different receptor. As such, in addition to their broad roles in cell signalling, the SPH family may be a useful scaffold for synthetic biology applications.

Database Deposition
The NMR-derived 3D co-ordinates of SPH15 have been deposited in the Protein Data Bank under PDB ID code 6G7G.