Thousands of unannotated small and alternative open reading frames (smORFs and alt-ORFs, respectively) have recently been revealed in mammalian genomes. While hundreds of mammalian smORF- and alt-ORF-encoded proteins (SEPs and alt-proteins, respectively) affect cell proliferation, the overwhelming majority of smORFs and alt-ORFs remain uncharacterized at the molecular level. Complicating the task of identifying the biological roles of smORFs and alt-ORFs, the SEPs and alt-proteins that they encode exhibit limited sequence homology to protein domains of known function. Experimental techniques for the functionalization of these gene classes are therefore required. Approaches combining chemical labeling and quantitative proteomics have greatly advanced our ability to identify and characterize functional SEPs and alt-proteins in high throughput. In this review, we briefly describe the principles of proteomic discovery of SEPs and alt-proteins, then summarize how these technologies interface with chemical labeling for identification of SEPs and alt-proteins with specific properties, as well as in defining the interactome of SEPs and alt-proteins.
Introduction
Small open reading frames (smORFs) shorter than 100 codons were previously excluded by genome annotation consortia in order to minimize false positives due to random ORF background in eukaryotic genomes [1]. Similarly, mammalian ORFs initiating at non-AUG start codons [2] or overlapping [3, 4] known protein coding sequences (the latter termed alternative ORFs, or alt-ORFs, which encode alternative proteins or alt-proteins), were not annotated [5]. However, thousands of smORFs and alt-ORFs are translated in mammalian cells [6, 7]. These previously unannotated genes are found in noncoding RNAs, in 5′ and 3′ untranslated regions of mRNAs, and in frame-shifted ORFs overlapping annotated protein-coding sequences [8]. Hundreds of mammalian smORF-encoded polypeptides (SEPs, also termed microproteins, small proteins, and micropeptides) regulate cell proliferation [9, 10], and the biological roles of several dozen human SEPs and alt-proteins have been defined at the molecular level [11]. Dysregulation of or mutations in multiple human smORFs have been shown to promote cancer [12]. However, due to their short lengths and lack of primary sequence homology to proteins of known function, the majority of smORFs and alt-ORFs remain uncharacterized [13].
Overview of experimental methods for analysis of SEPs and alt-proteins.
(A) Schematic workflow of RIBO-seq for detection of smORFs and alt-ORFs. Ribosome protected RNA fragments are purified and sequenced. Those smORFs and alt-ORFs exhibiting three-nucleotide periodicity [7, 14] (representative ribosome profiling data shown) are most likely to be translated. (B) Schematic workflow of proteomics for detection of SEPs and alt-proteins. The small proteins are enriched by size selection (representative Coomassie-stained SDS–PAGE gel image shown), followed by proteomics, and the data are searched against a custom database containing canonical proteins as well as all candidate smORFs and alt-ORFs. The annotated tryptic peptide sequences are discarded, and the unannotated tryptic peptide-spectral matches can be validated and mapped to putative SEPs and alt-proteins.
(A) Schematic workflow of RIBO-seq for detection of smORFs and alt-ORFs. Ribosome protected RNA fragments are purified and sequenced. Those smORFs and alt-ORFs exhibiting three-nucleotide periodicity [7, 14] (representative ribosome profiling data shown) are most likely to be translated. (B) Schematic workflow of proteomics for detection of SEPs and alt-proteins. The small proteins are enriched by size selection (representative Coomassie-stained SDS–PAGE gel image shown), followed by proteomics, and the data are searched against a custom database containing canonical proteins as well as all candidate smORFs and alt-ORFs. The annotated tryptic peptide sequences are discarded, and the unannotated tryptic peptide-spectral matches can be validated and mapped to putative SEPs and alt-proteins.
Three high-throughput technologies have been developed for the identification of smORFs and alt-ORFs [5]: comparative genomics [15], ribosome profiling (RIBO-seq) [7, 9, 16–19], and proteomics coupled with transcriptomic/translatomic databases [13, 20, 21]. RIBO-seq [22, 23] and proteomics [24–26] can both be leveraged in quantitative mode to reveal changes in SEP and alt-protein translation under different conditions, revealing a degree of functional information. However, only proteomics can directly detect SEPs and alt-proteins at the protein level. Importantly, the combination of protein labeling technologies with proteomics can identify SEPs and alt-proteins with specific physical and chemical properties, a level of information inaccessible to genetic methods.
In this review, we summarize the principles of proteomic discovery of SEPs and alt-proteins, then discuss how these platforms can be utilized downstream of chemical labeling and enrichment to profile unannotated SEPs and alt-proteins that exhibit chemical reactivity, regulated synthesis, and subcellular localizations, as well as to identify interaction partners of SEPs and alt-proteins (Figure 1).
Proteomic identification of SEPs and alt-proteins
In this section, we describe the basics of proteomic approaches for smORF and alt-ORF discovery. SEP and alt-protein proteomics has recently been reviewed [27, 28], and detailed protocols are available [29], so we provide only a brief overview here.
Due to their short lengths, SEPs and alt-proteins typically only generate one, or few, detectable tryptic peptides for liquid chromatography–tandem mass spectrometry (LC–MS/MS) detection, whereas larger proteins generate many peptides [13, 30]. To increase detection sensitivity for SEPs and alt-proteins, small proteins must be enriched after proteome extraction, using methods such as organic solvent extraction [31], peptide gel extraction [30], molecular weight cutoff filtration [13], or solid phase extraction [32]. Then the samples are enzymatically digested [33] and analyzed by LC–MS/MS; top-down LC–MS/MS identification of small proteins has also been reported [27]. Finally, MS/MS data must be searched against a database containing both annotated and unannotated proteins, such as in silico transcriptome translations [34, 35], RIBO-seq-derived translatomes [36, 37], or computationally predicted alt-ORF databases like OpenProt [38]. Peptide-spectral matches to annotated proteins and contaminants must next be discarded. Several methods for discarding hits matches to canonical proteins are available; our laboratory reported a script that removes peptides that are exact matches to annotated human protein sequences, which is followed by BLAST of each candidate peptide in order to ensure that it is at least two amino acids different from canonical proteins, and finally by manual validation of MS/MS spectral quality [13, 29]. An alternative approach is to filter peptide-spectral matches using PepQuery [39], which can exclude false positives due to isobaric amino acids, post-translational and chemical modifications to canonical peptides. It is critical to note that searches of expanded databases, coupled with the uncertainty of individual peptide-spectral matches, can lead to a high rate of false-positive identifications in SEP and alt-protein proteomic searches [40], which must be excluded by the experimenter, as previously described [13, 29]. After the exclusion of canonical proteins, the remaining peptide-spectral matches are candidate hits that can be mapped to SEPs and alt-proteins. Experimental (molecular) validation is ultimately required to confirm a novel SEP or alt-protein.
Principles of chemical proteomics in profiling of SEPs and alt-proteins
Approaches for chemical labeling of cellular proteins with ‘handles’ for proteomic analysis based on specific functions or properties can broadly be placed into three categories: (1) methods to reveal reactivity of amino acid residues using chemically tuned probe molecules (e.g. activity-based protein profiling/ABPP [41]); (2) for metabolic labeling (for example, bio-orthogonal non-canonical amino acid tagging or BONCAT [42]) that can install bio-orthogonal handles into cellular proteins for purification and analysis; and (3) proximity [43] approaches that label cellular proteins in accordance with their subcellular localizations and/or interactions. In the following sections, we will discuss how these tools have been adapted for analysis of SEPs and alt-proteins.
Reactive cysteine profiling for identification of nucleophilic SEPs
Proteomic identification of nucleophilic cysteine residues — a feature of the active sites of hydrolases and other enzyme classes, as well as many other proteins — using electrophilic probes was first developed for profiling of canonical proteins over a decade ago [44, 45]. Since then, this technique has been utilized to identify cysteine oxidation [46, 47] and post-translational modification [48] events, as well as to develop covalent ligands and inhibitors to cysteine-containing proteins [49, 50]. While SEP monomers are generally too small (<100 amino acids) to fold into enzyme-like structures with complex active sites, it is nonetheless possible that SEPs could harbor nucleophilic cysteine residues. To test this hypothesis, Saghatelian et al. [51] developed a modified method for reactive cysteine profiling in the small proteome. They began with isolating the peptidome by cell lysis and size selection using a 30 kDa molecular weight cutoff filter, then incubated the peptidome with an iodoacetamide (IA)-alkyne probe to label reactive cysteine-containing small proteins, which were then captured using Click chemistry with a biotin derivative and streptavidin pulldown (Figure 2A). The reactive small cysteinome thus captured was subjected to proteomics with custom database searching to identify 16 unannotated, nucleophilic SEPs in K562 cells. While the cellular functions of the SEPs identified in this work were not elucidated, this study provided the first evidence that chemical labeling can reveal reactive smORFs and alt-ORFs. It is possible that additional reactive SEPs remain to be discovered using new workflows, for example using varied size selection methods and electrophilic probes [52].
Chemical proteomic workflows to profile SEPs and alt-proteins.
(A) Schematic workflow of chemical proteomic profiling of reactive cysteine-containing SEPs and alt-proteins [51]. Small proteins were selected with a 30 kDa filter, followed by alkylation with iodoacetamide alkyne, which can specifically label reactive cysteine residues, to generate a handle for further Click chemistry-based enrichment. Coupled with custom smORF/alt-ORF database searching, nucleophilic cysteine-containing SEPs and alt-proteins were be identified. (B) Schematic workflow of bioorthogonal non-canonical amino acid tagging (BONCAT)-based proteomic profiling of SEPs and alt-proteins [53]. An unnatural amino acid (uAA) bearing a bioorthogonal azide moiety was metabolically incorporated into newly synthesized proteins, then the labeled small proteins were size-selected in solution with a C8 column. On-bead Click chemistry captures newly synthesized small proteins and enables removal of unlabeled proteins that did not undergo active synthesis during the labeling period, followed with trypsin digestion and proteomics to identify unannotated SEPs and alt-proteins [53]. (C) Schematic workflow of proximity labeling-based proteomic profiling of SEPs and alt-proteins [54]. TurboID, an engineered biotin ligase, biotinylates lysine residues on proximal proteins within the same subcellular region. To map SEPs and alt-proteins to subcellular compartments, TurboID was expressed in the compartment of interest as a genetic fusion to a localization sequence or protein. All proteins biotinylated by TurboID were enriched by streptavidin pulldown, followed with size-selection for small proteins and in-gel digestion (representative Coomassie-stained SDS–PAGE gel image shown). Finally, unannotated SEPs and alt-proteins were identified by mass spectrometry.
(A) Schematic workflow of chemical proteomic profiling of reactive cysteine-containing SEPs and alt-proteins [51]. Small proteins were selected with a 30 kDa filter, followed by alkylation with iodoacetamide alkyne, which can specifically label reactive cysteine residues, to generate a handle for further Click chemistry-based enrichment. Coupled with custom smORF/alt-ORF database searching, nucleophilic cysteine-containing SEPs and alt-proteins were be identified. (B) Schematic workflow of bioorthogonal non-canonical amino acid tagging (BONCAT)-based proteomic profiling of SEPs and alt-proteins [53]. An unnatural amino acid (uAA) bearing a bioorthogonal azide moiety was metabolically incorporated into newly synthesized proteins, then the labeled small proteins were size-selected in solution with a C8 column. On-bead Click chemistry captures newly synthesized small proteins and enables removal of unlabeled proteins that did not undergo active synthesis during the labeling period, followed with trypsin digestion and proteomics to identify unannotated SEPs and alt-proteins [53]. (C) Schematic workflow of proximity labeling-based proteomic profiling of SEPs and alt-proteins [54]. TurboID, an engineered biotin ligase, biotinylates lysine residues on proximal proteins within the same subcellular region. To map SEPs and alt-proteins to subcellular compartments, TurboID was expressed in the compartment of interest as a genetic fusion to a localization sequence or protein. All proteins biotinylated by TurboID were enriched by streptavidin pulldown, followed with size-selection for small proteins and in-gel digestion (representative Coomassie-stained SDS–PAGE gel image shown). Finally, unannotated SEPs and alt-proteins were identified by mass spectrometry.
Metabolic labeling of newly synthesized SEPs and alt-proteins
Unnatural amino acids (uAA) that are nearly isosteric to the proteinogenic amino acids can be incorporated into proteins via the cellular protein synthesis machinery [55]. Briefly, the uAA is supplied to living cells, and cellular proteins synthesized during the labeling period incorporate the uAA at all codons corresponding to its natural analog. When the unnatural amino acid bears a bio-orthogonal functional group, the labeled proteome can be captured via Click chemistry with a biotin analog for proteomic identification. One such method, bio-orthogonal noncanonical amino acid tagging, or BONCAT, was developed in 2006 to enable identification of newly synthesized proteins in cells and neurons [42]. An approach for direct detection of newly synthesized SEPs and alt-proteins using a modified BONCAT workflow was recently reported by our group [53] (Figure 2B). Briefly, after uAA labeling, small proteins are enriched using a C8 column [32]. Labeled small proteins are then captured with cyclooctyne-derivatized beads [56], circumventing loss of small proteins and peptides during biotin/streptavidin-based capture [42]. On-bead digest is performed prior to LC–MS/MS analysis and custom database searching. This method revealed 22 SEPs, alternative proteins and N-terminal extensions of annotated proteins [53], stress-regulated translation of nine unannotated small proteins, and cell-cycle regulated synthesis of the alt-protein MINAS-60. We note that the sensitivity of the method is likely currently limited by covalent retention of the uAA residue-containing peptide on the beads. Development of capture-and-release strategies may solve this problem in the future.
Proximity biotinylation reveals subcellular localizations of SEPs and alt-proteins
Proximity labeling is a term describing a suite of technologies that utilize engineered enzymes (e.g. APEX/APEX2 [57–59] or BioID [60]/TurboID [61]) or photocatalysts (e.g. µMap [62]) to locally generate a reactive probe with a short half-life. These high-energy intermediates, which include phenoxy radicals (APEX/APEX2), esters (BioID/TurboID), and carbenes (µMap), can react with residues on nearby proteins and other biomolecules (aromatic sidechains, amines, and C-H bonds, respectively), or, over longer distances, with solvent, which quenches them. The probes bear a biotin or bioorthogonal moiety that permits isolation and proteomic identification of labeled proteins after cell lysis, thus retaining spatial information of where the identified proteins were localized in the intact cell. Depending on the half-life of the reactive probe generated, each proximity labeling technology is associated with a characteristic labeling radius, on the order of tens to hundreds of nanometers [63], a range spanning the sizes of protein complexes to organelles. These technologies have been applied to mapping subcellular and extracellular proteomes in cells and in vivo by targeting the enzyme or catalyst to specific regions of the cell [57, 64–66], as well as to identify protein–protein [43, 67] and protein–RNA [68, 69] interactions.
Because bioinformatic analyses of SEPs and alt-proteins are challenging due to their short lengths, it is difficult to predict their subcellular localizations. At the same time, SEPs and alt-proteins can exhibit specific subcellular localizations via binding to interaction partners despite lacking canonical localization signal sequences. For example, NBDY localizes to P-bodies as a result of its interaction with the mRNA decapping complex [70, 71] and alt-RPL36 partially localizes to endoplasmic reticulum (ER)-plasma membrane junctions due to its interaction with TMEM24 [72]. MRI/CYREN localizes to the nucleus as a result of its interaction with Ku [73–75]. (Since the size limit for passive diffusion through the nuclear pore is ∼30 kDa [76], most SEPs should be able to freely transit the nucleus and cytoplasm, and can be retained at sites of interactions.). Along the same lines, multiple secreted SEPs have been reported, but some do not bear classical signal sequences for secretion [77]. Counterexamples exist, like MINAS-60, which bears a poly-arginine motif near its C-terminus that may behave like a nucleolar localization signal [53, 78]. SEPs and alt-proteins localized to the ER [79], mitochondria [80–84], Golgi [9], and plasma membrane [85, 86] have also been reported in diverse species including yeast, insects, mouse and human, and some, but not all, of these SEPs and alt-proteins bear signal or transmembrane sequences that could allow prediction of the organelle in which they function. Furthermore, while some SEPs are conserved from flies to human (e.g. sarcolamban), and others are conserved in mammals (e.g. NBDY, PIGBOS [79], and many others), conservation of their subcellular localization has rarely been confirmed experimentally. Taken together, the rules that govern SEP and alt-protein subcellular localization may be complex, and experimental techniques may be required to improve the mapping of SEPs and alt-proteins.
Our group adapted TurboID proximity labeling for subcellular mapping of SEPs and alt-proteins, in a pipeline we termed MicroID [54]. To modify previously reported TurboID proteomic workflows for high-sensitivity detection of SEPs and alt-proteins, after proximity biotinylation and streptavidin enrichment, eluted proteins were separated on a Tricine gel, enabling isolation of the low molecular weight (2–25 kDa) proteome. In-gel digestion was followed by mass spectrometry and proteomic identification of unannotated tryptic peptides using a three-frame translated RNA-seq database. A total of 154 unannotated SEPs and alt-proteins were thus identified in the nucleus, nucleolus, nuclear envelope, and chromatin of HEK 293T cells via quantitative comparison to an untargeted (whole-cell) TurboID control. MicroID can also be applied in vivo: nuclear-targeted TurboID identified 96 SEPs and alt-proteins in multiple mouse tissues. In the future, application of MicroID in additional subcellular and extracellular regions, adaptation of other proximity labeling technologies with different residue specificity and labeling radii, and exploration of additional in vivo models may identify the localizations and physiological relevance of many more SEPs and alt-proteins.
Chemical approaches to identify interaction partners of SEPs and alt-proteins
Many SEPs and alt-proteins bind to and regulate the functions of canonical proteins [87, 88]. Identifying interaction partners of SEPs and alt-proteins has therefore proven important in understanding their cellular roles. However, for low-affinity or transient interactions, as well as for low-abundance or membrane-localized interaction partners, detection of SEP/alt-protein binding events can be challenging. Several chemical approaches to improve the identification of SEP and alt-protein interactions have been reported.
In a seminal study, APEX2 was utilized for the discovery of SEP interaction partners. APEX2 is an engineered ascorbate peroxidase that can activate biotin-phenol in the presence of hydrogen peroxide to generate a phenoxy radical, which cross-links to aromatic residues within a ∼20 nm labeling radius due to a 1 ms half-life of the radical [57]. Therefore, fusion of APEX2 to a SEP or alt-protein is expected to preferentially label their close interaction partners with biotin, enabling their purification and identification. Chu and colleagues [89] demonstrated this principle (Figure 3A) by fusing APEX2 to the SEP CYREN/MRI-2, which was previously reported to interact with Ku70/80 [74]. APEX2-mediated biotinylation demonstrated superior quantitative enrichment of known CYREN interaction partners over nonspecific proteins as compared with traditional co-immunopurification. Subsequently, APEX2 tagging was employed to show that the previously uncharacterized C11ORF98 SEP interacts with nucleolar proteins, suggesting that it may function in this organelle. While this study successfully demonstrated that proximity labeling can identify SEP–protein interactions with superior signal to noise relative to non-covalent pull-downs, it is important to note that proximity labeling enzymes are several times larger than most SEPs, and fusions may interfere with SEP localization, interactions and functions.
General workflow of chemical labeling methods to identify interaction partners of SEPs and alt-proteins.
(A) Schematic of APEX2 fusion-based proteomics to identify interaction partners of SEPs and alt-proteins. APEX2 is fused to the SEP of interest, and is expressed in cells. In the presence of biotin-phenol and H2O2, APEX2 generates biotin phenoxy radicals, which can label proximal proteins at aromatic amino acid sidechains, enabling their purification with streptavidin and identification with proteomics. (B) Schematic of photo-crosslinking uAA-based proteomics to identify interaction partners of SEPs and alt-proteins. A uAA such as AbK, a lysine analog bearing a diazirine photo-cross-linker, is incorporated into the SEP of interest using amber codon suppression technology in cells; the SEP is also fused to an epitope tag for purification of cross-linked complexes. After photo-irradiation, the diazirine is converted to a reactive carbene, which can insert into C-H bonds on proximal proteins to form a covalent bond. Immunopurification of the cross-linked complexes followed by trypsin digest and mass spectrometry enables identification of interaction partners. (C) Schematic of chemical cross-linking for identification of SEP interaction partners. Bivalent compounds containing two reactive chemical groups bridged by a linker are added to cells or lysates, enabling formation of covalent bonds to nucleophilic amino acid side chains on two interacting proteins, trapping them in a complex. Subsequent trypsin digest, followed by proteomics with smORF/alt-ORF database searching, enables global discovery of SEP and alt-protein complexes in an unbiased manner.
(A) Schematic of APEX2 fusion-based proteomics to identify interaction partners of SEPs and alt-proteins. APEX2 is fused to the SEP of interest, and is expressed in cells. In the presence of biotin-phenol and H2O2, APEX2 generates biotin phenoxy radicals, which can label proximal proteins at aromatic amino acid sidechains, enabling their purification with streptavidin and identification with proteomics. (B) Schematic of photo-crosslinking uAA-based proteomics to identify interaction partners of SEPs and alt-proteins. A uAA such as AbK, a lysine analog bearing a diazirine photo-cross-linker, is incorporated into the SEP of interest using amber codon suppression technology in cells; the SEP is also fused to an epitope tag for purification of cross-linked complexes. After photo-irradiation, the diazirine is converted to a reactive carbene, which can insert into C-H bonds on proximal proteins to form a covalent bond. Immunopurification of the cross-linked complexes followed by trypsin digest and mass spectrometry enables identification of interaction partners. (C) Schematic of chemical cross-linking for identification of SEP interaction partners. Bivalent compounds containing two reactive chemical groups bridged by a linker are added to cells or lysates, enabling formation of covalent bonds to nucleophilic amino acid side chains on two interacting proteins, trapping them in a complex. Subsequent trypsin digest, followed by proteomics with smORF/alt-ORF database searching, enables global discovery of SEP and alt-protein complexes in an unbiased manner.
To circumvent artefacts due to tag size, Koh et al. [90] devised an approach to incorporate a photo-cross-linking amino acid into a specific position of a SEP (Figure 3B). This strategy, unnatural or noncanonical amino acid mutagenesis [91], utilizes an evolved mutant of Methanosarcina barkeri pyrrolysyl aminoacyl transfer RNA (tRNA) synthetase, along with an engineered variant of the M. barkeri tRNA that can be charged with an unnatural amino acid, and which bears an amber suppressor anticodon. This system is orthogonal in bacterial and eukaryotic cells, enabling specific incorporation of the unnatural amino acid uniquely at an amber codon strategically placed within the coding sequence of the target protein — in this case, a SEP. The amber suppressor system was used to genetically incorporate a diazirine amino acid, AbK, into multiple SEPs of unknown function. Photoactivation of AbK generates a reactive carbene intermediate, which can insert into C-H and unsaturated C-C bonds. A SEP bearing an AbK residue can be crosslinked to bound proteins; pulldown of the SEP via an affinity tag then permits digest and LC–MS/MS identification of cross-linked interaction partners. Using this approach, Koh and colleagues identified the interactomes of seven SEPs, including one that interacts with histone H2B and localizes to chromatin. This approach represents the minimal tag size that can be genetically incorporated for interaction partner identification, but presents the possible limitation that it requires overexpression of SEP coding sequences programmed with an amber codon. Despite the challenges, two subsequent studies applying unnatural amino acid mutagenesis to fluorescence imaging of SEPs and alt-proteins [72, 92] suggests that amber codon suppression may continue to find broader utility for minimally perturbative SEP and alt-protein labeling.
A third approach that enables covalent capture and identification of interaction partners without the need for overexpression or genetic fusion is cross-linking mass spectrometry (XL-MS) [93] (Figure 3C). In this approach, cell lysates are treated with a membrane-permeable crosslinker, such as a bis-succinimidyl ester, which reacts covalently with two groups (e.g. amines) on interacting proteins, and can contain a linker that fragments (such as sulfoxide) in the mass spectrometer. The cross-linked peptides can then be identified from their unique fragmentation spectra via shotgun proteomics and mapped to protein complexes via analysis with specialized software. This method is global and unbiased, and requires no enrichment of the proteins under study, offering a particularly high-throughput avenue to identify SEP and alt-protein interactions. Two studies have demonstrated that the XL-MS pipeline can be interfaced with proteogenomic database searching. In the first study [94], HeLa cell XL-MS data were reanalyzed against a database containing >200 000 predicted human alternative proteins in addition to the annotated human proteome. Nearly 300 candidate interactions between alt-proteins and canonical proteins were identified, including a candidate interaction between Alt-ATAD2 and ribosomal protein L10. While molecular validation was not provided, the study is notable for its quantitative comparison of database searches: searching the expanded database vs. the human proteome database decreased sensitivity for detection of annotated proteins, suggesting that near-matches to theoretical sequences can decrease peptide-spectral match scores, even for known proteins. In follow-up work [95], the same group subjected glioma cells to a timecourse of forskolin treatment, which promotes signaling associated with epithelial-to-mesenchymal transition. The cell extracts were subjected to XL-MS and complexes between alt-proteins and canonical proteins were dynamically mapped through the treatment. Again, >200 cross-links assigned to alt-protein-canonical protein interactions were identified, and the alt-proteins observed were dynamic at varied timepoints between 0 and 48 h of treatment. Candidate alt-protein-containing cross-links were identified with components of the translation machinery, the cytoskeleton and cell motility machinery. Of particular interest were three alt-proteins putatively interacting with tropomyosin 4: Alt-TRNAU1AP, Alt-EPHA5, and Alt-MAP2. Direct molecular evidence for the cellular expression of the alt-proteins implicated in these analyses will be of importance in the future; in addition, the simultaneity or exclusivity and cellular/phenotypic consequences of these candidate interactions will be interesting to probe using biochemical methods. These studies demonstrate that alt-proteins form complexes inside cells, an idea supported by similar observations in a recent reanalysis of a global interactomics dataset [96]. Overall, proximity labeling, photo-cross-linking, and chemical cross-linking offer complementary advantages to advance our understanding of the interactions of SEPs and alt-proteins.
Discussion
The study of SEPs and alt-proteins has begun to offer new insights into human biology and disease, and it is imperative to determine the molecular and organismal roles of thousands of these novel gene products that are currently uncharacterized.
Proteomics, particularly when coupled with chemical labeling, can provide information about the presence and abundance of SEPs and alt-proteins in the cell as well as their physical, chemical and functional properties. Such approaches have already been developed to report SEP and alt-protein reactivity, synthesis, localization, and interactions. In the future, there are clear avenues to improve on existing workflows for SEP and alt-protein functional proteomics, including application of new small protein extraction and enrichment methods up- or downstream of chemical labeling protocols. In addition, adaptation of other chemical labeling methods for nascent proteins, post-translational modifications and subcellular localizations, as well as analysis of secreted SEPs and alt-proteins, holds the potential to provide new insights into the functions of these species. Finally, while dysregulation and mutation of SEPs and alt-proteins contribute to human diseases, chemical labeling and proteomic analysis of SEPs and alt-proteins has not yet been applied in the disease context; doing so should provide key insights into the molecular mechanism of SEPs and alt-proteins in disease.
However, several key challenges remain. First, the proteoform [97] diversity of SEPs and alt-proteins is likely to be complex, as it is for canonical proteins. While multiple isoforms of some SEPs, like CYREN/MRI [74], have been documented, it is likely that alternative splicing of many other SEP and alt-protein transcripts generates isoform diversity that is currently almost entirely unexplored. Furthermore, while phosphorylation of multiple SEPs and alt-proteins has been reported [71, 72], few other SEP/alt-protein post-translational modifications have been examined, and the stoichiometry and site occupancy of post-translational modifications on intact SEP/alt-protein proteoforms is essentially unknown. While these questions are challenging to address even for canonical proteins, an understanding of the functional repertoire of SEPs and alt-proteins will require an accounting of all of their intact proteoforms in the cell. Intact proteoform and modification state diversity are impossible to fully map using bottom-up proteomics. Top-down proteomics is uniquely suited to addressing this question, and current efforts to advance this technology for SEP analysis offer an optimistic outlook for SEP proteoform identification in the future.
A second major issue for the field is the double-edged sword of sensitivity vs. false positive identifications. Proteomics experiments typically detect fewer alt-proteins and SEPs than ribosome profiling, likely a result of the stochastic nature of data-dependent acquisition, reliance on one or few tryptic peptides for detection of SEPs, and the challenge of detecting some classes of proteins [24] using bottom-up proteomics. However, a more fundamental challenge arises from peptide-spectral matching via database search. Expanded translatome or transcriptome databases are required for proteomic identification of SEPs and alt-proteins. These databases are several times larger than the canonical proteome, and contain many entries that do not correspond to bona fide cellular SEPs and alt-proteins; adding variable post-translational or chemical modifications to the search increases the size of the theoretical database yet again. As a result, false positive matches of experimentally acquired spectra to ORFs that are not really expressed occur [40, 98], a problem further exacerbated when modifications are included. At the same time, Fournier and colleagues showed that true positive identifications of canonical proteins are also decreased when searching expanded databases with stringent false discovery rates, perhaps due to near-matches to theoretical sequences that are not the reverse of real protein sequences but are present in the decoy database [95]. To solve these problems in the future, it will be essential to develop cell type specific translatome resources curated for expressed SEPs and alt-proteins in addition to the annotated proteome — or, better yet, to update the human proteome annotation with SEPs and alt-proteins with evidence for expression.
Perspectives
Small and alternative open reading frames (smORFs and alt-ORFs) were previously excluded from the human genome annotation, but are now known to encode thousands of small proteins with potential biological functionality and disease relevance. Identification and functional study of these unannotated small proteins represents a major opportunity to gain insights into biology.
Chemical labeling coupled with proteomic identification has begun to provide insight into the synthesis, localization, reactivity and interactions of smORF and alt-ORF-encoded proteins; some of these properties are challenging or impossible to investigate using genomic or computational technologies.
Proteomic detection of smORF and alt-ORF-encoded proteins still faces challenges including false positives, false negatives, and intact proteoform mapping, which can be solved with improved databases and mass spectrometric technologies in the future.
Competing Interests
The authors declare that there are no competing interests associated with the manuscript.
Open Access
Open access for this article was enabled by the participation of Yale University in an all-inclusive Read & Publish agreement with Portland Press and the Biochemical Society under a transformative agreement with Individual.
Acknowledgements
This work was supported by a Mark Foundation for Cancer Research Emerging Leader Award, a Paul G. Allen Frontiers Group Distinguished Investigator Award, and a Sloan Research Fellowship (FG-2022-18417) (to S.A.S.). X.C. was supported by Shanghai Pujiang Program (22PJ1402600), and in part by a Rudolph J. Anderson postdoctoral fellowship from Yale University. K.H.L. was supported in part by a NIH Pathway to Independence Award (4 R00DK129712-03).
Abbreviations
References
Author notes
These authors contributed equally to this work.