Long noncoding RNAs (lncRNAs) have emerged as prominent regulators of gene expression in eukaryotes. The identification of lncRNA orthologs is essential in efforts to decipher their roles across model organisms, as homologous genes tend to have similar molecular and biological functions. The relatively high sequence plasticity of lncRNA genes compared with protein-coding genes, makes the identification of their orthologs a challenging task. This is why comparative genomics of lncRNAs requires the development of specific and, sometimes, complex approaches. Here, we briefly review current advancements and challenges associated with four levels of lncRNA conservation: genomic sequences, splicing signals, secondary structures and syntenic transcription.
Long noncoding RNAs (lncRNAs) represent an abundant class of transcripts longer than 200 bases that do not encode proteins. In ENSEMBL 102 , there are 82 760 human lncRNAs originating from 28 480 genes; thus, they outnumber protein-coding genes (n=22 803). Despite indisputable progress in accumulating lncRNA-related data, relatively little is known about their functions, with no more than a few thousands of them subjected to detailed functional studies. To date, lncRNAs have been linked to virtually all cellular processes, including transcription, splicing, translation, and the cell cycle [2–6], and have also been implicated in human diseases, such as malignant transformation, while a number of them represent diagnostic and prognostic biomarkers for cancer [7,8]. lncRNAs participate in different steps of gene expression. For instance, many modulate the act of transcription, e.g. through promoter modifications, creating a permissive chromatin environment or binding transport factors to inhibit the nuclear localization of specific transcription factors [9,10].
The heterogeneity in the modes of lncRNA action poses a huge challenge to functional studies and for distinguishing which are bona fide functional. Here, comparative genomics, conservation studies in particular, may be helpful. First, evolutionary conservation is often associated with functionality, while the presence of orthologs with established roles makes the predictions stronger. Second, the level and type of lncRNA conservation may be used to assign lncRNAs to a hypothetical functional domain, thus facilitating subsequent functional studies. For instance, conservation of exon sequences points to functions exerted through mature lncRNA molecules, such as lncRNA:RNA base pairings that are associated with RNA processing and stability [11,12]. On the other hand, the evidence accumulated thus far strongly suggests that for most lncRNAs it is not their sequences but the act of transcription that is associated with their biologically relevant functions. This is especially true for so-called antisense lncRNAs [13,14], whose function is to modulate the expression of sense partners. Because of the nature of the underlying lncRNA:protein interactions with low sequence specificity, there are virtually no constraints upon the conservation of sequences of these lncRNAs. They do, however, often exhibit positional conservation; i.e., they occupy the same or neighboring genomic loci in closely related species.
In this review, we consider four dimensions of lncRNA conservation from the perspective of the genome. The abovementioned conservation of primary sequences (Figure 1A) and positional conservation of lncRNAs (Figure 1B) are complemented by insight into their splice sites and other splicing signals (Figure 1C), and their secondary structures (Figure 1D): we consider the pros and cons of existing methodologies and their algorithmic challenges, discuss relevant studies and pinpoint some novel or emerging approaches, such as the application of alignment-free metrics and the conservation of promoter sequences.
Different levels of lncRNAs conservation
We are not covering the issue of conservation of lncRNA expression, processing, and tissue-specificity but one should keep in mind this represents another layer of complexity in conservation and functional studies. For example, it has been demonstrated that positionally conserved lncRNAs in human embryonic stem cells are in most cases spliced and then exported to the cytoplasm, while their mouse counterparts are predominantly unspliced and retained in the nucleus, leading to species-specific functions of lncRNAs in pluripotency maintenance . Generally, the most conserved lncRNAs tend to be broadly expressed and they display significantly higher expression levels than those with high evolutionary rates [16,17].
In recent years, several software programs based on sequence homology have been proposed for use in lncRNA conservation analysis [18–20]. In particular, alignment-based algorithms (e.g., BLASTN or BLAT) are widely used to search for similarities in primary sequences between lncRNAs. Such comparisons are performed utilizing solely lncRNA sequences or, additionally, genomes of interest. For example, Marques et al. compared the substitutions in mouse and rat transcriptional initiation regions and demonstrated that enhancer-associated lncRNAs generally display lower conservation than other classes of lncRNAs . In another study, Hezroni et al. applied BLASTN to create a network of sequence similarities, comparing long intergenic RNAs (lincRNAs; autonomously transcribed lncRNAs that do not overlap annotated coding genes) from sea urchins and 16 vertebrates and demonstrated that most lincRNAs are lineage-specific . A number of studies have revealed that only a small fraction of lncRNAs exhibit primary sequence conservation similar to that of protein-coding genes , with some well-studied examples among them. For example, phylogenetic analysis comparing 20 mammals showed high evolutionary conservation of MALAT1 lncRNA , known to be a modulator of gene expression and involved in tumor growth and metastasis. Rapid evolution of lncRNAs has resulted in weak sequence conservation not only within animals but also in plants. It has been shown that fewer than 2% of the lncRNAs in A. thaliana are conserved across plants . Although the general conservation level of lncRNAs is very low, there is a special group of lncRNAs dubbed transcribed-ultraconserved regions (T-UCRs) that are encoded by a subset of ultraconserved genomic regions and are highly conserved in humans, rats, and mice . Typically, they are detected as transcribed by microarray experiments and not present in most RNA-seq based reconstructions of polyadenylated transcripts . Recent studies highlight the importance of T-UCRs in the regulation of genes expression. For example, uc.372 promotes the expression of genes related to lipid synthesis and uptake by binding pri-miR-195/pri-miR-4668 and by suppressing maturation of mir-195/mir-4668 .
Because the full sequences of many lncRNAs are poorly preserved, analyses of shorter regions have been proposed, leading to the discovery of short-sequence homologies (microhomologies) [18,27]. It has been shown that these microhomologous regions are repeated within an lncRNA and can act as protein binding sites [28,29]. However, another solution has been proposed by Noviello et al., who successfully applied alignment-free string similarity metrics for the identification of lncRNA homologs based on their promoter sequences . Kirk et al. have also developed an aligment-free approach called SEEKR. They found that lncRNAs with related functions have similar k-mer profiles despite lacking linear homology . Recently, Ross et al. have introduced a novel technique that complements the SEEKR algorithm, LncLOOM. This method efficiently detects both known and novel functional elements in lncRNAs . Interestingly, a growing body of evidence shows that promoter regions in lncRNAs are more conserved than the remainder of their sequences [29,32]. Moreover, Derrien et al. compared the conservation level of different regions of lncRNA genes and observed that promoters have been conserved to almost the same extent as protein-coding genes . In addition, a number of studies have shown that exons of lncRNAs are better conserved than introns [34,35]. On the other hand, Pegueroles and Gabaldón analyzed sets of human lncRNAs, demonstrating that although lncRNA exons are enriched in conserved elements, they are not conserved to a greater extent than introns .
Conservation of splicing signals
Because of the overall weak primary sequence conservation, studies of other lncRNA features, such as their gene structures, may shed light on the evolution and functional potential of these molecules. In general, the conservation of splicing patterns has been widely studied, as it refers directly to the preservation of the exon–intron structure, considered a hallmark of functional transcripts. Despite the fact that lncRNAs seem to be less efficiently spliced than pre-mRNAs [37–40], noncoding exons are reported to universally originate from alternative splicing , and distinct lncRNA isoforms have the potential to act in a divergent manner [42–44].
The most common approach to splicing conservation studies requires a set of high-quality splice sites in species of interest, multiple genome alignments and, finally, statistical testing to detect and measure conservation. For example, Ponjavic et al.  evaluated splice site conservation with a chi-square test to compare signals from bona fide splice sites against a set of background dinucleotides, not associated with splicing. They indicated that intronic terminal dinucleotides in mouse lncRNAs are significantly more conserved in humans and rats than dinucleotides that are not associated with splice-site signals. A similar strategy was applied by Chernikova et al. , who analyzed the evolutionary conservation of dinucleotides at 5′ and 3′ ends of introns using pairwise comparison between human and mouse lincRNAs and multiple sequence alignments for 15 animal species. Moreover, the researchers analyzed the conservation of the exon–intron structures using a parsimony approach. The analysis was performed using the DNAPARS program with positions of mouse or human introns as a reference for gene structures. Interestingly, some of them have been conserved for over 100 million years. However, a substantial fraction of lincRNAs introns are not conserved – a high turnover rate of lincRNAs introns in the Glires lineage was reported .
Machine learning or statistical models are implemented in conservation studies as well. For instance, Rose et al. trained a support vector machine model to detect and rate donor and acceptor splice site candidates in multiple sequence alignments . They also used MaxEntScan , a tool that scores splice site sequences in terms of their similarity to canonical splice sites. MaxEntScan is based on maximum entropy distribution and is considered to offer the most unbiased approximation for modeling short sequence motifs . Remarkably, it considers dependencies between both nonadjacent and adjacent positions and is used extensively in analyses of lncRNA splice-site conservation [50–52]. One such study was performed on vertebrate lncRNAs , in which multiple sequence alignment together with transcriptomics data were utilized to determine the homologous positions in splice sites, leading to the construction of a comparative map of splice sites. The authors showed that in conserved human transcripts, 87% of splice sites are present in other species, including mice, rats, cows, and dogs, yet most of the novel splice sites originated during primate evolution. MaxEntScan has also been used in recent studies on lncRNA splicing conservation in 16 Brassica species . These studies revealed that nearly 18% of lincRNAs display splicing conservation in at least one exon in Brassicaceae plant family members. In comparison with that of vertebrates, the level of lincRNA conservation measured by gene structure is significantly lower in plants.
The splicing conservation may also involve short sequence motifs recognized by proteins associated with splicing regulation. Examples of this motif are purine-rich hexamers that are recognized by serine arginine-rich proteins (SR proteins). These so-called exonic splice enhancers (ESEs) were explored, inter alia, by Schüler et al. ; these authors showed that experimentally confirmed human hexamers (ESEs) evolve considerably slower than nonenhancer sequences in lncRNAs. Another group implemented the RESCUE-ESE algorithm  to identify hexamers that are significantly enriched or depleted within exonic sequences relative to their flanking intronic sequences . The authors reported that evolutionary constraints are more concentrated near intron–exon boundaries. Moreover, these regions in lncRNAs contain a high density of ESEs, which are conserved across mammals . Despite the fact that both of the aforementioned studies indicate the action of purifying selection to preserve exonic splicing enhancers within lncRNAs, the biological interpretation is difficult, particularly because splicing of lncRNAs is generally inefficient [37,39,40]. More in-depth insight is provided by experimental evidence showing that lncRNAs are unable to bind SR proteins securely  and by studies showing the diverse dynamics of intron excision across lncRNAs .
Secondary structure conservation
Diversified functions of lncRNAs, including participation in post-transcriptional regulation, are often associated with their ability to recognize and bind molecular targets, where the secondary and tertiary structures of related RNA molecules may play pivotal roles. As a result, studying the structural architecture of lncRNAs may help to determine their exact modes of action. The possibility of these molecules to maintain certain structures was reported, for example by Smith et al. in a genome-wide study of mammalian evolutionarily conserved structures. Smith et al. identified millions of putative functional RNAs, some of which overlap with annotated ncRNAs . Another report pointed to a set of conserved brain lncRNAs being significantly enriched with predicted RNA secondary structures . Furthermore, Pegueroles and Gabaldón  observed that positions in paired lncRNA regions that participate in the formation of folds show lower probabilities of having SNPs. Thus secondary structures impose limits on changes to lncRNA sequences (which may lead to degeneration of the structural motifs), suggesting their functional relevance. Yang and Zhang showed, using data from a parallel analysis of RNA structure (PARS), that lncRNAs are substantially less folded than mRNAs . These results were surprising as it is assumed that lncRNAs often require a secondary structure to perform their functions, while the mRNAs do not. However, the authors pointed out that it is difficult to clearly explain the observations since analyses may be biased by insufficient representation of functional lncRNAs, for example. On the other hand, strong secondary structures may not be preferred as they could hinder base-pairings that are at the core of many RNAs functions.
Although the secondary structure conservation of lncRNAs is lower than that of messenger RNAs or ribosomal RNAs [20,59], some functional lncRNAs with experimentally confirmed structural motifs, including Cyrano , Xist [61,62], MALAT1 [63,64], SRA  and GAS5 , have been found. For instance, recently published studies on MALAT1 in vertebrates revealed an evolutionarily conserved core consisting of numerous helices . At the center of the functional core of Cyrano is a cloverleaf structure maintained for more than 400 million years, enriched in several protein binding sites and masking a target site for miR-7 . The aforementioned studies were based on a similar approach: secondary structures are experimentally determined by chemical and/or enzymatic probing and then used in comparative sequence analyses. Interestingly, resolved secondary structures may be used in the identification of lncRNAs in other species, as shown in studies on COOLAIR ; its secondary structure was successfully used to predict the corresponding exons in evolutionarily divergent Brassicaceae species.
Computational methods for the detection of conserved secondary structures are quite well established, such as those based on covariance models (CMs), which are able to automatically learn statistical models derived from multiple RNA alignments. CMs are implemented with popular homology search tools, e.g., Infernal  or CMfinder . CMfinder was applied in research on vertebrate genomes, which were screened for conserved RNA structures (CRSs) . The authors pinpointed some structurally conserved lncRNAs and noted that CRS density decreases from the 5′ to the 3′ end of lncRNAs. Another tool based on covariation models is R-scape, which initially showed limited utility on eukaryotic lncRNAs, as it found no statistically significant support for the proposed secondary structures in some of the lncRNAs with experimentally verified structures . However, recent studies have shown that R-scape efficiently detects conserved secondary structures when appropriately parameterized . Notably, since CMs need to be trained on sets of homologous sequences that are not always available, attempts have been made to utilize algorithms comparing structures without sequence alignments, e.g., BEAGLE , which was used to successfully study the secondary structure similarities in four different Caenorhabditis worm species .
Since it is difficult to identify orthologous lncRNAs by sequence similarity [74,75] or the conservation of secondary structures , the evolutionarily conserved position in the genome (synteny) may be particularly useful. Syntenic lncRNAs are found either in the same genomic region across compared species, as determined by whole genome alignment (WGA), or are found between syntenic protein-coding genes. Syntenic lncRNAs, also referred to as syntologs, represent loci with conserved transcription that typically exhibit little or no sequence similarity. In other words, syntenic transcription may generate RNAs whose expression itself, rather than the sequence, is required for lncRNA function. Indeed, a growing body of evidence shows that for most human lncRNAs, their transcription alone, rather than the production of the mature RNA molecules, is of biological importance. Good examples are natural antisense transcripts (NATs), which originate from the opposite genomic strand compared with their sense partners. It is estimated that as many as 70% of human genes display antisense transcription [13,14], with NATs modulating their expression and processing in a number of ways [76,77]. The prevailing mechanism is the recruitment of complex epigenetic machinery that results in histone modifications and subsequent transcriptional deregulation of target genes . Antisense lncRNAs may also affect expression of their sense counterparts through transcriptional intereference (TI) mechanisms. One such example is yeast SER3 protein coding gene being overlapped by the SRG1 lncRNA, whose transcription increases nucleosome density at the SER3 promoter . Antisense lncRNAs may also trigger methylation at GC-rich genomic regions that are often associated with vertebrate genomes. An example is α-thalassemia, where antisense RNA LUC7L represses expression of HBA2 (α-Globin) by triggering methylation of its CpG islands . More functions played by NATs are reviewed elsewhere [76,77]. Examples of syntologs with relatively degenerate sequences are those referred to as topological anchor point RNAs (tapRNAs), whose roles are linked to chromatin looping and topology, especially in the neighborhood of developmental genes . They act by binding the DNA in cis; hence, they maintain complementarity irrespective of how much the counterpart DNA sequence is changed, indicating little or no evolutionary constraints upon their sequences. It should be noted, however, that particular fragments of these transcripts might be under stricter evolutionary constraints. tapRNAs contain conserved sequence domains recognized by transcription factors and RNA-binding proteins with zinc finger domains, leading to their elevated overall sequence conservation compared with most lncRNAs .
Although the search for syntenic lncRNAs proved to be an effective way of obtaining orthologous lncRNAs [82,83], there are substantial issues when calling syntologs. In general, whether two lncRNAs found within syntenic genomic blocks are orthologous remains in question, as they tend to occupy nonoverlapping conserved fragments of genomic DNA or are even found in a relatively large, nonconserved genomic region between two syntenic blocks (or two orthologous protein-coding genes). The syntenic lncRNAs might also show conservation limited to short parts of their sequences (e.g. within one exon out of many), which further sows the seeds of doubt whether they should be called orthologs. Inference pipelines, such as slncky , also fail to unanimously call lncRNA homologs in cases where more than one lncRNA can be found at neighboring genomic loci or when lncRNA and its genomic neighborhood are enriched with repetitive elements. In these cases, a single orthologous pair is selected based on statistical analysis but with no guarantee that the best solution is provided. For example, to reduce reporting alignments that may be driven by repetitive elements, slncky aligns each lncRNA to the shuffled intergenic regions and seeks to establish a null distribution and to determine the empirical 5% threshold for significant alignment scores. Second, the search for syntologs relies heavily on the quality of WGAs, which project expressed lncRNA loci that correspond to loci in other species. As a result, although the approach is less dependent on evolutionary distance than other sequence-based analyses, it works best with closely related species. Nevertheless, Herrera-Ubeda et al. managed to find syntenic lncRNAs between humans and lancelets using LincOFinder, a new pipeline that identifies conserved lincRNAs between evolutionarily distant species by means of microsynteny analyses . The pipeline considers only the clusters formed by one upstream gene, the lincRNA and one downstream gene and led to the discovery of 32 clusters, with only 16 found to have bona fide orthologous lincRNA after manual inspection.
Notably, an efficient search for syntologous lncRNAs requires high-quality annotations for both species of interest, which are not always available but are essential in different algorithm steps. In particular, protein-coding gene annotations are used to resolve spurious conservation relationships, but the ability to detect lncRNA orthologs relies mostly on extensive annotations of lncRNAs in a given species. As mentioned earlier, available lncRNA datasets are scarce for most species, while model organisms possess multiple and rather poorly overlapping catalogs of lncRNAs, as they are populated with fundamentally different experimental and in silico methods. Keeping this in mind, in our recent research on conserved lncRNAs across primates, we used our custom pipeline built on the basis of slncky, and we assembled custom lncRNA annotations based on data from publicly available RNA-Seq repositories . Our analysis yielded over 78 000 expressed human lncRNAs, and from 2054 to 18 226 were conserved across individual primate species. Interestingly, lncRNAs showing only positional conservation represent as many as 66.78% of the conserved human lncRNAs in the great apes, while syntologs make up only 42.37% of the ultraconserved lncRNAs, defined as those found in all eleven primate species considered in the study.
Current challenges in comparative genomics of lncRNAs
When deciphering the evolution of lncRNAs, the use of comparative genomics is hindered by the poor conservation of sequences and secondary structures. Therefore, specialized computational approaches tailored to the specific characteristics of lncRNA evolution are needed. Some of the most frequently used approaches are briefly described above. The genomic context of lncRNAs represents another layer of complexity. As many of lncRNAs are expressed in antisense to protein-coding loci, are located in introns or represent alternative isoforms of protein-coding genes, it is virtually impossible to investigate their evolution independent of their associated protein-coding genes. This is why most studies are focusing on lincRNAs, but their evolution does not necessarily reflect that of all lncRNAs. Another major obstacle is extensive bias in the quality of annotations between compared species. First, in addition to model organisms, available lncRNA catalogs are quite scarce (compare 82 760 lncRNAs for humans and 1778 for chimps in ENSEMBL 102), which is often coupled with the lack of fully assembled genomes and diversified RNA-Seq data, both of which are typically required to build comprehensive sets of expressed lncRNAs. On the other hand, model organisms, such as humans, possess highly divergent datasets of lncRNAs across public resources, which largely stems from the various computational criteria being used, in some situations leading to barely overlapping sets of lncRNAs based on the same input data . Additionally, since lncRNAs are most often identified from RNA-Seq data, some lncRNAs with highly specific spatiotemporal expression patterns might remain undiscovered; similarly, nonpolyadenylated lncRNAs that prevailingly remain elusive when oligo(dT) protocols for reverse transcription are applied. Reliable analysis of lncRNA conservation, orthologs across species and evolution in general depend on the establishment of standard criteria for their search and annotation, along with the growth of genomic and transcriptomic data for nonmodel organisms.
Comparative genomics of lncRNAs typically involves the formulation of dedicated algorithms and the use of specialized software, in addition to those used in previous studies on protein-coding genes.
From the perspective of genomics, there are four different levels of lncRNA conservation, and in general, lncRNA orthologs are not required to show sequence similarities.
Comparative genomics of lncRNAs is severely hampered by limited availability of quality annotations, and further progress in the field is dependent on expansion of lncRNA catalogs across species.
The authors declare that there are no competing interests associated with the manuscript.
This work was supported by the National Science Centre [grant number 2017/27/N/NZ1/00417 (to E.W.)].
All authors wrote and edited the manuscript. E.W. and M.R.K. prepared the figure. All authors read and approved the final manuscript.
These authors contributed equally to this work.