Non-coding transcripts play an important role in gene expression regulation in all species, including budding and fission yeast. Such regulatory transcripts include intergenic ncRNA (non-coding RNA), 5′ and 3′ UTRs, introns and antisense transcripts. In the present review, we discuss advantages and limitations of recently developed sequencing techniques, such as ESTs, DNA microarrays, RNA-Seq (RNA sequencing), DRS (direct RNA sequencing) and TIF-Seq (transcript isoform sequencing). We provide an overview of methods applied in yeast and how each of them has contributed to our knowledge of gene expression regulation and transcription.
Three decades ago, studies of gene expression regulation were based mostly on biological and biochemical analyses of a specific gene. Technological advances of experimental strategies enabled a rapid development of the genome-wide approaches involving quantitative transcriptome sequencing and subsequent bioinformatic and statistical analyses. These offer a much deeper and general understanding of transcriptional regulation. Regulatory elements relate to transcribed, but not translated, nucleic acid regions such as intergenic ncRNA (non-coding RNA), 5′ and 3′ UTRs of a gene, introns and antisense transcripts.
In the present review, we focus on RNA polymerase II-transcribed regulatory elements in the fission yeast Schizosaccharomyces pombe and the budding yeast Saccharomyces cerevisiae. These two commonly used model organisms have contributed substantially to the comprehension of eukaryotic genome function. Investigative techniques of the yeast genome have undergone a dramatic evolution. Starting from the analysis of individual genes , followed by DNA microarrays , ESTs , RNA-Seq (RNA sequencing) [4,5], DRS (direct RNA sequencing)  and TIF-Seq (transcript isoform sequencing) , they are providing more and more advances in the technical aspect of experiments and consequently result in larger and more detailed datasets.
Experimental approaches employed to determine transcribed genomic regions
Expressed sequence tags
ESTs are partial reads of cDNA sequences. They provide the possibility of evaluating gene expression dependent on the cell cycle stage and cell type . Isolated eukaryotic mRNA is reverse-transcribed into cDNA, which is inserted into a suitable vector (Figure 1A). Inserted fragments are sequenced by priming their ends in randomly selected clones. Bioinformatic analysis removes low sequence reads and contaminating vector sequence. Individual EST reads can vary greatly in length, up to 800 nt . Although ESTs have proved to be useful in studying gene expression, several disadvantages have pushed them to the sideline of currently used techniques. In particular, low numbers of full-length ESTs, chimaeric sequences due to cDNA template switching , internal cDNA priming events and low-quality sequences at the EST ends  (reviewed in ) led to the development of further approaches to study gene expression.
Schematic illustration of experimental techniques
DNA microarrays became one of the most popular tools to measure transcriptional activity and genotype in multiple genomic regions. Fluorescently labelled cDNA or cRNA (complementary RNA) can hybridize to microscopic spots on a surface, each containing one specific fluorescent DNA probe (Figure 1B). A laser-scanning microscope detects the fluorescent signal corresponding to sample–probe hybridization [12,13]. Microarrays enable parallelization of data acquisition, allowing gene comparison. A major challenge of these microarrays is the elimination of cross-hybridization. Additionally, microarrays depend on specific probes to allow the detection of particular gene expression. ESTs, in this case, are helpful for probe design.
The more recent tiling arrays are independent of gene annotation and can probe for any genomic sequence. This allows the representation of non-repetitive DNA at various sequence resolutions, thus enabling the discovery of novel transcripts and regulatory elements . High-density oligonucleotide tiling arrays allowed the re-annotation of gene boundaries and the estimation of levels of coding and non-coding transcripts in S. cerevisiae  and S. pombe .
DNA tiling arrays could fail to identify short exons and precise UTR boundaries. They may give high false-positive rates of transcribed regions and cannot detect post-transcriptionally modified sequences . RNA-Seq is a more recent quantitative method allowing estimation of gene expression levels, where cDNA reads are mapped back to genomic regions . Unique DNA adapters are attached to both ends of sheared DNA fragments (<500 nt, Figure 1C), allowing their selection on beads on a slide in an adapter-dependent manner. Early RNA-Seq protocols used adapter ligation to double-stranded cDNA, which caused loss of information about transcriptional directionality. Strand-specific RNA-Seq protocols were therefore developed later (reviewed in ). RNA-Seq enables the investigation of unique sequences, even if there is only a difference of one nucleotide, and can detect and quantify DNA transcription at much lower levels than DNA microarrays. In addition, sequenced junctions between poly(A) tail and transcript allow a precise mapping of the 3′ ends. Nagalakshmi et al.  used RNA-Seq in S. cerevisiae, and Wilhelm et al. , Rhind et al.  and Marguerat et al.  used RNA-Seq (with different protocols) in S. pombe to define transcribed genomic regions.
Direct RNA sequencing
The major advantage of the DRS technology  is transcriptome sequencing using small amounts of RNA without cDNA conversion, eliminating several experimental artefacts associated with it. DRS provides the opportunity to map precise poly(A) sites , giving insight into alternative polyadenylation and heterogeneity. Moreover, any polyadenylated RNA is observable, such as a small fraction of rRNAs and snoRNAs (small nucleolar RNAs) .
DRS employ a single-molecule sequencing approach. Polyadenylated mRNAs are captured on the surface of a poly(dT)-coated flow cell. The poly(dT) primers initiate sequencing steps, using the exposed chain as a template to construct a complementary strand, reading each new base identity at its addition (Figure 1D). The protruding poly(A) tail is filled with overabundant dTTP. The complementary strand is constructed from fluorescent nucleotides containing a VT (virtual terminator) group, preventing any further nucleotide addition. If a nucleotide presented to the RNA strand is complementary to the first nucleotide in the strand, it will be ‘captured’ and produce a fluorescent signal. The solution is then washed away, and the VT is cleaved from the incorporated nucleotide. The process is repeated multiple times, with a fluorescent signal indicating the growing chain .
The above techniques allow sequencing of transcript fractions. The problem remains to map 3′ ends to their corresponding 5′ ends. A novel technique called TIF-Seq allows simultaneous sequencing of both transcript ends  (Figure 1E). It enables the observation of full-length transcripts and hence transcript and protein heterogeneity via different combinations of 5′ and 3′ ends.
mRNA molecules with capped 5′ ends and polyadenylated 3′ ends are transcribed into full-length cDNA. The 5′ end is tagged with biotin. The cDNA molecules are circularized and fragmented, but the biotin allows the capture of the 5′ end–3′ end junction on beads. The sequencing of the captured molecules is performed with a standard DNA-Seq library generation and paired-end sequencing .
Identification of ncRNA
High-density tiling arrays were used to detect ncRNA in the entire budding yeast genome . A large number of novel transcripts was detected in intergenic as well as in promoter regions, in both sense and antisense directions. It was suggested that the transcripts mapping to promoter regions have a regulatory role. We describe below CUTs (cryptic unstable transcripts) and SUTs (stable unannotated transcripts), which were mapped via RNA-Seq in combination with 3′ long SAGE (serial analysis of gene expression) [22,23].
CUTs are short RNA molecules overlapping with the promoter regions, soon degraded after their synthesis. They are possibly a by-product of bidirectional transcription, initiated from nucleosome-free regions at promoters . The less abundant sense CUTs are assumed to aid suppression of gene expression. The proximal TSS (transcription start site) is preferred under repressive conditions. If it belongs to the CUT, it can repress transcriptional initiation of the downstream gene. Alternatively, CUTs can be involved in transcriptional interference. Some CUTs share a TSS with the downstream gene and cause premature transcription termination, repressing the gene's expression .
Further investigation of CUTs via microarray data and Northern blots revealed that, although many CUTs are degraded in the nucleus, they can also be exported to the cytoplasm, where decapping and 5′-to-3′ exonucleolytic digestion causes its degradation. Moreover, some of them enter translation, although they do not encode a functional protein .
PASs (polyadenylation signals) in the 3′ UTR of a gene are necessary for the cleavage of the nascent mRNA and subsequent addition of the poly(A) tail. The poly(A) tail is responsible for mRNA stability, nuclear to cytoplasmic export and translation of the mRNA . Early cloning experiments have shown that the PASs in fission and budding yeast are homologous and possibly differ from higher eukaryotes . A broader understanding of polyadenylation was achieved by the alignment of full-length ESTs to the S. cerevisiae genome and its subsequent analysis for regulatory motifs . This allowed the classification of PASs into a tripartite signal consisting of the FUE (far-upstream element, relative to the cleavage site), the NUE (near-upstream element) and the cleavage site. A probabilistic model for the poly(A) site yielded the existence of widespread alternative polyadenylation and cleavage sites internal to the coding region . It also predicted longer 3′ UTRs than initially identified by ESTs.
Furthermore, data from a high-density oligonucleotide tiling array allowed the re-annotation of gene boundaries and the estimation of coding and non-coding transcripts levels in S. cerevisiae . This is complemented by further RNA-Seq analysis, which can map precisely positions of poly(A) sites (in S. cerevisiae  and S. pombe [5,19]). However, more 3′ than 5′ ends are detected, and the reads capturing the transcript–poly(A) junction are still considered as underrepresented. DRS avoids this problem by poly(A) priming; nevertheless, this also gives rise to some similar problems as RNA-Seq, e.g. degradation product capture (in S. cerevisiae ). Pervasive heterogeneity and alternative polyadenylation have been confirmed, also within ORFs. TIF-Seq additionally identified the possibility of a stop codon occurrence in the RNA, but absence from the DNA . The motif AATAAA, which is the most frequent NUE in humans, has been established to be the most significant NUE in budding yeast .
Another genome-wide experimental analysis, based on all available annotated 3′ UTRs, revealed that the most active terminators in budding yeast have shorter half-lives and higher mRNA expression levels. Even though miRNAs have not been identified in yeast, the weakest terminators comprise longer 3′ UTRs, suggesting the presence of elements that actively degrade mRNA . This provides an exciting stepping stone for further analysis.
The significantly shorter 5′ UTRs [15,17], similarly to the 3′ UTRs, were re-annotated and characterized using emerging technologies in the yeast genomes. Although RNA-Seq fails to determine the 5′ end at nucleotide resolution and can only approximate it due to sharp signal transition , TIF-Seq identified variations also in 5′ UTRs. Moreover, previously considered short regulatory ORFs (‘uORFs’ or upstream ORFs) were re-annotated as short coding regions .
CAGE (cap analysis gene expression) is a popular method to map TSSs . However, it was not applied to yeast species. Alternatively, a newly developed method called 5′ SAGE was used in S. cerevisiae . In S. pombe, the TSS has been reported to lie 25–40 nt from the TATA element . In contrast, 5′ SAGE maps the TSS for S. cerevisiae 50–125 nt away. The TSS consensus sequence was determined to be A(Arich)5NPyA(A/T)NN(Arich)6, with the underlined letter representing the first transcribed nucleotide. Moreover, 5′ SAGE identifies 24 genes with regulatory uORFs in their 5′ UTRs. These findings were improved by a large-scale cDNA analysis , which revealed that the vast majority of genes have two or more TSS. These were classified as either having a single dominant TSS or multiple modestly used TSSs. The cDNA analysis also revealed at least one uORF in 2415 5′ UTRs and introns in 32 5′ UTRs.
Recent small-scale experiments in S. cerevisiae (including 5′ RACE) demonstrate that alternative TSS selection can have a significant impact on translational activity and possibly act as a switch between coding and regulatory RNA production , as observed in S. pombe . This effect remains to be analysed on a genome-wide scale.
High-density tiling arrays can only detect sequences present in the genome, therefore RNA-Seq is better suited for annotation of splice sites, either by mapping the reads to predicted exon junctions or by de novo annotation via splitting the RNA-Seq reads into two parts . RNA-Seq revealed high numbers of alternative splicing; however, not in the form of exon skipping or alternative exon incorporation . Transcripts can be spliced on the basis of growth conditions in S. pombe ; interestingly, the unspliced variant may be the protein-coding isoform . The splicing efficiency (i.e. ratio of spliced/unspliced transcripts) varies between genes and growth conditions. It has been determined that spliced or unspliced products affect predicted protein sequences , and there is alternative splicing between vegetative growth and heat-shock response in S. cerevisiae . The mapping of a significant fraction of DRS reads to introns (or exons) indicates a dynamic interplay between polyadenylation and splicing , hence diversifying the organism's transcriptome and probably proteome.
DRS identified widespread antisense transcription in >60% of budding yeast genes . Additionally, co-ordination between sense and antisense polyadenylated transcript levels plays a role in gene expression regulation. Genes that are expressed at low levels in budding yeast show a positive correlation between sense and antisense transcripts, whereas highly expressed genes demonstrate a negative correlation . In contrast, in fission yeast, tiling array analysis suggests that most antisense transcripts are not polyadenylated . This method also detected a likely propensity of highly expressed genes and histone-depleted regions to show elevated antisense transcription . The antisense transcripts generally occur more frequently in the 3′ UTR than in the 5′ UTRs, which implies that they may also arise from overlapping 3′ UTRs, as shown by RNA-Seq . However, the identification of antisense transcription using any method that involves reverse transcription of RNA to cDNA requires caution, as this can cause secondary mispriming. Generally, it seems that antisense transcription is lower than sense transcription [15,19]. Also, some antisense transcripts are part of lncRNAs (long ncRNAs). These long antisense transcripts, as well as some long intergenic transcripts are present at less than one copy number per cell, which could reflect their tight repression at transcriptional, post-transcriptional or chromatin levels .
An important difference between S. cerevisiae and S. pombe is the lack of an RNAi pathway in budding yeast. The RNAi pathway in fission yeast and in higher eukaryotes is induced by dsRNA formation. dsRNA is processed into short RNA molecules (siRNA), which targets nascent mRNA via complementarity or induce heterochromatin formation and consequent gene silencing at the transcriptional level. Multiple protein complexes, including Dicer , are involved in the RNAi pathway. Small RNA libraries are prepared by size-selecting molecules, followed by RNA-Seq, which determines sequence content . It is apparent that RNAi in S. pombe acts jointly with the exosome to repress developmentally regulated genes and retrotransposons. Furthermore, similar analysis in Dicer-knockout cells  revealed Dicer-independent priRNAs (primary small RNAs). It appears that priRNAs originate from degradation of abundant transcripts, and they target antisense transcripts arising from bidirectional transcription of DNA repeats.
Over the years, increasingly sophisticated experimental and analytical tools have given scientists the opportunity to gain a deeper insight into gene regulation elements. These may range from the production of regulatory ncRNA, which cause a gradient of genetic repression and even gene silencing, up to coding transcript variation. Transcript diversity and functionality may depend on the protein sequence [17,19], translational activity  or deletion/retention of binding sites for RNA-binding proteins . Sequencing resources have not yet reached their limits. Their depth and precision grow with newly developing techniques.
In the present review, we have summarized a few significant experimental techniques and regulatory factors in several regions on the genome. Moreover, we have provided a simple overview of RNA polymerase II-transcribed regulatory elements. Many other regulatory elements such as rRNA, tRNA, spliceosomal RNA, snoRNA, telomerase RNA, signal recognition particle RNA, RNA components of RNase P and RNase MRP and mitochondrial RNA  can be also detected by the techniques described. To date, many effects in gene expression are not fully classified, and possibly only vaguely predicted. We anticipate that the rapid development and reduction in the costs of high-throughput sequencing, which dramatically increases the available genomic and transcriptome sequence resources in many organisms, will lead to the discovery of more regulatory RNAs. These resources, together with the powerful tools of comparative genomics have potential to bring new insights into the functional dynamics of gene expression regulation.
The 7th International Fission Yeast Meeting: Pombe 2013: An Independent Meeting/EMBO Conference held at University College London, London, U.K., 24–29 June 2013. Organized and Edited by Jürg Bähler (University College London, U.K.) and Jacqueline Hayles (Cancer Research UK London Research Institute, U.K.).
cryptic unstable transcript
direct RNA sequencing
primary small RNA
serial analysis of gene expression
small nucleolar RNA
stable unannotated transcript
transcript isoform sequencing
transcription start site
We thank Eleanor White for valuable comments on our paper.
This work was supported by the Engineering and Physical Sciences Research Council (to M.S.), a L’Oreal/UNESCO Woman in Science UK and Ireland award (to M.G.) and a Medical Research Council Career Development Award (to M.G.).