A beginner's guide to single-cell transcriptomics

Beginner's Guides each cover a key technique and offer the scientifically literate but not necessarily expert audience a background briefing on the underlying science of a technique that is (or will be) widely used in molecular bioscience. The series covers a mixture of techniques, including some that are well established amongst a subset of our readership but not necessarily familiar to those in different specialisms, or the reverse. This is our Beginner's Guide to single-cell transcriptomics.

A beginner's guide to single-cell transcriptomics Hojae Lee, Hengshi Yu and Joshua Welch (University of Michigan, USA) disease, we need to determine how each mutation changes gene expression within specific cell types. But doing this with bulk transcriptomic measurements is quite challenging-like trying to identify which ingredient in a fruit smoothie is rotten.
Thus, single-cell transcriptomics (scRNA-seq for short) is an essential tool for understanding mixed cell populations. Single-cell resolution allows us to determine the 'taste' of each individual ingredient in the smoothie, providing a comprehensive catalogue of cell types in complex tissues like brain. In heterogeneous diseased tissues, such as tumours, scRNA-seq provides a clear picture of how the gene expression of each diseased cell differs from that of the healthy cells. Single-cell resolution is also crucial in rapidly changing populations such as differentiating stem cells, enabling quantification of the continuous expression changes that drive cell fate specification.

How the technology works
The first step in scRNA-seq is cell dissociation, which creates a suspension of single cells from solid tissues so that individual cells can be isolated for analysis. There are two main types of dissociation methods: mechanical and enzymatic. Mechanical dissociation involves physical manipulation by pipetting, vortexing or shearing and works best in loosely associated tissues. Enzymatic dissociation treats tissues with proteolytic enzymes, such as trypsin or collagenase, that chemically dissolve the bonds between cells. This method is effective at separating cells but can modify cell-surface proteins and may influence downstream cell measurements in

Why single-cell transcriptomics?
The cell is the fundamental unit of biological life. The DNA sequence, or genotype, is nearly identical across all cells within a multicellular organism, but cellular phenotypes vary widely across space and time. For example, the morphology and function of a neuron are vastly different than that of a red blood cell.
Measuring the transcriptome-the set of RNA transcripts expressed in a biological sample-helps to decipher how a single genome encodes myriad phenotypic properties of cells. As readers of The Biochemist will know, different cells 'execute' different portions of the genome through gene expression, in which RNA molecules are transcribed from DNA and subsequently translated into proteins. Because of the complementary base-pairing properties of DNA and RNA, it is possible to determine the identities and quantities of RNA transcripts using high-throughput sequencing, whereas unbiased quantification of protein abundance is much more challenging. Thus, transcriptomic measurements serve as a quantifiable intermediate phenotype indicating cellular functional properties and latent cellular states.
Traditional transcriptomic measurements are performed 'in bulk' , yielding the average expression profile of an entire population of cells. However, such average measurements mask important cellular heterogeneity and obscure the link between genotype and phenotype. Complex tissues like brain are a mixture of highly diverse cells, containing many distinct cell types. If you think of each cell type as a fruit, the brain is like a fruit salad, and a bulk transcriptomic measurement is like a fruit smoothie. To understand how genetic mutations cause neurological Beginner's Guides each cover a key technique and offer the scientifically literate but not necessarily expert audience a background briefing on the underlying science of a technique that is (or will be) widely used in molecular bioscience. The series covers a mixture of techniques, including some that are well established amongst a subset of our readership but not necessarily familiar to those in different specialisms, or the reverse. This is our Beginner's Guide to single-cell transcriptomics.  some cases. Different cell types respond very differently to dissociation; for example, neurons are more fragile and prone to lysis during dissociation than glial cells. The choice of dissociation method is an important consideration, because it may affect cell viability and potentially bias for specific cell populations.

B. cDNA Library Preparation
The single-cell suspension is then used to isolate individual cells. Early single-cell experiments performed cell isolation manually by pipetting or laser capture microdissection, which are labour-intensive and low throughput. The crucial development of microfluidic approaches for cell isolation enabled rapid simultaneous isolation of thousands of cells. Microfluidic devices use controlled fluid flow to guide cells through channels within a microfabricated chip. Currently, the most popular cell isolation method is droplet microfluidics, in which a stream of cells in suspension is merged with a stream of oil droplets. Droplet-based microfluidics offers high throughput (up to thousands of cells per second) and low cost due to nanolitre-sized volumes for reagents. By flowing the droplet stream much faster than the cell stream, cells can be loaded so that most droplets contain exactly one cell, though some droplets will inevitably contain no cells or more than one cell. A crucial part of droplet microfluidic technology is the use of microscopic barcoded beads, each containing many copies of a unique DNA sequence attached to the bead surface. Each droplet contains a barcoded bead along with chemical reagents for cell lysis and complementary DNA (cDNA) library construction ( Figure 1A).
Because a single cell contains a tiny amount of RNA, the RNA must be converted into cDNA and chemically amplified to create a cDNA library. Each cell is lysed and its transcripts are captured using reverse-transcription primers containing oligo-dT sequences, which hybridize with poly(A)-tails. Because this process happens to each cell in isolation (within a single oil droplet), unique DNA barcodes can be added to identify and tag all molecules originating from the same cell. Each RNA molecule is then reverse transcribed into cDNA, which is amplified using polymerase chain reaction (PCR), as shown in Figure 1B. The PCR process introduces DNA tags called unique molecular identifiers, which distinguish PCR copies of the same RNA molecule, eliminating quantification errors due to different amounts of amplification among transcripts. Some cDNA library construction protocols use a mechanism called template switching to create fulllength cDNA copies of transcripts, while other protocols capture only the last few hundred bases upstream of the 3′ transcript ends.

Beginner's Guide
The cDNA library can then be sequenced ( Figure 1C) using a next-generation sequencing platform such as the Illumina NextSeq or a single-molecule sequencer such as the Oxford Nanopore MinIon. Because a single cDNA library often contains data from thousands of cells, the number of sequencing reads per cell is usually much lower than the number of reads in a bulk transcriptomic experiment.

Computational analysis of single-cell transcriptomic data
Computational analysis is essential for proper processing and interpretation of scRNA-seq data. Key analysis steps include demultiplexing, gene expression quantification, data filtering, data normalization and applying statistical techniques to identify important biological differences among cells (Figure 2).
Cell barcode sequences enable demultiplexingmatching each transcript to its cell source. The sequencer outputs each transcript's cDNA sequence, including the cell barcode and unique molecular identifier introduced during library preparation. The demultiplexing process takes into account sequencing errors introduced into the barcodes, collapsing barcodes that differ by one or two nucleotides into a single consensus sequence corresponding to each cell. The result of demultiplexing is a set of sequencing reads, each assigned to a single cell barcode.
Demultiplexed reads are used to estimate the gene expression level of each gene in each cell. Estimating gene expression requires aligning each read to the genome using sophisticated computational approaches that are optimized for speed and memory efficiency. Once the genomic location of each read is determined, the total number of transcripts from each gene can be counted, using the unique molecular barcodes to prevent doublecounting multiple PCR copies of the same transcript. This process yields a matrix of integer counts indicating how many transcripts of each gene are expressed within each cell.
Robust analysis of scRNA-seq data requires filtering the data to remove several types of artefacts, including doublets (a single barcode containing transcripts from two cells) and low-quality cell samples. Doublets often yield a higher number of transcripts than single cells, and frequently contain marker genes from multiple distinct cell types. Low-quality cell samples may result from dead cells or cell debris. Such cells often contain degraded RNA, which leads to a low fraction of mapped reads during read alignment. Low-quality cells may also contain higher percentages of ribosomal RNA and mitochondrial RNA, which are less vulnerable to degradation than messenger RNA. Thus, these samples may also be filtered by their high expression values of ribosomal and mitochondrial RNAs. Removal of doublets and low-quality cells ensures that downstream analyses are free from artefacts.
Next, the gene expression matrix is normalized to make the expression levels comparable across both cells and genes. The number of transcripts captured per cell varies due to random technical effects, so the gene expression matrix must be normalized so that each cell has the same total number of transcripts. If this step is not performed, the differences in total transcript counts per cell will overwhelm true biological variation. Each entry of the expression matrix is also log-transformed to down weight extreme values and scaled to ensure that genes expressed at different levels are comparable.

Beginner's Guide
The most interesting part of scRNA-seq analysis is applying computational and statistical techniques to explore biological differences among the cells sequenced. An important first step in such exploration is dimensionality reduction, which transforms the highdimensional data matrix (in which each gene represents a different 'dimension') into a low-dimensional space, where each dimension represents an important axis of variation among the cells. For example, one axis may correspond to a cell-cycle stage, another to differentiation progress and a third to cell type. A good dimensionality reduction technique constructs the low-dimensional space so that nearby cells in the original space are closely grouped in the low-dimensional space. Dimensionality reduction brings several benefits: it mitigates experimental noise and extracts robust and interpretable features, facilitating more accurate discovery of differences among cells. Commonly used dimensionality reduction techniques in scRNA-seq analysis include principal component analysis (PCA), independent component analysis (ICA), non-negative matrix factorization (NMF) and deep neural networks. Each technique calculates the lowdimensional space in a slightly different manner, and different approaches each have advantages in certain settings. Another important motivation for dimensionality reduction is data visualization: representing the differences among cells as a 2D or 3D picture amenable to human interpretation (Figure 2). Two different approaches, t-distributed stochastic neighbour embedding (t-SNE) and uniform manifold approximation and projection (UMAP), are frequently used to visualize single-cell data.
The existence of multiple cell types is one important type of cellular variation that scRNA-seq can be used to discover. Distinct cell types result in multiple 'clusters' , in which a group of cells each express a similar set of genes that is also distinct from the set of genes expressed in other cell groups. A variety of algorithms can detect such clusters, using nothing but the low-dimensional representation of the gene expression matrix. Once clusters are detected, statistical approaches can identify the unique genes that are differentially expressed among different clusters. These genes provide biomarkers for each cell type and give insight into functional properties and regulatory mechanisms. The ability to discover such cell types and properties in a completely unbiased fashion, making no assumptions about the cells present in the sample and using no information other than gene expression, is one of the key advantages of scRNA-seq.

Recent discoveries using the technology
Researchers are increasingly using scRNA-seq for scientific discovery across a range of biological systems. Broadly, the technology is being used to identify all of the cell types within healthy tissues, characterize celltype-specific differences between healthy and diseased tissue, and model gene expression changes during dynamic processes such as differentiation (Figure 3).
Recent studies used scRNA-seq to profile hundreds of thousands of cells from the mouse brain, identifying hundreds of transcriptionally distinct groups of cells and creating a detailed census of cell types. Additionally, researchers discovered a new lung cell type, the ionocyte, using scRNA-seq; an important discovery because ionocytes turn out to be the primary cells that express CFTR, the gene responsible for the disease cystic fibrosis. Recognizing the translational importance of cataloguing cell types, the Human Cell Atlas project is using scRNAseq to map all cell types within the human body, with the ultimate goal of diagnosis, monitoring and treatment of human disease.
scRNA-seq also detects the differences between healthy and diseased tissues. Recent studies compared single-cell expression data from brains of people with and without autism spectrum disorder and identified celltype-specific gene expression changes between control and affected individuals. Similar efforts have revealed Downloaded from https://portlandpress.com/biochemist/article-pdf/41/5/34/858267/bio041050034.pdf" /><meta name="dc.identifier" content="10.1042/BIO04105034" /><meta property="og:updated_time" content="10/18/2019 by guest on 13 February 2020 insights into Alzheimer's disease and inflammatory bowel disease. Such studies represent an important first step to understanding the pathways of aberrant gene expression that underlie a variety of diseases.

Dynamic Gene Expression Inference
In addition to identifying discrete cell types, scRNAseq reveals trajectories of continuous gene expression changes that cells undergo during dynamic processes such as differentiation and reprogramming. Because each cell's expression profile reflects one moment in the process, it is possible to replay the steps of the process by stitching together similar cells, somewhat like trying to reconstruct a movie given a set of frames from the movie in a random order. Approaches of this kind can yield valuable biological insights, such as identifying the type of stem cell progenitor for each mature cell type and determining the genes that indicate when a multipotent cell commits to one particular fate. Remarkably, a recent study showed that scRNA-seq can predict the future states of differentiating cells by using the ratio between spliced and unspliced RNA transcripts to calculate RNA velocity-a vector that indicates the direction and speed at which a cell is moving through gene expression space. Visualizing these vectors shows the direction and speed of each cell along a trajectory. The combination of scRNA-seq data and computational modelling promises to transform our understanding of cell differentiation and reprogramming. ■