Single-cell genome sequencing of individual archaeal and bacterial cells is a vital approach to decipher the genetic makeup of uncultured microorganisms. With this review, we describe single-cell genome analysis with a focus on the unique properties of single-cell sequence data and with emphasis on quality assessment and assurance.
The planet's biological diversity is overwhelmingly microbial. However, much of this diversity has evaded detection through traditional microbiological approaches, largely as a result of our inability to cultivate most microorganisms in a laboratory setting. Since the development of molecular-based, cultivation-independent tools, we have witnessed a burst in the detection of previously elusive microbial taxa. This was initially driven by the widespread adoption of high-throughput 16S rRNA gene sequencing where studies now span ecological gradients  and, in some cases, cross-biome comparisons [2,3]. However, 16S rRNA gene PCR-based surveys are limited due to constraints inherent to single-gene surveys. In many instances, single-gene surveys would have missed entire clades [4,5]. More recently, the genomes of novel phylogenetic groups have been uncovered with single-cell [6–8] and metagenomic sequencing techniques [4,9].
Single-cell genomics (Figure 1) and metagenomics are two techniques that provide access to microbial genomes without the requirement of cultivation. Sequencing all DNA from a bulk sample, also known as metagenomics, has become a powerful technique where hundreds and sometimes thousands of genomes can be extracted from an individual environmental sample . Alternatively, single-cell genomics has more recently emerged as an approach that provides genomic information for an individual cell [10–12]. This simplifies some of the challenges associated with metagenome assembly and provides a direct link between the genome and any additional cellular DNA, such as phages or plasmids (Figure 2). For example, single-cell genomics has uniquely linked viruses with their host cells in uncultivated clades of bacteria [13,14] and revealed organismal interactions in protists by associating single-cell protist DNA with intracellular bacterial and ssDNA viral sequences .
A schematic representation of the single-cell workflow with a focus on the analysis following sequencing.
Tetranucleotide principle component analysis (top) and GC content analysis (bottom) of target SAGs (blue) alongside additional contaminating sequence (red) and integrated phage sequences (green).
While the preparation of single-cell genomes, or ‘single amplified genomes’ (SAGs), is technically challenging, advances in isolation techniques, sequencing technologies and bioinformatics capabilities have greatly increased throughput and data quality. The analysis of SAG sequence data typically includes the following discrete steps: quality assurance of raw reads, genome assembly using a single-cell-specific assembler, automated and/or manual contaminant identification and removal, annotation, genome quality inspection and categorization according to the minimum information about a single amplified genome (MISAG) standards , and database submission (Figure 1). These individual steps can be assembled into a semi-automated workflow.
In this review, we focus on the unique properties of single-cell sequence data, make recommendations for data handling, including raw data quality control, suggest SAG specific assembly tools, discuss important contamination identification and removal procedures, and finally, review standards for reporting and submission of SAGs to the public repositories.
Properties of single-cell sequence data
The generation of single-cell genome sequences includes the following major steps: sample preservation and preparation, single-cell isolation, cell lysis, whole genome amplification (WGA), library preparation, sequencing and data analysis [10–12,17] (Figure 1). Given the extremely low yield of DNA from a single microbial cell (∼1–6 fg) , laboratory cleanliness needs to be one of the main considerations when preparing single cells for sequencing. The target DNA should be free of contaminating DNA molecules, as even the most minuscule amount of contaminant DNA will co-amplify during the WGA step and will be difficult to remove since single-cell assemblers now include low coverage regions. Although WGA can be a source of contaminant DNA, this step is essential because libraries cannot yet be prepared with DNA from a single cell . Alternative methods for WGA have been under rapid development over the last several years [20–22], yet multiple displacement amplification (MDA)  remains the most commonly used and dependable method for WGA for bacteria and archaea. However, biases typical of MDA include high coverage variation , the production of chimeric sequences  and a shift in overall GC content . It is largely these biases that contribute to the downstream challenges associated with the analysis of single-cell genomic data.
Quality assurance of single-cell sequence data
Genomes produced from single cells comprise distinct challenges due to the chimeric, biased and potentially contaminated nature of the underlying data, as discussed above. SAG sequence data thus require thorough quality control and specialized data handling.
Read-level quality assessment
Assessing the quality of a single-cell genomic dataset typically begins with a cleanup step prior to assembly. Such read-level quality assessment includes read trimming, quality filtering and read-based contamination identification and removal (Figure 1). For example, adapters are removed and reads are filtered to include only those reads above a specific base call quality score. Reads should also be checked against a microbial contaminant database specific to the laboratory where the SAGs were generated, in addition to genomes for microbial organisms that have been identified as common contaminants in the literature, such as Pseudomonas, Delftia  and skin-associated bacteria such as Propionibacterium, Streptococcus and Staphylococcus . It is good practice to map reads against human, dog and cat databases. Tools for read-based decontamination include DeconSeq  and modules from the BBTools bioinformatics package (https://sourceforge.net/projects/bbmap/). These tools can map reads against a sequence database of common contaminants and then remove the resulting hits from the dataset.
As discussed above, MDA leads to highly uneven coverage. Variability in coverage can be normalized, which is beneficial to assemblers, as normalization decreases runtime and memory requirements. However, the normalization step is becoming increasingly unnecessary as single-cell-specific assemblers are now publicly available, such as SPAdes  and IDBA-UD . These assembly algorithms make use of multiple coverage cutoffs as opposed to a single coverage threshold, resulting in the inclusion of a larger fraction of the data when compared with traditional assemblers. These approaches avoid reconstructing a string of k-mers with static read coverage thresholds; SPADes uses k-bimmers to build a topology of coverage and lengths before assigning a sequence, and IDBA-UD iteratively adjusts coverage thresholds. In addition, the use of reads from either end of a chimera is enabled without direct linkage.
Assessment of poolmate cross-contamination
The quality control described above does not take into account multiplex sequencing of single-cell genomes on high-throughput platforms. Depending on the capacity of the sequencing facility, SAG library preparation can take place in multi-welled plates, generating barcoded libraries for multiplexed sequencing of library pools. Multiplexing samples, specifically biased MDA'd samples, for sequencing on the Illumina platform, however, can cause significant ‘bleed over’ between poolmates. For example, Sinha et al.  showed that 5–10% of reads can be assigned to the wrong sample based on low levels of index-free primers present in the multiplexed pool, when using the HiSeq platform. More stringent library cleanup procedures, the use of dual indexes  and quality filtering  are all methods that can reduce this effect. Poolmate cross-contamination should also be detected during sequence analysis, since even a low fraction of cross-talk between multiplexed libraries (i.e. 0.01%) can have large effects on the assembly, an effect of the unevenness of amplification coverage of WGA methods like MDA. To assess poolmate cross-bleeding, it is good practice to map the reads of a given library to all assemblies across a plate, in an all-vs-all fashion. CrossBlock is a module available in the BBTools software package (https://sourceforge.net/projects/bbmap/) that performs this type of analysis. The program compares the coverage of contigs from one library to the coverage of all other libraries in a pool. However, this approach is only applicable when library pools contain different organisms, as highly similar organisms would produce a high fraction of false positives, which would be flagged as contaminants. To our knowledge, this is the only currently available tool specifically designed for this analysis, though other similar approaches have been performed: for example, searching the contigs of a genome against all other genomes in the pool using blastn and removing those contigs that match above user-defined identity and length thresholds .
Contig-level quality assurance
Following assembly, small contigs are removed, as these are more likely to contain assembly errors. At the U.S. Department of Energy's Joint Genome Institute, contigs <2 kb in length are removed from all SAG datasets. After removal of small contigs, screening for additional contaminating contigs originating from organismal DNA not representative of the target cell is typically performed (Figure 2). Identification and removal of contaminants following assembly can be performed with many currently available semi-automated and automated tools. Generally, assembly-based contaminant screening tools scan for outlying features of an SAG, including unusual 16S rRNA genes and protein-coding genes, abnormal k-mer frequencies and/or variation in GC content (Figure 2). These features can be identified interactively within the IMG interface  (tutorial available here: https://img.jgi.doe.gov/er/doc/SingleCellDataDecontamination.pdf) and within the recently developed analysis and visualization platform, Anvi'o . Both tools provide interactive platforms, facilitating the removal of contaminating contigs from an assembly based on outlying genomic signatures. Anvi'o and another recently developed software package, CheckM, estimate genome completeness and contamination based on the presence of single-copy marker genes. ProDeGe  and acdc  are additional tools that perform automated contamination screens returning separate fasta files for clean and contaminant contigs. These tools can be used in combination with tools like CheckM and Anvi'o, especially on large sets of SAGs where manual curation is challenging. As such, automated screening methods can be performed on SAGs with high contamination estimates, then checked for completeness and contamination using CheckM and/or Anvi'o, followed by additional rounds of cleanup if necessary. Figure 2 displays a schematic of a target SAG (blue) that is contaminated with contigs derived from another cell (red) (panel A), the same SAG following contaminant removal (panel B) and a different target SAG (blue) exhibiting features that could be flagged as contamination (panel C) (i.e. false positives). The situation depicted in Figure 2C can happen when an SAG contains rRNA genes with variable nucleotide composition and/or an integrated phage that is embedded within a contig that contains regions of deviating tetramer composition (Figure 2C). Because all automated methods produce false positives and negatives, we highly recommend manual evaluation of all SAGs prior to making biological inferences and submitting to the public databases.
Genome quality reporting
To avoid making biological inferences with contaminated SAGs, it is critical to confirm and report SAG quality before performing comparative genome analysis. Reporting SAG quality also informs other researchers that retrieve SAG data from public databases for their own analyses. For quality reporting, we suggest following the MISAG standards . These are simple standards that require a minimal set of mandatory genome quality criteria such as the reporting of basic metadata, assembly statistics, and genome completeness and contamination estimates. Additional mandatory reporting standards include fields specific to laboratory production (e.g. cell isolation, cell lysis and WGA), taxonomic identification of SAGs, identification of ribosomal RNA genes and software used for assembly and contamination detection and removal. We strongly suggest following these guidelines, as the criteria outlined in MISAG will be valuable for future comparative genomic studies as users of public databases can filter genomes based on the genome quality required of a particular downstream analysis.
Downstream single-cell genomic analysis
Once a single-cell genome is curated, it can be analyzed together with additional genomic sequences to place it into a larger evolutionary, ecological and functional context. For example, phylogenetically informative genes such as the 16S rRNA gene and sets of conserved protein-coding marker genes have been used to assess intra- and inter-phylum-level relationships of microbial dark matter lineages [7,42]. When closely related isolate genomes are unavailable, SAGs can be used as reference genomes to recruit metagenomic reads for quantifying abundance patterns across temporal and spatial gradients [7,43–45]. SAGs have further shown utility in the analysis of recombination frequencies in bacterial populations, such as freshwater bacteria of the SAR11 clade  and for the determination of the overall genetic heterogeneity within discrete populations in honey bee gut symbionts  and wild Prochlorococcus . Unlike MAGs, single-cell datasets are particularly powerful in linking phage sequences to their host [13,14] (represented schematically in Figure 2C) or deciphering eukaryote multipartite associations . As such, single-cell sequence data offer a broad array of downstream analyses, depending on the research questions to be addressed.
Single-cell sequencing of individual bacterial and archaeal cells is becoming an important tool available to the microbiologist as single-cell sequencing is highly complementary to other approaches including traditional culture-based approaches and metagenomic sequencing. Single-cell sequencing has demonstrated its utility across disciplines including microbial ecology, evolutionary biology, agriculture and medicine. With this review, we provide suggestions for single-cell analysis workflows going from raw sequence data to the submission of single-cell genomes to public databases. As technical advancements continue and bioinformatic tools are refined, our ability to resolve whole microbial communities down to the genetic differences defining individual strains will improve and, undoubtedly, benefit from the production and analysis of DNA sequences originating from an individual cell.
Single-cell genome sequencing has become an important complement to metagenomics, facilitating the direct extraction of genomes from environmental samples in the absence of cultivation, yet requiring amplification of the DNA.
Due to the unique nature of the resulting single cell sequence data, it is of value to outline recommendations for the analysis of single-cell genomes, specifically describing a start to finish pipeline, from the assessment of read and contig quality to database submission.
Thoughtful consideration and execution of each step in a single-cell genome analysis pipeline is critically important for the reporting and deposition of single-cell genomes to the public databases.
R.M.B., D.F.R.D. and T.W. wrote the paper.
The work conducted by the U.S. Department of Energy Joint Genome Institute, a DOE Office of Science User Facility, is supported under contract no. DE-AC02-05CH11231. DFRD was supported under the LBNL Microbes to Biomes LDRD entitled ‘Tackling microbial-mediated plant carbon decomposition using “function-driven” genomics’.
The Authors declare that there are no competing interests associated with the manuscript.