Next-generation sequencing has allowed genetic studies to collect genome sequencing data from a large number of individuals. However, raw sequencing data are not usually interpretable due to fragmentation of the genome and technical biases; therefore, analysis of these data requires many computational approaches. First, for each sequenced individual, sequencing data are aligned and further processed to account for technical biases. Then, variant calling is performed to obtain information on the positions of genetic variants and their corresponding genotypes. Quality control (QC) is applied to identify individuals and genetic variants with sequencing errors. These procedures are necessary to generate accurate variant calls from sequencing data, and many computational approaches have been developed for these tasks. This review will focus on current widely used approaches for variant calling and QC.
Genome sequencing enables discovery of nearly a complete genome sequence of an individual. While the first draft for human genome cost $2.7 billion in 2003 [1,2], the cost of genome sequencing has decreased at a rate faster than that of Moore's Law [3,4], and it has become considerably inexpensive as it currently costs less than $1000 to sequence a human genome . Given the rapid decrease in cost and its ability to detect nearly all genetic variants in an individual genome, genome sequencing has become very popular in several fields of genetics such as clinical genetics [6,7], cancer genetics [8,9], population genetics [10,11], and genetic studies for complex diseases and traits [12,13]. Currently, there are several ongoing large-scale whole-genome sequencing (WGS) studies aimed at identifying genetic variants influencing a diverse range of human diseases and traits such as the Trans-Omics for Precision Medicine (TOPMed) , Genome Sequencing Program (GSP) , and whole-genome sequencing in psychiatric disorders (WGSPD) . Each of these large-scale WGS studies involves at least 10 000 individuals while some have collected well over 100 000 individuals.
A primary goal of these sequencing-based studies is the identification of genetic elements that vary among individuals. These genetic elements, such as single nucleotide variants (SNVs) and small insertions or deletions (indels), associated with a disease or trait, may provide clues for understanding the genetic basis of a disease or trait and for identifying possible therapeutic targets. One key obstacle in these analyses is the technical noise associated with sequencing technologies. Specifically, technical biases from the sequencing process can introduce issues such as incorrectly reported sequences. Several platforms exist for sequencing the human genome, and each has unique technical biases, such as differing sequencing error rates . These biases must be accounted for to accurately identify genetic elements from sequencing data and distinguish true genetic variation from technical noise.
This review will discuss computational approaches that have been developed to analyze genome sequencing data or next-generation sequencing (NGS) data. Specifically, it will focus on the two major topics in statistical genetics: (1) variant calling and (2) quality control (QC) of NGS data. Variant calling is a procedure to obtain genetic variants from NGS data; a variant is a position or a locus in the genome that differs from a reference genome. Variant calling is one of the most important steps in an analysis of NGS data because one major goal of genetic studies is to identify genetic variants that influence a phenotype of interest, and hence, it is important to discover as many true variants as possible in an analysis. QC is the next step after the variant calling procedure, and its goal is to filter out individuals and genetic variants with poor sequencing quality and to improve the quality of variant calls. Correct variant calls are critical in downstream analyses such as an association test to reduce false-positive findings. This review will discuss common practices used in variant calling and QC in genetic studies and some of the challenges in those procedures.
Variant calling process
This review will focus on the germline variant caller on SNVs and indels. Germline variants are present in gametes and can be passed onto offspring, while somatic variants are found in non-germline cells and cannot be inherited. Variant callers for structural variants (SVs) and somatic variants will be discussed briefly. As there are many different variant callers and pipelines, we will discuss one based on the genome analysis toolkit (GATK) pipeline [18–20] developed by the Broad Institute as it is one of the most widely used variant callers and pipelines; it is the primary variant caller for several large-scale WGS studies such as WGSPD, the Alzheimer's disease sequencing project , and the Genome Aggregation Database (gnomAD) . This pipeline consists of several computational steps (Figure 1), which will be discussed below.
Generation of raw NGS data from a biological sample.
Map reads to a reference genome
When a sequencing machine processes a genome, it generates many reads, which are a short fragment of the genome, typically in length of 100 bp for short read sequencers . Depending on the type of sequencing (e.g. whole-exome sequencing vs. WGS) and coverage (e.g. low coverage vs. high coverage), millions or billions of reads can be generated per genome. Information on reads (their sequences and quality scores) is stored in FASTQ files. We then need to find a location of the genome where each read originated from since this information is not retained during the sequencing procedure. For this purpose, we map reads to a reference genome with a known complete sequence. This procedure aims to identify a location in a reference genome where each read matches in sequence while tolerating some mismatches. There are many read mappers or aligners [23–25] such as BWA-MEM , a method that is often used in the GATK pipeline. BWA-MEM uses a Burrows–Wheeler transform-based algorithm  to map reads. Read mapping is one of the most time-consuming steps in the variant calling process. For high-coverage WGS, this step may use more than half of total computational time required for the entire process . This read mapping information is stored in the binary alignment map (BAM) files.
Mark duplicates and recalibrate base quality scores
These two procedures are additional steps to improve the quality of variant calling in subsequent steps. The mark duplicates procedure marks reads that are duplicates, which reduces PCR duplication artifacts, and it is implemented in the Picard software. Identification of duplicate reads is important for downstream analyses that assume the measurements of a particular genomic position in a sample are independent. In addition, PCR duplicates can overrepresent certain sequences and can lead to false positives in variant calling if these sequences harbor an error. The Base Quality Recalibration Score step recalibrates a quality score associated with each base of a read, which is implemented in GATK. This processing step is meant to refine the quality score estimates produced by a sequencing machine which may be inaccurate. The GATK method utilizes known SNVs and covariates from the observed sequencing data, such as the original quality score, positions within reads, and nucleotide context, to more accurately model quality scores. Quality scores are important for variant calling, since they provide a measure of the confidence we have in a variant detected in a sequencing read; therefore, accurate quality scores are essential for these analyses. We apply the previous read alignment and these two procedures to sequence data of each individual and generate analysis-ready reads that can be used for a variant caller. These methods utilize read group information that typically indicates which sets of reads were generated from a single sequencing run. The purpose of this information is to allow for the detection of technical artifacts arising from multiple sequencing runs. We note that an indel realignment procedure was part of the GATK pipeline, which performs local realignment to correct potential mapping errors around indels. This process is no longer necessary after the introduction of HaplotypeCaller that performs a haplotype assembly, which will be discussed next.
Given a BAM file that contains mapping information of sequenced reads of a target genome that we want to sequence, the goal of a variant caller is to identify variants or positions of the genome that vary among samples. To achieve this goal, HaplotypeCaller performs a local de novo assembly by building a De Bruijn-like graph and identifies both SNVs and indels. Once it identifies those variants, the next step is to identify a genotype of each variant, which is to infer an allele for each chromosome. HaplotypeCaller calculates the likelihood of each possible genotype based on the read information and identifies the most likely genotype. The output file of HaplotypeCaller can be either a variant call format (VCF) file or a genomic VCF (gVCF) file. A VCF file contains information on all genetic variants detected and corresponding genotypes for individuals as well as several quality scores and depth information for genotypes. While one may include BAM files of multiple individuals and call variants together with HaplotypeCaller, which will generate a multi-sample VCF file, it is recommended that HaplotypeCaller is applied to a BAM file of each individual to generate a gVCF file for large sample sizes. A gVCF file contains not only genetic variants but also non-variant sites. Multiple gVCF files can be joint-genotyped using a GenotypeGVCFs command in GATK, which is more efficient than joint genotyping multiple BAM files using HaplotypeCaller. To further improve the efficiency, one may split many gVCF files into multiple batches where each batch of gVCF files can be combined into a multi-sample gVCF file using a CombineGVCFs command in GATK. One can then apply the GenotypeGVCFs to several multi-sample gVCF files from those batches and perform joint genotyping on a large number of individuals simultaneously. More recently, the GenomicsDB format was developed for efficient storage of variants and variant retrieval in GATK version 4. However, VCF files remain widely used.
Other variant callers and their performance
We focused on methods such as BWA-MEM and GATK due to their current dominance in variant calling pipelines for human NGS data. However, a number of other read aligners and variant callers for SNVs and indels have been developed, and several studies have compared their performance using either simulated or real genome sequencing data [29–33]. In those studies, FreeBayes  and Samtools  are variant callers that are most frequently evaluated in addition to GATK. FreeBayes is a haplotype-based variant caller for SNVs and indels which uses a Bayesian statistical framework, and Samtools also uses a Bayesian method to detect SNVs and indels. Results of those studies that compared the performance of different variant callers are somewhat discordant as some studies found that Samtools and Freebayes perform better than GATK, while other studies found the opposite results. This could be due to different genome sequencing data tested, different variant calling pipelines, or different software versions for variant callers in those studies. A more recent variant calling method, DeepVariant, utilizes deep learning to call variants from images of aligned short-reads . A recent study found that the accuracy of SNV calling is similar across DeepVariant and GATK ; however, they observed improved precision in indel calling. As stated before, GATK remains the most widely used tool for variant calling from human genome sequencing data.
Challenges in variant calling
The key challenge in variant calling is distinguishing true genetic variation from technical artifacts. These artifacts can arise from the sequencing process or variant calling algorithms. The preprocessing steps discussed here (duplicate identification and base quality score recalibration) are some ways to address technological biases. False positives arising from variant calling algorithms are also a concern, especially for variants that are difficult to identify from short-read sequencing, such as SV. It is notoriously difficult to identify SVs accurately from short-read sequencing as each read may not span an entire SV. Numerous methods have been developed for SV calling that use different sources of information such as BreakDancer , cn.MOPs , CNVNator , DELLY , GenomeSTRiP , Hydra , LUMPY , and BreakSeq , but no single SV algorithm can identify all types of SVs with high accuracy . Hence, recently a few methods have been developed to combine the results of multiple SV callers using an overlap approach or a machine learning algorithm and improve both precision and recall of SV detection [47–49]. In addition, while this review focuses on germline variant callers, there are also many variant callers for somatic variants such as those that differ between tumor and normal samples from the same individuals .
Another challenge in variant calling is its computational time; it may take a couple of weeks to perform the GATK variant calling on high-coverage WGS data of one individual using one CPU core . Although some procedures in the GATK variant calling pipeline such as BWA-MEM support multiple threads to improve the computational efficiency, not all procedures support multiple threads and they become a bottleneck in the variant calling process. To overcome this challenge, a few computational approaches have been developed to speed up the GATK pipeline [28,51]. Those approaches divide a genome into smaller regions such as chromosomes or regions with fixed length (e.g. 10 Mbp) and utilize a high-performance cluster or a cloud-computing resource such as Amazon Web Service or Google Cloud to simultaneously call variants in those regions and improve the efficiency of the overall variant calling process. In addition to these approaches, fast variant callers such as Playtypus , Strelka2 , and Fuwa  have recently been proposed.
There are two types of QC; one is individual-level QC that identifies problematic individuals and the other is variant-level QC that identifies genetic variants with poor sequencing quality. These two procedures can be performed independently, or one procedure can be performed after the other procedure is performed (e.g. one may perform variant-level QC after removing problematic individuals if there are many such individuals).
Genotype missing rate
This metric indicates the proportion of genotypes that an individual is missing. One may measure genotype missing rate after setting genotypes with low genotype quality (GQ) to missing; for a VCF generated with the GATK pipeline, genotypes with GQ ≤ 20 (or higher) may be set to missing. If an individual has high genotype missing rate (>5% or >10%), it may indicate low coverage or poor sequencing quality, and hence, this individual needs to be removed from the analysis.
Genotype concordance to microarray data
Studies may have already collected microarray genotype data for sequenced individuals, and in this case, one may compare genotypes between microarray and sequencing over SNVs present in both the platforms. It is expected that genotype concordance rate between the two genotyping platforms would be high (>99%). If the rate is ∼50%, it may indicate sample swap (mixed up samples), and if the rate is below 90%, it may indicate contamination, which is discussed below. SNVs that are present in both microarray and sequencing are used to calculate genotype concordance rate, and it is important to pay attention to strand issues. For example, an SNV may have A and G alleles in microarray while it has T and C alleles in sequencing due to different strands genotyped or sequenced between microarray and sequencing. Also, as microarray data are often generated much earlier than sequencing, reference human genome builds may be different such as microarray data in hg18 and sequencing data in hg19. LiftOver can be performed to change the positions of SNVs in microarray data to match the reference genome builds of sequencing data.
During sample preparation and manipulation for sequencing, DNA from multiple individuals may be present in the same library, which represents contamination of DNA. It is important to detect this contamination and remove individuals who are heavily contaminated. One approach to detect contamination is using the genotype concordance rate as described before because contaminated individuals may have lower genotype concordance rate to microarray data. Another approach is to use a software called VerifyBamID  that calculates a contamination level of each sequenced individual using either sequencing data only or both sequencing and microarray data.
These measures include a variety of statistics describing each sequenced individual. For example, they are the number of SNVs/indels (known/novel), transition/transversion ratio (Ti/TV) (known/novel), the number of singletons, and the number of multi-allelic variants that each individual carries. The known number of SNVs or known Ti/Tv ratio is calculated from SNVs present in dbSNP while the novel number of SNVs or novel Ti/Tv ratio is calculated from those not present in dbSNP. In a homogeneous population, any individual whose statistics show outlier patterns may have sequencing problems and may need to be removed. The Ti/Tv ratio is expected to be around two for WGS data; the known Ti/Tv ratio may be slightly above two while the novel Ti/Tv ratio may be lower than two (e.g. 1.4–1.8). Very low novel Ti/Tv ratio may indicate problems with sequencing.
Identical-by-descent (IBD) analysis
This analysis identifies a pair of individuals who are duplicates (e.g. sequenced twice or twins) or who are related (e.g. parent/child relationship and siblings). One often uses PLINK  ‘--genome’ option to calculate IBD estimates quantified with a value between every pair of individuals or one may use other software to calculate relationship among pairs of individuals such as KING . When calculating value in PLINK, it is important to use a set of independent SNVs, since SNVs in high linkage disequilibrium (LD) will lead to common haplotypes that will be identified as IBD though they were not due to a recent common ancestor; one may perform LD-pruning with ‘--indep’ or ‘--indep-pairwise’ option. If value is close to 1, it suggests duplicate samples, and hence, one of the duplicates will need to be removed. For value close to 0.5, it suggests the first degree of the relationship while value close to 0.25 suggests the second degree of relationship. For case-control studies where only unrelated individuals are expected, one individual from each pair of related individuals needs to be removed. The individual in these pairs who has the disease or trait of interest is often prioritized. For family-based studies, a study may use values to check whether a known pedigree structure is consistent with the pedigree structure inferred from values. If there is inconsistency, it is important to correct the issue by checking whether there is an error in the known pedigree structure or whether wrong individuals are sequenced.
Principal component analysis
Principal component analysis (PCA) is often applied to identify the ethnicity of sequenced individuals (Figure 2). One of the most widely used tools for PCA is EIGENSTRAT . Using the reference data set such as 1000 Genomes data set , principal components (PCs) estimated from PCA cluster sequenced individuals according to their ethnicities. One may draw PCA plots using these PCs (e.g. PC1 vs. PC2), and these plots will allow a study to identify outliers in terms of ethnicity. Those outliers may need to be removed from further analysis. When performing PCA, it is important to perform LD-pruning and remove related individuals since both local LD structure and direct relatedness can be a stronger source of variation in the genetic data than the population-level ancestry differences that are of interest. For family-based studies, founders who are unrelated may be included in PCA.
Principal component analysis (PCA) of 1000 Genomes data set.
One can check whether sex inferred from sequencing data using X chromosome is consistent with known sex from sample annotation using ‘--check-sex’ option in PLINK. If there is inconsistency, one may need to check possible sample swap.
There are two main types of variant-level QC; one is a filtering approach based on several filters and the other is a classification approach using a machine learning model. One may use a combination of both by including only variants that pass both types of QC.
This QC calculates several statistics about each variant and removes it if it fails one of the filters. This approach has been widely used in GWAS based on microarray data [60,61]. The main advantage of this approach is its simplicity, and the main disadvantage is that the threshold for each filter is usually arbitrarily determined. The following filters are often used.
Genotype missing rate: Similar to genotype missing rate for each individual, one may also calculate the missing rate for each variant and remove it if it is too high (e.g. >5% or >10%).
Hardy–Weinberg Equilibrium (HWE) P-value: This P-value measures the deviation from HWE that compares the frequency of observed genotypes from sequencing data and expected the frequency of genotypes from HWE for each SNV. SNVs with low HWE P-values (e.g. <1 × 104) will need to be removed since they significantly deviate from HWE. It is important to note that HWE P-values need to be estimated in a homogeneous population that consist of unrelated and healthy controls. If there are individuals from different populations or related individuals, minor allele frequency estimation for HWE P-values may not be accurate. In addition, genotypes among case individuals may not follow HWE, so their genotypes should not be included in the estimation of HWE P-values.
Genotype concordance rate to microarray data: If microarray data are available, one may calculate genotype concordance rate between sequencing and microarray data for each SNV and remove those with low concordance rate. This filter, however, can only test those SNVs present in both sequencing and microarray data.
Mendelian error rate: In family-based studies where there are trios or pairs of sequenced individuals, one may calculate Mendelian error rate of each SNV that indicates whether transmission of alleles from parents to offspring follows Mendelian inheritance patterns. SNVs with high Mendelian error rate will need to be removed (Figure 3).
Allele balance of heterozygous calls (ABHet): This statistic measures the balance between reads supporting a heterozygous genotype. Specifically, it is calculated as the number of reads with a reference (or alternative) allele from an individual divided by the total number of reads from the individual for a heterozygous genotype (e.g. A/C and G/T). Ideally, ABHet should be near 0.5, and a study may remove an SNV with many individuals who have heterozygous genotypes with ABHet values much greater or smaller than 0.5 (Figure 3).
This approach attempts to determine whether a specific variant has high or low sequencing quality using a machine learning model. One example is Variant Quality Score Recalibration (VQSR) from GATK that uses a Gaussian mixture model . To train their model, it needs a training set that contains true positives (variants that are highly validated to be true), and it uses SNVs found in known databases such as HapMap , Omni 2.5M SNP Chip, and 1000 Genomes . After training the model and applying it to sequence data, a study may filter out variants depending on its desired sensitivity level. One potential issue with the classification approach is that the known databases may not be accurate, which may cause inaccurate classification of variants. Another issue is that those databases may not be available for certain species other than human. Lastly, the classification approach based on VQSR may require more computational resources than the filtering approach.
Quality control measures for genetic variants.
Even with these preprocessing and QC methods, false positives can still occur. There are several experiments for validating identified genetic variants. For example, PCR amplification of regions containing a putative SNV, indel, or SV breakpoints can be verified by Sanger sequencing. This sequencing method is slow and expensive compared with NGS; however, it has lower error rates and, furthermore, the targeted PCR amplification yields high coverage for validating a variant. As another example, fluorescence in situ hybridization can be used to validate SVs, such as CNVs. This method utilizes fluorescent probes that hybridize with genomic sequences and can be visualized at the chromosomal level.
Genome sequencing has become increasingly popular in several fields of genetics due to the rapid decrease in sequencing cost and its ability to discover nearly a complete genome sequence of an individual genome.
Variant calling from sequencing data involves many processing steps to accurately identify genetic variants. The GATK best practice pipeline is one of the most widely used variant calling approaches, which consists of read alignment, duplicate read identification, base quality score recalibration, and HaplotypeCaller.
Quality control (QC) is performed after a variant calling to detect and remove individuals and genetic variants with sequencing errors.
Individual-level QC removes individuals with high genotype missing rate, low genotype concordance rate to microarray data, contamination, outlier patterns in sequencing statistics, or wrong sex. It also identifies related individuals using the IBD analysis and population outliers using PCA.
Variant-level QC can be performed using the traditional filtering approach and/or the classification approach using machine learning algorithms. The filtering approach uses several filters such as genotype missing rate, HWE P-values, genotype concordance rate to microarray data, Mendelian error rate, and ABHet.
allele balance of heterozygous
Alzheimer's Disease Sequencing Project
Binary Alignment Map
fluorescence in situ hybridization
Genome Analysis Toolkit
Genome Sequencing Program
principal component analysis
single nucleotide variants
variant call format
Variant Quality Score Recalibration
Whole-Genome Sequencing in Psychiatric Disorders
B.J. was supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE-1650604. J.H.S. was supported by the National Institute of Environmental Health Sciences (NIEHS) grant [K01ES028064], National Institute of Neurological Disorders and Stroke (NINDS) [R01NS102371], and the National Science Foundation grant [#1705197].
The Authors declare that there are no competing interests associated with the manuscript.