Hereditary cerebellar ataxias are a heterogenous group of progressive neurological disorders that are disproportionately caused by repeat expansions (REs) of short tandem repeats (STRs). Genetic diagnosis for RE disorders such as ataxias are difficult as the current gold standard for diagnosis is repeat-primed PCR assays or Southern blots, neither of which are scalable nor readily available for all STR loci. In the last five years, significant advances have been made in our ability to detect STRs and REs in short-read sequencing data, especially whole-genome sequencing. Given the increasing reliance of genomics in diagnosis of rare diseases, the use of established RE detection pipelines for RE disorders is now a highly feasible and practical first-step alternative to molecular testing methods. In addition, many new pathogenic REs have been discovered in recent years by utilising WGS data. Collectively, genomes are an important resource/platform for further advancements in both the discovery and diagnosis of REs that cause ataxia and will lead to much needed improvement in diagnostic rates for patients with hereditary ataxia.
Introduction
Hereditary cerebellar ataxias are a heterogenous group of progressive neurological disorders characterised by significant morbidity and mortality. The prevalence of hereditary ataxia ranges from 1.5 to 4.9 cases per 100 000 individuals [1]. While primarily affecting adults, ataxias can also have prenatal, childhood and adolescent onset. Ataxia is primarily a gait movement disorder and usually presents with dysarthria, dysmetria, and impaired oculomotor control. Other frequently co-occurring symptoms include parkinsonism, dementia, dystonia, chorea, vestibulopathy, sleep disorders, peripheral neuropathy, pyramidal symptoms such as weakness and spasticity, ocular abnormalities such as nystagmus or oculomotor apraxia and deafness [2].
The genetics of hereditary ataxias is uniquely complex. While sometimes caused by de novo or inherited rare and deleterious mutations, it is most associated with repeat expansion (RE) of short tandem repeats (STRs) [3]. STRs (also known as microsatellites) are repetitive elements of DNA in which motifs 2 to 6 base pairs (bp) in length are repeated in tandem. STR lengths are highly variable between individuals due to their inherent instability, however most variation in STR length is benign. There are currently over 50 REs known to cause disease and of which 18 cause ataxias (summarised in Table 1).
Year of discovery . | Disorder name . | Inheritance type . | chr . | gene . | location . | pathogenic motif . | normal repeat range . | pathogenic range . | mechanism . | discovery method . | citation . |
---|---|---|---|---|---|---|---|---|---|---|---|
1993 | SCA1 | AD | 6p22 | ATXN1 | exon | CAG | 6–38 | ≥39–88 | polyQ, RNA? | linkage; expansion screening | [57] |
1994 | SCA3 | AD | 14q32 | ATXN3 | exon | CAG | 12–44 | ≥55–87 | polyQ, RNA (foci) | linkage; cloning | [58] |
1994 | DRPLA | AD | 12p13.31 | ATN1 | exon | CAG | 3–35 | >−48–93 | polyQ, RNA? | linkage; expansion screening | [59,60] |
1996 | SCA2 | AD | 12q24 | ATXN2 | exon | CAG | 13–31 | ≥32–500 | polyQ, RNA? | linkage; cloning | [47] |
1996 | SCA7 | AD | 3p21 | ATXN7 | exon | CAG | 4–33 | ≥37–460 | polyQ, RNA? | linkage; cloning | [61] |
1996 | FRDA | AR | 9q21.11 | FXN | intron | GAA | 5–34 | ≥66–1300 | gene silencing | linkage; expansion screening | [62] |
1997 | SCA6 | AD | 19p13 | CACNA1A | exonic | CAG | 4–18 | ≥20–33 | polyQ, RNA? | linkage; expansion screening | [63] |
1999 | SCA17 | AD | 6q27 | TBP | exonic | CAG or CAG/CAA | 25–40 | ≥43–66 | polyQ, RNA? | linkage; candidate gene analysis | [36] |
1999 | SCA8 | AD | 13q21 | ATXN8 (ATXN8OS) | 3'UTR | CAG/CTG | 15–50 | >74–250 | RNA (foci), RAN | linkage; cloning | [64] |
1999 | SCA12 | AD | 5q31 | PPP2R2B | 5'UTR | CAG | 4–32 | ≥43–78 | RAN (polyG)? | linkage; repeat expansion detection | [65] |
2000 | SCA10 | AD | 22q13 | ATXN10 | intron | ATTCT/ATTGT | 10–32 | >280–4500 | RNA (foci) | linkage; expansion screening | [66] |
2009 | SCA31 | AD | 16q22 | BEAN1 (TK2) | intron | (TAAAA), TGGAA/TAGAA | ? | ≥110–760 | RNA (foci, PS), RAN | linkage | [67] |
2011 | SCA36 | AD | 20p13 | NOP56 | intron | GGCCTG | 5–14 | ≥650–2500 | RNA (foci) | linkage; expansion screening | [68] |
2017 | SCA37 | AD | 1p32 | DAB1 | intron | (ATTTT), ATTTC | 7–400 (ATTTT) | ≥31–75 (ATTTC) | RNA | linkage; expansion screening | [69] |
2019 | CANVAS | AR | 4p14 | RFC1 | intron | AAGGG, ACAGG, AAAGG-AAGGG-AAAGG | - | ≥400–2000 | unknown | linkage; WGS | [19,32] |
2019 | GDPAG | AR | 2q32.2 | GLS | 5'UTR | CAG | 8–16 | ≥680–1400 | gene silencing | candidate gene analysis | [43] |
2023 | SCA27B | AD | 13q33.1 | FGF14 | intron | AAG | 10–250 | ≥300 | reduced gene expression | linkage; WGS | [21,34] |
2023 | - | AD | 16q22.1 | THAP11 | exon | CAG | 20–38 | ≥45–100 | PolyQ | linkage; LRS | [35] |
Year of discovery . | Disorder name . | Inheritance type . | chr . | gene . | location . | pathogenic motif . | normal repeat range . | pathogenic range . | mechanism . | discovery method . | citation . |
---|---|---|---|---|---|---|---|---|---|---|---|
1993 | SCA1 | AD | 6p22 | ATXN1 | exon | CAG | 6–38 | ≥39–88 | polyQ, RNA? | linkage; expansion screening | [57] |
1994 | SCA3 | AD | 14q32 | ATXN3 | exon | CAG | 12–44 | ≥55–87 | polyQ, RNA (foci) | linkage; cloning | [58] |
1994 | DRPLA | AD | 12p13.31 | ATN1 | exon | CAG | 3–35 | >−48–93 | polyQ, RNA? | linkage; expansion screening | [59,60] |
1996 | SCA2 | AD | 12q24 | ATXN2 | exon | CAG | 13–31 | ≥32–500 | polyQ, RNA? | linkage; cloning | [47] |
1996 | SCA7 | AD | 3p21 | ATXN7 | exon | CAG | 4–33 | ≥37–460 | polyQ, RNA? | linkage; cloning | [61] |
1996 | FRDA | AR | 9q21.11 | FXN | intron | GAA | 5–34 | ≥66–1300 | gene silencing | linkage; expansion screening | [62] |
1997 | SCA6 | AD | 19p13 | CACNA1A | exonic | CAG | 4–18 | ≥20–33 | polyQ, RNA? | linkage; expansion screening | [63] |
1999 | SCA17 | AD | 6q27 | TBP | exonic | CAG or CAG/CAA | 25–40 | ≥43–66 | polyQ, RNA? | linkage; candidate gene analysis | [36] |
1999 | SCA8 | AD | 13q21 | ATXN8 (ATXN8OS) | 3'UTR | CAG/CTG | 15–50 | >74–250 | RNA (foci), RAN | linkage; cloning | [64] |
1999 | SCA12 | AD | 5q31 | PPP2R2B | 5'UTR | CAG | 4–32 | ≥43–78 | RAN (polyG)? | linkage; repeat expansion detection | [65] |
2000 | SCA10 | AD | 22q13 | ATXN10 | intron | ATTCT/ATTGT | 10–32 | >280–4500 | RNA (foci) | linkage; expansion screening | [66] |
2009 | SCA31 | AD | 16q22 | BEAN1 (TK2) | intron | (TAAAA), TGGAA/TAGAA | ? | ≥110–760 | RNA (foci, PS), RAN | linkage | [67] |
2011 | SCA36 | AD | 20p13 | NOP56 | intron | GGCCTG | 5–14 | ≥650–2500 | RNA (foci) | linkage; expansion screening | [68] |
2017 | SCA37 | AD | 1p32 | DAB1 | intron | (ATTTT), ATTTC | 7–400 (ATTTT) | ≥31–75 (ATTTC) | RNA | linkage; expansion screening | [69] |
2019 | CANVAS | AR | 4p14 | RFC1 | intron | AAGGG, ACAGG, AAAGG-AAGGG-AAAGG | - | ≥400–2000 | unknown | linkage; WGS | [19,32] |
2019 | GDPAG | AR | 2q32.2 | GLS | 5'UTR | CAG | 8–16 | ≥680–1400 | gene silencing | candidate gene analysis | [43] |
2023 | SCA27B | AD | 13q33.1 | FGF14 | intron | AAG | 10–250 | ≥300 | reduced gene expression | linkage; WGS | [21,34] |
2023 | - | AD | 16q22.1 | THAP11 | exon | CAG | 20–38 | ≥45–100 | PolyQ | linkage; LRS | [35] |
Table adapted from review paper [70].
Historically, the discovery and diagnosis of RE disorders has been made difficult by the repetitive nature of the DNA sequence. In the era of clinical exomes and genomes for diagnosis of genetic disorders, the gold standard for diagnosis of ataxias caused by REs remains repeat-primed PCR assays or Southern blots, neither of which is scalable nor readily available for all STR loci. Testing is also limited to the most common REs. Diagnosis of RE disorders is further complicated by variable phenotypes which can overlap with other, more common disorders such as Parkinson's Disease or amyotrophic lateral sclerosis (ALS) [4].
Detection and diagnosis of RE disorders
The advent of whole exome and genome sequencing (WES/WGS) has accelerated the diagnosis and discovery of rare genetic diseases. Accessibility of WES and more recently WGS for clinical diagnosis of rare diseases is constantly increasing [5], however screens remain limited mostly to SNVs and small indels, as conducted by clinical genomics bioinformatics pipelines. In recent years it has become increasingly feasible to detect and accurately size REs using WGS, and to a lesser extent, WES. Early iterations of analysis tools struggled with the repetitive nature of STRs, however this hurdle has been largely overcome since the development of catalogue-based methods such as ExpansionHunter [6], gangSTR [7], STRetch [8] and exSTRa [9], and more recently, visualisation tools such as REViewer [10]. ExpansionHunter, gangSTR and STRetch provide an exact genotype for STRs shorter than the read length (typically 150 bp) or an estimated size for longer STRs, although STRetch has a higher computational burden due to its requirement to make use of an alternative, augmented reference genome. In contrast, exSTRa is an outlier method that determines whether a specific STR is expanded compared with other individuals. These methods are catalogue based, i.e. they will only screen pre-defined STRs and motifs and have been used with great success to diagnose pathogenic REs in disease cohorts [4,11–13]. For example, we recently diagnosed SCA36 in a multigenerational Australian pedigree using ExpansionHunter and exSTRa [14]. SCA36 is a rare form of ataxia caused by an intronic GGCCTG RE in NOP56, with no clinical test available in Australia. Diagnosis of SCA36 was made within five days of receiving WGS data. In addition, REViewer is an important tools for visually confirming the composition of REs and can be used to identify interruptions in the motif and to eliminate false positive findings.
One recent study from the UKs 100 000 Genomes Project validated RE screening in WGS compared with PCR for neurological disease cohorts [4]. Compared with PCR, WGS was able to correctly classify expanded alleles with 97.3% sensitivity (95% CI 94.2–99.0) and 99.6% specificity (99.1–99.9) for thirteen pathogenic loci. Screening of WGS from 11 631 patients with suspected genetic neurological disorders with ExpansionHunter yielded 81 pathogenic REs. Follow up analysis with PCR confirmed that 68 were in the pathogenic range, representing an 84% true positive rate. Many of these diagnoses were made in people who did not present with typical symptoms, including children. This included REs for SCA2 in patients with early onset Parkinson's disease and ALS, SCA3 in a complex Parkinson's disease patient, and a SCA1 diagnosis in a hereditary spastic paraplegia patient. This study demonstrates that WGS, which is increasingly generated in both clinical and research settings, is a critical tool for the diagnosis of RE disorders, and is a rapid alternative to the long diagnostic odyssey associated with consecutive testing with PCR and Southern blots. In addition, it highlights the heterogeneity of RE disorder phenotypes as well as the issue of underdiagnosis. RE disorders, especially ataxias, are very rare, and evidence suggests that they are being misdiagnosed as more well-known neurological conditions. For example, a recent study of REs in ALS/FTD identified enrichment for multiple REs in the pathogenic range (SCA1, DM1 and DM2), and others in the intermediate range (including SCA2, SCA17 and Huntington's disease), highlighting the heterogeneity of the RE disorders and potential for misdiagnosis [15]. Furthermore, REs in RFC1, which cause CANVAS, have also been identified as a common cause of idiopathic neuropathy [16–18].
While there is demonstrated utility in detecting REs in WGS data, there are some challenges that still need to be addressed. Exonic REs are generally short and can be genotyped with high accuracy, however non-exonic expansions can be very large and the true RE size may be substantially underestimated by tools such as ExpansionHunter [4]. For some loci, such as NOP56 (SCA36), RE are easily detected in WGS [14]. Some REs such as AAGGG and ACAGG in CANVAS are easily detected but difficult to accurately size [19,20]. However, these motifs are not common and the presence in a homozygous or compound heterozygous state would indicate CANVAS despite an underestimated allele size.
In contrast, FGF14-GAA (SCA27B), which has a pathogenic threshold of >300 repeats, is commonly observed with 50–250 repeats in the general population. However, ExpansionHunter is unable to accurately size RE larger than ∼100 repeats at this locus and will severely underestimate true positive expansions [21]. SCA37 is uniquely difficult to detect with short-read sequencing techniques due to the combination of a pathogenic TTTCA embedded deep within the expanded reference TTTTA motif and cannot be detected with ExpansionHunter [22] (Figure 1). In addition, false positives are common for some loci, such as the RE that causes Fragile X Syndrome in FMR1 [4]. Use of visualisation tools such as REViewer are essential for identifying false positives. GnomAD now has an STR catalogue for 19 241 genomes (v3.1), which includes 59 known pathogenic REs, but not newly discovered REs such as SCA27B or the putative THAP11-CAG. The gnomAD STR catalogue is a useful resource as it contains variations of motifs at specific, known RE loci and has REViewer plots available for all individuals, which can be helpful for researchers to use to compare to their own datasets and potentially mitigate risk of false positives.
Genotyping of STRs in short-read sequencing data.
(A) An exact genotype can be determined for STRs that are shorter than the read length (typically 150 bp for modern short-read sequencing). (B) For STRs that are longer than the read length, genotypes are estimated; the longer the STR, the more likely that the size will be underestimated. (C) pathogenic insertions, such as a side-by-side insertion of a pathogenic motif adjacent to a benign motif (such as in FAME) are easily detected but can be difficult to accurately genotype. (D) in contrast, pathogenic motifs embedded within benign motifs, such as SCA37, can be difficult to detect with short-read sequencing data.
(A) An exact genotype can be determined for STRs that are shorter than the read length (typically 150 bp for modern short-read sequencing). (B) For STRs that are longer than the read length, genotypes are estimated; the longer the STR, the more likely that the size will be underestimated. (C) pathogenic insertions, such as a side-by-side insertion of a pathogenic motif adjacent to a benign motif (such as in FAME) are easily detected but can be difficult to accurately genotype. (D) in contrast, pathogenic motifs embedded within benign motifs, such as SCA37, can be difficult to detect with short-read sequencing data.
Collectively, these data demonstrate that although some loci such as SCA27B and SCA37 require a RE-specific approach, most REs are easily detected with WGS in a homogenous/common approach. This approach now needs to be embedded in standard clinical genomics pipelines, with validation via current PCR or Southern based assays, or, in the future, most likely with long-read sequencing methods, especially for the larger REs. Such an approach will yield significant benefits for patients, their families and health care systems.
Recent discoveries of REs
In addition to short-read sequencing facilitating diagnosis of known RE disorders, the technology has also played a key role in the discovery of novel REs in recent years. Historically, the discovery of pathogenic REs was slow, as discovery relied heavily on linkage analysis and molecular methods such as expansion screening and DNA cloning (Figure 2, Table 1). The first pathogenic REs were discovered in 1991 [23,24]. The early discoveries were biased towards coding regions - however, over time there has been a boom in the discovery of non-coding pathogenic REs.
Timeline of the discovery of RE disorders.
Ataxias are shown in blue, non-ataxia disorders are shown in grey. Circle presents non-coding loci and squares represent coding loci. The green shading indicates the start of the rapid discovery of pathogenic REs facilitated with short-read sequencing.
Ataxias are shown in blue, non-ataxia disorders are shown in grey. Circle presents non-coding loci and squares represent coding loci. The green shading indicates the start of the rapid discovery of pathogenic REs facilitated with short-read sequencing.
The first pathogenic RE discovered with WGS was the hexanucleotide GGGGCC RE in C9ORF72 which causes ALS and frontotemporal dementia (FTD). This was discovered due to a well-powered linkage analysis and a highly significant GWAS hit at chromosome 9p21 for ALS [25] and FTD [26] which highlighted the genomic region requiring further examination. However, this discovery used manual inspection of the reads and extensive prior knowledge, and was not scalable.
In 2018, the discovery of the RE that causes FAME1 had major repercussions for the discovery of REs. The FAMEs are a group of epilepsies with very distinct phenotypes and tight linkage analysis which remained unsolved for decades, until the discovery of the TTTTA/TTTCA RE in SAMD12 in FAME1 [27]. This RE was discovered by a number of methods which included WGS, and the identification of this pathogenic motif had a flow on effect and within only two years, FAME 2,3 and 4 were all solved using a mixture of WGS, long-read sequencing and traditional RE detection methods, often with prior information from mapping efforts [28–30]. All were caused by the TTTTA/TTTCA motif in different genes. Of note, the first TTTTA/TTTCA disorder reported was SCA37, which was discovered a year before FAME1. It is the only TTTTA/TTTCA motif to date known to cause ataxia rather than epilepsy. It is not clear why this specific motif causes FAME in some instances, and SCA37 in others, however it has been postulated that cell-specific gene expression may play a role. Ataxia genes generally have elevated expression in the cerebellum (Figure 3) when compared with randomly chosen set of genes of the same size from the human genome. Given the importance of Purkinje cell degeneration in the cerebellum in ataxia pathology [31], we postulate that elevated expression in these cells might contribute to the disease phenotype.
Ataxia genes are preferentially expressed in the cerebellum.
Cerebellar gene expression, presented as the mean transcripts per million plus one, was obtained from GTEx (v8 RNASeQCv1.1.9). Mean cerebellar expression levels of ataxia genes are shown compared with a control gene set as a box plot. An ataxia gene list consisting of 372 was curated based on genes from OMIM, PanelApp UK and PanelApp Australia (curated 2023-03-22). The control set comprises 372 randomly selected non-ataxia genes.
Cerebellar gene expression, presented as the mean transcripts per million plus one, was obtained from GTEx (v8 RNASeQCv1.1.9). Mean cerebellar expression levels of ataxia genes are shown compared with a control gene set as a box plot. An ataxia gene list consisting of 372 was curated based on genes from OMIM, PanelApp UK and PanelApp Australia (curated 2023-03-22). The control set comprises 372 randomly selected non-ataxia genes.
In 2019, another major breakthrough was published: the discovery of the recessively inherited AAGGG RE in RFC1 [19,32]. This discovery was made simultaneously by two teams. The first paper published was by a UK team who relied on strong linkage analysis to manually screen the read data from WGS using a visual inspection approach. This was only practical due to a highly significant and narrow linkage region that could be manually screened [32]. The second paper, published by our team, used a novel tool called ExpansionHunter Denovo (EHDN) [33] and was the first time a RE was discovered using an unbiased method with a genome-wide approach [19]. Prior to the publication of EHDN, all RE detection tools were catalogue based, i.e. they could only screen for expansions of a pre-defined list of STRs (location and motif). EHDN is a catalogue-free method that leverages the lower read quality inherent to regions of repetitive DNA to rapidly identify ‘anchored’ and repeat-rich reads amongst aligned, unaligned and misaligned reads. These reads are anchored to a genomic location by their high quality read pair which aligns uniquely to the flanking DNA. Using this method, we discovered the AAGGG RE in CANVAS which is a non-reference motif that was not present in STR catalogues at the time and thus not detectable at that time with catalogue-based methods with the current catalogue.
Since 2018, 15 new pathogenic REs have been discovered, nine of which were published in 2019 and most of which relied on WGS for discovery. Three of these REs cause ataxia, including the recent discovery SCA27B, which is caused by a GAA RE in FGF14 that was discovered with EHDN by two groups simultaneously [21,34]. Like CANVAS, SCA27B is a surprisingly common form of adult onset ataxia, which facilitated its discovery.
Recently, a study used long-read sequencing to identify a coding polyglutamine (polyQ) RE in the gene THAP11, in a five-generation Chinese family with autosomal dominant ataxia [35]. They report a repeat length of 45–55 repeats in adult-onset patients and one RE of 100 repeats in a childhood-onset patient, indicating anticipation can occur at this locus. This is the first discovery of a coding ataxia RE since SCA17 in 1999 [36].
Discovering new REs
While short-read sequencing has been critical to the discovery of many REs in recent years, there are challenges that need to be addressed. One of the key difficulties for RE discovery is the lack of large STR genotype databases sourced from ancestry-diverse populations. GnomAD, which contains 76 156 genomes of diverse ancestries (v3.1.2 data set, GRCh38), has been pivotal for genomics of rare disease describing SNP and indel frequencies, but such extensive catalogues do not exist for STRs.
Some smaller scale databases exist, mainly based on analysis of the 1000 Genomes Project [37]. For example, the Illumina genome-wide polymorphic STR catalogue was generated using STR-finder, a tool that infers STRs from short-read sequencing at population levels [6]. This was generated using 2504 unrelated individuals from 1000 Genomes Project and contains 175 000 curated and polymorphic STRs, and is an important STR resource [37]. While the 1000 Genomes Project is a stratified sample that sought to maximised genetic diversity, it is too small to sufficiently capture the rare but crucial STR variation across populations that is needed for the discovery of rare pathogenic variants. More recently, a population reference panel of STR variation was generated using 3550 individuals from the 1000 Genomes Project and H3Africa cohorts, identifying over 1.7 million STR loci [38]. These catalogues are useful resources for STR discovery, especially for exonic STRs which are generally smaller, easier to genotype, and more evolutionarily conserved.
Despite these difficulties, there are strategies that can be implemented to improve the discovery of pathogenic REs in ataxia. First, while there are thousands of different motifs, only a small number have been associated with disease. For example, all known coding REs to date are CAG (polyQ), GCN (polyA) or CCG, and a first pass analysis might focus specifically on these motifs within coding regions. There are over 6000 known CGG sites, 93% of which are highly variable between individuals, and could potentially be candidates for pathogenic REs [39]. Likewise, motifs such as AAG, TTTCA, GGCCTG and others can be prioritised for intronic STRs.
Second, STRs in genes already known to cause ataxia should be prioritised — CACNA1A, FGF14, RFC1 and GLS are all examples of genes that can cause ataxia by point mutations/indels, structural variants and also STRs [40–43]. Furthermore, expression in the cerebellum is known to be important for cerebellar ataxia — genes with no expression in the cerebellum can be excluded from analysis as a useful initial filter (Figure 3).
Non-catalogue methods such as EHDN are widely used for the discovery of novel REs. Other non-catalogue tools have recently been published. This includes STRling [44], which is the first tool to report the coordinates of novel STR expansions to base pair accuracy, and superSTR, which is a rapid non-reference-based method that identifies expanded motifs [45] and thus can be applied to RNAseq data and other organisms, without reference genomes. In addition, STRling can detect RE that are smaller than the read length, which is especially important for exonic RE which are often pathogenic under 150 bp. In contrast, EHDN cannot detect STRs shorter than the read length and is therefore biased against detecting exonic REs.
There is still utility in catalogue-based methods such as ExpansionHunter as a primary discovery tool. In contrast with intronic STRs which can be hundreds to thousands of repeats in length, exonic STRs are usually very short, and even small increases in length can be pathogenic. For example, in SCA2, alleles between 32–36 are incompletely penetrated, and alleles greater than or equal to 37 are fully penetrant [46,47]. However, 37 repeats span 111 base pairs, and this would not be detected with tools such as EHDN. Given that exonic STRs are generally highly conserved and well characterised, using a reference-based method such as ExpansionHunter, which can accurately size shorter REs, with a catalogue such as the Illumina polymorphic STR catalogue is a good approach to capturing exonic REs.
REs that resemble SCA37, in which the pathogenic motif is embedded deep within the reference motif, are difficult to discover with short-read sequencing, as reads from the STR cannot be uniquely mapped to the locus. However, the expansion of the reference motif can be detected with tools such as ExpansionHunter and EHDN, in which case such loci can be short listed for further investigative work, such as long-read sequencing and tools such as superSTR may identify the unmapped read enrichment for pathogenic repeats.
Interruptions
Further complexities arise in the form of interrupted motifs; however the impact of these interruptions is poorly understood. Motif interruptions refer to occasional interruptions in the motif sequence, and are distinct from motifs changes, i.e. the existing motif changes from one motif to a repetitive stretch of another motif. Generally, pathogenic REs are pure and the motif is not interrupted. Uninterrupted motifs are more unstable, and more susceptible to expansion during gametogenesis [48,49]. In addition, there are reports that interruptions change the disease phenotype. An example is SCA2 — in which pure CAG expansions greater than 33 repeats in ATXN2 cause ataxia, but expansions interrupted by sporadic CAA motifs may cause Parkinson's Disease (PD) instead [50,51]. However, the research is conflicting and multiple studies do not find an association with PD and interrupted ATXN2 motifs [52], possibly due to differences in familial or sporadic PD and ethnic diversity. In addition, severe neurodegeneration of the dopaminergic substantia nigra is often observed in SCA2 and in rare cases can cause parkinsonism, raising the possibility that people with SCA2 are being misdiagnosed with PD. The mechanism is unclear as both CAA and CAG code for glutamine and there are reports of interrupted CAG motifs in ATXN2 with symptoms consistent with SCA2 [53]. RNA toxicity may be impacted, although this remains poorly understood [54,55].
Current RE detection methods do not detect interruptions, however motif purity can be checked using REViewer, as long as the interruption is not too deeply embedded within the STR where it may not be able to be identified. The accumulation of large genomics disease cohorts makes large-scale screening of motif purity increasingly accessible and may help address concerns regarding the impact of interruptions on disease progression.
Conclusion
Advances in methodologies for screening of REs in WGS data has resulted in significant progress in the discovery and diagnosis of these complex genetic variations in ataxia. However, these tools remain under-utilised in clinical diagnosis pipelines. Genomics is now widely used in the clinical setting, and patients will continue to miss out on rapid genetic diagnosis until such pipelines are implemented as a first-line screen for REs in ataxia. In addition, WGS is increasingly being generated for disease cohorts in a research setting, including ataxia, which will facilitate further discoveries of pathogenic REs. Although significant strides have been made using short-read sequencing, the nature of long REs means that WGS will always have its limitations. Long-read sequencing has the potential to address these limitations as the read length is sufficient to completely capture the RE within a single read. However, this is still an emerging technology and has limitations, including affordability, scalability and technical issues (reviewed in [56]), and will likely not displace short-read genomics in the short term. As costs continue to decrease and the technology continues to improve, short-read sequencing, with support from long-read sequencing, will continue to be critical for the discovery and diagnosis of pathogenic REs in coming years.
Summary
Multiple studies have shown that short-read sequencing data can be used to report whether an individual has a pathogenic RE. This information can be used to suggest further clinical investigation and to improve diagnostic rates of RE disorders.
In recent years, we have witnessed the success of utilising short-read sequencing data for the discovery of REs, which includes the discovery of pathogenic REs that cause CANVAS and SCA27B, two very common causes of late-onset ataxia.
RE detection pipelines are ready for application in clinical pipelines where genomics is increasingly being used for genetic diagnosis.
The rapid accumulation of both short and long-read genomes data in the research setting and continual improvement of RE detection methods suggests that further RE discovery are likely in the coming years.
Competing Interests
The authors declare that there are no competing interests associated with the manuscript.
Funding
H.R. was supported by an NHMRC Emerging Leadership 1 grant (1194364) and M.B. was supported by an NHMRC Leadership 1 grant (1195236). Additional funding was provided by the Independent Research Institute Infrastructure Support Scheme and the Victorian State Government Operational Infrastructure Program.
Open Access
Open access for this article was enabled by the participation of University of Melbourne in an all-inclusive Read & Publish agreement with Portland Press and the Biochemical Society under a transformative agreement with CAUL.
Author Contributions
H.R. planned, wrote and edited the manuscript. M.F.B reviewed and edited the manuscript. M.B planned, reviewed and edited the manuscript.
Acknowledgements
We acknowledge the work and contribution of Liam G Fearnley.