Tandem repeat DNA sequences constitute a significant proportion of the human genome. While previously considered to be functionally inert, these sequences are now broadly accepted as important contributors to genetic diversity. However, the polymorphic nature of these sequences can lead to expansion beyond a gene-specific threshold, causing disease. More than 50 pathogenic repeat expansions have been identified to date, many of which have been discovered in the last decade as a result of advances in sequencing technologies and associated bioinformatic tools. Commonly utilised diagnostic platforms including Sanger sequencing, capillary array electrophoresis, and Southern blot are generally low throughput and are often unable to accurately determine repeat size, composition, and epigenetic signature, which are important when characterising repeat expansions. The rapid advances in bioinformatic tools designed specifically to interrogate short-read sequencing and the development of long-read single molecule sequencing is enabling a new generation of high throughput testing for repeat expansion disorders. In this review, we discuss some of the challenges surrounding the identification and characterisation of disease-causing repeat expansions and the technological advances that are poised to translate the promise of genomic medicine to individuals and families affected by these disorders.
Introduction
Genetic technologies are transforming healthcare by empowering genomic medicine, an emerging discipline which utilises genetic data to improve clinical care and patient outcomes. Genomic medicine can provide diagnostic certainty, key information for prognosis, counselling, and reproductive planning. It can lead to the development and delivery of improved treatments targeted to disease mechanisms. However, genetic technologies have had limited success in bringing the power of genomic medicine to individuals with disorders caused by pathogenic repeat expansions (RE). Repetitive DNA sequences or short tandem repeats (STRs) of 2–6 bp make up ∼2% of the genome and are inherently unstable, thereby facilitating genetic plasticity and evolutionary development [1]. RE disorders manifest when a segment of repetitive DNA is expanded beyond a gene-specific threshold and typically cause conditions with significant neurodevelopmental, neuromuscular and neurological outcomes. There are over 50 RE disorders, which encompass conditions of major clinical significance including Huntington disease, fragile X syndrome and hereditary cerebellar ataxias (Figure 1) [2]. Collectively, RE disorders are the most common genetic conditions seen by neurologists with burden of disease being significantly underestimated [3]. Although the number of identified pathogenic REs is increasing, there is compelling clinical and genetic evidence that many are still to be identified [2]. An issue of current unmet need is that diagnostic testing is only widely available for a minority of the known RE disorders. While short-read next generation sequencing (NGS) has revolutionised the diagnosis of disorders caused by single nucleotide polymorphisms (SNP) or copy number variants (CNV), NGS is yet to be widely used in RE diagnostics [4]. Instead, clinical testing for RE disorders commonly utilises outdated low-throughput platforms, most of which were developed in the last century. The experience at our centre, and globally, is that diagnostic yield for the subset of RE disorders tested utilising capillary array electrophoresis, Sanger sequencing and Southern blot typically range from 10% to 30% [5]. These tools are very effective when assessing pathogenic CAG coding sequence REs that underlie spinocerebellar ataxia (SCA) 1, 2, 3, 6, and 7, where repeat sizes do not exceed the capabilities of the platform [6–8]. However, they are typically less effective for larger (>300 repeats) and more complex pathogenic REs. The rapid advancement in both short-read whole genome sequencing (WGS) and single molecule long-read sequencing (LRS) platforms provide an opportunity to redefine the approach for RE identification and discovery, providing genome-wide variant calling and significantly improved determination of RE size and composition [9,10].
Pathogenic REs have been identified in both coding and non-coding regions and contain a diverse range of pathogenic motifs.
Exonic REs are generally shorter and of lower complexity, falling into two categories, poly-alanine (Poly-A) or poly-glutamine (Poly-Q). The 5′ untranslated region (UTR) contains G-C rich trinucleotide repeats which result in methylation-mediated alteration in transcriptional activity. REs in the 3′ UTR often cause toxic RNA or polypeptide synthesis that form both intra- and inter-cellular aggregates. Conversely, unconventional pentanucleotide and hexapeptide REs, located within introns, contain more complex motifs due to their polymorphic nature and often cause RNA toxicity and non-AUG RAN translation. Created with BioRender.com.
Exonic REs are generally shorter and of lower complexity, falling into two categories, poly-alanine (Poly-A) or poly-glutamine (Poly-Q). The 5′ untranslated region (UTR) contains G-C rich trinucleotide repeats which result in methylation-mediated alteration in transcriptional activity. REs in the 3′ UTR often cause toxic RNA or polypeptide synthesis that form both intra- and inter-cellular aggregates. Conversely, unconventional pentanucleotide and hexapeptide REs, located within introns, contain more complex motifs due to their polymorphic nature and often cause RNA toxicity and non-AUG RAN translation. Created with BioRender.com.
The convergence of large datasets generated by short-read WGS with specialised bioinformatic tools including but not limited to ExpansionHunter (EH), exSTRa, STRetch, TREDPARSE, and superSTR [11–16] has transformed genome-wide approaches to analysing STRs within known disease-associated genes. These tools can interrogate both aligned and misaligned reads to identify possible REs at known and characterised loci. This analysis has expanded the capabilities of short-read WGS beyond the more conventional curation of SNPs, indels, and CNVs, yet these RE analysis tools have not been implemented broadly by diagnostic providers, despite the evidence for clinical utility [9]. In practice, uptake of these tools has been much more rapid in the research context. For example, we recently identified SCA36 as the genetic basis of ataxia in a multi-generational Australian pedigree after a >10-year diagnostic odyssey [17]. Similarly, the use of these tools in a multi-generational family study of autism spectrum disorder identified a heterozygous intronic CCTG expansion in CNBP, a known cause of myotonic dystrophy type 2 (DM2) [18]. The identification of novel RE loci using short-read WGS has similarly benefitted from new bioinformatic tools and methodologies. ExpansionHunter Denovo (EHdn) was the first tool developed to search genome-wide for novel repeat expansions in short-read data [19] and has been instrumental in the identification of novel RE in RFC1 and FGF14 causing monogenic disorders [20,21] and also novel REs associated with neurodevelopmental disorders [22]. More recently, STRling was developed with a similar novel RE detection capability [23].
However, the recent trajectory of novel RE discovery suggests that the ‘low hanging fruit’ is essentially exhausted and future discoveries are likely to involve complex repeat sequences with multiple pathogenic motifs, for example, complex pentanucleotide REs cause cerebellar ataxia (SCA10 [24], SCA37 [25], CANVAS [20,26]) and familial adult myoclonic epilepsy (FAME [27]). These structures challenge the capability of short-read technology and are leading to a rapid evolution of long-read tools and methodologies. Here, we discuss some of the complexities surrounding pathogenic RE discovery and characterisation and the technological hurdles that need to be addressed.
Structural considerations for repeat expansion detection
STRs have been implicated in the regulation of important cellular processes including genetic plasticity, gene transcription and translation [1]. However, STRs can be difficult to characterise due to unconventional structural conformations and variability in motif size and composition. The challenge for detecting large REs and determining pathogenicity arises when the sequence includes interruptions, the DNA forms unconventional shapes, changes in promoter methylation within CpG islands affects transcription, or complex repeat motifs cause conflicting functional consequences (Figure 2). Most of these complex changes cannot be detected using conventional techniques and require lengthy investigations.
Structural abnormalities within DNA due to RE mutations can lead to multiple intracellular dysfunctions.
RE length and motif are generally considered the primary determinants of pathogenicity however, characterisation often requires multiple independent tests to interrogate different attributes. Long hairpin structures can contribute to expansion with increased strand fragility and sequestration of RNA-binding splicing factors. DNA repeat tract interruption with an alternate motif can lead to abnormal hairpin structures and strand strengthening which can affect pathogenicity. Increased repeats containing methylated cytosine (5-mC and 5-hmC) within a 5′ UTR or CpG island can be an effective mechanism for gene silencing. This can lead to detrimental consequences if protein down-regulation causes disease or can be beneficial if toxic aggregate formation is avoided. Non-AUG RAN translation of an intronic repeat tract can lead to multiple functional and inert gene products. Transcription of these REs can produce toxic RNA foci, and translation can yield RAN proteins in all possible reading frames that can either be catabolised by intracellular proteosomes or form insoluble monomers and polymers. Created with BioRender.com.
RE length and motif are generally considered the primary determinants of pathogenicity however, characterisation often requires multiple independent tests to interrogate different attributes. Long hairpin structures can contribute to expansion with increased strand fragility and sequestration of RNA-binding splicing factors. DNA repeat tract interruption with an alternate motif can lead to abnormal hairpin structures and strand strengthening which can affect pathogenicity. Increased repeats containing methylated cytosine (5-mC and 5-hmC) within a 5′ UTR or CpG island can be an effective mechanism for gene silencing. This can lead to detrimental consequences if protein down-regulation causes disease or can be beneficial if toxic aggregate formation is avoided. Non-AUG RAN translation of an intronic repeat tract can lead to multiple functional and inert gene products. Transcription of these REs can produce toxic RNA foci, and translation can yield RAN proteins in all possible reading frames that can either be catabolised by intracellular proteosomes or form insoluble monomers and polymers. Created with BioRender.com.
Flanking regions surrounding REs can contain sequences promoting strand integrity and stability. Recent gene discovery programs following phenotype-driven data have utilised both WGS and LRS platforms, capturing data from both intergenic and intragenic regions to identify regions of interest. Using this approach, a GAA expansion was discovered in intron 1 of FGF14 isoform 1b causing SCA27B [21,28]. This common genetic cause of ataxia had remained hidden, despite SNPs and CNVs in FGF14 being shown to cause a rare form of episodic ataxia over 20 years ago (SCA27A [29]). Detailed analysis of flanking regions surrounding the RE uncovered a common 17 bp deletion-insertion variant that correlated with increased DNA stability and shorter allele length [30]. In a cohort of 2191 individuals with mixed genetic lineage, alleles containing the common deletion-insertion variant almost exclusively had less than 30 repeats, while those containing the reference sequence GAAGAAAGAAA(GAA)n demonstrated a higher degree of instability and meiotic variability. The presence of this variant suggests an inherited mechanism or secondary structure that stabilises this region, when other GAA tracts are known for being prone to slippage [31]. Therefore, looking beyond the RE into the surrounding intronic regions identified a key feature providing information about the likelihood of expansion of that allele.
Repeat sequences containing alternating motifs have historically been difficult to identify, as flanking PCR cannot differentiate an interrupted repeat tract when separated on a gel. These interruptions of alternating motifs in large REs can stabilise the DNA strand and reduce disease penetrance, as seen in the polymorphic CTG repeat in DMPK1 causing myotonic dystrophy (DM1) [32]. Investigation into a family specific phenotype of DM1 co-segregating Charcot-Marie-Tooth neuropathy, encephalopathic attacks, and early hearing loss found the RE did not maintain a pure CTG motif. Instead, interruptions were present at the 3′-end creating an alternating CGG•CCG that formed a stable hairpin structure [33]. The high G-C content of the 3′ end interruptions made PCR amplification difficult in this region, causing ambiguous clinical testing results and false negatives as RE length in DM1 is inversely correlated with age of onset [34]. The 3′ end interruption was retained through the maternal germline and reduced somatic instability, however the 5′ CTG tract was still unstable resulting in expansion and toxic RNA gain of function [33,35]. Discovery of this complex interruption required vectorette PCR, a technique that utilises large amounts of input DNA, restriction enzyme libraries to digest fragments, and separation of fragments by gel electrophoresis [33]. Cloning and sequencing was still required to confirm this unusual repeat motif. Identifying this unique sequence was necessary for understanding the unusual phenotype in this family, however unless multiple PCR techniques are being used to routinely investigate polymorphic REs, many individuals/families with similar complex repeat tracts would likely remain uncharacterised. The recent developments in LRS support in-depth investigation of RE sequences and can improve pathogenic thresholding as complex patterns of alternating motifs can easily be identified and segregated from those with pure repeat tracts.
Alternatively, the CRISPR/Cas9 system has been adapted to target and enrich genes of interest to thoroughly interrogate complex RE tracts on LRS platforms. In a study of 202 males with X-linked dystonia-parkinsonism (XDP), small post-zygotic deletions were observed within AGAGGG repeats at the 5′ end causing a mosaic of divergent sequence lengths [36]. Using CRISPR/Cas9, enrichment of the canonical retrotransposon insertion, SINE-VNTR-Alu-5′-(AGAGGG)n, in TAF1 revealed a partial repeat motif of 3–4 nucleotides offsetting the sequence, the most common being (AGAGGG)2AGG(AGAGGG)n and (AGAGGG)2AGGG(AGAGGG)n. These deletions are postulated to improve sequence stability, resulting in a shorter overall repeat length that influences age of onset. Additionally, post-zygotic modifications at this locus suggest that blood-derived DNA is not necessarily the most appropriate source of genomic DNA to determine RE pathogenicity [37]. The use of CRISPR/Cas9 to target and enrich TAF1 with ONT LRS enabled patient-to-patient comparisons at a nucleotide level that ultimately led to discovery of these unique deletions and the repeat pattern [36].
Epigenetic gene silencing
DNA methylation is a commonly studied epigenetic mechanism due to its importance in gene expression and is being applied as a method for determining RE pathogenicity. Methylation models focus on cytosine and covalent bonding of a methyl group on the pyrimidine ring, 5-methycytosine (5mC), or the oxidised derivative, 5-hydroxymethycytosine (5hmC) [38,39]. These derivatives can influence gene expression by interrupting the un-methylated phosphate-linked cytosine-guanine (CpG) islands, which act as a mark for transcription factors within gene promoter regions [40,41]. Promotor hypermethylation of CpG islands results in gene silencing, as seen in fragile X syndrome (FXS) where an expansion of methylated CGG triplet repeats in FMR1 causes a decrease in translated FMRP [42]. Alternatively, intronic expansions can cause gene silencing independent of the promoter region as seen in Friedreich ataxia [43]. The region directly upstream of the GAAn expansions found within intron 1 of FXN, are hypermethylated while the region downstream is hypomethylated, leading to transcriptional down-regulation of FXN. This leads to a frataxin deficiency and disease symptoms of progressive ataxia of the limbs and gait, dysarthria, and cardiomyopathy [43,44].
Recently, GGC expansions in NOTCH2NLC have been associated with complex and divergent phenotypes of dementia-dominant neuronal intranuclear inclusion disease (NIID), oculopharyngodistal myopathy type 3 (OPDM3), and hereditary essential tremor 6 (ETM6) [45]. In the context of OPDM3, a repeat size range of GGC128–198 was defined as pathogenic, whereas asymptomatic parents contained an allele with a higher number of repeats (GGC>300) [46]. LRS investigation of 5mC methylation indicated the promoter region of NOTCH2NLC was hypermethylated in the unaffected parents, resulting in significantly lower blood mRNA expression [46]. These observations suggest that the larger expansion (GGC>300) is associated with DNA methylation resulting in gene silencing and reduced pathogenicity. The variable methylation signatures of NOTCH2NLC and multiple diseases associated with this RE highlight the potential requirement for 5mC and 5hmC testing for the diagnosis of NIID and OPDM3.
Determining methylation status has largely relied on DNA bi-sulfite sequencing, EPIC array and Southern blot, which are standalone tests that require additional time and expense. An added benefit of LRS platforms like ONT and PacBio HiFi is the ability to assess 5mC and 5hmC within existing experimental protocols. Complimentary methylation callers Megalodon and Nanopolish amongst others [47,48], have been developed specifically for LRS, enabling single test assessment of RE size, composition and methylation profile. In addition, methylation status can be directly linked to the reads containing the RE, allowing discrimination of allele-specific expression.
Pentanucleotide repeats
Pentanucleotide expansions are commonly located within or adjacent to, intronic transposable Alu elements, which are DNA sequences that contain an RNA polymerase III promoter and influence gene expression via multiple mechanisms including polyadenylation and splicing [49]. These highly polymorphic transposable elements are recognised as drivers of genetic evolution, and have more recently been associated with pathogenic expansion of small repeat tracts in the reference sequence [51]. Discovery of novel REs within intronic regions becomes increasingly difficult as complexity increases, for example when both pathogenic and non-pathogenic motifs are present. For instance, the pure ATTTT motif found in DAB1 has been reported with up to 400 repeats in unaffected individuals, however an ATTTC interruption deep in the repeat tract is found exclusively in individuals with the SCA37 phenotype [25]. Identification and characterisation of this ATTTC motif is extremely important for diagnosis as the number of repeats is inversely correlated with age of onset, yet detection may be limited if utilising short-read using WGS [25].
The polymorphic nature of intronic pentanucleotide expansions can result in somatic mutations that challenge consensus size determination and complicates delineation of RE composition. For example, the ancestral motif of ATTCT in ATXN10 is commonly found with 10–32 repeats. However, reduced penetrance and mild symptom presentation in some individuals with SCA10 has prompted a large intermediate range, with the fully penetrant pathogenic threshold being considered >800 repeats [24,52]. Interruptions to the ATTCT tract with the mutant ATCCTn or ATTCCn motif stabilise the RE but are also associated with increased symptom severity and a higher risk of developing seizures [24,52]. Amplification-free targeted sequencing and optical genome mapping (OGM) have been used to confirm somatic mosaicism in patient-derived DNA, containing the ancestral motif and mutant ATCCT or ATTCC [52]. Indeed, OGM is emerging as a powerful platform to delineate complex structural variation, including repeat expansions [53]. However, current clinical diagnostic testing protocols for SCA10 do not utilise these technologies, instead opting for fragment analysis via capillary array electrophoresis and Southern blot. Therefore, repeat motif variation deep in the repeat tract of SCA10, which may provide relevant information regarding allele pathogenicity, cannot be characterised with current clinical testing protocols.
Discovery of the pentanucleotide RE in RFC1 causing cerebellar ataxia, neuropathy, and vestibular areflexia syndrome (CANVAS) was facilitated by careful phenotyping, combining the neuro-otologic features of cerebellar ataxia with bilateral vestibulopathy (CABV) syndrome with accompanying sensory neuropathy [54]. Family studies assessing syndromic features initially separated CANVAS from the SCAs by a recessive inheritance pattern, leading to the independent gene discovery using two different approaches [55]. Cortese et al. [26] used non-parametric linkage analysis to delineate a 4p14 region of interest before manually assessing WGS for variants. Comparison between affected and unaffected family members identified biallelic expansion of between 400-2000 AAGGG repeats, affirming the recessive inheritance pattern. Rafehi et al. [20], by contrast, defined a linkage region on chromosome 4 and used EHdn software to identify biallelic AAGGG expansion in short-read WGS data. The latter study included 30 individuals with both the CANVAS phenotype and biallelic AAGGG expansions, with one heterozygous individual expected to have a second variant in trans [20]. Notably, the same study proposed a core ancestral haplotype that likely originated with a single founder effect 25 000 years ago [20].
Additional pathogenic motifs have since been reported at this locus, with individuals exhibiting the same core CANVAS phenotype, albeit with slight variations [56]. An ACAGGn repeat was identified in cohorts of Asia Pacific and Japanese ethnicity and a mixed AAAGG10–25AAGGGnAAAGG4–6 was found in individuals with Māori heritage [57,58]. Moreover, using CRISPR/Cas9 targeted LRS, novel motifs of AGGGCn, AAGGCn, and AGAGGn were validated as pathogenic, while AAAGGGn, and AAGAGn appear to be benign. These results emphasise the highly polymorphic nature of this region and the benefit of using targeted LRS for accurate characterisation of the RE causing CANVAS [59].
Familial adult myoclonic epilepsy (FAME)
FAME is an autosomal-dominant disorder characterised by cortical myoclonic tremor and epilepsy that can be triggered by photic stimulation, alcohol consumption, sleep deprivation, or emotional stress [60]. While the core symptoms are consistent across the six FAME disorders described to date (FAME1, 2, 3, 4, 6, and 7), clinical heterogeneity within the condition has made it challenging to distinguish this phenotype from more frequently occurring, clinically overlapping disorders [61]. Discovery of SAMD1 causing FAME1 (Familial Adult Myoclonic Epilepsy 1) was driven by linkage analysis that identified a 30 Mb region on chromosome 8 [62], with 6 families (1 Chinese and 5 Japanese) sharing the same core haplotype that contained an expanded TTTTA/TTTCA motif [27]. Subsequently, five intronic REs with the same motif and associated with the FAME phenotype have been identified in STARD7, MARCHF6, YEATS2, TNRC6A, and RAPGEF2. While these genes are all highly expressed in the brain, they are located on different chromosomes and encode functionally dissimilar proteins (Table 1). Pathogenic RE in the FAMEs commonly contain both ancestral TTTTAn and mutant TTTCAn motifs and can span between 2.2–18.4 kb [27,63,64]. However, genetic diagnosis can be difficult as unaffected individuals can also have large benign repeat tracts that contain only the ancestral motif [27,65]. In affected individuals, pathogenic TTTCAn repeats have been found deep within long repeat tracts comprised of the ancestral TTTTAn, effectively hidden from PCR and short-read WGS technologies [27].
FAME . | Gene . | Chr . | Repeat (GRCh38) . | Canonical Sequence . | Pathogenic RE . | Ref . |
---|---|---|---|---|---|---|
1 | SAMD12 | 8 | 118366816–118366918 | (TTTTA)7(TTA)(TTTTA)13–exp | TTTTA/TTTCA440–3680 | [27] |
2 | STARD7 | 2 | 96197067–96197121 | TTTA(TTTTA)9–20TTTT | TTTTA/TTTCA661–735 | [63] |
3 | MARCHF6 | 5 | 10356347–10356411 | TTTTTTATTTA(TTTTA)10–30TTTT | TTTTA/TTTCA660–2800 | [64] |
4 | YEATS2 | 3 | 183712188–183712222 | TTTTATGTTC(TTTTA)7TTTTTT | TTTTA/TTTCA>192 | [70] |
6 | TNRC6A | 16 | 24613440–24613529 | (TTTTA)8 | TTTTA/TTTCAexp | [27] |
7 | RAPGEF2 | 4 | 159342527–159342616 | (TTTTA)5(TATTA)(TTTTA)12–14 | TTTTA/TTTCAexp | [27] |
FAME . | Gene . | Chr . | Repeat (GRCh38) . | Canonical Sequence . | Pathogenic RE . | Ref . |
---|---|---|---|---|---|---|
1 | SAMD12 | 8 | 118366816–118366918 | (TTTTA)7(TTA)(TTTTA)13–exp | TTTTA/TTTCA440–3680 | [27] |
2 | STARD7 | 2 | 96197067–96197121 | TTTA(TTTTA)9–20TTTT | TTTTA/TTTCA661–735 | [63] |
3 | MARCHF6 | 5 | 10356347–10356411 | TTTTTTATTTA(TTTTA)10–30TTTT | TTTTA/TTTCA660–2800 | [64] |
4 | YEATS2 | 3 | 183712188–183712222 | TTTTATGTTC(TTTTA)7TTTTTT | TTTTA/TTTCA>192 | [70] |
6 | TNRC6A | 16 | 24613440–24613529 | (TTTTA)8 | TTTTA/TTTCAexp | [27] |
7 | RAPGEF2 | 4 | 159342527–159342616 | (TTTTA)5(TATTA)(TTTTA)12–14 | TTTTA/TTTCAexp | [27] |
Clinical application
The technological advancements facilitating RE discovery and characterisation have evolved rapidly, and can now identify complex STR sequences in a time and cost-effective manner. The development of short-read NGS has proven to be a dramatic improvement on existing diagnostic tools and methods, providing the ability to screen many known RE loci simultaneously and identify novel RE/genes [9]. Despite these advantages, uptake of WGS for routine diagnostic screening of REs remains limited. For the most part, diagnostic service providers are yet to establish these bioinformatic tools that will enable them to fully utilise the power of short-read WGS for RE screening. However, even once short-read WGS is implemented in diagnostic laboratories, the diagnostic process will often still require two steps; firstly, to screen for REs in known genes, and secondly, to determine the size and sequence composition. Currently, no protocol or technology can reliably fulfil these requirements, in part due to the unstable and polymorphic nature of REs. Clinical laboratories currently focus on a subset of more predictable REs, using dated techniques like capillary array electrophoresis and Southern blot [66].
RE discovery
The current standard for RE discovery utilises short read-WGS, which has the capability to obtain genome-wide data, encompassing coding and non-coding sequences. Pairing this technology with advanced bioinformatic tools allows for accurate examination of novel regions and grading candidate genes. However, this methodology can have reduced effectiveness for some REs. For example, while EHdn was instrumental in the identification of the FGF14 RE in SCA27B (GAA-FGF14 ataxia), subsequent attempts to size the GAA RE using EH and exSTRa were ineffective, with both programs consistently under-estimating repeat size [21]. Repeat sizing with capillary array electrophoresis was effective up to ∼GAA350, however larger expansions were often beyond detection limits. Utilising LRS in this situation provides consensus sequencing across the RE to determine allele sizing and motif without the need for PCR amplification.
The introduction of LRS has revolutionised the interpretation of known pathogenic REs where characterisation has historically been impacted by experimental limitations such as PCR amplification errors, and genetic factors like somatic variability [52]. For example, RE size and sequence interruptions influence phenotype severity and age of onset, however repeat primed PCR can only interrogate known motifs, making it difficult to determine RE composition. The use of LRS has improved the capacity to interrogate these variable sequence motifs, allowing for annotation of interruptions rather than just reporting their occurrence [52]. The presence of long repeat sequences, like those found in CANVAS, the FAMEs, SCA10 and SCA37 cannot always be accurately represented by capillary array as size limits prohibit large PCR fragment analysis.
Recent studies have demonstrated the utility of the ONT LRS platforms for targeted sequencing of REs, using open-source software Readfish [67]. This approach allows simultaneous interrogation of a user-defined gene list, rather than analysis of the entire genome, for example targeted analysis of pathogenic RE loci including the FAMEs, ataxias and OPDMs [10]. The obvious limitation of this approach is the inability to retrospectively reanalyse sequence data for novel genes, an approach that has been shown to have clinical utility for short-read WGS data [68]. Conversely, LRS platforms like PacBio HiFi and ONT will continue to improve read accuracy, eventually being used to call single variants and polymorphic REs with precision. These evolving technologies will vastly improve RE characterisation and redefine current knowledge, although transition from research to diagnostic application will require cost effective, multiplex capable solutions that are accurately reproducible [69].
Summary
The repetitive sequence that comprises pathogenic repeat expansions is difficult to analyse and resolve using currently established diagnostic technologies.
Complex non-coding pathogenic repeat expansions, exemplified by those underlying CANVAS and the FAMEs, highlight the challenge of novel repeat expansion discovery.
Future application of long-read single molecule sequencing methods offers considerable advantages in terms of throughput and accurate determination of size and motif composition.
Competing Interests
The authors declare that there are no competing interests associated with the manuscript.
Funding
This work was supported in part by the Australian Government National Health and Medical Research Council (GNT2001513), the Medical Research Future Fund (MRF2007677), Serp Hills Foundation and JTM Foundation. Additional funding was provided by the Independent Research Institute Infrastructure Support Scheme and the Victorian State Government Operational Infrastructure Program. KCD and GT are supported by an Australian Research Training Program Scholarship.
Open Access
Open access for this article was enabled by the participation of University of Melbourne in an all-inclusive Read & Publish agreement with Portland Press and the Biochemical Society under a transformative agreement with CAUL.
Abbreviations
- CANVAS
cerebellar ataxia, neuropathy, and vestibular areflexia syndrome
- FAME
familial adult myoclonic epilepsy
- LRS
long-read sequencing
- NGS
next generation sequencing
- OGM
optical genome mapping
- ONT
Oxford Nanopore Technologies
- OPDM3
oculopharyngodistal myopathy type 3
- RE
repeat expansion
- SCA
spinocerebellar ataxia
- STR
short tandem repeat
- WGS
whole genome sequencing