DNA is a fundamentally important molecule for all cellular organisms due to its biological role as the store of hereditary, genetic information. On the one hand, genomic DNA is very stable, both in chemical and biological contexts, and this assists its genetic functions. On the other hand, it is also a dynamic molecule, and constant changes in its structure and sequence drive many biological processes, including adaptation and evolution of organisms. DNA genomes contain significant amounts of repetitive sequences, which have divergent functions in the complex processes that involve DNA, including replication, recombination, repair, and transcription. Through their involvement in these processes, repetitive DNA sequences influence the genetic instability and evolution of DNA molecules and they are located non-randomly in all genomes. Mechanisms that influence such genetic instability have been studied in many organisms, including within human genomes where they are linked to various human diseases. Here, we review our understanding of short, simple DNA repeats across a diverse range of bacteria, comparing the prevalence of repetitive DNA sequences in different genomes. We describe the range of DNA structures that have been observed in such repeats, focusing on their propensity to form local, non-B-DNA structures. Finally, we discuss the biological significance of such unusual DNA structures and relate this to studies where the impacts of DNA metabolism on genetic stability are linked to human diseases. Overall, we show that simple DNA repeats in bacteria serve as excellent and tractable experimental models for biochemical studies of their cellular functions and influences.
Simple DNA repeats
DNA molecules are the store of genetic information for all cellular organisms. The arrangements of individual bases in the DNA sequences of an organism, its genome, are specific to that organism, and elucidation of massive numbers of genome sequences have impacted on our understanding of the phylogenetic tree of life . The organization of sequences in any genome is critical for its function and, from the earliest days of genome sequence analysis, it was recognized that natural DNA molecules contain a wide array of repeating sequences . In fact, this was particularly important in many genomic studies because such sequences are challenging to obtain accurate data . Repeat sequences of ∼1–6 base pairs (bp) in their unit structure are termed simple repeating sequences, due to their sequence being less complex (‘simpler’) than random sequences [4,5]. Such simple sequences are often called microsatellites and the term ‘short tandem repeats’ is also used frequently in the literature. Although most base sequences will be found within double-stranded DNA molecules, within this review we generally refer to sequences via a single strand, given in the 5′-3′ direction.
Simple repeating sequences can be distinguished by their sequence motif and base composition [4–7]. The various sequence motifs consist of different lengths of the repeat unit, such as mono-, di-, tri-, or tetranucleotide repeats, etc. For example, mononucleotide repeats are tracts of a single nucleotide in the sequence. Within repeating units there is some redundancy within DNA sequences e.g. (CT)n also contains (TC)m, where ‘n’ and ‘m’ refer to numbers of repeats — see Figure 1. (Depending on the sequences that flank the repeat, ‘n’ and ‘m’ may be equal, or they may differ by 1.) Importantly, DNA molecules have a directionality associated with them, with the 5′- and 3′-ends usually containing terminal phosphate and terminal hydroxyl groups, respectively . Following the convention of writing sequences in a 5′-3′ direction and antiparallel arrangement of complementary chains in double-stranded DNA molecules, there are just two options for mononucleotide repeats (A/T or C/G base pairs) and four different types of dinucleotide repeats, (AT)n, (GT)n, (GA)n, and (GC)n. A similar analysis of trinucleotide repeats identifies ten different repeat sequences . Classical examples of microsatellites consist of uninterrupted sequence of tandem repeats of the same motif (Figure 1). When one or more bases interrupt the repeat array, the microsatellite is termed ‘interrupted’ (also sometimes called ‘imperfect’). Juxtapositions of two types of repeat (called ‘compound’ or sometimes ‘composite’ microsatellites) also occur frequently in genomes (Figure 1).
Nomenclature to illustrate variations of microsatellites repeats.
Some repetitive elements are referred to as ‘inverted repeats’ because the rules of complementary base pairing mean that their sequence is the same when the complementary strand is read in its 5′-3′ direction (Figure 2A) . Since inverted repeats will occur on both strands at the specific location, they can adopt a specific structure referred to as a cruciform (Figure 2B) — see below for more details. Such sequences are targets for many architectural and regulatory proteins and their importance has been demonstrated for several basic biological processes. As we discuss below, such processes may be regulated by the formation of specific types of localized DNA structures at these sequences.
Inverted repeat DNA sequences can adopt different types of three-dimensional structure.
Prevalence of DNA repeats in bacterial genomes
Advances in DNA sequencing technologies have generated massive numbers of genome sequences for prokaryotes due to their relatively small size and ease of experimental manipulation . Most genome sequences are deposited in databases that make them publicly available. One such archive is the genome database at the National Center for Biotechnology Information (NCBI) and it contains DNA sequences from over two hundred thousand bacteria (206 445) as of 13/09/2019).
One of the first sequenced and best characterized bacterial genome is that of Escherichia coli, which contains a 4.6 million base pair genome with 4288 annotated protein-coding genes, seven ribosomal RNA operons, and 86 transfer RNA genes . It is clear that there is a massive variation in phenotypes of bacteria, which is reflected in the huge variety of sizes and types of sequences found within their genomes. The vast majority of bacterial genomes are circular, consisting usually of large chromosomes and small plasmids. However, this is not always the case and there are notable examples of bacteria that harbour linear genomes, including some that are industrially important, such as Streptomyces coelicolor [12,13]. Indeed, there is vastly more evolutionary divergence among bacteria than is found among all other organisms on earth . Many of the examples discussed in this review refer to E. coli because that system allows good correlation between bioinformatics and laboratory-based biological studies, but representative details from other organisms are discussed as appropriate.
All DNA genomes contain amounts of repetitive sequences that are larger than expected for random distribution of bases, but the percentage of repetitive sequences varies greatly across different organisms. For example, while the genome of E. coli contains only 0.7% of repeats in non-coding regions , at least 50% of the human genome is repetitive or repeat-derived . As discussed in more detail below, through their involvement in DNA metabolism, repetitive DNA sequences have a dramatic influence on the genetic instability and evolution of genomes and organisms. These factors are some of the major forces that drive the increased prevalence of repeats within genomes compared with what would be expected if all bases were distributed randomly.
While simple DNA repeats are over-represented in the human genome and, generally, in eukaryotic genomes , in bacteria they are less common and are often subjected to negative selection . However, significant differences in the amounts of simple DNA repeats exist, even among closely related species, as shown in mycoplasma . An algorithm was developed to search specifically for tandem repeats . Refinement of these approaches has developed computer-based analyses of microbial whole genome sequences that reveal overrepresentation of several simple DNA repeats. Such screening of the genome sequence of E. coli strain K12 identified thousands of tandem simple sequence repeat tracts, with motifs ranging from 1 to 6 nucleotides . In addition to simple microsatellites, the repeats also consist of transposable genetic elements.
Comprehensive analyses of DNA sequence frequencies in various genomes have been published in the genome composition database (GCD) . The genome-wide analysis of E. coli strain K12 already referred to shows a significant excess of mono- and trinucleotide repeats only . The presence of the mononucleotide repeats is unequal for the two types and differs according to the GC contents of individual organisms . For example, the GC content of E. coli K12 strain is 50.79%, but 93% of the mononucleotide repeats in its genome are formed by A (or T, its complement), both in open reading frames (ORFs) and in non-coding regions . Similarly, the distribution of dinucleotide repeats in the genome of E. coli strain K12 is not random, with the (CG)n motif being very abundant in coding regions (49.1% of all dinucleotide repeats, compared with 17.3% expected).1
1The expected frequencies referred to here were determined by observing those in 10 computer-generated genomes constructed by random ordering of nucleotides according to their overall frequencies in the genome, with departures tested using parametric statistics.
Similar analyses of repeats with larger unit lengths also showed that not all combinations are equally distributed in genomes. In E. coli strain K12 the maximum observed repeat length is four for tetranucleotide repeats, there are no pentanucleotide repeats and only three hexanucleotide repeats . Furthermore, the frequencies of repeats with a specific motif of three and more bases were not distributed equally across all possible combinations. Most notably, 52 examples of tetranucleotide repeats, (TGGC)n (and its complement (GCCA)n) occurred 21 times in coding sequences. The finding that the E. coli genome is rich in (TGGC)n has been attributed to the activity of very short patch repair, which corrects T : G mismatches to C : G, thus increasing GC dinucleotide content in the genome .
The length and type of simple repeat sequences also vary significantly in different locations of genomes. For example, simple repeats that are rich in G bases on one strand (and C bases on the other) are often located at the ends of chromosomes. Known as telomeres, these repeats have been best characterized in the genomes of eukaryotes , but they also occur in some bacteria [12,13].
Analyses of short simple repeats among different strains of E. coli show that the number of repeats is polymorphic . Determination of the size of repeat tracts can be used to identify different strains as long as care is taken to be aware of the potential for variable sizes to be identified in short repeats . This approach can quickly diagnose the presence of different strains of bacteria, allowing identification of those that may be pathogenic, as demonstrated with E. coli [25,26], Staphylococcus aureus , Mycobacterium leprae , and many others .
DNA structures formed by DNA repeats
DNA molecules, including those containing repetitive sequences, mostly form the two-stranded, right-handed helical B-form structure . This structure maximizes the thermodynamic stability of the molecule and is crucial for fundamental biological processes that store, replicate, and transcribe genetic information. Nevertheless, various alternative (non-B) structures can also occur in DNA. These structures are usually characterized by the occurrence of single-stranded regions (loops) and/or sites of disrupted base pair stacking (junctions between continuous B-form DNA and the alternative structure). Since the disruption of hydrogen bonds and stacking interactions represents a loss of enthalpic contribution to the free energy of the molecule, any transition from B-form DNA to an alternative structure requires an input of energy. An alternative structure can be favoured if there are alterations to the sequence of one strand, for example when the complementary strand is absent or present in a sub-stoichiometric amount (as in the structure depicted in Figure 3B). However, some environmental (and cellular) conditions promote the formation of alternative structures due to their improved thermodynamic stability compared with B-form DNA under the given conditions. This type of situation occurs for some types of repetitive DNA sequences in vitro, with increasing evidence that such structures also exist within cells (see below). The types of the structure adopted by repetitive DNA sequences — and their thermodynamic stabilities — are influenced by the length and type of bases within the repeat. Furthermore, topological stress, which is inherent to the majority of DNA molecules inside cells, is another important factor that influences local DNA structures. Typically, DNAs in bacterial cells exist as negatively supercoiled molecules, which can lead to destabilization of right-handed, double-helical DNA [29,30]. In the presence of suitable nucleotide sequences, certain levels of negative superhelical stress can be locally absorbed via the transition from the B-form DNA to an open local structure. This can assist the formation of non-B-DNA structures, as shown in vitro for various types of repeats [31–34]. Evidence is particularly strong to show that higher levels of negative supercoiling increase the extent of cruciform formation in dinucleotide repeats. This has been confirmed for (AT)n sequences in vitro and in E. coli [29,35]. Variations in levels of DNA superhelicity naturally occurs in vivo in ‘active’ regions of the genome, where processes that involve unravelling of the DNA double helix take place, such as transcription, replication, and recombination.
Ribbon scheme of localized non-B-DNA structures.
Due to complementary base pairing in double-stranded DNA, mononucleotide repeats are inherently homopurine on one strand and homopyrimidine on the other. While A tracts are prone to DNA bending , homopurine/homopyrimidine tracts, in general, are able to form triplex structures (Figure 3A). Mononucleotide repeats naturally possess mirror symmetry, which is a feature favouring triplex structures via the formation of Hoogsteen triads, as shown in Figure 4. Hoogsteen hydrogen bonding occurs between the purine-rich strand of the duplex DNA and either a pyrimidine-rich or a purine-rich third strand. Pyrimidine-rich third strand interactions are stabilized by Hoogsteen hydrogen bonds that are favoured at low pH, which facilitates the requirement for cytosine protonation required for its Hoogsteen pairing. In contrast, purine-rich third strand interactions form reverse-Hoogsteen hydrogen bonds, which do not require acidic pH and are stabilized by bivalent cations.
Watson–Crick and Hoogsteen hydrogen bonds in triplex DNA molecules.
Mononucleotide repeats can also undergo strand slipping transitions, resulting in extrusion of a hairpin (Figure 3B) or a pair of hairpins that are separated from each other (Figure 3C). The proclivity to strand slipping is a common feature of simple repeats, playing a crucial role in their change in size during replication [31,37]. Conditions for good thermodynamic stability of hairpins have been well characterized in vitro for trinucleotide repeats such as (CGG)n, (CAG)n, and (CTG)n, even though these contain base mispairs or wobble pairs, such as T•T, A•A, or G•G [38,39].
For dinucleotide repeats the length observed in typical microsatellites varies from 5 to 50 repeats. Importantly, while all dinucleotide sequences are direct repeats, some are also inverted repeats (e.g. (AT)n and (CG)n), whereas others are not (e.g. (AG)n and (AC)n). This is significant because those that are inverted repeats are able to form cruciform structures (Figures 2 and 3D). At the same time, these sequences are composed of (purine–pyrimidine)n motifs that are capable of forming a segment of left-handed, Z-form, double helix under certain conditions .
Tandem repeats involving Gn blocks and mononucleotide repeats consisting of G-tracts are able to form quadruplex structures (Figure 3E). Such structures are typically formed when four G nucleotides can be brought together in a planar arrangement to form guanine quartets involving Hoogsteen G–G pairing (see Figure 4) and are usually stabilized by the presence of monovalent cations in the middle of each G-quartet. Note that the presence of G-tracts on one strand means that C-tracts must be present on the complementary strand, and such sequences can adopt other non-B-DNA structures, such as the i-motif, which we discuss in more detail below.
A strikingly wide range of sequences have been demonstrated to form stable G-quadruplexes under different environmental conditions [37,41]. All of these sequences are not classically considered as simple DNA repeats, but G-quadruplexes can be formed by various types of short repeats of G bases within longer sequences. Some of the sequences that can form G-quadruplexes are simple microsatellite sequences, such as trinucleotide and hexanucleotide repeats [42,43]. Other sequences that are more complex in the base composition can also form G-quadruplexes, but they all contain G-tracts that are repeated with specific periodicities. Within any particular sequence that can form G-quadruplexes the bases that separate the G-tracts may be different in type and number and, thus, they represent a complicated type of interrupted repeat tract (see Figure 1). A wide array of sequences have been shown to form quadruplexes, but longer G-tracts and shorter interruptions form more stable G-quadruplexes, although the size of the loop also impacts on the type of folding seen in stable quadruplexes . Importantly, the likelihood of G-quadruplexes forming in genomes varies dramatically in different locations of DNA molecules . For example, simple repeats that are rich in G bases are often found at telomeric ends of chromosomes and there is significant evidence that such sequences form complexes of proteins specifically bound to four-stranded structures . Telomeres have been best characterized in the genomes of eukaryotes, including humans, but they also occur in some bacteria [12,13,22].
Non-B-DNA structures are also able to form within sequences that would not typically be able to form significant levels of base pairing. For example, mononucleotide Cn sequences and repeats with Cn blocks are able to form hairpins (Figure 3B) and i-motif structures (Figure 3F) under conditions allowing the formation of hemi-protonated C+/C base pairs [47,48]. Following similar arguments presented above for G-quadruplexes, sequences that can form i-motifs are not all classically considered as simple DNA repeats. However, all of these sequences do contain C tracts that are repeated with specific periodicities and, thus, are relevant to topics discussed in this review. The i-motif structures require four C-rich strands containing bases, which can be formed from four distinct strands, two hairpins each carrying two cytosine stretches, or from a single strand with four cytosine stretches [49,50]. Recent observations have indicated that it is possible to achieve stable i-motifs at physiological pH without the use of crowding agents, if there are at least five cytosine bases per tract [48,51].
Trinucleotide repeat sequences also adopt many of the structures described above that are dependent on environmental conditions and type of sequences. For example, they can form slipped-stranded DNA and hairpins, but (CGG)n have been shown to form G-quadruplexes under specific conditions [52,53]. R-loops (Figure 3G) are another altered structure, which can be thermodynamically stable in (CAG)n and (GAA)n [54,55]. Major structures formed by (GAA)n are triplexes in which the third strand can be derived from either the pyrimidine strand or the purine strand [56,57]. One related structure that has particularly high thermodynamic stability in these sequences has been referred to as ‘sticky DNA’ because of the way it brings together multiples triplexes .
Thus, many molecular and biochemical studies demonstrate that simple repeating DNA sequences form a wide array of non-B-DNA structures in vitro. Whether such structures influence biological processes and consequences are questions that have been addressed in different cell types, including several bacteria, as we now discuss.
Biochemical and cellular impacts of simple repeat sequences in bacteria
Within the highly complex environment in cells, various local structures in long, genomic DNA molecules appear to serve as markers of the location of specific activities or functions. Examples of the types of cellular functions that they are involved in are highlighted in Figure 5. The biological relevance of these types of non-B-DNA motifs in recombination, replication, and the regulation of gene expression has long been proposed . Furthermore, several studies have demonstrated the important role of non-B-DNA structures in the context of gene regulation in bacteria [30,60,61]. For example, cruciforms have been shown to be important for dynamic genome organization , and for replication of the circular molecules of genomes, plasmids, mitochondrial DNAs , and chloroplast DNAs . Cruciforms are targets for many architectural and regulatory proteins  and their importance has been demonstrated for the regulation of transcription of some genes .
Suggested biological roles of simple DNA repeats.
Three-stranded triplex structures can be formed in a range of simple repeats, and structures of many different types have been characterized . Genomic loci containing motifs that can form triplexes are significantly more likely to undergo genome rearrangement compared with control sites, as demonstrated in certain Enterobacteria and Cyanobacteria species . A systematic search of 5246 different bacterial plasmids and genomes for intra-strand triplex motifs was conducted and the results summarized in the ITxF database . This database points to the importance of these types of sequences (and their potential to form non-B-DNA structures) in influencing the genetic stability of bacterial genomes.
Several bioinformatics tools have been developed to identify potential quadruplex sequences in genomes, such as QGRS Mapper  and G4Hunter . In another example, the ProQuad database developed simple rules for G-quadruplex forming patterns and used them to assess the occurrence of repeating G-tracts and their association with different genomic regions. This initially identified potential quadruplex sequences within the genomes of 146 bacterial species , and an updated database, QuadBase2, mined motifs across genes and their promoter sequences in 1719 prokaryotes . This database can be used to identify the number and location of repeats within large genome sequences. As an example, we use this to identify potential quadruplex forming sequences in the genome of E. coli K12 strain, highlighting 69 sequences, 37 in the plus strand and 28 in the minus strand (Figure 6). A separate genome-wide analysis of 18 microbes indicated enrichment of G-quadruplex DNA motifs in putative promoters, with detailed analysis in E. coli suggesting a global role for them in ‘turning-on’ transcription during certain growth phases . Along with in vitro data that demonstrates quadruplexes are bound by some proteins [46,73], these findings point towards physiological functions for G-quadruplexes. In this respect, it is significant that genomes with high G + C content are more able to form four-stranded structures with relatively high thermodynamic stability [37,74]. There is increasing evidence that these types of structures provide opportunities to regulate DNA metabolism in bacteria [51,75–77]. The genome of the bacterium Paracoccus denitrificans PD1222 has a relatively high G + C content (∼67%) and a range of biophysical, molecular, and microbiological studies show that targeting of four-stranded structures can be controlled under cellular conditions, allowing regulation of expression of some genes [48,78–81].
Potential quadruplex forming sequences are dispersed throughout the
Escherichia coli genome.
Scientific interest in the genetic stability of simple DNA repeats took on much wider significance when it was recognized that length changes within them are linked to human diseases and disorders. In the 1990s, genetic instability of microsatellites was identified as a useful diagnostic tool for some types of cancer and is associated with some hereditary neurological disorders in humans [54,82–84]. Much effort has been put into analyzing cellular mechanisms that lead to genetic instabilities of trinucleotide repeats, aiming to understand why some are more prevalent in human disorders, the most common of which are CAG, CTG, CGG, and GAA. Recent molecular studies have confirmed that other simple repeats are also important for human diseases [54,58]. These links have driven many studies that focus on DNA repeats in bacteria where it is often more tractable to conduct genetic analyses.
Different models have been proposed to explain genetic instabilities observed in simple repeats. Many of them involve DNA synthesis, including DNA replication, and various types of DNA repair and recombination [7,31,33,82,84–86]. Extensive experiments using E. coli confirmed that length changes in plasmid-based DNA trinucleotide repeats are affected by replication. The observations are consistent with known biochemical properties of replication forks and lead to suggestions that the sequence within the repeat influences the thermodynamic stability of unusual structures in the DNA [31,33,84,86,87]. Other processes acting on DNA can impact on mechanisms by which DNA synthesis influences the genetic stability of simple repeats in E. coli. For example, transcription of DNA mononucleotide repeats blocked their subsequent replication , and transcription into trinucleotide repeats in plasmids influenced the frequency of deletions to the repeat [89–91]. These experiments highlight that interactions between different processes acting on DNA combine to influence their genetic instability. Interactions may be particularly relevant for processes that use similar proteins, such as DNA polymerases in DNA replication and repair.
The link to DNA repair systems has intriguing roles in relation to genetic instabilities of simple DNA repeats because some of them recognize any aspect of genome structure that is different from the standard base pairs and double helix, including non-B-DNA structures [33,74,92]. All cells contain proteins that recognize and repair such genome alterations, protecting genomic integrity by different pathways, which include mismatch repair (MMR), nucleotide and base excision repair, and the repair of double- and single-strand breaks [83,93,94]. Generally, the DNA repair pathways and their proteins are well conserved, which means that there is much to be gained from studies of these systems in simpler experimental models, such as bacteria [95–97]. As described below, experiments using bacteria, particularly different strains of mycobacteria, have been very useful for understanding how DNA repair systems influence the genetic stability of simple DNA repeats.
An important physiological role for some DNA repair pathways is to prevent significant changes to the type and number of bases within the genome. However, the genetic instabilities observed within DNA repeats indicate that modifications to the size of the genome are not always repaired. Possibly, cells may not be able to repair some types of length changes to repeats due to non-recognition of certain structures or inaccessibility of DNA processed by some events. Alternatively, mutations in repair proteins may induce length alterations to repeats. Numerous studies show that the impact of DNA repair pathways on repeat tract stability is complex [84,85]. Importantly, some non-B-DNA structures are identified as modifications to be removed, at least in some contexts or under certain conditions.
MMR and nucleotide excision repair (NER) are fundamental cellular systems involved in maintaining genomic integrity [82,83,85,93,94]. MMR is able to detect and replace mismatched base pairs that are introduced during inaccurate DNA synthesis. Without such repair, these mismatched base pairs are a source of mutations within genomes. Upon inactivation of MMR, increased heterogeneities have been observed at simple repetitive DNA (e.g. mono- and dinucleotides) in bacteria [82,87], suggesting that the genetic stability of simple repeats indicates the increased rate of mutation throughout the whole genome. Due to this phenomenon, such deficiencies within DNA repair systems have been termed the ‘mutator phenotype’ . Generally, NER systems recognize a wide range of lesions and damage due to distortion of the DNA double helix, and unusual DNA structures that could form in repeat tracts are likely to be activators of NER [83,93,94]. Studies in E. coli observed that their constituent NER proteins influenced the genetic stability of long plasmid-based DNA trinucleotide repeats in a complex fashion [33,82,87]. Associations between defective MMR and NER and elevated microsatellite instability are linked to some human diseases, and are particularly strong for hereditary nonpolyposis cancer.
In contrast with their usual cellular functions, the excision repair systems can enhance the genetic instabilities of DNA repeats since they provide opportunities for non-B-DNA structures to form on single-stranded regions that are presented as the damage is excised from the DNA helix. Therefore, the repair processes themselves can lead to further consequences, such as addition or deletion of bases, which would be observed as genetic instability [82,85,93,94]. Furthermore, abundant evidence demonstrates that unusual DNA structures may be recognized as ‘damaged DNA’ by DNA repair systems, sometimes leading to the deletion of the sequence [92,99,100]. To reduce such potential problems, cells also take advantage of enzymatic processes to dissolve unusual DNA structures, such as DNA helicases . For example, the RecQ helicases are capable of unwinding G-quadruplex DNA and this family of enzymes is conserved and is essential for genomic stability in organisms from E. coli to humans [102,103].
Genetic instabilities within mono- and dinucleotide repeats increase for longer runs of consecutive repeats and, therefore, are decreased by interruptions in the repeat sequence . These observations are consistent with the hypothesis that slipped-strand mispairing during DNA synthesis generates misaligned intermediates. Such parameters are intrinsic to the DNA repeat, but flanking sequences also influence the genetic stability of simple repeat sequences. These observations suggest that many aspects of DNA metabolism affect the genetic stability of all microsatellite sequences.
Through their effects on DNA metabolism, repetitive DNA sequences have a dramatic influence on the genetic instability and evolution of genomes and organisms. The high levels of genetic instability of repetitive DNA sequences may act to promote the evolution of genomic sequences [84,104]. It has been suggested that length changes to simple repeats can normally be tolerated because they do not have dramatic consequences for the organism in question and that deleterious consequences occur only at extreme length changes [104,105], as described for the trinucleotide repeat diseases. However, it is clear that simple DNA repeats in bacteria represent hypermutable loci associated with reversible changes in the number of repeats [2,106]. Variability of the length of simple DNA repeats can lead to the increased antigenic variance of the pathogen population . Such length changes have been clearly demonstrated in bacteria, where this property means that simple DNA repeats can act as prerequisites for bacterial phase variation and adaptation, providing clear evidence that length variations to repeat tracts are used as a means of modulating gene expression. For example, in some bacteria, such as Haemophilus influenzae, the susceptibility of microsatellites to reversible length changes is used to control specific genes that allow environmental adaptation [104,108]. Thus, the hypermutable repeat sequence allows the bacterium to respond swiftly to changes in environmental conditions and adapt to different situations [104,109]. Such variability in repeat tracts can even impact on the virulence of some bacteria, as seen in H. influenzae and Neisseria meningitides [110,111]. Variation in the overall size of the repetitive domains was detected even among bacteria sub-cultured from a single colony, highlighting that the altered size of the repeat was intrinsic to the sequence.
From the earliest studies of natural DNAs, it became clear that repetitive DNA sequences are common, leading to expectations that there must be biological reasons to explain this. The advent of large numbers of genome sequences has reinforced these observations, but biologists continue to assess the full biological significance of repetitive regions of genomes. Different aspects of DNA metabolism influence genetic instabilities within these sequences, and many of the studies that have improved knowledge have originated in bacteria, where the experiments are most tractable. An important corollary of the results from such studies is that many of the biochemical pathways are found in all organisms, meaning that many of the conclusions are relevant to all organisms.
Genetic instabilities of simple repeats may be mediated by many biochemical processes, including DNA replication-based slipped-strand mispairing, small slipped-register DNA synthesis, tandem duplications, and gene conversion-recombination processes. These processes may occur independently or in concert with each other and/or other DNA metabolic processes such as MMR, NER, DNA polymerase proofreading, SOS repair, and transcription. It is also clear that the structural properties of the simple repeats (hairpin loop formation, slipped structures, triplexes, etc.) play a consequential role in their genetic instabilities. The involvement of unusual DNA structures may occur because they are inherent within simple repeats inside cells, or because enzymes manipulating DNA may promote their formation. Either way, the presence of unusual structures within simple repeats is likely to influence the interaction of the DNA with proteins, which, in turn, facilitates the genetic instability of simple repeats.
Rapid progress in obtaining and interpreting genome information will continue to extend knowledge about the genetic variations that exist for simple repeating DNA sequences across all organisms. In this review, we have summarized the current understanding obtained from biochemical and cellular studies of such repeat sequences in bacteria. A combination of these different experiments in bacteria will shed further insight into the biological impacts of simple DNA repeats, including enhancement in understanding their roles in bacterial metabolism (with possible impact in the treatment of bacterial pathogens) as well as in a range of human diseases.
All authors were involved in the planning and writing of the manuscript. Figures were prepared by V.B. and R.P.B.
This work was supported by the Czech Science Foundation (18-15548S) and the European Union Horizon 2020 research and innovation programme under grant agreement no. 692068, for the BISON project and by the SYMBIT project reg. no. CZ.02.1.01/0.0/0.0/15_003/0000477 financed from the ERDF. V.B. and M.F. acknowledge institutional support from the Czech Academy of Sciences (68081707).
Open access for this article was enabled by the participation of University of East Anglia in an all-inclusive Read & Publish pilot with Portland Press and the Biochemical Society under a transformative agreement with Jisc.
We thank colleagues at the University of East Anglia, Institute of Biophysics and Central European Institute of Technology, Masaryk University in Brno, the Czech Republic for discussions that assisted the development of this manuscript.
The authors declare that there are no competing interests associated with the manuscript.