Naturally occurring DNA is encoded by the four nucleobases adenine, cytosine, guanine and thymine. Yet minor chemical modifications to these bases, such as methylation, can significantly alter DNA function, and more drastic changes, such as replacement with unnatural base pairs, could expand its function. In order to realize the full potential of DNA in therapeutic and synthetic biology applications, our ability to ‘write’ long modified DNA in a controlled manner must be improved. This review highlights methods currently used for the synthesis of moderately long chemically modified nucleic acids (up to 1000 bp), their limitations and areas for future expansion.
The ordered and addressable arrangement of the four nucleobases adenine (A), cytosine (C), guanine (G) and thymine (T) in DNA allows our cells to store approximately 1–2 GB of data in 7 pg of material. The accessibility of these data is carefully controlled via chemical modification of the nucleobases, double helical DNA formation and compaction, as well as numerous DNA–protein interactions. Of these mechanisms, chemical modifications are of particular interest in that they increase the information storage capacity of DNA and provide functional handles for external manipulation. For example, controlled DNA methyltransferase-mediated methylation of cytosine modulates gene expression , whereas uncontrolled reactive oxygen species can oxidize nucleobases inducing mutations upon polymerase replication. Top-down studies have developed our understanding of these processes, however our ability to exploit this knowledge in synthetic biology applications and expand upon it using bottom-up approaches for therapeutic applications is constrained by our inability to ‘read and write’ modified DNA efficiently.
In the past decade, remarkable progress has been made in both of the above aspects for unmodified nucleic acids. Millions of DNA strands can now be sequenced in parallel using Illumina and Ion Torrent next generation sequencing platforms , whereas whole genomes such as that of Mycoplasma mycoides can be reconstructed from oligonucleotides . Yet of the two processes, our ability to write still lags behind our ability to read nucleic acid sequences. Single base resolution of modified nucleobases, such as 5-methyl-, hydroxy-, formyl- and carboxy-C, is now possible , and with the rapid development of single molecule polymerase-independent nanopore sequencing this number may expand considerably still . On the other hand, the in vitro synthesis of highly ordered modified DNA, whether it contains (for example) amino acid side chains, epigenetic modifications, electrochemical markers or fluorophore tags, is still in its infancy partly due to the reliance of gene synthesis clean-up and amplification by PCR. Consequently, the focus of this review is to evaluate the current state of modified DNA synthesis (up to 1000 bp) and highlight the challenges that have yet to be overcome.
Automated solid-phase phosphoramidite-based synthesis allows the routine and cheap production of oligonucleotides of approximately 100 bases. The principal chemical steps involved in each cycle were optimized by the early 1990s and have remained largely unchanged since (Figure 1) [6,7]. Stepwise coupling of phosphoramidites allows unparalleled control of DNA modification, however this comes at the cost of imperfect coupling (approximately 98.5–99.5%) and uncontrolled mutagenesis due to chemical exposure . Furthermore, the modifications to be introduced must be compatible with mildly acidic, strongly basic and oxidizing conditions used in the various synthetic steps. This can be particularly challenging when multiple modifications, each with their own unique deprotection requirements, are used. When combined, this limits the length and yield of the oligonucleotide (if coupling efficiency=99%, 100-mer maximum yield=36.6%) as well as the number and type of modifications that can incorporated (Figure 2).
Standard automated solid-phase phosphoramidite-based oligonucleotide synthesis
Examples of base modified phosphoramidities, triphosphates and unnatural base pairs
Efforts to optimize each of the four steps in the phosphoramidite oligonucleotide synthesis cycle have resulted in a diverse range of reagents and conditions. For example, tetrazole, benzylthiotetrazole (BTT) or ethylthiotetrazole (ETT) coupling reagents could be used, each with their own optimal concentrations, incubation times and cycle repetitions; protecting groups could be removed in concentrated ammonium hydroxide, 0.1 M NaOH(aq) or methanolic potassium carbonate solution at temperatures ranging from 25 to 55°C for 1–18 h . Yet the gains in purity and yield due to changes in these specific parameters are challenging to quantify since they are likely to be interdependent and result in deletions, insertions or point mutations. Until recently , this have been difficult to characterize in an accurate and cost-effective manner.
On the other hand, evaluation of depurination or incomplete detritylation, both of which impede the synthesis of full-length oligonucleotides and are linked to acidity, is more straightforward due to techniques for size separation of oligonucleotides and spectroscopic analysis. Alternative detritylation approaches include replacement of the Brønsted acid trichloroacetic acid (TCA) with chelating Lewis acids such as zinc bromide or mildly acidic sodium acetate buffer (10 mM, pH 3.5) [11,12]. Unfortunately, the former requires use of the explosive nitromethane for rapid detritylation and the latter takes approximately 30 min for near quantitative conversion, reducing their viability in commercial applications. Depurination-inhibiting protecting groups such as amidines that decrease the pKa of the N7 (or N1) position of purines have also been explored but have not been widely adopted possibly due to slower deprotection rates . Finally, with the advent of high-throughput microarray oligonucleotide synthesizers with capacity for parallel synthesis of 12000–1000000 oligonucleotides, variations in platform design have given rise to the largest innovations in detritylation chemistry. Electrochemically generated acids allow deprotection in approximately 5 s , whereas photolabile 5’- protecting groups avoid the use of acid altogether . For inkjet printed microarrays, it was found that active quenching of the acid by the oxidizing solution significantly reduced depurination . However, these gains are tempered by losses in spatial resolution of reagents due to high-throughput synthesis compared with conventional column synthesis, which ultimately constrains the maximum length of modified oligonucleotide synthesis to approximately 150–200 bases.
Therefore, the production of larger modified DNA must rely upon the assembly of smaller components. In this regard, the simplest method is to create the desired sequence by PCR amplification methods using a modified dNTP in place of the naturally occurring one. A key requirement for such dNTPs is that they must be incorporated and also read by polymerases. As a result, modifications that do not disturb the Watson–Crick base pairing face of the nucleobases are necessary; the C5 position of pyrimidines and the C7 position of 7-deazapurines are usually chosen for functionalization due to protrusion of the modification into the major groove of DNA, a structural feature that is more readily tolerated by polymerases (based on biochemical experiments and X-ray crystallographic data) [17,18]. Of these two positions, the C5 pyrimidine modifications are more synthetically accessible and consequently more widely reported (Figure 2) [19,20]. C8 purine modifications have been reported but their incorporation efficiency is noticeably worse [21,22]. It should also be noted that the (de)stabilizing properties of the modification must be carefully modulated for very large DNA duplex synthesis since most modifications, with the exception of C5-alkynyl pyrimidines and C7-alkynl 7-deazapurines [23,24], disrupt DNA duplex thermal stability by up to 1–2°C per incorporation.
Correct optimization of PCR extension time (typically lengthened) and temperature (typically lowered) is important for modified DNA synthesis but the choice of polymerase is most critical. Family B polymerases, such as KOD XL, Vent (exo-) and Pwo, are generally superior to family A polymerases [19,25]. Indeed, for Cy3 and Cy5 fluorophore labelled dNTPs, genetically evolved family B Pfu variants are required for full incorporation of the modification by PCR . The properties of the resultant 1000 bp product neatly illustrates the fascinating behaviour of heavily modified DNA, in that conventional small molecule DNA intercalation was no longer possible and solubility was higher in organic rather than aqueous solvents. The latter point does, however, highlight that it would be favourable to control the density of modifications, which is sometimes achieved through the use of a mixture of modified and unmodified dNTPs . This is most evident when bulky modifications must be incorporated sequentially along a template; this results in some templates being more ‘modifiable’ than other.
Perhaps the main limitation of this method is that it is an ‘all or nothing’ approach. If all A, C, G or T sites of the product are replaced by a modified nucleotide (X), then the product is a defined single species; if there is a mixture of modifications introduced via a mixture of a natural dNTP and the unnatural dXTP for either of the A, C, G or T sites, then a range of products is obtained with varying modification content. For some applications, such as synthesis of fluorescent in situ hybridization probes, the mixed products could be of benefit as this enables custom mixtures of multiple fluorophores that give rise to unique optical output [27,28]. Indeed, for aptamer (nucleic acid antibody equivalents) selection, this would be ideal as fully modified sequences may not have sufficient structural diversity for target binding . Yet the problem lies in identifying the position of the modified base and, if more than one round of aptamer selection is required, its site-specific re-introduction into an enriched subpopulation of sequences.
Finally, if single PCR products are required, given that polymerases have evolved to recognize only the naturally occurring bases/bp (AT, TA, CG, GC), only four modifications can be made per PCR generated product . This premise has been challenged in recent years through the use of unnatural bp. To date three unnatural bp have been reported with very high stability and replication fidelity (>99.8%) under defined conditions (Figure 2) [29–31]. Selectivity is maintained either by alternative hydrogen bonding patterns or predominantly hydrophobic interactions. The ‘expanded genetic alphabet’ is particularly intriguing due its promise of enhanced data storage but has so far been used mainly for aptamer selection where its presence was vital for function of the identified aptamer; substitution of the modified base for its natural base analogue resulted in a >100-fold decrease in affinity of the aptamer for the target [32,33]. This effect is most likely due to the unnatural hydrophobic interactions these bases generate as opposed to the expanded sequence space. Although these studies did not demonstrate that control unmodified sequences fail to generate aptamers of similar affinity, the importance of hydrophobic ‘protein-like’ interactions has been established for C5-modified pyrimidine sequences against a range of protein targets . Although the dNaM–d5SICS bp has been shown to be biocompatible in Escherichia coli, caution must be taken as the dNTPs required for its replication cannot be synthesized natively in cells .
The above PCR-based methods cannot be used to produce chemically modified DNA containing specific modifications at predefined loci. For such site-specific incorporation of modifications in long DNA, the ideal reaction would involve direct stepwise ligation of shorter chemically synthesized modified oligonucleotides. This would bypass the inability of polymerases to detect modifications that are not on present the Watson–Crick hydrogen bonding face, and would allow for a much wider range of modifications, including those to sugar and phosphate. To this end, DNA ligases can be used, however the templated assembly of >1000 bp genes using this method has been restricted to unmodified oligonucleotides [36–38]. This is surprising given the growing interest in the epigenome and in particular the role that oxidized forms of 5-methyl-C play in it . A reason for this could be ligation inefficiency; post-ligation enrichment of the desired gene by PCR or cloning is often used , which would erase all modifications introduced. Yet, with optimization, ligation methods could feasibly be used to synthesize large DNA constructs with highly ordered and predetermined arrangements of modifications. Precedent for this comes from templated assembly of 5’ phosphorylated trimers and pentamers bearing peptide moieties using T4 DNA ligase [40,41]. Products of 50–150 bases were synthesized (10–50 independent ligations) with efficiencies reaching approximately 98%. Promisingly, these products could be read by Vent (exo-) polymerase and re-synthesized by T4 DNA ligase, enabling their use in inhibitor identification by in vitro selection.
In this context, chemical ligation is appealing in that small molecule chemical reactions can be highly efficient, and should be more agnostic than ligase enzymes in terms of the substrates to be ligated. The first example of this was a condensation reaction using cyanogen bromide , whereas later methods involved the spontaneous displacement of a tosylate/iodo group from the 5’- position of oligonucleotides using 3’ thiophosphorylated oligonucleotides [43,44]. Although promising, these methods failed to gain wider adoption due to the lack of controlled ligation, stability of components in water and toxicity. To this end, development of bifunctional 5’-azide 3’-alkyne modified oligonucleotides for use in copper-catalysed cycloaddition ‘click’ chemistry was developed in our group (Figure 3) [45,46]. The resultant triazole linkage, although not a natural phosphodiester, is remarkably biocompatible and non-toxic. Successive refinement of the structure based on the hypothesis that the triazole acts as a hydrogen bond acceptor, which was later supported by NMR studies , in combination with efforts to reduce the synthetic burden of monomer synthesis resulted in a second generation version that is faithfully replicated . This is the first fully modified nucleic acid backbone that is compatible with DNA and RNA polymerases in vitro and in mammalian cells . As proof of principle, a 300-base oligonucleotide containing two triazole linkages was prepared. Further work is ongoing to adapt this methodology for larger modified DNA synthesis.
Unnatural biocompatible ‘click’ ligation of oligonucleotides
In principle, all of the above methodologies can be used in conjunction with post-synthetic modification of DNA. In effect, this reduces the chemical or enzymatic synthetic burden by reducing monomer complexity. Simple small functional handles, such as aldehydes, linear alkynes, cycloalkynes, azides, carboxylic acids or primary amines, can be introduced as nucleobase modifications during chemical or enzymatic synthesis as phosphoramidites and triphosphates, before being subsequently reacted to generate the final desired modification. These handles could also be generated de novo using methyltransferases; requiring a recognition motif typically up to 6 bp in length, these enzymes have been used to transfer moieties larger than methyl groups through the use of modified S-adenosyl methionine substrates [50,51]. A second pre-labelling step is sometimes required to access handles such as azides due to the chemical instability during solid-phase oligonucleotide synthesis. Of the numerous coupling methods that have been established, there are to date just a few orthogonal reactions [52,53] (e.g. amide formation, inverse Diels–Alder reaction , copper-catalysed  /strain-promoted  Huisgen [2+3] cycloaddition and oxime formation ). As a consequence, selective protection of different moieties is required for the introduction of multiple modifications, with one of the best examples being the two different levels of terminal alkyne protection used for three independent copper ‘click’ reactions by solid-phase synthesis . Despite its synthetic ease, the main drawbacks to post-synthetic functionalization are incomplete or slow reactions depending upon reactant concentrations, partly limited by solubility issues. Hence, the methodology is most efficiently used with solid-phase immobilized oligonucleotides, where the reactants can be used in excess in a suitable solvent and easily removed.
The synthesis of highly modified DNA is constrained by the limited ability of polymerases to recognize modifications to the nucleobase, phosphate or sugar. Consequently, the chemical complexity of polymerase generated DNA will always be limited by the number of different dNTPs that are available and the number of different bases or bp that can be read and written by the polymerase enzyme. If locus/sequence specific modifications to DNA are required, the problem becomes intractable. On the other hand, for the generation of random libraries, the use of modified dNTPs has clear advantages but is currently handicapped by our limited ability to decode the modifications and the low/inconsistent yields of triphosphate (dNTP) synthesis ; novel high yielding methodologies are required. For ordered and patterned synthesis of modified DNA, it is evident that a combination of chemical and enzymatic synthetic methods should be used. The high number of coupling steps achievable by solid-phase synthesis (up to 150 with appreciable final yields) highlights that chemical coupling can be exceptionally efficient, but that it does have a finite limit. This is partly due to the architecture of the solid supports used conventionally. Therefore, to achieve long modified DNA synthesis, a compromise must be made between oligonucleotide length, purity and the number of ligation reactions. The issue of oligonucleotide purity is particularly important given that post-assembly error correction methods currently used in PCR-based gene synthesis  will have detrimental effects when synthesizing modified DNA; they can remove modifications and replace them with unmodified sections of DNA. Therefore, it is essential that solid-phase automated oligonucleotide synthesis, in addition to ligation methods, be optimized if the synthesis of very long chemically modified DNA constructs is to become a routine exercise in the future.
This work was supported by the U.K. BBSRC [grant number BB/M025624/1]; next generation DNA synthesis and the Extending the boundaries of nucleic acid chemistry [grant number BB/J001694/1].
Synthetic Biology UK 2015: Held at Kingsway Hall Hotel, London, U.K., 1–3 September 2015