Functional biology and biotechnology of thermophilic viruses

Abstract Viruses have developed sophisticated biochemical and genetic mechanisms to manipulate and exploit their hosts. Enzymes derived from viruses have been essential research tools since the first days of molecular biology. However, most viral enzymes that have been commercialized are derived from a small number of cultivated viruses, which is remarkable considering the extraordinary diversity and abundance of viruses revealed by metagenomic analysis. Given the explosion of new enzymatic reagents derived from thermophilic prokaryotes over the past 40 years, those obtained from thermophilic viruses should be equally potent tools. This review discusses the still-limited state of the art regarding the functional biology and biotechnology of thermophilic viruses with a focus on DNA polymerases, ligases, endolysins, and coat proteins. Functional analysis of DNA polymerases and primase-polymerases from phages infecting Thermus, Aquificaceae, and Nitratiruptor has revealed new clades of enzymes with strong proofreading and reverse transcriptase capabilities. Thermophilic RNA ligase 1 homologs have been characterized from Rhodothermus and Thermus phages, with both commercialized for circularization of single-stranded templates. Endolysins from phages infecting Thermus, Meiothermus, and Geobacillus have shown high stability and unusually broad lytic activity against Gram-negative and Gram-positive bacteria, making them targets for commercialization as antimicrobials. Coat proteins from thermophilic viruses infecting Sulfolobales and Thermus strains have been characterized, with diverse potential applications as molecular shuttles. To gauge the scale of untapped resources for these proteins, we also document over 20,000 genes encoded by uncultivated viral genomes from high-temperature environments that encode DNA polymerase, ligase, endolysin, or coat protein domains.


Introduction
Life in high-temperature environments poses challenges that are met by adaptations that increase the stability of all macromolecules, including proteins. The same properties that make thermophilic proteins vital to their thermophilic hosts-intrinsic stability and activity at high temperatures-also offer important advantages over their mesophilic counterparts for industrial and molecular biology applications. As a classic example, Taq polymerase, isolated from Thermus aquaticus, was employed in the 1980s to substitute for Escherichia coli DNA polymerase for the polymerase chain reaction (PCR) [1]; the stability of Taq polymerase under conditions required to thermally denature DNA improved the practicality and costs of PCR, and was critical for its rapid expansion as a cornerstone of modern molecular biology, disease diagnostics, forensics, and genetic genealogy, among other technologies [2]. High temperatures also decrease nucleic acid secondary structures, off-target base-pairing, and nonspecific protein-protein and ligand-protein interactions, thereby improving the efficiency and fidelity of a wide variety of biochemical interactions. Thermophily also increases compatibility with a variety of industrial and molecular biology applications, including better performance in viscous solutions, which become more fluid at higher temperature, and increasing volatility of biofuels [3,4]. In addition, thermophilic enzymes are typically more stable than mesophilic enzymes [5], which can increase shelf-life, enhance stability under a variety of extreme conditions, and simplify purification schemes; for example, heat purification of thermostable enzymes is a simple method for isolating recombinant thermophilic proteins from crude lysates of mesophilic expression systems [6][7][8].
Even though biotechnology has long relied on enzymes from thermophilic prokaryotes, those from thermophilic viruses that infect them remain remarkably underexplored. All viral genomes encode key enzymes that are necessary for the biology of the virus, including those involved in diversion of host resources for viral genome replication, transcription and translation, evasion of host immunity, packaging, and egress from the host cell [9,10]. Genes encoding these functions occur at much higher frequencies in viral DNA than in host DNA and high recombination rates within viruses promote biochemical innovations. In all, these biochemical innovations provide excellent targets for biotechnological commercialization.
Over the last two decades, rapid progress has been made on the cultivation and molecular biology of thermophilic and hyperthermophilic host-virus pairs, especially among novel archaeal viruses infecting the thermoacidophilic order Sulfolobales and to a lesser extent thermophilic Thermoproteales, both belonging to the phylum Thermoproteota (synonym Crenarchaeota) [11,12]. These archaeal viruses belong to the International Committee on Taxonomy of Viruses (ICTV) families of Lipothrixviridae, Rudiviridae, Tristomaviridae, Turriviridae, Ampullaviridae, Bicaudaviridae, Spiraviridae, Fuselloviridae, Guttaviridae, Clavaviridae, and Globuloviridae [11,13]. Parallel research on thermophilic bacteriophages has largely focused on those infecting the genera Thermus, Meiothermus, Geobacillus, and Rhodothermus [14]. Most of these bacteriophages belong to the class Caudoviricetes (recently reclassified based on genomic information and not morphology [15]), while others represent novel families not yet placed in higher taxonomic ranks or are largely unclassified (e.g., unclassified myo-and siphoviruses φYS40, G20c, and RM378). These viruses replicate at high temperatures and are often stable at temperatures exceeding the optimal growth temperatures of their host thermophiles or hyperthermophiles [16][17][18]. However, viral genomes are also universally enriched in poorly annotated genes [19], with tens of thousands of poorly annotated small-gene families being discovered recently [20]. Together, these genes represent a vast, underexplored resource with potential to contribute innumerable advances in biotechnology and biomedicine.
This paper highlights a relatively small number of functionally characterized proteins from thermophilic bacteriophage and archaeal viruses and their potential roles in biotechnology, focusing on proteins with potential applications in DNA synthesis, nucleotide modifications and repair, cell lysis, and nano-trafficking ( Figure 1 and Table 1). We also provide an up-to-date accounting of putative proteins of biotechnological interest in uncultivated viral genomes (UViGs) from thermal environments and discuss key opportunities to explore these proteins for biotechnology purposes (

DNA polymerases
Thermophilic DNA polymerases have been a focal point of development and commercialization for biotechnology companies due to their versatility, with uses ranging from molecular diagnostics to next-generation DNA sequencing. In 1988, Taq polymerase, a family-A DNA polymerase (PolA) from T. aquaticus, was optimized for PCR due to its thermophily, with the important caveat that it has a high error rate at one mutation per 20,000 base pairs [21,22]. Bacteriophage φ29 DNA polymerase is a mesophilic enzyme with lower error rates [23] that allows for isothermal amplification with proofreading, high processivity, and strand-displacement capabilities, but it only functions at low temperatures. Attempts to identify thermophilic viral DNA polymerases similar to Taq polymerase, but that can proofread, or to identify a suitable thermophilic alternative to φ29 polymerase would improve the capabilities of existing applications. Thus far, some thermophilic viral DNA polymerases have been identified that contain high-fidelity proofreading domains, strand-displacement, or reverse-transcriptase activity [22,[24][25][26].
In addition to these cultivated viruses, a prophage-encoded DNA polymerase within the chromosome of Thermus antranikianii was characterized and shown to have strong strand-displacement activity, similar to φ29, but many of the amplification products were highly branched, non-specific DNA molecules. Thus, this polymerase is not suitable as a thermophilic alternative to φ29 [24], but this prophage polymerase shows that thermophilic strand-displacement is possible. A thermophilic polymerase with properties similar to φ29-high fidelity, high processivity, and strong strand-displacement activity-would enable high-fidelity, long-range PCR and would be of considerable biotechnology interest.
Aside from work on cultivated viruses, bioinformatic [36,37] and functional screens [25,26] for DNA polymerases in viral metagenomic DNA from diverse terrestrial hot springs revealed a group of 3 -5 proofreading exonuclease and DNA polymerase (3 exo/pol)-encoding genes within metagenomic fragments and UViGs assigned to the putative genus 'Pyrovirus' , which is predicted to infect genera within the Aquificaceae. Comparative phylogenetics showed these unusual polAs spread by horizontal gene transfer among thermophilic viruses, their Aquificota hosts, other diverse bacteria (although only temporarily retained), and the proto-apicoplast that became a symbiotic partner of an ancestor to the eukaryotic phylum Apicomplexa [38]; yet, only the viral enzymes encode N-terminal helicase domains (DUF927). The interdomain lateral gene transfers of these large and unique polAs suggest they may be associated with dispersal of diversity-generating mechanisms between geothermal and moderate-temperature biomes [38]. An engineered fusion between a 'Pyrovirus' PolA enzyme with the Sulfolobus solfataricus Sso7d DNA binding protein [39], called PyroPhage 3173 PolA, was shown to have three novel characteristics, namely (i) reverse transcriptase (RT) activity, (ii) DNA polymerase strand-displacement activity, and (iii) thermostability, which enabled RT-PCR and reverse transcription loop-mediated isothermal amplification (RT-LAMP) [25,26]. PyroPhage 3173 PolA had an error rate approximately 10 times lower than Taq polymerase, and the enzyme was commercialized by Lucigen Corporation (acquired by LCG, Middleton, Wisconsin). Amino acid swapping with polA genes from related UViGs, and domain swapping with the Taq polymerase 5 -3 exonuclease resulted in a recombinant enzyme, called Magma DNA polymerase, that was more thermophilic, more accurate (as low as 1 error in 10 6 nucleotides), and performed better in reverse-transcriptase PCR applications [40].
Viral DNA-directed primase-polymerase-like proteins are predicted to have additional roles in DNA and/or RNA priming, as well as damage-tolerant DNA polymerase activity [41], and at least one such unusual thermophilic enzyme from unclassified deep-sea vent phage NrS-1, infecting Nitratiruptor sp. SB155-2 [42], was shown to be functional [43,44]. This enzyme has features found in DNA polymerases (DNA-dependent polymerization), primases (primer-free DNA strand synthesis initiation), helicases (strand displacement), and RNA polymerases (RNA-dependent polymerization), and could be useful in several potential applications, such as primer-free, isothermal whole-genome amplification.

Ligases
DNA and RNA ligases catalyze the formation of phosphodiester bonds between 5 -phosphate and 3 -hydroxyl groups, with activity on DNA or RNA, respectively [45]. These enzymes serve critical functions in vivo, including DNA replication and recombination, somatic generation of immune diversity, nucleic acid editing, and DNA/RNA repair (Table 1). Biotechnology applications of ligases include decades-old technologies such as construction of recombinant plasmids or viruses, but also emerging technologies such as library preparation for DNA and microRNA sequencing [46], single nucleotide polymorphism diagnostics using the ligase chain reaction [47], and synthetic gene construction via Gibson assembly, a cornerstone technology of modern synthetic biology [48].
Comparatively less work has been done to characterize ligases from thermophilic viruses, and all published work to date has focused on moderately thermophilic ATP-dependent RNA ligase 1 enzymes. In 2003, a thermophilic homolog of T4 RNA ligase 1 from Rhodothermus phage RM378, an unclassified myovirus, was shown to have optimal activity at 64 • C [19,61]. This enzyme could substitute for T4 RNA ligase 1 in RNA ligase-mediated rapid amplification of cDNA ends (RLM-RACE) at 60 • C and was patented and commercialized for that purpose. In 2005, another thermostable RNA ligase 1 from the unclassified virus Thermus phage Ph2119, TS2126 RNA ligase, was also characterized [19,62]. The TS2126 RNA ligase had ∼30 times higher specific activity compared with T4 RNA ligase in phosphatase protection assays with a temperature optimum of 70-75 • C and it was also more effective with ssDNA ligation. This enzyme complexes with adenylated donors rapidly, with a slower ligation activity, which strongly favors intramolecular ligations. This property has been exploited for 5 preadenylylation of DNA oligonucleotide adapters during cDNA library preparation [63] and for circularization of single-stranded DNA or RNA templates for rolling-circle replication or rolling-circle transcription experiments. Kits for the latter produce virtually no linear or circular concatemers and have been trademarked as CircLigase™ ssDNA Ligase and CircLigase™ II ssDNA Ligase by Epicentre (acquired by Illumina, Madison, WI, U.S.A.).

Endolysins
All viruses require mechanisms to escape infected host cells following replication and assembly, and enzyme systems for this purpose are as diverse as the cell envelopes of their host prokaryotes [64][65][66]. Endolysins of mesophilic bacteriophages are hard to purify, have limited stability and activity under industrial conditions, and are typically highly specific, with many showing activity only against one host species or just a group of strains [67,68]. These shortcomings limit their use as antimicrobials, yet some evidence suggests these limitations may be overcome by their thermophilic counterparts. Currently, several native endolysins from thermophilic viruses and thermophilic recombinant endolysins have been investigated for broad-range applications (Table 1).
Two endolysins from unclassified cultivated phages infecting T. scotoductus strain MAT2119, phage vB Tsc2631 and phage Ph2119, were shown to be homologs of T3 and T7. These endolysins lysed Thermus strains, Deinococcus radiodurans, and also Gram-negative mesophiles such as E. coli, Salmonella enterica, Serratia marcescens, and Pseudomonas fluorescens [69,70]. The Ph2119 endolysin retained 87% activity at 95 • C, and the endolysin from vB Tsc2631 retained 65% activity at 95 • C. The endolysin from vB Tsc2631 has seven charged arginine amino acids near the N-terminus, making the endolysin act like polycationic antibacterial peptides that form pores in cell membranes, allowing for the catalytic center of the endolysin to interact with the peptidoglycan underneath [71]. Arginine is also more thermostable than other positively charged amino acids because its functional group mimics guanidinium [71,72]. The active site of the vB Tsc2631 is also believed to bind Zn 2+ ions for catalytic functions and structural stability. These properties are natural advantages necessary for thermophily that overcome many of the obstacles that limit the utility of mesophilic endolysins as antimicrobials.
The endolysin of Oshimavirus TSP4 (also known as Thermus phage TSP4), TSPphg, was expressed in E. coli, purified, and shown to reduce Staphylococcus aureus infections in mice, offering promise for clinical treatments for bacterial infections [73]. Further in vitro testing showed antimicrobial activity against Gram-negative S. enterica, Klebsiella pneumoniae, E. coli, and Gram-positive Bacillus subtilis. The broad antimicrobial activity of TSPphg may be due to strong interactions with peptidoglycan, owing to the six positively charged amino acids near the N-terminus, similar to the endolysin of vB Tsc2631 [70,73].
Another broad range endolysin, MMPphg, was found in Meiothermus phage MMP17, an unclassified myovirus. Purified MMPphg has optimal activity at 65-70 • C and lyses both Gram-positive and Gram-negative bacteria, including E. coli, S. aureus, S. enterica, and Shigella dysenteriae, and eight different antibiotic-resistant strains of K. pneumoniae [74]. The C-terminus of the lysin also contains six positively charged amino acids. The MMP17 endolysin was later artificially fused with TSPphg and the recombinant enzyme, MLTphg, showed higher antimicrobial activity in vitro than either individual endolysin [75].
The other two functionally characterized endolysins are from viruses infecting Gram-positive thermophiles. GVE2, an unclassified siphovirus that infects Geobacillus sp. E263, encodes an endolysin believed to interact with a host eukaryotic-type ABC transporter to lyse host cells at temperatures from 55 to 90 • C [76,77]. Because this endolysin is the first known to interact with ABC transporter proteins, it was further investigated as a potential antimicrobial [78]. The catalytic domain of the GVE2 endolysin was fused to peptidoglycan-binding domains from endolysins of several different Clostridium perfringens viruses. The result was chimeric endolysins that could operate up to 70 • C, making these enzymes a potential antibiotic treatment for animals that can be added to their heat-sterilized feed [78].
Recently, an endolysin from Saundersvirus Tp84 (also known as Geobacillus virus TP-84) was investigated as a potential disinfectant for surfaces at high temperatures [79]. This endolysin had activity throughout the temperature range of 30-70 • C and inhibited biofilm formation by Pseudomonas aeruginosa, Streptococcus pyogenes, and S. aureus. Extensive human safety testing is recommended for all endolysins to ensure they are safe for consumption if used as additives, and further testing on the long-term stability of these endolysins would be required to evaluate their potential use [78,79].
Phage depolymerases, including both hydrolases and lyases, have been investigated for their ability to degrade polysaccharides or lipids depending on the host's envelope [80][81][82]. Recent research has suggested that a cocktail of endolysins and envelope depolymerases would produce a greater antimicrobial effect on biofilm-forming bacteria such as P. aeruginosa [81,83,84]; however, investigation of thermophilic envelope depolymerases remains sparse. Although several genes from thermophilic phages are annotated as encoding some form of envelope depolymerase [84], evidence for expression of these enzymes are limited to the formation of halos around clear plaques in eight thermophilic Geobacillus phages, including TP-84 [79,84]. Due to the relative lack of functional studies performed on the efficacy of thermophilic depolymerases and endolysin cocktails, this presents a notable target for research and development of possible biotechnological applications.

Coat proteins
Viruses consist of nucleic acids encapsulated by protein coats (capsids) that comprise numerous copies of one or more coat protein subunits. As part of the virion, capsids protect nucleic acids [85,86], serve as vehicles for transport that can target specific cells [85], and mediate introduction of nucleic acids into host cells during infection [85,86]. As these particles typically have natural tropism toward certain cell types, and are often resistant to immune defense systems, coat proteins are excellent candidates as nano-traffickers in biomedicine. These coat proteins self-assemble [60,85,87] and inclusion of certain protein domains within them can result in highly specific targeted delivery of compounds [85,88,89].
A variety of molecules can be encapsulated by capsids, resulting in virus or virus-like nanoparticles [19,59,85,90]. These nanoparticles can be used in biomedicine for delivery of contrast agents for medical imaging [85,88,89,91], anticancer or antimicrobial drugs [85,88,89,91], antigen-presenting platforms for vaccines [85,87,91], or engineered shuttles for genetic material in gene therapy [85,91,92]. Additionally, these capsids can also serve as nanoreactors [85,89,90] and can be used in non-medical applications like transport of inorganic compounds and production of nanomaterials [85,93]. However, limitations of capsids from mesophilic viruses infecting animals, plants, or mesophilic microorganisms include their limited chemical and physical stability [85,94], challenges purifying nucleic acid-free capsids from infected host cells or expression hosts [90], and residual immunogenicity [85,89,94], which may result in clearance of nanoparticles from the system before achieving the desired effect [85,94]. Several of these shortcomings can be alleviated by capsids derived from thermophilic viruses ( Table 1).
The first thermophilic viral capsids tested for chemical and physical stability in different solvents, and for availability of ligand attachment sites, were those from Icerudivirus SIRV2 (also known as Sulfolobus islandicus rod-shaped virus 2) [95]. To assess the stability of SIRV2 particles, the structural integrity and infectivity of virions was assessed following incubations in DMSO and ethanol. SIRV2 particles remained intact and infective for 6 days in 20% ethanol, 20% DMSO, or 50% DMSO, respectively, and remained intact in up to 50% ethanol [95]. With its high stability in DMSO, a solvent commonly used for bioconjugation applications, the availability of ligand attachment sites was evaluated through biotinylation of the SIRV2 particles using different compounds to identify reactive carboxylates, carbohydrates, and amines [95]. With amine reactivity found only in the minor coat protein subunits, located at the ends of the rod-shaped virions, and reactive carboxylates and carbohydrates found in both minor and major coat protein subunits, a broad variety of functional groups can be conjugated to these viral nanoparticles, including spatially specific selective bioconjugation to only the minor coat protein subunits [95,96].
More recently, an unclassified virus in the Bicaudaviridae, Sulfolobus monocaudavirus 1 (SMV1), was tested extensively for potential biotechnology applications, especially as a potential nano-trafficker for biomedical use. SMV1 particles were treated with ethanol, DMSO, simulated gastric fluid, and simulated intestinal fluid solutions to assess stability through S. islandicus plaque assays, and particles remained infective for up to 6 days [94]. Subsequently, SMV1 particles were passed through the gastrointestinal tracts of mice or incubated with human intestinal organoids, where limited immune responses were elicited, and no SMV1 particles were detected in off-target organs or tissues [94]. Overall, SMV1 particles fared better than Inovirus M13KE, an E. coli phage used for comparison, in both the mice and organoids [94], showing great promise as both molecular delivery systems and antigen-presentation platforms.

Expansion of bioprospecting through viral metagenomes
Most research to date has focused on only viruses infecting cultivated thermophilic archaea and bacteria [14,19], thus limiting the overall breadth of our understanding of the thermophilic virome. Given the limited diversity of cultivated thermophilic prokaryotes [97,98] and their viruses [14], and the extremely limited number of biochemically characterized proteins from thermophilic viruses, we propose that UViGs represent a vast resource for the biotechnology sector. To begin to evaluate the potential resource, we searched for pfams that are diagnostic of the four protein groups discussed here-DNA polymerases, ligases, endolysins, and coat proteins-in the Integrated Microbial Genomes/Virus (IMG/VR) v4 database, which contains over 5.5 million high-confidence viral genome contigs from a wide range of biomes [99], focusing on marine and terrestrial geothermal systems (Table 2; Supplementary Material). This search revealed >20,000 potential matches to these four protein groups from >185,000 UViGs, with the largest amount coming from marine hydrothermal systems, followed by high-temperature terrestrial geothermal systems, and the largest protein category being coat proteins, followed by polymerases, endolysins, and ligases. For example, we identified over 5,000 putative polymerases; the vast diversity of polymerase architectures driven by adaptation to thermal environments [88] are ripe for biotechnology exploration.
Despite the vast resources available in UViGs, there are currently some limitations. First, given the genetic and biochemical diversity encoded by UViGs, a vast diversity of biotechnologically useful functions resides in poorly annotated genes that are difficult to bioprospect based on sequence similarity. This hidden resource could be systematically explored using artificial intelligence platforms, including those examining protein folds, which are more highly conserved than primary sequence information [100]. Another limitation is the systematic focus on dsDNA viruses due to predominant library preparation methods used for viral metagenomics, which unfortunately exclude six of the seven groups of the Baltimore classification system [101,102]. This heavy focus on dsDNA viruses ignores many novel architectures of undiscovered viruses and certainly biases our understanding of the thermophilic virome, although known recombination between natural virus populations with different types of genomes may relieve this limitation to a degree [103,104]. Groups such as the RNA Virus Discovery Consortium have deposited many more RNA viral metagenomes into IMG/VR [99] through RNA extraction and reverse transcription prior to or during library preparation, with advancements to date in marine [105,106], sediment [107], and terrestrial ecosystems [108,109], including thermal springs and others [107,110]. Bioinformatic pipelines like VirSorter are also increasing accuracy and providing support for RNA viruses [111]. A separate problem is the identification and classification of metagenomic contigs as UViGs in the first place, which can lead to both false negatives and false positives, although community standards have been developed to improve communication of UViG quality, with UViGs categorized as high quality (>90% completeness), medium quality (50-90% completeness), low quality (<50% completeness), and unsure quality (>120% or no completeness estimate) [99,112]. Despite these challenges, we contend that UViGs provide an immense and poorly explored resource for bioprospecting the global thermophile virome.
In recognition of this resource, projects focused on exploring the sequence coverage of the virosphere are seeing increasing support. This is evident in community-driven sequencing efforts supported by the Joint Genome Institute (e.g., OSTI 1488193, Award 503441), implementation of analysis tools for viruses in collaborative cyberinfrastructure, like CyVerse [113], and the RNA Virus Discovery Consortium [110]. Additionally, in 2016 to 2020, the European Union funded the Virus-X project-Viral Metagenomics for Innovation Value-at €8 million. These projects have expanded the sequence coverage of the global virosphere, expressed and characterized novel proteins [114][115][116], analyzing crystal structures of expressed genes to aid in functional identification [117], and improved methods to identify and interpret UViGs [111,113], including algorithms to identify host-virus pairs [118] and to improve annotation of uncharacterized viral genes through protein clustering [119,120]. These advancements show not only that thermophilic viral enzymes are an expanding topic of importance for biotechnology, but also that infrastructure and data mining tools are improving to better support the ever-expanding UViG dataset.

Summary
• Viral proteins, particularly polymerases, ligases, endolysins and coat proteins, provide a bountiful but underutilized toolbox for the biotechnology industry to explore.
• Applications involving thermophilic viral proteins provide several benefits that overcome some of the shortcomings of their mesophilic counterparts.
• Bioprospecting of genomes from uncultivated viruses provides a vast and underexplored resource that overcomes the primary impediment of cultivability.

Competing Interests
David Mead is an employee of Varigen Biosciences, doing business as Varizymes, and has worked on commercializing thermophilic enzymes.

Funding
Funding was provided by a grant from the Human Genome Research program at the National Institutes of Health [grant number 1R43HG012181-01]. Additional funding was provided by the UNLV Office of Economic Development and the Troesh Center for Entrepreneurship and Innovation.

Author Contribution
All authors discussed the topic and outline and contributed to the bibliography. All authors contributed to the first draft, which was edited by all authors. R.K.D. and M.P. conducted bioinformatics searches of UViGs from geothermal environments and relevant pfams and compiled data.