Nucleotide composition plays a crucial role in the structure, function and recognition of RNA molecules. During infection, virus RNA is exposed to multiple endogenous proteins that detect local or global compositional biases and interfere with virus replication. Recent advancements in RNA:protein mapping technologies have enabled the identification of general RNA-binding preferences in the human proteome at basal level and in the context of virus infection. In this review, we explore how cellular proteins recognise nucleotide composition in virus RNA and the impact these interactions have on virus replication. Protein-binding G-rich and C-rich sequences are common examples of how host factors detect and limit infection, and, in contrast, viruses may have evolved to purge their genomes from such motifs. We also give examples of how human RNA-binding proteins inhibit virus replication, not only by destabilising virus RNA, but also by interfering with viral protein translation and genome encapsidation. Understanding the interplay between cellular proteins and virus RNA composition can provide insights into host–virus interactions and uncover potential targets for antiviral strategies.
A fundamental property of nucleic acids is the ability to store and transmit information via the sequence of nucleotides they contain. Coding sequences are composed of an array of codons that are translated by ribosomes to produce polypeptides. This flow of information is universal and can be found both in living organisms and the viruses that infect them. Due to the degeneracy of the genetic code, with 61 codons coding for 20 amino acids, organisms can develop and sustain coding biases. These biases can manifest as an overrepresentation or scarcity of specific codons, codon pairs or nucleotide combinations in protein-coding sequences . Such biases can influence gene expression, by impacting mRNA stability or translation efficiency, or even evade detection by the immune system. As a result, viruses are constrained by host-induced pressures on coding sequences, and some viruses may have nucleotide compositions that have been selected to optimise replication in specific hosts .
Nucleotide composition refers to the relative proportions of the four nucleotides that make up DNA or RNA molecules, i.e. adenine, guanine, cytosine and thymine/uridine. Additionally, composition of nucleic acids may also refer to the overall GC content – which is the percentage of guanine and cytosine bases in a given sequence – or to the frequency of certain dinucleotides. At the most basic level, nucleotide composition affects the physical and chemical properties of the nucleic acid molecule, such as its melting temperature , stability , bendability  and the propensity to form complex secondary and tertiary structures . For example, melting temperatures of DNA molecules increase with GC-content while G-rich RNA sequences have the tendency to form a stable secondary structure known as G-quadruplex (G4). Nucleotide composition also affects mutagenesis of nucleic acids after insults such as oxidation  and UV-light damage . Beyond its effects on physicochemical properties of nucleic acids, nucleotide composition governs many aspects of cell biology, primarily by modulating how proteins interact with nucleic acids. Some of these interactions are conserved in eukaryotes over millions of years of evolution and they are important for many biological functions.
Viruses, as obligate intracellular pathogens, are exposed to thousands of proteins in cells that recognise nucleotide composition. Viruses are under increased selective pressure since they (1) must hijack cellular machinery to replicate while (2) avoiding recognition by the immune system but (3) maintaining recognition of their own nucleic acids to ensure genome packaging. In this review, we focused on how various cellular proteins recognise nucleotide composition in virus RNA, and how advancements in RNA:protein mapping technologies have enabled the identification of RNA-binding profiles for many RNA-binding proteins (RBPs). Mostly we focused on proteins whose activity leads to reduced virus replication, however, as discussed, many proteins inhibit infection of one virus but aid the replication of others. Canonical virus RNA sensors, such as retinoic acid-inducible gene I (RIG-I)-like receptors (RLRs) and some Toll-like receptors (TLRs) display certain compositional biases in the types of molecules they recognise; e.g., RIG-I preferentially binds U-rich sequences with interspersed cytosine nucleotides , while TLR7 preferentially binds to uridine-containing sequences . However, since these receptors primarily recognise double-stranded RNA or differentially localised single-stranded RNA, and their properties have been reviewed elsewhere , in this review we will focus on emerging evidence for other cellular proteins that recognise nucleotide composition in virus RNA.
RNA-binding preferences of human proteins
Over the past decades, the development of novel techniques to determine with high precision what sequence motifs RBPs recognise have significantly change our understanding of molecular biology. One such approach is crosslinking and immunoprecipitation (CLIP) coupled with RNA sequencing . CLIP methods, such as PAR-CLIP , HITS-CLIP  and iCLIP , combine UV cross-linking of RNA–protein complexes with immunoprecipitation. UV cross-linking helps to covalently link the RBP to the bound RNA molecules, preserving their interactions even after stringent washing steps. After immunoprecipitation, cross-linked RNA:RBP complexes are purified, and the RNA sequences are sequenced to identify binding sites. Another approach to determine RNA-binding preferences is systematic evolution of ligands by exponential enrichment (SELEX) [16,17]. SELEX is an iterative process used to identify RNA sequences that bind to a specific RBP. It involves generating a large pool of random RNA sequences and subjecting them to multiple rounds of selection and amplification. The RNA sequences that bind to the protein of interest are enriched with each round, and through subsequent sequencing and analysis, the binding motifs can be deduced. These techniques have been further developed to generate high-throughput approaches that can be used to assess RNA-binding properties of hundreds of proteins [18–22]. Consequently, many of the RNA-binding preferences of human proteins are now known. Functional examination of such proteins has shown that the activity of these proteins directly impact RNA stability, RNA subcellular localisation and the regulation of splicing . Some of these RBPs also regulate virus infection and we will further discuss their role in controlling virus replication. An orthogonal approach for the identification of human proteins that interact with virus RNA involves RNA-capture techniques, such as hybridisation purification of RNA–protein complexes followed by mass spectrometry (HyPR-MS)  and comprehensive identification of RNA-binding proteins by mass spectrometry (ChIRP-MS) . In such techniques, the interaction between human proteins and virus RNA is stabilised by a cross-linking step, such as treatment with formaldehyde or UV radiation, followed by incubation with biotinylated oligos targeting virus RNA . RNA:protein complexes are then isolated with streptavidin-conjugated reagents. These approaches have already been applied to multiple viruses including HIV-1 [23,26], SARS-CoV-2 [27,28], Zika virus and ebolavirus . While these approaches have been applied to identify novel host–virus interactions, both ChIRP-MS and HyPR-MS are prone to variability . Nevertheless, at least one protein family seems to be represented across multiple experiments and during infection with different viruses, heterogeneous ribonucleoproteins (hnRNPs). This family includes many RNA-binding proteins and we discuss the activity of several members of this family in the sections below.
Large scale RBPs mapping experiments have highlighted the distribution of RNA-binding preferences of the human proteome. While the discovery of RNA-binding motifs is inherently difficult and may be dependent on sequence context and RNA structure , many human RBPs queried bind to motifs containing stretches of the same nucleotide. For example, the RBPs FUS and EWSR1 preferentially bind G-rich motifs, while hnRNPC and TIA-1 recognise U-rich sequence motifs . To identify preferred motifs bound by RNA-binding proteins, motif discovery algorithms, such as MEME , HOMER  or GLAM2 , are frequently used. These algorithms analyse large sets of RNA sequences and aim to find statistically enriched patterns or conserved motifs that are enriched in the binding sites of RBPs. However, motif discovery algorithms are limited in their ability to detect longer or complex motifs. Algorithms that search for short motifs may miss more intricate binding patterns, which could be relevant for RBPs with specific binding preferences. These algorithms are usually adequate at retrieving motifs containing repeated mononucleotides in tandem; nevertheless, the RNA-binding profile of certain RBPs may be better explained when nucleotides are interspersed. Indeed, there is evidence that the preference for mononucleotide-rich RNA-binding motifs is also observed when these nucleotides are interspaced . An example of that is the protein hnRNPK, whose RNA-binding profile was initially described as the C-rich motif 5′-CCCCCC-3′ but a motif with interspersed cytosine, such as 5′-CNCNCNCNNNCC-3′ (Figure 1), yielded higher enrichment scores . There are also examples of interspersed motif recognition in other antiviral proteins . This indicates that overall sequence composition may be better recognised by some RBPs over linear motifs. However, interspersed patterns have not been extensively studied and further investigation is required to better determine exact binding motifs.
Schematic representation of in tandem and interspersed recognition modes of mononucleotides
Stretches of the same mononucleotide or overall representation of that base along an RNA molecule may have the propensity to form certain types of RNA structures. G-rich sequences, for instance, have the tendency to form G-quadruplexes which in turn can be detected by RBPs . Therefore, proteins may have evolved to sense nucleotide composition of RNA by scanning RNA molecules for such structures rather than linear motifs. Due to non-Watson–Crick base pairing, G:U and C:U interactions are possible in RNA, and therefore the overall composition of guanine, cytosine and uridine in RNA correlates with more stable secondary structure . Endogenous proteins may have evolved to recognise specific structures formed by these nucleotide pairs. Intriguingly, a feature observed across many large-scale RNA-binding mapping experiments is the scarcity of A-rich motifs (Figure 2) [18–22]. Exceptions to this include proteins involved in recognition of polyA tails, such as PABPC1. It is unlikely that such bias against A-rich motifs is due to technical limitations, since this bias is observed across many types of techniques including several CLIP protocols  and SELEX experiments . A possible explanation is the underlying role of structure in the target RNA molecule. A-rich RNA molecules tend to form less stable secondary structures, which may be important for the preservation of RNA:RBP binding site. It is unclear if the scarcity of RNA-binding proteins that recognise A-rich motifs is driven by the lack of such motifs within the coding sequence. PolyA tracks encoding lysine residues have been found to reduce translation and mRNA stability by stalling ribosomes , imposing a selective pressure against such motifs within open reading frames. While polyA-binding proteins predominantly bind to 3′ UTRs, it has been proposed that binding to other regions of the mRNA molecule may also occur. PABPC1, for instance, limits its own expression by binding to an A-rich sequence in the 5′ UTR of its mRNA . Since polyA-binding proteins are relevantly abundant in the cell, A-rich motifs present within coding sequences may have been disfavoured and selected against during the evolution of eukaryotic genomes. Nevertheless, definitive investigation regarding the distribution of A-rich sequences within mRNA molecules – including a statistical analysis of their underrepresentation in sequence motifs across multiple CLIP experiments – is missing and requires further examination.
Summary of binding motifs of human RNA-binding proteins
This bias against A-rich motifs may also impact the evolution of RNA viruses. Indeed, a recent study that looked at the overall nucleotide composition of both positive-sense and negative-sense single-stranded RNA viruses showed that the majority of viral coding sequences tend to have high adenosine frequency at the cost of cytosine representation . This bias was not found in coding sequences from bacteriophages or human genomes. The authors of this study postulated that perhaps this bias towards A-richness is a consequence of the type of amino acids encoded by these sequences and, specifically, a bias against amino acids preferably displayed by the major histocompatibility complex (MHC). In line with this hypothesis, analysis of the small peptides that are loaded onto the MHC revealed that these peptides tend to be encoded by A/G-poor coding sequences . While this is a plausible explanation in the context of mammalian immune systems, viruses that infect plants – which are organisms that lack MHC systems – also present similar A-richness [40–42]. Since A-rich motifs are often depleted in CLIP/SELEX experiments of human RBPs [18–20,22], an alternative explanation for the bias towards adenosine in RNA viruses is the mere scarcity of human proteins that recognise such motifs, therefore, reducing the changes of detection and manipulation of virus RNA by the immune system. Nevertheless, this hypothesis has not been tested and requires experimental investigation. Codon choice relies heavily on the third position of the codon, also known as the wobble position. In human cells, codons that contain G or C in the wobble position are associated with mRNAs that are more stably expressed while A/T nucleotides have a negative impact on RNA stability . This distinction may also regulate the expression of functionally distinct genes, for example, genes involved in cell cycle regulation are enriched in codons with A or T in the wobble position . In single-stranded RNA viruses, the third position of the codons tends to be an A or a T, suggesting that viruses may also use this layer of information to regulate gene expression. In composition studies of virus genomes, the adenosine bias is frequently accompanied by the suppression of cytosines . Analysis of SARS-CoV-2 genomes, for example, showed a mutagenic bias to reduce cytosine and increase uridine residues [45,46]. Suppression of cytosine residues have also been observed in other viruses such as HIV-1  and artificially increasing cytosine content in HIV-1 leads to replication defects . The mechanisms underlying variation of mononucleotide composition in virus RNA are not well understood, but we postulate that they may be explained – at least in part – by the nucleotide binding preferences of cellular proteins. In the next few sections, we summarised how human RBPs sensing nucleotide composition affect the replication of many viruses.
Proteins binding G-rich RNA
G-rich motifs seem to be overrepresented in RNA:RBP mapping experiments [18,22,36], and indeed, many proteins that bind G-rich sequences have been previously identified. An example of such protein is the cellular nucleic acid-binding protein (CNBP). CNBP encodes multiple small zinc fingers (also known as zinc knuckles) of the CCHC-type (Figure 3) through which can bind both RNA and DNA . CNBP's binding motif was initially described in sterol regulatory elements found in the promoter regions of the low-density lipoprotein receptor and was thought to consist of 5′-GTG(G/C)GGTG-3′ . Subsequent studies performing in vitro binding assays and yeast one-hybrid screens identified G-rich consensus sequence motifs as preferred DNA binding sites of CNBP [48,50]. Similar experiments identified G-rich motifs as preferred RNA binding sites of CNBP . Using PAR-CLIP, Benhalevy and colleagues found that most targets of CNBP in human cells were mRNA coding sequences containing G-rich motifs, most of which were found to form G-quadruplex (G4) structures . An RNA G4 is a type of secondary RNA structure in which four G residues, either within the same molecule or between multiple RNA molecules, pair with each other forming a square planar structure called guanine tetrad . Such structures can be very stable, due to the multiple hydrogen bonds formed between G residues and base-stacking interactions. The presence of such structures in mRNA has been linked to reduced protein translation, probably due to halted ribosome movement at the G-quadruplex site [52,53]. Previous studies showed that CNBP binds RNA G4 structures in vitro and that CNBP promotes the translation of mRNA containing G4 structures in human cells . The current understanding of the activity of CNBP in endogenous RNA molecules points that CNBP binds to G-rich sequences and prevents the formation of G4 structures; however, it may also aid the unwinding of G4 structures . Intriguingly, the coding sequence and structures of the CNBP's zinc fingers closely resemble those of the zinc fingers found in the HIV-1 nucleocapsid. So much so that CNBP's zinc fingers can functionally replace the zinc fingers of HIV-1 nucleocapsid without impacting virus replication . To our knowledge, no association between CNBP and HIV-1 RNA/DNA has been demonstrated. Nevertheless, the similarity in protein topology and functional replacement of the zinc knuckles suggest similar modes of recognition. A study on the recognition of HIV-1 RNA by Gag (a precursor protein of the nucleocapsid) showed preferential binding to A/G-rich regions of the HIV-1 genome . Many G4 structures have been identified in HIV-1 genomes, both in RNA and DNA states [57,58], and indeed – akin to the activity of CNBP - HIV-1 nucleocapsid protein unfolds G4 structures in virus RNA . Crucially, a previous study argued that the A-rich nature of the HIV-1 RNA genome – a constant feature despite high mutation rates – may contribute to the ‘self-recognition’ of virus RNA by viral proteins, which is important for virus genome packaging and virion formation . Further investigation on the effects of CNBP on HIV-1 replication may elucidate novel virus-host interactions.
Topological organisation of RNA-binding proteins
Furthermore, a recent study showed that CNBP binds to SARS-CoV-2 RNA . This has been corroborated by another study that demonstrated that CNBP directly interacts with both sense and antisense strands of SARS-CoV-2 RNA genomes . Bezzi and colleagues showed that CNBP-binding sites in SARS-CoV-2 RNAs form G4 structures in vitro, and that CNBP binding to G4 promotes RNA unfolding. Other groups have identified multiple G4 structures in SARS-CoV-2 RNA . Abrogation of CNBP expression in humans cells led to increased SARS-CoV-2 RNA levels following infection . Similarly, depletion of CNBP led to increased virus replication while CNBP overexpression reduced virus replication . CNBP-knockout mice were also more susceptible to infection . However, the mechanism by which CNBP may inhibit virus replication is still poorly understood. While binding of CNBP to G4 structures in human mRNA molecules generally promotes translation, CNBP expression reduced the levels of both virus RNA and virus protein. A recent study proposed that binding of CNBP to virus RNA may prevent encapsidation of viral RNA by the virus nucleoprotein (Figure 4) . Wrapping of virus RNA by the nucleocapsid protein is a common process in the lifecycle of many viruses and it is essential for the packaging of virus genomes into virions. In many viruses, including SARS-CoV-2, this process is now believed to occur through the formation of RNA-protein condensates in the cytoplasm via phase separation . Indeed, expression of CNBP disrupted the formation of liquid-phase condensates between virus RNA and the nucleoprotein . This process seemed to be specific for viral RNA:protein interactions since the addition of CNBP did not affect the formation of condensates when using an unrelated RNA. CNBP may also have indirect effects on virus replication, since this protein can also modulate the innate immune response. The expression of CNBP promoted the production of interleukin-6, a proinflammatory cytokine, in response to LPS . CNBP was shown to become phosphorylated upon influenza virus infection, followed by translocation to the nucleus where it binds to the IFNB promoter region . Consequently, IFN-β levels were lower in CNBP-deficient mice. This is in line with previous findings that suggests that CNBP is important for the control of protozoan and bacterial infections [65,67]. Nevertheless, the antiviral mechanism of CNBP is still not fully understood and additional experiments are required to determine how this protein limits virus replication.
Proposed antiviral mechanism of CNBP
Another antiviral protein that recognises G-rich RNA sequences is the Fused in Sarcoma (FUS) protein. FUS expression is up-regulated upon treatment with type I interferon, hinting to a role in controlling viral infections . During coxsackievirus B3 infection, FUS has been shown to directly bind to the virus genome . Consequently, ablation of its expression led to increased virus RNA levels and IRES-mediated virus translation. Similarly, FUS has also been described to inhibit the reactivation of Kaposi’s sarcoma-associated herpesvirus by inhibiting viral gene expression . Nonetheless, it is still unclear how the activity of FUS can lead to decreased virus replication. FUS can bind both DNA and RNA and it has been implicated in various aspects of RNA metabolism, including transcriptional regulation, splicing, RNA transport and translation . It belongs to the FET protein family which also includes EWSR1 and TAF15 proteins that, coincidently, also recognise G-rich motifs . Mutations in the FUS gene have also been linked to several neurodegenerative diseases, including amyotrophic lateral sclerosis and frontotemporal lobar degeneration . In these diseases, FUS protein accumulates abnormally in the cytoplasm of neurons, leading to impaired RNA metabolism and neurodegeneration. The FUS protein comprises several functional domains, including an N-terminal transcriptional activation domain that contains a QGSY-rich region, multiple nucleic-acid-binding domains (Figure 3) such as an RNA-recognition motif (RMM) domain, three Arg-Gly-Gly repeat regions (known as RGG1-3), a zinc-finger motif and a highly conserved C-terminal nonclassical nuclear localisation signal that interacts with the nuclear transport receptor transportin 1 . Based on its domain topology, FUS interaction with its target RNA may be complex since both the RMM domain and the zinc finger can interact with RNA. Attempts to identify RNA-binding sites of FUS were first achieved using SELEX, where the motif 5′-GGUG-3′ was identified . Still, RNAs lacking the GGUG motif retain the ability to interact with FUS . More recently, using multiple CLIP-based approaches, several studies shown that FUS binds to G-rich sequences in RNA with a consensus sequence of 5′-GGGGGG-3′ [18,22]. Structural analysis revealed a bipartite mode of recognition of RNA in which the zinc finger recognises a G-rich motif while the RRM recognises RNA secondary structure . Coupling sequence specificity recognition with structural recognition may explain the diverse impact of FUS in RNA biology and recognition of virus RNA.
Proteins binding C-rich RNA
Poly-cytosine binding proteins (PCBPs) are RBPs that recognise C-rich nucleic acids. This family of proteins includes PCBP1, PCBP2 and hnRNP-K, and they all have functions in mRNA stability, RNA nuclear traffic and splicing. Some of these proteins are also induced upon interferon stimulation and in response to virus infection, suggesting a potential role in controlling virus replication. PCBP1 and PCBP2 have been implicated in the replication of numerous viruses, and while they share nearly 90% of protein sequence identity, their effects on virus replication are distinct . PCBP2 contains three KH domains, which are domains known to be involved in the recognition of both RNA and DNA. Initially, PCBP2 was thought to bind C/U-rich motifs in RNA , while interacting with T-rich DNA molecules in vitro . More recently, FAST-iCLIP experiments confirmed that endogenous targets of PCBP2 are C/U-rich . The fate of RNA molecules bound by PCBP1/2 can be diverse: while PCBP1 have been shown to increase the stability of many mRNAs, including α-globin and collagen-α1 [81,82], they can also inhibit translation of other genes . Comparably, the expression of PCBP2 also has opposing effects on virus replication. In human cells, ablation of PCBP2, but not PCBP1, enhanced vesicular stomatitis virus (VSV) replication . Conversely, overexpression of PCBP2 but not PCBP1 was found to reduce VSV replication. This effect is likely due to increased viral RNA stability in the absence of PCBP2 and it does not seem to interfere with viral ribonucleoprotein complexes. PCBP1, in turn, was found to inhibit the replication of HIV-1 . While knockdown of both PCBP1 and PCPB2 increases virus gene expression, only overexpression of PCBP1 reduced viral RNA levels. In particular, the C-terminal KH domain of PCBP1 is responsible for this activity. In contrast, many reports point that the expression of PCBP2 has a positive impact on the replication of many positive-sense single-stranded RNA viruses. Several studies demonstrated that PCBP2 directly binds to the 5′ untranslated regions of poliovirus and hepatitis A and C viruses [86–89]. Attempts to map this interaction showed that PCBP2 binds to C/U-rich regions of the hepatitis C virus (HCV) genome . While the positive effects of the expression of PCBP2 in virus replication may be due to a direct binding to virus RNA, PCBP2 may boost virus replication by down-regulating innate immune responses. Indeed, PCBP2 was shown to negatively regulate innate immune signalling: PCBP2 triggers mitochondrial antiviral signalling protein (MAVS) for degradation and, upon infection with VSV and sendai virus, knockdown of PBCP2 increases MAVS-mediated signalling . PCBP2 is also implicated in the cGAS-STING signalling pathways. Ablation of PCBP2 increases cGAS-STING-mediated signalling after herpes simplex virus 1 (HSV-1) infection, while overexpression of PCBP2 has the opposite effect . PCBP2 interacts with cGAS and prevents its condensation in the cytoplasm, while PCBP1 knockout reduces cGAS-mediated signalling in response to HSV-1 infection. PCBP1 facilitates the binding of DNA to cGAS and directly interacts with HSV-1 and HIV-1 DNA, primarily through its KH1 domain, promoting DNA-induced aggregation of cGAS. PCBP1 and PCBP2 exhibiting both antiviral and proviral activities, and perhaps interactions with different protein partners may explain their divergent roles in regulating immune responses and viral infections .
Another member of the poly-cytosine binding protein family is the heterogeneous nuclear ribonucleoprotein K (hnRNPK), which was originally identified as a component of the heterogeneous nuclear ribonucleoprotein complex . This polycytidine-binding protein is involved in a variety of cellular processes, including chromatin remodelling, transcriptional regulation, splicing and RNA translation. It interacts with RNA and DNA and has been shown to act as both a transcriptional activator and a repressor when assembled on DNA [94,95]. hnRNPK has been found to affect mRNA stability and alternative splicing through its direct binding to mRNA [96,97]. The hnRNPK preference towards C-rich RNA sequences has been demonstrated in multiple experiments [18,22] and, as discussed previously, interspersed cytosine residues seem to be preferentially bound by this protein . Consequently, frequency of cytosine present in human RNA molecules correlates with the interaction with hnRNPK in cells . hnRNPK has also been shown to directly interact with virus RNA. hnRNPK binds to HCV RNA and blocks production of infectious particles . Ablation of hnRNPK expression increased HCV replication, and reconstitution was shown to restore suppression of virus infection. In this study, the expression of hnRNPK did not affect the levels of virus RNA. Instead, and akin to the proposed mechanism of action of CNBP, hnRNPK’s antiviral mechanism is thought to be through interference of virus assembly. hnRNPK has also been shown to limit virus replication by targeting viral proteins for proteasomal degradation . Therefore, the antiviral mechanism of hnRNPK may be more complex than previously thought and requires further examination.
hnRNPK has also been shown to directly bind to the HIV-1 RNA genome, where it binds to a C-rich region of a stem-loop structure located in the Env gene region . Immunoprecipitation of HIV-1 RNA led to the identification of hnRNPK and its overexpression altered HIV-1 splicing, which is crucial for efficient HIV-1 replication. The HIV-1 genome is initially transcribed as a large precursor RNA molecule which contains specific splice sites. One essential splice site in HIV-1 pre-mRNA is the major splice donor site located upstream of the viral env gene. Alternative splicing at this site results in the production of two main classes of mRNA molecules: partially spliced and fully spliced RNAs. While the unspliced mRNA serves as the genomic RNA that is packaged into virions and contains genetic information for the production of structural proteins (Gag and Pol) and the enzymes required for viral replication (reverse transcriptase, integrase, and protease), spliced RNAs encodes for the envelope and multiple regulatory viral proteins, including Rev and Tat . These proteins play critical roles in regulating viral gene expression and enhancing viral replication. hnRNPK was shown to bind directly to the A7 acceptor site, which is present in all HIV-1-derived RNAs, and strongly inhibit splicing . How hnRNPK inhibits splicing is not yet understood, but it has been suggested that binding of hnRNPK stabilises unspliced RNA molecules. In line with this idea, another group found that hnRNPK is enriched in HIV-1 unspliced RNAs but not in partially or fully spliced RNA molecules . hnRNPK has also been implicated in splicing of other viruses, including influenza A virus (IAV) . At least two IAV segments are subjected to splicing during virus replication: the NS segment, whose unspliced isoform produces the non-structural protein 1 (NS1) while the spliced form encodes for NS2; and the M segment, whose spliced and unspliced forms produce an ion channel (M2) and a matrix protein (M1), respectively . Additional spliced isoforms of M have also been identified . Depletion of hnRNPK led to abnormal mRNA ratios between M1 and M2 RNA and reduced virus replication . hnRNPK was shown to directly bind IAV RNAs in a C-rich region, consistent with the bias for cytosine in endogenous RNAs . Considering that many virus genomes are C-poor, future research into proteins that recognise C-rich RNA sequences may explain composition biases in circulating viruses.
Proteins binding U-rich RNA
U-rich RNA sequences are crucial for RNA metabolism, primarily through uridylation of mRNA molecules. Uridylation refers to the addition of uridine monophosphate residues to RNA molecules. It is a post-transcriptional modification that mostly occurs at the 3′ end of RNA molecules, however, internal uridylation sites have also been identified in trypanosomes [107,108]. Therefore, multiple proteins have evolved to recognise U-rich RNA. One example of an RBP protein with preference for U-rich RNA sequences is the T-cell-restricted intracellular antigen 1 (TIA-1). Multiple studies using CLIP-based approaches have shown that TIA-1 directly binds to U-rich RNA sequences [18,22,109]. TIA-1 is ubiquitously expressed but it is present at higher levels in immune cells and has been extensively studied for its involvement in cellular stress responses, immune regulation and RNA granule dynamics [110,111]. TIA-1 contains three RRMs, which enable interaction with specific RNA sequences, and a low-complexity C-terminal domain . RRM2 is thought to be the main RNA recognition domain binding to U-rich sequences, while RRM1 does not bind to RNA and RRM3 binds to RNA with no specificity . TIA-1 is known to impact mRNA stability and translation in the cell. Upon binding to target mRNAs, TIA-1 recruits several nucleases and decapping enzymes that metabolise bound RNA . It also participates in the formation of stress granules during cellular stress responses and inhibits the translation of target mRNAs [113,114]. It has been shown that during VSV infection, TIA-1 is enriched in intracellular structures that resemble stress granules . Indeed, knockdown of TIA-1 leads to increase viral gene expression, promoting VSV replication. Similarly, during rabies virus infection, TIA-1 also has an antiviral effect, since cells lacking TIA-1 expression sustain higher replication . This seems to be true for some positive-sense single-stranded viruses as well: during infection with tick-borne encephalitis virus (TBEV), a flavivirus, TIA-1 is recruited to virus replication sites where it directly binds viral RNA . Similarly to VSV and Rabies virus infection, knockdown of TIA-1 led to increased TBEV replication. TIA-1 also binds virus RNA and limits virus replication of other viruses such as enterovirus D68 and red-spotted grouper nervous necrosis virus [118,119]. In contrast, TIA-1 seems to have a positive effect in other viruses, such as enterovirus A71, West Nile virus and dengue virus [120–122]. Virus context specific outcomes may be explained by different localisation of TIA-1 upon infection, and consequently, different interaction partners; however, further investigation is required to assess this hypothesis.
The protein hnRNPL is also known to bind U-rich RNA sequences [18,19]. hnRNP contains four RRMs (termed RRM1-4), an N-terminal glycine-rich region and a flexible, proline-rich domain between RRM2 and 3 . In human cells, hnRNPL inhibits retrotransposition  and regulates RNA splicing . hnRNPL inhibits foot-and-mouth disease virus replication by binding to the IRES . This interaction occurs through the domains RRM3 and RRM4. While this interaction does not impact viral protein translation, it leads to reduced viral RNA levels. One possible antiviral mechanism may be the blocking of interaction of viral proteins, such as the virus RNA polymerase, with viral RNA .
Proteins recognising dinucleotide composition
Aside from mononucleotide biases, virus genomes also maintain biased dinucleotide composition. In vertebrate RNA viruses, one of the most pervasive dinucleotide bias is the low frequency of CpG dinucleotides, which resembles the CG-suppressed genome of the human. The paucity of CG dinucleotides in the human genome is thought to be a consequence of the activity of DNA methyltransferases that catalyse the reaction of cytosine to 5-methylcytosine in a CpG context; methylated cytosines are naturally prone to spontaneous deamination, resulting in a C-to-T mutation . While virus RNA is not a substrate for human DNA methyltransferases, CpG suppression in RNA viruses is prevalent and is found to be essential for their replication in human cells; such examples include HIV-1  and influenza virus . A possible explanation for this selection pressure is that viruses have evolved to evade endogenous sensors that detect CpGs in virus RNA. One of the best characterised dinucleotide composition sensor is the zinc-finger antiviral protein (ZAP), which detects CpG dinucleotides and inhibits the replication of a broad range of viruses. ZAP contains four CCCH-type zinc finger motifs (ZnF1-4) at its N-terminus, forming the RNA-binding domain which is crucial for its antiviral activity. The rest of the protein comprises of a WWE domain and a poly(ADP-ribose) polymerase (PARP)-like domain [130,131]. While the PARP-like domain seems to be, at large, dispensable for antiviral activity, the WWE domain may enhance ZAP’s function by binding to poly(ADP-ribose) . There are at least four isoforms of ZAP , however the most abundant isoform are the long isoform (ZAP-L) and the short isoform (ZAP-S) whose expression is controlled by an early polyadenylation site . ZAP specifically interacts with single-stranded RNA but it may also interact with structured RNA elements, such as stem loops . ZAP directly binds CpG dinucleotides through its RNA-binding domain, where the highly basic ZnF2 forms a binding pocket specific for accommodating CG dinucleotide bases of the target RNA [128,136,137].
There are two main mechanisms of ZAP that required recruitment of different cofactors: upon interaction with viral RNA, ZAP can inhibit viral protein translation and target viral RNA for degradation [138–141]. To block translation, ZAP interacts with translational initiation factor eIF4A disrupting eIF4A–eIF4G association, which halts translation . TRIM25, a E3 ubiquitin ligase, is also important for translation inhibition by ZAP in the context of alphavirus infection, since ablation of TRIM25 impairs ZAP-dependent repression of viral translation . TRIM25 is comprised of four domains: an N-terminal RING domain which catalyses the transfer of ubiquitin, a B-box domain, a coiled-coil domain and a carboxy-terminal PRY/SPRY domain responsible for substrate binding . TRIM25 interacts with the RNA-binding domain of ZAP potentially through its SPRY domain, and both its RING domain and coiled-coil domain are important for the antiviral activity of ZAP [143,145,146]. Aside from translation inhibition, ZAP also recruits cofactors responsible for RNA degradation; for example, in the context of CG-rich HIV-1 infection, the putative endonuclease KHNYN is recruited by ZAP to cleave viral RNA via its NYN endonuclease domain . ZAP can also recruit helicases, such as the p72 DEAD box RNA helicase, as well as the RNA processing exosome component hRrp46p, that catalyse the degradation of the target RNA [148,149]. Exploiting the antiviral activity of ZAP by introducing CpG dinucleotides in virus genomes has also been shown to be an effective strategy for the generation of live-attenuated vaccines [33,150]. In addition to its antiviral activity, ZAP also regulates endogenous RNAs, for instance, ZAP inhibits the retrotransposition of Long Interspersed Element-1 (LINE-1) via binding to LINE1 RNA and mediate its degradation to prevent its accumulation in the cytoplasm [151,152].
Notably, TRIM25 is an RNA-binding protein itself as well. Previous studies have shown that TRIM25 may bind to RNA through two potential motifs: the 7K sequence in the L2 linker region located between the coiled-coil and the PRY/SPRY domains, and also through residues located within the SPRY domain [153,154]. However, if the RNA-binding activity of TRIM25 is important for the antiviral activity of ZAP is not yet known. Nevertheless, the RNA-binding activity of TRIM25 may be important in a ZAP-independent context; for example, while dengue virus is not sensitive to ZAP , its replication is inhibited by TRIM25 in a manner that is dependent on its RNA-binding properties . TRIM25 was also initially implicated in the activation of RIG-I during virus RNA sensing . However, recent evidence suggests that Riplet/RNF135, not TRIM25, is the E3 ubiquitin ligase required for the activation of RIG-I in the presence of double stranded RNA [156,157]. Further investigation is required to determine if TRIM25 senses specific motifs enriched in virus RNA.
Emerging evidence suggests that some human proteins detect local or global compositional biases in virus RNA. These proteins seem to regulate not only virus replication but also multiple aspect of RNA metabolism. Indeed, mutations found in many of the genes encoding for such proteins contribute to several human diseases [71,125]. These proteins may have evolved to detect stretches of mononucleotides, either in tandem or interspersed , in endogenous RNA molecules but later co-opted to control virus infections. Supporting this hypothesis is the observation that many of these proteins are up-regulated upon virus infection or upon the induction of innate immune signalling [91,158,159]. Intriguingly, some of the sensors discussed here were also shown to be negative regulators of the immune response. Antiviral proteins that act on virus RNA have been suggested to affect the composition of co-expressed genes , and the evolution of the antiviral state is likely constrained by the activity of many up-regulated RBPs. While this review focuses on RNA viruses, it is likely that such proteins may also impact the replication of DNA viruses by direct interaction with viral DNA and mRNA. A common theme we observed when reviewing the evidence for compositional sensors is that most of the proteins discussed above were found to interact with both RNA and DNA, recognising similar motifs in both molecules. While structural studies that define the molecular details of RNA recognition is missing from many of the proteins reviewed, we hypothesise that such interaction occurs primarily through the recognition of nucleotide residues and not based on interactions with the ribose/deoxyribose bases or diester bond. A mode of recognition like the one described may be suitable for both molecules. In some cases, we discussed how the same protein can be antiviral in some cases and promote virus replication in others. Indeed, endogenous substrates of these RBPs can also have opposite fates in the cell, where proteins can promote mRNA stability of certain targets and RNA degradation of others. While a plausible explanation is the variation in interaction partners, another explanation could be that recognition of the target RNA may be more complex that a single RNA-binding motif. For example, structural studies of FUS interacting with target RNAs suggest a bipartite mode of recognition . Indeed, complex modes of RNA recognition that depart from standard single motif recognition models may explain how different target RNAs can have different fates. To address this problem, the emerging field of ribolinguistics [1,161–163] – which studies how coding biases, synonymous mutations and compositional patterns encoded in RNA sequences impact multiple aspects of RNA biology and virology – will play a pivot role. The further development of RNA:protein mapping approaches, as well as motif discovery algorithms capable of identifying non-linear binding motifs, will paved the way to understanding how complex RNA:protein interactions determine the fate of virus infections.
The authors declare that there are no competing interests associated with the manuscript.
Open access for this article was enabled by the participation of Imperial College London in an all-inclusive Read & Publish agreement with Portland Press and the Biochemical Society under a transformative agreement with JISC.
CRediT Author Contribution
Raymon Lo: Conceptualization, Writing—original draft, Writing—review & editing. Daniel Gonçalves-Carneiro: Conceptualization, Writing—original draft, Writing—review & editing.