Comparative genomics and interactomics of polyadenylation factors for the prediction of new parasite targets: Entamoeba histolytica as a working model

Abstract Protein–protein interactions (PPI) play a key role in predicting the function of a target protein and drug ability to affect an entire biological system. Prediction of PPI networks greatly contributes to determine a target protein and signal pathways related to its function. Polyadenylation of mRNA 3′-end is essential for gene expression regulation and several polyadenylation factors have been shown as valuable targets for controlling protozoan parasites that affect human health. Here, by using a computational strategy based on sequence-based prediction approaches, phylogenetic analyses, and computational prediction of PPI networks, we compared interactomes of polyadenylation factors in relevant protozoan parasites and the human host, to identify key proteins and define potential targets for pathogen control. Then, we used Entamoeba histolytica as a working model to validate our computational results. RT-qPCR assays confirmed the coordinated modulation of connected proteins in the PPI network and evidenced that silencing of the bottleneck protein EhCFIm25 affects the expression of interacting proteins. In addition, molecular dynamics simulations and docking approaches allowed to characterize the relationships between EhCFIm25 and Ehnopp34, two connected bottleneck proteins. Interestingly, the experimental identification of EhCFIm25 interactome confirmed the close relationships among proteins involved in gene expression regulation and evidenced new links with moonlight proteins in E. histolytica, suggesting a connection between RNA biology and metabolism as described in other organisms. Altogether, our results strengthened the relevance of comparative genomics and interactomics of polyadenylation factors for the prediction of new targets for the control of these human pathogens.


Introduction
Gene expression regulation allows cells to adapt and survive to environmental stimuli by switching on or switching off specific protein coding genes. After gene transcription in the nucleus, the 5 -end of pre-mRNA is covered by the cap structure, introns are eliminated through splicing and 3 -end is cleaved and protected by a poly(A) tail. Then, mature mRNA is exported to the cytoplasm where it is translated to

Building host and pathogens polyadenylation interactome
Sequences of human and parasite polyadenylation factors were used for predicting protein-protein interaction (PPI) networks from STRING database (https://string-db.org/) and interactomes were analyzed with the Network Analyser plugin of Cytoscape v3.8.2 (https://cytoscape.org/) using the most stringent criteria. Refinement steps were conducted to remove nonessential nodes and false-positive unreliable interaction, node and edge weight were computed for each protein and significantly lower node and edge weight were discarded [23]. To compare PPI networks between host and pathogens, proteins were clustered into two connected modules. The first module was assigned to PPI among polyadenylation machinery, and the second module corresponded to PPI related to other processes. Proteins having the highest number of interactions were considered as hubs, and proteins connecting both modules in the network were classified as bottlenecks [24]. The functional enrichment analysis was carried out using stringApp from Cytoscape v3.8.2; Gene Ontology (GO) terms, KEGG Pathways, STRING clusters, and Interior domains were used to determine the overrepresented molecular function and biological processes of parasites and human PPI networks. Overrepresented molecular function and biological processes were chosen with a FDR < 0.01.

Analysis of RNA-seq datasets
RNA-seq data information of parasite polyadenylation factors were obtained from transcriptomic resources in VEu-PathBD (https://veupathdb.org/veupathdb/app) using experimental datasets corresponding to E. histolytica trophozoites growing in normal conditions versus subjected to serum starvation for 24 h or replenished with serum for 2 h, following starvation [27], and virulent versus non-virulent trophozoites [28]. We also included data about polyadenylation factors of Plasmodium vivax growing in different microenvironments and temperatures [29], hypnozoites versus mixed cells [30] and sporozoites in different stages [31], as well as two wild-type strains of Trypanosoma brucei (MITat 1.2, clone 221a) and mutant strains [32]. Heatmap and hierarchical clustering were visualized using R statistical software packages.

Molecular modeling and docking
The 3D structures of EhCFIm25 and EhNopp34 proteins (sequences C4M2T1 and C4LU58 from Uniprot, respectively) were generated using an approach of distance-based protein structure prediction by deep learning with the RAPTOR-X server [33] and homology modeling with the SWISS-MODEL server [34]. The 3D models were refined with the ReFOLD server [35] and their stereochemical quality was assessed by Verify3D [36] and PROCHECK [37] servers. The refined models of EhCFIm25 and EhNopp34 were submitted to molecular dynamics (MD) simulations through the GROMACS suite [38] version 5.1, using the OPLS all-atom force-field [39]. For this, monomeric structures were independently solvated in a dodecahedral box with its nearest edge 1.0 nm away from the protein, and the TIP3P explicit water model was used for all simulations. Sodium and chloride ions (for EhCFIm25 and EhNopp34 simulations, respectively) were added for system neutralization and all electrostatic interactions were calculated through the Particle Mesh Ewald (PME) approach. Energy minimization was performed using the steepest descent algorithm for 5000 steps. Then, a restrained MD simulation of 1000 ps was performed to allow the solvent to relax; the peptide atoms were harmonically restrained to their position in the model with a force constant of 1000 kJ/mol/nm 2 . All simulations were performed at 300 K and 1 atm pressure. The free MD run was carried  out for 100 ns with the same pressure-and temperature-coupling constants as the restrained run. All steps of the simulations were performed using periodic boundary conditions. The stability and conformational changes of the trajectory for EhCFIm25 and EhNopp34 structures were characterized by analyzing the root mean square deviation (RMSD), that quantifies how much a structure diverges from another, and therefore may indicate the stability of the protein structure during simulation or may also reflect high flexibility of different regions of the protein structure, the root-mean-square-fluctuation (RMSF) that reveals which regions of the structure are the most mobile and the radius of gyration (R g ), which indicates a measure of a protein compactness, all calculated by tools included in the GROMACS software. Finally, the protein-protein docking study was conducted with the LZerD webserver [40] using the average structures of EhCFIm25 and EhNopp34 obtained from the last 75 and 80 ns of the MD simulation, respectively, which are the times from which both simulations converged. Then, the dimeric structure with the highest frequency of members inside each cluster and the best score was used to perform a second docking. The best output structure was chosen considering the frequency of members inside each cluster and the best multi-score ranking from LZerD server.

RNA extraction and real-time qRT-PCR
Total RNA (1 μg) of E. histolytica trophozoites cultured under different conditions as described above was obtained by the TRIzol reagent (Invitrogen) according to manufacturer instructions and used to validate PPI networks predictions and RNA seq datasets analysis through Real-Time qRT-PCR. For this, cDNA was synthesized using the SuperScript III Reverse Transcriptase (Invitrogen) according to manufacturer's instructions in a GeneQ Thermal Cycler (BIOER, Hangzhou, China). The qPCR assay was completed using the SensiFAST™ SYBR Hi-ROX Kit (Bioline) with specific primers for selected genes (Table 1). All reactions were performed in a StepOne real-time PCR system (Applied Biosystems) with the following steps: enzyme activation at 95 • C for 2 min, denaturation at 95 • C for 5 s, and annealing/extension at 60 • C for 30 s (40 cycles). Experiments were performed by triplicate with three biological samples and the relative expression of mRNA was determined by the 2 − Ct method using data of the EhRNAPII gene for normalization [44].

Cross-linking immunoprecipitation (CLIP) assays
Entamoeba histolytica trophozoites (3 × 10 5 ) harvested in exponential growth phase were transferred to 25 cm 2 cell culture flasks and grown at 37 • C for 24 h. Then, parasites were exposed to irradiation in a 254 nm UV-transilluminator (BioDoc-It Imaging System (UVP)) for 30 min at 4 • C [45] and cytoplasmic (CE) and nuclear (NE) extracts were obtained using the NE-PER Kit (Thermo Scientific™) according to manufacturer's instructions. Proteins were quantified by the Pierce™ BCA Protein Assay Kit and their integrity was confirmed by SDS-PAGE analysis. The absence of cross contamination between NE and CE was assessed by Western blot experiments using anti-EhPC4 (1:1000 dilution) as a nuclear marker [46], anti-EhPSP (1: 1000 dilution) as a cytoplasmic marker [47], and anti-HsCFIm25 polyclonal antibodies (GeneTex, GTX115535) (1:1,500 dilution), followed by goat anti-rabbit IgG horseradish peroxidase secondary antibody (Invitrogen, A16096) (1:10,000 dilution), and proteins were revealed with the Immobilon Western Chemiluminescent HRP Substrate (MILLIPORE). Immunoprecipitation (IP) assays were performed using Dynabeads protein A/G (Invitrogen) following manufacturer recommendations. NE (100 μg) were incubated during 2 h at 4 • C with Rabbit IgG Isotype (10 μg) and magnetic beads (10 μl) (Invitrogen) to remove non-specific interactions. On the other hand, Dynabead magnetic beads (25 μl) were coupled with 10 μg of anti-HsCFIm25 polyclonal antibody (GeneTex, GTX115535) and anti-EhCFIm25 polyclonal antibody, previously produced and validated by our group [8] for 30 min at room temperature with1X PBS 0.02% Tween-20. Then, beads were washed three times with 1× PBS 0.02% Tween-20 and beads-antibodies complex was incubated with precleared NE (100 μg) for 2 h at 37 • C followed by three washes with wash buffer (20 mM Tris, pH 7.5; 0.5M NaCl and 0.05% Tween 20). Finally, magnetic beads-antibodies-proteins complex was resuspended in elution buffer (0.1 M glycine, pH 2.0) for 2 min at room temperature to dissociate the complex; the tube was placed on the magnet and the supernatant containing eluted antibody and bound proteins were collected.
Proteins corresponding to eight independent experiments with each antibody were pooled (150 μg) and delivered to the Laboratorio Nacional de Servicios Experimentales (LaNSE), CINVESTAV (Mexico), for protein identification by mass spectrometry analysis LC-ESI-HDMSE as described [48,49]. Briefly, the proteomic parameters included: trypsin for protein digestion and one missed cleavage allowed; automatic peptide and fragment tolerance, minimum fragment ion matches per peptide, 2; minimum fragment ion matches per protein, 5; minimum peptide matches per protein, 1; and a false discovery rate (FDR) ≤ 4%. Raw files containing MS and MS/MS spectra were quantified by PROTEINLYNX GLOBAL SERVER (PLGS) v3.0.3 software against E. histolytica (Strain: ATCC 30459/ HM-1:IMSS, downloaded from UniProt, 7959 protein sequences, January 3, 2022) concatenated with reverse database and the protein identification was considered significant with the probability score of 95 and 99%. Mass spectrometry analysis was performed twice from the pool of eight IP samples to assess the reproducibility of the data, and only proteins identified in both replicates were considered. The mass spectrometry proteomic data have been deposited to the ProteomeXchange Consortium via the PRIDE [50] partner repository with the dataset identifier PXD033620.

Statistical analysis
Data of gene expression were analyzed by Prism-GraphPad software using the paired Student's t-test to compare stress, virulence and EhCFIm25 silencing conditions with standard growth condition as control. P<0.05 was considered as statistically significant.

Comparative analysis of human and parasite polyadenylation machineries
Polyadenylation machineries have been poorly studied in protozoan parasites. Moreover, most polyadenylation factors have not been experimentally annotated (low-level of analysis) in genome databases. Here, we conducted a BLAST approach using human proteins as queries and identify potential orthologs in several protozoan pathogens that affect human health. As shown in Table 2, BLAST results indicated that E. histolytica and A. castellani have the largest set of polyadenylation factors with 15 proteins, followed by T. vaginalis and L. mexicana with 14 proteins, while G. lamblia has the smallest one, with only six proteins. All parasites conserve at least one subunit of the four main complexes, CPSF, CstF, CFIm y CFIIm (except for G. lamblia that lacks both CF complexes). Only E. histolytica and T. vaginalis have the six subunits of CPSF (CPSF1, CPSF2, CPSF3, CPSF4, FIP1, and WDR33), the other pathogens lack one to four of them, and WDR33 (11/13) and CPSF160 (10/13) are the three more conserved subunits across this set of parasites. Three polypeptides of CstF (CSTF1, CSTF2, and CSTF3) were identified in A. castellani and Plasmodium spp, while T. vaginalis, L. mexicana, T. cruzi, B. microti, and G. lamblia seem to have only one subunit. Except for G. lamblia, all protozoa have the CFIm25 protein, and only A. castellani and T. vaginalis have the larger subunits of CFIm. Similarly, no apparent CLP1 and PCF11 (CFIIm) homologue were found in G. lamblia, and both subunits were only found in E. histolytica, A. castellani and L. mexicana. Interestingly, PAP that is fundamental for poly(A) tail synthesis is present in all parasites and PABP2 is only missing in C. cayatenesis and G. lamblia. Lastly, several accessory polyadenylation factors were also predicted in some parasites, but Symplekin (SYMPK) was only  detected in L. mexicana. Additionally, we identified distinct genes for several polyadenylation factors in two other protozoan parasites; thus, two genes correspond to CLP1 in T. vaginalis, and to CPSF30 in T. cruzi. In contrast, the searches for PAPα, PAPβ, and PAPγ variants, as well as PP1α (PPP1CA), PP1β (PPP1CB) and PP1γ (PPP1CC) isoforms identified the same gene sequence in parasites (Supplementary Table S1).
To better characterize the parasite polyadenylation machineries and assess their conservation, we next compared protein length and amino acid identity. Most polyadenylation factors have a similar size than the corresponding human proteins (Supplementary Figure S1). However, CPSF160 is significantly increased in size in P. vivax, P. falciparum, and T. gondi, as well as CstF77 in P. vivax, P. falciparum, P. berghei, and T. gondi, and PAP in A. castellani. In contrast, FIP1, WDR33, CFIm59, PCF11, and RBPP6 of all parasites correspond to ∼25-50% of the size of their human counterparts; the size of CstF64 is also reduced in E. histolytica, T. vaginalis, and A. castellani. In agreement with the BLAST search strategy, most parasite proteins share at least 20% identity with human counterparts (Supplementary Figure S2). Notably, PP1 isoforms were the most conserved factors with 70-90% identity in T. gondi, L. mexicana, T. brucei brucei, T. cruzi, and G. lamblia.
We also constructed the phylogenetic tree of parasite using human proteins as an external control group. The complete cladogram shown in Figure 1 revealed that some parasite proteins do not share a close relationship with their human homologue, since they present a polyphyletic origin in most pathogens. Notably, the poly(A) polymerases PA-POLA (E. histolytica) and PAPOLAB (T. vaginalis) share a polyphyletic relationship between them and are excluded from the main phylogenetic tree, suggesting that they evolved at a different time from the rest of the polyadenylation machineries. Interestingly, CFIm25 (CPSF1) of E. histolytica, shows a paraphyletic origin from the rest of the phylogenetic tree, suggesting that it does not share relationship with the human protein. Surprisingly, the human CstF50 (CSTF1) is closely related with CFIm25 (CPSF5) of E. histolytica and T. vaginalis, while the human CFIm25 (CPSF5) protein is related with CstF50 (CSTF1) of Plasmodium species, suggesting that these parasites proteins share similar functions. The human CPSF30 (CPSF4) protein is closely related to the CLP1 and CPSF30 (CPSF4) proteins in most parasites, as well the human proteins CFIm68 (CPSF6), CFIm59 (CPSF7), and CstF64 (CSTF2) with the pathogen proteins CstF64 (CSTF2), indicating that these proteins are efficient enough to remain throughout evolutionary history. In addition, the human SYMPK showed monophyletic grouping with CFIm68 (CPSF6), CFIm59 (CPSF7), PCF11, and CstF64 (CSTF2), revealing a positive selection of this protein on the history evolution of the polyadenylation. On the other hand, the human FIP1L1, PPP1CA, PPP1CB, PPP1CC, and WDR33 share monophyletic clustering with FIP1L1 and PPP1CA of E. histolytica and T. vaginalis and WDR33 of Plasmodium vivax and T. vaginalis, respectively. Other parasite proteins share monophyletic relationships with the homologue human

Human and parasite polyadenylations networks have different central hubs
The fact that some parasite polyadenylation factors have evolved in a different way than human proteins suggest that their function and their relationships with other proteins could be different. Moreover, the absence of factors may have an impact on protein function and their relationships with other proteins. Therefore, computational models of human and pathogens PPI networks were predicted using information available from STRING database to identify highly connected proteins (central hubs) for mRNA polyadenylation network. As shown in Figure 2A, the human network was clustered in two different PPI modules, specifically the polyadenylation module (external nodes) and the module corresponding to proteins that interact with polyadenylation factors (internal nodes). Interestingly, polyadenylation factors have relationship with proteins related to genetic information processing, as well as gene expression and protein modification. CPSF1 (CPSF160) represents the central hub of the network, establishing numerous connections (high degree of nodes) with polyadenylation factors and various proteins of the second module, suggesting that CPSF1 is likely to have a relevant role on the global network function via multiple interactions. Parasite networks contain the same two PPI modules; in addition to interactions within the polyadenylation module, polyadenylation factors also have relationships with proteins involved in gene expression regulation events. However, the central hubs of PPI networks are different in parasites and their human host. Specifically, some parasites networks display central hubs in polyadenylation proteins including PAP (EHI 012040; Q51D88) in E. histolytica, WDR33 (XP 001277014.1; A2D7P9) in T. vaginalis, and CPSF30 (XP 804858.1; Q4CRV4) in T. cruzi. In contrast, in other parasites networks, central hubs are proteins that interact with the polyadenylation module, such as the SSU ribosomal protein S7P (GSB 12981; V6TPX5) in G. lamblia, the small nuclear ribonucleoprotein SmD1 (AAZ12376; D6XKM9) in T. brucei, the nucleic acid binding protein (PBANKA 093980; A0A509AKK6), the RNA helicase proteins (PFF0100w; A0A024WC77 and PVX 113270; A5K1L9) in P. berghei, P. falciparum, and P. vivax, respectively, as well as the splicing protein Mago nashi (TGME49 067420; A0A125YHL8) in T. gondii (Figure 3 and Supplementary Figure S3).
In the last two decades, our group has focused on the study of mRNA polyadenylation in E. histolytica [6][7][8][9][10][11][12]. Therefore, we used E. histolytica as a working model to perform several in vitro experiments to support computational predictions. First, we performed qRT-PCR assays to evaluate the expression of five genes, CPSF5 (CFIm25), CPSF1 (CPSF160), PAP, CLP1, and RBPP6, selected from the heat map and corresponding to proteins found in both modules of the PPI network ( Figure 4B-D). As expected, all genes corresponding to interacting proteins were modulated in trophozoites grown in stress condition (serum-deprivation and heat shock) and in highly virulent parasites in comparison with controls. Interestingly, PAP was the most up-regulated gene in response to stress conditions, while CPSF5 (CFIm25), CPSF1 (CPSF160), and CLIP1 were reduced. The modulation of the RBPP6 gene depended on the stress condition. In virulent parasites, RBPP6 was the most overexpressed gene, followed by PAP, while the expression of other genes was reduced.

E. histolytica polyadenylation network: central hub and bottleneck validation
As described above, PAP is a central hub in the PPI network of our working model E. histolytica. Interestingly, the node of the polyadenylation module with the highest betweenness is the polyadenylation factor EhCFIm25 (EHI-077110) that is known to be essential for parasite survival [11,12]; this bottleneck protein establishes interactions with nine proteins of the polyadenylation module and three proteins of the second module. Notably, EhCFIm25 The EhRNAPII gene mRNA expression was used as normalization control for all qRT-PCR assays. Data corresponding to stress (serum starvation and heat shock) and virulent conditions were expressed as mean + − SD (n=3) and compared with the control condition using the paired Student's t-test; *P<0.05, **P<0.01, and ***P<0.001. interacts with Ehnopp34 (EHI 06680) that is the bottleneck protein in the second module (being connected to 28 nodes on this module and three of the polyadenylation module) and EhPAP (EHI 012040), the central hub protein in the interactome ( Figure 5A-C). In contrast, the bottleneck protein that connects the polyadenylation module to the second module through its interactions with many proteins is PP1CC in the human PPI network. Notably, PP1CC interacts with two important proteins, BRCA1 and TP53, that act as a tumor suppressor proteins and interestingly, BRCA1 is a bottleneck protein in the second module ( Figure 2B,C). The fact that human and parasite polyadenylation factors establish relationships with distinct proteins could be useful for target identification in E. histolytica.
Because of their high betweenness centrality in PPI networks, bottleneck proteins regulate the flow of signaling information and therefore, represent central points for communication in an interaction network [51]. To confirm the relevance of EhCFIm25 as a bottleneck protein, we evaluated the effect of EhCFIm25 silencing on mRNA expression of relevant proteins identified in E. histolytica PPI network, particularly the hub protein EhPAP (EHI 012040), the bottleneck protein Ehnopp34 (EHI 068680), as well as EhCPSF1 (EHI 106110) and EhCLP1 (EHI-008100) that are connected to EhCFIm25 and EhPAP in the polyadenylation module, and EhNSA2 (EHI 099760), EhsnRNPF (EHI 060400) and EhMyb (EHI 000550) that are connected to Ehnopp34 in the other module ( Figure 5D). Interestingly, the expression of these relevant was affected in trophozoites lacking EhCFIm25 in comparison with untreated cells, which confirms the importance of the bottleneck protein EhCFIm25 in regulating the flow of expression of these connected genes. Notably, EhCPSF1, EhCLIP1, Ehnopp34, and EhNSA2 were down-regulated, while EhPAP, EhsnRNPF, and EhMyb presented an increased expression. None of the genes were modulated in the gfp-dsRNA condition used as an additional control, confirming that the effects are due to EhCFIm25 silencing by specific dsRNA as previously reported [11].

EhCFIm25-Ehnopp43 interaction in E. histolytica: docking and model validation
To assess the reliability of predicted PPI in E. histolytica, we investigated the interaction of bottleneck proteins in each module, EhCFIm25 and Ehnopp4. First, models with full sequence amino acids of each protein were generated. The RAPTOR-X server generated a full-length amino acid model of EhCFIm25 and a truncate EhNopp34 model (residues 19 to 151), while the SWISS-MODEL generated truncate models for EhCFIm25 (residues 29 to 236) and EhNopp34 (residues 1 to 108). The EhCFIm25 model generated by RAPTOR-X was selected for further analysis. To generate a complete sequence model of EhNopp34, the models obtained in RAPTOR-X and SWISS-MODEL were superposed (the RMSD of the Cα from residues 19 to 96 between both models was of 1.17Å), and the atomic coordinates of amino acids 1 to 19 of the model obtained in SWISS-MODEL, were merged with the coordinates of the model obtained in RAPTOR-X using the coot software [52]. The refined models of EhCFIm25 and EhNopp34 showed 91% and 85% of residues in the most favored regions in the Ramachandran plot, 8.5% and 13.6% in additional allowed regions, and 0.4% and 0.7% in generously allowed regions, respectively, suggesting that the models obtained had suitable stereochemistry quality for MD and docking analyses. The EhCFIm25 model possesses the characteristic α/β/α Nudix fold of this protein family ( Figure 6A). A superposition of Cα of the Nudix domain and Nudix box of EhCFIm25 (residues 92 to 230 and 137 to 161, respectively) and HsCFIm25 (residues 69 to 202 and 108 to 130, respectively) (PDB: 3Q2T) showed a RMSD of only 1.9Å for the Nudix domain and 1.4Å for the Nudix box, suggesting essentially the same folding ( Figure 6B). We did not find any report about nopp34 crystallographic structure, but we observed that the predicted tridimensional structure of Ehnopp34 has the same domain of ∼90 amino acids folded into folded into β-1-α1-β2-β3-α2 fold than the RRM domain of HsCFIm68 present in the 3Q2T crystallographic structure of the human CFIm68/CFIm25/RNA complex. This folding similarity was confirmed when Cα of EhNopp34 RRM domain were superimposed with RRM domain of HsCFIm68 from the 3Q2T crystal complex (RMSD: 6Å) or with other proteins containing RRM domains, such as the human RBM7 protein (RMSD: 1.9Å; PDB: 5IQQ), the UPB1 of Trypanosoma cruzi (RMSD: 5.6Å; PDB: 1U6F) or the RNA15 RRM domain of Saccharomyces cerevisiae (RMSD: 5.1Å; PDB: 2X1B) ( Figure 6B).
In the next step, we performed MD simulation of complete proteins and trajectories analysis of MD showed that the RMSD for Cα was around 13Å for EhCFIm25, and 16.5Å for EhNopp34, from 30 and 25 ns of simulation, respectively, that is the time in which each protein structure reached the convergence of the simulation. Interestingly, the analysis of the data extracted from trajectories corresponding to residues of the RRM and Nudix domains showed that the RMSD values for the RRM domain remained around 2Å during the 100 ns of the simulation, while the RMSD profile of the Nudix domain appeared to be parallel to the RMSD curve of the full protein, although with lower values (approximately 4Å). The main contributions to RMSD variations can be attributed to the N-terminal segment of EhCFIm25 (residues 1 to 23) and the N-and C-terminal segments of EhNopp34 (residues 1 to 18 and 97 to 155, respectively), judging by the structures extracted at different times of the DM ( Figure 6C). Congruently, the central region of both proteins showed minor RMSF values, suggesting that the N-and C-terminal regions of EhCFIm25 and EhNopp34 have high flexibility in comparation to the core which presents a highly conserved folding ( Figure 6D). Furthermore, the Rg showed a difference of approximately 3.3Å between the full-length convergence structures and the folded core of both proteins, which reinforces the evidence that terminal segments of these proteins have high flexibility ( Figure 6E). Finally, to investigate the possible association between EhCFIm25 and EhNopp34, we used the molecular docking approach. With the LZerD server, a first dimeric molecule was obtained (one EhCFIm25 molecule and one EhNopp34 molecule), representative of a cluster of solutions and with the best score value. The interface between EhCFIm25 and  EhNopp24 structures is formed by 36 amino acid residues of EhCFIm25 and 38 residues of EhNopp34 ( Figure 7A). The binding energy of this dimer, estimated in the Prodigy server [53] was of −10 kcal mol −1 , which is in the range of energies expected for protein-protein complexes. In mammals, CFIm25 has been reported to form homodimers [54] or heterotetramers with larger CFIm subunits [55]. Since EhCFIm25 is the only subunit found in E. histolytica and considering the similarity previously described between EhNopp34 and HsCFIm68 RRM domains, we decided to evaluate whether a heterodimer formed by EhCFIm25 and EhNopp34 could form a heterotetramer. The construction of the tetrameric molecule was performed from a dimeric molecule through a second docking, resulting in a tetrameric molecule in which the EhCFIm25 chains formed an interface, with an estimated binding energy of −6.9 kcal mol −1 , while there were no interactions between EhNopp34 molecules ( Figure 7B). The amino acids that formed the interface between both heterodimers are localized at the C-terminal of EhCFIm25, similar to the interface observed in the reported tetrameric structure of the HsCFIm25-HsCFIm68 complex (PDB 3Q2T) ( Figure 7C) Interestingly, the binding energies of the human complex, estimated with the Prodigy server (−8.2 kcal mol −1 for the HsCFIm25-HsCFIm68 dimer and −13.4 kcal mol −1 for the HsCFIm25 dimer interface), agree with those observed in the Entamoeba complexes. The tetrameric model of EhCFIm25-EhNopp34 resembles the HsCFIm25-HsCFIm68 complex; however, the higher length of the complete model structures induces that the arrangement of each monomer of both EhCFIm25 and EhNopp34 is slightly different in comparison with the HsCFIm25-HsCFIm68 complex. Importantly, the RNA-binding sites of both proteins remain available in the obtained tetramer model ( Figure 7C).

Proteomic analysis of EhCFIm25 interactome
In an attempt to experimentally identify EhCFIm25 interactome, we conducted a proteomic analysis of E. histolytica nuclear proteins that were immunoprecipitated by two different anti-CFIm25 antibodies. The quality of NE and CE was previously assessed by SDS-PAGE since migration profiles displayed clear differences ( Figure 8A). Additionally, in Western blot assays, the nuclear marker EhPC4 was only detected in NE, while the cytosolic marker EhPSP was only found in CE, which indicates the clear separation between NE and CE, and a good enrichment in nuclear proteins. The EhCFIm25 protein was mainly detected in NE, but also in CE as previously described [8]. Moreover, results confirmed the recognition of parasite CFIm25 protein by heterologous antibody, confirming that it can be used in CLIP assays ( Figure 8B).
As described above, two distinct antibodies were used for CLIP assays followed by protein identification by mass spectrometry analysis LC-ESI-HDMSE. Detailed information about protein identification is shown in Table 3. Although both strategies allowed the identification of distinct EhCFIm25 interacting proteins, some features are conserved, such as the presence of (1) proteins of the predicted interactome of polyadenylation factors shown in Figure 3, particularly the WD repeat protein and ribosomal proteins in the second module of the PPI network; (2) proteins assigned to nuclear functions (S1 RNA binding domain-containing protein, elongation factor 1-α, NEDD4-binding protein 2); and (3) proteins related to metabolism pathways that typically take place in the cytoplasm, namely glyceraldehyde-3-phosphate dehydrogenase and pyruvate phosphate dikinase that contribute to energy generation [56], as well as peroxiredoxin and thioredoxin that participate in the thiol-dependent redox metabolism and the ubiquitous oxidoreductase system with antioxidant and redox regulatory roles, respectively [57].

Discussion
It has been demonstrated that altering mRNA polyadenylation represents a valuable strategy for protozoan parasite control [11][12][13][14][15][16][17], but more information about the relationship between parasite factors and the human counterparts is required to better identify specific parasite proteins as drug targets. The present work combining computational methods and experimental assays was centered around two major research questions in relation to biology system studies: (1) do polyadenylation factors conserve the same relationships and protein connections in human and parasites? and (2) do polyadenylation factors and interacting proteins share a correlated expression?, whose answers allowed us to propose potential targets against parasites that still affect human health worldwide at the beginning of the XXI century.
Complementing the description of some polyadenylation factors in several protozoan parasites [6,[58][59][60], our protein sequence similarity searches for polyadenylation factors in parasites suggested that polyadenylation machineries present a large variability among protozoa. Interesting, E. histolytica and A. castellani conserve 15 of the 23 main and associated polyadenylation proteins described in human, while G. lamblia only has six, as we previously reported [6]. Importantly, the conservation of CPSF160, CFSP73, CFSP30, WDR33, CstF50, CstF64, CFIm25, CLP1, PAP, PABP, and RBBP6 in almost all parasites suggest these proteins may have fundamental roles in mRNA 3 end formation. Notably, WDR33 recognizes the polyA signal in human cells [61], CFIm25 defines the cleavage site [62,63], CstF73 performs RNA cleavage [64] and PAP catalyzes the poly tail elongation [65]. The absence of CPSF73 in A. castellani, T. brucei brucei, T. cruzi and C. cayetanensis (and other important proteins in G. lamblia that has the smallest machinery) suggests that other yet unknown proteins may have independently evolved to perform the corresponding functions in parasites. Importantly, the fact that the E. histolytica CFIm25 is evolutionary related to the human CstF50 suggests that it may be able to functionally replace the missing CstF50 in the parasite, making it a very relevant protein in the polyadenylation process. This assumption agrees with our previous reports showing that the absence of EhCFIm25 produces parasite death [11,12].
Besides amino acids sequence similarity, the comparison of amino acids number is another way to characterize the conservation of homologous proteins [66]. Interestingly, there are some parasite proteins that do not follow the general rule of protein length increase throughout evolution. For example, the size of CPSF160, CstF77 and PAP is increased in several parasites, in comparison to their human homologues, which adds another piece to the evolution of polyadenylation factors. Phylogenetic analyses also provide valuable data to understand the behavior of a protein, in comparison with other biological models studied, such the human polyadenylation machinery. Our analyses generally support the evolutionary relationships among parasite and human polyadenylation factors. Despite their sequence similarity, some proteins, do no share a common origin with human homologues, which suggests that their formation has been most likely mediated by different evolutionary events. It is possible that these differences could affect protein function and PPI in each system. Congruently, these differences have an impact on the organization and topology of predicted PPI networks although their sequence similarity suggests that homologous proteins conserve the same functions in human and parasites. The grade of connectivity for each node in the networks and functional annotations have dissimilarity between species. These findings denoted extensive species-specific network renewing that can likely be associated to specific protein evolution in each system.
To date, PPI networks related to mRNA polyadenylation have been poorly described in protozoan parasites. The architecture of predicted PPI networks confirms the link between the polyadenylation process and other events related to gene expression regulation in both human and parasites. These relationships between proteins that participate in different steps of gene mRNA processing and export have been experimentally demonstrated. For example, distinct groups have showed the interaction of members of the human CFIm complex with the polyadenylation factors PAP, PABP, and FIP1 [67][68][69], hClp1, U2AF, and snRNP U1 involved in splicing [63,70,71,72], the cap binding protein CBP20 [55], as well as Thoc5, Mex67-Mtr2, and Tap-p15 that participate in mRNA nuclear export [73]. In E. histolytica, we reported the interaction between CFIm25 and PAP [8]. Considering that grouping proteins with functional annotations contribute to a better identification of central functions and therefore lethal proteins [74], our results clearly evidenced that proteins involved in different steps of gene expression are prospective parasite targets.
Interestingly, the significant differences in PPI networks corresponding to parasites and the human host can be exploited to specifically target a pathogen protein. Network 'hubs' are highly connected proteins with many PPIs (high node degree); they are therefore likely to exert a higher influence on network function via multiple interactions [51]. The polyadenylation factor CPSF160 presents the highest degree of connectivity in the human PPI network, while central hubs are represented by PAP, WDR33, CPSF30 in E. histolytica, T. vaginalis, and T. cruzi, respectively. Additionally, central hubs do not correspond to polyadenylation factors in the other parasites. Computational predictions of PPI networks have allowed the identification of key proteins for cell survival [74] and potential drug targets against parasites [75]. Considering that proteins with a high degree of connectivity should be the most essential proteins [76,77], our results suggest that targeting PAP, WDR33, and CPSF30 could have an effect on parasite survival. Interestingly, some of these proteins have already been described as biochemical target. Notably, Hericks et al. reported that CPSF30 depletion affect polyadenylation and survival in T. brucei [13]. Moreover, CFIm25, CPSF30 (CPSF4), and CPSF73 (CPSF3) have been identified as potential therapeutic targets in E. histolytica, T. brucei, T. gondii, and P. falciparum [11][12][13][14][15][16]. Further experimental assays are required to validate the biological importance of central hubs proteins predicted here. Finally, it has also been described that interacting proteins have a correlated expression [51]. Congruently, we demonstrated the modulated expression of parasite polyadenylation proteins in response to diverse growth conditions, which suggests their functional interaction in agreement with PPI networks and strengthens the impact of targeting polyadenylation factors on parasite biology and survival.
Network 'bottlenecks' are proteins with high betweenness centrality; they regulate the flow of signaling information and therefore, represent central points for communication in an interaction network [51]. In the human PPI network, the connection between the bottleneck proteins PP1CC and BRCA1, agrees with a previous report showing that the tumor suppressor BRCA1 protein interacts with the polyadenylation protein CstF50 to promote deadenylation activity of the poly(A)-specific ribonuclease PARN during DNA damage, leading to RNA degradation [78]. In E. histolytica taken as a working model, the identification of different bottleneck proteins and connections between both PPI modules is a relevant finding that could help to detect parasite targets. Indeed, the silencing of the bottleneck protein CFIm25 largely affected gene expression in genes of both PPI modules, including Ehnopp34, the bottleneck protein of the second module. Additionally, changes in the expression of the central hub protein PAP and interacting polyadenylation factors EhCFIm25, EhCPSF1, EhCLP1, and EhRBPP6, confirms the relevance of polyadenylation genes for parasite survival, growth, and stress adaptation. Congruently with the relevance of these proteins with high betweenness centrality in the PPI network, we previously demonstrated that EhCFIm25 silencing inhibits parasite survival and virulence. Notably, we hypothesized that the greater effect of EhCFIm25 capture by specific RNA aptamers versus EhCFIm25 gene silencing on E. histolytica proliferation and death was related to the concomitant capture of EhCFIm25 interacting proteins [11,12]. With the present results, we corroborate that the removal or capture of the bottleneck node CFIm25 broke down communication links between both modules of the PPI networks, affecting the expression of relevant connected proteins, which resulted to be fatal for the pathogen.
By correlating the consequences of the absence of individual yeast proteins with the number of their PPI, it has been shown that highly connected proteins with a central role in the network organization and topology, i.e. hubs and bottleneck proteins, are potentially three times more essential than proteins with a reduced number of PPI [76]. Using an integrative network approach, Alam et al. recently identified 86 hub proteins in tuberculosis and noncommunicable diseases and demonstrated the relation between drugs and these apparently unrelated targets and pathways, supporting the assumption that the most highly connected proteins in the cell are the most important for its survival [79]. The positive correlation between lethality and connectivity in PPI network therefore represents an interesting strategy to identify pathogen proteins that could represent new therapeutic targets for parasite controls. The identification of EhPAP as a hub protein suggests that its removal would strongly affect E. histolytica survival, as we observed for EhCFIm25. Further experimental assays are required to confirm the relevance of EhPAP as a new therapeutic target.
The reliability of the E. histolytica PPI was assessed by two complementary approaches: a molecular docking study of the EhCFIm25-Ehnopp34 interaction and a CLIP assay followed coupled to a proteomic analysis. The 3D models of EhCFIm25 and Ehnopp34 have been recently published in the AlphaFold platform [80,81]. Interestingly, the Nudix and RRM domains of EhCFIm25 and Ehnopp34 fold in a very similar way in our relaxed EhCFIm25 and Ehnopp34 models with RMSD values of 0.485 and 0.797Å, respectively. In contrast, the most flexible regions, namely the Nand C-terminal of EhCFIm25 and the C-terminal of Ehnopp34 show significant differences with RMSD values up to 2.545Å. However, their prediction is highly uncertain according to the trust report of AlphaFold (<60% in average for these regions). Therefore, we consider that the optimization of EhCFIm5 and Ehnopp43 structures by molecular dynamics simulations gave us better models for docking study. The connection between EhCFIm25 and Ehnopp34 was supported by two complementary computational analysis, the PPI network prediction, and the docking simulation. However, it is necessary to perform specific experiments to confirm this interaction. Nopp proteins are nucleolar phosphoproteins that shuttle between nucleus and cytoplasm. Nopp34, also known as NIFK or MKI67IP, interacts with the forkhead-associated domain of human Ki67. It presents an RNA recognition motif (RRM) at the N-terminus and a FHA Ki67 binding domain at the C-terminus, suggesting that it could be participating in mitotic chromosome organization [82]. The Ehnopp34 sequence contains only half the residues compared with that of the Hsnopp34, which has 293 residues; thus, it lacks the C-terminus, which has been shown to bind to FHA domains [52,82]. A sequence comparison between Ehnopp34 and Hsnopp34 showed 52.3% similarity between the RRM domains, and 32.2% identity (data not shown). The lack of the C-terminal domain in Ehnopp34 suggests that it might participate in different functions than Hsnopp34, in which the RRM domain is relevant. This domain has been found in proteins with diverse functions, such as pre-mRNA polyadenylation, for example, CstF-64 [83], splicing [84], mRNA stability [85], RNA editing [86], among other functions. Interestingly, it has been proposed that RRM domain-containing proteins could also participate in the recruitment of other proteins for the formation of RNA processing complexes. As discussed, larger CFIm components have not been identified in the amoeba machinery compared with the heterotetrameric complex described for humans, suggesting that other proteins could be supplying the function of the missing subunits. Study of the human CFIm25-CFIm68 complex showed that RNA association to the CFIm25 subunit is enhanced by the presence of the CFIm68 subunit, possibly facilitating RNA looping through the RRM domain so that the nucleotide sequences that are recognized by the CFIm25 subunit can be positioned in the binding pocket [55]. The role of RRM domains in this process appears to be fundamental, suggesting that proteins with RRM domains that have the ability to associate with the CFIm25 subunit could favor RNA binding to the binding pocket present in the Nudix domain of CFIm25. Our molecular docking results suggest that Ehnopp34 could bind to EhCFIm25 in a similar way to that reported for the human CFIm68 subunit, suggesting that this protein could facilitate the association of RNA to the EhCFIm25 monomers of the complex in E. histolytica.
The proteomic analysis of nuclear proteins interacting with EhCFIm25 led to the identification of the proteins related to the same pathways previously described in the E. histolytica PPI network. Particularly, IP and bioinformatics approaches identified proteins related to (1) ribosomes biogenesis (IP: 60S ribosomal protein L1, ribosomal protein S24; PPI network: 60S ribosomal protein, ribosome assembly factor mrt4, ribosome production factor 2 homolog, ribosome biogenesis protein NSA2 homolog); (2) WD domain contaning proteins; and (3) Transcription (IP: NEDD4-binding protein 2, elongation factor 1-α; PPI:tTranscription initiation factor SPT5, RNA-binding protein, transcription initiation protein SPT4). Therefore, the proteomic analysis validates our computational approach and confirms the relationship among the different events of gene expression regulation in E. histolytica. Additionally, the interaction of EhCFIm25 with Cys-peroxiredoxin and 60S ribosomal protein agrees with a previous report in P. falciparum that evidenced interactions between Cys-peroxiredoxins and proteins involved in translation [87]. Both antibodies used in the CLIP assays are polyclonal antibodies; therefore, it is possible that they bind to several and distinct epitopes in EhCFIm25, affecting its interaction with other proteins, which may explain the immunoprecipitation of different EhCFIm25-interacting proteins in both experiments; this is also an opportunity to expand the interactome.
The fact that EhCFIm25 was not detected in the CLIP assays is likely to be related to the protein identification protocol. Mass spectrometry works by measuring the relative intensity of the proteins; in such analyses, intensity peak ratios are treated as abundance ratios of the respective molecule. However, the technique has its limitations when studying protein complexes. Since the signal intensities of analytes of similar size can significantly distort the ratio of peaks due to the unequal signal response of the different analytes, this can make it difficult to analyze when trying to identify large molecular complex [88]. Therefore, because of the number of proteins bound to the complex, the EhCFIm25 signal was distorted, and the MS immediately discarded its identification. The absence of EhCFIm25 and other expected nuclear proteins, such as EhPAP that has been demonstrated to interact with EhCFIn25 [8] can also be related to the low abundance of true nuclear protein and their under-representation in proteomic studies by high-throughput techniques, since as we said above, the most abundant proteins determine the limit of detection for the less abundant ones. Thus, we cannot consider data of our CLIP assays as a complete illustration of EhCFIm25 interactome. It is possible that other experimental strategies, such as yeast two-hybrids system, aptoprecipitation with anti-EhCFIm25 aptamers, or in vivo UV-CLIP in EhCFIm25 overexpressing trophozoites, would provide a better understanding of the nuclear proteins interacting with EhCFIm25 in E. histolytica.
Surprisingly, the proteomic analysis of EhCFIm25 interactome also identified proteins that are typically considered as cytoplasmic. A similar observation has been previously reported for other nuclear proteomes [89,90]. The high abundance of these supposedly cytoplasmic proteins in the nuclear fraction suggest they might have moonlight functions. Congruently, several of the proteins related to energy metabolism, cell redox homeostasis and response to protein folding stress, have been described as moonlight proteins with multiple functions in distinct cellular compartments in E. histolytica. Thus, thioredoxin regulates the activity of the UDP-glucose pyrophosphorylase through post-translational modification [91]. Thioredoxin was detected in both the cytoplasm and the nucleus of trophozoites, where it plays distinct roles in a two-step mechanism of redox regulation of transcription factor NF-κB [92]. Peroxiredoxin was localized in the nucleus and cytoplasm of E. moshkovskii, trophozoites [93], while it was found in the nucleus and the membrane in E. histolytica [94]. GAPDH participates in telomere maintenance in the nucleus of human lung carcinoma cells [95]; under stresses, the translocation of glyceraldehyde-3-phosphate dehydrogenase from cytoplasm to the nucleus is regulated by acetylation [96]. In E. histolytica, GAPDH is modified by ADP-ribosylation and secreted to the extracellular medium where it may play an important role in ameba survival or in interaction with host cells or molecules [97]. This first attempt to identify EhCFIm25 interactome confirms the existence of close relationships among proteins involved in gene expression regulation and evidenced new links between gene expression and metabolism though connections with moonlight proteins in E. histolytica. Interestingly, enzymes that participate in biochemical pathways generating energy, reducing power and biosynthetic intermediates necessary for cell survival, have also been involved in 'moonlighting' as RNA-binding proteins in mammalians cells. Some authors have proposed the REM (RNA-enzyme and metabolite) network hypothesis in which RNA is considered as a key molecule that assembles all the enzymes of a particular metabolic pathway to promote the metabolic activity. It has also been suggested that these moonlight proteins could connect intermediary metabolism with RNA biology and posttranscriptional gene regulation [98,99]. In this work, parasite cultures were UV cross-linked prior to obtain nuclear proteins and perform immunoprecipitation assays with anti-EhCFIm25 antibodies, to ensure precipitation of RNA-binding proteins, including EhCFIm25 and proteins interacting to EhCFIm25. Immunoprecipitates also contain RNA molecules, then proteins identified by LC-ESI-HDMSE analyses can include other RNA-binding proteins in close proximity with EhCFIm25. We hypothesize that some of these proteins may correspond to moonlight proteins described above. However, additional experiments are required to confirm these assumptions in E. histolytica.

Conclusion
Taken altogether, our results established the high potential of computational approaches based on comparative genomics and interactomics to predict PPI networks and identify key proteins for network function, which may open a new horizon for the characterization of new therapeutic targets in protozoan parasites that affect human health, such as EhCFIm25 in E. histolytica. Additionally, they confirmed the existence of important relationships between mRNA polyadenylation and other molecular events involved in gene expression regulation in E. histolytica. Finally, the possible role of RNA as a scaffolding molecule that connects RNA biology and posttranscriptional gene regulation with intermediary metabolism requires a specific attention to better understand E. histolytica biology.

Data Availability
The mass spectrometry proteomic data have been deposited to the ProteomeXchange Consortium via the PRIDE [50] partner repository with the dataset identifier PXD033620.