Comparative genomics analysis of statistically significant genomic islands of Helicobacter pylori strains for better understanding the disease prognosis

Abstract Bacterial virulence factors are often located in their genomic islands (GIs). Helicobacter pylori, a highly diverse organism is reported to be associated with several gastrointestinal diseases like, gastritis, gastric cancer (GC), peptic ulcer, duodenal ulcer (DU) etc. A novel similarity score (Sm)-based comparative analysis with GIs of 50 H. pylori strains revealed clear idea of the various factors which promote disease progression. Two putative pathogenic GIs in some of the H. pylori strains were identified. One GI, having a putative labile enterotoxin and other dynamin-like proteins (DLPs), is predicted to increase the release of toxin by membrane vesicular formation. Another island contains a virulence-associated protein D (vapD) which is a component of a type-II toxin–antitoxin system (TAs), leads to enhance the severity of the H. pylori infection. Besides the well-known virulence factors like Cytotoxin-associated gene A (CagA) and vacA, several GIs have been identified which showed to have direct or indirect impact on H. pylori clinical outcomes. One such GI, containing lipopolysaccharide (LPS) biosynthesis genes was revealed to be directly connected with disease development by inhibiting the immune response. Another collagenase-containing GI worsens ulcers by slowing down the healing process. GI consisted of fliD operon was found to be connected to flagellar assembly and biofilm production. By residing in biofilms, bacteria can avoid antibiotic therapy, resulting in chronic infection. Along with well-studied CagA and vacuolating toxin A (vacA) virulent genes, it is equally important to study these identified virulence factors for better understanding H. pylori-induced disease prognosis.


Introduction
Helicobacter pylori, formerly known as Campylobacter pylori, is a Gram-negative, microaerophilic bacteria that is part of the gastric microbiota in over 50% of population worldwide.
The International Agency for Research on Cancer (IARC) classified it as a Class I carcinogen in 1994, nearly 20 years after its discovery, stating that people having H. pylori infection have six times higher risk of developing gastric cancer (GC) than those who do not [1].
Colonization of H. pylori in stomach is commonly acquired in early age, mostly by person-toperson transmission through fluidic or aerosolic transfer mediated by the faeco-oral route [2,3]. H. pylori infection leads a variety of upper gastrointestinal disorders, such as chronic gastritis, peptic ulcer disease (PUD), gastric mucosa-associated lymphoid tissue (MALT) lymphoma, and GC [4][5][6]. Colonization persists throughout life unless treated with antibiotics. Unfortunately, the spike in antibiotic resistance is clearly affecting the response to the therapy and there is currently no preventive vaccine measure. Different strains of the H. pylori were found to have different level of virulence, while some of the strains were found to be non-virulent. So, to understand the disease prognosis in a better way and to develop new therapeutics to fight against this bacterium, it is important to study H. pylori in strain level.
Strain specific virulence factors are generally located in the Genomic Island (GI) region of particular pathogenic strains. Large genomic regions consisting of multiple horizontally transferred genes (HTGs) are collectively termed as genomic islands (GIs) [7]. These are crucial for bacterial evolution, adaptability, and diversification as they contain genes that code for a wide range of proteins [8]. In 1990 Hacker et al. identified virulent genes in uropathogenic strains of Escherichia coli. This identified virulence factor was not present in beneficial strains of E. coli. This cluster of genes was referred to as pathogenicity islands (PAI). It has been proposed that possible vaccine candidates are located within PAIs [9].
Several other reports suggested apart from carrying virulent factors, GIs can have clusters of genes related to special biological functions such as, genes required for adaptation in a special environment, genes encoding proteins linked to metabolic processes, antimicrobial resistance genes. Based on the function of the genes present on GIs, they can be differentiated as metabolic islands (MIs), resistance islands (REIs) etc [10,11].
Despite the fact that H. pylori infection is a leading cause of GC, the majority of people who carry the bacteria in their stomachs never get diagnosed with GC. It was also found that some of the normal individuals showed to have higher abundance of H. pylori. They neither showed any symptoms nor had any discomfort. H. pylori showed to be the part of normal pylori is a highly diverse organism. Studying H. pylori at strain level may shed light to understand the disease prognosis. In the present study the GIs were thoroughly studied in fifty H. pylori strains in order to find strain-specific novel virulent factors and other Islands.
The comparative study with the GIs among fifty H. pylori strains identified GIs with virulence-associated genes in addition to the known GIs. Further characterization of the newly detected islands identified the possible function of these GIs. Several other factors were also identified which were found to be associated either directly or indirectly with the H. pylori induced disease prognosis.

Comparative analysis of GIs in Helicobacter pylori genomes
Fifty completely sequenced strains of H. pylori were downloaded from ftp://ftp.ncbi.nlm.nih.gov/. All of these strains have common hosts i.e. human. Geographical distribution and their disease association were listed in Supplementary Table 1

and 2.
Design-Island-II (https://www.isical.ac.in/~rchatterjee/Design-Island.htm), a de novo algorithm for GI predication was run on 50 H. pylori strains to predict the statistically significant GIs. Islands that were ≥10kb in length were annotated as GIs. All HTGs within these GIs were identified from the gene annotation data using an in-house perl script. For comparison of the GIs among different strains, we used cd-hit suite (http://weizhonglab.ucsd.edu/cdhit-web-server/) to assign unique Cluster numbers (CLS) to the HTGs.
Sequences with 90% identity were given identical cluster (CLS) numbers.

Similarity Score (Sm) based visualization tool
To identify the similarity of a particular GI among different H pylori strains, we introduced a similarity score (Sm): where, A and B correspond to number of genes within the given GIs as identified by Design-Island-II. Similarity score (Sm) calculates the number of commons genes present in the two GIs, divided by the number of total genes present in the two GIs. We calculated Sm score for all the identified GIs in fifty H. pylori strains. Sm value-based heat map were generated for all GIs to get an idea about the similarities or dissimilarities among the fifty H. pylori strains. Hierarchical clustering on the Sm score and heat maps were generated for all GIs using R packages (gplots and heatmap.2).

Core and Accessory Island
A cut-off value of Sm ≥ 50% was used to annotate a GI as shared between two strains. GIs present in all 50 strains, were classified as core GIs. These were named as Core_1, Core_2, etc. Accessory GIs were defined as GIs found in 2 to 49 H. pylori strains and termed as Acc_1, Acc_2, Acc_3 and so on. The presence and absence of GIs across all the strains were denoted as 1 and 0. Kendall's Tau correlation coefficient was calculated among all 50 H. pylori strains considering the presence or absence of accessory islands.

Annotation of hypothetical proteins
A large number of genes within many accessory islands were annotated as hypothetical proteins. Functional characterization of these unknown proteins was performed using a conserved domain search (https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi) and psi blast (NCBI) against PDB database with default parameters until convergence. The query sequences were searched against CDD v3.19 -58235 PSSMs database, using 0.010 as expect value. To understand the conservation of the homolog in the organisms and their evolutionary relationships, multiple sequence alignment and phylogenetic analysis were carried out using Molecular Evolutionary Genetic Analysis (MEGA) [12].

Protein-protein interaction and Homology modeling
The probable biological role of the uncharacterized proteins is predicted by studying the protein-protein interaction using STRINGv10 [13]. Protein three-dimensional structures were generated using an automatic fold recognition server Phyre2 [14]. The PDB IDs of the templates used for homology modeling of putative enterotoxin, vapD and DUF3240 family protein were 4AUR (Confidence: 99.9%; Coverage: 95%), 3UI3 (Confidence: 100.0%; Coverage:96%) and 3CE8 (Confidence: 93.5%; Coverage: 75%) respectively. Clashes, rotamers and Ramachandran analysis were done by Molprobity [15]. Fpocket2 and CSA programs were used to detect pocket and catalytic site of the hypothetical protein respectively [16,17].Visualization and modification of the 3D model were carried out in PyMOL [18].

Docking
Protein-peptide interactions are essential in various biological processes involving signaling, cellular localization, etc. Twenty-six probable complexes were generated using a proteinprotein docking server ClusPro [19]. The complex that had the highest binding energy and lowest free energy was considered. In a protein complex some of the residues take part in the interaction. BIPSPI was used to predict partner-specific binding sites in the protein-protein complex [20]
These methods were applied on fifty H. pylori strains to identify the GIs and their locations in the genome. The length of the predicted GIs were compared with that of the GIs predicted by Design-Island-II. If at least 50% of the length of a GI, predicted by any other tool was overlapped with our predicted GIs, it was determined as detected by both methods.

Identification of GIs across fifty H. pylori strains
Design-Island-II, a modified version of Design-Island [26] was used to detect statistically

Comparative analysis of GIs in Helicobacter pylori
Forty-one GIs identified in all 50 strains were subjected to comparative analysis among all H. pylori strains. Among these, seven GIs were present in all 50 strains and were termed as core GI. Three core GIs (Core_1-Core_3) were found to have Sm ≥ 80%. The rest four core GIs (Core_4 -7) had Sm ≤ 80%. Thirty four accessory GIs (namely, Acc_1 -Acc_34) were present in 2 to 49 H. pylori strains with (Sm ≤ 80%) ( Figure 1C). The plausible functions of all the forty one GIs were predicted (Supplementary Table 3). Two of the seven core GIs were mainly enriched with ribosomal proteins. In terms of composition, many highly expressed genes, like ribosomal proteins, deviate from the genomic background [27,28].
Thus, they were identified as GIs despite the fact that they were part of the host genome.
Various studies suggested eliminating ribosomal proteins containing GIs, as these are false positive in terms of identification [29,30].

Antibiotic Resistance Island
Core_3 was found to be present in all H. pylori strains with an average S m =83.1%. The heatmap gave a lucid pictorial illustration of the similarity of Core_3 GI among strains (Supplementary Figure 1A). The length of Core_3 was ~17Kb and 14 HTGs were present on it. This gene cluster was the entire nuo operon (NADH: ubiquinone oxidoreductase. This operon contains essential genes with unique genomic properties, such as different oligonucleotide frequencies and slightly greater GC% than tissue-specific genes differentiate this region from rest of the genome [31,32]. Thus, this nuo operon was identified as GI.

Niche Adaptation Island (Urease)
The urease gene cluster (Core_5), which was adopted horizontally, became one of the most important genetic elements for the survival of helicobacter strains. The GI was ~17.5Kb in length and consisted of 16 HTGs. Genes encoding H. pylori urease are located as a single 6.13kb gene cluster that consists of seven contiguous genes ureA, ureB, ureI, ureE, ureF, ureG and ureH, are necessary for the synthesis of an active enzyme [33,34]. The average Sm was found to be 58%. The heatmap with the Sm of Urease Island showed three distinct clusters (

PAI containing Collagenase
Core_6, a 13Kb long GI, was identified with genes that promote the virulence of pylori infection. This GI was present in all strains, whether intact, fragmented, or missing in part.
However, a U32 family peptidase was present in all strains. Other important HTGs of this GI were a DNA-binding response regulator (racR), peptide chain release factor RF2 (prfB), flagellar biosynthetic protein FliR (fliR), fructose-bisphosphate aldolase (fbaB), etc. The average Sm, calculated on the basis of gene content was found to be 64.5%. Thirty-six strains showed >80% similarity and formed the main cluster in the hierarchical clustering (Supplementary Figure 3A). U32 family peptidase present in all the strains is actually a collagenase. Gastric epithelium extracellular matrix is mostly composed of collagen type I and III. Type I collagen synthesis is increased in the areas surrounding gastric ulcers. Thus, ulcer healing process is promoted [35]. However, the release of collagenase from H. pylori Core_7 was found to contain fliD operon along with some other HTGs, like, DNA-damage induced multidrug efflux protein (dinF), N-carbamoylputrescine amidase (cpa), 2', 3'-cyclicnucleotide 2'-phosphodiesterase (ymdA) etc. This GI was ~12Kb long and was present in all strains. The fliD operon not only promotes H. pylori colonization in the human stomach by forming flagella, but it also facilitates biofilm formation [36]. The bacteria can evade antibiotic therapy by living in a biofilms, leading to chronic infection [37]. This operon was intact across all the fifty H. pylori strains; with 75% of average Sm score (Supplementary Figure 4A). The difference in adjacent gene cluster led to the formation of two major groups (Supplementary Figure 4B). Core_7 is involved in H. pylori colonization and biofilm formation, which contributes to the disease's chronicity. As fliD is present in all strains and crucial for H. pylori survival, it can be a potential target for new drug or vaccine development.

Lipopolysaccharide (LPS) Biosynthesis Island
Acc_17 GI was ~13Kb in length and contained ten HTGs present on it, like, lipid-Adisaccharide synthase (lpxB), hof-family outer membrane protein (hofC, hofD), CDPdiacylglycerol diphosphatase (cdh) etc. This GI was found in 41 strains, with an average Sm of 55.3%. The hierarchical clustering in the heatmap divided the strains into two major clusters ( Figure 3A). Synteny of this GI also represented that the gene cluster was not intact in all strains, and it was fragmented or some genes were missing ( Figure 3B). This GI was found to be associated with lipopolysaccharide (LPS) synthesis. Lipid A-core and the O antigen polysaccharide make up LPS which serves as a key component on the surface of H.
pylori. Lewis antigens are present in the O polysaccharides which imitate the glycan structures produced by human cells. Lewis antigens interact with human dendritic cells, generating an immune response that contributes to H. pylori pathogenicity [38]. As LPS is involved in producing H. pylori induced virulence, this GI may indirectly influence the pathogenesis of the strain.

Cytotoxin associated gene A (CagA) PAI
Cag is an exhaustively studied pathogenicity island in H. pylori. Comparative analysis of the cagPAI (Acc_25) identified ~80% of the strains have 70-100% of resemblance among them (Supplementary Figure 5A). It was ~40kb in length and consisted of thirty two genes. Three strains (H. pylori 2017, 2018 and 908) showed ~90% similarity among themselves however differed from the other strains. These three strains, reportedly associated with Duodenal Ulcer (DU), belong to the same geographical location i.e. Africa. While five Cag negative strains  Figure 5B). Gene composition analysis of the cagPAI suggested additive or reductive evolution in some of the H. pylori strains.

Lysine Biosynthesis Island
Another biosynthetic island (Acc_14) was identified in thirty-three H. pylori strains. The GI ( Figure 5C).The homology models of the putative toxin and its binding pocket is presented in Figure 5D & 5E.

PAI containing Toxin-Antitoxin system (TAs)
Design-Island-II identified a PAI (Acc_33), consisted of a Toxin-Antitoxin system (TAs) in 29 of the studied strains. This GI contains a virulence-associated protein D (vapD), Cobaltzinc-cadmium resistance protein, cation system protein, and six proteins with unknown function. These hypothetical proteins were found to be either dynamin or DUF3240 family protein.
VapD is a toxin with purine-specific endoribonuclease activity whereas DUF3240 family protein, with unknown function, binds to its own regulatory region. The toxin and antitoxin were found to be present on the GI of all strains; however, the neighboring genes  Figure 6D-E.

Prediction of GIs with other GI prediction tools
Design-Island-II detected a total of thirty-four accessory GIs in fifty strains of H. pylori. Out of which twenty-three GIs were also identified by at least one of the other five GI prediction tools (Supplementary Table 4). One GI (Acc_25) was predicted by all five methods, while thirteen GIs were predicted by three methods (PredictBias [21], Zisland Explorer [22], and MSGIP [23]), and five GIs were predicted by two, and four GIs were predicted by two GI prediction tools (Supplementary Table 4). Two newly identified putative pathogenic GIs, Acc_31 and Acc_33 were detected by four and three GI prediction tools respectively. Acc_31 (labile enterotoxin containing PAI) was detected in all five strains by these four methods, while Acc_33 (Toxin-Antitoxin system PAI) were detected in twenty-nine strains by the Design-Island-II and MSGIP. PredicBias, on the other hand, identified Acc_31 in twenty-one strains. A synteny analysis revealed that the Toxin-Antitoxin system were present in all strains, but PredictBias was unable to identify this GI, probably because to the lack of surrounding gene content (Supplementary Figure 9). Eleven accessory GIs were identified only by Design-Island-II. Among these, four GIs (Acc_9, Acc_15, Acc_26, and Acc_34) contained ribosomal proteins along with some HTGs of miscellaneous functions. Many highly expressed genes, such as ribosomal proteins, diverge from the genetic background in terms of composition. As a result, despite the fact that they were essential genes, they were identified as GIs. Several studies have proposed that GI-containing ribosomal proteins be eliminated since they are false positives in terms of identification [29,30]. and H + . The HCO 3 maintains the periplasmic pH close to 6.1, assisting H. pylori to avoid acidic pH [46]. The H + ions, responsible for imparting acidity is also taken up by NH 3 in the periplasmic region, another way of acid acclimation [47]. As a result, H. pylori combat the deadly effect of the acidic pH. CagPAI is a well-studied virulence factor in H. pylori [48].

Discussion
The presence of the cagA gene, though not a virulence determinant, makes a strain more virulent by eliciting the production of interleukin (IL)-8 by gastric epithelial cells [49].
CagPAI encodes cagA an onco-protein which gets transferred to the epithelial cells through T4SS. VacA is known to form vacuole, facilitates H. pylori colonize and a pore-forming toxin. The recruitment of inflammatory cells by VacA promotes ulceration. VacA mutant strains were found to produce stomach ulcers less likey than wild-type strains in gerbils [50].
CagA, and VacA gene containing GIs have direct impact on GC prognosis, after successful colonisation of H. pylori in the human stomach mucosa with the help of urease gene cluster containing GI. Thus these three GIs play significant role in the development of H. pyloriinfection mediated illnesses.
After the successful colonization H. pylori damages the insulating mucosal barrier of the stomach and duodenum, allowing acid to penetrate the delicate lining beneath. The acid and germs both damage the lining, resulting in a sore or ulcer. The rapid regeneration of the epithelium layer is not the only mechanism for healing injury; fibrillar collagens of types I and III have been reported to be over expressed [35]. These simultaneous healing processes are found in gastric or duodenal ulcer. Core_6 was found to contain a U32 family protein, which is a collagenase in nature. This protein degrades the collagen found in the extracellular matrix of the stomach mucosa, as well as the collagen involved in ulcer recovery. By delaying the healing process, it thus increases the severity of ulcer. Thus, Core_6 was found to be associated with promoting ulceration.
Acc_20 was found to contain SCOT complex associated with citric acid cycle and acxC, a subunit of acxABC responsible for acetone carboxylation. While H. pylori lacks succinate dehydrogenase, a key enzyme in Kreb's cycle, the SCOT complex acts as an alternative for energy metabolism. On the other hand acxABC promotes acetone utilization. Thus, these two are beneficial for H. pylori survival. Mutation is acxABC led to reduce the colonization significantly in the stomach of mice [51]. So, Acc_20 plays an important role by providing an alternative path for energy production and assisting H. pylori to colonize in gastric mucosa.
LPS biosynthesis genes such hofC, hofD, and lpxB were found to be present as an accessory island. LPS is present in the cellular membrane of all Gram-negative bacteria and plays a crucial role in pathogenesis. LPS has three distinct parts, a core composed of oligosaccharide, lipid A and O antigen. Lipid A is able to trigger fatal immune reaction even in very low concentration, thus known as endotoxin [52]. The core part is found to be conserved in interaction signal down regulates the inflammatory cascade [54]. This suppression of the immune response facilitates a chronic H. pylori infection. Thus this GI is also associated with the disease prognosis indirectly. GC is multi factorial disease which includes genetic and environmental factors. The high salt in-taking diet habit was found to be linked with increase the virulence of H. pylori. The virulence cagA +ve strains were found to be enhanced with heavy salted diet in mice. It elevated the H. pylori colonization in the gastric mucosa and led to greater depletion of parietal cell in mice, thus facilitating the development of GC [41]. This result was observed when similar experiment was done with cagA -ve strains. So, rich salt diet promotes the GC prognosis in H. pylori infected people. Acc_14 was discovered to have a number of genes whose over expression has been linked to a high-salt diet, for example, amiE, lysA. The exact function of these proteins in GC prognosis not yet understood.
The colonization potential of H. pylori is critically influenced by flagellar motility. The fliD mutant strain was completely non-motile, was unable to colonize the stomach mucosa of host mice [55]. The fliD operon is linked to flagellar assembly and biofilm formation was found on Core_7. Many bacteria employ flagella to assist swarming, adhesion, and biofilm formation. In biofilms, H. pylori may be able to withstand environmental challenges such stomach acidity and ROS produced by phagocytic cells. The bacteria can evade antibiotic therapy by living in a biofilms, leading to chronic infection [37]. Flagellar hook-associated protein 2 (fliD) was found to be present across all the H. pylori strains and associated with essential function like flagellar biosynthesis and biofilm formation. It can be considered as a good target for development of a new therapeutic to combat with the H. pylori infection.
Characterization of accessory GI (Acc_31) led to the identification of a putative labile enterotoxin. Phylogenetic analysis suggested that this toxin was horizontally acquired from another helical-shaped, Gram-negative bacterium Campylobacter jejuni, which is the most common cause of food poisoning in Europe and the United States [56]. Further analysis showed that this toxin interacts with ligase tilS, an essential protein for viability. TilS along with dus work in tRNA processing. Apart from this, tilS co-express with ddl which is involved in the peptidoglycan biosynthesis pathway. Thus, putative toxin-tilS interaction indirectly helps in tRNA processing, cell wall biosynthesis and vancomycin resistance.
Generally, enterotoxins showed a marked effect in the gastrointestinal tract by increasing the chloride permeability of the intestinal mucosal cells [57][58][59][60]. potentially through vesicle secretion [61]. The cellular location of this putative toxin is in the periplasmic region. We can vouch the novel toxin along with other dynamin-like proteins present on the GI involve in vesicle formation, membrane fusion, thus lead to increase toxin release.
Toxin-antitoxin genes are often inherited through HGT [62]. Three Fukui strains (H. pylori F30, H. pylori F16 and H. pylori F32) contain type II TAs which belong to Vap family and encoded by bicistronic operon. VapD was present in 60% of the H. pylori genomes [63].
Design-Island-II identified these TAs in 29 strains. The lengths of the GI in 26 strains were <10Kb, due to deletion of neighboring gene content. Therefore, we studied this GI only in the afore-mentioned strains. Docking of vapD and its antitoxin forms a stable complex with minimum free energy. The antitoxin binds to vapD in the post-translational stage and inhibits its expression. The exact biological role of the vapD protein is yet to be established. Several reports suggest VapD in Haemophilus influenzae, acts as a toxin and in Rhodococcus equi helps in survival through acid tolerance [64,65]. In adenocarcinoma gastric cells and gastric biopsies, vapD-antitoxin expression values were higher than cagA and vacA cytotoxin genes.
Another report showed treatment of biopsy samples with chloramphenicol and kanamycin induced the expression of TAs. These suggested that vapD is a virulent factor, protects the bacterium by aiding biofilm formation [66,67]. It helps the bacterium to survive in a hostile

Competing Interests
The author declares that there are no competing interests associated with the manuscript.