Identification of pivotal genes associated with the prognosis of gastric carcinoma through integrated analysis

Abstract Purpose: Detecting and diagnosing gastric cancer (GC) during its early period remains greatly difficult. Our analysis was performed to detect core genes correlated with GC and explore their prognostic values. Methods: Microarray datasets from the Gene Expression Omnibus (GEO) (GSE54129) and The Cancer Genome Atlas (TCGA)-stomach adenocarcinoma (STAD) datasets were applied for common differentially co-expressed genes using differential gene expression analysis and Weighted Gene Co-expression Network Analysis (WGCNA). Functional enrichment analysis and protein–protein interaction (PPI) network analysis of differentially co-expressed genes were performed. We identified hub genes via the CytoHubba plugin. Prognostic values of hub genes were explored. Afterward, Gene Set Enrichment Analysis (GSEA) was used to analyze survival-related hub genes. Finally, the tumor-infiltrating immune cell (TIC) abundance profiles were estimated. Results: Sixty common differentially co-expressed genes were found. Functional enrichment analysis implied that cell–cell junction organization and cell adhesion molecules were primarily enriched. Hub genes were identified using the degree, edge percolated component (EPC), maximal clique centrality (MCC), and maximum neighborhood component (MNC) algorithms, and serpin family E member 1 (SERPINE1) was highly associated with the prognosis of GC patients. Moreover, GSEA demonstrated that extracellular matrix (ECM) receptor interactions and pathways in cancers were correlated with SERPINE1 expression. CIBERSORT analysis of the proportion of TICs suggested that CD8+ T cell and T-cell regulation were negatively associated with SERPINE1 expression, showing that SERPINE1 may inhibit the immune-dominant status of the tumor microenvironment (TME) in GC. Conclusions: Our analysis shows that SERPINE1 is closely correlated with the tumorigenesis and progression of GC. Furthermore, SERPINE1 acts as a candidate therapeutic target and prognostic biomarker of GC.


Introduction
Gastric cancer (GC) is an aggressive solid tumor malignancy with approximately 27600 estimated new cases and 11010 estimated deaths in 2020, which causes a huge socioeconomic burden to patients and their families [1]. Many lifestyle factors have been researched, with behavioral factors such as tobacco use, alcohol consumption and Helicobacter pylori (HP) infection and are regarded as risk factors contributing to the development of GC [2]. Currently, the therapeutic strategies for GC include surgery, radiotherapy, neoadjuvant chemotherapy and immunotherapy, and the survival rate for patients with early GC is nearly 90% [3]. However, it remains very difficult to detect and diagnose this cancer during its early stage, leading to an apparent reduction in survival rates after diagnosis [4]. Therefore, it is extremely necessary to detect candidate diagnostic and prognostic indicators and therapeutic targets for GC patients.
With the rapid progress of genomic technology, gene expression profiles are usually used through bioinformatics methods to explore the underlying molecular mechanisms of tumors and find cancer-specific indicators [5]. As a significant algorithm, Weighted Gene Co-expression Network Analysis (WGCNA) is widely applied to further understand gene co-expression networks and gene functions [6]. Based on WGCNA, modules of highly correlated genes correlated with the traits of samples are found, providing valuable insights into predicting the potential function of co-expressed genes [7]. Moreover, differential gene expression analysis is commonly adopted to analyze transcriptomics datasets, and this is conducive to exploring the underlying biological and molecular mechanisms of tumors and detecting quantitative differences between the gene expression levels of experimental and control cohorts [8].
The two approaches mentioned above are adopted in our analysis for the better capability of discriminating highly related genes. First, we acquired gene expression profiles of GC from the Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) databases. Second, WGCNA and differential gene expression analyses were applied to detect common differentially co-expressed genes. Then, we conducted functional enrichment analysis, protein-protein interaction (PPI) analysis and survival analysis for potential biomarkers related to the occurrence, development and prognosis of GC. Subsequently, the relationship of survival-related hub genes with clinical data was explored. Next, we performed gene set enrichment analysis (GSEA) of survival-related hub genes. Finally, the tumor-infiltrating immune cell (TIC) abundance profiles were estimated using the TCGA-stomach adenocarcinoma (STAD) dataset.

Materials and methods
The specific steps of how data were downloaded, hub genes were selected and TIC profiles of GC were estimated are illustrated in Figure 1. Each step is shown clearly in the subsequent subsections.

Datasets download
We downloaded gene expression profiles of GC from the GEO and TCGA databases. GSE54129 was acquired from the GEO database and included 111 GC and 21 normal tissues. The GSE54129 platform was GPL570 [HG-U133 Plus 2] Affymetrix Human Genome U133 Plus 2.0 Array. Based on the annotation file provided by the manufacturer, probes would be transferred to gene symbols, probes without corresponding gene symbols would be removed, and average values would be taken when one gene corresponded to multiple probes. Consequently, we obtained 21655 genes for our next analysis. On the other hand, the gene expression dataset and clinical data of GC were acquired from the TCGA database (https://portal.gdc.cancer.gov/). A total of 407 samples from the TCGA were obtained, consisting of 375 GC and 32 normal tissues, as well as RNA-Seq count data of 19646 genes. Using the Illumina HiSeq 2000 platform, all of the data were generated and annotated to a reference transcript set of the human hg38 gene standard track. Moreover, the edgeR package tutorial revealed that genes with low read counts usually played an insignificant role in the next analysis [9]. Hence, we deleted genes with a count per million (CPM) < 1 from our study. Next, we applied the rpkm function in the edgeR package for further filtering. As a consequence, a total of 15085 genes were acquired for our next analysis.

Identification of key co-expression modules using WGCNA
The gene co-expression networks of GSE54129 and TCGA-STAD datasets were established via the WGCNA package [6]. WGCNA found highly related genes and aggregated these genes into the same co-expression module related to clinical traits. To establish the scale-free networks, soft-powers β = 8 (Figure 2A,B) and 3 ( Figure 3A,B) were applied to the GSE54129 and TCGA-STAD datasets, respectively. Afterward, we created the adjacency matrix based on the following formula: aij = |Sij| β (aij: adjacency matrix between gene i and gene j, Sij: similarity matrix which was made by Pearson's correlation coefficient of all gene pairs, and β: soft-power value), and then this matrix was converted into topological overlap matrix (TOM) and the corresponding dissimilarity (1-TOM). The hierarchical clustering dendrogram of the 1-TOM matrix was built to aggregate genes with similar expressions into the same co-expression module. Subsequently, the module-trait relations between modules and clinical traits were explored for functional modules from the co-expression network. Thus, the module with the largest correlation coefficient was considered the potential module that was highly associated with clinical traits, and this module was used in the next analysis.

Identification of differentially co-expressed genes
The limma package was usually adopted to conduct differential gene expression analyses of microarray and RNA-Seq datasets [10]. To detect differentially expressed genes (DEGs) between GC and non-tumorous tissues, the limma package was used in the differential expression analysis of the GSE54129 and TCGA-STAD datasets. To reduce the false discovery rate (FDR), the P-value was adjusted via the Benjamini-Hochberg (BH) approach. |logFC|>1 and adj. P<0.05 were used as the selection criteria for DEGs. We took the intersection of genes among two lists of DEGs and two lists of co-expressed genes for common genes to improve the capability of discriminating closely related genes, and the genes from this intersection were adopted for candidate biomarkers related to the prognosis of GC.

Functional enrichment analyses
Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses constituted functional enrichment analyses of the differentially co-expressed genes. To explore the biological functions of differentially co-expressed genes, GO and KEGG pathway analyses were conducted through the clusterProfiler [11] package. GO is an essential bioinformatics tool that is usually adopted to annotate genes and analyze their biological processes [12]. KEGG was conductive to understanding high-level functions and biological systems from large-scale molecular datasets [13]. P<0.05 was considered significantly different.

PPI network construction and hub gene identification
The PPI network of these differentially co-expressed genes was established using the Search Tool for the Retrieval of Interacting Genes (STRING) [14]. Cytoscape was adopted to build the visual network of molecular interactions with a combined score >0.15 [15]. The Molecular Complex Detection (MCODE) plugin was applied to detect closely correlated modules from the PPI network [16]. The most significant gene module of this PPI network was visualized and displayed through the MCODE plug-in. The filtering criteria were as follows: MCODE score > 5, node score cutoff = 0.2, degree cutoff = 2, k-score = 2 and max depth = 100. Additionally, the degree, edge percolated component (EPC), maximal clique centrality algorithm (MCC), and maximum neighborhood component (MNC) algorithms were useful methods of selecting hub genes from PPI networks [17]. We calculated the degree, EPC, MCC and MNC scores of all nodes of the PPI network via the CytoHubba plugin. The top 10 nodes with the highest degree, EPC, MCC and MNC scores were selected, and we took the intersection of outcomes of the four algorithms. To increase the reliability of hub genes, their overlapping genes were believed to be hub genes related to GC.

Survival analyses of hub genes
To screen the prognostic values of these hub genes, we conducted univariate Cox regression analysis of the overall survival (OS) via the survival package using the TCGA-STAD dataset. We also conducted disease-free survival (DFS) analysis of hub genes in The Gene Expression Profiling Interactive Analysis (GEPIA) database (http://gepia.cancerpku.cn/) [18]. STAD patients with missing follow-up data were excluded from this analysis, and the remaining patients in the TCGA-STAD dataset were classified into two cohorts according to the median expression values of these hub genes. Log-rank P<0.05 was considered statistically significant.

The correlation network and the relations of hub genes with clinical data
The correlation network of these hub genes was built using the igraph package in R software. The criterion of filtering was a cutoff >0. 10. We also explored the relationship of the hub genes with clinical information (including tumor grade, individual cancer stages and nodal metastasis status) among STAD patients based on the UALCAN database (http://ualcan.path.uab.edu/) [19].

GSEA of survival-related hub genes
GSEA is a computing approach that can recognize if the priorly defined gene set has statistical significance and concordant differences between two biological states [20]. We stratified STAD patients into two groups in light of the median expression values of survival-related hub genes. Afterward, the effects of their expression on many gene sets were explored to detect related KEGG pathways through the molecular signatures database (MSigDB) (c2.cp.kegg.all.v7.1.symbols.gmt) [21]. The permutation of every analysis was one thousand times. The selection criteria were as follows: |Normalized enrichment score (NES)| > 1, nominal (NOM) P-value <0.05 and FDR q-value <0.25.

TIC abundance profile
We adopted the CIBERSORT computational approach to estimate the TIC abundance profiles among all STAD samples, which was followed by quality filtering [22]. Hence, 149 STAD samples with P<0.05 were included for the TIC profile in STAD samples and correlation analysis of 22 kinds of immune cells. We still explored the correlation of TIC proportion with the expression level of serpin family E member 1 (SERPINE1) in STAD.

Selection of pivotal co-expression modules using WGCNA
To detect the functional module in STAD, two gene co-expression networks were established through the WGCNA package based on the GSE54129 and TCGA-STAD datasets. We detected eight modules in the GSE54129 dataset ( Figure 2C) and 14 modules in the TCGA-STAD dataset ( Figure 3C). Afterward, two heatmaps screened the relation between these modules and clinical traits (normal and STAD) in the GSE54129 ( Figure 2D) and TCGA-STAD datasets ( Figure 3D). The black module of GSE54129 and the turquoise module of the TCGA-STAD dataset were positively associated with normal tissues (black module in GSE54129: r = 0.85, P=7e-38, and turquoise module in the TCGA-STAD dataset: r = 0.34, P=1e-12).

Selection of differentially co-expressed genes
The two heatmaps showed the expression levels of 50 up-regulated and 50 down-regulated genes in the GSE54129 ( Figure 4A) and TCGA-STAD datasets ( Figure 4B). The volcano plots displayed that 1793 DEGs in the GSE54129 dataset ( Figure 4C) and 2694 DEGs in the TCGA-STAD dataset ( Figure 4D) were conspicuously dysregulated between STAD and non-tumorous tissues. Figure 4E clearly revealed that the intersection of two lists of DEGs (Supplementary Tables S1 and S2) and two lists of co-expressed genes (Supplementary Tables S3 and S4) contained a list of 60 genes (Supplementary Table S5), which were used for the next analysis.

Functional enrichment analyses
To obtain further insights into the candidate biological functions, we conducted GO and KEGG pathway analyses of the 60 differentially co-expressed genes. According to the results of GO analysis, cell-cell junction organization, positive regulation of morphogenesis of the epithelium and epidermis development were significantly enriched ( Figure  5A). In addition, KEGG pathway analysis of the 60 differentially co-expressed genes demonstrated that complement and coagulation cascades, cell adhesion molecules and cytokine-cytokine receptor interactions were mainly enriched ( Figure 5B).

PPI network construction and hub gene identification
The PPI network of the 60 differentially co-expressed genes with 58 nodes and 227 edges was vividly illustrated ( Figure  6A). Using the MCODE plugin, we found the most significant module of this PPI network, which contained 14 nodes and 52 edges ( Figure 6B). Subsequently, the degree, EPC, MCC and MNC scores of the top 10 nodes were calculated through the CytoHubba plugin, and we took the intersection of results of the four algorithms ( Figure 6C) to improve the reliability of hub genes. A total of eight genes (SERPINE1, OCLN, SHH, SERPINA1, PLAU, MMP3, ELF3 and PLAUR) were considered hub genes (Table 1).

Survival analysis of hub genes
To explore the prognostic values of the eight hub genes in STAD, we conducted univariate Cox regression analysis of OS using the TCGA-STAD dataset. The higher expression of SERPINE1 was observed to be closely associated with shorter OS time among patients with STAD (hazard ratio (HR) = 1.27, 95% confidence interval (CI): 1.139-1.420, P=3.07E-05, Figure 7A). Moreover, the higher expression of SERPINE1 was significantly correlated with worse DFS among STAD patients (HR = 1.70, P=0.0046) ( Figure 7B) using the GEPIA database.

The correlation network and the relationship of hub genes with clinical data
The correlation network of these hub genes is presented in Figure 7C through the igraph package, and the shade of the color represents the magnitude of the correlation coefficients. Furthermore, we observed that the overexpression of SERPINE1 was highly correlated with poorer tumor grades ( Figure 7D), higher individual cancer stages ( Figure  7E), and more serious nodal metastasis status ( Figure 7F) among patients with STAD.

GSEA of survival-related hub genes
According to the outcomes of GSEA, we found that focal adhesion, extracellular matrix (ECM) receptor interaction, leukocyte transendothelial migration, regulation of actin cytoskeleton, MAPK signaling pathway, and pathways in     cancers were highly associated with the expression of SERPINE1 ( Figure 8A-J). In detail, the outcomes of GSEA of SERPINE1 in STAD are represented in Table 2 using the TCGA-STAD dataset.

Correlation of SERPINE1 with the proportion of TICs
To explore the relationship of SERPINE1 expression with the immune microenvironment, we analyzed the proportion of tumor-infiltrating immune subsets based on CIBERSORT. Twenty-two types of TICs in STAD samples are illustrated in Figure 9. The difference analysis showed ten types of TICs associated with SERPINE1 expression ( Figure  10A), while the correlation analyses revealed ten kinds of TICs associated with SERPINE1 expression ( Figure 10B). The intersection of the outcomes of difference and correlation analyses demonstrated that nine kinds of TICs were associated with the expression of SERPINE1 ( Figure 10C). Namely, five types of TICs (memory B cells, CD8 + T cells, follicular helper T cells, regulatory T cells (Tregs) and resting mast cells) were negatively associated with SERPINE1 expression, while four types of TICs (M0 macrophages, activated mast cells, neutrophils and resting NK cells) were

Discussion
As an aggressive solid-tumor malignancy with high incidence, GC has caused huge socioeconomic pressure on GC patients and their families. Currently, progression in the early diagnosis and prediction of the prognosis of GC remains limited. Hence, it is essential to detect tumor-specific indicators to monitor progression and predict the prognosis of GC patients. In our analysis, 60 differentially co-expressed genes were selected through integrated bioinformatics analysis based on the GSE54129 and TCGA-STAD datasets. Functional annotation analysis illustrated that cell-cell junction organization, positive regulation of morphogenesis of an epithelium, and cell adhesion molecules were primarily enriched. The intersection of genes using the degree, EPC, MCC and MNC algorithms were regarded as hub genes that were closely correlated with GC. Afterward, we found that SERPINE1 was highly correlated with the survival of patients with GC, and SERPINE1 expression was correlated with tumor grades, individual stages and N stages among GC patients. GSEA revealed that ECM receptor interaction, leukocyte transendothelial migration and pathways in cancers were associated with SERPINE1 expression. Finally, we found that the expression pattern of SERPINE1 influenced the immune activity of the TME in GC. As a serine protease inhibitor, SERPINE1 can inhibit urokinase-type plasminogen (uPA) and tissue-type plasminogen activator (tPA) [22]. The plasminogen activator inhibitor-1 (PAI-1) encoded by SERPINE1 was an important regulator of the uPA system, which was supposed to inhibit the activation of uPA, but paradoxical studies revealed that SERPINE1 might play a role in promoting the carcinogenesis of many cancers [23]. Many investigations have implied that SERPINE1 is overexpressed in various cancers, including GC [24], head and neck cancer [25] and colorectal cancer [26], and this overexpression correlated with the aggressiveness and invasiveness of tumor cells. Additionally, the down-regulation of SERPINE1 suggested a suppressive effect on the phenotype of glioma tumor cells by activating the p53 signaling pathway and suppressing the adverse invasion and metastasis of nasopharyngeal carcinoma cells in vitro [27][28]. Xu et al. suggested that the overexpression of SERPINE1 was associated with poor prognosis in GC, and SERPINE1 was closely correlated with epithelial-to-mesenchymal transition (EMT) using qRT-PCR methods [29]. Similarly, Yang et al. indicated that SERPINE1 was apparently up-regulated in GC tissues, high expression of SERPINE1 was correlated with worse survival among patients with GC, and low expression of SERPINE1 had an inhibitory effect on the EMT process, while its overexpression had the reverse results [30]. In addition, based on the bioinformatics analysis of 293 GC and 196 normal tissues, Liao et al. found that SERPINE1 was remarkably overexpressed in GC tissues, and the overexpression of SERPINE1 was correlated with worse survival in GC patients [31]. SERPINE1 was reported to be up-regulated in STAD tissues compared with normal tissues, and the overexpression of SERPINE1 was closely associated with worse prognosis among STAD patients [32]. In addition, SERPINE1 was reported to have excellent prognostic value among patients with lower grade glioma [33]. SERPINE1 was observed to be up-regulated in glioma tissues using immunohistochemistry, and high expression of SERPINE1 was associated with reduced DFS and OS in glioma patients [34]. Considering these studies and our findings, we could conclude that SERPINE1 may closely participate in the carcinogenesis and development of GC, and SERPINE1 might act as an important biomarker of diagnosis, monitoring progression and predicting prognosis in GC patients.
Though some studies [29,30,35] have been published previously, our study has some correspondimg advantages compared with the similar studies. (1) Xu B et al. [29] just used the GEO database for the differential gene expression analysis, and another important database (TCGA) didn't be used. Yang JD et al. [30] just used the TCGA database for analysis, and the GEO database didn't be used. But we applied GEO and TCGA databases to find co-expressed differential genes instead of just one database, which may increase the credibility of our study. (2) Three of them [29,30,35] just performed differential gene expression analysis, and WGCNA didn't be carried out. But WGCNA and differential gene expression analysis were carried out in our study at the same time, which may enhance the reliability of our findings. (3) They didn't explore the correlation of SERPINE1 and tumor-infiltrating immune subsets in GC, but our study performed this, which increases novel content compared with them [29,30,35].
Admittedly, there were some limitations in our study. (1) though the GSE50129 and TCGA-STAD datasets had multiple samples of GC and nontumorous tissues, only the two datasets were used and analyzed in our analysis. (2) Although we used integrated bioinformatics methods to detect candidate prognostic biomarkers in GC, it might not be very precise for patients with different GC grades and stages. (3) We did not validate our findings by performing experiments, which would be an important issue that deserves further research in the future.

Conclusion
This analysis was conducted to detect hub genes that may be correlated to the carcinogenesis and progression of GC using differential gene expression analysis and WGCNA. Eight hub genes were identified through the degree, EPC, MCC and MNC algorithms, and SERPINE1 was closely associated with the prognosis of GC patients. Furthermore, SERPINE1 may be a potential therapeutic and prognostic indicator in GC patients. Nonetheless, more studies are required to further validate and explore the biological relations among these differentially co-expressed genes in GC.