Construction of a five-gene prognostic model based on immune-related genes for the prediction of survival in pancreatic cancer

Abstract Purpose: To identify differentially expressed immune-related genes (DEIRGs) and construct a model with survival-related DEIRGs for evaluating the prognosis of patients with pancreatic cancer (PC). Methods: Six microarray gene expression datasets of PC from the Gene Expression Omnibus (GEO) and Immunology Database and Analysis Portal (ImmPort) were used to identify DEIRGs. RNA sequencing and clinical data from The Cancer Genome Atlas Program-Pancreatic Adenocarcinoma (TCGA-PAAD) database were used to establish the prognostic model. Univariate, least absolute shrinkage and selection operator (LASSO) and multivariate Cox regression analyses were applied to determine the final variables of the prognostic model. The median risk score was used as the cut-off value to classify samples into low- and high-risk groups. The prognostic model was further validated using an internal validation set of TCGA and an external validation set of GSE62452. Results: In total, 142 DEIRGs were identified from six GEO datasets, 47 were survival-related DEIRGs. A prognostic model comprising five genes (i.e., ERAP2, CXCL9, AREG, DKK1, and IL20RB) was established. High-risk patients had poor survival compared with low-risk patients. The 1-, 2-, 3-year area under the receiver operating characteristic (ROC) curve of the model reached 0.85, 0.87, and 0.93, respectively. Additionally, the prognostic model reflected the infiltration of neutrophils and dendritic cells. The expression of most characteristic immune checkpoints was significantly higher in the high-risk group versus the low-risk group. Conclusions: The five-gene prognostic model showed reliably predictive accuracy. This model may provide useful information for immunotherapy and facilitate personalized monitoring for patients with PC.


Introduction
Pancreatic cancer (PC) is a highly malignant cancer and the seventh-leading cause of mortality worldwide [1]. It is predicted that PC will become the second-leading cause of cancer-related death by the year 2030 in the USA [2]. Thus far, surgical therapy is the only curative strategy for resectable PC. However, only 10% of the patients are able to undergo standard resection at diagnosis due to the presence of atypical symptoms and the lack of effective imaging examination and diagnostic biomarkers in the early stage of disease [3]. At the time of diagnosis, most patients present with unresectable disease, characterized by nodal metastases, vascular invasion, or distant metastases [4]. Nevertheless, even patients who undergo surgical resection may not achieve satisfactory survival. Therefore, early detection and development of novel therapeutic strategies are urgently warranted to improve the survival of patients with PC.
The immune system plays a pivotal role in tumorigenesis and the progression of human malignancy [5]. A growing body of evidence suggests that the use of immunotherapy could result in favorable outcomes in cancer therapy. Blockade of immune checkpoints has shown substantial survival benefit for patients with several types of cancer, such as hepatocellular carcinoma, renal cell carcinoma, and melanoma [6][7][8]. Tumor cells could escape recognition and elimination by the immune system, induce immune tolerance, and promote their own growth and metastasis by secretion of immunosuppressive cytokines and regulation of the expression of immunoregulatory molecules [9,10] . Previous studies highlighted that the immunerelated genes (IRGs) were associated with the prognosis of several types of cancer [11][12][13]. However, few studies investigated the role of IRGs in PC. Hence, the identification of genes with prognostic potential and construction of an effective predictive model may be useful for individualized management and assessment of prognosis in patients with PC.
In the present study, we utilized the Robust Rank Aggregation (RRA) method to identify differentially expressed genes (DEGs) between pancreatic tumors and Downloaded from http://portlandpress.com/bioscirep/article-pdf/doi/10.1042/BSR20204301/915297/bsr-2020-4301.pdf by guest on 19 June 2021 adjacent normal tissues using six microarray datasets obtained from the Gene Expression Omnibus (GEO). Subsequently, univariate Cox regression was employed to identify survival-related differentially expressed immune-related genes (DEIRGs).
Furthermore, least absolute shrinkage and selection operator (LASSO) Cox regression and multivariate Cox regression analyses were utilized to construct a prognostic model comprising survival-related DEIRGs. The median risk score calculated by the model was used to classify patients into high-and low-risk groups. The association between the model and immune cell infiltration was investigated. In addition, the expression of immune checkpoints in the low-and high-risk groups was compared.
The aimed of this study was to identify the survival-related biomarkers and therapeutic targets, establish a predictive model, and provide a basis for immunotherapy in patients with PC.

Gene expression datasets
Six gene expression datasets (i.e., GSE15471, GSE60979, GSE62165, GSE71989, GSE91035, GSE102238) of PC were obtained from the GEO. All datasets met the following criteria: 1) included tumor and adjacent tissues of human PC; 2) comprised case and control groups; 3) contained >20 samples. Detailed information regarding these datasets is listed in Supplementary Table 1. A total of 170 PC sample profiles with available survival data were downloaded The Cancer Genome Atlas-Pancreatic Adenocarcinoma (TCGA-PAAD) dataset. A training dataset with 102 samples and an internal validation dataset with 68 samples were randomly generated from the TCGA-PAAD dataset in a ratio of 3:2. In addition, a microarray dataset (GSE62452) containing 64 samples with survival data was obtained from the GEO for external validation. The characteristics of the training and the validation datasets are listed in Table 1. The Limma package was utilized to identify DEGs between tumor tissues and adjacent normal tissues of each dataset in the R platform (v3.6.1) [14]. The RobustRankAggreg package, which is based on the RRA method, was employed to normalize multiple datasets and conduct gene integration analysis for the identification of the most significant DEGs [15]. Genes with |log2 Fold Change| >1 and adjusted P-value <0.05 were selected as significant DEGs.

Screening for survival-related DEIRGs
The IRG list (1,811 genes) was obtained from the Immunology Database and Analysis Portal (ImmPort) [16]. DEIRGs were obtained by intersecting the IRG list and DEG list identified from the six GEO datasets. Subsequently, we performed univariate Cox regression analysis of DEIRGs to identify survival-related DEIRGs. Genes with Pvalue < 0.01 were selected as survival-related DEIRGs.

Functional enrichment analysis of DEIRGs
Gene Ontology (GO) enrichment analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis were carried out by the Clusterprofiler package to investigate the potential function of DEIRGs [17]. Adjusted P value less than 0.05 was selected as the cut-off criteria for GO terms and KEGG pathways.

Construction of a prognostic model
In the training set, LASSO regression through the glmnet package was utilized to determine the most powerful prognostic genes among the survival-related DEIRGs [18]. Next, multivariate Cox stepwise regression was applied to determine the best prognostic model. Subsequently, a prognostic model was established using a linear combination of the relative gene expression values (Expi) and coefficient (βi) generated in the multivariate Cox regression. The risk score calculation formula is as follows: Risk score = Exp1 × β1+ Exp2 × β2 +……+ Expn × βn. The median risk score was used as the cut-off value to classify the PC samples into low-and high-risk Downloaded from http://portlandpress.com/bioscirep/article-pdf/doi/10.1042/BSR20204301/915297/bsr-2020-4301.pdf by guest on 19 June 2021 groups. Kaplan-Meier curves analysis and log-rank test were performed to identify differences in survival. Time-dependent receiver operating characteristic (ROC) curve analysis was applied to evaluate the predictive ability of the prognostic model via the timeROC package [19]. Furthermore, we performed univariate and multivariate Cox regression analyses to investigate the independent factors between the risk score and clinical parameters, including age, sex, T stage, N stage, AJCC stage, and histologic grade. In addition, we performed gene set enrichment analysis (GSEA) utilizing the "ClusterProfiler" package to investigate the significantly enriched pathways between high-and low-risk groups in TCGA-PAAD dataset. We retrieved KEGG gene sets (c2.cp.kegg.v7.0.symbols.gmt) by using the msigdbr package.

Correlation analysis of risk score and immune cells infiltration
Tumor Immune Estimation Resource (TIMER, cistrome.shinyapps.io/timer) is a comprehensive analytical web tool, which includes 10,897 samples across 32 cancer types from TCGA to estimate the abundance of six tumor-infiltrating immune cells between low-and high-risk group aiming to provide information to optimize immunotherapeutic strategies for patients with PC.

Statistical analysis
R software v3.6.1 (www.r-project .org) was used for statistical analyses in the present study. The Limma package was used to obtain DEGs. Kaplan-Meier curves analysis and log-rank test were performed using the survival package. Time-dependent ROC curve analysis was applied via the timeROC package. Univariate and multivariate Cox regression were employed to determine independent factors for OS. Spearman correlation analysis was applied to investigate the correlation between risk score and immune cells infiltration. Difference of expression of immune checkpoints in Lowand high-risk Group were compared using Wilcox test. Analysis results with P value <0.05 indicated statistical significance.

Identification of DEGs
A total of 985 DEGs, including 619 upregulated genes and 366 downregulated genes were identified. The top 20 upregulated and downregulated DEGs are shown in Figure 1A. We obtained a total of 142 DEIRGs by intersecting the DEG list with the IRG list ( Figure 1B).

GO and KEGG enrichment analysis for DEIRGs
The following GO categories were enriched: biological process (BP), cellular component (CC), and molecular function (MF) (Figures 2A-C). The results showed that the significantly enriched terms were defense response to other organism (BP),

Construction and internal validation of prognostic Model
We obtained 47 survival-related DEIRGs via univariate Cox regression analysis (Supplementary Table 3). LASSO Cox regression was applied to narrow down the number of relevant genes in the training dataset. The LASSO coefficient profiles of 47 survival-related DEIRGs are presented in Figure 3A. We obtained 12 genes with minimum partial likelihood deviance according to 10-fold cross-validation results ( Figure 3B). Furthermore, the best prognostic model with the smallest Akaike Information Criterion was identified via multivariate Cox stepwise regression analysis.
Finally, a prognostic model involving five genes, namely endoplasmic reticulum aminopeptidases 2 (ERAP2), amphiregulin (AREG), C-X-C motif chemokine ligand 9 (CXCL9), dickkopf-1 (DKK1) and interleukin-20 receptor subunit beta (IL-20RB) was constructed. Figure 3C shows that CXCL9, DKK1, and IL-20RB exhibit the characteristics of independent prognostic factors in the training dataset. The prognostic risk score for each patient was calculated as follows: Risk score = The patients in the training dataset were divided into low-and high-risk groups applying the median risk score as the cut-off criteria. The Kaplan-Meier curve shows that high-risk patients had a significant worse overall survival (OS) than low-risk group patients ( Figure 4A). A time-dependent ROC curve was generated, and the area under the ROC curve (AUC) was calculated to assess the predictive ability of the model. In the training dataset, the 1-, 2-, and 3-year AUCs were 0.85, 0.87, and 0.93 ( Figure 4B), respectively. The risk score distribution and the expression of the five genes in the training dataset are shown in Figures 4C and 4D. The prognostic model was further validated using the internal validation dataset. Similarly, patients with higher risk scores were associated with worse OS ( Figure 4E). In the internal validation dataset, the 1-, 2-, and 3-year AUCs were 0.79, 0.74, and 0.8, respectively Downloaded from http://portlandpress.com/bioscirep/article-pdf/doi/10.1042/BSR20204301/915297/bsr-2020-4301.pdf by guest on 19 June 2021 Bioscience Reports. This is an Accepted Manuscript. You are encouraged to use the Version of Record that, when published, will replace this version. The most up-to-date-version is available at https://doi.org/10.1042/BSR20204301 ( Figure 4F). Figures 4G and 4H demonstrate the distribution of the risk score and a heatmap of the five gene expression data in the validation dataset, respectively. This indicates that this prognostic model is able to predict the OS of patients with PC in TCGA cohort.

External validation of the prognostic model in GEO dataset
A GEO dataset (GSE62452) of PC with survival data was used as an external validation dataset to assess the predictive capability of the prognostic model. The risk score of each patient in the dataset was calculated using the formula of the model, and all patients were divided into high-and low-risk groups according to the median risk score. The OS observed in the high-risk group patients was significantly worse than that recorded in the low-risk group. In the external validation dataset, the 1-, 2-, and 3-year AUCs were 0.6, 0.75, and 0.77, respectively. External validation further confirmed the stable and accurate prognostic value of the present model in PC

Evaluation of the independence of the prognostic model
To investigate the independent predictive ability of the prognostic model, we performed univariate and multivariate Cox regression analyses for the relationship between risk score and clinicopathological characteristics. The results showed that age, N stage, histologic grade, and risk score were associated with worse prognosis ( Figure   5). Meanwhile, age and risk score were independent prognostic factors for OS.
Moreover, higher risk scores were associated with advanced grade of disease in TCGA-PAAD dataset (Supplementary Figure 1F).

Correlation between the risk score and immune cell infiltration
We investigated the correlation between the risk score and the abundance of six tumor infiltrating immune cell subsets (i.e., B cells, CD4 T cells, CD8 T cells, macrophages, neutrophils, and dendritic cells). The results shown that the risk score was positively

Differences between the expression of immune checkpoints in the low-and highrisk groups
We compared the relative expression of most characteristic immune checkpoints between the low-and high-risk groups. The expression of PDCD1, PDCD-L1, CTLA4, CD80, CD86, TIM3, VISTA, and TIGIT was significantly higher in the highrisk group compared with the low-risk group, indicating that immunosuppression may contribute to worse OS in high-risk patients (Figure 7).

GSEA analysis result
We performed GSEA to further investigate the different functional phenotype between the high-and low-risk groups. The results are listed in Supplementary Table 4 Previous studies have shown that the TME of PC was infiltrated by immunosuppressive cells, but not effector lymphocytes [61,62]. Moreover, PC was characterized by a low proportion of tumor/stroma ratio in the tumor mass [63].
Notably, the stromal area of the TME was the main site of immune cell infiltration, which contributes to the poor outcome of PC [64]. In the present study, the correlation between the risk score and six subtypes of tumor-infiltrating immune cells was investigated. We observed that our prognostic model was positively associated with the infiltration of dendritic and neutrophil cells. According to the considerable research on immune checkpoints conducted in recent decades, immunotherapy has shown great curative potential for several types of cancer, such as hepatocellular carcinoma, melanoma, and bladder cancer [65][66][67]. Binding of PDCD1LG1 to its Interestingly, the expression levels of PDCD1, PDCD1-L1, CTLA4, CD80, TIM3, VISTA, and TIGIT in the high-risk group were notably higher than those measured in the low-risk group. These results indicated that the immunotherapeutic strategy of immune checkpoint blockade may be more effective for high-risk patients. GSEA revealed that the significantly enriched pathways in the high-risk group were associated with immune-related responses and tumorigenesis.
However, limitations in the present study should be realized. As our study was driven from statistics analysis of retrospective data, multicenter clinical trials and prospective research are required to further assess and validate this prognostic model.

Conclusion
In this study, survival-related DEIRGs were identified, and a five-gene prognostic model with reliable predictive accuracy was constructed. The risk score calculated from the model is strongly correlated with immune cell infiltration in tumors and the expression of immune checkpoints. This model may provide new insight into the individualization of anti-tumor therapy and facilitate clinical monitoring of PC.

Competing interests
The authors declare no potential conflicts of interest.

Ethics approval and consent to participate
The ethical approval did not refer to this study and this study did not need the informed consent.