Identification of core genes and clinical roles in pregnancy-associated breast cancer based on integrated analysis of different microarray profile datasets

More women are delaying child-birth. Thus, the diagnosis of pregnancy-associated breast cancer (PABC) will continue to increase. The aim of this study was to identify core candidate genes of PABC, and the relevance of the genes on the prognosis of PABC. GSE31192 and GSE53031 microarray profile datasets were downloaded from the Gene Expression Omnibus database and differentially expressed genes were analyzed using the R package and GEO2R tool. Then, Gene Ontology and Kyoto Encyclopedia of Gene and Genome pathway enrichment analyses were performed using the Database for Annotation, Visualization, and Integrated Discovery. Moreover, the Search Tool for the Retrieval of Interacting Genes and the Molecular Complex Detection Cytoscape software plug-in were utilized to visualize protein–protein interactions and to screen candidate genes. A total of 239 DEGs were identified in PABC, including 101 up-regulated genes mainly enriched in fatty acid activation and the fibroblast growth factor signaling pathway, while 138 down-regulated genes particularly involved in activation of DNA fragmentation factor and apoptosis-induced DNA fragmentation. Fourteen hub genes with a high degree of connectivity were selected, including CREB1, ARF3, UBA5, SIAH1, KLHL3, HECTD1, MMP9, TRIM69, MEX3C, ASB6, UBE2Q2, FBXO22, EIF4A3, and PXN. Overall survival (OS) analysis of core candidate genes was performed using the Gene Expression Profiling Interactive Analysis and UALCAN websites. High ASB6 expression was associated with worse OS of PABC patients. Molecular subtypes and menopause status were also associated with worse OS for PABC patients. In conclusion, ASB6 could be a potential predictor and therapeutic target in patient with PABC.

Although breast cancer is the most frequently diagnosed malignancy among women and is increasing in incidence, PABC has remained a rare event in the last decade [5,8,10,11]. However, PABC has increased more recently as the government policy in China allows the birth of a second child and given the trend of  Different color areas stand for up-regulated and down-regulated genes, which were defined with P < 0.05 and |log FC| > 1 as the cut-off criterion.
women in postponing pregnancy [12][13][14][15]. While PABC formation and progression has been well-studied, the biology of PABC is still not clear. The poor prognosis of PABC has been attributed to a higher risk for nodal involvement, higher hormonal milieu, and more frequent delay in diagnosis [3,[16][17][18][19][20]. However, several studies found no major differences in the expression of hormone receptors and a more common association with the high expression of potentially relevant cancer targets, such as insulin growth factor, Wnt/β-catenin and RANK ligand, and low prevalence of tumor-infiltrating lymphocytes [21][22][23][24][25]. There have been considerable efforts in managing breast cancer that develops during pregnancy. Yet, very little progress has been in understanding of the biology of this disease. This challenging scenario in PABC demands knowledge of the causes and potential molecular mechanisms of PABC, and molecular biomarkers for the early diagnosis and individualized therapy.
Gene microarray profiling is widely used in life sciences. This approach can quickly detect all the expressed genes within the same tissue and reveals differentially expressed genes (DEGs) [26]. The use of the gene chip can uncover thousands of DEGs associated with different biological processes, different signal pathways, or molecular functions of cancer. The popularity of gene microarray profiling has generated a huge amount of data that is available in public databases. These data can be scrutinized to integrate and re-analyze gene expression profiles, provide new methods, and obtain valuable clues for the disease of interest.
There are many reports that integrate gene expression profiles on breast cancer, however, hard to see it in PABC [27][28][29]. In this study, we screened two microarray profile datasets, GSE31192 [30] and GSE53031 [31], from the National Center for Biotechnology Information Gene Expression Omnibus database (NCBI-GEO). Then, we filtered DEGs using combined GEO2R and R package analyses, and performed Gene Ontology (GO) and Kyoto Encyclopedia of Gene and Genome (KEGG) pathway enrichment analyses of DEGs with the Database for Annotation, Visualization, and Integrated Discovery (DAVID). The protein-protein interaction (PPI) network of DEGs developed with the Search Tool for the Retrieval of Interacting Genes (STRING) and top modular analysis with Cytoscape software were used to identify core candidate genes in PABC. The association of the core candidate genes with overall survival (OS) was assessed using the Gene Expression Profiling Interactive Analysis (GEPIA) and UALCAN websites. The bioinformatics analyses revealed a number of DEGs. In particular, ASB6 (ankyrin repeat and suppressor of cytokine signaling (SOCS) box containing 6) could be potentially considered as a biomarker in PABC.

Acquisition microarray datasets
Gene expression microarray data were obtained from the GSE31192 and GSE53031 microarray profile datasets in the NCBI-GEO database (http://www.ncbi.nlm.nih.gov/geo/), which is a public and freely available platform. GSE31192 is based on the GPL570 platform of the Affymetrix Human Genome U133 Plus 2.0 Array, which includes 20 PABC samples containing six cancer epithelial tissues, six cancer stroma tissues, three normal epithelial tissues, and five normal stroma tissues, and 13 non-PABC control samples that contained four cancer epithelial tissues, four cancer stroma tissues, three normal epithelial tissues, and two normal stroma tissues. GSE53031 is based on the GPL13667 platform of the Affymetrix Human Genome U219 Array, and contains 54 PABC tissues and 113 non-PABC control tissues.

Detection processing of DEGs
GEO2R (Online: https://www.ncbi.nlm.nih.gov/geo/geo2r/) is an online tool that can compare two or more groups of samples in the GEO series datasets. A P-value < 0.05 and |log FC| > 1 were set as the cut-off criteria. For the integrated analysis of the high-throughput functional genomic expression data, the raw data were processed in R according to the |log FC| > 1 and q-value < 0.05 criteria. The combination of GEO2R and R was used to select DEGs. The biological coefficient of variation, which was the square-root of the dispersions, was set to 0.01 following the suggestion in the R manual and Volcano plots was constructed [32]. Datasets that separated up-regulated and down-regulated DEGs were uploaded into FunRich version 3.1.3, a free software for functional enrichment analysis of gene sets that contains additional features [33]. Venn diagram of the analysis results plotted commonly changed DEGs using the P < 0.05 and |log FC| > 1 criteria.

GO and KEGG pathway analysis of DEGs
GO and KEGG pathway enrichment analyses were performed to detect the biological functions and potential pathway of DEGs. GO is a commonly used method to annotate genes and gene products, and identifies characteristic biological attributes for high-throughput genome or transcriptome data [34]. KEGG is a collection of databases dealing with genomes, biological pathways, diseases, drugs, and chemical substances [35,36]. GO and KEGG analyses were applied through the DAVID (Online: https://david.ncifcrf.gov/), a website bioinformatics resource that can provide tools for the gene annotation, visualization, and integrated discovery function [37]. The cut-off criterion was P < 0.05.

PPI network and module analyses
Search Tool for the Retrieval of Interacting Genes (STRING, https://string-db.org) is an online tool designed to evaluate PPIs [38]. The DEGs involved in the PPI network were identified using STRING, with confidence network edges and a medium confidence of 0.400 as the product criteria. The Tab Separated Values format document was downloaded as a simple tabular text output. Cytoscape v3.6.0 was used to construct the PPI network and analyze the interactions of the candidate DEGs encoding proteins [39]. The Molecular Complex Detection (MCODE) software application was used to screen the top modules in the PPI network with a degree cut-off = 2, node score cut-off = 0.2, k-core = 2, and maximum depth = 100. The corresponding proteins in the central nodes and highly degree were potential core proteins encoded by key candidate genes that have important physiological regulatory functions.

OS analysis of core candidate genes in PABC
GEPIA (http://gepia.cancer-pku.cn/) is a web-based tool that rapidly delivers customizable functionalities based on TCGA and GTEx data. Patient survival analysis can be conducted [40]. UALCAN (Online: http://ualcan.path.uab. edu/index.html) is a user-friendly, interactive web resource used to analyze cancer transcriptome data. UALCAN allows users to identify up-and down-regulated expressed genes in individual cancer types, as well as to estimate the effect of gene expression level and clinicopathologic features on patient survival [41]. Presently, GEPIA was used to determine OS related to core candidate genes that had been identified. The effects of the expression of the core candidate genes on PABC OS according to molecular subtypes and menopause status were assessed by UALCAN and P-values were calculated.

Identification of DEGs in PABC
NCBI-GEO is a free database of microarray profiles and next-generation sequencing data. The database was used to obtain PABC, non-PABC, and normal or adjacent stroma tissue gene expression profiles of the GSE31192 and GSE53031 datasets. A total of 1712 and 2642 DEGs were respectively up-regulated in these two datasets, while 2171 and 1388 DEGs were respectively down-regulated using P<0.05 and |logFC|>1 as the cut-off criteria (Figures 1 and  2). After integrated bioinformatic analysis, 239 common DEGs were identified from the two profile datasets. These included 101 up-regulated genes and 138 down-regulated genes in PABC, compared with non-PABC ( Figure 2 and Table 1).

DEGs PPI network and modular analysis
Using the STRING online database, a total of 139 DEGs (54 up-regulated and 85 down-regulated genes) of the 239 commonly altered DEGs were filtered into the PPI network complex. One hundred of the 239 DEGs did not fall into the PPI network. Based on the STRING output and Cytoscape software analysis, 54 nodes and 131 edges were identified in up-regulated DEGs and 85 nodes and 188 edges in down-regulated DEGs. The PPI network complex defaulter filter degree was between 1 and 20 inclusive ( Figure 4A).
According to the degree of genes and the module was selected using the MCODE Cytoscape plug-in, the most significant fourteen node degree genes identified using a filtering degree >7 as the criterion were CREB1, ARF3, UBA5, SIAH1, KLHL3, HECTD1, MMP9, TRIM69, MEX3C, ASB6, UBE2Q2, FBXO22, EIF4A3, and PXN ( Figure  4B,C). The proteins of these central nodes might be the core proteins and the genes could be key candidate genes with important roles in the progress of PABC and prognosis. They could be potential therapeutic targets.

Core candidate genes OS analysis
The prognostic information of the 14 core candidate genes is freely available in http://gepia.cancer-pku.cn. Of the 14 identified genes, the high expression of ASB6 was associated with worse OS for PABC patients according to molecular subtypes and menopause status. Compared with low expression of ASB6 in triple negative PABC, high expression of ASB6 was associated with worse OS for PABC patients, while the opposite result was obtained in the Her-2 overexpression subtype ( Figure 6A). High expression of ASB6 was associated with poor OS in pre-menopausal PABC patients. By contrast, low expression of ASB6 was associated with a poor prognosis in peri-menopausal patients ( Figure 6B). No statistical significance of ASB6 expression levels was evident according to luminal subtype and post-menopause status.

Discussion
The prevalence of PABC is increasing, and the situation will become serious as the policy of planned birth has changed in China. Early diagnosis, prevention, and personalized therapy of PABC are vital and require more knowledge of the disease. This study integrated two microarray profile datasets and utilized bioinformatics methods to identify 239 common DEGs. The genes comprised 101 up-regulated and 138 down-regulated genes. The 239 DEGs were divided into biological process, molecular function, and cellular component groups by GO and further clustered based on KEGG pathways with significant enrichment analysis. The up-regulated DEGs were mainly enriched in the immune response, peptide binding, plasma membrane, and fatty acid activation. The influences of fatty acids on cancer have been extensively studied, with variable influences apparent in different cancers [42][43][44][45][46]. The down-regulated DEGs were mainly enriched in the macromolecule catabolic process, small conjugating protein ligase activity, nuclear lumen, and activation of DNA fragmentation factor. Interestingly, the signaling pathway enrichment analysis indicated that epidermal growth factor receptor and protein kinase B, which is closely related to cell growth, proliferation, and apoptosis, play a critical role in cancer [47][48][49][50][51][52]. By reviewing the previous studies, Harvell et al. and Ma et al. found that PABC stroma tissue differentially expressed immune response genes, involved in angiogenesis and extracellular matrix deposition, associated with high-grade tumors [30,53]. Azim et al. and Muller et al. observed that CXCR4, belong to G protein-coupled receptor family of cell surface molecules, increase the metastatic potential of breast cancer cells during pregnancy [31,54]. These are consisted with our GO and KEGG enrichment results, will help us to further understand this cohort of patients.
Among the presently identified CREB1, ARF3, UBA5, SIAH1, KLHL3, HECTD1, MMP9, TRIM69, MEX3C, ASB6,  UBE2Q2, FBXO22, EIF4A3, and PXN were identified as core genes that displayed a high degree of connectivity in the PPI network. These genes were expressed at higher or lower levels in PABC tissues. The up-regulation of ASB6 was significantly associated with worse OS of PABC patients. The same result was detected in triple negative molecular subtypes and pre-menopausal patients. Further analyses implicated ASB6 as having a central role in the biological process of PABC and in the prognosis. ASB6 might be a potentially valuable therapeutic molecular target.
ASB6 belongs to a family of ankyrin repeat proteins that contain a C-terminal SOCS box motif. ASB6 can couple to an adaptor protein containing PH and SH2 domains that interacts with insulin receptor to execute glucose transport [55,56]. Accumulating evidence suggests that the SOCS box acts as a bridge between specific substrate-binding domains and more generic proteins that comprise a large family of E3 ubiquitin protein ligases, which mediated the ubiquitination and subsequent proteasomal degradation of target proteins [57]. ASB6 up-regulation by Areca nut extracts has been identified in normal keratinocytes and oral cancer cells, and is highly associated with a worse prognosis of oral squamous cell carcinoma [58]. The authors also reported differential expression of ASB6 and its influence on prognosis, which are consistent with our findings. Whether these observations with ASB6 in two different types of cancer extend to other cancers remains to be determined.
In conclusion, this study identified DEGs through integrated bioinformatical analysis to find potential core candidate genes involved in PABC. ASB6 was identified as potentially having a pivotal role in the occurrence, development, and prognosis of PABC. ASB6 is thus implicated as a novel biomarker and potential therapeutic target.

Funding
This study was supported by the National Natural Science Foundation of China [grant numbers 81372843, 81502518, 81472472 and 81502300].