Co-expression of key gene modules and pathways of human breast cancer cell lines

Breast cancer (BC) is the most common leading cause of cancer-related death in women worldwide. Gene expression profiling analysis for human BCs has been studied previously. However, co-expression analysis for BC cell lines is still devoid to date. The aim of the study was to identify key pathways and hub genes that may serve as a biomarker for BC and uncover potential molecular mechanism using weighted correlation network analysis. We analyzed microarray data of BC cell lines (GSE 48213) listed in the Gene Expression Omnibus database. Gene co-expression networks were used to construct and explore the biological function in hub modules using the weighted correlation network analysis algorithm method. Meanwhile, Gene ontology and KEGG pathway analysis were performed using Cytoscape plug-in ClueGo. The network of the key module was also constructed using Cytoscape. A total of 5000 genes were selected, 28 modules of co-expressed genes were identified from the gene co–expression network, one of which was found to be significantly associated with a subtype of BC lines. Functional enrichment analysis revealed that the brown module was mainly involved in the pathway of the autophagy, spliceosome, and mitophagy, the black module was mainly enriched in the pathway of colorectal cancer and pancreatic cancer, and genes in midnightblue module played critical roles in ribosome and regulation of lipolysis in adipocytes pathway. Three hub genes CBR3, SF3B6, and RHPN1 may play an important role in the development and malignancy of the disease. The findings of the present study could improve our understanding of the molecular pathogenesis of breast cancer.


Introduction
Breast cancer (BC) is one of the most common types of cancer in women [1]. This disease largely affects women in their 40-60s. BC accounts for 29% of all female cancer-related cancer worldwide that contributed the second most cancer now, after lung cancer both in developing and developed countries [2]. At present, the morbidity and mortality of BC are continuously growing. In 2017, approximately 63,990 new cases and 14,440 deaths were predicted to be associated with kidney cancer in the U.S.A. [3]. Nevertheless, BC is a complex disease with respect to molecular alterations, cellular composition, and clinical outcome six subtypes were defined approximately a decade ago based on transcriptional characteristics, and were designated luminal A, luminal B, ERBB2-enriched, basal-like, Claudin-low, and normal-like [4]. However, growing body studies indicated that BC is heterogeneous cancer in various aspects including clinical-pathological, molecular, and cellular heterogeneity [5]. In order to improve the prognosis and decrease mortality and morbidity of BC, diagnostic biomarkers are critical for early detection and risk stratification of BC, which could help us to choose the proper treatment. Weighted gene co-expression network analysis (WGCNA), a comprehensive collection of R functions, is a prevalent and useful method in the correlation network analysis and the identification of disease-related gene modules and key genes that contribute to phenotypic traits [6,7]. WGCNA approach has provided functional interpretation tools in systems biology. Unlike conventional microarray-based expression profiling method, WGCA allows a global interpretation of gene expression data by constructing gene networks based on similarities in expression profiles amongst samples. To better understand and explore the complex mechanisms of BC cell lines, WGCNA method would be the best choice to study the disease. WGCNA has been successfully performed in various types of cancer, such coronary heart disease [8], uveal melanoma [9], gastric adenocarcinoma [10], prostate cancer [11], and lung cancer [12]. However, the analysis of microarray-based gene expression data by the WGCNA has so far not been applied to compare the co-expression network of BC cell lines.
In the present study, the WGCNA method was applied to analyze a gene expression dataset of the BC cell line to identify biologically relevant modules associated with BC cell line subtypes.

Expression analysis of microarray data of BC cell line samples
Gene expression profiles of breast cell lines were accessible from the Gene Expression Omnibus (GEO) database using the accession number GSE 48216 [13]. The microarray gene data included 56 BC cell lines, which composed of five subtypes (basal, claudin-low, luminal, non-malignant, and unknown). Annotated files of microarray platform (GPL 10999) were also downloaded from GEO. Annotation information of microarray data was employed to match probes with corresponding gene information. The mean value was used to consider the expression level if one gene detected by multiple probes. We select the top 25% variance gene expression data as our study object. A matrix of pairwise correlations between all pairs of genes across all selected samples was constructed. sample outlier. Second, WGCNA algorithm was used to calculate the correlations amongst genes across samples, and the appropriate soft threshold power was chosen and the standard scale-free network was established by the condition of scale independent as >0.8. Third, module identification was carried out with the dynamic tree cut method by hierarchically clustering the genes using 1-topological overlap matrix (TOM) as the distance measure with a deep split value of 2 and minimum module size (minClusterSize) of 50 for the resulting dendrogram. Highly similar modules were clustered and merged with a height cutoff of 0.25. Finally, WGCNA algorithm was then used to construct the co-expression modules and extract the gene information in each module.

Establishment of module related trait relationships
The correlation between gene expression modules and each phenotype of BC cell lines was determined by WGCNA. Spearman's correlation analysis was carried out to determine the association individual module and subtype of BC cell lines, which was the most relevant module between the module eigengene (ME) and clinical traits. Each subtype of BC has a strongly related module, which can be used as its signature. Gene significance (GS) was calculated for each expression profiles, module membership (MM) was defined as the correlation of expression profile and each MM. The GS versus MM plot was also drawn.

Interaction analysis of co-expression modules of BC
Interaction relationships amongst different co-expression modules were imputed by WGCNA. Heatmap tool package in R software was used in the evaluation of the strength of the relationship. We correlate the clustering coefficient with connectivity by the module in the unweighted network. The heatmap of gene expression profiles in the individual module was also shown.

Functional enrichment analysis of genes in the hub modules
The number of genes in the critical modules was put in the order from high to low gene significance scores. Then, functional enrichment analysis was conducted on these genes in these modules. GO Biological Process term and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses were conducted using Cytoscape plug-in ClueGo [15]. Functional enrichment analysis was based on the cut-off value of P-value <0.05.

Module network visualization and hub genes
The interested module was constructed using the Cytoscape software, and Cytoscape plug-in molecular complex detection (MCODE) was used to analyze the most notable clustering module.

The Kaplan-Meier plotter
The prognostic value of hub gene was evaluated using an online database, Kaplan-Meier plotter (www.kmplot.com) [16], which integrated gene expression data with survival information of 4142 clinical BC patients [17]. For overall survival analysis, the log-rank test was employed to compare the survival difference between the BC group and non-tumor group. A two-sided P-value of <0.05 was regarded as statistically significant.

Validation of the hub gene expression with TCGA and CCLE database
To verify the hub genes from the TCGA database, the expression statuses of hub genes in the TCGA BRCA cohort were revealed for validation. The diagnostic performance was evaluated using receiver operating characteristic (ROC) curves. The hub genes in a different type of cancers and cell lines were determined through analysis in CCLE database (https://portals.broadinstitute.org/ccle/home), which is publicly a large-scale genomic dataset of human cancer cell lines to facilitate discovery and identification from genome-wide expression analyses [18]. P-value, less than 0.05, was regarded as significant. All analyses were performed using R/Bioconductor (version 3.3.2).

Collection of gene expression data
The raw gene expression data (GSE 48213) were downloaded from the GEO data repository (http://www.ncbi.nlm. nih.gov/geo/). The dataset contained a total of 56 BC cell lines, consisting 27 luminal, 14 basal, six claudin-low, one Kbluc, five non-malignant, and three of unknown subtype. The detailed information is shown in Table 1. The sequencing platform was GPL10999 Illumina Genome Analyzer IIx (Homo sapiens). The gene expression matrix was downloaded from the GEO database. The mean value was used to represent the expression level if one gene detected by multiple probes. As a result, a total of 36,953 expression data of genes was obtained. To select the most varying genes, the top highest 25% variance expression profiles were chosen to conduct the WGCNA analysis.

Construction of co-expression module of BC cell lines
Co-expression modules were constructed by the expression values of 5000 genes in 56 BC cell lines using the WGCNA algorithm. On the whole, the sample hierarchical clustering plot in each sample was divided into two clusters using the flashClust tool package of WGCNA algorithm method. As can be seen in Figure 1, 56 BC samples were divided into two clusters, and we fail to detect outlier samples. Power values were screened out by WGCNA algorithm in the construction of co-expression modules. When the soft threshold power β was set at 6, the independence degree was up to 0.8 and the average connectivity degree was higher ( Figure 2). After highly similar modules were merged, a total of 28 co-expression modules were identified in the co-expression network (  Supplementary Table S1.

Correlation between modules and each subtype of BC cell lines in co-expression networks
The module-trait relationships were established by WGCNA algorithm. Modules with higher Spearman's correlation coefficients are also the most important elements of modules associated with the trait. From Figure 4, the modules related to BC cell line classification were observed, including the five categories of 'basal' , 'claudin-low' , 'luminal' , 'non-malignant' , and 'unknown' . The correlation between ME and traits shows that some modules are more important than others in the disease. A scatterplot of gene significance versus MM was also plotted. We found that the luminal phenotype highly correlated with the brown module (Spearmen correlation coefficients: 0.86, P<0.01), with a significantly significant correlation. The basal phenotype is highly related to the black module (Cor: 0.52, P<0.01). Claudin-low subtype highly correlated with darkred module (Cor: 0.67, P<0.01). Non-malignant subtype was highly associated with midnightblue module (Cor: 0.74, P<0.01) ( Figure 5). The module with the greatest association with

Interaction analysis of co-expression modules and trait
We further analyzed the interaction amongst 28 co-expression modules. On the whole, no significant difference amongst different modules was observed, suggesting the relative independence of gene expression in these modules ( Figure 6). The higher scale independence amongst these modules was also detected. Connectivity of eigengenes analysis was performed in order to evaluate the interactions amongst these constructed co-expression modules ( Figure  7). The dendrogram plot also indicated that the brown module was highly associated with luminal BC.

Functional enrichment analysis of critical modules
To explore the potential biological function of critical modules, we performed the functional enrichment analysis of GO and KEGG category. There was a significant difference in the biological processes that different modules were enriched in Figure 8 ( Table 2). The results of GO analysis for brown module revealed that GO categories enriched in the biological process (n=25), cellular component (n=3), and molecular function (n=3). Genes in the brown module were largely enriched in GO:0034248 (regulation of cellular amide metabolic process), GO:0010608 (post-transcriptional regulation of gene expression), and GO:0043487 (regulation of RNA stability). The GO enrichment categories are exhibited in Figure 8A. A total of 15 KEGG pathways focussing on the biological pathways    were enriched, including autophagy, spliceosome, and mitophagy. The KEGG pathway enrichment categories are exhibited in Figure 9A. Gene in the black module was mainly enriched in GO:0000184 (nuclear-transcribed mRNA catabolic process, nonsense-mediated decay), GO:0033613 (activating transcription factor binding), GO:0022626 (cytosolic ribosome) ( Figure 8B). Two KEGG pathways focussing on the biological pathways were enriched in the black module, including colorectal cancer and pancreatic cancer ( Figure 9B). Gene in the dark red module was mainly enriched in GO:0003678 (DNA helicase activity), GO:0004003 (ATP-dependent DNA helicase activity), and GO:0070603 (SWI/SNF superfamily-type complex) ( Figure 8C). However, genes in dark red module did not significantly enrich any pathways. Genes in midnight blue module were enriched in GO:0043628 (ncRNA 3 -end processing), GO:0006734 (NADH metabolic process), and GO:0045047(protein targeting to ER) ( Figure 8D). Genes in midnight blue module played critical roles in ribosome and regulation of lipolysis in adipocytes pathway ( Figure 9C).

Module visualization and hub genes
We chose gene pairs in the module which weight >0.05 and co-expression network was constructed using Cytoscape ( Figure 10). The sub-network was also identified via Cytoscape plug-in MCODE. In total, three sub-co-expression networks (modules 1, 2, and 3) with score >6 were detected by MCODE ( Figure 11). Each score was 34.979, 8.889, and 7.579, respectively. The three sub-co-expression networks belong to black module, midnightblue, and brown. Amongst these genes, CBR3, SF3B6, and RHPN1 were the hub gene in the sub-network.

Validation of hub genes
The Kaplan-Meier curve and log-rank test analyses revealed that the mRNA level of three hub genes was significantly associated with OS (P<0.05) ( Figure 12) in all BC patients. The BC patients with higher mRNA levels of CBR3 and   SF3B6 or lower mRNA level of RHPN1 were predicted to have a better OS. The three hub gene expression level was extracted in TCGA BRCA cohort. Compared with normal samples, the three hub gene expression level was higher in the tumor samples ( Figure 13). We employed ROC curves to evaluate the prognostic power of three hub genes. The combined AUC for the three biomarkers prognostic model was 0.927 (0.907−0.947) ( Figure 14). The CCLE database analysis demonstrated that the two hub gene expression level of CBR3 and RHPN1 ranked higher in a variety of cancer cell line ( Figure 15).

Discussion
BC is still one of the most common and malignant tumors by complex molecular and cellular heterogeneity [5]. This disease largely influences women in their 40-60s. It is the second most cancer now, just after lung cancer, the principal cause of death from cancer amongst women both in developing and developed countries [2,3]. BC cell lines may mirror many of the molecular characteristics of the tumors from which they were derived, and are therefore a useful preclinical model in which to explore strategies for predictive marker development. However, the mechanisms of critical pathways and their interactions involved in the occurrence and development of BC, remain largely unknown. Therefore, understanding of BC biology may provide clinicians with new tools that can be used for the treatment of this disease. Many successful efforts were employed to examine the complex molecular and cellular nature of BC through basic genetic and molecular study, and various novel genes that are involved in BC have been identified [19][20][21]. Although  [22]. Due to rapid technological breakthroughs of genome-wide sequencing, the research of clinical issues and related pathological mechanisms in various cancers has been developed [23]. Weighted gene co-expression network analysis (WGCNA) is a method frequently used in the co-expression module correlation analysis by microarray samples. WGCNA is also a powerful tool to examine the potential gene correlation structures within the gene expression data. Therefore, the WGCNA method was employed to construct robust gene co-expression networks. These modules were established in terms of large-scale gene expression profiles and the distinction of centrally located genes (hub genes) that drive key cellular signaling pathways [24]. The WGCNA approach has provided functional interpretation tools in systems biology and led to new insights into the molecular and pathological mechanisms in several diseases [25,26]. There are no reports applying WGCNA to systematically identify gene co-expression networks associated with BC cell lines. To fulfill this gap, the WGCNA was applied to construct a co-expression network amongst each phenotype of BC cell line and calculated module-trait correlations based on one public microarray datasets (GSE78000), which included 36 samples and 22,615 genes. A total of 28 co-expression modules were constructed by WGCNA method, which helps clinicians to detect the association between BC cell line transcriptome and phenotype. We found four co-expression modules that relate to phenotype (basal, claudin-low, luminal, unknown, and non-malignant). Using the enrichment and functional analysis of Cytoscape plug-in ClueGO, we found that the KEGG pathway of targetted genes in the brown module was mainly involved in the pathway of the autophagy, spliceosome, and mitophagy, the black module was mainly enriched in the pathway of colorectal cancer and pancreatic cancer, and genes in midnightblue module played critical roles in ribosome and regulation of lipolysis in adipocytes pathway. Three hub genes CBR3, SF3B6, and RHPN1 may play an important role in the development and malignancy of BC. The genes stood out for their association with the diffuse histological subtype.
CBR3 (carbonyl reductase 3) is a protein-coding gene. Amongst its related pathways are metabolism and cytochrome P450 -arranged by substrate type. GO annotations related to this gene include oxidoreductase activity and carbonyl reductase (NADPH) activity. An important paralog of this gene is CBR1. At present, CBR3 has become potential biomarkers in uterine leiomyomas disease [27], type 2 diabetes [28], oral squamous cell carcinomas [29], and BC [30]. CBR3 mRNA expression could regulate human cancer cell lines by the Nrf2 pathway [31]. CBR3 is also proven a novel target gene of inflammatory stimuli [32]. Hertz et al. demonstrate that SNPs in CBR3 was correlated with chronic cardiotoxicity in a cohort of BC patients treated with anthracyclines [33].
SF3B6 (splicing factor 3b subunit 6) is a protein-coding gene. Amongst its related pathways are gene expression and mRNA splicing -major pathway. GO annotations related to this gene include nucleic acid binding and nucleotide binding. Siebring-van et al. identified that silencing SF3A3 yielded much stronger cytotoxicity to non-small cell lung cancer cells than to lung fibroblasts, suggesting that the gene could represent useful therapeutic targets [34]. RHPN1 (Rhophilin Rho GTPase binding protein 1) is a protein-coding gene. Amongst its related pathways are ectoderm differentiation and RHO GTPases activate rhotekin and rhophilins. An important paralog of this gene is RHPN2. Lal MA reported that Rhophilin-1 is a key regulator of the podocyte cytoskeleton, and is essential for glomerular filtration [35].
In this work, our findings demonstrated that black module and brown module, midnight blue module, and dark red module were regarded as the most critical module in the development and pathological phenotype of BC. And the hub gene CBR3, SF3B6, and RHPN1 were found to be significantly in three modules, which may play as the potential diagnostic and prognostic biomarkers of BC, However, the biological function of these three genes needs to be further validated with more experiment.

Availability of data and material
Data availability could be obtained from TCGA website.