A 23-gene prognostic classifier for prediction of recurrence and survival for Asian breast cancer patients

Abstract We report a 23- gene-classifier profiled from Asian women, with the primary purpose of assessing its clinical utility towards improved risk stratification for relapse for breast cancer patients from Asian cohorts within 10 years’ following mastectomy. Four hundred and twenty-two breast cancer patients underwent mastectomy and were used to train the classifier on a logistic regression model. A subset of 197 patients were chosen to be entered into the follow-up studies post mastectomy who were examined to determine the patterns of recurrence and survival analysis based on gene expression of the gene classifier, age at diagnosis, tumor stage and lymph node status, over a 5 and 10 years follow-up period. Metastasis to lymph node (N2-N3) with N0 as the reference (N2 vs. N0 hazard ratio: 2.02 (1.05–8.70), N3 vs. N0 hazard ratio: 4.32 (1.41–13.22) for 5 years) and gene expression of the 23-gene panel (P=0.06, 5 years and 0.02, 10 years, log-rank test) were found to have significant discriminatory effects on the risk of relapse (HR (95%CI):2.50 (0.95–6.50)). Furthermore, survival curves for subgroup analysis with N0-N1 and T1-T2 predicted patients with higher risk scores. The study provides robust evidence of the effectiveness of the 23-gene-classifier and could be used to determine the risk of relapse event (locoregional and distant recurrence) in Asian patients, leading to a meaningful reduction in chemotherapy recommendations.


Introduction
Breast cancer is the second leading cause of cancer deaths in women, next only to lung cancer resulting in 1 death among 38 women each year (6.8%) [1]. Breast cancer remains a very heterogeneous disease comprising a wide range of morphologic and biological features, clinical behavior, and treatment responses [2]. A patient's prognosis is often judged by the available survival and recurrence rates, usually calculated from diagnosis. The risk of breast cancer recurrence, both local or distant. Is an important criterion that decides treatment courses for patients? Twenty to thirty percent of patients with an early breast cancer are susceptible to relapse [3]. However, factors affecting recurrence/relapse, and the identification of genetic and histological factors that might affect subgroups of patients who continue to be at an increased risk of recurrence long after completing the standard course of treatment, during the first decade following diagnosis, are still an unmet clinical issue.
The risk of recurrence and outcomes have traditionally been stratified with regard to tumor subtype [4,5]. Other characteristic such as tumor size, tumor grade, nodal status, and age of diagnosis affect the disease progression [6]. There is also the hereditary predisposition to breast cancer that accounts for approximately 5-10% of all breast cancers [7]. Breast cancer susceptible gene mutations and the accurate estimation of their probabilities provide vital information towards genetic counselling of breast cancer patients [8]. Multiple factors, such as socioeconomic, epidemiological and genetic etiology, play roles in tumor behavior, cancer subtype and the prognosis of patients, and have been observed to vary among different racial/ethnic groups [9,10]. Mortality rates due to breast cancer also vary across patients from different ancestry. Prior studies have shown that there exist different manifestations of breast cancer risk and prognosis between non-Asian and Asian Triple-negative Breast Cancer (TNBC) subtypes [11]. Such racial differences could be attributed to fundamental epidemiological and genetic risk factors between populations which might be responsible for the underlying mechanism leading to population-specific risk levels. Taking into consideration, effects of ancestry differences, could provide additional understanding of patient prognosis thereby leading to better and appropriate treatment decisions. Therefore, this provides an excellent rationale to conduct breast cancer studies for women specifically from Asian cohorts.
Multiple genetic association studies have accrued over the years and have added to gene-based knowledge for breast cancer. High-penetrance breast cancer susceptibility genes, such as BRCA1 and BRCA2, explain only a small fraction of breast cancers in the general population because of their low carrier rates [12]. Most of the identified genetic factors are associated with only a small to moderate increased risk of breast cancer and therefore, cumulatively may explain a small proportion of heritability of the breast cancer risk. Over the years, several assays have been developed to conduct breast cancer recurrence score tests [13][14][15][16][17]. They provide an indication of the activity of a set of candidate breast cancer-specific genes to predict the probability of the tumor's growth and spread, thereby act as biomarkers in making treatment decisions following surgery. This is particularly important if the patient requires chemotherapy to prevent the patient from being over treated or undertreated. Oncotype DX [14], Endopredict [13], and RecurIndex [15][16][17] are such genomic tests that utilize potential breast cancer genetic predictors suitable for patients that are recently diagnosed with early stage, estrogen receptor-positive (ER+) or progesterone receptor-positive (PR+), and Human epidermal growth factor receptor 2-negative (HER2-) breast cancer. These tests typically help make treatment decisions using multigene expression profiles to predict breast cancer's recurrence or distant metastasis within a follow-up period (maximum 10 years) after diagnosis. Such tests have been effective in preventing overtreatment in early breast cancer patients.
We have reported one 34-gene set and another 18-gene classifier in our prior studies that could partition the loco-regional recurrence in high risk patients from that of the low risk patients after mastectomy [15][16][17]. Other than gene expression profiling of tumors for predicting clinical outcomes in breast cancer patients, regional lymph node status, and tumor pathological staging may provide surrogate information for events such as metastasis or relapse [18]. Racial and ethnic disparities are evident in underlying genetic and biological factors that might influence the disease incidence and prognosis. Hence, in the present study, we report a 23-gene classifier with the primary purpose of assessing its clinical utility toward improved risk stratification for relapse post mastectomy. The 23 genes are from our published studies, which the metagene we found in the Asian population for the prediction of loco-regional recurrence [15]. After that, we developed and validated the gene-subset from the metagene we found for the prediction of loco-regional recurrence and distant metastasis [16,17]. Based on them, we develop and validate the new subset to predict the relapse of breast cancer in the Asian population. The gene set is derived from genomic profiling of Asian women, to predict the risk level (high/low) of relapse within up to 10 years post mastectomy following initial diagnosis. The present study further establishes the efficacy of the discriminatory 23-genes along with pathological indicators such as tumor stage (T stage) and lymph node status (N stage) to predict survival [15].

Study population
For this Asian population study, a total of 422 patients' gene expression data were obtained from publicly available gene expression omnibus (GEO) datasets. The first dataset GSE20685 [19] consists of gene expression profiles from 312 prospectively enrolled patients diagnosed with breast cancer and treated between 1991 and 2004 at the Koo Foundation Sun-Yat-Sen Cancer Center (KFSYSCC) and an additional 15 lobular breast carcinoma samples, collected between 1999 and 2004. The second dataset GSE45255 [20] consists of 1954 annotated breast tumors with corresponding clinical-pathological data including distance metastasis-free survival gathered from Singapore and Europe, out of which 95 samples from Singapore origin are included in the present study. Characteristics such as age at diagnosis (years), tumor stage (T1 (stage1), T2 (stage2), T3 (stage3), T4 (stage4)), N stage (lymph nodes status: N0, N1, N2, N3), for each of the samples were recorded. Treatment related status (neo-adjuvant chemotherapy), were also obtained. All women in the present study are treated with either breast-conserving therapy or mastectomy. Patients were classified into different tumor and lymph node and eligible patients met the following inclusion criteria: (1) invasive carcinoma of the breast, (2) clinical stages T1-T4, (3) Lymph node status N0-N3 [19,20]. In addition, 197 out of the 422 patients were entered into the follow-up studies. Remaining patients were excluded if they had (1) no  Figure 1 summarizes the workflow that is implemented in the present study. A previously identified 23 prognosis-related genes classifier is utilized to predict overall recurrence based on their binary expression status.

Model building
A logistic model was built using all study samples and a leave one out cross-validation (LOOCV) [21] procedure is conducted to check the accuracy and test the preliminary performance of 23-gene set. LOOCV provides an almost unbiased estimate of generalization performance [22] and consists of training the model on n -1 subsamples and the model selection criterion is evaluated on the remaining 1 sample. This procedure is then repeated for all n combinations of n -1 subsamples and subsequently the accuracy is calculated to judge the model performance.

Model training and testing
Logistic regression was used to predict the recurrence of breast cancer based on the 20 gene expression as the predictor. The model is built in R where the outcome variable is binomial with a link function of logit. The binary response parameter is recurrence (y = 1) or disease-free (y = 0). Selection of a best fit logistic regression model is accomplished through model training and leads to obtaining optimal values of prediction-parameters by which the model is governed. The model is trained using a supervised learning method. The predicted y (recurrence/disease-free) for the model is run with 50% of the total samples as training samples, and the predicted (y) value (predicted high or predicted low risk) is then compared with the respective observed status (observed relapse or disease-free) of each patient, using the input vector of x's (gene expression of 23 genes) as the predictor. Based on the result of the comparison and the specific learning algorithm being used, the parameters of the model are adjusted.
Once the model is trained, the model is tested to determine how accurately the predictive model will perform in practice. The remaining 50% of the total samples are used as the test dataset to provide an unbiased evaluation of a final model that is fit on the training dataset. The model performance is evaluated through the confusion matrix while the clinical performance is judged through metrics such as sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV). The clinical performance from the model and confusion matrix were presented through R package "DescTools"(https://cran.r-project.org/web/packages/DescTools/index.html). All of the model training and testing are conducted on the R version 3.6.3.

Survival analysis
Cox proportional hazards regression models were used to assess the prognostic significance of age at diagnosis, pathological tumor grade, N-stage, and the 23-gene classifier. R packages survminer (https://cran.r-project.org/web/ packages/survminer/index.html) and survival (https://cran.r-project.org/web/packages/survival/survival.pdf) were used to conduct all survival analysis. Disease-free interval is displayed by Kaplan-Meier method and log-rank test is used to determine any statistically significant differences in survival between the indicated groups. Comparative analyses were performed between groups using Chi-squared and T-tests for categorical and numeric variables. Statistical significance was accepted for P < 0.05. Both univariate and multivariable Cox proportional hazard analyses were performed for age, T and N staging, and gene expression profiles, for both 5-and 10-year follow-up data to obtain hazard ratios (HRs) with 95% confidence intervals (CIs) and P-values. Finally, a subgroup analysis using Cox-proportional hazard test stratified by tumor stage T1-T2 and N-stage N0-N1, respectively, were conducted to estimate if they had any significant effect in predicting the survival of patients within a 10-year follow-up period from the initial diagnosis.

Patient demographics
About 327 patients from Gene expression Omnibus (GEO)) dataset GSE20685 [19], who underwent modified radical mastectomy or breast-conserving surgery plus dissection of axillary nodes were included in the present study. Clinical information on the follow up treatments for these patients included radiotherapy for 141 patients, adjuvant chemotherapy for 232 patients, and/or hormonal therapy for 224 patients. Neoadjuvant chemotherapy was administered to 31 patients with locally advanced disease. Similar clinical information from the 95 patients from the Singapore cohort (GSE45255) [20] were presented with tamoxifen monotherapy (ER+), chemotherapy, and/or neoadjuvant chemotherapy status. The demographic features for the 422 (total) patients, included in the present study, such as age at diagnosis, N stage (N0,N1,N2,N3), tumor stage (T1, T2, T3, T4), recurrence (relapse or disease-free) and follow-up status are summarized in Table 1. To further determine the recurrence and survival rate of the patients, further 5 and 10 years follow-up studies were conducted on a total of 197 patients, because there were 7 patients without follow-up and 7 events happened before follow up. The detail is displayed on Figure 1. The demographic details of the follow-up patient sample with age at diagnosis, tumor stage, N stage, and recurrence status are displayed in Table 2.

Training and testing performance of the 23 gene classifier
Gene expression profiles for each of the genes in the 23-gene panel is displayed in Figure 2 for all patients with and without recurrence respectively. The gene expression profile displayed all genes with high median gene expression  (log2 expression >7) for both recurrence free patients as well as for patients with recurrent breast cancer. The expression profiles pointed out that ACTB, PTI1, and RPLPO to be consistently high across patients with and without recurrence. The predictive power of the gene classifier is established through accuracy, sensitivity, specificity, PPV, and NPV measures for the fitted logistic regression model for patients at high risk versus low risk of recurrence. Table 3A,B summarizes the confusion matrix for predicted and observed recurrence risks (high/low) in patients from both training and testing data. While the model achieved a training accuracy of 78.7% (Table 3A), it achieved a testing accuracy of 73.9% (Table 3B). The ability of the model to correctly classify a high-risk patient was 23.6% (training sensitivity) and 15.7% (testing sensitivity); however, the probability of correctly classifying a low risk individual correctly (specificity) was 96.9% (training) and 92.5% (testing). Further, the PPV and NPV of the classifier reached 70.6% and 79.4% for the training data whereas it could just achieve a PPV of 40% and NPV of 77.5% for the testing data. Table 2 summarizes the demographic table for classification by the prediction model for a 5-year (median follow-up: 5.00 [4.29-5.00]) and 10-year recurrence (median follow-up: 6.28 [4.29-9.4]), respectively. 197 patients were retained for the 5/10-year follow-up study of which 19 were predicted to be at high risk of recurrence, with a mean age of   This indicates that patients with high risk scores displayed high relapse rates than those with low risk scores and there was significant difference between relapse free between high and low risk groups. To delve deeper into the intricacies of how the relapse is affected by each of the factors (genetic, age at diagnosis, T and N stage) univariate and multivariate  Cox proportional hazard test results were studied. The analysis results for univariate Cox proportional hazard test for effects of covariates on patient prognosis between high risk and low risk groups, shows that difference in risk to relapse is not attributed to age at diagnosis and tumor stage (Table 4). However, metastasis to lymph node (N2 and  (Table 4).

Subgroup analysis
To evaluate the 23-gene classifier, based on its ability to predict recurrence in early stage patients' , tumor grades 1 and 2 and N-stage 0 and 1 were chosen for a further subgroup analysis. Moreover, the findings from the Cox proportional hazard analysis in the previous step, along with the fact that the most frequent groups of patients in a 10-year follow-up study were with T stage 1 and 2 (89% of the total patients in follow up study) and N Stage 0 and 1 (78% of the total follow up patients, Table 2), led us to focus on the mentioned subgroups. Also, a larger tumor burden and higher metastatic risk restricts the benefit the classifier may provide, while a lower tumor burden in N1 might help to improve survival. Figure 4 displays the survival curves for the subgroup analysis, obtained through Kaplan Meier analysis, which showed that patients with high risk scores reported shorter survival rates compared with those with low risk scores. There was no significant difference in overall survival between high and low risk patients according to N stage (N0-N1) (P=0.13) and marginally significant discriminatory effect of tumor stage (T1-T2) (P=0.057) between high risk versus low risk.

Discussion
A promising new strategy that is utilized avidly for predicting clinical outcomes in breast cancer patients is gene expression profiling of tumors. Some studies in the past have established that the clinical responses for patients are correlated with different specific molecular 'portraits' [23] and gene signatures can potentially distinguish subgroups of patients with different prognoses or response to different treatment regimens [24]. In the present study we, therefore, try to establish the effectiveness of a gene classifier as a predictor for recurrence, post adjuvant chemotherapy, in breast cancer diagnosed patients from Asian population. The gene-classifier consists of a 23 gene cluster of correlated genes that is used to predict patient risk stratification (high risk or low risk) for relapse within 10 years from diagnosis where the model parameters are first trained to obtain the best fit model. Once a best fit model is obtained it is tested on a separate dataset, to evaluate the feasibility of the model. The classifier is designed to be used for decision making through a logistic regression model. It is a major challenge to validate gene classifiers with samples that are independent from those that were used to develop them [25]. Increased sensitivity, the ability to correctly identify people with high risk of recurrence, usually comes at the expense of reduced specificity (meaning more false-positives). Likewise, high specificity, when a test does a good job of ruling out people with low risk of recurrence, usually means that the test has lower sensitivity (more false-negatives). A false negative result might be giving false reassurance to the patient and a false-positive result might send patients on unnecessary and expensive medical procedures. Therefore, despite good accuracy, low sensitivity, or high specificity and high PPV and NPV, careful consideration of interplay of diagnostic accuracy is vital for reaching a reliable conclusion. It is not enough to make decisions based only on gene expression as the complexity of a classification can be affected by so many other factors that drive the prognosis of a complex disease such as breast cancer. Other criteria such as tumor size, nodal status, age at diagnosis along with genetic etiology affects the diseases prognosis, relapse/disease-free and survival. We conducted a 5-and 10-year follow-up, taking into consideration the role of each prognostic factor in predicting the course of the disease and the response to different treatment strategies. Differences in recurrence risks between prognostic subgroups was observed to decline over time. Also differences in disease-free survival between subgroups decreased with time as well. All results indicate that conditional survival and recurrences stratified by gene-panel and lymph node status provide patients better insight in prognosis. The significant P-values and overall hazard ratios also confirmed the success of the gene classifier to discriminate patients with high risk from patients with low risk to metastasis.
The 23-gene classifier, of which three were housekeeping genes and were not included, conform mostly to subsets of patients of Asian origin. Pooling all of the expression of the genes in the classifier, ACTB, BLM, BUB1B, CCR1, CKAP5, CLCA2, DDX39, DTX2, ERBB2, ESR1, MKI67, OBSL1, PGR, PHACTR2, PIM1, PTI1, RCHY1, RPLPO, SF3B5, STIL, TFRC, TPX2, and YWHAB indicates an increased likelihood of breast cancer [26][27][28]; however, membership in a prognostic gene list is not necessarily indicative of a gene's importance in cancer pathology. Nevertheless, each of the genes listed has been associated with some or more cancer related prognosis. The scoring algorithm was formulated by assigning each gene with a weigh according to the logistic regression before assembling the scores to one gene score that was used as a cut-off to classify patients. PHACTR2 is associated with up-regulated and down-regulated functional gene networks of the migratory breast tumor cells [29], TPX2 suppresses activation of p53 pathway in breast cancer [30], DDX39 involved in embryogenesis, spermatogenesis, and cellular growth and division. The following genes maybe have protective effect for the prognosis of breast cancer, BLM has a key role in homologous recombination repair, telomere maintenance, and DNA replication [28]. The molecular function of SF3B5 is about RNA binding and Splicing factor binding. The gene panel is consistent with our earlier study [15]. Moreover, extracting biological meaning from whole genome molecular profiling remains a significant challenge.
There have also been concerns relating to design and validation of gene classifiers, as small numbers of patient samples are usually used to derive classifiers and there is very little overlap present in these gene signatures among various classifiers. The Oncotype DX Breast Cancer Assay, developed by Genomic Health, is a 21-gene assay that can predict disease recurrence and response to chemotherapy in ER positive, HER2-negative, early stage breast cancer. EndoPredict combines prognostic information from a 8-gene analysis with tumor size and the patient's nodal status. Both tests have been developed to guide treatment decisions on whether women should receive chemotherapy in addition to anti-hormone treatments. A prior study has reported that the 8-gene signature obtained from the Endo-Predict breast cancer test might be more accurate at predicting the recurrence of breast cancer, compared with the recurrence score of Oncotype DX [31]. The 23-gene set was selected from and validated on several datasets and was inferred as the most stable gene-set to predict the relapse of breast cancer. The performance of the gene classifier in this study is very encouraging, as the survival benefit was more significant in follow-up and subgroup analysis. However, comparison of our predictor with other predictors, needs to be evaluated as patients who are correctly predicted to be at low risk can be saved the unwanted side effects that are common with chemotherapy treatments and can have better life quality.
Gene expression profiling of tumors appears to be a promising new strategy for predicting clinical outcomes in breast cancer patients.

Conclusion
The present study aims to determine the risk of recurrence in breast cancer patients from Asia by utilizing a 23 gene prognostic signature. It is designed to validate the gene signature through a 10-year follow-up study (recurrence and overall survival) and further define the prognosis of breast cancer and assessment of patients where chemotherapy will be beneficial. The study provides robust evidence of the clinical utility of the 23-gene classifier and that it can successfully distinguish between patients with high and low risk of relapse. Such information are potential molecular tools for clinicians to help them with selection of therapeutic strategies, such as extension of adjuvant endocrine therapy or on suppressing adjuvant chemotherapy in patients were toxic effects are particularly deleterious or when this treatment is fundamentally not needed.

Data Availability
The data used to support this study was in the manuscript. The dataset is available in NCBI GEO database, which with the accession number GSE20685 and GSE45255.