Methylation-to-expression feature models of breast cancer accurately predict overall survival, distant-recurrence free survival, and pathologic complete response in multiple cohorts

Methylation-to-expression feature models of breast cancer accurately predict overall survival, distant-recurrence free survival, and pathologic complete response in multiple cohorts

Play all audios:

Loading...

ABSTRACT Prognostic biomarkers serve a variety of purposes in cancer treatment and research, such as prediction of cancer progression, and treatment eligibility. Despite growing interest in


multi-omic data integration for defining prognostic biomarkers, validated methods have been slow to emerge. Given that breast cancer has been the focus of intense research, it is amenable to


studying the benefits of multi-omic prognostic models due to the availability of datasets. Thus, we examined the efficacy of our methylation-to-expression feature model (M2EFM) approach to


combining molecular and clinical predictors to create risk scores for overall survival, distant metastasis, and chemosensitivity in breast cancer. Gene expression, DNA methylation, and


clinical variables were integrated via M2EFM to build models of overall survival using 1028 breast tumor samples and applied to validation cohorts of 61 and 327 samples. Models of distant


recurrence-free survival and pathologic complete response were built using 306 samples and validated on 182 samples. Despite different populations and assays, M2EFM models validated with


good accuracy (C-index or AUC ≥ 0.7) for all outcomes and had the most consistent performance compared to other methods. Finally, we demonstrated that M2EFM identifies functionally relevant


genes, which could be useful in translating an M2EFM biomarker to the clinic. SIMILAR CONTENT BEING VIEWED BY OTHERS MULTI-OMICS ANALYSES IDENTIFY _HSD17B4_ METHYLATION-SILENCING AS A


PREDICTIVE AND RESPONSE MARKER OF HER2-POSITIVE BREAST CANCER TO HER2-DIRECTED THERAPY Article Open access 23 September 2020 MACHINE LEARNING APPROACHES REVEAL METHYLATION SIGNATURES


ASSOCIATED WITH PEDIATRIC ACUTE MYELOID LEUKEMIA RECURRENCE Article Open access 06 May 2025 INTEGRATION OF 101 MACHINE LEARNING ALGORITHM COMBINATIONS TO UNVEIL M6A/M1A/M5C/M7G-ASSOCIATED


PROGNOSTIC SIGNATURE IN COLORECTAL CANCER Article Open access 18 February 2025 INTRODUCTION The extreme heterogeneity of breast cancer makes one-size-fits all treatments difficult to achieve


and although a number of targeted therapies are available1 personalized therapies work only for a limited number of patients. Better methods are needed to rapidly identify patients that can


benefit from specific treatments, reveal new targets for historically difficult to treat patients, and provide accurate prognoses to patients to enable better life planning. Large scale


collection of cancer genomic data, such as The Cancer Genome Atlas (TCGA)2 and the International Cancer Genome Consortium3, and expression repositories such as NCBI’s Gene Expression


Omnibus4 have enabled developments in prognostic and predictive cancer models. In addition to improved understanding of how cancer disrupts normal cellular function, such data have led to


the development of biomarker panels, such as the PAM50 risk of recurrence score (rorS)5 and the 21 gene Oncotype DX6, which are both now used for clinical decision making in the treatment of


breast cancer7. Furthermore, to the extent that such tests are able to predict the occurrence of distant metastases, they may enable early prevention strategies to be put in place. Despite


such promising developments, the goals of precision medicine remain elusive due to specific limitations in molecular biomarker studies. These include limited understanding of the optimal


genomic features to utilize, lack of validation on external data, and inconsistent use of best practices for creating biomarkers with the best chances of entering clinical practice8,9. There


has been growing interest in the integration of molecular and traditional clinicopathological markers to achieve higher levels of prognostic power and utility and small but significant


gains in prognostic accuracy have been observed through this approach5,10. To the extent that these sources are independently prognostic, they should result in higher accuracy than either


source alone and better understanding of prognostic features by revealing holistic relationships in the data. As yet, few prognostic or predictive models are designed to take advantage of


multiple data types. In this work, we apply our own previously developed data integration approach for combining gene expression and DNA-methylation data to model the impact of regulatory


disruption on breast cancer prognosis and prediction while also incorporating prognostic clinical variables11. This study overcomes prior limitations by creating a model that can identify


biologically plausible targets, using best practices in assessing the model8,9, and by validating the actual model (not just the gene signature) in multiple, external datasets, representing


different demographics and disease distributions. METHODS M2EFM Previously, we introduced a data-integrated modeling approach we call Methylation-to-Expression Feature Model (M2EFM)11 and


applied it in the case of clear cell renal cell carcinoma (ccRCC). The basis of this approach is to pre-filter loci to those that are differentially methylated between matched tumor and


adjacent, pathologically normal tissue and to associate the methylation status of those loci with significant differences in gene expression in the tumor samples. The differentially


methylated loci are then called m2eQTL (for methylation-to-expression QTL) and the genes are called m2eGenes. An initial model is built from m2eQTL and m2eGenes with the largest effects to


generate risk scores for all samples using Cox-Ridge regression (using the exponential function of the weighted sum of the model features). A final model is built from the risk scores in


conjunction with clinical variables and a final risk score is generated from the weighted sum of the features of this model. Here we build M2EFM models of overall survival (OS), distant


recurrence-free survival (DRFS), and pathologic complete response (pCR) in breast cancer, where OS is defined as time from initial diagnosis to death from any cause, DRFS is defined as time


from diagnosis to distant recurrence, death from cancer, or death from unknown cause, and pCR is defined as a patient being recorded as having a pathologic complete response or a residual


cancer burden class of RCB 0/I as in Hatzis, _et al_.12. DATA A full description of data pre-processing appears in the Supplementary Methods. For this work, we supplied a list of probes to


M2EFM from a study of CpG loci involved in epigenetic field effects in breast cancer based on data available in GEO (GSE69914). The loci were delineated CpG sites exhibiting subtle


differences in methylation in samples from breast tissue from non-diseased healthy subjects compared to samples from tumor-adjacent normal breast cancer cases, that are also shown to exhibit


differential methylation in tumor tissue itself13. Therefore, they represent regions that may be involved in early loss of regulation in breast cancer. Aside from the probe list, five


different breast cancer cohorts were used in this work. The first dataset is referred to as TCGA, was used to train and test an OS model for breast cancer, and consisted of gene expression


and DNA methylation profiles created by TCGA, using the Illumina HiSeq 2000 and Illumina Infinium HumanMethylation 450 platforms, respectively. Methylation data were functionally


normalized14 using the _minfi_ package15 in R. Expression data were TDM normalized, which makes the distributions between microarray and RNA-seq datasets similar, making them more comparable


to the validation datasets16. The samples are described in Table S1. After pre-processing, we retained 1028 samples. The second dataset is referred to as TERUNUMA, is used for external


validation of the OS model, and contains 61 tumor samples with clinical and survival data, downloaded from ArrayExpress (E-GEOD-39004)17. These data were assayed on the Affymetrix GeneChip


Human Gene 1.0 ST Array. They were normalized and then batch corrected in conjunction with the following dataset. A description of the samples is in Table S2. The third dataset is referred


to as KAO, is used as a second external validation dataset for the OS model, and contains 327 tumor samples, downloaded from ArrayExpress (E-GEOD-20685)18. These data were assayed on the


Affymetrix GeneChip Human Genome U133 Plus 2.0. A description of the samples appears in Table S2. The fourth dataset is referred to as HATZIS1, was used to train and test a model of DRFS and


another model of pCR, and includes HER2-negative breast cancer cases with neo-adjuvant treatment by taxane-anthracycline (followed by endocrine therapy for ER-positive cases)12. These data


were assayed on the Affymetrix Human Genome U133A Array. They were downloaded from GEO (GSE25055), and were normalized and then batch corrected in conjunction with the following dataset. A


description of the samples appears in Table S3. After pre-processing we retained 306 samples. The final dataset is referred to as HATZIS2, was used for external validation of both the DRFS


and pCR models of (HER2-negative) breast cancer, and includes HER2-negative breast cancer cases with neo-adjuvant treatment by taxane-anthracycline (followed by endocrine therapy for


ER-positive cases)12. These data were assayed on the Affymetrix Human Genome U133A Array. They were downloaded from GEO (GSE25065). A description of the samples appears in Table S3. After


pre-processing we retained 182 samples. In order to reduce categories with very few samples and the number of variables in the model, all AJCC tumor stage classifications were reduced to


simple Stage I, II, III, or IV. EXPERIMENTAL DESIGN Three different types of models were built: (1) a model of overall survival (OS), (2) a model of distant recurrence-free survival (DRFS),


(3) a model of pathologic complete response (pCR) to neoadjuvant taxane/anthracycline treatment. For all model types, we built 100 different models using different random splits of the data,


divided into 70% training and 30% testing data sets, to assess how consistently a good model could be found given different training data. We assessed the performance of the models built


with and without clinical variables, with clinical variables alone, and separately to include external validation data. The external validation data could only be tested on models without


DNA methylation values, because methylation data were not available (an overview is shown in Fig. 1). For external validation, a final model using all training data was also built. M2EFM OS


and DRFS models were compared with Cox-Ridge regression, using the _glmnet_ package19 for R, without pre-selection of genes or probes (although probes were pre-filtered by median absolute


deviation to reduce them to a number of the most variable probes roughly equal to the number of genes). The M2EFM pCR model was compared to a logistic-Ridge regression model, again using


_glmnet_. Furthermore, the M2EFM OS and DRFS models were compared to a model incorporating the PAM50 risk of recurrence score (rorS)5 with clinical variables. Finally, the M2EFM model was


compared to a model built from a gene signature obtained by the network clinical association (NCA) algorithm20, which is also based on data integration. Performance of the OS and DRFS models


was evaluated using concordance or C-index, a measure of how likely it is, in any given pair of individuals, that the individual with the higher risk score has the event first. For the pCR


model, the performance was measured using area under the receiver operating characteristic curve (AUC). Significance tests for comparisons used the two-tailed Wilcoxon signed-rank test.


Clinical variables were chosen using the least absolute shrinkage and selection operator (LASSO) for each outcome. Candidate variables included AJCC stage, patient age at diagnosis, ER


status, PR status, HER2 status, and PAM50 subtype (calculated using the _genefu_ package for R)21. Selected variables include AJCC stage and patient age at diagnosis for all outcomes,


although stage was reduced to simply presence or absence of stage III cancer for DRFS and pCR outcomes due to the absence of stage IV cases and small number of stage I cases for these


outcomes. The m2eGenes were used to analyze the functional relevance of our approach using the online tool WEB-based GEne SeT AnaLysis Toolkit (WebGestalt 2017)22. Using WebGestalt, we


performed protein network topology-based analysis and enrichment analysis for GO biological process terms. Furthermore, we tested the enrichment of m2eGenes in a database of genes with


causal cancer connections, the Catalog of Somatic Mutations in Cancer (COSMIC) Cancer Gene Census23, using Fisher’s exact test. Finally, we tested the enrichment of genes in our signature in


genes known to be targeted by breast cancer drugs listed in the DrugBank database24, again using Fisher’s exact test. DATA AVAILABILITY The datasets generated and/or analyzed during the


current study are available in the following repositories: * TCGA: UCSC Cancer Genomics Browser * Terunuma: ArrayExpress (E-GEOD-39004) * Kao: ArrayExpress (E-GEOD-20685) * Hatzis1: GEO


(GSE25055) * Hatzis2: GEO (GSE25065) * Cancer Genes: COSMIC Cancer Gene Census * Cancer Drug Targets: DrugBank M2EFM is available as a package for R at https://github.com/jeffreyat/M2EFM.


RESULTS We applied M2EFM to the TCGA dataset to select dysregulated probes and genes for our models. The lists of probes and genes selected by M2EFM are shown in Table S4. OVERALL SURVIVAL –


META-DIMENSIONAL MODELS WITHOUT EXTERNAL VALIDATION DATA Results of the meta-dimensional double-integrated model, (Fig. 1 and fully described in the Supplemental Methods) are shown in Fig. 


2A and Table S5, allowing a comparison to how frequently equivalent strength results could be obtained from random gene sets. Across the 100 random splits of the training and testing data,


the double integrated model containing methylation values from m2eQTL, expression values from m2eGenes, and clinical covariates (M2EFM Meth + Exp + Clin) had the highest median C-index at


0.790, followed by the model incorporating PAM50 rorS values for the samples and clinical variables (rorS + Clin) at 0.775, then the clinical variables only model (Clin) at 0.753, and


finally the Cox-Ridge regression with an additional regression integrating unpenalized clinical variables (Cox Meth + Exp + Clin) at 0.728. The C-index scores across the splits were


statistically significantly greater for the M2EFM Meth + Exp + Clin model than rorS + Clin, Clin, or Cox Meth + Exp + Clin (p = 1.84e-08, 1.26e-12, and 3.23e-16 respectively). For models


built from only the molecular risk score, M2EFM Meth + Exp had the highest median C-index although scores tended to be lower than models incorporating clinical data. M2EFM Meth + Exp results


were significantly greater than rorS and Cox Meth + Exp (p = 3.44e-11, and 1.13e-02 respectively). OVERALL SURVIVAL - EXPRESSION ONLY MODELS WITH EXTERNAL VALIDATION DATA To allow external


validation of our models using other datasets we next focused on breast cancer OS, building models using only expression data, with the m2eQTL discovered previously. In such cases, we


substitute the _cis_-m2eGenes (genes proximal to the differentially methylated loci) for the associated methylation values. We also added a model built from a gene signature discovered using


NCA (except for two genes: _HES5_ and _NKIRAS1_, which were not in our profiles). The median C-index for each model on each dataset is shown in Table S6, and the distribution of C-indices


is shown in Fig. 2B. On TCGA, the median C-index for rorS + Clin was the highest (0.762), followed closely by M2EFM Exp + Clin (0.759) and NCA Exp + Clin (0.757), with the other models


further behind. The differences were not significant for rorS + Clin compared to M2EFM Exp + Clin (p = 6.76e-02) but were for NCA Exp + Clin, Clin, and Cox Exp + Clin (p = 3.99e-02,


3.26e-11, and 5.80e-13 respectively). However, the M2EFM model was the most consistent performer over all the validation data. On the Terunuma external validation data, M2EFM Exp + Clin


validated better than any other model, with a median C-index of 0.721, followed by Cox Exp + Clin at 0.720, while the other models trailed behind. The M2EFM Exp + Clin C-index scores were


not significantly greater than Cox Exp + Clin (p = 8.17e-01), but were significantly greater than rorS + Clin (p < 2.20e-16) and NCA Exp + Clin (p < 2.20e-16). In the Kao data, the


best performer was Cox Exp (0.726), followed by Cox Exp + Clin (0.725), then M2EFM Exp + Clin (0.697), followed by the other models. The C-index scores for Cox Exp were significantly


stronger than M2EFM Exp + Clin (p = 3.14e-16), but M2EFM Exp + Clin scores were still significantly stronger than either rorS + Clin (p < 2.20e-16) or NCA Exp + Clin (p < 2.20e-16).


FULL TRAINING DATA MODEL FOR OVERALL SURVIVAL We used the full training data set (TCGA), to build a final overall survival model and validated it on Terunuma and Kao. For Terunuma, the


C-index was 0.724, 95% CI [0.499, 0.874]. For Kao, the C-index was 0.703, 95% CI [0.582, 0.800]. These scores were very similar to their median C-index scores. The hazard ratios for this


model are given in Table S7. We further divided the samples into three groups based on their risk score, with low risk below the 25th percentile, medium risk from the 25th to the 75th


percentile, and high risk above the 75th percentile, and created Kaplan-Meier plots of the survival of the risk groups in each dataset (Fig. 3). Hazard ratios comparing risk groups for the


risk score are shown in Table 1. We generated calibration curves for the proportional hazards models based on the full training data model (Fig. S1), using the _pec_ package for R25, as well


as the integrated Brier score (IBS)26, a quantitative measure of calibration or prediction error. For M2EFM Exp + Clin, the IBS on Kao was 0.060 and on Terunuma was 0.166. For Cox Exp + 


Clin, the IBS on Kao was 0.055 and on Terunuma was 0.161, indicating that the Cox Exp + Clin models were better calibrated over all time points. ER STATUS Most prognostic markers for breast


cancer have limited utility for patients with ER-negative tumors27. Unfortunately, ER status was not annotated in the Kao data, and there were too few samples in Terunuma to effectively test


the prognostic utility of our marker on ER-negative patients. Therefore, we inferred the ER, PR, and HER2 status of samples in Kao by clustering samples using expression probes associated


with each receptor status using a Gaussian mixture model to capture each bimodal distribution28. We then assessed the relative prognostic power of the M2EFM models in samples that varied by


ER or triple negative status (Table 2). The overall C-index was 0.703, but for ER-positive patients it was 0.665, for ER-negative patients it was 0.727, and for triple negative patients it


was 0.682. SUBTYPES Given that the covariates in our model are unpenalized, we needed to select a limited number of clinical covariates to avoid overfitting the model. This resulted in


models without PAM50 intrinsic subtypes, ER, PR, or HER2 status, which are all known to be associated with breast cancer prognosis. Therefore, we regressed the M2EFM molecular risk score


(calculated for all TCGA samples) on these variables and calculated the variance inflation factor (VIF), a commonly used measure of the multicollinearity between variables. The VIF was


approximately 1.20. A VIF between 1 and 5 is generally regarded to imply only moderate multicollinearity. Therefore, the molecular risk score from M2EFM contains prognostic information


independent of breast cancer subtype. DISTANT RECURRENCE-FREE SURVIVAL (DRFS) MODELS On both the testing data and the external validation data, M2EFM was the most prognostic model of DRFS


(Fig. 4 and Table S8). On the testing data, the M2EFM Exp + Clin had a median C-index score of 0.722, followed by rorS + Clin (0.707), then M2EFM Exp (0.705), and Cox Exp + Clin (0.701),


with the other models below 0.7. In the external validation data, M2EFM Exp + Clin and M2EFM Exp were the most prognostic at 0.752 and 0.741 respectively, although Cox Exp + Clin and Cox Exp


were strong at 0.735 and 0.729 respectively. Finally, an M2EFM Exp + Clin model trained on the full training data (Hatzis1) achieved a C-index of 0.766, 95% CI [0.604, 0.875] on the


external validation data (Hatzis2). The hazard ratios for this model are given in Table S9. Calibration curves at 3 years are shown in Fig. S2. The IBS score for M2EFM Exp + Clin on Hatzis2


was 0.079. For Cox Exp + Clin it was 0.083, demonstrating that the M2EFM Exp + Clin model is better calibrated. NEGATIVE PREDICTIVE VALUE FOR NODE-NEGATIVE CASES One interest in genomic


cancer biomarkers is for determining cases with a low risk of recurrence who may not benefit from adjuvant chemotherapy29,30. For these cases, it is important that the marker has a high


negative predictive value (NPV). Therefore, we assessed the NPV of the M2EFM model in node-negative samples using the _timeROC_ package for R, which uses a time-dependent AUC and can handle


censored survival times. A cutoff for the risk score was determined in Hatzis1 that equalized the sensitivity and specificity in predicting DRFS for node-negative cases at 3 years (the


median follow-up in these data). This cutoff was applied to node-negative cases in Hatzis2 to determine the positive predictive value (PPV) and NPV. At the 3-year follow up point, the PPV


was 0.49 and the NPV was 0.95. We predicted 39 node-negative cases would not have a distant metastasis by 3 years and we were correct for 36. We predicted 12 node-negative cases would have a


distant metastasis and were correct for 7. PATHOLOGIC COMPLETE RESPONSE (PCR) MODELS The final set of models are for pCR following neoadjuvant treatment with Taxane/Anthracycline in


HER2-negative breast cancer. Results are shown in Fig. 5 and Table S10. For these data, the clinical models were not very informative. The reference logistic-Ridge regression model performed


best on the testing data, with Logistic Exp + Clin scoring a median AUC of 0.767 and Logistic Exp at 0.763 compared to M2EFM Exp + Clin with a median AUC of 0.739 and M2EFM Exp at 0.733.


Nevertheless, M2EFM performed best on the external validation data, with M2EFM Exp + Clin at a median AUC of 0.716 and M2EFM Exp at 0.727 compared to Logistic Exp + Clin at a median AUC of


0.707 and Logistic Exp at 0.709. Finally, an M2EFM Exp + Clin model trained on the full training data (Hatzis1) achieved an AUC of 0.720, 95% CI [0.639, 0.802] on the external validation


data (Hatzis2). The odds ratios for this model are given in Table S11. Calibration curves appear in Fig. S3. The mean Brier score for M2EFM Exp + Clin on Hatzis2 was 0.184 and for Cox Exp + 


Clin it was 0.202, showing that M2EFM Exp + Clin is better calibrated. FUNCTIONAL ANALYSIS PROTEIN NETWORK TOPOLOGY-BASED ANALYSIS Of the 115 m2eGenes, 55 were found to interact in a


sub-network of protein-protein interactions using WebGestalt. The sub-network included 65 genes in all and was enriched for 184 biological process terms, demonstrating that many of the genes


in our signature are functionally related. The top result was for lymphocyte activation (p = 2.50e-12), which included 23 genes from the sub-network and is about 8.96 times more overlap


than would be expected by chance. The sub-network is shown in Fig. S4, visualized using Cytoscape31. BIOLOGICAL PROCESS, DISEASE, AND DRUG ENRICHMENT Using Gene Ontology enrichment testing,


our gene set shows a clear enrichment for immune functions (Fig. 6). The results further demonstrate that expression tends to be repressed in these pathways in the samples with the highest


M2EFM risk scores and more moderately repressed in samples with medium M2EFM risk scores. Many of the genes in our gene signature are already known to be associated with breast cancer (such


as _ESR1_32 and _EGFR_33). We calculated the enrichment of genes in our signature in the Catalog of Somatic Mutations in Cancer (COSMIC) Cancer Gene Census23, downloaded on 9/13/2016.


Enrichment would suggest a causal link between genes in our signature and the progression of the disease. There were 601 genes in the catalog, 515 of which were in our gene expression


profiles. The enrichment for causally implicated cancer genes in our gene signature was about 4 times greater than expected by chance (OR: 3.87, p = 6.09e-06). The following 18 genes used in


our M2EFM models appear in the Cancer Gene Census: _AFF3, BCL11A, CD79A, CD79B, EGFR, ERBB4, ESR1, FOXA1, GATA3, IRF4, ITK, LCK, MYB, POU2AF1, PRF1, PTPRC, RET, TNFRSF17_. We also examined


the known targets of FDA approved breast cancer drugs using the DrugBank database version 5.024. There were 43 targets among the genes in our expression profiles. Of these, 4 genes were


found in our signature (_EGFR, ESR1, MAPT_, and _PRKCB_). Thus, our signature was enriched for known targets of breast cancer treatments 10 times more than would be expected by chance (OR:


10.00, p = 1.03e-03). These drugs included paclitaxel (_MAPT_), raloxifene (_ESR1_), toremifene (_ESR1_), fulvestrant (_ESR1_), trastuzumab (_EGFR_), tamoxifen (_ESR1_, _PRKCB_), docetaxel


(_MAPT_), and lapatinib (_EGFR_). DISCUSSION In this work, we have tried to address common shortcomings in molecular biomarker studies, have applied the REMARK guidelines in our


assessment34, and validated a data-integrated approach to creating prognostic and predictive cancer models. The M2EFM approach shows remarkable promise for developing prognostic and


predictive cancer biomarkers using both clinical and molecular data. For overall survival, distant recurrence free survival, and pathologic complete response in breast cancer it achieved a


median C-index or AUC of at least 0.7 in all internal and external validation datasets (including 5 different cohorts). The fact that our model can reliably attain this level of


discrimination across platforms and populations is promising indeed. Unsurprisingly, one of our comparison methods, a modified Cox-Ridge approach, performed about as well as M2EFM in terms


of discriminative power. M2EFM depends on Cox-Ridge itself, and the Cox-Ridge models contain a superset of the predictors in M2EFM. Nevertheless, M2EFM models in many cases outperform the


Cox-Ridge models, and more importantly, M2EFM models depend on only 115 genes, as opposed to 10990 in the Cox-Ridge models. Therefore, clinical use for Cox-Ridge would require a


transcriptome-wide gene expression assay, and the situation would be even more extreme if the model with methylation values was used. Additionally, for DRFS and pCR, M2EFM was better


calibrated than Cox-Ridge, suggesting that Cox-Ridge would not be as useful clinically. We also demonstrated that M2EFM has a high NPV for node-negative cases, while maintaining a meaningful


PPV, suggesting M2EFM might be useful for identifying patients who could avoid adjuvant chemotherapy. Finally, M2EFM has functional relevance, increasing the possibility that it could serve


as the basis for clinical tests that determine eligibility for treatment. Thus, using three different outcomes and five different data sets, we maintained a consistent level of accuracy. In


conjunction with our prior study of M2EFM applied to ccRCC11, these results consistently show that integrating both molecular and clinical data results in more prognostic models than either


data type alone. Our results strongly suggest that a data-integrated approach that models genes affected by regulatory changes in cancer is more reliable than many extant approaches.


Furthermore, our results suggest our method builds a model that is prognostic in the case of ER-negative samples and possibly triple negative samples (although unfortunately few samples were


available to test this hypothesis). Given the lack of approved prognostic indicators for ER-negative patients, partly due to its greater proliferative rate and the reliance of many tests on


proliferative genes27, and the greater difficulty in treating triple negative disease, a marker, such as ours, that is largely independent of proliferation is desirable27,35,36. Our results


also show that our approach does more than build a reliable prognostic model: it identifies biologically relevant pathways and, importantly, possible therapeutic targets and genes involved


in cancer progression. Our gene signature contains genes that are functionally involved in cancer and that are targeted by current breast cancer treatments. Perhaps more interestingly, given


recent advances in immunotherapy37,38, we note the repression of immune related genes in our signature for patients predicted to be at high risk. There has been growing interest in finding


signatures that are independent of proliferation and ER status that might also be predictive of possible immune therapies27. Although we provide evidence of the efficacy of M2EFM, there


remains one limitation before attempting to translate these results to the clinic: calibration of the model to a single platform. To obtain external validation data, we performed


cross-platform normalization. Having established the utility of the M2EFM approach, it remains to train the model on a high-quality dataset generated on the platform that will be used for


any clinical assays. CONCLUSIONS M2EFM can predict risk based on data integrated from multiple sources, taking advantage of tried-and-true prognostic factors while incorporating data on


relevant relationships between molecular data types at the individual level. For breast cancer, this resulted in models based on several genes with known functional associations to cancer


progression and which are targeted by currently available drugs, and therefore, may result in a useful tool for clinical decision support. This also suggests that other genes in the model


should be examined as possible targets, because many of the genes are functionally connected. Given the strength of our findings, we believe that further development of this technique and


application to other cancers is warranted. REFERENCES * Cho, S. H., Jeon, J. & Kim, S. I. Personalized Medicine in Breast Cancer: A Systematic Review. _J Breast Cancer_ 15, 265–272,


https://doi.org/10.4048/jbc.2012.15.3.265 (2012). Article  PubMed  PubMed Central  Google Scholar  * Weinstein, J. N. _et al_. The Cancer Genome Atlas Pan-Cancer analysis project. _Nat


Genet_ 45, 1113–1120, https://doi.org/10.1038/ng.2764 (2013). Article  PubMed  PubMed Central  Google Scholar  * Zhang, J. J. _et al_. International Cancer Genome Consortium Data Portal-a


one-stop shop for cancer genomics data. _Database-Oxford_ https://doi.org/10.1093/database/bar026 (2011). * Edgar, R., Domrachev, M. & Lash, A. E. Gene Expression Omnibus: NCBI gene


expression and hybridization array data repository. _Nucleic Acids Res_ 30, 207–210, https://doi.org/10.1093/nar/30.1.207 (2002). Article  CAS  PubMed  PubMed Central  Google Scholar  *


Parker, J. S. _et al_. Supervised risk predictor of breast cancer based on intrinsic subtypes. _J Clin Oncol_ 27, 1160–1167, https://doi.org/10.1200/JCO.2008.18.1370 (2009). Article  PubMed


  PubMed Central  Google Scholar  * Paik, S. _et al_. A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. _N Engl J Med_ 351, 2817–2826,


https://doi.org/10.1056/NEJMoa041588 (2004). Article  CAS  PubMed  Google Scholar  * Coates, A. S. _et al_. Tailoring therapies-improving the management of early breast cancer: St Gallen


International Expert Consensus on the Primary Therapy of Early Breast Cancer 2015. _Ann Oncol_ 26, 1533–1546, https://doi.org/10.1093/annonc/mdv221 (2015). Article  CAS  PubMed  PubMed


Central  Google Scholar  * Kern, S. E. Why Your New Cancer Biomarker May Never Work: Recurrent Patterns and Remarkable Diversity in Biomarker Failures. _Cancer Res_ 72, 6097–6101,


https://doi.org/10.1158/0008-5472.Can-12-3232 (2012). Article  CAS  PubMed  PubMed Central  Google Scholar  * Yuan, Y. _et al_. Assessing the clinical utility of cancer genomic and proteomic


data across tumor types. _Nat Biotechnol_ 32, 644−+, https://doi.org/10.1038/nbt.2940 (2014). Article  PubMed  PubMed Central  Google Scholar  * Ritchie, M. D., Holzinger, E. R., Li, R. W.,


Pendergrass, S. A. & Kim, D. Methods of integrating data to uncover genotype-phenotype interactions. _Nat Rev Genet_ 16, 85–97, https://doi.org/10.1038/nrg3868 (2015). Article  CAS 


PubMed  Google Scholar  * Thompson, J. A. & Marsit, C. J. A Methylation-to-Expression Feature Model for Generating Accurate Prognostic Risk Scores and Identifying Disease Targets in


Clear Cell Kidney Cancer. _Pac Symp Biocomput_ 22, 509–520, https://doi.org/10.1142/9789813207813_0047 (2017). PubMed  Google Scholar  * Hatzis, C. _et al_. A Genomic Predictor of Response


and Survival Following Taxane-Anthracycline Chemotherapy for InvasiveBreast Cancer. _Jama-J Am Med Assoc_ 305, 1873–1881, https://doi.org/10.1001/jama.2011.593 (2011). Article  CAS  Google


Scholar  * Teschendorff, A. E. _et al_. DNA methylation outliers in normal breast tissue identify field defects that are enriched in cancer. _Nat Commun_ 7


https://doi.org/10.1038/ncomms10478 (2016). * Fortin, J. P. _et al_. Functional normalization of 450k methylation array data improves replication in large cancer studies. _Genome Biol_ 15


https://doi.org/10.1186/s13059-014-0503-2 (2014). * Aryee, M. J. _et al_. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays.


_Bioinformatics_ 30, 1363–1369, https://doi.org/10.1093/bioinformatics/btu049 (2014). Article  CAS  PubMed  PubMed Central  Google Scholar  * Thompson, J. A., Tan, J. & Greene, C. S.


Cross-platform normalization of microarray and RNA-seq data for machine learning applications. _Peerj_ 4, https://doi.org/10.7717/peerj.1621 (2016). * Terunuma, A. _et al_. MYC-driven


accumulation of 2-hydroxyglutarate is associated with breast cancer prognosis. _J Clin Invest_ 124, 398–412, https://doi.org/10.1172/Jci71180 (2014). Article  CAS  PubMed  Google Scholar  *


Kao, K. J., Chang, K. M., Hsu, H. C. & Huang, A. T. Correlation of microarray-based breast cancer molecular subtypes and clinical outcomes: implications for treatment optimization. _Bmc


Cancer_ 11, https://doi.org/10.1186/1471-2407-11-143 (2011). * Friedman, J., Hastie, T. & Tibshirani, R. Regularization Paths for Generalized Linear Models via Coordinate Descent. _J


Stat Softw_ 33, 1–22 (2010). Article  PubMed  PubMed Central  Google Scholar  * Martinez-Ledesma, E., Verhaak, R. G. W. & Trevino, V. Identification of a multi-cancer gene expression


biomarker for cancer clinical outcomes using a network-based algorithm. _Sci Rep-Uk_ 5, https://doi.org/10.1038/srep11966 (2015). * Gendoo, D. M. A. _et al_. Genefu: an R/Bioconductor


package for computation of gene expression-based signatures in breast cancer. _Bioinformatics_ 32, 1097–1099, https://doi.org/10.1093/bioinformatics/btv693 (2016). Article  CAS  PubMed 


Google Scholar  * Wang, J., Duncan, D., Shi, Z. & Zhang, B. WEB-based GEne SeT AnaLysis Toolkit (WebGestalt): update 2013. _Nucleic Acids Res_ 41, W77–W83,


https://doi.org/10.1093/nar/gkt439 (2013). Article  PubMed  PubMed Central  Google Scholar  * Forbes, S. A. _et al_. COSMIC: exploring the world’s knowledge of somatic mutations in human


cancer. _Nucleic Acids Res_ 43, D805–D811, https://doi.org/10.1093/nar/gku1075 (2015). Article  CAS  PubMed  Google Scholar  * Wishart, D. S. _et al_. DrugBank: a comprehensive resource for


_in silico_ drug discovery and exploration. _Nucleic Acids Res_ 34, D668–D672, https://doi.org/10.1093/nar/gkj067 (2006). Article  CAS  PubMed  Google Scholar  * Mogensen, U. B., Ishwaran,


H. & Gerds, T. A. Evaluating Random Forests for Survival Analysis Using Prediction Error Curves. _J Stat Softw_ 50, 1–23 (2012). Article  PubMed  PubMed Central  Google Scholar  * Graf,


E., Schmoor, C., Sauerbrei, W. & Schumacher, M. Assessment and comparison of prognostic classification schemes for survival data. _Stat Med_ 18, 2529–2545 (1999). Article  CAS  PubMed 


Google Scholar  * Gyorffy, B. _et al_. 3 Multigene prognostic tests in breast cancer: past, present, future. _Breast Cancer Research_ 17 https://doi.org/10.1186/s13058-015-0514-2 (2015). *


Lehmann, B. D. _et al_. Identification of human triple-negative breast cancer subtypes and preclinical models for selection of targeted therapies. _J Clin Invest_ 121, 2750–2767,


https://doi.org/10.1172/Jci45014 (2011). Article  CAS  PubMed  PubMed Central  Google Scholar  * Duffy, M. J. _et al_. Clinical use of biomarkers in breast cancer: Updated guidelines from


the European Group on Tumor Markers (EGTM). _Eur J Cancer_ 75, 284–298, https://doi.org/10.1016/j.ejca.2017.01.017 (2017). Article  CAS  PubMed  Google Scholar  * Loncaster, J. _et al_.


Impact of Oncotype DX breast Recurrence Score testing on adjuvant chemotherapy use in early breast cancer: Real world experience in Greater Manchester, UK. _Ejso-Eur J Surg Onc_ 43, 931–937,


https://doi.org/10.1016/j.ejso.2016.12.010 (2017). Article  CAS  Google Scholar  * Shannon, P. _et al_. Cytoscape: a software environment for integrated models of biomolecular interaction


networks. _Genome Res_ 13, 2498–2504, https://doi.org/10.1101/gr.1239303 (2003). Article  CAS  PubMed  PubMed Central  Google Scholar  * Masuda, H. _et al_. Role of epidermal growth factor


receptor in breast cancer. _Breast Cancer Res Treat_ 136, 331–345, https://doi.org/10.1007/s10549-012-2289-9 (2012). Article  CAS  PubMed  Google Scholar  * Lauring, J. & Wolff, A. C.


Evolving Role of the Estrogen Receptor as a Predictive Biomarker: ESR1 Mutational Status and Endocrine Resistance in Breast Cancer. _J Clin Oncol_ 34, 2950-+,


https://doi.org/10.1200/Jco.2016.68.4720 (2016). Article  PubMed  Google Scholar  * McShane, L. M. _et al_. REporting recommendations for tumor MARKer prognostic studies (REMARK). _Breast


Cancer Res Tr_ 100, 229–235, https://doi.org/10.1007/s10549-006-9242-8 (2006). Article  Google Scholar  * Shao, F., Sun, H. & Deng, C. X. Potential therapeutic targets of triple-negative


breast cancer based on its intrinsic subtype. _Oncotarget_ 8, 73329–73344, https://doi.org/10.18632/oncotarget.20274 (2017). PubMed  PubMed Central  Google Scholar  * Collignon, J.,


Lousberg, L., Schroeder, H. & Jerusalem, G. Triple-negative breast cancer: treatment challenges and solutions. _Breast Cancer (Dove Med Press)_ 8, 93–107,


https://doi.org/10.2147/BCTT.S69488 (2016). Google Scholar  * Liu, Z., Li, M., Jiang, Z. & Wang, X. A Comprehensive Immunologic Portrait of Triple-Negative Breast Cancer. _Transl Oncol_


11, 311–329, https://doi.org/10.1016/j.tranon.2018.01.011 (2018). Article  CAS  PubMed  Google Scholar  * Tolba, M. F. & Omar, H. A. Immunotherapy, an evolving approach for the


management of triple negative breast cancer: Converting non-responders to responders. _Crit Rev Oncol Hematol_. https://doi.org/10.1016/j.critrevonc.2018.01.005 (2018). PubMed  Google


Scholar  Download references ACKNOWLEDGEMENTS The analyses described in this report were supported by NIH grants R01ES022223, P30CA138292, P30ES019776, and R01DE022772. AUTHOR INFORMATION


AUTHORS AND AFFILIATIONS * Department of Biostatistics, University of Kansas Medical Center, Kansas City, USA Jeffrey A. Thompson * Department of Epidemiology, Geisel School of Medicine at


Dartmouth College, Hanover, USA Brock C. Christensen * Department of Environmental Health, Rollins School of Public Health at Emory University, Atlanta, USA Carmen J. Marsit Authors *


Jeffrey A. Thompson View author publications You can also search for this author inPubMed Google Scholar * Brock C. Christensen View author publications You can also search for this author


inPubMed Google Scholar * Carmen J. Marsit View author publications You can also search for this author inPubMed Google Scholar CONTRIBUTIONS J.T. and C.M. conceived of and designed the


study. J.T. collected the data, performed the analyses, and drafted the article. J.T., C.M. and B.C. performed critical revision. All authors read and approved the final manuscript.


CORRESPONDING AUTHOR Correspondence to Jeffrey A. Thompson. ETHICS DECLARATIONS COMPETING INTERESTS The authors declare no competing interests. ADDITIONAL INFORMATION PUBLISHER'S NOTE:


Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. ELECTRONIC SUPPLEMENTARY MATERIAL SUPPLEMENTARY MATERIAL RIGHTS AND


PERMISSIONS OPEN ACCESS This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any


medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The


images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not


included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly


from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. Reprints and permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Thompson, J.A.,


Christensen, B.C. & Marsit, C.J. Methylation-to-Expression Feature Models of Breast Cancer Accurately Predict Overall Survival, Distant-Recurrence Free Survival, and Pathologic Complete


Response in Multiple Cohorts. _Sci Rep_ 8, 5190 (2018). https://doi.org/10.1038/s41598-018-23494-0 Download citation * Received: 13 December 2017 * Accepted: 13 March 2018 * Published: 26


March 2018 * DOI: https://doi.org/10.1038/s41598-018-23494-0 SHARE THIS ARTICLE Anyone you share the following link with will be able to read this content: Get shareable link Sorry, a


shareable link is not currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative