Play all audios:
ABSTRACT Prognostic biomarkers serve a variety of purposes in cancer treatment and research, such as prediction of cancer progression, and treatment eligibility. Despite growing interest in
multi-omic data integration for defining prognostic biomarkers, validated methods have been slow to emerge. Given that breast cancer has been the focus of intense research, it is amenable to
studying the benefits of multi-omic prognostic models due to the availability of datasets. Thus, we examined the efficacy of our methylation-to-expression feature model (M2EFM) approach to
combining molecular and clinical predictors to create risk scores for overall survival, distant metastasis, and chemosensitivity in breast cancer. Gene expression, DNA methylation, and
clinical variables were integrated via M2EFM to build models of overall survival using 1028 breast tumor samples and applied to validation cohorts of 61 and 327 samples. Models of distant
recurrence-free survival and pathologic complete response were built using 306 samples and validated on 182 samples. Despite different populations and assays, M2EFM models validated with
good accuracy (C-index or AUC ≥ 0.7) for all outcomes and had the most consistent performance compared to other methods. Finally, we demonstrated that M2EFM identifies functionally relevant
genes, which could be useful in translating an M2EFM biomarker to the clinic. SIMILAR CONTENT BEING VIEWED BY OTHERS MULTI-OMICS ANALYSES IDENTIFY _HSD17B4_ METHYLATION-SILENCING AS A
PREDICTIVE AND RESPONSE MARKER OF HER2-POSITIVE BREAST CANCER TO HER2-DIRECTED THERAPY Article Open access 23 September 2020 MACHINE LEARNING APPROACHES REVEAL METHYLATION SIGNATURES
ASSOCIATED WITH PEDIATRIC ACUTE MYELOID LEUKEMIA RECURRENCE Article Open access 06 May 2025 INTEGRATION OF 101 MACHINE LEARNING ALGORITHM COMBINATIONS TO UNVEIL M6A/M1A/M5C/M7G-ASSOCIATED
PROGNOSTIC SIGNATURE IN COLORECTAL CANCER Article Open access 18 February 2025 INTRODUCTION The extreme heterogeneity of breast cancer makes one-size-fits all treatments difficult to achieve
and although a number of targeted therapies are available1 personalized therapies work only for a limited number of patients. Better methods are needed to rapidly identify patients that can
benefit from specific treatments, reveal new targets for historically difficult to treat patients, and provide accurate prognoses to patients to enable better life planning. Large scale
collection of cancer genomic data, such as The Cancer Genome Atlas (TCGA)2 and the International Cancer Genome Consortium3, and expression repositories such as NCBI’s Gene Expression
Omnibus4 have enabled developments in prognostic and predictive cancer models. In addition to improved understanding of how cancer disrupts normal cellular function, such data have led to
the development of biomarker panels, such as the PAM50 risk of recurrence score (rorS)5 and the 21 gene Oncotype DX6, which are both now used for clinical decision making in the treatment of
breast cancer7. Furthermore, to the extent that such tests are able to predict the occurrence of distant metastases, they may enable early prevention strategies to be put in place. Despite
such promising developments, the goals of precision medicine remain elusive due to specific limitations in molecular biomarker studies. These include limited understanding of the optimal
genomic features to utilize, lack of validation on external data, and inconsistent use of best practices for creating biomarkers with the best chances of entering clinical practice8,9. There
has been growing interest in the integration of molecular and traditional clinicopathological markers to achieve higher levels of prognostic power and utility and small but significant
gains in prognostic accuracy have been observed through this approach5,10. To the extent that these sources are independently prognostic, they should result in higher accuracy than either
source alone and better understanding of prognostic features by revealing holistic relationships in the data. As yet, few prognostic or predictive models are designed to take advantage of
multiple data types. In this work, we apply our own previously developed data integration approach for combining gene expression and DNA-methylation data to model the impact of regulatory
disruption on breast cancer prognosis and prediction while also incorporating prognostic clinical variables11. This study overcomes prior limitations by creating a model that can identify
biologically plausible targets, using best practices in assessing the model8,9, and by validating the actual model (not just the gene signature) in multiple, external datasets, representing
different demographics and disease distributions. METHODS M2EFM Previously, we introduced a data-integrated modeling approach we call Methylation-to-Expression Feature Model (M2EFM)11 and
applied it in the case of clear cell renal cell carcinoma (ccRCC). The basis of this approach is to pre-filter loci to those that are differentially methylated between matched tumor and
adjacent, pathologically normal tissue and to associate the methylation status of those loci with significant differences in gene expression in the tumor samples. The differentially
methylated loci are then called m2eQTL (for methylation-to-expression QTL) and the genes are called m2eGenes. An initial model is built from m2eQTL and m2eGenes with the largest effects to
generate risk scores for all samples using Cox-Ridge regression (using the exponential function of the weighted sum of the model features). A final model is built from the risk scores in
conjunction with clinical variables and a final risk score is generated from the weighted sum of the features of this model. Here we build M2EFM models of overall survival (OS), distant
recurrence-free survival (DRFS), and pathologic complete response (pCR) in breast cancer, where OS is defined as time from initial diagnosis to death from any cause, DRFS is defined as time
from diagnosis to distant recurrence, death from cancer, or death from unknown cause, and pCR is defined as a patient being recorded as having a pathologic complete response or a residual
cancer burden class of RCB 0/I as in Hatzis, _et al_.12. DATA A full description of data pre-processing appears in the Supplementary Methods. For this work, we supplied a list of probes to
M2EFM from a study of CpG loci involved in epigenetic field effects in breast cancer based on data available in GEO (GSE69914). The loci were delineated CpG sites exhibiting subtle
differences in methylation in samples from breast tissue from non-diseased healthy subjects compared to samples from tumor-adjacent normal breast cancer cases, that are also shown to exhibit
differential methylation in tumor tissue itself13. Therefore, they represent regions that may be involved in early loss of regulation in breast cancer. Aside from the probe list, five
different breast cancer cohorts were used in this work. The first dataset is referred to as TCGA, was used to train and test an OS model for breast cancer, and consisted of gene expression
and DNA methylation profiles created by TCGA, using the Illumina HiSeq 2000 and Illumina Infinium HumanMethylation 450 platforms, respectively. Methylation data were functionally
normalized14 using the _minfi_ package15 in R. Expression data were TDM normalized, which makes the distributions between microarray and RNA-seq datasets similar, making them more comparable
to the validation datasets16. The samples are described in Table S1. After pre-processing, we retained 1028 samples. The second dataset is referred to as TERUNUMA, is used for external
validation of the OS model, and contains 61 tumor samples with clinical and survival data, downloaded from ArrayExpress (E-GEOD-39004)17. These data were assayed on the Affymetrix GeneChip
Human Gene 1.0 ST Array. They were normalized and then batch corrected in conjunction with the following dataset. A description of the samples is in Table S2. The third dataset is referred
to as KAO, is used as a second external validation dataset for the OS model, and contains 327 tumor samples, downloaded from ArrayExpress (E-GEOD-20685)18. These data were assayed on the
Affymetrix GeneChip Human Genome U133 Plus 2.0. A description of the samples appears in Table S2. The fourth dataset is referred to as HATZIS1, was used to train and test a model of DRFS and
another model of pCR, and includes HER2-negative breast cancer cases with neo-adjuvant treatment by taxane-anthracycline (followed by endocrine therapy for ER-positive cases)12. These data
were assayed on the Affymetrix Human Genome U133A Array. They were downloaded from GEO (GSE25055), and were normalized and then batch corrected in conjunction with the following dataset. A
description of the samples appears in Table S3. After pre-processing we retained 306 samples. The final dataset is referred to as HATZIS2, was used for external validation of both the DRFS
and pCR models of (HER2-negative) breast cancer, and includes HER2-negative breast cancer cases with neo-adjuvant treatment by taxane-anthracycline (followed by endocrine therapy for
ER-positive cases)12. These data were assayed on the Affymetrix Human Genome U133A Array. They were downloaded from GEO (GSE25065). A description of the samples appears in Table S3. After
pre-processing we retained 182 samples. In order to reduce categories with very few samples and the number of variables in the model, all AJCC tumor stage classifications were reduced to
simple Stage I, II, III, or IV. EXPERIMENTAL DESIGN Three different types of models were built: (1) a model of overall survival (OS), (2) a model of distant recurrence-free survival (DRFS),
(3) a model of pathologic complete response (pCR) to neoadjuvant taxane/anthracycline treatment. For all model types, we built 100 different models using different random splits of the data,
divided into 70% training and 30% testing data sets, to assess how consistently a good model could be found given different training data. We assessed the performance of the models built
with and without clinical variables, with clinical variables alone, and separately to include external validation data. The external validation data could only be tested on models without
DNA methylation values, because methylation data were not available (an overview is shown in Fig. 1). For external validation, a final model using all training data was also built. M2EFM OS
and DRFS models were compared with Cox-Ridge regression, using the _glmnet_ package19 for R, without pre-selection of genes or probes (although probes were pre-filtered by median absolute
deviation to reduce them to a number of the most variable probes roughly equal to the number of genes). The M2EFM pCR model was compared to a logistic-Ridge regression model, again using
_glmnet_. Furthermore, the M2EFM OS and DRFS models were compared to a model incorporating the PAM50 risk of recurrence score (rorS)5 with clinical variables. Finally, the M2EFM model was
compared to a model built from a gene signature obtained by the network clinical association (NCA) algorithm20, which is also based on data integration. Performance of the OS and DRFS models
was evaluated using concordance or C-index, a measure of how likely it is, in any given pair of individuals, that the individual with the higher risk score has the event first. For the pCR
model, the performance was measured using area under the receiver operating characteristic curve (AUC). Significance tests for comparisons used the two-tailed Wilcoxon signed-rank test.
Clinical variables were chosen using the least absolute shrinkage and selection operator (LASSO) for each outcome. Candidate variables included AJCC stage, patient age at diagnosis, ER
status, PR status, HER2 status, and PAM50 subtype (calculated using the _genefu_ package for R)21. Selected variables include AJCC stage and patient age at diagnosis for all outcomes,
although stage was reduced to simply presence or absence of stage III cancer for DRFS and pCR outcomes due to the absence of stage IV cases and small number of stage I cases for these
outcomes. The m2eGenes were used to analyze the functional relevance of our approach using the online tool WEB-based GEne SeT AnaLysis Toolkit (WebGestalt 2017)22. Using WebGestalt, we
performed protein network topology-based analysis and enrichment analysis for GO biological process terms. Furthermore, we tested the enrichment of m2eGenes in a database of genes with
causal cancer connections, the Catalog of Somatic Mutations in Cancer (COSMIC) Cancer Gene Census23, using Fisher’s exact test. Finally, we tested the enrichment of genes in our signature in
genes known to be targeted by breast cancer drugs listed in the DrugBank database24, again using Fisher’s exact test. DATA AVAILABILITY The datasets generated and/or analyzed during the
current study are available in the following repositories: * TCGA: UCSC Cancer Genomics Browser * Terunuma: ArrayExpress (E-GEOD-39004) * Kao: ArrayExpress (E-GEOD-20685) * Hatzis1: GEO
(GSE25055) * Hatzis2: GEO (GSE25065) * Cancer Genes: COSMIC Cancer Gene Census * Cancer Drug Targets: DrugBank M2EFM is available as a package for R at https://github.com/jeffreyat/M2EFM.
RESULTS We applied M2EFM to the TCGA dataset to select dysregulated probes and genes for our models. The lists of probes and genes selected by M2EFM are shown in Table S4. OVERALL SURVIVAL –
META-DIMENSIONAL MODELS WITHOUT EXTERNAL VALIDATION DATA Results of the meta-dimensional double-integrated model, (Fig. 1 and fully described in the Supplemental Methods) are shown in Fig.
2A and Table S5, allowing a comparison to how frequently equivalent strength results could be obtained from random gene sets. Across the 100 random splits of the training and testing data,
the double integrated model containing methylation values from m2eQTL, expression values from m2eGenes, and clinical covariates (M2EFM Meth + Exp + Clin) had the highest median C-index at
0.790, followed by the model incorporating PAM50 rorS values for the samples and clinical variables (rorS + Clin) at 0.775, then the clinical variables only model (Clin) at 0.753, and
finally the Cox-Ridge regression with an additional regression integrating unpenalized clinical variables (Cox Meth + Exp + Clin) at 0.728. The C-index scores across the splits were
statistically significantly greater for the M2EFM Meth + Exp + Clin model than rorS + Clin, Clin, or Cox Meth + Exp + Clin (p = 1.84e-08, 1.26e-12, and 3.23e-16 respectively). For models
built from only the molecular risk score, M2EFM Meth + Exp had the highest median C-index although scores tended to be lower than models incorporating clinical data. M2EFM Meth + Exp results
were significantly greater than rorS and Cox Meth + Exp (p = 3.44e-11, and 1.13e-02 respectively). OVERALL SURVIVAL - EXPRESSION ONLY MODELS WITH EXTERNAL VALIDATION DATA To allow external
validation of our models using other datasets we next focused on breast cancer OS, building models using only expression data, with the m2eQTL discovered previously. In such cases, we
substitute the _cis_-m2eGenes (genes proximal to the differentially methylated loci) for the associated methylation values. We also added a model built from a gene signature discovered using
NCA (except for two genes: _HES5_ and _NKIRAS1_, which were not in our profiles). The median C-index for each model on each dataset is shown in Table S6, and the distribution of C-indices
is shown in Fig. 2B. On TCGA, the median C-index for rorS + Clin was the highest (0.762), followed closely by M2EFM Exp + Clin (0.759) and NCA Exp + Clin (0.757), with the other models
further behind. The differences were not significant for rorS + Clin compared to M2EFM Exp + Clin (p = 6.76e-02) but were for NCA Exp + Clin, Clin, and Cox Exp + Clin (p = 3.99e-02,
3.26e-11, and 5.80e-13 respectively). However, the M2EFM model was the most consistent performer over all the validation data. On the Terunuma external validation data, M2EFM Exp + Clin
validated better than any other model, with a median C-index of 0.721, followed by Cox Exp + Clin at 0.720, while the other models trailed behind. The M2EFM Exp + Clin C-index scores were
not significantly greater than Cox Exp + Clin (p = 8.17e-01), but were significantly greater than rorS + Clin (p < 2.20e-16) and NCA Exp + Clin (p < 2.20e-16). In the Kao data, the
best performer was Cox Exp (0.726), followed by Cox Exp + Clin (0.725), then M2EFM Exp + Clin (0.697), followed by the other models. The C-index scores for Cox Exp were significantly
stronger than M2EFM Exp + Clin (p = 3.14e-16), but M2EFM Exp + Clin scores were still significantly stronger than either rorS + Clin (p < 2.20e-16) or NCA Exp + Clin (p < 2.20e-16).
FULL TRAINING DATA MODEL FOR OVERALL SURVIVAL We used the full training data set (TCGA), to build a final overall survival model and validated it on Terunuma and Kao. For Terunuma, the
C-index was 0.724, 95% CI [0.499, 0.874]. For Kao, the C-index was 0.703, 95% CI [0.582, 0.800]. These scores were very similar to their median C-index scores. The hazard ratios for this
model are given in Table S7. We further divided the samples into three groups based on their risk score, with low risk below the 25th percentile, medium risk from the 25th to the 75th
percentile, and high risk above the 75th percentile, and created Kaplan-Meier plots of the survival of the risk groups in each dataset (Fig. 3). Hazard ratios comparing risk groups for the
risk score are shown in Table 1. We generated calibration curves for the proportional hazards models based on the full training data model (Fig. S1), using the _pec_ package for R25, as well
as the integrated Brier score (IBS)26, a quantitative measure of calibration or prediction error. For M2EFM Exp + Clin, the IBS on Kao was 0.060 and on Terunuma was 0.166. For Cox Exp +
Clin, the IBS on Kao was 0.055 and on Terunuma was 0.161, indicating that the Cox Exp + Clin models were better calibrated over all time points. ER STATUS Most prognostic markers for breast
cancer have limited utility for patients with ER-negative tumors27. Unfortunately, ER status was not annotated in the Kao data, and there were too few samples in Terunuma to effectively test
the prognostic utility of our marker on ER-negative patients. Therefore, we inferred the ER, PR, and HER2 status of samples in Kao by clustering samples using expression probes associated
with each receptor status using a Gaussian mixture model to capture each bimodal distribution28. We then assessed the relative prognostic power of the M2EFM models in samples that varied by
ER or triple negative status (Table 2). The overall C-index was 0.703, but for ER-positive patients it was 0.665, for ER-negative patients it was 0.727, and for triple negative patients it
was 0.682. SUBTYPES Given that the covariates in our model are unpenalized, we needed to select a limited number of clinical covariates to avoid overfitting the model. This resulted in
models without PAM50 intrinsic subtypes, ER, PR, or HER2 status, which are all known to be associated with breast cancer prognosis. Therefore, we regressed the M2EFM molecular risk score
(calculated for all TCGA samples) on these variables and calculated the variance inflation factor (VIF), a commonly used measure of the multicollinearity between variables. The VIF was
approximately 1.20. A VIF between 1 and 5 is generally regarded to imply only moderate multicollinearity. Therefore, the molecular risk score from M2EFM contains prognostic information
independent of breast cancer subtype. DISTANT RECURRENCE-FREE SURVIVAL (DRFS) MODELS On both the testing data and the external validation data, M2EFM was the most prognostic model of DRFS
(Fig. 4 and Table S8). On the testing data, the M2EFM Exp + Clin had a median C-index score of 0.722, followed by rorS + Clin (0.707), then M2EFM Exp (0.705), and Cox Exp + Clin (0.701),
with the other models below 0.7. In the external validation data, M2EFM Exp + Clin and M2EFM Exp were the most prognostic at 0.752 and 0.741 respectively, although Cox Exp + Clin and Cox Exp
were strong at 0.735 and 0.729 respectively. Finally, an M2EFM Exp + Clin model trained on the full training data (Hatzis1) achieved a C-index of 0.766, 95% CI [0.604, 0.875] on the
external validation data (Hatzis2). The hazard ratios for this model are given in Table S9. Calibration curves at 3 years are shown in Fig. S2. The IBS score for M2EFM Exp + Clin on Hatzis2
was 0.079. For Cox Exp + Clin it was 0.083, demonstrating that the M2EFM Exp + Clin model is better calibrated. NEGATIVE PREDICTIVE VALUE FOR NODE-NEGATIVE CASES One interest in genomic
cancer biomarkers is for determining cases with a low risk of recurrence who may not benefit from adjuvant chemotherapy29,30. For these cases, it is important that the marker has a high
negative predictive value (NPV). Therefore, we assessed the NPV of the M2EFM model in node-negative samples using the _timeROC_ package for R, which uses a time-dependent AUC and can handle
censored survival times. A cutoff for the risk score was determined in Hatzis1 that equalized the sensitivity and specificity in predicting DRFS for node-negative cases at 3 years (the
median follow-up in these data). This cutoff was applied to node-negative cases in Hatzis2 to determine the positive predictive value (PPV) and NPV. At the 3-year follow up point, the PPV
was 0.49 and the NPV was 0.95. We predicted 39 node-negative cases would not have a distant metastasis by 3 years and we were correct for 36. We predicted 12 node-negative cases would have a
distant metastasis and were correct for 7. PATHOLOGIC COMPLETE RESPONSE (PCR) MODELS The final set of models are for pCR following neoadjuvant treatment with Taxane/Anthracycline in
HER2-negative breast cancer. Results are shown in Fig. 5 and Table S10. For these data, the clinical models were not very informative. The reference logistic-Ridge regression model performed
best on the testing data, with Logistic Exp + Clin scoring a median AUC of 0.767 and Logistic Exp at 0.763 compared to M2EFM Exp + Clin with a median AUC of 0.739 and M2EFM Exp at 0.733.
Nevertheless, M2EFM performed best on the external validation data, with M2EFM Exp + Clin at a median AUC of 0.716 and M2EFM Exp at 0.727 compared to Logistic Exp + Clin at a median AUC of
0.707 and Logistic Exp at 0.709. Finally, an M2EFM Exp + Clin model trained on the full training data (Hatzis1) achieved an AUC of 0.720, 95% CI [0.639, 0.802] on the external validation
data (Hatzis2). The odds ratios for this model are given in Table S11. Calibration curves appear in Fig. S3. The mean Brier score for M2EFM Exp + Clin on Hatzis2 was 0.184 and for Cox Exp +
Clin it was 0.202, showing that M2EFM Exp + Clin is better calibrated. FUNCTIONAL ANALYSIS PROTEIN NETWORK TOPOLOGY-BASED ANALYSIS Of the 115 m2eGenes, 55 were found to interact in a
sub-network of protein-protein interactions using WebGestalt. The sub-network included 65 genes in all and was enriched for 184 biological process terms, demonstrating that many of the genes
in our signature are functionally related. The top result was for lymphocyte activation (p = 2.50e-12), which included 23 genes from the sub-network and is about 8.96 times more overlap
than would be expected by chance. The sub-network is shown in Fig. S4, visualized using Cytoscape31. BIOLOGICAL PROCESS, DISEASE, AND DRUG ENRICHMENT Using Gene Ontology enrichment testing,
our gene set shows a clear enrichment for immune functions (Fig. 6). The results further demonstrate that expression tends to be repressed in these pathways in the samples with the highest
M2EFM risk scores and more moderately repressed in samples with medium M2EFM risk scores. Many of the genes in our gene signature are already known to be associated with breast cancer (such
as _ESR1_32 and _EGFR_33). We calculated the enrichment of genes in our signature in the Catalog of Somatic Mutations in Cancer (COSMIC) Cancer Gene Census23, downloaded on 9/13/2016.
Enrichment would suggest a causal link between genes in our signature and the progression of the disease. There were 601 genes in the catalog, 515 of which were in our gene expression
profiles. The enrichment for causally implicated cancer genes in our gene signature was about 4 times greater than expected by chance (OR: 3.87, p = 6.09e-06). The following 18 genes used in
our M2EFM models appear in the Cancer Gene Census: _AFF3, BCL11A, CD79A, CD79B, EGFR, ERBB4, ESR1, FOXA1, GATA3, IRF4, ITK, LCK, MYB, POU2AF1, PRF1, PTPRC, RET, TNFRSF17_. We also examined
the known targets of FDA approved breast cancer drugs using the DrugBank database version 5.024. There were 43 targets among the genes in our expression profiles. Of these, 4 genes were
found in our signature (_EGFR, ESR1, MAPT_, and _PRKCB_). Thus, our signature was enriched for known targets of breast cancer treatments 10 times more than would be expected by chance (OR:
10.00, p = 1.03e-03). These drugs included paclitaxel (_MAPT_), raloxifene (_ESR1_), toremifene (_ESR1_), fulvestrant (_ESR1_), trastuzumab (_EGFR_), tamoxifen (_ESR1_, _PRKCB_), docetaxel
(_MAPT_), and lapatinib (_EGFR_). DISCUSSION In this work, we have tried to address common shortcomings in molecular biomarker studies, have applied the REMARK guidelines in our
assessment34, and validated a data-integrated approach to creating prognostic and predictive cancer models. The M2EFM approach shows remarkable promise for developing prognostic and
predictive cancer biomarkers using both clinical and molecular data. For overall survival, distant recurrence free survival, and pathologic complete response in breast cancer it achieved a
median C-index or AUC of at least 0.7 in all internal and external validation datasets (including 5 different cohorts). The fact that our model can reliably attain this level of
discrimination across platforms and populations is promising indeed. Unsurprisingly, one of our comparison methods, a modified Cox-Ridge approach, performed about as well as M2EFM in terms
of discriminative power. M2EFM depends on Cox-Ridge itself, and the Cox-Ridge models contain a superset of the predictors in M2EFM. Nevertheless, M2EFM models in many cases outperform the
Cox-Ridge models, and more importantly, M2EFM models depend on only 115 genes, as opposed to 10990 in the Cox-Ridge models. Therefore, clinical use for Cox-Ridge would require a
transcriptome-wide gene expression assay, and the situation would be even more extreme if the model with methylation values was used. Additionally, for DRFS and pCR, M2EFM was better
calibrated than Cox-Ridge, suggesting that Cox-Ridge would not be as useful clinically. We also demonstrated that M2EFM has a high NPV for node-negative cases, while maintaining a meaningful
PPV, suggesting M2EFM might be useful for identifying patients who could avoid adjuvant chemotherapy. Finally, M2EFM has functional relevance, increasing the possibility that it could serve
as the basis for clinical tests that determine eligibility for treatment. Thus, using three different outcomes and five different data sets, we maintained a consistent level of accuracy. In
conjunction with our prior study of M2EFM applied to ccRCC11, these results consistently show that integrating both molecular and clinical data results in more prognostic models than either
data type alone. Our results strongly suggest that a data-integrated approach that models genes affected by regulatory changes in cancer is more reliable than many extant approaches.
Furthermore, our results suggest our method builds a model that is prognostic in the case of ER-negative samples and possibly triple negative samples (although unfortunately few samples were
available to test this hypothesis). Given the lack of approved prognostic indicators for ER-negative patients, partly due to its greater proliferative rate and the reliance of many tests on
proliferative genes27, and the greater difficulty in treating triple negative disease, a marker, such as ours, that is largely independent of proliferation is desirable27,35,36. Our results
also show that our approach does more than build a reliable prognostic model: it identifies biologically relevant pathways and, importantly, possible therapeutic targets and genes involved
in cancer progression. Our gene signature contains genes that are functionally involved in cancer and that are targeted by current breast cancer treatments. Perhaps more interestingly, given
recent advances in immunotherapy37,38, we note the repression of immune related genes in our signature for patients predicted to be at high risk. There has been growing interest in finding
signatures that are independent of proliferation and ER status that might also be predictive of possible immune therapies27. Although we provide evidence of the efficacy of M2EFM, there
remains one limitation before attempting to translate these results to the clinic: calibration of the model to a single platform. To obtain external validation data, we performed
cross-platform normalization. Having established the utility of the M2EFM approach, it remains to train the model on a high-quality dataset generated on the platform that will be used for
any clinical assays. CONCLUSIONS M2EFM can predict risk based on data integrated from multiple sources, taking advantage of tried-and-true prognostic factors while incorporating data on
relevant relationships between molecular data types at the individual level. For breast cancer, this resulted in models based on several genes with known functional associations to cancer
progression and which are targeted by currently available drugs, and therefore, may result in a useful tool for clinical decision support. This also suggests that other genes in the model
should be examined as possible targets, because many of the genes are functionally connected. Given the strength of our findings, we believe that further development of this technique and
application to other cancers is warranted. REFERENCES * Cho, S. H., Jeon, J. & Kim, S. I. Personalized Medicine in Breast Cancer: A Systematic Review. _J Breast Cancer_ 15, 265–272,
https://doi.org/10.4048/jbc.2012.15.3.265 (2012). Article PubMed PubMed Central Google Scholar * Weinstein, J. N. _et al_. The Cancer Genome Atlas Pan-Cancer analysis project. _Nat
Genet_ 45, 1113–1120, https://doi.org/10.1038/ng.2764 (2013). Article PubMed PubMed Central Google Scholar * Zhang, J. J. _et al_. International Cancer Genome Consortium Data Portal-a
one-stop shop for cancer genomics data. _Database-Oxford_ https://doi.org/10.1093/database/bar026 (2011). * Edgar, R., Domrachev, M. & Lash, A. E. Gene Expression Omnibus: NCBI gene
expression and hybridization array data repository. _Nucleic Acids Res_ 30, 207–210, https://doi.org/10.1093/nar/30.1.207 (2002). Article CAS PubMed PubMed Central Google Scholar *
Parker, J. S. _et al_. Supervised risk predictor of breast cancer based on intrinsic subtypes. _J Clin Oncol_ 27, 1160–1167, https://doi.org/10.1200/JCO.2008.18.1370 (2009). Article PubMed
PubMed Central Google Scholar * Paik, S. _et al_. A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. _N Engl J Med_ 351, 2817–2826,
https://doi.org/10.1056/NEJMoa041588 (2004). Article CAS PubMed Google Scholar * Coates, A. S. _et al_. Tailoring therapies-improving the management of early breast cancer: St Gallen
International Expert Consensus on the Primary Therapy of Early Breast Cancer 2015. _Ann Oncol_ 26, 1533–1546, https://doi.org/10.1093/annonc/mdv221 (2015). Article CAS PubMed PubMed
Central Google Scholar * Kern, S. E. Why Your New Cancer Biomarker May Never Work: Recurrent Patterns and Remarkable Diversity in Biomarker Failures. _Cancer Res_ 72, 6097–6101,
https://doi.org/10.1158/0008-5472.Can-12-3232 (2012). Article CAS PubMed PubMed Central Google Scholar * Yuan, Y. _et al_. Assessing the clinical utility of cancer genomic and proteomic
data across tumor types. _Nat Biotechnol_ 32, 644−+, https://doi.org/10.1038/nbt.2940 (2014). Article PubMed PubMed Central Google Scholar * Ritchie, M. D., Holzinger, E. R., Li, R. W.,
Pendergrass, S. A. & Kim, D. Methods of integrating data to uncover genotype-phenotype interactions. _Nat Rev Genet_ 16, 85–97, https://doi.org/10.1038/nrg3868 (2015). Article CAS
PubMed Google Scholar * Thompson, J. A. & Marsit, C. J. A Methylation-to-Expression Feature Model for Generating Accurate Prognostic Risk Scores and Identifying Disease Targets in
Clear Cell Kidney Cancer. _Pac Symp Biocomput_ 22, 509–520, https://doi.org/10.1142/9789813207813_0047 (2017). PubMed Google Scholar * Hatzis, C. _et al_. A Genomic Predictor of Response
and Survival Following Taxane-Anthracycline Chemotherapy for InvasiveBreast Cancer. _Jama-J Am Med Assoc_ 305, 1873–1881, https://doi.org/10.1001/jama.2011.593 (2011). Article CAS Google
Scholar * Teschendorff, A. E. _et al_. DNA methylation outliers in normal breast tissue identify field defects that are enriched in cancer. _Nat Commun_ 7
https://doi.org/10.1038/ncomms10478 (2016). * Fortin, J. P. _et al_. Functional normalization of 450k methylation array data improves replication in large cancer studies. _Genome Biol_ 15
https://doi.org/10.1186/s13059-014-0503-2 (2014). * Aryee, M. J. _et al_. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays.
_Bioinformatics_ 30, 1363–1369, https://doi.org/10.1093/bioinformatics/btu049 (2014). Article CAS PubMed PubMed Central Google Scholar * Thompson, J. A., Tan, J. & Greene, C. S.
Cross-platform normalization of microarray and RNA-seq data for machine learning applications. _Peerj_ 4, https://doi.org/10.7717/peerj.1621 (2016). * Terunuma, A. _et al_. MYC-driven
accumulation of 2-hydroxyglutarate is associated with breast cancer prognosis. _J Clin Invest_ 124, 398–412, https://doi.org/10.1172/Jci71180 (2014). Article CAS PubMed Google Scholar *
Kao, K. J., Chang, K. M., Hsu, H. C. & Huang, A. T. Correlation of microarray-based breast cancer molecular subtypes and clinical outcomes: implications for treatment optimization. _Bmc
Cancer_ 11, https://doi.org/10.1186/1471-2407-11-143 (2011). * Friedman, J., Hastie, T. & Tibshirani, R. Regularization Paths for Generalized Linear Models via Coordinate Descent. _J
Stat Softw_ 33, 1–22 (2010). Article PubMed PubMed Central Google Scholar * Martinez-Ledesma, E., Verhaak, R. G. W. & Trevino, V. Identification of a multi-cancer gene expression
biomarker for cancer clinical outcomes using a network-based algorithm. _Sci Rep-Uk_ 5, https://doi.org/10.1038/srep11966 (2015). * Gendoo, D. M. A. _et al_. Genefu: an R/Bioconductor
package for computation of gene expression-based signatures in breast cancer. _Bioinformatics_ 32, 1097–1099, https://doi.org/10.1093/bioinformatics/btv693 (2016). Article CAS PubMed
Google Scholar * Wang, J., Duncan, D., Shi, Z. & Zhang, B. WEB-based GEne SeT AnaLysis Toolkit (WebGestalt): update 2013. _Nucleic Acids Res_ 41, W77–W83,
https://doi.org/10.1093/nar/gkt439 (2013). Article PubMed PubMed Central Google Scholar * Forbes, S. A. _et al_. COSMIC: exploring the world’s knowledge of somatic mutations in human
cancer. _Nucleic Acids Res_ 43, D805–D811, https://doi.org/10.1093/nar/gku1075 (2015). Article CAS PubMed Google Scholar * Wishart, D. S. _et al_. DrugBank: a comprehensive resource for
_in silico_ drug discovery and exploration. _Nucleic Acids Res_ 34, D668–D672, https://doi.org/10.1093/nar/gkj067 (2006). Article CAS PubMed Google Scholar * Mogensen, U. B., Ishwaran,
H. & Gerds, T. A. Evaluating Random Forests for Survival Analysis Using Prediction Error Curves. _J Stat Softw_ 50, 1–23 (2012). Article PubMed PubMed Central Google Scholar * Graf,
E., Schmoor, C., Sauerbrei, W. & Schumacher, M. Assessment and comparison of prognostic classification schemes for survival data. _Stat Med_ 18, 2529–2545 (1999). Article CAS PubMed
Google Scholar * Gyorffy, B. _et al_. 3 Multigene prognostic tests in breast cancer: past, present, future. _Breast Cancer Research_ 17 https://doi.org/10.1186/s13058-015-0514-2 (2015). *
Lehmann, B. D. _et al_. Identification of human triple-negative breast cancer subtypes and preclinical models for selection of targeted therapies. _J Clin Invest_ 121, 2750–2767,
https://doi.org/10.1172/Jci45014 (2011). Article CAS PubMed PubMed Central Google Scholar * Duffy, M. J. _et al_. Clinical use of biomarkers in breast cancer: Updated guidelines from
the European Group on Tumor Markers (EGTM). _Eur J Cancer_ 75, 284–298, https://doi.org/10.1016/j.ejca.2017.01.017 (2017). Article CAS PubMed Google Scholar * Loncaster, J. _et al_.
Impact of Oncotype DX breast Recurrence Score testing on adjuvant chemotherapy use in early breast cancer: Real world experience in Greater Manchester, UK. _Ejso-Eur J Surg Onc_ 43, 931–937,
https://doi.org/10.1016/j.ejso.2016.12.010 (2017). Article CAS Google Scholar * Shannon, P. _et al_. Cytoscape: a software environment for integrated models of biomolecular interaction
networks. _Genome Res_ 13, 2498–2504, https://doi.org/10.1101/gr.1239303 (2003). Article CAS PubMed PubMed Central Google Scholar * Masuda, H. _et al_. Role of epidermal growth factor
receptor in breast cancer. _Breast Cancer Res Treat_ 136, 331–345, https://doi.org/10.1007/s10549-012-2289-9 (2012). Article CAS PubMed Google Scholar * Lauring, J. & Wolff, A. C.
Evolving Role of the Estrogen Receptor as a Predictive Biomarker: ESR1 Mutational Status and Endocrine Resistance in Breast Cancer. _J Clin Oncol_ 34, 2950-+,
https://doi.org/10.1200/Jco.2016.68.4720 (2016). Article PubMed Google Scholar * McShane, L. M. _et al_. REporting recommendations for tumor MARKer prognostic studies (REMARK). _Breast
Cancer Res Tr_ 100, 229–235, https://doi.org/10.1007/s10549-006-9242-8 (2006). Article Google Scholar * Shao, F., Sun, H. & Deng, C. X. Potential therapeutic targets of triple-negative
breast cancer based on its intrinsic subtype. _Oncotarget_ 8, 73329–73344, https://doi.org/10.18632/oncotarget.20274 (2017). PubMed PubMed Central Google Scholar * Collignon, J.,
Lousberg, L., Schroeder, H. & Jerusalem, G. Triple-negative breast cancer: treatment challenges and solutions. _Breast Cancer (Dove Med Press)_ 8, 93–107,
https://doi.org/10.2147/BCTT.S69488 (2016). Google Scholar * Liu, Z., Li, M., Jiang, Z. & Wang, X. A Comprehensive Immunologic Portrait of Triple-Negative Breast Cancer. _Transl Oncol_
11, 311–329, https://doi.org/10.1016/j.tranon.2018.01.011 (2018). Article CAS PubMed Google Scholar * Tolba, M. F. & Omar, H. A. Immunotherapy, an evolving approach for the
management of triple negative breast cancer: Converting non-responders to responders. _Crit Rev Oncol Hematol_. https://doi.org/10.1016/j.critrevonc.2018.01.005 (2018). PubMed Google
Scholar Download references ACKNOWLEDGEMENTS The analyses described in this report were supported by NIH grants R01ES022223, P30CA138292, P30ES019776, and R01DE022772. AUTHOR INFORMATION
AUTHORS AND AFFILIATIONS * Department of Biostatistics, University of Kansas Medical Center, Kansas City, USA Jeffrey A. Thompson * Department of Epidemiology, Geisel School of Medicine at
Dartmouth College, Hanover, USA Brock C. Christensen * Department of Environmental Health, Rollins School of Public Health at Emory University, Atlanta, USA Carmen J. Marsit Authors *
Jeffrey A. Thompson View author publications You can also search for this author inPubMed Google Scholar * Brock C. Christensen View author publications You can also search for this author
inPubMed Google Scholar * Carmen J. Marsit View author publications You can also search for this author inPubMed Google Scholar CONTRIBUTIONS J.T. and C.M. conceived of and designed the
study. J.T. collected the data, performed the analyses, and drafted the article. J.T., C.M. and B.C. performed critical revision. All authors read and approved the final manuscript.
CORRESPONDING AUTHOR Correspondence to Jeffrey A. Thompson. ETHICS DECLARATIONS COMPETING INTERESTS The authors declare no competing interests. ADDITIONAL INFORMATION PUBLISHER'S NOTE:
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. ELECTRONIC SUPPLEMENTARY MATERIAL SUPPLEMENTARY MATERIAL RIGHTS AND
PERMISSIONS OPEN ACCESS This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any
medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The
images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly
from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. Reprints and permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Thompson, J.A.,
Christensen, B.C. & Marsit, C.J. Methylation-to-Expression Feature Models of Breast Cancer Accurately Predict Overall Survival, Distant-Recurrence Free Survival, and Pathologic Complete
Response in Multiple Cohorts. _Sci Rep_ 8, 5190 (2018). https://doi.org/10.1038/s41598-018-23494-0 Download citation * Received: 13 December 2017 * Accepted: 13 March 2018 * Published: 26
March 2018 * DOI: https://doi.org/10.1038/s41598-018-23494-0 SHARE THIS ARTICLE Anyone you share the following link with will be able to read this content: Get shareable link Sorry, a
shareable link is not currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative