Artificial intelligence unravels interpretable malignancy grades of prostate cancer on histology images

Artificial intelligence unravels interpretable malignancy grades of prostate cancer on histology images

Play all audios:

Loading...

ABSTRACT Malignancy grading of prostate cancer (PCa) is fundamental for risk stratification, patient counseling, and treatment decision-making. Deep learning has shown potential to improve


the expert consensus for tumor grading, which relies on the Gleason score/grade grouping. However, the core problem of interobserver variability for the Gleason grading system remains


unresolved. We developed a novel grading system for PCa and utilized artificial intelligence (AI) and multi-institutional international datasets from 2647 PCa patients treated with radical


prostatectomy with a long follow-up of ≥10 years for biochemical recurrence and cancer-specific death. Through survival analyses, we evaluated the novel grading system and showed that AI


could develop a tumor grading system with four risk groups independent from and superior to the current five grade groups. Moreover, AI could develop a scoring system that reflects the risk


of castration resistant PCa in men who have experienced biochemical recurrence. Thus, AI has the potential to develop an effective grading system for PCa interpretable by human experts.


SIMILAR CONTENT BEING VIEWED BY OTHERS YET ANOTHER AUTOMATED GLEASON GRADING SYSTEM (YAAGGS) BY WEAKLY SUPERVISED DEEP LEARNING Article Open access 14 June 2021 A SYSTEMATIC REVIEW AND


META-ANALYSIS OF ARTIFICIAL INTELLIGENCE DIAGNOSTIC ACCURACY IN PROSTATE CANCER HISTOLOGY IDENTIFICATION AND GRADING Article 25 April 2023 AN INTERNATIONAL MULTI-INSTITUTIONAL VALIDATION


STUDY OF THE ALGORITHM FOR PROSTATE CANCER DETECTION AND GLEASON GRADING Article Open access 15 August 2023 INTRODUCTION Prostate cancer (PCa) is one of the most prevalent malignant diseases


in males and exhibits diverse cancer aggressiveness and prognosis1. When PCa is diagnosed, usually by biopsy, the pathological examination of cancer differentiation and dissemination status


are key determinants for selecting appropriate treatments2. Currently, pathologists grade PCa malignancy based on the modified Gleason grading system, originally established in the 1960s3.


The first version of the Gleason grading system was based on five tissue patterns (labeled 1–5) that identified different transformation conditions of prostatic tissues according to tissue


architecture, growth, and glandular features3,4. This grading system produces a score that considers two identical or different patterns to grade PCa differentiation, and the order in which


patterns are added differs according to tissue sampling (biopsy core vs. whole prostate)3,4. PCa grading was further refined after patterns 1 and 2 were mostly identified as benign with the


identification of basal cells by immunohistochemistry, and some of those patterns 1 and 2 were reclassified as Gleason pattern 3 as well5,6. In 2016, Epstein et al. proposed a modified


version of the Gleason grading system that included five grade groups (GGs) instead of nine different Gleason scores (such as 3 + 3, 4 + 3, and 5 + 3) to achieve a more concise prognostic


stratification according to biochemical recurrence (BCR) rates7. Despite strong prognostic capacities and continual revisions since its introduction8, GG reproducibility has remained limited


because of interobserver variability in grading and quantification, leading to grade inconsistency even among expert pathologists, thus increasing the potential risk of treatment delay or


suboptimal treatment choice9,10. Contemporary studies have highlighted the great potential of artificial intelligence (AI) in improving GG consistency and achieving accuracy comparable to


expert levels11,12,13. However, these studies likely inherited the limitations of the current grading system as the histological ground truth is based on evaluations from a small group of


expert pathologists, which is not necessarily reflective of the global pathology community (social and cognitive biases) or grading correctness14. To bypass these reproducibility


limitations, we applied AI to develop a novel recurrence prediction system based on long-term PCa prognosis instead of interobserver-based histology. We relied on the tissue microarray (TMA)


framework of the Canadian Prostate Cancer Biomarker Network (CPCBN) initiative of the Terry Fox Research Institute; this initiative implemented thoroughly validated techniques to ensure the


collection of representative samples of PCa from radical prostatectomy (RP) specimens15. In this study, we developed a calibrated and interpretable algorithm for predicting PCa outcomes in


multiple independent cohorts that could eventually be integrated into existing prognostic and predictive nomograms. RESULTS SURVIVAL MODELING To establish a novel system for predicting


recurrence, we initially investigated a multicenter population (CPBCN, _n_ = 1489) in which the overall BCR probability was 33.1% (_n_ = 493). The median time to BCR events was 26


(interquartile range [IQR], 8–52) months; in contrast, the median follow-up was 109 (76–141) months in patients without BCR events. The development and first external validation sets (CPBCN


cohort) were not statistically different with respect to pathological tumor (pT) stage, pathological nodal (pN) status, and GG (Supplementary Table S1). Among 600 patients in the development


set, 225 (37.5%) experienced recurrence during follow-up (median follow-up, 91 [42–123] months); in contrast, among 889 patients in the first external validation set, 268 (30.1%) had BCR


(median follow-up, 75 [43–116] months). Figure 1 summarizes the study methodology using histology images as data input, the confidence scores for BCR as output, and the binarized recurrence


status as the ground truth for model development and evaluation. The Supplementary Materials include cohort descriptions for all datasets included in this study (Supplementary Tables S1–S3).


In the first external validation set, the BCR model demonstrated a c-index of 0.682 ± 0.018 and a generalized concordance probability of 0.927 (95% CI: 0.891–0.952). The AUROC for the BCR


model was 0.714 (95% CI: 0.673–0.752). Using a cutoff of 0.5 for the BCR confidence score, the sensitivity was 50.0% and the specificity was 83.2%. The precision and recall of the BCR model


at a 0.5 threshold were 56.3% and 50.0%, respectively. The calibration plot demonstrated good correlation between the predicted BCR probability (BCR score) and observed 10-year BCR-free


survival rate (Supplementary Figure S1). Our novel model revealed a better effect size (hazard ratio) and higher generalized concordance probability than the classical models ResNet16,


VGG-1617, and EfficientNet18, which were trained on the same development set for BCR prognosis. EfficientNet and the novel model provided the lowest AIC and BIC. A non-nested partial


likelihood ratio test revealed that EfficientNet did not fit better than the novel model. Importantly, our novel BCR model had between 8- and 32-times fewer feature maps in the last


convolutional layer for BCR prediction (before being fully connected) and a parameter capacity 125, 54-, or 24-times smaller than the models mentioned above (Supplementary Table S4). We


observed no performance benefits from using image patches at ×20 or ×40 object magnifications, the attention aggregation layer, or the Cox deep convolutional model concept (Supplementary


Table S5). The results of the CHAID analysis are shown in Supplementary Figure S2. Based on the BCR scores estimated by our model and CHAID, BCR scores ≤5% were considered low risk, BCR


scores between 6% and 42% were low intermediate, BCR scores between 43% and 74% were high intermediate, and BCR scores ≥75% were high risk. RECURRENCE-FREE SURVIVAL One study conducted


univariate and multivariable Cox regression analyses on CPCBN and PROCURE cohorts to assess the prognostic value of the novel risk classification system for PCa recurrence (Supplementary


Tables S6 and S7). The results showed that the BCR score was an independent prognostic factor for recurrence, along with PSA level, tumor stage, GG, and surgical margin status. The novel


risk classification system showed a better model fit and superiority over GG (Table 1). No significant multicollinearity between variables was identified (VIF < 2), indicating the


correlation between variables (GG and the novel risk group) is negligibly small. The survival rates varied across the novel risk groups in both the cohorts, as shown in and Fig. 2A, B (See


supplementary Table S8 for 3-, 5-, 10-years BCR-free survival rates). The survival rates for GG are shown in the Supplementary section for comparison (Supplementary Tables S9 and S10 and


Figures S3 and S4). The estimated power for BCR survival analysis in this study was determined to be ≥99% at an alpha level of 5% for each cohort. CANCER-SPECIFIC SURVIVAL This study


examined cancer-specific survival using a novel risk classification system in three cohorts: the CPCBN, PROCURE, and PLCO cohorts. In the CPCBN cohort, the novel score was a significant


prognostic factor for cancer-specific mortality and tumor stage; in contrast, GG was not an independent prognostic factor (Supplementary Table S11). In the PROCURE Quebec Prostate Cancer


Biobank (PROCURE cohort), the novel risk score was an independent prognostic factor, along with the nodal stage; in contrast, the tumor stage was insignificant (Supplementary Table S12).


Supplementary Table S13 summarizes the results of the Cox regression analyses of the PLCO cohort, further validating the independent prognostic value of the risk score for cancer-specific


mortality using whole-slide images. In the CPCBN and PROCURE cohorts, the multivariate Cox regression model with novel risk groups fit well, similar to the full model. However, the model


with GG fits the data poorly (Table 2). In the PLCO cohort, both the GG and risk groups fit poorly compared with the full model, and the difference in the goodness-of-fit between the model


with GG and the model with risk groups was insignificant. No significant multicollinearity between variables was identified (VIF < 2). The estimated power for BCR survival analysis in


this study was determined to be ≥95% at an alpha level of 5% for each cohort. The Fine-Gray competing risk regression analyses further validated the independent prognostic value of our novel


risk groups for cancer-specific mortality on external validation sets (Supplementary Tables S14–S16). The Kaplan–Meier curves for cancer-specific survival according to risk classification


in the three external validation sets showed significant differences among the risk groups (Fig. 2C–E). Supplementary Table S17 summarizes cancer-specific survival rates across the three


cohorts and shows a distinct separation of survival rates among the risk groups 10 or 15 years after RP. The low-risk group of the novel grading system had no PCa-related deaths in any of


the three cohorts; in contrast, the GG in the current grading system included patients who died owing to PCa in two of the three cohorts. PLCO cohort analysis showed that the number of


slides per case and its correlation with the risk score did not significantly affect the prognostic value (Supplementary Table S18). Additional information on survival probabilities,


Kaplan–Meier curves for the GG, Gleason score groups, and the PCa pathological stage is provided in Supplementary Tables S19–S21 and Supplementary Figures S5–S8 for comparison.


CASTRATION-RESISTANT PROSTATE CANCER Castration-resistant prostate cancer (CRPC) occurs when PCa progresses despite therapy-induced castrate conditions. The current study assessed the


occurrence of castration-resistant prostate cancer (CRPC) in men experiencing biochemical recurrence and their association with our novel scoring and grading systems. Figure 3 shows that the


proportion of CRPC increases with risk groups in men with biochemical recurrence on the PROCURE cohort. In support to this observation, we found a significant correlation between risk group


and the development of CRPC (Kendall’s rank correlation tau: 0.22; z = 4.2277; _p_ < 0.0001). Moreover, we identified that the low-risk group had no CRPC case and that all CRPC cases


(100%) were found in the intermediate or high-risk groups. Multivariate Cox regression analysis showed that the novel risk score was an independent prognosticator for CRPC development


whereas pT, pN and surgical margin status were not (Table 3). INTERPRETABILITY Table 4 shows the concordance between the five pathologists and novel risk classifications. This table


summarizes the synergistic efforts between AI and pathologists in defining a novel grading system for PCa. Despite being completely blinded to the novel risk classification and


clinicopathological information, we found a striking alignment between the pathologists and risk classification in sorting image groups. Despite not relying on pattern proportions like the


GG and the absent of significant collinearity between our novel risk group and GG, the image group representing the low-risk group included Gleason pattern 3 mostly; in contrast, the


high-risk group included Gleason patterns 4 and 5, with Gleason pattern 3 being almost absent. The pathologists found a mixture of Gleason patterns 3 and 4 in the intermediate group, with a


trend in favor of Gleason pattern 4 in the high-intermediate group. Figure 4 exemplary illustrates the histopathological gradient for distortion of glandular architecture as well as the


Supplementary section include information on accessing image groups. The in-depth evaluation of 64 representative features revealed that specifically the 23rd representative feature showed


two distinct distributions (different variances) according to the risk groups and the recurrence status (Levene test, _P_ < 0.0001). According to the histogram and bimodal (one-vs-other)


distribution comparisons, the feature distribution for low or high-risk group was noticeably more distinguishable than the feature distribution for low- or high-intermediate groups (Fig. 4).


The evaluation of image patches selected according to the feature distribution (dominant red range for low and high-risk group, the overlapped range for intermediate groups) revealed a


histopathology pattern gradient across the risk groups (Fig. 4). Supplementary Figures S9–S13 provide the distribution patterns for 64 feature representations stratified by recurrence status


and risk groups. Gleason pattern 5 was mostly observed in the lower intermediate risk group (31% for CPCBN and 27% for PROCURE), higher intermediate/high-risk groups (67% for CPCBN and 73%


for PROCURE). GG2 (3 + 4) predominantly belonged to intermediate risk groups, accounting for 76% in PROCURE and 80% in CPCBN. Within these intermediate risk groups, GG2 was predominantly


found in the lower intermediate risk group, making up 76% in CPCBN and 88% in PROCURE. DISCUSSION In this study, we developed and externally validated a novel grading system for PCa that was


superior to the existing grading systems. We demonstrated that AI could be a helpful tool for generating a well-calibrated grading system interpretable by human experts, including risk


stratification groups with distinct survival probabilities that enable communication with and between domain experts and between patients and experts to make clinical decisions7,19,20. A


well-calibrated deep learning model significantly mitigates the usual concerns of overconfidence and enables the interpretation of the model’s prediction as scores21,22. Lastly, risk


stratification further enables the exploration of common histopathologic patterns by risk scores7,19,20. Previous AI efforts have focused on replicating grading systems using supervised


learning. Bulten et al. reported a deep learning model trained with the semi-automatic region-level annotation technique and slide-level annotations to show a Cohen’s quadratic kappa score


(κquad) of 0.918 (95% CI 0.891–0.941)11. Similarly, Ström et al. developed an ensemble of deep learning models trained with automatically generated region-level annotations from pen marks


and slide-level annotations, yielding a linear-weighted kappa score (κlin) of 0.8323. A recent study proposed a weakly supervised deep learning model that leveraged only the global Gleason


score of whole-slide images during training to grade patch-pixel-level patterns and perform slide-level scoring accurately24. The authors reported an average improvement on Cohen’s quadratic


kappa score (κquad) of approximately 18% compared to full supervision for the patch-level Gleason grading task24. Similarly, another study reported that the use of the AI-assisted method


was associated with significant improvements in the concordance of PCa grading and quantification between pathologists: pathologists 1 and 2 had 90.1% agreement using the AI-assisted method


vs. 84.0% agreement using the manual method (_p_ < 0.001)25. Despite these results being promising, the current grading system still suffers from reader dependency, and any AI-based


solution developed to improve the interrater agreement for tumor grading will apply to a closed network of human readers with associated social and cognitive biases. To address these


integral notions of AI design, our grading system was calibrated with different risk groups independent of human readers. Our approach also overcomes the challenges of interpreting an


AI-designed grading system as human readers can identify pattern trends in our grading system. Finally, our novel grading system accurately facilitated PCa grading at the clinically relevant


case level using a limited number of representative PCa tissues (three to four small regions representing the index PCa on an RP specimen) or a fully representative slide from an RP


specimen. Previous studies have explored the potential of digital biomarkers or AI-based Gleason grading systems for survival prediction and prognosis in PCa. For instance, a most recent


nested case-control study developed a prognostic biomarker for BCR using ResNet-50D26 and a TMA cohort, and the time to recurrence was utilized to label the histology images27. Wulczyn et


al. proposed an AI-based Gleason grading system for PCa-specific mortality based on Inception12-derived architecture28. Yamamoto et al. utilized deep autoencoders29 to extract key features


that were then fed into a second machine learning model (regression and support vector machine30) to predict the BCR status for PCa at fixed follow-up time points (Year 1 and 5)31. Other


studies also utilized multimodal data (molecular feature and histology) for prognosis in different cancers32,33. Overall, these studies set the ground for further survival analyses using AI;


however, they were limited by the post hoc explanation of their black box models that is not necessarily reflective of interpretable, clinically relevant well-validated algorithms34,35,36.


Our novel grading system is also prognostic for the development of CRPC which represents an advanced progression stage of PCa with poor outcome, that no longer completely responds to the


androgen deprivation therapy and consequently continues to progress37,38. Our data demonstrate the potential use of our novel grading system as clinical tool to determine cases at high-risk


of CRPC development and accordingly propose a risk-adapted personalized surveillance strategy. One of the most important aspects to consider when developing tools for clinical


decision-making is practicality and clinical utility. Our novel model was calibrated to predict 10-year BCR-free survival probability and facilitate model interpretation. It should also be


noted that the standard prognostic factors for PCa are all obtained during diagnosis or treatment without accounting for any time information. Accordingly, we integrated this important


aspect into our novel prediction system and selected model architectures for comparison based on recent surveys for medical imaging39,40 and the PANDA Challenge41 for PCa. Similarly, because


c-index and ROC curves are not ideal for comparing prognostic models, we utilized the partial LR test, AIC, and BIC to identify which model configuration fits better and provides a superior


prognostic performance42. The novel prediction system presented in this study does not rely on Cox models to calculate risk scores and determine risk groups. In this study, Cox models were


used only to evaluate the accuracy and clinical utility of the grading system. This study applied the Gleason grading system for nomology and ontology to describe the histopathological


contents of each group as it is widely accepted as a communication terminology for histopathological changes in PCa among domain experts (including urologists, pathologists, and


oncologists), despite their interrater limitations. Although there was some unsurprising overlap between our risk scores and the GG, the risk groups provided significantly different


interpretations of the GG patterns. Furthermore, our analysis revealed no significant evidence of multicollinearity among various parameters, including Gleason grade (GG) and the risk


groups. This suggests that the variables we considered in our study are independent and not significantly correlated with each other. We limited the sampling dimension to 0.6 mm (utilizing


TMA cores) while evaluating the interpretability of our novel grading system. This restriction enabled us to improve the readability of the histological content associated with the risk


score. Our TMA cohorts were assembled through a meticulous process involving rigorous protocols and quality control components to ensure the sampling of representative prostate cancer


tissues for each respective case15,43. We adopted the definition proposed by Rudin for interpretable AI34, which obeys a domain-specific set of constraints so that human experts can better


understand it. Interpretable AI necessitates a calibrated model, a requirement that aligns with its importance in clinical decision-making, whereas post hoc explanation of a black box model


does not necessarily equate to interpretable AI34,44. Moreover, within the domain of deep neural networks, the model generalization primarily arises from the presence of a substantial


inductive bias intrinsic to their architectural design; notably, deep neural networks demonstrate behavior that closely approximates Bayesian principles, as substantiated by prior


research45,46,47,48. This specific property strengthens our assumption that bimodal distributions linked to the corresponding risk groups are observable for certain features within the


penultimate fully connected layer, as demonstrated in Fig. 4 and Supplementary Figures S9 to S13; such alteration in the bimodal distribution across different risk groups provides insights


into the model’s inference and the feature distributions resulting from the input images. Although our results are robust, and our novel grading system does not rely on GG nor pattern


proportions, whether it can overcome sampling errors, tissue fragmentation, degradation, or artifacts caused by prostate biopsy and/or poor RP tissue quality is unknown. We did not evaluate


our grading system on the biopsy materials for survival modeling as a sampling effect (evident from the increase in PCa on RP) and the effects of time or intermediate events (such as cancer


progression) until treatment (such as RP) were difficult to control in the experimental setting. In contrast, these effects were easier to control with RP specimens, and it was previously


demonstrated that TMA, corresponding biopsy samples and RP specimens were comparable to GG15,43. The selection strategy for whole-slide images (WSIs) or tissue microarray (TMA) sampling in


the current cohorts was determined exclusively by the study organizers before the initiation of the current study. Thus, our strategy mitigated the observer bias by ensuring that data


collectors were not involved with data analysis process of the current study. Although we did not have control over the WSI or TMA sampling and case selection process for the current study,


our power analyses indicate that the sample size we have is adequate to execute our study. Moreover, the TMA cohorts were primarily designed for biomarker validation, specifically to assess


the effectiveness of biomarkers in predicting or prognosing survival outcomes. The selection of TMA samples accordingly followed predetermined criteria set by the study organizers to ensure


accurate representation and robust validation while mitigating the selection bias15,43. To mitigate potential bias from interobserver variability in labeling histopathological image groups,


we requested explanations from pathologists to better understand the factors influencing their decisions. This approach aimed to improve transparency and provide insights into the potential


sources of bias in the interpretation of histopathological images. Finally, our AI-based grading system was not developed to detect PCa; therefore, additional models to detect PCa are


required for a fully automated grading system. This study introduced and validated a novel grading system resulting from the synergy between AI and domain knowledge. Future research should


focus on identifying the application boundaries of our novel grading system in a real-world setting, including its possible integration into existing nomograms used to predict prognosis and


treatment response. METHODS DATA COHORTS In this study, we adopted a study design that focused on the analysis of independent retrospective cohorts. The patients with CPCBN were randomly


divided into development and validation sets based on their institutions. The development cohort included 600 RP cases from two institutions in the CPCBN framework15,43. Each center received


ethical approval from their Institutional Review Board (IRB) for biobanking activity and for their contributions to the CPCBN. CTRNet standards were followed for quality assurance and


ensured appropriate handling of human tissue49. The first external validation set, the CPCBN cohort, included 889 RP cases from three different institutions within the CPCBN framework,


anonymized to minimize bias and excluding the institutions used in the development set to avoid potential label leakage. The second cohort included 16 digital TMA scans of 897 patients from


the PROCURE cohort50,51. The study has been approved by the McGill University Institutional Review Board (study number A01-M04-06A). Lastly, the 1502 H&E-stained whole-slide images from


861 RP cases in the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial (NCT00339495; PLCO cohort) were used52,53. Only cores or representative slides from the RP index


lesion were used to develop and validate the malignancy grading system for PCa. Access to the PLCO data set was approved through the National Cancer Institute Cancer Data Access System.


Informed consent was obtained from all subjects involved in the study and managed by the respective organizers. The current study was conducted in accordance with the Declaration of


Helsinki, and the respective study organizers were responsible for obtaining the ethical approval. The Supplementary Methods details TMA construction and histological images of these cohorts


as well as their exclusion and inclusion criteria. CLINICOPATHOLOGICAL INFORMATION Histological images of PCa, clinicopathological information, and longitudinal follow-up data were


available for all cases. Clinicopathological data included age at diagnosis, preoperative prostate-specific antigen (PSA) measurements, RP TNM classification, and RP GG for all patients at


the RP and TMA core sample levels. Tumor staging was based on the 2002 TNM classification54 and grading according to the 2016 WHO/ISUP consensus55. All data were available from the


corresponding framework and study trial. The clinicopathological information was obtained through a meticulous chart review process, involving the extraction of data and the data quality


control from the electronic health records (EHR) of each participating hospital. FOLLOW-UP AND ENDPOINTS Most patients were regularly followed after RP to identify BCR, defined as two


consecutive increases in serum PSA levels above 0.2 ng/mL, PSA persistence (failure to fall below 0.1 ng/mL), initiation of salvage or adjuvant treatment, and cancer-specific death. BCR


status (non-BCR vs. BCR) and cancer-specific death status were documented during the follow-up period. Non-BCR cases or cancer survivors with incomplete follow-up duration were censored at


the date of last follow-up for survival analyses. The occurrence of the castration-resistant prostate cancer (CRPC) during the follow-up period was additionally documented. MODEL DEVELOPMENT


The development cohort was further divided into training and in-training validation sets, with the largest single-institution cohort used as the training set. Gleason patterns were utilized


to ensure consistent histological appearance in circular cores with a diameter of approximately 0.6 mm. Gleason patterns 3 + 3 and 4 + 4 were specifically used to evaluate homogeneous cores


to ensure consistency in the histological appearance. These patterns were selected to determine the minimum and maximum ranges of architectural tissue alteration within the circular cores.


In contrast, cores with Gleason pattern 4 + 3 were considered to represent heterogeneous cores, indicating an intermediate stage of architectural alteration of the tissue. The selection of


Gleason pattern 3 cores was limited to cases without recurrence during follow-up to ensure a clean pattern. Images including Gleason pattern 5 were intentionally excluded from the training


set. By removing pattern 5 and 3 + 4 from the training set, we aimed to encourage the model to learn and rely on other distinguishing features that are indicative of different malignancy


patterns other than the Gleason pattern system (quasi zero-shot learning). As a result, the model development process accounted for tissue appearance and distortion variations independent of


the current Gleason grading system. The study employed neural architecture search using PlexusNET and grid search to find the best architecture model for BCR prediction56. ADAM optimization


algorithm and cross-entropy loss function were used to train the models57. The optimal architecture was selected based on a 3-fold cross-validation performance. The resulting model was


trained on the entire training set with early stopping and triangular cyclical learning rates applied to mitigate overfitting. Model performance was evaluated at the case level using


confidence scores and metrics such as AUROC and Heagerty’s c-index58,59,60. Tile-level predictions were aggregated to determine core- or slide-level predictions, and case-level predictions


were estimated by averaging core- or slide-level predictions. In parallel, we repeated the same steps using ResNet-50RS16,61, VGG-1617, and EfficientNet18, as these represent


state-of-the-art or classical architectures (SOTA)16,17,18,61, and we then assessed the effect sizes (i.e., hazard ratio) for each model for BCR prognosis at case level. In a similar manner,


we tested the performance benefits of using image patches at ×20 or ×40 object magnification, using the COX deep convolutional neural network concept as described by Katzman et al.62 or the


attention aggregation layer63 instead of the global average pooling for our survival modeling. The risk classification model for BCR was constructed using the chi-square automatic


interaction detector (CHAID) algorithm64, with probabilities cutoffs identified on the development set and validated on external validation sets. MODEL EVALUATION In the development and


external validation cohorts, confidence scores for BCR (BCR scores) were generated for all cases. Prognostic classification and accuracy were measured using AUROC, Heagerty’s C-index, and


generalized concordance probability. The goodness-of-fit was assessed using Akaike information criterion (AIC) and Bayesian information criterion (BIC)65,66,67. Calibration plots were


created for external validation of the BCR model to evaluate its interpretability. Harrell’s “resampling model calibration” algorithm was applied to assess model calibration68,69. BCR


predictions were compared to corresponding Kaplan–Meier survival estimates within 10 years. Univariate and multivariate weighted Cox regression analyses were conducted on external validation


cohorts using Schemper et al.’s method to provide unbiased hazard ratio estimates, even in cases of non-proportional hazards70. Parameters included age at diagnosis, surgical margin status,


preoperative serum levels of PSA, pT stage, pN stage, GG, and BCR confidence scores. Parameters significant in the univariate analysis were included in the multivariate Cox regression


analysis to identify independent prognostic factors for BCR. Cox regression models were used for cancer-specific survival to examine the prognostic value of the novel score/grading system,


including GG, tumor stage, and the novel score/grading system. In addition to that, we performed the Fine-Gray competing risk regression analyses for cancer-specific mortality, while


considering other competing causes of death reported in the death certificates. Kaplan–Meier survival estimates were used to approximate the BCR and cancer-specific survival probabilities


for GG and the novel risk classification. Nested partial likelihood ratio tests were conducted to compare different Cox regression model configurations (only categorical variables) and


determine the best model for prognosis71. The best-performing grading system (novel grading vs. GG) was chosen based on the lowest changes in partial likelihood ratio and _p_-values. The AIC


and BIC values were compared among the Cox regression models, with the best model having the lowest values. Pearson correlation coefficient was calculated to assess the correlation between


the risk score and slide number. The variance inflation factor (VIF) was utilized to assess the multicollinearity level between the GG, novel grading, and tumor stage. Here, we built two


logistic regression models for 10-year BCR and cancer-specific death prediction. VIF below 2 indicates a negligible multicollinearity between these prediction variables. To ensure the


robustness, reliability, and adequate sample size of our study, we conducted a power calculation for Cox proportional hazards regression. Specifically, we evaluated the statistical power of


our analysis considering GG and risk score groups to prognose BCR or cancer-specific mortality using powerSurvEpi72. HUMAN INTERPRETABILITY The first external validation set (CPCBN) core


images were grouped according to their risk classification. Five experienced genitourinary pathologists with over 10 years of expertise were asked to review and sort randomly labeled image


groups based on tumor differentiation. Furthermore, these senior pathologists had to explain their decision in sorting the image groups while no specific instruction on how to explain their


decision was given. Pathologists were blinded to the corresponding clinicopathological and follow-up information to mitigate the recall bias and survivorship bias. Each pathologist was


individually approached via email to perform the assigned task while the image groups were randomly sorted before sharing them with each pathologist; no communication between pathologists


specific to this task was permitted to avoid the confirmation bias. Time limitation was not set to execute the task. To assess the inter-rater agreement between a pathologist and our novel


risk groups, we utilized a percent agreement based on the proportion of correctly labeled risk groups out of the total number of risk groups under the assumption that the probability for a


random agreement in sorting the entire grouped images between a single pathologist and the novel risk classification model is <5% and therefore negligible. SOFTWARE Model development and


analyses were performed with Albumentations73, Keras 2.674, TensorFlow 2.675, Python™ 3.8, SPSS® 23, and the R statistical package system (R Foundation for Statistical Computing, Vienna,


Austria). DATA AVAILABILITY Due to data transfer agreements and data privacy issues, data cannot be made openly available. CODE AVAILABILITY An abstract version of the codes can be obtained


from https://github.com/oeminaga/AI_PCA_GRADE. CHANGE HISTORY * _ 08 JULY 2024 A Correction to this paper has been published: https://doi.org/10.1038/s44303-024-00026-2 _ REFERENCES *


Siegel, R. L., Miller, K. D., Fuchs, H. E. & Jemal, A. Cancer Statistics, 2021. _CA Cancer J. Clin._ 71, 7–33 (2021). Article  PubMed  Google Scholar  * Mottet, N. et al.


EAU-EANM-ESTRO-ESUR-SIOG guidelines on prostate cancer—2020 update. Part 1: screening, diagnosis, and local treatment with curative intent. _Eur. Urol._ 79, 243–262 (2021). Article  CAS 


PubMed  Google Scholar  * Gleason, D. F. In _Urologic Pathology. the Prostate_ Vol. 171 (1977). * Gleason, D. F. & Mellinger, G. T. Prediction of prognosis for prostatic adenocarcinoma


by combined histological grading and clinical staging. _J. Urol._ 111, 58–64 (1974). Article  CAS  PubMed  Google Scholar  * Epstein, J. I., Allsbrook, W. C. Jr., Amin, M. B., Egevad, L. L.


& Committee, I. G. The 2005 International Society of Urological Pathology (ISUP) consensus conference on Gleason grading of prostatic carcinoma. _Am. J. Surg. Pathol._ 29, 1228–1242


(2005). Article  PubMed  Google Scholar  * Epstein, J. I., Srigley, J., Grignon, D., Humphrey, P. & Otis, C. Recommendations for the reporting of prostate carcinoma. _Virchows Arch._


451, 751–756 (2007). Article  PubMed  Google Scholar  * Epstein, J. I. et al. A contemporary prostate cancer grading system: a validated alternative to the Gleason score. _Eur. Urol._ 69,


428–435 (2016). Article  PubMed  Google Scholar  * Varma, M., Shah, R. B., Williamson, S. R. & Berney, D. M. 2019 Gleason grading recommendations from ISUP and GUPS: broadly concordant


but with significant differences. _Virchows Archiv._ 478, 813–815 (2021). Article  PubMed  Google Scholar  * Allsbrook, W. C. Jr et al. Interobserver reproducibility of Gleason grading of


prostatic carcinoma: general pathologist. _Hum. Pathol._ 32, 81–88 (2001). Article  PubMed  Google Scholar  * Ozkan, T. A. et al. Interobserver variability in Gleason histological grading of


prostate cancer. _Scand J. Urol._ 50, 420–424 (2016). Article  CAS  PubMed  Google Scholar  * Bulten, W. et al. Automated deep-learning system for Gleason grading of prostate cancer using


biopsies: a diagnostic study. _Lancet Oncol._ 21, 233–241 (2020). Article  PubMed  Google Scholar  * Nagpal, K. et al. Development and validation of a deep learning algorithm for Gleason


grading of prostate cancer from biopsy specimens. _JAMA Oncol._ 6, 1372–1380 (2020). Article  PubMed  Google Scholar  * Pantanowitz, L. et al. An artificial intelligence algorithm for


prostate cancer diagnosis in whole slide images of core needle biopsies: a blinded clinical validation and deployment study. _Lancet Digit Health_ 2, e407–e416 (2020). Article  PubMed 


Google Scholar  * Burchardt, M. et al. Interobserver reproducibility of Gleason grading: evaluation using prostate cancer tissue microarrays. _J. Cancer Res. Clin. Oncol._ 134, 1071–1078


(2008). Article  CAS  PubMed  Google Scholar  * Ouellet, V. et al. The Terry Fox Research Institute Canadian Prostate Cancer Biomarker Network: an analysis of a pan-Canadian multi-center


cohort for biomarker validation. _BMC Urol_. 18, 78 (2018). Article  PubMed  PubMed Central  Google Scholar  * Bello, I. et al. Revisiting resnets: improved training and scaling strategies.


_Adv. Neural Inf. Process. Syst._ 34, 22614–22627 (2021). Google Scholar  * Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. Preprint at


https://arxiv.org/abs/1409.1556 (2014). * Tan, M. & Le, Q. Efficientnet: rethinking model scaling for convolutional neural networks. In _International Conference on Machine Learning_


6105–6114 (PMLR, 2019). * Sanda, M. G. et al. Clinically localized prostate cancer: AUA/ASTRO/SUO guideline. Part I: risk stratification, shared decision making, and care options. _J. Urol._


199, 683–690 (2018). Article  PubMed  Google Scholar  * Roobol, M. J. & Carlsson, S. V. Risk stratification in prostate cancer screening. _Nat. Rev. Urol._ 10, 38–48 (2013). Article 


CAS  PubMed  Google Scholar  * Huang, Y., Li, W., Macheret, F., Gabriel, R. A. & Ohno-Machado, L. A tutorial on calibration measurements and calibration models for clinical prediction


models. _J. Am. Med. Inform. Assoc._ 27, 621–633 (2020). Article  PubMed  PubMed Central  Google Scholar  * Vaicenavicius, J. et al. Evaluating model calibration in classification. In _The


22nd International Conference on Artificial Intelligence and Statistics_ 3459–3467 (PMLR, 2019). * Strom, P. et al. Artificial intelligence for diagnosis and grading of prostate cancer in


biopsies: a population-based, diagnostic study. _Lancet Oncol._ 21, 222–232 (2020). Article  PubMed  Google Scholar  * Silva-Rodriguez, J., Colomer, A., Dolz, J. & Naranjo, V.


Self-learning for weakly supervised Gleason grading of local patterns. _IEEE J. Biomed. Health Inform._ 25, 3094–3104 (2021). Article  PubMed  Google Scholar  * Huang, W. et al. Development


and validation of an artificial intelligence–powered platform for prostate cancer grading and quantification. _JAMA Netw. Open_ 4, e2132554–e2132554 (2021). Article  PubMed  PubMed Central 


Google Scholar  * He, T. et al. Bag of tricks for image classification with convolutional neural networks. In _Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern


Recognition_ 558–567 (2019). * Pinckaers, H. et al. Predicting biochemical recurrence of prostate cancer with artificial intelligence. _Commun. Med._ 2, 64 (2022). Article  PubMed  PubMed


Central  Google Scholar  * Wulczyn, E. et al. Predicting prostate cancer specific-mortality with artificial intelligence-based Gleason grading. _Commun. Med._ 1, 1–8 (2021). Article  Google


Scholar  * Kramer, M. A. Nonlinear principal component analysis using autoassociative neural networks. _AIChE J._ 37, 233–243 (1991). Article  CAS  Google Scholar  * Vapnik, V. _The Nature


of Statistical Learning Theory_ (Springer Science & Business Media, 1999). * Yamamoto, Y. et al. Automated acquisition of explainable knowledge from unannotated histopathology images.


_Nat. Commun._ 10, 5642 (2019). Article  CAS  PubMed  PubMed Central  Google Scholar  * Chen, R. J. et al. Pan-cancer integrative histology-genomic analysis via multimodal deep learning.


_Cancer Cell_ 40, 865–878.e866 (2022). Article  CAS  PubMed  PubMed Central  Google Scholar  * Mobadersany, P. et al. Predicting cancer outcomes from histology and genomics using


convolutional networks. _Proc. Natl Acad. Sci._ 115, E2970–E2979 (2018). Article  CAS  PubMed  PubMed Central  Google Scholar  * Rudin, C. Stop explaining black box machine learning models


for high stakes decisions and use interpretable models instead. _Nat. Mach. Intell._ 1, 206–215 (2019). Article  PubMed  PubMed Central  Google Scholar  * Saporta, A. et al. Benchmarking


saliency methods for chest X-ray interpretation. _Nat. Mach. Intell._ 4, 867–878 (2022). Article  Google Scholar  * Arun, N. et al. Assessing the trustworthiness of saliency maps for


localizing abnormalities in medical imaging. _Radiol. Artif. Intell._ 3, e200267 (2021). Article  PubMed  PubMed Central  Google Scholar  * Saad, F., Bögemann, M., Suzuki, K. & Shore, N.


Treatment of nonmetastatic castration-resistant prostate cancer: focus on second-generation androgen receptor inhibitors. _Prostate Cancer Prostatic. Dis._ 24, 323–334 (2021). Article 


PubMed  PubMed Central  Google Scholar  * Saad, F. et al. 2022 Canadian Urological Association (CUA)-Canadian Uro Oncology Group (CUOG) guideline: Management of castration-resistant prostate


cancer (CRPC). _Can. Urol. Assoc. J._ 16, E506–E515 (2022). Article  PubMed  PubMed Central  Google Scholar  * Kim, H. E. et al. Transfer learning for medical image classification: a


literature review. _BMC Med. Imaging_ 22, 69 (2022). Article  PubMed  PubMed Central  Google Scholar  * Morid, M. A., Borjali, A. & Del Fiol, G. A scoping review of transfer learning


research on medical image analysis using ImageNet. _Comput. Biol. Med._ 128, 104115 (2021). Article  PubMed  Google Scholar  * Bulten, W. et al. Artificial intelligence for diagnosis and


Gleason grading of prostate cancer: the PANDA challenge. _Nat. Med._ 28, 154–163 (2022). Article  CAS  PubMed  PubMed Central  Google Scholar  * Harrell, F. E. _Regression Modeling


Strategies: with Applications to Linear Models, Logistic Regression, and Survival Analysis_ (Springer, 2001). * Leyh-Bannurah, S. R. et al. A multi-institutional validation of gleason score


derived from tissue microarray cores. _Pathol. Oncol. Res._ 25, 979–986 (2019). Article  CAS  PubMed  Google Scholar  * Ghassemi, M., Oakden-Rayner, L. & Beam, A. L. The false hope of


current approaches to explainable artificial intelligence in health care. _Lancet Digit. Health_ 3, e745–e750 (2021). Article  CAS  PubMed  Google Scholar  * Mingard, C., Valle-Pérez, G.,


Skalse, J. & Louis, A. A. Is SGD a Bayesian sampler? Well, almost. _J. Mach. Learn. Res._ 22, 3579–3642 (2021). Google Scholar  * Valle-Perez, G., Camargo, C. Q. & Louis, A. A. Deep


learning generalizes because the parameter-function map is biased towards simple functions. Preprint at https://arxiv.org/abs/1805.08522 (2018). * Mingard, C. et al. Neural networks are a


priori biased towards boolean functions with low entropy. Preprint at https://arxiv.org/abs/1909.11522 (2019). * Wenzel, F. et al. How good is the Bayes posterior in deep neural networks


really? Preprint at https://arxiv.org/abs/2002.02405 (2020). * Matzke, E. A. et al. Certification for biobanks: the program developed by the Canadian Tumour Repository Network (CTRNet).


_Biopreserv. Biobank_ 10, 426–432 (2012). Article  PubMed  Google Scholar  * Wissing, M. et al. Optimization of the 2014 Gleason grade grouping in a Canadian cohort of patients with


localized prostate cancer. _BJU Int._ 123, 624–631 (2019). Article  PubMed  Google Scholar  * Brimo, F. et al. Strategies for biochemical and pathologic quality assurance in a large


multi-institutional biorepository; The experience of the PROCURE Quebec Prostate Cancer Biobank. _Biopreserv. Biobank_ 11, 285–290 (2013). Article  PubMed  PubMed Central  Google Scholar  *


Team, P. P., Gohagan, J. K., Prorok, P. C., Hayes, R. B. & Kramer, B.-S. The prostate, lung, colorectal and ovarian (PLCO) cancer screening trial of the National Cancer Institute:


history, organization, and status. _Controll. Clin. Trials_ 21, 251S–272S (2000). Article  Google Scholar  * Andriole, G. L. et al. Mortality results from a randomized prostate-cancer


screening trial. _N. Engl. J. Med._ 360, 1310–1319 (2009). Article  CAS  PubMed  PubMed Central  Google Scholar  * Greene, F. L. et al. _AJCC Cancer Staging Handbook: TNM Classification of


Malignant Tumors_ (Springer Science & Business Media, 2002). * Egevad, L., Delahunt, B., Srigley, J. R. & Samaratunga, H. International Society of Urological Pathology (ISUP) grading


of prostate cancer—an ISUP consensus on contemporary grading. _APMIS_ 124, 433–435 (2016). Article  PubMed  Google Scholar  * Eminaga, O. et al. PlexusNet: a neural network architectural


concept for medical image classification. _Comp. Biol. Med._ 154, 106594 (2023). * Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at


https://arxiv.org/abs/1412.6980 (2014). * Heller, G. & Mo, Q. Estimating the concordance probability in a survival analysis with a discrete number of risk groups. _Lifetime Data Anal._


22, 263–279 (2016). Article  PubMed  Google Scholar  * Uno, H., Cai, T., Pencina, M. J., D’Agostino, R. B. & Wei, L. J. On the C-statistics for evaluating overall adequacy of risk


prediction procedures with censored survival data. _Stat Med._ 30, 1105–1117 (2011). Article  PubMed  PubMed Central  Google Scholar  * Heagerty, P. J. & Zheng, Y. Survival model


predictive accuracy and ROC curves. _Biometrics_ 61, 92–105 (2005). Article  PubMed  Google Scholar  * He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition.


In _Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition_ 770–778 (2016). * Katzman, J. L. et al. DeepSurv: personalized treatment recommender system using a Cox


proportional hazards deep neural network. _BMC Med. Res. Methodol._ 18, 24 (2018). Article  PubMed  PubMed Central  Google Scholar  * Touvron, H. et al. Augmenting Convolutional networks


with attention-based aggregation. Preprint at https://arxiv.org/abs/2112.13692 (2021). * Kass, G. V. An exploratory technique for investigating large quantities of categorical data. _J. R.


Stat. Soc. Ser. C Appl. Stat._ 29, 119–127 (1980). Google Scholar  * Sakamoto, Y., Ishiguro, M. & Kitagawa, G. _Akaike Information Criterion Statistics._ Vol. 81, 26853 (D. Reidel,


1986). * Vrieze, S. I. Model selection and psychological theory: a discussion of the differences between the Akaike information criterion (AIC) and the Bayesian information criterion (BIC).


_Psychol. Methods_ 17, 228 (2012). Article  PubMed  PubMed Central  Google Scholar  * Neath, A. A. & Cavanaugh, J. E. The Bayesian information criterion: background, derivation, and


applications. _Wiley Interdiscip. Rev. Comput. Stat._ 4, 199–203 (2012). Article  Google Scholar  * Harrell, F. E. Regression modeling strategies. _Bios_ 330, 14 (2017). Google Scholar  *


Harrell Jr, F. E., Harrell Jr, M. F. E. & Hmisc, D. Package ‘rms’. Vanderbilt University 229, Q8 (2017). * Schemper, M., Wakounig, S. & Heinze, G. The estimation of average hazard


ratios by weighted Cox regression. _Stat. Med._ 28, 2473–2489 (2009). Article  PubMed  Google Scholar  * Cox, D. R. Partial likelihood. _Biometrika_ 62, 269–276 (1975). Article  Google


Scholar  * Qiu, W. et al. Package ‘powerSurvEpi’ (2009). * Buslaev, A. et al. Albumentations: fast and flexible image augmentations. _Information_ 11, 125 (2020). Article  Google Scholar  *


Gulli, A. & Pal, S. _Deep learning with Keras_ (Packt Publishing Ltd, 2017). * Abadi, M. et al. Tensorflow: a system for large-scale machine learning. In _12th {USENIX} Symposium on


Operating Systems Design and Implementation ({OSDI} 16)_ 265–283 (2016). Download references ACKNOWLEDGEMENTS The Canadian Prostate Cancer Biomarker Network (CPCBN) acknowledges


contributions to its biobank from several Institutions across Canada: Centre Hospitalier de l’Université de Montreal (CHUM), Centre Hospitalier Universitaire de Quebec (CHUQ), McGill


University Health Centre (MUHC), University Health Network (UHN), and University of British Columbia/Vancouver Coastal Health Authority. D.T. receives salary support from the FRQS (Clinical


Research Scholar, Junior 2). The CRCHUM and CRCHUQc-UL receive support from the FRQS. The authors thank Mrs. Véronique Barrès, Mrs. Gabriela Fragoso, and Mrs. Liliane Meunier of the


molecular pathology core facility of the Centre de Recherche du Centre Hospitalier de l’Université de Montréal for performing the sections, immunohistochemistry, and slide scanning and the


facility core for image analysis with the Visiopharm image software. Access to Dr. Féryel Azzi’s expertise is possible thanks to the TransMedTech Institute and its primary funding partner,


the Canada First Research Excellence Fund. We acknowledge the contribution of PLCO study trial to this study providing histology images and corresponding clinical data. Finally, external


validation TMAs for this research project were obtained from the PROCURE Biobank. This biobank is the result of a collaboration between the Centre Hospitalier de l’Université de Montréal


(CHUM), the CIUSSS de l’Estrie-CHUS, the CHU de Québec-Université Laval and the Research Institute of the McGill University Health Center, with funds from PROCURE and its partners. We thank


the organization of the PLCO study for sharing the whole-slide images and the corresponding clinical information. AUTHOR INFORMATION Author notes * These authors jointly supervised this


work: Dominique Trudel, Sami-Ramzi Leyh-Bannurah. AUTHORS AND AFFILIATIONS * AI Vobis, Palo Alto, CA, USA Okyaz Eminaga * Division of Urology, Department of Surgery, Centre Hospitalier de


l’Université de Montréal, University of Montreal, Montréal, QC, H2X 0A9, Canada Fred Saad & Pierre I. Karakiewicz * Cancer Prognostics and Health Outcomes Unit, University of Montreal


Health Center, Montreal, QC, Canada Fred Saad, Zhe Tian & Pierre I. Karakiewicz * Centre de recherche du Centre Hospitalier de l’Université de Montréal (CRCHUM), 900 Saint-Denis,


Montréal, QC, H2X 0A9, Canada Fred Saad, Pierre I. Karakiewicz, Véronique Ouellet, Feryel Azzi & Dominique Trudel * Institut du cancer de Montréal, 900 Saint-Denis, Montréal, QC, H2X


0A9, Canada Fred Saad, Véronique Ouellet, Feryel Azzi & Dominique Trudel * University of Münster, Münster, Germany Ulrich Wolffgang * Department of Pathology and Cellular Biology,


Université de Montréal, 2900 Boulevard Édouard-Montpetit, Montreal, QC, H3T 1J4, Canada Feryel Azzi & Dominique Trudel * Department of Pathology, Centre Hospitalier de l’Université de


Montréal (CHUM), 1051 Sanguinet, Montreal, QC, H2X 0C1, Canada Feryel Azzi & Dominique Trudel * Institute of Pathology, St. Franziskus-Hospital, Muenster, Germany Tilmann Spieker *


Department of Pathology, University of Muenster, Muenster, Germany Tilmann Spieker * Institute of Pathology, Elbe Klinikum Stade, Academic Teaching Hospital of The University Medical Center


Hamburg-Eppendorf (UKE), Hamburg, Germany Burkhard M. Helmke * Martini-Klinik Prostate Cancer Center, University Hospital Hamburg-Eppendorf, Hamburg, Germany Markus Graefen & Sami-Ramzi


Leyh-Bannurah * Department of Computer Science, University of Muenster, Muenster, Germany Xiaoyi Jiang * Department of Radiation Oncology - Radiation Physics, Stanford University School of


Medicine, Stanford, CA, USA Lei Xing * Prostate Center Northwest, Department of Urology, Pediatric Urology and Uro-Oncology, St. Antonius-Hospital, Gronau, Germany Jorn H. Witt & 


Sami-Ramzi Leyh-Bannurah Authors * Okyaz Eminaga View author publications You can also search for this author inPubMed Google Scholar * Fred Saad View author publications You can also search


for this author inPubMed Google Scholar * Zhe Tian View author publications You can also search for this author inPubMed Google Scholar * Ulrich Wolffgang View author publications You can


also search for this author inPubMed Google Scholar * Pierre I. Karakiewicz View author publications You can also search for this author inPubMed Google Scholar * Véronique Ouellet View


author publications You can also search for this author inPubMed Google Scholar * Feryel Azzi View author publications You can also search for this author inPubMed Google Scholar * Tilmann


Spieker View author publications You can also search for this author inPubMed Google Scholar * Burkhard M. Helmke View author publications You can also search for this author inPubMed Google


Scholar * Markus Graefen View author publications You can also search for this author inPubMed Google Scholar * Xiaoyi Jiang View author publications You can also search for this author


inPubMed Google Scholar * Lei Xing View author publications You can also search for this author inPubMed Google Scholar * Jorn H. Witt View author publications You can also search for this


author inPubMed Google Scholar * Dominique Trudel View author publications You can also search for this author inPubMed Google Scholar * Sami-Ramzi Leyh-Bannurah View author publications You


can also search for this author inPubMed Google Scholar CONTRIBUTIONS O.E., D.T., and S.-R.L.-B. designed the study, D.T. and F.A. evaluated the histological slides, O.E. performed the


model development, validation, and data visualization. Z.T. critically reviewed the statistical methods. X.J. and L.X. critically reviewed the deep learning methods. Z.T., U.W., F.S.,


P.I.K., V.O., F.A., B.M.H., J.H.W., D.T., M.G., X.J., L.X., and S.-R.L.-B. critically reviewed and evaluated the methods and results. O.E. drafted the manuscript. S.-R.L.-B. and D.T. edited


and revised the manuscript. S.-R.L.-B. and D.T. supervised the study. F.S., P.I.K., D.T., and S.-R.L.-B. supervised the oncologic and clinical aspects. L.X. and X.J. supervised the aspects


for computer science, artificial intelligence, pattern recognition, and computer vision. All authors reviewed and approved the manuscript. CORRESPONDING AUTHORS Correspondence to Okyaz


Eminaga or Sami-Ramzi Leyh-Bannurah. ETHICS DECLARATIONS COMPETING INTERESTS The authors declare no competing interests. ADDITIONAL INFORMATION PUBLISHER’S NOTE Springer Nature remains


neutral with regard to jurisdictional claims in published maps and institutional affiliations. SUPPLEMENTARY INFORMATION SUPPLEMENTARY INFORMATION RIGHTS AND PERMISSIONS OPEN ACCESS This


article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as


you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party


material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s


Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.


To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. Reprints and permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Eminaga, O., Saad, F., Tian, Z. _et al._


Artificial intelligence unravels interpretable malignancy grades of prostate cancer on histology images. _npj Imaging_ 2, 6 (2024). https://doi.org/10.1038/s44303-023-00005-z Download


citation * Received: 28 September 2023 * Accepted: 18 December 2023 * Published: 06 March 2024 * DOI: https://doi.org/10.1038/s44303-023-00005-z SHARE THIS ARTICLE Anyone you share the


following link with will be able to read this content: Get shareable link Sorry, a shareable link is not currently available for this article. Copy to clipboard Provided by the Springer


Nature SharedIt content-sharing initiative