Fast intratumor heterogeneity inference from single-cell sequencing data

Fast intratumor heterogeneity inference from single-cell sequencing data

Play all audios:

Loading...

ABSTRACT We introduce HUNTRESS, a computational method for mutational intratumor heterogeneity inference from noisy genotype matrices derived from single-cell sequencing data, the running


time of which is linear with the number of cells and quadratic with the number of mutations. We prove that, under reasonable conditions, HUNTRESS computes the true progression history of a


tumor with high probability. On simulated and real tumor sequencing data, HUNTRESS is demonstrated to be faster than available alternatives with comparable or better accuracy. Additionally,


the progression histories of tumors inferred by HUNTRESS on real single-cell sequencing datasets agree with the best known evolution scenarios for the associated tumors. Access through your


institution Buy or subscribe This is a preview of subscription content, access via your institution ACCESS OPTIONS Access through your institution Access Nature and 54 other Nature Portfolio


journals Get Nature+, our best-value online-access subscription $29.99 / 30 days cancel any time Learn more Subscribe to this journal Receive 12 digital issues and online access to articles


$99.00 per year only $8.25 per issue Learn more Buy this article * Purchase on SpringerLink * Instant access to full article PDF Buy now Prices may be subject to local taxes which are


calculated during checkout ADDITIONAL ACCESS OPTIONS: * Log in * Learn about institutional subscriptions * Read our FAQs * Contact customer support SIMILAR CONTENT BEING VIEWED BY OTHERS


FASTCLONE IS A PROBABILISTIC TOOL FOR DECONVOLUTING TUMOR HETEROGENEITY IN BULK-SEQUENCING SAMPLES Article Open access 08 September 2020 CLONESIG CAN JOINTLY INFER INTRA-TUMOR HETEROGENEITY


AND MUTATIONAL SIGNATURE ACTIVITY IN BULK TUMOR SEQUENCING DATA Article Open access 09 September 2021 HAPLOTYPE-AWARE ANALYSIS OF SOMATIC COPY NUMBER VARIATIONS FROM SINGLE-CELL


TRANSCRIPTOMES Article 26 September 2022 DATA AVAILABILITY Our study makes use of two publicly available datasets introduced in previous studies16,17. For the leukemia dataset, single-cell


and bulk sequencing data have been deposited at NCBI BioProject ID PRJNA648656 and SNP array data at NCBI GEO ID GSE156934. For the HGSOC dataset, the single-cell FASTQs have been deposited


in the European Genome-phenome Archive under accession no. EGA: EGAS00001003190. The OV2295 datasets are available at Zenodo20. Source data for the performance results for Figs. 1 and 2 are


available in Supplementary Table 1. Simulated data generated and used in this study for obtaining the results shown in Extended Data Figs. 1–10 are available at Zenodo21. Source data are


provided with this paper. CODE AVAILABILITY The open-source implementation of HUNTRESS is available at Zenodo22. CHANGE HISTORY * _ 16 SEPTEMBER 2022 In the version of this article initially


published, the email address shown for Salem Malikić was incorrect and has been amended in the HTML and PDF versions of the article. _ REFERENCES * Kuipers, J., Jahn, K. & Beerenwinkel,


N. Advances in understanding tumour evolution through single-cell sequencing. _Biochim. Biophys. Acta_ 1867, 127–138 (2017). Google Scholar  * Schwartz, R. & Schäffer, A. A. The


evolution of tumour phylogenetics: principles and practice. _Nat. Rev. Genet._ 18, 213–229 (2017). Article  Google Scholar  * Jahn, K., Kuipers, J. & Beerenwinkel, N. Tree inference for


single-cell data. _Genome Biol._ 17, 86 (2016). Article  Google Scholar  * Ross, E. M. & Markowetz, F. OncoNEM: inferring tumor evolution from single-cell sequencing data. _Genome Biol._


17, 69 (2016). Article  Google Scholar  * Zafar, H., Tzen, A., Navin, N., Chen, K. & Nakhleh, L. SiFit: inferring tumor trees from single-cell sequencing data under finite-sites models.


_Genome Biol._ 18, 178 (2017). Article  Google Scholar  * Zafar, H., Navin, N., Chen, K. & Nakhleh, L. Siclonefit: Bayesian inference of population structure, genotype and phylogeny of


tumor clones from single-cell genome sequencing data. _Genome Res._ 29, 1847–1859 (2019). Article  Google Scholar  * Wu, Y. Accurate and efficient cell lineage tree inference from noisy


single cell data: the maximum likelihood perfect phylogeny approach. _Bioinformatics_ 36, 742–750 (2020). Google Scholar  * Malikic, S., Jahn, K., Kuipers, J., Sahinalp, S. C. &


Beerenwinkel, N. Integrative inference of subclonal tumour evolution from single-cell and bulk sequencing data. _Nat. Commun._ 10, 2750 (2019). Article  Google Scholar  * Malikić, S.,


Mehrabadi, F. R., Azer, E. S., Ebrahimabadi, M. H. & Sahinalp, S. C. Studying the history of tumor evolution from single-cell sequencing data by exploring the space of binary matrices.


_J. Comput. Biol._ 28, 857–879 (2021). Article  MathSciNet  Google Scholar  * El-Kebir, M. SPhyR: tumor phylogeny estimation from single-cell sequencing data under loss and error.


_Bioinformatics_ 34, i671–i679 (2018). Article  Google Scholar  * Malikic, S. et al. PhISCS: a combinatorial approach for subperfect tumor phylogeny reconstruction via integrative use of


single-cell and bulk sequencing data. _Genome Res._ 29, 1860–1877 (2019). Article  Google Scholar  * Edrisi, M., Zafar, H. & Nakhleh, L. in _19th International Workshop on Algorithms in


Bioinformatics_ (_WABI 2019_), Vol. 143 of _Leibniz International Proceedings in Informatics_ (_LIPIcs_) (eds Huber, K. T. & Gusfield, D.) 22:1–22:13 (National Science Foundation, 2019).


* Sadeqi Azer, E. et al. PhISCS-BnB: a fast branch and bound algorithm for the perfect tumor phylogeny reconstruction problem. _Bioinformatics_ 36, i169–i176 (2020). Article  Google Scholar


  * Ciccolella, S. et al. gpps: an ILP-based approach for inferring cancer progression with mutation losses from single cell data. _BMC Bioinformatics_ 21, 413 (2020). Article  Google


Scholar  * Azer, E. S., Ebrahimabadi, M. H., Malikić, S., Khardon, R. & Sahinalp, S. C. Tumor phylogeny topology inference via deep learning. _iScience_ 23, 101655 (2020). Article 


Google Scholar  * Laks, E. et al. Clonal decomposition and DNA replication states defined by scaled single-cell genome sequencing. _Cell_ 179, 1207–1221 (2019). Article  Google Scholar  *


Morita, K. et al. Clonal evolution of acute myeloid leukemia revealed by high-throughput single-cell genomics. _Nat. Commun._ 11, 5327 (2020). Article  Google Scholar  * Singer, J., Kuipers,


J., Jahn, K. & Beerenwinkel, N. Single-cell mutation identification via phylogenetic inference. _Nat. Commun._ 9, 5144 (2018). Article  Google Scholar  * Gusfield, D. Efficient


algorithms for inferring evolutionary trees. _Networks_ 21, 19–28 (1991). Article  MathSciNet  Google Scholar  * McPherson, A. W. Clonal decomposition and DNA replication states defined by


scaled single cell genome sequencing. Zenodo (2019); https://doi.org/10.5281/zenodo.3445364 * Malikic, S., Mehrabadi, F. R. & Kizilkale, C. Fast intratumor heterogeneity inference from


single-cell sequencing data (simulated data – extended data figures). Zenodo (2022); https://doi.org/10.5281/zenodo.6829082 * Kizilkale, C., Buluc, A. & Rashidi, F. PASSIONLab/HUNTRESS:


HUNTRESS. Zenodo (2022); https://doi.org/10.5281/zenodo.6803392 Download references ACKNOWLEDGEMENTS This work is supported in part by the Intramural Research Program of the National


Institutes of Health, National Cancer Institute (to F.R.M., E.P.-G., K.L.M., M.P.L., C.-P.D., G.M., S.C.S. and S.M.) and utilized the computational resources of the NIH Biowulf


high-performance computing cluster (http://hpc.nih.gov) and Gurobi (http://www.gurobi.com) to solve some optimization problems. Additionally, F.R.M. was supported in part by Indiana U. Grand


Challenges Precision Health Initiative. C.K. and A.B. were supported by the Advanced Scientific Computing Research (ASCR) Program of the Department of Energy Office of Science under


contract no. DE-AC02- 05CH11231. AUTHOR INFORMATION Author notes * Erfan Sadeqi Azer Present address: Google LLC, Sunnyvale, CA, USA * These authors contributed equally: Can Kizilkale, Farid


Rashidi Mehrabadi. AUTHORS AND AFFILIATIONS * Department of Electrical Engineering and Computer Sciences UC Berkeley, Berkeley, CA, USA Can Kızılkale & Aydın Buluç * Computational


Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA Can Kızılkale & Aydın Buluç * Cancer Data Science Laboratory, Center for Cancer Research, National Cancer


Institute, National Institutes of Health, Bethesda, MD, USA Farid Rashidi Mehrabadi, S. Cenk Sahinalp & Salem Malikić * Department of Computer Science, Indiana University, Bloomington,


IN, USA Farid Rashidi Mehrabadi, Erfan Sadeqi Azer & Funda Ergün * Laboratory of Cancer Biology and Genetics, Center for Cancer Research, National Cancer Institute, National Institutes


of Health, Bethesda, MD, USA Eva Pérez-Guijarro, Kerrie L. Marie, Maxwell P. Lee, Chi-Ping Day & Glenn Merlino Authors * Can Kızılkale View author publications You can also search for


this author inPubMed Google Scholar * Farid Rashidi Mehrabadi View author publications You can also search for this author inPubMed Google Scholar * Erfan Sadeqi Azer View author


publications You can also search for this author inPubMed Google Scholar * Eva Pérez-Guijarro View author publications You can also search for this author inPubMed Google Scholar * Kerrie L.


Marie View author publications You can also search for this author inPubMed Google Scholar * Maxwell P. Lee View author publications You can also search for this author inPubMed Google


Scholar * Chi-Ping Day View author publications You can also search for this author inPubMed Google Scholar * Glenn Merlino View author publications You can also search for this author


inPubMed Google Scholar * Funda Ergün View author publications You can also search for this author inPubMed Google Scholar * Aydın Buluç View author publications You can also search for this


author inPubMed Google Scholar * S. Cenk Sahinalp View author publications You can also search for this author inPubMed Google Scholar * Salem Malikić View author publications You can also


search for this author inPubMed Google Scholar CONTRIBUTIONS S.C.S. and A.B. jointly initiated and supervised the project. The algorithmic approach was developed by C.K. HUNTRESS was


implemented by C.K. and was tested by F.R.M. Theorem 1, related lemmas and their proofs are by S.C.S., S.M., F.E. and C.K. Experimental results on real data and external simulators are by


F.R.M. The internal simulator was developed and the simulated data were generated by S.M. The experimental results on the internal simulator are by C.K. and F.R.M. The bulk of the paper was


written by S.M., S.C.S., F.R.M., C.K. and F.E. with feedback from all co-authors. E.P.-G., K.L.M., M.P.L., E.S.A., C.-P.D. and G.M. contributed to the interpretation of the results and


biological implications. CORRESPONDING AUTHORS Correspondence to S. Cenk Sahinalp or Salem Malikić. ETHICS DECLARATIONS COMPETING INTERESTS The authors declare no competing interests. PEER


REVIEW PEER REVIEW INFORMATION _Nature Computational Science_ thanks Yufeng Wu, Hamim Zafar and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.


Primary Handling Editor: Fernando Chirigati, in collaboration with the _Nature Computational Science_ team. ADDITIONAL INFORMATION PUBLISHER’S NOTE Springer Nature remains neutral with


regard to jurisdictional claims in published maps and institutional affiliations. EXTENDED DATA EXTENDED DATA FIG. 1 A RUNNING TIME ASSESSMENT OF HUNTRESS ON SIMULATED DATA WITH NO FALSE


POSITIVES, IN COMPARISON TO SCISTREE, SPHYR, AND PHISCS-BNB, AS WELL AS ITS SLOWER BUT MORE GENERAL VARIANTS, PHISCS-I, AND PHISCS-B. All numbers on y-axis are in log10 scale. Here n, m and


fn, respectively, denote the number of cells, the number of mutations and the false negative error rate in single-cell data. For each setting of n and m, we report the distribution of the


running time for each tool - over 10 distinct trees of tumor progression, with false negative error rates of 0.05, 0.1 and 0.2. Each tool was run with a time limit of 8 hours (those cases


that exceed the time limit are not included here). The corresponding accuracy measures for each setting are shown in Extended Data Figure 2 (ancestor descendant accuracy measure) and


Extended Data Figure 3 (different-lineages accuracy measure). Source data EXTENDED DATA FIG. 2 COMPARISON OF ANCESTOR-DESCENDANT (AD) ACCURACY MEASURES FOR HUNTRESS, PHISCS-BNB, SCISTREE AND


SPHYR ON SIMULATED DATA WITH NO FALSE POSITIVES. Here n, m and fn, respectively, denote the number of cells, the number of mutations and the false negative error rate in single-cell


sequencing. For each setting of n and m, we report the ancestor-descendent (AD) accuracy measure for each tool with respect to the ground truth. The experiments were performed over 10


distinct trees of tumor progression, using a false negative error rate of 0.05, 0.1 or 0.2. Each tool was run with a time limit of 8 hours (those cases that exceed the time limit are not


included here). Note that we have not included results of PhISCS-I and PhISCS-B as their accuracy values are identical to that of PhISCS-BnB (on these instances on which they completed the


task) due to the same underlying objective function and optimality guarantee that they all provide. Source data EXTENDED DATA FIG. 3 COMPARISON OF DIFFERENT-LINEAGES (DL) ACCURACY MEASURE


DISTRIBUTIONS FOR HUNTRESS, PHISCS-BNB, SCISTREE AND SPHYR ON SIMULATED DATA WITH NO FALSE POSITIVES. Here n, m and fn, respectively, denote the number of cells, the number of mutations and


the false negative error rate for single-cell sequencing. For each setting of n and m, we report the different-lineages (DL) accuracy measure for each tool with respect to the ground truth.


The experiments were performed over 10 distinct trees of tumor progression, with a false negative error rate varying across 0.05, 0.1 and 0.2. Each tool was run with a time limit of 8 hours


(those cases that exceed the time limit are not included here). Note that we have not included results of PhISCS-I and PhISCS-B separately as their accuracy values match those of PhISCS-BnB


(on these instances on which they completed the task) due to the same underlying objective function and optimality guarantee that they all provide. Source data EXTENDED DATA FIG. 4


ANCESTOR-DESCENDANT (AD) ACCURACY MEASURE DISTRIBUTIONS FOR HUNTRESS, SCISTREE AND SPHYR ON SIMULATED DATA WITH FALSE POSITIVES, FALSE NEGATIVES AND MISSING ENTRIES. Here n, m, fn, fp and na


respectively, denote the number of cells, the number of mutations, the false negative, false positive error and missing entry rates in single-cell sequencing data. For each setting we


report the distribution for each tool over 10 distinct trees of tumor progression. Each tool was allowed to run with a time limit of 48 hours (those cases that exceed the time limit are not


included here). Note that each (violin) plot shows the maximum, center and minimum value of the data depicted, together with the probability density. Source data EXTENDED DATA FIG. 5


DIFFERENT LINEAGES (DL) ACCURACY MEASURE DISTRIBUTIONS FOR HUNTRESS, SCISTREE AND SPHYR ON SIMULATED DATA WITH FALSE POSITIVES, FALSE NEGATIVES AND MISSING ENTRIES. Here n, m, fn, fp and na


respectively, denote the number of cells, the number of mutations, the false negative, false positive error and missing entry rates in single-cell sequencing data. For each setting we report


the distribution for each tool over 10 distinct trees of tumor progression. Each tool was allowed to run with a time limit of 48 hours (those cases that exceed the time limit are not


included here). Note that each (violin) plot shows the maximum, center and minimum value of the data depicted, together with the probability density. Source data EXTENDED DATA FIG. 6 RUNNING


TIME DISTRIBUTIONS FOR HUNTRESS, SCISTREE AND SPHYR ON SIMULATED DATA WITH FALSE POSITIVES, FALSE NEGATIVES AND MISSING ENTRIES. Here n, m, fn, fp and na respectively, denote the number of


cells, the number of mutations, the false negative, false positive error and missing entry rates in single-cell sequencing data. For each setting we report the distribution for each tool


over 10 distinct trees of tumor progression. For any given task each tool was allowed to run with a time limit of 48 hours (those cases that exceed the time limit are not included here).


Note that each (violin) plot shows the maximum, center and minimum value of the data depicted, together with the probability density. Source data EXTENDED DATA FIG. 7 ANCESTOR-DESCENDANT


(AD) AND DIFFERENT LINEAGES (DL) ACCURACY MEASURES AS WELL AS RUNNING TIME DISTRIBUTIONS FOR HUNTRESS, SCISTREE AND SPHYR ON SIMULATED DATASETS WITH DOUBLETS. Here n=1000, m=300, fn=0.2 or


0.05, fp=0.001, na=0.05, and the doublet rate is set to 0.03. Each distribution is over 10 distinct trees of tumor progression. For any given task each tool was allowed to run with a time


limit of 48 hours. Note that each (violin) plot shows the maximum, center and minimum value of the data depicted, together with the probability density. Source data EXTENDED DATA FIG. 8


ANCESTOR-DESCENDANT (AD) AND DIFFERENT LINEAGES (DL) ACCURACY MEASURES, AS WELL AS THE RUNNING TIME DISTRIBUTIONS FOR HUNTRESS ON LARGE SIMULATED DATASETS. In these simulations we set


n=5000, m=500, fn=0.05 or 0.2, fp=0.001 and na=0.05. For each setting, the distributions are reported over 10 distinct trees of tumor progression. Note that because ScisTree could not finish


any of the tasks within the time limit of 48 hours and SPhyR failed to generate any output, they are not presented in the figure. Note that each (violin) plot shows the maximum, center and


minimum value of the data depicted, together with the probability density. Source data EXTENDED DATA FIG. 9 ANCESTOR-DESCENDANT (AD) AND DIFFERENT LINEAGES (DL) ACCURACY MEASURES AS WELL AS


RUNNING TIME DISTRIBUTIONS FOR HUNTRESS, SCISTREE AND SPHYR ON SIMULATED DATASETS WITH A HIGH(ER) FALSE POSITIVE RATE OF FP=0.003. Here n=1000, m=300, fn=0.2 or 0.05, and na=0.05. Each


distribution is over 10 distinct trees of tumor progression. For any given task each tool was allowed to run with a time limit of 48 hours. Note that each (violin) plot shows the maximum,


center and minimum value of the data depicted, together with the probability density. Source data EXTENDED DATA FIG. 10 RUNNING TIME, ANCESTOR-DESCENDANT AND DIFFERENT LINEAGES ACCURACY


MEASURE DISTRIBUTIONS FOR HUNTRESS AND SPHYR ON SIMULATIONS WITH PARAMETERS SIMILAR TO THOSE OBSERVED IN THE AML DATASET (THE TAPESTRI PLATFORM). Here n=5000, m=50, fn=0.2 or 0.05, fp=0.01


or 0.003 and na=0.1. All simulated data used in this figure were generated by the simulator developed for OncoNEM. For any given task each tool was allowed to run with a time limit of 48


hours. Note that each (box) plot shows the maximum, center and minimum value, as well as the median half of the data depicted. Source data SUPPLEMENTARY INFORMATION SUPPLEMENTARY INFORMATION


Supplementary Figs. 1 and 2, Tables 1–7, Algorithms 1–5, proofs and discussions. SOURCE DATA SOURCE DATA EXTENDED DATA FIG. 1 Statistical source data. SOURCE DATA EXTENDED DATA FIG. 2


Statistical source data. SOURCE DATA EXTENDED DATA FIG. 3 Statistical source data. SOURCE DATA EXTENDED DATA FIG. 4 Statistical source data. SOURCE DATA EXTENDED DATA FIG. 5 Statistical


source data. SOURCE DATA EXTENDED DATA FIG. 6 Statistical source data. SOURCE DATA EXTENDED DATA FIG. 7 Statistical source data. SOURCE DATA EXTENDED DATA FIG. 8 Statistical source data.


SOURCE DATA EXTENDED DATA FIG. 9 Statistical source data. SOURCE DATA EXTENDED DATA FIG. 10 Statistical source data. RIGHTS AND PERMISSIONS Reprints and permissions ABOUT THIS ARTICLE CITE


THIS ARTICLE Kızılkale, C., Rashidi Mehrabadi, F., Sadeqi Azer, E. _et al._ Fast intratumor heterogeneity inference from single-cell sequencing data. _Nat Comput Sci_ 2, 577–583 (2022).


https://doi.org/10.1038/s43588-022-00298-x Download citation * Received: 09 June 2021 * Accepted: 14 July 2022 * Published: 08 September 2022 * Issue Date: September 2022 * DOI:


https://doi.org/10.1038/s43588-022-00298-x SHARE THIS ARTICLE Anyone you share the following link with will be able to read this content: Get shareable link Sorry, a shareable link is not


currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative