Resolving the complexity of the human genome using single-molecule sequencing

Resolving the complexity of the human genome using single-molecule sequencing

Play all audios:

Loading...

ABSTRACT The human genome is arguably the most complete mammalian reference assembly1,2,3, yet more than 160 euchromatic gaps remain4,5,6 and aspects of its structural variation remain


poorly understood ten years after its completion7,8,9. To identify missing sequence and genetic variation, here we sequence and analyse a haploid human genome (CHM1) using single-molecule,


real-time DNA sequencing10. We close or extend 55% of the remaining interstitial gaps in the human GRCh37 reference genome—78% of which carried long runs of degenerate short tandem repeats,


often several kilobases in length, embedded within (G+C)-rich genomic regions. We resolve the complete sequence of 26,079 euchromatic structural variants at the base-pair level, including


inversions, complex insertions and long tracts of tandem repeats. Most have not been previously reported, with the greatest increases in sensitivity occurring for events less than 5 


kilobases in size. Compared to the human reference, we find a significant insertional bias (3:1) in regions corresponding to complex insertions and long short tandem repeats. Our results


suggest a greater complexity of the human genome in the form of variation of longer and more complex repetitive DNA that can now be largely resolved with the application of this longer-read


sequencing technology. Access through your institution Buy or subscribe This is a preview of subscription content, access via your institution ACCESS OPTIONS Access through your institution


Subscribe to this journal Receive 51 print issues and online access $199.00 per year only $3.90 per issue Learn more Buy this article * Purchase on SpringerLink * Instant access to full


article PDF Buy now Prices may be subject to local taxes which are calculated during checkout ADDITIONAL ACCESS OPTIONS: * Log in * Learn about institutional subscriptions * Read our FAQs *


Contact customer support SIMILAR CONTENT BEING VIEWED BY OTHERS A DRAFT HUMAN PANGENOME REFERENCE Article Open access 10 May 2023 BEYOND ASSEMBLY: THE INCREASING FLEXIBILITY OF


SINGLE-MOLECULE SEQUENCING TECHNOLOGY Article 09 May 2023 HIGHLY ACCURATE LONG-READ HIFI SEQUENCING DATA FOR FIVE COMPLEX GENOMES Article Open access 17 November 2020 ACCESSION CODES PRIMARY


ACCESSIONS SEQUENCE READ ARCHIVE * SRP040522 * SRP044331 * SRX533609 DATA DEPOSITS All underlying SMRT WGS read data have been released within the NCBI Sequence Read Archive (SRA) under


accession SRX533609 and may also be accessed as part of all the SMRT data sets (NCBI SRA accession SRP040522). Illumina WGS data for CHM1 are available in the NCBI SRA under accession


SRP044331 as well as finished BAC and fosmid clone inserts using SMRT sequence data (GenBank accessions in Supplementary Table 35). For the purpose of mapping and annotation, we developed a


patched GRCh37 reference genome including a track hub for upload into the UCSC Genome Browser. A complete list of all inaccessible regions of the human genome and a database of


heterochromatic and subtelomeric sequence reads that could not be assembled are available at (http://eichlerlab.gs.washington.edu/publications/chm1-structural-variation). REFERENCES * The


1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. _Nature_ 491, 56–65 (2012) * The International HapMap Project Consortium. The International


HapMap Project. _Nature_ 426, 789–796 (2003) * International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. _Nature_ 431, 931–945 (2004) *


Kurahashi, H. et al. Molecular cloning of a translocation breakpoint hotspot in 22q11. _Genome Res._ 17, 461–469 (2007) Article  CAS  Google Scholar  * Genovese, G. et al. Using population


admixture to help complete maps of the human genome. _Nature Genet._ 45, 406–414 (2013) Article  CAS  Google Scholar  * Bovee, D. et al. Closing gaps in the human genome with fosmid


resources generated from multiple individuals. _Nature Genet._ 40, 96–101 (2008) Article  CAS  Google Scholar  * Mills, R. E. et al. Mapping copy number variation by population-scale genome


sequencing. _Nature_ 470, 59–65 (2011) Article  CAS  Google Scholar  * Kidd, J. M. et al. A human genome structural variation sequencing resource reveals insights into mutational mechanisms.


_Cell_ 143, 837–847 (2010) Article  CAS  Google Scholar  * Eichler, E. E., Clark, R. A. & She, X. An assessment of the sequence gaps: unfinished business in a finished human genome.


_Nature Rev. Genet._ 5, 345–354 (2004) Article  CAS  Google Scholar  * Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. _Science_ 323, 133–138 (2009) Article  ADS 


CAS  Google Scholar  * Chaisson, M. J. & Tesler, G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. _BMC


Bioinformatics_ 13, 238 (2012) Article  CAS  Google Scholar  * Lee, H. & Schatz, M. C. Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability


score. _Bioinformatics_ 28, 2097–2105 (2012) Article  CAS  Google Scholar  * Myers, E. W. et al. A whole-genome assembly of _Drosophila_. _Science_ 287, 2196–2204 (2000) Article  ADS  CAS 


Google Scholar  * Chin, C. S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. _Nature Methods_ 10, 563–569 (2013) Article  CAS  Google Scholar  *


Huddleston, J. et al. Reconstructing complex regions of genomes using long-read sequencing technology. _Genome Res._ 24, 688–696 (2014) Article  CAS  Google Scholar  * Kimelman, A. et al. A


vast collection of microbial genes that are toxic to bacteria. _Genome Res._ 22, 802–809 (2012) Article  CAS  Google Scholar  * Lander, E. S. et al. Initial sequencing and analysis of the


human genome. _Nature_ 409, 860–921 (2001) Article  ADS  CAS  Google Scholar  * Venter, J. C. et al. The sequence of the human genome. _Science_ 291, 1304–1351 (2001) Article  ADS  CAS 


Google Scholar  * Conrad, D. F. et al. Origins and functional impact of copy number variation in the human genome. _Nature_ 464, 704–712 (2010) Article  CAS  Google Scholar  * Kong, A. et


al. A high-resolution recombination map of the human genome. _Nature Genet._ 31, 241–247 (2002) Article  CAS  Google Scholar  * Stewart, C. et al. A comprehensive map of mobile element


insertion polymorphisms in humans. _PLoS Genet._ 7, e1002236 (2011) Article  CAS  Google Scholar  * Steinberg, K. M. et al. Single haplotype assembly of the human genome from a hydatidiform


mole. _Genome Res_ (in press) * Parsons, J. D. Miropeats: graphical DNA sequence comparisons. _Comput. Appl. Biosci._ 11, 615–619 (1995) CAS  PubMed  Google Scholar  * Jurka, J., Klonowski,


P., Dagman, V. & Pelton, P. CENSOR–a program for identification and elimination of repetitive elements from DNA sequences. _Comput. Chem._ 20, 119–121 (1996) Article  CAS  Google Scholar


  * Smit, A. F. A., Hubley, R. & Green, P. RepeatMasker Open-3.0 http://www.repeatmasker.org (1996–2010) * Adey, A. et al. Rapid, low-input, low-bias construction of shotgun fragment


libraries by high-density in vitro transposition. _Genome Biol._ 11, R119 (2010) Article  CAS  Google Scholar  * Wu, T. & Watanabe GMAP: a genomic mapping and alignment program for mRNA


and EST sequences. _Bioinformatics_ 21, 1859–1875 (2005) Article  CAS  Google Scholar  Download references ACKNOWLEDGEMENTS We thank D. Alexander, D. Church and A. Klammer for discussions,


K. Mohajeri and L. Harshman for technical assistance and T. Brown for assistance in manuscript preparation. This work was supported, in part, by US National Institutes of Health (NIH) grant


HG002385 and HG007497 to E.E.E. M.Y.D. is supported by the US National Institute of Neurological Disorders and Stroke (award K99NS083627). E.E.E. is an investigator of the Howard Hughes


Medical Institute. AUTHOR INFORMATION AUTHORS AND AFFILIATIONS * Department of Genome Sciences, University of Washington School of Medicine, Seattle, 98195, Washington, USA Mark J. P.


Chaisson, John Huddleston, Megan Y. Dennis, Peter H. Sudmant, Maika Malig, Fereydoun Hormozdiari, Richard Sandstrom, John A. Stamatoyannopoulos & Evan E. Eichler * Howard Hughes Medical


Institute, University of Washington, Seattle, 98195, Washington, USA John Huddleston & Evan E. Eichler * Dipartimento di Biologia, Università degli Studi di Bari ‘Aldo Moro’, Bari 70125,


Italy, Francesca Antonacci * Department of Pathology, University of Pittsburgh, Pittsburgh, 15261, Pennsylvania, USA Urvashi Surti * Pacific Biosciences of California, Inc., Menlo Park,


94025, California, USA Matthew Boitano, Jane M. Landolin, Michael W. Hunkapiller & Jonas Korlach Authors * Mark J. P. Chaisson View author publications You can also search for this


author inPubMed Google Scholar * John Huddleston View author publications You can also search for this author inPubMed Google Scholar * Megan Y. Dennis View author publications You can also


search for this author inPubMed Google Scholar * Peter H. Sudmant View author publications You can also search for this author inPubMed Google Scholar * Maika Malig View author publications


You can also search for this author inPubMed Google Scholar * Fereydoun Hormozdiari View author publications You can also search for this author inPubMed Google Scholar * Francesca Antonacci


View author publications You can also search for this author inPubMed Google Scholar * Urvashi Surti View author publications You can also search for this author inPubMed Google Scholar *


Richard Sandstrom View author publications You can also search for this author inPubMed Google Scholar * Matthew Boitano View author publications You can also search for this author inPubMed


 Google Scholar * Jane M. Landolin View author publications You can also search for this author inPubMed Google Scholar * John A. Stamatoyannopoulos View author publications You can also


search for this author inPubMed Google Scholar * Michael W. Hunkapiller View author publications You can also search for this author inPubMed Google Scholar * Jonas Korlach View author


publications You can also search for this author inPubMed Google Scholar * Evan E. Eichler View author publications You can also search for this author inPubMed Google Scholar CONTRIBUTIONS


E.E.E., M.J.P.C., M.Y.D., J.H. and J.K. designed experiments; M.M. prepared DNA; M.M. and M.B. prepared libraries and generated sequence data; P.H.S., J.H. and M.Y.D. identified clones for


sequencing; J.H., P.H.S., M.Y.D., F.H. and M.J.P.C. performed bioinformatics analyses; M.Y.D., F.A. and M.M. performed targeted sequencing of clones; M.J.P.C. designed algorithms and


pipelines for mapping SMRT sequence data and detection of structural variants; M.W.H., U.S., R.S. and J.A.S. provided access to critical resources; J.M.L. deposited SMRT sequence data into


SRA; M.J.P.C., J.H. and E.E.E. wrote the manuscript. CORRESPONDING AUTHOR Correspondence to Evan E. Eichler. ETHICS DECLARATIONS COMPETING INTERESTS M.B., J.L., M.W.H. and J.K. are employees


of Pacific Biosciences, Inc., a company commercializing DNA sequencing technologies; E.E.E. is on the scientific advisory board (SAB) of DNAnexus, Inc. and was formerly an SAB member of


Pacific Biosciences, Inc. (2009–2013) and SynapDx Corp. (2011–2013); and M.J.P.C. was a former employee for Pacific Biosciences, Inc. EXTENDED DATA FIGURES AND TABLES EXTENDED DATA FIGURE 1


SEQUENCE CONTENT OF GAP CLOSURES. A–C, Gap closures are enriched for simple repeats compared to equivalently sized regions randomly sampled from GRCh37; examples of the organization of these


regions are shown using Miropeats for chromosome 4 (GRCh37, chr4:59724333–59804333) (A), chromosome 11 (GRCh37, chr11:87673378–87753378) (B), and chromosome X (GRCh37,


chrX:143492324–143572324) (C). Dotplots show the architecture of the degenerate STRs with the core motif highlighted below. Shared sequence motifs between blocks are indicated by colour.


EXTENDED DATA FIGURE 2 VARIANT DETECTION PIPELINE. At every variant locus, we collected the full-length reads that overlap the locus, performed _de novo_ assembly using the Celera assembler,


and called a consensus using Quiver after remapping reads used in the assembly as well as reads flanking the assembly (yellow reads) to increase consensus quality at the boundaries of the


assembly. BLASR is used to align the assembly consensus sequences to the reference, and insertions and deletions in the alignments are output as variants. Reads spanning a deletion event


within a single alignment are shown as bars connected by a solid line, and double hard-stop reads spanning a larger deletion event and split into two separate alignments of the same read are


shown as a dotted line. EXTENDED DATA FIGURE 3 GENOME DISTRIBUTION OF CLOSED GAPS AND INSERTIONS. Chromosome ideogram heatmap depicts the normalized density of inserted CHM1 base pairs per


5-Mb bin with a strong bias noted near the end of most chromosomes. Locations of structural variants and closed gaps are given by coloured diamonds to the left of each chromosome: closed gap


sequences (red), inversions (green), and complex events (blue). EXTENDED DATA FIGURE 4 CONFIRMATION OF COMPLEX INSERTIONS IN ADDITIONAL GENOMES. Top, genotypes of polymorphic complex


regions using read depth of unique _k_-mers (blue: present; white: absent). Bottom, extended examples of complex insertion events: alignment to chimpanzee panTro4 reference (dark blue);


existing human reference hg19 (light teal); inserted sequence (dark teal). The bottom rows show repeat annotations, with darker hues for repeats overlapping the inserted region. EXTENDED


DATA FIGURE 5 INVERSION VALIDATION BY BAC-INSERT SEQUENCING. Inversions detected by alignment of single long reads were validated by sequencing clones from the CHM1 BAC library (CHORI17), in


which end mappings to GRCh37 spanned the putative inversions. Inversions were validated by aligning the corresponding BAC sequences to GRCh37 with Miropeats. Shared sequence between the


BACs and GRCh37 is shown in black; inversion events are indicated in red. EXTENDED DATA FIGURE 6 CHM1 CLONE-BASED ASSEMBLY OF THE HUMAN 10Q11 GENOMIC REGION. A, The clone-based assembly is


composed primarily of BACs from the CH17 library as shown in the tiling path below the internal repeat structure of the region. Coloured arrows indicate large segmental duplications with


homologous sequences connected by coloured lines (Miropeats). Genes annotated from alignment of RefSeq messenger RNA sequences with GMAP27 are shown. B, Miropeats comparisons of the 10q11


clone-based assembly against the corresponding sequence from GRCh37, with gaps shown in red, highlight the degree to which the reference was misassembled. SUPPLEMENTARY INFORMATION


SUPPLEMENTARY INFORMATION This file contains Supplementary Methods, Text and Data, Supplementary Figures 1-29, Supplementary Tables 1-35 and additional references. Tables shown in this file


represent views of the full tables given in the Supplementary Tables file. (PDF 5107 kb) SUPPLEMENTARY TABLES This file contains the full table values for the Supplementary Tables 1-35 (see


separate Supplementary information file). (XLSX 442 kb) POWERPOINT SLIDES POWERPOINT SLIDE FOR FIG. 1 POWERPOINT SLIDE FOR FIG. 2 POWERPOINT SLIDE FOR FIG. 3 RIGHTS AND PERMISSIONS Reprints


and permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Chaisson, M., Huddleston, J., Dennis, M. _et al._ Resolving the complexity of the human genome using single-molecule sequencing. _Nature_


517, 608–611 (2015). https://doi.org/10.1038/nature13907 Download citation * Received: 03 July 2014 * Accepted: 30 September 2014 * Published: 10 November 2014 * Issue Date: 29 January 2015


* DOI: https://doi.org/10.1038/nature13907 SHARE THIS ARTICLE Anyone you share the following link with will be able to read this content: Get shareable link Sorry, a shareable link is not


currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative