Play all audios:
ABSTRACT Sequencing of multiple related species followed by comparative genomics analysis constitutes a powerful approach for the systematic understanding of any genome. Here, we use the
genomes of 12 _Drosophila_ species for the _de novo_ discovery of functional elements in the fly. Each type of functional element shows characteristic patterns of change, or ‘evolutionary
signatures’, dictated by its precise selective constraints. Such signatures enable recognition of new protein-coding genes and exons, spurious and incorrect gene annotations, and numerous
unusual gene structures, including abundant stop-codon readthrough. Similarly, we predict non-protein-coding RNA genes and structures, and new microRNA (miRNA) genes. We provide evidence of
miRNA processing and functionality from both hairpin arms and both DNA strands. We identify several classes of pre- and post-transcriptional regulatory motifs, and predict individual motif
instances with high confidence. We also study how discovery power scales with the divergence and number of species compared, and we provide general guidelines for comparative studies. You
have full access to this article via your institution. Download PDF SIMILAR CONTENT BEING VIEWED BY OTHERS STRUCTURAL AND FUNCTIONAL CHARACTERIZATION OF A PUTATIVE DE NOVO GENE IN
_DROSOPHILA_ Article Open access 12 March 2021 POPULATION-SCALE LONG-READ SEQUENCING UNCOVERS TRANSPOSABLE ELEMENTS ASSOCIATED WITH GENE EXPRESSION VARIATION AND ADAPTIVE SIGNATURES IN
_DROSOPHILA_ Article Open access 12 April 2022 THE ORIGIN AND STRUCTURAL EVOLUTION OF DE NOVO GENES IN _DROSOPHILA_ Article Open access 27 January 2024 MAIN The sequencing of the human
genome and the genomes of dozens of other metazoan species has intensified the need for systematic methods to extract biological information directly from DNA sequence. Comparative genomics
has emerged as a powerful methodology for this endeavour1,2. Comparison of few (two–four) closely related genomes has proven successful for the discovery of protein-coding genes3,4,5, RNA
genes6,7, miRNA genes8,9,10,11 and catalogues of regulatory elements3,4,12,13,14. The resolution and discovery power of these studies should increase with the number of
genomes15,16,17,18,19,20, in principle enabling the systematic discovery of all conserved functional elements. The fruitfly _Drosophila melanogaster_ is an ideal system for developing and
evaluating comparative genomics methodologies. Over the past century, _Drosophila_ has been a pioneering model in which many of the basic principles governing animal development and
population biology were established21. In the past decade, the genome sequence of _D. melanogaster_ provided one of the first systematic views of a metazoan genome22, and the ongoing effort
by the FlyBase and Berkeley Drosophila Genome Project (BDGP) groups established a systematic high-quality genome annotation23,24,25. Moreover, the fruitfly benefits from extensive
experimental resources26,27,28, which enable novel functional elements to be systematically tested and used in the evaluation of genetic screens29,30. The fly research community has
sequenced, assembled and annotated the genomes of 12 _Drosophila_ species22,31,32 at a range of evolutionary distances from _D. melanogaster_ (Fig. 1a, b). The analysis of these genomes was
organized around two complementary aims. The first, described in an accompanying paper32, was to understand the evolution of genes and chromosomes on the _Drosophila_ phylogeny, and how it
relates to speciation and adaptation. The second goal, described here, was to develop general comparative methodologies to discover and refine functional elements in _D. melanogaster_ using
the 12 genomes, and to investigate the scaling of discovery power and its implications for studies in vertebrates (Fig. 1c). Here, we report genome-wide alignments of the 12 species
(Supplementary Information 1), and the systematic discovery of euchromatic functional elements in the _D. melanogaster_ genome. We predict and refine thousands of protein-coding exons, RNA
genes and structures, miRNAs, pre- and post-transcriptional regulatory motifs and regulatory targets. We validate many of these elements using complementary DNA (cDNA) sequencing, human
curation, small RNA sequencing, and correlation with experimentally supported transcription factor and miRNA targets. In addition, our analysis leads to several specific biological findings,
listed below. • We predict 123 novel polycistronic transcripts, 149 genes with apparent stop-codon readthrough and several candidate programmed frameshifts, with potential roles in
regulation, localization and function of the corresponding protein products. • We make available the first systematic prediction of general RNA genes and structures (non-coding RNAs
(ncRNAs)) in _Drosophila_, including several structures probably involved in translational regulation and adenosine-to-inosine RNA editing (A-to-I editing). • We present comparative and
experimental evidence that some miRNA loci yield multiple functional products, from both hairpin arms or from both DNA strands, thereby increasing the versatility and complexity of
miRNA-mediated regulation. • We provide further comparative evidence for miRNA targeting in protein-coding exons. • We report an initial network of pre- and post-transcriptional regulatory
targets in _Drosophila_ on the basis of individual high-confidence motif occurrences. COMPARATIVE GENOMICS AND EVOLUTIONARY SIGNATURES Although multiple closely related genomes provide
sufficient neutral divergence for recognition of functional regions in stretches of highly conserved nucleotides16,17,33, measures of nucleotide conservation alone do not distinguish between
different types of functional elements. Moreover, functional elements that tolerate abundant ‘silent’ mutations, such as protein-coding exons and many regulatory motifs, might not be
detected when searching on the basis of strong nucleotide conservation. Across many genomes spanning larger evolutionary distances, the information in the patterns of sequence change reveals
evolutionary signatures (Fig. 2) that can be used for systematic genome annotation. Protein-coding regions show highly constrained codon substitution frequencies34 and insertions and
deletions that are heavily biased to be multiples of three3 (Fig. 2a). RNA genes and structures tolerate substitutions that preserve base pairing35,36 (Fig. 2b). MicroRNA hairpins show a
characteristic conservation profile with high conservation in the stem and mutations in loop regions10,11 (Fig. 2c). Finally, regulatory motifs are marked by high levels of genome-wide
conservation3,4,12,13,14, and post-transcriptional motifs show strand-biased conservation12 (Fig. 2d, e). We find that these signatures can be much more precise for genome annotation than
the overall level of nucleotide conservation (for example, Fig. 3a). REVISITING THE PROTEIN-CODING GENE CATALOGUE The annotation of protein-coding genes remains difficult in metazoan genomes
owing to short exons and complex gene structures with abundant alternative splicing. Comparative information has improved computational gene predictors5, but their accuracy still falls far
short of well-studied gene catalogues such as the FlyBase annotation, which combines computational gene prediction37, high-throughput experimental data38,39,40,41,42 and extensive manual
curation23. Recognizing this, we set out not only to produce an independent computational annotation of protein-coding genes in the fly genome, but also to assess and refine its already
high-quality annotations43. Our analyses of _D. melanogaster_ coding genes are based on two independent evolutionary signatures unique to protein-coding regions (Fig. 2a): (1) reading frame
conservation (RFC)3, which observes the tendency of nucleotide insertions and deletions to preserve the codon reading frame; and (2) codon substitution frequencies (CSF, see Supplementary
Methods 2a), which observes mutational biases towards synonymous codon substitutions and conservative amino acid changes, similar to the non-synonymous/synonymous substitution ratio
_K_A/_K_S34 and other methods44,45,46. ASSESSING AND REFINING EXISTING GENE ANNOTATIONS We first assessed the 13,733 euchromatic genes in FlyBase47 release 4.3. Using the above measures, we
defined tests that ‘confirmed’ genes supported by the evolutionary evidence, ‘rejected’ genes inconsistent with protein-coding selection, or ‘abstained’ for genes that were not aligned or
with ambiguous comparative evidence (Supplementary Methods 2a). Of the 4,711 genes with descriptive names, we confirmed 97%, rejected 1% and abstained for 2%, whereas the same criteria
applied to 15,000 random non-coding regions ≥300 nucleotides rejected 99% of candidates and confirmed virtually none (Table 1). Together, these results illustrate the high sensitivity and
specificity of our criteria. Applying the same criteria to the 9,022 genes lacking a descriptive name (genes designated only by a CG identifier, referred to hereafter as CGid-only genes),
our tests accepted 87%, rejected 5% (414 genes) and abstained for 8%. This provides strong evidence that most CGid-only genes encode proteins, but also suggests that they may be less
constrained20,32 and/or may include incorrect annotations. Indeed, on manual review, 222 (54%) of the 414 rejected CGid-only genes were re-categorized as non-protein-coding or deleted (of
which 55 were due to genomically primed clones), 73 (18%) were flagged as being of uncertain quality, and the remaining 119 (29%) were kept unchanged (Fig. 3b). Some of these are probably
rapidly evolving protein-coding genes, but others may also prove to be non-protein-coding genes or spurious; in fact, none of these had any functional gene ontology (GO) annotation48. In
addition, we proposed specific corrections and adjustments to hundreds of existing transcript models, including translation start site adjustments (Supplementary Fig. 2b), alternative splice
boundaries (Supplementary Fig. 2b), recent nonsense mutations (Supplementary Fig. 2c) and alternative translational reading frames43. IDENTIFYING NEW GENES AND EXONS To predict new
protein-coding exons, we integrated our metrics into a probabilistic algorithm that determines an optimal segmentation of the genome into protein-coding and non-coding regions (Fig. 3a) on
the basis of whole-genome sequence alignments of the 12 fly species (Supplementary Methods 2a). Our genome-wide search predicted 1,193 new protein-coding exons, mostly in euchromatic regions
annotated as intergenic (43%), intronic (26%), or 5′/3′ untranslated region (UTR; 23%) in FlyBase annotation release 4.3. We manually reviewed 928 of these predictions according to FlyBase
standards23 (Supplementary Methods 2a), leading to 142 new gene models (incorporating 192 predictions) and 438 revised gene models (incorporating 562 predictions) (Fig. 3b). In parallel, we
tested 184 predictions (126 intergenic, 58 intronic) by directed cDNA sequencing using inverse polymerase chain reaction (inverse PCR) of circularized full-length clones49,50,51 (Fig. 3c),
which validated 120 targeted predictions (65%) and an additional 42 predictions not directly targeted but contained within the recovered transcripts. Predictions in intergenic regions
yielded 88 full-length cDNAs, providing evidence for 50 new genes and modification of 39 gene models. Predictions within introns of existing annotations yielded 32 full-length cDNAs, of
which only 18 (56%) represent new splice variants of the surrounding gene, whereas the remaining 14 revealed nested or interleaved gene structures. This provides additional evidence that
such complex gene structures are not rare in _Drosophila_23. Overall, 83% of the 948 predicted exons that we assessed by manual curation or cDNA sequencing were incorporated into FlyBase,
resulting in 150 new genes and modifications to hundreds of existing gene models. Finally, the 245 predictions that we did not assess were in non-coding regions of existing transcript
models, or were already included in FlyBase independent of our study. In an independent analysis52, we predicted 98 new genes on the basis of inferred homology to predicted genes in the
informant species32, of which 63% matched the above predictions. DISCOVERING UNUSUAL FEATURES OF PROTEIN-CODING GENES Our analysis also predicted an abundance of unusual protein-coding genes
that call for follow-up experimental investigation. First, we found open reading frames with clear protein-coding signatures and conserved start and stop sites on the transcribed strand of
annotated UTRs, indicative of polycistronic transcripts23,53,54. These include 73% of 115 annotated dicistronic transcripts and 135 new candidate cistrons of 123 genes (Supplementary Fig.
2b). Second, we predicted that 149 genes undergo stop codon readthrough, with protein-coding selection continuing past a deeply conserved stop codon (Fig. 3d), in some cases for hundreds of
amino acids. It is unlikely that these genes are selenoproteins, as they appear to lack SECIS elements that direct selenocysteine recoding55,56,57,58. Other mechanisms may instead be at
work, such as regulation of ribosomal release factors59, A-to-I editing39,60,61, alternative splicing, or other less-characterized mechanisms62. In fact, these genes are significantly
enriched in neuronal proteins (_P_ = 10-4), which frequently undergo A-to-I editing63. Third, we found four genes in which CSF signatures abruptly shift from one reading frame to another in
the absence of nearby intron–exon boundaries or insertions and deletions (Fig. 3e). These are suggestive of conserved ‘programmed’ frameshifts64, which are thought to be rare in eukaryotes.
Overall, our results affected over 10% of protein-coding genes, and will be available in future releases of FlyBase. They also suggest that several types of unusual protein-coding gene
structure may be more prevalent in the fly than previously appreciated. RNA GENES AND STRUCTURES Several comparative approaches to RNA gene identification have been developed6,7,65 that
recognize their characteristic properties: compensatory double substitutions of paired nucleotides (for example, A·U↔C·G), structure-preserving single-nucleotide mutations involving G·U base
pairs (G·U↔G·C and G·U↔A·U), and few nucleotide substitutions disrupting functional base pairs (Fig. 2b). To predict new structures, we applied EvoFold7 in highly conserved segments of the
12 _Drosophila_ species and focused on high-stringency candidates with strong support by compensatory changes (Supplementary Methods 4). Our search led to 394 predictions, recovering 68
known RNA structures (primarily transfer RNA genes) in 0.02% of the genome (570-fold enrichment). The novel candidates consisted of 177 structures in intergenic regions (54%), 103 in introns
(32%), 36 in 3′ UTRs (11%) and 10 in 5′ UTRs (3%). In addition, we predicted 200 structures in protein-coding regions (Supplementary Methods 3). Notably, 75% of 3′ UTR structures and 80% of
5′ UTR structures were predicted on the transcribed strand, suggesting that they are frequently part of the messenger RNA. In contrast, only 47% of intronic structures are on the
transcribed strand, suggesting that they are largely independent of the surrounding genes. KNOWN AND NOVEL TYPES OF RNA GENES Of the 177 predicted intergenic structures, 30 were detected in
a tiling-array expression study42. This fraction (17%) is significantly above that for all conserved intergenic regions (12%, _P_ = 0.007), but lower than that of known intergenic ncRNAs
(21%), suggesting that these candidates may be of lower abundance, temporally or spatially constrained, or might include false positives. Two predictions were expressed throughout
development, one extending the annotation of a previously reported but uncharacterized ncRNA66 and the other probably representing a novel type of ncRNA. The predictions also included nine
novel H/ACA-box small nucleolar RNA candidates in introns of ribosomal genes, known to frequently contain small nucleolar RNAs that guide post-transcriptional base modifications of ncRNAs67.
LIKELY A-TO-I EDITING STRUCTURES Many of the 48 intronic candidates on the transcribed strand and many of the 200 hairpins in coding sequence are probably involved in A-to-I editing or
post-transcriptional regulation (Fig. 4a). Hairpins in coding sequence were associated with 11 of the 157 known editing sites (120-fold enrichment) and both intronic and coding-sequence
hairpins showed a strong enrichment for ion-channel genes (6%, _P_ = 0.007 and 10%, _P_ = 2 × 10-12, respectively), known to be frequent editing targets. Editing is known to occur at
multiple sites in the same gene63, and we find an additional 10 hairpins in known editing targets, as well as 40 additional hairpins clustered in 18 genes not previously known to be edited
(for example _huntingtin_68, which harbours four predicted hairpins, more than any other gene). Intronic predictions also showed the highest abundance of compensatory substitutions: for
example, _Resistant to dieldrin_ (Fig. 2b) contained a 26-base-pair (bp) intronic hairpin flanked by exons known to be edited69 with a striking 16 compensatory changes, _lodestar_ showed one
hairpin with 11 compensatory changes, and _Inverted repeat-binding protein_ showed one hairpin with 10 compensatory substitutions (Fig. 4b). LIKELY REGULATORY UTR STRUCTURES We predicted 38
structures in 3′ UTRs, a density twofold higher than the genomic average, whereas fewer than 10 such examples are currently known70. A considerable fraction of these lies in regulatory
genes (14 out of 38; _P_ = 10-4), including several transcriptional regulators (for example, _cas_, _spen_ and _Alh_), the tyrosine phosphatase _PTP-ER_ and the translation initiation factor
_eIF3-S8_. This suggests that many regulatory genes may themselves be regulated post-transcriptionally through these structures. 3′ UTR structures were also enriched for genes involved in
mRNA localization (3 out of 38, _P_ = 2.7 × 10-4), including _oo18 RNA-binding protein_ (_orb_) and _staufen_ (_stau_), both of which contain double-stranded RNA-binding domains, are
involved in axis specification during oogenesis, and interact with the mRNA of maternal effect protein _oskar_. The hairpin in _orb_ is known to be important for mRNA transport and
localization71, whereas the highly similar _stau_ hairpin has not been previously described to our knowledge. The ten structures found in 5′ UTRs probably contain binding sites for factors
that regulate translation. For example, the fly homologue of yeast ribosomal protein _RPL24_ contains a hairpin structure overlapping its start codon (Fig. 4c). This is interesting in light
of high conservation upstream of the start codon in yeast ribosomal proteins3,4, and findings that ribosomal proteins bind to their mRNAs and control translation in prokaryotes72,73.
CONSERVED RNA STRUCTURES IN _ROX2_ RECRUIT MSL In an independent study74, we searched for conserved regions in the non-coding _roX1_ and _roX2_ (_RNA on the X_) genes to gain insights into
their function. Both RNAs are components of the MSL (Male-specific lethal) complex and are crucial for dosage compensation in male flies, inducing lysine 16 acetylation of histone H4,
leading to upregulation of hundreds of genes on the X chromosome75. We identified several stem-loop structures with repeated sequence motifs (for example, GUUNUACG), and found that tandem
repeats of one of these were sufficient to recruit MSL complexes to the X chromosome and to induce acetylation of lysine 16 of histone H4. Although this structure could not fully rescue
roX-deficient males, our results suggest that it mediates MSL recruitment during _roX2_-dependent chromatin modification and dosage compensation, illustrating the power of evolutionary
evidence for directing experimental studies. PREDICTION AND CHARACTERIZATION OF MIRNA GENES Focusing on specific classes of RNA genes markedly increases the accuracy of RNA gene prediction,
reviewed in refs 35, 76 and illustrated here for _Drosophila_ miRNA genes. The common biogenesis and function of miRNAs77 lead to evolutionary and structural signatures (Fig. 2c) that can be
used for their systematic _de novo_ discovery8,9,10,11. Using such signatures in the 12 fly genomes (Supplementary Methods 4a, b), we predicted 101 miRNAs78 (Supplementary Table 4d), which
include 60 of the 74 verified Rfam miRNAs (81%), while spanning less than 0.006% of the fly genome (13,500-fold nucleotide enrichment). Comparison of our predictions with high-throughput
sequencing data of short RNA libraries from different stages and tissues of _D. melanogaster_78,79 revealed that 84 of the 101 predictions (83%), including 24 of the 41 novel predictions
(59%), were authentic miRNA genes (Fig. 5a and Supplementary Table 4d). An independent computational method79 had 20 of its 45 novel predictions validated when used across six _Drosophila_
species. Additional candidates may represent genuine miRNAs whose temporal or spatial expression pattern does not overlap with the surveyed libraries. Several of the validated miRNAs were on
the transcribed strand of introns or clustered with other miRNAs. For example, _mir-11_ and _mir-998_ (the vertebrate homologue of which, _mir-29_, has been implicated in cancer80) were
both found in the last intron of _E2f_, and might be involved in cell-cycle regulation (Fig. 5b). Notably, two predictions overlapped exons of previously annotated protein-coding genes that
were independently rejected above (Fig. 5c), providing an explanation for the previously observed transcripts of these annotations and highlighting the importance of specific signatures for
genome annotation. High-throughput sequencing data discovered an additional 50 miRNAs not found computationally79,81, thereby illustrating the limitations of purely computational approaches.
Some of these had precursor structures not seen previously for animal miRNAs, including unusually long hairpins79 and hairpins corresponding to short introns (mirtrons)81,82. The remaining
were often less broadly conserved or showed unusual conservation properties. SIGNATURES FOR MATURE MIRNA ANNOTATION The exact position of 5′ cleavage of mature miRNAs is important, because
it dictates the core of the target recognition sequence83,84,85. This leads to unique structural and evolutionary signatures, including direct signals, present at the 5′ cleavage site, and
indirect signals, stemming from the relationship of miRNAs with their target genes (Supplementary Methods 4a, c). Combined into a computational framework78, these signatures predicted the
exact start position in 47 of the 60 cloned Rfam miRNAs (78%), and were within 1 bp in 51 cases (85%). The method disagreed with the previous annotation in 9 of the 14 Rfam miRNAs that were
not previously cloned, of which 6 were confirmed by sequencing reads78,79, leading to marked changes in the inferred target spectrum (Fig. 5d). Prediction accuracy was significantly lower
(41% exact, 61% within 1 nucleotide) for novel miRNAs, which, however, also showed less accurate processing _in vivo_78,79. NEW INSIGHTS INTO MIRNA FUNCTION AND BIOGENESIS We predicted
targets for all conserved miRNAs identified by high-throughput sequencing79 searching for conserved matches to the seed region (similar to ref. 86) evaluated using the branch length score
(Supplementary Methods 5a), a new scoring scheme described below. Whereas the resulting miRNA targeting network changed substantially79, we found that the novel and revised miRNAs shared
many of their predicted targets with previously known miRNAs, resulting in a denser network with increased potential for combinatorial regulation78,79. For ten miRNA hairpins, the mature
miRNA and the corresponding miRNA star sequence (miRNA*, the small RNA from the opposite arm of the hairpin) both appeared to be functional: both reached high computational scores and were
frequently sequenced78,79, often exceeding the abundance of many mature miRNAs (Supplementary Table 4e). The Hox miRNA _mir-10_ showed a particularly striking example of a functional star
sequence (Fig. 5e): both arms showed abundant reads, high scores and highly conserved Hox gene targets78,79, suggesting a key role in Hox regulation. In addition, for 20 miRNA loci, the
anti-sense strand also folded into a high-scoring hairpin suggestive of a functional miRNA78 (Supplementary Table 4f). Indeed, sequencing reads confirmed that four of these anti-sense
hairpins are processed into small RNAs _in vivo_79. Thus, a single genomic miRNA locus may produce up to four miRNAs, each with distinct targets. REGULATORY MOTIF DISCOVERY AND
CHARACTERIZATION Regulatory motifs recognized by proteins and RNAs to control gene expression have been difficult to identify due to their short length, their many weakly specified
positions, and the varying distances at which they can act87,88. Recent studies have shown that comparative genomics of a small number of species can be used for motif discovery3,4,12,13,14,
on the basis of hundreds of conserved instances across the genome (Fig. 2d). Many related genomes should lead to increased discovery power, but also pose new challenges, arising from
sequencing, assembly, or alignment artefacts, and from movement or loss of motif instances in individual species. To account for the unique properties of regulatory motifs, we developed a
phylogenetic framework to assess the conservation of each motif instance across many genomes89. Briefly, we searched for motif instances in each of the aligned genomes, and based on the set
of species that contained them, we evaluated the total branch length over which the _D. melanogaster_ motif instance appears to be conserved (Supplementary Methods 5a, b), which we call the
branch length score (BLS). We used BLS for the discovery of novel motifs (this section) and for the prediction of individual functional motif instances (next section). PREDICTED MOTIFS
RECOVER KNOWN REGULATORS To discover motifs, we estimated the conservation level of candidate sequence patterns with a motif excess conservation (MEC) score compared to overall conservation
levels in promoters, UTRs, introns, protein-coding exons and intergenic regions (Supplementary Methods 5a). Our search in regions with roles in pre-transcriptional regulation resulted in 145
distinct motifs (Table 2), obtained by collapsing variants across 83 motifs discovered in promoters, 35 in enhancers, 20 in 5′ UTRs, 35 in core promoters, 30 in introns and 84 in the
remaining intergenic regions. Motifs discovered in each region showed similar properties and large overlap: 66 (46%) were discovered independently in at least two regions and 40 (28%) in at
least three, consistent with shared regulatory elements in these regions90. The 145 discovered motifs match 40 (46%) of the 87 known transcription factors in _Drosophila_ (Supplementary
Table 5c) compared to 8% expected at random (_P_ = 1 × 10-20). Several of the non-discovered known motifs are involved in early anterior–posterior segmentation of the embryo, consistent with
reports that they are largely non-conserved91; indeed, 74% of these did not exceed the conservation expected by chance in promoter regions. Other non-discovered motifs often lacked
characteristics expected for transcription factor motifs, suggesting that some may be spurious: 49% were unusually long (>10 nucleotides) compared to 23% of recovered ones, and showed
only one or a few total instances genome-wide, suggestive of individual regulatory sites rather than motifs. TISSUE-SPECIFIC AND FUNCTIONAL ENRICHMENT OF NOVEL MOTIFS The discovered motifs
showed strong signals with respect to embryonic expression patterns (Fig. 6a). Overall, 75 (52%) were either enriched or depleted in genes expressed in at least one tissue, compared to 59%
of known motifs and 3% of random controls. Motif depletion may represent either specific repressors for individual tissues, or activators excluded from these tissues. Motif depletion was
found more generally in ubiquitously expressed genes (30% of discovered and 34% of known motifs compared with 1% expected at random), similar to findings for _in vivo_ binding sites92, and
probably reflecting less complex regulation. We also found significant motif enrichment in groups of genetically interacting genes (collected by FlyBase) that often function in common
developmental contexts or signalling pathways, genes of metabolic pathways (Kyoto Encyclopedia of Genes and Genomes, KEGG93), and genes with shared functions (GO). In total, 68% of
discovered and 70% of known motifs were enriched or depleted in one of the functional categories (14% random). Noteworthy examples include motif ME93 (GCAACA), which was more highly enriched
in neuroblasts (_P_ = 4 × 10-12) than either of the two well-known regulators of neuroblast development, _prospero_ and _asense_ (_P_ = 4 × 10-5 and 2 × 10-7, respectively). Similarly,
motifs ME89 (CACRCAC), ME11 (MATTAAWNATGCR) and ME117 (MAAMNNCAA) were highly enriched in malpighian tubule (_P_ = 4 × 10-7), trachea (_P_ = 4 × 10-5) and surface glia (6 × 10-7),
respectively, in each case ranking above motifs for factors known to be important in these tissues (Supplementary Table 5c). These presumably correspond to as-yet-unknown regulators for
these tissues. EXCLUSION, CLUSTERING AND POSITIONAL CONSTRAINTS A large number of motifs were depleted in coding sequence (57% of discovered versus 57% of known and 10% of random motifs, _P_
= 3 × 10-18) and in 3′ UTRs (30% versus 22% and 0%, _P_ = 4 × 10-11), suggesting specific exclusion similar to _in vivo_ binding92. Many of the intergenic or intronic instances occurred in
clusters, a property of motifs that has been used to identify enhancer elements91,94,95,96. We assessed increased conservation of motifs when found near other instances of the same motif
(whether conserved or not, to correct for regional conservation biases), and found significant multiplicity for 19% of the discovered motifs (compared to 24% of known and 4% of random
motifs). In addition, 15 of the discovered motifs (10%) were significantly enriched near transcription start sites (compared to 14% of known and 1% of random motifs). Several were enriched
at precise positions and preferred orientations (Fig. 6b), including close matches to several known core promoter motifs involved in transcription initiation97. For example, ME5 (STATAWAWR),
which matches the TATA-box motif, displayed a sharp peak on the transcribed strand, 27 nucleotides upstream of the transcription start site. Similarly, ME120 (TCAGTT), corresponding to the
known initiator motif (Inr) strongly peaked directly on the transcription start site, and ME54 (RCGYRCGY), which matches a known downstream promoter element (DPE), peaked 30 nucleotides
downstream of the transcription start site. REGULATORY MOTIFS INVOLVED IN POST-TRANSCRIPTIONAL REGULATION We also used BLS/MEC to discover motifs involved in post-transcriptional regulation,
and developed methods to distinguish motifs acting at the DNA level, motifs acting at the RNA level and motifs stemming from protein-coding codon biases (Supplementary Methods 5a). Motifs
acting post-transcriptionally at the RNA level generally showed highly asymmetric conservation12, as functional instances can only occur on the transcribed strand. Indeed, 71 of 90 motifs
(79%) discovered in 3′ UTRs showed strand-specific conservation (compared with only 3% of 5′ UTR motifs and 5% of intron motifs, suggesting that these act primarily in pre-transcriptional
regulation). Overall, 33 motifs discovered in 3′ UTRs were complementary to the 5′ end of Rfam miRNAs, recovering 72% of known miRNAs (68% of 5′ unique miRNA families). An additional 21
motifs matched to 5′ ends of novel miRNAs predicted above, of which 12 were validated experimentally78,79, and 3 motifs matched uniquely to miRNA star sequences, all of which were abundantly
expressed _in vivo_ (Supplementary Table 4e). We found 33 additional motifs in 3′ UTRs that were apparently not associated with miRNAs. MO40 (TGTANWTW) closely matches the Puf-family
Pumilio motif98. MO32 (AATAAA) corresponds to the polyadenylation signal and displays both very strong conservation and a sharply defined distance preference with respect to the end of the
annotated 3′ UTR (_P_ = 10-69). Finally, several motifs (for example, MO24 = TAATTTAT; MO94 = TTATTTT) are variants of known AU-rich elements, which are known to mediate mRNA instability and
degradation99. MICRORNA TARGETING IN PROTEIN-CODING REGIONS Protein-coding regions can also harbour functional regulatory motifs, such as exonic splicing regulatory elements100. However,
motif conservation is difficult to assess within protein-coding regions because of the overlapping selective pressures. Indeed, the most highly conserved nucleotide sequence patterns of
length seven (7mers) in coding sequence showed strong reading-frame-biased conservation, suggesting that they reflect protein-coding constraints rather than regulatory roles at the DNA or
RNA level (Fig. 6c). MicroRNA motifs, which function at the RNA level, instead showed high conservation in all three reading frames, suggesting that they are specifically selected within
coding regions for their RNA-level function. Indeed, previous studies have shown that miRNA motifs in coding regions are preferentially conserved in vertebrates86, that they can lead to
repression in experimental assays101,102, and that they are avoided in genes co-expressed with the miRNA103. Frame-invariant conservation allows us to demonstrate the coding-region targeting
of individual miRNAs, and also enables the _de novo_ discovery of miRNA motifs in coding regions. Using frame-invariant conservation, we recovered 11 miRNA motifs within the top 20
coding-region motifs (Supplementary Table 5g), whereas using overall conservation required several hundred candidates to recover 11 miRNA motifs. Moreover, 7mers complementary to different
positions in the mature miRNA show a distinctive conservation pattern indicative of functional targeting in coding regions (Fig. 6d) and similar to that found in 3′ UTRs12,83 (correlation
coefficient 0.96). Finally, 6mers complementary to miRNA 5′ ends were depleted in coding exons of anti-target genes (Supplementary Fig. 5f), similar to findings for these genes’ 3′
UTRs103,104. Overall, these results, together with findings in vertebrates86,101,102,103, suggest that important miRNA targets have been overlooked by many target prediction methods105 that
have traditionally focused exclusively on 3′ UTR sequences. PREDICTION OF INDIVIDUAL REGULATOR BINDING SITES Previous methods for regulatory motif discovery3,4,12,13,14 integrated
conservation information over hundreds of motif instances across the genome, leading to an exceedingly clear signal for motif discovery even if many of these instances are only marginally
conserved. In contrast, the reliable identification of individual motif instances has been hampered by lack of neutral divergence and would require many related genomes15,16,17,18,19. In the
absence of such data, previous studies have relied on motif clustering91,94,95,96 or other sequence characteristics106 to predict regulatory targets or regions. With the availability of the
12 fly genomes, we inferred high-confidence instances of regulatory motifs by mapping the BLS of each motif instance to a confidence value (Supplementary Methods 5a). This value represents
the probability that a motif instance is functional, on the basis of the conservation level of appropriate control motifs evaluated in the same type of region (promoter, 3′ UTR, coding, and
so on). Because the number of conserved instances decreases much more rapidly for control motifs than for real motifs, the many genomes allowed us to reach high confidence values for many
transcription factors and miRNAs, even at relatively modest BLS thresholds (Fig. 2e). CONSERVED MOTIF INSTANCES IDENTIFY FUNCTIONAL _IN VIVO_ TARGETS We found that increasing confidence
levels selected for functional instances for both transcription factor and miRNA motifs: the normalized fraction of transcription factor motif instances within promoter regions rose from 20%
to 90%; that of miRNA motif instances within 3′ UTRs rose from 20% to 90%; and the fraction of miRNA motif instances on the transcribed strand of 3′ UTRs rose from 50% (uniform) to 100%
(Fig. 7a); in each case selecting the regions and strands where the motifs are known to be functional. We further assessed how predicted motif instances compared with _in vivo_ targets in
promoter regions, defined experimentally (without comparative information). We used a set of high-confidence direct CrebA targets107 and three genome-wide chromatin immunoprecipitation
(ChIP) data sets for Snail, Mef2 and Twist92,108,109, and in each case found that the enrichment between conserved motif instances and known _in vivo_ regions increased sharply for
increasing confidence values (Fig. 7b). We also found that a large fraction of motif instances in experimentally determined target regions was conserved (Fig. 7c): 76% of motif instances in
direct CrebA targets and 90% of motif instances in experimentally supported miRNA targets104,110 were recovered at 60% confidence. Although many of the miRNA targets stem from comparative
predictions and are expected to be well conserved, their high recovery rate illustrates the increased sensitivity of the BLS measure compared to perfect conservation (Supplementary Fig. 7d).
Similar results were found for motifs in known enhancers that were determined to be bound by ChIP (‘ChIP-bound’): 65% of Mef2 motifs, 65% of Snail motifs and 25% of Twist motifs were
conserved (Fig. 7c). CHIP-DETERMINED AND CONSERVATION-DETERMINED TARGETS SHOW SIMILAR ENRICHMENT To determine whether ChIP-bound motifs that lack conservation are biologically meaningful, we
studied their enrichment in muscle gene promoters. We found that motifs that were both bound and evolutionarily conserved showed very strong correlation with muscle genes for all three
factors: Mef2 showed eightfold enrichment, Twist showed sevenfold enrichment and Snail, a mesodermal repressor, showed threefold depletion for muscle genes. However, when only non-conserved
sites were considered, the correlation dropped significantly to 1–2-fold for all three factors, suggesting that non-conserved ChIP-bound sites may be of decreased biological significance
(Fig. 7d). We also used the correlation with muscle genes to compare ChIP-on-chip and evolutionary conservation as two complementary methods for target identification (Fig. 7d). We found
that the enrichment of conservation-inferred targets was consistently higher than the enrichment of ChIP-inferred targets for each of the three factors. Finally, we assessed the functional
significance of motif instances that were only found by the conservation approach, specifically excluding those in ChIP-bound regions, and found that these were also enriched in the same
functional categories as ChIP-bound sites with comparable or higher functional correlations (Fig. 7d). This suggests that the additional conserved instances are indeed functional, probably
reflecting the higher coverage of conservation-based approaches, which are not restricted to the experimental conditions surveyed, or that they may be bound _in vivo_ yet missed by
ChIP-on-chip technology111,112. In an independent study113 we compared several strategies for the prediction of motif instances and _cis_-regulatory modules and found that using the 12 fly
genomes led to substantial improvements. In another study, we reported the recovery of conserved motifs for several known regulators, including _Suppressor of Hairless_, in genes of the
_Enhancer of split_ complex114. A REGULATORY NETWORK OF _D. MELANOGASTER_ AT 60% CONFIDENCE Having established the accuracy of conserved motif instances, we present an initial regulatory
network for _D. melanogaster_ at 60% confidence (Supplementary Fig. 5i), containing 46,525 regulatory connections between 67 transcription factors and 8,287 genes, and 3,662 connections
between 81 cloned miRNAs (clustered in 49 families with unique seed sequences) and 2,003 genes. The distribution of predicted sites per target gene is highly non-uniform and indicative of
varying levels of regulatory control. Genes with the highest number of sites appeared to be enriched in morphogenesis, organogenesis, neurogenesis and a variety of tissues, whereas
ubiquitously expressed genes and maternal genes with housekeeping functions had the fewest sites104. Interestingly, transcription factors appeared to be more heavily targeted than other
genes, both by transcription factors (10 sites versus 5.5 on average, _P_ = 10-15) and by miRNAs (2.3 versus 1.8 miRNAs, _P_ = 5 × 10-5). Moreover, genes with many transcription factor sites
also had many miRNA sites, and conversely, genes with few transcription factor sites also had few miRNA sites (_P_ = 10-4 and _P_ = 7 × 10-3, respectively). Several of the predicted
regulatory connections have independent experimental support (Supplementary Table 5h), including direct regulation of _achaete_ by Hairy115, of _giant_ by Bicoid116, of _Enhancer of split_
complex genes by Suppressor of Hairless117, and of _bagpipe_ by Tinman (known to cooperate in mesoderm induction and heart specification118). More generally, when tissue-specific expression
data were available, we found that on average 46% of all targets were co-expressed with their factor in at least one tissue (Supplementary Fig. 5i), which is significantly higher than
expected by chance (_P_ = 2 × 10-3). SCALING OF COMPARATIVE GENOMICS POWER Theoretical considerations and pilot studies on selected genomic regions showed that the discovery power of
comparative methods scales with the number and phylogenetic distance of the species compared16,17,18,19,20,45,119,120. We extended these analyses by investigating the scaling of genome-wide
discovery power using evolutionary signatures for each class of functional elements (Fig. 8), on the basis of the recovery of known elements using different subsets of informant species (at
a fixed stringency). We found that recovery consistently increased with the total number of informant species, and that multi-species comparisons outperformed pairwise comparisons within the
same phylogenetic clade. When we examined subsets of informants with similar total branch length (for example, several close species versus one distant species), multi-species comparisons
sometimes performed better (protein-coding exons, ncRNAs), comparably (motifs), or worse (miRNAs) than pairwise comparisons. This complex relationship between total branch length and actual
discovery power probably reflects imperfect genome assemblies/alignments, characteristics of each class of functional elements, and the specific methods we used. For example, ncRNA discovery
probably benefits from observing more compensatory changes across more genomes, whereas miRNA discovery may be more sensitive to artefacts in low-coverage genomes, given the expected high
conservation of miRNA arms. As expected, longer elements were easier to discover than shorter elements. Long protein-coding exons (>300 nucleotides) were recovered at very high rates even
with few species at close distances (leaving little room for improvement with additional species). In contrast, more informant species and larger distances were crucial for recovering short
exons, miRNAs and regulatory motifs. Notably, the optimal evolutionary distance for pairwise comparisons to _D. melanogaster_ also seemed to depend on element length: for long
protein-coding exons, the best pairwise informant was the closely related _D. erecta_, for exons of intermediate lengths _D. ananassae_, and for the shortest exons the distant _D.
willistoni_ (Supplementary Table 7a). Distant species were also optimal for other classes of short elements (ncRNAs, miRNAs and motifs, Fig. 8b–d). This suggests that a small number of
species at close evolutionary distances may generally allow the discovery of long elements, possibly including clade-specific elements, whereas short clade-specific elements may not be
reliably detectable without many genomes at close distances. Finally, we investigated the effect of alignment choice on our results (Supplementary Fig. 8). We found high similarity between
different alignment strategies for longer elements (>93% agreement for exons), whereas shorter elements showed larger discrepancies between alignments (81% and 59% agreement for miRNA and
motif instances, respectively). Although factors such as genome size, repeat density, pseudogene abundance and physiological differences might confound a simple analogy to the vertebrate
phylogeny based on neutral branch length (Fig. 1c), our results suggest that comparisons spanning marsupials, birds and reptiles may prove surprisingly useful for biological signal discovery
in the human genome. DISCUSSION Our results demonstrate the potential of comparative genomics for the systematic characterization of functional elements in a complete genome. Even in a
species as intensely studied as _D. melanogaster_, our methods predicted several thousand new functional elements, including protein-coding genes and exons, novel RNA genes and structures,
miRNA genes, regulatory motifs, and regulator targets. Our novel predictions have overwhelming statistical support, often surpassing that of known functional elements, and are additionally
supported by experimental evidence in hundreds of cases. The common underlying methodology in this study has been the recognition of specific evolutionary signatures associated with each
class of functional elements, which can be much more informative for genome annotation than overall measures of nucleotide conservation. These signatures are general and are immediately
relevant to the analysis of the human genome and more generally of any species. In addition to the many new elements, we gained specific biological insights and formulated hypotheses that we
hope will guide follow-up experiments. We found 149 genes with potential translational readthrough, showing protein-like evolution downstream of a highly conserved stop codon, and possibly
encoding additional protein domains or peptides specific to certain developmental contexts. We also found several candidate programmed frameshifts, which might be part of regulatory circuits
(as for _ODC_/_Oda_ 64) or help expand the diversity of protein products generated from one mRNA, similar to their role in prokaryotes121. We also presented evidence of miRNA processing
from both arms of a miRNA hairpin and from both DNA strands of a miRNA locus in some cases, potentially leading to as many as four functional miRNAs per locus. As miRNA/miRNA* pairs are
expressed from a single precursor and thus co-regulated, whereas sense/anti-sense pairs are expressed from distinct promoters, the use of both arms or both strands provides compelling
general building blocks for higher-level miRNA-mediated regulation. The newly discovered elements did not dramatically increase the total number of annotated nucleotides. Known and predicted
elements explain 42% of nucleotides in phastCons elements33, compared to 35.5% for previous annotations (Supplementary Fig. 6), an 18% increase (mostly owing to conserved motif instances).
The remaining phastCons elements and independent estimates based on transcriptional activity42 would suggest that a much higher fraction of the genome may be functional (Supplementary Fig.
6). Although it is possible that these estimates are artificially high and that we are in fact converging on a complete annotation of the fly genome, they might instead indicate that much
remains to be discovered, which may require the recognition of as-yet-unknown classes of functional elements with distinct evolutionary signatures. Our results also allowed us to compare and
contrast evolutionary and experimental methods for the recovery of functional elements, particularly for the identification of regulator targets. We found that comparative genomics resulted
in many functionally meaningful sites for transcription factors Mef2, Twist and Snail outside ChIP-bound regions, probably representing targets from diverse conditions not surveyed
experimentally. Similarly, ChIP resulted in many additional sites outside those recovered by comparative genomics: some of these may have been replaced by functionally equivalent
non-orthologous sequence, rendering them apparently non-conserved in sequence alignments122,123,124; others may have species- or lineage-specific roles, thus lacking sufficient signal for
their comparative detection; finally, some bound sites may be biochemically active yet selectively neutral125. It is worth noting, however, that ChIP-bound motifs that were not conserved
showed decreased enrichment in muscle/mesoderm development where the factors are known to act, suggesting that potential lineage-specific roles may lie outside the regulators’ conserved
functions. To resolve these questions, comparative genomics studies would benefit greatly from experimental studies in several related species in parallel. Overall, comparative genomics and
species-specific experimental studies provide complementary approaches to biological signal discovery. Comparative studies help pinpoint evolutionarily selected functional elements across
diverse conditions, whereas experimental studies reveal stage- and tissue-specific information, as well as species-specific sites. Ultimately, their integration is a necessary step towards a
comprehensive understanding of animal genomes. METHODS SUMMARY The Methods are described in Supplementary Information, with more details found in the cited companion papers for each
section. The sections of the Supplementary Methods are arranged in the same order as the manuscript to facilitate cross-referencing, with an index on the first page to aid navigation.
REFERENCES * Miller, W., Makova, K. D., Nekrutenko, A. & Hardison, R. C. Comparative genomics. _Annu. Rev. Genomics Hum. Genet._ 5, 15–56 (2004) Article CAS PubMed Google Scholar *
Ureta-Vidal, A., Ettwiller, L. & Birney, E. Comparative genomics: genome-wide analysis in metazoan eukaryotes. _Nature Rev. Genet._ 4, 251–262 (2003) Article CAS PubMed Google Scholar
* Kellis, M. et al. Sequencing and comparison of yeast species to identify genes and regulatory elements. _Nature_ 423, 241–254 (2003) Article ADS CAS PubMed Google Scholar * Cliften,
P. et al. Finding functional features in _Saccharomyces_ genomes by phylogenetic footprinting. _Science_ 301, 71–76 (2003) Article ADS CAS PubMed Google Scholar * Brent, M. R. Genome
annotation past, present, and future: how to define an ORF at each locus. _Genome Res._ 15, 1777–1786 (2005) Article CAS PubMed Google Scholar * Washietl, S., Hofacker, I. L. &
Stadler, P. F. Fast and reliable prediction of noncoding RNAs. _Proc. Natl Acad. Sci. USA_ 102, 2454–2459 (2005) Article ADS CAS PubMed PubMed Central Google Scholar * Pedersen, J. S.
et al. Identification and classification of conserved RNA secondary structures in the human genome. _PLoS Comput. Biol._ 2, e33 (2006) Article ADS CAS PubMed PubMed Central Google
Scholar * Lim, L. P. et al. The microRNAs of _Caenorhabditis elegans_ . _Genes Dev._ 17, 991–1008 (2003) Article CAS PubMed PubMed Central Google Scholar * Lim, L. P. et al. Vertebrate
microRNA genes. _Science_ 299, 1540 (2003) Article CAS PubMed Google Scholar * Lai, E. C., Tomancak, P., Williams, R. W. & Rubin, G. M. Computational identification of _Drosophila_
microRNA genes. _Genome Biol._ 4, R42 (2003) Article PubMed PubMed Central Google Scholar * Berezikov, E. et al. Phylogenetic shadowing and computational identification of human microRNA
genes. _Cell_ 120, 21–24 (2005) Article CAS PubMed Google Scholar * Xie, X. et al. Systematic discovery of regulatory motifs in human promoters and 3′ UTRs by comparison of several
mammals. _Nature_ 434, 338–345 (2005) Article ADS CAS PubMed PubMed Central Google Scholar * Ettwiller, L. et al. The discovery, positioning and verification of a set of
transcription-associated motifs in vertebrates. _Genome Biol._ 6, R104 (2005) Article PubMed PubMed Central CAS Google Scholar * Chan, C. S., Elemento, O. & Tavazoie, S. Revealing
posttranscriptional regulatory elements through network-level conservation. _PLoS Comput. Biol._ 1, e69 (2005) Article ADS PubMed PubMed Central CAS Google Scholar * Boffelli, D. et
al. Phylogenetic shadowing of primate sequences to find functional regions of the human genome. _Science_ 299, 1391–1394 (2003) Article CAS PubMed Google Scholar * Cooper, G. M. et al.
Distribution and intensity of constraint in mammalian genomic sequence. _Genome Res._ 15, 901–913 (2005) Article CAS PubMed PubMed Central Google Scholar * Margulies, E. H., Blanchette,
M., Haussler, D. & Green, E. D. Identification and characterization of multi-species conserved sequences. _Genome Res._ 13, 2507–2518 (2003) Article CAS PubMed PubMed Central Google
Scholar * Thomas, J. W. et al. Comparative analyses of multi-species sequences from targeted genomic regions. _Nature_ 424, 788–793 (2003) Article ADS CAS PubMed Google Scholar *
Eddy, S. R. A model of the statistical power of comparative genome sequence analysis. _PLoS Biol._ 3, e10 (2005) Article PubMed PubMed Central CAS Google Scholar * Bergman, C. M. et al.
Assessing the impact of comparative genomic sequence data on the functional annotation of the Drosophila genome. _Genome Biol._ 3, RESEARCH0086 (2002) Article PubMed PubMed Central
Google Scholar * Rubin, G. M. & Lewis, E. B. A brief history of _Drosophila_’s contributions to genome research. _Science_ 287, 2216–2218 (2000) Article CAS PubMed Google Scholar *
Adams, M. D. et al. The genome sequence of _Drosophila melanogaster._ . _Science_ 287, 2185–2195 (2000) Article PubMed Google Scholar * Misra, S. et al. Annotation of the _Drosophila
melanogaster_ euchromatic genome: a systematic review. _Genome Biol._ 3, RESEARCH0083 (2002) Article PubMed PubMed Central Google Scholar * Celniker, S. E. & Rubin, G. M. The
_Drosophila melanogaster_ genome. _Annu. Rev. Genomics Hum. Genet._ 4, 89–117 (2003) Article CAS PubMed Google Scholar * Ashburner, M. & Bergman, C. M. _Drosophila melanogaster_: a
case study of a model genomic sequence and its consequences. _Genome Res._ 15, 1661–1667 (2005) Article CAS PubMed Google Scholar * Matthews, K. A., Kaufman, T. C. & Gelbart, W. M.
Research resources for _Drosophila_: the expanding universe. _Nature Rev. Genet._ 6, 179–193 (2005) Article CAS PubMed Google Scholar * Venken, K. J., He, Y., Hoskins, R. A. &
Bellen, H. J. P[acman]: a BAC transgenic platform for targeted insertion of large DNA fragments in _D. _ _melanogaster_ . _Science_ 314, 1747–1751 (2006) Article ADS CAS PubMed Google
Scholar * Dietzl, G. et al. A genome-wide transgenic RNAi library for conditional gene inactivation in _Drosophila_ . _Nature_ 448, 151–156 (2007) Article ADS CAS PubMed Google Scholar
* Spradling, A. C. et al. The Berkeley _Drosophila_ Genome Project gene disruption project: Single P-element insertions mutating 25% of vital _Drosophila_ genes. _Genetics_ 153, 135–177
(1999) Article CAS PubMed PubMed Central Google Scholar * St Johnston, D. The art and design of genetic screens: _Drosophila melanogaster_ . _Nature Rev. Genet._ 3, 176–188 (2002)
Article CAS PubMed Google Scholar * Richards, S. et al. Comparative genome sequencing of _Drosophila_ pseudoobscura: chromosomal, gene, and cis-element evolution. _Genome Res._ 15, 1–18
(2005) Article CAS PubMed PubMed Central Google Scholar * Drosophila 12 Genomes Consortium Evolution of genes and genomes on the _Drosophila_ phylogeny. _Nature_ doi:
10.1038/nature06341 (this issue) (2007) * Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. _Genome Res._ 15, 1034–1050 (2005) Article CAS
PubMed PubMed Central Google Scholar * Nekrutenko, A., Makova, K. D. & Li, W. H. The KA/KS ratio test for assessing the protein-coding potential of genomic regions: an empirical and
simulation study. _Genome Res._ 12, 198–202 (2002) Article CAS PubMed PubMed Central Google Scholar * Eddy, S. R. Computational genomics of noncoding RNA genes. _Cell_ 109, 137–140
(2002) Article CAS PubMed Google Scholar * Bompfuenewerer, A. F. et al. Evolutionary patterns of non-coding RNAs. _Theor. Biosci._ 123, 301–369 (2004) Article CAS Google Scholar *
Reese, M. G. et al. Genome annotation assessment in _Drosophila melanogaster_ . _Genome Res._ 10, 483–501 (2000) Article CAS PubMed PubMed Central Google Scholar * Rubin, G. M. et al. A
_Drosophila_ complementary DNA resource. _Science_ 287, 2222–2224 (2000) Article CAS PubMed Google Scholar * Stapleton, M. et al. A _Drosophila_ full-length cDNA resource. _Genome
Biol._ 3, RESEARCH0080 (2000). * Hild, M. et al. An integrated gene annotation and transcriptional profiling approach towards the full gene content of the _Drosophila_ genome. _Genome Biol._
5, R3 (2003) Article CAS PubMed PubMed Central Google Scholar * Yandell, M. et al. A computational and experimental approach to validating annotations and gene predictions in the
_Drosophila melanogaster_ genome. _Proc. Natl Acad. Sci. USA_ 102, 1566–1571 (2005) Article ADS CAS PubMed PubMed Central Google Scholar * Manak, J. R. et al. Biological function of
unannotated transcription during the early development of _Drosophila melanogaster_ . _Nature Genet._ 38, 1151–1158 (2006) Article CAS PubMed Google Scholar * Lin, M. F. et al.
Revisiting the protein-coding gene catalog of _Drosophila melanogaster_ using twelve fly genomes. _Genome Res._ doi: 10.1101/gr.6679507 (in the press) * Yang, Z. & Bielawski, J. P.
Statistical methods for detecting molecular adaptation. _Trends Ecol. Evol._ 15, 496–503 (2000) Article CAS PubMed PubMed Central Google Scholar * Mignone, F., Grillo, G., Liuni, S.
& Pesole, G. Computational identification of protein coding potential of conserved sequence tags through cross-species evolutionary analysis. _Nucleic Acids Res._ 31, 4639–4645 (2003)
Article CAS PubMed PubMed Central Google Scholar * Zhang, L., Pavlovic, V., Cantor, C. R. & Kasif, S. Human-mouse gene identification by comparative evidence integration and
evolutionary analysis. _Genome Res._ 13 (6A). 1190–1202 (2003) Article CAS PubMed PubMed Central Google Scholar * Crosby, M. A. et al. FlyBase: genomes by the dozen. _Nucleic Acids
Res._ 35 (Database issue). D486–D491 (2007) * Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. _Nature Genet._ 25, 25–29 (2000) Article
CAS PubMed Google Scholar * Ochman, H., Ajioka, J. W., Garza, D. & Hartl, D. L. Inverse polymerase chain reaction. _Bio/Technology_ 8, 759–760 (1990) CAS Google Scholar * Hoskins,
R. A. et al. Rapid and efficient cDNA library screening by self-ligation of inverse PCR products (SLIP). _Nucleic Acids Res._ 33, e185 (2005) Article PubMed PubMed Central CAS Google
Scholar * Wan, K. H. et al. High-throughput plasmid cDNA library screening. _Nature Protocols_ 1, 624–632 (2006) Article CAS PubMed Google Scholar * Hahn, M. W., Han, M. V. & Han,
S.-G. Gene family evolution across 12 _Drosophila_ genomes. _PLoS Genet_ 3, e197 (2007) Article PubMed PubMed Central CAS Google Scholar * Andrews, J. et al. The stoned locus of
_Drosophila melanogaster_ produces a dicistronic transcript and encodes two distinct polypeptides. _Genetics_ 143, 1699–1711 (1996) Article CAS PubMed PubMed Central Google Scholar *
Brogna, S. & Ashburner, M. The Adh-related gene of _Drosophila melanogaster_ is expressed as a functional dicistronic messenger RNA: multigenic transcription in higher organisms. _EMBO
J._ 16, 2023–2031 (1997) Article CAS PubMed PubMed Central Google Scholar * Hatfield, D. L. & Gladyshev, V. N. How selenium has altered our understanding of the genetic code. _Mol.
Cell. Biol._ 22, 3565–3576 (2002) Article CAS PubMed PubMed Central Google Scholar * Kryukov, G. V. et al. Characterization of mammalian selenoproteomes. _Science_ 300, 1439–1443 (2003)
Article ADS CAS PubMed Google Scholar * Copeland, P. R. Regulation of gene expression by stop codon recoding: selenocysteine. _Gene_ 312, 17–25 (2003) Article CAS PubMed PubMed
Central Google Scholar * Castellano, S. et al. _In silico_ identification of novel selenoproteins in the _Drosophila melanogaster_ genome. _EMBO Rep._ 2, 697–702 (2001) Article CAS
PubMed PubMed Central Google Scholar * von der Haar, T. & Tuite, M. F. Regulated translational bypass of stop codons in yeast. _Trends Microbiol._ 15, 78–86 (2007) Article CAS
PubMed Google Scholar * Luo, G. X. et al. A specific base transition occurs on replicating hepatitis delta virus RNA. _J. Virol._ 64, 1021–1027 (1990) Article CAS PubMed PubMed Central
Google Scholar * Casey, J. L. & Gerin, J. L. Hepatitis D virus RNA editing: specific modification of adenosine in the antigenomic RNA. _J. Virol._ 69, 7593–7600 (1995) Article CAS
PubMed PubMed Central Google Scholar * Steneberg, P. et al. Translational readthrough in the hdc mRNA generates a novel branching inhibitor in the _Drosophila_ trachea. _Genes Dev._ 12,
956–967 (1998) Article CAS PubMed PubMed Central Google Scholar * Bass, B. L. RNA editing by adenosine deaminases that act on RNA. _Annu. Rev. Biochem._ 71, 817–846 (2002) Article CAS
PubMed Google Scholar * Ivanov, I. P. et al. The _Drosophila_ gene for antizyme requires ribosomal frameshifting for expression and contains an intronic gene for snRNP Sm D3 on the
opposite strand. _Mol. Cell. Biol._ 18, 1553–1561 (1998) Article CAS PubMed PubMed Central Google Scholar * Eddy, S. R. Non-coding RNA genes and the modern RNA world. _Nature Rev.
Genet._ 2, 919–929 (2001) Article CAS PubMed Google Scholar * Yuan, G. et al. RNomics in _Drosophila melanogaster_: identification of 66 candidates for novel non-messenger RNAs. _Nucleic
Acids Res._ 31, 2495–2507 (2003) Article CAS PubMed PubMed Central Google Scholar * Lestrade, L. & Weber, M. J. snoRNA-LBME-db, a comprehensive database of human H/ACA and C/D box
snoRNAs. _Nucleic Acids Res._ 34 (Database issue). D158–D162 (2006) Article CAS PubMed Google Scholar * Bier, E. _Drosophila_, the golden bug, emerges as a tool for human genetics.
_Nature Rev. Genet._ 6, 9–23 (2005) Article CAS PubMed Google Scholar * Hoopengardner, B., Bhalla, T., Staber, C. & Reenan, R. Nervous system targets of RNA editing identified by
comparative genomics. _Science_ 301, 832–836 (2003) Article ADS CAS PubMed Google Scholar * Mignone, F. et al. UTRdb and UTRsite: a collection of sequences and regulatory motifs of the
untranslated regions of eukaryotic mRNAs. _Nucleic Acids Res._ 33 (Database issue). D141–D146 (2005) Article CAS PubMed Google Scholar * Cohen, R. S., Zhang, S. & Dollar, G. L. The
positional, structural, and sequence requirements of the _Drosophila_ TLS RNA localization element. _RNA_ 11, 1017–1029 (2005) Article CAS PubMed PubMed Central Google Scholar *
Allemand, F. et al. _Escherichia coli_ ribosomal protein L20 binds as a single monomer to its own mRNA bearing two potential binding sites. _Nucleic Acids Res._ 35, 3016–3031 (2007) Article
CAS PubMed PubMed Central Google Scholar * Okumura, T., Matsumoto, A., Tanimura, T. & Murakami, R. An endoderm-specific GATA factor gene, dGATAe, is required for the terminal
differentiation of the _Drosophila_ endoderm. _Dev. Biol._ 278, 576–586 (2005) Article CAS PubMed Google Scholar * Park, S. W. et al. An evolutionarily conserved domain of roX2 RNA is
sufficient for induction of H4-Lys16 acetylation on the _Drosophila_ X chromosome. _Genetics_ (in the press) * Park, Y. & Kuroda, M. I. Epigenetic aspects of X-chromosome dosage
compensation. _Science_ 293, 1083–1085 (2001) Article CAS PubMed Google Scholar * Berezikov, E., Cuppen, E. & Plasterk, R. H. Approaches to microRNA discovery. _Nature Genet._ 38
(Suppl 1). S2–S7 (2006) Article CAS PubMed Google Scholar * Bartel, D. P. MicroRNAs: genomics, biogenesis, mechanism, and function. _Cell_ 116, 281–297 (2004) Article CAS PubMed
Google Scholar * Stark, A. et al. Systematic discovery and characterization of fly microRNAs using 12 _Drosophila_ genomes. _Genome Res._ 10.1101/gr.6593807 (in the press) * Ruby, J. G. et
al. Evolution, biogenesis, expression, and target predictions of a substantially expanded set of _Drosophila_ microRNAs. _Genome Res._ 10.1101/gr.6597907 (in the press) * Pekarsky, Y. et al.
Tcl1 expression in chronic lymphocytic leukemia is regulated by miR-29 and miR-181. _Cancer Res._ 66, 11590–11593 (2006) Article CAS PubMed Google Scholar * Ruby, J. G., Jan, C. H.
& Bartel, D. P. Intronic microRNA precursors that bypass Drosha processing. _Nature_ 448, 83–86 (2007) Article ADS CAS PubMed PubMed Central Google Scholar * Okamura, K. et al. The
mirtron pathway generates microRNA-class regulatory RNAs in _Drosophila_ . _Cell_ 130, 89–100 (2007) Article CAS PubMed PubMed Central Google Scholar * Lewis, B. P. et al. Prediction
of mammalian microRNA targets. _Cell_ 115, 787–798 (2003) Article CAS PubMed Google Scholar * Stark, A., Brennecke, J., Russell, R. B. & Cohen, S. M. Identification of _Drosophila_
microRNA targets. _PLoS Biol._ 1, E60 (2003) Article PubMed PubMed Central Google Scholar * Lai, E. C. Micro RNAs are complementary to 3′ UTR sequence motifs that mediate negative
post-transcriptional regulation. _Nature Genet._ 30, 363–364 (2002) Article CAS PubMed Google Scholar * Lewis, B. P., Burge, C. B. & Bartel, D. P. Conserved seed pairing, often
flanked by adenosines, indicates that thousands of human genes are microRNA targets. _Cell_ 120, 15–20 (2005) Article CAS PubMed Google Scholar * Tompa, M. Identifying functional
elements by comparative DNA sequence analysis. _Genome Res._ 11, 1143–1144 (2001) Article CAS PubMed Google Scholar * Stormo, G. D. DNA binding sites: representation and discovery.
_Bioinformatics_ 16, 16–23 (2000) Article CAS PubMed Google Scholar * Kheradpour, P., Stark, A., Roy, S. & Kellis, M. Reliable prediction of regulator targets using 12 _Drosophila_
genomes. _Genome Res._ 10.1101/gr.7090407 (in the press) * Stathopoulos, A. & Levine, M. Genomic regulatory networks and animal development. _Dev. Cell_ 9, 449–462 (2005) Article CAS
PubMed Google Scholar * Schroeder, M. D. et al. Transcriptional control in the segmentation gene network of _Drosophila_ . _PLoS Biol._ 2, e271 (2004) Article PubMed PubMed Central CAS
Google Scholar * Zeitlinger, J. et al. Whole-genome ChIP-chip analysis of Dorsal, Twist, and Snail suggests integration of diverse patterning processes in the _Drosophila_ embryo. _Genes
Dev._ 21, 385–390 (2007) Article CAS PubMed PubMed Central Google Scholar * Kanehisa, M. et al. The KEGG resource for deciphering the genome. _Nucleic Acids Res._ 32 (Database issue).
D277–D280 (2004) Article CAS PubMed PubMed Central Google Scholar * Berman, B. P. et al. Exploiting transcription factor binding site clustering to identify cis-regulatory modules
involved in pattern formation in the _Drosophila_ genome. _Proc. Natl Acad. Sci. USA_ 99, 757–762 (2002) Article ADS CAS PubMed PubMed Central Google Scholar * Markstein, M. et al. A
regulatory code for neurogenic gene expression in the _Drosophila_ embryo. _Development_ 131, 2387–2394 (2004) Article CAS PubMed Google Scholar * Philippakis, A. A. et al.
Expression-guided _in silico_ evaluation of candidate cis regulatory codes for _Drosophila_ muscle founder cells. _PLoS Comput. Biol._ 2, e53 (2006) Article ADS PubMed PubMed Central CAS
Google Scholar * Smale, S. T. & Kadonaga, J. T. The RNA polymerase II core promoter. _Annu. Rev. Biochem._ 72, 449–479 (2003) Article CAS PubMed Google Scholar * Gerber, A. P. et
al. Genome-wide identification of mRNAs associated with the translational regulator PUMILIO in _Drosophila melanogaster_ . _Proc. Natl Acad. Sci. USA_ 103, 4487–4492 (2006) Article ADS CAS
PubMed PubMed Central Google Scholar * Zubiaga, A. M., Belasco, J. G. & Greenberg, M. E. The nonamer UUAUUUAUU is the key AU-rich sequence motif that mediates mRNA degradation.
_Mol. Cell. Biol._ 15, 2219–2230 (1995) Article CAS PubMed PubMed Central Google Scholar * Fairbrother, W. G., Yeh, R. F., Sharp, P. A. & Burge, C. B. Predictive identification of
exonic splicing enhancers in human genes. _Science_ 297, 1007–1013 (2002) Article ADS CAS PubMed Google Scholar * Kloosterman, W. P., Wienholds, E., Ketting, R. F. & Plasterk, R. H.
Substrate requirements for let-7 function in the developing zebrafish embryo. _Nucleic Acids Res._ 32, 6284–6291 (2004) Article CAS PubMed PubMed Central Google Scholar * Grimson, A.
et al. MicroRNA targeting specificity in mammals: determinants beyond seed pairing. _Mol. Cell_ 27, 91–105 (2007) Article CAS PubMed PubMed Central Google Scholar * Farh, K. K. et al.
The widespread impact of mammalian MicroRNAs on mRNA repression and evolution. _Science_ 310, 1817–1821 (2005) Article ADS CAS PubMed Google Scholar * Stark, A. et al. Animal microRNAs
confer robustness to gene expression and have a significant impact on 3′ UTR evolution. _Cell_ 123, 1133–1146 (2005) Article CAS PubMed Google Scholar * Rajewsky, N. microRNA target
predictions in animals. _Nature Genet._ 38, (suppl. 1)S8–S13 (2006) Article CAS PubMed Google Scholar * Elnitski, L. et al. Distinguishing regulatory DNA from neutral sites. _Genome
Res._ 13, 64–72 (2003) Article CAS PubMed PubMed Central Google Scholar * Abrams, E. W. & Andrew, D. J. CrebA regulates secretory activity in the _Drosophila_ salivary gland and
epidermis. _Development_ 132, 2743–2758 (2005) Article CAS PubMed Google Scholar * Sandmann, T. et al. A temporal map of transcription factor activity: mef2 directly regulates target
genes at all stages of muscle development. _Dev. Cell_ 10, 797–807 (2006) Article CAS PubMed Google Scholar * Sandmann, T. et al. A core transcriptional network for early mesoderm
development in _Drosophila melanogaster_ . _Genes Dev._ 21, 436–449 (2007) Article CAS PubMed PubMed Central Google Scholar * Sethupathy, P., Corda, B. & Hatzigeorgiou, A. G.
TarBase: A comprehensive database of experimentally supported animal microRNA targets. _RNA_ 12, 192–197 (2006) Article CAS PubMed PubMed Central Google Scholar * Lee, T. I. et al.
Control of developmental regulators by Polycomb in human embryonic stem cells. _Cell_ 125, 301–313 (2006) Article CAS PubMed PubMed Central Google Scholar * Boyer, L. A. et al. Core
transcriptional regulatory circuitry in human embryonic stem cells. _Cell_ 122, 947–956 (2005) Article CAS PubMed PubMed Central Google Scholar * Aerts, S., van Helden, J., Sand, O.
& Hassan, B. Fine-tuning enhancer models to predict transcriptional targets across multiple genomes. _PLoS ONE_ 2, (11)e1115 (2007) Article ADS PubMed PubMed Central CAS Google
Scholar * Maeder, M., Polansky, B., Robson, B. & Eastman, D. Phylogenetic footprinting analysis in the upstream regulatory regions of the _Drosophila_ Enhancer of split genes.
_Genetics_ (in the press) * Van Doren, M. et al. Negative regulation of proneural gene activity: hairy is a direct transcriptional repressor of achaete. _Genes Dev._ 8, 2729–2742 (1994)
Article CAS PubMed Google Scholar * Kraut, R. & Levine, M. Spatial regulation of the gap gene giant during _Drosophila_ development. _Development_ 111, 601–609 (1991) Article CAS
PubMed Google Scholar * Bailey, A. M. & Posakony, J. W. Suppressor of hairless directly activates transcription of enhancer of split complex genes in response to Notch receptor
activity. _Genes Dev._ 9, 2609–2622 (1995) Article CAS PubMed Google Scholar * Yin, Z. & Frasch, M. Regulation and function of tinman during dorsal mesoderm induction and heart
specification in _Drosophila_ . _Dev. Genet._ 22, 187–200 (1998) Article CAS PubMed Google Scholar * Margulies, E. H. et al. An initial strategy for the systematic identification of
functional elements in the human genome by low-redundancy comparative sequencing. _Proc. Natl Acad. Sci. USA_ 102, 4795–4800 (2005) Article ADS CAS PubMed PubMed Central Google Scholar
* Margulies, E. H., Chen, C. W. & Green, E. D. Differences between pair-wise and multi-sequence alignment methods affect vertebrate genome comparisons. _Trends Genet._ 22, 187–193
(2006) Article CAS PubMed Google Scholar * Farabaugh, P. J. Programmed translational frameshifting. _Annu. Rev. Genet._ 30, 507–528 (1996) Article CAS PubMed Google Scholar * Odom,
D. T. et al. Tissue-specific transcriptional regulation has diverged significantly between human and mouse. _Nature Genet._ 39, 730–732 (2007) Article CAS PubMed Google Scholar * Ludwig,
M. Z. & Kreitman, M. Evolutionary dynamics of the enhancer region of even-skipped in _Drosophila_ . _Mol. Biol. Evol._ 12, 1002–1011 (1995) CAS PubMed Google Scholar * Ludwig, M. Z.
et al. Functional evolution of a cis-regulatory module. _PLoS Biol._ 3, e93 (2005) Article PubMed PubMed Central CAS Google Scholar * The ENCODE Project Consortium Identification and
analysis of functional elements in 1% of the human genome by the ENCODE pilot project. _Nature_ 447, 799–816 (2007) Article ADS PubMed Central CAS Google Scholar * Kent, W. J. et al.
The human genome browser at UCSC. _Genome Res._ 12, 996–1006 (2002) Article CAS PubMed PubMed Central Google Scholar Download references ACKNOWLEDGEMENTS We thank the National Human
Genome Research Institute (NHGRI) for continued support. A.S. was supported in part by the Schering AG/Ernst Schering Foundation and in part by the Human Frontier Science Program
Organization (HFSPO). P.K. was supported in part by a National Science Foundation Graduate Research Fellowship. J.S.P. thanks B. Raney and R. Baertsch, and the Danish Medical Research
Council and the National Cancer Institute for support. J.B. thanks the Schering AG/Ernst Schering Foundation for a postdoctoral fellowship. L.Parts thanks J. Vilo. S.R. was supported by a
HHMI-NIH/NIBIB Interfaces Training Grant and thanks T. Lane and M. Werner-Washburne. D.H., D.P.B., G.J.H. and T.C.K. are Investigators of the Howard Hughes Medical Institute, and B.P.,
J.G.R., E.H. and J.B. are affiliated with these investigators. J.W.C. and S.E.C. were supported by the NHGRI. M.K. was supported by start-up funds from the MIT Electrical Engineering and
Computer Science Laboratory, the Broad Institute of MIT and Harvard, and the MIT Computer Science and Artificial Intelligence Laboratory, and by the Distinguished Alumnus (1964) Career
Development Professorship. AUTHOR CONTRIBUTIONS Organizing committee: Manolis Kellis, William Gelbart, Doug Smith, Andrew G. Clark, Michael E. Eisen, Thomas C. Kaufman; protein-coding gene
prediction: Michael F. Lin, Ameya N. Deoras, Mira V. Han, Matthew W. Hahn, Donald G. Gilbert, Michael Weir, Michael Rice, Manolis Kellis; manual curation of protein-coding genes: Madeline A.
Crosby, Harvard FlyBase curators, William M. Gelbart; validation of protein-coding genes: Joseph W. Carlson, Berkeley Drosophila Genome Project, Susan E. Celniker; non-coding RNA gene
prediction: Jakob S. Pedersen, David Haussler, Yongkyu Park, Seung-Won Park, Manolis Kellis; microRNA gene prediction: Alexander Stark, Pouya Kheradpour, Leopold Parts, Manolis Kellis;
microRNA cloning and sequencing: Julius Brennecke, Emily Hodges, Gregory J. Hannon; microRNA target prediction: Alexander Stark, J. Graham Ruby, Manolis Kellis, Eric C. Lai, David P. Bartel;
motif identification: Alexander Stark, Pouya Kheradpour, Manolis Kellis; motif instance prediction: Alexander Stark, Pouya Kheradpour, Sushmita Roy, Morgan L. Maeder, Benjamin J. Polansky,
Bryanne E. Robson, Deborah A. Eastman, Stein Aerts, Bassem Hassan, Jacques van Helden, Manolis Kellis; genome alignments: Angie S. Hinrichs, W. James Kent, Anat Caspi, Lior Pachter, Colin N.
Dewey, Benedict Paten; phylogeny and branch length estimation: Matthew D. Rasmussen, Manolis Kellis; final manuscript preparation: Alexander Stark, Michael F. Lin, Pouya Kheradpour, Jakob
Pedersen, Manolis Kellis. AUTHOR INFORMATION Author notes * Alexander Stark, Michael F. Lin, Pouya Kheradpour and Jakob S. Pedersen: These authors contributed equally to this work. * Lists
of participants and affiliations appear at the end of the paper. AUTHORS AND AFFILIATIONS * The Broad Institute, Massachusetts Institute of Technology and Harvard University, Cambridge,
Massachusetts 02140, USA , Alexander Stark, Michael F. Lin & Manolis Kellis * Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massachusetts 02139, USA ,
Alexander Stark, Michael F. Lin, Pouya Kheradpour, Matthew D. Rasmussen, Ameya N. Deoras & Manolis Kellis * Department of Molecular Biology, The Bioinformatics Centre, University of
Copenhagen, Ole Maaloes Vej 5, 2200 Copenhagen N, Denmark, Jakob S. Pedersen * Center for Biomolecular Science and Engineering, University of California, Santa Cruz, California 95064, USA ,
Jakob S. Pedersen, Angie S. Hinrichs, Benedict Paten, W. James Kent & David Haussler * Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK ,
Leopold Parts & Benedict Paten * Institute of Computer Science, University of Tartu, Estonia Leopold Parts * BDGP, LBNL, 1 Cyclotron Road MS 64-0119, Berkeley, California 94720, USA ,
Joseph W. Carlson & Susan E. Celniker * FlyBase, The Biological Laboratories, Harvard University, 16 Divinity Avenue, Cambridge, Massachusetts 02138, USA , Madeline A. Crosby &
William M. Gelbart * Department of Computer Science, University of New Mexico, Albuquerque, New Mexico 87131, USA, Sushmita Roy * Department of Biology, MIT, Cambridge, Massachusetts 02139,
USA, J. Graham Ruby & David P. Bartel * Whitehead Institute, Cambridge, Massachusetts 02142, USA , J. Graham Ruby & David P. Bartel * Cold Spring Harbor Laboratory, Watson School of
Biological Sciences, 1 Bungtown Road, Cold Spring Harbor, New York 11724, USA , Julius Brennecke, Emily Hodges & Gregory J. Hannon * University of California, San Francisco/University of
California, Berkeley Joint Graduate Group in Bioengineering, Berkeley, California 97210, USA , Anat Caspi * EMBL Nucleotide Sequence Submissions, European Bioinformatics Institute, Wellcome
Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK , Benedict Paten * Department of Cell Biology and Molecular Medicine, G-629, MSB, 185 South Orange Avenue, UMDNJ-New Jersey Medical
School, Newark, New Jersey 07103, USA, Seung-Won Park & Yongkyu Park * Department of Biology and School of Informatics, Indiana University, Indiana 47405, USA Mira V. Han & Matthew
W. Hahn * Department of Biology, Connecticut College, New London, Connecticut 06320, USA, Morgan L. Maeder, Benjamin J. Polansky, Bryanne E. Robson & Deborah A. Eastman * Department of
Molecular and Developmental Genetics, Laboratory of Neurogenetics, VIB, 3000 Leuven, Belgium, Stein Aerts & Bassem Hassan * Department of Human Genetics, K. U. Leuven School of Medicine,
3000 Leuven, Belgium Stein Aerts & Bassem Hassan * Department de Biologie Moleculaire, Universite Libre de Bruxelles, 1050 Brussels, Belgium Jacques van Helden * Department of Biology,
Indiana University, Bloomington, Indiana 47405, USA, Donald G. Gilbert & Thomas C. Kaufman * Department of Mathematics and Computer Science, Wesleyan University, Middletown, Connecticut
06459, USA, Michael Rice * Biology Department, Wesleyan University Middletown, Connecticut 06459, USA Michael Weir * Department of Biostatistics and Medical Informatics, University of
Wisconsin-Madison, Madison, Wisconsin 53706, USA, Colin N. Dewey * Department of Mathematics, University of California at Berkeley, Berkeley, California 94720, USA, Lior Pachter * Department
of Computer Science, University of California at Berkeley, Berkeley, California 94720, USA, Lior Pachter * Department of Developmental Biology, Memorial Sloan-Kettering Cancer Center, New
York, New York 10021, USA, Eric C. Lai * Department of Molecular and Cell Biology, Graduate Group in Biophysics, and Center for Integrative Genomics, University of California, Berkeley,
California 94720, USA, Michael B. Eisen * Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA, Michael B. Eisen * Department of Molecular Biology
and Genetics, Cornell University, Ithaca, New York 14853, USA, Andrew G. Clark * Agencourt Bioscience Corporation, 500 Cummings Center, Suite 2450, Beverly, Massachusetts 01915, USA ,
Douglas Smith * The Department of Molecular and Cellular Biology, Harvard University, Cambridge, Massachusetts 02138, USA, William M. Gelbart * FlyBase, The Biological Laboratories, Harvard
University, 16 Divinity Avenue, Cambridge, Massachusetts 02138, USA., Madeline A. Crosby, Beverley B. Matthews, Andrew J. Schroeder, L. Sian Gramates, Susan E. St Pierre, Margaret Roark,
Kenneth L. Wiley Jr, Rob J. Kulathinal, Peili Zhang, Kyl V. Myrick, Jerry V. Antone & William M. Gelbart * BDGP, LBNL, 1 Cyclotron Road MS 64-0119, Berkeley, California 94720, USA.,
Joseph W. Carlson, Charles Yu, Soo Park, Kenneth H. Wan & Susan E. Celniker Authors * Alexander Stark View author publications You can also search for this author inPubMed Google Scholar
* Michael F. Lin View author publications You can also search for this author inPubMed Google Scholar * Pouya Kheradpour View author publications You can also search for this author
inPubMed Google Scholar * Jakob S. Pedersen View author publications You can also search for this author inPubMed Google Scholar * Leopold Parts View author publications You can also search
for this author inPubMed Google Scholar * Joseph W. Carlson View author publications You can also search for this author inPubMed Google Scholar * Madeline A. Crosby View author publications
You can also search for this author inPubMed Google Scholar * Matthew D. Rasmussen View author publications You can also search for this author inPubMed Google Scholar * Sushmita Roy View
author publications You can also search for this author inPubMed Google Scholar * Ameya N. Deoras View author publications You can also search for this author inPubMed Google Scholar * J.
Graham Ruby View author publications You can also search for this author inPubMed Google Scholar * Julius Brennecke View author publications You can also search for this author inPubMed
Google Scholar * Emily Hodges View author publications You can also search for this author inPubMed Google Scholar * Angie S. Hinrichs View author publications You can also search for this
author inPubMed Google Scholar * Anat Caspi View author publications You can also search for this author inPubMed Google Scholar * Benedict Paten View author publications You can also search
for this author inPubMed Google Scholar * Seung-Won Park View author publications You can also search for this author inPubMed Google Scholar * Mira V. Han View author publications You can
also search for this author inPubMed Google Scholar * Morgan L. Maeder View author publications You can also search for this author inPubMed Google Scholar * Benjamin J. Polansky View author
publications You can also search for this author inPubMed Google Scholar * Bryanne E. Robson View author publications You can also search for this author inPubMed Google Scholar * Stein
Aerts View author publications You can also search for this author inPubMed Google Scholar * Jacques van Helden View author publications You can also search for this author inPubMed Google
Scholar * Bassem Hassan View author publications You can also search for this author inPubMed Google Scholar * Donald G. Gilbert View author publications You can also search for this author
inPubMed Google Scholar * Deborah A. Eastman View author publications You can also search for this author inPubMed Google Scholar * Michael Rice View author publications You can also search
for this author inPubMed Google Scholar * Michael Weir View author publications You can also search for this author inPubMed Google Scholar * Matthew W. Hahn View author publications You can
also search for this author inPubMed Google Scholar * Yongkyu Park View author publications You can also search for this author inPubMed Google Scholar * Colin N. Dewey View author
publications You can also search for this author inPubMed Google Scholar * Lior Pachter View author publications You can also search for this author inPubMed Google Scholar * W. James Kent
View author publications You can also search for this author inPubMed Google Scholar * David Haussler View author publications You can also search for this author inPubMed Google Scholar *
Eric C. Lai View author publications You can also search for this author inPubMed Google Scholar * David P. Bartel View author publications You can also search for this author inPubMed
Google Scholar * Gregory J. Hannon View author publications You can also search for this author inPubMed Google Scholar * Thomas C. Kaufman View author publications You can also search for
this author inPubMed Google Scholar * Michael B. Eisen View author publications You can also search for this author inPubMed Google Scholar * Andrew G. Clark View author publications You can
also search for this author inPubMed Google Scholar * Douglas Smith View author publications You can also search for this author inPubMed Google Scholar * Susan E. Celniker View author
publications You can also search for this author inPubMed Google Scholar * William M. Gelbart View author publications You can also search for this author inPubMed Google Scholar * Manolis
Kellis View author publications You can also search for this author inPubMed Google Scholar CONSORTIA HARVARD FLYBASE CURATORS * Madeline A. Crosby * , Beverley B. Matthews * , Andrew J.
Schroeder * , L. Sian Gramates * , Susan E. St Pierre * , Margaret Roark * , Kenneth L. Wiley Jr * , Rob J. Kulathinal * , Peili Zhang * , Kyl V. Myrick * , Jerry V. Antone * & William
M. Gelbart BERKELEY DROSOPHILA GENOME PROJECT * Joseph W. Carlson * , Charles Yu * , Soo Park * , Kenneth H. Wan * & Susan E. Celniker CORRESPONDING AUTHOR Correspondence to Manolis
Kellis. ETHICS DECLARATIONS COMPETING INTERESTS The authors declare no competing financial interests. SUPPLEMENTARY INFORMATION SUPPLEMENTARY INFORMATION The file contains extensive
Supplementary Information. (PDF 5008 kb) RIGHTS AND PERMISSIONS Reprints and permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Stark, A., Lin, M., Kheradpour, P. _et al._ Discovery of
functional elements in 12 _Drosophila_ genomes using evolutionary signatures. _Nature_ 450, 219–232 (2007). https://doi.org/10.1038/nature06340 Download citation * Received: 21 July 2007 *
Accepted: 04 October 2007 * Issue Date: 08 November 2007 * DOI: https://doi.org/10.1038/nature06340 SHARE THIS ARTICLE Anyone you share the following link with will be able to read this
content: Get shareable link Sorry, a shareable link is not currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative