A gene-rich fraction analysis of the passiflora edulis genome reveals highly conserved microsyntenic regions with two related malpighiales species

Play all audios:

ABSTRACT _Passiflora edulis_ is the most widely cultivated species of passionflowers, cropped mainly for industrialized juice production and fresh fruit consumption. Despite its commercial

importance, little is known about the genome structure of _P. edulis_. To fill in this gap in our knowledge, a genomic library was built, and now completely sequenced over 100 large-inserts.

Sequencing data were assembled from long sequence reads, and structural sequence annotation resulted in the prediction of about 1,900 genes, providing data for subsequent functional

analysis. The richness of repetitive elements was also evaluated. Microsyntenic regions of _P. edulis_ common to _Populus trichocarpa_ and _Manihot esculenta_, two related Malpighiales

species with available fully sequenced genomes were examined. Overall, gene order was well conserved, with some disruptions of collinearity identified as rearrangements, such as inversion

and translocation events. The microsynteny level observed between the _P. edulis_ sequences and the compared genomes is surprising, given the long divergence time that separates them from

the common ancestor. _P. edulis_ gene-rich segments are more compact than those of the other two species, even though its genome is much larger. This study provides a first accurate gene set

for _P. edulis_, opening the way for new studies on the evolutionary issues in Malpighiales genomes. SIMILAR CONTENT BEING VIEWED BY OTHERS THE GENOME OF _MAGNOLIA BIONDII_ PAMP. PROVIDES

INSIGHTS INTO THE EVOLUTION OF MAGNOLIALES AND BIOSYNTHESIS OF TERPENOIDS Article Open access 01 March 2021 A CHROMOSOME-SCALE GENOME SEQUENCE OF PITAYA (_HYLOCEREUS UNDATUS_) PROVIDES NOVEL

INSIGHTS INTO THE GENOME EVOLUTION AND REGULATION OF BETALAIN BIOSYNTHESIS Article Open access 06 July 2021 CHROMOSOME-SCALE ASSEMBLY AND EVOLUTION OF THE TETRAPLOID _SALVIA SPLENDENS_

(LAMIACEAE) GENOME Article Open access 01 September 2021 INTRODUCTION The Passifloraceae family belongs to the Malpighiales order and is a member of the Rosids clade, according to classical

and molecular phylogenetic analysis. The family consists of 700 species, classified in 16 genera. The majority of species belong to the genus _Passiflora_ (~530 species), popularly known as

passion fruits1. This genus is widely distributed in tropical and subtropical regions of the Neotropics. Approximately 150 species are native to Brazil, which is acknowledged to be an

important centre of diversity2. Among the American tropical species of _Passiflora_, 60 fruit-bearing species are marketed for human consumption. Moreover, several species and hybrids have

been produced for ornamental purposes (see www.passiflora.it;)3, and pharmacologists have found that passion fruit vines contain bioactive compounds that are used in traditional folk

medicines as anxiolytics and antispasmodics4. _Passiflora edulis_ is the major species of passionflowers grown for fresh fruit consumption and juice production in climates ranging from cool

subtropical (purple variety) to warm tropical (yellow variety). Species grown particularly in Brazil include _P. edulis_ (sour passion fruit) and _P. alata_ (sweet passion fruit). Because of

the quality of its fruit and yield for processing into commercial juices, _P. edulis_ is grown in 90% of the commercial orchards. The most recent agricultural production survey showed that

58,089 hectares were planted with passion fruits, yielding 838,444 tons per year5. _P. edulis_ is a diploid (_2n_ = 18)6, self-incompatible species7,8, with perfect, insect-pollinated

flowers. Over the last two decades, our research group has carried out studies for estimating the genetic parameters of experimental populations9, as well as constructing genetic maps10,11

and mapping quantitative loci associated with the response to _Xanthomonas axonopodis_ infection12. Munhoz and co-workers were able to determine which gene expression patterns were

significantly modulated during the _P. edulis_-_X. axonopodis_ interaction13. Despite its commercial success, little is known about the genome structure of _P. edulis_. The genome size has

been estimated at ~1,230 Mb (1 C DNA content = 1.258 pg by flow cytometric analysis)14. To fill in this gap in our knowledge, a large-insert genomic BAC (Bacterial Artificial Chromosomes)

library was built and denoted Ped-B-Flav (https://cnrgv.toulouse.inra.fr/library/genomic_resource/Ped-B-Flav). It contains 83,000 clones, which are kept at the National Centre for Plant

Genomic Resources (CNRGV: cnrgv.toulouse.inra.fr) at INRA in Toulouse, France. In addition, previous studies provided initial insights into the _P. edulis_ genome using BAC-end sequence

(BES) data as a major resource15, and described the structural organization of the plant’s chloroplast genome, which differs from that of various Malpighiales species due to rearrangement

events16. Although based on small-sized sequences, BAC-end sequences can be mapped to intervals of sequenced related genomes17 in order to identify collinear microsyntenic regions as a

preliminary step towards selecting clones for full sequencing, which can be done with high accuracy using the single-molecule real-time (SMRT) sequencing (Pacific Biosciences). This method

produces long, unbiased sequences that, in turn, facilitate subsequent assembly18, a critical step in plants due to the high proportion of repetitive sequences throughout their genomes19.

Most of the projects aimed at obtaining a draft or a complete plant genome were performed using large-insert based sequencing methods20,21 to allow estimation of the number of genes, and

abundance of transposable elements and microsatellites. In the functional part of the genome in particular, the annotation of large-inserts can provide an arsenal of biological information

to facilitate comparison against databases and, in addition, to determine the distribution of BAC inserts relative to related genomes in order to examine the degree of synteny between them

and gain insights into evolutionary relationships22,23. In this scenario, the _P_. _edulis_ genome is continuing to be studied based on the large-insert BAC library and using the SMRT

sequencing platform to completely sequence over 100 inserts of BAC clones. These clones were pre-selected based on BES microsynteny results and probes homologous to transcripts from a

subtractive library of _P. edulis_ in response to _Xanthomonas axonopodis_ infection, which allowed us to obtain a gene-rich fraction of this genome. The repetitive content, predicted genes,

and coding sequences were annotated. Also, microsyntenic regions of _P. edulis_ common to _Populus trichocarpa_ (Salicaceae, 485 Mb24) and _Manihot esculenta_ (Euphorbiaceae, 742 Mb25), two

related Malpighiales species with available fully sequenced and well-annotated genomes, were identified. MATERIAL AND METHODS BAC SELECTION AND DNA PREPARATION BAC clones were selected from

the findings of Santos _et al_.15, which provides an initial overview of the _P. edulis_ genome using BAC-end sequence (BES) data as a major resource. The results of comparative mapping

between _P. edulis’_ BES and the reference genomes of _Arabidopsis thaliana_, _Populus trichocarpa_ and _Vitis vinifera_ were also used to choose BAC clones for sequencing. In addition,

based on BES functional annotation results, the BAC-inserts with coding sequences (CDS) in one or both BESs were also selected. A second selection procedure was performed after screening the

genomic library using the probes homologous to _P. edulis_ transcripts described in13. Briefly, the authors used suppression subtractive hybridization to construct two cDNA libraries

enriched for transcripts induced and repressed by _Xanthomonas axonopodis_, respectively, 24 h after inoculation with a highly virulent bacterial strain. The homologous probes were prepared

via PCR, using as a template the genomic DNA from ‘IAPAR-123’, the accession used to construct the Ped-B-Flav BAC library. Specific primers were used to generate a single amplicon (200 to

600 bp in size) for each probe gene sequence. The ‘DecaLabel DNA Labeling Kit’ (Fermentas) was used for radiolabeling the probes. The amplification products were then purified with ‘Illustra

ProbeQuantTM G-50 Micro Columns’ (GE Healthcare). The library was previously gridded onto macroarrays in which 41,472 clones were double-spotted on each 22 × 22 cm nylon membrane. These

membranes were submerged in a bath of SSC (Saline-Sodium Citrate) solution (6×, 17 min., 50 °C); incubated overnight (68 °C) in hybridization buffer [6× SSC, 5× Denhardt’s Solution, 0.5%

(w/v) SDS (Sodium Dodecyl Sulfate)]; hybridized with denatured probes (10 min, 95 °C; 1 min., cooled on ice); and washed twice in buffer 1 [2× SSC, 0.1% (w/v) SDS] (15 min., 50 °C) and

buffer 2 [0.5× SSC, 0.1% (w/v) SDS] (30 min., 50 °C). Next, the hybridized membranes were placed in a film cassette for 24 h.; radioactive signals were detected using a PhosphorImagerTM and

Storm 820 scanner (Amersham Biosciences) and analyzed using HDFR3 software, to identify the positive clones. Each positive clone was individually validated by PCR. In order to estimate

insert sizes, the preserved cultures were scraped and a positive single colony of each BAC grown in a 96-well plate (overnight, 37 °C) containing 1200 µL of LB medium with chloramphenicol

(12.5 µg/mL) and glycerol (6%). DNAs were then isolated using a NucleoSpin® 96 Flash (Macherey-Nagel) BAC DNA purification kit, digested with 5 U of FastDigest™ _Not_I enzyme (Fermentas) and

size-fractioned by PFGE (6 V.cm−1, 5 to 15 s switch time, 16 h run time, 12.5 °C) in a Chef Mapper XA Chiller System 220 V (BioRad), followed by ethidium bromide staining and visualization.

The insert sizes were determined by comparison with PFGE (pulsed-field gel electrophoresis) standard size markers. To prepare the DNA for sequencing, 1 μl of the above cultures was allowed

to regrow in 20 mL of LB medium (plus 12.5 µg/mL chloramphenicol at 37 °C overnight) under shaking (250 rpm). The cultures were then mixed in pools, at a maximum of 20 clones per pool. DNA

extraction was performed using the Nucleobond Xtra Midi Plus kit (Macherey-Nagel) according to the manufacturer’s instructions. DNA SEQUENCING AND ASSEMBLY FROM LONG SEQUENCE READS

Approximately 5 µg of each pool was used for the construction of a SMRT library based on the standard Pacific Biosciences (San Francisco, CA, USA) preparation protocol for 10-kb libraries.

Each pool was sequenced in one SMRT Cell using P6 polymerase in combination with C4 chemistry, following the manufacturer’s standard operating procedures and using the PacBio RS II long-read

sequencer. Reads were assembled by a hierarchical genome assembly process (HGAP workflow)26, and using the v2.2.0 SMRT® analysis software suite for HGAP implementation. Reads were first

aligned by the PacBio long-read aligner or BLASR27 against the complete genome of _Escherichia coli_, strain K12, substrain DH10B (GenBank: CP000948.1). The _E. coli_ reads, as well as low

quality reads (minimum read length of 500 bp and minimum read quality of 0.80) were removed from the data set. Filtered reads were then preassembled to yield long, highly accurate sequences.

To perform this step, the smallest and the longest reads were separated from each other to correct errors by mapping single-pass reads to the longest reads (seed reads), which represent the

longest portion of the read length distribution. Next, sequences were filtered against vector (BAC) sequences, and the Celera assembler used to assemble data and obtain draft assemblies.

The last step was performed in order to significantly reduce the remaining indels and base substitution errors in the draft assembly. The Quiver algorithm was used for this purpose. This

quality-aware consensus algorithm uses rich quality scores (Quality Value/QV scores) and QV is a per-base estimate of base accuracy. QV scores over 20 are from very good data with only 1%

error probability. Finally, Quiver polishes the assembly for final consensus26. Once the refined assembly was obtained, each BAC-insert sequence was individualized by matching the end

sequences to the pool of assembled sequences using BLAST. Read coverage was assessed by aligning the raw reads on the assembled sequences with BLASR. IDENTIFICATION AND ANNOTATION OF

REPETITIVE SEQUENCES Eukaryotic genomes contain a substantial portion of repetitive elements which are organized into three main classes: dispersed repeats (mostly transposable elements and

retrotransposed genes), local repeats (tandem repeats and simple sequence repeats or microsatellites) and segmental duplications (duplicated genomic fragments)28. It is highly recommended to

identify and mask repetitive regions before gene prediction. Otherwise, unmasked repeats can produce spurious BLAST alignments, resulting in false evidence for gene annotations29. The v2.2

REPET package was used for _de novo_ detection and annotation of transposable elements (TEs). The annotation process starts with self-alignment of the sequences by all-by-all comparison.

Matching clusters are then identified based on the same cluster sequences in a given family. A consensus for each family is created, and each consensus is classified according to the

structures and domains present. The last step entails annotating TE copies30,31. The resulting elements were then compared with sequences deposited in the Viridiplantae section of the

Repbase repeat database32. They were classified by PASTEC, a tool for classifying TEs by searching for structural features and similarities33 and implementing the hierarchical classification

system proposed by34. Repeat masking was subsequently performed with RepeatMasker Open-3.035 using the library generated by the REPET and Repbase Viridiplantae dataset32. MISA36 was used to

search for microsatellites based on microsatellite sequences with at least 10 nucleotides in the repeat for mono-, 5 for di -, and 3 for tri-, tetra-, penta- or hexanucleotides. Composite

microsatellites were also identified. They are formed by multiple, adjacent, repetitive motifs. Hence, a microsatellite is considered composite if it has a maximum interruption of 10 bp

between motifs37,38. GENE PREDICTION AND FUNCTIONAL ANNOTATION Evidence-driven gene prediction was performed based on gene models of _Arabidopsis thaliana_ and _Theobroma cacao_ and using

the following software: Augustus39, GlimmerHMM40, GeneMark.hmm41, and SNAP42_. Ab initio_ gene finding was performed with the BRAKER pipeline43. Protein homology detection and potential

intron resolution were detected by Exonerate software44 against the annotated genomes of _Populus trichocarpa_, _Salix purpurea_, _Ricinus communis_ and _Manihot esculenta_, downloaded from

the Phytozome website45. These species are among the plant genomes with the highest number of top hits for _P. edulis_15. Additionally, a _P. edulis_ RNA-seq library (see details below) was

used to support gene model predictions. PASA46 was used to produce alignment assemblies based on overlapping transcript alignments from _P. edulis_ RNA-seq data. The results were combined by

EVidence Modeler software47, and PASA was used to update the EVidence Modeler consensus predictions, adding UTR annotations and models for alternatively spliced isoforms. Exon-intron

boundaries were manually examined using GenomeView48 and adjusted where necessary. RNA-seq reads (2 × 100 bp; Illumina HiSeq2000) were trimmed based on quality (Phred quality score >20).

Contaminants, remaining adapters, and sequences (<50 bp) were removed using SeqyClean v1.9.949. Total RNA-seq assembly was implemented by Trinity50. In brief, RNA-seq reads were derived

from three libraries (each replicated three times) of shoot apexes of juvenile, vegetative and reproductive adult plants of _P. edulis_, constructed with the aim of performing comparisons of

these three developmental stages (Dornelas M.C. _et al_., unpublished data). Functional annotation of the predicted gene sequences was performed using Blast2GO v3.2 tools51 for assigning

ontological terms in accordance with BLASTX results (e-value cut-off of 1 × 10−6). In addition, protein signature recognition was performed using the InterProScan tool52. MICROSYNTENY

ANALYSIS The 20 _P. edulis_ BAC-inserts with the highest number of annotated genes were used for the identification of potential microsyntenic regions between _P. edulis_ and _Populus

trichocarpa_ (Salicaceae), and _P. edulis_ and _Manihot esculenta_ (Euphorbiaceae), two related Malpighiales species with entirely sequenced and well-annotated genomes. _P. edulis_ coding

sequences were compared with these two genome sequences, available in the Phytozome database45 using BLASTN. Based on the phylogenetic relationships among the Malpighiales species, we chose

_P. trichocarpa_ because it is the closest species to _P. edulis_. Taxonomically speaking, Passifloraceae appears as a sister group to Salicaceae. On the other hand, _M. esculenta_ is the

most distant species from _P. edulis_ among those Malpighiales with fully sequenced and well-annotated genomes. To consider two genes as orthologs, the alignment had to show an e-value <

10−10 and coverage >50%. After identifying the orthologs, microsyntenic regions were defined. These are regions with more than four pairs of orthologous genes. All gene positions in the

microsyntenic regions were recorded to construct comparative graphs. The analysis was carried out on JBrowse, (Phytozome v12.1 platform)45 to search for genes exhibiting each _P. edulis_

microsyntenic region and in the _P. trichocarpa_ and _M. esculenta_ genome. The initial and final positions of the orthologous genes and chromosome identification were used as a basis for

constructing comparative graphs. Using the GenomeView browser48, each of the microsyntenic regions was visualized and confirmed. Finally, comparative graphs were constructed using a graphics

application. RESULTS BAC SELECTION, SEQUENCING AND ASSEMBLY A total of 66 BAC inserts were selected for complete sequencing based on our previous BAC-end sequencing results15, and 46 were

selected using probes homologous to transcripts of _P. edulis_53 (Supplementary Table S1). Thus, in total, 112 BAC inserts from the _P. edulis_ genomic library were sequenced. The sequencing

process resulted in 571,565 high quality reads, ranging from 500 to 46,831 bp in length. Sequences were between 24,316 and 142,456 bp in length, corresponding to their respective band sizes

resolved by PFGE. The high quality of the long reads (QV > 47) and high coverage of the contigs (on average 278×) are indications of the reliability of our data (Supplementary Table S2),

leading to the conclusion that all inserts were completely sequenced and assembled. The assembly, gene models, and genome browser are available at

https://genomevolution.org/coge/GenomeInfo.pl?gid=52053. The sequencing method was of sufficient quality to provide a single contig per insert, with only two exceptions; in the assembly

process, insert sequences Pe101K14 and Pe141H13 had overlapping regions that resulted in a single contig of 172,337 bp; similarly, Pe20N3 and Pe64C12 resulted in a single contig of 114,997

bp. In addition, of the 112 BAC insert sequences, three corresponded to organelle DNA, and therefore these sequences were not included. Thus, 107 sequences were subjected to annotation,

totaling 10,401,671 bp (10.4 Mb) corresponding to approximately 1.0% of the _P. edulis_ genome. GC content across this genome fraction was 41.09%, and in the CDS 46.49%. GENE

REPRESENTATIVENESS, STRUCTURE AND FUNCTIONAL ANNOTATION Structural sequence annotation resulted in the prediction of 1,883 genes ranging from 153 to 24,687 bp in length, with an average of

2,448 bp. These gene sequences represented 44% of the total sequenced nucleotides, corresponding to 4,608,830 bp. Intergenic regions covered from 0 (overlapped genes) to 92,497 bp, with a

mean length of 3,184 bp. Between 3 and 36 predicted genes were identified per sequenced insert, with an average of 17.6 predicted genes per insert (Table 1, Supplementary Table S3). Taking

into account the estimated size of the _P. edulis_ genome (~1,230 Mb), the high number of genes identified herein (1,833) endorses the efficiency of the strategy used for selecting

BAC-inserts that were supposedly gene-rich. One third of the genes (631) had no introns. The remaining (1,252) had up to 50 introns. A total of 6,122 introns (ranging from 26 to 7,869 bp in

length) and 8,005 exons (ranging from 3 to 6,249 bp) were recognized. CDS ranged from 153 to 14,583 bp in length, totaling 1,985,892 bp, with a mean of 1,054 bp. A total of 61 were

insert-end sequences and therefore incomplete gene sequences. According to the RNA-seq read alignment results, 252 genes exhibited more than one transcript (Supplementary Table S3),

including glutamine synthetase leaf enzyme, chloroplastic (6 transcripts), ultraviolet-B receptor UVR8, a protein responsive to UV-B (5), the auxin response factor (2), an abscisic acid

insensitive protein (2) and an ethylene receptor protein (2). Of the 1,883 predicted genes, 1,502 showed significant levels of similarity (e-values < 1 × 10−6) to plant proteins

(Supplementary Table S3) according to the Blast2GO results. The top hits for this large fraction of genes (~80%) were from _Jatropha curcas_ (298), _Populus trichocarpa_ (275), _Populus

euphratica_ (232) and _Ricinus communis_ (212). These results were expected, since among all available plant genomes, these species are phylogenetically close to _P. edulis_, and all belong

to the Malpighiales order. Functional annotation resulted in 3,178 ontological terms assigned to 1,191 genes. These GO terms were related to several processes, and are usually classified

into three broad categories (known as level 1): biological process, molecular function and cellular component. The distribution of level 2 terms within each of these major categories is

shown in Fig. 1 and matches the results of BES annotation15. Regarding the 46 regions selected using probes homologous to transcripts induced and repressed by _X. axonopodis_ infection, none

of the functional categories related to plant defense were found to be overrepresented. However, protein signatures related to plant immunity and defense functions were identified. The

serine/threonine-protein kinase active site (32 genes), and the leucine-rich repeat domain, L domain-like (27 genes) were among the most represented signatures (Table 2). In total, automated

searches for protein signatures recognized 1,383 signatures in 1,488 genes of _P. edulis_: 783 domains, 453 protein families, 125 sites and 22 replicates (Table 2). Most of these signatures

(769) were taken from the Pfam database54, and the remainder from SuperFamily (239)55 and Smart (223)56. RICHNESS OF TRANSPOSABLE ELEMENTS AND MICROSATELLITES The search for transposable

elements resulted in the identification of 250 TEs that, in turn, were automatically classified as Class I (retrotransposons) and Class II (DNA transposons), and in terms of order33. These

TEs represented 17.6% of total data, corresponding to 1,830,620 bp. Class I was prevalent with 96.4% (241/250) retrotransposons (Table 3). These TEs were preferentially hosted in intergenic

regions (70.4%, 176/250); 74 TEs were found within genes, including 70 exonic TEs, and only four were located in introns. The LTR (Long Terminal Repeat) retrotransposon was the most frequent

order, and accounted for 75.1% (181/241) of retrotransposons, corresponding to 1,418,389 bp or 13.6% (1,418,389 bp/10,401,671 bp) of all sequence data. The other orders of Class I were

poorly represented, but note that LARDs (Large Retrotransposon Derivatives) accounted for 36 elements (Table 3). Only 3.6% (9/250) of TEs were of Class II, the majority (6) classified as TIR

(Terminal Inverted Repeats) (Table 3). The search for microsatellites resulted in the identification of 11,020 simple sequence repeats (SSR), representing 1.05% of all sequence data

(109,695 bp/10,401,671 bp). In CDS (1,985,806 bp) there were 1,762 SSRs (~16% of the total). Taking into account all sequence data, 106 SSRs were found every 100 kb (one SSR every 0.94 kb).

Analyzing the CDS region, 89 SSRs were found every 100 kb (one SSR every 1.12 kb); hence, the frequency of SSRs was slightly lower in the CDS region (~1.2×, 1.12 kb/0.94 kb). Our estimates

were 10× lower than those reported in15 using _P. edulis_ BES data as a major resource (10.8 SSRs every 100 kb or one SSR every 9.25 kb). Microsatellite sequences were grouped according to

motif, and all possible classes of repeats were found, with trinucleotides the most prevalent in both data sources. Compound SSRs accounted for 17.4% (1,919/11,020) of all SSRs, and 15.7%

(278/1,762) of these SSRs were found in CDS (Fig. 2A). Among the mononucleotides, the A/T motif far surpassed the number of G/C motifs. The most frequent dinucleotides were AT/AT (49.3%),

followed by AG/CT (35.4%), which were prevalent in CDS (74%). Among the trinucleotides, AAG/CTT were the most frequent in both data sources (~23%). Other occurrences (tetra-, penta- and

hexanucleotides) are shown in Fig. 2B. MICROSYNTENY ANALYSIS RESULTS The following 20 _P. edulis_ BAC-inserts were used for microsynteny analysis: Pe101K14 + 141H13 (36), Pe185D11 (36),

Pe164B18 (29), Pe214H11 (29), Pe164D9 (28), Pe186E19 (28), Pe43L2 (27), Pe164K17 (26), Pe215I8 (26), Pe84I14 (25), Pe84M23 (25), Pe93M2 (25), Pe171P13 (25), Pe207D11 (25), Pe93N7 (24),

Pe108C16 (24), Pe173B16 (24), Pe185J16 (24), Pe198H23 (24) and Pe212I1 (24). These regions were found to contain the highest number of annotated genes (given in parenthesis) and account for

2,243,840 bp, encompassing 534 genes (Table 1). Microsynteny analysis showed that 18 of the 20 _P. edulis_ regions contained syntenic _P. trichocarpa_ chromosomal regions, and 15 _P. edulis_

regions had syntenic _M. esculenta_ chromosomal regions (Figs 3−7, S1−S13). In some comparisons, the microsyntenic region of _P. edulis_ had the opposite orientation with respect to the

chromosomes of both (see Fig. 3) or one of the species compared. The 18 _P. edulis_ regions span 1,702,975 bp and contain 406 genes. They matched syntenic segments of _P. trichocarpa_

chromosomes that span 7,137,451 bp and contain 966 genes, including 501 orthologs (Table 4). Ten of the syntenic regions of _P. edulis_ have orthologous genes that are duplicated in _P.

trichocarpa_ chromosomes. Interestingly, a continuous region in _P. edulis_ (Pe214H11) is syntenic to segments of _P. trichocarpa_ chromosome 4, and these segments are separated by 1.4 Mb.

The same is true for segments of chromosome 9, separated by 1.2 Mb (Fig. 4). Other large segments of the _P. trichocarpa_ chromosome 4 are also missing in the corresponding _P. edulis_

syntenic region (Fig. 7). These presumably relate to deletion events that occurred in _P. edulis_. Average gene length in _P. edulis_ (2,785 bp) is slightly lower than that of _P.

trichocarpa_ (3,290 bp). However, the average intergenic spacer length in _P. trichocarpa_ (8,694 bp) is four times that of _P. edulis_ (1,871 bp) (Supplementary Table S4). The gene order is

conserved in most of the syntenic regions, but rearrangements were observed. On comparing _P. edulis_ with _P. trichocarpa_, two typical inversion events in the gene order were recognized

(Supplementary Figs S3 and S6). Moreover, two adjacent genes in _P. trichocarpa_ chromosome 1 were found to be inverted, and also interrupted in the _P. edulis_ syntenic region (Fig. 6).

Finally, it is worth noting the occurrence of particular gene duplications within the syntenic regions involving two to seven copies. Figure 4 shows two _P. edulis_ genes (8th and 22nd) that

have four copies in _P. trichocarpa_ chromosome 9. In the comparison with _M. esculenta_, the 15 regions of _P. edulis_ span 1,392,795 bp and contain 348 genes, matching syntenic segments

of _M. esculenta_ chromosomes that span 5,053,254 bp and contain 633 genes, including 365 orthologs (Table 5). Eleven of the syntenic regions of _P. edulis_ contain orthologous genes that

are duplicated in _M. esculenta_ chromosomes. The average _P. edulis_ gene length (2,641 bp) is slightly lower than that of _M. esculenta_ (3,886 bp). However, the average intergenic spacer

length (6,777 bp) was three times that of _P. edulis_ (1,850 bp) (Supplementary Table S4). Gene order is also conserved in most of the syntenic regions, but rearrangements were recognized in

genes of both _P. edulis_ and _M. esculenta_ (Figs S1, S2, S6, S7). The occurrence of particular gene duplications within syntenic regions involving two to five copies was also detected.

Figure 3 shows three copies of a _P. edulis_ gene (18th) arranged in tandem on chromosome 13 of _M. esculenta_ and two copies in tandem on chromosome 12, totaling 5 copies. The 2nd gene

within the _P. edulis_ microsyntenic region is also duplicated in _M. esculenta_ chromosome 12. In terms of specific genes, note that a single copy of the gene encoding a KIN1-related

stress-induced protein was found in _P. edulis_ but there are seven orthologous copies in _P. trichocarpa_ chromosome 4 and three in chromosome 17 (Supplementary Fig. S2). Moreover, five

copies in tandem of the gene encoding an endo-1,3 1,4-beta-D-glucanase were found in _P. edulis_, but no orthologs were found in _P. trichocarpa_ and _M. esculenta_. Finally, four copies in

tandem of the salicylic acid-binding protein 2-like gene were found in _P. edulis_: an orthologous copy was found in chromosome 4 and three in chromosome 9 of _P. trichocarpa_, but only one

copy was found in chromosome 17 of _M. esculenta_ (Supplementary Fig. S1). There is a higher degree of comparative microsynteny between _P. edulis_ and _P. trichocarpa_ than between _P.

edulis_ and _M. esculenta_. The number of genes is significantly high in most _P. trichocarpa_ and _M. esculenta_ chromosomes compared to that found in _P. edulis_ microsyntenic regions

(Tables 4 and 5). The highest level of synteny conservation was found between Pe173B16 and _P. trichocarpa_ chromosome 9, with 29 orthologous, collinear gene pairs (Table 4; Fig. 7), and

between Pe185D11 and _M. esculenta_ chromosome 12, with 27 orthologous, collinear gene pairs (Table 5; Fig. 3). DISCUSSION Despite great advances in genome sequencing, the process of

sequencing a plant genome is still laborious, due primarily to the size and complexity of genome regions which pose a challenge when it comes to sequencing and assembly. For instance,

_Passiflora_ species are extensively diversified in morphological terms, with genome sizes ranging from 207 Mb to 2.15 Gb14 and there are no draft genomes for any passion fruits, even the

most cultivated species, _P. edulis_. In this study, a gene-rich fraction of the _P. edulis_ genome was sequenced and assembled from long sequence reads, allowing us to obtain 10.4 Mb of

highly curated data. About half of all sequences (44%) matched _P. edulis_ gene sequences and annotation revealed several functional categories and protein domains. Interestingly, the most

frequent domain was retrotransposon gag, associated with transcripts of the LTR retrotransposon, followed by the kinase domains. This abundance was to be expected, since kinases belong to a

superfamily of proteins with copies in the hundreds or thousands and are components of all cellular functions. These proteins use ATP γ-phosphate to phosphorylate serine and threonine or

tyrosine residues from other proteins57. Note that to date there is an enormous scarcity of information on _Passiflora_ nuclear genes in databases. This means that obtaining gene-based

probes for selecting new regions for whole sequencing is practically impossible. The structural and functional annotation of 1,883 genes provides a significant set of high quality gene

sequences that can be used in many other studies on _Passiflora_ (see Supplementary Table S3). Transposable elements (TEs) are highly widespread in plant genomes, accounting for 14% of the

_Arabidopsis thaliana_ genome58, up to 80% of the maize genome59 and 17.6% of all _P. edulis_ sequences. The vast majority are retroelements that belong to Class I (96.4%), and especially to

the LTR order. This abundance is very similar to that previously reported15 analyzing ~10,000 BES (18.5% TEs, 94.1% Class I TEs, the majority belonging to the LTR order), and this pattern

should be repeated in _P. edulis_. On examining high quality genomes, several authors have stated that the spread of TEs (mostly retrotransposons) is the main driver of genome size variation

in plants. This is particularly true of LTR retrotransposons due to the replication mechanism. LTRs are found mainly in centromeric regions, playing important role in chromatin structure

maintenance, centromere performance and the regulation of host gene expression60,61,62. The content of LTR elements in _P. edulis_ is comparable to that identified in related Malpighiaceae

species with completely sequenced genomes, although the abundance of TEs is highly variable. This variation is to be expected and is indicative of particular TE-driven evolutionary

processes60. For instance, ~42% of the _P. trichocarpa_ genome consists of transposable elements (although only 12.9% of the sequences could be classified as known TEs), the majority

belonging to the LTR order (~60%). These figures relate to the draft genome of _P. trichocarpa_24, and the authors state that this genome could contain even more non-classified LTRs. In _R.

communis_, approximately 50% of the genome consists of transposable elements, and LTRs were the most abundant, making up ~16% of the genome63, close to the value observed in _P. edulis_

(13.6%), although the genome size of this species is ~3.8× larger than that of _R. communis_. Finally, in _Manihot esculenta_, ~25.7% of the genome consists of transposable elements, and LTR

is also the most represented order among classified TEs, forming ~11% of the genomic sequences25. In this case, the genome report was based on 65% of an assembled genome of the domesticated

variety. In terms of microsatellite abundance, ~1.0% of all _P. edulis_ sequences consisted of SSRs, with trinucleotide repeats prevalent (55.6%), even in CDS (93.8%). Microsatellite

abundance generally varies from one genome region to another, but trinucleotides are usually overrepresented in coding sequences, due to selection pressures against mutations that may alter

the reading frames64. The _P. edulis_ results corroborate the findings of a pioneer study65 with regard to the effect that trinucleotide repeats are significantly more abundant in the

expressed regions of plant genomes. Recently, a total of 1,300 perfect microsatellite sites were described in _P. edulis_ genomic regions (with minimum 15× coverage as a cut off; Illumina

paired-end reads) that were selected for marker development and _Passiflora_ diversity analysis66. In this significant sample, the prevalence of tri-, tetra- and dinucleotides was found to

be 41.0%, 36.4% and 22.6%, respectively. In the _P. trichocarpa_ genome, the predominance of mono- (69.8%), di- (19.5%) and trinucleotides (9.0%) decreased stepwise as the motif length

increased (mono- to hexanucleotide repeats); 98% of _P. trichocarpa_ mononucleotides consist of A/T motifs and only 2% of C/G motifs. The same applies to _P. edulis_ (Fig. 2B). For di- and

trinucleotides, the most frequent motifs were AT/AT (60.5%) and AAT/ATT (48.2%). In terms of coding sequences, 90.3% and 76.6% of the mono- and dinucleotides consist respectively of A/T and

AG/CT motifs. Trinucleotides consist mainly of AAG/CTT, ACC/GGT and AGG/CCT motifs (~20% of each), and the frequencies of tetra-, penta- and hexanucleotides were very low67. In _M.

esculenta_, 37.4% of all SSRs corresponded to dinucleotides, and tri- and pentanucleotides were found in the same proportion (~24%); within the coding sequences, tri- and hexanucleotides

accounted for 95.6%. AT/AT and AAT/ATT were the most common di- and trinucleotide motifs (~23% and ~12%, respectively) and, as in _P. edulis_, AG/CT and AAG/CTT were the most prevalent in

coding sequences (~4% and ~23%, respectively)68. In the _R. communis_ genome, most of the SSRs found were also dinucleotides (70.4%), followed by trinucleotides (24.9%). AT/TA was the most

frequent motif among dinucleotides (75.3%) and AAT/TTA among trinucleotides (71%)69. Clearly, the particular occurrence of certain motifs in plant genomes and in different genome regions is

due to selection pressure during evolution70,71, and structural and functional genome attributes, like GC content and codon usage bias, may be responsible for the unique content and

distribution patterns of microsatellites72,73. Remarkable, there are several benefits that can be derived from the knowledge we have generated. First, a draft sequencing of the _Passiflora

edulis_ nuclear genome, especially of a gene-rich fraction, provides a platform for functional analysis and development of genomic tools in applied passion fruit improvement. Our work also

represents a first step towards full sequencing of the _P. edulis_ genome. Moreover, wild _Passiflora_ species harbor a variety of characteristics that determine their ecological importance

and adaptability. The availability of gene sequences could help researchers test for the presence of gene variants or polymorphisms in different environments. This is also possible for

cultivated species. Gene prediction has yielded around 1,900 genes, and functional annotation has associated genes with plant immunity and defense functions (Supplementary Table S3).

Taxonomically speaking, the genus is subdivided into four subgenera: three clades were recognized as monophyletic (_Astrophea_, _Decaloba_, and _Passiflora_), but the position of

_Deidamioides_ remained unresolved, as this particular clade was found to be paraphyletic. Therefore, gene sequences could be used in phylogenetic analysis to obtain accurate evolutionary

information. By providing information on the levels of synteny conservation and rearrangements within the microcollinear regions (inverted and translocated segments, deletion and gene

duplication events), this study will help confirm the relationships between a _Passiflora_ species and related Malpighiales, with important taxonomical implications. Our previous

phylogenetic analyses based on the available chloroplast genomes of members of the four families that compose the Malpighiales order indicated that the Passifloraceae are more closely

related to the Salicaceae than to the Euphorbiaceae16. This proximity is definitively confirmed herein by microsynteny analysis, confirming the importance of using comparative genomic

approaches as an additional resource for elucidating the phylogenetic relationships in the families that compose the Malpighiales order, one of the largest of flowering plants. Although _P_.

_edulis_ microsyntenic regions were compared with whole genomes of _P. trichocarpa_ (Salicaceae) and _M. esculenta_ (Euphorbiaceae), i.e. species that belong to different taxonomic

families, the analysis showed that overall gene order was well conserved. The level of microsynteny observed between the majority of _P. edulis_ BAC inserts and these genomes is surprising,

given the long divergence time that separates them from the common ancestor of the Malpighiales, some 100 million years ago74. The event of whole genome duplication (WGD) in _P. trichocarpa_

occurred about 60−65 million years ago and reached around 92% of its genome24. On the other hand, _M. esculenta_ has undergone a paleo-genome duplication event, and a number of its genes

were found to have only two copies25,75. This may be related to the loss of one of the homologous copies in _M. esculenta_ owing to selection pressure that restored the single-copy state of

genes that impair fitness when present in multiple copies76. The genome size of _P. edulis_ is estimated at ~1.23 Gb, significantly higher than the estimated genome sizes of _P. trichocarpa_

(~485 Mb)24 and _M. esculenta_ (~742 Mb)25. These differences raise the question: did an ancestor of the passionflowers undergo genome duplication? Possibly. According to cytogenetic

studies, the basic chromosome number in the genus _Passiflora_ is _x_ = 6, with several species containing secondary numbers, as in the case of _P. edulis_ (_x_ = _9_). These species with

secondary chromosome numbers are possibly of polyploid origin77,78. Nevertheless, there is evolutionary evidence indicating _x_ = _12_ as the basic chromosome number, since _x_ = 6 was

reported to occur only in the subgenus _Decaloba_. In primitive _Passiflora_ species, such as those of the _Astrophea_ subgenus, _x_ = _12_, and the same applied to other species of the

Passifloraceae family78,79. This suggests that descending dysploidy events may have occurred in the _Passiflora_ (_x_ = _9_) and _Decaloba_ (_x_ = _6_) subgenera, lending weight to the

hypothesis that genome duplication occurred in an ancestor of the Passifloraceae. In actual fact the diploid numbers _2n_ = _12_, _18, 24, and 72_ have been reported for _Passiflora_

species80. An examination of the microsyntenic regions shows that the _P. edulis_ gene-rich segments are more compact than those of the species compared, even though its genome size is three

times longer than that of _P. trichocarpa_, and almost twice the size of the _M. esculenta_ genome. The limited sampling of _P. edulis_ genome analyzed herein does not account for these

apparently contradictory attributes regarding the compactness of gene regions and genome sizes. Further studies are required to elucidate the abundance of repetitive DNA (including TEs)

associated with gene-poor regions and/or the occurrence of large heterochromatin blocks in _P. edulis_81,82. Finally, wide variations in genome size occur within the genus _Passiflora_14

indicating that genome duplication, DNA sequence acquisition and loss throughout the evolution of the genus (favoring species disruption) have occurred since its diversification from the

common ancestor about 38 million years ago83. CONCLUSION The outcome of this research was a unique set of high quality sequence data on a gene-rich fraction of the _Passiflora edulis_

genome, describing gene content and abundance of repetitive elements. The structural and functional annotations of 1,883 genes of _P. edulis_ are detailed. It is proposed that there is a

relatively high degree of conservation in gene regions of _P. edulis_, _Populus trichocarpa_ and _Manihot esculenta_, according to our microsynteny analysis results. Collinear orthologous

genes are shown to be prevalent, although some disruptions of collinearity have occurred due to rearrangements (inversion, translocation events) within microsyntenic regions. Interestingly,

even though the _P. edulis_ genome is much larger than those of _P. trichocarpa_ (3×) and _M. esculenta_ (2×), which evolved by polyploidy, the _P. edulis_ gene-rich segments are much more

compact. In this study the first steps have been taken, but further studies are required to elucidate the abundance of repetitive DNA associated with gene-poor regions and/or the occurrence

of large heterochromatin blocks in _P. edulis_, in order to contribute to our understanding of the evolutionary issues that these genomes raise. REFERENCES * Ulmer, T. & MacDougal, J. M.

J. M. Passiflora: passionflowers of the world. _Timber Press_ 430p (2004). * Bernacci, L. C. _et al_. Passifloraceae. _Lista de espécies da flora do Brasil_ (2014). Available at:

http://reflora.jbrj.gov.br/jabot/floradobrasil/FB182 (Accessed: 15th November 2017). * Abreu, P. P. _et al_. Passion flower hybrids and their use in the ornamental plant market: perspectives

for sustainable development with emphasis on Brazil. _Euphytica_ 166, 307–315 (2009). Article Google Scholar * Deng, J., Zhou, Y., Bai, M., Li, H. & Li, L. Anxiolytic and sedative

activities of _Passiflora edulis_ f. _flavicarpa_. _J. Ethnopharmacol._ 128, 148–153 (2010). Article PubMed Google Scholar * IBGE. _Produção Agrícola Municipal: culturas temporárias e

permanentes_. 42 (2015). * Cuco, S. M., Vieira, M. L. C., Mondin, M. & Aguiar-Perecin, M. L. R. Comparative karyotype analysis of three _Passiflora_ L. species and cytogenetic

characterization of somatic hybrids. _Caryologia_ 58, 220–228 (2005). Article Google Scholar * Madureira, H. C., Pereira, T. N. S., Da Cunha, M. & Klein, D. E. Histological analysis of

pollen-pistil interactions in sour passion fruit plants (_Passiflora edulis_ Sims). _Biocell_ 36, 83–90 (2012). PubMed Google Scholar * Suassuna, T., de, M. F., Bruckner, H., de Carvalho,

R. & Borem, A. Self-incompatibility in passionfruit: evidence of gametophytic-sporophytic control. _Theor. Appl. Genet._ 106, 298–302 (2003). Article Google Scholar * Moraes, M. C.,

Gerald, I. O., Matta, F. P. & Vieira, M. L. C. Genetic and phenotypic parameter estimates for yield and fruit quality traits from a single wide cross in yellow passion fruit. _Hort

Science_ 40, 1978–1981 (2005). Google Scholar * Carneiro, M. S. _et al_. RAPD-based genetic linkage maps of yellow passion fruit (_Passiflora edulis_ Sims. f. _flavicarpa_ Deg.). _Genome_

45, 670–678 (2002). Article PubMed CAS Google Scholar * Oliveira, E. J. _et al_. An integrated molecular map of yellow passion fruit based on simultaneous maximum-likelihood estimation

of linkage and linkage phases. _J. Am. Soc. Hortic. Sci._ 133, 35–41 (2008). CAS Google Scholar * Lopes, R. _et al_. Linkage and mapping of resistance genes to _Xanthomonas axonopodis_ pv.

_passiflorae_ in yellow passion fruit. _Genome_ 49, 17–29 (2006). Article PubMed CAS Google Scholar * Munhoz, C. F. _et al_. Analysis of plant gene expression during passion fruit-

_Xanthomonas axonopodis_ interaction implicates lipoxygenase 2 in host defence: Gene expression during passion fruit- _Xanthomonas axonopodis_ interaction. _Ann. Appl. Biol._ 167, 135–155

(2015). Article CAS Google Scholar * Yotoko, K. S. C. _et al_. Does variation in genome sizes reflect adaptive or neutral processes? New clues from _Passiflora_. _PLoS One_ 6, e18212

(2011). Article ADS PubMed PubMed Central CAS Google Scholar * Santos, A. _et al_. Begin at the beginning: A BAC-end view of the passion fruit (Passiflora) genome. _BMC Genomics_ 15,

816 (2014). Article PubMed PubMed Central Google Scholar * Cauz-Santos, L. A. _et al_. The Chloroplast Genome of _Passiflora edulis_ (Passifloraceae) Assembled from Long Sequence Reads:

Structural Organization and Phylogenomic Studies in Malpighiales. _Front. Plant Sci_. 8 (2017). * Huddleston, J. _et al_. Reconstructing complex regions of genomes using long-read sequencing

technology. _Genome Res._ 24, 688–696 (2014). Article PubMed PubMed Central CAS Google Scholar * VanBuren, R. _et al_. Single-molecule sequencing of the desiccation-tolerant grass

_Oropetium thomaeum_. _Nature_ 527, 508–511 (2015). Article ADS PubMed CAS Google Scholar * Mayer, K. F. X. _et al_. A physical, genetic and functional sequence assembly of the barley

genome. _Nature_ 491, 711–716 (2012). Article ADS PubMed CAS Google Scholar * Li, F. _et al_. Genome sequence of cultivated Upland cotton (_Gossypium hirsutum_ TM-1) provides insights

into genome evolution. _Nat. Biotechnol._ 33, 524–530 (2015). Article PubMed CAS Google Scholar * Buyyarapu, R. _et al_. BAC-Pool Sequencing and Analysis of Large Segments of A12 and D12

Homoeologous Chromosomes in Upland Cotton. _PLoS One_ 8, e76757 (2013). Article ADS PubMed PubMed Central CAS Google Scholar * Ming, R. _et al_. The pineapple genome and the evolution

of CAM photosynthesis. _Nat. Genet._ 47, 1435 (2015). Article PubMed PubMed Central CAS Google Scholar * de Setta, N. _et al_. Building the sugarcane genome for biotechnology and

identifying evolutionary trends. _BMC Genomics_ 15, 540 (2014). Article PubMed PubMed Central Google Scholar * Tuskan, G. A. _et al_. The genome of black cottonwood, _Populus

trichocarpa_ (Torr. & Gray). _Science_ 313, 1596–1604 (2006). Article ADS PubMed CAS Google Scholar * Wang, W. _et al_. Cassava genome from a wild ancestor to cultivated varieties.

_Nat. Commun._ 5, 5110 (2014). Article PubMed CAS Google Scholar * Chin, C.-S. _et al_. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. _Nat Meth_

10, 563–569 (2013). Article CAS Google Scholar * Chaisson, M. J. & Tesler, G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR):

application and theory. _BMC Bioinformatics_ 13, 238 (2012). Article PubMed PubMed Central CAS Google Scholar * Bao, Z. & Eddy, S. R. Automated De Novo Identification of Repeat

Sequence Families in Sequenced Genomes. _Genome Res._ 12, 1269–1276 (2002). Article PubMed PubMed Central CAS Google Scholar * Yandell, M. & Ence, D. A beginner’s guide to

eukaryotic genome annotation. _Nat Rev Genet_ 13, 329–342 (2012). Article PubMed CAS Google Scholar * Quesneville, H. _et al_. Combined evidence annotation of transposable elements in

genome sequences. _PLoS Comput. Biol._ 1, e22 (2005). Article ADS PubMed Central CAS Google Scholar * Flutre, T., Duprat, E., Feuillet, C. & Quesneville, H. Considering Transposable

Element Diversification in De Novo Annotation Approaches. _PLoS One_ 6, e16526 (2011). Article ADS PubMed PubMed Central CAS Google Scholar * Jurka, J. _et al_. Repbase Update, a

database of eukaryotic repetitive elements. _Cytogenet. Genome Res._ 110, 462–467 (2005). Article PubMed CAS Google Scholar * Hoede, C. _et al_. PASTEC: an automatic transposable element

classification tool. _PLoS One_ 9, e91929 (2014). Article ADS PubMed PubMed Central CAS Google Scholar * Wicker, T. _et al_. A unified classification system for eukaryotic

transposable elements. _Nat. Rev. Genet._ 8, 973–982 (2007). Article PubMed CAS Google Scholar * Smit, A., Hubley, R. & Green, P. RepeatMasker Open-3.0. (2010). * Aggarwal, R. K. _et

al_. Identification, characterization and utilization of EST-derived genic microsatellite markers for genome analyses of coffee and related species. _Theor. Appl. Genet._ 114, 359–72

(2007). Article PubMed CAS Google Scholar * Oliveira, E. J., Pádua, J. G., Zucchi, M. I., Vencovsky, R. & Vieira, M. L. C. Origin, evolution and genome distribution of

microsatellites. _Genet. Mol. Biol._ 29, 294–307 (2006). Article CAS Google Scholar * Vieira, M. L. C., Santini, L., Diniz, A. L. & Munhoz, C. de F. Microsatellite markers: what they

mean and why they are so useful. _Genet. Mol. Biol._ 39, 312–328 (2016). Article PubMed PubMed Central Google Scholar * Stanke, M., Steinkamp, R., Waack, S. & Morgenstern, B.

AUGUSTUS: a web server for gene finding in eukaryotes. _Nucleic Acids Res._ 32, W309–W312 (2004). Article PubMed PubMed Central CAS Google Scholar * Majoros, W. H., Pertea, M. &

Salzberg, S. L. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. _Bioinformatics_ 20, 2878–2879 (2004). Article PubMed CAS Google Scholar * Borodovsky, M.

& Lomsadze, A. Eukaryotic Gene Prediction Using GeneMark.hmm-E and GeneMark-ES. _Curr. Protoc. Bioinformatics, Unit_ 4, 610 (2011). Google Scholar * Korf, I. Gene finding in novel

genomes. _BMC Bioinformatics_ 5, 59 (2004). Article PubMed PubMed Central Google Scholar * Hoff, K. J., Lange, S., Lomsadze, A., Borodovsky, M. & Stanke, M. BRAKER1: Unsupervised

RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS. _Bioinformatics_ 32, 767–769 (2016). Article PubMed CAS Google Scholar * Slater, G. S. C. & Birney, E. Automated

generation of heuristics for biological sequence comparison. _BMC bioinformatics_ 6, 31 (2005). Article PubMed PubMed Central CAS Google Scholar * Goodstein, D. M. _et al_. Phytozome: a

comparative platform for green plant genomics. _Nucleic Acids Res._ 40, D1178–D1186 (2012). Article PubMed CAS Google Scholar * Haas, B. J. _et al_. Improving the Arabidopsis genome

annotation using maximal transcript alignment assemblies. _Nucleic Acids Res._ 31, 5654–5666 (2003). Article PubMed PubMed Central CAS Google Scholar * Haas, B. J. _et al_. Automated

eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. _Genome Biol._ 9, R7–R7 (2008). Article PubMed PubMed Central CAS Google

Scholar * Abeel, T., Van Parys, T., Saeys, Y., Galagan, J. & Van de Peer, Y. GenomeView: a next-generation genome browser. _Nucleic Acids Res._ 40, e12–e12 (2012). Article PubMed CAS

Google Scholar * Zhbannikov, I. Y., Hunter, S. S., Foster, J. A. & Settles, M. L. SeqyClean. _Proc. 8th ACM Int. Conf. Bioinformatics, Comput. Biol. Heal. Informatics (ACM-BCB ’17)_

407–416, https://doi.org/10.1145/3107411.3107446 (2017). * Haas, B. J. _et al_. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation

and analysis. _Nat. Protoc._ 8, 1494–1512 (2013). Article PubMed CAS Google Scholar * Conesa, A. _et al_. Blast2GO: a universal tool for annotation, visualization and analysis in

functional genomics research. _Bioinformatics_ 21, 3674–3676 (2005). Article PubMed CAS Google Scholar * Jones, P. _et al_. InterProScan 5: genome-scale protein function classification.

_Bioinformatics_ 30, 1236–1240 (2014). Article PubMed PubMed Central CAS Google Scholar * Munhoz, C. F. _et al_. Analysis of plant gene expression during passion fruit- _Xanthomonas

axonopodis_ interaction implicates lipoxygenase 2 in host defence. _Ann. Appl. Biol._ 167, 135–155 (2015). Article CAS Google Scholar * Finn, R. D. _et al_. Pfam: the protein families

database. _Nucleic Acids Res._ 42, D222–D230 (2014). Article PubMed CAS Google Scholar * Gough, J., Karplus, K., Hughey, R. & Chothia, C. Assignment of homology to genome sequences

using a library of hidden Markov models that represent all proteins of known structure. _J. Mol. Biol._ 313, 903–919 (2001). Article PubMed CAS Google Scholar * Letunic, I. & Bork,

P. 20 years of the SMART protein domain annotation resource. _Nucleic Acids Res._ 46, D493–D496 (2017). Article PubMed Central Google Scholar * Lehti-Shiu, M. D. & Shiu, S.-H.

Diversity, classification and function of the plant protein kinase superfamily. _Philos. Trans. R. Soc. B Biol. Sci._ 367, 2619–2639 (2012). Article CAS Google Scholar * Kaul, S. _et al_.

Analysis of the genome sequence of the flowering plant _Arabidopsis thaliana_. _Nature_ 408, 796–815 (2000). Article CAS Google Scholar * Schnable, P. S. _et al_. The B73 Maize Genome:

complexity, diversity, and dynamics. _Science_ 326, 1112–1115 (2009). Article ADS PubMed CAS Google Scholar * El Baidouri, M. & Panaud, O. Comparative Genomic Paleontology Across

Plant Kingdom Reveals The Dynamics Of TE-driven Genome Evolution. _Genome Biology and Evolution_ 5, 954–965 (2013). Article PubMed PubMed Central Google Scholar * Zhao, M. & Ma, J.

Co-evolution of plant LTR-retrotransposons and their host genomes. _Protein Cell_ 4, 493–501 (2013). Article PubMed PubMed Central CAS Google Scholar * Tenaillon, M. I., Hollister, J.

D. & Gaut, B. S. A triptych of the evolution of plant transposable elements. _Trends Plant Sci._ 15, 471–478 (2010). Article PubMed CAS Google Scholar * Chan, A. P. _et al_. Draft

genome sequence of the oilseed species _Ricinus communis_. _Nat. Biotechnol._ 28, 951–956 (2010). Article PubMed PubMed Central CAS Google Scholar * Xu, J. _et al_. Development and

characterization of simple sequence repeat markers providing Genome-Wide coverage and high resolution in Maize. _DNA Res._ 20, 497–509 (2013). Article PubMed PubMed Central CAS Google

Scholar * Morgante, M., Hanafey, M. & Powell, W. Microsatellites are preferentially associated with nonrepetitive DNA in plant genomes. _Nat. Genet._ 30, 194–200 (2002). Article PubMed

CAS Google Scholar * Araya, S. _et al_. Microsatellite marker development by partial sequencing of the sour passion fruit genome (_Passiflora edulis_ Sims). _BMC Genomics_ 18, 549

(2017). Article PubMed PubMed Central Google Scholar * Sonah, H. _et al_. Genome-wide distribution and organization of Microsatellites in plants: An insight into marker development in

_Brachypodium_. _PLoS One_ 6 (2011). * Vásquez, A. & López, C. In silico genome comparison and distribution analysis of simple sequences repeats in cassava. _Int. J. Genomics_ 2014

(2014). * Tan, M. _et al_. Developing and characterising Ricinus communis SSR markers by data mining of whole-genome sequences. _Mol. Breed._ 34, 893–904 (2014). Article CAS Google Scholar

* Hancock, J. M. J. _Microsatellites_ and other simple sequences: genomic context and mutational mechanisms, in _Microsatellites: evolution and applications 1_ (eds Goldstein, D. B. &

Schlötterer, C.) 3–9, https://doi.org/10.1038/mt.2008.186 (Oxford University Press, 1999). * Ellegren, H. Microsatellites: simple sequences with complex evolution. _Nat. Rev. Genet._ 5,

435–45 (2004). Article PubMed CAS Google Scholar * Chakraborty, R., Kimmel, M., Stivers, D. N., Davison, L. J. & Deka, R. Relative mutation rates at di-, tri-, and tetranucleotide

microsatellite loci. _Proc. Natl. Acad. Sci. USA_ 94, 1041–1046 (1997). Article ADS PubMed CAS PubMed Central Google Scholar * Whittaker, J. C. _et al_. Likelihood-based estimation of

microsatellite mutation rates. _Genetics_ 164, 781–787 (2003). PubMed PubMed Central Google Scholar * Magallon, S. & Castillo, A. Angiosperm diversification through time. _Am. J.

Bot._ 96, 349–365 (2009). Article PubMed Google Scholar * Prochnik, S. _et al_. The Cassava Genome: current progress, future directions. _Trop. Plant Biol._ 5, 88–94 (2012). Article

PubMed PubMed Central CAS Google Scholar * De Smet, R. _et al_. Convergent gene loss following gene and genome duplications creates single-copy families in flowering plants. _Proc. Natl.

Acad. Sci._ 110, 2898–2903 (2013). Article ADS PubMed PubMed Central Google Scholar * De Melo, N. F., Cervi, A. C. & Guerra, M. Karyology and cytotaxonomy of the genus _Passiflora_

L. (Passifloraceae). _Plant Syst. Evol._ 226, 69–84 (2001). Article Google Scholar * De Melo, N. F. & Guerra, M. Variability of the 5 S and 45 S rDNA sites in _Passiflora_ L. species

with distinct base chromosome numbers. _Ann. Bot._ 92, 309–316 (2003). Article PubMed CAS Google Scholar * Hansen, A. K. _et al_. Phylogenetic Relationships and Chromosome Number

Evolution in Passiflora. _Syst. Bot._ 31, 138–150 (2006). Article Google Scholar * Magalhães Souza, M., Santana Pereira, T. N. & Carneiro Vieira, M. L. Cytogenetic studies in some

species of _Passiflora_ L. (Passifloraceae): A review emphasizing Brazilian species. _Brazilian Arch. Biol. Technol._ 51, 247–258 (2008). Article Google Scholar * Kim, S. _et al_. Genome

sequence of the hot pepper provides insights into the evolution of pungency in _Capsicum_ species. _Nat. Genet._ 46, 270–278 (2014). Article PubMed CAS Google Scholar * Willing, E.-M.

_et al_. Genome expansion of _Arabis alpina_ linked with retrotransposition and reduced symmetric DNA methylation. _Nat. Plants_ 1, 14023 (2015). Article PubMed CAS Google Scholar *

Muschner, V. C., Zamberlan, P. M., Bonatto, S. L. & Freitas, L. B. Phylogeny, biogeography and divergence times in Passiflora (Passifloraceae). _Genet. Mol. Biol._ 35, 1036–1043 (2012).

Article PubMed PubMed Central Google Scholar Download references ACKNOWLEDGEMENTS We would like to thank GATC Biotech (http://www.gatc-biotech.com) for providing DNA sequencing services,

and Mr. Steve Simmons for proofreading the manuscript. This work was supported by the following Brazilian institutions: Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP, grant

no. 2014/25215-2, postdoctoral and doctoral fellowships awarded to CFM, grant no. 2013/11196-3 and LAC-S, grant no. 2017/04216-9, respectively), Conselho Nacional de Desenvolvimento

Científico e Tecnológico (CNPq, scholarship awarded to ZPC) and Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES, scholarship awarded to ACER). AUTHOR INFORMATION Author

notes * Carla Freitas Munhoz, Zirlane Portugal Costa, Luiz Augusto Cauz-Santos and Alina Carmen Egoávil Reátegui contributed equally. AUTHORS AND AFFILIATIONS * Departamento de Genética,

Escola Superior de Agricultura “Luiz de Queiroz”, Universidade de São Paulo, 13418-900, Piracicaba, Brazil Carla Freitas Munhoz, Zirlane Portugal Costa, Luiz Augusto Cauz-Santos, Alina

Carmen Egoávil Reátegui & Maria Lucia Carneiro Vieira * Institut National de la Recherche Agronomique (INRA), Centre National de Ressources Génomique Végétales, 31326, Castanet-Tolosan,

France Nathalie Rodde, Stéphane Cauet & Hélène Bergès * Departamento de Biologia Vegetal, Instituto de Biologia, Universidade Estadual de Campinas, 13083-862, Campinas, Brazil Marcelo

Carnier Dornelas * INRA, UCA, UMR 1095, GDEC, 63000, Clermont-Ferrand, France Philippe Leroy * Departamento de Tecnologia, Faculdade de Ciências Agrárias e Veterinárias, Universidade

Estadual Paulista, 14884-900, Jaboticabal, Brazil Alessandro de Mello Varani Authors * Carla Freitas Munhoz View author publications You can also search for this author inPubMed Google

Scholar * Zirlane Portugal Costa View author publications You can also search for this author inPubMed Google Scholar * Luiz Augusto Cauz-Santos View author publications You can also search

for this author inPubMed Google Scholar * Alina Carmen Egoávil Reátegui View author publications You can also search for this author inPubMed Google Scholar * Nathalie Rodde View author

publications You can also search for this author inPubMed Google Scholar * Stéphane Cauet View author publications You can also search for this author inPubMed Google Scholar * Marcelo

Carnier Dornelas View author publications You can also search for this author inPubMed Google Scholar * Philippe Leroy View author publications You can also search for this author inPubMed

Google Scholar * Alessandro de Mello Varani View author publications You can also search for this author inPubMed Google Scholar * Hélène Bergès View author publications You can also search

for this author inPubMed Google Scholar * Maria Lucia Carneiro Vieira View author publications You can also search for this author inPubMed Google Scholar CONTRIBUTIONS N.R. worked on probe

preparation and membrane hybridization as well as BAC DNA extraction, and S.C. worked on assembly of the PacBio long read sequences, assisted by H.B. at CNRGV, France. C.F.M., Z.P.C. and

L.A.C.-S. performed all bioinformatics analysis, including sequence prediction and annotation of genes and repetitive elements. M.C.D. provided information on RNA-seq libraries. A.M.V.

constructed a bioinformatics pipeline especially for _P. edulis_ sequences. P.L. assisted with sequence data analysis. C.F.M. and A.C.E.R. worked on microsynteny analysis. M.L.C.V. conceived

the study, provided assistance in the interpretation of the results, and wrote the final version of the manuscript. All authors read and approved the final manuscript. CORRESPONDING AUTHOR

Correspondence to Maria Lucia Carneiro Vieira. ETHICS DECLARATIONS COMPETING INTERESTS The authors declare no competing interests. ADDITIONAL INFORMATION PUBLISHER'S NOTE: Springer

Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. ELECTRONIC SUPPLEMENTARY MATERIAL SUPPLEMENTARY FIGURES S1-S13 SUPPLEMENTARY

TABLES S1 AND S2 SUPPLEMENTARY TABLES S3 AND S4 RIGHTS AND PERMISSIONS OPEN ACCESS This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use,

sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative

Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated

otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds

the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. Reprints and

permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Munhoz, C.F., Costa, Z.P., Cauz-Santos, L.A. _et al._ A gene-rich fraction analysis of the _Passiflora edulis_ genome reveals highly

conserved microsyntenic regions with two related Malpighiales species. _Sci Rep_ 8, 13024 (2018). https://doi.org/10.1038/s41598-018-31330-8 Download citation * Received: 23 April 2018 *

Accepted: 14 August 2018 * Published: 29 August 2018 * DOI: https://doi.org/10.1038/s41598-018-31330-8 SHARE THIS ARTICLE Anyone you share the following link with will be able to read this

content: Get shareable link Sorry, a shareable link is not currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative