De novo assembly and functional annotation of Myrciaria dubia fruit transcriptome reveals multiple metabolic pathways for L-ascorbic acid biosynthesis Castro et al. Castro et al. BMC Genomics (2015) 16:997 DOI 10.1186/s12864-015-2225-6 Castro et al. BMC Genomics (2015) 16:997 DOI 10.1186/s12864-015-2225-6 RESEARCH ARTICLE Open Access De novo assembly and functional annotation of Myrciaria dubia fruit transcriptome reveals multiple metabolic pathways for L-ascorbic acid biosynthesis Juan C. Castro1,2*, J. Dylan Maddox3, Marianela Cobos4, David Requena5,6, Mirko Zimic5,6, Aureliano Bombarely7, Sixto A. Imán8, Luis A. Cerdeira1 and Andersson E. Medina1 Abstract Background: Myrciaria dubia is an Amazonian fruit shrub that produces numerous bioactive phytochemicals, but is best known by its high L-ascorbic acid (AsA) content in fruits. Pronounced variation in AsA content has been observed both within and among individuals, but the genetic factors responsible for this variation are largely unknown. The goals of this research, therefore, were to assemble, characterize, and annotate the fruit transcriptome of M. dubia in order to reconstruct metabolic pathways and determine if multiple pathways contribute to AsA biosynthesis. Results: In total 24,551,882 high-quality sequence reads were de novo assembled into 70,048 unigenes (mean length = 1150 bp, N50 = 1775 bp). Assembled sequences were annotated using BLASTX against public databases such as TAIR, GR-protein, FB, MGI, RGD, ZFIN, SGN, WB, TIGR_CMR, and JCVI-CMR with 75.2 % of unigenes having annotations. Of the three core GO annotation categories, biological processes comprised 53.6 % of the total assigned annotations, whereas cellular components and molecular functions comprised 23.3 and 23.1 %, respectively. Based on the KEGG pathway assignment of the functionally annotated transcripts, five metabolic pathways for AsA biosynthesis were identified: animal-like pathway, myo-inositol pathway, L-gulose pathway, D-mannose/L-galactose pathway, and uronic acid pathway. All transcripts coding enzymes involved in the ascorbate-glutathione cycle were also identified. Finally, we used the assembly to identified 6314 genic microsatellites and 23,481 high quality SNPs. Conclusions: This study describes the first next-generation sequencing effort and transcriptome annotation of a non-model Amazonian plant that is relevant for AsA production and other bioactive phytochemicals. Genes encoding key enzymes were successfully identified and metabolic pathways involved in biosynthesis of AsA, anthocyanins, and other metabolic pathways have been reconstructed. The identification of these genes and pathways is in agreement with the empirically observed capability of M. dubia to synthesize and accumulate AsA and other important molecules, and adds to our current knowledge of the molecular biology and biochemistry of their production in plants. By providing insights into the mechanisms underpinning these metabolic processes, these results can be used to direct efforts to genetically manipulate this organism in order to enhance the production of these bioactive phytochemicals. The accumulation of AsA precursor and discovery of genes associated with their biosynthesis and metabolism in (Continued on next page) * Correspondence: juanccgomez@yahoo.es 1Unidad Especializada de Biotecnología, Centro de Investigaciones de Recursos Naturales de la Amazonía (CIRNA), Universidad Nacional de la Amazonía Peruana (UNAP), Pasaje Los Paujiles S/N, San Juan Bautista, Iquitos, Perú 2Círculo de Investigación en Plantas con Efecto en Salud (FONDECYT N° 010–2014), Lima, Perú Full list of author information is available at the end of the article © 2015 Castro et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Castro et al. BMC Genomics (2015) 16:997 Page 2 of 13 (Continued from previous page) M. dubia is intriguing and worthy of further investigation. The sequences and pathways produced here present the genetic framework required for further studies. Quantitative transcriptomics in concert with studies of the genome, proteome, and metabolome under conditions that stimulate production and accumulation of AsA and their precursors are needed to provide a more comprehensive view of how these pathways for AsA metabolism are regulated and linked in this species. Keywords: Camu-camu, Metabolic pathway reconstruction, Next-generation sequencing, Plant vitamin C metabolism Background generated in the meta-assembly with a length between Myrciaria dubia (Kunth) McVaugh “camu-camu” is an 200 and 8059 bp, mean length of 1150 bp, and a N50 of diploid Amazonian plant species with 2n = 22 chro- 1775 bp (Fig. 1). The Illumina paired-end reads have mosomes [1] that produces numerous bioactive phy- been submitted to the Short Read Archive [SRA: tochemicals [2–6], but is best known by its high SRR1630824]. Vitamin C (L-ascorbic acid) content in fruits [7], which can contain as much as 2 g of L-ascorbic acid Functional annotation, and metabolic pathway assignments (AsA) per 100 g of fruit pulp [8], which is equivalent All unigenes were included in the annotation process. to 50 times the AsA content of orange juice [9]. Pro- This process included best BLASTX match selection, nounced variation in AsA content among different Gene Ontology ID assignment, enzyme code assign- tissue types in the same individual and among indi- ments and InterPro domains calculation. BLASTX com- viduals has been observed [10], but the genetic factors parison with the NCBI nonredundant (nr) database responsible for AsA content variation in this species revealed 52,707 (75.2 %) unigenes with annotations are largely unknown. (Fig. 2). A significant amount of mapping data (93.0 % of Results from our research group have demonstrated that unigenes with mapping information) was derived from M. dubia possesses the capability for AsA biosynthesis in UniProtKB database followed by TAIR and GR_protein. several tissues (unpublished data), and that the large vari- Additional databases (i.e., FB, MGI, RGD, ZFIN, SGN, ation of this bioactive molecule in the leaves and fruit pulp WB, TIGR_CMR, and JCVI-CMR) were searched but and peel is likely due, in part, to differential gene expres- did not contribute to the mapping process. The top five sion and enzyme activities in the D-mannose/L-galactose species that contributed the greatest number of gene an- pathway [11]. In other plant species, radiolabelling, mu- notations from BLASTx were Vitis vinifera, Theobroma tant analysis, and transgenic manipulation have provided cacao, Populus trichocarpa, Prunus persica, and Ricinus evidence for the occurrence of multiple metabolic path- communis (Fig. 3). ways of AsA biosynthesis [12, 13]. It is therefore reason- Of the three core GO annotation categories, biological able to hypothesize that AsA pool size in M. dubia is also processes (BP) comprised 53.6 % of the total assigned the result of multiple metabolic pathways and that their annotations, whereas cellular components (CC) and mo- identification and understanding may ultimately explain lecular functions (MF) comprised 23.3 % and 23.1 %, re- the large variation observed in AsA content. spectively. The GO terms with the largest number of Recent advances in high-throughput next-generation sequencing and bioinformatics tools have been used suc- cessfully to reveal the transcriptome and identify meta- bolic pathways in several plant species [14–18]. In this study, we present the sequencing, assembly, and annota- tion of the fruit transcriptome of M. dubia in order to reconstruct metabolic pathways and identify those associated with AsA biosynthesis. Results Illumina paired end sequencing and de novo assembly A total of 25,787,070 raw sequencing reads of 100 bp were generated from a 200 bp insert library. After raw reads were filtered and cleaned, 24,551,882 (95.2 %) high-quality reads were used to assemble the fruit tran- Fig. 1 Distribution of unigene lengths after de novo transcriptome assembly of fruits from M. dubia scriptome of M. dubia. A total of 70,048 unigenes were Castro et al. BMC Genomics (2015) 16:997 Page 3 of 13 existing knowledge of these pathways and enzymes in- volved in M. dubia are limited. Based on the KEGG pathway assignment (map00010, map00051, map00053, map00500, map00520, and map00562) of the function- ally annotated sequences and local blast search of the de novo meta-assembly transcriptome, transcripts coding for the D-mannose/L-galactose pathway were found. This pathway involves the generation of AsA from D- mannose-1-phosphate (Fig. 5). GDP-D-mannose synthe- sis from D-mannose-1-phosphate and GTP is catalyzed by GDP-D-mannose pyrophosphorylase (E.C. 2.7.7.13), GDP-D-mannose is converted to GDP-L-galactose by Fig. 2 Distribution of Blast2GO three-step processes including a reversible double epimerization, catalyzed by GDP- BLASTX, mapping, and annotation of de novo M. dubia fruit mannose-3′,5′-epimerase (E.C. 5.1.3.18), then GDP-L- transcriptome meta-assembly galactose is broken down by GDP-L-galactose:hexose- 1-phosphate guanyltransferase (E.C. 2.7.7.69) to L- galactose-1-phosphate, which is subsequently hydrolyzed assigned sequences in the BP category were ATP binding to L-galactose and inorganic phosphate by L-galactose-1- (7,741; 3.5 %), zinc ion binding (4,467; 2.0 %), DNA phosphate phosphatase (E.C. 3.1.3.25). L-galactose is then binding (3682; 1.7 %), and sequence-specific DNA bind- oxidized to L-galactono-1,4-lactone by the NAD- ing transcription factor activity (2,986; 1.3 %) (Fig. 4). dependent L-galactose dehydrogenase (E.C. 1.1.1.316), For CC the terms with the most sequences were nucleus finally L-galactono-1,4-lactone is oxidized to AsA by L- (10,891; 11.3), plasma membrane (9,153; 9.5 %), mito- galactono-1,4-lactone dehydrogenase (E.C. 1.3.2.3). chondrion (6,562; 6.8 %), and cytosol (5,545; 5.8 %). In Four additional AsA biosynthetic pathways were the MF category the terms with the most sequences also identified. The first is the animal-like pathway. In were oxidation-reduction process (4,445; 4.7 %), serine this pathway D-glucuronic acid is generated from D- family amino acid metabolic process (3,309; 3.5 %), glucose via the intermediates: D-glucose-1-phosphate, regulation of transcription, DNA-dependent (3,293; UDP-D-glucose, UDP-D-glucuronic acid, and D- 3.5 %), and protein phosphorylation (2,336; 2.4 %). In glucuronic acid-1-phosphate. D-glucuronic acid is total, KEGG maps for more than 160 metabolic path- then converted to L-gulonic acid by glucuronate ways were generated. The full pathway list and the reductase (E.C. 1.1.1.2), which is converted to L-gulono- KEGG maps are available as Additional file 1: Table 1,4-lactone, from this compound AsA is generated by L- S1 and Additional file 2: Figure S1, respectively. gulono-1,4-lactone oxidase/dehydrogenase (E.C. 1.1.3.8). The second alternative pathway uses myo-inositol as a L-ascorbic acid biosynthesis and recycling precursor of AsA. In this pathway, D-glucuronic acid, an Although the metabolic pathways for AsA biosynthesis intermediate of the animal-like pathway, can be generated and recycling are known for several plant species, the from myo-inositol by inositol oxygenase (E.C. 1.13.99.1). The third is the L-gulose pathway. In this putative path- way, the first metabolic intermediary (GDP-L-gulose) is generated from GDP-D-mannose by action of GDP- mannose-3′,5′-epimerase (E.C. 5.1.3.18), subsequently GDP-L-gulose is transformed to L-gulono-1,4-lactone throughput four sequential biochemical reactions. Finally, the fourth is the uronic acid pathway. In this pathway pectin-derived D-galacturonic acid is metabolized to AsA by an inversion pathway. The enzyme D-galacturonic acid reductase (E.C. 1.1.1.365) reduces the compound D- galacturonic acid to L-galactonic acid, which in turn is spontaneously converted to L-galactono-1,4-lactone. This compound is the substrate of the L-galactono-1,4-lactone dehydrogenase enzyme (E.C. 1.3.2.3). We also identified all transcripts coding enzymes in- Fig. 3 Top-hits species distribution based on BLASTX alignments in volved in the recycling pathway (i.e., ascorbate-gluthatione the M. dubia fruit transcriptome meta-assembly cycle). When AsA is oxidized to monodehydroascorbate Castro et al. BMC Genomics (2015) 16:997 Page 4 of 13 Fig. 4 Gene Ontology classifications of assembled sequences. Numbers indicate the number of sequences associated with the particulate GO term by ascorbate peroxidase (E.C. 1.11.1.11), it can be reduced were included in unigenes without a match in the nr to AsA by monodehydroascorbate reductase (E.C. 1.6.5.4) database (Additional file 3: Table S2). The di-nucleotide or it can disproportionate non-enzymatically to AsA and repeat AG/TC was the most abundant type (91.0 %), dehydroascorbate (DHA). DHA can also be reduced to followed by other di- (4.4 %) and tri-nucleotide repeats AsA by dehydroascorbate reductase (E.C. 1.8.5.1), using (3.9 %; Fig. 6). The tetra-, penta-, and hexa-nucleotide glutathione as the reductant (GSH) that is then converted repeats together exhibited the lowest frequency (0.7 %). to oxidized glutathione (GSSG). Finally, this compound is Using Primer 3 Software we were able to design primers reduced by glutathione reductase (E.C. 1.8.1.7) using for 3,240 unigenes containing SSRs with product size NADPH as the reductant. ranging from 100 to 450 bp (Additional file 4: Table S3). A total of 73,889 putative SNPs were also discovered, Discovery of molecular markers although only 23,481 met the selection criteria for ro- All unigenes (70,048) of the meta-assembly were used to bustness. The majority of these SNPs were bi-allelic mine potential genic simple sequence repeats (genic- (23,464) and only 17 were tri-allelic. These SNPs were SSR) or microsatellites that were defined as di- to hexa- found in 5,587 unigenes of which 5,226 (93.5 %) could nucleotide motifs with a minimum of five repetitions. A be annotated with a GO term (Additional file 5: Table total of 30 motifs with simple sequence repeats were S4). The transition substitutions (33.5 % G↔A, and 33.6 identified in 6,314 (9.0 %) unigenes, but 287 genic-SSRs % T↔C; totalling 67 %) were high compared to Castro et al. BMC Genomics (2015) 16:997 Page 5 of 13 Fig. 5 L-ascorbic acid biosynthesis and recycling pathways reconstructed based on the meta-assembly and annotation of the M. dubia fruit transcriptome Castro et al. BMC Genomics (2015) 16:997 Page 6 of 13 Additionally, analogous approaches were effectively used for sequencing and de novo assembly of tran- scriptome in various tissues of several non-model plant species without a reference genome [25–35]. The novel assembly methods [36–39] have made short read assembly to be a cost-effective and reliable tool for gene discovery and molecular markers devel- opment in non-model plant species. Functional annotation and metabolic pathway assignments In the present study we annotated 75.2 % of the assem- bled transcriptome, leaving 17,341 unigenes unidentified. Similar results with a large number of unidentified se- quences have been reported for other non-model organ- isms [17, 19–24, 26]. These unidentified sequences are likely to correspond to non-coding RNAs; short se- quences lacking informative domains for conclusive an- Fig. 6 Frequency distribution of simple sequence repeats based on motif types identified in the M. dubia fruit notation; or novel and/or specific genes of M. dubia that transcriptome meta-assembly have not been previously characterized or coding orphan enzymes (i.e., unannotated gene sequences). The latter are all sequence-lacking enzymatic activities described in the transversions (6.7 % C↔A, 9.9 % G↔C, 7.5 % T↔A, literature and often catalogued in the EC database [40]. and 8.9 % T↔G; totalling 33 %) with an observed transi- According to Sorokina et al. [41], 22.4 % of enzymatic ac- tion over transversion ratio of approximately 2.0. tivities from 5,096 ECs were orphans, and a large propor- Several of the genes involved in AsA metabolism tion of pathways (87 % in KEGG and 36 % in MetaCyc) proved to be polymorphic as evidenced by SNP discov- contain at least one orphan activity. Of the pathways con- ery. For example, the D-mannose/L-galactose pathway taining a mix of orphan and non-orphan activities in mannose-1-phosphate guanylyltransferase (E.C. 2.7.7.13) KEGG and MetaCyc, an average of 26.0 % and 39.5 % of contained >20 SNPs, GDP-mannose-3′,5′-epimerase the reactions per pathway corresponds to orphan en- (E.C. 5.1.3.18) had 13 SNPs, whereas L-galactono-1,4- zymes, respectively. Consequently, most metabolic path- lactone dehydrogenase (E.C. 1.3.2.3) only had 5 SNPs. ways are still not entirely resolved at the gene level, which The animal-like pathway UTP:glucose-1-phosphate restricts in silico reconstructions of metabolic pathways. uridylyltransferase (E.C. 2.7.7.9) contained 7 SNPs. In Two additional problems that create challenges for au- the uronic acid pathway pectin esterase (E.C. 3.1.1.11) tomated reconstructions of metabolic pathways are the and galacturan-1,4-alpha-galacturonidase (E.C. 3.2.1.15) number of misannotations in large public databases and showed more than 20 and 14 SNPs, respectively. Finally, the variation in metabolic pathways. First, Schnoes et al. in the ascorbate-glutathione pathway the unigenes [42] investigating the prevalence of annotation error in monodehydroascorbate reductase (E.C. 1.6.5.4) and primary and secondary large public protein databases glutathione reductase (E.C. 1.8.1.7) contained 2 and 3 commonly used today, found that the manually curated SNPs, respectively (Additional file 6: Table S5). database Swiss-Prot shows the lowest annotation error levels. The other two protein sequence databases Discussion (GenBank NR and TrEMBL) and the protein sequences Illumina paired end sequencing and de novo assembly in the KEGG pathways database exhibit similar and sur- Fruit transcriptome sequencing of M. dubia with prisingly high levels of misannotation that average 25 % Illumina paired end sequencing technology and de novo in the enolase superfamily to over 60 % in the HAD assembly with the meta-assembly bioinformatics strategy superfamily. Second, the availability of sequenced ge- were able to produce more than 24 million high-quality nomes has revealed the diversity of biochemical solu- reads, and ~70,000 assembled unigenes with high N50 tions to similar chemical problems, because the pathway value (Fig. 1). Similar strategies have been widely utilized enzymes first discovered in model organisms are often for successful de novo transcriptome sequencing and as- not universally conserved [43]. For example, the tetrahy- sembly of fruits in other plant species such as Ananas drofolate biosynthesis pathway and enzymes are not uni- comosus [19], Capsicum annuum [20], Litchi chinensis versal and alternate solutions are found for most steps, [17], Mangifera indica [21], Momordica cochinchinensis making this pathway and others like it a challenge for [22], Pyrus bretschneideri [23], and Vaccinium spp. [24]. automatic annotation in many genomes [43]. Castro et al. BMC Genomics (2015) 16:997 Page 7 of 13 Due to these limitations several metabolic pathways differential activation of one or more, but not all five, reconstructed from the annotated unigenes of M. metabolic pathways from the above examples via gene dubia show gaps or missing genes. Consequently, expression depends on several factors (e.g., organ/tis- comparative genomics, enzymatic, metabolomic, and sue type, development stage, physiological condition, structural analyses will be required to fill these path- environmental factors) and does not necessarily indi- way gaps [40, 44–46]. Despite such limitations, it was cate their absence. possible to completely reconstruct several KEGG The L-gulose pathway is derived from the D-mannose/ pathways with the M. dubia transcriptome. The well L-galactose pathway by action of the GDP-mannose- represented pathways discovered in this study in- 3′,5′-epimerase (E.C. 5.1.3.18) and evidence exists that it cluded L-ascorbic acid biosynthesis and recycling, is active in plant cells, because both L-gulose and L- phenylpropanoid biosynthesis, flavonoid biosynthesis, gulono-1,4-lactone serve as precursors of L-ascorbic acid anthocyanin biosynthesis, pentose phosphate pathway, biosynthesis [49], but have limited molecular and bio- glutathione metabolism, plant pathogen interaction, chemical studies. To date, genic sequences coding en- biosynthesis of plant hormones, aminoacids biosyn- zymes catalysing four consecutive biochemical reactions thesis and degradation, and circadian rhythm are unknown: GDP-L-gulose → L-gulose-1-phosphate (Additional file 1: Table S1 and Additional file 2: → L-gulose → L-gulonic acid → L-gulono-1,4-lactone Figure S1). In conclusion, while transcriptomic ana- (or L-gulose → L-gulono-1,4-lactone) → L-ascorbic lysis is not a substitute for detailed gene and pathway acid. Except for L-gulono-1,4-lactone dehydrogenase, studies, it does provide a broad overview of the im- enzyme activities for these biochemical reactions also portant metabolic processes from which to efficiently were not tested. Enzymatic activity of L-gulono-1,4-lac- build hypotheses that can guide future detailed stud- tone dehydrogenase has been observed in cytosolic and ies on improving our understanding of L-ascorbic mitochondrial fractions of the leaves and fruit pulp and acid metabolism and the accumulation of other peel of M. dubia [56] and tubers of Solanum tuberosum bioactive phytochemicals in this plant species. [49]. In addition, the genome of Arabidopsis thaliana contains genes (i.e., At1g 32300, At2g 46740, At2g L-ascorbic acid biosynthesis and recycling 46750, At2g 46760, At5g11540, and At5g 56490) that Plants can possess a total of five metabolic pathways for are closely related with the rat L-gulono-1,4-lactone oxi- AsA biosynthesis. These metabolic pathways are the dase. Some of these genes could be coding enzymes re- animal-like pathway [47], the myo-inositol pathway [48], sponsible for the conversion of L-gulono-1,4-lactone to the L-gulose pathway [49], the D-mannose/L-galactose AsA [49]. Also, one putative unigene, probably coding pathway [11], and the uronic acid pathway [12]. All these for this enzyme, was identified in our assembled tran- pathways were identified in the fruit transcriptome of M. scriptome (E.C. 1.1.3.8). dubia (Fig. 5) and have also been documented in other Some of the biochemical reactions of the L-gulose plant species. For example, analysis of the Citrus sinensis pathway, however, can likely be catalysed from pro- fruit transcriptome indicated that genes from four of the miscuous enzymes, which is an inherent property of five biosynthetic pathways (all but the animal-like path- many enzymes catalysing analogous biochemical reac- way) are expressed [50], whereas expressed sequence tions [57]. Indeed, some enzymes of the D-mannose/ tags from fruits and other tissues of four Actinidia L-galactose pathway catalysing similar reactions of the species (A. arguta, A. chinensis, A. deliciosa, and A. L-gulose pathway have shown promiscuity. For ex- eriantha) indicated that myo-inositol, D-mannose/L-gal- ample, the enzyme GDP-L-galactose:hexose 1- actose, and uronic acid are active [51]. In contrast, the phosphate guanyltransferase (E.C. 2.7.7.69) has a low analysis of fruit transcriptomes of Ziziphus jujuba, Myr- Km value (10 μM and 4.4 μM), a high turnover rate ica rubra, and Ananas comosus were only able to iden- kcat (64 s−1 and 23 s−1), and a similar specificity con- tify the D-mannose/L-galactose pathway as active [19, stant kcat/Km (6.3x106 s−1 M−1 and 5.7x106 s−1 M−1) 52, 53]. From an evolutionary perspective, however, it is with the GDP-L-galactose and GDP-D-glucose possible that many or even most of these metabolic substrates, respectively [58]. Furthermore, the enzyme pathways for AsA biosynthesis are conserved in plants, L-galactose dehydrogenase catalyses the oxidative re- because AsA is one of the most abundant low molecular duction of both L-galactose and L-gulose substrates weight antioxidant of plants that plays an essential role in Spinacia oleracea [59]. Consequently, it is neces- in the detoxification of reactive oxygen species. In sary to conduct additional research to further our addition, AsA is important in plant development, hor- understanding of these genes, enzymes and the con- mone and light signaling, cell cycle, death, and cell ex- tribution of these metabolic pathways to AsA biosyn- pansion, pathogen responses, and as a cofactor for thesis and accumulation in fruits, other organs and several key enzymes [13, 47, 54, 55]. Consequently, the tissues of M. dubia. Castro et al. BMC Genomics (2015) 16:997 Page 8 of 13 Based on the KEGG pathway assignments, we identi- accumulation, fruit size) or biological processes (e.g., fied transcripts coding for all enzymes of the ascorbate- seed germination, growth and development, disease glutathione or “Foyer-Halliwell-Asada” pathway (Fig. 5). resistance, etc.). The enzymes of this metabolic pathway have been local- Our results regarding genic-SSR markers are largely ized in several compartments of the plant cells, such as similar to other plant studies, but differences do exist. the cytosol, mitochondria, peroxisomes, and chloroplast First, the percentage of unigenes containing SSRs (9.0 % [60, 61]. This distribution of the ascorbate-glutathione in M. dubia) is comparable to reports for this species pathway components is attributable to its vital role, since from Brazil with 10.9 % [67] as well as for Ipomoea this pathway is recognized to be a key player in H2O2 batatas with 7.3 % [32], Cajanus cajan with 7.7 % [69], metabolism and AsA recycling [54]. AsA recycling re- Capsicum anuum with 7.7 % [15], and Sesamum indi- quires a continuous supply of GSH and NADPH. The cum with 8.9 % [28]. Differences, however, existed when pathways supplying these reductant molecules lie out- compared to Apium graveolens with 6.2 % [70] and the side of the AsA biosynthetic machinery. The key sup- monocot Phoenix dactylifera with 16.0 % [71]. Second, pliers of GSH are by de novo biosynthesis in two ATP- regarding the distribution of the perfect repeat motif dependent reactions catalyzed by γ-glutamylcysteine types, tri-nucleotide repeats have generally been ob- synthetase and glutathione synthetase [62]. The second served to have the highest frequency in cereals and other biochemical process is catalyzed by glutathione reduc- plant species [71–73]. However, here, as in a previous tase (E.C. 1.8.1.7), which uses NADPH to reduce GSSG study on M. dubia [67] and other plant species [14, 28, to GSH [54]. Moreover, there are various sources of the 69, 70, 74], the most abundant repeat motif type was di- essential reductant NADPH. The first and principal nucleotide repeats, followed by tri-nucleotide repeats. Of source is the oxidative pentose phosphate pathway [63]. the thirty motifs, (AG/TC)n showed the highest fre- The second significant source includes L-malate:NADP quency (91.0 %), which is in agreement with other plant oxidoreductase (E.C. 1.1.1.40) that catalyzes the oxida- species [14, 28, 32, 71, 74]. As in monocot [74] and tive decarboxylation of L-malate to yield pyruvate, CO2 other dicot plants [28, 32, 75] the (AAG/TTC)n motif in and NADPH in the presence of a bivalent cation [64]. Fi- M. dubia was the most abundant of the tri-nucleotide nally, NADPH is generated in photosynthetic cells (i.e. repeat motifs. This triplet codes for lysine, which is com- immature fruit peel of M. dubia) primarily from the monly found in the exons of plants [75]. This finding is light reactions of photosynthesis [65]. Therefore, genetic consistent with Katti et al. [76] who showed that expan- manipulations that increase the availability of GSH and sions of codon repeats corresponding to small hydro- NADPH for AsA recycling, through up-regulation/over- philic amino acids are tolerated more, while strong expression of related genes identified here, could be selection pressures probably eliminate codon repeats promising approaches to increase the yield of AsA in M. encoding for hydrophobic and basic amino acids. dubia and other plant species. In addition, in our dataset some unigenes containing genic-SSR were lacking functional annotations. These Discovery of molecular markers unidentified unigenes probably correspond to untrans- It is well-known that genic-SSR markers have numerous lated (UTR) regions. Several researches showed that SSR applications, such as functional genomics, association frequency is high in the 5′ UTR regions of plant tran- mapping, diversity analysis, genome mapping, transfer- scripts [77–80], suggesting that SSRs located in this ability and comparative mapping, marker assisted selec- genic region can potentially act as factors in regulating tion breeding, and other applications [66]. Nevertheless, gene expression in the transcriptional or translational only eight genic-SSR markers have been developed for levels [78, 81]. Consequently, these insights are likely to M. dubia until now [67, 68], limiting the applications play a significant role in selecting SSRs loci to be used in previously mentioned. However, with this research min- molecular breeding programs of M. dubia. ing the assembly fruit transcriptome of M. dubia it was Sequencing a pool of cDNA using next-generation se- possible to identify a large number of unigenes contain- quencing technologies and appropriate mining software ing SSR (primers were designed for 3,240 unigenes) mo- allows for the rapid and inexpensive SNP discovery tifs that would be appropriates for developing a within genes in non-model plants without a reference comprehensive set of genic-SSR markers that will need genome. Our transcriptome dataset contained a large experimental validation. In conclusion, the genic-SSR number of high quality SNPs (>23,000) and marks the markers identified in the assembly transcriptome data- highest number of SNP markers discovered to date from base represent a significant addition to the limited set of M. dubia using transcriptome sequencing. While the markers available in M. dubia and it will be feasible to majority of SNPs were bi-allelic (>99.9 %), an insignifi- conduct marker assisted gene mapping for important cant fraction showed tri-allelic (0.072 %) polymorphisms. agronomical traits (e.g., L-ascorbic acid and anthocyanin These results are in agreement with the diploid nature Castro et al. BMC Genomics (2015) 16:997 Page 9 of 13 of the M. dubia genome [1]. Similar low levels of tri- in M. dubia is intriguing and worthy of further investi- allelic SNPs also were reported in other plant species gation. The sequences and pathways produced here such as Brassica napus with 0.029–0.06 % [82, 83] and present the genetic framework required for further stud- Manihot esculenta with 0.52 % [84], whereas tri-allelic ies. Quantitative transcriptomics in concert with studies SNPs were not detected in Sesamum indicum [85] and of the genome, proteome, and metabolome under condi- Hevea brasiliensis [86]. Nevertheless, the switchgrass tions that stimulate production and accumulation of Panicum virgatum possesses a substancial number of AsA and their precursors are needed to provide a more tri-allelic SNPs (15 %), which is consistent with the poly- comprehensive view of how these pathways for AsA me- ploid condition of the genome of this species [87]. Al- tabolism are regulated and linked in this species. though in principle, at each position of a sequence any of the four nucleotide bases can be present, however, Methods SNPs are frequently biallelic. One possible explanation is Plant material the low frequency of single nucleotide substitutions (5.0- Unripe (60 days after anthesis) and ripe fruits (70 days 30.0 x 10−9) at the nuclear genes of plants [88]. Conse- after anthesis) were randomly collected from 10 different quently, the probability of two or three independent mu- accessions (one plant by accession) from the M. dubia tations occurring at a single position is very low. germplasm bank (03°57′17″S, 73°24′55″W) at the Insti- Another important cause for the prevalence of bi-allelic tuto Nacional de Innovación Agraria of Peru, region SNP is attributable to a clear bias in the mutation mech- Loreto. Established approximately 20 years ago, this anism that results in a prevalence of transitions over germplasm bank consists of 43 representative accessions transversions exchanges (67 % vs 33 % in our data set). of genetic variability of M. dubia from the eight major One probable explanation for this is the high frequency river basins of the Loreto Region (Nanay, Itaya, Napo, of spontaneous deamination of 5-methyl cytosine to Ucayali, Putumayo, Curaray, Tigre and Amazonas). Im- thymidine in the CpG dinucleotides [89]. mediately after harvesting, samples were stored at −80°C Although SNPs are less polymorphic than SSR until further use. A graphical representation of our work markers, they easily compensate for this drawback by flow is provided in Additional file 7: Figure S2. being abundant and amenable to high- and ultra- high-throughput automation [90]. Consequently, this Total RNA isolation and cDNA synthesis large collection of SNP markers could facilitate gen- Total RNA was isolated from seeds and fruit pulp and etic applications in M. dubia such as genetic diversity peel from each of the 10 plants using the CTAB method, and characterization, linkage mapping, high-density solvent extractions, and DNase treatment as described quantitative trait locus analysis, association studies, by Castro et al. [91]. RNA samples were chosen and map-based cloning, marker-assisted plant breeding, pooled equally for cDNA library construction and se- and functional genomics. quencing if the OD ratio A260/A280 > 1.9, A260/A230 > 2.0, and the samples were not degraded as assessed by Conclusions formaldehyde denaturing gel electrophoresis [92]. This study describes the first next-generation sequencing effort and transcriptome annotation of a non-model cDNA library construction and sequencing Amazonian plant that is relevant for AsA production Illumina sequencing was performed at Macrogen’s se- and other bioactive phytochemicals. Genes encoding key quencing service according to the manufacturer’s in- enzymes were successfully identified and metabolic path- structions (Illumina Inc., San Diego, CA, USA). First, ways involved in biosynthesis of AsA, anthocyanins, and mRNA with a poly(A) tail was isolated from 20 μg of other metabolic pathways have been reconstructed. The pooled total RNA using Sera-mag magnetic oligo (dT) identification of these genes and pathways is in beads (Illumina). To avoid priming bias, the purified agreement with the empirically observed capability of M. mRNA was first fragmented into small pieces (100–400 dubia to synthesize and accumulate AsA and other im- bp) using divalent cations at 94°C for 5 minutes. With portant molecules, and adds to our current knowledge random hexamer primers (Illumina), the double- of the molecular biology and biochemistry of their pro- stranded cDNA was synthesized using the SuperScript duction in plants. By providing insights into the mecha- double-stranded cDNA synthesis kit (Invitrogen, CA). nisms underpinning these metabolic processes, these The synthesized cDNA was subjected to end-repair and results can be used to direct efforts to genetically ma- phosphorylation, and then the repaired cDNA fragments nipulate this organism in order to enhance the produc- were 3′ adenylated with Klenow Exo- (3′ to 5′ exo tion of these bioactive phytochemicals. minus, Illumina). Illumina paired-end adapters were li- The accumulation of AsA precursor and discovery of gated to the ends of these 3′-adenylated cDNA frag- genes associated with their biosynthesis and metabolism ments. To select the proper templates for downstream Castro et al. BMC Genomics (2015) 16:997 Page 10 of 13 enrichment, products from the ligation reaction were gel SNPs were identified by first mapping the filtered reads (2 % agarose) purified and excised. Fifteen cycles of PCR to the final meta-assembly using BWA v0.7.10 [103] and amplification were carried out to enrich the purified then mpileup of SAMtools v1.1 [104] was used to detect cDNA template using PCR primers PE 1.0 and 2.0 SNP sites. Only SNPs with quality scores >20 and cover- (Illumina) with phusion DNA polymerase. Finally, the age depth >20 were labeled as high quality. The locations cDNA library was constructed with a 200 bp insertion of the SNPs in unigenes were predicted with TransDeco- fragment. After validation on an Agilent Technologies der (http://transdecoder.github.io/) and snpEff v3.1 [105] 2100 Bioanalyzer, the library was sequenced using an Illu- was used to predict the effects of SNPs on genes. mina HiSeq™ 2000 (Illumina Inc., San Diego, CA, USA). Additional files Data filtering and de novo assembly Prior to assembly raw sequencing reads were filtered Additional file 1: Table S1. KEEG Pathway list in transcriptome of M. and trimmed with Trimmomatic v0.32 [93] using the fol- dubia. (CSV 6 kb) lowing steps: (1) leading and trailing bases of low quality Additional file 2: Figure S1. KEEG Pathway Maps in transcriptome of or N bases were removed, (2) the 3′ end was cut and re- M. dubia. (PDF 7038 kb) moved if the quality score of a 4 bp wide sliding window Additional file 3: Table S2. Genic-SSR characteristics of M. dubia. (CSV 330 kb) dropped below 15, and (3) remaining reads less than 36 Additional file 4: Table S3. Primers characteristics for genic-SSR of bp and singletons were removed. M. dubia. (CSV 1694 kb) Cleaned reads were de novo assembled following the Additional file 5: Table S4. SNPs in unigenes of M. dubia. (XLSX 2653 kb) multiple k-mer approach of Melicher et al. [94]. Briefly, we Additional file 6: Table S5. Summary of SNPs effect in M. dubia. used Velvet v1.2.10 [36] and Oases v0.2.08 [37] to produce (CSV 576 bytes) assembles of k-mer lengths 21, 25, 29, 33, and 37. An add- Additional file 7: Figure S2. Flow chart of methods used. (PNG 495 kb) itional assembly was produced with Trinity v20140717 [38] using the default settings at a k-length of 25. The tran- Competing interests scripts from the 6 assemblies were then pooled and re- The authors declare that they have no competing interests. assembled using CAP3 [39] to produce a meta-assembly. Authors’ contributions JCC conceived the study, participated in the study design, obtained funds Functional annotation, and metabolic pathway for the research, coordinated activities, and participated in the preparation of assignments the manuscript. JDM, DR, MZ, and AB performed the bioinformatics analysis To elucidate the potential functions of gene transcripts and participated in the preparation of the manuscript. MC participated in the study design, supervised samples preparation, and helped to draft the we utilized the web tool FastAnnotator [95], which inte- manuscript. LAC and AEM participated in botanical sample collection, in grates the well-established annotation tools Blast2GO the isolation of total RNA and quality analysis for Illumina sequencing. [96], PRIAM [97], and RPS BLAST [98] to assign Gene SAI participated in the study design, supervised botanical samples collection, and helped to draft the manuscript. All authors read and Ontology (GO) terms, Enzyme Commission numbers approved the final manuscript. (EC numbers), and functional domains to query se- quences (cut-off Evalue ≤ 10−6). To determine metabolic Acknowledgements pathways, sequences assigned ECs from FastAnnotator This research was supported by grants from Universidad Nacional de la Amazonía Peruana. We also thank Dr. Jorge L. Marapara for his help with the were mapped to the Kyoto Encyclopedia of Genes and infrastructure and equipment of Unidad Especializada de Biotecnología and Genomes (KEGG) metabolic pathway database [99]. To Instituto Nacional de Innovación Agraria (INIA) - San Roque-Iquitos for access to further enrich the pathway annotation and to identify the germplasm collection of Myrciaria dubia. J. Dylan Maddox was supported by NSF International Fellowship OISE-1159178 during a portion of this research. the BRITE functional hierarchies, sequences were also submitted to the KEGG Automatic Annotation Server Author details 1 (KAAS) [100] with the single-directional best hit infor- Unidad Especializada de Biotecnología, Centro de Investigaciones de Recursos Naturales de la Amazonía (CIRNA), Universidad Nacional de la mation method selected. Amazonía Peruana (UNAP), Pasaje Los Paujiles S/N, San Juan Bautista, Iquitos, Perú. 2Círculo de Investigación en Plantas con Efecto en Salud (FONDECYT N ° 010–2014), Lima, Perú. 3Discovery of molecular markers Pritzker Laboratory for Molecular Systematics and Evolution, The Field Museum of Natural History, Chicago, IL, USA. Unigenes were mined for genic simple sequence repeats 4Laboratorio de Biotecnología y Bioenergética, Universidad Científica del (genic-SSR) with MSATCOMMANDER v1.0.8 [101] and Perú (UCP), Av. Abelardo Quiñones km 2.5, San Juan Bautista, Iquitos, Perú. 5 primers designed with the integrated Primer 3 Software Laboratorio de Bioinformática y Biología Molecular, Laboratorios de Investigación y Desarrollo (LID), Facultad de Ciencias, Universidad Peruana [102] using the default setting for both programs, except Cayetano Heredia (UPCH), Av. Honorio Delgado 430, San Martín de Porres, that only perfect repeats (i.e., di-, tri-, tetra-, penta-, hex- Lima, Perú. 6FARVET S.A.C. Carretera Panamericana Sur N° 766 Km 198.5, 7 anucleotides) were selected and mononucleotide repeats Chincha Alta, Ica, Perú. Department of Horticulture, Virginia Tech, Blacksburg, VA 24061, USA. 8Área de Conservación de Recursos and complex SSR types were excluded. Only those Fitogenéticos, Instituto Nacional de Innovación Agraria (INIA), Calle San genic-SSRs with ≥ 5 repeats were retained. Roque 209, Iquitos, Perú. Castro et al. BMC Genomics (2015) 16:997 Page 11 of 13 Received: 11 December 2014 Accepted: 17 November 2015 22. Hyun TK, Rim Y, Jang H-J, Kim CH, Park J, Kumar R, et al. De novo transcriptome sequencing of Momordica cochinchinensis to identify genes involved in the carotenoid biosynthesis. Plant Mol Biol. 2012;79(4–5):413–27. 23. Xie M, Huang Y, Zhang Y, Wang X, Yang H, Yu O, et al. Transcriptome References profiling of fruit development and maturation in Chinese white pear 1. Ribeiro da Costa I, Forni-Martins ER. Chromosome studies in species of (Pyrus bretschneideri Rehd). BMC Genomics. 2013;14:823. Eugenia, Myrciaria and Plinia (Myrtaceae) from south-eastern Brazil. Aust J 24. Li X, Sun H, Pei J, Dong Y, Wang F, Chen H, et al. De novo sequencing and Bot. 2006;54(4):409–15. comparative analysis of the blueberry transcriptome to discover putative 2. Ueda H, Kuroiwa E, Tachibana Y, Kawanishi K, Ayala F, Moriyasu M. Aldose genes related to antioxidants. Gene. 2012;511(1):54–61. reductase inhibitors from the leaves of Myrciaria dubia (H. B. & K.) McVaugh. 25. Que Y, Su Y, Guo J, Wu Q, Xu L. A global view of transcriptome dynamics Phytomedicine Int J Phytother Phytopharm. 2004;11(7–8):652–6. during Sporisorium scitamineum challenge in sugarcane by RNA-seq. 3. Zanatta CF, Cuevas E, Bobbio FO, Winterhalter P, Mercadante AZ. PloS One. 2014;9(8), e106476. Determination of anthocyanins from camu-camu (Myrciaria dubia) by 26. Wang Z, Hu H, Goertzen LR, McElroy JS, Dane F. Analysis of the Citrullus HPLC-PDA, HPLC-MS, and NMR. J Agric Food Chem. 2005;53(24):9531–5. colocynthis transcriptome during water deficit stress. PloS One. 2014;9(8), e104657. 4. Akachi T, Shiina Y, Kawaguchi T, Kawagishi H, Morita T, Sugiyama K. 27. Lai Z, Lin Y. Analysis of the global transcriptome of longan (Dimocarpus 1-methylmalate from camu-camu (Myrciaria dubia) suppressed longan Lour.) embryogenic callus using Illumina paired-end sequencing. D-galactosamine-induced liver injury in rats. Biosci Biotechnol Biochem. BMC Genomics. 2013;14:561. 2010;74(3):573–8. 28. Wei W, Qi X, Wang L, Zhang Y, Hua W, Li D, et al. Characterization of the 5. Fracassetti D, Costa C, Moulay L, Tomás-Barberán FA. Ellagic acid derivatives, sesame (Sesamum indicum L.) global transcriptome using Illumina paired- ellagitannins, proanthocyanidins and other phenolics, vitamin C and end sequencing and development of EST-SSR markers. BMC Genomics. antioxidant capacity of two powder products from camu-camu fruit 2011;12:451. (Myrciaria dubia). Food Chem. 2013;139(1–4):578–88. 29. Liu J, Mei D, Li Y, Huang S, Hu Q. Deep RNA-Seq to unlock the gene bank 6. Inoue T, Komoda H, Uchida T, Node K. Tropical fruit camu-camu of floral development in Sinapis arvensis. PloS One. 2014;9(9), e105775. (Myrciaria dubia) has anti-oxidative and anti-inflammatory properties. 30. Shi C-Y, Yang H, Wei C-L, Yu O, Zhang Z-Z, Jiang C-J, et al. Deep J Cardiol. 2008;52(2):127–32. sequencing of the Camellia sinensis transcriptome revealed candidate genes 7. Justi KC, Visentainer JV, de Souza N E, Matsushita M. Nutritional composition for major metabolic pathways of tea-specific compounds. BMC Genomics. and vitamin C stability in stored camu-camu (Myrciaria dubia) pulp. 2011;12:131. Arch Latinoam Nutr. 2000;50(4):405–8. 31. Ge X, Chen H, Wang H, Shi A, Liu K. De novo assembly and annotation of 8. Imán S, Bravo L, Sotero V, Oliva C. Contenido de vitamina C en frutos de Salvia splendens transcriptome using the Illumina platform. PloS One. camu camu Myrciaria dubia (H.B.K) Mc Vaugh, en cuatro estados de 2014;9(3), e87693. maduración, procedentes de la Colección de Germoplasma del INIA Loreto, 32. Wang Z, Fang B, Chen J, Zhang X, Luo Z, Huang L, et al. De novo assembly Perú. Sci Agropecu. 2011;2(3):123–30. and characterization of root transcriptome using Illumina paired-end 9. Klimczak I, Małecka M, Szlachta M, Gliszczyńska-Świgło A. Effect of storage sequencing and development of cSSR markers in sweet potato on the content of polyphenols, vitamin C and the antioxidant activity of (Ipomoea batatas). BMC Genomics. 2010;11:726. orange juices. J Food Compos Anal. 2007;20(3–4):313–22. 33. Liu X, Lu Y, Yuan Y, Liu S, Guan C, Chen S, et al. De novo transcriptome of 10. Castro JC, Gutiérrez F, Acuña C, Cerdeira LA, Tapullima A, Marianela C, et al. Brassica juncea seed coat and identification of genes for the biosynthesis of Variación del contenido de vitamina C y antocianinas en Myrciaria dubia flavonoids. PloS One. 2013;8(8), e71110. “camu-camu.”. Rev Soc Quím Perú. 2013;79(4):319–30. 34. Li H, Dong Y, Yang J, Liu X, Wang Y, Yao N, et al. De novo transcriptome of 11. Wheeler GL, Jones MA, Smirnoff N. The biosynthetic pathway of vitamin C safflower and the identification of putative genes for oleosin and the in higher plants. Nature. 1998;393(6683):365–9. biosynthesis of flavonoids. PloS One. 2012;7(2), e30987. 12. Valpuesta V, Botella MA. Biosynthesis of L-ascorbic acid in plants: new 35. Wang S, Wang X, He Q, Liu X, Xu W, Li L, et al. Transcriptome analysis of the pathways for an old antioxidant. Trends Plant Sci. 2004;9(12):573–7. roots at early and late seedling stages using Illumina paired-end 13. Gallie DR. L-ascorbic acid: a multifunctional molecule supporting plant sequencing and development of EST-SSR markers in radish. Plant Cell Rep. growth and development. Scientifica. 2013;2013:795964. 2012;31(8):1437–47. 14. Li D, Deng Z, Qin B, Liu X, Men Z. De novo assembly and characterization 36. Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly of bark transcriptome using Illumina sequencing and development of using de Bruijn graphs. Genome Res. 2008;18(5):821–9. EST-SSR markers in rubber tree (Hevea brasiliensis Muell. Arg.). 37. Schulz MH, Zerbino DR, Vingron M, Birney E. Oases: robust de novo RNA- BMC Genomics. 2012;13:192. seq assembly across the dynamic range of expression levels. Bioinformatics. 15. Ashrafi H, Hill T, Stoffel K, Kozik A, Yao J, Chin-Wo SR, et al. De novo 2012;28(8):1086–92. assembly of the pepper transcriptome (Capsicum annuum): a benchmark for 38. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. in silico discovery of SNPs. SSRs and candidate genes. BMC Genomics. Trinity: reconstructing a full-length transcriptome without a genome from 2012;13:571. RNA-Seq data. Nat Biotechnol. 2011;29(7):644–52. 16. Natarajan P, Parani M. De novo assembly and transcriptome analysis of five 39. Huang X, Madan A. CAP3: A DNA sequence assembly program. Genome major tissues of Jatropha curcas L. using GS FLX titanium platform of 454 Res. 1999;9(9):868–77. pyrosequencing. BMC Genomics. 2011;12:191. 40. Hanson AD, Pribat A, Waller JC, de Crécy-Lagard V. ‘Unknown’ proteins and 17. Li C, Wang Y, Huang X, Li J, Wang H, Li J. De novo assembly and ‘orphan’ enzymes: the missing half of the engineering parts list – and how characterization of fruit transcriptome in Litchi chinensis Sonn and analysis to find it. Biochem J. 2010;425(1):1–11. of differentially regulated genes in fruit in response to shading. 41. Sorokina M, Stam M, Médigue C, Lespinet O, Vallenet D. Profiling the BMC Genomics. 2013;14:552. orphan enzymes. Biol Direct. 2014;9:10. 18. Guo X, Li Y, Li C, Luo H, Wang L, Qian J, et al. Analysis of the Dendrobium 42. Schnoes AM, Brown SD, Dodevski I, Babbitt PC. Annotation error in public officinale transcriptome reveals putative alkaloid biosynthetic genes and databases: misannotation of molecular function in enzyme superfamilies. genetic markers. Gene. 2013;527(1):131–8. PLoS Comput Biol. 2009;5(12), e1000605. 19. Ong WD, Voo L-YC, Kumar VS. De novo assembly, characterization and 43. De Crécy-Lagard V. Variations in metabolic pathways create challenges for functional annotation of pineapple fruit transcriptome through massively automated metabolic reconstructions: examples from the tetrahydrofolate parallel sequencing. PloS One. 2012;7(10), e46937. synthesis pathway. Comput Struct Biotechnol J. 2014;10(16):41–50. 20. Martínez-López LA, Ochoa-Alejo N, Martínez O. Dynamics of the chili 44. Zhao S, Kumar R, Sakai A, Vetting MW, Wood BM, Brown S, et al. Discovery pepper transcriptome during fruit development. BMC Genomics. of new enzymes and metabolic pathways by using structure and genome 2014;15:143. context. Nature. 2013;502(7473):698–702. 21. Wu H, Jia H, Ma X, Wang S, Yao Q, Xu W, et al. Transcriptome and 45. Kumar R, Zhao S, Vetting MW, Wood BM, Sakai A, Cho K, et al. Prediction proteomic analysis of mango (Mangifera indica Linn) fruits. J Proteomics. and biochemical demonstration of a Catabolic pathway for the 2014;105:19–30. osmoprotectant proline betaine. MBio. 2014;5(1):e00933–13. Castro et al. BMC Genomics (2015) 16:997 Page 12 of 13 46. Bradbury LMT, Niehaus TD, Hanson AD. Comparative genomics approaches 70. Fu N, Wang Q, Shen H-L. De novo assembly, gene annotation and marker to understanding and manipulating plant metabolism. Curr Opin development using Illumina paired-end transcriptome sequences in celery Biotechnol. 2013;24(2):278–84. (Apium graveolens L.). PloS One. 2013;8(2):e57686. 47. Smirnoff N, Wheeler GL. Ascorbic acid in plants: biosynthesis and function. 71. Zhao Y, Williams R, Prakash CS, He G. Identification and characterization of Crit Rev Biochem Mol Biol. 2000;35(4):291–314. gene-based SSR markers in date palm (Phoenix dactylifera L.). BMC Plant Biol. 48. Lorence A, Chevone BI, Mendes P, Nessler CL. myo-inositol oxygenase offers 2012;12:237. a possible entry point into plant ascorbate biosynthesis. Plant Physiol. 2004; 72. Kantety RV, La Rota M, Matthews DE, Sorrells ME. Data mining for simple 134(3):1200–5. sequence repeats in expressed sequence tags from barley, maize, rice, 49. Wolucka BA, Van Montagu M. GDP-mannose 3′,5′-epimerase forms GDP-L- sorghum and wheat. Plant Mol Biol. 2002;48(5–6):501–10. gulose, a putative intermediate for the de novo biosynthesis of vitamin C in 73. Varshney RK, Thiel T, Stein N, Langridge P, Graner A. In silico analysis on plants. J Biol Chem. 2003;278(48):47483–90. frequency and distribution of microsatellites in ESTs of some cereal species. 50. Xu Q, Chen L-L, Ruan X, Chen D, Zhu A, Chen C, et al. The draft genome of Cell Mol Biol Lett. 2002;7(2A):537–46. sweet orange (Citrus sinensis). Nat Genet. 2013;45(1):59–66. 74. Ting N-C, Zaki NM, Rosli R, Low E-TL, Ithnin M, Cheah S-C, et al. SSR mining 51. Crowhurst RN, Gleave AP, MacRae EA, Ampomah-Dwamena C, Atkinson RG, in oil palm EST database: application in oil palm germplasm diversity Beuning LL, et al. Analysis of expressed sequence tags from Actinidia: studies. J Genet. 2010;89(2):135–45. applications of a cross species EST database for gene discovery in the areas 75. Li Y-C, Korol AB, Fahima T, Nevo E. Microsatellites within genes: structure, of flavor, health, color and ripening. BMC Genomics. 2008;9:351. function, and evolution. Mol Biol Evol. 2004;21(6):991–1007. 52. Li Y, Xu C, Lin X, Cui B, Wu R, Pang X. De novo assembly and 76. Katti MV, Ranjekar PK, Gupta VS. Differential distribution of simple sequence characterization of the fruit transcriptome of Chinese Jujube (Ziziphus jujuba repeats in eukaryotic genome sequences. Mol Biol Evol. 2001;18(7):1161–7. Mill.) Using 454 pyrosequencing and the development of novel tri- 77. Morgante M, Hanafey M, Powell W. Microsatellites are preferentially nucleotide SSR markers. PLoS One. 2014;9(9):e106438. associated with nonrepetitive DNA in plant genomes. Nat Genet. 53. Feng C, Chen M, Xu C, Bai L, Yin X, Li X, et al. Transcriptomic analysis of 2002;30(2):194–200. Chinese bayberry (Myrica rubra) fruit development and ripening using 78. Fujimori S, Washio T, Higo K, Ohtomo Y, Murakami K, Matsubara K, et al. A RNA-Seq. BMC Genomics. 2012;13:19. novel feature of microsatellites in plants: a distribution gradient along the 54. Foyer C, Graham N. Ascorbate and glutathione: the heart of the Redox Hub. direction of transcription. FEBS Lett. 2003;554(1–2):17–22. Plant Physiol. 2011;155(1):2–18. 79. Zhang L, Yuan D, Yu S, Li Z, Cao Y, Miao Z, et al. Preference of simple 55. Gest N, Gautier H, Stevens R. Ascorbate as seen through plant evolution: the sequence repeats in coding and non-coding regions of Arabidopsis thaliana. rise of a successful molecule? J Exp Bot. 2013;64(1):33–53. Bioinformatics. 2004;20(7):1081–6. 56. Tapullima A. Biochemical characterization of L-galactono/L-gulono-1,4- 80. Grover A, Aishwarya V, Sharma PC. Biased distribution of microsatellite lactone dehydrogenase from Myrciaria dubia (Kunth) McVaugh “camu camu. motifs in the rice genome. Mol Genet Genomics. 2007;277(5):469–80. ”. Theses Bachelor of Pharmaceutical Chemistry: Universidad Nacional de la 81. Zhao Z, Guo C, Sutharzan S, Li P, Echt CS, Zhang J, et al. Genome-wide Amazonía Peruana; 2013. analysis of tandem repeats in plants and green algae. G3 (Bethesda). 57. Babtie A, Tokuriki N, Hollfelder F. What makes an enzyme promiscuous? 2014;4(1):67–78. Curr Opin Chem Biol. 2010;14(2):200–7. 82. Huang S, Deng L, Guan M, Li J, Lu K, Wang H, et al. Identification of 58. Linster CL, Gomez TA, Christensen KC, Adler LN, Young BD, Brenner C, et al. genome-wide single nucleotide polymorphisms in allopolyploid crop Arabidopsis VTC2 encodes a GDP-L-galactose phosphorylase, the last Brassica napus. BMC Genomics. 2013;14:717. unknown enzyme in the Smirnoff-Wheeler pathway to ascorbic acid in 83. Dalton-Morgan J, Hayward A, Alamery S, Tollenaere R, Mason AS, Campbell E, plants. J Biol Chem. 2007;282(26):18879–85. et al. A high-throughput SNP array in the amphidiploid species Brassica napus 59. Mieda T, Yabuta Y, Rapolu M, Motoki T, Takeda T, Yoshimura K, et al. shows diversity in resistance genes. Funct Integr Genomics. 2014;14(4):643–55. Feedback inhibition of spinach l-galactose dehydrogenase by l-ascorbate. 84. Pootakham W, Shearman JR, Ruang-Areerate P, Sonthirod C, Sangsrakru D, Plant Cell Physiol. 2004;45(9):1271–9. Jomchai N, et al. Large-scale SNP discovery through RNA sequencing and 60. Palma JM, Jiménez A, Sandalio LM, Corpas FJ, Lundqvist M, Gómez M, SNP genotyping by targeted enrichment sequencing in Cassava et al. Antioxidative enzymes from chloroplasts, mitochondria, and (Manihot esculenta Crantz). PloS One. 2014;9(12), e116028. peroxisomes during leaf senescence of nodulated pea plants. J Exp Bot. 85. Wei L, Miao H, Li C, Duan Y, Niu J, Zhang T, et al. Development of SNP and 2006;57(8):1747–58. InDel markers via de novo transcriptome assembly in Sesamum indicum L. 61. Locato V, de Pinto MC, De Gara L. Different involvement of the Mol Breed. 2014;34(4):2205–17. mitochondrial, plastidial and cytosolic ascorbate-glutathione redox enzymes 86. Salgado LR, Koop DM, Pinheiro DG, Rivallan R, Le Guen V, Nicolás MF, et al. in heat shock responses. Physiol Plant. 2009;135(3):296–306. De novo transcriptome analysis of Hevea brasiliensis tissues by RNA-seq and 62. Noctor G, Arisi A-CM, Jouanin L, Kunert KJ, Rennenberg H, Foyer CH. screening for molecular markers. BMC Genomics. 2014;15:236. Glutathione: biosynthesis, metabolism and relationship to stress tolerance 87. Ersoz ES, Wright MH, Pangilinan JL, Sheehan MJ, Tobias C, Casler MD, et al. explored in transformed plants. J Exp Bot. 1998;49(321):623–47. SNP discovery with EST and NextGen sequencing in switchgrass 63. Kruger NJ, von Schaewen A. The oxidative pentose phosphate pathway: (Panicum virgatum L.). PloS One. 2012;7(9):e44112. structure and organisation. Curr Opin Plant Biol. 2003;6(3):236–46. 88. Wolfe KH, Li WH, Sharp PM. Rates of nucleotide substitution vary greatly 64. Drincovich MF, Casati P, Andreo CS. NADP-malic enzyme from plants: a among plant mitochondrial, chloroplast, and nuclear DNAs. Proc Natl Acad ubiquitous enzyme involved in different metabolic pathways. FEBS Lett. Sci U S A. 1987;84(24):9054–8. 2001;490(1–2):1–6. 89. Ehrlich M, Zhang X-Y, Inamdar NM. Spontaneous deamination of cytosine 65. Kramer DM, Avenson TJ, Edwards GE. Dynamic flexibility in the light and 5-methylcytosine residues in DNA and replacement of 5- reactions of photosynthesis governed by both electron and proton transfer methylcytosine residues with cytosine residues. Mutat Res Genet Toxicol. reactions. Trends Plant Sci. 2004;9(7):349–57. 1990;238(3):277–86. 66. Varshney RK, Graner A, Sorrells ME. Genic microsatellite markers in plants: 90. Mammadov J, Aggarwal R, Buyyarapu R, Kumpatla S. SNP markers and their features and applications. Trends Biotechnol. 2005;23(1):48–55. impact on plant breeding. Int J Plant Genomics. 2012;2012:728398. 67. Rojas S, Rodrigues D, Lima M, Fhilo S. Desenvolvimento e mapeamento de 91. Gómez JCC, Reátegui ADCE, Flores JT, Saavedra RR, Ruiz MC, Correa SAI. microssatélites gênicos (EST-SSRs) de camu-camu (Myrciaria dubia [H.B.K.] Isolation of high-quality total RNA from leaves of Myrciaria dubia “camu McVaug. Rev Corpoica. 2008;9(1):1421. camu.”. Prep Biochem Biotechnol. 2013;43(6):527–38. 68. Rojas S, Yuyama K, Clement C, Ossamu E. Diversidade genética em acessos 92. Sambrook J, Fritsch EF, Maniatis T. Molecular cloning: a laboratory manual. do banco de germoplasma de camu-camu (Myrciaria dubia [H.B.K.] Second: Cold Spring Harbor Laboratory Press; 1989. McVaugh) do INPA usando marcadores microssatélites (EST-SSR). Rev 93. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina Corpoica. 2011;12(1):51–64. sequence data. Bioinformatics. 2014;30(15):2114–20. 69. Dutta S, Kumawat G, Singh BP, Gupta DK, Singh S, Dogra V, et al. 94. Melicher D, Torson AS, Dworkin I, Bowsher JH. A pipeline for the de novo Development of genic-SSR markers by deep transcriptome sequencing in assembly of the Themira biloba (Sepsidae: Diptera) transcriptome using a pigeonpea [Cajanus cajan (L.) Millspaugh]. BMC Plant Biol. 2011;11:17. multiple k-mer length approach. BMC Genomics. 2014;15:188. Castro et al. BMC Genomics (2015) 16:997 Page 13 of 13 95. Chen T-W, Gan R-CR, Wu TH, Huang P-J, Lee C-Y, Chen Y-YM, et al. FastAnnotator–an efficient transcript annotation web tool. BMC Genomics. 2012;13 Suppl 7. 96. Conesa A, Götz S, García-Gómez JM, Terol J, Talón M, Robles M. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics. 2005;21(18):3674–6. 97. Claudel-Renard C, Chevalet C, Faraut T, Kahn D. Enzyme-specific profiles for genome annotation: PRIAM. Nucleic Acids Res. 2003;31(22):6633–9. 98. Marchler-Bauer A, Zheng C, Chitsaz F, Derbyshire MK, Geer LY, Geer RC, et al. CDD: conserved domains and protein three-dimensional structure. Nucleic Acids Res. 2013;41(Database issue):D348–52. 99. Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30. 100. Moriya Y, Itoh M, Okuda S, Yoshizawa AC, Kanehisa M. KAAS: an automatic genome annotation and pathway reconstruction server. Nucleic Acids Res. 2007;35(Web Server issue):W182–5. 101. Faircloth BC. msatcommander: detection of microsatellite repeat arrays and automated, locus-specific primer design. Mol Ecol Resour. 2008;8(1):92–4. 102. Rozen S, Skaletsky H. Primer3 on the WWW for general users and for biologist programmers. Methods Mol Biol. 2000;132:365–86. 103. Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25(14):1754–60. 104. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. 1000 genome project data processing subgroup: the sequence alignment/map format and SAMtools. Bioinformatics. 2009;25(16):2078–9. 105. Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L, et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin). 2012;6(2):80–92. Submit your next manuscript to BioMed Central and we will help you at every step: • We accept pre-submission inquiries • Our selector tool helps you to find the most relevant journal • We provide round the clock customer support • Convenient online submission • Thorough peer review • Inclusion in PubMed and all major indexing services • Maximum visibility for your research Submit your manuscript at www.biomedcentral.com/submit