DGTF: A Database of Grape Transcription Factors

in Journal of the American Society for Horticultural Science

The database of grape transcription factors (DGTF) is a plant transcription factor (TF) database comprehensively collecting and annotating grape (Vitis L.) TF. The DGTF contains 1423 putative grape TF in 57 families. These TF were identified from the predicted wine grape (Vitis vinifera L.) proteins from the grape genome sequencing project by means of a domain search. The DGTF provides detailed annotations for individual members of each TF family, including sequence feature, domain architecture, expression information, and orthologs in other plants. Cross-links to other public databases make its annotations more extensive. In addition, some other transcriptional regulators were also included in the DGTF. It contains 202 transcriptional regulators in 10 families.

Abstract

The database of grape transcription factors (DGTF) is a plant transcription factor (TF) database comprehensively collecting and annotating grape (Vitis L.) TF. The DGTF contains 1423 putative grape TF in 57 families. These TF were identified from the predicted wine grape (Vitis vinifera L.) proteins from the grape genome sequencing project by means of a domain search. The DGTF provides detailed annotations for individual members of each TF family, including sequence feature, domain architecture, expression information, and orthologs in other plants. Cross-links to other public databases make its annotations more extensive. In addition, some other transcriptional regulators were also included in the DGTF. It contains 202 transcriptional regulators in 10 families.

Transcription factors (TF) are identified by their affinity for specific motifs in promoters, upstream regulatory elements, or enhancer regions of target genes (Riechmann et al., 2000). These factors bind specifically to their DNA-binding sites near target genes and then activate or repress gene transcription (Zhang, 2003). It is essential to identify and characterize TF on a genome-wide level to understand their biological function and to explore the mechanisms of transcriptional regulation. Recently, some databases of eukaryotic transcription factors have become available on the web. TRANSFAC (Matys et al., 2003), DBD (Wilson et al., 2008), PlnTFDB (Riaño-Pachón et al., 2007), AGRIS (Palaniswamy et al., 2006), DATF (Guo et al., 2005), DRTF (Guo et al., 2006), and DPTF (Zhu et al., 2007) have provided some comprehensive information on TF.

Grape is one of the most important horticultural crops in the world, being used for the production of wine and juice, and as fresh and dried fruit (Tinlot and Rousseau, 1993). Therefore, it is necessary to develop genomic tools to accelerate the acquisition of knowledge about its important agronomic characteristics such as resistance to diseases, tolerance to abiotic stress, and maturation and quality of the fruit. Given the importance of TF in the life cycle of plants, identification and annotation of the TF in grape will improve the understanding of these agronomic characteristics at the level of gene expression and regulation. With the completion of the grape genome sequence (Jaillon et al., 2007), the entire complement of genes coding for TF can be identified and described. A database of grape transcription factors (DGTF) will provide a resource for researchers to explore the expression and function of TF of grape.

Materials and Methods

Source datasets.

We downloaded 30,434 wine grape mRNAs (Vitis_vinifera_mRNA _v1.fa), the corresponding protein sequences (Vitis_vinifera_peptide_v1.fa), and general feature formats (GFF; Wellcome Trust Sanger Institute, 2007) of genes (Vitis_vinifera_annotation_v1.gff) from Genoscope (Center National de Séquençage, Evry, France).

Data processing.

TF can be identified and grouped into different families based upon their DNA-binding domains (Riaño-Pachón et al., 2007; Riechmann et al., 2000). For some families, a TF contains a single domain, which is sufficient to assign its membership. However, in other families, a TF may contain more than one DNA-binding domain; likewise, some domains are shared by different TF families. To correctly identify and classify TF into different families, we constructed a rule for identification and classification of each TF based on the literature (Riaño-Pachón et al., 2007; Riechmann et al., 2000) and the grape-specific combination of domains in each of the grape TF (data not shown). The rule was depicted as a graph on the DGTF web site (for details, see the DGTF Help page). A pipeline we have developed for the identification, classification, and annotation of TF is shown in Fig. 1. In the first step, we collected hidden Markov models (HMM) of domains that occur in TF from the Pfam database (version 22.0; Finn et al., 2006). For the families without DNA-binding domain HMM available in Pfam, new HMM were created in-house based on alignments of TF from Arabidopsis [Arabidopsis thaliana (L.) Heynh.]. These HMM were then used to search against 30,434 predicted proteins from Genoscope by using the hmmsearch program (Eddy, 1998). The E-value cut-off 0.01 was used in the search and all significant hits were kept. For the NOZZLE and SAP (STERILE APETALA) families, each of which has only a single member in Arabidopsis, a BLAST (Altschul et al., 1997) search was run for the homolog in grape, and the E-value cut-offs were inspected. The rule was then implemented in a Perl script to classify proteins into different families.

Fig. 1.
Fig. 1.

Pipeline for the identification, classification, and annotation of transcription factors (TF). The pipeline starts with the complete collection of predicted proteins from Genoscope (Center National de Séquençage, Evry, France). Hidden Markov models (HMM) of DNA-binding domains are collected or created. These HMM are then used to search against all of the predicted proteins by using the hmmsearch program, and all significant hits are kept. A Perl script produces a list of putative TF grouped into different families according to a rule we developed for family classification. A BLAST search against expressed sequence tags (EST) of UniGene (National Center for Biotechnology Information, Bethesda, MD) is performed to obtain the UniGene entry corresponding to each TF. Finally, domain structures are also identified and annotated by hmmpfam, and orthologs of each TF in other plants are detected using best-reciprocal BLAST hits.

Citation: Journal of the American Society for Horticultural Science J. Amer. Soc. Hort. Sci. 133, 3; 10.21273/JASHS.133.3.459

To obtain the UniGene (National Center for Biotechnology Information, Bethesda, MD) entry corresponding to each TF, a BLAST search against expressed sequence tags (EST) of UniGene was performed (E-value < 10−10, identity > 90%). The accession numbers of matched EST and the UniGene ID for each TF were recorded. For each TF, links to various external public sequence databases, Pfam, PlnTFDB, SIMAP (Rattei et al., 2008), and NCBI (National Center for Biotechnology Information, Bethesda, MD; Wheeler et al., 2006) were collected. GFF was used for drawing gene structure. Domain structures were also identified and annotated by hmmpfam (Eddy, 1998). Orthologs of each TF in Arabidopsis, rice (Oryza sativa L.), and poplar (Populus trichocarpa Torr. & Gray) were detected using best-reciprocal BLAST hits. In addition, the multiple sequence alignment of the DNA-binding domains of TF in each family was created by CLUSTAL W (Thompson et al., 1994). The neighbor-joining phylogenetic tree for each family was also constructed based on the alignment of predicted amino acid sequences.

In addition to TF, some other types of transcriptional regulators were also included in the DGTF. These transcriptional regulators contain binding domains for: ARID (A-T rich interaction domain), HMG (high mobility group), MBF1 (multiprotein bridging factor 1), SNF2 (named after the Saccharomyces cerevisiae Hansen protein SNF2), Aux/IAA, Jumonji, PHD (plant homeodomain), DDT (DNA-binding homeobox and different TF), LUG (LEUNIG), and SET [named after three Drosophila melanogaster Meigen genes involved in epigenetic processes, Su(var), E(z) and trithorax].

Results and Discussion

Identification and classification of grape putative tf.

Using the pipeline we developed, 1423 putative TF in grape were identified and classified into 57 families. These putative TF were identified from 30,434 proteins that were predicted by the grape genome sequencing project. Several resources were used to build the gene models automatically with GAZE (Jaillon et al., 2007). Although our pipeline can improve the reliability and accuracy of gene prediction, some proteins may be annotated incorrectly. Therefore, users should validate any TF from our database that are going to be used in further research. In addition, 202 other transcriptional regulators were identified and classified into 10 families.

Annotation of grape putative tf.

A UniGene entry is a set of transcript sequences that appear to come from the same transcription locus, together with information on protein similarities, gene expression, cDNA clone reagents, and genomic location (Wheeler et al., 2006). In this study, 55% of the TF matched UniGene clusters. The corresponding UniGene entry for each TF was stored as annotation information, which may provide valuable expression information for further analysis of TF. Using a similarity search, of 1423 grape TF, 98% were found to have orthologs in Arabidopsis, rice, and poplar.

Implementation and user interface.

The DGTF allows users to start their data-mining by browsing a list of families (Cai, 2008). Clicking on one of the family names will show users a summary page of the family, including the family description extracted from the literature, the list of protein names, the multiple sequence alignment of the DNA-binding domains, and the neighbor-joining phylogenetic tree. Detailed information for each of the TF can be accessed by clicking the protein name or by entering the protein name into a search form on the top of the page. Alternatively, users can search the DGTF by running a BLAST search against the sequences in the DGTF.

The DGTF provides detailed information on individual TF, including chromosomal location, predicted molecular weight and pI (isoelectric point), gene structure, gene feature, corresponding UniGene, and expression profile. The detailed information on EST in the UniGene is also available. For each TF, the information provided in the DGTF is linked to various external public sequence databases: Pfam, plnTFDB, SIMAP, NCBI, and the grape genome browser in Genoscope. All the coding sequences (CDS), protein sequences, and GFF can be downloaded through the DGTF website for further analysis.

Future plans.

DGTF is the first database that comprehensively collects and annotates grape TF based on genome-wide data. The DGTF will be a useful resource for research on grape transcription regulation. As more data and information on the genome sequence becomes available, we will maintain and update the DGTF regularly. Furthermore, as soon as new plant genomes become available, the methods described in this study will be applied to them.

Literature Cited

  • AltschulS.MaddenT.SchafferA.ZhangJ.ZhangZ.MillerW.LipmanD.1997Gapped BLAST and PSI-BLAST: A new generation of protein database search programsNucleic Acids Res.2533893402

    • Search Google Scholar
    • Export Citation
  • CaiB.2008The database of grape transcription factors. List of transcription factor families15 Mar. 2008<http://www.yaolab.sh.cn/dgtf.html>.

    • Export Citation
  • EddyS.R.1998Profile hidden Markov modelsBioinformatics14755763

  • FinnR.D.MistryJ.Schuster-BöcklerB.Griffiths-JonesS.HollichV.LassmannT.MoxonS.MarshallM.KhannaA.DurbinR.EddyS.R.SonnhammerE.L.L.BatemanA.2006Pfam: Clans, web tools and servicesNucleic Acids Res.34247251

    • Search Google Scholar
    • Export Citation
  • GuoA.HeK.LiuD.BaiS.GuX.WeiL.LuoJ.2005DATF: A database of Arabidopsis transcription factorsBioinformatics2125682569

  • GuoA.HeK.LiuD.BaiS.GuX.WeiL.LuoJ.2006DRTF: A database of rice transcription factorsBioinformatics2212861287

  • JaillonO.AuryJ.M.NoelB.PolicritiA.ClepetC.CasagrandeA.ChoisneN.AubourgS.VituloN.JubinC.VezziA.LegeaiF.HugueneyP.DasilvaC.HornerD.MicaE.JublotD.PoulainJ.BruyèreC.BillaultA.SegurensB.GouyvenouxM.UgarteE.CattonaroF.AnthouardV.VicoV.Del FabbroC.AlauxM.Di GasperoG.DumasV.FeliceN.PaillardS.JumanI.MoroldoM.ScalabrinS.CanaguierA.Le ClaincheI.MalacridaG.DurandE.PesoleG.LaucouV.ChateletP.MerdinogluD.DelledonneM.PezzottiM.LecharnyA.ScarpelliC.ArtiguenaveF.M.E.ValleG.MorganteM.CabocheM.Adam-BlondonA.F.WeissenbachJ.QuétierF.WinckerP.French-Italian Public Consortium for Grapevine Genome Characterization2007The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phylaNature449463468

    • Search Google Scholar
    • Export Citation
  • MatysV.FrickeE.GeffersR.GößlingE.HaubrockM.HehlR.HornischerK.KarasD.KelA.E.Kel-MargoulisO.V.KloosD.U.LandS.Lewicki-PotapovB.MichaelH.MünchR.ReuterI.RotertS.SaxelH.ScheerM.ThieleS.WingenderE.2003TRANSFAC: Transcriptional regulation, from patterns to profilesNucleic Acids Res.31374378

    • Search Google Scholar
    • Export Citation
  • PalaniswamyS.K.JamesS.SunH.LambR.S.DavuluriR.V.GrotewoldE.2006AGRIS and AtRegNet: A platform to link cis-regulatory elements and transcription factors into regulatory networksPlant Physiol.1403818829

    • Search Google Scholar
    • Export Citation
  • RatteiT.TischlerP.ArnoldR.HambergerF.KrebsJ.KrumsiekJ.WachingerB.StümpflenV.MewesW.2008SIMAP: Structuring the network of protein similaritiesNucleic Acids Res.36289292

    • Search Google Scholar
    • Export Citation
  • Riaño-PachónD.M.RuzicicS.DreyerI.Mueller-RoeberB.2007PlnTFDB: An integrative plant transcription factor databaseBMC Bioinformatics8110

    • Search Google Scholar
    • Export Citation
  • RiechmannJ.L.HeardJ.MartinG.ReuberL.JiangC.Z.KeddieJ.AdamL.PinedaO.RatcliffeO.J.SamahaR.R.CreelmanR.PilgrimM.BrounP.ZhangJ.Z.GhandehariD.ShermanB.K.YuG.L.2000 Arabidopsis transcription factors: Genome-wide comparative analysis among eukaryotesScience29021052110

    • Search Google Scholar
    • Export Citation
  • ThompsonJ.D.HigginsD.G.GibsonT.J.1994CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choiceNucleic Acids Res.2246734680

    • Search Google Scholar
    • Export Citation
  • TinlotR.RousseauM.1993The state of viticulture in the world and the statistical information in 1992Bulletin de l'O.I.V.66861946

  • Wellcome Trust Sanger Institute2007GFF: An exchange format for feature description15 Mar. 2008<http://www.sanger.ac.uk/Software/formats/GFF/>.

    • Export Citation
  • WheelerD.L.BarrettT.BensonD.A.BryantS.H.CaneseK.ChetverninV.ChurchD.M.DiCuccioM.EdgarR.FederhenS.GeerL.Y.HelmbergW.KapustinY.KentonD.L.KhovaykoO.LipmanD.J.MaddenT.L.MaglottD.R.OstellJ.PruittK.D.SchulerG.D.SchrimlL.M.SequeiraE.SherryS.T.SirotkinK.SouvorovA.StarchenkoG.SuzekT.O.TatusovR.TatusovaT.A.WagnerL.YaschenkoE.2006Database resources of the National Center for Biotechnology InformationNucleic Acids Res.34173180

    • Search Google Scholar
    • Export Citation
  • WilsonD.CharoensawanV.KummerfeldS.K.TeichmannS.A.2008DBD: Taxonomically broad transcription factor predictions: New content and functionalityNucleic Acids Res.368892

    • Search Google Scholar
    • Export Citation
  • ZhangJ.Z.2003Overexpression analysis of plant transcription factorsCurr. Opin. Plant Biol.65430440

  • ZhuQ.H.GuoA.Y.GaoG.ZhongY.F.XuM.HuangM.LuoJ.2007DPTF: A database of poplar transcription factorsBioinformatics2313071308

If the inline PDF is not rendering correctly, you can download the PDF file here.

Contributor Notes

The research was supported by the Shanghai Project for ISTC (055407068), by the Shanghai Subject Chief Scientist (06XD14017), by the Project of the key laboratory of Shanghai (05dz223266-07dz22011), by the National Natural Science Foundation (30471258-30670179), and by the 863 Program (2006AA10Z117-06Z358).

Corresponding author. E-mail: yaoquanhong_sh@yahoo.com.cn.

  • View in gallery

    Pipeline for the identification, classification, and annotation of transcription factors (TF). The pipeline starts with the complete collection of predicted proteins from Genoscope (Center National de Séquençage, Evry, France). Hidden Markov models (HMM) of DNA-binding domains are collected or created. These HMM are then used to search against all of the predicted proteins by using the hmmsearch program, and all significant hits are kept. A Perl script produces a list of putative TF grouped into different families according to a rule we developed for family classification. A BLAST search against expressed sequence tags (EST) of UniGene (National Center for Biotechnology Information, Bethesda, MD) is performed to obtain the UniGene entry corresponding to each TF. Finally, domain structures are also identified and annotated by hmmpfam, and orthologs of each TF in other plants are detected using best-reciprocal BLAST hits.

  • AltschulS.MaddenT.SchafferA.ZhangJ.ZhangZ.MillerW.LipmanD.1997Gapped BLAST and PSI-BLAST: A new generation of protein database search programsNucleic Acids Res.2533893402

    • Search Google Scholar
    • Export Citation
  • CaiB.2008The database of grape transcription factors. List of transcription factor families15 Mar. 2008<http://www.yaolab.sh.cn/dgtf.html>.

    • Export Citation
  • EddyS.R.1998Profile hidden Markov modelsBioinformatics14755763

  • FinnR.D.MistryJ.Schuster-BöcklerB.Griffiths-JonesS.HollichV.LassmannT.MoxonS.MarshallM.KhannaA.DurbinR.EddyS.R.SonnhammerE.L.L.BatemanA.2006Pfam: Clans, web tools and servicesNucleic Acids Res.34247251

    • Search Google Scholar
    • Export Citation
  • GuoA.HeK.LiuD.BaiS.GuX.WeiL.LuoJ.2005DATF: A database of Arabidopsis transcription factorsBioinformatics2125682569

  • GuoA.HeK.LiuD.BaiS.GuX.WeiL.LuoJ.2006DRTF: A database of rice transcription factorsBioinformatics2212861287

  • JaillonO.AuryJ.M.NoelB.PolicritiA.ClepetC.CasagrandeA.ChoisneN.AubourgS.VituloN.JubinC.VezziA.LegeaiF.HugueneyP.DasilvaC.HornerD.MicaE.JublotD.PoulainJ.BruyèreC.BillaultA.SegurensB.GouyvenouxM.UgarteE.CattonaroF.AnthouardV.VicoV.Del FabbroC.AlauxM.Di GasperoG.DumasV.FeliceN.PaillardS.JumanI.MoroldoM.ScalabrinS.CanaguierA.Le ClaincheI.MalacridaG.DurandE.PesoleG.LaucouV.ChateletP.MerdinogluD.DelledonneM.PezzottiM.LecharnyA.ScarpelliC.ArtiguenaveF.M.E.ValleG.MorganteM.CabocheM.Adam-BlondonA.F.WeissenbachJ.QuétierF.WinckerP.French-Italian Public Consortium for Grapevine Genome Characterization2007The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phylaNature449463468

    • Search Google Scholar
    • Export Citation
  • MatysV.FrickeE.GeffersR.GößlingE.HaubrockM.HehlR.HornischerK.KarasD.KelA.E.Kel-MargoulisO.V.KloosD.U.LandS.Lewicki-PotapovB.MichaelH.MünchR.ReuterI.RotertS.SaxelH.ScheerM.ThieleS.WingenderE.2003TRANSFAC: Transcriptional regulation, from patterns to profilesNucleic Acids Res.31374378

    • Search Google Scholar
    • Export Citation
  • PalaniswamyS.K.JamesS.SunH.LambR.S.DavuluriR.V.GrotewoldE.2006AGRIS and AtRegNet: A platform to link cis-regulatory elements and transcription factors into regulatory networksPlant Physiol.1403818829

    • Search Google Scholar
    • Export Citation
  • RatteiT.TischlerP.ArnoldR.HambergerF.KrebsJ.KrumsiekJ.WachingerB.StümpflenV.MewesW.2008SIMAP: Structuring the network of protein similaritiesNucleic Acids Res.36289292

    • Search Google Scholar
    • Export Citation
  • Riaño-PachónD.M.RuzicicS.DreyerI.Mueller-RoeberB.2007PlnTFDB: An integrative plant transcription factor databaseBMC Bioinformatics8110

    • Search Google Scholar
    • Export Citation
  • RiechmannJ.L.HeardJ.MartinG.ReuberL.JiangC.Z.KeddieJ.AdamL.PinedaO.RatcliffeO.J.SamahaR.R.CreelmanR.PilgrimM.BrounP.ZhangJ.Z.GhandehariD.ShermanB.K.YuG.L.2000 Arabidopsis transcription factors: Genome-wide comparative analysis among eukaryotesScience29021052110

    • Search Google Scholar
    • Export Citation
  • ThompsonJ.D.HigginsD.G.GibsonT.J.1994CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choiceNucleic Acids Res.2246734680

    • Search Google Scholar
    • Export Citation
  • TinlotR.RousseauM.1993The state of viticulture in the world and the statistical information in 1992Bulletin de l'O.I.V.66861946

  • Wellcome Trust Sanger Institute2007GFF: An exchange format for feature description15 Mar. 2008<http://www.sanger.ac.uk/Software/formats/GFF/>.

    • Export Citation
  • WheelerD.L.BarrettT.BensonD.A.BryantS.H.CaneseK.ChetverninV.ChurchD.M.DiCuccioM.EdgarR.FederhenS.GeerL.Y.HelmbergW.KapustinY.KentonD.L.KhovaykoO.LipmanD.J.MaddenT.L.MaglottD.R.OstellJ.PruittK.D.SchulerG.D.SchrimlL.M.SequeiraE.SherryS.T.SirotkinK.SouvorovA.StarchenkoG.SuzekT.O.TatusovR.TatusovaT.A.WagnerL.YaschenkoE.2006Database resources of the National Center for Biotechnology InformationNucleic Acids Res.34173180

    • Search Google Scholar
    • Export Citation
  • WilsonD.CharoensawanV.KummerfeldS.K.TeichmannS.A.2008DBD: Taxonomically broad transcription factor predictions: New content and functionalityNucleic Acids Res.368892

    • Search Google Scholar
    • Export Citation
  • ZhangJ.Z.2003Overexpression analysis of plant transcription factorsCurr. Opin. Plant Biol.65430440

  • ZhuQ.H.GuoA.Y.GaoG.ZhongY.F.XuM.HuangM.LuoJ.2007DPTF: A database of poplar transcription factorsBioinformatics2313071308

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 139 88 1
PDF Downloads 31 20 1