mGene.web: A Web Service for Accurate Computational Gene Finding

mGene.web is a web service for the genome-wide prediction of protein coding genes from eukaryotic DNA sequences. It offers pre-trained models for the recognition of gene structures including untranslated regions in an increasing number of organisms. mGene.web additionally allows to train the system for other organisms on the push of a button, a functionality that greatly accelerates the annotation of newly sequenced genomes. The system is built in a highly modular way, such that individual components of the framework, like the promoter prediction tool or the splice site predictor, can be used autonomously. The underlying gene finding system mGene is based on discriminative machine learning techniques and its high accuracy has been demonstrated in an international competition on nematode genomes. mGene.web is free of charge, and can be used for eukaryotic genomes of small to moderate size (several hundred Mbp).

The web service has been described in [1] and can be found here: http://galaxy.raetschlab.org. mGene.web is an interface to the mGene gene finding system (see http://mgene.org and [3] for more details) that is available for download at http://mgene.org/download.

Main Features

Simple one-step procedure to train an ab initio gene predictor for a new organism based on a FASTA and a GFF3 (or GTF) file.
Gene prediction for a growing list of organisms from a given FASTA file using pretrained mGene instances.
Easy access to the signal predictions, e.g. for splice sites, transcription start sites, etc.
Integration of externally provided signal or content predictions/tracks into the mGene gene finder.
High accuracy of mGene's gene and signal predictions.

Disclaimer

Getting Started

A number of examples with easy step by step explanations can be found here.

Workflows

mGene.web has a very flexible modular setup. Using the galaxy workflow system we are able to pass this strength to webservice users without complicated and confusing parameterization procedures. User defined pipelines can be build on modules via a graphical workflow editor and can be shared among users. We provide a number of predefined workflows that combine different modules of our system to perform a number of tasks. More information on the different workflows and links for importing them into your galaxy environment can be found here.

Figure 1. Workflow to train a complete gene finder with mGene.web. The workflow only requires a FASTA file with the genomic sequence and a GFF3 file with a set of genes on which the system can be trained. It outputs a trained mGene Predictor (TmGP) as well as a text file that contains the estimated prediction accuracy. When using the workflow mGeneTrain, also all intermediate results are returned (not shown).

Libraries

Using the mechanisms of the Galaxy-framework [2], we offer pretrained signal, content and gene-structure predictors for an increasing number of organisms. To obtain such a classifier please follow the simple steps described here.

These classifiers come with meta-information such as the number of training examples and the performance on an appropriate holdout set.

Prediction Results

We tested the web service tools for a large number of organisms, including:

Drosophila melanogaster

Arabidopsis thaliana

Caenorhabditis elegans

Aspergillus nidulans

Tetraodon nigroviridis

Anopheles gambiae

A detailed list of the results for gene, signal and content predictions can be found here.

Gene State Model

mGene.web is supposed to be applicable to a wide range of organisms. We therefore use a more general state model than in the original version of mGene that was developed for gene predictions in nematodes. Predicting trans-splicing and operons is therefore not supported in the mGene.web model. Also, we excluded poly-A signals, as there is usually no training data available. However, we do model splicing of UTRs. The complete model applied in the GeneTrain tool is depicted in Figure 2.

Figure 2. State model used in mGene.web, represented in two levels of detail. States are drawn as ovals, transitions between states as arrows. States are associated with one or more signal detectors, transitions with content detectors and lengths features. For simplification, states that share all their outgoing transitions are drawn in boxes with right-pointing triangles (e.g. AccStop and Stop). The outgoing transitions are represented only once for all the grouped states. A Each segmentation has to start in the Begin state and stop in the End state. The model allows for 0, 1 or more detected genes. To handle long intergenic segments we have introduced an additional dummy state intergenic-long (marked as small circles with 'i'). The gene model is shown with all details for the untranslated regions. For technical reasons we have introduced several mixed states, namely AccTis with the consensus AG|ATG and AccStop with a consensus GT|TAA. Blue transitions are associated with UTR content detectors, red transitions with intron content detectors and black ones with intergenic content detectors. B Shown are the transitions and states in the coding region. Green arrows are associated with coding exon and frame content detectors, red ones with intron detectors. The splice sites in the coding regions are split to propagate the frame information across an intron to the next exon and to avoid creation of in-frame stop codons by splicing.

Contact

In case of comments, problems, questions etc. feel free to contact

Gunnar Raetsch

Gabriele Schweikert

Jonas Behr

References

[1]	Schweikert, G, Behr, J, Zien, A, Zeller, G, Ong, CS, Sonnenburg, S, and Rätsch, G (2009). mGene.web: a web service for accurate computational gene finding. Nucleic Acids Research, Web Server Issue.

[2]	Giardine B, Riemer C, Hardison R, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, et al. (2005). Galaxy: a platform for interactive large-scale genome analysis. Genome Res. (2005) 15:1451–1455.

[3]	Schweikert, G, Zien, A, Zeller, G, Behr, J, Dieterich, C, Ong, CS, Philips, P, De Bona, F, Hartmann, L, Bohlen, A, Krüger, N, Sonnenburg, S, and Rätsch, G. mGene: Accurate Computational Gene Finding with Application to Nematode Genomes. Genome Research, 19, 2133-2143, 2009.

cBio@MSKCC

Personal tools