Gene Prediction Exercise

1.
Gene Prediction Manual
A. Gene annotation
Step 1. Accesing EMBL database to retrieve the gene
Go to EMBL database
Select Nucleotide sequences
Type sequence entry name HS307871
Press Go button
Click on EmblEntry link
Have a look at the different entry fields: detect the mRNA and CDS exons
Click on Text Entry link to see the plain text formatted output
This is the sequence in FASTA format
B. Exploring ab initio gene prediction

Step 2. Running geneid
Connect to the geneid server
Paste the FASTA sequence
Choose geneid output format
Run geneid with different parameters:

1. Searching signals: Select acceptors, donors, start and stop codons. Look for them in the real
annotation of the sequence
2. Searching exons: Select All exons and try to find the real ones
3. Finding genes: You do not need to select any option (default behaviour). Compare the predicted gene
with the real gene
Figure 1. Signal, exons and genes predicted by geneid in the sequence HS307871
Step 3. Running other genefinders
Provided that there are several alternative programs to analyze a DNA sequence, we can run every application and
observe the common parts of the predictions.
1. GENSCAN:
o
Connect to the GENSCAN server
Paste DNA sequence
Press Run Genscan button
Compare annotations and predictions
2. FGENESH:
o
Connect to Softberry homepage
On the left frame, select GENE FINDING in Eukaryota
Select the program FGENESH
Paste DNA sequence
Press Search button
Compare annotations and predictions
3. GRAIL:
o
Connect to GrailEXP homepage
Activate Perceval Exon Candidates box
Paste DNA sequence
Press Go! button
Check the results
Compare annotations and predicted exons
4. NOTE: First exon is always missed in the predictions and there are some problems to detect the donor site
from exon 5. Detection of Start codons is a serious drawback in current gene finding programs (see Figure 2).
However, this problem can be overcome by using homology information to complete the gene prediction.
Figure 2. EMBL annotation and genes predicted by Grail,

GENSCAN, geneid and FGENESH in the sequence HS307871
C. Using EST/cDNA homology information

Step 4. Using GrailEXP
Connect to GrailExp homepage
Activate Galahad EST/mRNA/cDNA Alignments box
Select GrailEXP database (RefSeq/HTDB/dbEST/EGAD/Riken)
Activate exon assembly: Gawain Gene Models
Paste DNA sequence
Press Go! button
Check the results: predictions and supporting information
Compare annotations, ab initio GRAIL prediction and five predicted alternative spliced variants
Figure 3. Comparison between EMBL annotation and genes

predicted ab inition by Grail Vs five alternative predictions
supported by ESTs information in the sequence HS307871
Step 5. Using other gene finding programs + alignment of transcripts

Using blastn, we can search the database est_human for ESTs supporting future predictions. Filter this output in
order to select those non-overlapping ESTs that could form a complete cDNA sequence (see Figure 4). Moreover,
ESTs not divided into two or more pieces in the genomic sequence (containing a couple of splice sites) should be
rejected.
Connect to the FGENESH-C server (on Gene finding with similarity menu)
Paste the sequence HS307871
Paste the cDNA sequence or EST you have selected
Press the search button
Notice that predicted gene will necessarily supported by homology information, so it will likely mapped only in
the genomic region overlapping your EST query.
Figure 4. Best human ESTs in the alignment

mapped on the genomic sequence HS307871
D. Using protein homology information

Step 6. Spliced alignment
Spliced alignment is very useful when we have additional information (a putative homologous protein sequence) about
the content of the sequence. Thus, gene prediction is guided by fitting the protein sequence into the best splice sites
predicted in the genomic sequence.
Open the NCBI blast server
Choose blastx program (genomic query versus protein database)
Paste the genomic sequence and press the Blast! and Format!
Select the first protein. Display the FASTA sequence or click here. Obviously, it is the real protein annotated in
the genomic sequence.
Open genewise web server to use this protein to predict the best gene structure
Paste both protein and genomic sequences and run the program
Compare predicted gene (end of the file) and annotations: look for splice sites within introns to check exon
boundaries are correct
Figure 5. Best HSPs representing proteins

homologues similar to the genomic sequence
HS307871 obtained using blastx
Step 7. Spliced alignment using homologous proteins

From blastx output, choose several homologous genes and run genewise for each one separately, again. Observe the
gain of accuracy as long as the homologue is closer to the original human protein:
Homo sapiens
Ovis aries
Mus musculus
Rattus norvegicus
Danio rerio
Drosophila melanogaster
Drosophila virilis
Saccharomyces cerevisiae
Schizosaccharomyces pombe
Figure 6. Graphical comparison of the real gene

annotation and different genewise predictions
using different homologous proteins for the
gene uroporphyrinogen decarboxylase (URO-D)
Step 8. Using protein homology information: GenomeScan

Protein homology information can also be used to enhance ab initio predicted exons supported by blastx HSPs as in
the case of GenomeScan and geneid improving therefore the final prediction GenomeScan:
Connect to the GenomeScan web server
Retrieve the protein from the previous blast search
Paste both genomic and protein sequences
Press the button GenomeScan
Check the results. It seems that the first exon has not been detected even using homology information. This is
due to the fact that blast programs have a minimal word lenght.
Figure 6. GenomeScan output: first exon is not

correctly predicted probably due to blast length
restrictions
E. Using a genome annotation browser

Step 9. Golden path archive:
Open the UCSC Genome Bioinformatics Site
Select the blat link to locate the genomic coordinates of our sequence
Paste the DNA sequence in FASTA format (HS307871)
Submit the file
Click over the first hit: (browser link)
Compare the graphical annotation with the EMBL entry of the gene
Analyze these different sets of output options:

Genes and Gene Prediction Tracks,
mRNA and EST Tracks
Figure 7. (a) UCSC genome browser

representation of the region containing the
geneuroporphyrinogen decarboxylase (UROD) (b) UCSC genome browser representation of
the contex (100Kbps) region around the
gene uroporphyrinogen decarboxylase (URO-D).
F. Results
Here you can find the solutions to every exercise:
EMBL annotation
EMBL annotation (plain text)
FASTA sequence
geneid results: signals
geneid results: exons
geneid results: genes

GENSCAN results
FGENESH results
GRAIL results
GrailEXP results
Blastn + human ESTs results
Blastx + protein results
Genewise (human protein)
Genewise (ovis protein)
Genewise (mouse protein)
Genewise (rat protein)
Genewise (Danio rerio protein)
Genewise (Drosophila melanogaster protein)
Genewise (Drosophila virilis protein)
Genewise (yeast protein)
Genewise (fission yeast protein)
GenomeScan results
F. Bibliography
1. J.F. Abril and R. Guig. gff2ps: visualizing genomic annotations. Bioinformatics 16:743-744 (2000).
2. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J. Mol.
Biol. 215:403-410 (1990).
3. Burge, C. and Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol.
268, 78-94 (1997).
4. E. Blanco, G. Parra and R. Guig. Using geneid to Identify Genes. In A. D. Baxevanis and D. B. Davison,
chief editors: Current Protocols in Bioinformatics. Volume 1, Unit 4.3. John Wiley & Sons Inc., New York.
ISBN: 0-471-25093-7 (2002).
5. G. Parra, E. Blanco, and R. Guig. Geneid in Drosophila. Genome Research 10:511-515 (2000).
6. Asaf A. Salamov and Victor V. Solovyev. Ab initio Gene Finding in Drosophila Genomic DNA Genome
Res. 10: 516-522 (2000).
7. Yeh, R.-F., Lim, L. P. and Burge, C. B. Computational inference of homologous gene structures in the
human genome. Genome Res. 11: 803-816 (2001).
8. D. Hyatt, J. Snoddy, D. Schmoyer, G. Chen, K. Fischer, M. Parang, I. Vokler, S. Petrov, P. Locascio, V. Olman,
Miriam Land, M. Shah, and E. Uberbacher. Improved Analysis and Annotation Tools for Whole-Genome
Computational Annotation and Analysis: GRAIL-EXP Genome Analysis Toolkit and Related Analysis
Tools. Genome Sequencing & Biology Meeting (2000).

9. Ewan Birney and Richard Durbin. Using GeneWise in the Drosophila Annotation Experiment. Genome
Res. 10: 547-548 (2000).

Gene Prediction Exercise

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gene Prediction Exercise

Uploaded by

Copyright:

Available Formats

1.

Gene Prediction Manual

Select Nucleotide sequences

Type sequence entry name HS307871

Click on EmblEntry link

This is the sequence in FASTA format

B. Exploring ab initio gene prediction

Connect to the geneid server

Paste the FASTA sequence

Choose geneid output format

Run geneid with different parameters:

Connect to the GENSCAN server

Paste DNA sequence

Press Run Genscan button

Compare annotations and predictions

Connect to Softberry homepage

On the left frame, select GENE FINDING in Eukaryota

Select the program FGENESH

Paste DNA sequence

Press Search button

Compare annotations and predictions

Connect to GrailEXP homepage

Activate Perceval Exon Candidates box

Paste DNA sequence

Press Go! button

Check the results

Compare annotations and predicted exons

Figure 2. EMBL annotation and genes predicted by Grail,

C. Using EST/cDNA homology information

Connect to GrailExp homepage

Activate Galahad EST/mRNA/cDNA Alignments box

Select GrailEXP database (RefSeq/HTDB/dbEST/EGAD/Riken)

Activate exon assembly: Gawain Gene Models

Paste DNA sequence

Press Go! button

Check the results: predictions and supporting information

Figure 3. Comparison between EMBL annotation and genes

Step 5. Using other gene finding programs + alignment of transcripts

Paste the sequence HS307871

Paste the cDNA sequence or EST you have selected

Press the search button

Figure 4. Best human ESTs in the alignment

D. Using protein homology information

Open the NCBI blast server

Choose blastx program (genomic query versus protein database)

Figure 5. Best HSPs representing proteins

Step 7. Spliced alignment using homologous proteins

Figure 6. Graphical comparison of the real gene

Step 8. Using protein homology information: GenomeScan

Connect to the GenomeScan web server

Retrieve the protein from the previous blast search

Paste both genomic and protein sequences

Press the button GenomeScan

Figure 6. GenomeScan output: first exon is not

E. Using a genome annotation browser

Open the UCSC Genome Bioinformatics Site

Paste the DNA sequence in FASTA format (HS307871)

Submit the file

Click over the first hit: (browser link)

Analyze these different sets of output options:

Figure 7. (a) UCSC genome browser

geneid results: genes

Tools. Genome Sequencing & Biology Meeting (2000).

You might also like