Lab 6. Gene finding and genome annotation
In this lab we will attempt to describe one
gene in detail by examination of genomic sequence. I extracted
a random 150Kb sequence
from chromosome 1 of sorghum (Sorghum
bicolor),
a grain crop whose genome has recently been sequenced, and selected an
interesting 17Kb region. We will apply ab initio
methods to find putative
exons
and other gene signals; then we will evaluate the quality of these
predictions
using EST sequences from sorghum; then we'll broaden our survey to ESTs
from other plants; and finally we will compare our results with the
publicly available annotations and explore the information available
from other genomes. In the process we will learn some
features of the Argo,
JGI,
and VISTA
genome
browsers.
Questions
to be addressed in your report are in this color. The only questions you need to
answer, however, are in italics.
- Ab initio gene
prediction Our 17K sequence in FASTA format will be
found here.
Right-click on the link, save the sequence to a
file,
and navigate to the FGENESH
server. In the Organism
group
of buttons, select Monocot
plants (Corn, Rice, Wheat, Barley)
and
click Search. View and save
for later reference
the PDF file from
the resulting page. Also copy the text (starting with FGENESH 2.6) and save it to a text
file.
- I've written a Perl
script
to convert FGENESH output to GFF2 format, a simple text format that
describes genomic features. Right-click on the link to save the script
to your computer. Type
perl fgenscan_to_gff2.pl
and press Enter
to make the program print out instructions for usage. To run the
program, you'll
need the file name as the first argument and the letter "F" as the
second, since the script is also written to convert GENSCAN output and
will be used below for that. Note: if you are working this exercise
outside the lab, the script
won't work, because it requires a special Perl module. In that case,
create directory C:\Perl\lib\CN
and place this file in it.
- Take
a little time to work out the notation in the FGENESH
graphical
output and make sure you understand what gene features correspond to
the various glyphs (symbolic images). The full key will be found
partway down this
page.
- Next submit your sequence to the GENSCAN server, selecting
Maize as the model
organism. Copy the output from the WWW page; select from the top of the
text down to the end of the gene predictions; don't select the Click messages or the predicted
peptides under them. Paste into a text editor and save. To generate a .gff2 file from this output, run the
fgenscan2gff2.pl
script again, this time using "G" as the second argument.
- Submit your sequence to the GeneID
server, this time using the Rice
model. Why
isn't there a sorghum gene-prediction model for these gene finders?
What would be required to develop one? Leave the settings at
their default values. When the output appears, you will see summaries
of gene models starting
with the comment
# Gene 1 (Forward). 1 exons. 6 aa.
Score = 0.95
Copy
the
text from this line to the end, and paste it to a text document. Save
as a text file with the suffix .gff2.
This will not need to be converted because it's already in the proper
format.
- Aligning genomic DNA with expressed
sequences from sorghum Navigate to the NCBI BLAST
page, and click the Basic
BLAST/Nucleotide link. On the new page, use the Browse button in the Enter Query Sequence section to
locate your sequence. In the Choose Search
Set section, select the Others
(nr, etc.) button. In the Database
dropdown menu, choose Expressed
sequence tags (est). In the Organism
box, enter Sorghum bicolor and
accept the suggested text completion. Click the BLAST button.
- In the page that appears, look at the graphical display
(can you estimate the number of
expressed genes in your sequence?), but click the Formatting options link and uncheck
the Graphical Overview box,
since we want only text output. Uncheck the Linkout box, click the Download link, and in the box that
appears, click on Text. A
file-save dialog will appear. Select to save as All Files, and enter a file name
ending with suffix .bn.out.
(If you save by accident as
a .txt file, change the suffix
so that our genome-browser software, Argo, will recognize the file as
BLAST output).
- Using genome browsers The
Argo
software should be installed on your lab computer; if you are working
on a different machine, you'll have to download it. It is just a large .jar
file and requires no installation. It requires Java, which is installed
on the lab computers.
- Start Argo by double-clicking on the file name or icon. I'm
assuming you use Windows, but the experiment should
work
just as well on Linux/Unix or OS X. If Argo doesn't start up in a few
seconds,
open a command-line window and type java
-version to make sure you have at least Java 1.4. If you don't, update your
Java
version; if you do, use your DOS session to navigate to the Argo
directory
and type java -jar argo.jar.
- In Argo, choose File/Open
Sequence
File and load your sequence. When prompted as to how Argo should
interpret
it, choose the FASTA option.
When
the Sequence Range dialog
comes up,
accept the full range. At the bottom of this dialog you will also see a
button
labeled Track Table.... You
could
use this to load your tracks, but for now we'll wait and do this in
another
way. Click OK.
- Now choose File/Load
Tracks...
and use this dialog to load all your .gff2
and .bn.out files, one after
the
other. Note that when you load BLAST
output you'll be asked whether you want to use the subject coordinates
to draw features. Click No. The coordinate system you want is that
of your genomic query sequence, not that of each of the ESTs.
When
you dismiss the Track Table, all tracks will be drawn in Argo's Sequence View window. Note that
you can always view, remove, and add tracks via Edit/Track Table.
- Finding homology with other plants Submit
your gene sequence to NCBI
BLAST. Proceed as in step 6, but leave the Organism box empty, and in the Entrez query box enter
Viridiplantae NOT sorghum [organism]
so that we don't get any of the same BLAST hits that we got from the
search against the sorghum ESTs alone. (Viridiplantae means green plants).
Run the search, save the output as text with suffix .bn.out,
and load into Argo.
- Note that in Argo, clicking on any feature shows its
description in the lower left-hand Inspector
Panel.
- Let's view an alignment of this area of the sorghum genome
with the genomes of some other plants. Navigate to the VISTA Genome Browser
page. From the Clade dropdown
menu select Plant and
from the Genome menu, Sorghum. In the Position field enter Chr_1:18878351-18895351, the
coordinates defining our genomic sequence. Then click Go.
- To work out what the display means, you'll want to click on
the Help menu. The Help page will open in your WWW
browser, not the VISTA viewer.
- Click on the Browser
button in the toolbar at the top of the VISTA display. The JGI (Joint Genome Institute) browser
will open, showing the same genomic region and many other tracks. About
halfway down you'll see a track for the Sbi1.4 computationally predicted community gene
model set. These predictions, and the sorghum-genome annotations
used by VISTA, were done by JGI as described on this page. Note also that
to the left of each track is a row of tiny buttons, including one
labeled i for "information".
Clicking on it will give you a window that explains the track.
- So that we could view the annotations in the Argo display
too, I retrieved the .gff3
file using the Download
data
link at the top of the page (you don't need to do it. This is just
included for your information). From this annotation file I copied the
rows describing features starting and ending between the above
coordinates and saved it into this file,
which you'll need to download and save.
- We'll need to translate the coordinates to local ones by
subtracting the genome start position, so here's a simple Excel trick
for doing it. Open the file, find a blank cell, enter the number
18878351, and
press Enter. Now select the
cell and Copy it. Next select
all of the cells in the 4th and 5th columns, representing the feature
start and end positions. Choose Edit/Paste Special and, in the Operation section of the dialog,
choose Subtract. Now click OK and note that all of the
coordinates have been adjusted. Delete the number you entered in the
blank cell. Choose File/Save as...
and save your file as tab-separated text. Then change the suffix
manually from .txt to .gff3. Load this file into Argo like
the others.
- Interpreting genome-annotation
displays You
now have a lot of information on your screen: the
gene-prediction-program graphical outputs, the Argo, VISTA, and JGI
browser displays, and all the features you can explore and links
that you can follow. For example, in VISTA, click on one of the
contours (VISTA calls them curves) and then click the i button next to the Browsers button in the toolbar. You'll need to
answer the following questions for your report. As usual, be
informative but concise!
- What is a CNS? Find an example and describe or present
the evidence.
- On the JGI browser you'll find a Repeats track at the bottom. Select one of the repeats and find out how
it was named (for example, 26262 2286 4836) and how it was identified.
- In the VISTA display there are thick red and gray lines
under the curves. What do these
represent? Choose one and give all the information about it that you
can get from the plot.
- In the VISTA Control
Panel at left, you'll see a select/add
dropdown menu. If Maize BACs
are not already one of the tracks (click the Show organism name checkbox to see
labels on the tracks), add this track. Why
does the section of the curve from positions 18,890,847 to 18,894,464
(you can identify these coordinates by hovering over the curves and
watching the Control Panel)
show coloring that you
don't see in the other tracks, such as the Oryza sativa (rice) one?
- In the Argo display, in several regions such as around
the 6.5 and 16 Kb positions there are many sorghum ESTs aligned to the
genome sequence, yet neither the ab
initio gene finders nor the annotation pipeline has annotated
these regions as exons. Why not?
Click on one of the glyphs and study the DNA alignment in the Inspector
Panel (make sure the Properties
tab is selected). If you scroll down in the window you'll note that
Argo provides an alignment score, but this number doesn't tell us much.
Consider whether the alignment itself gives any clues to
help you answer this question. Do the two sequences match perfectly, or
are there many SNPs, and what might this mean? You may also want to
look at several EST
alignments with regions that were
annotated as exons. Finally, try this: Select one of the ESTs in these
groups. From the Inspector panel, copy its GenBank ID, which will look
something like this: gb|CF074151.1.
Take this ID to the NCBI home
page, under the Search
menu select Nucleotide, paste
your ID into the for box, and
click Go. When the record for
this sequence appears, click the FASTA
link next to Format:, and when
the FASTA sequence appears, select and copy it. Now go to the Phytozome site and click
the BLAST Genome button near
the top of the page. Paste your sequence into the Query Sequence box and click the Run BLAST button. What information can you get from the
results?
- In Argo, why do the
EST arrow glyphs all point to the right, while the gene-model glyphs
point to the left?
- There appear to be three gene models shown in these
displays. Can anything be inferred
about the function of these genes? What, in your understanding, is the
strength of the evidence?
- What is KOG (in the
JGI browser)? Does it give any information about the genomic region
we're studying?
- What is a scaffold, in the context of the JGI browser?
- As you did for
question d), add another track to the VISTA display: RankVISTA for Maize BACs. Explain the resulting display.
- Identify one of each
of these features in one of the gene models in our sequence: initial
and final exons, transcription start sites, and UTRs. Give the actual
DNA sequences and coordinates of the features that were used by the ab
initio gene finders to recognize
these; for example, intron
donor and acceptor sites, start and stop codons, and TSS motif.
Argo's Zoom/Zoom to Bases
option will be useful for this question.
|