PLPTH 613
Bioinformatics Applications
Spring 2009
Home page
Schedule
Research project
K-State Online

Lab 6. Gene finding and genome annotation

In this lab we will attempt to describe one gene in detail by examination of genomic sequence. I extracted a random 150Kb sequence from chromosome 1 of sorghum (Sorghum bicolor), a grain crop whose genome has recently been sequenced, and selected an interesting 17Kb region. We will apply ab initio methods to find putative exons and other gene signals; then we will evaluate the quality of these predictions using EST sequences from sorghum; then we'll broaden our survey to ESTs from other plants; and finally we will compare our results with the publicly available annotations and explore the information available from other genomes. In the process we will learn some features of the Argo, JGI, and VISTA genome browsers.
Questions to be addressed in your report are in this color. The only questions you need to answer, however, are in italics.
  1. Ab initio gene prediction Our 17K sequence in FASTA format will be found here. Right-click on the link, save the sequence to a file, and navigate to the FGENESH server. In the Organism group of buttons, select Monocot plants (Corn, Rice, Wheat, Barley) and click Search. View and save for later reference the PDF file from the resulting page. Also copy the text (starting with FGENESH 2.6) and save it to a text file.
  2. I've written a Perl script to convert FGENESH output to GFF2 format, a simple text format that describes genomic features. Right-click on the link to save the script to your computer. Type

    perl fgenscan_to_gff2.pl

    and press Enter to make the program print out instructions for usage. To run the program,  you'll need the file name as the first argument and the letter "F" as the second, since the script is also written to convert GENSCAN output and will be used below for that. Note: if you are working this exercise outside the lab, the script won't work, because it requires a special Perl module. In that case, create directory C:\Perl\lib\CN and place this file in it.
  3. Take a little time to work out the notation in the FGENESH graphical output and make sure you understand what gene features correspond to the various glyphs (symbolic images). The full key will be found partway down this page.
  4. Next submit your sequence to the GENSCAN server, selecting Maize as the model organism. Copy the output from the WWW page; select from the top of the text down to the end of the gene predictions; don't select the Click messages or the predicted peptides under them. Paste into a text editor and save. To generate a .gff2 file from this output, run the fgenscan2gff2.pl script again, this time using "G" as the second argument.
  5. Submit your sequence to the GeneID server, this time using the Rice model. Why isn't there a sorghum gene-prediction model for these gene finders? What would be required to develop one? Leave the settings at their default values. When the output appears, you will see summaries of gene models starting with the comment
    # Gene 1 (Forward). 1 exons. 6 aa. Score = 0.95
    Copy the text from this line to the end, and paste it to a text document. Save as a text file with the suffix .gff2. This will not need to be converted because it's already in the proper format.
  6. Aligning genomic DNA with expressed sequences from sorghum Navigate to the NCBI BLAST page, and click the Basic BLAST/Nucleotide link. On the new page, use the Browse button in the Enter Query Sequence section to locate your sequence. In the Choose Search Set section, select the Others (nr, etc.) button. In the Database dropdown menu, choose Expressed sequence tags (est). In the Organism box, enter Sorghum bicolor and accept the suggested text completion. Click the BLAST button.
  7. In the page that appears, look at the graphical display (can you estimate the number of expressed genes in your sequence?), but click the Formatting options link and uncheck the Graphical Overview box, since we want only text output. Uncheck the Linkout box, click the Download link, and in the box that appears, click on Text. A file-save dialog will appear. Select to save as All Files, and enter a file name ending with suffix .bn.out. (If you save by accident as a .txt file, change the suffix so that our genome-browser software, Argo, will recognize the file as BLAST output).
  8. Using genome browsers The Argo software should be installed on your lab computer; if you are working on a different machine, you'll have to download it. It is just a large .jar file and requires no installation. It requires Java, which is installed on the lab computers.
  9. Start Argo by double-clicking on the file name or icon. I'm assuming you use Windows, but the experiment should work just as well on Linux/Unix or OS X. If Argo doesn't start up in a few seconds, open a command-line window and type java -version to make sure you have at least Java 1.4. If you don't, update your Java version; if you do, use your DOS session to navigate to the Argo directory and type java -jar argo.jar.
  10. In Argo, choose File/Open Sequence File and load your sequence. When prompted as to how Argo should interpret it, choose the FASTA option. When the Sequence Range dialog comes up, accept the full range. At the bottom of this dialog you will also see a button labeled Track Table.... You could use this to load your tracks, but for now we'll wait and do this in another way. Click OK.
  11. Now choose File/Load Tracks... and use this dialog to load all your .gff2 and .bn.out files, one after the other. Note that when you load BLAST output you'll be asked whether you want to use the subject coordinates to draw features. Click No. The coordinate system you want is that of your genomic query sequence, not that of each of the ESTs. When you dismiss the Track Table, all tracks will be drawn in Argo's Sequence View window. Note that you can always view, remove, and add tracks via Edit/Track Table.
  12. Finding homology with other plants Submit your gene sequence to NCBI BLAST. Proceed as in step 6, but leave the Organism box empty, and in the Entrez query box enter

    Viridiplantae NOT sorghum [organism]

    so that we don't get any of the same BLAST hits that we got from the search against the sorghum ESTs alone. (Viridiplantae means green plants). Run the search, save the output as text with suffix .bn.out, and load into Argo.
  13. Note that in Argo, clicking on any feature shows its description in the lower left-hand Inspector Panel.
  14. Let's view an alignment of this area of the sorghum genome with the genomes of some other plants. Navigate to the VISTA Genome Browser page. From the Clade dropdown menu select  Plant and from the Genome menu, Sorghum. In the Position field enter Chr_1:18878351-18895351, the coordinates defining our genomic sequence. Then click Go.
  15. To work out what the display means, you'll want to click on the Help menu. The Help page will open in your WWW browser, not the VISTA viewer.
  16. Click on the Browser button in the toolbar at the top of the VISTA display. The JGI (Joint Genome Institute) browser will open, showing the same genomic region and many other tracks. About halfway down you'll see a track for the Sbi1.4 computationally predicted community gene model set. These predictions, and the sorghum-genome annotations used by VISTA, were done by JGI as described on this page. Note also that to the left of each track is a row of tiny buttons, including one labeled i for "information". Clicking on it will give you a window that explains the track.
  17. So that we could view the annotations in the Argo display too, I retrieved the .gff3 file using the Download data link at the top of the page (you don't need to do it. This is just included for your information). From this annotation file I copied the rows describing features starting and ending between the above coordinates and saved it into this file, which you'll need to download and save.
  18. We'll need to translate the coordinates to local ones by subtracting the genome start position, so here's a simple Excel trick for doing it. Open the file, find a blank cell, enter the number 18878351, and press Enter. Now select the cell and Copy it. Next select all of the cells in the 4th and 5th columns, representing the feature start and end positions. Choose Edit/Paste Special and, in the Operation section of the dialog, choose Subtract. Now click OK and note that all of the coordinates have been adjusted. Delete the number you entered in the blank cell. Choose File/Save as... and save your file as tab-separated text. Then change the suffix manually from .txt to .gff3. Load this file into Argo like the others.
  19. Interpreting genome-annotation displays You now have a lot of information on your screen: the gene-prediction-program graphical outputs, the Argo, VISTA, and JGI browser displays, and all the features you can explore and links that you can follow. For example, in VISTA, click on one of the contours (VISTA calls them curves) and then click the i button next to the Browsers button in the toolbar. You'll need to answer the following questions for your report. As usual, be informative but concise!
    1. What is a CNS? Find an example and describe or present the evidence.
    2. On the JGI browser you'll find a Repeats track at the bottom. Select one of the repeats and find out how it was named (for example, 26262 2286 4836) and how it was identified.
    3. In the VISTA display there are thick red and gray lines under the curves. What do these represent? Choose one and give all the information about it that you can get from the plot.
    4. In the VISTA Control Panel at left, you'll see a select/add dropdown menu. If Maize BACs are not already one of the tracks (click the Show organism name checkbox to see labels on the tracks), add this track. Why does the section of the curve from positions 18,890,847 to 18,894,464 (you can identify these coordinates by hovering over the curves and watching the Control Panel) show coloring that you don't see in the other tracks, such as the Oryza sativa (rice) one?
    5. In the Argo display, in several regions such as around the 6.5 and 16 Kb positions there are many sorghum ESTs aligned to the genome sequence, yet neither the ab initio gene finders nor the annotation pipeline has annotated these regions as exons. Why not? Click on one of the glyphs and study the DNA alignment in the Inspector Panel (make sure the Properties tab is selected). If you scroll down in the window you'll note that Argo provides an alignment score, but this number doesn't tell us much. Consider whether the alignment itself gives any clues to help you answer this question. Do the two sequences match perfectly, or are there many SNPs, and what might this mean? You may also want to look at several EST alignments with regions that were annotated as exons. Finally, try this: Select one of the ESTs in these groups. From the Inspector panel, copy its GenBank ID, which will look something like this:  gb|CF074151.1. Take this ID to the NCBI home page, under the Search menu select Nucleotide, paste your ID into the for box, and click Go. When the record for this sequence appears, click the FASTA link next to Format:, and when the FASTA sequence appears, select and copy it. Now go to the Phytozome site and click the BLAST Genome button near the top of the page. Paste your sequence into the Query Sequence box and click the Run BLAST button. What information can you get from the results?
    6. In Argo, why do the EST arrow glyphs all point to the right, while the gene-model glyphs point to the left?
    7. There appear to be three gene models shown in these displays. Can anything be inferred about the function of these genes? What, in your understanding, is the strength of the evidence?
    8. What is KOG (in the JGI browser)? Does it give any information about the genomic region we're studying?
    9. What is a scaffold, in the context of the JGI browser?
    10. As you did for question d), add another track to the VISTA display: RankVISTA for Maize BACs. Explain the resulting display.
    11. Identify one of each of these features in one of the gene models in our sequence: initial and final exons, transcription start sites, and UTRs. Give the actual DNA sequences and coordinates of the features that were used by the ab initio gene finders to recognize these; for example, intron donor and acceptor sites, start and stop codons, and TSS motif. Argo's Zoom/Zoom to Bases option will be useful for this question.