PLPTH 613
Bioinformatics Applications
Spring 2009
Schedule
Research project
K-State Online

Lab 10. Extracting biological meaning from expression data

Purposes of this exercise

  1. To use Gene Ontology assignments to find overrepresented functional classes in gene lists
  2. To become acquainted with a software tool (ermineJ) for gene-enrichment analysis of expression data
NOTES:
  1. You'll also need to work with Microsoft Excel, so open that too.
  2. Questions for the report are, as usual, in this color.

GO analysis of our wheat expression experiment

In this exercise, several tiresome data-preparation steps have been done for you in advance with Perl scripts and Excel operations. This makes microarray-data analysis using free, non-integrated tools look easier than it really is! But there are many tools, and I have perhaps not selected those that you will find easiest to use with your own data..
  1. Affymetrix has managed to assign GO annotations to only about 1660 of the 62,000 genes on the wheat GeneChip. I downloaded these annotations from Affymetrix's NetAffx pages, but you should not, because the developers of a powerful tool called Blast2GO have assigned annotations to 18,000 of the genes. Explain how they've done this.
  2. We will run a program called ermineJ to perform gene-enrichment tests on our results. The program requires three files and can display a fourth:
    1. a file describing the GO hierarchy. Download it to your computer. It's a gzipped (compressed) file. You needn't unzip it.
    2. a file containing the GO annotations of the genes on your chip. I prepared this for you, using a custom Perl script and Microsoft Excel. To the Blast2GO annotations, which were supplied with gene descriptions but not gene symbols, I added the gene-symbol annotations provided by Affymetrix.. Download it and open it with Excel to see its format: just Probe, Gene symbol, Gene title, and GO categories.
    3. a file containing the probe names and resistance P values from the 2-way ANOVA that we did in Lab 9. Download it and open it with Excel to see its format. All I did was delete all columns but ID_REF and adj p value. Note that I included ALL of the genes, not just the "significant" ones.
    4. To have ermineJ display the expression data for GO-classified gene sets, we can load the raw data, the same text file we used for MeV.
  3.  We're now ready for the analysis. Go the the ermineJ download page, but don't download the software. We'll run it with WebStart, right on the WWW. Click on one of the run ermineJ links (or this one!) to start the program, and wait while the Java files are downloaded. If you see a security prompt, allow the program to run.
  4. The first thing you'll see is a dialog asking for two files. For the GO XML file, browse to the file you retrieved in step 2a. For the Probe annotation file, browse to the file you retrieved in step 2b. After you've entered these, click Start.
  5. Browse through the window that appears. What has ermineJ done with the two files, to make this display? Be sure to examine the Tree tab too. Realize that so far, we know only the frequencies of genes in the various GO classes. We don't know anything about our experimental results because we haven't supplied the statistics yet.
  6. Right-click on one of the categories in the Description column, and select View/modify this gene set. Note the correspondence of the members in the set to the numbers of rows appearing in the popup list. A gene set is defined by ermineJ as any grouping of genes defined by criteria other than the data currently being analyzed. So the only gene sets we have are identical to GO categories, which is reasonable here, since we haven't loaded any expression data yet.
  7. First we'll try an Over-Representation Analysis, which ermineJ calls ORA and which operates only on ranks of your expression statistics, not on their values. Choose Analysis/Run analysis and make sure the ORA radio button is selected. Click Next.
  8. Browse to the file you retrieved in step 2c, which ermineJ calls your Gene score file.
  9. For the Raw data file, load the file you retrieved in step 2d.
  10. Now Click Next through the steps labeled 3, 4, and 5 in the dialog.
  11. When you get to step 6, note that the analysis expects more significant genes to have larger statistical values. Since we've used P values, which behave in the opposite way, we'll have ermineJ transform these by taking their negative log; thus 0.001 will become 3.0. You need not change the default settings, which assume that you're submitting P values. Click Finish.
  12. Your table should now have a new column of P values. Be sure you understand that these are not the P values from your original MeV analysis, but rather for gene enrichment. What probability do they represent? The colors represent FDRs, as explained here. Hold the cursor over a P value entry to see the FDR, which ermineJ also calls the Corrected p.
  13. Double-click on a P value and wait a few seconds for the display of the raw expression values and the MeV-calculated p values. The columns of text are explained here. Notice how few of the genes in the GO category (gene set) you are examining may have low expression p values (high gene scores). Expand the heat-map display horizontally with the slider so that you can view the treatment labels. Do you see any patterns that might correspond to the labels -- for example, xRxx vs. xSxx, or xxxB vs. xxxD?
  14. The ermineJ documentation recommends using the ORA only when you have a natural threshold for gene selection, which we don't here -- we just accepted a default supplied by the program. In other words, ORA will perform its enrichment analysis using only genes whose expression scores pass the chosen threshold. But the enrichment P values are strongly dependent on this threshold, which may make us a bit uneasy.
  15. An alternative, Gene Score Resampling (GSR), takes into account the distribution of gene scores within the gene sets, using a resampling procedure to arrive at an empirical threshold. As nearly as I can determine from the documentation, the algorithm samples many random gene sets of the same size as the one being tested and records a distribution of their composite enrichment scores, against which the observed score is compared. Let's try it.
  16. Choose Analysis/Run analysis and make sure the GSR button is clicked. Then continue clicking Next through all six steps, accepting the defaults. In step 5, what is meant by "genes appearing more than once", or "replicates"?
  17. A new column of P values has appeared to the right in our output table. You can sort the columns of the table by clicking in the column headers. Note that the ORA and GSR P values don't share the same rank order; that is, a gene set that is best under one may not be best under the other. How well do the two columns correspond?
  18. Of course, the whole purpose of an analysis is to discover biological truth. So if you know anything about plant disease resistance, you should be thinking about the GO categories that seem to be enriched in the experiment we analyzed, and whether they seem to fit the plant response patterns you're familiar with!
  19. If you would like to explore GO enrichment analysis further with our wheat dataset, I've saved a different gene-score list for you from the MeV analysis we did on Tuesday. This one contains the P values for the interaction of resistance with tissue rather than the resistance main effect. To load it into ermineJ to replace the first list, you need only choose Analysis/Set gene score file... and browse to the downloaded file. The other three files stay the same and needn't be reloaded into ermineJ.
  20. Now proceed from step 16. You may also like to explore the Correlation and ROC analyses.