Lab 10. Extracting biological meaning from expression data
Purposes of this exercise
- To use Gene Ontology assignments to find overrepresented
functional classes in gene lists
- To become acquainted with a software tool (ermineJ) for
gene-enrichment analysis of expression data
NOTES:
- You'll also need to work with Microsoft Excel, so open that too.
- Questions for the report are, as usual,
in this color.
GO analysis of our wheat
expression experiment
In this exercise, several tiresome data-preparation steps have been
done for you in advance with Perl scripts and Excel operations. This
makes microarray-data analysis using free, non-integrated tools look
easier than it really is! But there are many tools, and I have perhaps
not selected those that you will find easiest to use with your own
data..
- Affymetrix has managed to assign GO annotations to only about
1660 of the 62,000 genes on the wheat GeneChip. I downloaded
these annotations from Affymetrix's NetAffx pages, but you should not, because the
developers of a
powerful tool called Blast2GO have assigned
annotations to 18,000 of the genes. Explain how they've done this.
- We will run a program called ermineJ to perform
gene-enrichment tests on our results. The program requires three files
and can display a fourth:
- a
file describing the GO hierarchy. Download it to your computer.
It's a gzipped (compressed) file. You needn't unzip it.
- a
file containing the GO annotations of the genes on your chip. I
prepared this for you, using a custom Perl script and Microsoft Excel.
To the Blast2GO
annotations, which were supplied with gene descriptions but not
gene symbols, I added the gene-symbol
annotations provided by Affymetrix.. Download it and open it with Excel
to see its format: just Probe,
Gene symbol, Gene title, and GO categories.
- a file
containing the probe names and resistance P values from the 2-way
ANOVA that we did in Lab 9. Download
it and open it with Excel to see its format. All I did was
delete all columns but ID_REF
and adj p value. Note that I
included ALL of the genes, not just the "significant" ones.
- To have ermineJ
display the expression data for GO-classified gene sets, we can load
the raw data,
the same text file we used for MeV.
- We're now ready for the analysis. Go the the ermineJ
download page, but don't download the software. We'll run it with
WebStart, right on the WWW. Click on one of the run ermineJ links (or this
one!) to
start the program, and wait while the Java files are downloaded. If you
see a security prompt, allow the program to run.
- The first thing you'll see is a dialog asking for two
files. For
the GO XML file, browse to the
file you retrieved in step 2a. For the Probe
annotation file, browse to the file you retrieved in
step 2b. After you've entered these, click Start.
- Browse through the window that appears. What has ermineJ done with the two files, to make this display?
Be sure to examine the Tree
tab too. Realize that so far,
we know only the frequencies of genes in the various GO classes. We
don't know anything about our experimental results because we haven't
supplied the statistics yet.
- Right-click on one of the categories in the Description column, and select View/modify this gene set. Note the correspondence of the
members in the set to the numbers of rows appearing in the popup list. A gene
set is defined by ermineJ as any
grouping of genes defined by criteria
other than the data currently being analyzed. So the only gene sets we have are identical to GO categories, which is
reasonable here, since we haven't loaded any expression data yet.
- First we'll try an Over-Representation
Analysis, which ermineJ calls ORA
and which operates only on ranks
of your expression statistics, not on their values.
Choose Analysis/Run analysis
and make sure the ORA radio
button is selected. Click Next.
- Browse to the file you retrieved in step 2c, which ermineJ calls
your Gene score file.
- For the Raw data file,
load the file you retrieved in step 2d.
- Now Click Next through
the steps labeled 3, 4,
and 5 in the dialog.
- When you get to step 6, note that the analysis expects more
significant genes to have larger
statistical values. Since we've used P values, which behave in the
opposite way, we'll have ermineJ transform
these by taking their
negative
log; thus 0.001 will become 3.0. You need not change the default
settings, which assume that you're submitting P values. Click Finish.
- Your table should now have a new column of P values. Be sure you understand that these
are not the P values from your original MeV analysis, but rather
for gene enrichment. What probability do they represent? The colors
represent FDRs, as explained
here. Hold the cursor over a P value entry to see the FDR, which ermineJ also calls the Corrected p.
- Double-click on a P value and wait a few seconds for the display
of the raw expression values and the MeV-calculated p values. The
columns of text are explained here.
Notice how few of the genes in the GO category (gene set) you are
examining may have low expression p values (high gene scores). Expand the heat-map display
horizontally with the slider so that you can view the treatment labels.
Do you see any patterns that might correspond to the labels -- for
example, xRxx vs. xSxx, or xxxB vs. xxxD?
- The ermineJ
documentation recommends using the ORA
only when
you have a natural threshold for gene selection, which we don't
here -- we just accepted a default supplied by the program. In other
words, ORA will perform its
enrichment analysis using only genes whose expression scores pass
the chosen threshold. But the enrichment P values are strongly
dependent on this threshold, which may make us a bit uneasy.
- An alternative, Gene
Score Resampling (GSR), takes into account the distribution of gene
scores within the gene sets, using a resampling procedure to arrive at
an empirical threshold. As nearly as I can determine from the
documentation, the algorithm samples many random gene sets of the same
size as the one being tested and records a distribution of their
composite enrichment scores, against which the observed score is
compared. Let's try it.
- Choose Analysis/Run analysis
and make sure the GSR button
is clicked. Then continue clicking Next
through all six steps, accepting the defaults. In step 5, what is
meant by "genes appearing more than once", or "replicates"?
- A new column of P values has appeared to the right in our output
table. You can sort the columns of the table by clicking in the column
headers. Note that the ORA and
GSR P values don't share
the same rank order; that is, a gene set that is best under one may not
be best under the other. How
well do the two columns correspond?
- Of course, the whole purpose of an
analysis is to discover biological truth. So if you know anything about
plant disease resistance, you should be thinking about the GO
categories that seem to be enriched in the experiment we analyzed, and
whether they seem to fit the plant response patterns you're familiar
with!
- If you
would like to explore GO enrichment analysis further with our wheat
dataset,
I've saved a different gene-score list for you from the MeV analysis we
did on Tuesday. This one contains the P values for the
interaction of resistance with tissue rather than the resistance
main effect. To load it into ermineJ
to replace the first list, you need only choose Analysis/Set gene score file... and
browse to the downloaded file. The other three files stay the same and
needn't be reloaded into ermineJ.
- Now proceed from step 16. You may also like to explore the Correlation and ROC analyses.
|