Lab 9. Expression data analysis I
Purposes of this exercise
- To use ANOVA for discovering differentially expressed genes
in a
multifactor experimental design.
- To compare results with those from simpler tests (t tests for single factors). Does
it make a difference how you analyze your data?
- To get acquainted with the TMEV analysis package.
- To get acquainted with GEO, NCBI's public microarray
archive.
- To show that you can do quite a lot of microarray analysis
with
Microsoft Excel, including easily computing FDRs.
First, start TMEV 4.3 on your computer by either finding the icon on
the desktop or navigating to directory C:\MEV_4.3
and double-clicking on
the MS-DOS Batch File icon.
Two windows should
appear. We will work with the TIGR Multiple Array Viewer window.
NOTE: You will also need
to
work with Microsoft Excel, so open that too.
1. Reanalyzing a microarray dataset from GEO, and computing
FDR
- NCBI's Gene
Expression Omnibus
is a large public repository of microarray experimental data. We will
retrieve an experiment conducted by researchers in the KSU Plant
Pathology Department and reanalyze the data. The experiment employed
the Affymetrix
Wheat GeneChip Array.
These are not spotted microarray data. However, they could in principle
have been collected with spotted-array technology, and the statistical
lessons we will learn are the same. The
data we will use have already been normalized.
- The data are represented by GEO accessions GSM143490
through
GSM143526.
In the Query/GEO Accession
box, enter
GSM143490 and click Go.
- In the resulting page, you'll see information about this
"sample",
which corresponds to the data from one array. Before we retrieve the
data,
scroll down to find the Platform link
to GPL3802, and click on it to find out what microarray was used for
this
experiment.
- We would like to work with all the arrays at once, and GEO
manages
this by grouping together in a "series" the arrays that are supposed to
belong together. Look below the Platform
link and follow the Series
link
to GSE6227.
- You will need to know that pairs of isolines or near-isogenic lines are two inbred
lines that are genetically almost identical except for a genomic region
surrounding one gene. They are created from a cross of a "donor"
parent to a "recurrent" parent, followed by repeated backcrossing of
the
progeny to the recurrent parent, under selection for the targeted gene
(here,
the Lr34 rust-resistance
gene).
The purpose is to ensure that the only genetic difference between the
two lines (the
R and the S isoline) is their different alleles at this gene.
In this experiment, two isoline pairs were tested, made independently
with wheat cultivars
Thatcher and Jupateco.
- Note: Lr34 has been cloned since this
experiment was done, and the gene (an ABC transporter) is actually one
of the genes on the wheat Affy chip. It will be interesting to see
whether it is differentially expressed in these data!
- Examine the Summary field of the Series
page. We'll
work with this dataset in some detail, so be sure that you understand
the
design and goal of the experiment. To see the descriptions of all 12
treatment
combinations (each replicated 3x for a total of 36 samples),
on the GEO Series page click the Show all... link under
the Samples (36) label. The variables are: isoline,
resistance
vs. susceptibility, inoculation with pathogen vs. mock
inoculation,
and tissue where inoculation was done (leaf distal part, or
tip, vs.
base). I have composed these into this
table showing each treatment combination and added row numbers. We
will
need these in order to create our statistical tests.
- To save time, I've prepared the data file for you.
Right-click on this
link
and save it to your computer. (If you were going to retrieve the data
from GEO, you would click
on Series matrix file(s), and
then
on GSE6227_series_matrix.txt.gz
to download the file).
- Now in TMEV's Multiple Array Viewer window, choose File/Load
Data. When the Expression File Loader window appears, make
sure that the file type showing is Tab Delimited, Multiple Sample
Files; if not, select this type from the Select File Loader menu at top. Then
use the Browse button to
locate your file, and open it.
- In the Expression Table at right, you'll need to
click on the top value in the second column so that TMEV knows where
the data
start. Then click the Load button.
- We now have 36 columns, each representing one of the 36
samples. There are
several analyses we could perform on these rich data,
but let's repeat one of those reported by the experimenters. They asked
whether
the Jupateco resistant isoline showed different expression from the
susceptible, when both isolines were mock-inoculated at the leaf base
-- that is, when the other three of the four variables in the
experiment were
held constant. The simplest way to perform this is with a t test, for each of the 61,127
genes, comparing the three replicates for either of these two treatment
combinations. For brevity, I'll call these classes henceforth.
- Under the Statistics
button in the top toolbar, select TTEST.
When the dialog comes up, click the Between
subjects tab at the top, since we will be comparing expression
responses
between classes.
- For each gene, TMEV's t
test will compare its mean expression value over all samples in group A
against its mean over all samples in group B. Notice the interface that
TMEV offers for assigning samples to these groups. It would be annoying
for us to conduct our tests by clicking on these little buttons in this
window.
- Fortunately TMEV allows us to compose a "grouping" file in
a
simple format. Example JRMB-JSMB
specifies a test for samples 1, 2, and 3 against samples 7, 8, and 9.
Refer
to the design
list and
you'll see that this is a contrast of Jupateco R vs.
S isolines, mock-inoculated on leaf base. We can write this as
JRMB-JSMB.
As you can see, to compose the grouping file you simply write 36
tab-separated
numbers, with a "1" for the samples in class A, a "2" for those in
class
B, and a "3" for the samples that are not to be tested.
- After
right-clicking on the JRMB-JSMB
link to save the grouping file to your computer, click the Load Grouping button and find your
file. After
TMEV has read it in, click on the tab P-Value
Parameters and notice that there are two options for assigning P
values. For now we will leave this setting at the default, p-values based on t-distribution. Click
OK
to run the analysis. In a few seconds the results will appear in the
panel at left. Clicking on any of the little "hinge" icons allows you
to open up the various items.
- Click on the Volcano
icon and examine the plot. Does it look like the plot we saw in
lecture? Why or why not, do you think?
- In the left (analysis) panel, explore the results under TTests (1). It's best to click only
on Significant Genes labels,
since the figures for the Non-significant
Genes take a long time to appear and don't show much. If you
click on Cluster Information/Results
you'll see how many genes were declared significant in this test; 174. What criterion
did TMEV use to declare significance? You can determine this in two
ways.
- Choose Table
Views/Significant
Genes. Right-click in the table and choose Save Cluster to save this table to
an
appropriately named text file (it's best if you create a new directory
for
these result files).
- While
the original authors performed and reported some of the tests described
above, we can do better by taking the experimental design into account.
We aren't mainly interested in the effect of resistance under leaf-base
inoculation independently from its effect under leaf-tip inoculation --
we want to know its effect after
inoculation site has been accounted for. That means that we
should somehow combine these tests. We really should fit all the
effects in the same linear model (and could do it in R or SAS), but
TMEV provides only a two-way ANOVA. So let's make the reasonable
assumption that the Jupateco and Thatcher isolines can be combined, and
use resistance status and tissue as the two factors to test. We'll
consider only the mock-inoculation (M) treatment, since
this was applied to both isolines.
- Choose Two-factor ANOVA from the Statistics dropdown menu. In the
first dialog, you'll be asked to provide the names of factors and the
number of levels for each, so enter names "Resistance" and "Tissue" for
the factor names and 2 for each of the levels, and click Next. Be sure you understand what the two levels of each
of these factors are! (This is not a question for the report).
- You will now see a dialog with an interface for assigning
samples
to factors and levels (which TMEV calls "groups"). Notice that the top
left panel lets you classify samples into Resistance groups 1 and 2 and the
top right panel, into Tissue
groups 1 and 2.
- I hope you see that in this test we will include 8 classes,
each
with 3 replicate samples: JRMB, TRMB, JSMB, TSMB, JRMD, TRMD, JSMD,
TSMD. All have the M for the
mock-inoculation treatment. To save time I've again created a grouping
file.
After you Load this, examine
the sample assignments in the dialog window and convince yourself that
they make sense. For
your report, create the grouping design for the following test. Suppose
you've found that there is no differential gene response depending on
the inoculation tissue, and have decided to test the effects of mock
vs. real inoculation and resistance vs. susceptibility, using a 2-way
ANOVA.
- Accept TMEV's default parameters for the ANOVA and click OK.
After a while the results will appear in the Analysis Panel at left of
the MultiArrayViewer main screen.
- Click on Table Views/Resistance Significant. Take
a
minute to understand the tables. The three columns of raw and adjusted
P
values refer to the three effects tested for each gene (the main
effects of your two factors, and their interaction effect). To see the
full column
labels, you can drag the cell boundaries in the column header, just as
in
Excel. Notice that all the values in the first column of P values,
labeled Adj p-values (Resistance), are < 0.01, whereas the
values in
the other columns are mostly much higher. Now, as in step 17,
right-click
and Save Cluster to save to a file. In next week's lab we'll
also need the Resistance
Nonsignificant cluster, but don't bother to save it today
because your computer won't retain it. It will be provided in the lab.
- Let's
now see how easy it is to calculate a FDR. Open your file in Excel
(it's tab-delimited, so Excel will know how to open it).
You'll find it easier to work with this large table if you divide the
screen into four panels. Do this by dragging the
horizontal and vertical dividers (found at the right and top ends of
the horizontal and vertical scroll bars) to the middles of their scroll
bars.
- Select all the cells in your table. You can do this by clicking with
the mouse in the top left
cell (ID_REF)
and releasing the mouse button. Then use the mouse to scroll your
bottom right panel both horizontally and vertically until the bottom
right data cell of the table comes
into view. Hold down the Shift key, and click with the mouse in that
cell.
- Choose Data/Sort...
In
the dialog, be sure that the My list
has header row button is checked. Then choose to sort ascendingly on column Adj. p values (Resistance). This is
the ranking step we saw in lecture.
- Insert a new row at the top of your worksheet by selecting
the
rows in the left margin of the worksheet and choosing Insert/Row. Now insert four new
columns after the first P-value column, by selecting the four columns
following it and choosing Insert/Columns.
In the top cells of these columns, type (or just copy and paste!) the
entries shown below. The 61127 is the
number of genes on the array and 0.1 is the target FDR threshold (the q
value) that we'll start with.
|
61127 |
0.1 |
|
| Rank |
Fraction |
FDR |
Accepted |
- In the Rank
column,
enter the rank of every gene in the table. Do this by typing a 1 in the
first cell under the heading Rank,
and pressing Enter. Then
select the cell, Shift-click in the bottom cell of the column, and
choose Edit/Fill/Series and
click OK.
- Now we'll enter some formulas. To make this easier, change
the
reference style of this worksheet by selecting menu item Tools/Options, then choosing the General tab and clicking on R1C1 reference style, then OK.
- In the first cell of the Fraction
column, enter this formula (copy and paste it if you like)
=RC[-1]/R1C
and hit Enter.
In
the first cell of the FDR column,
enter formula
=RC[-1]*R1C
and hit Enter. In the first
cell of the Accepted column,
enter formula
=N(RC[-1]>=RC[-4])
and hit Enter. Now select
these three cells, Shift-click on the bottom right cell of the three
columns, and choose Edit/Fill/Down
(or just type Ctrl-D).
- Take a minute to think about this. Remember the decision
formula
from lecture: p < (i/N) * q. The Rank column holds i.
The Fraction column holds i/N.
The FDR column holds (i/N)
* q. Finally, the Accepted column shows the result of
the tested inequality: a 1 if
the p value of your gene
passes the threshold and a 0
if not. If you've done everything the way I did it, you should have
about 15 1s in your column. If
you were reporting these results you'd say that these genes were found
significantly DE at a q value of 0.1.
- You may be wondering what to do if a gene with 0 (fails the
FDR
test) has a lower p
value (is ranked higher in the gene list) than an accepted gene. The
answer is that you accept all the genes above the lowest-ranked gene in
the list.
- Notice that we've prepared the formulas so that the FDR
stats
will be updated if we change the parameters. What happens if you replace your 0.1
q value with 0.2, or 0.01? The
original analysts reported that 60 genes were differentially expressed.
Can you find the q value to which this
number corresponds?
- The Lr34 probe is TaAffx.77499.1.S1_at
(information kindly supplied by Dr. Bob Bowden, KSU Plant Pathology,
who also originated the expression data we're using in this exercise).
Can you locate this in your MeV output? Is there any evidence of
differential expression across the conditions that were tested in this
experiment? Can you propose an explanation for your observation, based
on the Science
paper (a PDF is also provided for you in KSOL)?
2. Comparing the ANOVA to the t test results
- Let's see whether we have any differentially expressed
genes in
common between the simpler t test and the more elaborate ANOVA.
After all, they are testing one factor in common under mock
inoculation. Open both of your saved files in Excel.
- We will use the Excel function vlookup() to see
whether
some of the same genes are common to both tests. (It would usually be
easier to do this with a dedicated microarray software package, but we
can't learn everything in one lab!).
- In one of your two output files, insert a blank column. You
can
do this by clicking on the header of column 1 (or A) and choosing Insert/Column.
In the first cell, type the label Match. In the second empty
cell (corresponding to the first row of output data), type
= vlookup(
and, without hitting the Enter key, click with
the
mouse in the second cell of the ID_Ref column. Now type a
comma. You should now have
= vlookup(RC[2],
or, if you're using A1 cell references,
= vlookup(B2,
- Still without hitting Enter, navigate to the
worksheet of
your other result file. Click in the first cell of the ID_Ref
column in that worksheet, then scroll down and, holding down the Shift
key, click in the last cell. Your formula in the formula bar will now be
=
vlookup(RC[2],[other_output_file]other_worksheet!R2C2:R175C2
- Finish by typing
, 1, false)
- and hit Enter. You should see in the cell where you
entered the formula either an ID equal to the one in the ID_Ref
column, or the letters #N/A.
- What have you just done? You've told Excel to try to find
the
first ID in the first column of the second table, and return the value
in column 1 of that table if it found an exact match with the first ID.
In this case the second table has only one column, so you're returning
the very value that
you matched -- but this vlookup() function is very useful for
extracting
data from a multiple-column table too.
- Copy
the formula to the rest of the cells in your column as follows: select
the cell containing your formula, then Shift-click on the last cell in
the same column (the last row of the table). Choose Edit/Fill/Down
or just type Ctrl-D. This will run the same lookup on each gene
of your first result file.
- While these cells are still selected, choose Edit/Copy
and
then Paste Special. In the dialog, choose to Paste Values
only. This will get rid of the formulas. Now choose Edit/Replace,
and
replace #N/A with nothing.
- You can now select the entire table, then choose Data/Sort...,
then choose to sort on your Match column. This will place all
the matches
at the top where you can see them. How well do the
two lists match? Can you draw any conclusion from this observation?
|