PLPTH 613
Bioinformatics Applications
Spring 2009
Home page
Schedule
Research project
K-State Online

Lab 9. Expression data analysis I

Purposes of this exercise

  1. To use ANOVA for discovering differentially expressed genes in a multifactor experimental design.
  2. To compare results with those from simpler tests (t tests for single factors). Does it make a difference how you analyze your data?
  3. To get acquainted with the TMEV analysis package.
  4. To get acquainted with GEO, NCBI's public microarray archive.
  5. To show that you can do quite a lot of microarray analysis with Microsoft Excel, including easily computing FDRs.
First, start TMEV 4.3 on your computer by either finding the icon on the desktop or navigating to directory C:\MEV_4.3 and double-clicking on the MS-DOS Batch File icon. Two windows should appear. We will work with the TIGR Multiple Array Viewer window.

NOTE: You will also need to work with Microsoft Excel, so open that too.

1. Reanalyzing a microarray dataset from GEO, and computing FDR

  1. NCBI's Gene Expression Omnibus is a large public repository of microarray experimental data. We will retrieve an experiment conducted by researchers in the KSU Plant Pathology Department and reanalyze the data. The experiment employed the Affymetrix Wheat GeneChip Array. These are not spotted microarray data. However, they could in principle have been collected with spotted-array technology, and the statistical lessons we will learn are the same. The data we will use have already been normalized.
  2. The data are represented by GEO accessions GSM143490 through GSM143526. In the Query/GEO Accession box, enter GSM143490 and click Go.
  3. In the resulting page, you'll see information about this "sample", which corresponds to the data from one array. Before we retrieve the data, scroll down to find the Platform link to GPL3802, and click on it to find out what microarray was used for this experiment.
  4. We would like to work with all the arrays at once, and GEO manages this by grouping together in a "series" the arrays that are supposed to belong together. Look below the Platform link and follow the Series link to GSE6227.
  5. You will need to know that pairs of isolines or near-isogenic lines are two inbred lines that are genetically almost identical except for a genomic region surrounding one gene. They are created from a cross of a "donor" parent to a "recurrent" parent, followed by repeated backcrossing of the progeny to the recurrent parent, under selection for the targeted gene (here, the Lr34 rust-resistance gene). The purpose is to ensure that the only genetic difference between the two lines (the R and the S isoline) is their different alleles at this gene. In this experiment, two isoline pairs were tested, made independently with wheat cultivars Thatcher and Jupateco.
  6. Note: Lr34 has been cloned since this experiment was done, and the gene (an ABC transporter) is actually one of the genes on the wheat Affy chip. It will be interesting to see whether it is differentially expressed in these data!
  7. Examine the Summary field of the Series page. We'll work with this dataset in some detail, so be sure that you understand the design and goal of the experiment. To see the descriptions of all 12 treatment combinations (each replicated 3x for a total of 36 samples), on the GEO Series page click the Show all... link under the Samples (36) label.  The variables are: isoline, resistance vs. susceptibility, inoculation with pathogen vs. mock inoculation, and tissue where inoculation was done (leaf distal part, or tip, vs. base). I have composed these into this table showing each treatment combination and added row numbers. We will need these in order to create our statistical tests.
  8. To save time, I've prepared the data file for you. Right-click on this link and save it to your computer. (If you were going to retrieve the data from GEO, you would click on Series matrix file(s), and then on GSE6227_series_matrix.txt.gz to download the file).
  9. Now in TMEV's Multiple Array Viewer window, choose File/Load Data. When the Expression File Loader window appears, make sure that the file type showing is Tab Delimited, Multiple Sample Files; if not, select this type from the Select File Loader menu at top. Then use the Browse button to locate your file, and open it.
  10. In the Expression Table at right, you'll need to click on the top value in the second column so that TMEV knows where the data start. Then click the Load button.
  11. We now have 36 columns, each representing one of the 36 samples. There are several analyses we could perform on these rich data, but let's repeat one of those reported by the experimenters. They asked whether the Jupateco resistant isoline showed different expression from the susceptible, when both isolines were mock-inoculated at the leaf base -- that is, when the other three of the four variables in the experiment were held constant. The simplest way to perform this is with a t test, for each of the 61,127 genes, comparing the three replicates for either of these two treatment combinations. For brevity, I'll call these classes henceforth.
  12. Under the Statistics button in the top toolbar, select TTEST. When the dialog comes up, click the Between subjects tab at the top, since we will be comparing expression responses between classes.
  13. For each gene, TMEV's t test will compare its mean expression value over all samples in group A against its mean over all samples in group B. Notice the interface that TMEV offers for assigning samples to these groups. It would be annoying for us to conduct our tests by clicking on these little buttons in this window.
  14. Fortunately TMEV allows us to compose a "grouping" file in a simple format. Example JRMB-JSMB specifies a test for samples 1, 2, and 3 against samples 7, 8, and 9. Refer to the design list and you'll see that this is a contrast of Jupateco R vs. S isolines, mock-inoculated on leaf base. We can write this as JRMB-JSMB. As you can see, to compose the grouping file you simply write 36 tab-separated numbers, with a "1" for the samples in class A, a "2" for those in class B, and a "3" for the samples that are not to be tested.
  15. After right-clicking on the JRMB-JSMB link to save the grouping file to your computer, click the Load Grouping button and find your file. After TMEV has read it in, click on the tab P-Value Parameters and notice that there are two options for assigning P values. For now we will leave this setting at the default, p-values based on t-distribution. Click OK to run the analysis. In a few seconds the results will appear in the panel at left. Clicking on any of the little "hinge" icons allows you to open up the various items.
  16. Click on the Volcano icon and examine the plot. Does it look like the plot we saw in lecture? Why or why not, do you think?
  17. In the left (analysis) panel, explore the results under TTests (1). It's best to click only on Significant Genes labels, since the figures for the Non-significant Genes take a long time to appear and don't show much. If you click on Cluster Information/Results you'll see how many genes were declared significant in this test; 174. What criterion did TMEV use to declare significance? You can determine this in two ways.
  18. Choose Table Views/Significant Genes. Right-click in the table and choose Save Cluster to save this table to an appropriately named text file (it's best if you create a new directory for these result files).
  19. While the original authors performed and reported some of the tests described above, we can do better by taking the experimental design into account. We aren't mainly interested in the effect of resistance under leaf-base inoculation independently from its effect under leaf-tip inoculation -- we want to know its effect after inoculation site has been accounted for. That means that we should somehow combine these tests. We really should fit all the effects in the same linear model (and could do it in R or SAS), but TMEV provides only a two-way ANOVA. So let's make the reasonable assumption that the Jupateco and Thatcher isolines can be combined, and use resistance status and tissue as the two factors to test. We'll consider only the mock-inoculation (M) treatment, since this was applied to both isolines.
  20. Choose Two-factor ANOVA from the Statistics dropdown menu. In the first dialog, you'll be asked to provide the names of factors and the number of levels for each, so enter names "Resistance" and "Tissue" for the factor names and 2 for each of the levels, and click Next. Be sure you understand what the two levels of each of these factors are! (This is not a question for the report).
  21. You will now see a dialog with an interface for assigning samples to factors and levels (which TMEV calls "groups"). Notice that the top left panel lets you classify samples into Resistance groups 1 and 2 and the top right panel, into Tissue groups 1 and 2.
  22. I hope you see that in this test we will include 8 classes, each with 3 replicate samples: JRMB, TRMB, JSMB, TSMB, JRMD, TRMD, JSMD, TSMD. All have the M for the mock-inoculation treatment. To save time I've again created a grouping file. After you Load this, examine the sample assignments in the dialog window and convince yourself that they make sense. For your report, create the grouping design for the following test. Suppose you've found that there is no differential gene response depending on the inoculation tissue, and have decided to test the effects of mock vs. real inoculation and resistance vs. susceptibility, using a 2-way ANOVA.
  23. Accept TMEV's default parameters for the ANOVA and click OK. After a while the results will appear in the Analysis Panel at left of the MultiArrayViewer main screen.
  24. Click on Table Views/Resistance Significant. Take a minute to understand the tables. The three columns of raw and adjusted P values refer to the three effects tested for each gene (the main effects of your two factors, and their interaction effect). To see the full column labels, you can drag the cell boundaries in the column header, just as in Excel. Notice that all the values in the first column of P values, labeled Adj p-values (Resistance), are < 0.01, whereas the values in the other columns are mostly much higher. Now, as in step 17, right-click and Save Cluster to save to a file. In next week's lab we'll also need the Resistance Nonsignificant cluster, but don't bother to save it today because your computer won't retain it. It will be provided in the lab.
  25. Let's now see how easy it is to calculate a FDR. Open your file in Excel (it's tab-delimited, so Excel will know how to open it). You'll find it easier to work with this large table if you divide the screen into four panels. Do this by dragging the horizontal and vertical dividers (found at the right and top ends of the horizontal and vertical scroll bars) to the middles of their scroll bars.
  26. Select all the cells in your table. You can do this by clicking with the mouse in the top left cell (ID_REF) and releasing the mouse button. Then use the mouse to scroll your bottom right panel both horizontally and vertically until the bottom right data cell of the table comes into view. Hold down the Shift key, and click with the mouse in that cell.
  27. Choose Data/Sort... In the dialog, be sure that the My list has header row button is checked. Then choose to sort ascendingly on column Adj. p values (Resistance). This is the ranking step we saw in lecture.
  28. Insert a new row at the top of your worksheet by selecting the rows in the left margin of the worksheet and choosing Insert/Row. Now insert four new columns after the first P-value column, by selecting the four columns following it and choosing Insert/Columns. In the top cells of these columns, type (or just copy and paste!) the entries shown below. The 61127 is the number of genes on the array and 0.1 is the target FDR threshold (the q value) that we'll start with.

    61127 0.1
    Rank Fraction FDR Accepted
  29. In the Rank column, enter the rank of every gene in the table. Do this by typing a 1 in the first cell under the heading Rank, and pressing Enter. Then select the cell, Shift-click in the bottom cell of the column, and choose Edit/Fill/Series and click OK.
  30. Now we'll enter some formulas. To make this easier, change the reference style of this worksheet by selecting menu item Tools/Options, then choosing the General tab and clicking on R1C1 reference style, then OK.
  31. In the first cell of the Fraction column, enter this formula (copy and paste it if you like)
    =RC[-1]/R1C
    and hit Enter. In the first cell of the FDR column, enter formula
    =RC[-1]*R1C
    and hit Enter. In the first cell of the Accepted column, enter formula
    =N(RC[-1]>=RC[-4])
    and hit Enter. Now select these three cells, Shift-click on the bottom right cell of the three columns, and choose Edit/Fill/Down (or just type Ctrl-D).
  32. Take a minute to think about this. Remember the decision formula from lecture: p < (i/N) * q. The Rank column holds i. The Fraction column holds i/N. The FDR column holds (i/N) * q. Finally, the Accepted column shows the result of the tested inequality: a 1 if the p value of your gene passes the threshold and a 0 if not. If you've done everything the way I did it, you should have about 15 1s in your column. If you were reporting these results you'd say that these genes were found significantly DE at a q value of 0.1.
  33. You may be wondering what to do if a gene with 0 (fails the FDR test) has a lower p value (is ranked higher in the gene list) than an accepted gene. The answer is that you accept all the genes above the lowest-ranked gene in the list.
  34. Notice that we've prepared the formulas so that the FDR stats will be updated if we change the parameters. What happens if you replace your 0.1 q value with 0.2, or 0.01? The original analysts reported that 60 genes were differentially expressed. Can you find the q value to which this number corresponds?
  35. The Lr34 probe is TaAffx.77499.1.S1_at (information kindly supplied by Dr. Bob Bowden, KSU Plant Pathology, who also originated the expression data we're using in this exercise). Can you locate this in your MeV output? Is there any evidence of differential expression across the conditions that were tested in this experiment? Can you propose an explanation for your observation, based on the Science paper (a PDF is also provided for you in KSOL)?

2. Comparing the ANOVA to the t test results

  1. Let's see whether we have any differentially expressed genes in common between the simpler t test and the more elaborate ANOVA. After all, they are testing one factor in common under mock inoculation. Open both of your saved files in Excel.
  2. We will use the Excel function vlookup() to see whether some of the same genes are common to both tests. (It would usually be easier to do this with a dedicated microarray software package, but we can't learn everything in one lab!).
  3. In one of your two output files, insert a blank column. You can do this by clicking on the header of column 1 (or A) and choosing Insert/Column. In the first cell, type the label Match. In the second empty cell (corresponding to the first row of output data), type

    = vlookup(

    and, without hitting the Enter key, click with the mouse in the second cell of the ID_Ref column. Now type a comma. You should now have

    = vlookup(RC[2],

    or, if you're using A1 cell references,

    = vlookup(B2,

  4. Still without hitting Enter, navigate to the worksheet of your other result file. Click in the first cell of the ID_Ref column in that worksheet, then scroll down and, holding down the Shift key, click in the last cell. Your formula in the formula bar will now be

    = vlookup(RC[2],[other_output_file]other_worksheet!R2C2:R175C2

  5. Finish by typing

    , 1, false)

  6. and hit Enter. You should see in the cell where you entered the formula either an ID equal to the one in the ID_Ref column, or the letters #N/A.
  7. What have you just done? You've told Excel to try to find the first ID in the first column of the second table, and return the value in column 1 of that table if it found an exact match with the first ID. In this case the second table has only one column, so you're returning the very value that you matched -- but this vlookup() function is very useful for extracting data from a multiple-column table too.
  8. Copy the formula to the rest of the cells in your column as follows: select the cell containing your formula, then Shift-click on the last cell in the same column (the last row of the table). Choose Edit/Fill/Down or just type Ctrl-D. This will run the same lookup on each gene of your first result file.
  9. While these cells are still selected, choose Edit/Copy and then Paste Special. In the dialog, choose to Paste Values only. This will get rid of the formulas. Now choose Edit/Replace, and replace #N/A with nothing.
  10. You can now select the entire table, then choose Data/Sort..., then choose to sort on your Match column. This will place all the matches at the top where you can see them. How well do the two lists match? Can you draw any conclusion from this observation?