PLPTH 613
Bioinformatics Applications
Spring 2009


Schedule
Research project
K-State Online

Lab 2. Data manipulation without programming

Note: There's no separate problem set this week. You'll turn in the results of this lab as your homework.
Some of the solutions we demonstrated in lecture may be helpful for these problems, especially problem 2..

1. Calculating statistics on tables

The data file for this exercise describes the results of randomly sequencing many short pieces ofShort-read alignments in EagleView DNA from a sorghum cultivar, using a "next-generation" technology we'll study in a later week. Because the genome is represented many times in the DNA sample, some regions get sequenced more than once. Analysts can quantify this "coverage depth" by first aligning the sequencing "reads" against the full genome sequence of a different sorghum cultivar that has already been sequenced (known as the reference genome. In sorghum, this is accession BTx623 and is described here), and then simply counting, at each position of the reference sequence, the alignments that overlap it. Alignments look like this (the reference sequence is shown just above the horizontal line). Can you see why the coverage depth at the first base on the left edge would be counted as 25?

But this deep coverage is NOT the rule across the genome. Unless you sequence enough DNA to cover the genome many times over, you'll find that a lot of the genome doesn't get hit by a read and has coverage 0. Then a plot of coverage depth across 30 Mb might look like this:
Short-read coverage distribution plot
We will do more with these data later, but for now let's use Excel to characterize them. On opening the file with Microsoft Excel, you'll see a note describing the content. It's a good habit to write such descriptions inside your own data files, especially if someone else may want to work with them.

In the following exercises, you must use formulas and relative references, as shown in lecture, to answer the questions. For now it isn't the answers that are important, but the efficiency with which you calculate them. Except in question 3, you must use formulas that need to be written only once, and are then copied or filled into the other cells in which they are needed. You may introduce extra rows above the table (use menu Insert/Rows) to hold any calculated results.

On each chromosome,
  1. how many total bases are described in its column? Use the Excel formula sum().
  2. how many total bases were covered by at least one alignment?
  3. what proportion of all bases were covered by at least one alignment?
For each accession, answer the same questions, but calculate over all chromosomes. Use columns to the right of the table. This is the only time you'll need to use more than one formula (three different sum() formulas).
  1. total bases
  2. bases covered to depth of 1 or more
  3. proportion covered at least once

  4. For each depth in each accession, calculate the proportion of bases covered to that depth. Use columns to the right of those you've used so far.
  5. Calculate the average coverage depth for each accession. If you're not sure how to do this, here's the idea: add up all the depths at every covered position and divide by the total bases covered at least once, a number you calculated in 5). How do you "add up all the depths"? Example: if  92664 bases are covered to depth 13, the total depth contribution of these bases is 13 * 92664 = 1204632. Again, remember, write only ONE formula and fill it into the cells needed for the new columns to the right. Be sure to label the columns resulting from your calculations, so that any reader can understand them.
What to turn in: Turn in the table with your insertions highlighted like this. Be sure to place labels over the statistics you've calculated, so that an analyst can see quickly what they represent.

2. Merging and comparing tables

In this exercise we prepare a file for a gene-enrichment analysis we'll do later on the results of a microarray experiment. We start with a file of 18,000 gene probes having the format shown in the table below, where the first column holds the probe name, the second describes the gene whose expression the probe measures, and the third column contains Gene Ontology (GO) identifiers. As we'll see later, these correspond to specific molecular roles that have been postulated for the gene. We won't dwell on the details for now.

TaAffx.108608.1.S1_at
dash complex subunit dad3
GO:0031134 GO:0042729 GO:0007052
Ta.6141.2.S1_x_at
alpha-l-arabinofuranosidase c-terminus family expressed
GO:0046373 GO:0046556 GO:0009117
TaAffx.23169.2.S1_at
had superfamily (subfamily ig)5-nucleotidase
GO:0005739 GO:0008253

The important thing is that we need to add a fourth column to this file, containing the gene symbols for these genes, and these symbols are in another file in this format:
Ta.10047.1.S1_s_at
WHAP6 /// WHAP7 /// WHAP9 CCAAT-box transcription factor complex WHAP6 /// CCAAT-box transcription factor complex WHAP7 /// CCAAT-box transcription factor complex WHAP9 GO:0006355  GO:0005634  GO:0003700 
Ta.10047.2.S1_at WHAP9 CCAAT-box transcription factor complex WHAP9 GO:0006355  GO:0005634  GO:0003700 
Ta.10048.3.S1_at --- ---
Ta.1005.1.S1_at --- ---
Ta.10053.1.A1_at --- ---
Ta.10054.1.S1_at ABFB ABA response element binding factor GO:0006355  GO:0005634  GO:0003677  GO:0003700  GO:0043565  GO:0046983 

Here, the first column contains probe names; the second, gene symbols (some unknown and replaced with ---); the third, gene descriptions, and the fourth, GO identifiers.
So why don't we just use this file for the analysis instead of the first one? You can't tell from these small samples, but in fact the second file contains GO annotation for only 1660 of the 61,290 probes on the microarray (all of which are listed in the file).

Your task is to obtain from this second file the gene symbols for the 18,000 probes in the first file, and add them to the rows containing those probes in the first file. Do NOT turn in either file as your solution. Just describe what you did, in such a way that anyone familiar with basic Excel operations could follow your directions to accomplish this task.

3. Using relative references

In lecture, we showed how the expected values in a frequency table could be filled out with a single Excel formula. Here is the table.
  1. Calculate the full chi-square statistic from these data, recalling that the value for each cell is (O - E)2/E and the sum of these values over all cells is the statistic.
  2. Use Excel function chidist() to calculate the P value for this statistic, recalling that the degrees of freedom for a m x n table are (m - 1)(n - 1).
  3. Verify this P value with Excel function chitest(), which requires you to select both the observed and the expected tables.
  4. Finally, verify the chi-square statistic with function chiinv() applied to the P value.
This file should be turned in. Be sure to label the tables and statistics that you add to the worksheet.