Lab 2. Data manipulation without programming
Note: There's no separate problem set this week. You'll turn in the results of this lab as your homework.
Some of the solutions we demonstrated in lecture may be helpful for these problems, especially problem 2..
1. Calculating statistics on tables
The data file for this
exercise
describes the results of randomly sequencing many short pieces of DNA
from a sorghum cultivar, using a "next-generation" technology we'll
study in a later week. Because the genome is represented many times in
the DNA sample, some regions get sequenced more than once. Analysts can
quantify this "coverage depth" by first aligning the sequencing "reads"
against the full genome sequence of a different sorghum cultivar that
has already been sequenced (known
as the reference genome. In sorghum, this is accession BTx623 and is described here), and then
simply counting, at each position of the
reference sequence, the alignments that overlap it. Alignments look
like this (the reference sequence is shown just above the
horizontal line). Can you see why the coverage depth at the first base
on the left edge would be counted as 25?
But
this deep coverage is NOT the rule across the genome. Unless you
sequence enough DNA to cover the genome many times over, you'll find
that a lot of the genome doesn't get hit by a read and has coverage 0.
Then a plot of coverage depth across 30 Mb might look like this:

We will do more with these data later, but for now let's use Excel to
characterize them. On opening the file with Microsoft Excel, you'll see
a note describing the content. It's a good habit to write such
descriptions inside your own data files, especially if someone else may
want to work with them.
In the following exercises, you must use formulas and relative
references, as shown in lecture, to answer the questions. For now it
isn't the
answers that are important, but the efficiency with which you
calculate them. Except in question 3, you must use formulas that need
to be
written only once, and are
then copied or filled into the other cells
in which they are needed. You may introduce extra rows above the table
(use menu Insert/Rows) to hold
any calculated results.
On each chromosome,
- how many total bases are described in its column? Use the
Excel formula sum().
- how many total bases were covered by at least one alignment?
- what proportion of all bases were covered by at least one
alignment?
For each accession, answer the same questions, but calculate over all
chromosomes. Use columns to the right of the table. This is the only
time you'll need to use
more than one formula (three different sum()
formulas).
- total bases
- bases covered to depth of 1 or more
- proportion covered at least once
- For each depth in each accession,
calculate the proportion of bases covered to that depth. Use columns to
the right of those you've
used so far.
- Calculate
the average coverage depth for each accession. If you're not sure how
to do this, here's the idea: add up all the depths at every covered
position and divide by the total bases covered at least once, a number
you calculated in 5). How do you "add up all the depths"? Example:
if 92664 bases are covered to depth 13, the total depth
contribution of these bases is 13 * 92664 = 1204632. Again, remember,
write only ONE formula and fill it into the cells needed for the new
columns to the right. Be sure to label the columns resulting from your
calculations, so that any reader can understand them.
What to turn in: Turn in
the table with your insertions highlighted
like this. Be sure to place labels over the statistics you've
calculated, so that an analyst can see quickly what they represent.
2. Merging and comparing tables
In this exercise we prepare a file for a gene-enrichment analysis we'll
do later on the results of a microarray experiment. We start with a
file of 18,000 gene
probes having the format shown in the table below,
where the first column holds the probe name, the second describes the
gene whose expression the probe measures, and the third column contains
Gene Ontology (GO) identifiers. As we'll see later, these correspond to
specific molecular roles that have been postulated for the gene. We
won't dwell on the details for now.
TaAffx.108608.1.S1_at
|
dash
complex subunit dad3
|
GO:0031134
GO:0042729
GO:0007052
|
Ta.6141.2.S1_x_at
|
alpha-l-arabinofuranosidase
c-terminus family expressed
|
GO:0046373
GO:0046556
GO:0009117
|
TaAffx.23169.2.S1_at
|
had
superfamily (subfamily
ig)5-nucleotidase
|
GO:0005739
GO:0008253
|
The important thing is that we need to add a fourth column to this
file, containing the gene symbols for these genes, and these symbols
are in another
file in this format:
Ta.10047.1.S1_s_at
|
WHAP6
/// WHAP7 ///
WHAP9 |
CCAAT-box
transcription factor complex WHAP6 /// CCAAT-box transcription factor
complex WHAP7 /// CCAAT-box transcription factor complex WHAP9 |
GO:0006355 GO:0005634 GO:0003700 |
| Ta.10047.2.S1_at |
WHAP9 |
CCAAT-box
transcription factor complex WHAP9 |
GO:0006355 GO:0005634 GO:0003700 |
| Ta.10048.3.S1_at |
--- |
--- |
|
| Ta.1005.1.S1_at |
--- |
--- |
|
| Ta.10053.1.A1_at |
--- |
--- |
|
| Ta.10054.1.S1_at |
ABFB |
ABA response element
binding factor |
GO:0006355 GO:0005634 GO:0003677 GO:0003700 GO:0043565 GO:0046983 |
Here, the first column contains probe names; the second, gene symbols
(some unknown and replaced with ---); the third, gene descriptions, and
the fourth, GO identifiers.
So why don't we just use this file for the
analysis instead of the first one? You can't tell from these small
samples, but in fact the second file contains GO annotation for only
1660 of the 61,290 probes on the microarray (all of which are listed in
the file).
Your task is to obtain from this second
file the gene symbols for the 18,000 probes in the first file, and
add them to the rows containing those probes in the first file. Do NOT turn in either file as your
solution. Just describe what you did, in such a way that anyone
familiar with basic Excel operations could follow your directions to
accomplish this task.
3. Using relative references
In lecture, we showed how the expected values in a frequency table could be filled out with a single Excel formula. Here is the table.
- Calculate the full chi-square statistic from these data, recalling that the value for each cell is (O - E)2/E and the sum of these values over all cells is the statistic.
- Use Excel function chidist() to calculate the P value for this statistic, recalling that the degrees of freedom for a m x n table are (m - 1)(n - 1).
- Verify this P value with Excel function chitest(), which requires you to select both the observed and the expected tables.
- Finally, verify the chi-square statistic with function chiinv() applied to the P value.
This file should be turned in. Be sure to label the tables and statistics that you add to the worksheet.
|