Lab 7. Genomic SNP discovery
In this lab we'll use recently-acquired high-throughput sequence data
for discovery of single-nucleotide polymorphisms in the sorghum genome.
We will tabulate alignment coverage and SNPs between two genomes, ask
how well the coverage depth conforms to statistical expectation, and
explore
genome-browser displays of SNPs and haplotypes in mouse, one of the few
animals in which genomic diversity has been deeply characterized with
SNPs.
Here is a useful tip: when
navigating from one site to another with your browser, never use the
left mouse button to click on a hyperlink unless you are permanently
finished with the current page. Instead, use the right button to Open link in new tab (at least, in
the Firefox browser). Then you
can return to the current page without using your browser's Back button.
Background
As described in lecture, we submitted genomic DNA from two sorghum
accessions for sequencing by Solexa technology, which yielded many
sequencing reads of 36 bases.
(This was in 2008. The technology is rapidly improving and read length
has
increased greatly since then). Our plan was to align these reads to the available Tx623 sorghum genome sequence.
From this we could identify mismatches, which are not necessarily true
SNPs but could have been due to sequencing error. We received a total
of 3 genome
equivalents of bases, but found that for only about a third of these
reads could a unique best alignment
be found. Recall that such an alignment is needed for identifying and
mapping SNPs.
An obvious question was: how much sequencing would be necessary for
discovery of any given number of SNPs? We will test our observed
coverage-depth distribution against theory and try to predict this.
Another question addressed a method that we adopted to enrich our
sequence reads in expressed genes: we created our genomic library by
digestion with enzyme HpaII,
which cuts preferentially in nonmethylated DNA, which in turn is more
common in genic than nongenic regions of the genome. Was this method
effective?
Much of the work in processing the data is done with custom-written
Perl scripts and the alignment program, SOAP, that is invoked by one of
the scripts. Running these scripts doesn't make a very instructive
exercise; one sequencing run can take half an hour or more to process
and nothing useful is visible on the screen. So we will survey the
appearance of the data and results at various stages and then use
Microsoft Excel to examine some of it.
1. Understanding the alignment pipeline, and testing observed
against expected alignment coverage
- The
raw read files from a recent Illumina Genome Analyzer run approached 1
Gb each in size and described around 6M reads each, looking like this
on disk:
-rw-r--r--
1 jcn jcn 573096334 2008-11-24 09:27 sb1_seq.txt
-rw-r--r--
1 jcn jcn 770254244 2008-11-24 12:14 sb3_seq.txt
-rw-r--r--
1 jcn jcn 798418112 2008-11-24 09:51 sb4_seq.txt
-rw-r--r--
1 jcn jcn 905819832 2008-11-24 11:53 sb5_seq.txt
-rw-r--r--
1 jcn jcn 686960926 2008-11-24 11:55 sb6_seq.txt
-rw-r--r--
1 jcn jcn 609726796 2008-11-24 12:11 sb7_seq.txt
-rw-r--r--
1 jcn jcn 655719202 2008-11-24 12:13 sb8_seq.txt
- The data in one such file looks like this:
@HWI-EAS339_30H7TAAXX:1:1:1663:1004
GTAATGTTGTGCTCCAAATAGTCTACCTCTGACGGG
+HWI-EAS339_30H7TAAXX:1:1:1663:1004
]]WS]]]]]]]]]]]IWS]PWY]]PXWXXZVGFVVV
@HWI-EAS339_30H7TAAXX:1:1:1532:76
GGTGGATAATGTTCATTTGGGCTGCTTTATTAGTTA
+HWI-EAS339_30H7TAAXX:1:1:1532:76
^^^^^^^]]^^^^^]^^^\]Y]]Y][[[P[[SS[[S
There are four lines per read. The first and third contain labels, the
second contains the read itself, and the fourth gives a quality score
for each base in FASTQ format, which represents numbers as ASCII
characters in a way we need not investigate. A quality score, which
you'll remember from Sanger sequencing and the Phred program,
quantifies the confidence we have in the base call.
- A script now extracts the reads and converts them into
FASTA format, like this:
>1-1-1663-1004
GTAATGTTGTGCTCCAAATAGTCTACCTCTGACGGG
>1-1-1532-76
GGTGGATAATGTTCATTTGGGCTGCTTTATTAGTTA
- Another script runs the SOAP alignment program against the
sorghum genome sequence, giving output looking like this and showing (just for
your information):
- id of read
- sequence of read
- quality (here not used)
- number of equal best hits. Reads with no hits are
ignored
- a/b: a flag meaningful only for paired-end alignment
- read length
- +/-: strand to which read was aligned
- chr: id of reference sequence;
- position of first bp on the reference, counted from 1
- mismatch types:
0: exact match.
1 or 2: RefAllele->Offset QueryAllele Qual
Example: 2
A->10T30 C->13A32 means there are two
mismatches,
one at position +10 and the other at position +13 of the reference.
The reference alleles are A and C, while the called query
allele and its quality are T,30 and A,32.
- Next a script is run to compute the coverage-depth distribution of
these reads, and this is what we will examine more closely. We are
interested in the alignment depth of our reads, since these provide the
strength of evidence for any mismatch with the reference genome. In
particular, we would like to cover about 0.05 of the genome to a depth
of at least 10 reads in 8 sorghum accessions, and would prefer deeper
coverage so that each accession could be represented by several reads. Why do we want each accession to be
represented by several reads?
- All of the data used in this lab will be found in this Excel workbook.
Use the labeled tabs at the bottom of the workbook to select the
worksheets.
- You should recall Lab 2, where we
computed statistics on coverage tables. Now we will examine coverage
results from three experiments, presented in the first three worksheets
of your data workbook:
- 300,000 32-base reads were randomly sampled from a
simulated genome of 30,000,000 bases and aligned back to the same
"genome".
- The same was done with 30 Mb of sorghum chromosome 1.
- A similar number of reads from the real data that
uniquely aligned to the 30-Mb region of chromosome 1 was identified.
A 1988 result by Lander and Waterman predicts that the proportion P(d)
of a region sequenced to coverage c that is covered to depth d has a Poisson distribution:

Don't
worry: this exercise requires no probability, but a little statistics.
We'll be
fitting linear or nonlinear regressions in Excel and will need to do
some simple algebra with the results. Recall that in lab
exercise 2.7 we calculated this proportion for each
value of d in our data, while
in exercise 2.8 we calculated the coverage in the process of
calculating the average coverage. This means that we know the values to
put on the left-hand-side of the equation (the proportions P(d))
as well as the values of c
and d for the right-hand side.Your task is to determine how well
the data from these three experiments match the values predicted by the
equation.
- Here are two ways to
do this. Both ways
should be applied to all three data sets. You may need
the Excel functions exp() and fact().
Note that once you make the calculations for one, you can just copy and
paste the tables into the other two, in the same positions relative to
the data.
- Using the equation and your known values of c and d, calculate an expected proportion
for each depth. Plot this (using Excel's Insert/Chart, selecting an
appropriate chart type) against d.
In case you don't know how to make an Excel chart of one variable
against another, the easiest way is to place the variables in
adjacent columns and select them, along with the column names, before
starting the Chart Wizard.
Likewise, plot the observed proportions against d on the same chart, and compare
the curves.
- We can linearize the relationship as follows: multiply
both
sides by d!, and then take
the natural logarithm of both sides. This gives
ln P(d) + ln(d!)
= d ln(c) - c
This should tell you that if you plot the
left-hand values against d,
you should get a straight line with slope ln(c) and y-intercept -c. So calculate the left-hand
values using the observed data
and do this. You can make Excel calculate a regression by
clicking in the plot and selecting menu item Chart/Add trendline. In the setup
dialog you'll see an Options
tab that allows you to display the equation on the chart.
- In your report,
comment on your
results for the three experimental cases. You should try to interpret
deviations from the expected curve in method 1. (Hint: if you
understand what coverage depth means, this will be a one-sentence
answer!). It will help to
remember that the alignments reported by SOAP are unique alignments, and that reads
sampled from the approximately two thirds of the sorghum genome that
consists of repetitive elements will likely not align uniquely and thus
not be represented in the coverage-depth distribution. For method 2, you should comment on
the appropriateness of the regression fit, in particular determining
whether the calculated slope and intercept values are consistent with
the coverage you also calculated.
- For this lab, your
Excel files should be hyperlinked from your HTML report. To do this for
a file, create subdirectory data
in the directory containing your report, and place your Excel file in
it. Enter in your report some suitable text for your hyperlink, select
it, and choose Insert/Link
(Ctrl-L). In the Link Properties
dialog, use Choose File... to
navigate to your Excel file, which will be in data/<your file>. Click OK. Don't forget to submit both
report and Excel file.
- While you're looking at these data, notice how much deeper
the alignment depths go in the real data than in the simulated genomic
data. This is visible in the second figure in Lab
2.
We believe that these strange "read stacks" are artefacts of PCR
amplification used for preparation of the sequencing libraries for the
Illumina machine. That's why I avoid using them in the statistical
calculations.
2.
Predicting the amount of sequencing needed for adequate SNP calling
NOTE: You may work on this
problem OR problem 3. If you do both (which of course I hope you will),
you'll get extra credit. Problem 2 requires more Excel plotting, and
some mathematical intuition, both of them useful for anyone wanting to
work with bioinformatics. Problem 3 requires some genetic understanding
and intuition but no arithmetic.
- The preceding experiment should
have revealed that we
can't rely directly on the Lander-Waterman result to predict the depth
distribution for real data. So we
will next try to estimate empirically the amount of coverage we will
need, by
observing how our coverage-depth distribution changes as we add more
sequence. (The need for this
kind of calculation is getting overtaken by sequencing-technology
advances
being made almost monthly. In a couple of years, coverage is likely to
be so cheap that the operational way to answer any such question is to
keep sequencing until you have enough depth.)
- Recently we acquired one lane's worth of sequence from each
of
eight sorghum accessions; seven of these files appear in the listing in
step 1 above. From the alignment output
for these files, considering for simplicity only alignments to sorghum
chromosome 1, I calculated what I call cocoverage.
That is, first I determined the coverage-depth distribution for
accession 3; then for accessions 3 and 1 combined; then for 3, 1, and 4
combined, and so on up to all 8 accessions. The results are in
worksheet 4 of your data workbook. What is the approximate length of
chromosome 1 in bases?
- Notice the increasing trends within each row, from left to
right. Suppose that our target is to add sequence runs (which will add
more columns) until the number of bases covered to depth 10 or higher
is at least 0.05 of the length of chromosome 1. How many further runs will we need
to add?
To answer this, I will suggest a simple approach, but feel free to use
your own ideas,
especially if you know how to use Excel's more advanced tools, such as
the Solver.
- The
raw base frequencies and the proportions they represent are related by
a constant (the total length of the chromosome), so it makes no
difference which is used, and you may take your choice. Try selecting
the table and using Excel to graph it as a line chart. You will want to
omit the zero-depth row, since its values will dominate the scale of
the chart. In step 2 of the Chart Wizard, you will see the option to
plot series in rows or columns. Try both, and decide which is more
meaningful for your purposes.
- You may wish to add up the frequencies for depths of 10 and
above, since we really don't care about the trend at each depth. (In
fact, I suggest replacing the frequencies in each row with the
cumulative frequencies for all higher depths). You may find it useful
to transform the table (say, by calculating the logarithms) in hopes of
making the trends linear. If you can't do this, you can still use
Excel's option to fit a nonlinear curve, which you'll find in the Trendline dialog described in step 8. Once you've fitted a curve you think is
promising,
use its equation to try to predict the coverage that will result in the
desired number or proportion of bases at your target depth. Be
aware that confident extrapolation requires some assurance that the
curve that fits the data is the right one -- which we have really no
way of knowing, though it's more likely to be logarithmic than
polynomial. For your report,
you know by now what is wanted: describe
concisely how you arrived at your answer and provide sufficient detail
of arithmetical calculations that the reader could reproduce them. Don't
describe Excel operations in the depth of detail that is given in these
lab instructions! Include your
charts in the worksheets, with informative captions on the axes and an
informative title in each chart.
3. Viewing SNP and haplotype data
Optional if you have worked
problem 2, but I encourage you to go through it for a view of online
resources and approaches to these data.
- The
exercises so far in this lab deal only with alignment coverage, not
with SNPs. While many SNPs have been identified in the sorghum data,
their analysis has not proceeded far enough to make what I consider an
interesting and instructive exercise. We will instead examine tools
developed for a more extensive study in mouse, at the NIEHS Mouse
Haplotype Analysis site.
In PLPTH 612 you should have covered array-based resequencing for SNP
detection, the approach used for collecting these data. Note that, as
in our sorghum experiment, there is a reference
genotype, in this case the inbred mouse strain C57BL/6J.
This and 15 other "classical" strains were derived from four wild mouse
species, and the experiment was carried out to characterize the
inheritance and distribution, in the classical strains, of haplotype
blocks derived from these wild progenitors.
- There
are two haplotype browsers: the Mouse
Ancestry Mapper and the Mouse
Haplotype Block Viewer. You can open
either by clicking on the Haplotype
Analysis link at the top of the page and then, in either the Mouse Ancestry Mapper section or
the Mouse Haplotype Block Viewer
section, clicking on View Haplotypes.
Its features are explained here.
Explain what
information is given by the colors in the main panel, and what it means
to select a different strain under one of the menus at the top. While
this display is very attractive, it's not clear how much we can
learn from studying it alone. The site also includes an installation of
the Generic Genome Browser that allows us to look at other features of
the mouse genome. So, in a new tab of your browser, open the Mouse
Genome Browser page and, in the Landmark
or Region box under Search,
enter Chr08. Let's explore
some curious features.
- In the two tracks under the Cytobands track, you'll see that the
profiles of SNPs/KB and Assay Coverage Percentage both drop
to the baseline and go flat. Click on one of these regions (I chose the
right-hand one) to zoom in to a 50 Kb window, and then copy the
coordinates from the Landmark or
Region box. Go back to the Haplotype
Block Viewer that you already opened, enter these coordinates
into the text box labeled Change
the order of strains and/or enter position or search term, and
click Submit. When the display
appears, you'll see solid color in each of the horizontal bars
representing mouse strains. Use the zoom buttons at left to zoom out to
20M, and observe the region. It occupies about 2 Mb of sequence, from
about 56Mb to 58Mb. Find the same region in the Ancestry Browser. What does the coloring in this
region indicate?
- Apparently we can't find from the
resources at this site why there are no SNPs in these regions. Let's
look at a different
genome browser (actually just a different installation of the same
Generic Genome Browser) at the Jackson Lab's Mouse Genome Informatics
(MGI) database site. Enter your coordinates in the Landmark or Region box (note: for this browser you must omit
the Chr in front of the
coordinates. Just enter the chromosome number 8, followed by the rest)
and click Search to update the
display.
- Let's look for SNPs in the region. Scroll
down to the Tracks section and
locate the Polymorphisms
subsection at the bottom. Click the SNP
checkbox and then on any Update Image
button you see. Some red SNP glyphs should appear in the lowest track
in the Details section.
Right-click on one to open the link in a new tab. You'll be taken to
NCBI's dbSNP, where you'll find details about this SNP. Note in the Handle|Submitter ID section that the
SNP came from Perlegen, so we can guess that the resequencing study is
the source of many of the SNPs in the MGI database.
- Your final task is to find out and/or form some reasonable
hypothesis, based on online resources and your genetic knowledge, why
there are no SNPs in the chromosome-8 region(s) we have been looking at.
Is there even any sequence? Any contigs? I can recommend some
directions for exploration. In the Integrated
Maps section on this page, you can follow the chromosome 8 link. Once the new map
page opens, you can click on a Map
element link in the Search
results section. On the NCBI Map Viewer page that now opens,
you'll see a panel on the left side containing links to mouse resources
of various kinds. In your
report include hyperlinks to the two or three pages that gave you your
answers, if you find any.
|