PLPTH 613
Bioinformatics Applications
Spring 2009


Home page
Schedule
Research project
K-State Online

Lab 7. Genomic SNP discovery

In this lab we'll use recently-acquired high-throughput sequence data for discovery of single-nucleotide polymorphisms in the sorghum genome. We will tabulate alignment coverage and SNPs between two genomes, ask how well the coverage depth conforms to statistical expectation, and explore genome-browser displays of SNPs and haplotypes in mouse, one of the few animals in which genomic diversity has been deeply characterized with SNPs.

Here is a useful tip: when navigating from one site to another with your browser, never use the left mouse button to click on a hyperlink unless you are permanently finished with the current page. Instead, use the right button to Open link in new tab (at least, in the Firefox browser). Then you can return to the current page without using your browser's Back button.

Background

As described in lecture, we submitted genomic DNA from two sorghum accessions for sequencing by Solexa technology, which yielded many sequencing reads of 36 bases. (This was in 2008. The technology is rapidly improving and read length has increased greatly since then). Our plan was to align these reads to the available Tx623 sorghum genome sequence. From this we could identify mismatches, which are not necessarily true SNPs but could have been due to sequencing error. We received a total of 3 genome equivalents of bases, but found that for only about a third of these reads could a unique best alignment be found. Recall that such an alignment is needed for identifying and mapping SNPs.

An obvious question was: how much sequencing would be necessary for discovery of any given number of SNPs? We will test our observed coverage-depth distribution against theory and try to predict this. Another question addressed a method that we adopted to enrich our sequence reads in expressed genes: we created our genomic library by digestion with enzyme HpaII, which cuts preferentially in nonmethylated DNA, which in turn is more common in genic than nongenic regions of the genome. Was this method effective?

Much of the work in processing the data is done with custom-written Perl scripts and the alignment program, SOAP, that is invoked by one of the scripts. Running these scripts doesn't make a very instructive exercise; one sequencing run can take half an hour or more to process and nothing useful is visible on the screen. So we will survey the appearance of the data and results at various stages and then use Microsoft Excel to examine some of it.

1. Understanding the alignment pipeline, and testing observed against expected alignment coverage

  1. The raw read files from a recent Illumina Genome Analyzer run approached 1 Gb each in size and described around 6M reads each, looking like this on disk:

    -rw-r--r-- 1 jcn jcn 573096334 2008-11-24 09:27 sb1_seq.txt
    -rw-r--r-- 1 jcn jcn 770254244 2008-11-24 12:14 sb3_seq.txt
    -rw-r--r-- 1 jcn jcn 798418112 2008-11-24 09:51 sb4_seq.txt
    -rw-r--r-- 1 jcn jcn 905819832 2008-11-24 11:53 sb5_seq.txt
    -rw-r--r-- 1 jcn jcn 686960926 2008-11-24 11:55 sb6_seq.txt
    -rw-r--r-- 1 jcn jcn 609726796 2008-11-24 12:11 sb7_seq.txt
    -rw-r--r-- 1 jcn jcn 655719202 2008-11-24 12:13 sb8_seq.txt

  2. The data in one such file looks like this:

    @HWI-EAS339_30H7TAAXX:1:1:1663:1004
    GTAATGTTGTGCTCCAAATAGTCTACCTCTGACGGG
    +HWI-EAS339_30H7TAAXX:1:1:1663:1004
    ]]WS]]]]]]]]]]]IWS]PWY]]PXWXXZVGFVVV
    @HWI-EAS339_30H7TAAXX:1:1:1532:76
    GGTGGATAATGTTCATTTGGGCTGCTTTATTAGTTA
    +HWI-EAS339_30H7TAAXX:1:1:1532:76
    ^^^^^^^]]^^^^^]^^^\]Y]]Y][[[P[[SS[[S

    There are four lines per read. The first and third contain labels, the second contains the read itself, and the fourth gives a quality score for each base in FASTQ format, which represents numbers as ASCII characters in a way we need not investigate. A quality score, which you'll remember from Sanger sequencing and the Phred program, quantifies the confidence we have in the base call.
  3. A script now extracts the reads and converts them into FASTA format, like this:

    >1-1-1663-1004
    GTAATGTTGTGCTCCAAATAGTCTACCTCTGACGGG
    >1-1-1532-76
    GGTGGATAATGTTCATTTGGGCTGCTTTATTAGTTA

  4. Another script runs the SOAP alignment program against the sorghum genome sequence, giving output looking like this and showing (just for your information):

    1. id of read
    2. sequence of read
    3. quality (here not used)
    4. number of equal best hits. Reads with no hits are ignored
    5. a/b: a flag meaningful only for paired-end alignment
    6. read length   
    7. +/-: strand to which read was aligned   
    8. chr: id of reference sequence;
    9. position of first bp on the reference, counted from 1
    10. mismatch types:
      0:  exact match.
    11. 1 or 2: RefAllele->Offset QueryAllele Qual
    Example: 2 A->10T30    C->13A32 means there are two mismatches, one at position +10 and the other at position +13 of the reference. The reference alleles are A and C, while the called query allele and its quality are T,30 and A,32.

  5. Next a script is run to compute the coverage-depth distribution of these reads, and this is what we will examine more closely. We are interested in the alignment depth of our reads, since these provide the strength of evidence for any mismatch with the reference genome. In particular, we would like to cover about 0.05 of the genome to a depth of at least 10 reads in 8 sorghum accessions, and would prefer deeper coverage so that each accession could be represented by several reads. Why do we want each accession to be represented by several reads?
  6. All of the data used in this lab will be found in this Excel workbook. Use the labeled tabs at the bottom of the workbook to select the worksheets.
  7. You should recall Lab 2, where we computed statistics on coverage tables. Now we will examine coverage results from three experiments, presented in the first three worksheets of your data workbook:
    1. 300,000 32-base reads were randomly sampled from a simulated genome of 30,000,000 bases and aligned back to the same "genome".
    2. The same was done with 30 Mb of sorghum chromosome 1.
    3. A similar number of reads from the real data that uniquely aligned to the 30-Mb region of chromosome 1 was identified.

    A 1988 result by Lander and Waterman predicts that the proportion P(d) of a region sequenced  to coverage c that is covered to depth d has a Poisson distribution:

    Poisson expression

    Don't worry: this exercise requires no probability, but a little statistics. We'll be fitting linear or nonlinear regressions in Excel and will need to do some simple algebra with the results. Recall that in lab exercise 2.7 we calculated this proportion for each value of d in our data, while in exercise 2.8 we calculated the coverage in the process of calculating the average coverage. This means that we know the values to put on the left-hand-side of the equation (the proportions P(d)) as well as the values of c and d for the right-hand side.Your task is to determine how well the data from these three experiments match the values predicted by the equation.
  8. Here are two ways to do this. Both ways should be applied to all three data sets. You may need the Excel functions exp() and fact(). Note that once you make the calculations for one, you can just copy and paste the tables into the other two, in the same positions relative to the data.
    1. Using the equation and your known values of c and d, calculate an expected proportion for each depth. Plot this (using Excel's Insert/Chart, selecting an appropriate chart type) against d. In case you don't know how to make an Excel chart of one variable against another, the easiest way is to place the variables in adjacent columns and select them, along with the column names, before starting the Chart Wizard. Likewise, plot the observed proportions against d on the same chart, and compare the curves.
    2. We can linearize the relationship as follows: multiply both sides by d!, and then take the natural logarithm of both sides. This gives

      ln P(d) + ln(d!) = d ln(c) - c

      This should tell you that if you plot the left-hand values against d, you should get a straight line with slope ln(c) and y-intercept -c. So calculate the left-hand values using the observed data and do this. You can make Excel calculate a regression by clicking in the plot and selecting menu item Chart/Add trendline. In the setup dialog you'll see an Options tab that allows you to display the equation on the chart.
  9. In your report, comment on your results for the three experimental cases. You should try to interpret deviations from the expected curve in method 1. (Hint: if you understand what coverage depth means, this will be a one-sentence answer!). It will help to remember that the alignments reported by SOAP are unique alignments, and that reads sampled from the approximately two thirds of the sorghum genome that consists of repetitive elements will likely not align uniquely and thus not be represented in the coverage-depth distribution. For method 2, you should comment on the appropriateness of the regression fit, in particular determining whether the calculated slope and intercept values are consistent with the coverage you also calculated.
  10. For this lab, your Excel files should be hyperlinked from your HTML report. To do this for a file, create subdirectory data in the directory containing your report, and place your Excel file in it. Enter in your report some suitable text for your hyperlink, select it, and choose Insert/Link (Ctrl-L). In the Link Properties dialog, use Choose File... to navigate to your Excel file, which will be in data/<your file>. Click OK. Don't forget to submit both report and Excel file.
  11. While you're looking at these data, notice how much deeper the alignment depths go in the real data than in the simulated genomic data. This is visible in the second figure in Lab 2. We believe that these strange "read stacks" are artefacts of PCR amplification used for preparation of the sequencing libraries for the Illumina machine. That's why I avoid using them in the statistical calculations.

2. Predicting the amount of sequencing needed for adequate SNP calling

NOTE: You may work on this problem OR problem 3. If you do both (which of course I hope you will), you'll get extra credit. Problem 2 requires more Excel plotting, and some mathematical intuition, both of them useful for anyone wanting to work with bioinformatics. Problem 3 requires some genetic understanding and intuition but no arithmetic.
  1. The preceding experiment should have revealed that we can't rely directly on the Lander-Waterman result to predict the depth distribution for real data. So we will next try to estimate empirically the amount of coverage we will need, by observing how our coverage-depth distribution changes as we add more sequence. (The need for this kind of calculation is getting overtaken by sequencing-technology advances being made almost monthly. In a couple of years, coverage is likely to be so cheap that the operational way to answer any such question is to keep sequencing until you have enough depth.)
  2. Recently we acquired one lane's worth of sequence from each of eight sorghum accessions; seven of these files appear in the listing in step 1 above. From the alignment output for these files, considering for simplicity only alignments to sorghum chromosome 1, I calculated what I call cocoverage. That is, first I determined the coverage-depth distribution for accession 3; then for accessions 3 and 1 combined; then for 3, 1, and 4 combined, and so on up to all 8 accessions. The results are in worksheet 4 of your data workbook. What is the approximate length of chromosome 1 in bases?
  3. Notice the increasing trends within each row, from left to right. Suppose that our target is to add sequence runs (which will add more columns) until the number of bases covered to depth 10 or higher is at least 0.05 of the length of chromosome 1. How many further runs will we need to add? To answer this, I will suggest a simple approach, but feel free to use your own ideas, especially if you know how to use Excel's more advanced tools, such as the Solver.
  4. The raw base frequencies and the proportions they represent are related by a constant (the total length of the chromosome), so it makes no difference which is used, and you may take your choice. Try selecting the table and using Excel to graph it as a line chart. You will want to omit the zero-depth row, since its values will dominate the scale of the chart. In step 2 of the Chart Wizard, you will see the option to plot series in rows or columns. Try both, and decide which is more meaningful for your purposes.
  5. You may wish to add up the frequencies for depths of 10 and above, since we really don't care about the trend at each depth. (In fact, I suggest replacing the frequencies in each row with the cumulative frequencies for all higher depths). You may find it useful to transform the table (say, by calculating the logarithms) in hopes of making the trends linear. If you can't do this, you can still use Excel's option to fit a nonlinear curve, which you'll find in the Trendline dialog described in step 8. Once you've fitted a curve you think is promising, use its equation to try to predict the coverage that will result in the desired number or proportion of bases at your target depth. Be aware that confident extrapolation requires some assurance that the curve that fits the data is the right one -- which we have really no way of knowing, though it's more likely to be logarithmic than polynomial. For your report, you know by now what is wanted: describe concisely how you arrived at your answer and provide sufficient detail of arithmetical calculations that the reader could reproduce them. Don't describe Excel operations in the depth of detail that is given in these lab instructions! Include your charts in the worksheets, with informative captions on the axes and an informative title in each chart.

3. Viewing SNP and haplotype data

Optional if you have worked problem 2, but I encourage you to go through it for a view of online resources and approaches to these data.
  1. The exercises so far in this lab deal only with alignment coverage, not with SNPs. While many SNPs have been identified in the sorghum data, their analysis has not proceeded far enough to make what I consider an interesting and instructive exercise. We will instead examine tools developed for a more extensive study in mouse, at the NIEHS Mouse Haplotype Analysis site. In PLPTH 612 you should have covered array-based resequencing for SNP detection, the approach used for collecting these data. Note that, as in our sorghum experiment, there is a reference genotype, in this case the inbred mouse strain C57BL/6J. This and 15 other "classical" strains were derived from four wild mouse species, and the experiment was carried out to characterize the inheritance and distribution, in the classical strains, of haplotype blocks derived from these wild progenitors.
  2. There are two haplotype browsers: the Mouse Ancestry Mapper and the Mouse Haplotype Block Viewer. You can open either by clicking on the Haplotype Analysis link at the top of the page and then, in either the Mouse Ancestry Mapper section or the Mouse Haplotype Block Viewer section, clicking on View Haplotypes. Its features are explained here. Explain what information is given by the colors in the main panel, and what it means to select a different strain under one of the menus at the top. While this display is very attractive, it's not clear how much we can learn from studying it alone. The site also includes an installation of the Generic Genome Browser that allows us to look at other features of the mouse genome. So, in a new tab of your browser, open the Mouse Genome Browser page and, in the Landmark or Region box under Search, enter Chr08. Let's explore some curious features.
  3. In the two tracks under the Cytobands track, you'll see that the profiles of SNPs/KB and Assay Coverage Percentage both drop to the baseline and go flat. Click on one of these regions (I chose the right-hand one) to zoom in to a 50 Kb window, and then copy the coordinates from the Landmark or Region box. Go back to the Haplotype Block Viewer that you already opened, enter these coordinates into the text box labeled Change the order of strains and/or enter position or search term, and click Submit. When the display appears, you'll see solid color in each of the horizontal bars representing mouse strains. Use the zoom buttons at left to zoom out to 20M, and observe the region. It occupies about 2 Mb of sequence, from about 56Mb to 58Mb. Find the same region in the Ancestry Browser. What does the coloring in this region indicate?
  4. Apparently we can't find from the resources at this site why there are no SNPs in these regions. Let's look at a different genome browser (actually just a different installation of the same Generic Genome Browser) at the Jackson Lab's Mouse Genome Informatics (MGI) database site. Enter your coordinates in the Landmark or Region box (note: for this browser you must omit the Chr in front of the coordinates. Just enter the chromosome number 8, followed by the rest) and click Search to update the display.
  5. Let's look for SNPs in the region. Scroll down to the Tracks section and locate the Polymorphisms subsection at the bottom. Click the SNP checkbox and then on any Update Image button you see. Some red SNP glyphs should appear in the lowest track in the Details section. Right-click on one to open the link in a new tab. You'll be taken to NCBI's dbSNP, where you'll find details about this SNP. Note in the Handle|Submitter ID section that the SNP came from Perlegen, so we can guess that the resequencing study is the source of many of the SNPs in the MGI database.
  6. Your final task is to find out and/or form some reasonable hypothesis, based on online resources and your genetic knowledge, why there are no SNPs in the chromosome-8 region(s) we have been looking at. Is there even any sequence? Any contigs? I can recommend some directions for exploration. In the Integrated Maps section on this page, you can follow the chromosome 8 link. Once the new map page opens, you can click on a Map element link in the Search results section. On the NCBI Map Viewer page that now opens, you'll see a panel on the left side containing links to mouse resources of various kinds. In your report include hyperlinks to the two or three pages that gave you your answers, if you find any.