PLPTH 613
Bioinformatics Applications
Spring 2009
Schedule
Research project
K-State Online

Lab 8. QTL analysis

In this lab we will do a practical analysis of a real data set. We will use a public software package called QGene that is developed here at K-State. Our goals will be to explore the dataset in an orderly way that will prove valuable to you in later analyses.
  1. QGene has been installed on your lab computers in directory C:\PLPTH_613. Double-click on the qgene.exe launcher to start the Java program.
  2. The real data set we will use represents a rice recombinant-inbred-line (RIL) progeny of about 300 RILs. It was grown in the states of Arkansas and Louisiana in summer of 2006 and 2007 and evaluated for many traits. The main purpose of the study was to identify QTLs for whole milling yield, often called head-rice yield or HRY. This is the proportion of rice grains that retain at least 75% of their length after milling. Milling is the series of operations that removes husk and bran from "rough" rice (rice with the husk on, just as harvested from the field) to yield white rice. Milling tends to cause rice breakage, and rice processors pay much less for broken rice, so that an understanding of the genetics of HRY would be of value to the rice industry. The parents of our cross were Cypress and LaGrue, both high-quality medium-grain rice cultivars, but with Cypress known to give consistently high HRY. So we hope to find HRY QTLs in the RILs.
  3. Download this file to your working directory and open it with a text editor to examine the format. You'll find it explained on this QGene manual page. Note that each marker row contains, besides the marker data, the chromosome and map position in which the marker is to be placed by QGene.
  4. In QGene, choose File/Load data and navigate to your data file to load it. A window called the DataManager will appear. In the left panel is a list of the data files currently loaded (only one, so far) and in the right panel you may view trait, map, and marker data by clicking on the three tabs. Click on the Marker data tab. You can zoom the image by holding down the Control key and turning the mouse wheel. By examining the top left colored squares and the marker data in your text editor, determine which colors represent genotypes A, B, H, and missing, and describe concisely how you found out.
  5. Examine the genetic linkage map. In interval mapping, QTLs of small effect are difficult to detect in wide intervals between markers. Although there is no exact equation that I know of -- especially since QTL detection power also depends on population size and other factors -- we may say roughly that intervals no wider than 15 cM are desired. On this basis, about what proportion of small-effect QTLs might fail to be detected in this population? As usual, indicate concisely the quantitative basis for your answer.
  6. Examine the trait data. You can view thumbnail histograms (if you have a small thumb!) in the Data Manager. Are there any apparent outliers? Outliers are data points that do not follow any obvious trend over the remainder of the data in a distribution. One example is seen in trait HD_LA_07. Let's look at this distribution in more detail. From the main QGene window, select menu item Analysis/Trait analysis and then in the trait list in the left panel, select HD_LA_07.
  7. In the larger histogram, we can see that all of the heading dates in Louisiana 2007 are clustered at the left end of the distribution around 85, but one of the RILs seems to have headed about 40 days after the others. The absence of any other RILs in between makes this data point look suspicious. Do we have any other evidence? Look at the table on the right and scroll over to the column labeled HD_LA_07. Click in the column header to sort on the values; you may have to click a second time to sort descendingly. Notice that the same RIL, in Louisiana 2006, showed a heading date of 82 days, and in Arkansas (AR) in both years, headed in only 79 days. Not only this, but it was harvested (HARV_LA_07) in 111 days, 13 days before heading? Nonsense! We'll plan to delete it, by changing the value to missing data. Since we can't do this from inside QGene, we'll have to modify the data file and then reload it.
  8. To save effort, identify any scary data points in other traits, and make notes of them. Did you elect to remove any more points? In case you're wondering whether this is statistically OK to do: yes, as long as you do it conservatively and have a good reason to suspect error. With 299 data points, if the removal of one or two outliers radically changes the QTL analysis, there is something wrong with the data anyway.
  9. Just to see what difference this cleaning step made, we'll create another trait that's identical except for this data point. Open your data file with a text editor, find the row for trait HD_LA_07, and duplicate it. Rename one of the rows HD_LA_07_trimmed, and in this row, replace the outlying value with the missing-data symbol, which is just a period. Use this opportunity to replace any other outliers you found in any of the traits. Save your file (you need not change the name).
  10. Close the Trait Analysis window, locate the Data Manager again (you can always find it with Window/Data manager), select the dataset name in the Data panel, and press your Delete key. Now reload the data file. In the Traits tab of the Data Manager, note the drastic change in the trait distribution that you made by removing one data point! Below, we'll find out whether the QTL analysis of this trait has changed.
  11. Examine the marker genotype segregation From the main window, select Analysis/QTL mapping, or just type Ctrl-R. The Analysis window that now appears is the window where most of your QGene work will be done. Click on the + icon next to Genotype segregation analysis in the panel on the right, and then click in the checkboxes for the first three entries, aa, Aa, and AA. These statistics represent, for each marker, the proportions of informative RILs carrying each of these three marker genotypes (where an informative RIL is one whose genotype is non-missing). Click in the Chromosomes panel at left, and type Ctrl-A to select all of the chromosomes. Now, even though this analysis doesn't depend on trait data, you'll need to click on a trait in the Traits panel to view the plot.
  12. Drag the right window edge and the internal dividers to expand the black plot window, and use the Vertical zoom and Horizontal zoom sliders to shrink the plot to a convenient size. In your report you will need to comment concisely on any pattern of segregation distortion and any instances of extreme distortion. To identify markers of interest and find the corresponding genotype frequencies, also check the boxes f(aa) f(Aa), and f(AA). Finally, check the Chi^2 and -log10(P(Chi^2)) boxes. Choose File/Export text to clipboard, open a new Excel worksheet, select a cell, and choose Paste. Don't include this table in your report; just use it to support your observations about marker segregation in this data set. These lines are claimed to be RI lines, inbred to near-homozygosity at all loci. Is this plausible? Based on a loss of half of the heterozygosity at every selfing generation following the F1, how many generations of inbreeding does the marker segregation pattern suggest that they have really undergone? Note that in the Excel table, you can sum the frequency columns to calculate an overall estimate of the three genotype frequencies.
  13. Examine trait correlations Before we start looking at the QTL plots, let's find out a little more about the traits. Open the Trait analysis window as in step 6 above. Note that the trait RIL isn't really a trait! It just lists the numbers of the RILs, as a convenience for matching them with their trait values. In the Trait list at left, select all the traits except for RIL and the untrimmed HD_LA_07, (you can select or deselect individual traits by Ctrl-clicking on them) and choose Analysis/Compare traits. A correlation table will appear, though the column labels are obscured owing to a bug in QGene or Java. Choose File/Export text to clipboard and paste into another Excel worksheet.
  14. In this data set I've included only four traits: HD (heading date, measured from emergence of the rice plant from the soil), HARV (harvest date, measured similarly), GFD (HARV - HD, as you can verify from the trait data shown in the Trait analysis window), and HRM (head-rice yield, measured as the percent of milled kernels that remained unbroken). The rice lines were not harvested all on the same date, as would be the practice on a farm where only one cultivar was being grown. Rather, each line was harvested when its grain appeared ripe (yellow color reaching the base of the panicle). In examining the correlations, pay close attention to the correlations of our target trait HRM with HD in both AR and LA locations. Do you observe anything of possible interest?
  15. Conducting the QTL analysis Now let's have a look at the QTL plots. Return to the QTL analysis window and deselect the check boxes in the Segregation analysis. Select the two traits HD_LA_07 and HD_LA_07_trimmed, and in the right-hand panel open up the Simple IM tree and click the LOD and Add effect boxes. You should already know that the LOD statistic quantifies the evidence for the presence of a QTL at each tested position. In contrast, the additive effect represents the expected change in phenotype when one parental allele is substituted with the opposite allele at the tested QTL.
  16. To make the plot more manageable, I usually remove chromosomes showing little evidence for QTLs -- using a rough LOD threshold of about 3. You can do this by Ctrl-clicking on the names of these chromosomes in the left-hand list. Also the default line pattern for LOD contours is a dot-dashed line, while I prefer a solid line. To change this, scroll the right-hand analysis-list panel to the left and double-click on the icon next to LOD. In the Choose a pattern dialog that comes up, you can click on any desired line pattern and then click OK to accept it. If you wish to change the color of a line, you can do it similarly, using the Traits panel: double-click on your desired trait and choose a new color from the palette that appears.
  17. You should be able to see some differences in the QTL profiles of your original and trimmed traits. How important are these? Select the other three HD traits in order to view all QTL profiles at once. Note that the profiles have many common features in both locations and years, which is not surprising for this highly heritable trait. In particular, note the chromosome-7 and chromosome-3 peaks, which appear to have been those most affected by our trait editing. Before we test the significance of these peaks with permutation, look at the additive-effect plot below the LOD plot. This is how you determine the parental source of the influential alleles at the QTLs you declare. For example, on chromosome 2 the Add profile dips below 0 to the LaG(rue) side of the plot. This means that in this region, alleles from the LaGrue parent increased the trait values. For this trait, this means that they delayed heading. The numbers on the Y axis at left are in units of the trait itself and are expressed per allele. Thus at the chromosome-10 QTL we see that the LaG homozygote heads 4.5 days later than the Cyp(ress)homozygote. Be sure you understand this essential feature of QTL analysis. Can you give a reasonable estimate of the difference in heading date between the Cypress and LaGrue parents, based on the sums of QTL effects?
  18. Let's calculate a plausible acceptance threshold for our SIM analysis, though keeping in mind that we do not intend to use SIM for the final analysis. Deselect trait HD_LA_07, leaving only  HD_LA_07_trimmed selected. Choose Resampling/Permutation and expand the permutation window that appears, so that the QTL plot is visible. Leave settings at their defaults (1000 iterations can be computed very quickly for SIM) and click the Start button at lower left. Be aware of what is happening. At each iteration, the trait values are randomized. The software calculates a QTL scan across the entire genome (not just the chromosomes shown in your plot) and records the highest LOD score found on any chromosome. These 1000 maximum values are then sorted in descending order, and the 950th highest is taken as the 0.05 threshold, or the LOD above which only 0.05 of false QTL peaks are expected to fall. Likewise the 0.01 threshold is computed as the 990th highest value. These numbers are shown at top of the window (although QGene doesn't always print the alpha 0.01 and alpha 0.05 labels legibly) and are also represented by horizontal threshold lines in the plot.
  19. Do any QTL peaks appear significant that would have been rejected in the analysis of the untrimmed trait?
  20. Do we really need 1000 iterations to arrive at a reliable permutation threshold? For slower interval-mapping methods it will be useful to know the minimum number of iterations that will give satisfactory results. Test a range of iteration numbers from 50 to 10,000, and record the thresholds. What do you recommend as the minimum number of permutations to use? When you are finished with this study, close the permutation window and return to the QTL-analysis window.
  21. Considering trait correlations in assessing QTL results We'll now examine QTLs for HD, HARV, and GFD. Recall the relationship between these traits, described above in step 14. With the SIM LOD and Add effect analyses still selected, select all the chromosomes again, and in the Traits list select the four HARV and four HD traits. Again, clean up the view by deselecting the chromosomes that appear to carry no QTLs. By selecting and deselecting appropriate traits, identify QTLs that appear to be shared by both HARV and HD and QTLs that are specific to only one of the two traits. In doing this, realize that it's most sensible to compare the two traits within the same year and location, where they will be subject to the same environmental effects.
  22. We might think that any QTL that would increase days to heading would also increase days to harvest, based on the simple arithmetical relationship between these two quantities. Do you observe substantial HD QTLs that are not also strong HARV QTLs for the same year and location? Can you propose any biological or genetic explanation for this?
  23. Let's try two ways to separate these two traits for exploring their genetic basis. Recall that GFD is simply the difference between harvest and heading date. Concisely describe the relationship of GFD QTLs with those of HARV and HD, and again propose an explanation for any trait-specific QTLs.
  24. From the main window's Analysis menu, select Trait analysis, and in this window select Analysis/Regress traits. In the Trait regression dialog, select as a Response trait HARV_LA_07, and as a Factor trait HD_LA_07_trimmed. In the Residual trait name box, enter the new trait name HARV_on_HD_LA_07, and click OK. Dismiss the dialog that appears, showing the regression results, close the Trait analysis window, and return to the QTL window. You will need to close this window and reopen it in order to view the new trait that has been added to the trait list. Again select all chromosomes and the SIM LOD and Add effect options, and finally select both your new trait and GFD_LA_07, whose profiles you will find to match closely.
  25. What did we just do? We used regression to find out how much of HARV could be explained by HD (for LA in 2007) and subtracted this, so that all variation that remained (the residuals from regression) was independent of HD. Why do you think that the residuals trait shows a QTL plot very close to that of GFD? (This question takes a little statistical intuition, so do your best on it). However, for your future QTL work, keep in mind that this regression method may be used for traits that are measured in different units and have no such simple relationship as the two we selected. You should always be alert to the possibility that the "QTLs" that you seem to be seeing for one trait are really QTLs for a different trait that is the underlying cause of variation in the first trait.
  26. QTL analysis of HRM, our target trait Examine the QTL profiles of trait HRM in AR and in LA. You'll see immediately that they are quite different. This trait is known to be strongly influenced by growing environment. Comment on the consistency of HRM QTL profiles within and across years and locations. How does it relate to the results in your correlation table? Note: include at least one plot in your report for this lab -- QGene allows you to save graphics files with File/Save image to file. This requirement is mostly to give you practice in including images in HTML pages.  Follow the directions in Lab 1; be sure that your image link is images/my_image.jpg and upload the image file to KSOL along with your HTML report.
  27. Plot HD and HRM QTL profiles together, for corresponding years and locations. Do the profiles appear to correspond in some environments? Which? What is the relationship between the additive effects for HD and HRM? How does this relationship correspond to the correlations in your correlation table?
  28. How much of HRM can be explained by HD and HARV, and how does this correspond to the growing environment? For this I'm not asking for a quantitative answer. I merely ask you to find out, as in step 24, what happens to HRM QTLs after we remove the influence of HD, HARV, or both (you can select more than one trait as a factor in the Trait regression dialog).
  29. Summarize your observations about the relationship of head-rice yield to heading date and harvest data in AR and LA, and suggest any genetic or biological explanation you can think of. Do you see any QTL evidence of the generally high and stable HR for which rice producers favor the Cypress cultivar? Can you suggest any experimental approach to obtain more evidence?
  30. If you wish to play with QGene and these data some more, I suggest using the Single-trait multiple IM and Multiple-trait MLE analyses. However, including these here would make it too elaborate for one lab exercise.