Lab 8. QTL analysis
In this lab we will do a practical analysis of a real data set. We will use a public software package called QGene that is
developed here at K-State. Our goals will be to explore the dataset
in an orderly way that will prove valuable to you in later analyses.
- QGene has been installed on your lab computers in directory
C:\PLPTH_613.
Double-click on the qgene.exe
launcher to start the Java program.
- The real data set we will use represents a rice
recombinant-inbred-line (RIL) progeny of about 300 RILs. It was grown
in the states of Arkansas and Louisiana in summer of 2006 and 2007 and
evaluated for many traits. The main purpose of the study was to
identify QTLs for whole milling yield, often called head-rice yield or
HRY. This is the proportion of rice grains that retain at least 75% of
their length after milling. Milling is the series of operations that
removes husk and bran from "rough" rice (rice with the husk on, just as
harvested from the field) to yield white rice. Milling tends to cause
rice breakage, and rice processors pay much less for broken rice, so
that an understanding of the genetics of HRY would be of value to the
rice industry. The parents of our cross were Cypress and LaGrue, both
high-quality medium-grain rice cultivars, but with Cypress known to
give consistently high HRY. So we hope to find HRY QTLs in the RILs.
- Download this file
to your working directory and open it with a text editor to examine the
format. You'll find it explained on this
QGene manual page. Note that each marker row contains, besides the
marker data, the chromosome and map position in which the marker is to
be placed by QGene.
- In QGene, choose File/Load data and navigate to your data file to load it. A window called the DataManager
will appear. In the left panel is a list of the data files currently
loaded (only one, so far) and in the right panel you may view trait,
map, and marker data by clicking on the three tabs. Click on the Marker data tab. You can zoom the image by holding down the Control key and turning the mouse wheel. By examining the top left colored squares and the marker data in your text editor, determine which colors represent genotypes A, B, H, and missing, and describe concisely how you found out.
- Examine the genetic linkage map.
In interval mapping, QTLs of small effect are difficult to detect in
wide intervals between markers. Although there is no exact equation
that I know of -- especially since QTL detection power also depends on
population size and other factors -- we may say roughly that intervals
no wider than 15 cM are desired. On this basis, about what proportion of small-effect QTLs might fail to be detected in this population? As usual, indicate concisely the quantitative basis for your answer.
- Examine the trait data.
You can view thumbnail histograms (if you have a small thumb!) in the
Data Manager. Are there any apparent outliers? Outliers are data points
that do not follow any obvious trend over the remainder of the data in
a distribution. One example is seen in trait HD_LA_07.
Let's look at this distribution in more detail. From the main QGene window, select menu item
Analysis/Trait analysis and then in the trait list in the left panel,
select HD_LA_07.
- In the larger histogram, we can see that all
of the heading dates in Louisiana 2007 are clustered at the left
end of the distribution around 85, but one of the RILs seems to have
headed about 40 days after the others. The absence of any other RILs in
between makes this data point look suspicious. Do we have any other
evidence? Look at the table on the right and scroll over to the column
labeled HD_LA_07.
Click in the column header to sort on the values; you may have to click
a second time to sort descendingly. Notice that the same RIL, in
Louisiana 2006, showed a heading date of 82 days, and in Arkansas (AR)
in both years, headed in only 79 days. Not only this, but it was
harvested (HARV_LA_07) in 111 days, 13 days before heading? Nonsense!
We'll plan to delete it, by changing the value to missing data. Since
we can't do this from inside QGene, we'll have to modify the data file
and then reload it.
- To save effort, identify any scary data points in other traits, and make notes of them. Did you elect to remove any more points?
In case you're wondering whether this is statistically OK to do: yes,
as long as you do it conservatively and have a good reason to suspect
error. With 299 data points, if the removal of one or two outliers
radically changes the QTL analysis, there is something wrong with the
data anyway.
- Just
to see what difference this cleaning step made, we'll create another
trait that's identical except for this data point. Open your data file
with a text editor, find the row for trait HD_LA_07, and duplicate it. Rename one of the rows HD_LA_07_trimmed,
and in this row, replace the outlying value with the missing-data
symbol, which is just a period. Use this opportunity to replace any
other outliers you found in any of the traits. Save your file (you need not change the name).
- Close the Trait Analysis window, locate the Data Manager again (you can always find it with Window/Data manager), select the dataset name in the Data panel, and press your Delete key. Now reload the data file. In the Traits
tab of the Data Manager, note the drastic change in the trait
distribution that you made by removing one data point! Below, we'll
find out whether the QTL analysis of this trait has changed.
- Examine the marker genotype segregation From the main window, select Analysis/QTL mapping, or just type Ctrl-R. The Analysis window that now appears is the window where most of your QGene work will be done. Click on the + icon next to Genotype segregation analysis in the panel on the right, and then click in the checkboxes for the first three entries, aa, Aa, and AA.
These statistics represent, for each marker, the proportions of
informative RILs carrying each of these three marker genotypes (where
an informative RIL is one whose genotype is non-missing). Click in the Chromosomes
panel at left, and type Ctrl-A to select all of the chromosomes. Now,
even though this analysis doesn't depend on trait data, you'll need to
click on a trait in the Traits panel to view the plot.
- Drag the right window edge and the internal dividers to expand the black plot window, and use the Vertical zoom and Horizontal zoom sliders to shrink the plot to a convenient size. In
your report you will need to comment concisely on any pattern of
segregation distortion and any instances of extreme distortion.
To identify markers of interest and find the corresponding genotype frequencies, also check the boxes f(aa)
f(Aa), and f(AA).
Finally, check the Chi^2 and -log10(P(Chi^2)) boxes.
Choose File/Export text to clipboard, open a new Excel worksheet, select a cell, and choose
Paste.
Don't include this table in
your report; just use it to support your
observations about marker segregation in this data set. These lines are
claimed to be RI lines, inbred to near-homozygosity at all loci. Is
this plausible? Based on a loss of half of the heterozygosity at every
selfing generation following the F1, how
many
generations of inbreeding does the marker segregation pattern suggest
that they have really undergone? Note that in the Excel table, you can
sum the frequency columns to calculate an overall estimate of the three
genotype frequencies.
- Examine trait correlations Before we start looking at the QTL plots, let's find out a little more about the traits. Open the Trait analysis window as in step 6 above. Note that the trait RIL
isn't really a trait! It just lists the numbers of the RILs, as a
convenience for matching them with their trait values. In the Trait list at left, select all the traits except for RIL and the untrimmed HD_LA_07, (you can select or deselect individual traits by Ctrl-clicking on them) and choose Analysis/Compare traits. A correlation table will appear, though the column labels are obscured owing to a bug in QGene or Java. Choose File/Export text to clipboard and paste into another Excel worksheet.
- In this data set I've included only four traits: HD (heading date, measured from emergence of the rice plant from the soil), HARV (harvest date, measured similarly), GFD (HARV - HD, as you can verify from the trait data shown in the Trait analysis window), and HRM (head-rice
yield, measured as the percent of milled kernels that
remained unbroken). The rice lines were not harvested all on the same
date, as would be the practice on a farm where only one cultivar was
being grown. Rather, each line was harvested when its grain appeared
ripe (yellow color reaching the base of the panicle). In examining the
correlations, pay close attention
to the correlations of our target trait HRM with HD in both AR and LA locations. Do you observe anything of possible interest?
- Conducting the QTL analysis Now let's have a look at the QTL plots. Return to the QTL analysis window and deselect the check boxes in the Segregation analysis. Select the two traits HD_LA_07 and HD_LA_07_trimmed, and in the right-hand panel open up the Simple IM tree and click the LOD and Add effect boxes. You should already know that the LOD
statistic quantifies the evidence for the presence of a QTL at each
tested position. In contrast, the additive effect represents the
expected change in phenotype when one parental allele is substituted
with the opposite allele at the tested QTL.
- To make the plot more
manageable, I usually remove chromosomes showing little evidence for QTLs
-- using a rough LOD threshold of about 3. You can do this by
Ctrl-clicking on the names of these chromosomes in the left-hand list.
Also the default line pattern for LOD contours is a dot-dashed line,
while I prefer a solid line. To change this, scroll the right-hand
analysis-list panel to the left and double-click on the icon next to LOD. In the Choose a pattern
dialog that comes up, you can click on any desired line pattern and
then click OK to accept it. If you wish to change the color of a line,
you can do it similarly, using the Traits panel: double-click on your desired trait and choose a new color from the palette that appears.
- You should be able to
see some differences in the QTL profiles of your original and trimmed
traits. How important are these? Select the other three HD
traits in order to view all QTL profiles at once. Note that the
profiles have many common features in both locations and years, which
is not surprising for this highly heritable trait. In particular, note
the chromosome-7 and chromosome-3 peaks, which appear to have been
those most affected by our trait editing. Before we test the
significance of these peaks with permutation, look at the
additive-effect plot below the LOD plot. This is how you determine the
parental source of the influential alleles at the QTLs you declare. For
example, on chromosome 2 the Add profile dips below 0 to the
LaG(rue) side of the plot. This means that in this region, alleles from the LaGrue
parent increased the trait values. For this trait, this means that they
delayed heading. The numbers on the Y axis at left are in units of the
trait itself and are expressed per allele.
Thus at the chromosome-10 QTL we see that the LaG homozygote heads
4.5 days later than the Cyp(ress)homozygote. Be
sure you understand this essential feature of QTL analysis. Can you
give a reasonable estimate of the difference in heading date between
the Cypress and LaGrue parents, based on the sums of QTL effects?
- Let's
calculate a plausible acceptance threshold for our SIM analysis, though
keeping in mind that we do not intend to use SIM for the final
analysis. Deselect trait HD_LA_07, leaving only HD_LA_07_trimmed selected. Choose Resampling/Permutation
and expand the permutation window that appears, so that the QTL plot is
visible. Leave settings at their defaults (1000 iterations can be
computed very quickly for SIM) and click the Start
button at lower left. Be aware of what is happening. At each iteration,
the trait values are randomized. The software calculates a QTL scan
across the entire genome (not just the chromosomes shown in your plot)
and records
the highest LOD score found on any chromosome. These 1000 maximum
values are then sorted in descending order, and the 950th highest is
taken as the 0.05 threshold, or the LOD above which only 0.05 of false
QTL peaks are expected to fall. Likewise the 0.01 threshold is computed
as the 990th highest value. These numbers are shown at top of the
window (although QGene doesn't always print the alpha 0.01 and alpha 0.05 labels legibly) and are also represented by horizontal threshold lines in the plot.
- Do any QTL peaks appear significant that would have been rejected in the analysis of the untrimmed trait?
- Do
we really need 1000 iterations to arrive at a reliable permutation
threshold? For slower interval-mapping methods it will be useful to
know the minimum number of iterations that will give satisfactory
results. Test a range of iteration numbers from 50 to 10,000,
and record the thresholds. What do you recommend as the minimum number
of permutations to use? When you are finished with this study, close the permutation window and return to the QTL-analysis window.
- Considering trait correlations in assessing QTL results We'll now examine QTLs for HD, HARV, and GFD. Recall the relationship between these traits, described above in step 14. With the SIM LOD and Add effect analyses still selected, select all the chromosomes again, and in the Traits list select the four HARV and four HD traits.
Again, clean up the view by deselecting the chromosomes that appear to
carry no QTLs. By selecting and deselecting appropriate traits, identify QTLs that appear to be shared by both HARV and HD and QTLs that are specific to only one of the two traits.
In doing this, realize that it's most sensible to compare the two
traits within the same year and location, where they will be subject to
the same environmental effects.
- We
might think that any QTL that would increase days to heading would also
increase days to harvest, based on the simple arithmetical relationship
between these two quantities. Do
you observe substantial HD QTLs that are not also strong HARV QTLs for
the same year and location? Can you propose any biological or genetic
explanation for this?
- Let's try two ways to separate these two traits for exploring their genetic basis.
Recall that GFD is simply the difference between harvest and heading date.
Concisely describe the relationship of GFD QTLs with those of HARV and HD, and again propose an explanation for any trait-specific QTLs.
- From the main window's Analysis menu, select Trait analysis, and in this window select Analysis/Regress traits. In the Trait regression dialog, select as a Response trait HARV_LA_07, and as a Factor trait HD_LA_07_trimmed. In the Residual trait name box, enter the new trait name HARV_on_HD_LA_07, and click OK. Dismiss the dialog that appears, showing the regression results, close the Trait analysis window, and
return to the QTL window. You will need to close this window and reopen it in
order to view the new trait that has been added to the trait list.
Again select all chromosomes and the SIM LOD and Add effect options, and finally select both your new trait and GFD_LA_07, whose profiles you will find to match closely.
- What did we just do? We used regression to find out how much of HARV could be explained by HD
(for LA in 2007) and subtracted this, so that all variation that
remained (the residuals from regression) was independent of HD. Why
do you think that the residuals trait shows a QTL plot very close to
that of GFD? (This question takes a little statistical intuition, so do
your best on it). However, for your
future QTL work, keep in mind that this regression method may be used
for traits that are measured in different units and have no such simple
relationship as the two we selected. You should always be alert to the
possibility that the "QTLs" that you seem to be seeing for one trait
are really QTLs for a different trait that is the underlying cause of
variation in the first trait.
- QTL analysis of HRM, our target trait Examine the QTL profiles of trait HRM
in AR and in LA. You'll see immediately that they are quite different.
This trait is known to be strongly influenced by growing environment. Comment
on the consistency of HRM QTL profiles within and across years and
locations. How does it relate to the results in your correlation table? Note: include at least one plot in your report for this lab -- QGene allows you to save graphics files with File/Save image to file.
This requirement is mostly to give you practice in including images in
HTML pages. Follow the directions in Lab 1; be sure that your
image link is images/my_image.jpg and upload the image file to KSOL along with your HTML report.
- Plot HD and HRM QTL profiles together, for corresponding years and locations. Do
the profiles appear to correspond in some environments? Which? What is
the relationship between the additive effects for HD and HRM? How does
this relationship correspond to the correlations in your correlation
table?
- How much of HRM can be explained by HD and HARV, and how does this correspond to the growing environment? For this I'm not asking for a quantitative answer. I merely ask you to find out, as in step 24,
what happens to HRM QTLs after we remove the influence of HD, HARV, or
both (you can select more than one trait as a factor in the Trait regression dialog).
- Summarize
your observations about the relationship of head-rice yield to heading
date and harvest data in AR and LA, and suggest any genetic or
biological explanation you can think of. Do you see any QTL evidence of
the generally high and stable HR for which rice producers favor the
Cypress cultivar? Can you suggest any experimental approach to obtain
more evidence?
- If you wish to play with QGene and these data some more, I suggest using the Single-trait multiple IM and Multiple-trait MLE analyses. However, including these here would make it too elaborate for one lab exercise.
|