Nelson lab => software

Page last updated 11/20/05MATLAB plot of linkage in SBSBS

MatLink: a MATLAB utility for estimating genetic linkage in exotic line-cross mating designs

Requirements

This software requires the commercial program MATLAB (The MathWorks, Inc.) and at least one of its add-ons, the Symbolic Math Toolbox. The Statistics Toolbox is also used for goodness-of-fit tests on single-locus segregation, but isn't needed for linkage calculation.

Motivation

While the most informative mating designs for constructing a linkage map employ early-generation populations such as BC1, F2, and DH and advanced selfed or intercrossed lines such as RILs and AILs, some researchers choose to collect genotype data on lines advanced by a series of operations that may include a mix of backcrossing, selfing, and (more rarely) haploid-doubling. But how do you make a map with data from, say, a BC2F3 population? Conventional mapping programs don't have genetic models for such a design, and trying to fool them by telling them that the data are from, say, an F2 design is unlikely to produce a trustworthy map.

Basic ideas behind linkage estimation when you can't count recombinants

Linkage estimation and multipoint map assembly (the ordering problem) are separable. I'll outline -- skipping most of the mathematical detail -- how linkage may be estimated in any generation in a progeny advanced from the F1 of a line cross by any combination of the three breeding operations listed above.

Maximum-likelihood estimation (MLE) of linkage between two loci A and B begins with the construction of an equation expressing the likelihood of the data as a function of the data values and the unknown variable (here rAB) to be estimated. The data are the observed counts of genotype combinations at the two loci. These combinations fall into 25 classes at most, reaching this maximum in a design whose last operation is a selfing and in which dominant and codominant markers were used for genotyping.

aabb    -
Aabb   
-
AAbb    -
aaBb    -
AaBb    aABb
AABb   
-
aaBB    -
AaBB    -
AABB    -
aab_    -
a_bb    -
aaB_    -
A_bb    -
Aab_    -
a_Bb    -
AaB_    -
A_Bb    -
AAb_    -
a_BB    -
AAB_    -
A_BB    -
a_b_    -
A_b_    -
a_B_    -
A_B_    -

I've placed the fully informative genotypes in the first 9 rows, and lumped together the phase-indistinguishable classes (the two double heterozygotes). We must work out the likelihood of occurrence of each class in the mating design we're using. We can do this by hand using a Punnett square, but once we get beyond the first generation or two, the tedium of multiplying polynomials and collecting all their terms suggests finding a systematic way. Incidentally, the genotype combinations in some pairs of classes are symmetrical and will always have the same likelihood, but it turns out wrong to lump them together.

States, operators, and symbolic instead of numerical calculation

A useful idea is to define a state vector showing at any given generation the relative proportions of each two-locus class, as a function of r. To apply a B, S, or D operation to this vector, we premultiply it by an operator matrix, which is just a matrix of transition probabilities (also functions of r) between all possible true parent genotypes (represented by column labels) and progeny genotypes (row labels). Here is the F1 state vector, SF1, with genotype labels at the left:

aabb    0
Aabb    0
AAbb    0
aaBb    0
AaBb    1
aABb    0
AABb    0
aaBB    0
AaBB    0
AABB    0

and here is the BC operator matrix, OBC, whose rows and columns follow the same labeling scheme:

1    1/2   0   1/2  (1-r)/2  r/2     0    0    0    0
0    1/2   1    0    r/2     (1-r)/2 1/2  0    0    0
0    0     0    0    0       0       0    0    0    0
0    0     0   1/2   r/2    (1-r)/2  0    1    1/2  0
0    0     0    0   (1-r)/2  r/2     1/2  0    1/2  1
0    0     0    0    0       0       0    0    0    0
0    0     0    0    0       0       0    0    0    0
0    0     0    0    0       0       0    0    0    0
0    0     0    0    0       0       0    0    0    0
0    0     0    0    0       0       0    0    0    0

The S and D operators can be found in this MATLAB M-file. Now, to find the state vector for, say, a BC1F5 population, we need only carry out the matrix multiplication

SBC1F5 = OS x OS x OS x OSOBC x SF1

For this we can use a symbolic mathematics package such as Mathematica, Maple, or MATLAB's Symbolic Toolbox and a simple code loop. But we can also do it rather easily in any language, as student Roby Joehanes showed me with a quickly written Java class. For the BC1F5 design, the final state vector comes out to 1/64 *

  47-8*r^9+40*r^8-96*r^7+144*r^6-156*r^5+140*r^4-112*r^3+78*r^2-47*r
   16*r^9-80*r^8+192*r^7-288*r^6+296*r^5-216*r^4+112*r^3-40*r^2+10*r
             -8*r^9+40*r^8-96*r^7+144*r^6-140*r^5+76*r^4-38*r^2+37*r
   16*r^9-80*r^8+192*r^7-288*r^6+296*r^5-216*r^4+112*r^3-40*r^2+10*r
 2-16*r^9+80*r^8-192*r^7+288*r^6-312*r^5+264*r^4-168*r^3+72*r^2-18*r
         -16*r^9+80*r^8-192*r^7+288*r^6-280*r^5+168*r^4-56*r^3+8*r^2
    16*r^9-80*r^8+192*r^7-288*r^6+296*r^5-216*r^4+112*r^3-40*r^2+8*r
             -8*r^9+40*r^8-96*r^7+144*r^6-140*r^5+76*r^4-38*r^2+37*r
    16*r^9-80*r^8+192*r^7-288*r^6+296*r^5-216*r^4+112*r^3-40*r^2+8*r
  15-8*r^9+40*r^8-96*r^7+144*r^6-156*r^5+140*r^4-112*r^3+78*r^2-45*r

which is just a set of 10 9th-degree polynomials in r representing probabilities of occurrence of each genotype class, conditional on r. A second glance shows the symmetry; there are only 7 distinct classes, and the middle two will be lumped together too because they're indistinguishable phase variants of the same genotype (double heterozygote).

Solution by maximum likelihood

All we need from these expressions is the coefficients, which we can extract into a matrix. We also differentiate each row expression with respect to r, and make a second matrix of the coefficients of the derivatives, which we'll need for maximizing the log likelihood and computing the linkage information carried by a marker pair. For mating designs ending in a selfing, where dominant markers have been used in our mapping experiment, we need to expand these matrices to include expressions for the other 15 (incompletely informative) genotype classes listed above, by premultiplying them with a permutation matrix. Once we have the matrices for the mating design we're interested in, we can construct and solve the likelihood equation for any given two-locus segregation. We end up with tables of two-point linkage and information or LOD estimates for an entire marker set.

Solution by EM algorithm

Another way to estimate linkage between two loci starts with a guess for r (for example, 0.25), which is plugged into a set of expressions (again in r) to yield the estimated number nc of recombinants between the loci. Now the quotient nc/N (N being the number of individuals in the population) gives us an updated estimate for r, and we continue alternating Expectation and Maximization steps until our estimate stops changing. You can find helpful material, referring to an F2 design and explaining how to treat dominant markers, in Liu's Statistical Genomics (p. 172). My adaptation of this algorithm to our case of a sequence of mating operations relied on the observation that N is the number of recombinationally informative individuals only in the first generation following the F1. In any subsequent generation, since recombination between two loci is observable only in parents doubly heterozygous for those loci, the proportions of informative individuals must be computed as the sum of proportions of the two double-heterozygote classes in the state vector for that generation. Code isn't shown here, but we still use the genotype probability matrix described above, plus other matrices for accumulating crossover probabilities and normalizing pooled likelihoods. EM may converge a bit slower than the MLE method and so far in my implementation gives qualitatively similar but not identical estimates.

More on coefficient matrices

The attached table gives all coefficient matrices computed (using this recursive program) for up to 5 generations of any combination of selfing or backcrossing, as well as of their derivatives. Variants that include haploid-doubling are equally easy to compute, but since such designs are rarely used, I didn't bother. And since we can quickly compute any desired matrix on demand, we don't need to precalculate and store it, so these matrices are shown only for demonstration.

We can also use these state coefficient matrices to produce the 3 x 3 genotype transition matrices used in the Jiang/Zeng Markov chain method for estimating QTL genotype distributions from flanking markers.

Using the program

The full source code isn't given here, but it's available on request.

If you do have the MATLAB packages needed, place all the files in your workspace, prepare your three data files, and run the M-file LINKAGE_MAIN.m. Examine the sample files to see the format. You must use the genotype codes 1 through 6 to represent aa, AA, Aa, a_, A_, and - (missing). In a design including backcrossing, the recurrent parent is always the aa. Marker data and marker names must be in separate files, and you must provide a simple info file containing only four strings on the same line: the name, type, and size of your population, and the number of markers used in your experiment. Population type is written as a string that uses only the letters "b", "s", and "d"; for example sbsbbs describes a population selfed to the F2, then backcrossed, selfed, backcrossed twice, and selfed a final time.

Caution: linkage estimation assumes that no selection affecting allele frequencies of the targeted markers has taken place. If there's been selection, estimates may be wrong -- maybe badly.

Some results

The figures show MATLAB-generated plots of linkage and total information or LOD per marker pair, computed by MatLink on data simulated with QGene for s, sbsbs, bssss, and sssss designs, using 100 individuals and a 120-cM chromosome carrying 40 markers. The s (i. e. F2) is represented by two datasets, one with only codominant markers and the other with 10% a_ and 25% A_ markers. These plots are best viewed inside MATLAB, where you can rotate them. Note that markers are plotted in their true order; otherwise the plot would be a real mess.
F2 linkage plot for 40 codominant markers F2 information, 40 markers, 100 plants

F2 linkage (left) and Fisher information (right)

F2 linkage LOD, 40 codominant markers F2 information, 35% dominant markers, 100 plants

F2 LOD (left) and F2 (with 35% dominant markers) information (right)

Note the big drops in information at points where dominant markers are involved. Even for tightly linked marker pairs (those on the diagonal) the linkage MLE for dominant markers in repulsion phase (one showing genotypes AA and a_ and the other A_ and aa) can approach 0.5, as can be seen on the linkage plot below in the form of spikes on the diagonal. In any design with a lot of heterozygosity still hanging around, you'd split the data into two sets, with either set containing the codominant markers and the dominant markers of one type, and make two maps.
Linkage in f2 design with 35% dominant markers Information in sbsbs design

F2 linkage (100 plants, 35% dominant markers) (left) and sbsbs information (right)

Notice the decline of information content in the advanced-generation design involving backcrossing. Compare with the F6 design below.
Linkage in sbsbs design LOD plot of linkage in sbsbs design

sbsbs linkage (left) and LOD (right)

f6 linkage plot f6 information plot

sssss (F6) linkage (left) and information (right)

The information plot shows that selfing to advanced generations doesn't lose information as do designs involving backcrossing -- reasonably enough, since allele frequencies are maintained in selfing and phase uncertainty diminishes, while the ultimate endpoint in backcrossing is progeny containing no linkage information because they match the recurrent parent.

Acknowledgments

Thanks to several different colleagues who inquired about mapping in advanced-generation designs, some of whom provided data sets; student Zhigang Guo who was the first in the lab to use MATLAB and inspired me to look into its capabilities, most recently the Symbolic Math Toolbox; and, farther back, to Steve Tanksley whose ideas (and elaboration on previous ideas) have inspired many researchers to try advanced-backcross designs for mapping.