While the most informative mating designs for constructing a linkage map
employ early-generation populations such as BC1, F2,
and DH and advanced selfed or intercrossed lines such as RILs and AILs,
some researchers choose to collect genotype data on lines advanced by a
series of operations that may include a mix of backcrossing, selfing, and
(more rarely) haploid-doubling. But how do you make a map with data from,
say, a BC2F3 population? Conventional mapping programs
don't have genetic models for such a design, and trying to fool them by telling
them that the data are from, say, an F2 design is unlikely to
produce a trustworthy map.
Linkage estimation and multipoint map assembly (the ordering problem)
are separable. I'll outline -- skipping most of the mathematical detail --
how linkage may be estimated in any generation in a progeny advanced from
the F1 of a line cross by any combination of the three breeding
operations listed above.
A useful idea is to define a state vector showing at any given generation the relative proportions of each two-locus class, as a function of r. To apply a B, S, or D operation to this vector, we premultiply it by an operator matrix, which is just a matrix of transition probabilities (also functions of r) between all possible true parent genotypes (represented by column labels) and progeny genotypes (row labels). Here is the F1 state vector, SF1, with genotype labels at the left:
aabb 0The S and D operators can be found in this MATLAB M-file. Now, to find the state vector for, say, a BC1F5 population, we need only carry out the matrix multiplication
SBC1F5 = OS x OS x OS x OS x OBC x SF1For this we can use a symbolic mathematics package such as Mathematica, Maple, or MATLAB's Symbolic Toolbox and a simple code loop. But we can also do it rather easily in any language, as student Roby Joehanes showed me with a quickly written Java class. For the BC1F5 design, the final state vector comes out to 1/64 *
47-8*r^9+40*r^8-96*r^7+144*r^6-156*r^5+140*r^4-112*r^3+78*r^2-47*rwhich is just a set of 10 9th-degree polynomials in r representing
probabilities of occurrence of each genotype class, conditional on r. A
second glance shows the symmetry; there are only 7 distinct classes, and
the middle two will be lumped together too because they're indistinguishable
phase variants of the same genotype (double heterozygote).
All we need from these expressions is the coefficients, which we can extract into
a matrix. We also differentiate each row expression with respect to
r, and make a second matrix of the coefficients of the derivatives,
which we'll need for maximizing the log likelihood and computing the linkage
information carried by a marker pair. For mating designs ending in a selfing,
where dominant markers have been used in our mapping experiment, we need
to expand these matrices to include expressions for the other 15 (incompletely
informative) genotype classes listed above, by premultiplying them with
a permutation
matrix. Once we have the matrices for the mating design we're interested
in, we can construct
and solve the likelihood equation for any given two-locus segregation.
We end up with tables of two-point linkage and information or LOD estimates
for an entire marker set.
Another way to estimate linkage between two loci starts with a guess for
r (for example, 0.25), which is plugged into a set of expressions (again
in r) to yield the estimated number nc of recombinants
between the loci. Now the quotient nc/N (N being
the number of individuals in the population) gives us an updated estimate
for r, and we continue alternating Expectation and Maximization
steps until our estimate stops changing. You can find helpful material, referring
to an F2 design and explaining how to treat dominant markers,
in Liu's Statistical Genomics (p. 172). My adaptation of this algorithm
to our case of a sequence of mating operations relied on the observation
that N is the number of recombinationally informative individuals
only in the first generation following the F1. In any subsequent
generation, since recombination between two loci is observable only in parents
doubly heterozygous for those loci, the proportions of informative individuals
must be computed as the sum of proportions of the two double-heterozygote
classes in the state vector for that generation. Code isn't shown here, but
we still use the genotype probability matrix described above, plus other matrices for
accumulating crossover probabilities and normalizing pooled likelihoods.
EM may converge a bit slower than the MLE method and so far in my implementation
gives qualitatively similar but not identical estimates.
The attached table gives all coefficient
matrices computed (using this recursive
program) for up to 5 generations of any combination of selfing or backcrossing,
as well as of their derivatives. Variants that include haploid-doubling
are equally easy to compute, but since such designs are rarely used, I didn't
bother. And since we can quickly compute any desired matrix on demand, we
don't need to precalculate and store it, so these matrices are shown only
for demonstration.
We can also use these state coefficient matrices to produce
the 3 x 3 genotype transition matrices used in the Jiang/Zeng Markov
chain method for estimating QTL genotype distributions from flanking markers.
If you do have the MATLAB packages needed, place all the files in your workspace, prepare your three data files, and run the M-file LINKAGE_MAIN.m. Examine the sample files to see the format. You must use the genotype codes 1 through 6 to represent aa, AA, Aa, a_, A_, and - (missing). In a design including backcrossing, the recurrent parent is always the aa. Marker data and marker names must be in separate files, and you must provide a simple info file containing only four strings on the same line: the name, type, and size of your population, and the number of markers used in your experiment. Population type is written as a string that uses only the letters "b", "s", and "d"; for example sbsbbs describes a population selfed to the F2, then backcrossed, selfed, backcrossed twice, and selfed a final time.
Caution: linkage estimation assumes that no selection affecting allele
frequencies of the targeted markers has taken place. If there's been selection,
estimates may be wrong -- maybe badly.
Note the big drops in information at points where dominant markers are
involved. Even for tightly linked marker pairs (those on the diagonal) the
linkage MLE for dominant markers in repulsion phase (one showing genotypes
AA and a_ and the other A_ and aa) can approach
0.5, as can be seen on the linkage plot below in the form of spikes on the
diagonal. In any design with a lot of heterozygosity still hanging around,
you'd split the data into two sets, with either set containing the codominant
markers and the dominant markers of one type, and make two maps.