## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# Multilocus tetrasomic linkage analysis using hidden Markov chain model

Edited* by Wen-Hsiung Li, University of Chicago, Chicago, IL, and approved January 19, 2010 (received for review July 28, 2009)

## Abstract

The availability of reliable genetic linkage maps is crucial for functional and evolutionary genomic analyses. Established theory and methods of genetic linkage analysis have made map construction a routine exercise in diploids. However, many evolutionarily, ecologically, and/or agronomically important species are autopolyploids, with autotetraploidy being a typical example. These species undergo much more complicated chromosomal segregation and recombination at meiosis than diploids. In addition, there is evidence of polyploidy-induced and highly dynamic changes in the structure of the genome. These polysomic characteristics indicate the inappropriateness of the theory and methods of linkage analysis in diploids for use in these species and a gap in the theory and methodology of tetraploid map construction. This paper presents a theoretical model and statistical framework for multilocus linkage analysis in autotetraploids for use with dominant and/or codominant DNA molecular markers. The theory and methods incorporate the essential features of allele segregation and recombination under tetrasomic inheritance and the major challenges in statistical modeling and marker data analysis. We validated the method and explored its statistical properties by intensive simulation study and demonstrated its utility by analysis of AFLP and SSR marker data from an outbred autotetraploid potato population.

Polyploidization, the simultaneous duplication of all genes in a genome, has been ubiquitous in plant evolutionary history, with ≈70% of angiosperms being polyploids (1). Allopolyploids display disomic inheritance and their genetic analysis follows the same principles as diploids, whereas autopolyploids display more complicated polysomic inheritance. Among autopolyploids exist important autotetraploids including agricultural crops, notably cultivated potato, and aquaculture animals such as Atlantic salmon and trout, for which active genome projects are underway. These projects will generate abundant genetic polymorphisms and, thus, enable construction of genetic maps, which are essential for dissecting quantitative trait loci (QTL) and, in turn, for improving efficiency of breeding programs through marker assisted selection.

Genetic linkage maps exist or are becoming rapidly available in almost all important diploid animal and plant species and humans, providing an essential starting point for an insightful genomic investigation. In sharp contrast, progress in autopolyploid linkage analysis, the theoretical kernel for map construction, lags far behind. In autopolyploids, multivalent meiotic pairing of homologous chromosomes, followed by crossing over between the locus and spindle attachment, may cause double reduction, in which sister chromatids enter into the same gamete (2). Recombination frequency between a pair of loci can be as high as 0.75 under a tetrasomic model (compared to 0.5 in diploids), and double reduction can occur at a frequency of 25%, indicating remarkable difference in the pattern of gene segregation and recombination. The evolution of polyploid genomes is also an extremely dynamic process involving extensive genetic and epigenetic changes (3). Thus, it is dangerous to approximate genetic analysis of a polyploid from that of its diploid relative.

The last decade witnessed the development of both statistical methods for linkage analysis and molecular markers for constructing genetic maps in polyploids (4–11). However, many of these methods assume that homologous chromosomes undergo bivalent pairing in meiosis and, indeed, they all have failed to adequately address the most crucial nature of polysomic inheritance. Consequently, they are unsuitable to model and analyze marker data from real experiments. More recently, a full statistical framework for autotetrasomic linkage analysis has been developed by Luo et al. (8, 12) that takes appropriate account of the essential features of autotetrasomic inheritance. The method is based on two-locus linkage analysis, although it is well documented in diploids that multiple locus analysis that incorporates information of all partially informative markers simultaneously, will effectively increase statistical efficiency (13, 14). Marker data from polyploid, particularly outbred, populations is characteristically partially informative, i.e., the genotype cannot be definitively inferred from the corresponding phenotype. Unlike in diploids, codominant markers may also not be fully informative because the same allele may exist in multiple copies. In tetrasomic linkage analysis, a genetic marker is fully informative only if there are eight distinct alleles segregating between two parental genotypes. Such markers are practically nonexistent in tetraploids, thus it is more crucial for genetic linkage analysis to be carried out on a multilocus basis in polyploids than in diploids. This study develops the theory and method for multilocus linkage analysis with both dominant and codominant markers in outbred autotetraploid populations, both filling a gap in the subject of genetic linkage analysis and providing useful analytical tools for autotetraploid genetic analyses including map construction.

## Results

We developed and presented the theory and statistical method for tetrasomic multilocus linkage analysis in *Methods* below. To test the efficiency and to explore statistical properties of the algorithms, we carried out an intensive simulation study using the computer simulation programs described (12). The simulation mimics the multiple-locus gametogenesis of an autotetraploid individual whose meiosis involves either bivalent or quadrivalent pairing of homologous chromosomes.

### Simulation-Based Numerical Analysis.

Our computer program was designed with flexibility to simulate different models, although the present study considers a linkage group of 10 marker loci and a quadrivalent chromosome pairing model. The simulation program generated phenotype data at the marker loci from a full-sib family of 200 individuals from crossing a pair of parental autotetraploids, corresponding to three models of marker allele inheritance. Models I and II indicate codominant and dominant markers, respectively, and Model III presents a mixture of the two types. A fixed set of simulated values were used for the coefficients of double reduction (α) at and recombination frequencies (*r*) between the marker loci (Table 1). Marker allele “O” represents a null allele at the codominant marker loci or a recessive allele at the dominant marker loci.

The multilocus linkage analysis focuses on calculation of the distribution of genotypes at two linked loci by using genetic information of all loci in the linkage group. This multilocus analysis distinguishes from the two-locus tetrasomic linkage analysis (12) that uses information at two loci only. Based on the genotypic distribution, statistical inference is carried out to test for significance and to calculate the maximum likelihood estimates (MLEs) of the model parameters, α and *r*. Table 2 shows the advantage of the multilocus over the two-locus algorithm in statistical inference over 100 repeat simulations under three models of parental inheritance. Variances of the estimates from the codominant markers (Model I) are consistently much smaller than those from the dominant markers (Model II), reflecting the fact that codominant marker phenotypes are much more informative in regard to the underlying genotype. Under Model I, the multilocus algorithm estimated the model parameters remarkably more precisely than the two-locus algorithm, reflecting its efficiency in the use of map information. This pattern was also true under a model of bivalent chromosome pairing (*SI Methods*). Although the multilocus method with dominant markers estimated the coefficients of double reduction with consistently smaller variances, it did not clearly outperform the two-locus method in the precision of the recombination frequency estimates. This observation may be explained by the mainly simplex dominant allele at all markers and highly dispersed linkage phase of the dominant allele in the parental genotype G_{1} (Model II), largely limiting the extra gain in information from the multiple locus analysis. However, the outperformance by the multilocus analysis across the entire linkage group is clear when a codominant marker is present (Model III).

Two further advantages of the multilocus linkage analysis stem from the calculation of the likelihood of the linkage group through Eq. **8** (*Methods*). First, this property enables a direct evaluation of likelihood for any map order of markers and provides a statistical basis to infer the most likely map order. We compared the least square estimates to the true map order in repeated simulations and found 20% of the map order estimates differed from the true value. The likelihood value obtained from the multilocus analysis supports inference of the true marker order over the biased one (Fig. S1*A*), indicating its improved efficiency in distinguishing the optimal map order. Second, it provides a direct evaluation of likelihoods of alternative linkage phases at multiple marker loci simultaneously (Fig. S1*B*), enabling discrimination between the true linkage phase and the biased prediction from the two-locus method.

### Case Study of Map Construction in Autotetraploid Potato.

To demonstrate the ability of the tetrasomic multilocus method to analyze real experimental data, we implemented the algorithm to construct genetic linkage maps of 197 AFLP (dominant) and 4 microsatellite (codominant) DNA markers (16) scored on 228 offspring from a cross between two parental lines of cultivated potato (*Solanum tuberosum*). Fig. S2 illustrates the assembly of the markers into 11 linkage groups by using the least square method implemented in JoinMap. The multilocus linkage analysis significantly improved the parameter estimates and, hence, the likelihood of the entire map for 10 linkage groups (Table S1). The greatest improvement in likelihood occurred for linkage groups 6 and 5, which contained 3 and 2 codominant markers, respectively. Thus, the multilocus analysis uses the information from only few codominant markers to improve parameter estimation for the entire linkage group beyond the improvement possible with dominant markers alone, and a greater improvement is likely when more codominant markers are available.

## Discussion

This article presents a theoretical model for multilocus linkage analysis in autotetraploid species and a unique hidden Markov chain statistical method for the construction of genetic linkage maps with dominant and/or codominant molecular markers. The statistical method allows modeling and analyzing of the major complexities of DNA marker data such as incomplete information of marker phenotype with regard to underlying genotype, dominant inheritance of marker alleles, and the presence of null alleles. Moreover, the theory and method enable the use of complete map information to substantially improve precision and accuracy of the linkage map, thus filling a theoretical gap in genetic linkage analysis.

We tested the validity of the statistical method and explored its statistical properties by analyzing simulation datasets as well as marker data from a full-sib family of autotetraploid potato and their parental cultivars. Both accuracy and precision of estimates of the coefficients of double reduction and recombination frequencies are remarkably improved in the multilocus linkage analysis in comparison with the corresponding two-locus analysis, even though partially informative codominant markers are used. Significant improvement can be obtained in statistical reliability by incorporating only a single SSR marker through the multilocus analysis. The multilocus method confers the additional advantage of providing a direct calculation of likelihood for any given linkage map, thus achieving a step forward for the computational challenge of statistical inferring linkage order and linkage phase (17). Although active research efforts have been invested to develop theory and statistical methods for mapping QTL in tetraploid species (4, 11, 15, 18–21), none of these have been built on a rigorous autotetrasomic model. In addition to the tools for genetic map construction, this study provides both theory and method to calculate the conditional probability distribution of genotypes at any test position given the phenotype of its linked genetic markers, a key step for QTL analysis in autotetraploid species.

## Methods

### Model and Notation.

We consider segregation and recombination of alleles at *m* genetic marker loci with a given order, M_{1}, M_{2}, …, M* _{m}*, in a full-sib family from crossing two autotetraploid parental individuals. Let (

*i*= 1, 2, …,

*m*−1) be the recombination frequency between the

*i*th marker interval flanked by markers M

*and M*

_{i}

_{i}_{+1}, and (

*i*= 1, 2, …,

*m*) be the coefficient of double reduction at the

*i*th marker. The parents together with a random sample of

*n*offspring individuals are scored at the

*m*genetic marker loci. Let be a vector of phenotype and be a vector of genotype for the

*i*th individual at the

*m*marker loci. represents a pattern of gel bands for any PCR-based DNA molecular markers. For a given phenotype, there may be up to six genotypes that are compatible with the phenotype (7). Thus, we denote as the number of possible genotypes for a given phenotype (subscript specifies a particular genotype at the locus). We use () to denote that genotype is compatible with phenotype .

### Theoretical Analysis.

For marker locus M* _{k}*, let M

*and M*

_{j}*(1 ≤*

_{l}*j ≤ k ≤ l ≤ m*) be the two most adjacent fully informative markers at which the phenotype provides full information of the underlying genotype. In autotetraploids, fully informative markers at which there are eight distinct marker alleles segregating between the two parental individuals are very rare, thus

*j*is usually 1 and

*l*usually takes

*m*, i.e., all marker information is taken into account.

### Prediction of Single-Locus Genotypic Distribution Under the Multilocus Model.

The conditional probability of genotype of individual *i* at M_{k} based on marker phenotype of the individual at marker loci from M_{j} to M* _{l}* can be calculated fromLetthen Eq.

**1**can be written into a matrix form aswhere denotes the component-wise product of vectors and T the matrix transpose.

To calculate , the prior probabilities of genotypes at marker M* _{k}* with tetrasomic inheritance, one needs to model double reduction in gametogenesis. For example, if A

_{1}A

_{2}A

_{3}A

_{4}represent four copies of marker alleles at locus M

*and the coefficient of double reduction at this locus is, there are 100 possible genotypes at this locus. We developed a computer-based approach to sort the genotypes into three different groups according to the number of double reduction gametes () involved and to calculate their probabilities (17) given by*

_{k}To calculate elements of the right, , and the left, , conditional probabilities defined above, we used the Markov property of genotype distribution at linked loci, *i*.*e*., genotype of an individual at marker M* _{k}* given its genotype at M

_{k}_{−1}or M

_{k}_{+1}is independent of genotype at any other marker loci (see

*SI Methods*for detail). If we let be the transition probability of genotype at marker M

*given the genotype at marker M*

_{k}

_{k}_{−1}, we defined as the transition probability matrix of genotypes at M

*given genotypes at M*

_{k}

_{k}_{−1}and as the transition probabilities from genotype at marker M

_{k}_{−1}to all possible genotypes at marker M

*. Then can be expressed in a matrix form asin which is a row vector of 1 of length*

_{k}*n*and is a stochastic matrix. Similarly, the left conditional probabilities can be written aswhere with being the transition probability of at marker M

_{h}_{−1}given the genotype at marker M

_{h}as detailed in

*SI Methods*.

### Prediction of Two-Locus Genotypic Distribution Under the Multilocus Model.

The conditional probability distribution of genotypes at markers M* _{k}* and M

_{k}_{+1}given phenotypes of all markers from M

*to M*

_{j}*can be expressed asLet and . It can be shown thatfor any*

_{l}*i*and

*k*, . For any offspring individual in the population and any pair of marker loci, can be expressed as a sparse matrix. Calculation of the transition probability under two alternative models of either bivalent or quadrivalent homologous chromosome pairing in the tetraploid meiosis is detailed in

*SI Methods*.

### The Maximum Likelihood Estimation.

Based on the hidden Markov chain model we can formulate the likelihood of the model parameters and given the phenotype data at the *m* marker loci aswhich is derived in detail in *SI Methods*. We established a recursive relationship between the coefficients of double reduction at linked loci (17) and so the MLE of can be calculated from the MLE of and .

Let be the joint phenotype at markers and of the *i*th offspring individual. is usually only partially informative for recombinant and double reduction status of constituent gametes. Let be the *j*th joint marker genotype from a possible 100 under the tetrasomic model, which is compatible with . is usually not fully informative and comprises possible fully informative genotypes whose gamete constituency is known, with probabilities given by from Eq. **7** for given and . Thus, the probability of the aggregate genotype can be calculated fromwhere summation is over all possible compatible fully informative genotypes. A general form of can also be expressed as , in which *u _{k}* = 0, 1, 2 stands for the number of double reduction gametes and

*v*= 0, 1, …, 4 for the number of recombinant chromosomes in the fully informative genotype, whereas

_{k}*c*is a constant which can be calculated as described (17). Accordingly we have . The probability of observing the individual phenotype is thus given by

_{k}We developed the EM algorithm (22) to calculate the MLEs of the model parameters for all marker intervals for *k* = 1, 2, …, *m*−1. The iterative algorithm is initiated from the MLE of and from the two-locus analysis. At iteration *t*, the probability distribution for all fully informative genotypes at markers (,) is calculated from the HMM analysis, then the next iteration *t* + 1 is completed via an expectation (E) step and a maximization (M) step. The E step calculates the probability of the *i*th individual's aggregate genotype, , carrying double reduction gametes at locus M_{k} fromand the probability of recombinant chromosomes fromThe M step updates the estimates of the model parameters from

The likelihood function given by Eq. **8** increases as the iterative algorithm is repeated and the sequential parameter estimates converge to the MLEs. These MLEs are then used to calculate the logarithm ratio of the likelihoods for parameter which follow approximately a χ^{2} with 1 degree of freedom and can test for significance of parameters and for the marker interval (,).

The multilocus linkage analysis presented here focuses on quadrivalent chromosome pairing only. There is no major difficulty to extend the statistical framework to the bivalent pairing (Table S2–S4) or the case of mixed bivalent and quadrivalent pairings (*SI Methods*).

## Acknowledgments

We thank two anonymous reviewers and the acting editor for their critical and constructive comments that have helped improve presentation of the paper. We thank Dr. John Bradshaw for allowing us to analyze the potato marker data. The analyses were programmed in Fortran-90 computer language (available upon request from the corresponding author).The research was supported by research grants from Biotechnology and Biological Science Research Council UK and Natural Science Foundation of China (to Z.W.L.).

## Footnotes

^{1}To whom correspondence may be addressed: E-mail: z.luo{at}bham.ac.uk or zwluo{at}fudan.edu.cn.Author contributions: M.J.K. and Z.W.L. designed research; L.J.L. and Z.W.L. performed research; L.J.L. and Z.W.L. contributed new reagents/analytic tools; L.J.L., L.W., and Z.W.L. analyzed data; and L.J.L., M.K., and Z.W.L. wrote the paper.

The authors declare no conflict of interest.

*This Direct Submission article had a prearranged editor.

This article contains supporting information online at www.pnas.org/cgi/content/full/0908477107/DCSupplemental.

Freely available online through the PNAS open access option.

## References

- ↵
- Masterson J

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Lander ES,
- Green P

- ↵
- ↵
- ↵
- Luo ZW,
- Zhang RM,
- Kearsey MJ

- ↵
- ↵
- ↵
- ↵
- ↵
- Dempster A,
- Laird N,
- Rubin D

## Citation Manager Formats

## Sign up for Article Alerts

## Article Classifications

- Biological Sciences
- Genetics