QTL定位的原理和方法.课件.ppt
第三章 QTL定位的原理和方法16:2016:20QTL是什么? 数量性状位点(QTL)是影响数量性状的一个染色体片段; QTL定位是确定数量性状基因在染色体上位置的一种方法; QTL 和QTLs。16:20为什么要定位它? 它为了解个体数量性状基因之间的行为和交互作用等基础知识提供了一条路径,允许建立更加真实的表型变异、选择反应和进化过程模型; 将标记信息综合到遗传评估中,辅助人工选择程序,主要方式有MAS和MAI; 能进行基因的位置克隆,允许对当前存在的数量变异进行分子机制的研究,并通过直接的分子干预,进一步增加增效等位基因频率。16:20QTL定位的基本原则 QTL定位的基本原则是关联度量的遗传变异和表型变异; 群体的选择、用于度量表型个体选择和基因型判型个体的选择是所有QTL定位设计要重点考虑的因素; 对于所有的QTL定位设计,标记等位基因和QTL等位基因之间的LD是必须的。16:20 QTL定位的关键16:2016:2016:20第一节 LA定位(连锁分析定位)16:20linkage analysis only considers the linkage disequilibrium that exists within families, which can extend for 10s of cM, and is broken down by recombination after only a few generations. Such as BC and F2 design16:20单标记分析16:20 是总平均; 和 是加性和显性效应; 是标记和QTL之间的重组率。mdar 是给定个体标记位点基因型为Aa的条件下的QTL基因型Qq的条件概率; 是标记和QTL基因型的联合概率; 是标记基因型的边际概率。)Pr(AaQq)Pr(QqAa)Pr(Aa16:20来自近交系的回交群体的标记和QTL概率 标记基因型之间的表型值平均差异:16:20单标记分析的缺点 单标记使用标记平均值,不能获得QTL效应单独的估计值和QTL与标记的重组频率;因此,不能区分是一个大的QTL效应松散地与标记连锁,或是小效应紧密地与标记连锁。16:20区间定位 Lander and Botstein (1989)提出使用所有连续的标记进行QTL定位的方法; 该方法原则上能够区分QTL的效应和位置; 该方法需要一张带有一定数目的遗传图谱,相邻标记间的距离是已知的。16:20Haldane作图函数 为遗传距离( ); 假设减数分裂期间的遗传物质交换沿着染色体是随机和独立发生的。212MerMcMM100116:20标记和QTL概率16:20数据分析 为具有QTL基因型 的个体 的性状记录; 为具有QTL基因型 的个体的期望效应(如 或 ); 为随机误差,并且 ,因此有: ijyijjmjdmamije), 0(2Neij),(2jijmNy16:20最大似然法分析 前面回交例子的似然函数为: 为QTL位点的基因型; 和 为个体 在标记位点A和B的基因型; 为回交个体数。jQiiANiB16:20 似然率检验(LRT): 为零假设没有分离QTL条件下的似然值; 为有一个QTL分离条件下的似然值。reducedLfullL LOD检验:16:20最小二乘分析 前面回交例子的最小二乘分析模型为: 需要估计的参数:一种为两个QTL基因型的平均值;另外一种为总平均值和两个基因型之间的效应差; 显著性检验:RMSMSQF MSQ为拟合模型由QTL基因型解释的方差; RMS为拟合模型的残余均方。16:20LS和ML的比较 LS只使用了标记平均值信息,标记基因型组内的方差变异没有被使用;而ML使用了所有可能的信息,这包括标记基因型和性状分布。 LS的计算比较简单易行,能够使用标准的软件(SAS)进行分析;而ML计算非常困难,需要专门的软件将其扩展到非常复杂的模型。16:20 似然率检验和F检验的比较: 对一个QTL,如果残差呈正态分布,则LS和ML估计是相同的; 对一般情形,关系变为: 大部分QTL定位分析结果显示LS获得与ML极端近似的结果。16:20基因组扫描 区间定位的优势在于能对整个标记的基因组进行扫描; QTL定位是在整个基因组内进行,某一个区间内QTL基因型的条件概率根据侧翼标记信息进行计算,然后一个区间接着一个区间,使用最小二乘或最大似然法进行分析,同时每个区间的检验统计量(F-ratio或LRT)也被计算,具有最大检验统计量的位置就是QTL最可能存在的位置,而该位置的QTL效应就是最好的QTL估计效应。16:2016:20多次检测问题 如果有许多独立的零假设被检验,而且事先知道所有的零假设都为真,则,至少出现一次假显著(false positive)的概率为1 (1)n 16:20伯努利校正11 (1)nn 16:20Permutation test 对表型和标记基因型数据进行随机重排,它消除了标记基因型和表型之间的关联; 每次重排数据,都要重新在整个基因组中进行QTL定位分析; 通过多次重排,可获得每次检验LRT统计量在没有QTL的零假设条件下的分布;16:20 Permutation test的具体步骤:16:20FDR(false discovery rate) is declared FDR (such as 0.05) j is the largest order that met formula (1) m is the number of marker(1)jjPm16:20FDR(false discovery rate) 方法 Sort p values of all marker interval based on ascending order 16:20LOD下降支撑区间(LOD drop support interval) 如果某一特定位置检测到一个QTL,需要对QTL所在的位置执行检验; 零假设是该QTL位于估计的峰值位置,备择假设为QTL位于距峰值距离为 的位置, 检验统计量为全QTL模型在峰值位置和距离峰值位置 图距单位位置的似然函数的差值的两倍,当样本为大样本时,它近似呈自由度为1的 分布; 因此可以通过偏离峰值位置,使检验统计量降到一个给定的数值来对QTL位置置信区间进行检验。dd216:20 例如: 95%的QTL置信区间对应的检验统计量下降3.84; 1 LOD下降对应97%的QTL置信区间; 2 LOD下降对应99.8%的QTL置信区间;16:2016:20Bootstrap置信区间1.对于一个大小为 的群体,抽取 个带有覆盖性质的记录(有些记录被抽取多次,而有些记录没被抽取);2.分析并估计QTL位置;3.重复上面的1和2两个过程,如200次或更多;4.在分布的两尾去掉2.5%的极端的QTL位置估计值;5.剩余的95%表示置信区间的估计值。nn16:20QTL位置估计的置信区间16:20预测置信区间 置信区间的长度受样本大小、QTL效应和标记密度的影响,对一个高密度标记图谱,Darvasi and Soller (1997)给出了一个预测的近似95%的置信区间(单位cM): 为样本大小; 和 为标准的加性和显性效应(以基因型标准差为单位)。nda16:20统计能力(Statistical power)16:20为什么要计算检测能力? 给定样本大小,计算能够检测到的QTL效应; 给定QTL效应,估计检测到该QTL需要的群体大小; 检测特定的QTL时,比较不同的群体设计。16:20完全连锁标记统计能力的计算理论 型错误( ):当零假设为真,拒绝零假设所犯错误的概率; 型错误 ( ):当零假设为假,接受零假设所犯错误的概率; 统计能力被定义为:16:20P(T)TCritical valueHAH0Statistical errors16:20Rejection of H0Nonrejection of H0H0 trueHA trueType I error at rate Type II error at rate Significant resultNonsignificant resultPOWER =(1- )16:21Impact of alphaP(T)TCritical value16:21Impact of effect size, NP(T)TCritical value16:21影响检测能力的重要因素 群体类型; 样本大小; QTL效应; 基因组大小; 标记密度; 显著性阈值; 分析类型。16:21完全连锁标记统计能力的计算 近交系杂交情形下的QTL定位检测能力计算基于单标记的t-检验和F-检验。16:21 F2设计: BC设计:ndateBC24)(222)(4datnBCe16:21 对于合理的样本大小和小的QTL效应,要求的 t 值为:16:21da Sample size BC 672 128 42 11 6 16:21 BC和F2设计的合理样本大小之比为: BC比F2的基因组扫描所需的显著性阈值要低;BC:F2: BC比F2的 可能要低。2e16:21 考虑两种设计阈值的变化:16:2116:21 如果连锁不完全( ),且使用单标记分析:0r 如果连锁不完全( ),且使用区间定位分析:0r16:21 为了增加QTL检测能力,可以增加判型的个体数目或标记密度;两者之间花费依赖于标记的成本与获得个体表型成本之间的比率。16:21增加检测能力的方式 增加样本大小; 增加效应大小。 后者可以通过选择一个具有丰富分离QTL的群体结构或样本; 如后裔检验。16:21精细定位QTL的群体设计16:21重组近交系(Recombinant inbred lines RIL) 重组近交系来源于F2群体的近交; RIL只需要被判型一次,却能很好地度量多个性状(clonal Lines); RIL关键的特性是比F2发生更多的重组,数量性状通过使用系平均值能被准确度量; RIL只能定位加性QTL; RIL的产生慢而困难。16:21深度杂交系(Advanced intercross lines AIL) AIL开始于F2群体,杂交后裔继续杂交一定数目的世代(与RIL近似,但是远交,而不是近交); AIL是在F2群体QTL定位的基础上进一步提高QTL的定位精度; AIL的任何性状都能被度量,但基因型判型只着眼于感兴趣的区域; AIL的关键特性是在目标区域创造了附加的重组事件,类似于扩大了F2群体。16:21Semi-random intercrossingPF1F2F3Ft AIL要保持一定的群体大小; AIL相对于F2使重组近似增加 ,置信区间为: AIL能定位几个QTL或多个QTL到15cM。/2t16:21Requires only 2 generations.Requires very large samples. 0246020406080cMLODLesions densityPaigen et al. BCSPh-BC重组后裔检验(Recombinant progeny testing)16:21Males, recombinant at an interval of interest, are progeny tested to check which QTL allele was retained.Requires only 3 generations. Efficient for dominant effects Requires large sample Interval-specific congenic strains区间特异同源异基因品系16:21are produced by a series of backcrosses and intercrosses P1 RI P2xx F1,1 F1,2F2,1 F2,2Each selected RIL is backcrossed to each parent and then the BC1 is selfed and grown out for phenotypiing and genotyping in the QTL region. Because the QTL was previously mapped to this region, the BC to one of the parents will segregate while the other will not; thus, indicating whether the gene controlling the QTL is above or below the breakpoint. The overlapping results of the various RILs will narrow the QTL interval. F21F22C57LAKRAKXL-16P=0.41D2MIT64D2MIT200P=0.02B. TaylorA. Darvasi第二节 特定结构的远交群体QTL定位16:21近交和远交群体的差别 远交群体也存在部分的近交; 远交群体的主要特征是群体内部没有故意尽力让亲属之间进行配种而创造近交(随机交配); 远交群体与近交群体的主要差别是远交群体内有遗传变异正在分离; 远交群体QTL与标记的关联是特定家系的关联,而不是群体范围的关联; 远交群体家系间存在附加的遗传方差。16:21使用远交群体进行QTL定位的基本策略 在存在差异的远交群体之间寻找QTL; 在一个群体内寻找正在分离的QTL; 具体策略为: 使用遗传标记追踪从父母亲到后裔的遗传,在基因组的所有位置获得不同基因型可能的概率; 在家系内关联表型数据和基因型概率数据。16:21远交系杂交16:21 远交系杂交与F2情形比较类似,但现在是两个经济性状存在差异的远交系或品种进行异型杂交; 所有三代,包括祖代、F1和F2在多个标记位点都要进行标记基因型判型,但只有F2个体获得表型; F2个体的QTL基因型(QQ、Qq、qQ和qq)的两个等位基因不能区分那一个来自父亲,那一个来自母亲; 在远交群体杂交,要考虑加性效应、显性效应和父母亲来源效应(印记效应)。16:21半同胞群体16:21 家畜群体存在很大的半同胞家系; 半同胞家系QTL定位的原则是关联半同胞后裔的表型和它们遗传自共同祖先的等位基因的概率; 父母亲和半同胞后裔都要判定基因型,只有后裔度量表型。16:21单个半同胞家系 单标记 对于单个半同胞家系,唯一的要求就是共同祖先在一个标记位点是杂合子; 因此,能看到后代在一个位点的两个等位基因的表型值平均差异; 该差异能使用t检验进行显著性检验; 单标记的单个半同胞家系类似于BC设计。16:21 多标记 多标记条件下,共同父母亲标记之间的连锁相未知; 需要重新构建父母亲的“单倍型” ; 根据父母亲的配子和后代的基因型,获得最可能的父母亲 “单倍型”; 参考BC的计算方法,在父母亲 “单倍型”已知的条件下计算每个HS后裔遗传自父母亲某一个配子的条件概率; 表型对估计的条件概率进行回归获得QTL等位基因之间的差异,利用t检验进行差异显著性检验。16:21公畜单倍型重构 确定每个HS后代的信息标记,即确定公共父母亲的那些标记是杂合的,且等位基因的传递是清楚的; 考虑某一个公共父母亲的那一些相邻标记对是杂合的; 计算出两个相邻位点等位基因能确定遗传自公共父母亲的后代数目; 利用期望最大法(EM)在最小化重组数目的基础上建立标记位点的连锁相16:21多个半同胞家系 对上面四个父母亲的后代进行标记对比分析将不会检测到QTL,因为M和m的差异为零,所以应该考虑家系内的嵌套分析。16:21 单标记 对于单标记的多个家系,可以使用嵌套的ANOVA,考虑嵌套在家系内的标记效应:16:21孙女设计(Granddaughter designGDD) Weller et al. (1990)介绍了在半同胞家系中一个孙女设计被应用来进行QTL定位; 该设计要求公畜、儿子和女儿的三代系谱,公畜和儿子被判定基因型,孙女获得表型; 利用在某一位点遗传了公畜两个可选等位基因的儿子的女儿的表型平均值来比较定位QTL。 GDD的优势在于获得相同检测能力的条件下比较少的个体需要被判定基因型; GDD比较容易收集数据,因为公牛的AI体系。16:21 该分析通常采用儿子的女儿离差(daughter yield deviations DYD)来进行; 因此能使用女儿设计模型应用ANOVA和回归进行分析; 儿子女儿数目如果变化很大,这时需要对DYD进行加权。16:21 NCP for the daughter design as: NCP for the granddaughter design as: Once the NCP parameters is calculated, power is derived as the probability that a non-central variate exceeds the threshold from a central distribution. GDD is generally much more powerful than a daughter design16:21全同胞家系 单个或多个大的全同胞家系在绝大部分物种内都是不可能的,但检测到QTL的能力很强。可能的原因有; 全同胞家系存在两个标记差异,一个是父亲,另一个是母亲; 全同胞相对于半同胞,期望的标记差异包含加性和显性方差;16:21同胞对和核心家系(配对设计) 大部分物种都不大可能获得大的全同胞或半同胞家系; 怎样在那样的群体内定位QTL呢? 一种设计是收集没有亲缘关系的同胞对或核心家系; 这时要将QTL效应作为随机效应,在同胞对之间关联类似的表型和它们类似的等位基因。16:2116:21第三节 方差组分QTL定位16:21模型和检验统计量 An example of a linear mixed model for a single QTL analysis is:16:21Variance components can be estimated using maximum likelihood or restricted maximum likelihood (REML), The log-likelihood function is:The assumed mean and variance structure of the observations : Q is the IBD matrix :16:21 The distribution of the test statistics are, asymptotically, a mixture of zero (with probability ) and a with 1 degree of freedom (also with probability of ).216:21 The advantage of this likelihood-based approach. The full maximum likelihood approach simultaneously estimates the IBD probabilities and the variance components, in a combined segregation analysis and linkage analysis framework. “distribution method” “expectation method”16:21 So why is QTL mapping in general pedigrees not used more frequently, in particular in large, deep pedigrees? IBD estimation in large pedigrees. the unavailability of (user-friendly) software for the variance component estimation part of the analysis. a finite budget. the unavailability of DNA samples from most ancestors16:21IBD 估计16:21Perfect marker As in the case of sibpairs, IBD sharing using a fully informative marker is straightforward, because we can simply count the number of alleles that two relatives share by descent. At a location linked to a perfect marker, IBD probabilities can be calculated from the observed IBD probability at the marker, the average relationship between individuals, and the recombination rate between the marker and putative QTL position.16:21The general case: missing data and non-informative markers The marker information in complex pedigrees is often incomplete. Unknown linkage phases, non-informative markers and/or missing marker genotypes complicate the calculation of Q. The calculation methods of Q are: recursive algorithms, correlation based algorithms simulation based algorithms.16:21Implementation in Loki The multiple-site segregation sampler in Loki is a cleverly designed Gibbs sampler with batch updating. is the probability of the segregation indicators across n loci at the ith segregation conditional on all other segregation indicators and observed marker data.16:21 A two step strategy to sample The first step involves moving through the genome, calculating locus by locus, cumulative probabilities for Sij. the second step involves moving back down the genome, sampling Sij from a univariate density that is a function of the associated cumulative probability, the previous sampled segregation indicator (Si j+1) and the recombination rate between loci j and j+1.16:21 Introduction to Loki Loki was originally designed for multipoint linkage analysis in general pedigrees using MCMC methods. Then, it has since been modified for IBD probability calculation. The user supplies Loki with the pedigree structure, marker genotypes, marker positions and QTL positions for which the IBD matrices are to be calculated. Dependent chains of IBD probabilities are then obtained for each QTL position. Convergence is determined by monitoring the IBD probabilities over the iteration number. Once the probabilities stabilize, the sampler is deemed to have reached convergence.16:21Variance component estimation After having calculated IBD probabilities, there are two difficulties in estimating variance components by ML(REML). Firstly, the IBD matrix is a completely general symmetrical matrix and does not have an obvious inverse. Secondly, the IBD matrix is likely to be singular.16:21 why the IBD matrices are often singular? The reason is that two related relatives can share 0 or 100% of their alleles IBD, which can cause a dependency in the matrix of IBD probabilities. The genotypes of the parents are M1M2 and M3M4. If the progeny have genotypes M1M3 and M2M4(a), or M1M3 and M1M3(b), then the resulting IBD matrix is:ab16:21 If the maximisation algorithm is based upon the complete matrix V (or V-1), then there should not be a problem. If the maximisation is based upon an algorithm that requires Q-1, then using genomic positions which are slightly distant from the markers will give a positive-definite Q,16:21Implementation example Visscher et al. (1999) used the combination of an MCMC sampling approach and REML variance component estimation to map a QTL for bipolar disorder (manic depression) in a human pedigree. The pedigree size was 168, over 4 generations, and 143 individuals had a phenotypic score. The incidence of major recurrent depression (unipolar disorder) and bipolar disorder was 17/143 and 11/143. A small segment of chromosome 4 was considered because this region had previously shown linkage to bipolar disorder using a parametric linkage analysis, and 11 microsatellite markers were scored spanning 26 cM.16:21 IBD probabilities were estimated using Loki, using 10,000 samples. REML was used to estimate 81 variance components, with an algorithm based upon the complete (co)variance matrix V, to avoid the problem of singular IBD matrices.16:2116:21第四节 LD (连锁不平衡)定位16:21What is LD? Linkage disequilibrium is a measure of association between alleles at different loci. Suppose we have two bi-allelic loci, A and B, with allele frequencies pA1 and pA2, and pB1 and pB2, respectively. LE: LD:16:21Measures of LD for single-allelic marker1.Falconer and Mackay, 1996; Lynch and Walsh 1998 for bi-allelic loci:16:21when D0, the smaller of pA1pB2 and pA2pB1. when D0, the smaller of pA1pB1 and pA2pB2.2.Another measure of LD is: ranges from -1 to +1, whereas ranges from 0 to 1.Whenever one of the four haplotype frequencies is zero, = 1.16:213.For bi-allelic markers, another useful measure is(Hill and Robertson, 1968):Nr2 is the test statistic for independence as calculated from a 2x2 contingency table.A statistical test of LD using the r2 statistic is therefore straightforward.216:21Measures of LD for multi-allelic marker Hedrick, 1987:16:21 k and l are the number of alleles at locus A and B. pAi and pBj are the population allele frequencies of allele i at locus A and allele j at locus B. |Dij| is the absolute value of the normalised measure. pAiBj is the estimated population frequency of the haplotype AiBj Dijmax is the maximum amount of disequilibrium possible between allele i at locus A and allele j at locus B. The corresponding multi-allelic measure of the squared correlation is:16:21linkage disequilibrium vs. gametic phase disequilibrium The term linkage disequilibrium appears to imply that the loci have to be linked. However, this is not the case, because an association between alleles can exist even if the alleles are unlinked. two populations with unequal frequencies are mixed. Non-random mating. the case of an F1 population. Selection A better term for LD is gametic phase disequilibrium, which is used in text books such as Falconer and Mackay (1996) and Lynch and Walsh (1998)16:21D or r2? Hedrick (1987) stated that a good measure of disequilibrium should have the following properties: A simple biological interpretation. Statistical tests should be possible. Be directly related mathematically to evolutionary factors such as recombination, selection, genetic drift, gene flow etc Be standardised to allow comparisons across loci or populations16:21Dynamics of LD There are a number of evolutionary forces that create LD, including mutation, admixture (crossbreeding), genetic drift, inbreeding, founder effects and selection. The main force that destroys LD is recombination.16:2116:21LD mapping mapping requires a mark