今天要分享的是一本合集 Clinical Bioinformatics 临床生物信息学实验指南中的第五章Bioinformatics Challenges in Genome-Wide Association Studies (GWAS)
De R., Bush W.S., Moore J.H. (2014) Bioinformatics Challenges in Genome-Wide Association Studies (GWAS). In: Trent R. (eds) Clinical Bioinformatics. Methods in Molecular Biology (Methods and Protocols), vol 1168. Humana Press, New York, NY
一张导图总结
作者之一Jason H. Moore教授就职于Geisel School of Medicine at Dartmouth,研究方向是生物统计、流行病学和基因组,开发SPARCoC软件,还写过一本书Computational Methods for Genetics of Complex Traits(2010)以后有钱了找来看看。。。。
好了,继续来说这篇文章
摘要:本章回顾了GWAS 的基本概念、用于捕获遗传变异的技术、遗传力缺失问题、高效实验设计、减少引入到数据集中的偏差以及如何利用新的资源(如电子病历)
Key words:Data imputation, Epistasis, Electronic medical records, Filtering, Gene–gene interactions, GWAS, Meta-analysis, Missing heritability, Replication
一、简介
GWAS 是基于常见疾病-共同变异(Common Disease—Common Variant,CD-CV)假说的,即common diseases (II型糖尿病,类风湿性关节炎或原发性高血压等) are caused in part by genetic variations that are also common in the population。
SNP遗传效力和疾病遗传力的关系 If common variants have a small effect size but common diseases show a strong inheritance in families (high heritability), then almost by definition the disease must be influenced by multiple genetic factors.
The missing heritability problem: GWAS has had limited success in detecting genetic variants that account for a large portion of the heritability of any common disease trait. 作者举例在breast cancer研究中找到的两个loci仅能解释5.9%的乳腺癌家族风险。
*产生原因之一是上位效应epistatic interactions. Biological epistasis refers to the physical interactions between biomolecules that are influenced by multiple genetic variants. Statistical epistasis is the term for the nonadditive interactions between multiple genes, each of which affects disease susceptibility, and the environment.
*解决办法: 1) Designing our studies to search for nonlinear interactions amongst SNPs. 2) Using methods such as meta-analysis and data imputation to increase our statistical power. 3) Establishing strict criteria for defining phenotypes
二、材料
介绍了Illumina和Affymetrix两家测序平台以及Electronic Medical Records的应用,这里略过
三、方法
1 关于基本概念:
SNP-single base pair changes in the DNA sequence, have now become the modern unit of genetic variation
MAF-the frequency of the less common allele is referred to as minor allele frequency
LD-Linkage disequilibrium is a measure of correlation between SNP alleles at one site and the specific alleles carried at variant sites nearby. 用D′ 或r2来计算
Haplotype-a particular combination of alleles along a chromosome
tag SNPs-in strong LD with other variants surrounding them最终会被筛选出来
2 关于实验设计:
(1)Case–Control VS Quantitative
Case–Control案例研究通常是二元结果,如病例/对照或受影响/未受影响。若病例中SNP频率高于对照组,说明SNP与疾病风险增加有关;Quantitative定量研究评估量化或连续性状,以获得定量值(如HDL、LDL),研究SNP或等位基因的频率是否与数量性状相关。
(2)Standardizing Phenotype Criteria
对表型的标准化定义是非常重要的,特别是在多机构的合作中。有时案例研究里把病人由case错归为control的影响要比定量研究中记录错数值严重得多。
(3)Testing for an Association(重点)
1)前期准备
选择合适的方法——关联分析可分为allelic或genotypic与表型相关联,需根据具体情况选择显性、隐形、加性效应模型来分析
调整数据集——用Regression方法调整协变量以防出现假阳性结果
群体结构分析Population substructure——作为重要协变量之一, ethnic-specific SNPs may show up to be associated with a trait due to population stratification,可以用STRUCTURE或EIGENSTRAT来分析
2)单一位点 VS 多位点
在Binary traits, case–control研究中常采用 a contingency table method或logistic regression.
*A contingency table summarizes the number of individuals within each genotypic group for a single biallelic SNP. It searches for a deviation from the null hypothesis that there is no association between the phenotype and genotype. e.g. the chi-square test or the Fisher’s exact test by SAS, SPSS, Stata, or Microsoft Excel.
*Logistic regression is an extension of linear regression where the phenotypic outcome studied is transformed using a logistic function. This method predicts the probability of an individual having a case status, given their genotype class. 因允许协变量调整而被更广泛地使用
对于quantitative traits,常采用方差分析Analysis of Variance (ANOVA). It assumes that 1) the trait is normally distributed (正态分布), 2) the variance of the trait is the same within each group, and 3) that the groups are independent. For single-SNP analysis, ANOVA functions under the null hypothesis.
PLINK是GWAS分析中的常用软件,功能强大,操作简便,可以使用the allelic orinheritance模型, or by using the Cochran-Armitage test (a contingency table method).
由于用linear modeling framework 去分析单一SNPs at a time会导致之前提到过的missing heritability问题, 因此需要用到multi-locus analysis, more holistic approaches that recognize the complex landscape of the genotype–phenotype relationship and examine nonlinear interactions between genetic variants throughout the genome. 这里最大的挑战在于处理50万个SNP会消耗大量计算资源,需用特定的过滤方法来减轻计算压力。
一般的GWAS single SNP分析会基于MAF\LD值进行初始过滤(仍会留下30万SNPs), 然后会通过设定显著性阈值筛选出一些主效markers (和疾病强关联的单一SNPs)
另一种过滤方法是检测marks有没有在某一通路、蛋白家族中存在相互作用 dataset can also be filtered so that only those multi-marker interactions will be examined that fit within a certain biological context such as a biological pathway, protein family, and group of genes or proteins involved in a certain molecular function.
如Biofilter algorithm 算法 combines biomedical knowledge from multiple public repositories with statistical methods such as logistic regression or multifactor dimensionality reduction (MDR) method to analyze SNP–SNP combinations.
3)Post Analysis 纠错
p-value 检验 is defined as the probability of observing a test statistic that is equal to or greater than the observed test statistic, if the null hypothesis is true. P值的问题
GWAS中常用的多重假设检验矫正方法有:
*The Bonferroni correction
*Adjusting the False Discovery Rate (FDR)
*Using permutation testing to adjust the significance threshold by PLINK, PRESTO, and PERMORY
(4)结果的可重复
重复的唯一目的是评估GWAS最初的阳性结果,证实其有效性和可信度
1)Statistical Replication
要实现统计上的可重复需满足以下条件:
*样本量足够大 由于winner’s curse 赢家的诅咒 (GWAS在研究群体中的效应被高估,即比实际在人群中要高) 的存在,这点至关重要
*重复必须在同一群体的独立数据集中进行,并应该使用相同的标准来定义所讨论的
*由于GWAS标记是基于LD模式选择的,应旨在重复某个基因组区域,而不一定是最初研究中得到的具体某个SNP
2)Meta-analysis
Meta-analysis is a statistical method for combining several different studies to provide one summary result aims to examine the effect of the same allele across all studies.(前提是所有研究需基于相同的假说). 可以用Cochran’s Q 或 I2 statistic来计算heterogeneity
3)Data Imputation
The imputation procedure makes use of the known LD and haplotype patterns in reference panels to estimate genotypes for SNPs that were not directly genotyped within a study. 常用的算法有BimBam, IMPUTE, MaCH, and Beagle (均基于haplotype phasing algorithms, which estimate the contiguous set of alleles that lie on a specific chromosome)
四、 展望
Although, as the content of genotyping chips, cohort sizes, and biobanks grow even larger, the challenges of data manipulation, quality control, strong study design, and strict phenotypic definitions grow more complex. Hence, moving forward human geneticists will have to develop bioinformatics infrastructure and expertise to overcome such challenges. Most importantly, scientists will have to combine their bioinformatics efforts with genetics, biochemistry and cell biology to confirm the functional consequence and biological relevance of the genotype–phenotype associations that are identified.
本文提纲挈领地阐明了医学临床上的GWAS分析基本概念和原理,关联算法模型的选择和使用,特别是指出了现有GWAS存在的不足以及我们在具体实践中应该如何避免误差。建议小伙伴在学习GWAS时先看这篇入门介绍,再根据个人水平去查陌生的专业名词的含义以及常用软件的使用方法。另一篇简书文章欢迎阅读GWAS基本分析内容
GWAS提出到现在已经十多年,发挥了重要的作用,存在很多问题 (参见扩展阅读),还有许多改进的空间。正如作者最后在Future Directions所说 ‘Ultimately, the translation of GWAS findings into clinical practice will rely upon correct assumptions regarding the genetic architecture of complex traits especially in the context of gene–gene and gene–environment interactions.’
参考文献:
见原文
扩展阅读: