Dragon star Day 1 关于测序技术、NGS数据格式、变异检测

Dragon star Day 1 20190729

关于测序技术、NGS数据格式、变异检测

中英混杂3000字超长补丁,一只酸菜鱼疯狂填补知识盲区的笔记


Dragonstar2019 by Kai Wang

  1. Genomic technologies in disease studies
  2. NGS data formats and variant calling

Part Ⅰ Genomic technologies in disease studies

1 Types of genetic variation

  • SNVs (Single Nucleotide Variants) - 单核苷酸变异

  • Indel (Insertion or deletion < 50 bp)

  • SV (Structural Variants) - 结构变异

    can be balanced or unbalanced

    • Balanced events do not involve gain or loss of genetic materials

    • Inversions and translocations

      • Complex SVs (several types together)
    • Unbalanced events:

    • deletions/insertions/duplications (Deletions and duplications are two subtypes of CNVs (Copy Number Variants)

      • Chromosomal aneuploidies (such as trisomy 21 21三体) are extreme cases of unbalanced SV.
  • Allele frequency and effect size - 等位基因频率与效应量/效应值

    Manolio et al, Nature, 2009

    Genome-wide association studies (GWAS) are effective in detecting common alleles that contribute to the inherited component of common multifactorial diseases. Typically, the alleles identified by this approach have modest effect sizes that cannot fully account for disease susceptibility. This discrepancy may exist because it is hard to identify rare alleles with a low to modest penetrance using GWAS. Penetrance is a measure of the proportion of individuals in a population carrying a particular allele that expresses the related phenotype. In contrast to multifactorial diseases, Mendelian diseases have a high penetrance and very rare allele frequency.

    McCarthy, Mark I., et al. "Genome-wide association studies for complex traits: consensus, uncertainty and challenges." Nature reviews genetics 9.5 (2008): 356.

    统计学中,效应值(Effect size)是量化现象强度的数值。[1]效应值实际的统计量包括了二个变数间的相关程度、回归模型中的回归系数、不同处理间平均值的差异……等等。无论哪种效应值,其绝对值越大表示效应越强,也就是现象越明显。效应值与特效检验的概念是互补的。在估算统计检定力、需要的样本数与进行元分析时,效应值经常扮演重要角色。

    https://zh.wikipedia.org/wiki/效应值

    Effect size is a simple way of quantifying the difference between two groups that has many advantages over the use of tests of statistical significance alone. Effect size emphasises the size of the difference rather than confounding this with sample size.*

    Coe R. It's the effect size, stupid: What effect size is and why it is important[J]. 2002.

2 Revolution: single-molecule long-read sequencing

单分子长读长测序

2.1 Oxford Nanopore Sequencing

测量电流

The DNA/RNA is sequenced when it is going though a protein pore.

  • DNA (or RNA)
  • Motor protein
  • Adapter sequence

The nucleotides in the DNA/RNA block the ionic current and induce changes of current, which can be measured.

2.2 PacBio Single-molecule real-time (SMRT) sequencing

测量荧光

  • SMRTbell library: 环状

  • Adapter: 环状

  • Types of SMRT sequencing reads

    • Continuous Long Reads (CLR)

      Long inserts so that the polymerase can synthesize along a single strand

    • Circular Consensus Sequencing (CCS, HiFi)

      Short inserts, so polymerase can continue around the entire SMRTbell multiple times and generate multiple sub-reads from the same single molecule

    • 存在由于技术造成的 error insertion

      Insertions tend to be more than deletions and substitutions.

2.3 Linked-read Sequencing

依靠barcode

By adding a unique barcode to every short read generated from an individual molecule, the short reads are linked together.

10X Genomics 公司的 linked reads 技術本質上是將 barcode 序列引入長序列片段,透過將長片段分配到不同的油滴微粒中,利用 GemCode 平台對長片段序列進行擴增引入 barcode 序列以及定序接頭引物,然後將序列打斷成適合定序大小的片段進行定序,通過 barcode 序列資訊追踪來自每個大片段DNA模板的多個 Reads,從而獲得大片段的遺傳資訊。通過 linked reads 結合常規二代定序組裝得到的Scaffold,可建構準確度更長的Scaffold。

http://toolsbiotech.blog.fc2.com/blog-entry-37.html

What are molecular barcodes?
The concept of molecular barcodes is that each original DNA fragment, within the same sample, is attached to a unique sequence barcode.

https://pdfs.semanticscholar.org/310b/3bac42989485c98406848217418ff22c22e7.pdf

  • Linked read use molecular barcoding to preserve long-range information

  • Short read pairs (2 x150 bp) are generated using barcode-containing primers.

  • Short reads that contains the same barcode and within a certain distance can be linked together to “reconstruct” the original long DNA fragment.

2.4 Single-molecule optical mapping (Bionano Genomics)

依靠限制性内切酶位点,测量光谱

Optical mapping

By mapping the location of restriction enzyme sites along the unknown DNA of an organism, the spectrum of resulting DNA fragments collectively serves as a unique "fingerprint" or "barcode" for that sequence.

https://en.wikipedia.org/wiki/Optical_mapping

Part Ⅱ NGS data formats and variant calling

1 Basic Concepts in NGS

  • Insert – the DNA portion that is used for sequencing *note: 和突变中的insertion不同

    Insert size is the length of the DNA (or RNA) that you want to sequence and that is "inserted" between the adapters (so adapters excluded).

    https://www.biostars.org/p/95803/

  • Read – the part of the insert that is sequenced 读长

  • Single End – a sequencing procedure by which the insert is sequenced from one end only 单端

  • Paired End – a sequencing procedure by which the insert is sequenced from both ends 双端

2 Data formats

2.1 The Rawset of Raw Data

Typically: images 根据不同碱基发射出不同颜色的荧光

base calling: call nucleotides at each base of each read.

2.2 NGS data formats

FASTQ: The raw sequence data format

Millions of short reads from unknown genetic locations.

  • Basic Qualities: a "Phred" scale

    BQ = -10log10(ε) where ε is the probability of an error.

    Phred: Phil's revised editor by Phil Green

    https://www.pnas.org/content/101/39/13991>

    Base quality conversion

    Nowadays, we settled on using quality scores on the original Sanger format,(Phred+33).

    需要知道测序年份/公司,来确定Phred

    https://en.wikipedia.org/wiki/FASTQ_format

  • BED: Genomic region format

    • The first 3 required BED fields are:

      1. chrom: "chr1"

      2. chromStart: "0"

      3. chromEnd: "3"

      只包含这三种信息的BED: Minimal BED format. So-called BED3 format.

    • 再加上:

      1. labels: "foo", "bar", "biz"...
      2. scores: "3.1"
      3. strands: "+", "-"

      This is so-called BED6 format.

  • GFF: annotates one line per feature

  • BAM/SAM: Genome alignment format

    • SAM: Sequence Alignment/Map format (tab-delimited text file).
    • BAM: The binary equivalent of a SAM file, which stores the same data in a compressed binary representation
    https://www.samformat.info/sam-format-flag
  • CRAM

    和 BAM/SAM 类似的另一种格式,无损压缩,适用于大型人群基因组/外显子组测序项目,如 UK Biobank.

    • CRAM was designed to be an efficient reference-based alternative to SAM/BAM file formats
    • Better lossless compression than BAM, but also allow for controlled loss of BAM data
  • VCF: Variant Call Format

    • A gold standard for describing variants.

    • One locus per line, and it may contain more than one mutations, but most lines contain one variant only.

    • 主要分为:

      • the header line

        包括:CHROM POS ID REF ALT QUAL FILTER INFO

        存在genotype时,还有:FORMAT

      • the INFO line

        用分号分隔的keys, 用等号定义可选的值

      • the genotypes

        定义数据类型及顺序

  • gVCF (Genomic VCF):

    • the basic format specification is the same as for a regular VCF, but gVCF contains extra information.
    • gVCF储存了变异及非变异位点的测序信息,适用于临床分析

2.3 Formats use different coordinate systems, which adds confusion

不同文件格式对染色体坐标的起始位置定义不同,造成了一定的困扰

BED: 0-based, half-open
GFF: 1-based, fully closed
SAM: 1-based, fully closed
BAM: 0-based, half-open
VCF: 1-based, fully closed

3 Visualization of genomic data: IGV

强大的IGV

IGV: a high-performance visualization tool for interactive exploration of large, integrated genomic datasets.

4 Coverage

4.1 What is coverage?

  • The depth of sequencing coverage can be defined theoretically as LN/G, where L is the read length, N is the number of reads and G is the haploid genome length.

    测序深度可理解为:(read长度/read数量)/单倍体基因组长度

  • The breadth of coverage is the percentage of target bases that have been sequenced for a given number of times.

  • The accuracy of variant calling is affected by sequence quality, uniformity of coverage and the threshold of false-discovery rate that is used.

    What is variant calling?

    Variant calling is the process by which we identify variants from sequence data (Figure 11).

    1. Carry out whole genome or whole exome sequencing to create FASTQ files.
    2. Align the sequences to a reference genome, creating BAM or CRAM files.
    3. Identify where the aligned reads differ from the reference genome and write to a VCF file.

    Identify where the aligned reads differ from the reference genome and write to a VCF file.

    Figure 11 A CRAM file aligned to a reference genomic region as visualised in Ensembl. Differences are highlighted in red in the reads, and will be called as variants.

    reads中标红的碱基为variants.

    https://www.ebi.ac.uk/training/online/course/human-genetic-variation-i-introduction-2019/variant-identification-and-analysis

4.2 Coverage: how many reads we need to cover the genome?

  • Depth of coverage model

    鸟枪法测序中,genome size G, read size S, N reads

    某个read出现在某个间隙(interval)的概率为:p=L/G

    D: 出现在与read同等长度的间隙的read的数量

    D ~ Binomial(N, L/G),D与N, p 符合二项分布

    即:b(x; n, P) = nCx* Px * (1 - P) n-x**

    b(D; N, L/G) = NCD* (L/G)D * (1 - L/G) N-D**

    Let L = S, D is also the number of reads that cover the last position of the interval → D is depth of coverage.

    因为 S << G, N的值特别大,depth可近似理解为泊松分布:

    D ~ Poisson(λ)
    λ= SN/G (average depth of coverage)

    P(D=k)=\frac{e^{-\lambda}\lambda^k}{k!}

  • Fraction of genome that are covered

    根据上述公式可以计算:

    Given λ=40, the fraction of genome that are covered more than 30x (D>30) is: 0.938.

4.3 Overdispersion

  • Main cause of overdispersion

    GC bias and other technical factors lead to systematic bias in coverage, resulting in overdispersion.

  • How to model overdispersion

    • Ideal situation (Poisson distrition):
      Var(D) = μ

    • Gamma-Poisson is equivalent to Negative Binomial, which is a commonly used model for dealing with overdispersion in count data:
      Var(D) = μ+ μ^2/k

    • 过离散模型也适用于其他因素,如 biological noise

4.4 Question on coverage

Why do we need average 30-50x in a typical WGS experiment, and 100x in WES?

WGS is less biased compared to WES. We do not need as much depth to call a variant confidently. Check the following publications for more detailed information.

" Exome-seq achieves 95% SNP detection sensitivity at a mean on-target depth of 40 reads, whereas WGS only requires a mean of 14 reads. "

Article Variant detection sensitivity and biases in whole genome and...

Shu-Hong Lin

https://www.researchgate.net/post/Why_is_NGS_coverage_of_seemingly_less_complex_exome_sequences_higher_than_that_of_whole_genome_sequencing

"WGS is less biased compared to WES."的原因:WES的样品首先需要经过PCR

Why 30X WGS beats 100X WES for variant coverage

https://www.variantyx.com/variantyx-posts/why-30x-wgs-beats-100x-wes-for-variant-coverage/

5 General strategy for variant calling

5.1 Possible Genotypes

  1. 当只有参考基因组时,各种情形的概率均为1

    P(reads|A/A, read mapped) = P(C observed|A/A, read mapped) = 1.0

    P(reads|A/C, read mapped) = P(C observed|A/C, read mapped) = 1.0

    P(reads|C/C, read mapped) = P(C observed|C/C, read mapped) = 1.0

  2. 假定error rate为0.01

  3. 当第1条read的该位点为C

    P(reads|A/A, read mapped) = 0.01

    P(reads|A/C, read mapped) = 0.5

    P(reads|C/C, read mapped) = 1 - 0.01 = 0.99

  4. 当第2条read的该位点为C

    P(reads|A/A, read mapped) = 0.012 = 0.0001

    P(reads|A/C, read mapped) = 0.52 = 0.25

    P(reads|C/C, read mapped) = 0.992 = 0.9801

  5. 当第3条read的该位点为C

    P(reads|A/A, read mapped) = 0.013 = 0.000001

    P(reads|A/C, read mapped) = 0.53 = 0.125

    P(reads|C/C, read mapped) = 0.993 = 0.970299

  6. 当第4条read的该位点为A

    P(reads|A/A, read mapped) = 0.013 * 0.99 = 0.00000099

    P(reads|A/C, read mapped) = 0.54 = 0.0625

    P(reads|C/C, read mapped) = 0.993 * 0.01 = 0.00970299

  7. 当第5条read的该位点为A

    P(reads|A/A, read mapped) = 0.013 * 0.992 = 0.00000098

    P(reads|A/C, read mapped) = 0.55 = 0.03125

    P(reads|C/C, read mapped) = 0.993 * 0.012 = 0.0000970299 ≈ 0.000097

总结出贝叶斯公式:

Combine these likelihoods with a prior incorporating information from other individuals and flanking sites to assign a genotype.

5.2 From Sequence to Genotype: Individual Based Prior

Individual Based Prior: Evry site has 1/1000 probability of varying.

Ingredients That Go Into Prior

  • Most sites don’t vary

    P(non-reference base) ~ 0.001

  • When a site does vary, it is usually heterozygous
    P(non-reference heterozygote) ~ 0.001 * 2/3
    P(non-reference homozygote) ~ 0.001 * 1/3

  • Mutation model
    Transitions account for most variants (C↔T or A↔G)
    Transversions account for minority of variants

https://pdfs.semanticscholar.org/0156/04c3ba76cf247a8010f74ec1386e58ceb530.pdf

因此,各位点的先验概率为:

Prior(A/A) = 0.001 * 1/3 ≈ 0.00033 0.00034?

Prior(A/C) = 0.001 * 2/3 ≈ 0.00067 0.00066?

Prior(C/C) = 1 - 0.001 = 0.999

琢磨了两个小时并把google搜索翻到了第十页,依然没看懂这里的后验概率是怎么算的,如果有统计大神看到这里,求赐教🙏

5.3 From Sequence to Genotype: Population Based Prior

Population Based Prior: Use frequency information from examining others at the same site. In the example above, we estimated P(A) = 0.20

因此,各位点的先验概率为:

Prior(A/A) = 0.2 * 0.2 = 0.04

Prior(A/C) = 1 - 0.04 - 0.64 = 0.32

Prior(C/C) = 0.8 * 0.8 = 0.64

同样的迷惑:

5.4 Sequence Based Genotype Calls

  • Individual Based Prior
    • Assumes all sites have an equal probability of showing polymorphism
    • Specifically, assumption is that about 1/1000 bases differ from reference
    • If reads where error free and sampling Poisson …
    • … 14x coverage would allow for 99.8% genotype accuracy
    • … 30x coverage of the genome needed to allow for errors and clustering
  • Population Based Prior
    • Uses frequency information obtained from examining other individuals
    • Calling very rare polymorphisms still requires 20-30x coverage of the genome
    • Calling common polymorphisms requires much less data

https://pdfs.semanticscholar.org/0156/04c3ba76cf247a8010f74ec1386e58ceb530.pdf


最后,向大家隆重推荐生信技能树的一系列干货!

  1. 生信技能树全球公益巡讲:https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
  2. B站公益74小时生信工程师教学视频合辑:https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
  3. 招学徒:https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 194,088评论 5 459
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 81,715评论 2 371
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 141,361评论 0 319
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 52,099评论 1 263
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 60,987评论 4 355
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 46,063评论 1 272
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 36,486评论 3 381
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 35,175评论 0 253
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 39,440评论 1 290
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 34,518评论 2 309
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 36,305评论 1 326
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 32,190评论 3 312
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 37,550评论 3 298
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 28,880评论 0 17
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 30,152评论 1 250
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 41,451评论 2 341
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 40,637评论 2 335

推荐阅读更多精彩内容