全基因组 - 人类基因组变异分析（PacBio) （5）-- pbsv

以Pacific Biosciences （PacBio）公司为代表的第三代单分子实时荧光测序平台为以测序为基础的科学研究带来了革命性的突破。目前该技术广泛应用于基因组Denovo组装、全长转录本检测、宏基因组，基因组重测序等多个方向，并且在染色体结构变异（Structure Variation, SV）的检测中有着不可替代的优势。前面我们讲了PacBio三代测序数据的类型、预处理、比对和SNPs/INDELs变异检测等基本内容。本期我们就继续沿着分析流程图一起看看基于比对结果检测染色体结构变异（SV）分析方法。

一、染色体结构变异

染色体结构变异（Structure Variation, SV），指基因组上发生的长度大于50bp的大片段插入(Insertion, INS)、缺失(Deletion, DEL)、倒位(Inversion, INV)、易位(Translocation)、重复（Duplication, DUP）等类型的变异，其中占比最大的就是大片段的插入和缺失（图1）。插入缺失很好理解就是，多了一段或者少了一段DNA序列；重复就是有一段区域的序列重复出现；倒位就是序列翻转了一下，如本来那个位置该是AATTG的，结果变成了GTTAA；易位的话就是序列位置的变化，又进一步分为染色体内易位和染色体间易位。据统计，基因组结构变异可能导致的遗传性疾病已经超过1,000种，对于每个人来讲其基因组都有至少20,000个的结构变异，这些变异带来的影响或许比SNVs或InDels带来的影响更大。

图1. 结构变异的类型

二、三代测序在结构变异检测上的优势

如果是仅仅研究像SNPs/INDELs一样的很小的变异，二代测序就能够胜任；但是如果要研究很大的结构变异（>50bp），则二代测序的短读长很难识别变异位点，尤其是复杂的高CG含量的区域，而且仅仅依靠算法很难克服技术上的缺陷。三代测序的长读长能够很有效的跨越覆盖识别出结构变异位点，得到结构变异的全貌，轻松测通基因组上的复杂重复区域。通过三代测序技术，在人类基因组中发现了数万个结构变异，而这些变异通常无法通过二代测序技术进行识别（图2）。

图2. 三代测序之于二代测序的优势

从Medhat Mahmoud等人 (5) 的研究中可以看出（图3B），在人类中，三代长度长测序比二代短读长测序在结构变异检测数量上平均高2-4倍。

图3. 三代测序技术在结构变异检测方面明显优于二代测序

三、SV检测相关软件

目前，对于PacBio三代测序平台结构变异检测的软件有PBSV, Sniffles, cuteSV、PAV、PBHoney、SMRT-SV、SVIM 等 (图4，图5，图6）。

图4. 适合于PacBio数据的结构变异检测软件

PBHoney，文章表于2014年 (7)，软件最后的更新停留在了2017年，所以已经不推荐了。
SMRT-SV, 第一个版本由Chaisson et al. 发表于2015年 (8)，早已经不再更新。SMRT-SV2由Audano et al.发表于2019年 (9)，替代SMRT-SV。现在SMRT-SV2也已经停止更新了，并在其github网站上推荐以下工具：

PAV: https://github.com/EichlerLab/pav
pbsv: https://github.com/PacificBiosciences/pbsv
Sniffles: https://github.com/fritzsedlazeck/Sniffles
SVIM: https://github.com/eldariont/svim
检查了一下，SVIM在2021年6月18号已经停止更新。

DeBreak，文章于2023年发表在Nature Communication上，github上更新到 2022年12月1号(version 1.2）。从文章中的数据看来，DeBreak 在模拟数据（图5）和实测数据HG002（图6）中recall, precision和 F1 score上优于同类软件。

图5. DeBreak在模拟数据上的表现

图6. DeBreak在HG002数据上的表现

PAV，Phased Assembly Variant Caller，文章于2021年发表于Science (11)，github更新到2023年10月13号(version 2.3.4)。
PBSV，PacBio官方开发的结构变异软件，github上更新至2023年3月14日（version 2.9.0）。
Sniffles2，版本一于2018年发表于Nature methods（12），Sniffles2于2022年发表于BioRxiv（13），github更新到2023年7月14号（version 2.2）
cuteSV，文章于2020年发表于Genome Biology (14)， github上更新至2023年11月14号（version 2.1）。

综上所述，现在常用以及保持更新的有以下这些，PacBio官方的PBSV、Sniffles2、cuteSV，DeBreak 和 PAV了 （此处可能总结的不全面，欢迎留言补充）。

前面咱们是用PacBio官方的比对软件pbmm2，由于pbmm2生成的 .bam文件MD tag 格式有些特别，无法用于后续的sniffles进行SV检测。所以这里我们使用pbsv进行后续SV检测。后面我会针对minimap2+Sniffles2组合以及minimap2+cuteSV再出一期教程，有时间也会尝试一下PAV和DeBreak。

四、pbsv具体安装和使用

主页网址：https://github.com/PacificBiosciences/pbsv

pbsv is a suite of tools to call and analyze structural variants in diploid genomes from PacBio single molecule real-time sequencing (SMRT) reads.

pbsv calls insertions, deletions, inversions, duplications, and translocations. Both single-sample calling and joint (multi-sample) calling are provided.

软件安装

#pvsv v.2.9.0, conda 安装
$ conda install -c bioconda pbsv

工作流程

整个工作流程从序列比对到获得结构变异一共分三步（图7）：

PacBio reads和参考基因组进行比对，从.subreads.bam / .ccs.fastq到比对排序后的.bam文件。
寻找结构变异的特征，.bam到.svsig.gz。
获得单个或者所有样本的结构变异和基因型，.svsig.gz到.vcf

图7.pbsv软件工作流程

具体分析命令

数据我们还是使用德系犹太人家系：HG002（子）、HG003（父）、HG004（母），具体参考全基因组 - 人类基因组变异分析（PacBio) （3）-- pbmm2

1. Align PacBio reads to a reference genome 序列比对。

# 以CCS bam input为示例
$ pbmm2 align ref.fa movie1.ccs.bam ref.movie1.bam --sort --preset CCS --sample sample1

#实际运行
#HG002_1
$ pbmm2 align --preset CCS --sort -j 24 -J 4 \
 --log-level INFO --log-file pbmm2.HG002_1.log --sample HG002_1 \
Human_ref/GRCh38.mmi m84011_220902_175841_s1.hifi_reads.bam HG002_1.align.bam 

#HG003
$ pbmm2 align --preset CCS --sort -j 24 -J 4 \
 --log-level INFO --log-file pbmm2.HG003.log --sample HG003 \
Human_ref/GRCh38.mmi m84010_220919_235306_s2.hifi_reads.bam HG003.align.bam 

#HG004
$ pbmm2 align --preset CCS --sort -j 24 -J 4 \
 --log-level INFO --log-file pbmm2.HG004.log --sample HG004 \
Human_ref/GRCh38.mmi m84010_220919_232145_s1.hifi_reads.bam HG004.align.bam

2. Discover signatures of structural variation 寻找结构变异的特征。

举例：

$ pbsv discover ref.movie1.bam ref.sample1.svsig.gz
$ pbsv discover ref.movie2.bam ref.sample2.svsig.gz

# optionally index svsig.gz to allow random access via `pbsv call -r`
tabix -c '#' -s 3 -b 4 -e 4 ref.sample1.svsig.gz
tabix -c '#' -s 3 -b 4 -e 4 ref.sample2.svsig.gz

It is highly recommended to provide one tandem repeat annotation .bed file of your reference to pbsv discover via --tandem-repeats. This increases sensitivity and recall. Feel free to use the following for human SV calling: GRCh38 or hs37d5/hg19. 加上提供的重复串联序列区域注释文件可以提高敏感度和召回率。

实际运行：

#HG002_1
$ nohup pbsv discover --tandem-repeats Human_ref/human_GRCh38_no_alt_analysis_set.trf.bed HG002_1.align.bam HG002_1.svsig.gz &

#HG003
$ nohup pbsv discover --tandem-repeats Human_ref/human_GRCh38_no_alt_analysis_set.trf.bed HG003.align.bam HG003.svsig.gz &

#HG004
$ nohup pbsv discover --tandem-repeats Human_ref/human_GRCh38_no_alt_analysis_set.trf.bed HG004.align.bam HG004.svsig.gz &

3. Call structural variants and assign genotypes 获得单个或者所有样本的结构变异和基因型。

#示例，如果输入序列是CCS序列，加上--ccs选项
$ pbsv call ref.fa ref.sample1.svsig.gz ref.sample2.svsig.gz ref.var.vcf

#实际运行，Joint calling

nohup pbsv call -j 24 --ccs Human_ref/GRCh38.fa HG002_1.svsig.gz HG003.svsig.gz HG004.svsig.gz HG.SV.vcf &

至此，我们已经获得了三个样本包含结构变异信息的vcf格式文件了。

pbsv call 帮助文档，除上述示例中的参数，PBSV还有诸多筛选参数，包括输出的SV类型、长度、基因型等等，需要大家根据自己的实际项目情况，进行探索，选择合适参数。比如用 -t 可以指定特定结构变异类型，-r 可以指定基因组区域。全部参数说明如下图所示：

pbsv call - Call structural variants from SV signatures and assign genotypes (SVSIG to VCF).

Usage:
  pbsv call [options] <ref.fa|xml> <ref.in.svsig.gz|fofn> <ref.out.vcf>

  ref.fa|xml                                  STR   Reference genome assembly against which to call variants.
  ref.in.svsig.gz|fofn                        STR   SV signatures from one or more samples.
  ref.out.vcf                                 STR   Variant call format (VCF) output.

Basic Options:
  -r,--region                                 STR   Limit discovery to this reference region: CHR|CHR:START-END.
  --hifi,--ccs                                      Use options optimized for HiFi reads: -S 0 -P 10.

Variant Options:
  -t,--types                                  STR   Call these SV types: "DEL", "INS", "INV", "DUP", "BND".
                                                    [DEL,INS,INV,DUP,BND]
  -m,--min-sv-length                          STR   Ignore variants with length < N bp. [20]
  --max-ins-length                            STR   Ignore insertions with length > N bp. [15K]
  --max-dup-length                            STR   Ignore duplications with length > N bp. [1M]

SV Signature Cluster Options:
  --cluster-max-length-perc-diff              INT   Do not cluster signatures with difference in length > P%. [25]
  --cluster-max-ref-pos-diff                  STR   Do not cluster signatures > N bp apart in reference. [200]
  --cluster-min-basepair-perc-id              INT   Do not cluster signatures with basepair identity < P%. [10]

Consensus Options:
  -x,--max-consensus-coverage                 INT   Limit to N reads for variant consensus. [20]
  -s,--poa-scores                             STR   Score POA alignment with triplet match,mismatch,gap. [1,-2,-2]
  --min-realign-length                        STR   Consider segments with > N length for re-alignment. [100]

Call Options:
  -A,--call-min-reads-all-samples             INT   Ignore calls supported by < N reads total across samples. [3]
  -O,--call-min-reads-one-sample              INT   Ignore calls supported by < N reads in every sample. [3]
  -S,--call-min-reads-per-strand-all-samples  INT   Ignore calls supported by < N reads per strand total across samples
                                                    [1]
  -B,--call-min-bnd-reads-all-samples         INT   Ignore BND calls supported by < N reads total across samples [2]
  -P,--call-min-read-perc-one-sample          INT   Ignore calls supported by < P% of reads in every sample. [20]
  --preserve-non-acgt                               Preserve non-ACGT in REF allele instead of replacing with N.

Genotyping:
  --gt-min-reads                              INT   Minimum supporting reads to assign a sample a non-reference
                                                    genotype. [1]

Annotations:
  --annotations                               FILE  Annotate variants by comparing with sequences in fasta.Default
                                                    annotations are ALU, L1, SVA.
  --annotation-min-perc-sim                   INT   Annotate variant if sequence similarity > P%. [60]

Variant Filtering Options:
  --min-N-in-gap                              STR   Consider >= N consecutive "N" bp as a reference gap. [50]
  --filter-near-reference-gap                 STR   Flag variants < N bp from a gap as "NearReferenceGap". [1K]
  --filter-near-contig-end                    STR   Flag variants < N bp from a contig end as "NearContigEnd". [1K]

  -h,--help                                         Show this help and exit.
  --version                                         Show application version and exit.
  -j,--num-threads                            INT   Number of threads to use, 0 means autodetection. [0]
  --log-level                                 STR   Set log level. Valid choices: (TRACE, DEBUG, INFO, WARN, FATAL).
                                                    [WARN]
  --log-file                                  FILE  Log to a file, instead of stderr.

Copyright (C) 2004-2023     Pacific Biosciences of California, Inc.
This program comes with ABSOLUTELY NO WARRANTY; it is intended for
Research Use Only and not for use in diagnostic procedures.

五、结构变异结果说明

以50bp为临界，不论是小突变（SNV/INDEL）还是大的染色体结构变异（SV），通常情况下生成的结果文件均为.vcf文件格式，也是信息分析中比较常见的一种格式，百度里也有详尽的结果文件格式说明。图8为HG002，HG003和HG004三个样本joint calling结构变异后的部分结果展示。

图8. vcf文件展示

参考文献

神灯宝典之PB三代重测序分析实录（一）
神灯宝典之三代重测序分析实录（二）
三代测序时代的临床科研
三代重测序到底能干什么？
Mahmoud, M., Gobet, N., Cruz-Dávalos, D. I., Mounier, N., Dessimoz, C., & Sedlazeck, F. J. (2019). Structural variant calling: the long and the short of it. Genome Biology.
giab_tools_methods
English, Adam C., William J. Salerno, and Jeffrey G. Reid. "PBHoney: identifying genomic variants via long-read discordance and interrupted mapping." BMC Bioinformatics 15 (2014).
Chaisson, Mark JP, John Huddleston, Megan Y. Dennis, Peter H. Sudmant, Maika Malig, Fereydoun Hormozdiari, Francesca Antonacci et al. "Resolving the complexity of the human genome using single-molecule sequencing." Nature (2015). SMRT-sv
Audano, P. A., Sulovari, A., Graves-Lindsay, T. A., Cantsilieris, S., Sorensen, M., Welch, A. E.,
Dougherty, M. L., Nelson, B. J., Shah, A., Dutcher, S. K., et al. (2019). Characterizing the major structural variant alleles of the human genome. Cell.
Chen, Yu, Amy Y. Wang, Courtney A. Barkley, Yixin Zhang, Xinyang Zhao, Min Gao, Mick D. Edmonds, and Zechen Chong. "Deciphering the exact breakpoints of structural variations using long sequencing reads with DeBreak." Nature Communications (2023).
Ebert et al., “Haplotype-Resolved Diverse Human Genomes and Integrated Analysis of Structural Variation,” Science. 2021.
Sedlazeck, Fritz J., Philipp Rescheneder, Moritz Smolka, Han Fang, Maria Nattestad, Arndt Von Haeseler, and Michael C. Schatz. "Accurate detection of complex structural variations using single-molecule sequencing." Nature methods (2018).
Smolka, Moritz, Luis F. Paulin, Christopher M. Grochowski, Dominic W. Horner, Medhat Mahmoud, Sairam Behera, Ester Kalef-Ezra et al. "Comprehensive structural variant detection: from mosaic to population-level." BioRxiv (2022).
Jiang, Tao, Yongzhuang Liu, Yue Jiang, Junyi Li, Yan Gao, Zhe Cui, Yadong Liu, Bo Liu, and Yadong Wang. "Long-read-based human genomic structural variation detection with cuteSV." Genome biology (2020).