现在我们已经具备采用pysam模块统计比对信息的知识了,这里我们编写一个程序来统计BAM/SAM文件的信息并使用samtools flagstat
进行验证。
NA12891_CEU_sample.bam文件下载
程序代码如下:
import pysam
import sys
from collections import Counter ## 1
if len(sys.argv) < 2: ## 2
sys.exit("usage: alnsta.py bamfile")
bamfile = pysam.AlignmentFile(sys.argv[1]) ## 3
stat = Counter()
for read in bamfile:
stat['total'] += 1
stat['qcfail'] += int(read.is_qcfail) ## 4
stat['paired'] += int(read.is_paired)
stat['read1'] += int(read.is_read1)
stat['read2'] += int(read.is_read2)
if read.is_unmapped:
stat['unmapped'] += 1
continue ## 5
stat['mapped'] += 1
if read.mapping_quality <= 30:
stat['mapping quality <= 30'] += 1
stat["proper pair"] += int(read.is_proper_pair)
output_order = ("total", "mapped", "unmapped", "paired", "read1", "read2", "proper pair", "qcfail", "mapping quality <= 30") ## 6
for key in output_order:
format = (key, stat[key], 100*float(stat[key]/stat['total']))
sys.stdout.write(("%s: %d (%0.2f%%)\n") % format) ## 7
解释:
## 1,Counter对象在统计信息的时候很常用,类似于一个字典
## 2,简单判断是否存在输入文件名
## 3,pysam.AlignmentFile用于读取BAM/SAM文件,自动判断文件类型不需要指定
## 4,根据read的属性判断read是否符合条件,使用int函数将逻辑值转为0,1值
## 5,如果一个read没有比对到参考基因组上,跳过下面的内容统计
## 6,由于Counter对象存储的元素没有顺序,这里指定输出顺序
## 7,通过控制台以给定的格式顺序输出内容
该代码的运行结果为:
$ python3 alnsta.py NA12891_CEU_sample.bam
total: 636207 (100.00%)
mapped: 630875 (99.16%)
unmapped: 5332 (0.84%)
paired: 435106 (68.39%)
read1: 217619 (34.21%)
read2: 217487 (34.18%)
proper pair: 397247 (62.44%)
qcfail: 0 (0.00%)
mapping quality <= 30: 90982 (14.30%)
使用samtools flagstat
进行检验:
$ samtools flagstat NA12891_CEU_sample.bam
636207 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
29826 + 0 duplicates
630875 + 0 mapped (99.16% : N/A)
435106 + 0 paired in sequencing
217619 + 0 read1
217487 + 0 read2
397247 + 0 properly paired (91.30% : N/A)
424442 + 0 with itself and mate mapped
5332 + 0 singletons (1.23% : N/A)
5383 + 0 with mate mapped to a different chr
2190 + 0 with mate mapped to a different chr (mapQ>=5)
可见我们的统计结果基本都是正确的,不过mapping quality <= 30数目没有统计到。我们可以这样验证:mapping quality <= 30数目 = 能够比对到参考基因组的read数目 - 比对质量大于30的read数目。
由于unmap的flag数字为4:
$ samtools flags UNMAP
0x4 4 UNMAP
所以mapped read数目为(注:这里-F
表示取反):
$ samtools view -c -F 4 NA12891_CEU_sample.bam
630875
通过-q
参数可以查看比对质量大于30的read数目:
$ samtools view -c -q 31 NA12891_CEU_sample.bam
539893
综合这两个信息,可知read比对质量小于等于30的read数目为:630875 - 539893 = 90982,与脚本的结果一致。