标签(空格分隔): annovar 注释
[TOC]
介绍
参考链接 http://annovar.openbioinformatics.org/en/latest/
用于对SNV、CNV进行功能注释,目前wANNOVAR专门用于SNV注释。主要包括3种注释模式:
- gene-based annotation:判断SNV或CNV是否造成蛋白编码或氨基酸的改变,可用基因命名系统包括RefSeq, UCSC, ENSEMBL,GENCODE, AceView等。
- region-based annotation:变异位于染色体哪个区域,预测转录因子结合位点、SD区域、GWAS hits...
- filter-based annotation:鉴定在特定数据库中记录的变异,如是否在dbSNP中被报道,在1KG中的频率,ExAC, 计算SIFT/PolyPhen/LRT/MutatonTaster/MutationAssessor scores...
- 其他功能:批量调取指定区域的核酸序列,调取合Mendelian disease的基因
下载安装
- 该测序是用perl语言写的,所以可以作为独立程序运行于各个已经安装Perl的系统。解压直接用即可。
- 下载需要.edu邮箱注册:http://www.openbioinformatics.org/annovar/annovar_download_form.php
tar xvfz annovar.latest.tar.gz
cd annovar
ANNOVAR的安装包里自带了一些常用的数据库,在humandb/
目录下; 如果要进行其他注释,需要使用 -downdb
命令下载数据库到 humandb/
目录。
主要程序结构
ANNOVAR程序结构
│ annotate_variation.pl #主程序,功能包括下载数据库,三种不同的注释
│ coding_change.pl #可用来推断蛋白质序列
│ convert2annovar.pl #将多种格式转为.avinput的程序
│ retrieve_seq_from_fasta.pl #用于自行建立其他物种的转录本
│ table_annovar.pl #注释程序,可一次性完成三种类型的注释
│ variants_reduction.pl #可用来更灵活地定制过滤注释流程
│
├─example #存放示例文件
│
└─humandb #人类注释数据库
数据库下载
依赖于数据库进行注释,如果没有相应的注释文件就无法进行注释(废话!)
最好下载相应基因组版本的最新注释数据库。
perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar refGene humandb/
#-buildver: 基因组对应版本
#-webfrom annovar: 从annovar库里下载;如果annovar库中没有,则不用写该选项,会从UCSC中下载
#refGene: 数据库名称
#humandb/: 下载至该目录
已下载数据库:refGene,ensGene,cytoBand,avsnp138,exac03,1000g2015aug,clinvar_20170905,dbnsfp30a, avsnp147
- avsnp138:给出rs编号
- dbnsfp30a:whole-exome SIFT, PolyPhen2 HDIV, PolyPhen2 HVAR, LRT, MutationTaster, MutationAssessor, FATHMM, MetaSVM, MetaLR, VEST, CADD, GERP++, DANN, fitCons, PhyloP and SiPhy scores from dbNSFP version 3.0a
输入文件
两种输入格式
- VCF文件:用
-vcfinput
指定 - avinput
每行代表一个位点
前5列依次为:chromosome, start position, end position, the reference nucleotides, the observed nucleotides
reference nucleotides:不知道时可设置为0
observed nucleotides: insertion,deletion,block subsititution可用-
表示
其余列:可有可无,如果有,在输出文件中会原样输出。
[root@localhost example]# more ex1.avinput
1 948921 948921 T C comments: rs15842, a SNP in 5' UTR of ISG15
1 13211293 13211294 TC - comments: rs59770105, a 2-bp deletion
1 11403596 11403596 - AT comments: rs35561142, a 2-bp insertion
1 105492231 105492231 A ATAAA comments: rs10552169, a block substitution
1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated
注释
- 一步到位:table_annovar.pl 可以同时进行3种注释
perl table_annovar.pl example/ex1.avinput humandb/ -buildver hg19 -out myanno -remove -protocol refGene,cytoBand,exac03,avsnp147,dbnsfp30a -operation gx,r,f,f,f -nastring . -csvout -polish -xref example/gene_xref.txt
#-remove: remove all temporary files
#-operation:g,gene-based; gx,gene-based with cross-reference annotation (from -xref argument);r, region-based; f,filter-based.
#-nastring:没有对应注释,则输出`.`
#-csvout:结果用,分隔;去掉则采用默认,用Tab分隔
#-xref: whether a known genetic disease is caused by defects in this gene (this information was suffplied in the example/gene_xref.txt file in the command line) 这一项没有也OK
其中(每种数据库对应的类型参考官网)
g,gene-based,对应数据库为refGene,ensGene等
r,region-based,对应数据库为cytoBand等
f,filter-based,对应数据库为exac03,avsnp147,dbnsfp30a等
- 3种注释分开进行:annotate_variation.pl
gene-based
perl annotate_variation.pl -geneanno -dbtype refGene -buildver hg19 example/ex1.avinput humandb/
结果文件在example/
中,ex1.avinput.variant_function
和ex1.avinput.exonic_variant_function
ex1.avinput.variant_function
第一列:variant effects,将变异分类,如intergenic, intronic, non-synonymous SNP, frameshift deletion, large-scale duplication等
第二列:基因名,Symbol,括号中为NM_22222,为refGene编号
其余列:输入文件ex1.avinput的内容
[root@localhost example]# head ex1.avinput.variant_function
UTR5 ISG15(NM_005101:c.-33T>C) 1 948921 948921 T C comments: rs15842, a SNP in 5' UTR of ISG15
UTR3 ATAD3C(NM_001039211:c.*91G>T) 1 1404001 1404001 G T comments: rs149123833, a SNP in 3' UTR of ATAD3C
ex1.avinput.exonic_variant_function
第一列:该变异在input文件的行号
第二列:对编码基因的影响,frameshift,nonsynonymous等
第三列:被影响的基因或转录本,其中NM_22222,为refGene编号
其余列:同输入文件
[root@localhost example]# head ex1.avinput.exonic_variant_function
line9 nonsynonymous SNV IL23R:NM_144701:exon9:c.G1142A:p.R381Q, 1 67705958 67705958 GA comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
line10 nonsynonymous SNV ATG16L1:NM_017974:exon8:c.A841G:p.T281A,ATG16L1:NM_001190267:exon9:c.A550G:p.T184A,ATG16L1:NM_030803:exon9:c.A898G:p.T300A,ATG16L1:NM_001190266:exon9:c.A646G:p.T216A,ATG16L1:NM_198890:exon5:c.A409G:p.T137A, 2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
line11 nonsynonymous SNV NOD2:NM_022162:exon4:c.C2104T:p.R702W,NOD2:NM_001293557:exon3:c.C2023T:p.R675W,16 50745926 50745926 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
用awk操作时,分隔符设定为\t
;不设置时,空格也被当做分隔符,会造成错位
[root@localhost example]# head ex1.avinput.exonic_variant_function|awk -F '\t' '{print $2}'
nonsynonymous SNV
nonsynonymous SNV
nonsynonymous SNV
nonsynonymous SNV
frameshift insertion
frameshift deletion
frameshift deletion
stoploss
stopgain
frameshift substitution
[root@localhost example]# head ex1.avinput.exonic_variant_function|awk '{print $2}'
nonsynonymous
nonsynonymous
nonsynonymous
nonsynonymous
frameshift
frameshift
frameshift
stoploss
stopgain
frameshift
region-based
perl annotate_variation.pl -regionanno -dbtype cytoBand -buildver hg19 example/ex1.avinput humandb/
鉴定各变异的cytogenetic band,如1p36.33
结果文件在example
中,ex1.avinput.hg19_cytoBand
第一列:cytoBand
第二列:1p21.1
其余列:同输入文件
[root@localhost example]# more ex1.avinput.hg19_cytoBand
cytoBand 1p36.33 1 948921 948921 T C comments: rs15842, a SNP in 5' UTR of ISG15
cytoBand 1p36.33 1 1404001 1404001 G T comments: rs149123833, a SNP in 3' UTR of ATAD3C
cytoBand 1p36.31 1 5935162 5935162 A T comments: rs1287637, a splice site variant in NP
HP4
cytoBand 1q23.3 1 162736463 162736463 C T comments: rs1000050, a SNP in Il
lumina SNP arrays
filter
perl annotate_variation.pl -filter -dbtype exac03 -buildver hg19 example/ex1.avinput humandb/
结果文件在example/
中,ex1.avinput.hg19_exac03_filtered
(exac03中没有报道的位点)和ex1.avinput.hg19_exac03_dropped
(exac03中报道的位点,包含其等位基因频率)