2023-07-21 变异位点(snp,indel)检测-GATK,snpEff

软件安装

unzip ~/software/gatk-4.4.0.0.zip -d /opt/biosoft
echo 'PATH=$PATH:/opt/biosoft/gatk-4.4.0.0/' >> ~/.bashrc
source ~/.bashrc
mkdir -p /home/train/07.variants_calling
cd /home/train/07.variants_calling

# 准备输入的BAM文件和参考基因组文件
ln -s ~/06.reads_aligment/bowtie2/*.sam ./
samtools sort -@ 4 -O BAM -o V1.bam V1.sam
samtools sort -@ 4 -O BAM -o V2.bam V2.sam #准备bam文件
java -jar /opt/biosoft/picard-tools/picard.jar MarkDuplicates I=V1.bam O=V1.MD.bam M=V1.metrics #上机之前有由于PCR扩增导致序列重复,故进行标记重复操作
java -jar /opt/biosoft/picard-tools/picard.jar MarkDuplicates I=V2.bam O=V2.MD.bam M=V2.metrics
samtools index V1.MD.bam
samtools index V2.MD.bam

ln -s ~/00.incipient_data/data_for_genome_assembling/assemblies_of_Malassezia_sympodialis/Malassezia_sympodialis.genome_V01.fasta genome.fasta
samtools faidx genome.fasta #准备基因组文件
java -jar /opt/biosoft/picard-tools/picard.jar CreateSequenceDictionary R=genome.fasta O=genome.dict

# 使用 GATK HaplotypeCaller 进行 SNP/InDel calling
mkdir -p /home/train/07.variants_calling/GATK
cd /home/train/07.variants_calling/GATK
# 准备 bam 文件和基因组文件
ln -s ../*bam* .
ln -s ../genome.* .
[train@MiWiFi-R3P-srv GATK]$ ln -s ../*bam* .
ln -s ../genome.* .
[train@MiWiFi-R3P-srv GATK]$ ls
genome.dict  genome.fasta  genome.fasta.fai  V1.bam  V1.MD.bam  V1.MD.bam.bai  V2.bam  V2.MD.bam  V2.MD.bam.bai

# 运行HaplotypeCaller分别对单个样品进行variants分析
gatk HaplotypeCaller -R genome.fasta -I V1.MD.bam -ERC GVCF -O V1.g.vcf --pcr-indel-model CONSERVATIVE --sample-ploidy 2 --min-base-quality-score 10 --kmer-size 10 --kmer-size 25
# real  2m13.747s
# user  4m26.141s
# sys   0m1.111s
#遇到报错
[train@MiWiFi-R3P-srv GATK]$ gatk HaplotypeCaller -R genome.fasta -I V1.MD.bam -ERC GVCF -O V1.g.vcf --pcr-indel-model CONSERVATIVE --sample-ploidy 2 --min-base-quality-score 10 --kmer-size 10 --kmer-size 25
Using GATK jar /opt/biosoft/gatk-4.4.0.0/gatk-package-4.4.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/biosoft/gatk-4.4.0.0/gatk-package-4.4.0.0-local.jar HaplotypeCaller -R genome.fasta -I V1.MD.bam -ERC GVCF -O V1.g.vcf --pcr-indel-model CONSERVATIVE --sample-ploidy 2 --min-base-quality-score 10 --kmer-size 10 --kmer-size 25
Error: LinkageError occurred while loading main class org.broadinstitute.hellbender.Main
    java.lang.UnsupportedClassVersionError: org/broadinstitute/hellbender/Main has been compiled by a more recent version of the Java Runtime (class file version 61.0), this version of the Java Runtime only recognizes class file versions up to 55.0
#解决方法
[train@MiWiFi-R3P-srv GATK]$ which java
/usr/bin/java
[train@MiWiFi-R3P-srv GATK]$ ls /opt/sysoft/j
jdk-20.0.1/   jre1.7.0_05/  jre1.8.0_371/ 
[train@MiWiFi-R3P-srv GATK]$ ls /opt/sysoft/jdk-20.0.1/bin/
jar        java   javadoc  jcmd      jdb        jdeps  jhsdb   jinfo  jmap  jpackage  jrunscript  jstack  jstatd      keytool      serialver
jarsigner  javac  javap    jconsole  jdeprscan  jfr    jimage  jlink  jmod  jps       jshell      jstat   jwebserver  rmiregistry
[train@MiWiFi-R3P-srv GATK]$ echo 'PATH=/opt/sysoft/jdk-20.0.1/bin/:$PATH' >> ~/.bashrc
[train@MiWiFi-R3P-srv GATK]$ source ~/.bashrc
[train@MiWiFi-R3P-srv GATK]$ which java
/opt/sysoft/jdk-20.0.1/bin/java
gatk HaplotypeCaller -R genome.fasta -I V2.MD.bam -ERC GVCF -O V2.g.vcf --pcr-indel-model CONSERVATIVE --sample-ploidy 2 --min-base-quality-score 10 --kmer-size 10 --kmer-size 25
# real  2m26.303s
# user  5m44.855s
# sys   0m1.106s
#多样品计算
[train@MiWiFi-R3P-srv ~]$ for i in `ls *.MD.bam`
> do
>     i=${i/.MD.bam/}
>     echo "gatk HaplotypeCaller -R genome.fasta -I $i.MD.bam -ERC GVCF -O $i.g.vcf --pcr-indel-model CONSERVATIVE --sample-ploidy 2 --min-base-quality-score 10 --kmer-size 10 --kmer-size 2"
> done > command.gatk_HaplotupeCaller.list
[train@MiWiFi-R3P-srv ~]$ ParaFly -c command.gatk_HaplotupeCaller.list -CPU 20

#可以使用samtools对变异位点进行观察
samtools tview V1.MD.bam genome.fasta

# 运行CombineGVCFs将多个GVCF文件进行整合
gatk CombineGVCFs -R genome.fasta -O combined.g.vcf -V V1.g.vcf -V V2.g.vcf
# real  0m16.789s
# user  0m33.000s
# sys   0m0.811s
 #可以批量进行
[train@MiWiFi-R3P-srv GATK]$ ls V?.g.vcf | perl -pe 's/^/-V /' | perl -pe 's/\n/ /'
-V V1.g.vcf -V V2.g.vcf 
[train@MiWiFi-R3P-srv GATK]$ gatk CombineGVCFs -R genome.fasta -O combined.g.vcf `ls V?.g.vcf | perl -pe 's/^/-V /' | perl -pe 's/\n/ /'`
Using GATK jar /opt/biosoft/gatk-4.4.0.0/gatk-package-4.4.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/biosoft/gatk-4.4.0.0/gatk-package-4.4.0.0-local.jar CombineGVCFs -R genome.fasta -O combined.g.vcf -V V1.g.vcf -V V2.g.vcf
09:24:04.472 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/biosoft/gatk-4.4.0.0/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
09:24:04.511 INFO  CombineGVCFs - ------------------------------------------------------------
09:24:04.514 INFO  CombineGVCFs - The Genome Analysis Toolkit (GATK) v4.4.0.0
09:24:04.514 INFO  CombineGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
09:24:04.514 INFO  CombineGVCFs - Executing as train@MiWiFi-R3P-srv on Linux v5.14.0-284.11.1.el9_2.x86_64 amd64
09:24:04.515 INFO  CombineGVCFs - Java runtime: Java HotSpot(TM) 64-Bit Server VM v20.0.1+9-29
09:24:04.515 INFO  CombineGVCFs - Start Date/Time: July 21, 2023, 9:24:04 AM CST
09:24:04.515 INFO  CombineGVCFs - ------------------------------------------------------------
09:24:04.515 INFO  CombineGVCFs - ------------------------------------------------------------
09:24:04.516 INFO  CombineGVCFs - HTSJDK Version: 3.0.5
09:24:04.516 INFO  CombineGVCFs - Picard Version: 3.0.0
09:24:04.516 INFO  CombineGVCFs - Built for Spark Version: 3.3.1
09:24:04.516 INFO  CombineGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
09:24:04.516 INFO  CombineGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
09:24:04.516 INFO  CombineGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
09:24:04.517 INFO  CombineGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
09:24:04.517 INFO  CombineGVCFs - Deflater: IntelDeflater
09:24:04.517 INFO  CombineGVCFs - Inflater: IntelInflater
09:24:04.517 INFO  CombineGVCFs - GCS max retries/reopens: 20
09:24:04.517 INFO  CombineGVCFs - Requester pays: disabled
09:24:04.517 INFO  CombineGVCFs - Initializing engine
09:24:04.614 INFO  FeatureManager - Using codec VCFCodec to read file file:///home/train/07.variants_calling/GATK/V1.g.vcf
09:24:04.636 INFO  FeatureManager - Using codec VCFCodec to read file file:///home/train/07.variants_calling/GATK/V2.g.vcf
09:24:04.713 INFO  CombineGVCFs - Done initializing engine
09:24:04.731 INFO  ProgressMeter - Starting traversal
09:24:04.732 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
09:24:04.804 WARN  ReferenceConfidenceVariantContextMerger - Detected invalid annotations: When trying to merge variant contexts at location MS01Contig01:1208 the annotation MLEAC=[1, 0] was not a numerical value and was ignored
09:24:14.734 INFO  ProgressMeter -  MS01Contig04:576896              0.2               1056000        6336000.0
09:24:21.235 INFO  ProgressMeter -   MS01Contig11:54282              0.3               1884473        6851383.4
09:24:21.235 INFO  ProgressMeter - Traversal complete. Processed 1884473 total variants in 0.3 minutes.
09:24:21.251 INFO  CombineGVCFs - Shutting down engine
[July 21, 2023, 9:24:21 AM CST] org.broadinstitute.hellbender.tools.walkers.CombineGVCFs done. Elapsed time: 0.28 minutes.
Runtime.totalMemory()=220200960



# 运行GenotypeGVCFs鉴定joint-called variants
gatk GenotypeGVCFs -R genome.fasta -O variants.raw.vcf -V combined.g.vcf --sample-ploidy 2
# real  0m14.088s
# user  0m25.331s
# sys   0m0.668s

# 运行VariantFiltration对variants结果进行hard filtering。理论上,更好的filtering方法是根据已有的准确的variants位点,通过机器学习的方法来进行variant quality score recalibration (VQSR)。该方法使用的命令是VariantRecalibrator。适用与大样本数,大数据,不适合小样本,简化基因组测序和转录组测序等。
gatk VariantFiltration -R genome.fasta -O variants.filter.vcf -V variants.raw.vcf --filter-name FilterQual --filter-expression "QUAL < 30.0" --filter-name FilterQD --filter-expression "QD < 13.0" --filter-name FilterMQ --filter-expression "MQ < 20.0" --filter-name FilterFS --filter-expression "FS > 20.0" --filter-name FilterMQRankSum --filter-expression "MQRankSum < -3.0" --filter-name FilterReadPosRankSum --filter-expression "ReadPosRankSum < -3.0" --filter-name FilterBaseQRankSum --filter-expression "BaseQRankSum < -3.0"
#--filter-expression "QUAL < 30.0" #mapping质量直小于30
#--filter-expression "QD < 13.0" #过滤基因组重复序列区域
#--filter-expression "FS > 20.0" #剔除链的偏好性位点
#--filter-expression "ReadPosRankSum < -3.0" #偏离平均值过多的位点,不符合正态分布的位点
--filter-name FilterMQRankSum #筛选掉与参考基因组不一致的位点
#-filter-name FilterBaseQRankSum #筛选掉与参考
# real  0m3.596s
# user  0m5.314s
# sys   0m0.233s
[train@MiWiFi-R3P-srv GATK]$ grep -v "#" variants.filter.vcf |wc -l
7632
杂合率=7632/基因组大小
获得snp位点和indel的vcf
[train@MiWiFi-R3P-srv GATK]$ gatk SelectVariants -V variants.raw.vcf -select-type SNP -O snp.vcf
[train@MiWiFi-R3P-srv GATK]$ gatk SelectVariants -V variants.raw.vcf -select-type INDEL -O indel.vcf




grep -v -P "\tFilter" variants.filter.vcf > variants.vcf

根据INDEL信息批量设计引物

mkdir -p /home/train/07.variants_calling/INDEL_primer_design
cd /home/train/07.variants_calling/INDEL_primer_design
ln -s ../genome.* .
gatk SelectVariants -R genome.fasta -V ../GATK/variants.vcf -O INDELs.vcf --select-type-to-include INDEL 
ln -s ~/05.genome_feature_analysis/SSR_detecting_and_primer_design/p3_settings_file .
VCF_InDel_primer3.pl INDELs.vcf genome.fasta --CPU 4 --p3_setting_file p3_settings_file --gff3_out VCF_InDel_primer3.gff3 > VCF_InDel_primer3.out
# real  3m0.606s
# user  21m44.174s
# sys   0m0.546s

SnpEff注释
软件安装

# installing snpEff (http://snpeff.sourceforge.net/)
#wget https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip -P ~/software/
unzip ~/software/snpEff_latest_core.zip -d /opt/biosoft/
gff3ToGtf.pl程序在下面这个软件里
[train@MiWiFi-R3P-srv SnpEff]$ tar zxf ~/software/geta-2.6.1.tar.gz -C /opt/biosoft/
[train@MiWiFi-R3P-srv SnpEff]$ echo 'PATH=$PATH:/opt/biosoft/geta-2.6.1/bin' >> ~/.bashrc
[train@MiWiFi-R3P-srv SnpEff]$ source ~/.bashrc

mkdir -p /home/train/07.variants_calling/SnpEff
cd /home/train/07.variants_calling/SnpEff
#构建SnpEff数据库
mkdir -p /opt/biosoft/snpEff/data/malassezia_sympodialis/
cp ~/00.incipient_data/data_for_genome_assembling/assemblies_of_Malassezia_sympodialis/Malassezia_sympodialis.genome_V01.fasta /opt/biosoft/snpEff/data/malassezia_sympodialis/sequences.fa #准备基因组文件
gff3_remove_UTR.pl /home/train/00.incipient_data/data_for_gene_prediction_and_RNA-seq/Malassezia_sympodialis_V01.bestGeneModels.gff3 > Malassezia_sympodialis.gff3 #准备gff3文件,去除utr区域,大型动植物基因组也可以不用去除
gff3ToGtf.pl ~/00.incipient_data/data_for_genome_assembling/assemblies_of_Malassezia_sympodialis/Malassezia_sympodialis.genome_V01.fasta Malassezia_sympodialis.gff3 > /opt/biosoft/snpEff/data/malassezia_sympodialis/genes.gtf
perl -p -i -e 's/^\s*$//' /opt/biosoft/snpEff/data/malassezia_sympodialis/genes.gtf #标准的gtf没有空行,故将其去除
cp /home/train/00.incipient_data/data_for_gene_prediction_and_RNA-seq/Malassezia_sympodialis_V01.CDS.fasta /opt/biosoft/snpEff/data/malassezia_sympodialis/cds.fa
cp /home/train/00.incipient_data/data_for_gene_prediction_and_RNA-seq/Malassezia_sympodialis_V01.protein.fasta /opt/biosoft/snpEff/data/malassezia_sympodialis/protein.fa

echo "malassezia_sympodialis.genome : malassezia sympodialis" >> /opt/biosoft/snpEff/snpEff.config
java -jar /opt/biosoft/snpEff/snpEff.jar build -c /opt/biosoft/snpEff/snpEff.config -gtf22 -v malassezia_sympodialis

#运行SnpEff注释程序
java -Xmx2G -jar /opt/biosoft/snpEff/snpEff.jar eff -csvStats variants.SnpEff.csv -s variants.SnpEff.html -c /opt/biosoft/snpEff/snpEff.config -v -ud 500 malassezia_sympodialis ~/00.incipient_data/data_for_variants_calling/variants.vcf > variant.SnpEff.vcf
# real  0m6.132s
# user  0m11.224s
# sys   0m0.369s
[train@MiWiFi-R3P-srv SnpEff]$ ls
Malassezia_sympodialis.genome_V01.fasta  Malassezia_sympodialis.gff3  variant.SnpEff.vcf  variants.SnpEff.csv  variants.SnpEff.genes.txt  variants.SnpEff.html
[train@MiWiFi-R3P-srv SnpEff]$ firefox variants.SnpEff.html 

如何去除indel位点?gatk
如何统计某个变异位点频率?https://blog.csdn.net/upyours00/article/details/106910961
筛选连锁不平衡位点?plink

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 193,968评论 5 459
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 81,682评论 2 371
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 141,254评论 0 319
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 52,074评论 1 263
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 60,964评论 4 355
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 46,055评论 1 272
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 36,484评论 3 381
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 35,170评论 0 253
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 39,433评论 1 290
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 34,512评论 2 308
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 36,296评论 1 325
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 32,184评论 3 312
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 37,545评论 3 298
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 28,880评论 0 17
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 30,150评论 1 250
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 41,437评论 2 341
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 40,630评论 2 335

推荐阅读更多精彩内容