2021/03/12
Short Read Mapping for Exome Sequencing
software:
1. FastQC (v0.11.9)
was used to assess the read quality
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
2. BWA (v0.7.17)
was used for read mapping
http://bio-bwa.sourceforge.net/
git clone https://github.com/lh3/bwa.git
cd bwa; make
./bwa index ref.fa
./bwa mem ref.fa read-se.fq.gz | gzip -3 > aln-se.sam.gz
./bwa mem ref.fa read1.fq read2.fq | gzip -3 > aln-pe.sam.gz
3. picard (v2.25.0)
was used to manipulate alignment files and generate quality control statistics
http://picard.sourceforge.net/
# 检查Java版本 Java 1.8
java -version
# 测试安装
## 测试是否可以运行Picard工具,在终端应用程序中运行以下命令,并提供picard.jar文件的完整路径
java -jar /path/to/picard.jar -h
## 或设置为快捷方式的环境变量(此处使用$PICARD)
java -jar $PICARD -h
# 调用方式
java jvm-args -jar picard.jar PicardToolName OPTION1=value1 OPTION2=value2...
4. verifyBamID (Version 1.1.3)
was used to test sample contamination
http://genome.sph.umich.edu/wiki/VerifyBamID
仅在Ubuntu(9.10-13.10)x86和x64平台上进行
tar xvf verifyBamID-#.#.#.tar.gz
cd verifyBamID-1.1.0
make cloneLib (if ../libStatGen does not exist)
make
Executable: ./bin/verifyBamID
任何遗传分析的关键步骤是验证所生成的数据是否符合预期。verifyBamID检查BAM文件中的读数是否与特定样品的先前基因型匹配。此外,它仅从总体等位基因频率中检测可能的样品混合物,这在无法获得基因型数据时特别有用。
使用将观察到的序列读数与假设的真实基因型相关联的数学模型,verifyBamID会尝试确定序列读数是与特定个体匹配还是更可能被污染(包括一小部分的外源DNA)(源自密切相关的个体),或衍生自完全不同的个人。
# Here is a typical command line:
verifyBamID --vcf [input.vcf] --bam [input.bam] --out [output.prefix] --verbose --ignoreRG
where
[input.bam] is a BAM (Binary Alignment Map) file of a sequence reads
[input.vcf] is input VCF file containing individual genotypes or AF or AC/AN fields in the INFO field. gzipped VCF is also allowed.
[outPrefix] is output prefix of output files - [outPrefix].{selfRG,selfSM,bestRG,bestSM,depthRG,depthSM} will be created.
详细文档
5.GATK(4.2.0.0)
was used to post-process the alignment files
http://www.broadinstitute.org/gsa/wiki/index.php/Home_Page
工具文档索引
6. SAMtools (v 1.11)
http://samtools.sourceforge.net/
git clone git://github.com/samtools/samtools.git
7. BEDtools (v2.30.0)
http://code.google.com/p/bedtools/
wget https://github.com/arq5x/bedtools2/releases/download/v2.29.1/bedtools-2.29.1.tar.gz
tar -zxvf bedtools-2.29.1.tar.gz
cd bedtools2
make
# 报告两个BED文件中功能之间的碱基对重叠
bedtools intersect -a reads.bed -b genes.bed
# 报告A中与B中NO条目重叠的条目
bedtools intersect -a reads.bed -b genes.bed -v
8. VCFtools (v0.1.16)
https://vcftools.github.io/downloads.html
# 下载
git clone https://github.com/vcftools/vcftools.git
# 更新
git pull https://github.com/vcftools/vcftools.git master
data files
1、最新人类基因组参考序列
(1)在人类遗传学研究案例中,建议使用1000个基因组项目所采用的参考序列。
ftp://ftp.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz
(2)NCBI
https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/reference/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.fna.gz
2、由1000个基因组项目测序的样本NA20322的外显子组测序数据示例:
https://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/phase3/data/NA20322/
(Illumina GAII)过滤后数据:
SRR070556_1.filt.fastq.gz,SRR070556_2.filt.fastq.gz,
SRR070839_1.filt.fastq.gz, SRR070839_2.filt.fastq.gz.
3、Agilent SureSelect_All_Exon_V2
(https://earray.chem.agilent.com/earray/,使用ELID:S0293689)
下载 ANNOTATION (targets) and BED (baits) files
4、下载基因组分析工具包(GATK)的资源包
method
照搬[原文], 待尝试
Short Read Mapping for Exome Sequencing
DOI:10.1007/978-1-62703-514-9_6
日期:2013-01-01
1、准备输入文件
(1)genome(reference genome)、gatk、exnome(annotation、bed)、fastq(NA20322)、jar(picard 和 GATK packages 的java文件)
(2)使用bedtools 的 maskFastaFromBed掩盖参考序列Y染色体上的假常染色体区域。
guzip genome/human_g1k_v37.fasta.gz
cat <<EOF | awk ’BEGIN{OFS¼"\t"}{print $1,$2,$3,$4}’ >
/tmp/mask.bed
Y 10001 2649520 PAR1
Y 59034050 59363566 PAR2
EOF
maskFastaFromBed -fi genome/human_g1k_v37.fasta
-bed /tmp/mask.bed \
-fo genome/b37.fasta
(3)运行BWA对蒙面基因组参考进行索引
bwa index -p genome/b37 -a bwtsw genome/b37.fasta
发现在基因组文件夹中创建了以下8个文件:
b37.amb, b37.ann, b37.bwt, b37.pac, b37.rbwt, b37.rpac, b37.rsa, b37.sa.
(4)为参考创建FASTA文件索引和字典
samtools faidx genome/b37.fasta
java -Xmx2g -jarjar/CreateSequenceDictionary.jar \
REFERENCE=genome/b37.fasta OUTPUT=genome/b37.dict GENOME _ASSEMBLY=b37
结果会有两个文件:b37.fasta.fai 、b37.fasta.dict,被用于一些 picard 程序
(5)运行快速QC评估读取质量,确定文件类型和基本质量分数编码
mkdir fastqc
fastqc --outdir fastqc/ fastq/*.gz
(6)把 target and bait definitions 转换为 picard interval list format
awk ’BEGIN{OFS="\t"}NR>1{gsub("chr","",$1);
print $0}’ \
exome/ANNOTATION/SureSelect_All_Exon_V2_-
with_annotation.hg19.bed \
> exome/targets.bed
cat genome/b37.dict > exome/targets.interval_list
awk ’BEGIN{OFS="\t"}NR>1{print $1,$2+1,$3,"+",$4}’ \
exome/targets.bed >> exome/targets.interval_list
cat genome/b37.dict > exome/baits.interval_list
awk ’BEGIN{OFS="\t"}{gsub("chr","",$1); print
$1,$2+1,$3,$6,$4}’ \
exome/BED/029368_D_BED_20111101.bed \
>> exome/baits.interval_list
2、mapping reads to reference
(1)
for RUN in SRR070556 SRR070839; do
mkdir -p mapping/$RUN
bwa aln -t 6 -q 10 -f mapping/$RUN/${RUN}_1.sai
genome/b37 \
fastq/$_1.filt.fastq.gz
bwa aln -t 6 -q 10 -f mapping/$RUN/${RUN}_2.sai
genome/b37 \
fastq/$_2.filt.fastq.gz
done
(2)创建比对sam
for RUN in SRR070556 SRR070839; do
bwa sampe -f mapping/${RUN}/$RUN.sam \-r
"@RG\tID:$RUN\tSM:NA20322\tLB:NA20322\tPL:
ILLUMINA" genome/b37 \
mapping/${RUN}/${RUN}_1.sai
mapping/${RUN}/${RUN}_2.sai \
fastq/${RUN}_1.filt.fastq.gz
fastq/${RUN}_2.filt.fastq.gz
done
(3)修复SAM输出中的mate信息,并转换为二进制格式
for RUN in SRR070556 SRR070839; do
java -XX:ParallelGCThreads¼2 -Xmx2g -jar
jar/FixMateInformation.jar \
INPUT¼mapping/$RUN/$RUN.sam
OUTPUT¼mapping/$RUN/$RUN.fixed.bam \
TMP_DIR¼mapping/$RUN/tmp
VALIDATION_STRINGENCY¼SILENT
done
(4)重新排序contigs,contigs内部按其映射的坐标对排列的读取进行排序
for RUN in SRR070556 SRR070839; do
java -XX:ParallelGCThreads=2 -Xmx2g -jar jar/
ReorderSam.jar \
INPUT=mapping/$RUN/$RUN.fixed.bam \
OUTPUT=mapping/$RUN/$RUN.reordered.bam \
REFERENCE=genome/b37.fasta
TMP_DIR=mapping/$RUN/tmp \
VALIDATION_STRINGENCY=SILENT
java -XX:ParallelGCThreads=2 -Xmx2g -jar
jar/SortSam.jar \
INPUT=mapping/$RUN/$RUN.reordered.bam \
OUTPUT=mapping/$RUN.sorted.bam
TMP_DIR=mapping/$RUN/tmp \
SORT_ORDER=coordinate CREATE_INDEX=true \
VALIDATION_STRINGENCY=SILENT
Done
此步骤还将为每个已排序的BAM文件创建索引文件(.bai)
(5)将两个lane级别的对齐合并到样本级别
java -XX:ParallelGCThreads=2 -Xmx2g -jar
jar/MergeSamFiles.jar \
INPUT=mapping/SRR070556.sorted.bam
INPUT=mapping/SRR070 839.sorted.bam \
OUTPUT=mapping/NA20322.merged.bam CREATE_INDEX=true \
SORT_ORDER=coordinate USE_THREADING=true \
TMP_DIR=mapping/tmp VALIDATION_STRINGENCY=SILENT
对NA20322.merge.bam进行排序,并为样本索引BAM文件。
3、在aligned reads上的后处理
(1)标记从对齐读取中检测到的重复片段
mkdir sam_qc
java -XX:ParallelGCThreads=2 -Xmx2g -jar jar/
MarkDuplicates.jar \
INPUT=mapping/NA20322.merged.bam
OUTPUT=mapping/NA20322.dupmarked.bam \
CREATE_INDEX=true
METRICS_FILE=sam_qc/NA20322.duplication_
metrics \
TMP_DIR=mapping/tmp VALIDATION_STRINGENCY=SILENT
此步骤还在sam_qc目录中生成重复指标。
(2)GATK重比对indel
java -XX:ParallelGCThreads=2 -Xmx4g -jar jar/
GenomeAnalysisTK.jar \
-l INFO -et STDOUT -T RealignerTargetCreator \
-R gatk/b37.fasta -L exome/targets.interval_
list \
-I mapping/NA20322.dupmarked.bam \
--known gatk/1000G_phase1.indels.b37.vcf \
--known
gatk/Mills_and_1000G_gold_standard.indels.b37.
vcf \
-o mapping/NA20322.realign.interval_list
java -XX:ParallelGCThreads=2 -Xmx4g -jar jar/
GenomeAnalysisTK.jar \
-l INFO -et STDOUT -T IndelRealigner \
-R gatk/b37.fasta -I mapping/NA20322.dupmarked.
bam \
-o mapping/NA20322.realigned.bam \
-targetIntervals
mapping/NA20322.realign.interval_list \
--knownAlleles gatk/1000G_phase1.indels.b37.
vcf \
--knownAlleles gatk/Mills_and_1000G_gold_
standard.indels.b37.vcf \
--consensusDeterminationModel USE_READS
(3)重新校准基础质量分数
java -Xmx4g -XX:ParallelGCThreads=2 -jar jar/
GenomeAnalysisTK.jar \
-l INFO -et STDOUT -T CountCovariates -nt 4 \
-R gatk/b37/b37.fasta -knownSites
gatk/b37/dbsnp_135.b37.vcf \
-I mapping/NA20322.realigned.bam --default_
platform Illumina \
-cov ReadGroupCovariate -cov QualityScoreCovariate \
-cov CycleCovariate -cov DinucCovariate \
-recalFile mapping/NA20322.pre_recal.csv
java -Xmx4g -XX:ParallelGCThreads=2 -jar jar/
GenomeAnalysisTK.jar \
-l INFO -et STDOUT -T TableRecalibration \
-R gatk/b37/b37.fasta --default_platform Illumina \
-I mapping/NA20322.realigned.bam -o mapping/
NA20322.recal.bam \
-recalFile mapping/NA20322.pre_recal.csv
由此产生的NA20322.recal.bam文件现在适合于变体调用。 在进行下游分析之前,我们首先需要确保数据质量。
4、Quality Control on Alignment
(1)从已处理的对准文件中收集各种质量控制统计信息。
java -XX:ParallelGCThreads=2 -Xmx2g -jar jar/
CollectMultipleMetrics.jar \
INPUT=mapping/NA20322.recal.bam
OUTPUT=sam_qc/NA20322 \
REFERENCE_SEQUENCE=exome/b37.fasta \
PROGRAM=CollectAlignmentSummaryMetrics \
PROGRAM=QualityScoreDistribution \
PROGRAM=MeanQualityByCycle \
PROGRAM=PROGRAM=CollectInsertSizeMetrics \
VALIDATION_STRINGENCY=SILENT
(2)评估目标区域上的覆盖深度
java -XX:ParallelGCThreads=2 -Xmx2g -jar jar/
CalculateHsMetrics.jar \
INPUT=mapping/NA20322.recal.bam \
OUTPUT=sam_qc/NA20322.hybrid_selection_-
metrics \
BAIT_INTERVALS=exome/baits.interval_list \
TARGET_INTERVALS=exome/targets.interval_list \
PER_TARGET_COVERAGE=sam_qc/NA20322.per_target_coverage \
REFERENCE_SEQUENCE=exome/b37/b37.fasta VALIDATION_STRINGENCY=SILENT
(3)测试测序样本是否符合他/她自己的身份
样本信息可以从项目FTP下载,样本NA20322属于ASW群体,因此我们首先在外显子试剂盒的常染色体靶点内提取ASW群体的SNP基因型。
awk ’$2=="ASW"{print $1}’ k1g/phase1_integrated_calls.20101123.ALL.panel \
> k1g/ASW.keep
awk ’$1 ~ /^[0-9]+$/’ exome/targets.bed > exome/
targets.auto.bed
vcftools --vcf gatk/1000G_omni2.5.b37.vcf --keep
k1g/ASW.keep \
--remove-filtered-all --bed exome/targets.
auto.bed \
--recode --out k1g/ASW.exome
(4)运行verifyBamID以检测样本交换或污染
verifyBamID --vcf k1g/ASW.exome.recode.vcf \
--bam mapping/NA20322.recal.bam --out sam_qc/
idcheck --verbose \
--maxDepth 2000 --precise --minQ 20 --maxQ 100 --
minMapQ 17
CHIPMIX和FREEMIX的统计数据都小于0.001,表明样品与身份相符,没有其他样品污染的证据。