0写在前面
脚本方面充斥着CHATGPT,复制粘贴。
尽管每个步骤都实际运行过,但因为是一边实验一边写,难免存在错误
各种排版,细节问题懒得细抠了
这篇文章又被锁定了
1.文件处理
在这篇文章(https://doi.org/10.1186/s12711-020-00599-7)提供的脚本里,是对不同群体分别进行质控。至于到底是先质控再提取还是提取了再质控,个人倾向于先提取,毕竟有sci论文正面回答了这一问题。当然正常论文也不至于写这么细致,结果符合期望也就可以了。
1.
需要注意,尽管我上一篇简书中提到的那篇文章的作者建议不要进行MAF和LD过滤。但目前看到的大部分文献仅提及不进行LD过滤(https://doi.org/10.1186/s12711-022-00695-w),不过滤MAF较少(https://doi.org/10.1186/s12711-020-00599-7)。也可能是单纯的没有特地提及没有过滤MAF。hwe基本没有文章提,但在上述脚本里提到了不过滤。
2.分群体提取文件
创建一个pop文件,里面是单列的群体名,再创建与群体名对应的个体id文件(如NEM.id)。会很耗费时间
for i in `cat pop`
do
mkdir ${i}
cd ${i}
bcftools view -S ../$i.id -Ov ../../all.beforeQC.vcf > $i.NC.vcf #vcf文件记得加上绝对路径,这里不知道为什么需要NC染色体
perl /public/home/ywf/universal/NC/replace.pl $i.NC.vcf /public/home/ywf/universal/NC/NC-chr-sheep.txt $i.vcf #转染色体号
gzip $i.NC.vcf
cd ../
done
3.转plink格式及质控
for i in `cat pop`
do
cd ${i}
plink --allow-extra-chr --chr-set 26 -vcf $i.vcf --make-bed --double-id --out beforeQC$i
plink --allow-extra-chr --chr-set 26 -bfile beforeQC$i --geno 0.05 --make-bed --out $i #mind理论上是需要的,但都做到这了,不能mind还能筛下去个体吧
plink --allow-extra-chr --chr-set 26 -bfile $i --recode vcf-iid --out QC #转为vcf格式,只有后续计算Lauto有用,不需要可以删除
plink -allow-extra-chr --chr-set 26 --bfile $i --recode --out $i --noweb - #转为map格式,只有后续计算Lauto有用,不需要可以删除
#vcf与map二选一即可
cd ../
done
2.计算ROH
2.1.普通的计算
!!!不加上“rm”会占较多的存储空间,请确保硬盘的剩余空间够多,可以先做一个看看删除了啥
for i in `cat pop`
do
cd $i
plink --allow-extra-chr --chr-set 26 --bfile $i \
--homozyg-density 10 \
--homozyg-gap 100 \
--homozyg-kb 100 \
--homozyg-snp 10 \
--homozyg-window-het 1 \
--homozyg-window-missing 5 \
--homozyg-window-snp 50 \
--homozyg-window-threshold 0.05 \
--homozyg group \
--homozyg-verbose \
--out $i
rm *.verbose
cd ../
done
这里具体的参数建议是查找同物种的前人的文章,每个参数的含义和部分参数的计算方法可参考:https://www.jianshu.com/p/4704282b0c69
目前看到的几篇文章:《Runs of homozygosity in Swiss goats reveal genetic changes associated with domestication and modern selection》《Whole genome sequencing revealed genetic diversity, population structure, and selective signature of Panou Tibetan sheep》《Genomics Confirm an Alarming Status of the Genetic Diversity of Belgian Red and Belgian White Red Cattle》
2.2.进一步考虑,如何把计算参数写进循环
这里是chatgpt的脚本
for i in `cat pop`
do
cd $i
cat /public/home/ywf/23-0314Northeast-merino/0526/0621examine/diversity/hohe/$i.hwe|awk '{sum+=$7} END {print sum/NR}' >$i.HOAverage.txt #具体的HO计算方法不在这里细说,只要保证所有群体的hwe文件在一个目录下,命名合理即可
read HOAverage < $i.HOAverage.txt
a=$HOAverage
b=$(wc -l < $i.bim) # 计算质控后位点数
c=$(wc -l < $i.fam) # 计算群体中个体数
#计算-window-snp
l=$(bc -l <<< "(l(0.05/($b*$c))/l(1-$a))")
l_rounded=$(echo "scale=0; $l / 1" | bc)
l_is_na=$(echo "$l == NA" | bc)
if [ $l_is_na -eq 1 ]; then
l=50
else
l=$l_rounded
fi
#-window-snp计算结束
#计算 threshold
d=$(echo "scale=3; 3/$l" | bc ) #这里的3是$Nout+1,Nout的值是一个通用的变量(该脚本默认值是2):The number of outer SNPs you don't want to allow for ROH-analysis (calculation of windowthreshold)
#计算ROH
plink --allow-extra-chr --chr-set 26 --bfile $i \
--homozyg-density 10 \
--homozyg-gap 100 \
--homozyg-kb 100 \
--homozyg-snp 10 \
--homozyg-window-het 1 \
--homozyg-window-missing 5 \
--homozyg-window-snp $l \
--homozyg-window-threshold $d \
--homozyg group \
--homozyg-verbose \
--out $i
rm *.verbose
unset a b c d
cd ../
done
3.计算Froh
3.1.计算Lauto
Lauto有两种计算方式,一种是参考基因组的总长度,一种是每条染色体第一个位点到最后一个位点的长度之和。两种方法都可以,不影响在自己的文章中比较,但若是想和其他人的文章作比,务必以相同的方式计算(我看的大部分sci论文都是第二种方法)
3.1.1.这里计算每条染色体长度之和,可以使用这个R脚本
3.1.1.1.单个群体
原脚本这里命名是map,但实际是bim文件
#Load map file, determine amount of chromosomes (no sex chromosomes!), and length of genome (in kb) covered by SNPs (sum of distance between first and last SNP per chromosome)
#load bim file: determine number of chromosomes
map<-read.table("test.beagle.map",sep="",head=F)
colnames(map)<-c("CHR","SNP_name","POS","BP","A1","A2")
map_chr<-max(map$CHR)
#Calculate length in bp per chromosome and length of total genome covered (in kb)
genome_length=0
#for loop per chromosome
for (ks in 1:map_chr){
#map file of chromosome i
tmp_map<-map[map$CHR==ks,]
#determine genome length of this chromosome and add to total genome length
genome_length<-genome_length+((max(tmp_map$BP)-min(tmp_map$BP))/1000) #divide by thousand to get kb
}
genome length
注意这里计算出来的是kb
3.1.1.2.循环
library("data.table")
setwd(dir="/public/home/ywf/23-0314Northeast-merino/0526/0621examine/ROHtest/")
lines <- readLines("pop")
for (i in lines) {
map <- fread(file=paste0("/public/home/ywf/23-0314Northeast-merino/0526/0621examine/ROHtest/", i, "/", i, ".bim"))
colnames(map)<-c("CHR","SNP_name","POS","BP","A1","A2")
map_chr<-max(map$CHR)
#Calculate length in bp per chromosome and length of total genome covered (in kb)
genome_length=0
#for loop per chromosome
for (ks in 1:map_chr){
#map file of chromosome i
tmp_map<-map[map$CHR==ks,]
#determine genome length of this chromosome and add to total genome length
genome_length<-genome_length+((max(tmp_map$BP)-min(tmp_map$BP))/1000) #divide by thousand to get kb
}
write.table(genome_length,paste("/public/home/ywf/23-0314Northeast-merino/0526/0621examine/ROHtest/",i,"/Lauto.txt",sep=""))
}
注意事项同上,后续建议在Excel表里统计
3.1.2.还可以进一步计算完全纯合个体具有的ROH片段总长度作为Lauto
这里脚本是没有问题的,但是实测如果ped文件过大,会报错
*** caught segfault ***
address (nil), cause 'unknown'
因为这个脚本占的内存会很大,慎用
感觉这个脚本有更好的替代,因为本质上就是一个完全纯合个体,直接操作vcf文件会不会更好
3.1.2.1单一群体R脚本
library("data.table")
#Create a dummy animal: one individual with the same map file as in the analysis, but with completely homozygous genotype
dummy.map.basis <- fread(file="NFS.map")
dummy.ped.basis <- fread(file="NFS.ped")
fam<- fread(file="NFS.fam")
#make large file with only A's to construct homozygous dummy animal
dummy_basis<-t(as.matrix((rep("A",30000000))))
#make dummy pedigree: 6 columns + 2*NR of SNPs (all SNPs A)
#First six columns are always equal to fam file!
dummy_ped<-c((as.matrix(fam[1,1:6])),dummy_basis[c(7:ncol(dummy.ped.basis))])
dummy_ped[1:20]
dummy.ped.basis<-as.matrix(t(dummy_ped))
#Write dummy pedigree and map write.table(x=as.matrix(dummy.ped.basis),file="DummyGenotypes_ROH_basis.ped",row.names=FALSE,col.names=FALSE,sep=" ",quote=FALSE,na="-9")
write.table(x=as.matrix(dummy.map.basis),file="DummyGenotypes_ROH_basis.map",row.names=FALSE,col.names=FALSE,sep="\t",quote=FALSE,na="-9")
然后使用这两个文件和之前计算ROH的参数,重新计算ROH
dummy.hom_VPF <- fread(file="./Data/DummyGenotypes_ROH_basis.hom.indiv",fill=TRUE,header=TRUE)
dummy_length<-dummy.hom_VPF$KB[1]
dummy_length
3.1.2.2.循环
制作map和ped文件
library("data.table")
setwd(dir="/public/home/ywf/23-0314Northeast-merino/0526/0621examine/ROHtest/")
lines <- readLines("pop")
for (i in lines) {
dummy.map.basis <- fread(file=paste0("/public/home/ywf/23-0314Northeast-merino/0526/0621examine/ROHtest/", i, "/", i, ".map"))
dummy.ped.basis <- fread(file=paste0("/public/home/ywf/23-0314Northeast-merino/0526/0621examine/ROHtest/", i, "/", i, ".ped"))
fam <- fread(file = paste0("/public/home/ywf/23-0314Northeast-merino/0526/0621examine/ROHtest/", i, "/", i, ".fam"))
dummy_basis <- t(as.matrix((rep("A", 30000000))))
dummy_ped<-c((as.matrix(fam[1,1:6])),dummy_basis[c(7:ncol(dummy.ped.basis))])
dummy.ped.basis<-as.matrix(t(dummy_ped))
write.table(x=as.matrix(dummy.ped.basis),file="DummyGenotypes_ROH_basis.ped",row.names=FALSE,col.names=FALSE,sep=" ",quote=FALSE,na="-9")
write.table(x=as.matrix(dummy.map.basis),file="DummyGenotypes_ROH_basis.map",row.names=FALSE,col.names=FALSE,sep="\t",quote=FALSE,na="-9")
}
到这里做好了map和ped,切换成linux继续进行
for i in `cat pop`
do
cd ${i}
plink --allow-extra-chr --chr-set 26 --file DummyGenotypes_ROH_basis --make-bed --out DummyGenotypes_ROH_basis #转为bed格式,好像可以直接用map文件计算ROH
cat /public/home/ywf/23-0314Northeast-merino/0526/0621examine/diversity/hohe/$i.hwe|awk '{sum+=$7} END {print sum/NR}' >$i.HOAverage.txt #后续与上述相同
read HOAverage < $i.HOAverage.txt
a=$HOAverage
b=$(wc -l < $i.bim) # 计算质控后位点数
c=$(wc -l < $i.fam) # 计算群体中个体数
#计算l
l=$(bc -l <<< "(l(0.05/($b*$c))/l(1-$a))")
l_rounded=$(echo "scale=0; $l / 1" | bc)
l_is_na=$(echo "$l == NA" | bc)
if [ $l_is_na -eq 1 ]; then
l=50
else
l=$l_rounded
fi
#l计算结束
#计算 threshold
d=$(echo "scale=3; 3/$l" | bc )
#计算ROH
plink --allow-extra-chr --chr-set 26 --bfile $i \
--homozyg-density 10 \
--homozyg-gap 100 \
--homozyg-kb 100 \
--homozyg-snp 10 \
--homozyg-window-het 1 \
--homozyg-window-missing 5 \
--homozyg-window-snp $l \
--homozyg-window-threshold $d \
--out $i
unset a b c d
cd ../
done
DummyGenotypes_ROH_basis.hom.indiv文件第5列就是对应文件夹下完全纯合个体的ROH长度,也就是"Lauto"
根据实际计算,直接使用SNP覆盖的染色体全长是:2655157.717Kb
完全纯合个体是:2650080K
3.1.3.通过创建完全纯合个体的vcf文件计算Lauto
3.1.3.1.单个群体
制作vcf文件
awk -F '\t' '{print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9}' QC.vcf> prefix.vcf
head -n 31 prefix.vcf > head.prefix.vcf #这里的数字取决于第一个位点前有多少行
tail -n +32 prefix.vcf > pos.prefix.vcf
b=$(wc -l < pos.prefix.vcf) # 使用wc命令统计行数
echo "NFS-0" > NFS.0 # 先写入个体名,制作位点文件
for ((i=1; i<$b; i++)) # 循环b-1次
do
echo "0/0" >> NFS.0 # 追加一行0/0
done
paste pos.prefix.vcf NFS.0 > 0.pos #合并文件
cat head.prefix.vcf test.0.vcf > 0.vcf
从这里开始就和上面一样了,转格式,计算ROH
3.1.3.2.循环
需要修改文件名,head命令行数,HO文件路径
for i in `cat pop`
do
cd ${i}
awk -F '\t' '{print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9}' QC.vcf> prefix.vcf #这里的QC就是转格式时格外制作的vcf文件
head -n 31 prefix.vcf > head.prefix.vcf #这里的数字取决于个体名前有多少行,使用上述命令制作的vcf文件,表头行数应该是一样的
tail -n +32 prefix.vcf > pos.prefix.vcf
b=$(wc -l < pos.prefix.vcf) # 使用wc命令统计行数
echo "test-0" > test.0 # 先写入个体名,制作位点文件
for ((k=1; k<$b; k++)) # 循环b-1次 耗费时间略长
do
echo "0/0" >> test.0 # 追加一行0/0
done
paste pos.prefix.vcf test.0 > test.0.pos #合并文件
cat head.prefix.vcf test.0.pos > 0.vcf
plink --allow-extra-chr --chr-set 26 -vcf 0.vcf --make-bed --double-id --out 0
cat /public/home/ywf/23-0314Northeast-merino/0526/0621examine/diversity/hohe/$i.hwe|awk '{sum+=$7} END {print sum/NR}' >$i.HOAverage.txt #从这里开始与上文相同
read HOAverage < $i.HOAverage.txt
a=$HOAverage
b=$(wc -l < $i.bim) # 计算质控后位点数 这里使用$i是为了保持计算参数的一致
c=$(wc -l < $i.fam) # 计算群体中个体数
#计算l
l=$(bc -l <<< "(l(0.05/($b*$c))/l(1-$a))")
l_rounded=$(echo "scale=0; $l / 1" | bc)
l_is_na=$(echo "$l == NA" | bc)
if [ $l_is_na -eq 1 ]; then
l=50
else
l=$l_rounded
fi
#l计算结束
#计算 threshold
d=$(echo "scale=3; 3/$l" | bc ) #这里的3是$Nout+1,Nout的值是一个通用的变量(该脚本默认值是2):The number of outer SNPs you don't want to allow for ROH-analysis (calculation of windowthreshold)
#计算ROH
plink --allow-extra-chr --chr-set 26 --bfile 0 \
--homozyg-density 10 \
--homozyg-gap 100 \
--homozyg-kb 100 \
--homozyg-snp 10 \
--homozyg-window-het 1 \
--homozyg-window-missing 5 \
--homozyg-window-snp $l \
--homozyg-window-threshold $d \
--out 0
unset a b c d
cd ../
done
该文件计算的ROH总长度(也即Lauto)是:2654800Kb
直接使用SNP覆盖的染色体全长是:2655387.385Kb
2.统计每个个体的Lroh和Froh
for i in `cat pop`
do
cd $i
awk -F $i.hom.indiv >> ../LROH.txt
cd ../
done
awk '{print $0, $5/x}' LROH.txt > Froh.indiv.txt #x替换为Lauto
这里建议后续的计算在excel中进行,因为可以画Lroh、Froh的分布图,计算时注意单位
4.统计每类ROH各有多少
1.把hom文件转化为tab分隔的文件(这一步似乎不是必须的)
for i in `cat pop`;
do
cd $i
cat $i.hom | tr -s ' ' '\t' > $i.tab.hom
cd ../
done
2.该脚本是根据ROH片段长短,分别给予编号,可以自己设置分级条件
for i in `cat pop`;
do
cd $i
awk '{FS=" "}$9>=5000{print$0"\tF\n"}$9>=4000&&$9<5000{print$0"\tE\n"}$9>=3000&&$9<4000{print$0"\tD\n"}$9>=2000 &&$9<3000{print$0"\tC\n"}$9>=1000 && $9<2000{print$0"\tB\n"}$9<1000{print$0"\tA\n"}' $i.tab.hom > $i.class.txt
cd ../
done
3.统计各级ROH数量
该脚本实际运行时间较长(大概每个群体4分钟),有能力的可以优化一下
for i in `cat pop`;
do
cd $i
declare -A count # declare an associative array to store the counts
declare -A sum # declare another associative array to store the sums
count=([A]=0 [B]=0 [C]=0 [D]=0 [E]=0 [F]=0) # initialize the counts to 0
sum=([A]=0 [B]=0 [C]=0 [D]=0 [E]=0 [F]=0) # initialize the sums to 0
while read line # read each line
do
col9=$(echo $line | awk '{print $9}') # extract the ninth column using awk
col14=$(echo $line | awk '{print $14}') # extract the fourteenth column using awk
if [[ $col14 =~ [A-F] ]] # if the fourteenth column is one of A-F
then
count[$col14]=$((count[$col14]+1)) # increase the corresponding count by 1
sum[$col14]=$(echo "${sum[$col14]}+$col9" | bc) # increase the corresponding sum by the value of the ninth column using bc
fi
done < $i.class.txt # 修改为自己使用的文件
echo "Category Count Sum" > count.txt # write the header to count.txt
for key in ${!count[@]} # iterate over the keys of the array
do
echo "$key ${count[$key]} ${sum[$key]}" >> count.txt # append the key, the value and the sum to count.txt
done
cd ../
done
!!!因为表头行也会被给予一个字母F,因此实际的F级长片段数需减一,F级片段总长度需减99
后续在Excel里面统计汇总
5.计算ROH重叠区域 (可以只针对目标群体进行计算)
for i in `cat pop`;
do
cd $i
cat $i.hom.overlap|grep "CON" > $i.con.txt
cd ../
done
该脚本可查看存在的重叠区域,根据在所有个体中区域出现的次数排序,取1%(一般1%够用了)
《Runs of homozygosity in Swiss goats reveal genetic changes associated with domestication and modern selection》这篇文章中将在群体内80%个体中出现的ROH片段认定为热点区域
6.更科学的计算热点区域的方法
这个脚本来自:https://doi.org/10.1186/s12711-020-00599-7
1.单个群体
1.1.曼哈顿图
1.1.1.数据准备
R
library("data.table")
ROH <- fread(file="NFS.hom",fill=TRUE,header=TRUE) #.hom以及下一步的.hom.summary均是ROH计算的输出文件
populationx.ROH.summary<- fread(file="NFS.hom.summary",fill=TRUE,header=TRUE)
populationx.ROH.summary$N_animals_ROH<-populationx.ROH.summary$UNAFF+populationx.ROH.summary$AFF
populationx.ROH.summary$Perc_animals_ROH<-round(populationx.ROH.summary$N_animals_ROH/(number of animals)*100,2) #这里直接替换成群体中个体数量,如populationx.ROH.summary$Perc_animals_ROH<-round(populationx.ROH.summary$N_animals_ROH/20*100,2)
SNP_results<-populationx.ROH.summary[,c("CHR","SNP","BP","N_animals_ROH","Perc_animals_ROH")]
SNP_results$population<-population #also write name of population
write.table(SNP_results,file="SNP_results_ROH_population.txt",sep="",row.names = F,col.names = T,quote=FALSE)
populationx.ROH.summary$jitter_mean<-jitter(populationx.ROH.summary$N_animals_ROH) #这句脚本的目的是给每种ROH区间的个体数目添加一些随机的微小扰动,以便于在绘图时避免重叠或者对齐
populationx.ROH.summary$Perc_animals_ROH<-populationx.ROH.summary$jitter_mean/(number of animals)*100 #同上
1.1.2.曼哈顿图
library(qqman)
png(filename="Incidence_SNP_ROH_allChr_density.png",width = 1300,height = 1000, res=300) #设置300dpi,体感300dpi略差,具体参数需微调
manhattan(populationx.ROH.summary,chr="CHR",bp="BP",snp="SNP",p="Perc_animals_ROH",logp=F,cex = 1, cex.axis = 0.8,
ylim=c(0,100),
ylab="Percentage of SNP incidence in ROH",suggestiveline=F,genomewideline = F,
col=c("#99A8B4FF","darkblue") )
mytitle = "NFS" #群体名
mysubtitle = paste("N=",(number of animals)) #同上
mtext(side=3, line=2, at=0.00, adj=0, cex=1.5, mytitle)
mtext(side=3, line=1, at=0.00, adj=0, cex=1, mysubtitle)
dev.off()
1.1.3.Manhattan plot expressed as percentage of animals per chromosome
pdf("perc_SNP_ROH_perChr_density.density_.pdf",width = 5)#这里如果直接画全部的染色体是,会出来一个几十页的,巨大的,pdf文件
for (i in 1:26){
tmp<-populationx.ROH.summary[populationx.ROH.summary$CHR==i,]
if (nrow (tmp)>0){
manhattan(tmp,chr="CHR",bp="BP",snp="SNP",p="Perc_animals_ROH",logp=F,
cex = 1, cex.axis = 0.8,
ylim=c(-0.5,100),ylab="Percentage of SNP incidence in ROH",suggestiveline=F,genomewideline = F,
col=c("darkblue") )
mytitle = "NFS"
mysubtitle = paste("N=",(number of animals))
mtext(side=3, line=2, at=0.00, adj=0, cex=1.5, mytitle)
mtext(side=3, line=1, at=0.00, adj=0, cex=1, mysubtitle)
}
}
dev.off()
这个脚本我能得到pdf文件,但是似乎没什么内容
1.2.识别ROH热点区域
roh_inc_all_snps<-populationx.ROH.summary
#Determine dataset with only ROH-island snps based on criterium of Gorssen et al; (2019)
#Only keep top SNP per bin (1mb): this avoids different weighting (1000 snps per bin vs 10 snps per bin)
#替换染色体数
for (z in 1:26){
#Make empty vector to fill in bins of 1Mb per chromsome
fill_vector<-data.frame(matrix(nrow=0,ncol=10))
#store chromosome number
bin_fill<-roh_inc_all_snps[roh_inc_all_snps$CHR==z,]
#How many bins of 1 MB are there?
bins<-max(bin_fill$BP)%/%(1000000)
#Bin for loop
for (r in 0:bins){#bin loop
#Print chromosome and bin
print(paste("CHR ",z," bin from ",r*1," MB to ",r*1+1," MB",sep=""))
xmin<-r*1000000
xmax<-r*1000000+1000000
#Look at highest percentage of ROH incidence within specific bin
SNP_highest_incidence_in_bin<-roh_inc_all_snps[roh_inc_all_snps$CHR==z & roh_inc_all_snps$BP>xmin & roh_inc_all_snps$BP<xmax,]
#Order SNP_highest_incidence_in_bin based on ROH_incidence
SNP_highest_incidence_in_bin<-SNP_highest_incidence_in_bin[order(SNP_highest_incidence_in_bin$Perc_animals_ROH,decreasing = T),]
SNP_highest_incidence_in_bin$bin_begin<-xmin/1000000
SNP_highest_incidence_in_bin$bin_end<-xmax/1000000
#Write SNP with highest ROH-incidence in vector
fill_vector[r+1,]<-SNP_highest_incidence_in_bin[1,]
}
assign(paste("CHR_",z,sep=""),fill_vector)
} #循环结束 耗费时间较长
#Bind all chromosomes together
All_chr<-CHR_1#Start with chromosome 1
#for loop to bind other chromosomes
for (z in 2:26){
All_chr<-rbind(All_chr,get(paste("CHR_",z,sep=""))) #替换染色体数
}
#adapt colnames
colnames(All_chr)<-c(colnames(roh_inc_all_snps),"Bin_begin","Bin_end")
#Remove na rows
All_chr<-na.omit(All_chr)
#Calculate ROH island threshold
threshold<-All_chr
#Make columns character
threshold$CHR<-as.character(threshold$CHR)
threshold$SNP<-as.character(threshold$SNP)
#add population name
threshold$population<- "NFS" #群体名
#===
#Calculate z-scores (z-score= (observation-mean)/sd)
#===
threshold$Z_score <- (threshold$Perc_animals_ROH - mean(threshold$Perc_animals_ROH)) / sd(threshold$Perc_animals_ROH)
#Calculate p-value based on z-score table
threshold$p_value<-pnorm(threshold$Z_score)
#Determine p-value to put thresthold: what is the % of SNPs in ROH above which we call a SNP in an ROH-island
p<-min(threshold$Perc_animals_ROH[threshold$p_value>=0.999]) #本文使用的是0.999
#if threshold is below 30%: make 30% #必须有30%的个体具有区域才能记为热点区域
if(p<30){p<-30}
#if threshold is above 80%: make 80% #大于80%的个体具有的区域均记为热点区域
if(p>80){p<-80}
#Extract SNPs in ROH island: SNPs with % of ROH larger or equal than p
snps_in_ROH_island<-threshold[threshold$Perc_animals_ROH>=p,]
#Write summary file (if there are any snps in roh island)
if(nrow(snps_in_ROH_island)>0){
write.table(snps_in_ROH_island,file="SNPs_in_ROH_island_population.csv", row.names = F,col.names = T,na = "",quote = F,sep = ",")
}
这些位点所在的ROH区域,也就是ROH 热点区域?我再看看文献,基本上不会有错。
原文献:ROH islands were defined as regions where SNPs had a P-value for ROH incidence higher than a population specific threshold.
Make manhattan plot with threshold visualized
threshold$CHR<-as.numeric(threshold$CHR)
roh_inc_all_snps$CHR<-as.numeric(roh_inc_all_snps$CHR)
#Save manhattan plot as png
png(filename="Incidence_SNP_ROH_allChr_1snp_per_mb_ROH_island_threshold.png",width = 1300,height = 1000, res=300)
#Manhattan plot
manhattan(populationx.ROH.summary,chr="CHR",bp="BP",snp="SNP",p="Perc_animals_ROH",logp=F,
cex = 1, cex.axis = 0.8,
ylim=c(0,101),ylab="Percentage of SNP incidence in ROH",suggestiveline=F,genomewideline = p,
col=c("#99A8B4FF","darkblue") )
mytitle = "NFS" #群体名
mysubtitle = paste("N=",个体数)
mtext(side=3, line=2, at=0.00, adj=0, cex=1.5, mytitle)
mtext(side=3, line=1, at=0.00, adj=0, cex=1, mysubtitle)
dev.off()
这图真丑,再优化吧,累了