记录下小编自己可能用到的小脚本
(1)fasta转phylip格式
用法:python fa2phy.py seq.fasta seq.phy
import sys
file = open(sys.argv[1],"r")
out = open(sys.argv[2],"w")
for line in file:
line = line.strip()
if line.startswith(">"):
out.write(line[1:] + "\t")
else:
out.write(line + "\n")
file.close()
out.close()
(2)fastq转fasta;根据ID提取fasta中的序列(需安装seqtk软件)
seqtk seq -a seq.fastq >seq.fasta
seqtk subseq seq.fasta ID.list >need.fasta
(3)从gff3文件中提取一系列需要的gff3信息
**用法:sh get-gff3.sh ID.list **
while read line
do
`grep -w $line EVM.final.all.transcripts.gene.gff3 >>EVM.final.all.transcripts.gene.new.gff3`
done<$1
(4)根据某一文件某列ID从另一个文件中提取有该列ID的信息(使用最频繁)
awk 'NR==FNR{a[$1]=$0}NR!=FNR{if(a[$2]){print $0"\t"a[$2]}else{print $0}}' file1 file2
file1文件如下:
EVM0003758 Cgyptg000001l_1G000010
EVM0027620 Cgyptg000001l_1G000020
EVM0001766 Cgyptg000001l_1G000030
EVM0031262 Cgyptg000001l_1G000040
EVM0027288 Cgyptg000001l_1G000050
EVM0007809 Cgyptg000001l_1G000060
EVM0018470 Cgyptg000001l_1G000070
EVM0030012 Cgyptg000001l_1G000080
EVM0007598 Cgyptg000001l_1G000090
EVM0016935 Cgyptg000001l_1G000100
file2文件如下:
Cgyptg000001l_1G000030
Cgyptg000001l_1G000050
Cgyptg000001l_1G000070
Cgyptg000001l_1G000080
脚本后
Cgyptg000001l_1G000030 EVM0001766 Cgyptg000001l_1G000030
Cgyptg000001l_1G000050 EVM0027288 Cgyptg000001l_1G000050
Cgyptg000001l_1G000070 EVM0018470 Cgyptg000001l_1G000070
Cgyptg000001l_1G000080 EVM0030012 Cgyptg000001l_1G000080
(5)统计一个文件中,每一行各个元素的个数(例如每行可能有A B C D四个元素)
**用法:python count.py file ID.list #ID.list为含有A B C D的文件list
import sys
from collections import Counter
file = open(sys.argv[1],"r")
file2 = open(sys.argv[2],"r")
out = open(sys.argv[3],"w")
for line in file:
line = line.strip().split("\t")
cou = Counter(line[2:])
for line2 in file2:
line2 = line2.strip()
if line2 not in cou.keys():
cou[line2] = "0"
else:
pass
out.write(line[0] + "\t" + str(line[1]) + "\t" + str(cou["A"]) + "\t" + str(cou["B"]) + "\t" + str(cou["C"]) + "\t" + str(cou["D"]) + "\n")
file.close()
file2.close()
out.close()
(6)划窗计算pi跟theta w,分两步,首先vcftools划窗计算pi,然后用awk计算theta w
vcftools --vcf sample_snp.vcf --window-pi 10000 --out samlpe.windowed.pi
tail -n +2 samlpe.windowed.pi | awk -F$'\t' -v popn=5 'BEGIN{for(z=1; z < 2*popn; ++z) w_p+=1/z;}{w=$4/w_p/($3-$2+1); a_w += w; a_pi += $5}END{print a_w/NR, a_pi/NR;}' #这里的 popn为vcf文件中样本个数
(7)提取某fasta文件中某ID的某区间序列,例如提取染色体chr1的100-200位置序列
samtools faidx input.fa chr1:100-200 > chr1.fa
(8)提取vcf文件中某区间的vcf文件,例如提取vcf文件中,染色体Chr01 1-1000bp区间的vcf文件
vcftools --vcf vcf --chr Chr01 --from-bp 1 --to-bp 1000 --recode --out Chr01
(9)PDF、svg转png(需安装有convert)
convert -resize 3600 -density 300 -quality 600 file.pdf file.png
convert -resize 3600 -density 300 -quality 600 file.svg file.png
(10)mummer比对
nucmer -t 12 --mum -p ref_seq ref.genome.fa seq.genome.fa #比对
delta-filter -i 90 -l 10000 -r -q ref_seq.delta -1> ref_seq.filter.delta.10k #过滤
mummerplot --png -p ref_seq.10k ref_seq.filter.delta.10k #作图
mummerplot -p ref_seq.filter.delta.10k -Q ref.ID -R seq.ID --png #按照选中的ID作图
show-coords ref_seq.filter.delta.10k >ref_seq.filter.delta.10k.coords #文件格式化