参考链接:
用法大全:
官方说明:https://bedtools.readthedocs.io/en/latest/content/overview.html
Bedtools使用:http://blog.genesino.com/2018/04/bedtools/
bedtools 用法大全:https://www.jianshu.com/p/6c3b87301491
2019-和刘小泽一起跟着官网学bedtools:https://www.jianshu.com/p/2efcb6f8f55d
bedtools使用教程详解:
安装:
wget https://github.com/arq5x/bedtools2/archive/v2.25.0.tar.gz
tar zxvf v2.25.0
cd bedtools2-2.25.0/
make
cd bin/
export PATH=$PWD:$PATH
特定功能:
使用bedtools划分各种条件的bin dplyr-pandas-bedtools 分组变量处理
dplyr-pandas-bedtools 分组变量处理
bedtools开发的目的是为了快速,灵活的比较大量的基因组特征(genomic features)。而genomic features通常使用Browser Extensible Data (BED) 或者 General Feature Format (GFF)文件表示,用UCSC Genome Browser进行可视化比较。
例如,bedtools可以进行取intersect(交集), merge(并集), count(计数), complement(补集),以及用来对广泛使用的基因组文件格式,例如BAM, BED, GFF/GTF, VCF等进行基因组区间的转换。单个的工具设计的目的是应对简单的任务,复杂的分析能通过组合多个bedtools工具操作实现。同时,该工具允许控制输出结果的呈现形式。最初的bedtools版本支持单独的6列BED文件。但是,如今增加了对序列比对BAM文件的支持。以及GFF文件的特征,BED文件。以及VCF文件。这些工具是相当快速的,并且即使是大的数据集也可以在数秒内完成任务。
BEDTools主要使用BED格式的前三列,即:
chrom: 染色体信息
start: genome feature的起始位点,从0开始
end: genome feature的终止位点,至少为1
一般常用物种的genome file在BEDTools安装目录的/genome里面
BEDPE格式是其自定义的一种新的格式,为了简洁的描述不连续的genome features,例如结构变异和双端测序比对
注意:
start1和start2起始坐标第一个碱基都为0,所以start=9, end=20表示碱基跨度是从第10位到第20位
chrom1或者chrom2用.表示unknown;start1,end1,start2,end2用-1表示unknown
bedtools 的39个子命令列表(按照字母排序)
Utility | Description |
---|---|
annotate | Annotate coverage of features from multiple files. |
bamtobed | Convert BAM alignments to BED (& other) formats. |
bamtofastq | Convert BAM records to FASTQ records. |
bed12tobed6 | Breaks BED12 intervals into discrete BED6 intervals. |
bedpetobam | Convert BEDPE intervals to BAM records. |
bedtobam | Convert intervals to BAM records. |
closest | Find the closest, potentially non-overlapping interval. |
cluster | Cluster (but don’t merge) overlapping/nearby intervals. |
complement | Extract intervals not represented by an interval file. |
coverage | Compute the coverage over defined intervals. |
expand | Replicate lines based on lists of values in columns. |
fisher | Calculate Fisher statistic b/w two feature files. |
flank | Create new intervals from the flanks of existing intervals. |
genomecov | Compute the coverage over an entire genome. |
getfasta | Use intervals to extract sequences from a FASTA file. |
groupby | Group by common cols. & summarize oth. cols. (~ SQL “groupBy”) |
igv | Create an IGV snapshot batch script. |
intersect | Find overlapping intervals in various ways. |
jaccard | Calculate the Jaccard statistic b/w two sets of intervals. |
links | Create a HTML page of links to UCSC locations. |
makewindows | Make interval “windows” across a genome. |
map | Apply a function to a column for each overlapping interval. |
maskfasta | Use intervals to mask sequences from a FASTA file. |
merge | Combine overlapping/nearby intervals into a single interval. |
multicov | Counts coverage from multiple BAMs at specific intervals. |
multiinter | Identifies common intervals among multiple interval files. |
nuc | Profile the nucleotide content of intervals in a FASTA file. |
overlap | Computes the amount of overlap from two intervals. |
pairtobed | Find pairs that overlap intervals in various ways. |
pairtopair | Find pairs that overlap other pairs in various ways. |
random | Generate random intervals in a genome. |
reldist | Calculate the distribution of relative distances b/w two files. |
tools/sample | Sample random records from file using reservoir sampling. |
shift | Adjust the position of intervals. |
shuffle | Randomly redistribute intervals in a genome. |
slop | Adjust the size of intervals. |
sort | Order the intervals in a file. |
tools/spacing | Sample random records from file using reservoir sampling. |
tools/split | Split a file into multiple files with equal records or base pairs. |
subtract | Remove intervals based on overlaps b/w two files. |
tag | Tag BAM alignments based on overlaps with interval files. |
unionbedg | Combines coverage intervals from multiple BEDGRAPH files. |
window | Find overlapping intervals within a window around an interval. |
bedtools 的41个子命令列表(按照功能排序)
区域注释,如peak注释,peak分布分析,peak与调控元件交集等。
区域合并,如求算多样品peak合集,或合并重叠区域
区域互补,如得到非基因区
利用比对结果对测序广度和深度评估
多样品peak相似性计算,评估ChIP类区域结果的样品相似性。
bedtools: flexible tools for genome arithmetic and DNA sequence analysis.
usage: bedtools <subcommand> [options]
The bedtools sub-commands include:
[ Genome arithmetic ]
intersect Find overlapping intervals in various ways.
求区域之间的交集,可以用来注释peak,计算reads比对到的基因组区域
不同样品的peak之间的peak重叠情况。
window Find overlapping intervals within a window around an interval.
closest Find the closest, potentially non-overlapping interval.
寻找最近但可能不重叠的区域
coverage Compute the coverage over defined intervals.
计算区域覆盖度
map Apply a function to a column for each overlapping interval.
genomecov Compute the coverage over an entire genome.
merge Combine overlapping/nearby intervals into a single interval.
合并重叠或相接的区域
cluster Cluster (but don't merge) overlapping/nearby intervals.
complement Extract intervals _not_ represented by an interval file.
获得互补区域
subtract Remove intervals based on overlaps b/w two files.
计算区域差集
slop Adjust the size of intervals.
调整区域大小,如获得转录起始位点上下游3 K的区域
flank Create new intervals from the flanks of existing intervals.
sort Order the intervals in a file.
排序,部分命令需要排序过的bed文件
random Generate random intervals in a genome.
获得随机区域,作为背景集
shuffle Randomly redistrubute intervals in a genome.
根据给定的bed文件获得随机区域,作为背景集
sample Sample random records from file using reservoir sampling.
spacing Report the gap lengths between intervals in a file.
annotate Annotate coverage of features from multiple files.
[ Multi-way file comparisons ]
multiinter Identifies common intervals among multiple interval files.
unionbedg Combines coverage intervals from multiple BEDGRAPH files.
[ Paired-end manipulation ]
pairtobed Find pairs that overlap intervals in various ways.
pairtopair Find pairs that overlap other pairs in various ways.
[ Format conversion ]
bamtobed Convert BAM alignments to BED (& other) formats.
bedtobam Convert intervals to BAM records.
bamtofastq Convert BAM records to FASTQ records.
bedpetobam Convert BEDPE intervals to BAM records.
bed12tobed6 Breaks BED12 intervals into discrete BED6 intervals.
[ Fasta manipulation ]
getfasta Use intervals to extract sequences from a FASTA file.
提取给定位置的FASTA序列
maskfasta Use intervals to mask sequences from a FASTA file.
nuc Profile the nucleotide content of intervals in a FASTA file.
[ BAM focused tools ]
multicov Counts coverage from multiple BAMs at specific intervals.
tag Tag BAM alignments based on overlaps with interval files.
[ Statistical relationships ]
jaccard Calculate the Jaccard statistic b/w two sets of intervals.
计算数据集相似性
reldist Calculate the distribution of relative distances b/w two files.
fisher Calculate Fisher statistic b/w two feature files.
[ Miscellaneous tools ]
overlap Computes the amount of overlap from two intervals.
igv Create an IGV snapshot batch script.
用于生成一个脚本,批量捕获IGV截图
links Create a HTML page of links to UCSC locations.
makewindows Make interval "windows" across a genome.
把给定区域划分成指定大小和间隔的小区间 (bin)
groupby Group by common cols. & summarize oth. cols. (~ SQL "groupBy")
分组结算,不只可以用于bed文件。
expand Replicate lines based on lists of values in columns.
split Split a file into multiple files with equal records or base pairs
- stdin 和 - 用法和xargs 不一样
Part1:Genome arithmetic
1.intersect
可以计算两个或者多个BED/BAM/VCF/GFF文件中基因组坐标位置的交集(overlap),根据参数不同,可以得到不同的结果。
求区域之间的交集,可以用来注释peak,计算reads比对到的基因组区域, 不同样品的peak之间的peak重叠情况
- Usage:
bedtools intersect [OPTIONS] -a <FILE> \
-b <FILE1, FILE2, ..., FILEN>
- 计算两个bed交集区域
$ cat A.bed
chr1 10 20
chr1 30 40
$ cat B.bed
chr1 15 20
$ bedtools intersect -a A.bed -b B.bed
chr1 15 20
2.window
与bedtools intersect相似,窗口搜索A和B中的重叠特性。然而,窗口添加了A中每个特性上游和下游的指定大小(默认为1000)的碱基对。
- Usage:
bedtools window [OPTIONS] [-a|-abam] -b <BED/GFF/VCF></pre>
- 相当于将A区域进行扩增一定区间,再和B比较。
$ cat A.bed
chr1 100 200
$ cat B.bed
chr1 500 1000
chr1 1300 2000
$ bedtools window -a A.bed -b B.bed
chr1 100 200 chr1 500 1000</pre>
3.closest
类似于相交,最接近的搜索是在A和B中重叠的特征。如果B中没有一个特征与A中的当前特征重叠,则最接近将报告最接近的特征(即距A的起点或终点最小的基因组距离)。 例如,人们可能想找到哪个是与显着GWAS多态性最接近的基因。 请注意,最接近将报告重叠特征为最接近-即,它不限于最接近的非重叠特征。 以下标志性的“备忘单”总结了最接近的工具提供的各种选项提供的功能。
寻找最近但可能不重叠的区域: 简单来说就是从A.bed 中找到B.bed 中最近的区间,比如修饰一个基因,最近的enhancer 位置.
- Usage:
bedtools closest [OPTIONS] -a <FILE> \
-b <FILE1, FILE2, ..., FILEN></pre>
- 比如:
$ cat a.bed
chr1 10 20 a1 1 -
$ cat b.bed
chr1 7 8 b1 1 -
chr1 15 25 b2 2 +
$ bedtools closest -a a.bed -b b.bed
chr1 10 20 a1 1 - chr1 15 25 b2 2 +</pre>
4.coverage
计算区域覆盖度:计算一个bed下,每一个region 和另一个bed 文件的交集。
Chromosome ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
BED FILE A *************** *************** ****** **************
BED File B ^^^^ ^^^^ ^^ ^^^^^^^^^ ^^^ ^^ ^^^^
^^^^^^^^ ^^^^^ ^^^^^ ^^
Result [ N=3, 10/15 ] [ N=1, 2/15 ] [N=1,6/6] [N=6, 12/14 ]</pre>
- Usage:
bedtools coverage [OPTIONS] -a <FILE> \
-b <FILE1, FILE2, ..., FILEN></pre>
- 例子:统计A区间每一个元素和B文件交集情况
$ cat A.bed
chr1 0 100
chr1 100 200
chr2 0 100
$ cat B.bed
chr1 10 20
chr1 20 30
chr1 30 40
chr1 100 200
$ bedtools coverage -a A.bed -b B.bed
chr1 0 100 3 30 100 0.3000000
chr1 100 200 1 100 100 1.0000000
chr2 0 100 0 0 100 0.0000000</pre>
5.map
为每个重叠间隔对列应用一个函数。
- Usage:
bedtools map [OPTIONS] -a <bed/gff/vcf> -b <bed/gff/vcf></pre>
- 例子:和A 区间有重叠的B中元素作为一个分组,运用函数计算最大值或者平均值等等。
$ cat a.bed
chr1 10 20 a1 1 +
chr1 50 60 a2 2 -
chr1 80 90 a3 3 -
$ cat b.bed
chr1 12 14 b1 2 +
chr1 13 15 b2 5 -
chr1 16 18 b3 5 +
chr1 82 85 b4 2 -
chr1 85 87 b5 3 +
$ bedtools map -a a.bed -b b.bed
chr1 10 20 a1 1 + 12
chr1 50 60 a2 2 - .
chr1 80 90 a3 3 - 5
6.genomecov
染色体和全基因组覆盖度计算
- Usage:
bedtools genomecov [OPTIONS] [-i|-ibam] -g (iff. -i)</pre>
- example:
$ cat A.bed
chr1 10 20
chr1 20 30
chr2 0 500
$ cat my.genome
chr1 1000
chr2 500
$ bedtools genomecov -i A.bed -g my.genome
chr1 0 980 1000 0.98
chr1 1 20 1000 0.02
chr2 1 500 500 1
genome 0 980 1500 0.653333
genome 1 520 1500 0.346667
# name 覆盖次数 覆盖碱基数 总碱基数 覆盖度
# 同时计算单染色体和全基因组覆盖度</pre>
7.merge
https://www.jianshu.com/p/2efcb6f8f55d
许多数据集的基因组feature坐标经常是连续的(比如ChIPseq的结果),就像下图的蓝色部分
于是可以把这些连续的基因组小区间连接起来,拼成一个连续的大区间
- Usage:
bedtools merge [OPTIONS] -i <BED/GFF/VCF/BAM></pre>
- 例子:
需要拼接的输入文件(bed/gff/vcf)必须是排序(sort)过的【不sort会报错🙅♂️】
先按染色体,再按起始位点,这样保证merge的算法执行起来非常顺畅,而基本不需要消耗内存再次加工
以exon.bed为例,展示merge的作用
注意看:第3行和第4行的区间是有重叠的,因此它们可以进行merge
bedtools merge -i exons.bed | head -10</pre>
可以看到,merge之后原来的第3行(13220 - 14409
)和第4行(14361 - 14829
)坐标合并成了13220 - 14829
[图片上传失败...(image-de7f16-1586285151489)]
8.cluster
类似merge 功能,合并重叠及其靠近的区间
- Usage:
bedtools cluster [OPTIONS] -i <BED/GFF/VCF></pre>
- example: 默认情况下合并有重叠的区域(1bp)
$ cat A.bed
chr1 100 200
chr1 180 250
chr1 250 500
chr1 501 1000
$ bedtools cluster -i A.bed
chr1 100 200 1
chr1 180 250 1
chr1 250 500 1
chr1 501 1000 2
9.complement
https://www.jianshu.com/p/2efcb6f8f55d
bedtools complement 实现反选.
给定一个feature坐标信息文件,我们如果不关心其中标记的区间,而是想看看有哪些区间不在这个文件中.
例如,有一个ChIP-seq的peaks信息,现在想知道有哪些区域没有被抗体结合,就可以用complement
- Usage:
bedtools complement -i <BED/GFF/VCF> -g <GENOME></pre>
-
example:
现在有了外显子的bed文件,通过反选,我们就能获得内含子或基因间区的坐标。那么既然是反选,除了exon.bed,还要有一个总体的范围(也就是基因组各个染色体的长度信息)【用
-g
指定】
# 首先要确定整体的操作范围
head -10 genome.txt
chr1 249250621
chr10 135534747
chr11 135006516
chr11_gl000202_random 40103
chr12 133851895
chr13 115169878
chr14 107349540
chr15 102531392
chr16 90354753
chr17 81195210
# 实现反选,选出不属于exon的区域
bedtools complement -i exons.bed -g genome.txt | head -10
chr1 0 11873
chr1 12227 12612
chr1 12721 13220
chr1 14829 14969
chr1 15038 15795
chr1 15947 16606
chr1 16765 16857
chr1 17055 17232
chr1 17368 17605
chr1 17742 17914</pre>
10 .subtract
https://www.jianshu.com/p/cb079a393661
从A中去掉B
- Usage:
bedtools subtract [OPTIONS] -a <BED/GFF/VCF> -b <BED/GFF/VCF></pre>
- example:
$ cat A.bed
chr1 10 20
chr1 100 200
$ cat B.bed
chr1 0 30
chr1 180 300
$ bedtools subtract -a A.bed -b B.bed
chr1 100 180</pre>
- 比如提取TSS ±2kb 区域外,peak 区间.为计算ROC做准备.
具体代码:https://github.com/Helab-bioinformatics/itChIP/blob/master/07_ROC.R
11.slop
awk '{OFS="\t" print $1,$2-<slop>,$3+<slop>}'
类似.
bedtools slop将限制染色体的大小(即没有不能小于 0和结束不能大于染色体长度)。
- Usage:
bedtools slop [OPTIONS] -i <BED/GFF/VCF> -g <GENOME> [-b or (-l and -r)]</pre>
- example :
$ cat A.bed
chr1 5 100
chr1 800 980
$ cat my.genome
chr1 1000
$ bedtools slop -i A.bed -g my.genome -b 5
chr1 0 105
chr1 795 985
$ bedtools slop -i A.bed -g my.genome -l 2 -r 3
chr1 3 103
chr1 798 983
12.flank
将区间进行左右两侧扩充.
- Usage:
bedtools flank [OPTIONS] -i <BED/GFF/VCF> -g <GENOME> [-b or (-l and -r)]</pre>
- example:
$ cat A.bed
chr1 100 200
chr1 500 600
$ cat my.genome
chr1 1000
$ bedtools flank -i A.bed -g my.genome -b 5
chr1 95 100
chr1 200 205
chr1 495 500
chr1 600 605
$ bedtools flank -i A.bed -g my.genome -l 2 -r 3
chr1 98 100
chr1 200 203
chr1 498 500
chr1 600 603</pre>
13. sort
排序,部分命令需要排序过的bed文件
- Usage:
bedtools sort [OPTIONS] -i <BED/GFF/VCF></pre>
- example:
cat A.bed
chr1 800 1000
chr1 80 180
chr1 1 10
chr1 750 10000
sortBed -i A.bed
chr1 1 10
chr1 80 180
chr1 750 10000
chr1 800 1000</pre>
14.random
bedtools random将以BED6格式生成一组随机的间隔。可以指定应该生成的间隔的数目(-n)和大小(-l)。
- Usage:
bedtools random [OPTIONS] -g <GENOME></pre>
- example:
$ bedtools random -g hg19.genome
chr2 87536758 87536858 1 100 -
chrX 46051735 46051835 2 100 +
chr18 5237041 5237141 3 100 -
chr12 45809998 45810098 4 100 +
chrX 42034890 42034990 5 100 -
chr10 77510935 77511035 6 100 -
chr3 39844278 39844378 7 100 -
chr6 101012700 101012800 8 100 +
chr12 38123482 38123582 9 100 +
chr7 88508598 88508698 10 100 -
$ bedtools random -g hg19.genome
chr3 141987850 141987950 1 100 +
chr5 137643331 137643431 2 100 +
chr2 155523858 155523958 3 100 -
chr5 147874094 147874194 4 100 +
chr1 71838335 71838435 5 100 -
chr8 71154323 71154423 6 100 -
chr2 133240474 133240574 7 100 +
chr9 131495427 131495527 8 100 +
chrX 125952943 125953043 9 100 +
chr3 59685545 59685645 10 100 +</pre>
15.shuffle
bedtools shuffle将在基因组文件中定义的基因组中,随机排列bed文件中区域在基因组位置。
根据给定的bed文件获得随机区域,作为背景集
- Usage:
bedtools shuffle [OPTIONS] -i <BED/GFF/VCF> -g <GENOME></pre>
- example: 默认情况下,bedtools shuffle将在随机染色体上的随机位置上重新定位输入BED文件中的每个特性。每个特征的大小和链被保留。
$ cat A.bed
chr1 0 100 a1 1 +
chr1 0 1000 a2 2 -
$ cat my.genome
chr1 10000
chr2 8000
chr3 5000
chr4 2000
$ bedtools shuffle -i A.bed -g my.genome
chr4 1498 1598 a1 1 +
chr3 2156 3156 a2 2 -</pre>
16.sample
从文件中随机取样记录.
Summary: Take sample of input file(s) using reservoir sampling algorithm.
Usage: bedtools sample [OPTIONS] -i <bed/gff/vcf/bam>
WARNING: The current sample algorithm will hold all requested sample records in memory prior to output. The user must ensure that there is adequate memory for this.
17.spacing
报告文件中间隔之间的间隔长度
Summary: Report (last col.) the gap lengths between intervals in a file.
Usage: bedtools spacing [OPTIONS] -i <bed/gff/vcf/bam>
-bed If using BAM input, write output as BED.
-header Print the header from the A file prior to results.
-nobuf Disable buffered output. Using this option will cause each line of output to be printed as it is generated, rather than saved in a buffer. This will make printing large output files noticeably slower, but can be useful in conjunction with other software tools and scripts that need to process one line of bedtools output at a time.
-iobuf Specify amount of memory to use for input buffer. Takes an integer argument. Optional suffixes K/M/G supported. Note: currently has no effect with compressed files.
Notes: (1) Input must be sorted by chrom,start (sort -k1,1 -k2,2n for BED). (2) The 1st element for each chrom will have NULL distance. ("."). (3) Distance for overlapping intervals is -1 and 0 for adjacent intervals.
Example:
$ cat test.bed
chr1 0 10
chr1 10 20
chr1 19 30
chr1 35 45
chr1 100 200 </pre>
$ bedtools spacing -i test.bed chr1 0 10 . chr1 10 20 0 chr1 19 30 -1 chr1 35 45 5 chr1 100 200 55
18.annotate
bedtools可以对一个BED / VCF / GFF文件进行注释,并具有从多个其他BED / VCF / GFF文件中观察到的覆盖范围和重叠数。 通过这种方式,它允许人们通过一个命令询问一个feature与其他多个feature类型的重合程度。
- Usage
bedtools annotate [OPTIONS] -i <BED/GFF/VCF> -files FILE1 FILE2 FILE3 ... FILEn</pre>
- 计算输入的bed文件和其他多个文件交集个数.
chr1 100 200 nasty 1 -
chr2 500 1000 ugly 2 +
chr3 1000 5000 big 3 -
$ cat genes.bed
chr1 150 200 geneA 1 +
chr1 175 250 geneB 2 +
chr3 0 10000 geneC 3 -
$ cat conserve.bed
chr1 0 10000 cons1 1 +
chr2 700 10000 cons2 2 -
chr3 4000 10000 cons3 3 +
$ cat known_var.bed
chr1 0 120 known1 -
chr1 150 160 known2 -
chr2 0 10000 known3 +
$ bedtools annotate -counts -i variants.bed -files genes.bed conserve.bed known_var.bed
chr1 100 200 nasty 1 - 2 1 2
chr2 500 1000 ugly 2 + 0 1 1
chr3 1000 5000 big 3 - 1 1 0</pre>
Part2:Multi-way file comparisons
19. multiinter
标识多个bed文件之间的公共区间。
Summary: Identifies common intervals among multiple BED/GFF/VCF files.
Usage: bedtools multiinter [OPTIONS] -i FILE1 FILE2 .. FILEn Requires that each interval file is sorted by chrom/start.
20.unionbedg
unionbedg将多个BEDGRAPH文件组合成单个文件,这样就可以直接比较多个样本的覆盖率(如基因型)
- Usage:
bedtools unionbedg [OPTIONS] -i FILE1 FILE2 FILE3 ... FILEn</pre>
- example:
cat 1.bg
chr1 1000 1500 10
chr1 2000 2100 20
cat 2.bg
chr1 900 1600 60
chr1 1700 2050 50
cat 3.bg
chr1 1980 2070 80
chr1 2090 2100 20
cat sizes.txt
chr1 5000
bedtools unionbedg -i 1.bg 2.bg 3.bg
chr1 900 1000 0 60 0
chr1 1000 1500 10 60 0
chr1 1500 1600 0 60 0
chr1 1700 1980 0 50 0
chr1 1980 2000 0 50 80
chr1 2000 2050 20 50 80
chr1 2050 2070 20 0 80
chr1 2070 2090 20 0 0
chr1 2090 2100 20 0 20</pre>
Part3 : Paired-end manipulation
21. pairtobed
找出以各种方式重叠区间的对
Summary: Report overlaps between a BEDPE file and a BED/GFF/VCF file.
Usage: bedtools pairtobed [OPTIONS] -a <bedpe> -b <bed/gff/vcf>
22.pairtopair
找出以各种方式重叠的配对。
pairToPair比较两个BEDPE文件以查找重叠,其中A中BEDPE特征的每个末端与B中特征的末端重叠。例如,使用pairToPair,可以在两个文件中筛选出完全相同的不一致双末端对齐方式。 这可能表明(除其他事项外)不一致的对表明每个文件/样本中的结构都相同。
- Usage:
pairToPair [OPTIONS] -a <BEDPE> -b <BEDPE></pre>
- example: 默认情况下,如果两端都与BEDPE B文件中的特征重叠,则将报告A中的BEDPE特征。 如果存在两个BEDPE文件的链信息,则进一步要求两端的重叠都在同一链上。 这样,原本重叠的(就基因组位置而言)F / R比对将不与R / R比对匹配。
Chromosome ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
BEDPE A >>>>>.................................>>>>>
BEDPE B <<<<<.............................>>>>>
Result
BEDPE A >>>>>.................................>>>>>
BEDPE B >>>>>.............................>>>>>
Result >>>>>.................................>>>>></pre>
Part 4 : Format conversion
23.bamtobed
bedtools bamtobed是一种比较实用转换程序,可将BAM格式的序列比对转换为BED,BED12和/或BEDPE记录。保存了位置信息,同时节约了空间.
- Usage
bedtools bamtobed [OPTIONS] -i <BAM></pre>
- 比如将bam 转换成bed6 格式
$ bedtools bamtobed -i reads.bam | head -3
chr7 118970079 118970129 TUPAC_0001:3:1:0:1452#0/1 37 -
chr7 118965072 118965122 TUPAC_0001:3:1:0:1452#0/2 37 +
chr11 46769934 46769984 TUPAC_0001:3:1:0:1472#0/1 37 -
24.bedtobam
将bed 转换成bam 文件
- Usage:
bedToBam [OPTIONS] -i <BED/GFF/VCF> -g <GENOME> > <BAM></pre>
- example:
head -5 rmsk.hg18.chr21.bed
chr21 9719768 9721892 ALR/Alpha 1004 +
chr21 9721905 9725582 ALR/Alpha 1010 +
chr21 9725582 9725977 L1PA3 3288 +
chr21 9726021 9729309 ALR/Alpha 1051 +
chr21 9729320 9729809 L1PA3 3897 -
bedToBam -i rmsk.hg18.chr21.bed -g human.hg18.genome > rmsk.hg18.chr21.bam
samtools view rmsk.hg18.chr21.bam | head -5
ALR/Alpha 0 chr21 9719769 255 2124M * 0 0 * *
ALR/Alpha 0 chr21 9721906 255 3677M * 0 0 * *
L1PA3 0 chr21 9725583 255 395M * 0 0 * *
ALR/Alpha 0 chr21 9726022 255 3288M * 0 0 * *
L1PA3 16 chr21 9729321 255 489M * 0 0 * *</pre>
25.bamtofastq
bedtools bamtofastq是一个转换工具,用于从BAM格式的序列比对中提取FASTQ记录。
- Usage
bedtools bamtofastq [OPTIONS] -i <BAM> -fq <FASTQ>
- 比如下面例子
$ bedtools bamtofastq -i NA18152.bam -fq NA18152.fq
$ head -8 NA18152.fq
@NA18152-SRR007381.35051
GGAGACATATCATATAAGTAATGCTAGGGTGAGTGGTAGGAAGTTTTTTCATAGGAGGTGTATGAGTTGGTCGTAGCGGAATCGGGGGTATGCTGTTCGAATTCATAAGAACAGGGAGGTTAGAAGTAGGGTCTTGGTGACAAAATATGTTGTATAGAGTTCAGGGGAGAGTGCGTCATATGTTGTTCCTAGGAAGATTGTAGTGGTGAGGGTGTTTATTATAATAATGTTTGTGTATTCGGCTATGAAGAATAGGGCGAAGGGGCCTGCGGCGTATTCGATGTTGAAGCCTGAGACTAGTTCGGACTCCCCTTCGGCAAGGTCGAA
+
<<<;;<;<;;<;;;;;;;;;;;;<<<:;;;;;;;;;;;;;;;;::::::;;;;<<;;;;;;;;;;;;;;;;;;;;;;;;;;;;<<<<<;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;<<;;;;;:;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;<<<;;;;;;;;;;<<<<<<<<;;;;;;;;;:;;;;;;;;;;;;;;;;;;;:;;;;8;;8888;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;8966689666666299866669:899
@NA18152-SRR007381.637219
AATGCTAGGGTGAGTGGTAGGAAGTTTTTTCATAGGAGGTGTATGAGTTGGTCGTAGCGGAATCGGGGGTATGCTGTTCGAATTCATAAGAACAGGGAGGTTAGAAGTAGGGTCTTGGTGACAAAATATGTTGTATAGAGTTCAGGGGAGAGTGCGTCATATGTTGTTCCTAGGAAGATTGTAGTGGTGAGGGTGTTTATTATAATAATGTTTGTGTATTCGGCTATGAAGAATAGGGCGAAGGGGCCTGCGGCGTATTCGATGTTGAAGCCTGAGACTAGTTCGGACTCCCCTTCCGGCAAGGTCGAA
+
<<<<<<<<<<;;<;<;;;;<<;<888888899<;;;;;;<;;;;;;;;;;;;;;;;;;;;;;;;<<<<<;;;;;;;;;<;<<<<<;;;;;;;;;;;;;<<<<;;;;;;;:::;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;<<<<;;;;;;;;;;;;;;;;;;;;;;;<;;;;;;;;;;;;;;;;;;;;;;<888<;<<;;;;<<<<<<;;;;;<<<<<<<<;;;;;;;;;:;;;;888888899:::;;8;;;;;;;;;;;;;;;;;;;99;;99666896666966666600;96666669966</pre>
26.bedpetobam
Summary: Converts feature records to BAM format.
Usage: bedpetobam [OPTIONS] -i <bed/gff/vcf> -g <genome>
Options: -mapq Set the mappinq quality for the BAM records. (INT) Default: 255
-ubam Write uncompressed BAM output. Default writes compressed BAM.
Notes: (1) BED files must be at least BED4 to create BAM (needs name field).
27.bed12tobed6
bed12ToBed6是一种方便的工具,它可以将BED12中的bed特征(即堆叠型的bed特征,如基因)转换为离散的bed特征。例如,对于一个有六个外显子的基因,bed12ToBed6会产生六个单独的BED6特征(即,每个外显子对应一个)。
- Usage:
bed12ToBed6 [OPTIONS] -i <BED12></pre>
- 基因多个外显子转换成每一个外显子单独一行
head data/knownGene.hg18.chr21.bed | tail -n 3
chr21 10079666 10120808 uc002yiv.1 0 - 10081686 1 0 1 2 0 6 0 8 0 4 528,91,101,215, 0,1930,39750,40927,
chr21 10080031 10081687 uc002yiw.1 0 - 10080031 1 0 0 8 0 0 3 1 0 2 200,91, 0,1565,
chr21 10081660 10120796 uc002yix.2 0 - 10081660 1 0 0 8 1 6 6 0 0 3 27,101,223,0,37756,38913,
head data/knownGene.hg18.chr21.bed | tail -n 3 | bed12ToBed6 -i stdin
chr21 10079666 10080194 uc002yiv.1 0 -
chr21 10081596 10081687 uc002yiv.1 0 -
chr21 10119416 10119517 uc002yiv.1 0 -
chr21 10120593 10120808 uc002yiv.1 0 -
chr21 10080031 10080231 uc002yiw.1 0 -
chr21 10081596 10081687 uc002yiw.1 0 -
chr21 10081660 10081687 uc002yix.2 0 -
chr21 10119416 10119517 uc002yix.2 0 -
chr21 10120573 10120796 uc002yix.2 0 -</pre>
Part5 : Fasta manipulation
28.getfasta
https://www.jianshu.com/p/6c3b87301491
根据坐标区域来从基因组里面提取fasta序列
- Usage
$ bedtools getfasta [OPTIONS] -fi <input FASTA> -bed <BED/GFF/VCF></pre>
- example:
参考:# BED/GFF/VCF +reference --> fasta
bedtools getfasta -fi ~/biosoft/bowtie/hg19_index/hg19.fa -bed ../macs14_results/highQuality_summits.bed -fo highQuality.fa
bedtools getfasta -fi ~/biosoft/bowtie/hg19_index/hg19.fa -bed ../macs14_results/highQuality_peaks.bed -fo highQuality.fa
脚本里面用的是bed格式来记录坐标区域,参考基因组用-fi参数指定具体位置,输出的fasta序列文件用-fo参数指定
tips: 有用的三个参数
-s Force strandedness. If the feature occupies the antisense,
strand, the sequence will be reverse complemented.
- By default, strand information is ignored.
-name Use the name field for the FASTA header
-tab Write output in TAB delimited format.
- Default is FASTA format.
-s
: 当提取数据,区分正负链时候,需要添加-s 参数,自动提取此区间负链反向
的序列(方向互补)
-name
: 提取bed文件第四列区间name,为fasta 文件输出注释行.
-tab
: fasta文件名称和序列以tab 分割。
29.maskfasta
和getfasta 相反,屏蔽区间.
- Usage
$ bedtools maskfasta [OPTIONS] -fi <input FASTA> -bed <BED/GFF/VCF> -fo <output FASTA></pre>
- example:
$ cat test.fa
>chr1
AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG
$ cat test.bed
chr1 5 10
$ bedtools maskfasta -fi test.fa -bed test.bed -fo test.fa.out
$ cat test.fa.out
>chr1
AAAAANNNNNCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG</pre>
30.nuc
分析FASTA文件中,bed 区间对应的核苷酸含量
Summary: Profiles the nucleotide content of intervals in a fasta file.
Usage: bedtools nuc [OPTIONS] -fi <fasta> -bed <bed/gff/vcf>
Part6 : BAM focused tools
31.multicov
https://www.jianshu.com/p/6c3b87301491
提供的每个bed间隔,它报告来自每个BAM文件的重叠对齐的单独计数。类似功能 featurecount/deeptools multiBamSummary
- Usage:
bedtools multicov [OPTIONS] -bams BAM1 BAM2 BAM3 ... BAMn -bed <BED/GFF/VCF></pre>
- example:
对RNA-seq的比对文件中的比对到各个基因的reads进行计数。**
# 例子:
bedtools multicov -bams aln1.bam aln2.bam aln3.bam -bed ivls-of-interest.bed
# ivls-of-interest.bed这个文件是必须的,可能需要自己制作,其实用gtf文件也可以的,如下:
chr1 0 10000 ivl1
chr1 10000 20000 ivl2
chr1 20000 30000 ivl3
chr1 30000 40000 ivl4
输出结果前三列是坐标,第四列是基因名,跟我们的bed文件一样,只是最后三列是三个样本的计数,是添加上来的!
chr1 0 10000 ivl1 100 2234 0
chr1 10000 20000 ivl2 123 3245 1000
chr1 20000 30000 ivl3 213 2332 2034
chr1 30000 40000 ivl4 335 7654 0</pre>
32. tag
Tag BAM alignments based on overlaps with interval files.
Summary: Annotates a BAM file based on overlaps with multiple BED/GFF/VCF files on the intervals in -i.
Usage: bedtools tag [OPTIONS] -i <BAM> -files FILE1 .. FILEn -labels LAB1 .. LABn
Part 7 :Statistical relationships
33.jaccard
检测两个数据之间的相关性
引入一个新的bedtools工具jaccard
,它会计算一个杰卡德相似性系数
结果是0.0 to 1. 0的值,数越小相关性越小
# 检测同一个样本的不同数据【系数是0.50637】
bedtools jaccard \
-a fHeart-DS16621.hotspot.twopass.fdr0.05.merge.bed \
-b fHeart-DS15839.hotspot.twopass.fdr0.05.merge.bed
intersection union-intersection jaccard n_intersections
81269248 160493950 0.50637 130852
# 再看不同的样本的不同数据【系数是0.170995】
bedtools jaccard \
-a fHeart-DS16621.hotspot.twopass.fdr0.05.merge.bed \
-b fSkin_fibro_bicep_R-DS19745.hg19.hotspot.twopass.fdr0.05.merge.bed
intersection union-intersection jaccard n_intersections
28076951 164197278 0.170995 73261</pre>
##### 另外,还能分析更多的样本之间相关性
这个就看(官网[http://quinlanlab.org/tutorials/bedtools/bedtools.html](https://links.jianshu.com/go?to=http%3A%2F%2Fquinlanlab.org%2Ftutorials%2Fbedtools%2Fbedtools.html))翻到最底部
![img](https://upload-images.jianshu.io/upload_images/9376801-806a9711daa5ebfe.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/878/format/webp)
34.reldist
计算两个文件的相对距离
总结两组基因组区间之间相似性的传统方法是基于相交区间的数量或比例。 但是,这种测量很大程度上看不到两组之间的空间相关性,尽管间距或邻近度一致,但相交很少见(例如,增强子和转录起始位点很少重叠,但与两组随机数相比,它们彼此之间的距离更近 间隔)。 Favorov等人[1]提出了一种相对距离度量标准,该度量描述了一组中每个间隔与另一组中两个最近间隔之间的相对距离分布(请参见上图)。 如果两组之间没有空间相关性,则可以期望相对距离在0到0.5的相对距离之间均匀分布。 但是,如果间隔趋于比偶然预期的要近得多,则观察到的相对距离的分布将朝较低的相对距离值(例如,下图)移动。
[1] Exploring Massive, Genome Scale Datasets with the GenometriCorr Package.
Favorov A, Mularoni L, Cope LM, Medvedeva Y, Mironov AA, et al. (2012)
PLoS Comput Biol 8(5): e1002529\. doi:10.1371/journal.pcbi.1002529</pre>
Usage:
bedtools reldist [OPTIONS] -a <BED/GFF/VCF> -b <BED/GFF/VCF></pre>
example:
$ bedtools reldist \
-a data/refseq.chr1.exons.bed.gz \
-b data/
aluY.chr1.bed.gz
0.00 164 43408 0.004
0.01 551 43408 0.013
0.02 598 43408 0.014
0.03 637 43408 0.015
0.04 793 43408 0.018
0.05 688 43408 0.016
0.06 874 43408 0.020
0.07 765 43408 0.018
0.08 685 43408 0.016
0.09 929 43408 0.021
0.10 876 43408 0.020
0.11 959 43408 0.022
0.12 860 43408 0.020
0.13 851 43408 0.020
0.14 903 43408 0.021
0.15 893 43408 0.021
0.16 883 43408 0.020
0.17 828 43408 0.019
0.18 917 43408 0.021
0.19 875 43408 0.020
0.20 897 43408 0.021
0.21 986 43408 0.023
0.22 903 43408 0.021
0.23 944 43408 0.022
0.24 904 43408 0.021
0.25 867 43408 0.020
0.26 943 43408 0.022
0.27 933 43408 0.021
0.28 1132 43408 0.026
0.29 881 43408 0.020
0.30 851 43408 0.020
0.31 963 43408 0.022
0.32 950 43408 0.022
0.33 965 43408 0.022
0.34 907 43408 0.021
0.35 884 43408 0.020
0.36 965 43408 0.022
0.37 944 43408 0.022
0.38 911 43408 0.021
0.39 939 43408 0.022
0.40 921 43408 0.021
0.41 950 43408 0.022
0.42 935 43408 0.022
0.43 919 43408 0.021
0.44 915 43408 0.021
0.45 934 43408 0.022
0.46 843 43408 0.019
0.47 850 43408 0.020
0.48 1006 43408 0.023
0.49 937 43408 0.022</pre>
35.fisher
对2个文件之间的重叠/唯一区间进行费舍尔的精确测试。
Given a pair of input files -a and -b in the usual BedTools parlance:
$ cat a.bed
chr1 10 20
chr1 30 40
chr1 51 52
$ cat b.bed
chr1 15 25
chr1 51 52</pre>
And a genome of 500 bases:
$ echo -e "chr1\t500" > t.genome</pre>
We may wish to know **if the amount of overlap between the 2 sets of intervals is more than we would expect given their coverage and the size of the genome**. We can do this with `fisher` as:
$ bedtools fisher -a a.bed -b b.bed -g t.genome
# Number of query intervals: 3
# Number of db intervals: 2
# Number of overlaps: 2
# Number of possible intervals (estimated): 37
# phyper(2 - 1, 3, 37 - 3, 2, lower.tail=F)
# Contingency Table Of Counts
#_________________________________________
# | in -b | not in -b |
# in -a | 2 | 1 |
# not in -a | 0 | 34 |
#_________________________________________
# p-values for fisher's exact test
left right two-tail ratio
1 0.0045045 0.0045045 inf</pre>
Part8 : Miscellaneous tools
36.overlap
和intersect 功能类似,将两个输入文件合并了,但是通过参数来指定那几列进行比较
Usage:
overlap [OPTIONS] -i <input> -cols s1,e1,s2,e2</pre>
| Option | Description |
| --- | --- |
| **-i** | Input file. Use “stdin” for pipes. |
| **-cols** | Specify the columns (1-based) for the starts and ends of the features for which you’d like to compute the overlap/distance. The columns must be listed in the following order: *start1,end1,start2,end2* . |
example:
windowBed -a A.bed -b B.bed -w 10
chr1 10 20 A chr1 15 25 B
chr1 10 20 C chr1 25 35 D
## 指定2,3 列与6,7 列比较
windowBed -a A.bed -b B.bed -w 10 | overlap -i stdin -cols 2,3,6,7
chr1 10 20 A chr1 15 25 B 5
chr1 10 20 C chr1 25 35 D -5</pre>
37.igv
用于生成一个脚本,批量捕获IGV截图
Summary: Creates a batch script to create IGV images at each interval defined in a BED/GFF/VCF file.
Usage: bedtools igv [OPTIONS] -i <bed/gff/vcf>
38. links
创建一个链接到UCSC的HTML页面
Usage:
linksBed [OPTIONS] -i <BED/GFF/VCF> > <HTML file></pre>
| Option | Description |
| --- | --- |
| **-base** | The “basename” for the UCSC browser. *Default: [http://genome.ucsc.edu](http://genome.ucsc.edu)* |
| **-org** | The organism (e.g. mouse, human). *Default: human* |
| **-db** | The genome build. *Default: hg18* |
example: **linksBed** creates links to the public UCSC Genome Browser.
head -3 genes.bed
chr21 9928613 10012791 uc002yip.1 0 -
chr21 9928613 10012791 uc002yiq.1 0 -
linksBed -i genes.bed -base http://mirror.uni.edu -org mouse -db mm9 > genes.html</pre>
39.makewindows
https://www.jianshu.com/p/7d47d8074bba
把给定区域划分成指定大小和间隔的小区间 (bin)
- 参考染色体大小文件
chrom.size
chr1 100
chr2 150
- 将染色体划分为20bp为一个bin的区间
$ bedtools makewindows -g chrom.size -w 20
chr1 0 20
chr1 20 40
chr1 40 60
chr1 60 80
chr1 80 100
chr2 0 20
chr2 20 40
chr2 40 60
chr2 60 80
chr2 80 100
chr2 100 120
chr2 120 140
chr2 140 150
40.groupby
https://www.jianshu.com/p/548d370b75a4
分组结算,不只可以用于bed文件。
以某一列进行分组,运用不同的函数
Usage
bedtools groupby [OPTIONS] -i <input> -g <group columns> -c <op. column> -o <operation>
41. expand
将某一列是逗号分隔的,差分成多行显示.
Summary: Replicate lines in a file based on columns of comma-separated values.
Usage: bedtools expand -c [COLS] Options: -i Input file. Assumes "stdin" if omitted.
-c Specify the column (1-based) that should be summarized.
* Examples:
$ cat test.txt
chr1 10 20 1,2,3 10,20,30
chr1 40 50 4,5,6 40,50,60
$ bedtools expand test.txt -c 5
chr1 10 20 1,2,3 10
chr1 10 20 1,2,3 20
chr1 10 20 1,2,3 30
chr1 40 50 4,5,6 40
chr1 40 50 4,5,6 50
chr1 40 50 4,5,6 60
$ bedtools expand test.txt -c 4,5
chr1 10 20 1 10
chr1 10 20 2 20
chr1 10 20 3 30
chr1 40 50 4 40
chr1 40 50 5 50
chr1 40 50 6 60</pre>
42. split
将一个文件分割成多个具有相同记录或基对的文件
Summary: Split a Bed file.
Usage: bedtools split [OPTIONS] -i <bed> -n number-of-files
Options:
-i|--input (file) BED input file (req'd).
-n|--number (int) Number of files to create (req'd).
-p|--prefix (string) Output BED file prefix.
-a|--algorithm (string) Algorithm used to split data.
so all files contain the ~ same number of bases
* simple : route records such that each split file has
approximately equal records (like Unix split).</pre>
-h|--help Print help (this screen). -v|--version Print version.
Note: This programs stores the input BED records in memory.
小结:
bedtools 很强大,后面用到了再对细节进行补充
有很多不足地方,希望大家留言指正~~~