可以使用samtools fastq命令从bam文件中提取特定的fastq序列。
使用帮助
samtools fastq
Usage: samtools fastq [options...] <in.bam>
Options:
-0 FILE write reads designated READ_OTHER to FILE
-1 FILE write reads designated READ1 to FILE
-2 FILE write reads designated READ2 to FILE
note: if a singleton file is specified with -s, only
paired reads will be written to the -1 and -2 files.
-f INT only include reads with all of the FLAGs in INT present [0]
-F INT only include reads with none of the FLAGS in INT present [0]
-G INT only EXCLUDE reads with all of the FLAGs in INT present [0]
-n don't append /1 and /2 to the read name
-N always append /1 and /2 to the read name
-O output quality in the OQ tag if present
-s FILE write singleton reads designated READ1 or READ2 to FILE
-t copy RG, BC and QT tags to the FASTQ header line
-T TAGLIST copy arbitrary tags to the FASTQ header line
-v INT default quality score if not given in file [1]
-i add Illumina Casava 1.8 format entry to header (eg 1:N:0:ATCACG)
-c compression level [0..9] to use when creating gz or bgzf fastq files
--i1 FILE write first index reads to FILE
--i2 FILE write second index reads to FILE
--barcode-tag TAG Barcode tag [default: BC]
--quality-tag TAG Quality tag [default: QT]
--index-format STR How to parse barcode and quality tags
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
--reference FILE
Reference sequence FASTA FILE [null]
-@, --threads INT
Number of additional threads to use [0]
Reads are designated READ1 if FLAG READ1 is set and READ2 is not set.
Reads are designated READ2 if FLAG READ1 is not set and READ2 is set.
Reads are designated READ_OTHER if FLAGs READ1 and READ2 are either both set
or both unset.
Run 'samtools flags' for more information on flag codes and meanings.
The index-format string describes how to parse the barcode and quality tags, for example:
i14i8 the first 14 characters are index 1, the next 8 characters are index 2
n8i14 ignore the first 8 characters, and use the next 14 characters for index 1
If the tag contains a separator, then the numeric part can be replaced with '*' to mean
'read until the separator or end of tag', for example:
n*i* ignore the left part of the tag until the separator, then use the second part
of the tag as index 1
Examples:
To get just the paired reads in separate files, use:
samtools fastq -1 paired1.fq -2 paired2.fq -0 /dev/null -s /dev/null -n -F 0x900 in.bam
To get all non-supplementary/secondary reads in a single file, redirect the output:
samtools fastq -F 0x900 in.bam > all_reads.fq
Flags对应的意义如下:
BAM比对文件中flags对应的意义 - 简书 (jianshu.com)
以上可以知道数字4可以将未比对上序列的和比对上序列分开。
数字64或128可以将read1和read2区分开。
这是示例序列的比对结果:
36599658 reads; of these:
36599658 (100.00%) were paired; of these:
36567223 (99.91%) aligned concordantly 0 times
32435 (0.09%) aligned concordantly exactly 1 time
0 (0.00%) aligned concordantly >1 times
----
36567223 pairs aligned concordantly 0 times; of these:
182 (0.00%) aligned discordantly 1 time
----
36567041 pairs aligned 0 times concordantly or discordantly; of these:
73134082 mates make up the pairs; of these:
73133169 (100.00%) aligned 0 times
913 (0.00%) aligned exactly 1 time
0 (0.00%) aligned >1 times
0.09% overall alignment rate
提取所有reads
双末端测序
ls *.bam |while read id
do
samtools fastq -@ 4 -N $id -1 ${id%%_*}_1.fq -2 ${id%%_*}_2.fq
gzip ${id%%_*}_1.fq ${id%%_*}_2.fq
done
单末端测序
ls *.bam |while read id
do
samtools fastq -@ 2 -N $id > ${id%%_*}.fq
gzip ${id%%_*}.fq
done
如果要提取特定状态的reads,需要使用f(提取)或F(过滤)参数+flags(BAM比对文件中flags对应的意义 - 简书 (jianshu.com)
),例如:-f 2为提取两条reads都比对上的reads;-F 4:提取所有比对上的reads(包含双末端中只有单条比对上的reads);-F 8过滤两条reads都没比对上的reads。
有些状态的reads不能通过一个步骤就将其提取;例如配对的两条reads只有其中一条比对上,我想提取另一条。可以分为两步提取:
samtools view -f 4 -b in.bam | samtools fastq -F 8 -1 read1-1.fq -2 read2-2.fq
或者两条reads只有其中一条比对上,我想提取比对上的这一条
samtools view -F 4 -b in.bam | samtools fastq -F 2 -1 read1-1.fq -2 read2-2.fq