bed基因注释

日常瞎掰

  工欲善其事必先利其器,好用的工具对做事来说很重要,可以让你做起事来收到事半功倍的效果。前段时间分析数据时,需要给intervals做基因注释,需要操作gtf文件,最后发现一款非常好用的python包 — pyranges。该包可以很轻松地将bedgtfgff3等格式文件读取为PyRanges对象,该对象有点类似潘大师的DataFrame,熟悉潘大师的同学应该能体会到DataFrame操作数据的快乐。同样,皮软杰斯内置操作intervals的方法也绝对能给你带来无语伦比的畅快。废话不多说,下面来感受一下使用皮软杰斯做注释基因有多方便。

读取文件

  read_bed方法可以用来读取bed格式的文件,read_gtfread_gff3分别用来读取gtfgff3文件:

import pyranges as pr

bed = pr.read_bed('data.bed')
bed
+--------------+-----------+-----------+------------+-----------+--------------+
| Chromosome   | Start     | End       | Name       | Score     | Strand       |
| (category)   | (int32)   | (int32)   | (object)   | (int64)   | (category)   |
|--------------+-----------+-----------+------------+-----------+--------------|
| chr1         | 212609534 | 212609559 | U0         | 0         | +            |
| chr1         | 169887529 | 169887554 | U0         | 0         | +            |
| chr1         | 216711011 | 216711036 | U0         | 0         | +            |
| chr1         | 144227079 | 144227104 | U0         | 0         | +            |
| ...          | ...       | ...       | ...        | ...       | ...          |
| chrY         | 15224235  | 15224260  | U0         | 0         | -            |
| chrY         | 13517892  | 13517917  | U0         | 0         | -            |
| chrY         | 8010951   | 8010976   | U0         | 0         | -            |
| chrY         | 7405376   | 7405401   | U0         | 0         | -            |
+--------------+-----------+-----------+------------+-----------+--------------+
Stranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

gtf = pr.read_gtf('gencode.vM24.annotation.gtf')
gtf
+--------------+------------+-------------+-----------+-----------+------------+--------------+------------+----------------------+----------------+-------+
| Chromosome   | Source     | Feature     | Start     | End       | Score      | Strand       | Frame      | gene_id              | gene_type      | +15   |
| (category)   | (object)   | (object)    | (int32)   | (int32)   | (object)   | (category)   | (object)   | (object)             | (object)       | ...   |
|--------------+------------+-------------+-----------+-----------+------------+--------------+------------+----------------------+----------------+-------|
| chr1         | HAVANA     | gene        | 3073252   | 3074322   | .          | +            | .          | ENSMUSG00000102693.1 | TEC            | ...   |
| chr1         | HAVANA     | transcript  | 3073252   | 3074322   | .          | +            | .          | ENSMUSG00000102693.1 | TEC            | ...   |
| chr1         | HAVANA     | exon        | 3073252   | 3074322   | .          | +            | .          | ENSMUSG00000102693.1 | TEC            | ...   |
| chr1         | ENSEMBL    | gene        | 3102015   | 3102125   | .          | +            | .          | ENSMUSG00000064842.1 | snRNA          | ...   |
| ...          | ...        | ...         | ...       | ...       | ...        | ...          | ...        | ...                  | ...            | ...   |
| chrY         | ENSEMBL    | CDS         | 90838871  | 90839177  | .          | -            | 0          | ENSMUSG00000096850.1 | protein_coding | ...   |
| chrY         | ENSEMBL    | start_codon | 90839174  | 90839177  | .          | -            | 0          | ENSMUSG00000096850.1 | protein_coding | ...   |
| chrY         | ENSEMBL    | stop_codon  | 90838868  | 90838871  | .          | -            | 0          | ENSMUSG00000096850.1 | protein_coding | ...   |
| chrY         | ENSEMBL    | UTR         | 90838868  | 90838871  | .          | -            | .          | ENSMUSG00000096850.1 | protein_coding | ...   |
+--------------+------------+-------------+-----------+-----------+------------+--------------+------------+----------------------+----------------+-------+
Stranded PyRanges object has 1,870,769 rows and 25 columns from 22 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
15 hidden columns: gene_name, level, mgi_id, havana_gene, transcript_id, transcript_type, transcript_name, transcript_support_level, tag, havana_transcript, exon_number, ... (+ 4 more.)

  上面说过PyRanges对象类似潘大师的DataFrame,所以可以用类似的方法来过滤数据,如下面只保留gtf里面Feature属性为gene的行:

gene = gtf[gtf.Feature == 'gene']
gene
+--------------+------------+------------+-----------+-----------+------------+--------------+------------+----------------------+----------------------+-------+
| Chromosome   | Source     | Feature    | Start     | End       | Score      | Strand       | Frame      | gene_id              | gene_type            | +15   |
| (category)   | (object)   | (object)   | (int32)   | (int32)   | (object)   | (category)   | (object)   | (object)             | (object)             | ...   |
|--------------+------------+------------+-----------+-----------+------------+--------------+------------+----------------------+----------------------+-------|
| chr1         | HAVANA     | gene       | 3073252   | 3074322   | .          | +            | .          | ENSMUSG00000102693.1 | TEC                  | ...   |
| chr1         | ENSEMBL    | gene       | 3102015   | 3102125   | .          | +            | .          | ENSMUSG00000064842.1 | snRNA                | ...   |
| chr1         | HAVANA     | gene       | 3252756   | 3253236   | .          | +            | .          | ENSMUSG00000102851.1 | processed_pseudogene | ...   |
| chr1         | HAVANA     | gene       | 3466586   | 3513553   | .          | +            | .          | ENSMUSG00000089699.1 | lncRNA               | ...   |
| ...          | ...        | ...        | ...       | ...       | ...        | ...          | ...        | ...                  | ...                  | ...   |
| chrY         | HAVANA     | gene       | 90603500  | 90605864  | .          | -            | .          | ENSMUSG00000099619.6 | lncRNA               | ...   |
| chrY         | HAVANA     | gene       | 90665345  | 90667625  | .          | -            | .          | ENSMUSG00000099399.6 | lncRNA               | ...   |
| chrY         | HAVANA     | gene       | 90752426  | 90755467  | .          | -            | .          | ENSMUSG00000095366.2 | lncRNA               | ...   |
| chrY         | ENSEMBL    | gene       | 90838868  | 90839177  | .          | -            | .          | ENSMUSG00000096850.1 | protein_coding       | ...   |
+--------------+------------+------------+-----------+-----------+------------+--------------+------------+----------------------+----------------------+-------+
Stranded PyRanges object has 55,385 rows and 25 columns from 22 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
15 hidden columns: gene_name, level, mgi_id, havana_gene, transcript_id, transcript_type, transcript_name, transcript_support_level, tag, havana_transcript, exon_number, ... (+ 4 more.)

基因注释

  皮软杰斯里面有nearestk_nearest两种方法可以用。通过函数名我们就可以猜到这两个函数都可以用来查找interval1集合里每一个区间在interval2集合里面距离最近的区间,默认是返回有交集的区间,而k_nearest可以指定返回的区间个数。通过代码看一下区别:

bed_test = pr.PyRanges(chromosomes="chr1", strands="+", starts=[3074300], ends=[3074322])
bed
+--------------+-----------+-----------+--------------+
| Chromosome   |     Start |       End | Strand       |
| (category)   |   (int32) |   (int32) | (category)   |
|--------------+-----------+-----------+--------------|
| chr1         |   3074300 |   3074322 | +            |
+--------------+-----------+-----------+--------------+

bed_test.nearest(gtf,suffix="_gtf")
+--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------+
| Chromosome   |     Start |       End | Strand       | Source     | Feature    |   Start_gtf |   End_gtf | Score      | Strand_gtf   | Frame      | +18   |
| (category)   |   (int32) |   (int32) | (category)   | (object)   | (object)   |     (int32) |   (int32) | (object)   | (category)   | (object)   | ...   |
|--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------|
| chr1         |   3074300 |   3074322 | +            | HAVANA     | gene       |     3073252 |   3074322 | .          | +            | .          | ...   |
+--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------+

bed_test.nearest(gene,overlap=False,suffix="_gtf")
+--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------+
| Chromosome   |     Start |       End | Strand       | Source     | Feature    |   Start_gtf |   End_gtf | Score      | Strand_gtf   | Frame      | +18   |
| (category)   |   (int32) |   (int32) | (category)   | (object)   | (object)   |     (int32) |   (int32) | (object)   | (category)   | (object)   | ...   |
|--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------|
| chr1         |   3074300 |   3074322 | +            | ENSEMBL    | gene       |     3102015 |   3102125 | .          | +            | .          | ...   |
+--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------+

bed_test.k_nearest(gene,k=5,suffix="_gtf")
+--------------+-----------+-----------+--------------+------------+------------+-----------+-----------+------------+--------------+------------+-------+
| Chromosome   |     Start |       End | Strand       | Source     | Feature    |   Start_b |     End_b | Score      | Strand_b     | Frame      | +18   |
| (category)   |   (int32) |   (int32) | (category)   | (object)   | (object)   |   (int32) |   (int32) | (object)   | (category)   | (object)   | ...   |
|--------------+-----------+-----------+--------------+------------+------------+-----------+-----------+------------+--------------+------------+-------|
| chr1         |   3074300 |   3074322 | +            | HAVANA     | gene       |   3073252 |   3074322 | .          | +            | .          | ...   |
| chr1         |   3074300 |   3074322 | +            | ENSEMBL    | gene       |   3102015 |   3102125 | .          | +            | .          | ...   |
| chr1         |   3074300 |   3074322 | +            | HAVANA     | gene       |   3205900 |   3671498 | .          | -            | .          | ...   |
| chr1         |   3074300 |   3074322 | +            | HAVANA     | gene       |   3252756 |   3253236 | .          | +            | .          | ...   |
| chr1         |   3074300 |   3074322 | +            | HAVANA     | gene       |   3365730 |   3368549 | .          | -            | .          | ...   |
+--------------+-----------+-----------+--------------+------------+------------+-----------+-----------+------------+--------------+------------+-------+

  从上面的代码可知有无overlap=False参数的区别,以及nearestk_nearest函数的区别。当然,还有一些比较实用的参数如strandedness:"same",指定链是否相同;how:"upstream"、 "downstream"分别指定选取最近区间的方位,nb_cpu:指定使用的cpu数。除此之外,k_nearest还有一个参数ties可以指定有距离相同的区间时如何返回,可选值为"first"、"last"、"different"。
  suffix参数指定当两个集合有相同列名时,后一个集合列名会添加的字段,默认是'_b'。不过,好像k_nearest函数有BUG这个参数无效,但并不影响结果可以忽略。
  从上面可以看出k_nearest函数功能上略胜一筹,注释基因时用此函数相对更自由一些:

anno = bed.k_nearest(gene, ties='different')
anno = anno[['Chromosome', 'Start', 'End, 'Strand', 'gene_id']]
anno
+--------------+-----------+-----------+--------------+----------------------+
| Chromosome   | Start     | End       | Strand       | gene_id              |
| (category)   | (int32)   | (int32)   | (category)   | (object)             |
|--------------+-----------+-----------+--------------+----------------------|
| chr1         | 212609534 | 212609559 | +            | ENSMUSG00000102307.1 |
| chr1         | 169887529 | 169887554 | +            | ENSMUSG00000103110.1 |
| chr1         | 216711011 | 216711036 | +            | ENSMUSG00000102307.1 |
| chr1         | 144227079 | 144227104 | +            | ENSMUSG00000100389.1 |
| ...          | ...       | ...       | ...          | ...                  |
| chrY         | 15224235  | 15224260  | -            | ENSMUSG00000102835.5 |
| chrY         | 13517892  | 13517917  | -            | ENSMUSG00000102174.1 |
| chrY         | 8010951   | 8010976   | -            | ENSMUSG00000099861.1 |
| chrY         | 7405376   | 7405401   | -            | ENSMUSG00000118447.1 |
+--------------+-----------+-----------+--------------+----------------------+

anno.to_csv("data_anno.txt", sep="\t", header=True)

结束语

  pyranges包的功能还是挺多的,不仅包含区间的交集、差集,还可以进行一些统计运算如JaccardFisher等。虽然,目前有不少现成的软件如homerchipseeker可以做基因注释,很多时候我们可以直接使用这些软件即可,但pyranges还是值得学习收藏一下,也许做个性化数据处理的时候使用它会来得更为方便些。


往期回顾

scanpy踩坑实录
差异基因密度分布
R绘图配色总结
saddleplot | A/B compartments
双曲线火山图一键拿捏

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 199,271评论 5 466
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 83,725评论 2 376
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 146,252评论 0 328
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 53,634评论 1 270
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 62,549评论 5 359
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 47,985评论 1 275
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,471评论 3 390
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,128评论 0 254
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,257评论 1 294
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,233评论 2 317
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,235评论 1 328
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 32,940评论 3 316
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,528评论 3 302
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,623评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 30,858评论 1 255
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 42,245评论 2 344
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 41,790评论 2 339

推荐阅读更多精彩内容