采用Trinotate对拼接结果进行注释

一：开始之前需要准备三个东西：

第一：安装软件

1.Trinity（用于生成拼接后的转录本fasta文件）

2.TransDecoder（用于预测转录本的蛋白编码区域）

3.SQLite （用于整合数据库数据，我下的是Trinotate_v3.sqlite）

4.NCBI BLAST+(用于比对Blast库）

5.HMMER/PFAM（用于通过HMMER工具注释蛋白质结构域）

第二：构建所需数据库

## sqlite3 --help cd

$~/biosoft mkdir Trinotate && cd Trinotate

# http://trinotate.github.io/

# https://github.com/Trinotate/Trinotate/releases

$wget https://github.com/Trinotate/Trinotate/archive/v3.0.1.tar.gz -O Trinotate.v3.0.1.tar.gz

$tar zxvf Trinotate.v3.0.1.tar.gz

$~/biosoft/Trinotate/Trinotate-3.0.1/Trinotate -h

$wget https://data.broadinstitute.org/Trinity/Trinotate_v3_RESOURCES/Pfam-A.hmm.gz

$wget https://data.broadinstitute.org/Trinity/Trinotate_v3_RESOURCES/uniprot_sprot.pep.gz

$wget https://data.broadinstitute.org/Trinity/Trinotate_v3_RESOURCES/Trinotate_v3.sqlite.gz -O Trinotate.sqlite.gz

$gunzip Trinotate.sqlite.gz gunzip uniprot_sprot.pep.gz

$makeblastdb -in uniprot_sprot.pep -dbtype prot

$gunzip Pfam-A.hmm.gz

$hmmpress Pfam-A.hmm

最后生成一下文件：

-rw-rw-r--  1 spider spider 1348848383 Mar  7  2016Pfam-A.hmm 数据库文件

-rw-rw-r--  1 spider spider  304256653 Apr 13 10:02 Pfam-A.hmm.h3f

-rw-rw-r--  1 spider spider    1124460 Apr 13 10:02 Pfam-A.hmm.h3i

-rw-rw-r--  1 spider spider  558372538 Apr 13 10:02 Pfam-A.hmm.h3m

-rw-rw-r--  1 spider spider  656774195 Apr 13 10:02 Pfam-A.hmm.h3p

drwxrwxr-x 10 spider spider       4096 Apr 11  2016 Trinotate-3.0.1

-rw-rw-r--  1 spider spider   29102393 Apr 12 11:46 Trinotate.v3.0.1.tar.gz

-rw-rw-r--  1 spider spider  359515136 Mar  7  2016 Trinotate_v3.sqlite

-rw-------  1 spider spider        317 Apr 12 22:07 nohup.out

-rw-rw-r--  1 spider spider  232512443 Mar  7  2016uniprot_sprot.pep 数据库文件

-rw-rw-r--  1 spider spider   70142548 Apr 12 22:07 uniprot_sprot.pep.phr

-rw-rw-r--  1 spider spider    4404496 Apr 12 22:07 uniprot_sprot.pep.pin

-rw-rw-r--  1 spider spider  197023228 Apr 12 22:07 uniprot_sprot.pep.psq

第三：需要我们用TransDecoder构建得到了转录本的编码蛋白的蛋白序列（我的如下）

准备就绪就开始了。

二：序列比对

接下来的工作就是通过之前下载配置好的Blast+和HMMER进行比对和预测（当然两者工具采用的算法会有差异）

$nohup blastx -query /data1/spider/ytbiosoft/data/trinity.all/trinity_out_dir_all.Trinity.fasta -db uniprot_sprot.pep -num_threads 8 -max_target_seqs 1 -outfmt 6 -evalue 1e-3 > blastx.outfmt6 & blastx比对

$nohup blastp -query /data1/spider/ytbiosoft/data/trinity.all/TransDecoder.LongOrfs/trinity_out_dir_all.Trinity.fasta.transdecoder.pep -db uniprot_sprot.pep -num_threads 8 -max_target_seqs 1 -outfmt 6 -evalue 1e-3 > blastp.outfmt6 & blastp比对

这条命令都是采用的Blast比对算法进行的

还有一种是HMMER基于隐马尔科夫链的算法进行的。

看一下hmmscan的帮助文档

$hmmscan -h

# hmmscan :: search sequence(s) against a profile database

# HMMER 3.2.1 (June 2018); http://hmmer.org/

# Copyright (C) 2018 Howard Hughes Medical Institute.

# Freely distributed under the BSD open source license.

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Usage: hmmscan [-options] <hmmdb> <seqfile>

Basic options:

  -h : show brief help on version and usage

Options controlling output:

  -o <f>           : direct output to file <f>, not stdout

  --tblout <f>     : save parseable table of per-sequence hits to file <f>

  --domtblout <f>  : save parseable table of per-domain hits to file <f>

  --pfamtblout <f> : save table of hits and domains to file, in Pfam format <f>

  --acc            : prefer accessions over names in output

  --noali          : don't output alignments, so output is smaller

  --notextw        : unlimit ASCII text output line width

  --textw <n>      : set max width of ASCII text output lines  [120]  (n>=120)

Options controlling reporting thresholds:

  -E <x>     : report models <= this E-value threshold in output  [10.0]  (x>0)

  -T <x>     : report models >= this score threshold in output

  --domE <x> : report domains <= this E-value threshold in output  [10.0]  (x>0)

  --domT <x> : report domains >= this score cutoff in output

Options controlling inclusion (significance) thresholds:

  --incE <x>    : consider models <= this E-value threshold as significant

  --incT <x>    : consider models >= this score threshold as significant

  --incdomE <x> : consider domains <= this E-value threshold as significant

  --incdomT <x> : consider domains >= this score threshold as significant

Options for model-specific thresholding:

  --cut_ga : use profile's GA gathering cutoffs to set all thresholding

  --cut_nc : use profile's NC noise cutoffs to set all thresholding

  --cut_tc : use profile's TC trusted cutoffs to set all thresholding

Options controlling acceleration heuristics:

  --max    : Turn all heuristic filters off (less speed, more power)

  --F1 <x> : MSV threshold: promote hits w/ P <= F1  [0.02]

  --F2 <x> : Vit threshold: promote hits w/ P <= F2  [1e-3]

  --F3 <x> : Fwd threshold: promote hits w/ P <= F3  [1e-5]

  --nobias : turn off composition bias filter

Other expert options:

  --nonull2     : turn off biased composition score corrections

  -Z <x>        : set # of comparisons done, for E-value calculation

  --domZ <x>    : set # of significant seqs, for domain E-value calculation

  --seed <n>    : set RNG seed to <n> (if 0: one-time arbitrary seed)  [42]

  --qformat <s> : assert input <seqfile> is in format <s>: no autodetection

  --cpu <n>     : number of parallel CPU workers to use for multithreads  [2]

开始用hmmer算法比对：

$hmmscan --cpu 4 --domtblout TrinotatePFAM.out Pfam-A.hmm /data1/spider/ytbiosoft/data/trinity.all/TransDecoder.LongOrfs/trinity_out_dir_all.Trinity.fasta.transdecoder.pep > pfam.log

所以最后总共会生成2个基于blast的文件和一个基于hmmer的文件：

三：比对完之后我们需要把数据整合导入到SQL数据库中，进行统一管理。

首先得到的结果放到Trinotate SQLite数据库中进行合并。因此我们需要把以下的几个结果进行Load到SQLite db中，在我们这个工作中的SQLite db就是Trinotate_v3.sqlite

1.转录本和蛋白数据（就是上面比对用到的*.fasta文件和*..transdecoder.pep文件）

2.Blast的结果（这个包括蛋白比对和核酸比对的两个结果）（blastx与blastp的结果文件）

3.HMMER比对的Pfam结果（hmmscan比对结果文件）

那么接下来就是采用以下命令进行（这个要明白Trinotate就是一个把数据装到数据库的工具）

1. 导入转录本和蛋白数据

$TrinotateSeqLoader.pl #调用的脚本

--sqliteTrinotate_v3.sqlite#导入的数据库（就是之前你下的）

--gene_trans_map /data1/spider/ytbiosoft/data/trinity.all/trinity_out_dir_all.Trinity.fasta.gene_trans_map  #导入的是转录本与基因的关系文件

--transcript_fasta /data1/spider/ytbiosoft/data/trinity.all/trinity_out_dir_all.Trinity.fasta  #导入转录本文件

--transdecoder_pep     /data1/spider/ytbiosoft/data/trinity.all/TransDecoder.LongOrfs/trinity_out_dir_all.Trinity.fasta.transdecoder.pep #导入转录本编码蛋白质文件

结果如下：

$TrinotateSeqLoader.pl --sqlite Trinotate_v3.sqlite --gene_trans_map /data1/spider/ytbiosoft/data/trinity.all/trinity_out_dir_all.Trinity.fasta.gene_trans_map --transcript_fasta /data1/spider/ytbiosoft/data/trinity.all/trinity_out_dir_all.Trinity.fasta --transdecoder_pep /data1/spider/ytbiosoft/data/trinity.all/TransDecoder.LongOrfs/trinity_out_dir_all.Trinity.fasta.transdecoder.pep

-parsing gene/trans map file.... done.

-loading Transcripts.

[375700]

done.

-loading ORFs.

[180000]

done.

Loading complete..

2.接下来导入Blast比对结果：

$Trinotate_BLAST_loader.pl --sqlite Trinotate_v3.sqlite --outfmt6 blastx.outfmt6 --prog blastx --dbtype Swissprot

#调用的脚本 #处理的数据库 #导入的是blastx的结果

$Trinotate_BLAST_loader.pl --sqlite Trinotate_v3.sqlite --outfmt6 blastp.outfmt6 --prog blastp --dbtype Swissprot

#调用的脚本 #处理的数据库 #导入的是blastp的结果

3.最后导入HMMER比对结果

$Trinotate_PFAM_loader.pl --sqlite Trinotate_v3.sqlite --pfam TrinotatePFAM.out

#调用的脚本 #处理的数据库 #导入的是pfam的结果

最终，这样我们就将之前的结果导入到了我们 Trinotate_v3.sqlite 这个文件中。之后在其他的地方我们会使用到这个，而且这个数据库也可以通过脚本Trinotate_report_writer.pl直接进行调用：

$Trinotate_report_writer.pl--sqlite Trinotate_v3.sqlite > database_report.xls

最终我们将blast结果以及hmmer预测结果进行了整合，同时发现SQL数据库的管理在大数据整合过程中非常重要，而且SQL数据库还可以在其他地方被调用。

最后编辑于：2023.06.28 20:13:17

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 199,478评论 5赞 467
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 83,825评论 2赞 376
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 146,482评论 0赞 330
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 53,726评论 1赞 271
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 62,633评论 5赞 359
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,018评论 1赞 275
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,513评论 3赞 390
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,168评论 0赞 254
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,320评论 1赞 294
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,264评论 2赞 317
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,288评论 1赞 328
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 32,995评论 3赞 315
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,587评论 3赞 303
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,667评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 30,909评论 1赞 255
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,284评论 2赞 345
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 41,862评论 2赞 339

采用Trinotate对拼接结果进行注释

推荐阅读更多精彩内容