跟着Science学保守元件(Conserved Elemtns)的鉴定

0.代码来源
文章参考:
《Large-scale ruminant genome sequencing provides insights into their evolution and distinct traits》
DOI: 10.1126/science.aav6202
后面简称'原文'
本文很贴心的给出了整个保守元件(CE, Conserved Element)的鉴定流程以及所用到的脚本.
https://zenodo.org/record/2549147
如果用到这个代码，请注明代码来源并引用这篇文章

软件的安装
参见: http://compgen.cshl.edu/phast

2.多基因组比对
这里也可以参考以下两个链接
https://biochen.org/cn/blog/2021/%E5%A2%9E%E5%BC%BA%E7%89%88%E4%B8%A4%E5%9F%BA%E5%9B%A0%E7%BB%84%E6%AF%94%E5%AF%B9/
https://biochen.org/cn/blog/2021/%E5%A4%9A%E5%9F%BA%E5%9B%A0%E7%BB%84%E6%AF%94%E5%AF%B9/

假设已经拿到了最后的maf文件
这里只演示一条染色体的结果

3.1 mod文件的获取
PS: 原文获取mod的做法较为繁琐
具体见本文末尾 4. 一节, 有完整的复现

这里先展示一种较为简单的方法,
此处参考: https://biochen.org/cn/blog/2021/%E4%BD%BF%E7%94%A8PHAST%E5%88%86%E6%9E%90%E4%BF%9D%E5%AE%88%E6%80%A7/
以及官方给出的pipeline (真的比原文的做法简单太多):
http://compgen.cshl.edu/phast/phyloP-tutorial.php

phyloFit  -i MAF  Chr1.maf  --tree  "(((((A, B), (C, D)), E), F), G)"
#输出文件phyloFit.mod
#用作后续分析

输出文件phyloFit.mod以用作后续分析

3.2 CE的鉴定

phastCons  Chr1.maf   phyloFit.mod   --target-coverage 0.25 --expected-length 12 --rho 0.4 --msa-format MAF --seqname Chr1  --most-conserved Chr1.CE.bed > Chr1.conserved.score.wig

输出文件1是Chr1.CE.bed, 这个就是保守元件的bed文件, 还有一个Chr1.conserved.score.wig是CE每个bp的保守性得分
其实做到这里就已经算是完成了, 但是可以再多做一点
比如我的进化树包含了 A - G总共7个物种，但是A-D才是我的研究物种, 我想把CE过滤到A-D上
按照原文的做法, 有两种过滤方式, 首先把长度小于20bp的CE都过滤掉(这一步骤本文就略去)

第一是CE的 EFG三个外群物种上是没有序列可以比对到的
或者说只有A-D也就是研究物种中才能找到的，研究物种特有的CE

第二是CE在外群物种中也有, 但是在研究物种中是 '保守的' 或者 '加速进化的' (基于看不懂的LRT test一大堆, 参考http://compgen.cshl.edu/phast/help-pages/phyloP.txt)

3.3 过滤CE (具体用到的脚本在本文末尾或者从zendo上自己下载)
3.3.1 首先是得到第一种CE:
修改脚本maf_bed_filter_Type1.pl
把这个脚本的标注出来的这一行, 修改为maf文件中外群物种的具体名称

修改我高亮的地方

perl maf_bed_filter_Type1.pl Chr1.CE.bed Chr1.maf ./
#会生成work.out， 这个就是外群没有，研究物种特有的CE，但是格式不是bed格式

#下面就是把work.out转为bed文件
perl work_out2bed.pl work.out Chr1 ./ 
#输出文件是Chr1.bed, 这就是第一种CE了


#做一下长度过滤, 低于20bp的扔掉
for i in {1..28}
do
awk '{if($3-$2>19)print}' Chr${i}.bed > Chr${i}.tmp && rm Chr${i}.bed && rename .tmp .Type1.bed *
done

3.3.2 其次是第二种CE
修改脚本01.convertMaf2List.pl
第一个红色箭头修改为maf文件作为参考基因组的物种的具体名称
第二个红色箭头修改为maf文件的所有物种的名称

修改红色箭头所示地方

修改mod文件 phyloFit.mod
在研究物种的祖先节点处加上标识, 这里表示我把我的研究物种的祖先节点命名为'Node'

修改红色箭头所示的地方

文件修改完毕，开始跑后续

perl   01.convertMaf2List.pl  Chr1.maf   Chr1
#输出的Chr1.maf.lst用作后续

phyloP --subtree Node --msa-format MAF --features Chr1.CE.bed -method LRT --mode CONACC phyloFit.mod Chr1.maf > Chr1.phyloP.bed
#--subtree Node 这里就是前面提及的研究物种的祖先节点
#--features Chr1.CE.bed 最开始鉴定出来的CE
#-method LRT  大概是用了什么统计检验, 但我看不懂
#--mode CONACC  这个模式下除了能看研究物种相比于外群物种保守, 还能看研究物种相比于外群物种快速进化的

#最后输出Chr1.phyloP.bed

输出文件Chr1.phyloP.bed长这样:

Chr1.phyloP.bed

最后一列如果小于0, 表示研究物种相比于外群物种快速进化的CE
最后一列如果大于0，研究物种相比于外群物种保守的CE

最后再卡一下显著性阈值和做一下p值校正, 这里就不再赘述

 head -n 1 Chr10.phyloP.bed > head

for i in {1..28}
do
awk '{if($9<0.05 && $9>0)print}' Chr${i}.phyloP.bed > Chr${i}.tmp && cat head Chr${i}.tmp > Chr${i}.phyloP.conservation.bed && rm Chr${i}.tmp
done

for i in {1..28}             
do
awk '{if($9>-0.05 && $9<0)print}' Chr${i}.phyloP.bed > Chr${i}.tmp && cat head Chr${i}.tmp > Chr${i}.phyloP.acceleration.bed && rm Chr${i}.tmp 
done

for i in {1..28}
do
awk '{if($3-$2>19)print}' Chr${i}.phyloP.acceleration.bed > Chr${i}.tmp && cat head Chr${i}.tmp > ../Acceleration/Chr${i}.acceleration.CE.bed && rm Chr${i}.tmp
done

for i in {1..28}
do
awk '{if($3-$2>19)print}' Chr${i}.phyloP.conservation.bed > Chr${i}.tmp && cat head Chr${i}.tmp > ../Type2/Chr${i}.Type2.bed && rm Chr${i}.tmp          
done


#####
#使用wigToBigWig将wig文件转换为二进制的bigWig文件后，就能在Genome Browser中展示了
#wigToBigWig in.wig chrom.sizes out.bw

for i in {1..28}
do         
/data/01/user164/software/wigToBigWig Chr${i}.CE.wig Golani.Chr.v4.length Chr${i}.CE.bw
done

#假设给定一条序列比如lncRNA，我们想计算它的平均phastCons分数
#可以使用bigWigAverageOverBed实现
#bigWigAverageOverBed in.bw in.bed out.tab

for i in {1..28}
do
/data/01/user164/software/bigWigAverageOverBed Chr${i}.CE.bw /data/01/user164/workspace_for_whole_object/spalax_SV/10.CE/Ruminant.Pipe.9spp/CE.bed/Chr${i}.CE.bed Chr${i}.tmp;
awk '{print $1"\t"$2"\t"$4"\t"$5}' Chr${i}.tmp > Chr${i}.CE.score; rm Chr${i}.tmp;
done

cat *.CE.score | sort -Vk 1 > all.CE.score
awk '{if($2>19)print}' all.CE.score > all.CE.dayu20bp.score

4.原文的做法
4.1 4d.nonconserved.mod文件的获取
其实和我前文中相比没有太大改变, 只不过原文中用了4d位点
那么首先从maf文件中提取4D位点:
这里参考了: http://compgen.cshl.edu/phast/phastCons-HOWTO.html

#依然是使用phast包内置的软件
msa_view  Chr1.maf  --in-format MAF  --4d  --features Chr1.gff  > Chr1.4d-codons.ss
msa_view Chr1.4d-codons.ss --in-format SS --out-format SS --tuple-size 1 > Chr1.4d-sites.ss
#对每条染色体重复这个操作
for i in {1..10}
do
msa_view  split_maf/Chr/Chr${i}.maf --in-format MAF --4d --features gff/Chr${i}.gff > 4d.site/Chr${i}.4d.ss
msa_view 4d.site/Chr${i}.4d-codons.ss  --in-format SS --out-format SS --tuple-size 1 > 4d.site/Chr${i}.4d-sites.ss
done

msa_view --unordered-ss --out-format SS --aggregate go,ca,ju,ga,Mus,Rat,Hamster,Squirrel,Rabbit Chr1.4d-sites.ss,Chr2.4d-sites.ss,Chr3.4d-sites.ss,Chr4.4d-sites.ss,Chr5.4d-sites.ss,Chr6.4d-sites.ss,Chr7.4d-sites.ss,Chr8.4d-sites.ss,Chr9.4d-sites.ss,Chr10.4d-sites.ss > all-4d.sites.ss

phyloFit --tree species.tree -i SS all-4d.sites.ss --out-root 4d.nonconserved
#用上述流程即可



#以上是标准的流程, 由于我的maf文件存在一些问题
#故进行修改
for i in {1..28}
do
msa_view split_maf/Chr/Chr${i}.maf --in-format MAF --4d --features gff/Chr${i}.gff > 9spp.4d.site/Chr${i}.tmp1

msa_view 9spp.4d.site/Chr${i}.tmp1 --in-format SS --out-format SS --tuple-size 1 > 9spp.4d.site/Chr${i}.tmp2

msa_view 9spp.4d.site/Chr${i}.tmp2 --in-format SS --out-format FASTA > 9spp.4d.site/Chr${i}.tmp3

samtools faidx 9spp.4d.site/Chr${i}.tmp3   

for c in {go,ca,ga,ju,Mus,Rat,Hamster,Squirrel,Rabbit}
do
samtools faidx 9spp.4d.site/Chr${i}.tmp3 $c >> 9spp.4d.site/Chr${i}.tmp4      
done

mv 9spp.4d.site/Chr${i}.tmp4 9spp.4d.site/Chr${i}.4d-sites.fasta      
msa_view 9spp.4d.site/Chr${i}.4d-sites.fasta --in-format FASTA --out-format SS > 9spp.4d.site/Chr${i}.4d-sites.ss
done
msa_view --u
nordered-ss --out-format SS --aggregate go,ca,ju,ga,Mus,Rat,Hamster,Squirrel,Rabbit Chr1.4d-sites.ss,Chr2.4d-sites.ss,Chr3.4d-sites.ss,Chr4.4d-sites.ss,Chr5.4d-sites.ss,Chr6.4d-sites.ss,Chr7.4d-sites.ss,Chr8.4d-sites.ss,Chr9.4d-sites.ss,Chr10.4d-sites.ss > all-4d.sites.ss
phyloFit --tree species.tree -i SS all-4d.sites.ss --out-root 4d.nonconserved
##

这一套操作下来, 得到文件: 4d.nonconserved.mod

4.2 estimating rho

for i in {1..10}
do
mkdir Chr${i} && cd Chr${i}
msa_split Chr${i}.maf --in-format MAF  --windows 1000000,0 --out-root split --out-format SS --min-informative 1000 --between-blocks 5000  #切割maf文件
done


#以单条染色体为例, 每条染色体重复以下操作
ls *.ss > ss.list
for ss in `cat ss.list`;do phastCons --estimate-rho $ss --no-post-probs $ss 4d.nonconserved.mod ;done
#4d.nonconserved.mod就是4.1的输出文件

ls *.cons.mod >all.cons.mod.list
ls *.noncons.mod > all.noncons.mod.list
phyloBoot --read-mods all.cons.mod.list --output-average Chr1.ave.cons.mod
phyloBoot --read-mods all.noncons.mod.list --output-average Chr1.ave.noncons.mod

phastCons --most-conserved Chr1.CE.bed --score Chr1.maf Chr1.ave.cons.mod,Chr1.ave.noncons.mod \> Chr1.wig
#以单条染色体为例, 每条染色体重复以上操作
#如此原文的CE鉴定部分就结束, 后面对CE进行分类/过滤与本文的 3.3 一节相同

#区别在于用到的mod文件, 原文是用的4d位点生成的4D.nonconserced.mod
phyloP --subtree Node --msa-format MAF --features 1.HCE.bed --method LRT --mode CONACC 4D.nonconserced.mod 1.maf > Chr1.phyloP.bed

区别在于用到的mod文件, 原文是用的4d位点生成的4D.nonconserced.mod
phyloP --subtree Node --msa-format MAF --features 1.HCE.bed --method LRT --mode CONACC 4D.nonconserced.mod 1.maf > Chr1.phyloP.bed

5.原文中提供的脚本
vim maf_bed_filter_Type1.pl

use warnings;
use strict;
my $bed_file = $ARGV[0];#"/public/home/linzeshan/project/01.conserve/Type_one_re_fileter/new_method/work/29.bed";
my $maf_file = $ARGV[1];#"../data/maf2list/29/29.maf.lst";
my $work_dir = $ARGV[2];

my @arr = ("Mus","Rat","Hamster","Squirrel","Rabbit");  #外群物种
my $species = {};
foreach my $arr(@arr){
    $species->{$arr} = 1;
}

open BED,$bed_file or die $!;
open MAF,$maf_file or die $!;

my $word = <MAF>;
my @first_line = split(/\t/,$word);
my $num = @first_line;
my @locate_arr;

for(my $i = 0;$i<$num;$i++){
    push (@locate_arr,$i) if (exists ($species->{$first_line[$i]}));
}
foreach my $j(@locate_arr){
    print $j."\t";
}

open OUT,">$work_dir/work.out" or die $!;
open ERR,">$work_dir/work.err" or die $!;

while(<BED>){
    my @array = split(/\t/,$_);
    my $pause = "nn";
    for(my $i=$array[1];$i<=$array[2];$i++){
        next if $i == 0;
        my $chu = 1;
        while(1){
            my $nt;
            if ($pause ne "nn"){
                $nt = $pause;
                $pause = "nn";
            }
            else {
                $nt = <MAF>;
            }
            my @nt = split(/\t/,$nt);
            if ($i>$nt[1]){
                next;
            }
            elsif($i == $nt[1]){
                foreach my $pocky(@locate_arr){
                    if ($nt[$pocky] ne "-"){
                    $chu = 0;
                    }
                }
                last;
            }
            else{
                $pause = $nt;
                print ERR "worng bed $i maf $nt[1]\n";
                last;
            }
        }
        print OUT "$i\t+\n" if $chu == 1;
        print OUT "$i\t-\n" if $chu == 0;
    }
}

vim work_out2bed.pl

my $file = $ARGV[0];
my $chr = $ARGV[1];
my $work_dir = $ARGV[2];
open IN,$file or die $!;
open OUT,">$work_dir/$chr.bed" or die $!;
my $over = 1;my $start=0;my $end=0;my $s_e = 1;
while(<IN>){
    chomp;
    my @array = split(/\t/,$_);
    #print $array[0]."\t"."..".$array[1]."..\n";last;}=p
    if ($array[1] eq "-"){
        $end = $start if $end == 0;
        print OUT "$chr\t$start\t$end\n" unless $start == 0;
        $start = 0;$end = 0;
        $over = 1;
        next;
    }
    else{
        if ($over == 1){
            $start = $array[0];
            $end = $array[0];
            $over = 0;
        }
        else{
            if ($array[0] > $end+1){
                $end = $start if $end == 0;
                print OUT "$chr\t$start\t$end\n" unless $start == 0;
                $start = 0;$end = 0;
                $start = $array[0];
                $end = $array[0];
            }
            else{
                $end = $array[0];
            }
        }
    }
}

vim 01.convertMaf2List.pl

#! /usr/env perl
use strict;
use warnings;

my $maf=$ARGV[0];
my $output=$ARGV[1].".maf.lst";
my $ref="golani";
my @species=("go","ga","ca","ju","Mus","Rat","Hamster","Jac","Squirrel");
open O,">$output";
open E,">no_ref.maf";
#open O,"| gzip - > $output";
my @head=("chr","pos",@species);
my $head=join "\t",@head;
print O "$head\n";

my $control=0;
my $content="";
open I,"cat $maf |";
while(<I>){
    next if(/^#/);
    if(/^a\s+score=/){
        $content=$_;
        while(<I>){
            if(/^s\s+/){
    $content.=$_;
            }
            else{
    last;
            }
        }
        &analysis($content);
    }
    # last if($control++>=0);
}
close I;
close O;

sub analysis{
    my $content=shift;
    my @line=split(/\n/,$content);
    my $head=shift @line;
    $head=~/score=([\d\.]+)/;
    my $score=$1;my $all_strand;
    # print "$score\n";
    my %pos;
    my $isRefFound=0;
    my $ref_chr="NA";
    foreach my $line(@line){
        my @a=split(/\s+/,$line);
        my ($s,$chr,$start,$alignment_length,$strand,$sequence_length,$sequence)=@a;my $all_strand = $strand;
        $chr=~/^([^\.]+)\.(.*)/;
        my $species=$1;
        $chr=$2;
        if($species eq $ref){
            $ref_chr=$chr;
            $isRefFound=1;
            my @base=split(//,$sequence);
            if($strand eq "+"){
    my $pos=$start;
    for(my $i=0;$i<@base;$i++){
        if($base[$i] ne "-"){
            $pos++;
            $pos{$i}=$pos;
        }
    }
            }
            if($strand eq "-"){
    my $pos=$start;
    for(my $i=0;$i<@base;$i++){
        if($base[$i] ne "-"){
            $pos++;
            my $real_pos = $sequence_length-1-($pos-1)+1;
            $pos{$i}=$real_pos;
        }
    }
            }
        }
    }
    if($isRefFound == 0){
        print E $content;
        next;   
#        die "$ref not found!\nspecies name should be added before chr such as chr01 should be cattle.chr01\n";
    }
    my %data;
    foreach my $line(@line){
        my @a=split(/\s+/,$line);
        my ($s,$chr,$start,$alignment_length,$strand,$sequence_length,$sequence)=@a;
        $chr=~/^([^\.]+)\.(.*)/;
        my $species=$1;
        $chr=$2;
        my @base=split(//,$sequence);
        $sequence =~ tr/ATCGRYMK/TAGCYRKM/ if ($strand eq "-");
        for(my $i=0;$i<@base;$i++){
            next if(!exists $pos{$i});
            my $pos=$pos{$i};
            $data{$pos}{$species}=$base[$i];
        }
    }
    my @output_line;
    foreach my $pos(sort {$a<=>$b} keys %data){
        @output_line=($ref_chr,$pos);
        foreach my $species(@species){
    my $base="-";
    if(exists $data{$pos}{$species}){
        $base = $data{$pos}{$species};
    }
    push @output_line,$base;
            }
        my $output_line=join "\t",@output_line;
        print O "$output_line\n";
    }
}

最后编辑于：2023.11.22 17:16:49

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 202,056评论 5赞 474
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 84,842评论 2赞 378
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 148,938评论 0赞 335
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,296评论 1赞 272
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,292评论 5赞 363
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,413评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,824评论 3赞 393
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,493评论 0赞 256
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,686评论 1赞 295
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,502评论 2赞 318
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,553评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,281评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,820评论 3赞 305
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,873评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,109评论 1赞 258
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,699评论 2赞 348
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,257评论 2赞 341

跟着Science学保守元件(Conserved Elemtns)的鉴定

推荐阅读更多精彩内容