卡卡作为一名生信工程师,常常接触到的编程语言主要有python、perl和R,但是每次撰写脚本的时候,总是有的不是很熟悉,还得上网查询,因此想在这里小结下不同语言实现同一功能的代码。
将dataframe格式的数据data输出到指定文件outfile
#python
data.to_csv(outfile,sep="\t",header=True,index=True)
#R
write.table(data,file=outfile,sep="\t",col.names=NA,quote=F,row.names=T)
#perl
open OUT, ">outfile" or die $!
存储到hash中遍历hash进行print
for循环遍历filelist对象
#python
filelist = ['c1293_fig_2e','c1353_fig_5b']
for i in filelist:
print(i)
#R
filelist = c('c1293_fig_2e','c1353_fig_5b')
for (i in filelist){
print(i)
}
#perl
my @filelist = qw(c1293_fig_2e c1353_fig_5b);
foreach my $i(@filelist){
print "$i\n";
}
字符串连接,将不同的字符串拼接到一起生成新的字符串file_tmp
#python
i = 'abc'
file_tmp = '/var/uploaded/public/' + i +'_cat.rds'
#R
i = 'abc'
file_tmp = paste0('/var/uploaded/public/',i,'_cat.rds')
#perl
my $i = 'abc';
my $file_tmp = '/var/uploaded/public/'.$i.'_cat.rds';
linux系统上文件 (t.sh)或者目录 (/Dir)是否存在的判断
#python
os.path.exists('t.sh')
os.path.exists('/Dir')
os.path.isfile('t.sh')
#R
file.exists('t.sh')
dir.exists('/Dir')
#perl
-e 't.sh'
-e '/Dir'
-f 't.sh'
-d '/Dir'
数据框格式的对象data (含有列名path和spname)中获取元素或切片
#python
data.loc[:,'path']
data.loc[0,['path','spname']]
data.iloc[:,1]
data.iloc[0,[0,1]]
#R
data[1:6,1:6]
data[1:6,c('path','spname')]
#perl
perl中一般存于hash,没有dataframe的概念
读取文件annot_article.txt,存储成dataframe对象
#python
import pandas as pd
info = pd.read_csv(annot_article.txt, sep="\t", dtype='str',index_col=0)
#R
info <- read.table(annot_article.txt,sep="\t",header=T,row.names=1,stringsAsFactors=F)
#perl
open IN, "<annot_article.txt" or die $!;
遍历存储到hash中
统计数据框格式的对象data不同列每个类别出现的频数(含有annot_sub,annot_sub2和annot_article列)
#python
data[[['annot_sub','annot_sub2']].value_counts()
#R
table(data['annot_article'])
与或操作符
#python
a and b
a or b
#R
a && b
a || b
#perl
a && b
a || b
终止运行的程序脚本
#python
quit('error')
#R
stop('error')
#perl
die('error')
对两个向量格式的数据data1和data2取交集,并存储到tmp中
#python
tmp = [val for val in data1 if val in data2]
#R
tmp <- intersect(data1,data2)
查看函数function的源码
#python
.__code__功能
function.__code__
sys模块查看
import sys
sys.modules[function]
#R
1 直接在R中运行functuion即可
2 methods(function)
3 getAnywhere(function)
查看数据data的数据类型
#python
type(data)
#R
class(data)
将向量类型的对象origin_gene中的名称根据maplist替换规则进行替换
#python
import pandas as pd
pd.DataFrame(origin_gene).iloc[:,0].map(maplist).tolist() #origin_gene为待替换的向量list,maplist为替换规则,不存在的以nan代替
#R
plyr::mapvalues(origin_gene,from=list1, to=list2)#list1和list2为替换规则
将字符串str根据","进行拆分
#python
str.split(",")
#R
strsplit(str,split=',')
#perl
split /,/, str;
将字符串中的字符a替换成b
#python
string.replace('a','b')
#R
gsub('a','b',string)
引用语言对应的模块
#python
import os
import pandas as pd
#R
library(Seurat)
library(monocle)
#perl
use strict;
use File::Basename;
查看引入的module版本
#R
library(Seurat)
packageVersion('Seurat') #Seurat为module
#python
import anndata
anndata.__version__
scanpy(python)与seurat(R)的对比
- 表达矩阵存储位置
#seurat,结构固定
raw counts: PRO@assays$RNA@counts
normalized data: PRO@assays$RNA@data
scaledata: PRO@assays$RNA@scale.data
#scanpy,不固定,一般来说
raw counts: adata.raw.X
normalized data: adata.X和adata.layers['normalised']
scaledata: adata.layers层
- subset对应的数据集PRO和adata
#seurat
PRO <- subset(PRO,idents=c('Tcell','Bcell'))
PRO <- subset(PRO,sample %in% c('sample1','sample2'))
#scanpy
used_cell = adata.obs[adata.obs['annot_full'].isin(['Tcell','Bcell']) & adata.obs['sample'].isin(['sample1','sample2'])]
adata = adata[used_cell.index,]
- 反向subset数据集
#seurat
PRO <- subset(PRO,idents=c('Tcell','Bcell'), invert=T)
PRO <- subset(PRO,sample %in% c('sample1','sample2'),invert=T)
#scanpy
adata = adata[~(adata.obs['annot_full'].isin(['Tcell','Bcell']) & adata.obs['sample'].isin(['sample1','sample2']))]
将数据框格式的数据df1和df2合并成df
#python
import pandas as pd
df = pd.concat([df1,df2]) #merge by row
#R
df = rbind(df1,df2)
df = cbind(df1,df2)