关键词
- 批次效应
- 批次校正
- 单细胞数据整合
适用背景
单细胞数据由于实验平台或样本等原因会造成不同数据集之间存在批次效应,这种批次效应是人为因素造成的,没有实际的生物学意义,可能对研究结果产生极大影响。批次效应可由聚类结果确定,如果聚类出的某些亚群绝大部分都来自同一个样本一般就认为存在批次效应,因此需要进行批次校正。
本文总结了三种常用的整合方法代码:CCA,SCTransform和Harmony。
方法一
CCA整合方法是目前应用最多方法,是Seurat自带的,大多数情况以及够用了,效果也还可以,但是对于较大数据集,耗时较长,占内存也较大。目前,Seurat官网在此基础上推荐reference-based,也就是指定参考数据集进行整合,但对于自产数据集,一般根本无法预先知道哪个样本效果最好,这种reference-based的思路更适合数据挖掘类的研究。
参数简介
- obj,Seurat对象
- group.by,整合分组
- mt.pattern或mt.list,指定线粒体基因,mt.pattern支持模糊搜索,mt.list直接给定基因集,格式为向量
- dim.use PCA主成分的选择个数
- mt.cutoff 线粒体百分比阈值
- nf.low 基因数下限
- nf.high 基因数上限
- nfeatures,用于整合的高变基因选择数
- res,聚类亚群的分辨率
seurat_integ <- function(obj,group.by=NULL,
mt.pattern="^MT-",mt.list=NULL,dim.use=30,mt.cutoff=5,
nf.low=500,nf.high=6000,nfeatures=3000,
res=1.5) {
all <- obj
if (is.null(mt.list)) {
all[["percent.mt"]] <- PercentageFeatureSet(all, pattern = mt.pattern)
}else{
mt.list <- mt.list[which(mt.list %in% rownames(all))]
all[["percent.mt"]] <- PercentageFeatureSet(all, features = mt.list)
}
all <- subset(all, subset = nFeature_RNA > nf.low & percent.mt < mt.cutoff & nFeature_RNA < nf.high)
all <- NormalizeData(all, normalization.method = "LogNormalize", scale.factor = 10000)
all.list <- SplitObject(all, split.by = group.by)
for (i in 1:length(all.list)){
all.list[[i]] <- NormalizeData(all.list[[i]], verbose = FALSE)
all.list[[i]] <- FindVariableFeatures(all.list[[i]], selection.method = "vst",nfeatures = nfeatures, verbose = FALSE)}
reference.list <- all.list
all.anchors <- FindIntegrationAnchors(object.list = reference.list, dims =1:dim.use)
all.integrated <- IntegrateData(anchorset = all.anchors, dims = 1:dim.use)
DefaultAssay(all.integrated) <- "integrated"
all.integrated <- ScaleData(all.integrated, verbose = FALSE)
npcs <- dim.use+10
all.integrated <- RunPCA(all.integrated, npcs = npcs, verbose = FALSE)
all.integrated <- FindNeighbors(all.integrated, reduction = "pca", dims = 1:dim.use)
all.integrated <- FindClusters(all.integrated, resolution = res)
all.integrated <- RunUMAP(all.integrated, reduction = "pca", dims = 1:dim.use)
return(all.integrated)
}
方法二
第二种方法是Seurat官网极度推荐的,主要由于方法一的Normalization and variance stabilization流程存在一定问题,会造成基因表达量会与测序深度存在明显的相关关系等,因此提出了SCTransform进行预处理,然后再整合,其实后面的整合方法跟方法一的类型,只不过这里的前期预处理用的是SCTransform,而方法一用的是LogNormalize,因此整合的对象结构是类似的。详细内容可阅读这篇文献。这种方法理论上更为合理,但是也更耗运行内存和运行时间。参数与方法一一致。
sct_integ <- function(obj,group.by=NULL,
mt.pattern="^MT-",mt.list=NULL,dim.use=30,mt.cutoff=5,
nf.low=500,nf.high=6000,nfeatures=3000,
res=1.5) {
all <- obj
if (is.null(mt.list)) {
all[["percent.mt"]] <- PercentageFeatureSet(all, pattern = mt.pattern)
}else{
mt.list <- mt.list[which(mt.list %in% rownames(obj))]
all[["percent.mt"]] <- PercentageFeatureSet(all, features = mt.list)
}
all <- subset(all, subset = nFeature_RNA > nf.low & percent.mt < mt.cutoff & nFeature_RNA < nf.high)
all <- NormalizeData(all, normalization.method = "LogNormalize", scale.factor = 10000)
obj <- all
ifnb.list <- SplitObject(obj, split.by = group.by)
if (!requireNamespace("glmGamPoi", quietly = TRUE)) {
ifnb.list <- lapply(X = ifnb.list, FUN = SCTransform, vars.to.regress = "percent.mt", verbose = FALSE)
}else{
ifnb.list <- lapply(X = ifnb.list, FUN = SCTransform, vars.to.regress = "percent.mt", verbose = FALSE, method = "glmGamPoi")
}
features <- SelectIntegrationFeatures(object.list = ifnb.list, nfeatures = nfeatures)
ifnb.list <- PrepSCTIntegration(object.list = ifnb.list, anchor.features = features)
immune.anchors <- FindIntegrationAnchors(object.list = ifnb.list, normalization.method = "SCT",
anchor.features = features)
immune.combined.sct <- IntegrateData(anchorset = immune.anchors, normalization.method = "SCT")
immune.combined.sct <- RunPCA(immune.combined.sct, verbose = FALSE)
immune.combined.sct <- RunUMAP(immune.combined.sct, reduction = "pca", dims = 1:dim.use)
all.integrated <- immune.combined.sct
all.integrated <- FindNeighbors(all.integrated, reduction = "pca", dims = 1:dim.use)
all.integrated <- FindClusters(all.integrated, resolution = res)
all.integrated <- RunUMAP(all.integrated, reduction = "pca", dims = 1:dim.use)
return(all.integrated)
}
方法三
第三种方法是一种降维整合,基于harmony包,这种方法的优势在于够快,大部分情况都能有比较好的结果。参数与方法一一致。
harmony_integ <- function(obj,group.by=NULL,
mt.pattern="^MT-",mt.list=NULL,dim.use=20,mt.cutoff=5,
nf.low=500,nf.high=6000,nfeatures=3000,
res=1.5) {
library(harmony)
all <- obj
if (is.null(mt.list)) {
all[["percent.mt"]] <- PercentageFeatureSet(all, pattern = mt.pattern)
}else{
mt.list <- mt.list[which(mt.list %in% rownames(all))]
all[["percent.mt"]] <- PercentageFeatureSet(all, features = mt.list)
}
all <- subset(all, subset = nFeature_RNA > nf.low & percent.mt < mt.cutoff & nFeature_RNA < nf.high)
all <- NormalizeData(all, normalization.method = "LogNormalize", scale.factor = 10000)
all <- FindVariableFeatures(all, selection.method = "vst", nfeatures = 2000)
all.genes <- rownames(all)
all <- ScaleData(all , features = all.genes, vars.to.regress = "nCount_RNA")
saveRDS(all,"regress.rds")
all <- RunPCA(all, features = VariableFeatures(object = all))
all <- RunHarmony(all, group.by , plot_convergence = F,dims.use = 1:dim.use)
Combine <- all
Combine = RunTSNE(Combine, reduction = "harmony", dims = 1:dim.use)
Combine = RunUMAP(Combine, reduction = "harmony", dims = 1:dim.use)
Combine = FindNeighbors(Combine, reduction = "harmony",dims = 1:dim.use)
Combine = FindClusters(Combine, resolution = res)
return(Combine)
}
小结与补充
单细胞数据整合一直是个玄学,没有说哪一种整合方法是最好的,不同方法针对不同样本会出现不同效果,只能每种方法都试一下才能知道哪种较好。而且还需要结合实际情况进行选择,例如数据集太大,或者运行内存不够,可能harmony的方法更适合,当然如果数据集适中,各种运行条件也合适那就可以考虑理论上更为合理的SCTransform方法。