聚类分析&maker define cluster - Seurat

########################################################

在完成filter工作以后,我们进行下一步的聚类、marker鉴定

PART1. 聚类

step0. count normalization

normalize:把count值转化为log值

0.1 normalization主要考虑的因素有两个:

(1)测序深度(sequencing depth):

我的理解是把一个细胞的reads数目normalize成10000(default),实际上比较的基因百分数的变化

(2)基因长度(Gene length)

0.2 normalizing the data

# normalizing the data
# default setting: normalization.method = "LogNormalize" and scale.factor = 10000  

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

pbmc <- NormalizeData(pbmc)

step1. feature selection - 找clusters之间表达水平差异化最大的基因feature

  • Identification of highly variable features
    这里指的是,单一样本,不同cluster之间相比,表达水平差异化最大的基因feature
# 加载Seurat对象/加载保存为.Rdata的Seurat对象
Seurat
load("data/pbmc_seurat_filtered.RData")

# 用 FindVariableFeatures 找variable gene;default 值如下
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

# 列出 cluster 之间变化最大的gene - Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)

# 画图 plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc) # without labels
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE) # with labels
plot1 + plot2

step2. scaling the data

  • 为 PCA 做准备

如果是为了PCA做准备,那么对 all.genes scaling data 则耗时太长;一个解决办法是,使用 scaledata 默认值(依照上一步鉴定的 2000 variable features),这样耗时短也能达到分群的目的

# perform scale data on the previously identified variable features(2000 by default)
pbmc <- ScaleData(pbmc)

  • DoHeatmap

To make sure we don’t leave any genes out of the heatmap later, we are scaling all genes in this tutorial.

# DoHeatmap by scaling all genes
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)

step3. PCA—主成分分析

用tSNE较多,因为tSNE分的比较开,uMAP的优点在于能看到谱系来源,但是分的不开,所以如果要做分cluster的话,用tSNE更多。

先用默认的 previously determined variable features 作为input,but can be defined using features argument if you wish to choose a different subset.

# runPCA
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))

3.1 识别重要主成分 PC (该部分需要加强理解 - option)

为了确保分群所依照的主成分基因,它不是由于单个基因表达的技术性噪声带来的干扰;所以要在PCA将维之前,识别重要的主成分基因,也就是,用 PC 去代表聚合的变异最大的基因。

然后用 PC得分,衡量该 cluster 在每个PC 上的权重,也就是,PC 代表的“元基因”(变异最大的那些基因)能解释该cluster 百分之几的差异度。

所以,在确定下游聚类分析之前,有一个问题,我们该如何识别PC 呢?

目前Seurat提供有以下三种方法:VizDimReduction, DimPlot, and DimHeatmap

3.1.1 找合适重要主成分的方法
  • 热图可视化 DimHeatmap

cells参数指定了用于绘图的最大正负PCA分数的细胞数。我们要寻找的是,在热图开始看起来更 "模糊 "的PC,也就是说,各组基因之间的区分不是那么明显的PC。

# 研究PC热图
DimHeatmap(seurat_integrated, 
           dims = 1:9, 
           cells = 500, 
           balanced = TRUE)

如果我们想探索大量的PCA,这种方法可能会很慢,而且很难将单个基因可视化。同样的,为了探索大量的PC,我们可以通过PCA分数驱动PC,打印出前10个(或更多的)正基因和负基因。

The third is a heuristic that is commonly used, and can be calculated instantly. In this example, all three approaches yielded similar results, but we might have been justified in choosing anything between PC 7-12 as a cutoff.

  • VizDimReduction

The first is more supervised, exploring PCs to determine relevant sources of heterogeneity, and could be used in conjunction with GSEA for example.

# Examine and visualize PCA results a few different ways
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)

## PC_ 1 
## Positive:  CST3, TYROBP, LST1, AIF1, FTL 
## Negative:  MALAT1, LTB, IL32, IL7R, CD2 
## PC_ 2 
## Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1 
## Negative:  NKG7, PRF1, CST7, GZMB, GZMA 
## PC_ 3 
## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1 
## Negative:  PPBP, PF4, SDPR, SPARC, GNG11 
## PC_ 4 
## Positive:  HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1 
## Negative:  VIM, IL7R, S100A6, IL32, S100A8 
## PC_ 5 
## Positive:  GZMB, NKG7, S100A8, FGFBP2, GNLY 
## Negative:  LTB, IL7R, CKB, VIM, MS4A7


# 可视化 VizDimReduction PCA results
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")

  • DimPlot

The second implements a statistical test based on a random null model, but is time-consuming for large datasets, and may not return a clear PC cutoff.

DimPlot(pbmc, reduction = "pca")

3.1.2 Determine the ‘dimensionality’ of the dataset

弯头图(elbow plot)直观地显示了每个PC的标准差,我们要寻找标准差开始趋于平稳的地方。从本质上说,弯头出现的地方通常是识别大多数变异的阈值。然而,这种方法可能是相当主观的。

根据这个图,我们可以通过PC8-PC10附近的肘部发生的位置来大致确定大部分的变化,或者可以认为应该是数据点开始接近X轴的时候,PC30左右。

ElbowPlot(pbmc)

step4. 聚类细胞 cluster the cell

Seurat使用了一种基于图的聚类方法,它将细胞嵌入到一个图结构中,使用K-近邻(KNN)图(默认情况下),并在具有相似基因表达模式的细胞之间画出边缘。然后,它试图将这个图分割成高度相互关联的 "准聚类 "或 "群落"[Seurat-Guided-Clustering-Tutorial]。

我们将使用FindClusters()函数来执行基于图的聚类。resolution是一个重要的参数,它设置了下行聚类的 "粒度 (granularity)",需要对每个单独的实验进行优化。对于3,000-5,000个细胞的数据集,resolution设置在0.4-1.4之间,一般可以获得良好的聚类效果。分辨率的增加会导致更多的聚类,这通常是较大的数据集所需要的。

FindClusters()函数允许我们输入一系列的分辨率,并将计算出聚类的 "粒度"。这便于测试哪种分辨率对于分群更有帮助,而不需要为每个分辨率分别运行函数。

# 确定k-近邻图
seurat_integrated <- FindNeighbors(object = seurat_integrated, 
                                dims = 1:40)
                                
# 确定聚类的不同分辨率                               
seurat_integrated <- FindClusters(object = seurat_integrated,
                               resolution = c(0.4, 0.6, 0.8, 1.0, 1.4))

# 如果我们看一下Seurat对象的元数据(seurat_integrated@metadata),每一个不同的分辨率都有一个单独的列来计算。
# 探索分辨率
seurat_integrated@meta.data %>% 
        View()

# 我们通常会选择一个中间范围的分辨率开始,比如0.6或0.8。我们将通过使用Idents()函数赋予聚类身份,从0.8的分辨率开始。
# 赋予聚类的身份
Idents(object = seurat_integrated) <- "integrated_snn_res.0.8"

# 举例
# ----------------
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2638
## Number of edges: 96033
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8720
## Number of communities: 9
## Elapsed time: 0 seconds

# Look at cluster IDs of the first 5 cells
head(Idents(pbmc), 5)

## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1 
##                1                3                1                2 
## AAACCGTGTATGCG-1 
##                6 
## Levels: 0 1 2 3 4 5 6 7 8

step5.Run non-linear dimensional reduction (UMAP/tSNE)

为了使细胞类群可视化,有一些不同的降维技术可以使用。最流行的方法包括t分布随机邻域嵌入(t-SNE)和统一模态近似与投影(UMAP)技术。

这两种方法的目的是将高维空间中具有相似局部邻域的细胞一起放置在低维空间中。这两种方法都需要输入PCA维度的数量来进行可视化,我们建议使用相同数量的PC作为聚类分析的输入。这里,我们将用UMAP方法来进行可视化聚类分析。

# umap
pbmc <- RunUMAP(pbmc, dims = 1:10)
DimPlot(pbmc, 
        reduction = "umap", 
        label = TRUE,
        label.size = 6)

# tsne
pbmc <- RunTSNE(pbmc, features = VariableFeatures(object = pbmc))
DimPlot(pbmc, reduction = "tsne")

这对于研究其他分辨率也是很有用的。它可以让你快速了解基于分辨率参数的类群会有怎样的变化。例如,让我们换成0.4的分辨率。

# 分配类群的身份
Idents(object = seurat_integrated) <- "integrated_snn_res.0.4"

# 绘制UMAP图
DimPlot(seurat_integrated,
        reduction = "umap",
        label = TRUE,
        label.size = 6)

# 保存为.RData
saveRDS(pbmc, file = "../output/pbmc_tutorial.rds")

PARTII marker define cluster

step0. Find clusters vs. each other, or against all cells.

By default, it identifes positive and negative markers of a single cluster (specified in ident.1), compared to all other cells.

FindAllMarkers automates this process for all clusters, but you can also test groups of clusters vs. each other, or against all cells.

The min.pct argument requires a feature to be detected at a minimum percentage in either of the two groups of cells.

# find all markers of cluster 1 (vs. all other cluster by default)
cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)
head(cluster1.markers, n = 5)

# p_val avg_logFC   pct.1   pct.2   p_val_adj
# IL32  0   0.8373872   0.948   0.464   0
# LTB   0   0.8921170   0.981   0.642   0
# CD3D  0   0.6436286   0.919   0.431   0
# IL7R  0   0.8147082   0.747   0.325   0
# LDHB  0   0.6253110   0.950   0.613   0

# find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)

# p_val avg_logFC   pct.1   pct.2   p_val_adj
# FCGR3A    0   2.963144    0.975   0.037   0
# IFITM3    0   2.698187    0.975   0.046   0
# CFD       0   2.362381    0.938   0.037   0
# CD68  0   2.087366    0.926   0.036   0
# RP11-290F20.3 0   1.886288    0.840   0.016   0

# find markers for every cluster compared to all remaining cells, report only the positive ones
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)

step1. visualizing marker expression

VlnPlot (shows expression probability distributions across clusters), and FeaturePlot (visualizes feature expression on a tSNE or PCA plot) are our most commonly used visualizations. We also suggest exploring RidgePlot, CellScatter, and DotPlot as additional methods to view your dataset.

1.1 VinPlot the expression level

# you can plot avg_logFC
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))

# you can plot raw counts as well
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)

1.2 FeaturePlot the distribution

FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP",  "CD8A"))

1.3 DoHeatmap for given cells and features

# In this case, we are plotting the top 20 markers (or all markers if less than 20) for each cluster.
top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend()

step2. 根据找到的marker/canonical markers 导出csv

##找到每个cluster的marker
markers <- FindMarkers(pbmc, ident.1 = 0,  only.pos = TRUE,min.pct = 0.25)
write.table(markers, file="/Users/apple/Desktop/Result/markers_0.tsv", sep="\t", quote=FALSE, col.names=NA)

step3. 根据marker.csv 画图

##直接根据marker画图

table <- read.table ( '/Users/apple/Desktop/Rwork/test1.csv', header = FALSE)
genes <- as.vector(table[1:4,1])
plots <- VlnPlot(pbmc, features = genes, group.by = "seurat_clusters", pt.size = 0, combine = FALSE)
a <- CombinePlots(plots = plots, ncol = 1)
ggsave("violin_test_1.png", a)

table <- read.table ( '/Users/apple/Desktop/Rwork/test1.csv', header = FALSE)
genes <- as.vector(table[8:14,1])
plots <- VlnPlot(pbmc, features = genes, group.by = "seurat_clusters", pt.size = 0, combine = FALSE)
a <- CombinePlots(plots = plots, ncol = 1)
ggsave("violin_test_2.png", a)

table <- read.table ( '/Users/apple/Desktop/Rwork/test1.csv', header = FALSE)
genes <- as.vector(table[15:21,1])
plots <- VlnPlot(pbmc, features = genes, group.by = "seurat_clusters", pt.size = 0, combine = FALSE)
a <- CombinePlots(plots = plots, ncol = 1)
ggsave("violin_test_3.png", a)
table <- read.table ( '/Users/apple/Desktop/Rwork/test1.csv', header = FALSE)
genes <- as.vector(table[22:28,1])
plots <- VlnPlot(pbmc, features = genes, group.by = "seurat_clusters", pt.size = 0, combine = FALSE)
a <- CombinePlots(plots = plots, ncol = 1)
ggsave("violin_test_4.png", a)

a <- FeaturePlot(pbmc, reduction = "tsne",features = "Cidec", cols = c("grey", "red"))
ggsave("genemap_1.png", a)


step4. 根据不同的 cluster 指定细胞类型 Assigning cell type identity to clusters

##标记
table <- read.table ( '/Users/apple/Desktop/Rwork/test1.csv', header = FALSE)
genes <- as.vector(table[1:8,1])
new.cluster.ids <- genes
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
a<-DimPlot(pbmc, reduction = "tsne", label = TRUE, pt.size = 0.5) + NoLegend()
ggsave("tsne.png", a)

# or 直接按顺序给cluster0 -> 8 指定细胞类型
new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", 
    "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

# 存储为RDS
saveRDS(pbmc, file = "../output/pbmc3k_final.rds")

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 203,098评论 5 476
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,213评论 2 380
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 149,960评论 0 336
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,519评论 1 273
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,512评论 5 364
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,533评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,914评论 3 395
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,574评论 0 256
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,804评论 1 296
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,563评论 2 319
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,644评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,350评论 4 318
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,933评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,908评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,146评论 1 259
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 42,847评论 2 349
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,361评论 2 342