聚类分析&maker define cluster - Seurat

########################################################

题目：1.聚类分析&maker define cluster - Seurat
语言：R
官方教程：https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html
日期：April 17, 2020
Author: YQ
########################################################

在完成filter工作以后，我们进行下一步的聚类、marker鉴定

PART1. 聚类

step0. count normalization

normalize：把count值转化为log值

0.1 normalization主要考虑的因素有两个：

(1)测序深度（sequencing depth）:

我的理解是把一个细胞的reads数目normalize成10000(default),实际上比较的基因百分数的变化

(2)基因长度（Gene length）

0.2 normalizing the data

# normalizing the data
# default setting: normalization.method = "LogNormalize" and scale.factor = 10000  

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

pbmc <- NormalizeData(pbmc)

step1. feature selection - 找clusters之间表达水平差异化最大的基因feature

Identification of highly variable features
这里指的是，单一样本，不同cluster之间相比，表达水平差异化最大的基因feature

# 加载Seurat对象/加载保存为.Rdata的Seurat对象
Seurat
load("data/pbmc_seurat_filtered.RData")

# 用 FindVariableFeatures 找variable gene；default 值如下
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

# 列出 cluster 之间变化最大的gene - Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)

# 画图 plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc) # without labels
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE) # with labels
plot1 + plot2

step2. scaling the data

为 PCA 做准备

如果是为了PCA做准备，那么对 all.genes scaling data 则耗时太长；一个解决办法是，使用 scaledata 默认值（依照上一步鉴定的 2000 variable features），这样耗时短也能达到分群的目的

# perform scale data on the previously identified variable features(2000 by default)
pbmc <- ScaleData(pbmc)

DoHeatmap

To make sure we don’t leave any genes out of the heatmap later, we are scaling all genes in this tutorial.

# DoHeatmap by scaling all genes
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)

step3. PCA—主成分分析

用tSNE较多，因为tSNE分的比较开，uMAP的优点在于能看到谱系来源，但是分的不开，所以如果要做分cluster的话，用tSNE更多。

先用默认的 previously determined variable features 作为input，but can be defined using features argument if you wish to choose a different subset.

# runPCA
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))

3.1 识别重要主成分 PC (该部分需要加强理解 - option)

为了确保分群所依照的主成分基因，它不是由于单个基因表达的技术性噪声带来的干扰；所以要在PCA将维之前，识别重要的主成分基因，也就是，用 PC 去代表聚合的变异最大的基因。

然后用 PC得分，衡量该 cluster 在每个PC 上的权重，也就是，PC 代表的“元基因”（变异最大的那些基因）能解释该cluster 百分之几的差异度。

所以，在确定下游聚类分析之前，有一个问题，我们该如何识别PC 呢？

目前Seurat提供有以下三种方法：VizDimReduction, DimPlot, and DimHeatmap

3.1.1 找合适重要主成分的方法

热图可视化 DimHeatmap

cells参数指定了用于绘图的最大正负PCA分数的细胞数。我们要寻找的是，在热图开始看起来更 "模糊 "的PC，也就是说，各组基因之间的区分不是那么明显的PC。

# 研究PC热图
DimHeatmap(seurat_integrated, 
           dims = 1:9, 
           cells = 500, 
           balanced = TRUE)

如果我们想探索大量的PCA，这种方法可能会很慢，而且很难将单个基因可视化。同样的，为了探索大量的PC，我们可以通过PCA分数驱动PC，打印出前10个（或更多的）正基因和负基因。

The third is a heuristic that is commonly used, and can be calculated instantly. In this example, all three approaches yielded similar results, but we might have been justified in choosing anything between PC 7-12 as a cutoff.

VizDimReduction

The first is more supervised, exploring PCs to determine relevant sources of heterogeneity, and could be used in conjunction with GSEA for example.

# Examine and visualize PCA results a few different ways
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)

## PC_ 1 
## Positive:  CST3, TYROBP, LST1, AIF1, FTL 
## Negative:  MALAT1, LTB, IL32, IL7R, CD2 
## PC_ 2 
## Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1 
## Negative:  NKG7, PRF1, CST7, GZMB, GZMA 
## PC_ 3 
## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1 
## Negative:  PPBP, PF4, SDPR, SPARC, GNG11 
## PC_ 4 
## Positive:  HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1 
## Negative:  VIM, IL7R, S100A6, IL32, S100A8 
## PC_ 5 
## Positive:  GZMB, NKG7, S100A8, FGFBP2, GNLY 
## Negative:  LTB, IL7R, CKB, VIM, MS4A7


# 可视化 VizDimReduction PCA results
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")

DimPlot

The second implements a statistical test based on a random null model, but is time-consuming for large datasets, and may not return a clear PC cutoff.

DimPlot(pbmc, reduction = "pca")

3.1.2 Determine the ‘dimensionality’ of the dataset

弯头图（elbow plot）直观地显示了每个PC的标准差，我们要寻找标准差开始趋于平稳的地方。从本质上说，弯头出现的地方通常是识别大多数变异的阈值。然而，这种方法可能是相当主观的。

根据这个图，我们可以通过PC8-PC10附近的肘部发生的位置来大致确定大部分的变化，或者可以认为应该是数据点开始接近X轴的时候，PC30左右。

ElbowPlot(pbmc)

step4. 聚类细胞 cluster the cell

Seurat使用了一种基于图的聚类方法，它将细胞嵌入到一个图结构中，使用K-近邻（KNN）图（默认情况下），并在具有相似基因表达模式的细胞之间画出边缘。然后，它试图将这个图分割成高度相互关联的 "准聚类 "或 "群落"[Seurat-Guided-Clustering-Tutorial]。

我们将使用FindClusters()函数来执行基于图的聚类。resolution是一个重要的参数，它设置了下行聚类的 "粒度 (granularity)"，需要对每个单独的实验进行优化。对于3,000-5,000个细胞的数据集，resolution设置在0.4-1.4之间，一般可以获得良好的聚类效果。分辨率的增加会导致更多的聚类，这通常是较大的数据集所需要的。

FindClusters()函数允许我们输入一系列的分辨率，并将计算出聚类的 "粒度"。这便于测试哪种分辨率对于分群更有帮助，而不需要为每个分辨率分别运行函数。

# 确定k-近邻图
seurat_integrated <- FindNeighbors(object = seurat_integrated, 
                                dims = 1:40)
                                
# 确定聚类的不同分辨率                               
seurat_integrated <- FindClusters(object = seurat_integrated,
                               resolution = c(0.4, 0.6, 0.8, 1.0, 1.4))

# 如果我们看一下Seurat对象的元数据(seurat_integrated@metadata)，每一个不同的分辨率都有一个单独的列来计算。
# 探索分辨率
seurat_integrated@meta.data %>% 
        View()

# 我们通常会选择一个中间范围的分辨率开始，比如0.6或0.8。我们将通过使用Idents()函数赋予聚类身份，从0.8的分辨率开始。
# 赋予聚类的身份
Idents(object = seurat_integrated) <- "integrated_snn_res.0.8"

# 举例
# ----------------
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2638
## Number of edges: 96033
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8720
## Number of communities: 9
## Elapsed time: 0 seconds

# Look at cluster IDs of the first 5 cells
head(Idents(pbmc), 5)

## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1 
##                1                3                1                2 
## AAACCGTGTATGCG-1 
##                6 
## Levels: 0 1 2 3 4 5 6 7 8

step5.Run non-linear dimensional reduction (UMAP/tSNE)

为了使细胞类群可视化，有一些不同的降维技术可以使用。最流行的方法包括t分布随机邻域嵌入（t-SNE）和统一模态近似与投影（UMAP）技术。

这两种方法的目的是将高维空间中具有相似局部邻域的细胞一起放置在低维空间中。这两种方法都需要输入PCA维度的数量来进行可视化，我们建议使用相同数量的PC作为聚类分析的输入。这里，我们将用UMAP方法来进行可视化聚类分析。

# umap
pbmc <- RunUMAP(pbmc, dims = 1:10)
DimPlot(pbmc, 
        reduction = "umap", 
        label = TRUE,
        label.size = 6)

# tsne
pbmc <- RunTSNE(pbmc, features = VariableFeatures(object = pbmc))
DimPlot(pbmc, reduction = "tsne")

这对于研究其他分辨率也是很有用的。它可以让你快速了解基于分辨率参数的类群会有怎样的变化。例如，让我们换成0.4的分辨率。

# 分配类群的身份
Idents(object = seurat_integrated) <- "integrated_snn_res.0.4"

# 绘制UMAP图
DimPlot(seurat_integrated,
        reduction = "umap",
        label = TRUE,
        label.size = 6)

# 保存为.RData
saveRDS(pbmc, file = "../output/pbmc_tutorial.rds")

PARTII marker define cluster

step0. Find clusters vs. each other, or against all cells.

By default, it identifes positive and negative markers of a single cluster (specified in ident.1), compared to all other cells.

FindAllMarkers automates this process for all clusters, but you can also test groups of clusters vs. each other, or against all cells.

The min.pct argument requires a feature to be detected at a minimum percentage in either of the two groups of cells.

# find all markers of cluster 1 (vs. all other cluster by default)
cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)
head(cluster1.markers, n = 5)

# p_val avg_logFC   pct.1   pct.2   p_val_adj
# IL32  0   0.8373872   0.948   0.464   0
# LTB   0   0.8921170   0.981   0.642   0
# CD3D  0   0.6436286   0.919   0.431   0
# IL7R  0   0.8147082   0.747   0.325   0
# LDHB  0   0.6253110   0.950   0.613   0

# find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)

# p_val avg_logFC   pct.1   pct.2   p_val_adj
# FCGR3A    0   2.963144    0.975   0.037   0
# IFITM3    0   2.698187    0.975   0.046   0
# CFD       0   2.362381    0.938   0.037   0
# CD68  0   2.087366    0.926   0.036   0
# RP11-290F20.3 0   1.886288    0.840   0.016   0

# find markers for every cluster compared to all remaining cells, report only the positive ones
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)

step1. visualizing marker expression

VlnPlot (shows expression probability distributions across clusters), and FeaturePlot (visualizes feature expression on a tSNE or PCA plot) are our most commonly used visualizations. We also suggest exploring RidgePlot, CellScatter, and DotPlot as additional methods to view your dataset.

1.1 VinPlot the expression level

# you can plot avg_logFC
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))

# you can plot raw counts as well
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)

1.2 FeaturePlot the distribution

FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP",  "CD8A"))

1.3 DoHeatmap for given cells and features

# In this case, we are plotting the top 20 markers (or all markers if less than 20) for each cluster.
top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend()

step2. 根据找到的marker/canonical markers 导出csv

##找到每个cluster的marker
markers <- FindMarkers(pbmc, ident.1 = 0,  only.pos = TRUE,min.pct = 0.25)
write.table(markers, file="/Users/apple/Desktop/Result/markers_0.tsv", sep="\t", quote=FALSE, col.names=NA)

step3. 根据marker.csv 画图

##直接根据marker画图

table <- read.table ( '/Users/apple/Desktop/Rwork/test1.csv', header = FALSE)
genes <- as.vector(table[1:4,1])
plots <- VlnPlot(pbmc, features = genes, group.by = "seurat_clusters", pt.size = 0, combine = FALSE)
a <- CombinePlots(plots = plots, ncol = 1)
ggsave("violin_test_1.png", a)

table <- read.table ( '/Users/apple/Desktop/Rwork/test1.csv', header = FALSE)
genes <- as.vector(table[8:14,1])
plots <- VlnPlot(pbmc, features = genes, group.by = "seurat_clusters", pt.size = 0, combine = FALSE)
a <- CombinePlots(plots = plots, ncol = 1)
ggsave("violin_test_2.png", a)

table <- read.table ( '/Users/apple/Desktop/Rwork/test1.csv', header = FALSE)
genes <- as.vector(table[15:21,1])
plots <- VlnPlot(pbmc, features = genes, group.by = "seurat_clusters", pt.size = 0, combine = FALSE)
a <- CombinePlots(plots = plots, ncol = 1)
ggsave("violin_test_3.png", a)
table <- read.table ( '/Users/apple/Desktop/Rwork/test1.csv', header = FALSE)
genes <- as.vector(table[22:28,1])
plots <- VlnPlot(pbmc, features = genes, group.by = "seurat_clusters", pt.size = 0, combine = FALSE)
a <- CombinePlots(plots = plots, ncol = 1)
ggsave("violin_test_4.png", a)

a <- FeaturePlot(pbmc, reduction = "tsne",features = "Cidec", cols = c("grey", "red"))
ggsave("genemap_1.png", a)

step4. 根据不同的 cluster 指定细胞类型 Assigning cell type identity to clusters

##标记
table <- read.table ( '/Users/apple/Desktop/Rwork/test1.csv', header = FALSE)
genes <- as.vector(table[1:8,1])
new.cluster.ids <- genes
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
a<-DimPlot(pbmc, reduction = "tsne", label = TRUE, pt.size = 0.5) + NoLegend()
ggsave("tsne.png", a)

# or 直接按顺序给cluster0 -> 8 指定细胞类型
new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", 
    "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

# 存储为RDS
saveRDS(pbmc, file = "../output/pbmc3k_final.rds")