上次教程已经带领大家先简单的认识一下monocle3的特点与功能(单细胞测序数据进阶分析—《拟时序分析》4.初识monocle3)。虽然是录了五十分钟,但是鉴于monocle3的坑还是很多,所以每一部分的功能还是需要带大家展开了解一下。这次内容主要带领大家学习monocle3的降维、分群和聚类功能,不知道大家是更喜欢用Seurat还是这次的monocle3呢?
视频中提到的Rmakdown还没有制作完,大家不要着急,我们先把这节课的测试文件与代码放在这个链接中,monocle3系列课程制作完毕会再进行更新,请大家持续关注:
资料链接:
https://pan.baidu.com/s/16lPDInnenycR9sTcABolSg?pwd=tdp7
你也可以直接前往monocle3官网教程进行学习:
https://cole-trapnell-lab.github.io/monocle3/docs/differential/
视频教程总是上传失败,大家直接去B站看吧:
https://www.bilibili.com/video/BV1br4y1x7Hf?p=9
2.1.monocle内置
2.1.1.降维
rm(list = ls());gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 9994083 533.8 27726538 1480.8 27726538 1480.8
## Vcells 18353825 140.1 158881073 1212.2 198600815 1515.3
library(monocle3)library(dplyr)
cds <- readRDS('test.data/simple.rds')#怎么构建cds就不重复了,直接读一下上一部分的
cds <- preprocess_cds(cds, num_dim = 100)#主要是完成normalize,上一个视频刚讲过
plot_pc_variance_explained(cds)
cds <- reduce_dimension(cds)#降维,相当于Seurat::RunUMAP()
## No preprocess_method specified, using preprocess_method = 'PCA'
#这里我们没有指定reduction_method,但是默认是"UMAP",有"UMAP", "tSNE", "PCA", "LSI", "Aligned"可供选择
#一般大于10000个细胞认为是大数据集,这时可以用 umap.fast_sgd=TRUE
umap.fast_sgd参数或者设置cores可以加快运算速度,但是可以能会让降维图最终显得有些不同,所以要慎重选择
red.1 <- plot_cells(cds, color_cells_by="cao_cell_type", cell_size = 2,
graph_label_size = 5,
group_label_size = 3)#类似于Seurat::DimPlot()
## No trajectory to plot. Has learn_graph() been called yet?
cds <- reduce_dimension(cds,umap.fast_sgd=T)
## No preprocess_method specified, using preprocess_method = 'PCA'
## Note: reduce_dimension will produce slightly different output each time you run it unless you set 'umap.fast_sgd = FALSE' and 'cores = 1'
red.2 <- plot_cells(cds, color_cells_by="cao_cell_type", cell_size = 2, graph_label_size = 5, group_label_size = 3)
## No trajectory to plot. Has learn_graph() been called yet?
#这些参数的默认值都太小了,不利于我们观察,每次要敲很多参数,这就是我吐槽monocle3代码不友好的地方
cds <- reduce_dimension(cds,cores =5)
## No preprocess_method specified, using preprocess_method = 'PCA'
## Note: reduce_dimension will produce slightly different output each time you run it unless you set 'umap.fast_sgd = FALSE' and 'cores = 1'
red.3 <- plot_cells(cds, color_cells_by="cao_cell_type",
cell_size = 2,
graph_label_size = 5, group_label_size = 3)
## No trajectory to plot. Has learn_graph() been called yet?suppressMessages(library(patchwork))red.1|red.2|red.3#线程对降维的影响似乎要小一些
## Warning: Removed 1 rows containing missing values (geom_text_repel).
## Removed 1 rows containing missing values (geom_text_repel).
## Removed 1 rows containing missing values (geom_text_repel).
## Warning: ggrepel: 8 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 9 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 8 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
试试tSNE
cds <- reduce_dimension(cds, reduction_method="tSNE")
## No preprocess_method specified, using preprocess_method = 'PCA'
plot_cells(cds, color_cells_by="cao_cell_type", reduction_method="tSNE")#给大家看看默认效果多让人抓狂
## No trajectory to plot. Has learn_graph() been called yet?
## Warning: Removed 1 rows containing missing values (geom_text_repel).
#看看UMAP和tSNE的差别吧red.1|plot_cells(cds, reduction_method="tSNE",
color_cells_by="cao_cell_type", cell_size = 2,
group_label_size = 3)
## No trajectory to plot. Has learn_graph() been called yet?
## Warning: Removed 1 rows containing missing values (geom_text_repel).
## Removed 1 rows containing missing values (geom_text_repel).
## Warning: ggrepel: 8 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
2.1.2.分群
cds <- cluster_cells(cds)#分群
#这里计算方式用的是community detection,类似于Seurat::FindClusters(),但原理有所不同cds <- cluster_cells(cds, resolution=0.1)
#貌似现在只能选择分辨率,不能选择分群数量了,我觉得还挺好的一个功能没了
unique(partitions(cds))#查看分群ID
## [1] 1 3 2
## Levels: 1 2 3
plot_cells(cds)
## No trajectory to plot. Has learn_graph() been called yet?
plot_cells(cds, genes=c("cpna-2", "egl-21", "ram-2", "inos-1"))#看看基因在降维图中的表达量吧
## No trajectory to plot. Has learn_graph() been called yet?
#相当于Seurat::FeaturePlot()
2.1.3.去批次
names(cds@colData)#用于去批次的变量从这里挑
## [1] "plate" "cao_cluster" "cao_cell_type" "cao_tissue"
## [5] "Size_Factor"
bfalig <- plot_cells(cds, color_cells_by="plate", label_cell_groups=FALSE)
## No trajectory to plot. Has learn_graph() been called yet?
cds <- align_cds(cds, alignment_group = "plate")#去除批次效应
## Aligning cells from different batches using Batchelor.
## Please remember to cite:
## Haghverdi L, Lun ATL, Morgan MD, Marioni JC (2018). 'Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors.' Nat. Biotechnol., 36(5), 421-427. doi: 10.1038/nbt.4091
aftalig <- plot_cells(cds, color_cells_by="plate", label_cell_groups=FALSE)
## No trajectory to plot. Has learn_graph() been called yet?
bfalig|aftalig#看看效果
2.1.4.计算并展示marker基因
#计算marker基因,类似于Seurat::FindAllMarkers()marker_test_res <- top_markers(cds, group_cells_by="cao_cell_type",
reference_cells=1000, cores=2)
marker_test_res[1:4,1:4]
## gene_id gene_short_name cell_group marker_score
## 1 WBGene00000004 aat-3 GABAergic neurons 0.4279421
## 2 WBGene00000016 abf-5 Pharyngeal muscle 0.3333333
## 3 WBGene00000031 abu-8 Pharyngeal epithelia 0.3269503
## 4 WBGene00000037 ace-3 Canal associated neurons 0.4750789
top_specific_markers <- marker_test_res %>% filter(fraction_expressing >= 0.10) %>% group_by(cell_group) %>% top_n(1, pseudo_R2)top_specific_markers[1:4,1:4]
## # A tibble: 4 × 4
## # Groups: cell_group [4]
## gene_id gene_short_name cell_group marker_score
## <chr> <fct> <chr> <dbl>
## 1 WBGene00000485 che-3 Dopaminergic neurons 0.591
## 2 WBGene00000681 col-107 Seam cells 0.326
## 3 WBGene00000753 col-180 Non-seam hypodermis 0.355
## 4 WBGene00001172 egl-3 Pharyngeal neurons 0.297
top_specific_marker_ids <- unique(top_specific_markers %>% pull(gene_id))plot_genes_by_group(cds, top_specific_marker_ids[1:10],
group_cells_by="cao_cell_type",
max.size=3)#相当于Seurat::DotPlot
## Warning in if (ordering_type == "cluster_row_col") {: the condition has length >
## 1 and only the first element will be used
## Warning in stats::cor(t(res)): the standard deviation is zero
## Warning in if (axis_order == "marker_group") {: the condition has length > 1 and
## only the first element will be used
plot_genes_by_group(cds, top_specific_marker_ids[1:10],
max.size=3,
group_cells_by="partition")#根据cluster分组绘制
## Warning in if (ordering_type == "cluster_row_col") {: the condition has length >
## 1 and only the first element will be used
## Warning in if (ordering_type == "cluster_row_col") {: the condition has length >
## 1 and only the first element will be used
2.1.5.注释细胞
colData(cds)$assigned_cell_type <- as.character(partitions(cds))
colData(cds)$assigned_cell_type <- dplyr::recode(colData(cds)$assigned_cell_type, "1"="Bio_1", "2"="Bio_2", "3"="Bio_3",
'4'='Bio_4')#值得表扬的是没有像Seurat一样从0开始编号plot_cells(cds, group_cells_by="partition", color_cells_by="assigned_cell_type",
cell_size = 2, graph_label_size = 5, group_label_size = 3)
## No trajectory to plot. Has learn_graph() been called yet?
2.1.6.取子集以及亚群分析
try(cds_subset <- choose_cells(cds))#交互式的圈出一些细胞用于下游分析
## Error : choose_cells only works in interactive mode.
cds_subset <- cds #这里我本来细胞数就比较少就不取了class(cds_subset)
## [1] "cell_data_set"
## attr(,"package")
## [1] "monocle3"
pr_graph_test_res <- graph_test(cds_subset,
neighbor_graph="knn", cores=2)
通过marker基因集可视化帮助判断
#专门用来分析亚群的marker genepr_deg_ids <- row.names(subset(pr_graph_test_res,
morans_I > 0.01 & q_value < 0.05))
gene_module_df <- find_gene_modules(cds_subset[pr_deg_ids,], resolution=1e-3)#Seurat::AddModuleScore()
gene_module_df[1:5,1:5]
## # A tibble: 5 × 5
## id module supermodule dim_1 dim_2
## <chr> <fct> <fct> <dbl> <dbl>
## 1 WBGene00000002 11 5 1.19 2.30
## 2 WBGene00000003 2 2 9.66 -2.55
## 3 WBGene00000004 5 3 4.72 0.633
## 4 WBGene00000006 2 2 10.7 -2.83
## 5 WBGene00000010 3 1 -5.50 -2.16
plot_cells(cds_subset, genes=gene_module_df, show_trajectory_graph=FALSE,
label_cell_groups=FALSE, cell_size = 0.5, graph_label_size = 5,
group_label_size = 3)#相当于基因集打分
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
2.2.Garnett
这个功能旨在帮助大家自动注释细胞
2.2.1.利用Marker制作Garnett文件
assigned_type_marker_test_res <- top_markers(cds,
group_cells_by="assigned_cell_type",
reference_cells=1000, cores=2)
garnett_markers <- assigned_type_marker_test_res %>% filter(marker_test_q_value < 0.01 & specificity >= 0.5) %>% group_by(cell_group) %>% top_n(5, marker_score)garnett_markers <- garnett_markers %>% group_by(gene_short_name) %>% filter(n() == 1)garnett_markers[1:5,1:5]
## # A tibble: 5 × 5
## # Groups: gene_short_name [5]
## gene_id gene_short_name cell_group marker_score mean_expression
## <chr> <fct> <chr> <dbl> <dbl>
## 1 WBGene00000063 act-1 Bio_1 0.750 3.39
## 2 WBGene00001172 egl-3 Bio_2 0.229 1.88
## 3 WBGene00001189 egl-21 Bio_2 0.212 1.39
## 4 WBGene00002048 ida-1 Bio_2 0.291 2.14
## 5 WBGene00003369 mlc-1 Bio_1 0.738 1.41
generate_garnett_marker_file(garnett_markers,
file="./marker_file.txt")
## Garnett marker file written to ./marker_file.txt
2.2.2.利用先验Marker制作Garnett文件
#如果你有一些先验的marker,也可以自行修改Garnett文件#目前Garnett可以支持monocle2和monocle3
if(!require(org.Mm.eg.db))BiocManager::install(
c("org.Mm.eg.db", "org.Hs.eg.db","org.Ce.eg.db"))
## Warning: package 'AnnotationDbi' was built under R version 4.1.1
if(!require(garnett))devtools::install_github("cole-trapnell-lab/garnett", ref="monocle3")
unique(clusters(cds))
## [1] 6 1 4 3 5 7 2 8 9
## Levels: 1 2 3 4 5 6 7 8 9
colData(cds)$garnett_cluster <- clusters(cds)#根据marker基因注释细胞
worm_classifier <- train_cell_classifier(cds = cds,
marker_file = "./marker_file.txt",
db=org.Ce.eg.db::org.Ce.eg.db,
cds_gene_id_type = "ENSEMBL",
num_unknown = 50,
#防止被错误的分配,可以输入unknown的数量
marker_file_gene_id_type = "SYMBOL",
cores=4)#读取并构建garnett注释
## Warning in make_name_map(parse_list,
## as.character(row.names(rowData(norm_cds))), : 1 genes could not be converted
## from SYMBOL to ENSEMBL These genes are listed below:
## Warning in make_name_map(parse_list, as.character(row.names(rowData(norm_cds))), : The following genes from the cell type definition file are not presentin the cell dataset. Please check these genes for errors. Cell typedetermination will continue, ignoring these genes.
## tag-80
## training_sample
## Cell type Bio_1 Cell type Bio_2 Cell type Bio_3 Unknown
## 118 106 64 50
rownames(cds)[1:5]
## [1] "WBGene00000001" "WBGene00000002" "WBGene00000003" "WBGene00000004"
## [5] "WBGene00000005"
class(worm_classifier)#如果你对一个组织器官比较了解,可以给出一些先验的marker
## [1] "garnett_classifier"
## attr(,"package")
## [1] "garnett"
cds <- classify_cells(cds, worm_classifier,
db = org.Ce.eg.db::org.Ce.eg.db,
cluster_extend = TRUE,
cds_gene_id_type = "ENSEMBL")#把garnett的注释加进来unique
(colData(cds)$garnett_cluster)
## [1] 6 1 4 3 5 7 2 8 9
## Levels: 1 2 3 4 5 6 7 8 9
unique(colData(cds)$cluster_ext_type)
## [1] "Cell type Bio_1" "Cell type Bio_3" "Unknown" "Cell type Bio_2"
plot_cells(cds,show_trajectory_graph = F, group_cells_by="partition",
color_cells_by="cluster_ext_type", cell_size = 2,
graph_label_size = 5, group_label_size = 3)|
plot_cells(cds,show_trajectory_graph = F, group_cells_by="partition",
color_cells_by="assigned_cell_type", cell_size = 2,
graph_label_size = 5, group_label_size = 3)
欢迎关注同名公众号~