celaref ||单细胞细胞类型定义工具

Sarah Williams (2019). celaref: Single-cell RNAseq cell cluster labelling by reference. R package version 1.0.1.

当我们拿到单细胞转录组数据的时候，不管结果怎么样，翻来覆去发现其实就是一个基因和细胞的矩阵而已！一般基于这个矩阵，会先过滤细胞均一化，降维，聚类（最重要的一步，因为它可以发现细胞的异质性），聚类结果的差异分析（声称找到了每个亚群的marker基因），最后做一下数据的可视化。

且慢，用非监督的方式聚出来的类（一般叫做 $Cluster_i,i$ 为分的类数），既然是细胞异质性的体现，那么它到底是什么细胞呢，可以叫做什么，是CD4细胞呢还是microglia细胞？因为 $cluster_{i}$ 是没有意义的，只有起了像CD4这样的名字，这类细胞才有故事可以讲。

在定义细胞类型之前，需要确定就哪种聚类结果来做，是图聚类的结果还是k-means某一类的结果。如何来确定？看看分群的tsne图是一个不错的参考。

那么呢，之前你们家大师推荐过几个数据库single cell marker 基因数据库是基于某亚群细胞的marker基因，根据已经知道marker基因的细胞类型，这些数据库其实就是一个细胞类型和marker基因大表。根据marker基因列表与分析出来的差异基因列表以及基因丰度的关系（往往是做一个热图）来推断某个cluster属于哪一种细胞类型。

另外还有一种定义细胞类型的方法是通过每个cluster与已知细胞类型的表达谱的相似性来定义细胞类型，已有的方法为SingleR、celaref。均是R包，其中celaref是2019年的新品，也是本文主要介绍的一种方法。

celaref 简介

走遍示例流程

安装celaref包的时候，就把示例数据也下载了，首先载入数据，然后观察一下示例数据是怎么组织的，以便我们以后组织自己的数据。在接触一个新包的时候，最好的学习方法就是用示例数据走一遍然后看看有哪些功能（函数以及函数的参数可以用），多用help或者？来查看细节。一个R包封装的越好也就要求我们花时间去了解他的细节。

library(celaref)

# Paths to data files.
counts_filepath.query    <- system.file("extdata", "sim_query_counts.tab",    package = "celaref")
cell_info_filepath.query <- system.file("extdata", "sim_query_cell_info.tab", package = "celaref")
counts_filepath.ref      <- system.file("extdata", "sim_ref_counts.tab",      package = "celaref")
cell_info_filepath.ref   <- system.file("extdata", "sim_ref_cell_info.tab",   package = "celaref")

# Load data
toy_ref_se   <- load_se_from_files(counts_file=counts_filepath.ref, cell_info_file=cell_info_filepath.ref)
head(toy_ref_se)
toy_query_se <- load_se_from_files(counts_file=counts_filepath.query, cell_info_file=cell_info_filepath.query)

可见我载入了两个数据：一个是toy_ref_se，reference；一个是toy_query_se，query.都是SummarizedExperiment对象:

对数据进行过滤，help一下这个函数，看看都有哪些过滤条件。

# Filter data
toy_ref_se     <- trim_small_groups_and_low_expression_genes(toy_ref_se)
toy_query_se   <- trim_small_groups_and_low_expression_genes(toy_query_se)

对参考和需要鉴定的表达谱分别做差异分析：

# Setup within-experiment differential expression
de_table.toy_ref   <- contrast_each_group_to_the_rest(toy_ref_se,    dataset_name="ref")
de_table.toy_query <- contrast_each_group_to_the_rest(toy_query_se,  dataset_name="query")

分别得到差异基因结果：

绘制一组小提琴图，显示每个查询组的“top”基因在参考数据库中的分布。

# Plot
make_ranking_violin_plot(de_table.test=de_table.toy_query, de_table.ref=de_table.toy_ref)

根据参考数据集的相似性，在查询数据集中构造一些合理的标签或组/集群。

# And get group labels
make_ref_similarity_names(de_table.toy_query, de_table.toy_ref)

可以看到除了Group2 没有相应的similarity 类似的type之外，其余的都找到了相对应的细胞类型（基于检验的，不是生物学的）。这当然可以为我们定义细胞类型做一个统计上的参考。

准备数据

那么，如何应用我们自己的数据来做呢？官网也给出了数据准备的方法：

在准备数据时需要以下数据：

Counts Matrix Number of gene tags per gene per cell.
Cell information A sample-sheet table of cell-level information. Only two fields are essential:
cell_sample: A unique cell identifier
group: The cluster/group to which the cell has been assigned.
Gene information Optional. A table of extra gene-level information.
ID: A unique gene identifier

输入数据：

tab键分割的ounts matrix

gene	Cell1	cell2	cell3	cell4	…	cell954
GeneA	0	1	0	1	…	0
GeneB	0	3	0	2	…	2
GeneC	1	40	1	0	…	0
....

tab键分割的细胞分群信息

CellId	Sample	Cluster
cell1	Control	cluster1
cell2	Control	cluster7
…	…	…
cell954	KO	cluster8

From 10X pipeline output

dataset_se <- load_dataset_10Xdata('~/path/to/data/10X_mydata', 
                                   dataset_genome = "GRCh38", 
                                   clustering_set = "kmeans_7_clusters")

拿一个10X的例子试试吧

#Prepare 10X query dataset

先要把数据下载好，发现这应用的对象还是cellranger V2 pipeline的输出结果，注意为了和官网数据格式保持一致，我把 $otus$ 文件夹命名为 $10X_pbmc4k$ 了，这个前后一致就可以了。

+--- 10X_pbmc4k
|   +--- analysis
|   |   +--- clustering
|   |   |   +--- graphclust
|   |   |   |   +--- clusters.csv
|   |   |   +--- kmeans_10_clusters
|   |   |   |   +--- clusters.csv
|   |   |   +--- kmeans_2_clusters
···
|   |   +--- diffexp
|   |   |   +--- graphclust
|   |   |   |   +--- differential_expression.csv
|   |   |   +--- kmeans_10_clusters
···
|   |   +--- pca
|   |   |   +--- 10_components
|   |   |   |   +--- components.csv
|   |   |   |   +--- dispersion.csv
|   |   |   |   +--- genes_selected.csv
|   |   |   |   +--- projection.csv
|   |   |   |   +--- variance.csv
|   |   +--- tsne
|   |   |   +--- 2_components
|   |   |   |   +--- projection.csv
|   +--- filtered_gene_bc_matrices
|   |   +--- GRCh38
|   |   |   +--- barcodes.tsv
|   |   |   +--- genes.tsv
|   |   |   +--- matrix.mtx

datasets_dir <- "D:\\Users\\Administrator\\Desktop\\RStudio\\single_cell\\celaref/celaref_extra_vignette_data/datasets"
dir(datasets_dir)

dataset_se.10X_pbmc4k_k7 <- load_dataset_10Xdata(
  dataset_path   = file.path(datasets_dir,'10X_pbmc4k'), 
  dataset_genome = "GRCh38", 
  clustering_set = "kmeans_7_clusters", 
  id_to_use      = "GeneSymbol")
dataset_se.10X_pbmc4k_k7 <- trim_small_groups_and_low_expression_genes(dataset_se.10X_pbmc4k_k7)

可见我选择的是kmeans_7_clusters 的聚类结果来进行细胞定义。处理过之后：

> dataset_se.10X_pbmc4k_k7
class: SummarizedExperiment 
dim: 15407 4340 
metadata(0):
assays(1): ''
rownames(15407): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
rowData names(4): ensembl_ID GeneSymbol ID total_count
colnames(4340): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ... TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1
colData names(2): cell_sample group

下一步是对数据的转化和处理，也是one-to-others的DE分析，非常的慢和消耗内存，有可能跑断的。关键是这个函数做了是什么？其实就是把我们的矩阵文件整理成上面演示的 $toy \_ query\_se$ 格式。

de_table.10X_pbmc4k_k7   <- contrast_each_group_to_the_rest(dataset_se.10X_pbmc4k_k7, dataset_name="10X_pbmc4k_k7", num_cores=7)
Sorry, multithreading not supported on windows. Use linux, or set num_threads = 1 to suppress this message.Running single threaded  # 这个sorry 可以忽略 ，在Linux下才能指定运行进程数。
`fData` has no primerid.  I'll make something up.
`cData` has no wellKey.  I'll make something up.
Assuming data assay in position 1, with name et is log-transformed.
                                                                                                                                                          
Done!
Combining coefficients and standard errors
Calculating log-fold changes
Calculating likelihood ratio tests
Refitting on reduced model...
                                                                                                                                                          
Done!
`fData` has no primerid.  I'll make something up.
`cData` has no wellKey.  I'll make something up.
Assuming data assay in position 1, with name et is log-transformed.
 Completed [=======>----------------------------------------------------------------------------------------------------------------]   7% with 0 failures

windows下居然跑了一上午

准备 reference数据集

来自响应的数据库，因为我们的是血细胞，选择的是haemosphere .

A suitable PBMC reference (a ‘HaemAtlas’) has been published by $Watkins et al. (2009)$ . They purified populations of PBMC cell types and measured gene expression via microarray. The data used here was downloaded in a normalised table from the $‘haemosphere’$ website (Graaf et al. 2016).

下载页面

下载完之后由于是微阵列数据需要做一些特殊处理。

Logged, normalised expression values. Any low expression or poor quality measurements should have already been removed.
Sample information (see contrast_each_group_to_the_rest_for_norm_ma_with_limma for details)

library(readr)
this_dataset_dir     <- file.path(datasets_dir,     'haemosphere_datasets','watkins')
norm_expression_file <- file.path(this_dataset_dir, "watkins_expression.txt")
samples_file         <- file.path(this_dataset_dir, "watkins_samples.txt")

norm_expression_table.full <- read.table(norm_expression_file, sep="\t", header=TRUE, quote="", comment.char="", row.names=1, check.names=FALSE)

samples_table              <- read_tsv(samples_file, col_types = cols())
samples_table$description  <- make.names( samples_table$description) # Avoid group or extra_factor names starting with numbers, for microarrays

> samples_table$description
 [1] "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult"
 [8] "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X63.years.adult" "X63.years.adult"
[15] "X63.years.adult" "X63.years.adult" "X63.years.adult" "X63.years.adult" "X53.years.adult" "X53.years.adult" "X53.years.adult"
[22] "X53.years.adult" "X53.years.adult" "X53.years.adult" "X60.years.adult" "X60.years.adult" "X60.years.adult" "X60.years.adult"
[29] "X60.years.adult" "X60.years.adult" "X48.years.adult" "X48.years.adult" "X48.years.adult" "X48.years.adult" "X48.years.adult"
[36] "X48.years.adult" "X57.years.adult" "X57.years.adult" "X57.years.adult" "X57.years.adult" "X57.years.adult" "X57.years.adult"
[43] "NA."             "NA."             "NA."             "NA."             "NA."             "NA."             "NA."            
[50] "NA."

从样本表中可以看出，这个数据集包含了其他组织，但是作为PBMC的参考，我们只考虑外周血样本。与其他数据加载函数一样，要从分析中删除一个样本(或单元格)，从样本表中删除它就ok了。

amples_table        <- samples_table[samples_table$tissue == "Peripheral Blood",] 
> samples_table
# A tibble: 42 x 7
   sampleId     celltype            cell_lineage       surface_markers tissue           description     notes
   <chr>        <chr>               <chr>              <lgl>           <chr>            <chr>           <lgl>
 1 1674120023_A B lymphocyte        B Cell Lineage     NA              Peripheral Blood X49.years.adult NA   
 2 1674120023_B granulocyte         Neutrophil Lineage NA              Peripheral Blood X49.years.adult NA   
 3 1674120023_C natural killer cell NK Cell Lineage    NA              Peripheral Blood X49.years.adult NA   
 4 1674120023_D Th lymphocyte       T Cell Lineage     NA              Peripheral Blood X49.years.adult NA   
 5 1674120023_E Tc lymphocyte       T Cell Lineage     NA              Peripheral Blood X49.years.adult NA   
 6 1674120023_F monocyte            Macrophage Lineage NA              Peripheral Blood X49.years.adult NA   
 7 1674120053_A B lymphocyte        B Cell Lineage     NA              Peripheral Blood X49.years.adult NA   
 8 1674120053_B monocyte            Macrophage Lineage NA              Peripheral Blood X49.years.adult NA   
 9 1674120053_C granulocyte         Neutrophil Lineage NA              Peripheral Blood X49.years.adult NA   
10 1674120053_D Tc lymphocyte       T Cell Lineage     NA              Peripheral Blood X49.years.adult NA   
# ... with 32 more rows

通常情况下，最难的部分是格式化输入。应该使用与查询数据集相同的id，将微阵列表达式值作为规范化的、日志转换的数据。任何探头或样品级的过滤也应事先进行。在这种情况下，从网站获得的数据是正常的，但仍然需要匹配id。

library("tidyverse")
library("illuminaHumanv2.db")
probes_with_gene_symbol_and_with_data <- intersect(keys(illuminaHumanv2SYMBOL),rownames(norm_expression_table.full))

# Get mappings - non NA
probe_to_symbol <- select(illuminaHumanv2.db, keys=rownames(norm_expression_table.full), columns=c("SYMBOL"), keytype="PROBEID")
probe_to_symbol <- unique(probe_to_symbol[! is.na(probe_to_symbol$SYMBOL),])
# no multimapping probes
genes_per_probe <- table(probe_to_symbol$PROBEID) # How many genes a probe is annotated against?
multimap_probes <- names(genes_per_probe)[genes_per_probe  > 1]
probe_to_symbol <- probe_to_symbol[!probe_to_symbol$PROBEID %in% multimap_probes, ]

convert_expression_table_ids<- function(expression_table, the_probes_table, old_id_name, new_id_name){
  
  the_probes_table <- the_probes_table[,c(old_id_name, new_id_name)]
  colnames(the_probes_table) <- c("old_id", "new_id")
  
  # Before DE, just pick the top expresed probe to represent the gene
  # Not perfect, but this is a ranking-based analysis.
  # hybridisation issues aside, would expect higher epressed probes to be more relevant to Single cell data anyway.
  probe_expression_levels <- rowSums(expression_table)
  the_probes_table$avgexpr <- probe_expression_levels[as.character(the_probes_table$old_id)]
  
  the_genes_table <-  the_probes_table %>% 
    group_by(new_id) %>%
    top_n(1, avgexpr)
  
  expression_table <- expression_table[the_genes_table$old_id,]
  rownames(expression_table) <- the_genes_table$new_id
  
  return(expression_table)
}

# Just the most highly expressed probe foreach gene.
norm_expression_table.genes <- convert_expression_table_ids(norm_expression_table.full, 
                                                            probe_to_symbol, old_id_name="PROBEID", new_id_name="SYMBOL")



# Go...
de_table.Watkins2009PBMCs <- contrast_each_group_to_the_rest_for_norm_ma_with_limma(
  norm_expression_table = norm_expression_table.genes, 
  sample_sheet_table    = samples_table, 
  dataset_name          = "Watkins2009PBMCs", 
  extra_factor_name     = 'description', 
  sample_name           = "sampleId",
  group_name            = 'celltype')
> de_table.Watkins2009PBMCs
# A tibble: 113,376 x 21
   ID    logFC AveExpr     t  P.Value adj.P.Val     B     ci CI_upper CI_lower ci_inner ci_outer log2FC      fdr group sig   sig_up
   <chr> <dbl>   <dbl> <dbl>    <dbl>     <dbl> <dbl>  <dbl>    <dbl>    <dbl>    <dbl>    <dbl>  <dbl>    <dbl> <fct> <lgl> <lgl> 
 1 JCHA~  6.86    8.65  27.5 7.13e-26  2.70e-23  49.0 0.0952     6.96     6.77     6.77     6.96   6.86 2.70e-23 B ly~ TRUE  TRUE  
 2 FCRLA  6.41    8.77  23.7 1.13e-23  3.29e-21  44.0 0.103      6.51     6.30     6.30     6.51   6.41 3.29e-21 B ly~ TRUE  TRUE  
 3 VPRE~  6.18    8.66  40.2 1.20e-31  8.70e-29  62.0 0.0587     6.23     6.12     6.12     6.23   6.18 8.70e-29 B ly~ TRUE  TRUE  
 4 TCL1A  6.09    8.52  42.8 1.33e-32  1.14e-29  64.1 0.0544     6.14     6.04     6.04     6.14   6.09 1.14e-29 B ly~ TRUE  TRUE  
 5 CD79A  6.09    9.08  22.7 5.17e-23  1.40e-20  42.4 0.102      6.20     5.99     5.99     6.20   6.09 1.40e-20 B ly~ TRUE  TRUE  
 6 CD19   5.90    8.44  38.4 6.09e-31  4.11e-28  60.5 0.0587     5.96     5.85     5.85     5.96   5.90 4.11e-28 B ly~ TRUE  TRUE  
 7 HLA-~  5.61    8.35  47.9 2.30e-34  2.56e-31  68.0 0.0447     5.66     5.57     5.57     5.66   5.61 2.56e-31 B ly~ TRUE  TRUE  
 8 OSBP~  5.48    7.83  53.2 5.60e-36  8.82e-33  71.4 0.0394     5.52     5.45     5.45     5.52   5.48 8.82e-33 B ly~ TRUE  TRUE  
 9 CNTN~  5.38    7.95  49.3 8.12e-35  9.60e-32  68.9 0.0416     5.42     5.34     5.34     5.42   5.38 9.60e-32 B ly~ TRUE  TRUE  
10 COBL~  5.26    7.70  56.4 6.77e-37  1.28e-33  73.3 0.0356     5.30     5.23     5.23     5.30   5.26 1.28e-33 B ly~ TRUE  TRUE  
# ... with 113,366 more rows, and 4 more variables: gene_count <int>, rank <int>, rescaled_rank <dbl>, dataset <chr>

细胞类型定义（Compare 10X query PBMCs to to reference）

数据准备好就和之前的示例是一样的了。

make_ranking_violin_plot(de_table.test=de_table.10X_pbmc4k_k7, de_table.ref=de_table.Watkins2009PBMCs)

Hmm, there’s a few clusters where different the top genes are bunched near the top for a couple of different reference cell types.

Logging the plot will be more informative at the top end for this dataset.

make_ranking_violin_plot(de_table.test=de_table.10X_pbmc4k_k7, de_table.ref=de_table.Watkins2009PBMCs, log10trans = TRUE)

可以打标签啦。

 label_table.pbmc4k_k7_vs_Watkins2009PBMCs <- make_ref_similarity_names(de_table.10X_pbmc4k_k7, de_table.Watkins2009PBMCs)
label_table.pbmc4k_k7_vs_Watkins2009PBMCs

可以看出，七个群有六个找到了与之相似的群名称，并给出了pval.

make_ranking_violin_plot(de_table.test=de_table.Watkins2009PBMCs, de_table.ref=de_table.10X_pbmc4k_k7, log10trans = TRUE)

行啦，人pbmc细胞的定义工作就完成了。官网也提供了小鼠细胞类型识别的教程。这里就不再复述啦。

完成了细胞类型定义，单细胞的故事才刚刚开始。所以，这不是结束，甚至不是结束的开始，而可能是开始的结束。

SingleR
celaref
Sarah Williams (2019). celaref: Single-cell RNAseq cell cluster labelling by reference. R package version 1.0.1.
SummarizedExperiment

最后编辑于：2019.05.31 18:41:51

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 193,812评论 5赞 459
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 81,626评论 2赞 371
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 141,144评论 0赞 319
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 52,052评论 1赞 263
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 60,925评论 4赞 355
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 46,035评论 1赞 272
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 36,461评论 3赞 381
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 35,150评论 0赞 253
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 39,413评论 1赞 290
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 34,501评论 2赞 307
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 36,277评论 1赞 325
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 32,159评论 3赞 312
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 37,528评论 3赞 298
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 28,868评论 0赞 17
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 30,143评论 1赞 250
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 41,407评论 2赞 341
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 40,615评论 2赞 335