10X单细胞（10X空间转录组）聚类分析之scDCC

hi，各位好，今天我们努努力，看一下10X单细胞和10X空间转录组普遍存在的dropout现象对我们数据分析的影响和文章中的方法scDCC是如何规避的，文章在Model-based deep embedding for constrained clustering analysis of single cell RNA-seq data,2021年3月发表于NC，中国人发表的，不算低了，有关dropout的知识，大家可以看我之前分享的文章深度学习中Dropout原理解析（10X单细胞和10X空间转录组）,做一个简单的了解,那我们来深入解读一下，看看如何解决这个问题。（什么时候我们才能自己写算法呢？而不是读和借鉴别人的）。

还是老办法，先分享文章，后示例代码

Absract

Clustering is a critical step in single cell-based studies. Most existing methods support unsupervised clustering without the a priori exploitation of any domain knowledge（领域知识）.（我相信大家都是这样的吧，拿到矩阵之后，直接用Seurat进行降维聚类分析了，几乎没有用到什么先验的知识）When confronted by the high dimensionality and pervasive（普遍存在） dropout events of scRNA-Seq data, purely unsupervised clustering methods may not produce biologically interpretable clusters（大家遇到过么？有的cluster差异基因很少甚至没有，根本无法定义），which complicates cell type assignment.In such cases, the only recourse is for the user to manually and repeatedly tweak clustering parameters until acceptable clusters are found（是的，盲目调参，这或许也是科服和临检最大的鸿沟吧）。Consequently, the path to obtaining biologically meaningful clusters can be ad hoc（特设） and laborious.Here we report a principled clustering method named scDCC, that integrates domain knowledge into the clustering step（利用先验知识参与聚类，这个其实也很难），Experiments on various scRNA-seq datasets from thousands to tens of thousands of cells show that scDCC can significantly improve clustering performance, facilitating the interpretability of clusters and downstream analyses, such as cell type assignment.（最后一句话是套话，每篇文章都夸自己，不然发不出来😄）

introduction

这个地方我们提炼一下
目前常用的降维方法PCA、TSNE、UMAP、然后K-means、层次聚类进行可视化，including
SC37 (Spectral clustering), pcaReduce8 (PCA + k-means + hierarchical),TSCAN9 (PCA + Gaussian mixture model) and mpath10 (Hierarchical), to name a few（真的非常多，PCA的原理和深入探讨我之前分享过，大家可以翻阅一下），然后由于单细胞数据存在的稀疏性（这里就指dropout和基因水平的高度变化），这些传统的聚类方法其实会导致suboptimal results。
Recently, various clustering methods have been proposed to overcome the challenges in scRNA-seq data analysis.
（1）Shared nearest neighbor (SNN)-Clip combines a quasi-clique-based clustering algorithm with the SNN-based similarity measure to automatically identify clusters in the high-dimensional and highvariable
scRNA-seq data（SNN就是Seurat聚类用到的方法）。
（2）DendroSplit对通过层次聚类获得的树状图进行“分裂”和“合并”操作，该树状图根据细胞的成对距离（根据选定的基因计算）对细胞进行迭代分组，以揭示具有可解释的超参数的生物学上有意义的种群的多个水平（层次聚类其实用到的很少）。
（3）If the dropout probability P(u) is a decreasing function of the gene expression u, CIDR uses a nonlinear least-squares regression to empirically estimate P(u) and imputes the gene expressions with a weighted average to alleviate the impact of dropouts.（个人感觉不太靠谱）。
（4）Clustering analysis is performed on the first few principal coordinates, obtained through principal coordinate analysis (PCoA) on the imputed expression matrix（这个大家都是这么做的，只是在选择多少个主成分上可能会有差异）。
（5）SIMLR and MPSSC are both multiple kernel-based spectral clustering methods. Considering the complexities of the scRNAseq data, multiple kernel functions can help to learn robust similarity measures that correspond to different informative representations of the data（恕我直言，这些方法我根本没有听说过，😂），However, spectral clustering relies on the full graph Laplacian matrix, which is prohibitively expensive to compute and store.（看来缺点很显著，怪不得没有听说过）。
（6）The high complexity and limited scalability generally impede applying these methods to large scRNA-seq datasets（单细胞数据具有的特点确实不能套用老一代的方法）。

模型部分

通过scRNA-seq进行分析的大量细胞为研究人员提供了独特的机会，可以应用深度学习方法对嘈杂而复杂的scRNA-seq数据进行建模。（这就是我的职业追求）。
（1）scScope and DCA(Deep Count Autoencoder) apply regular autoencoders to denoise single-cell gene expression data and impute（估算） the missing values（恕我之前，也没有听说过）。In autoencoders, the lowdimensional bottleneck layer enforces the encoder to learn only the essential latent representations and the decoding procedure ignores non-essential sources of variations of the expression data（很专业的东西，大家感兴趣可以查一下）。
（2）Compared to scScope, DCA explicitly models the overdispersion and zero-inflation with a zero-inflated negative binomial (ZINB) model-based loss function and learns gene-specific parameters (mean, dispersion and dropout probability) from the scRNA-seq data.（零膨胀负二项分布，不知道大家了解多少，用过scanpy的同学应该知道）。
（3）SCVI and SCVIS are variational autoencoders (VAE) focusing on dimension reduction of scRNAseq
data（这方法也没有听说过，看来实力还是很差啊）. Unlike autoencoder, variational autoencoder assumes that latent representations learnt by the encoder follow a predefined distribution (typically a Gaussian distribution（高斯分布，单细胞i）). SCVIS uses the Student’s t-distributions（t分布） to replace the regular MSE-loss (mean square error) VAE, while SCVI applies the ZINB-loss VAE to characterize scRNA-seq data（分布上各有千秋）. Variational autoencoder is a deep generative model（生成模型）, but the assumption of latent representations following a Gaussian distribution might introduce the overregularization problem and compromise its performance（缺点依据很明显，怪不得没怎么用过😄）。
（4）More recently, Tian et al. developed a ZINB model-based deep clustering method (scDeepCluster) and showed that it could effectively characterize and cluster the discrete, over-dispersed and zero-inflated scRNA-seq count data.（自己写文章引用自己的文章，很不错，而且零膨胀负二项分布是单细胞最常用的分布），scDeepCluster combines the ZINB model-based autoencoder with the deep embedding clustering, which optimizes the latent feature learning and clustering simultaneously to achieve better clustering results.（作者认为好不管用，要我们认为可以）。

下游部分

Much of the downstream biological investigation relies on initial clustering results. Although clustering aims to explore and uncover new information，biologists expect to see some meaningful clusters that are consistent with their prior knowledge（典型的结果导向论，跟造假的距离不远了），In other words, totally exotic clustering with poor biological interpretability is puzzling, which is generally not desired by biologists.（但还是要基于客观事实）。For a clustering algorithm, it is good to accommodate biological interpretability while minimizing clustering loss from computational aspect（这是算法的目标），然而目前存在的算法只支持无监督聚类（有监督不见的比无监督好），有时候不符合之前的先验知识，If a method initially fails to find a meaningful solution, the only recourse may be for the user to manually and repeatedly tweak clustering parameters until sufficiently good clusters are found 。（这里大家要慎重啊，不要跟风）。
We note that prior knowledge has become widely available in many cases（但是不见得都对）。Quite a few cell type-specific signature sets have been published（每个样本的情况是不一样的，不能完全同一，搞一刀切）. Ignoring prior information may lead to suboptimal, unexpected, and even illogical clustering results（这句话我不是特别赞同，算法的改进可以理解，但是人为因素过多，结果同样不好）。后面说了几个做细胞定义的软件，说句实话，细胞是一个动态的过程，想要靠软件识别是不太可能的，而且先验知识不一定就适合所有的情况，不同组织，不同来源，不同品系，不同处理都会导致细胞的改变，所以这里的观点我个人不太赞同。
However, there are several limitations of these methods.
（1）First,they are developed in the context of the marker genes and lack the flexibility to integrate other kinds of prior information. （人为因素千万不可过多）
（2）Second, they are only applicable to scenarios where cell types are predefined and well-studied marker genes exist. （这个也不太对）
Poorly understood cell types would be invisible to these methods. Finally, they both ignore pervasive dropout events, a well-known problem for scRNA-seq data。

In this article, we are interested in integrating prior information into the modeling process to guide our deep learning model to simultaneously learn meaningful and desired latent representations and clusters（先验知识和机器学习联合使用，有了人为因素，可要小心了），convert (partial) prior knowledge into soft pairwise constraints and add them as additional terms into the loss function for optimization（认为加入外界因素），这个属于半监督范畴，这个软件scDCC

图片.png

scDCC encodes prior knowledge into constraint information,which is integrated to the clustering procedure via a novel loss function，当然，后面说自己的方法好，我们要批判性的看待（算法的部分我们在Method中分享**）。

Result1 Pairwise constraints.

Pairwise constraints mainly focus on the together or apart guidance as defined by prior information and domain knowledge. They enforce small divergence between predefined “similar” samples, while enlarging the difference between “dissimilar” instances.（说白了，限定先验知识的“距离”，相似样本和不相似样本的距离的限定），Researchers usually encode the together and apart information into must-link (ML) and cannot-link (CL) constraints, respectively（信息归类），With the proper setup, pairwise constraints have been proved to be capable of defining any ground-truth partition（这基本就是机器学习啊），In the context of scRNA-seq studies, pairwise constraints can be constructed based on the cell distance computed using marker genes（marker gene哪里来的？其他人的？不太靠谱吧）, cell sorting using flow cytometry, or other methods depending on real application scenarios
To evaluate the performance of pairwise constraints，用到如下数据;

图片.png

We selected 10% of cells with known labels to generate constraints in each dataset and evaluated the performance of scDCC on the remaining 90% of cells.(这个方法恕我直言，der)，We show that the prior information encoded as soft constraints could help inform the latent representations of the remaining cells and therefore improve the clustering performance（这个地方简直没用）。
Three clustering metrics：
（1）normalized mutual information (NMI)，range 0 to 1.
（2）clustering accuracy (CA)，range 0 to 1。
（3）adjusted Rand index (ARI)（可参考兰德指数），which ARI can be negative.
(科普一下，兰德指数需要给定实际类别信息C,假设K是聚类结果，a表示在C与K中都是同类别的元素对数，b表示在C与K中都是不同类别的元素对数。评价同一object在两种分类结果中是否被分到同一类别。)
A larger value indicates better concordance between the predicted labels and ground truth. The number of pairwise constraints fed into the model explicitly controls how much prior information is applied in the clustering process（局限性挺大的）。
看看文章的试验结果

图片.png

当然不错，文章的先验知识肯定是准备充分的。For datasets that are difficult to cluster, imposing a small set of pairwise constraints significantly improves the results.With 6000 pairwise constraints, scDCC achieves acceptable performance on all four datasets（有这先验还需要再验证么？全定义得了）。

图片.png

A random subset of corresponding ML (blue lines) and CL (red lines) constraints are also plotted（tsne）。
As shown, the latent representations learned by the ZINB model-based autoencoder are noisy and different labels are mixed. Although the representations from scDeepCluster could separate different clusters, the inconsistency against the constraints still exists. Finally, by incorporating the soft constraints into the model training, scDCC was able to precisely separate the clusters and the results are consistent with both ML (blue lines) and CL (red lines) constraints.（自己的软件表现最好，感觉很废话，因为你要的多）。Overall, these results show that pairwise constraints can help to learn a better representation during the end-to-end learning procedure and improve clustering performance.

For the randomly selected 2100 cells in each dataset, we observed that scDCC with 0 constraint outperformed most competing scRNA-seq clustering methods（这个才是比较有意义的），(some strong methods outperformed scDCC with 0 constraints on some datasets, such as SC3 and Seurat on mouse bladder cells)（Seurat的聚类方法确实是比较好的），有了constraint的话scDCC表现最好，感觉比较扯。
下面的内容很重点
In real applications, we recognize that constraint information may not be 100% accurate（有一半的真实性就很不错了），To evaluate the robustness of the proposed method, we applied scDCC to the datasets with 5% and 10% erroneous pairwise constraints（有一定的先验错误率我们看看会怎么样），当然了，稳定性不错，不然见不到这个文章了，但错误率有点高的时候，这个方法完全不行了。Therefore, users
should take caution when adding highly erroneous constraints。当然，另外的验证结果也很好。

Result2 Robustness on highly dispersed genes.

Gene filtering is widely applied in many single-cell analysis pipelines（这一般是真正分析的第一步），One typical gene filtering strategy is to filter out low variable genes and only keep highly dispersed genes.（选择高变基因），Selecting highly dispersed genes could amplify the differences among cells but lose key information between cell clusters（这个，说的对么？？？）To evaluate the robustness of scDCC on highly dispersed genes, we conducted experiments on the top 2000 highly dispersed genes of the four datasets and displayed the performances of scDCC and baseline methods。当然也不错，但是用处不大。

Result3 Real applications and use cases.（看一下）

Generating accurate constraints is the key to successfully apply the proposed scDCC algorithm to obtain robust and desired clustering results（看来这是主要的限制条件了），两种方式
（1）Protein marker-based constraints.
（2）Marker gene-based constraints.
都需要人为先label 啊，看来任重而道远啊。

Methods

图片.png

至于代码在这里，scDCC

读了这篇文章，感觉生命在流逝

生活很好，有你更好。

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 199,271评论 5赞 466
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 83,725评论 2赞 376
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 146,252评论 0赞 328
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 53,634评论 1赞 270
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 62,549评论 5赞 359
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 47,985评论 1赞 275
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,471评论 3赞 390
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,128评论 0赞 254
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,257评论 1赞 294
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,233评论 2赞 317
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,235评论 1赞 328
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 32,940评论 3赞 316
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,528评论 3赞 302
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,623评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 30,858评论 1赞 255
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,245评论 2赞 344
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 41,790评论 2赞 339

10X单细胞（10X空间转录组）聚类分析之scDCC

还是老办法，先分享文章，后示例代码

Absract

introduction

模型部分

下游部分

Result1 Pairwise constraints.

Result2 Robustness on highly dispersed genes.

Result3 Real applications and use cases.（看一下）

Methods

推荐阅读更多精彩内容