NBIS系列单细胞转录组数据分析实战(五):基因差异表达分析

第五节:基因差异表达分析

在本节教程中,我们将介绍基因差异表达分析的相关知识。在单细胞数据分析中,差异表达分析可以具有多种功能,例如鉴定细胞群的标记基因,以及跨条件(健康与对照)比较的差异调节基因。

加载所需的R包和数据集

suppressPackageStartupMessages({
    library(Seurat)
    library(venn)
    library(dplyr)
    library(cowplot)
    library(ggplot2)
    library(pheatmap)
    library(enrichR)
    library(rafalib)
})

alldata <- readRDS("data/results/covid_qc_dr_int_cl.rds")

在这里,我们选定louvain图聚类在0.5的分辨率下的聚类分群结果进行后续的差异表达分析。

# Set the identity as louvain with resolution 0.5
sel.clust = "CCA_snn_res.0.5"

alldata <- SetIdent(alldata, value = sel.clust)
table(alldata@active.ident)
##    0    1    2    3    4    5    6    7    8    9   10 
## 1453  570  528  494  487  470  446  400  240  230  214
# plot this clustering
plot_grid(ncol = 3, DimPlot(alldata, label = T) + NoAxes(), DimPlot(alldata, group.by = "orig.ident") + 
    NoAxes(), DimPlot(alldata, group.by = "type") + NoAxes())
image.png

鉴定细胞标记基因

首先,我们计算出每个聚类簇中差异表达排名靠前的基因。我们可以选择不同的计算方法和参数,以完善差异分析的结果。当寻找marker基因时,我们想要筛选出那些在一种细胞类型中高表达而在其他类型中可能不表达的基因。

# Compute differentiall expression
# 使用FindAllMarkers函数计算每个cluster中的差异表达基因
markers_genes <- FindAllMarkers(alldata, logfc.threshold = 0.2, test.use = "wilcox", assay = "RNA",
                                min.pct = 0.1, min.diff.pct = 0.2, only.pos = TRUE, max.cells.per.ident = 50)

现在,我们可以选择前25个上调的基因进行可视化展示。

top25 <- markers_genes %>% group_by(cluster) %>% top_n(-25, p_val_adj)
top25
#  p_val    avg_logFC    pct.1    pct.2    p_val_adj    cluster    gene
#2.264061e-18   2.1968625   0.964   0.072   4.102704e-14    0   VCAN
#9.762285e-18   2.0240455   0.994   0.135   1.769024e-13    0   FCN1
#9.967076e-18   1.9095912   0.939   0.058   1.806134e-13    0   CD14
#1.105185e-17   2.5175223   1.000   0.260   2.002706e-13    0   LYZ
#2.277970e-17   1.4242183   0.972   0.336   4.127910e-13    0   APLP2
#2.570420e-17   2.7792763   0.997   0.343   4.657858e-13    0   S100A9
#3.653861e-17   1.2712956   0.884   0.097   6.621161e-13    0   MPEG1
#1.199814e-16   2.1078711   0.985   0.122   2.174182e-12    0   AC020656.1
#1.341305e-16   2.8514451   0.993   0.257   2.430579e-12    0   S100A8
#1.725826e-16   1.8378729   0.966   0.111   3.127369e-12    0   MNDA

绘制条形图

mypar(2, 5, mar = c(4, 6, 3, 1))
for (i in unique(top25$cluster)) {
    barplot(sort(setNames(top25$avg_logFC, top25$gene)[top25$cluster == i], F), horiz = T, 
        las = 1, main = paste0(i, " vs. rest"), border = "white", yaxs = "i")
    abline(v = c(0, 0.25), lty = c(1, 2))
}
image.png

选择top5基因绘制热图

top5 <- markers_genes %>% group_by(cluster) %>% top_n(-5, p_val_adj)

# create a scale.data slot for the selected genes
alldata <- ScaleData(alldata, features = as.character(unique(top5$gene)), assay = "RNA")
DoHeatmap(alldata, features = as.character(unique(top5$gene)), group.by = sel.clust, assay = "RNA")
image.png

绘制气泡图

# 使用DotPlot函数绘制气泡图
DotPlot(alldata, features = rev(as.character(unique(top5$gene))), group.by = sel.clust, 
    assay = "RNA") + coord_flip()
image.png

绘制小提琴图

# take top 3 genes per cluster/
top3 <- top5 %>% group_by(cluster) %>% top_n(-3, p_val)

# set pt.size to zero if you do not want all the points to hide the violin
# shapes, or to a small value like 0.1
VlnPlot(alldata, features = as.character(unique(top3$gene)), ncol = 5, group.by = sel.clust, 
    assay = "RNA", pt.size = 0)
image.png

注意:针对以上差异分析内容,我们还可以选择不同的方法进行计算,如“wilcox”(Wilcoxon Rank Sum test), “bimod” (Likelihood-ratio test), “roc” (Identifies ‘markers’ of gene expression using ROC analysis),“t” (Student’s t-test),“negbinom” (negative binomial generalized linear model),“poisson” (poisson generalized linear model), “LR” (logistic regression), “MAST” (hurdle model), “DESeq2” (negative binomial distribution).

跨条件比较的差异表达分析

在我们的数据中,有来自患者和健康对照人群两大类,我们想知道在特定细胞类型中哪些基因的影响最大。因此,我们可以对Covid / Ctrl两组数据进行跨条件比较的差异表达分析。

# select all cells in cluster 2
cell_selection <- subset(alldata, cells = colnames(alldata)[alldata@meta.data[, sel.clust] ==  2])
cell_selection <- SetIdent(cell_selection, value = "type")

# Compute differentiall expression
DGE_cell_selection <- FindAllMarkers(cell_selection, logfc.threshold = 0.2, test.use = "wilcox", 
    min.pct = 0.1, min.diff.pct = 0.2, only.pos = TRUE, max.cells.per.ident = 50, 
    assay = "RNA")

绘制小提琴图查看差异基因的表达情况

top5_cell_selection <- DGE_cell_selection %>% group_by(cluster) %>% top_n(-5, p_val)

VlnPlot(cell_selection, features = as.character(unique(top5_cell_selection$gene)), 
    ncol = 5, group.by = "type", assay = "RNA", pt.size = 0.1)
image.png

查看差异基因在所有cluster中的分组表达情况

VlnPlot(alldata, features = as.character(unique(top5_cell_selection$gene)), ncol = 5, 
    split.by = "type", assay = "RNA", pt.size = 0)
image.png

基因集分析

超几何功能富集分析

当我们得到差异表达基因列表时,就可以使用超几何检验进行功能富集分析,这里我们使用enrichR包进行功能富集分析。

# Load additional packages
library(enrichR)

# Check available databases to perform enrichment (then choose one)
enrichR::listEnrichrDbs()
image.png
# Perform enrichment
enrich_results <- enrichr(genes = DGE_cell_selection$gene[DGE_cell_selection$cluster == 
    "Covid"], databases = "GO_Biological_Process_2017b")[[1]]
## Uploading data to Enrichr... Done.
##   Querying GO_Biological_Process_2017b... Done.
## Parsing results... Done.

一些感兴趣的数据库:

  • GO_Biological_Process_2017b
  • KEGG_2019_Human
  • KEGG_2019_Mouse
  • WikiPathways_2019_Human
  • WikiPathways_2019_Mouse

我们可以使用简单的条形图来可视化富集分析的结果

par(mfrow = c(1, 1), mar = c(3, 25, 2, 1))
barplot(height = -log10(enrich_results$P.value)[10:1], names.arg = enrich_results$Term[10:1], 
    horiz = TRUE, las = 1, border = FALSE, cex.names = 0.6)
abline(v = c(-log10(0.05)), lty = 2)
abline(v = 0, lty = 1)
image.png

基因集富集分析(GSEA)

除了使用超几何检验进行富集分析外,我们还可以执行基因集富集分析(GSEA),该功能对排序后的基因列表(通常基于倍数变化)进行评分,并通过置换检验来计算某个特定的基因集是否显著的富集到上调的还是下调的基因中。

# 计算差异表达基因
DGE_cell_selection <- FindMarkers(cell_selection, ident.1 = "Covid", logfc.threshold = -Inf, 
                                  test.use = "wilcox", min.pct = 0.1, min.diff.pct = 0, only.pos = FALSE, max.cells.per.ident = 50, assay = "RNA")

# Create a gene rank based on the gene expression fold change
# 根据差异倍数对基因进行排序
gene_rank <- setNames(DGE_cell_selection$avg_logFC, casefold(rownames(DGE_cell_selection), upper = T))

得到排好序的差异基因列表后,我们就可以对其进行基因集的富集分析。我们可以从分子特征数据库(MSigDB)中获取相应的基因集,这里以KEGG代谢通路富集为例。

# install.packages('msigdbr')
library(msigdbr)

# Download gene sets
msigdbgmt <- msigdbr::msigdbr("Homo sapiens")
msigdbgmt <- as.data.frame(msigdbgmt)

# List available gene sets
unique(msigdbgmt$gs_subcat)
##  [1] "MIR:MIR_Legacy"  "TFT:TFT_Legacy"  "CGP"             "TFT:GTRD"       
##  [5] ""                "CP:BIOCARTA"     "CGN"             "GO:MF"          
##  [9] "GO:BP"           "GO:CC"           "HPO"             "CP:KEGG"        
## [13] "MIR:MIRDB"       "CM"              "CP"              "CP:PID"         
## [17] "CP:REACTOME"     "CP:WIKIPATHWAYS"

# Subset which gene set you want to use.
msigdbgmt_subset <- msigdbgmt[msigdbgmt$gs_subcat == "CP:WIKIPATHWAYS", ]
gmt <- lapply(unique(msigdbgmt_subset$gs_name), function(x) {
    msigdbgmt_subset[msigdbgmt_subset$gs_name == x, "gene_symbol"]
})

names(gmt) <- unique(paste0(msigdbgmt_subset$gs_name, "_", msigdbgmt_subset$gs_exact_source))

接下来,我们进行GSEA富集分析。得到的结果中会包含多个不同代谢通路的富集情况,我们可以对这些通路进行排序和过滤,仅展示最重要的代谢通路。我们可以按p-valueNES值对它们进行排序和过滤。

library(fgsea)

# Perform enrichemnt analysis
fgseaRes <- fgsea(pathways = gmt, stats = gene_rank, 
                  minSize = 15, maxSize = 500,  nperm = 10000)

fgseaRes <- fgseaRes[order(fgseaRes$RES, decreasing = T), ]

# Filter the results table to show only the top 10 UP or DOWN regulated processes
# (optional)
top10_UP <- fgseaRes$pathway[1:10]

# Nice summary table (shown as a plot)
dev.off()
plotGseaTable(gmt[top10_UP], gene_rank, fgseaRes, gseaParam = 0.5)
## "","p_val","avg_logFC","pct.1","pct.2","p_val_adj","cluster","gene"
## "VCAN",2.26406050074545e-18,2.19686246834267,0.964,0.072,4.10270403340083e-14,"0","VCAN"
## "FCN1",9.76228453901092e-18,2.02404548581854,0.994,0.135,1.76902358131417e-13,"0","FCN1"
## "CD14",9.96707632855414e-18,1.90959124343176,0.939,0.058,1.8061339014973e-13,"0","CD14"
## "LYZ",1.10518526809416e-17,2.51752231699128,1,0.26,2.00270622431342e-13,"0","LYZ"
## "APLP2",2.27797041995035e-17,1.42421829154358,0.972,0.336,4.12791019799203e-13,"0","APLP2"
## "S100A9",2.57041990836472e-17,2.77927631368357,0.997,0.343,4.6578579159477e-13,"0","S100A9"
## "MPEG1",3.65386053599734e-17,1.27129564857088,0.884,0.097,6.62116067728078e-13,"0","MPEG1"
## "AC020656.1",1.19981360229422e-16,2.10787107628474,0.985,0.122,2.17418222871735e-12,"0","AC020656.1"
## "S100A8",1.34130508577447e-16,2.8514451433658,0.993,0.257,2.43057894593192e-12,"0","S100A8"
## "MNDA",1.72582559901868e-16,1.83787286513408,0.966,0.111,3.12736856798175e-12,"0","MNDA"
## "CSTA",1.95052415941971e-16,1.52842274057699,0.953,0.082,3.53454482928446e-12,"0","CSTA"
## "TYMP",2.33058390521209e-16,1.60278457009931,0.981,0.296,4.22325109463483e-12,"0","TYMP"
## "S100A12",2.96694584422187e-16,2.32785948742408,0.939,0.069,5.37640256431445e-12,"0","S100A12"
## "NCF2",4.76129446813577e-16,1.29612848623007,0.913,0.098,8.62794170570882e-12,"0","NCF2"
## "S100A11",1.45270665495142e-15,1.48753121077731,0.991,0.46,2.63244972943746e-11,"0","S100A11"
## "PLBD1",1.7606666199921e-15,1.56695692268896,0.856,0.066,3.19050398208769e-11,"0","PLBD1"
## "TSPO",2.29679670794202e-15,1.1932694569082,0.986,0.563,4.16202531446173e-11,"0","TSPO"
## "TNFSF13B",2.7508922871735e-15,1.28293955643987,0.886,0.085,4.9848919135871e-11,"0","TNFSF13B"
## "BRI3",3.18917499088416e-15,1.31778444541193,0.988,0.452,5.77910400098119e-11,"0","BRI3"
## "GRN",5.60438207582128e-15,1.63288864506976,0.967,0.174,1.01557007595957e-10,"0","GRN"
## "CSF3R",8.43482246596123e-15,1.38474196303786,0.926,0.087,1.52847417905684e-10,"0","CSF3R"
## "DUSP6",9.00960709980655e-15,1.63444829183343,0.828,0.103,1.63263090255594e-10,"0","DUSP6"
## "SERPINA1",1.04571152790341e-14,1.35247892492229,0.971,0.115,1.89493385971377e-10,"0","SERPINA1"
## "MS4A6A",1.06273950651256e-14,1.56597113708651,0.932,0.063,1.92579025975142e-10,"0","MS4A6A"
## "CYBB",1.33277817727406e-14,1.41435759146114,0.953,0.159,2.41512733503832e-10,"0","CYBB"
## "CFP",1.85094570039818e-14,1.09123390164341,0.855,0.089,3.35409870369153e-10,"0","CFP"
## "IGSF6",2.15961777805468e-14,1.15146027588538,0.782,0.071,3.91344337561288e-10,"0","IGSF6"
## "TALDO1",2.21894474063566e-14,1.16470313175123,0.957,0.353,4.02094976450588e-10,"0","TALDO1"
## "CTSS",3.93304535224592e-14,1.67034752638023,0.998,0.57,7.12707148280483e-10,"0","CTSS"
## "NAMPT",4.06179589121295e-14,1.69262915955195,0.943,0.242,7.36038033446699e-10,"0","NAMPT"
## "TKT",5.33889287865753e-14,1.31633400735535,0.979,0.42,9.6746077854153e-10,"0","TKT"
## "AIF1",7.38214607324845e-14,1.45756262888725,0.992,0.18,1.33771868993335e-09,"0","AIF1"
## "LILRA5",1.26075588622411e-13,0.981857624374599,0.766,0.075,2.28461574142671e-09,"0","LILRA5"
## "FGL2",1.27635506727916e-13,1.34945590284596,0.917,0.138,2.31288301741656e-09,"0","FGL2"
## "TIMP2",1.59612347621965e-13,0.991012230366151,0.818,0.08,2.89233535125763e-09,"0","TIMP2"
## "MEGF9",1.59668377892485e-13,0.767342234766623,0.628,0.078,2.89335067578972e-09,"0","MEGF9"
## "CD68",2.09471521369e-13,1.16673787964521,0.93,0.123,3.79583343872765e-09,"0","CD68"

保存基因差异表达分析的结果

saveRDS(alldata, "data/3pbmc_qc_dr_int_cl_dge.rds")
write.csv(markers_genes)
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Catalina 10.15.5
## 
## Matrix products: default
## BLAS/LAPACK: /Users/paulo.czarnewski/.conda/envs/scRNAseq2021/lib/libopenblasp-r0.3.12.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] fgsea_1.16.0    msigdbr_7.2.1   rafalib_1.0.0   enrichR_2.1    
##  [5] pheatmap_1.0.12 ggplot2_3.3.3   cowplot_1.1.1   dplyr_1.0.3    
##  [9] venn_1.9        Seurat_3.2.3    RJSONIO_1.3-1.4 optparse_1.6.6 
## 
## loaded via a namespace (and not attached):
##   [1] Rtsne_0.15            colorspace_2.0-0      rjson_0.2.20         
##   [4] deldir_0.2-9          ellipsis_0.3.1        ggridges_0.5.3       
##   [7] spatstat.data_1.7-0   farver_2.0.3          leiden_0.3.6         
##  [10] listenv_0.8.0         getopt_1.20.3         ggrepel_0.9.1        
##  [13] codetools_0.2-18      splines_4.0.3         knitr_1.30           
##  [16] polyclip_1.10-0       jsonlite_1.7.2        ica_1.0-2            
##  [19] cluster_2.1.0         png_0.1-7             uwot_0.1.10          
##  [22] shiny_1.5.0           sctransform_0.3.2     compiler_4.0.3       
##  [25] httr_1.4.2            assertthat_0.2.1      Matrix_1.3-2         
##  [28] fastmap_1.0.1         lazyeval_0.2.2        limma_3.46.0         
##  [31] later_1.1.0.1         formatR_1.7           admisc_0.11          
##  [34] htmltools_0.5.1       tools_4.0.3           rsvd_1.0.3           
##  [37] igraph_1.2.6          gtable_0.3.0          glue_1.4.2           
##  [40] RANN_2.6.1            reshape2_1.4.4        fastmatch_1.1-0      
##  [43] Rcpp_1.0.6            spatstat_1.64-1       scattermore_0.7      
##  [46] vctrs_0.3.6           nlme_3.1-151          lmtest_0.9-38        
##  [49] xfun_0.20             stringr_1.4.0         globals_0.14.0       
##  [52] mime_0.9              miniUI_0.1.1.1        lifecycle_0.2.0      
##  [55] irlba_2.3.3           goftest_1.2-2         future_1.21.0        
##  [58] MASS_7.3-53           zoo_1.8-8             scales_1.1.1         
##  [61] promises_1.1.1        spatstat.utils_1.20-2 parallel_4.0.3       
##  [64] RColorBrewer_1.1-2    curl_4.3              yaml_2.2.1           
##  [67] reticulate_1.18       pbapply_1.4-3         gridExtra_2.3        
##  [70] rpart_4.1-15          stringi_1.5.3         BiocParallel_1.24.0  
##  [73] rlang_0.4.10          pkgconfig_2.0.3       matrixStats_0.57.0   
##  [76] evaluate_0.14         lattice_0.20-41       ROCR_1.0-11          
##  [79] purrr_0.3.4           tensor_1.5            labeling_0.4.2       
##  [82] patchwork_1.1.1       htmlwidgets_1.5.3     tidyselect_1.1.0     
##  [85] parallelly_1.23.0     RcppAnnoy_0.0.18      plyr_1.8.6           
##  [88] magrittr_2.0.1        R6_2.5.0              generics_0.1.0       
##  [91] DBI_1.1.1             pillar_1.4.7          withr_2.4.0          
##  [94] mgcv_1.8-33           fitdistrplus_1.1-3    survival_3.2-7       
##  [97] abind_1.4-5           tibble_3.0.5          future.apply_1.7.0   
## [100] crayon_1.3.4          KernSmooth_2.23-18    plotly_4.9.3         
## [103] rmarkdown_2.6         grid_4.0.3            data.table_1.13.6    
## [106] digest_0.6.27         xtable_1.8-4          tidyr_1.1.2          
## [109] httpuv_1.5.5          munsell_0.5.0         viridisLite_0.3.0

参考来源:https://nbisweden.github.io/workshop-scRNAseq/labs/compiled/seurat/seurat_05_dge.html

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 203,456评论 5 477
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,370评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 150,337评论 0 337
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,583评论 1 273
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,596评论 5 365
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,572评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,936评论 3 395
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,595评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,850评论 1 297
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,601评论 2 321
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,685评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,371评论 4 318
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,951评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,934评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,167评论 1 259
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 43,636评论 2 349
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,411评论 2 342

推荐阅读更多精彩内容