一致性聚类是一种提供定量证据的方法,用以确定数据集中可能的聚类的成员及数量,例如微阵列基因表达。这种方法在癌症基因组学中得到了广泛应用,以发现了新的疾病分子亚类。
一致性聚类方法包括从一组项目(例如微阵列)中进行二次抽样,并确定聚类计数(k)的簇。然后,计算成对共识值,即两个项目在同一子样本中出现的次数中占据同一簇的比例,并将其存储在每个 k 的对称共识矩阵中。
1. 准备输入数据
# 安装并加载所需的R包
# BiocManager::install("ConsensusClusterPlus")
library(ConsensusClusterPlus)
# 输入数据格式是一个矩阵,其中列是样本(项目),行是特征,单元格是数值(此处展示TCGA-STAD的FPKM数据)
STAD_tumor[1:5,1:5]
## TCGA-BR-A4J4-01A-12R-A251-31 TCGA-BR-A4IZ-01A-32R-A251-31 TCGA-RD-A7C1-01A-11R-A32D-31 TCGA-BR-6852-01A-11R-1884-13 TCGA-BR-7851-01A-11R-2203-13
## 5_8S_rRNA 0.0866 0.4991 1.0460 0.1388 0.0965
## 5S_rRNA 0.0000 0.2676 2.6896 0.3327 0.1393
## 7SK 0.3533 0.2546 1.2531 0.2813 0.0000
## A1BG 0.0197 0.0358 0.1067 0.1153 0.0406
## A1BG-AS1 0.0546 0.2324 0.3320 0.1367 0.0525
# 为了选择信息最丰富的基因进行类检测,将数据集减少到前 5,000 个可变性最大的基因(通过中值绝对偏差 - MAD来衡量)
mads = apply(STAD_tumor, 1, mad) # MAD测度
STAD_tumor = STAD_tumor[rev(order(mads))[1:5000], ] # 提取前5000个基因
# 归一化操作,sweep是一个循环函数,首先用apply计算每行的中值,然后用每个基因在样本中的表达值减中值
STAD_tumor = sweep(STAD_tumor, 1, apply(STAD_tumor, 1, median, na.rm = T))
STAD_tumor[1:5,1:5]
## TCGA-BR-A4J4-01A-12R-A251-31 TCGA-BR-A4IZ-01A-32R-A251-31 TCGA-RD-A7C1-01A-11R-A32D-31 TCGA-BR-6852-01A-11R-1884-13 TCGA-BR-7851-01A-11R-2203-13
## 5_8S_rRNA -0.04830 0.36420 0.91110 0.00390 -0.03840
## 5S_rRNA -0.36695 -0.09935 2.32265 -0.03425 -0.22765
## 7SK 0.24415 0.14545 1.14395 0.17215 -0.10915
## A1BG -0.00930 0.00680 0.07770 0.08630 0.01160
## A1BG-AS1 -0.05605 0.12175 0.22135 0.02605 -0.05815
2. 运行ConsensusClusterPlus
主要包含了以下几个参数:
1)pItem, 样品的抽样比例,如 pItem=0.8 表示选择80%的样本进行重复抽样;
2)pfeature, Feature的抽样比例;
3)maxK, 聚类结果中分类的最大的K值,必须是整数;
4)reps, 重复抽样的数目;
5)clusterAlg, 使用的聚类算法,“hc”用于层次聚类,“pam”用于PAM(Partioning Around Medoids)算法,“km”用于K-Means算法,也可以自定义函数;
6)distance, 计算距离的方法,有pearson、spearman、euclidean、binary、maximum、canberra、minkowski;
7)title, 输出结果的文件夹名字,包含了输出的图片;
8)seed, 随机种子,用于重复结果
9)tmyPal,指定一致性矩阵使用的颜色,默认使用白-蓝色
10)plot,不设置时图片结果仅输出到屏幕,也可以设置输出为'pdf', 'png', 'pngBMP'
11)writeTable,若为TRUE,则将一致性矩阵、ICL、log输出到CSV文件
title=tempdir()
results = ConsensusClusterPlus(STAD_tumor, maxK = 20, reps = 1000, pItem = 0.8, pFeature = 1, title = "untitled_consensus_cluster", clusterAlg = "hc",
distance = "pearson", seed = 1262118388.71279, tmyPal=NULL, writeTable=FALSE, plot = "png")
## end fraction
## clustered
## clustered
## clustered
## clustered
## clustered
## clustered
## clustered
## clustered
## clustered
## clustered
## clustered
## clustered
## clustered
## clustered
## clustered
## clustered
## clustered
## clustered
## clustered
# ConsensusClusterPlus 的输出是一个列表,其中列表的元素对应于第 k 个集群的结果,例如 results[[8]] 就是 k =8 的结果 result。
str(results[[8]])
## List of 5
## $ consensusMatrix: num [1:412, 1:412] 1 0.00156 0.1442 0 0.00158 ...
## $ consensusTree :List of 7
## ..$ merge : int [1:411, 1:2] -1 -2 -30 -42 -43 -49 -50 -51 -59 -63 ...
## ..$ height : num [1:411] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ order : int [1:412] 410 258 251 330 22 175 191 241 252 382 ...
## ..$ labels : NULL
## ..$ method : chr "average"
## ..$ call : language hclust(d = as.dist(1 - fm), method = finalLinkage)
## ..$ dist.method: NULL
## ..- attr(*, "class")= chr "hclust"
## $ consensusClass : Named int [1:412] 1 2 3 4 2 3 3 2 3 3 ...
## ..- attr(*, "names")= chr [1:412] "TCGA-BR-A4J4-01A-12R-A251-31" "TCGA-BR-A4IZ-01A-32R-A251-31" "TCGA-RD-A7C1-01A-11R-A32D-31" ## "TCGA-BR-6852-01A-11R-1884-13" ...
## $ ml : num [1:412, 1:412] 1 0.00156 0.1442 0 0.00158 ...
## $ clrs :List of 3
## ..$ : chr [1:412] "#B2DF8A" "#A6CEE3" "#1F78B4" "#FF7F00" ...
## ..$ : num 8
## ..$ : chr [1:8] "#B2DF8A" "#A6CEE3" "#1F78B4" "#FF7F00" ...
results[[8]][["consensusMatrix"]][1:5,1:5]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.000000000 0.001557632 0.144200627 0.000000000 0.001577287
## [2,] 0.001557632 1.000000000 0.000000000 0.165094340 0.591263651
## [3,] 0.144200627 0.000000000 1.000000000 0.001567398 0.000000000
## [4,] 0.000000000 0.165094340 0.001567398 1.000000000 0.569105691
## [5,] 0.001577287 0.591263651 0.000000000 0.569105691 1.000000000
# 样本的聚类树
results[[8]][["consensusTree"]]
##
## Call:
## hclust(d = as.dist(1 - fm), method = finalLinkage)
##
## Cluster method : average
## Number of objects: 412
# consensusClass, 样本的聚类结果
length(results[[8]][["consensusClass"]])
## [1] 412
results[[8]][["consensusClass"]][1:5]
## TCGA-BR-A4J4-01A-12R-A251-31 TCGA-BR-A4IZ-01A-32R-A251-31 TCGA-RD-A7C1-01A-11R-A32D-31 TCGA-BR-6852-01A-11R-1884-13 TCGA-BR-7851-01A-11R-2203-13
## 1 2 3 4 2
# ml, 就是consensusMatrix
results[[8]][["ml"]][1:5,1:5]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.000000000 0.001557632 0.144200627 0.000000000 0.001577287
## [2,] 0.001557632 1.000000000 0.000000000 0.165094340 0.591263651
## [3,] 0.144200627 0.000000000 1.000000000 0.001567398 0.000000000
## [4,] 0.000000000 0.165094340 0.001567398 1.000000000 0.569105691
## [5,] 0.001577287 0.591263651 0.000000000 0.569105691 1.000000000
results[[8]][["consensusMatrix"]][1:5,1:5]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.000000000 0.001557632 0.144200627 0.000000000 0.001577287
## [2,] 0.001557632 1.000000000 0.000000000 0.165094340 0.591263651
## [3,] 0.144200627 0.000000000 1.000000000 0.001567398 0.000000000
## [4,] 0.000000000 0.165094340 0.001567398 1.000000000 0.569105691
## [5,] 0.001577287 0.591263651 0.000000000 0.569105691 1.000000000
# clrs, 颜色
results[[8]][["clrs"]]
## [[1]]
## [1] "#B2DF8A" "#A6CEE3" "#1F78B4" "#FF7F00" "#A6CEE3" "#1F78B4" "#1F78B4" "#A6CEE3" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#FF7F00" "#FF7F00" "#FF7F00" "#1F78B4"
## [17] "#1F78B4" "#1F78B4" "#1F78B4" "#A6CEE3" "#FF7F00" "#E31A1C" "#A6CEE3" "#A6CEE3" "#1F78B4" "#FF7F00" "#A6CEE3" "#1F78B4" "#1F78B4" "#A6CEE3" "#FF7F00" "#FF7F00"
## [33] "#FF7F00" "#1F78B4" "#FF7F00" "#1F78B4" "#1F78B4" "#A6CEE3" "#A6CEE3" "#1F78B4" "#1F78B4" "#A6CEE3" "#A6CEE3" "#1F78B4" "#A6CEE3" "#33A02C" "#1F78B4" "#1F78B4"
## [49] "#A6CEE3" "#A6CEE3" "#A6CEE3" "#FF7F00" "#FB9A99" "#A6CEE3" "#1F78B4" "#1F78B4" "#A6CEE3" "#1F78B4" "#A6CEE3" "#1F78B4" "#1F78B4" "#A6CEE3" "#A6CEE3" "#33A02C"
## [65] "#1F78B4" "#1F78B4" "#A6CEE3" "#1F78B4" "#FF7F00" "#1F78B4" "#A6CEE3" "#1F78B4" "#FF7F00" "#1F78B4" "#FF7F00" "#1F78B4" "#1F78B4" "#A6CEE3" "#A6CEE3" "#FB9A99"
## [81] "#1F78B4" "#FF7F00" "#A6CEE3" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#A6CEE3" "#FF7F00" "#A6CEE3" "#1F78B4" "#1F78B4" "#A6CEE3" "#1F78B4" "#1F78B4" "#1F78B4"
## [97] "#A6CEE3" "#A6CEE3" "#1F78B4" "#A6CEE3" "#1F78B4" "#A6CEE3" "#FF7F00" "#FF7F00" "#1F78B4" "#A6CEE3" "#1F78B4" "#A6CEE3" "#A6CEE3" "#FF7F00" "#A6CEE3" "#A6CEE3"
## [113] "#1F78B4" "#FF7F00" "#1F78B4" "#1F78B4" "#FF7F00" "#1F78B4" "#1F78B4" "#A6CEE3" "#A6CEE3" "#1F78B4" "#1F78B4" "#A6CEE3" "#FF7F00" "#FF7F00" "#1F78B4" "#FB9A99"
## [129] "#1F78B4" "#FF7F00" "#FF7F00" "#A6CEE3" "#1F78B4" "#A6CEE3" "#1F78B4" "#A6CEE3" "#FF7F00" "#1F78B4" "#FF7F00" "#1F78B4" "#1F78B4" "#A6CEE3" "#1F78B4" "#A6CEE3"
## [145] "#FF7F00" "#1F78B4" "#A6CEE3" "#FF7F00" "#1F78B4" "#1F78B4" "#A6CEE3" "#FF7F00" "#FF7F00" "#1F78B4" "#1F78B4" "#A6CEE3" "#FF7F00" "#FF7F00" "#1F78B4" "#1F78B4"
## [161] "#A6CEE3" "#A6CEE3" "#1F78B4" "#FF7F00" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#B2DF8A" "#FF7F00" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#1F78B4" "#E31A1C" "#1F78B4"
## [177] "#FF7F00" "#FF7F00" "#1F78B4" "#A6CEE3" "#1F78B4" "#1F78B4" "#FF7F00" "#1F78B4" "#FF7F00" "#1F78B4" "#1F78B4" "#1F78B4" "#A6CEE3" "#FF7F00" "#FDBF6F" "#1F78B4"
## [193] "#1F78B4" "#1F78B4" "#A6CEE3" "#1F78B4" "#1F78B4" "#A6CEE3" "#1F78B4" "#1F78B4" "#A6CEE3" "#A6CEE3" "#FB9A99" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#1F78B4" "#A6CEE3"
## [209] "#FF7F00" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#FF7F00" "#A6CEE3" "#1F78B4" "#A6CEE3" "#1F78B4" "#A6CEE3" "#1F78B4" "#A6CEE3" "#A6CEE3" "#FF7F00" "#1F78B4" "#B2DF8A"
## [225] "#A6CEE3" "#1F78B4" "#1F78B4" "#1F78B4" "#A6CEE3" "#1F78B4" "#A6CEE3" "#FF7F00" "#1F78B4" "#1F78B4" "#FF7F00" "#A6CEE3" "#1F78B4" "#1F78B4" "#A6CEE3" "#1F78B4"
## [241] "#FDBF6F" "#A6CEE3" "#A6CEE3" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#A6CEE3" "#A6CEE3" "#1F78B4" "#E31A1C" "#FDBF6F" "#1F78B4" "#1F78B4" "#B2DF8A" "#FF7F00"
## [257] "#FF7F00" "#E31A1C" "#1F78B4" "#A6CEE3" "#FF7F00" "#A6CEE3" "#33A02C" "#A6CEE3" "#A6CEE3" "#FF7F00" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#1F78B4" "#A6CEE3"
## [273] "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#A6CEE3" "#1F78B4" "#1F78B4" "#A6CEE3" "#1F78B4" "#1F78B4" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#FB9A99" "#1F78B4"
## [289] "#1F78B4" "#1F78B4" "#FF7F00" "#1F78B4" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#1F78B4" "#1F78B4" "#1F78B4" "#33A02C" "#1F78B4" "#B2DF8A" "#FF7F00" "#FF7F00" "#FF7F00"
## [305] "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#A6CEE3" "#1F78B4" "#1F78B4" "#A6CEE3" "#1F78B4" "#A6CEE3" "#FF7F00" "#1F78B4" "#A6CEE3" "#1F78B4"
## [321] "#FF7F00" "#1F78B4" "#1F78B4" "#A6CEE3" "#A6CEE3" "#1F78B4" "#A6CEE3" "#1F78B4" "#1F78B4" "#E31A1C" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4" "#1F78B4"
## [337] "#1F78B4" "#A6CEE3" "#1F78B4" "#FF7F00" "#1F78B4" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#B2DF8A" "#1F78B4" "#A6CEE3" "#1F78B4" "#1F78B4" "#1F78B4" "#A6CEE3" "#33A02C"
## [353] "#FF7F00" "#1F78B4" "#A6CEE3" "#1F78B4" "#FF7F00" "#B2DF8A" "#1F78B4" "#A6CEE3" "#B2DF8A" "#A6CEE3" "#FF7F00" "#A6CEE3" "#1F78B4" "#A6CEE3" "#1F78B4" "#FF7F00"
## [369] "#1F78B4" "#A6CEE3" "#1F78B4" "#1F78B4" "#33A02C" "#1F78B4" "#1F78B4" "#A6CEE3" "#1F78B4" "#1F78B4" "#A6CEE3" "#1F78B4" "#A6CEE3" "#FDBF6F" "#1F78B4" "#1F78B4"
## [385] "#FF7F00" "#FF7F00" "#1F78B4" "#1F78B4" "#A6CEE3" "#1F78B4" "#1F78B4" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#A6CEE3" "#1F78B4" "#A6CEE3" "#1F78B4" "#FF7F00" "#A6CEE3"
## [401] "#1F78B4" "#A6CEE3" "#FF7F00" "#A6CEE3" "#1F78B4" "#B2DF8A" "#1F78B4" "#1F78B4" "#FF7F00" "#E31A1C" "#1F78B4" "#FF7F00"
##
## [[2]]
## [1] 8
##
## [[3]]
## [1] "#B2DF8A" "#A6CEE3" "#1F78B4" "#FF7F00" "#E31A1C" "#33A02C" "#FB9A99" "#FDBF6F"
★ 官网推荐,在实际运行中,推荐reps设置的更大,比如1000, maxK设置的更大,比如20
3. 计算聚类一致性 (cluster-consensus) 和样品一致性 (item-consensus)
主要包含了以下几个参数:
1)title,设置生成的文件的路径
2)plot,不设置时图片结果仅输出到屏幕,也可以设置输出为'pdf', 'png', 'pngBMP' 。
3)writeTable,若为TRUE,则将一致性矩阵、ICL、log输出到CSV文件
icl = calcICL(results, title = "consensus_cluster", plot = "png")
dim(icl[["clusterConsensus"]])
icl[["clusterConsensus"]][1:5,]
## k cluster clusterConsensus
## [1,] 2 1 0.8236517
## [2,] 2 2 0.9204836
## [3,] 3 1 0.6659016
## [4,] 3 2 0.7503082
## [5,] 3 3 0.8561029
dim(icl[["itemConsensus"]])
icl[["itemConsensus"]][1:5,]
## k cluster item itemConsensus
## 1 2 1 TCGA-ZQ-A9CR-01A-11R-A39E-31 0.2776996
## 2 2 1 TCGA-BR-6563-01A-13R-2055-13 0.2786404
## 3 2 1 TCGA-IN-7808-01A-11R-2203-13 0.2941890
## 4 2 1 TCGA-VQ-A8P8-01A-11R-A39E-31 0.2783686
## 5 2 1 TCGA-HU-A4HB-01A-12R-A251-31 0.2714451
4. 聚类结果展示
4.1 矩阵热图
ꔷ 一致性矩阵的行和列表示的都是样本,一致性矩阵的值按从0(不可能聚类在一起)到1(总是聚类在一起)用白色到深蓝色表示
ꔷ 一致性矩阵按照一致性聚类(热图上方的树状图)来排列。树状图和热图之间的彩色矩形即分出来的类别
4.2 一致性累积分布函数(CDF)图
ꔷ 此图展示了k取不同数值时(用颜色表示)的一致性矩阵的累积分布函数,用于判断当k取何值时,CDF达到一个近似最大值,此时一致性和簇置信度在达到最大,聚类分析结果最可靠
4.3 Delta Area Plot
ꔷ 此图展示了 k 和 k-1 相比CDF曲线下面积的相对变化。当k = 2时,因为没有k=1,所以第一个点表示的是k=2时CDF曲线下总面积,而当k = 10 时,曲线下面积趋近于平稳,所以10为最合适的k值。
4.4 Tracking Plot
ꔷ 此图下方的黑色条纹表示样品,展示的是样品在k取不同的值时,归属的分类情况,不同颜色的色块代表不同的分类。取不同k值前后经常改变颜色分类的样品代表其分类不稳定。若分类不稳定的样本太多,则说明该k值下的分类不稳定。
4.5 item-Consensus Plot(样本一致性图)
ꔷ 纵坐标代表Item-consensus values。k值不同时,每个样本都会有一个对应不同簇的item-consensus values。竖条代表每一个样本,竖条按从下到上递增的值进行排序,高度代表该样本的总item-consensus values。每个样本的顶部都有一个小矩形,小矩形的颜色代表该样本被分到了哪一簇。
ꔷ 该图提供了给定k下所有其他集群的样本一致性图,可以查看样本是否是集群中非常“纯粹”的成员,或者它是否与多个集群共享高度一致性(多种颜色的列中的大矩形),表明它是不稳定或“不纯粹”的成员。 这些值可用于选择“核心”样本,它们高度代表集群。 此外,也可以帮助决策簇数量
4.6 Cluster-Consensus Plot(聚类一致性图 )
ꔷ 此图展示的是不同k值下,每个分类的cluster-consensus value(该簇中成员pairwise consensus values的均值)。该值越高(低)代表稳定性越高(低)。可用于判断同一k值下以及不同k值之间cluster-consensus value的高低
5. 最佳的K值的选择
通过上面的结果解读我们可以选取出初步分类的亚组的个数,但是这种方法有时候带有主观性,也可以用PAC法确定最佳聚类数目
# 面积最小值对应K为最佳K
Kvec = 2:20
x1 = 0.1; x2 = 0.9 # threshold defining the intermediate sub-interval
PAC = rep(NA,length(Kvec))
names(PAC) = paste("K=",Kvec,sep="") # from 2 to maxK
for(i in Kvec){
M = results[[i]]$consensusMatrix
Fn = ecdf(M[lower.tri(M)]) # M 为计算出共识矩阵
PAC[i-1] = Fn(x2) - Fn(x1)
}
optK = Kvec[which.min(PAC)] # 理想的K值
# 重新绘制热图(以k=7为例)
## 选取自己感兴趣的基因集
intersection_geneExp[1:3,1:3]
## TCGA-BR-A4J4-01A-12R-A251-31 TCGA-BR-A4IZ-01A-32R-A251-31 TCGA-RD-A7C1-01A-11R-A32D-31
## ALOX12B 0.5224 0.0031 0.6993
## GOT1 20.0073 9.3511 39.1501
## ALOX15B 0.2185 0.1652 1.3996
annCol <- data.frame(results = paste0("Cluster",
results[[7]][['consensusClass']]),
row.names = colnames(intersection_geneExp))
head(annCol)
## results
## TCGA-BR-A4J4-01A-12R-A251-31 Cluster1
## TCGA-BR-A4IZ-01A-32R-A251-31 Cluster1
## TCGA-RD-A7C1-01A-11R-A32D-31 Cluster2
## TCGA-BR-6852-01A-11R-1884-13 Cluster2
## TCGA-BR-7851-01A-11R-2203-13 Cluster2
## TCGA-BR-A4J1-01A-11R-A251-31 Cluster3
library(RColorBrewer)
library(pheatmap)
mycol <- brewer.pal(7, "Set3")
annColors <- list(results = c("Cluster1" = mycol[1],
"Cluster2" = mycol[2],
"Cluster3" = mycol[3],
"Cluster4" = mycol[4],
"Cluster5" = mycol[5],
"Cluster6" = mycol[6],
"Cluster7"= mycol[7]))
heatdata <- results[[7]][["consensusMatrix"]]
dimnames(heatdata) <- list(colnames(intersection_geneExp), colnames(intersection_geneExp))
heatdata[1:3,1:3]
## TCGA-BR-A4J4-01A-12R-A251-31 TCGA-BR-A4IZ-01A-32R-A251-31 TCGA-RD-A7C1-01A-11R-A32D-31
## TCGA-BR-A4J4-01A-12R-A251-31 1.0000000 0.6542056 0
## TCGA-BR-A4IZ-01A-32R-A251-31 0.6542056 1.0000000 0
## TCGA-RD-A7C1-01A-11R-A32D-31 0.0000000 0.0000000 1
# 绘制热图
result <- pheatmap(mat = heatdata,
color = colorRampPalette((c("white","steelblue")))(100),
border_color = NA,
annotation_col = annCol,
annotation_colors = annColors,
show_colnames = F,
show_rownames = F)
6. 提取分型结果
library(tidyverse)
Cluster_res <- annCol %>% as.data.frame %>% rownames_to_column("patient_ID")
table(Cluster_res$results)
##
## Cluster1 Cluster2 Cluster3 Cluster4 Cluster5 Cluster6 Cluster7
## 132 94 37 57 38 48 6
#合并预后数据进行生存分析
STAD_clinical <- read_tsv("STAD_clinical.txt")
STAD_clinical$time <- ifelse(STAD_clinical$vital_status=='Alive', STAD_clinical$days_to_last_follow_up, STAD_clinical$days_to_death) # 如果患者已经死亡则选择days_to_death,如果患者处于活着状态,则选择days_to_last_follow_up
STAD_clinical$status <- ifelse(STAD_clinical$vital_status=='Alive', 0, 1)
Cluster_res <- Cluster_res %>% inner_join(STAD_clinical, "patient_ID")
head(Cluster_res)
## patient_ID results age gender tissue_or_organ_of_origin primary_diagnosis vital_status ajcc_pathologic_stage year_of_birth days_to_last_follow_up year_of_death days_to_death time status
## 1 TCGA-BR-A4J4-01A-12R-A251-31 Cluster1 39 male Gastric antrum Adenocarcinoma, NOS Alive Stage IIIB 1973 16 NA NA 16 0
## 2 TCGA-BR-A4IZ-01A-32R-A251-31 Cluster1 45 female Gastric antrum Carcinoma, diffuse type Dead Stage IIIB 1967 24 NA 273 273 1
## 3 TCGA-RD-A7C1-01A-11R-A32D-31 Cluster2 82 male Fundus of stomach Carcinoma, diffuse type Dead Stage IB 1923 NA 2006 507 507 1
## 4 TCGA-BR-6852-01A-11R-1884-13 Cluster2 64 female Cardia, NOS Adenocarcinoma, NOS Alive Stage IIA 1947 1367 NA NA 1367 0
## 5 TCGA-BR-7851-01A-11R-2203-13 Cluster2 74 male Body of stomach Adenocarcinoma, intestinal type Dead Stage IIB 1937 574 NA NA NA 1
## 6 TCGA-BR-A4J1-01A-11R-A251-31 Cluster2 63 male Gastric antrum Adenocarcinoma, NOS Dead Stage IIIA 1949 NA 2012 22 22 1
#进行生存分析
library(survival)
library(survminer)
fit <- survfit(Surv(time, status) ~ results, data = Cluster_res)
ggsurvplot(fit, data = Cluster_res,
main = "Survival curve", # 添加标题
surv.median.line = "hv", # 添加中位数生存时间线
font.main = c(16, "bold", "darkblue"), # 设置标题字体大小、格式和颜色
font.x = c(14, "bold.italic", "red"), # 设置x轴字体大小、格式和颜色
font.y = c(14, "bold.italic", "darkred"), # 设置y轴字体大小、格式和颜色
font.tickslab = c(12, "plain", "darkgreen"), # 设置坐标轴刻度字体大小、格式和颜色
legend = c(0.9, 0.15), # 通过坐标指定图例位置
legend.title = "Cluster", legend.labs = c("Cluster1","Cluster2","Cluster3","Cluster4","Cluster5","Cluster6","Cluster7"), # 更改图例标题和标签
size = 1, # 改变线条大小
linetype = "strata", # 按组更改线型
break.time.by = 250, # 将时间轴以250作为打断
# palette = "Dark2" # 使用 brewer 调色板“Dark2”
palette = c('#E5D2DD', '#53A85F', '#F1BB72', '#F3B1A0', '#D6E7A3', '#57C3F3', '#476D87'), # 自定义调色板
ggtheme = theme_bw(), # 修改主题
# conf.int = TRUE, # 添加置信区间
pval = TRUE, # 添加p值
risk.table = TRUE, risk.table.y.text.col = TRUE, # 添加风险表,并按层更改风险表y 文本颜色
tables.height = 0.2, # 设置风险表的高度
tables.theme = theme_cleantable(), # 设置风险表的主题
xlim = c(0, 1050) # 更改 x 轴范围
)
7. tSNE分析
# 将数据框中字符型数据转化为数值型
intersection_tumor <- as.data.frame(lapply(intersection_geneExp, as.numeric))
row.names(intersection_tumor) <- row.names(intersection_geneExp)
colnames(intersection_tumor) <- colnames(tumor_matrix)
intersection_tumor = na.omit(intersection_tumor)
intersection_tumor <- as.matrix(intersection_tumor)
library(Rtsne)
tSNE_res<- Rtsne(t(intersection_tumor), dims= 2, perplexity= 10, verbose= F, max_iter= 500, check_duplicates= F)
tsne<- data.frame(tSNE1 = tSNE_res[["Y"]][,1], tSNE2= tSNE_res[["Y"]][,2], cluster = annCol$results)
ggplot(tsne, aes(x= tSNE1, y= tSNE2, color= cluster)) +
geom_point(size= 4.5, alpha= 0.5) +
stat_ellipse(level= 0.85, show.legend = F) +
theme(legend.position= "top")