2020-07-02 WGCNA 练习

WGCNA基本概念

加权基因共表达网络分析（WGCNA）是一种系统生物学方法，用于样品中基因之间的相关模式。WGCNA可用于查找高度相关的基因的簇（模块），使用module eigengene或intramodular hub gene对这些簇进行汇总，将模块相互关联并与外部样本性状关联（使用eigengene网络）方法），并用于计算 module membership度量。挑选模块内hub基因，这些基因可以用于生物标志物或治疗靶标。
它与只挑选差异基因相比，WGCNA可以从成千上万的基因中挑选出高度相关的基因的簇（模块），并将模块与外部样本性状关联，找出与样本性状高度相关的模块。然后就可以进行模块内分析。

Co-expression network(共表达网络)

共表达网络定义为无向的、加权的基因网络。这样一个网络的节点对应于基因，基因之间的边代表基因表达量的相关性，加权是将相关性的绝对值提高到幂 $β≥1$ (软阈值)，加权基因共表达网络的构建以牺牲低相关性为代价，强调高相关性。具体地说，

$a_{ij}=|cor(x_i,x_j)|^β$

表示无符号网络的邻接关系。

Module(模块)

模块是高度互连的基因簇。在无符号共表达网络中,

$a_{ij}=|cor(x_i,x_j)|^β$

;模块对应于具有高度相关的基因簇。在有符号网络中, $a_{ij}=(0.5∗(1+cor(x_i,x_j)))^β$

，模块对应于正相关的基因。这里的加权的网络就等于邻接矩阵。通过幂邻接转换，就强化了高相关性基因的关系，弱化了相关性基因的关系。

Connectivity(连接度)

对于每个基因，连接性(也称为度)被定义为与其他基因的连接强度之和：在共表达网络中，连接度衡量一个基因与所有其他网络基因的相关性。

Intramodular connectivity(模块内连接度)

模块内链接度衡量给定基因相对于特定模块的基因的连接或共表达程度。模块内连接度可以做为Module membership的度量。

Module eigengene E

给定模块的第一主成分，代表整个模块的基因表达谱.

Module Membership(MM)

对于每个基因，我们通过将其基因表达谱与模块的Module eigengene相关性来定义Module Membership。

$MMblue(i)=cor(x_i,Eblue)$

测量基因i与蓝色模块Module eigengene的相关性。如果MM blue(i)接近0，则i-th基因不是蓝色模块的一部分。另一方面，如果MM blue(i)接近1或-1，则它与蓝色模块基因高度相关.MM标记编码基因与蓝色模块Module eigengene之间是正相关还是负相关.

hub gene

高度连接基因的缩写，根据定义，它是共表达网络模块内具有高连接度的基因。

Gene significance(GS)

$GS.ClinicalTrait(i)=|cor(x_i,ClinicalTrait)|$

其中 $Xi$ 表示 $i$ 基因的表达谱， $GS.ClinicalTrait(i)$ 的绝对值越高，第 $i$ 基因的生物学意义就越大。

基本分析流程

数据输入和清洗
网络构建和模块检测
量化模块和样本性状的关系
挑出感兴趣模块内部的基因
可视化TOM矩阵
将网络导出到外部数据进行可视化

1.数据分析的常见问题

需要多少个样本？

不建议对少于15个样本的数据集尝试WGCNA。与其他分析方法一样，更多的样品通常会导致更可靠和更精确的结果(最少5个样本一个组)。

如何过滤掉探针？

探针集或基因可以通过均值、绝对中位差(MAD)或方差进行过滤，因为低表达或不变的基因通常代表噪声。用均值表达还是方差过滤是否更好尚有争议，两者都有优缺点。不建议通过差异分析过滤掉基因。

除了芯片数据，是否可以用RNA-seq数据进行WGCNA分析？

使用（正确归一化的）RNA-seq数据与使用（正确归一化的）微阵列数据并没有什么不同。只要使用相同的方式处理所有样本，无论是使用RPKM，FPKM还是简单的归一化计数，对于WGCNA分析都不会产生很大的不同。

如果数据来自不同批次，需要去除批次效应。

挑选软阈值的问题？

如果合理的阈值（无符号或有符号的混合网络，小于15，有符号的网络，小于30）不能使无尺度拓扑网络系数R^2高于0.8，或者或平均连接度降到100以下。可能是由于批次效应，生物学异质性（例如，由来自2个不同组织的样品组成的数据集）或条件之间的强烈变化（例如按时间序列表示）而导致的。应该仔细调查是否存在样本异质性，驱动异质性的原因以及是否应调整数据等. 如果事实证明由一个不想删除的有趣的生物学变量引起的（即调整数据），则可以根据样本数量选择适当的软阈值如下表所示。

Number of samples	Unsigned and signed hybrid networks	Signed networks
Less than 20	9	18
20-30	8	16
30-40	7	14
more than 40	6	12

2. 分析流程

数据输入、清洗、预处理：得到一个行为样本，列为基因的表达矩阵，另一个是样本对应临床特征的矩阵

2.1数据筛载及清洗

Step2.1.1 Data input and cleaning

rm(list = ls())
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.6.2
# Load the WGCNA package
library(WGCNA);
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
#Read in the fpkm data set
library(stringr)
a <- read.csv("core_table_gene.csv",header = T)
class(a)
#> [1] "data.frame"
#a=as.data.frame(a)
head(a[,1:6])
#>     st_gene_id gene_id gene_symbol exp_X391Y_504KO exp_X391Y_505KO
#> 1 G10090_32561   20115        Rps7          331.06          355.76
#> 2 G10090_15366   20501     Slc16a1           77.58           83.00
#> 3 G10090_11692   12428       Ccna2            1.23            1.00
#> 4 G10090_17909   57249       Gabrq            0.15            0.02
#> 5 G10090_20425  406220       Krt77            0.00            0.06
#> 6 G10090_14637  171285      Havcr2            0.42            0.40
#>   exp_X391Y_506HET
#> 1           352.14
#> 2            55.75
#> 3             2.24
#> 4             0.02
#> 5             0.00
#> 6             0.31
rownames(a)=a[,1]
table(a$type)
#> < table of extent 0 >
dat <- a[,str_detect(colnames(a),"count")]#筛选原始读值
dat <- dat[,1:10]
probe2sym=a[,1:3]#提取探针和基因symbol的信息
colnames(probe2sym)=c('probe', "ENTREZID","SYMBOL")
rownames(dat) <- probe2sym[,3]

dim(dat)
#> [1] 18139    10

# Take a quick look at what is in the data set:
names(dat);
#>  [1] "read_count_X391Y_504KO"  "read_count_X391Y_505KO" 
#>  [3] "read_count_X391Y_506HET" "read_count_X391Y_508HET"
#>  [5] "read_count_X391Y_510HET" "read_count_X391Y_522KO" 
#>  [7] "read_count_X391Y_531HET" "read_count_X391Y_560HET"
#>  [9] "read_count_X391Y_561KO"  "read_count_X391Y_595KO"

boxplot(dat,las=2)#画10个样本的箱线图看一下基因的表达情况

image.png


dat <- log2(dat+1)#将fpkm的数据进行log转换
boxplot(dat,las=2)

image.png

Step2.1.2 匹配数据和筛选 goodSamplesGenes

## 因为WGCNA针对的是基因进行聚类，而一般我们的聚类是针对样本用hclust即可，所以这个时候需要转置
datExpr0 <-  t(dat[order(apply(dat,1,mad), decreasing = T)[1:5000],])# top 5000 mad genes

datExpr <- datExpr0 

#我们首先检查有太多缺失值的基因和样本
gsg = goodSamplesGenes(datExpr0, verbose = 3);
#>  Flagging genes and samples with too many missing values...
#>   ..step 1
gsg$allOK
#> [1] TRUE

#如果最后一条语句返回TRUE，则所有基因都通过了就是OK的。如果没有，我们就从数据中删除有问题的基因和样本：
if (!gsg$allOK){
  # Optionally, print the gene and sample names that were removed:
  if (sum(!gsg$goodGenes)>0) 
    printFlush(paste("Removing genes:", paste(names(datExpr0)[!gsg$goodGenes], collapse = ", ")));
  if (sum(!gsg$goodSamples)>0) 
    printFlush(paste("Removing samples:", paste(rownames(datExpr0)[!gsg$goodSamples], collapse = ", ")));
  # Remove the offending genes and samples from the data:
  datExpr0 = datExpr0[gsg$goodSamples, gsg$goodGenes]
}

Step2.1.3 画层次聚类树，查看是否有离群的样本

#接下来，我们对样本进行聚类(与稍后对基因进行聚类形成对比)，以查看是否有明显的异常值
sampleTree = hclust(dist(datExpr0), method = "average");
# Plot the sample tree: Open a graphic output window of size 12 by 9 inches
# The user should change the dimensions if the window is too large or too small.
sizeGrWindow(12,9)
par(cex = 0.6);
par(mar = c(0,4,2,0))
#png("Step1-sampleClustering.png",width = 800,height = 600)
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5, 
     cex.axis = 1.5, cex.main = 2)+abline(h =75 , col = "red")
#> integer(0)

image.png

Step2.1.4 如果有离群样本就设置abline的h的值

# Plot a line to show the cut
#abline(h = 70000, col = "red");#从数据上看聚类的还可以，不需要剔除样本所以修改下参数“ h = ”
# Determine cluster under the line
clust = cutreeStatic(sampleTree, cutHeight = 75, minSize = 10)#不需要剔除样本所以修改下参数“ cutHeight = ”
table(clust)
#> clust
#>  1 
#> 10
# clust == 1 包含了我们需要的样本
keepSamples = (clust==1)
datExpr = datExpr0[keepSamples, ]
nGenes = ncol(datExpr)
nSamples = nrow(datExpr)
#变量datExpr现在包含可用于网络分析的表达式数据。

Step2.1.5 画样本的层次聚类树和trait的热图

得到样本聚类树和临床trait的热图

#读取处理文件
datTraits=read.table("datTraits.txt",sep = "\t",header = T,check.names = F)
datTraits
#>                         ctrl ko
#> read_count_X391Y_504KO     0  1
#> read_count_X391Y_505KO     0  1
#> read_count_X391Y_506HET    1  0
#> read_count_X391Y_508HET    1  0
#> read_count_X391Y_510HET    1  0
#> read_count_X391Y_522KO     0  1
#> read_count_X391Y_531HET    1  0
#> read_count_X391Y_560HET    1  0
#> read_count_X391Y_561KO     0  1
#> read_count_X391Y_595KO     0  1

## 下面主要是为了防止表型与样本名字对不上
datTraits <- datTraits[match(rownames(datExpr),rownames(datTraits)),]

identical(rownames(datTraits),rownames(datExpr))
#> [1] TRUE

# Re-cluster samples
sampleTree2 = hclust(dist(datExpr), method = "average")
# 将样本用颜色表示：在图2所示的曲线图中，白色表示低值，红色表示高值，灰色表示缺少条目
#如果是连续性变量会是渐变色，如果是‘0’，‘1’的数据将会是红白相间
traitColors = numbers2colors(datTraits, signed = FALSE);
# Plot the sample dendrogram and the colors underneath.
sizeGrWindow(12,9)
par(cex = 0.6);
par(mar = c(0,4,2,0))
#png("Step1-Sample dendrogram and trait heatmap.png",width = 800,height = 600)
plotDendroAndColors(sampleTree2, traitColors,
                    groupLabels = names(datTraits), 
                    main = "Sample dendrogram and trait heatmap")

datExpr=as.data.frame(datExpr)

save(datExpr, datTraits, file = "WGCNA-01-dataInput.RData")

image.png

选取 mad 前5000的基因
实验共有10个样本

2.2 一步步手动网络构建和模块检测

2.2.1 Step-by-step network construction and module detection

# Display the current working directory
rm(list = ls())

# Load the WGCNA package
library(WGCNA)
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
# Allow multi-threading within WGCNA. At present this call is necessary.
# Any error here may be ignored but you may want to update WGCNA if you see one.
# Caution: skip this line if you run RStudio or other third-party R environments.
# See note above.
if (Sys.info()["sysname"]=="Darwin" ) {
  allowWGCNAThreads(nThreads = 2)
} else{
  enableWGCNAThreads(nThreads = 2)
  #disableWGCNAThreads()
}
#> Allowing multi-threading with up to 2 threads.
# Load the data saved in the first part
lnames = load(file = "WGCNA-01-dataInput.RData");
#The variable lnames contains the names of loaded variables.
lnames
#> [1] "datExpr"   "datTraits"

2.2.2 Choose a set of soft-thresholding powers

# Choose a set of soft-thresholding powers
powers = c(c(1:10), seq(from = 12, to=20, by=1))
# Call the network topology analysis function
sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
#> pickSoftThreshold: will use block size 5000.
#>  pickSoftThreshold: calculating connectivity for given powers...
#>    ..working on genes 1 through 5000 of 5000
#>    Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
#> 1      1    0.140 -5.35          0.954 1400.00   1390.00 1650.0
#> 2      2    0.189 -3.26          0.934  580.00    571.00  777.0
#> 3      3    0.216 -2.53          0.899  291.00    283.00  434.0
#> 4      4    0.259 -2.25          0.858  164.00    158.00  269.0
#> 5      5    0.395 -2.54          0.884  100.00     95.10  182.0
#> 6      6    0.540 -2.80          0.903   64.90     61.00  131.0
#> 7      7    0.651 -2.96          0.920   44.10     41.00   97.9
#> 8      8    0.719 -3.06          0.936   31.10     28.70   75.8
#> 9      9    0.777 -3.13          0.944   22.60     20.70   60.4
#> 10    10    0.811 -3.16          0.950   16.90     15.30   49.1
#> 11    12    0.859 -3.09          0.951   10.00      8.93   34.3
#> 12    13    0.878 -3.07          0.954    7.89      7.01   29.3
#> 13    14    0.893 -3.05          0.950    6.32      5.58   25.4
#> 14    15    0.907 -3.05          0.961    5.13      4.49   22.3
#> 15    16    0.921 -3.05          0.965    4.20      3.66   19.7
#> 16    17    0.931 -3.07          0.971    3.48      3.01   17.6
#> 17    18    0.939 -3.07          0.974    2.91      2.50   15.8
#> 18    19    0.941 -3.07          0.971    2.46      2.09   14.2
#> 19    20    0.951 -3.05          0.976    2.09      1.77   12.9
# Plot the results:
sizeGrWindow(9, 5)
par(mfrow = c(1,2));
cex1 = 0.9;
# Scale-free topology fit index as a function of the soft-thresholding power
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
     main = paste("Scale independence"));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     labels=powers,cex=cex1,col="red");
# this line corresponds to using an R^2 cut-off of h
abline(h=0.85,col="red")
# Mean connectivity as a function of the soft-thresholding power
plot(sft$fitIndices[,1], sft$fitIndices[,5],
     xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
     main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")

image.png

2.2.3 Co-expression similarity and adjacency

## Co-expression similarity and adjacency
# We now calculate the adjacencies, using the soft thresholding power “sft$powerEstimate”:
softPower = 12;softPower    ##有最佳的power的估计值
#> [1] 12
adjacency = adjacency(datExpr, power = softPower);

2.2.4 Topological Overlap Matrix (TOM)

## To minimize effects of noise and spurious associations(最小化噪声和伪关联的影响), 
# we transform the adjacency into Topological Overlap Matrix, and calculate the corresponding dissimilarity(并计算相应的相异性):

# Turn adjacency into topological overlap
TOM = TOMsimilarity(adjacency);
#> ..connectivity..
#> ..matrix multiplication (system BLAS)..
#> ..normalization..
#> ..done.
dissTOM = 1-TOM

image.png

2.2.5 Clustering using TOM

# 现在使用层次聚类来产生基因的层次聚类树（树状图）。注意，我们使用函数hclust，它提供比标准hclust函数更快的层次聚类例程。

# Call(调用) the hierarchical clustering function
geneTree = hclust(as.dist(dissTOM), method = "average");
# Plot the resulting clustering tree (dendrogram)
sizeGrWindow(12,9)
plot(geneTree, xlab="", sub="", main = "Gene clustering on TOM-based dissimilarity",
     labels = FALSE, hang = 0.04);

## 最后一个命令绘制的集群树状图如图所示。

#在聚类树（树状图）中，每一片叶子，即一条短的垂直线，对应一个基因。树状图的分支群紧密相连，高度共表达基因。模块识别相当于单个分支的识别（“从树状图上切下分支”）。有几种分支切割方法；我们的标准方法是从包Dynamic Tree Cut中切割动态树。下一段代码说明了它的用途。

# We like large modules, so we set the minimum module size relatively high:
minModuleSize = 30;
# Module identification using dynamic tree cut:
dynamicMods = cutreeDynamic(dendro = geneTree, distM = dissTOM,
                            deepSplit = 2, pamRespectsDendro = FALSE,
                            minClusterSize = minModuleSize);
#>  ..cutHeight not given, setting it to 0.997  ===>  99% of the (truncated) height range in dendro.
#>  ..done.
table(dynamicMods)
#> dynamicMods
#>   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
#> 175 115 109 109 101  97  88  88  86  85  80  78  75  74  74  73  73  71  69  68 
#>  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
#>  68  67  64  64  64  64  62  61  61  60  60  60  59  59  59  58  57  57  57  57 
#>  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
#>  57  56  54  53  52  51  50  50  50  50  49  48  48  47  47  46  45  45  44  44 
#>  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
#>  43  43  42  42  42  41  41  39  39  38  37  36  36  35  35  35  35  33  33  32 
#>  81  82  83  84  85  86  87  88 
#>  32  32  32  32  31  31  31  30

## 函数返回88个模块，标记为1–88从大到小。标签0是为未分配的基因保留的。上面的命令列出了模块的大小。我们现在在基因树状图下绘制模块分配：

# Convert numeric lables into colors
dynamicColors = labels2colors(dynamicMods)
table(dynamicColors)
#> dynamicColors
#>   antiquewhite4         bisque4           black            blue           blue2 
#>              43              50              88             115              35 
#>           brown          brown2          brown4          coral1          coral2 
#>             109              35              51              44              43 
#>            cyan       darkgreen        darkgrey     darkmagenta  darkolivegreen 
#>              74              67              64              59              59 
#> darkolivegreen4      darkorange     darkorange2         darkred   darkseagreen3 
#>              35              64              52              68              30 
#>   darkseagreen4   darkslateblue   darkturquoise      darkviolet      firebrick4 
#>              44              50              64              35              36 
#>     floralwhite           green     greenyellow          grey60        honeydew 
#>              53             101              80              73              31 
#>       honeydew1      indianred4           ivory  lavenderblush2  lavenderblush3 
#>              45              36              54              31              45 
#>      lightcoral       lightcyan      lightcyan1      lightgreen      lightpink3 
#>              37              73              56              71              31 
#>      lightpink4  lightsteelblue lightsteelblue1     lightyellow         magenta 
#>              46              38              57              69              86 
#>        magenta4          maroon    mediumorchid   mediumpurple2   mediumpurple3 
#>              32              47              42              39              57 
#>    midnightblue    navajowhite1    navajowhite2          orange      orangered3 
#>              74              32              47              64              39 
#>      orangered4   paleturquoise  palevioletred2  palevioletred3            pink 
#>              57              60              32              48              88 
#>            plum           plum1           plum2           plum3          purple 
#>              41              57              50              33              85 
#>             red       royalblue     saddlebrown          salmon         salmon2 
#>              97              68              61              75              32 
#>         salmon4         sienna3         skyblue        skyblue1        skyblue2 
#>              48              59              61              41              42 
#>        skyblue3       steelblue             tan         thistle        thistle1 
#>              57              60              78              32              49 
#>        thistle2        thistle3       turquoise          violet           white 
#>              50              33             175              60              62 
#>          yellow         yellow4     yellowgreen 
#>             109              42              58
# Plot the dendrogram and colors underneath
sizeGrWindow(8,6)
plotDendroAndColors(geneTree, dynamicColors, "Dynamic Tree Cut",
                    dendroLabels = FALSE, hang = 0.03,
                    addGuide = TRUE, guideHang = 0.05,
                    main = "Gene dendrogram and module colors")

image.png

2.2.6 合并表达谱非常相似的模块

# 动态树切割可以识别其表达谱非常相似的模块。合并这些模块可能是谨慎的，因为它们的基因是高度共表达的。为了量化整个模块的共表达相似性，我们计算它们的特征基因并根据它们的相关性对它们进行聚类：

# Calculate eigengenes
MEList = moduleEigengenes(datExpr, colors = dynamicColors)
MEs = MEList$eigengenes
# Calculate dissimilarity of module eigengenes
MEDiss = 1-cor(MEs);
# Cluster module eigengenes
METree = hclust(as.dist(MEDiss), method = "average");
# Plot the result
sizeGrWindow(7, 6)
plot(METree, main = "Clustering of module eigengenes",
     xlab = "", sub = "")
#我们选择0.25的高度切割，对应于0.75的相关性以合并
MEDissThres = 0.25
# Plot the cut line into the dendrogram
abline(h=MEDissThres, col = "red")
# Call an automatic merging function
merge = mergeCloseModules(datExpr, dynamicColors, cutHeight = MEDissThres, verbose = 3)
#>  mergeCloseModules: Merging modules whose distance is less than 0.25
#>    multiSetMEs: Calculating module MEs.
#>      Working on set 1 ...
#>      moduleEigengenes: Calculating 88 module eigengenes in given set.
#>    multiSetMEs: Calculating module MEs.
#>      Working on set 1 ...
#>      moduleEigengenes: Calculating 78 module eigengenes in given set.
#>    multiSetMEs: Calculating module MEs.
#>      Working on set 1 ...
#>      moduleEigengenes: Calculating 77 module eigengenes in given set.
#>    Calculating new MEs...
#>    multiSetMEs: Calculating module MEs.
#>      Working on set 1 ...
#>      moduleEigengenes: Calculating 77 module eigengenes in given set.
# The merged module colors
mergedColors = merge$colors;

length(table(mergedColors)) #13个模块和一步构建是一样的
#> [1] 77
# Eigengenes of the new merged modules:
mergedMEs = merge$newMEs;

image.png

#为了了解合并对我们的模块颜色有什么影响，我们再次绘制了基因树状图，下面是原始的和合并的模块颜色
sizeGrWindow(12, 9)
#pdf(file = "Plots/geneDendro-3.pdf", wi = 9, he = 6)
plotDendroAndColors(geneTree, cbind(dynamicColors, mergedColors),
                    c("Dynamic Tree Cut", "Merged dynamic"),
                    dendroLabels = FALSE, hang = 0.03,
                    addGuide = TRUE, guideHang = 0.05)
dev.off()
#> null device 
#>           1

image.png

#在随后的分析中，我们将在mergedColors中使用合并的模块颜色。我们保存相关变量，以便在本教程的后续部分中使用：

# Rename to moduleColors
moduleColors = mergedColors
# Construct numerical labels corresponding to the colors
colorOrder = c("grey", standardColors(50));
moduleLabels = match(moduleColors, colorOrder)-1;
MEs = mergedMEs;
# Save module colors and labels for use in subsequent parts
save(MEs, moduleLabels, moduleColors, geneTree, file = "WGCNA-02-networkConstruction-stepByStep.RData")
length(table(moduleColors))
#> [1] 77
table(moduleColors)
#> moduleColors
#>   antiquewhite4         bisque4           black           blue2           brown 
#>              43              50              88             106             109 
#>          brown2          brown4          coral1          coral2            cyan 
#>              35              51              44             141              74 
#>       darkgreen        darkgrey     darkmagenta  darkolivegreen darkolivegreen4 
#>              67              64             134              59              70 
#>      darkorange     darkorange2         darkred   darkseagreen3   darkseagreen4 
#>              64              52              68              30              44 
#>   darkslateblue   darkturquoise      firebrick4     floralwhite           green 
#>              50              64              36              53             101 
#>     greenyellow          grey60        honeydew       honeydew1      indianred4 
#>              80             133              31             198              36 
#>           ivory  lavenderblush2  lavenderblush3      lightcoral       lightcyan 
#>              54              73              45              37              73 
#>      lightcyan1      lightpink3      lightpink4 lightsteelblue1     lightyellow 
#>              56              63              46              57              69 
#>         magenta        magenta4          maroon    mediumorchid   mediumpurple2 
#>              86              32              47              42              39 
#>   mediumpurple3    midnightblue    navajowhite1    navajowhite2          orange 
#>              57              74              32              47              64 
#>      orangered3      orangered4   paleturquoise  palevioletred3            pink 
#>              39              57              60              48              88 
#>           plum1           plum2           plum3          purple             red 
#>             105              50              33              85              97 
#>       royalblue     saddlebrown         salmon2         sienna3         skyblue 
#>              68              61              32              59              61 
#>        skyblue1        skyblue2       steelblue             tan         thistle 
#>              41              42              60              78              32 
#>        thistle1        thistle2        thistle3       turquoise           white 
#>              49              50              33             175              62 
#>          yellow     yellowgreen 
#>             109              58

2.3 挑选与感兴趣临床特征的模块

# Display the current working directory
rm(list = ls())
# Load the WGCNA package
library(WGCNA)
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
# Load the expression and trait data saved in the first part
lnames = load(file = "WGCNA-01-dataInput.RData");
#The variable lnames contains the names of loaded variables.
lnames
#> [1] "datExpr"   "datTraits"
# Load network data saved in the second part.
lnames = load(file = "WGCNA-02-networkConstruction-stepByStep.RData");
lnames
#> [1] "MEs"          "moduleLabels" "moduleColors" "geneTree"

2.3.1 计算MEs

# Define numbers of genes and samples
nGenes = ncol(datExpr);#定义基因和样本的数量
nSamples = nrow(datExpr);
# Recalculate MEs with color labels
MEs0 = moduleEigengenes(datExpr, moduleColors)$eigengenes
MEs = orderMEs(MEs0)#不同颜色的模块的ME值矩 (样本vs模块)
moduleTraitCor = cor(MEs, datTraits, use = "p");#计算模块与临床数据的相关性 行为样本，列为ME与临床特征的关系
moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples)

2.3.2 画模块和临床trait的关系图

sizeGrWindow(15,20)
# Will display correlations and their p-values
textMatrix =  paste(signif(moduleTraitCor, 2), "\n(",
                    signif(moduleTraitPvalue, 1), ")", sep = "");
dim(textMatrix) = dim(moduleTraitCor)
png("step3-Module-trait-relationships.png",width = 1500,height = 1200,res = 130)
par(mar= c(5,8, 2, 2));
# Display the correlation values within a heatmap plot
labeledHeatmap(Matrix = moduleTraitCor,
               xLabels = names(datTraits),
               yLabels = names(MEs),
               ySymbols = names(MEs),
               colorLabels = FALSE,
               colors = blueWhiteRed(50),#WGCNA提醒greenWhiteRed不适合红绿色盲，建议用blueWhiteRed
               textMatrix = textMatrix,
               setStdMargins = FALSE,
               cex.text = 0.5,
               zlim = c(-1,1),
               main = paste("Module-trait relationships"))

image.png

分析确定了几个重要的模块-性状关联。我们将把重点放在ko这一感兴趣的特征上

2.3.3 基因与性状和重要模块的关系：GS和MM

#GS: as(the absolute value of) the correlation between the gene and the trait
#MM: as the correlation of the module eigengene and the gene expression profile. This allows us to quantify the similarity of all genes on the array to every module.

# Define variable ko containing the ko column of datTrait
ko = as.data.frame(datTraits$ko);
names(ko) = "ko"
# names (colors) of the modules
modNames = substring(names(MEs), 3)

geneModuleMembership = as.data.frame(cor(datExpr, MEs, use = "p"));
MMPvalue = as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nSamples));
#在列名上加MM，p.MM
names(geneModuleMembership) = paste("MM", modNames, sep="");
names(MMPvalue) = paste("p.MM", modNames, sep="");

geneTraitSignificance = as.data.frame(cor(datExpr, ko, use = "p"));#修改临床特征ko
GSPvalue = as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nSamples));

names(geneTraitSignificance) = paste("GS.", names(ko), sep="");#修改临床特征ko
names(GSPvalue) = paste("p.GS.", names(ko), sep="")#修改临床特征ko

2.3.4 模块内分析：作模块membership和genesignificance的相关图

#darkorange
module = "darkorange"
column = match(module, modNames);
moduleGenes = moduleColors==module;

sizeGrWindow(7, 7);
png("step3-Module_membership-gene_significance.png",width = 800,height = 600)
par(mfrow = c(1,1));
verboseScatterplot(abs(geneModuleMembership[moduleGenes, column]),
                   abs(geneTraitSignificance[moduleGenes, 1]),
                   xlab = paste("Module Membership in", module, "module"),
                   ylab = "Gene significance for group",
                   main = paste("Module membership vs. gene significance\n"),
                   cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)

如下图所示。显然，GS和MM是高度相关的，说明与某个性状高度显著相关的基因通常也是与该性状相关的模块中最重要的(中心)元素。

image.png

2.3.5 计算模块内连接度

adjacency = adjacency(datExpr, power =7);#计算一个邻接矩阵
Alldegrees=intramodularConnectivity(adjacency, moduleColors)

#a data frame with 4 columns giving the totalconnectivity(kTotal ),
#intramodular connectivity(kWithon ), extra-modular connectivity(kOut  ), 
#and the difference of the intraandextra-modular connectivities for all genes(kDiff ); 
#otherwise a vector of intramodular connectivities,

class(Alldegrees)
#> [1] "data.frame"
module = "darkorange"; # Select module
probes = names(datExpr) # Select module probes
inModule = (moduleColors==module);
modProbes = probes[inModule];
length(modProbes)
#> [1] 64
KIM_module=Alldegrees[modProbes,]

image.png

2.3.6 展示模块的热图和eigengene

#我们现在创建一个解释模块（heatmap）和相应module eigengene（barplot）之间关系的图：
sizeGrWindow(8,7);
which.module="darkorange"
ME=MEs[, paste("ME",which.module, sep="")]
par(mfrow=c(2,1), mar=c(0.3, 5.5, 3, 2))
plotMat(t(scale(datExpr[,moduleColors==which.module ]) ),
        nrgcols=30,rlabels=F,rcols=which.module,
        main=which.module, cex.main=2)
par(mar=c(5, 4.2, 0, 0.7))
barplot(ME, col=which.module, main="", cex.main=2,
        ylab="eigengene expression",xlab="array sample")

最上面一行显示了灰色模块基因（行）在微阵列（列）中的位置。下一行显示与相同微阵列样本相对应的模块特征基因表达值（y轴）。注意，ME在许多模块基因表达不足的阵列中呈现低值（热图中为绿色）。ME对于很多模块基因过度表达的阵列具有很高的值（热图中为红色）。ME可以被认为是该模块最具代表性的基因表达谱。

2.3.6展示模块内基因名

names(datExpr)[1:10]
#>  [1] "Gm40525" "H2-Q8"   "Gm5796"  "Car3"    "Xntrpc"  "Has1"    "Gm27021"
#>  [8] "Gm5917"  "Xlr3a"   "Capn11"
tail(names(datExpr)[moduleColors=="darkorange"])
#> [1] "Dnajb1"  "Galnt15" "Clcn5"   "Oscp1"   "Ifi207"  "Gda"

创建一个数据框，其中包含所有探针的以下信息：探针ID、基因符号、module color, gene significance for weight, and module membership and p-values in all modules. 模块将按其’ko’重要性排序，最重要的模块位于左侧。

# Create the starting data frame
geneInfo0 = data.frame(geneSymbol = colnames(datExpr),
                       type = rep("mRNA",length(colnames(datExpr))),
                       moduleColor = moduleColors,
                       geneTraitSignificance,
                       GSPvalue)
# Order modules by their significance for  ‘ko’
modOrder = order(-abs(cor(MEs,ko, use = "p"))); # 修改特征参数‘ko’
# Add module membership information in the chosen order
for (mod in 1:ncol(geneModuleMembership)) 
{
  oldNames = names(geneInfo0)
  geneInfo0 = data.frame(geneInfo0, geneModuleMembership[, modOrder[mod]], 
                         MMPvalue[, modOrder[mod]]);
  names(geneInfo0) = c(oldNames, paste("MM.", modNames[modOrder[mod]], sep=""),
                       paste("p.MM.", modNames[modOrder[mod]], sep=""))
}
# Order the genes in the geneInfo variable first by module color, then by geneTraitSignificance
geneOrder = order(geneInfo0$moduleColor, -abs(geneInfo0$GS.ko)); # 修改特征参数‘ko’
geneInfo = geneInfo0[geneOrder, ]

write.csv(geneInfo, file = "geneInfo.csv")

2.4 对模块内的基因的进行GO富集分析

Interfacing network analysis with other data such as GO and KEGG

2.4.1 主要是关心具体某个模块内部的基因

# Display the current working directory
rm(list = ls())
# Load the WGCNA package
library(WGCNA)
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
if (Sys.info()["sysname"]=="Darwin" ) {
  allowWGCNAThreads(nThreads = 4)
} else{
  enableWGCNAThreads(nThreads = 2)
  #disableWGCNAThreads()
}
#> Allowing multi-threading with up to 4 threads.
# Load the expression and trait data saved in the first part
lnames = load(file ="WGCNA-01-dataInput.RData");#加载进来这里的我的表达矩阵变成了matrix,将其转为data.frame 才不会报错
#The variable lnames contains the names of loaded variables.
lnames
#> [1] "datExpr"   "datTraits"
# Load network data saved in the second part.
lnames = load(file = "WGCNA-02-networkConstruction-stepByStep.RData");
lnames
#> [1] "MEs"          "moduleLabels" "moduleColors" "geneTree"

# Select module
module = "darkorange"
# Select module probes
probes = colnames(datExpr) #我们例子里面的probe就是基因
inModule = (moduleColors==module);
table(inModule)
#> inModule
#> FALSE  TRUE 
#>  4936    64
modProbes = probes[inModule]; #模块内的基因已经挑了出来，可以用Y叔的包画图了
head(modProbes)
#> [1] "Gpr141"  "Gm6569"  "Samd11"  "Gpr141b" "Clstn3"  "Gm266"

modGenes = modProbes

2.4.2 GO分析，kEGG分析

library(clusterProfiler)
#> Warning: package 'clusterProfiler' was built under R version 3.6.2
library(org.Mm.eg.db)
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 3.6.2
modGenes=as.data.frame(modGenes)
mod_entrez= mapIds(x = org.Mm.eg.db,
                   keys = modGenes$modGenes,
                   keytype = "SYMBOL",
                   column = "ENTREZID")
length(mod_entrez)
#> [1] 64
mod_entrez =na.omit(mod_entrez)#去除没有ENTREZ id 的基因，
length(mod_entrez)
#> [1] 64

#对基因集进行富集分析。给定一个基因载体，该函数将在FDR控制后返回富集GO分类。
go <- enrichGO(gene = mod_entrez,   #a vector of entrez gene id
               keyType = "ENTREZID",#输入基因的型
               OrgDb = "org.Mm.eg.db", #组织数据库，bioconductor里面有人，鼠等
               ont="all",
               pvalueCutoff = 0.5,
               qvalueCutoff = 0.5,
               readable = TRUE)#whether mapping gene ID to gene Name

par(mar= c(3,4, 2, 2));
png("GO_all.png",width = 1500,height = 1200,res = 130)# 尝试随便命名
dotplot(go, split="ONTOLOGY") + facet_grid(ONTOLOGY~., scale="free")
####  查看得到的结果这里好像有个参数可以直接返回，等有空了去仔细看看这个R包
go_result=go@result
write.csv(go_result,file = 'go_all_result.csv')

image.png

Network visualization using WGCNA functions

# Network visualization using WGCNA functions

# Display the current working directory
rm(list = ls())
getwd();
# Load the WGCNA package
library(WGCNA)
library(gplots)
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
if (Sys.info()["sysname"]=="Darwin" ) {
  allowWGCNAThreads(nThreads = 4)
} else{
  enableWGCNAThreads(nThreads = 2)
  #disableWGCNAThreads()
}
# Load the expression and trait data saved in the first part
lnames = load(file = "WGCNA-01-dataInput.RData");
#The variable lnames contains the names of loaded variables.
lnames
# Load network data saved in the second part.
lnames = load(file = "WGCNA-02-networkConstruction-stepByStep.RData");
lnames
nGenes = ncol(datExpr)
nSamples = nrow(datExpr)


# Calculate topological overlap anew: this could be done more efficiently by saving the TOM
# calculated during module detection, but let us do it again here.
load(file="TOMsimilarityFromExpr.Rdata")
if (F) {
  
  TOM = TOMsimilarityFromExpr(datExpr, power = 12);#前面的power为12
  dissTOM = 1-TOM;#前面估计的power为12
  save(TOM,file="TOMsimilarityFromExpr.Rdata")
  # Transform dissTOM with a power to make moderately strong connections more visible in the heatmap
}


plotTOM = dissTOM^12;
# Set diagonal to NA for a nicer plot
diag(plotTOM) = NA;
# Call the plot function

if (T) {
  
sizeGrWindow(9,9)
png("step5-Network-heatmap.png",width = 800,height = 600)
myheatcol = colorpanel(250,'red',"orange",'lemonchiffon')
TOMplot(plotTOM, geneTree, moduleColors, main = "Network heatmap plot, all genes", col = myheatcol)
#TOMplot(plotTOM, geneTree, moduleColors, main = "Network heatmap plot, all genes")
dev.off()
}
#使用热图将基因网络可视化。热图描绘了分析中所有基因之间的拓扑重叠矩阵(TOM)。
#浅色表示低重叠，逐渐变深的红色表示高重叠。沿对角线的深色块是模块。
#左侧和顶部还显示了基因树状图和模块分配

  nSelect = 400
  # For reproducibility, we set the random seed
  set.seed(10);
  select = sample(nGenes, size = nSelect);
  selectTOM = dissTOM[select, select];
  # There's no simple way of restricting a clustering tree to a subset of genes, so we must re-cluster.
  selectTree = hclust(as.dist(selectTOM), method = "average")
  selectColors = moduleColors[select];
  # Open a graphical window
  sizeGrWindow(9,9)
  # Taking the dissimilarity to a power, say 10, makes the plot more informative by effectively changing 
  # the color palette; setting the diagonal to NA also improves the clarity of the plot
  plotDiss = selectTOM^7;
  diag(plotDiss) = NA;
  myheatcol = colorpanel(250,'red',"orange",'lemonchiffon')
  TOMplot(plotDiss, selectTree, selectColors, main = "Network heatmap plot, selected genes",col = myheatcol)
  

#重新计算模块MEs
# Recalculate module eigengenes
MEs = moduleEigengenes(datExpr, moduleColors)$eigengenes
# Isolate  'hour' from the clinical traits   
months = as.data.frame(datTraits$months);
names(months) = "months"
# hour加⼊入MEs成为MET
# Add the hour to existing module eigengenes
MET = orderMEs(cbind(MEs, months))
# Plot the relationships among the eigengenes and the trait
sizeGrWindow(5,10);
par(cex = 0.9)
png("step5-Eigengene-dendrogram.png",width = 800,height = 600)
par(margin(3,6,2,2))
plotEigengeneNetworks(MET, "Eigengene dendrogram and adjacency heatmap" , 
                      marDendro = c(0,4,1,2), 
                      marHeatmap = c(5,5,1,2) ,
                      cex.lab = 0.8, xLabelsAngle= 90)
dev.off()

#该函数生成ME和性状的树状图，以及它们之间关系的热图。显示了 eigengenes的层次聚类树图由dissimilarity of eigengenes EI构成,Ej由1−cor(Ei，Ej)给出，
#下面的热图显示eigengenes的邻接值，AIj=(1+COR(Ei，Ej))/2。

#要拆分树状图和热图，我们可以使用以下代码

# Plot the dendrogram
sizeGrWindow(6,6);
par(cex = 1.0)
# 模块的进化树
#png("step5-Eigengene-dendrogram-hclust.png",width = 800,height = 600)
plotEigengeneNetworks(MET, "Eigengene dendrogram", marDendro = c(0,4,2,0),
                      plotHeatmaps = FALSE)
dev.off()
# Plot the heatmap matrix (note: this plot will overwrite the dendrogram plot)
par(cex = 1.0)
par(margin(3,6,2,2))
# 性状与模块热图
#png("step5-Eigengene-adjacency-heatmap.png",width = 800,height = 600)
plotEigengeneNetworks(MET, "Eigengene adjacency heatmap", marHeatmap = c(5,4,2,2),
                      plotDendrograms = FALSE, xLabelsAngle = 90)
dev.off()
#可视化的特征基因网络表示模块和临床性状之间的关系。

image.png

还可将挑选出的基因在cytoscape进行可视化，通过cytohubba插件进行可视化

2020-07-02 WGCNA 练习

WGCNA基本概念

Co-expression network(共表达网络)

Module(模块)

Connectivity(连接度)

Intramodular connectivity(模块内连接度)

Module eigengene E

Module Membership(MM)

hub gene

Gene significance(GS)

基本分析流程

1.数据分析的常见问题

需要多少个样本？

如何过滤掉探针？

除了芯片数据，是否可以用RNA-seq数据进行WGCNA分析？

挑选软阈值的问题？

2. 分析流程

2.1数据筛载及清洗

Step2.1.1 Data input and cleaning

Step2.1.2 匹配数据和筛选 goodSamplesGenes

Step2.1.3 画层次聚类树，查看是否有离群的样本

Step2.1.4 如果有离群样本就设置abline的h的值

Step2.1.5 画样本的层次聚类树 和trait的热图

2.2 一步步手动网络构建和模块检测

2.2.1 Step-by-step network construction and module detection

2.2.2 Choose a set of soft-thresholding powers

2.2.3 Co-expression similarity and adjacency

2.2.4 Topological Overlap Matrix (TOM)

2.2.5 Clustering using TOM

2.2.6 合并表达谱非常相似的模块

2.3 挑选与感兴趣临床特征的模块

2.3.1 计算MEs

2.3.2 画模块和临床trait的关系图

2.3.3 基因与性状和重要模块的关系：GS和MM

2.3.4 模块内分析：作模块membership和genesignificance的相关图

2.3.5 计算模块内连接度

2.3.6 展示模块的热图和eigengene

2.3.6展示模块内基因名

2.4 对模块内的基因的进行GO富集分析

2.4.1 主要是关心具体某个模块内部的基因

2.4.2 GO分析，kEGG分析

Network visualization using WGCNA functions

Step2.1.5 画样本的层次聚类树和trait的热图