本内容为【科研私家菜】R语言机器学习与临床预测模型系列课程
你想要的R语言学习资料都在这里, 快来收藏关注【科研私家菜】
01 广义线性混合模型(GLMM)
广义线性混合模型GLMM(Generalized Linear Mixed Model),是广义线性模型GLM 和线性混淆模型LMM 的扩展形式,于二十世纪九十年代被提出。GLMM因其借鉴了混合模型的思想,其在处理纵向数据(重复测量资料)时,被认为具有独特的优势。GLMM不仅擅长处理重复测量资料,还可以用于任何层次结构的数据(因为本质上又是多水平模型)。
广义线性混合模型GLMM,可以看做是线性混合模型LMM的扩展形式,使得因变量不再要求满足正态分布;也可以看作是GLM的扩展形式,使得可以同时包含固定效应和随机效应。
举个例子,我们认为疗效可能与服药时间相关,但是这个相关并不是简简单单的疗效随着服药时间的变化而改变。更可能的是疗效的随机波动的程度与服药时间有关。比如说,在早上10:00的时候,所有人基本上都处于半饱状态,此时吃药,相同剂量药物效果都差不多。但在中午的时候,有的人还没吃饭, 有的人吃过饭了,有的人喝了酒,结果酒精和药物起了反应,有的人喝了醋,醋又和药物起了另一种反应。显然,中午吃药会导致药物疗效的随机误差非常大。这种疗效的随机误差(而非疗效本身)随着时间的变化而变化,并呈一定分布的情况,必须用广义线性混合模型了。对于固定效应来说,参数的含义是,自变量每变化一个单位,应变量平均变化多少。而对于随机效应而言,参数是服从正态分布的一个随机变量,也就是说对于两个不同的自变量的值,对应变量的影响不一定是相同的。
GLM的缺点
GLM模型可以支持数据的多种分布,但是要求数据是独立不相关的,在遗传评估时,很显然数据不能满足要求。
可以使用GLMM模型的软件
R语言中的lme4
ASReml-R(一个R包)
02 GLMMLasso算法
GLMMLasso:一种使用 L1 惩罚的高维广义线性混合模型算法,基于L1 惩罚估计的广义线性混合模型的变量选择。
glmmLasso(fix=formula, rnd=formula, data, lambda, family = gaussian(link="identity"),
switch.NR=FALSE, final.re=FALSE, control = list())
glmmLasso 算法是针对广义线性混合模型设计的一种梯度上升算法,它结合了 l1惩罚估计的变量选择。在最后的重新估计步骤中,模型只包含与非零固定效应相对应的变量,并用简单的 Fisher 评分法进行拟合。对于主算法以及最终的重估计 Fisher 评分法,可以选择计算随机效应方差-协方差参数估计的两种方法: em 型估计和 reml 型估计。
参数说明:
fix
一个描述模型固定效应部分的双边线性公式对象,其响应位于 ~ 算子的左侧,项之间用 + 算子隔开,位于右侧。对于范畴协变量,使用 as.factor (.)在公式里。请注意,相应的人体模型被当作一个整体来处理,并且按区块进行更新
rnd
一个描述模型随机效应部分的双边线性公式对象,在算符左侧有分组因子,右侧有 + 算符分隔的随机项,直接给出随机效应设计矩阵(列名合适)。如果设置为 NULL,则不包括任何随机效果。
data
包含公式中命名的变量的数据帧。
lambda
控制固定项收缩和变量选择的惩罚参数。最佳惩罚参数是必须确定的过程的调整参数,例如通过使用信息标准或交叉验证。(请参阅详细信息或快速演示示例。)
family
一个 GLM 家庭,见 GLM 和家庭。顺序反应模型也可以拟合: 使用家庭 = acat ()和家庭 = 累积()的拟合邻近类别或累积模型,分别。如果缺少家庭,那么一个线性混合模型是适合的; 否则一个广义线性混合模型是适合的。
switch.NR
逻辑。该算法应开始一个牛顿-拉夫森更新步骤,当合理? 默认值为 FALSE。
final.re
应该执行最终的 Fisher 得分重新估计吗? 默认值为 FALSE。
control
用于估计算法的控制值列表,以替换函数 glmmLassoControl 返回的缺省值。默认为空列表。
library(glmmLasso)
data("soccer")
## generalized additive mixed model
## grid for the smoothing parameter
## center all metric variables so that also the
## starting values with glmmPQL are in the correct scaling
soccer[,c(4,5,9:16)]<-scale(soccer[,c(4,5,9:16)],center=T,scale=T)
soccer<-data.frame(soccer)
lambda <- seq(500,0,by=-5)
family <- poisson(link = log)
################## First Simple Method ############################################
## Using BIC (or AIC, respectively) to determine the optimal tuning parameter lambda
BIC_vec<-rep(Inf,length(lambda))
## first fit good starting model
library(MASS);library(nlme)
PQL<-glmmPQL(points~1,random = ~1|team,family=family,data=soccer)
Delta.start<-c(as.numeric(PQL$coef$fixed),rep(0,6),as.numeric(t(PQL$coef$random$team)))
Q.start<-as.numeric(VarCorr(PQL)[1,1])
for(j in 1:length(lambda))
{
print(paste("Iteration ", j,sep=""))
glm1 <- try(glmmLasso(points~transfer.spendings
+ ave.unfair.score + ball.possession
+ tackles + ave.attend + sold.out, rnd = list(team=~1),
family = family, data = soccer, lambda=lambda[j],switch.NR=FALSE,final.re=FALSE,
control=list(start=Delta.start,q_start=Q.start)), silent=TRUE)
if(!inherits(glm1, "try-error"))
{
BIC_vec[j]<-glm1$bic
}
}
opt<-which.min(BIC_vec)
glm1_final <- glmmLasso(points~transfer.spendings
+ ave.unfair.score + ball.possession
+ tackles + ave.attend + sold.out, rnd = list(team=~1),
family = family, data = soccer, lambda=lambda[opt],switch.NR=FALSE,final.re=FALSE,
control=list(start=Delta.start,q_start=Q.start))
summary(glm1_final)
################## Second Simple Method ###########################
## Using 5-fold CV to determine the optimal tuning parameter lambda
### set seed
set.seed(123)
N<-dim(soccer)[1]
ind<-sample(N,N)
lambda <- seq(500,0,by=-5)
## set number of folds
kk<-5
nk <- floor(N/kk)
Devianz_ma<-matrix(Inf,ncol=kk,nrow=length(lambda))
## first fit good starting model
library(MASS);library(nlme)
PQL<-glmmPQL(points~1,random = ~1|team,family=family,data=soccer)
Delta.start<-c(as.numeric(PQL$coef$fixed),rep(0,6),as.numeric(t(PQL$coef$random$team)))
Q.start<-as.numeric(VarCorr(PQL)[1,1])
for(j in 1:length(lambda))
{
print(paste("Iteration ", j,sep=""))
for (i in 1:kk)
{
if (i < kk)
{
indi <- ind[(i-1)*nk+(1:nk)]
}else{
indi <- ind[((i-1)*nk+1):N]
}
soccer.train<-soccer[-indi,]
soccer.test<-soccer[indi,]
glm2 <- try(glmmLasso(points~transfer.spendings
+ ave.unfair.score + ball.possession
+ tackles + ave.attend + sold.out, rnd = list(team=~1),
family = family, data = soccer.train, lambda=lambda[j],switch.NR=FALSE,final.re=FALSE,
control=list(start=Delta.start,q_start=Q.start))
,silent=TRUE)
if(!inherits(glm2, "try-error"))
{
y.hat<-predict(glm2,soccer.test)
Devianz_ma[j,i]<-sum(family$dev.resids(soccer.test$points,y.hat,wt=rep(1,length(y.hat))))
}
}
print(sum(Devianz_ma[j,]))
}
Devianz_vec<-apply(Devianz_ma,1,sum)
opt2<-which.min(Devianz_vec)
glm2_final <- glmmLasso(points~transfer.spendings
+ ave.unfair.score + ball.possession
+ tackles + ave.attend + sold.out, rnd = list(team=~1),
family = family, data = soccer, lambda=lambda[opt2],switch.NR=FALSE,final.re=FALSE,
control=list(start=Delta.start,q_start=Q.start))
summary(glm2_final)
################## More Elegant Method ############################################
## Idea: start with big lambda and use the estimates of the previous fit (BUT: before
## the final re-estimation Fisher scoring is performed!) as starting values for the next fit;
## make sure, that your lambda sequence starts at a value big enough such that all covariates are
## shrunk to zero;
## Using BIC (or AIC, respectively) to determine the optimal tuning parameter lambda
## (here, a log-grid is used)
lambda <- exp(seq(log(460),log(1e-4),length.out = 50))
#lambda <- seq(460,0,length.out = 51)
BIC_vec<-rep(Inf,length(lambda))
family <- poisson(link = log)
# specify starting values for the very first fit; pay attention that Delta.start has suitable length!
Delta.start <- as.matrix(t(rep(0,7+23)))
# set reasonable starting value for intercept (depending on GLM family)
Delta.start[1,1] <- family$linkfun(mean(soccer$points))
Q.start<-0.1
for(j in 1:length(lambda))
{
print(paste("Iteration ", j,sep=""))
glm3 <- glmmLasso(points~1 +transfer.spendings
+ ave.unfair.score
+ tackles
+ sold.out
+ ball.possession
+ ave.attend
,rnd = list(team=~1),
family = family, data = soccer,
lambda=lambda[j], switch.NR=FALSE,final.re=FALSE,
control = list(start=Delta.start[j,],q_start=Q.start[j]))
print(colnames(glm3$Deltamatrix)[2:7][glm3$Deltamatrix[glm3$conv.step,2:7]!=0])
BIC_vec[j]<-glm3$bic
Delta.start<-rbind(Delta.start,glm3$Deltamatrix[glm3$conv.step,])
Q.start<-c(Q.start,glm3$Q_long[[glm3$conv.step+1]])
}
opt3<-which.min(BIC_vec)
glm3_final <- glmmLasso(points~transfer.spendings + ave.unfair.score
+ ball.possession
+ tackles + ave.attend + sold.out, rnd = list(team=~1),
family = family, data = soccer, lambda=lambda[opt3],
switch.NR=FALSE,final.re=FALSE,
control = list(start=Delta.start[opt3,],q_start=Q.start[opt3]))
summary(glm3_final)
## plot coefficient paths
# quartz()
par(mar=c(6,6,4,4))
plot(lambda,Delta.start[2:(length(lambda)+1),2],type="l",lwd=2,ylim=c(-0.05,0.05),ylab=expression(hat(beta[j])))
abline(h=0,lty=3,lwd=2,col="grey")
for(i in 3:7){
lines(lambda[1:length(lambda)],Delta.start[2:(length(lambda)+1),i], lwd=2)
}
axis(3,at=lambda[opt3],label=substitute(paste(lambda["opt"]," = ",nn), list(nn=lambda[opt3])))
abline(v=lambda[opt3],lty=2,lwd=2,col=2)
################## Elegant Cross-Validation ###########################
## Using 5-fold CV to determine the optimal tuning parameter lambda
## Idea: on each training data, similar to the previous method, start
## with big lambda and use the estimates of the previous fit (BUT: before
## the final re-estimation Fisher scoring is performed!) as starting values for the next fit;
## make sure, that your lambda sequence starts at a value big enough such that all
## covariates are shrunk to zero;
### set seed
set.seed(1909)
N<-dim(soccer)[1]
ind<-sample(N,N)
lambda <- seq(500,0,by=-5)
kk<-5
nk <- floor(N/kk)
Devianz_ma<-matrix(Inf,ncol=kk,nrow=length(lambda))
## first fit good starting model
library(MASS);library(nlme)
PQL <- glmmPQL(points~1,random = ~1|team,family=family,data=soccer)
Delta.start <- as.matrix(t(c(as.numeric(PQL$coef$fixed),rep(0,6),as.numeric(t(PQL$coef$random$team)))))
Q.start <- as.numeric(VarCorr(PQL)[1,1])
## loop over the folds
for (i in 1:kk)
{
print(paste("CV Loop ", i,sep=""))
if (i < kk)
{
indi <- ind[(i-1)*nk+(1:nk)]
}else{
indi <- ind[((i-1)*nk+1):N]
}
soccer.train<-soccer[-indi,]
soccer.test<-soccer[indi,]
Delta.temp <- Delta.start
Q.temp <- Q.start
## loop over lambda grid
for(j in 1:length(lambda))
{
#print(paste("Lambda Iteration ", j,sep=""))
glm4 <- try(glmmLasso(points~transfer.spendings
+ ave.unfair.score + ball.possession
+ tackles + ave.attend + sold.out, rnd = list(team=~1),
family = family, data = soccer.train, lambda=lambda[j],switch.NR=FALSE,final.re=FALSE,
control=list(start=Delta.temp[j,],q_start=Q.temp[j]))
,silent=TRUE)
if(!inherits(glm4, "try-error"))
{
y.hat<-predict(glm4,soccer.test)
Delta.temp<-rbind(Delta.temp,glm4$Deltamatrix[glm4$conv.step,])
Q.temp<-c(Q.temp,glm4$Q_long[[glm4$conv.step+1]])
Devianz_ma[j,i]<-sum(family$dev.resids(soccer.test$points,y.hat,wt=rep(1,length(y.hat))))
}
}
}
Devianz_vec<-apply(Devianz_ma,1,sum)
opt4<-which.min(Devianz_vec)
## now fit full model until optimnal lambda (which is at opt4)
for(j in 1:opt4)
{
glm4.big <- glmmLasso(points~transfer.spendings
+ ave.unfair.score + ball.possession
+ tackles + ave.attend + sold.out, rnd = list(team=~1),
family = family, data = soccer, lambda=lambda[j],switch.NR=FALSE,final.re=FALSE,
control=list(start=Delta.start[j,],q_start=Q.start[j]))
Delta.start<-rbind(Delta.start,glm4.big$Deltamatrix[glm4.big$conv.step,])
Q.start<-c(Q.start,glm4.big$Q_long[[glm4.big$conv.step+1]])
}
glm4_final <- glm4.big
summary(glm4_final)
## Not run:
data("soccer")
soccer[,c(4,5,9:16)]<-scale(soccer[,c(4,5,9:16)],center=TRUE,scale=TRUE)
soccer<-data.frame(soccer)
## linear mixed model
lm1 <- glmmLasso(points ~ transfer.spendings + ave.unfair.score
+ ball.possession + tackles
+ ave.attend + sold.out, rnd = list(team=~1),
lambda=50, data = soccer)
summary(lm1)
## similar linear model without random effects
lm1b <- glmmLasso(points ~ transfer.spendings + ave.unfair.score
+ ball.possession + tackles
+ ave.attend + sold.out, rnd = NULL,
lambda=50, data = soccer)
summary(lm1b)
## linear mixed model with slope on ave.attend;
## the coefficient of ave.attend is not penalized;
lm2 <- glmmLasso(points~transfer.spendings + ave.unfair.score
+ ball.possession + tackles + ave.attend
+ sold.out, rnd = list(team=~1 + ave.attend), lambda=10,
data = soccer, control = list(index=c(1,2,3,4,NA,5),
method="REML",print.iter=TRUE))
summary(lm2)
## linear mixed model with categorical covariates
## and final Fisher scoring re-estimation step
lm3 <- glmmLasso(points ~ transfer.spendings + as.factor(red.card)
+ as.factor(yellow.red.card) + ball.possession
+ tackles + ave.attend + sold.out, rnd = list(team=~1),
data = soccer, lambda=100, final.re=TRUE,
control = list(print.iter=TRUE,print.iter.final=TRUE))
summary(lm3)
## generalized linear mixed model
## with starting values
glm1 <- glmmLasso(points~transfer.spendings
+ ave.unfair.score + sold.out
+ tackles + ave.attend + ball.possession, rnd = list(team=~1),
family = poisson(link = log), data = soccer, lambda=100,
control = list(print.iter=TRUE,start=c(1,rep(0,29)),q_start=0.7))
summary(glm1)
## generalized linear mixed model with a smooth term
glm2 <- glmmLasso(points~ + ave.unfair.score + ave.attend
+ ball.possession + tackles + sold.out,
rnd = list(team=~1), family = poisson(link = log),
data = soccer, lambda=100, control = list(smooth=
list(formula=~-1 + transfer.spendings, nbasis=7,
spline.degree=3, diff.ord=2, penal=5),
print.iter=TRUE))
summary(glm2)
plot(glm2,plot.data=FALSE)
#####################
#####################
#####################
data(knee)
knee[,c(2,4:6)]<-scale(knee[,c(2,4:6)],center=TRUE,scale=TRUE)
knee<-data.frame(knee)
## fit cumulative model
glm3 <- glmmLasso(pain ~ time + th + age + sex, rnd = NULL,
family = cumulative(), data = knee, lambda=10,
control=list(print.iter=TRUE))
summary(glm3)
## fit adjacent category model
glm4 <- glmmLasso(pain ~ time + th + age + sex, rnd = NULL,
family = acat(), data = knee, lambda=10,
control=list(print.iter=TRUE))
summary(glm4)
# see also demo("glmmLasso-soccer")
## End(Not run)
效果如下:
03 参考资料
https://blog.csdn.net/fjsd155/article/details/88360671
Wiki: Generalized_linear_mixed_model
关注R小盐,关注科研私家菜(VX_GZH: SciPrivate),有问题请联系R小盐。让我们一起来学习 R语言机器学习与临床预测模型