模型比较
> #用anova()函数可以比较两个嵌套模型的拟合优度,所谓嵌套模型,即它的一些项完全包含在另一个模型中。咋states的多元回归模型中,我们发现Income和Frost的回归系数不显著。此时我们可以检验不含这两个变量的模型与包含这两项的模型预测效果是否是一样的。
> states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
> fit1 <- lm(Murder ~ Population+Income+Illiteracy+Frost, data=states)
> fit2 <-lm(Murder ~ Population+Illiteracy+Income+Frost,data=states)
> anova(fit2,fit1)
Analysis of Variance Table
Model 1: Murder ~ Population + Illiteracy + Income + Frost
Model 2: Murder ~ Population + Income + Illiteracy + Frost Res.Df RSS Df Sum of Sq F Pr(>F)1 45 289.17 2 45 289.17 0 0
> fit2<-lm(Murder ~ Population+Illiteracy,data=states)
> anova(fit2, fit1)Analysis of Variance Table
Model 1: Murder ~ Population + Illiteracy
Model 2: Murder ~ Population + Income + Illiteracy + Frost
Res.Df RSS Df Sum of Sq F Pr(>F)
1 47 289.25 2 45 289.17 2 0.078505 0.0061 0.9939
> fit1 <- lm(Murder ~ Population+Income+Illiteracy+Frost, data=states)
> fit2 <- lm(Murder ~ Population +Illiteracy, data=states)
> AIC(fit1, fit2)
df AICfit1 6 241.6429fit2 4 237.6565
> #此处的AIC值表明没有Income和Frost的模型更佳,因为AIC中较小的模型要优先选择,说明模型用较少的参数获得了足够的拟合度。该准则可以用AIC()函数实现。
变量选择
> library(MASS)
> states<- as.data.frame(state.x77[,c("Murder","Population","Income","Illiteracy","Frost")])
> fit<-lm(Murder~Population+Income+Illiteracy+Frost,data=states)
> stepAIC(fit,direction="backward")
Start: AIC=97.75Murder ~ Population + Income + Illiteracy + Frost Df Sum of Sq RSS AIC- Frost 1 0.021 289.19 95.753-
Income 1 0.057 289.22 95.759289.17 97.749-
Population 1 39.238 328.41 102.111-
Illiteracy 1 144.264 433.43 115.986
Step: AIC=95.75
Murder ~ Population + Income + Illiteracy Df Sum of Sq RSS AIC-
-Income 1 0.057 289.25 93.763289.19 95.753
-Population 1 43.658 332.85 100.783
- Illiteracy 1 236.196 525.38 123.605
Step: AIC=93.76Murder ~ Population + Illiteracy Df Sum of Sq RSS AIC 289.25 93.763
- Population 1 48.517 337.76 99.516
- Illiteracy 1 299.646 588.89 127.311
Call:
lm(formula = Murder ~ Population + Illiteracy, data = states)
Coefficients:
(Intercept) Population Illiteracy
1.6515497 0.0002242 4.0807366
> #可以看到,模型进行调整了,每一次删除一个值,形成映射的函数,最终终止选择过程,AIC下降停止
全子集回归
> states<-as.data.frame(state.x77[,c("Murder","Population","Income","Illiteracy","Frost")])
> leaps<-regsubsets(Murder ~ Population + Income + Illiteracy + Frost, data=states, nbest=4)
#第一行中,可以看到intercept(截距项)和income的模型调整为R平方为0.033,含intercept和population的模型调整r平方为0.1.跳至第12行,你会看到含intercept、population、illiteracy和income的模型调整r方为0.54,而仅含intercept、population和illiteracy的模型调整r平方为0.55.,此处就是含预测变量越少的模型调整r方就越大。图像表明:双预测变量(population和illiteracy)是最佳的。