Simple Linear Regression: one regressor
◾OLS approach to estimate the coefficients
◾OLS assumptions
◾Sampling Distribution of the OLS Estimator
◾Measures of Fit: R square
◾Gauss-Markov Theorem
Multiple Regression Models
◾Omit Varible Problem
◾Colinearity Problem
◾Hypothesis Tests and Confidence Intervals: t-test and F-test
01 Linear Regression with One Regressor一元线性回归
线性回归模型系数估计
pacman::p_load(AER,MASS)
data(CASchools)
class(CASchools)
head(CASchools)
# 计算STR(学生教师比),赋值给CASchools
CASchools$STR <- CASchools$students/CASchools$teachers
# 计算score(阅读、数学平均成绩), 赋值给 CASchools
CASchools$score <- (CASchools$read + CASchools$math)/2
# 计算STR和score的样本平均值 averages
avg_STR <- mean(CASchools$STR)
avg_score <- mean(CASchools$score)
# 计算STR和score的样本标准方差 standard deviations,方差var()
sd_STR <- sd(CASchools$STR)
sd_score <- sd(CASchools$score)
# 设置百分位数的向量,计算分位数 quantiles
quantiles <- c(0.10, 0.25, 0.4, 0.5, 0.6, 0.75, 0.9)
quant_STR <- quantile(CASchools$STR, quantiles)
quant_score <- quantile(CASchools$score, quantiles)
#将数据放入数据框
DistributionSummary <- data.frame(Average = c(avg_STR, avg_score),
StandardDeviation = c(sd_STR, sd_score),
quantile = rbind(quant_STR, quant_score))
#main添加了标题,xlab和ylab向两个轴添加了自定义标签
plot(score ~ STR,
data = CASchools,
main = "Scatterplot of TestScore and STR",
xlab = "STR (X)",
ylab = "Test Score (Y)")
#计算相关系数
#cor(x, y = NULL, use = "everything",method = c("pearson", "kendall", "spearman")),
cor(CASchools$STR, CASchools$score)
#协方差cov(x, y = NULL, use = "everything",method = c("pearson", "kendall", "spearman"))
方差:体现的是一组数据的波动情况,值越小波动越小。
协方差:两种不同数据的方差,体现两组数据的变化趋势,正值变化趋势一致,负值变化趋势相反,0不相关。
相关系数:两组不同数据的相关程度,取值范围[-1,1],越接近与0越不相关,相关系数是两个变量之间的线性关联的一个度量,不一定有因果关系的含义。
> class(CASchools)
[1] "data.frame"
> head(CASchools)
district school county grades
1 75119 Sunol Glen Unified Alameda KK-08
2 61499 Manzanita Elementary Butte KK-08
3 61549 Thermalito Union Elementary Butte KK-08
4 61457 Golden Feather Union Elementary Butte KK-08
5 61523 Palermo Union Elementary Butte KK-08
6 62042 Burrel Union Elementary Fresno KK-08
students teachers calworks lunch computer expenditure
1 195 10.90 0.5102 2.0408 67 6384.911
2 240 11.15 15.4167 47.9167 101 5099.381
3 1550 82.90 55.0323 76.3226 169 5501.955
4 243 14.00 36.4754 77.0492 85 7101.831
5 1335 71.50 33.1086 78.4270 171 5235.988
6 137 6.40 12.3188 86.9565 25 5580.147
income english read math
1 22.690001 0.000000 691.6 690.0
2 9.824000 4.583333 660.5 661.9
3 8.978000 30.000002 636.3 650.9
4 8.978000 0.000000 651.9 643.5
5 9.080333 13.857677 641.8 639.9
6 10.415000 12.408759 605.7 605.4
> DistributionSummary
Average StandardDeviation quantile.10.
quant_STR 19.64043 1.891812 17.3486
quant_score 654.15655 19.053347 630.3950
quantile.25. quantile.40. quantile.50. quantile.60.
quant_STR 18.58236 19.26618 19.72321 20.0783
quant_score 640.05000 649.06999 654.45000 659.4000
quantile.75. quantile.90.
quant_STR 20.87181 21.86741
quant_score 666.66249 678.85999
> cor(CASchools$STR, CASchools$score)
[1] -0.2263627
◾最小二乘法计算回归系数
OLS估计,选择回归系数使估计的回归线尽可能“接近”观测数据点。残差平方和来刻画样本观察值与估计值之差的大小。
#直接调用CASchools中的变量名称,不再需要使用 $
attach(CASchools)
# 计算beta_1_hat
beta_1 <- sum((STR - mean(STR)) * (score - mean(score))) / sum((STR - mean(STR))^2)
#计算beta_0_hat
beta_0 <- mean(score) - beta_1 * mean(STR)
# 建立模型
linear_model <- lm(score ~ STR, data = CASchools)
linear_model
#建立无截距模型lm(ts ~ cs - 1)或lm(ts ~ cs + 0)
# 画散点图,设置参数xlim和ylim
plot(score ~ STR,
data = CASchools,
main = "Scatterplot of TestScore and STR",
xlab = "STR (X)",
ylab = "Test Score (Y)",
xlim = c(10, 30),
ylim = c(600, 720))
abline(linear_model) #加回归线
#输出回归分析结果
mod_summary <- summary(linear_model)
mod_summary
#手动计算Rˆ2决定系数
SSR <- sum(mod_summary$residuals^2) #残差平方和
TSS <- sum((score-mean(score))^2) #总平方和
R2 <- 1 - SSR/TSS
# 手动计算SER残差平方和
n <- nrow(CASchools) #或者n <- length(score)
SER <- sqrt(SSR / (n-2))
#SER <- summary(linear_model)$sigma #Residual standard error残差标准误
#SSR <- SER^2*(length(score)-2)
#cov_matrix <- summary(linear_model)$cov.unscaled
#SEs <- sqrt(diag(cov_matrix)) #diag()构建对角矩阵
> beta_1
[1] -2.279808
> beta_0
[1] 698.9329
> linear_model
Call:
lm(formula = score ~ STR, data = CASchools)
Coefficients:
(Intercept) STR
698.93 -2.28
> mod_summary
Call:
lm(formula = score ~ STR, data = CASchools)
Residuals: #残差
Min 1Q Median 3Q Max
-47.727 -14.251 0.483 12.822 48.540
Coefficients: #回归系数(估计值、标准差、显著性检验的t值、P值)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 698.9329 9.4675 73.825 < 2e-16 ***
STR -2.2798 0.4798 -4.751 2.78e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 18.58 on 418 degrees of freedom
Multiple R-squared: 0.05124, Adjusted R-squared: 0.04897
F-statistic: 22.58 on 1 and 418 DF, p-value: 2.783e-06
#回归方程显著性检验-F检验的F值和P值
#anova(linear_model) #输出方差分析表
> R2
[1] 0.05124009
> SER
[1] 18.58097
1.Residuals残差:用于度量预测的响应值是否接近模型预测的响应值。
2.Coefficients估算系数:Intercept模型的截距
3.Std.Error标准误差:描述了模型的预测结果和模型的真实结果之间的差距。数字较小时回归分析中的最佳情况。
4.显著性:对模型显著性的检验表明,变量之间存在线性关系。p值附近的星数量越多,变量就越显著。如果p值小于显著性水平,则可以拒绝原假设。
5.t统计量:在t检验中使用t统计量来决定是否支持原假设。
6.Residual standard error残差标准误差:响应偏离回归线的平均值,如即:实际取得的测试分数和回归线的平均偏差为***分。
7.Multiple R-squared判定系数,=SSR/SST,测度回归直线对观测数据的拟合程度,即:用x的变化来解释y值变差的部分占比。如,在不良贷款取值的变差中,有71.6%可以由不良贷款与贷款余额之间的线性关系来解释,或者,在不良贷款取值的变动中,有71.6%是由贷款余额所决定的。一元线性回归中,相关系数r实际是判定系数R2的平方根。
8.Adjusted R-squared调整的决定系数:用来比较自变量个数不同的线性模型的拟合效果,越接近1回归拟合效果越好。
9.F统计量:定义了所有预测变量对响应变量的集体效果,F统计量越高,拟合度越好。
◾最小二乘估计的假设OLS assumptions
假设1:误差项的期望值为零
set.seed(321)
# 模拟数据
X <- runif(50, min = -5, max = 5) #均匀分布的随机数
u <- rnorm(50, sd = 1) #正态分布的随机数
# 真实关系
Y <- X^2 + 2 * X + u
# 建立简单回归模型
mod_simple <- lm(Y ~ X)
# 用二次模型预测
prediction <- predict(lm(Y ~ X + I(X^2)), data.frame(X = sort(X)))
# 画散点图
plot(Y ~ X)
abline(mod_simple, col = "red")
lines(sort(X), prediction)
使用二次模型(由黑色曲线表示),观测结果与预测的关系没有系统的偏差;然而,使用简单线性回归模型,由于E(ui|Xi)随Xi变化,违反了这个假设。
假设2: 独立、相同的分布
set.seed(123)
# 生成一个日期向量
Date <- seq(as.Date("1951/1/1"), as.Date("2000/1/1"), "years")
# 初始化雇员数量向量
X <- c(5000, rep(NA, length(Date)-1))
# 生成具有随机影响的时间序列观测值
for (i in 2:length(Date)) {
X[i] <- -50 + 0.98 * X[i-1] + rnorm(n = 1, sd = 200) }
#画图
plot(x = Date,
y = X,
type = "l",
col = "steelblue",
ylab = "Workers",
xlab = "Time")
很明显,对雇员数量的观察不独立:今天的就业水平与明天的就业水平有关。因此,违反了假设。
假设3:较大异常值不太可能出现
set.seed(123)
# 生成数据
X <- sort(runif(10, min = 30, max = 70))
Y <- rnorm(10 , mean = 200, sd = 50)
Y[9] <- 2000
# 有异常值的模型拟合
fit <- lm(Y ~ X)
# 无异常值的模型拟合
fitWithoutOutlier <- lm(Y[-9] ~ X[-9])
# 结果画图
plot(Y ~ X)
abline(fit)
abline(fitWithoutOutlier, col = "red")
即使极端观测记录正确,也建议在估计模型之前排除它们,因为OLS对异常值存在敏感性。
◾OLS估计的抽样分布
模拟研究1
# 模拟数据
N <- 100000
X <- runif(N, min = 0, max = 20)
u <- rnorm(N, sd = 10)
# 人口回归
Y <- -2 + 3.5 * X + u
population <- data.frame(X, Y)
# 设置样本大小
n <- 100
# 计算beta_hat_0方差
H_i <- 1 - mean(X) / mean(X^2) * X
var_b0 <- var(H_i * u) / (n * mean(H_i^2)^2 )
# 计算hat_beta_1方差
var_b1 <- var( ( X - mean(X) ) * u ) / (100 * var(X)^2)
# 设置重复次数和样本大小
n <- 100
reps <- 10000
# 初始化矩阵
fit <- matrix(ncol = 2, nrow = reps)
# 样本和系数估计循环
for (i in 1:reps){
sample <- population[sample(1:N, n), ]
fit[i, ] <- lm(Y ~ X, data = sample)$coefficients
}
# 使用结果计算方差估计
var(fit[, 1])
var(fit[, 2])
# 把画图空间划分为1行2列
par(mfrow = c(1, 2))
# beta_0估计的直方图
hist(fit[, 1],
cex.main = 1,
main = bquote(The ~ Distribution ~ of ~ 10000 ~ beta[0] ~ Estimates),
xlab = bquote(hat(beta)[0]),
freq = F)
# 添加概率密度曲线
curve(dnorm(x,-2,sqrt(var_b0)),
add = T,
col = "darkred")
# beta_1估计的直方图
hist(fit[, 2],
cex.main = 1,
main = bquote(The ~ Distribution ~ of ~ 10000 ~ beta[1] ~ Estimates),
xlab = bquote(hat(beta)[1]),
freq = F)
# 添加概率密度曲线
curve(dnorm(x,3.5,sqrt(var_b1)),
add = T,
col = "darkred")
> var_b0
[1] 4.045066
> var_b1
[1] 0.03018694
> var(fit[, 1])
[1] 4.186832
> var(fit[, 2])
[1] 0.03096199
方差估计接近理论值。直方图表明,估计量的分布可以很好地用正态分布来近似。
模拟研究2
set.seed(1)
# 设置重复次数和样本大小
reps <- 1000
n <- c(100, 250, 1000, 3000)
# 初始化矩阵
fit <- matrix(ncol = 2, nrow = reps)
# 把画图空间划分为2行2列
par(mfrow = c(2, 2))
# 重复抽样和画图
# 外部循环n次
for (j in 1:length(n)) {
# inner loop: sampling and estimating of the coefficients
for (i in 1:reps){
sample <- population[sample(1:N, n[j]), ]
fit[i, ] <- lm(Y ~ X, data = sample)$coefficients
}
# 画估计的密度曲线
plot(density(fit[ ,2]), xlim=c(2.5, 4.5),
col = j,
main = paste("n=", n[j]),
xlab = bquote(hat(beta)[1]))
}
我们发现,随着n的增加,βˆ1的分布集中在其平均值周围,即其方差减小。换句话说,随着样本量的增加,观察接近β1=3.5真实值的估计的可能性会增加。βˆ0的分布同样。
简单线性回归模型中的假设检验和置信区间
关于斜率系数的双向假设的检验
data(CASchools)
# 计算STR学生老师比
CASchools$STR <- CASchools$students/CASchools$teachers
# 计算score
CASchools$score <- (CASchools$read + CASchools$math)/2
# 建立模型
linear_model <- lm(score ~ STR, data = CASchools)
# 查看线性回归模型系数
summary(linear_model)$coefficients
# 确定自由度
linear_model$df.residual
#计算斜率系数双向显著性检验的p值
#t value为负,p value=2 * pt(t value, df ));t value为正,p value=2 * pt(t value, df ,lower.tail = FALSE)
2 * pt(-4.751327, df = 418)
#当n足够大,也可以使用标准正态分布来计算p值:2*(1-pnorm(abs(t value)))
#手工计算t value=β1^/SE(β1^)
coef <- summary(linear_model)$coefficients
t_value <- coef[2,1]/coef[2,2]
> summary(linear_model)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 698.932949 9.4674911 73.824516 6.569846e-242
STR -2.279808 0.4798255 -4.751327 2.783308e-06
> linear_model$df.residual
[1] 418
> 2 * pt(-4.751327, df = 418)
[1] 2.78331e-06
回归系数的置信区间
set.seed(4)
# 生成并绘制样本数据
Y <- rnorm(n = 100,
mean = 5,
sd = 5)
plot(Y, pch = 19, col = "steelblue")
#在1-α(α=0.05)置信水平下的置信区间
> cbind(CIlower = mean(Y) - 1.96 * 5 / 10, CIupper = mean(Y) + 1.96 * 5 / 10)
CIlower CIupper
[1,] 4.502625 6.462625
#计算linear_model系数的95%置信区间
confint(linear_model,level = 0.95)
#> 2.5 % 97.5 %
#> (Intercept) 680.32312 717.542775
#> STR -3.22298 -1.336636
高斯-马尔可夫定理Gauss-Markov theorem
OLS估计量是最好的(在最小方差的意义上)线性条件无偏估计量(BLUE)。
模拟研究:BLUE估计量
# 设置样本大小和重复次数
n <- 100
reps <- 1e5
# 选择epsilon,创建如上文所定义的权重向量
epsilon <- 0.8 #设置系数
w <- c(rep((1 + epsilon) / n, n / 2),
rep((1 - epsilon) / n, n / 2) )
# 从标准正态分布中抽取一个随机样本y_1,……,y_n,
# 使用估计器1e5次,并将结果存储在向量“ols”和“weightedestimator”
ols <- rep(NA, reps)
weightedestimator <- rep(NA, reps)
for (i in 1:reps) {
y <- rnorm(n) #抽取100个数据样本
ols[i] <- mean(y)
weightedestimator[i] <- crossprod(w, y) #向量相乘
}
# 绘制估计量分布的核心密度估计:
# OLS
plot(density(ols),
col = "purple",
lwd = 3,
main = "Density of OLS and Weighted Estimator",
xlab = "Estimates")
# weighted
lines(density(weightedestimator),
col = "steelblue",
lwd = 3)
# 在0处添加一条虚线,并在绘图中添加一个图例
abline(v = 0, lty = 2)
#加图例
legend('topright',
c("OLS", "Weighted"),
col = c("purple", "steelblue"),
lwd = 3)
这两个估计量似乎都是无偏的:它们的估计分布的平均值为零。
使用权重的估计量不如OLS估计量有效,比OLS更离散。
02 Regression Models with Multiple Regressors多元回归模型
Omitted Variable Bias省略的变量偏差
library(AER)
library(MASS)
data(CASchools)
# 定义变量
CASchools$STR <- CASchools$students/CASchools$teachers
CASchools$score <- (CASchools$read + CASchools$math)/2
# 计算相关性
cor(CASchools$STR, CASchools$score)
cor(CASchools$STR, CASchools$english)
# 建立两种回归模型
mod <- lm(score ~ STR, data = CASchools)
mult.mod <- lm(score ~ STR + english, data = CASchools)
# 输出结果
mod
mult.mod
> mod
Call:
lm(formula = score ~ STR, data = CASchools)
Coefficients:
(Intercept) STR
698.93 -2.28
> mult.mod
Call:
lm(formula = score ~ STR + english, data = CASchools)
Coefficients:
(Intercept) STR english
686.0322 -1.1013 -0.6498
多元回归模型Multiple regression models
summary(mult.mod)$coef
> Estimate Std. Error t value Pr(>|t|)
(Intercept) 686.0322445 7.41131160 92.565565 3.871327e-280
STR -1.1012956 0.38027827 -2.896026 3.978059e-03
english -0.6497768 0.03934254 -16.515882 1.657448e-47
Measures of Fit in Multiple Regression多元回归拟合优度
summary(mult.mod)
n <- nrow(CASchools) # 观测值数量
k <- 2 # 回归变量数量
#k <- ncol(mult.mod$model)-1
y_mean <- mean(CASchools$score)
SSR <- sum(residuals(mult.mod)^2) # 残差平方和
TSS <- sum((CASchools$score - y_mean )^2) # 总平方和
ESS <- sum((fitted(mult.mod) - y_mean)^2) # 回归平方和
# 手工计算拟合优度
SER <- sqrt(1/(n-k-1) * SSR) # Residual standard error回归标准误
Rsq <- 1 - (SSR / TSS) # Rˆ2
adj_Rsq <- 1 - (n-1)/(n-k-1) * SSR/TSS # adj. Rˆ2
c("SER" = SER, "R2" = Rsq, "Adj.R2" = adj_Rsq)
> summary(mult.mod)
Call:
lm(formula = score ~ STR + english, data = CASchools)
Residuals:
Min 1Q Median 3Q Max
-48.845 -10.240 -0.308 9.815 43.461
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 686.03224 7.41131 92.566 < 2e-16 ***
STR -1.10130 0.38028 -2.896 0.00398 **
english -0.64978 0.03934 -16.516 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 14.46 on 417 degrees of freedom
Multiple R-squared: 0.4264, Adjusted R-squared: 0.4237
F-statistic: 155 on 2 and 417 DF, p-value: < 2.2e-16
> c("SER" = SER, "R2" = Rsq, "Adj.R2" = adj_Rsq)
SER R2 Adj.R2
14.4644831 0.4264315 0.4236805
多元回归中的最小二乘假设OLS Assumptions
Multicollinearity多重共线性
多重共线性,意味着多元回归模型中的两个或多个回归量是强相关的。如果两个或多个回归量之间的相关性是完美的,即一个回归量可以写成另一个回归量的线性组合,那即是完美多重共线性。
完美多重共线性的例子【1】:回归量之间强相关
# 定义一个变量,英语学习者的分数
CASchools$FracEL <- CASchools$english / 100
# 建立模型
mult.mod <- lm(score ~ STR + english + FracEL, data = CASchools)
#在线性回归时自动删除FracEL变量,因为其为完美共线性
summary(mult.mod)
> summary(mult.mod)
Call:
lm(formula = score ~ STR + english + FracEL, data = CASchools)
Residuals:
Min 1Q Median 3Q Max
-48.845 -10.240 -0.308 9.815 43.461
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 686.03224 7.41131 92.566 < 2e-16 ***
STR -1.10130 0.38028 -2.896 0.00398 **
english -0.64978 0.03934 -16.516 < 2e-16 ***
FracEL NA NA NA NA
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 14.46 on 417 degrees of freedom
Multiple R-squared: 0.4264, Adjusted R-squared: 0.4237
F-statistic: 155 on 2 and 417 DF, p-value: < 2.2e-16
完美多重共线性的例子【2】:定义的变量有逻辑错误,无观测值
# 如果STR小于12, NS = 0, 否则NS = 1
CASchools$NS <- ifelse(CASchools$STR < 12, 0, 1) #变量只有一类观测值
# 建立模型
mult.mod <- lm(score ~ computer + english + NS, data = CASchools)
summary(mult.mod)
> summary(mult.mod)
Call:
lm(formula = score ~ computer + english + NS, data = CASchools)
Residuals:
Min 1Q Median 3Q Max
-49.492 -9.976 -0.778 8.761 43.798
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 663.704837 0.984259 674.319 < 2e-16 ***
computer 0.005374 0.001670 3.218 0.00139 **
english -0.708947 0.040303 -17.591 < 2e-16 ***
NS NA NA NA NA
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 14.43 on 417 degrees of freedom
Multiple R-squared: 0.4291, Adjusted R-squared: 0.4263
F-statistic: 156.7 on 2 and 417 DF, p-value: < 2.2e-16
> table(CASchools$NS) #变量只有一类观测值
1
420
完美多重共线性的例子【3】“虚拟变量陷阱”
当多个虚拟变量用作回归变量时,可能会发生这种情况。
减掉一个参数也可行,基于减掉参数。
set.seed(1)
# 生成方位的人工数据
CASchools$direction <- sample(c("West", "North", "South", "East"),
420,
replace = T)
# 建立模型
mult.mod <- lm(score ~ STR + english + direction, data = CASchools)
summary(mult.mod)
> summary(mult.mod)
Call:
lm(formula = score ~ STR + english + direction, data = CASchools)
Residuals:
Min 1Q Median 3Q Max
-49.603 -10.175 -0.484 9.524 42.830
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 684.80477 7.54130 90.807 < 2e-16
STR -1.08873 0.38153 -2.854 0.00454
english -0.65597 0.04018 -16.325 < 2e-16
directionNorth 1.66314 2.05870 0.808 0.41964
directionSouth 0.71619 2.06321 0.347 0.72867
directionWest 1.79351 1.98174 0.905 0.36598
(Intercept) ***
STR **
english ***
directionNorth
directionSouth
directionWest
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 14.5 on 414 degrees of freedom
Multiple R-squared: 0.4279, Adjusted R-squared: 0.421
F-statistic: 61.92 on 5 and 414 DF, p-value: < 2.2e-16
完美多重共线性的例子【4】:冗余回归变量产生完美线性关系
# 会说英语的人的百分比
CASchools$PctES <- 100 - CASchools$english
# 建立模型
mult.mod <- lm(score ~ STR + english + PctES, data = CASchools)
summary(mult.mod)
> summary(mult.mod)
Call:
lm(formula = score ~ STR + english + PctES, data = CASchools)
Residuals:
Min 1Q Median 3Q Max
-48.845 -10.240 -0.308 9.815 43.461
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 686.03224 7.41131 92.566 < 2e-16 ***
STR -1.10130 0.38028 -2.896 0.00398 **
english -0.64978 0.03934 -16.516 < 2e-16 ***
PctES NA NA NA NA
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 14.46 on 417 degrees of freedom
Multiple R-squared: 0.4264, Adjusted R-squared: 0.4237
F-statistic: 155 on 2 and 417 DF, p-value: < 2.2e-16
多元回归中OLS估计量的分布
acman::p_load(mvtnorm,MASS)
# 设置样本大小
n <- 50
# 初始化系数的向量
coefs <- cbind("hat_beta_1" = numeric(10000), "hat_beta_2" = numeric(10000))
coefs
set.seed(1)
# 循环抽样和估计量
for (i in 1:10000) {
X <- rmvnorm(n, c(50, 100), sigma = cbind(c(10, 2.5), c(2.5, 10)))
u <- rnorm(n, sd = 5)
Y <- 5 + 2.5 * X[, 1] + 3 * X[, 2] + u
coefs[i,] <- lm(Y ~ X[, 1] + X[, 2])$coefficients[-1]
}
# 计算密度估计
kde <- kde2d(coefs[, 1], coefs[, 2])
# 画图
persp(kde,
theta = 310,
phi = 30,
xlab = "beta_1",
ylab = "beta_2",
zlab = "Est. Density")
单个系数的假设检验和置信区间(Hypothesis Tests and Confidence Intervals for a Single Coefficient)
coeftest(x, vcov. = NULL, df = NULL, ...)
coeftest函数用于对回归系数进行检验(t检验或z检验),其参数x表示要进行检验的对象,一般需是一个回归模型(即lm类型数据);vcov.表示参数的方差(矩阵),当取默认值NULL时,使用的就是传统回归的估计方法;df表示t检验的自由度,当取默认值NULL时,自由度为n-k(即样本量减参数量),而当df取负值时,检验方法将会从t检验(t分布假设)变为z检验(正态分布假设)。
#obtain a robust summary of the coefficients
coeftest(mod, vcov. = vcovHC)
pacman::p_load(stargazer,AER)
CASchools$size <- CASchools$students/CASchools$teachers
CASchools$score <- (CASchools$read + CASchools$math)/2
# t-test
model <- lm(score ~ size + english, data = CASchools)
coeftest(model, vcov. = vcovHC, type = "HC1")
## 计算所有系数的t统计量compute t-statistics for each coefficient
coefs <- summary(model)$coefficients[, 1]
SEs <- summary(model)$coefficients[, 2]
tstats <- coefs/SEs
## 计算所有显著性测试的p值compute p-values for all significance tests
pvals <- 2*(pnorm(-abs(tstats)))
#计算各个系数的置信区间
confint(model)
#获得α=0.1水平下的置信区间
confint(model, level = 0.9)
> coeftest(model, vcov. = vcovHC, type = "HC1")
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 686.032245 8.728225 78.5993 < 2e-16
STR -1.101296 0.432847 -2.5443 0.01131
english -0.649777 0.031032 -20.9391 < 2e-16
(Intercept) ***
STR *
english ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> confint(model)
2.5 % 97.5 %
(Intercept) 671.4640580 700.6004311
STR -1.8487969 -0.3537944
english -0.7271113 -0.5724424
计算多元回归模型中单个系数的置信区间,使用confint()。缺点是不能使用稳健标准误差来计算置信区间。对于大样本置信区间,手动计算。
# print a coefficient summary that reports robust standard errors
coeftest(model, vcov. = vcovHC)
# check whether the hypotheses are rejected at the 1% significance level
coeftest(model , vcov. = vcovHC)[, 4] < 0.01
# 手工计算稳健标准误
rob_se <- diag(vcovHC(model, type = "HC1"))^0.5
# 手工计算稳健95%置信区间
rbind("lower" = coef(model) - qnorm(0.975) * rob_se,
"upper" = coef(model) + qnorm(0.975) * rob_se)
> (Intercept) size english
lower 668.9252 -1.9496606 -0.7105980
upper 703.1393 -0.2529307 -0.5889557
# 手工计算稳健90%置信区间
rbind("lower" = coef(model) - qnorm(0.95) * rob_se,
"upper" = coef(model) + qnorm(0.95) * rob_se)
> (Intercept) size english
lower 671.6756 -1.8132659 -0.7008195
upper 700.3889 -0.3893254 -0.5987341
同方差:是为了保证回归参数估计量具有良好的统计性质经典线性回归模型的一个重要假定:总体回归函数中的随机误差项满足同方差性即它们都有相同的方差。
异方差(heteroscedasticity):如果这一假定不满足(随机误差项具有不同的方差) 则称线性回归模型存在异方差性。
模型的另一个扩充
# 标准化支出至千元
CASchools$expenditure <- CASchools$expenditure/1000
# 建模
model <- lm(score ~ STR + english + expenditure, data = CASchools)
coeftest(model, vcov. = vcovHC, type = "HC1")
# 'size' 和 'expenditure'相关关系
cor(CASchools$STR, CASchools$expenditure)
> coeftest(model, vcov. = vcovHC, type = "HC1")
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 649.577947 15.458344 42.0212 < 2e-16
STR -0.286399 0.482073 -0.5941 0.55277
english -0.656023 0.031784 -20.6398 < 2e-16
expenditure 3.867901 1.580722 2.4469 0.01482
(Intercept) ***
STR
english ***
expenditure *
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> cor(CASchools$STR, CASchools$expenditure)
[1] -0.6199822
使用F统计量进行联合假设检验(Joint Hypothesis Testing Using the F-Statistic)
# 建立多元回归模型
model <- lm(score ~ STR + english + expenditure, data = CASchools)
summary(model)
# 在模型上置信该函数,提供回归量作为test字符
linearHypothesis(model, c("STR=0", "expenditure=0"))
# heteroskedasticity-robust F-test异方差-稳健F检验
linearHypothesis(model, c("STR=0", "expenditure=0"), white.adjust = "hc1")
> summary(model)
Call:
lm(formula = score ~ STR + english + expenditure, data = CASchools)
Residuals:
Min 1Q Median 3Q Max
-51.340 -10.111 0.293 10.318 43.181
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 649.57795 15.20572 42.719 < 2e-16 ***
STR -0.28640 0.48052 -0.596 0.55149
english -0.65602 0.03911 -16.776 < 2e-16 ***
expenditure 3.86790 1.41212 2.739 0.00643 **
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 14.35 on 416 degrees of freedom
Multiple R-squared: 0.4366, Adjusted R-squared: 0.4325
F-statistic: 107.5 on 3 and 416 DF, p-value: < 2.2e-16
> linearHypothesis(model, c("STR=0", "expenditure=0"))
Linear hypothesis test
Hypothesis:
STR = 0
expenditure = 0
Model 1: restricted model
Model 2: score ~ STR + english + expenditure
Res.Df RSS Df Sum of Sq F Pr(>F)
1 418 89000
2 416 85700 2 3300.3 8.0101 0.000386 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> linearHypothesis(model, c("STR=0", "expenditure=0"), white.adjust = "hc1")
Linear hypothesis test
Hypothesis:
STR = 0
expenditure = 0
Model 1: restricted model
Model 2: score ~ STR + english + expenditure
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 418
2 416 2 5.4337 0.004682 **
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
多重系数的置信度集
◾基于以前的F统计量,可以指定置信度集。置信集类似于单个系数的置信区间。置信集是包括包含真实系数的系数组合。
◾两个系数的置信集,是由两个系数估计定义的点为中心的椭圆,使用confidenceEllipse()绘制模型两个系数的置信集。
# 画size 和 expenditure两个系数95%的置信集
model <- lm(score ~ size + english + expenditure, data = CASchools)
confidenceEllipse(model,
fill = T,
lwd = 0,
which.coef = c("size", "expenditure"),
main = "95% Confidence Set")
可以看到椭圆围绕(-0.29、3.87)size和expenditure中心。此外,(0,0)也不是95%置信集的元素,因此可以拒绝(H0:beta_1=0,beta_3=0)。
默认情况下,confidenceEllipse()只使用均方差标准误差。下面显示如何计算稳健置信椭圆。
# 画size 和 expenditure两个系数95%的稳健置信集
confidenceEllipse(model,
fill = T,
lwd = 0,
which.coef = c("size", "expenditure"),
main = "95% Confidence Sets",
vcov. = vcovHC(model, type = "HC1"),
col = "red")
# 画size 和 expenditure两个系数95%的置信集
confidenceEllipse(model,
fill = T,
lwd = 0,
which.coef = c("size", "expenditure"),
add = T)
多元回归模型规范
选择回归规范,比如选择回归模型的变量,是一项困难的任务。但也有一些关指导方针。
目标是明确的:对感兴趣的因果关系获取无偏的、精确的估计。
◾第一步,考虑省略的变量,即通过使用适当的控制变量来避免可能的偏差。
◾第二步,通过测量度来比较不同的规格,不应仅仅依赖于R2。
下面讨论,在多元回归模型中潜在的忽略变量偏差。
# 建模
model <- lm(score ~ size + english + lunch, data = CASchools)
coeftest(model, vcov. = vcovHC, type = "HC1")
> coeftest(model, vcov. = vcovHC, type = "HC1")
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 700.149957 5.568453 125.7351 < 2.2e-16 ***
size -0.998309 0.270080 -3.6963 0.0002480 ***
english -0.121573 0.032832 -3.7029 0.0002418 ***
lunch -0.547345 0.024107 -22.7046 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
与lm(score ~ size + english)相比,加入lunch没有观察到关于size实质性变化,size系数仅变化0.1。虽然在这种情况下,系数估计的差异并不大,但保留lunch,使假设条件平均独立更可信。
R^2无法告知:
1.所包含的变量具有统计学意义。
2.回归量是因变量中运动的真正原因。
3.其中省略了变量偏差。
4.已经选择了最合适的回归量集。
set.seed(1)
# 生成PLS观测值
CASchools$PLS <- c(22 * CASchools$income
- 15 * CASchools$size
+ 0.2 * CASchools$expenditure
+ rnorm(nrow(CASchools), sd = 80) + 3000)
# 画PLS和score之间关系的散点图
plot(CASchools$PLS,
CASchools$score,
xlab = "Parking Lot Space",
ylab = "Test Score",
pch = 20,
col = "steelblue")
# score对PLS回归
summary(lm(score ~ PLS, data = CASchools))
> summary(lm(score ~ PLS, data = CASchools))
Call:
lm(formula = score ~ PLS, data = CASchools)
Residuals:
Min 1Q Median 3Q Max
-42.386 -9.739 0.598 10.495 36.866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.576e+02 1.171e+01 39.08 <2e-16 ***
PLS 6.455e-02 3.837e-03 16.82 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 14.73 on 418 degrees of freedom
Multiple R-squared: 0.4037, Adjusted R-squared: 0.4023
F-statistic: 283 on 1 and 418 DF, p-value: < 2.2e-16
PLS为income、size、expenditure和随机误差的线性函数。因此,PLS与score之间存在一定的正关系。
PLS系数为正,与零有显著差异。此外,R2和barR2约0.4,远远超过lm(score ~ size)时观察到的约0.05。
考试成绩数据集分析
估计STR的变化对score的因果影响,使用一个例子说明如何使用多元回归减轻省略变量偏差。考虑English、lunch两个变量来描述学生特征,它们与STR相关,并假设对考试成绩有影响。提供一个新变量calworks,与lunch高度相关。
# 'calworks' 与 'lunch'高度相关
cor(CASchools$calworks, CASchools$lunch)
#根据考试成绩绘制学生特征
m <- rbind(c(1, 2), c(3, 0))
graphics::layout(mat = m)
# 散点图
plot(score ~ english,
data = CASchools,
col = "steelblue",
pch = 20,
xlim = c(0, 100),
cex.main = 0.9,
main = "Percentage of English language learners")
plot(score ~ lunch,
data = CASchools,
col = "steelblue",
pch = 20,
cex.main = 0.9,
main = "Percentage qualifying for reduced price lunch")
plot(score ~ calworks,
data = CASchools,
col = "steelblue",
pch = 20,
xlim = c(0, 100),
cex.main = 0.9,
main = "Percentage qualifying for income assistance")
# 建立学生特征与score的相关关系
cor(CASchools$score, CASchools$english)
cor(CASchools$score, CASchools$lunch)
cor(CASchools$score, CASchools$calworks)
#考虑五个不同的模型
install.packages("stargazer")
library(stargazer)
spec1 <- lm(score ~ STR, data = CASchools)
spec2 <- lm(score ~ STR + english, data = CASchools)
spec3 <- lm(score ~ STR + english + lunch, data = CASchools)
spec4 <- lm(score ~ STR + english + calworks, data = CASchools)
spec5 <- lm(score ~ STR + english + lunch + calworks, data = CASchools)
# 将稳健标准误放在一个列表里
rob_se <- list(sqrt(diag(vcovHC(spec1, type = "HC1"))),
sqrt(diag(vcovHC(spec2, type = "HC1"))),
sqrt(diag(vcovHC(spec3, type = "HC1"))),
sqrt(diag(vcovHC(spec4, type = "HC1"))),
sqrt(diag(vcovHC(spec5, type = "HC1"))))
# 用stargazer生成LaTeX表
stargazer(spec1, spec2, spec3, spec4, spec5,
se = rob_se,
digits = 3,
header = F,
column.labels = c("(I)", "(II)", "(III)", "(IV)", "(V)"))
> cor(CASchools$calworks, CASchools$lunch)
[1] 0.7394218
> cor(CASchools$score, CASchools$english)
[1] -0.6441238
> cor(CASchools$score, CASchools$lunch)
[1] -0.868772
> cor(CASchools$score, CASchools$calworks)
[1] -0.6268533
stargazer()可以生成专业外观的满足科学标准的HTML和LateX表。
1.可以看到,增加控制变量会将size系数减半。在其他条件不变,将STR降低一个单位,会导致score的估计平均1个点的增长。
2.增加学生特征作为控制变量,R2和barR2从spec1的0.049到spec3/spec5的0.773,因此可以将这些变量作为score的合适预测因素。
3.可以看到,控制变量在所有模型中都没有统计学意义。在spec3中对size系数的估计(及其标准误差)添加calworks的影响可以忽略不计。