有下面这个数据
分A和B两组人群
下面4行是不同疾病患病数
# 首先我们建立一个dataframe
dat <- data.frame(low=c(13,7,21,6),
high=c(77,22,21,71))
# 而A组总共有66个样本,B组有128个样本
total_no <- c(66,128)
# 先以dat第一行建立一个四格表
# low high
# 13 77
# 53 51
tmp <- chisq.test(rbind(dat[1,], total_no-dat[1,]))
# 提取卡方和p值
tmp$statistic
tmp$p.value
# 其实可以手动计算另外3行,但是想试一试循环
# 先建立一个空的向量
k <- rep(NA, 4)
p <- rep(NA, 4)
# 接下来开始循环
for (i in c(1:4)) {
a <- chisq.test(rbind(dat[i,], total_no-dat[i,]))
k[i] <- a$statistic
p[i] <- a$p.value
}
results <- rbind(k,p)
results
最后得到结果
故事还没有结束。。。。
用SPSS做出的结果和R的结果有出入
而R做出来的卡方值是
为什么?为什么?
寻找原因
R的数值录入有问题?
所以重新录入,模仿SPSS
使用t()函数对数据进行转化
dat <- data.frame(low=c(13,7,21,6),
high=c(77,22,21,71))
total_no <- c(66,128)
# 在这步加入t()转换
tmp <- chisq.test(t(rbind(dat[1,], total_no-dat[1,])))
tmp$statistic
tmp$p.value
但是结果依旧是
R和SPSS的参数不同?
查看R的帮助文档,发现蛛丝马迹
原来有一个叫Yates Correction的东西在搞鬼(主要是我的统计知识太菜)
再次跑R
bingo!和SPSS的卡方值一样了
Yates Correction是什么东西
以下参考:
https://www.statisticshowto.datasciencecentral.com/what-is-the-yates-correction/
为什么要用yates correction?
The Yates correction is a correction made to account for the fact that both Pearson’s chi-square test and McNemar’s chi-square test are biased upwards for a 2 x 2 contingency table. An upwards bias tends to make results larger than they should be. If you are creating a 2 x 2 contingency table that uses either of these two tests, the Yates correction is usually recommended, especially if the expected cell frequencies are below 10 (some authors put that figure at 5).
Chi2 tests are biased upwards when used on 2 x 2 contingency tables. The reason is that the statistical Chi2 distribution is continuous and the 2 x 2 contingency table is dichotomous (in other words, it isn’t continuous, there are two variables). All you really need to know is that if your expected cell frequencies are below 10, you probably should be using the Yates correction.
而R默认是使用yates correction,所以有了上面这个故事。