本章内容
描述性统计分析
频数表和列联表
相关系数和协方差
t检验
非参数统计
7.1 描述性统计分析
连续型变量的中心趋势、变化性和分布形状的方法。
> myvars <- c("mpg","hp","wt")
> head(mtcars[myvars])
mpg hp wt
Mazda RX4 21.0 110 2.620
Mazda RX4 Wag 21.0 110 2.875
Datsun 710 22.8 93 2.320
Hornet 4 Drive 21.4 110 3.215
Hornet Sportabout 18.7 175 3.440
Valiant 18.1 105 3.460
7.1.1 方法云集
> myvars <- c("mpg","hp","wt")
> summary(mtcars[myvars])
mpg hp wt
Min. :10.40 Min. : 52.0 Min. :1.513
1st Qu.:15.43 1st Qu.: 96.5 1st Qu.:2.581
Median :19.20 Median :123.0 Median :3.325
Mean :20.09 Mean :146.7 Mean :3.217
3rd Qu.:22.80 3rd Qu.:180.0 3rd Qu.:3.610
Max. :33.90 Max. :335.0 Max. :5.424
sapply(x, FUN, options)
其中的x是你的数据框(或矩阵),FUN为一个任意的函数。如果指定了options,它们将被传递 给FUN。
> mystats <- function(x, na.omit = F){
+ if(na.omit)
+ x <- x[!is.na(x)]
+ m <- mean(x)
+ n <- length(x)
+ s <- sd(x)
+ skew <- sum((x-m)^3/s^3)/n
+ kurt <- sum((x-m)^4/s^4)/n-3
+ return(c(n=n,mean = m, stdev = s, skew = skew, kurtoside = kurt))
+ }
> myvars <- c("mpg","hp","wt")
> sapply(mtcars[myvars], mystats)
mpg hp wt
n 32.000000 32.0000000 32.00000000
mean 20.090625 146.6875000 3.21725000
stdev 6.026948 68.5628685 0.97845744
skew 0.610655 0.7260237 0.42314646
kurtoside -0.372766 -0.1355511 -0.02271075
7.1.2 更多方法
若干用户贡献包都提供了计算描述性统计量的函数,其中包括Hmisc、pastecs和psych。 由于这些包并未包括在基础安装中,所以需要在首次使用之前先进行安装。
Hmisc包中的describe()函数可返回变量和观测的数量、缺失值和唯一值的数目、平均值、 分位数,以及五个最大的值和五个最小的值。
pastecs包中有一个名为stat.desc()的函数,它可以计算种类繁多的描述性统计量。
psych包也拥有一个名为describe()的函数,它可以计算非缺失值的数量、 平均数、标准差、中位数、截尾均值、绝对中位差、最小值、最大值、值域、偏度、峰度和平均 值的标准误。
install.packages(c("Hmisc", "pastecs", "psych"))
library(Hmisc)
myvars <- c("mpg","hp","wt")
describe(mtcars[myvars])
mtcars[myvars]
3 Variables 32 Observations
-------------------------------------------------------------------------------------------------------------
mpg
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90
32 0 25 0.999 20.09 6.796 12.00 14.34 15.43 19.20 22.80 30.09
.95
31.30
lowest : 10.4 13.3 14.3 14.7 15.0, highest: 26.0 27.3 30.4 32.4 33.9
-------------------------------------------------------------------------------------------------------------
hp
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90
32 0 22 0.997 146.7 77.04 63.65 66.00 96.50 123.00 180.00 243.50
.95
253.55
lowest : 52 62 65 66 91, highest: 215 230 245 264 335
-------------------------------------------------------------------------------------------------------------
wt
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90
32 0 29 0.999 3.217 1.089 1.736 1.956 2.581 3.325 3.610 4.048
.95
5.293
lowest : 1.513 1.615 1.835 1.935 2.140, highest: 3.845 4.070 5.250 5.345 5.424
-------------------------------------------------------------------------------------------------------------
> library(pastecs)
> myvars <- c("mpg","hp","wt")
> stat.desc(mtcars[myvars])
mpg hp wt
nbr.val 32.0000000 32.0000000 32.0000000
nbr.null 0.0000000 0.0000000 0.0000000
nbr.na 0.0000000 0.0000000 0.0000000
min 10.4000000 52.0000000 1.5130000
max 33.9000000 335.0000000 5.4240000
range 23.5000000 283.0000000 3.9110000
sum 642.9000000 4694.0000000 102.9520000
median 19.2000000 123.0000000 3.3250000
mean 20.0906250 146.6875000 3.2172500
SE.mean 1.0654240 12.1203173 0.1729685
CI.mean.0.95 2.1729465 24.7195501 0.3527715
var 36.3241028 4700.8669355 0.9573790
std.dev 6.0269481 68.5628685 0.9784574
coef.var 0.2999881 0.4674077 0.3041285
> library(psych)
载入程辑包:‘psych’
The following object is masked from ‘package:Hmisc’:
describe
The following objects are masked from ‘package:ggplot2’:
%+%, alpha
Warning message:
程辑包‘psych’是用R版本3.6.2 来建造的
> myvars <- c("mpg","hp","wt")
> describe(mtcars[myvars])
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 32 20.09 6.03 19.20 19.70 5.41 10.40 33.90 23.50 0.61 -0.37 1.07
hp 2 32 146.69 68.56 123.00 141.19 77.10 52.00 335.00 283.00 0.73 -0.14 12.12
wt 3 32 3.22 0.98 3.33 3.15 0.77 1.51 5.42 3.91 0.42 -0.02 0.17
7.1.3 分组计算描述性统计量
使用aggregate()分组获取描述性统计量。(5.6.2节)
> myvars <- c("mpg","hp","wt")
> aggregate(mtcars[myvars], by = list(am=mtcars$am), mean)
am mpg hp wt
1 0 17.14737 160.2632 3.768895
2 1 24.39231 126.8462 2.411000
> aggregate(mtcars[myvars], by = list(am=mtcars$am), sd)
am mpg hp wt
1 0 3.833966 53.90820 0.7774001
2 1 6.166504 84.06232 0.6169816
by(data, INDICES, FUN)
使用by()分组计算描述性统计量
> dstats <- function(x) sapply(x, mystats)
> myvars <- c("mpg","hp","wt")
> by(mtcars[myvars], mtcars$am, dstats)
mtcars$am: 0
mpg hp wt
n 19.00000000 19.00000000 19.0000000
mean 17.14736842 160.26315789 3.7688947
stdev 3.83396639 53.90819573 0.7774001
skew 0.01395038 -0.01422519 0.9759294
kurtoside -0.80317826 -1.20969733 0.1415676
---------------------------------------------------------------------------------
mtcars$am: 1
mpg hp wt
n 13.00000000 13.0000000 13.0000000
mean 24.39230769 126.8461538 2.4110000
stdev 6.16650381 84.0623243 0.6169816
skew 0.05256118 1.3598859 0.2103128
kurtoside -1.45535200 0.5634635 -1.1737358
7.1.4 分组计算的扩展
doBy包和psych包也提供了分组计算描述性统计量的函数。同样地,它们未随基本安装发布, 必须在首次使用前进行安装。
#使用doBy包中的summaryBy()分组计算概述统计量
> install.packages("doBy")
> library(doBy)
> summaryBy(mpg+hp+wt~am, data = mtcars, FUN = mystats)
am mpg.n mpg.mean mpg.stdev mpg.skew mpg.kurtoside hp.n hp.mean hp.stdev hp.skew hp.kurtoside wt.n
1 0 19 17.14737 3.833966 0.01395038 -0.8031783 19 160.2632 53.90820 -0.01422519 -1.2096973 19
2 1 13 24.39231 6.166504 0.05256118 -1.4553520 13 126.8462 84.06232 1.35988586 0.5634635 13
wt.mean wt.stdev wt.skew wt.kurtoside
1 3.768895 0.7774001 0.9759294 0.1415676
2 2.411000 0.6169816 0.2103128 -1.1737358
> describeBy(mtcars[myvars], list(am = mtcars$am))
Descriptive statistics by group
am: 0
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 19 17.15 3.83 17.30 17.12 3.11 10.40 24.40 14.00 0.01 -0.80 0.88
hp 2 19 160.26 53.91 175.00 161.06 77.10 62.00 245.00 183.00 -0.01 -1.21 12.37
wt 3 19 3.77 0.78 3.52 3.75 0.45 2.46 5.42 2.96 0.98 0.14 0.18
---------------------------------------------------------------------------------
am: 1
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 13 24.39 6.17 22.80 24.38 6.67 15.00 33.90 18.90 0.05 -1.46 1.71
hp 2 13 126.85 84.06 109.00 114.73 63.75 52.00 335.00 283.00 1.36 0.56 23.31
wt 3 13 2.41 0.62 2.32 2.39 0.68 1.51 3.57 2.06 0.21 -1.17 0.17
#describe.by()函数不允许指定任意函数,所以它的普适性较低。
7.2 频数表和列联表
风湿性关节炎新疗法的双盲临床实验的结果。治疗情况(安慰剂治疗、用药治疗)、性别(男性、女性)和改善情况(无改善、一定程度 的改善、显著改善)均为类别型因子。
library(vcd)
> head(Arthritis)
ID Treatment Sex Age Improved
1 57 Treated Male 27 Some
2 46 Treated Male 29 None
3 77 Treated Male 30 None
4 17 Treated Male 32 Marked
5 36 Treated Male 46 Marked
6 23 Treated Male 58 Marked
7.2.1 生成频数表
最重要的函数已列于表7-1中。
1. 一维列联表
> mytable <- with(Arthritis, table(Improved))
> mytable
Improved
None Some Marked
42 14 28
> prop.table(mytable)
Improved
None Some Marked
0.5000000 0.1666667 0.3333333
> prop.table(mytable)*100
Improved
None Some Marked
50.00000 16.66667 33.33333
2. 二维列联表
mytable <- table(A, B)
其中的A是行变量,B是列变量。除此之外,xtabs()函数还可使用公式风格的输入创建列联表。
mytable <- xtabs(~A+B, data = mydata)
其中的mydata是一个矩阵或数据框。总的来说,要进行交叉分类的变量应出现在公式的右侧(即~ 符号的右方),以+作为分隔符。若某个变量写在公式的左侧,则其为一个频数向量(在数据已经 被表格化时很有用)。