cor相关性:
cor():
cor(x, y , method = c("pearson", "kendall", "spearman"))
计算x,y的相关性,x,y需要为向量,若为矩阵则计算x列,y列的相关性。method指定相关性计算的方法,默认情况下是Pearson相关性系数。
Pearson,spearman计算方法见笔记本
cor.test():
cor()只给出相关性值,cor.test()同时给出检验的p值。
lm()线性拟合
lm()函数返回拟合结果的对象,可以用summary()函数查看其内容
> test
height weight gender BMI
tom 180 75 male 23.14815
cindy 165 58 female 21.30395
jimmy 175 72 male 23.51020
sam 173 68 male 22.72044
lucy 160 60 female 23.43750
lily 165 55 female 20.20202> result=lm(height~weight,data=test)
> result
Call:
lm(formula = height ~ weight, data = test)
Coefficients:
(Intercept) weight
115.34 0.84
#拟合方程为height=115.34+0.84weight
> summary(result)
Call:
lm(formula = height ~ weight, data = test)
Residuals:
tom cindy jimmy sam Lucy lily
1.6529 0.9336 -0.8270 0.5332 -5.7465 3.4537
Coefficients:
Estimate Std. Error t value Pr(>|t|)#Pr(>|t|)为显著性
(Intercept) 115.3441 12.5825 9.167 0.000786 ***
weight 0.8400 0.1933 4.346 0.012198 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.519 on 4 degrees of freedom
Multiple R-squared: 0.8252, Adjusted R-squared: 0.7815
#拟合优度和矫正后的拟合优度,此项数值越大越好
F-statistic: 18.89 on 1 and 4 DF, p-value: 0.0122
#检验整个方程的p值,<0.05为宜
检验一套数据是否符合正态分布
符合正态分布的数据情况
法一
>data=rnorm(1000)
> hist(data,prob=T)
> lines(density(data))
绘制数据的统计图,大致符合钟形分布即可大致判断符合正态分布
法二
qqnorm(data)
qqline(data)
直线大致满足y=x即可判断为符合正态分布
法三
> shapiro.test(data)
Shapiro-Wilk normality test
data: data
W = 0.99786, p-value = 0.229
#>0.05符合正态分布,<0,05不符合正态分布
不符合正态分布的数据情况
法一
> a=c(rep(1,10),rep(2,5),rep(3,4),6,8,10,12,20)
> hist(a,breaks=seq(0.5,21,by=1),prob=TRUE)
> lines(density(a),col="blue")#明显看出是偏态分布
> abline(v=median(a),col="red")#添加中值,红色线
> abline(v=mean(a),col="green")#添加平均数,绿色线
> #对于偏态分布,中值比平均数更有意义
法二
> qqnorm(a)
> qqline(a)
明显为偏态分布
法三
> shapiro.test(a)
Shapiro-Wilk normality test
data: a
W = 0.63575, p-value = 1.589e-06
#pvalue<0.05,偏态分布
不符合正态分布的数据,不能使用t检验,应该使用秩和检验。
> a=c(rep(1,10),rep(2,5),rep(3,4),6,8,10,12,20)
> b=c(rep(2,7),rep(3,5),rep(5,8),8,10,18,25)
#a,b不符合正态分布
> t.test(a,b)
Welch Two Sample t-test
data: a and b
t = -1.2025, df = 44.761, p-value = 0.2355
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.681665 1.181665
sample estimates:
mean of x mean of y
3.666667 5.416667
#不符合正态分布的数据,不能用T检验,此时P value>0.05,无显著差异
> wilcox.test(a,b,exact = FALSE)#秩和检验,非参数检验
Wilcoxon rank sum test with continuity correction
data: a and b
W = 162.5, p-value = 0.008673
alternative hypothesis: true location shift is not equal to 0
#不符合正态分布的数据,用秩和检验,此时P value<0.05,有显著差异
百分比检验
假设我们知道全球流感对死亡率是10%,在美国对调查发现400名流感病人中有51人死亡了,请问美国流感对死亡率是否显著高于全球流感对死亡率。
> prop.test(51,400,p=0.1,alternative = "greater")
1-sample proportions test with continuity correction
data: 51 out of 400, null probability 0.1
X-squared = 3.0625, df = 1, p-value = 0.04006
alternative hypothesis: true p is greater than 0.1
95 percent confidence interval:
0.101422 1.000000
sample estimates:
p
0.1275
卡方检验和fisher精确检验
> data=rbind(c(50,250),c(8,10))
> rownames(data)=c("cig","non-cig")
> colnames(data)=c("qiguanyan","non-qiguanyan")
> chisq.test(data)#卡方检验,注意输入的值直接为数据框
Pearson's Chi-squared test with Yates' continuity correction
data: data
X-squared = 7.0225, df = 1, p-value = 0.008049
Warning message:
In chisq.test(data) : Chi-squared approximation may be incorrect
#此时会产生报错,产生此项错误的原因是吸烟患者样本数太少,出现此类报错,可用下文的fisher精确检验做统计检验
> fisher.test(data)#fisher精确检验
Fisher's Exact Test for Count Data
data: data
p-value = 0.007489
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.08452923 0.77224505
sample estimates:
odds ratio
0.25151
方差分析示例
step1:大致查看数据
> cholesterol
trt response
1 1time 3.8612
2 1time 10.3868
3 1time 5.9059
4 1time 3.0609
5 1time 7.7204
6 1time 2.7139
7 1time 4.9243
8 1time 2.3039
9 1time 7.5301
10 1time 9.4123
11 2times 10.3993
12 2times 8.6027
13 2times 13.6320
14 2times 3.5054
15 2times 7.7703
16 2times 8.6266
17 2times 9.2274
18 2times 6.3159
19 2times 15.8258
20 2times 8.3443
21 4times 13.9621
22 4times 13.9606
23 4times 13.9176
24 4times 8.0534
25 4times 11.0432
26 4times 12.3692
27 4times 10.3921
28 4times 9.0286
29 4times 12.8416
30 4times 18.1794
31 drugD 16.9819
32 drugD 15.4576
33 drugD 19.9793
34 drugD 14.7389
35 drugD 13.5850
36 drugD 10.8648
37 drugD 17.5897
38 drugD 8.8194
39 drugD 17.9635
40 drugD 17.6316
41 drugE 21.5119
42 drugE 27.2445
43 drugE 20.5199
44 drugE 15.7707
45 drugE 22.8850
46 drugE 23.9527
47 drugE 21.5925
48 drugE 18.3058
49 drugE 20.3851
50 drugE 17.3071
> boxplot(response~trt,data=cholesterol)
大致可以看出不同处理间y是有差异的。
step2:检查正态分布及方差齐性
> shapiro.test(cholesterol$response)
#检验是否为正态分布,>0.05没有显著差异,为正态分布
Shapiro-Wilk normality test
data: cholesterol$response
W = 0.97722, p-value = 0.4417
> bartlett.test(response~trt,data=cholesterol)
#检查方差齐性,>0.05方差没有显著差异。
Bartlett test of homogeneity of variances
data: response by trt
Bartlett's K-squared = 0.57975, df = 4, p-value = 0.9653
step3方差分析:
> fit<-aov(response~trt,data=cholesterol)
> summary(fit)
Df Sum Sq Mean Sq F value Pr(>F)
trt 4 1351.4 337.8 32.43 9.82e-13 ***
Residuals 45 468.8 10.4
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Pr(>F) 即p值小于0.001,极显著,即处理组间有差异,可以用于事后多重比较
step4:事后多重比较
> TukeyHSD(fit)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = response ~ trt, data = cholesterol)
$trt
diff lwr upr p adj
2times-1time 3.44300 -0.6582817 7.544282 0.1380949
4times-1time 6.59281 2.4915283 10.694092 0.0003542
drugD-1time 9.57920 5.4779183 13.680482 0.0000003
drugE-1time 15.16555 11.0642683 19.266832 0.0000000
4times-2times 3.14981 -0.9514717 7.251092 0.2050382
drugD-2times 6.13620 2.0349183 10.237482 0.0009611
drugE-2times 11.72255 7.6212683 15.823832 0.0000000
drugD-4times 2.98639 -1.1148917 7.087672 0.2512446
drugE-4times 8.57274 4.4714583 12.674022 0.0000037
drugE-drugD 5.58635 1.4850683 9.687632 0.0030633
#diff:两组平均值之间的差异
#lwr,upr:置信区间的上下端点为95%(默认值)
#p adj:调整后的多个比较的p值。
#如本例,除了2times-1time,2times-1time,drugD-4times没有显著差异,其余组别均有显著差异
step5 图例展示
> library(multcomp)
> par(mar=c(5,4,6,2))
> tuk<-glht(fit,linfct=mcp(trt="Tukey"))
> plot(cld(tuk,level=0.05),col="lightgreen")
有相同字母的组别表示差异不显著