–参考书籍《R语言实战》
目录
1:描述性统计分析函数 //就是最大最小值等等数据 此处 stat.desc()函数
2:分组描述统计分析数据 //by()
3:列联表和频数表 //table()等可视化列表
4:独立性检验 //检验数据间是否关系(或变量是否独立)
5:相关性检验 //得到变量间的相关(影响)程度
6:组间差异检验 //t检验 非独立性检验
1 2:描述性统计分析函数
①:此处仅仅学习了 :pastecs包中的stat.desc()函数
stat.desc(data,分组类别,统计函数)
library(pastecs)
myvars <-c(“mpg”,“hp”,“wt”)
data <-head(mtcars[myvars])
myam <-head(mtcars$am)
by(data,myam,stat.desc)
myam: 0
mpg hp wt
nbr.val 3.00000000 3.0000000 3.00000000
nbr.null 0.00000000 0.0000000 0.00000000
nbr.na 0.00000000 0.0000000 0.00000000
min 18.10000000 105.0000000 3.21500000
max 21.40000000 175.0000000 3.46000000
range 3.30000000 70.0000000 0.24500000
sum 58.20000000 390.0000000 10.11500000
median 18.70000000 110.0000000 3.44000000
mean 19.40000000 130.0000000 3.37166667
SE.mean 1.01488916 22.5462488 0.07854581
CI.mean.0.95 4.36671560 97.0086788 0.33795535
var 3.09000000 1525.0000000 0.01850833
std.dev 1.75783958 39.0512484 0.13604534
coef.var 0.09061029 0.3003942 0.04034958
myam: 1
mpg hp wt nbr.val 3.00000000 3.00000000 3.0000000 nbr.null 0.00000000 0.00000000 0.0000000 nbr.na 0.00000000 0.00000000 0.0000000 min 21.00000000 93.00000000 2.3200000 max
22.80000000 110.00000000 2.8750000 range 1.80000000 17.00000000 0.5550000 sum 64.80000000 313.00000000 7.8150000 median 21.00000000 110.00000000 2.6200000 mean
21.60000000 104.33333333 2.6050000 SE.mean 0.60000000 5.66666667 0.1603901 CI.mean.0.95 2.58159164 24.38169880 0.6901031 var 1.08000000 96.33333333 0.0771750 std.dev
1.03923048 9.81495458 0.2778039 coef.var 0.04811252 0.09407305 0.1066426
3:列联表和频数表
(1)列联表建构步骤
①建立简单、多维列联表 -->②修改参数类型–>③修改列联表行列、变量内容、整合列联表
①建立列联表
table(A,B) xlabs(~A+B,data=)
mytable <-table(Arthritis$Improved)
mytable <-xtabs(~Treatment+Improved, data=Arthritis)
mytable1 <-xtabs(~Treatment+Sex+Improved, data=Arthritis)
②修改参数类型
prop.table(mytable) #全部数据比例
Improved Treatment None Some Marked Placebo 0.34523810 0.08333333 0.08333333 Treated 0.15476190
0.08333333 0.25000000
prop.table(mytable,1) #用行计算比例
prop.table(mytable,2) #用例计算比例
③修改变量内容(增加和)并整合行列式addmargins()
addmargins(prop.table(mytable)) #行列均求和
Improved
Treatment None Some Marked Sum Sum
Placebo 0.08630952 0.02083333 0.02083333 0.12797619 0.25595238
Treated 0.03869048 0.02083333 0.06250000 0.12202381 0.24404762
Sum 0.12500000 0.04166667 0.08333333 0.25000000 0.50000000
Sum 0.25000000 0.08333333 0.16666667 0.50000000 1.00000000
addmargins(prop.table(mytable),1) #行均求和
addmargins(prop.table(mytable),2) #列均求和
③整合行列式ftable()
mytable1 <-xtabs(~Treatment+Sex+Improved, data=Arthritis)
mytable1
, , Improved = NoneSex Treatment Female Male Placebo 19 10 Treated 6 7
, , Improved = Some
Sex Treatment Female Male Placebo 7 0 Treated 5 2
, , Improved = Marked
Sex Treatment Female Male Placebo 6 1 Treated 16 5
ftable(mytable1)
Improved None Some Marked
Treatment Sex
Placebo Female 19 7 6
Male 10 0 1
Treated Female 6 5 16
Male 7 2 5ftable(addmargins(prop.table(mytable1,c(1,2))),3)
#此时将1,2变量转化成了比例值,,并增加到了另一变量,最后加以整合
4:独立性检验
①:此处学习了fisher,test()检验方法 //检验两个变量有无相关性
mytable1 <-xtabs(~Treatment+Improved,data=Arthritis)
fisher.test(mytable1)Fisher’s Exact Test for Count Data
data: mytable1 p-value = 0.001393 alternative hypothesis: two.sided #此时p<0.01,两个变量相关
mytable1 <-xtabs(~Improved+Sex, data=Arthritis)
fisher.test(mytable1)Fisher’s Exact Test for Count Data
data: mytable1 p-value = 0.1094 alternative hypothesis: two.sided #此时p>0.05 两者无相关性
5:相关性
(1)理解: 各类型变量间相互的影响程度(数字,正负表示)
①cor(data , method=)
默认为pearson(线性相关程度) 也可以为spearman(定序相关)
states <-state.x77[,1:6]
cor(states)Population Income Illiteracy
Population 1.00000000 0.2082276 0.1076224 Income 0.20822756
1.0000000 -0.4370752 Illiteracy 0.10762237 -0.4370752 1.0000000 Life Exp -0.06805195 0.3402553 -0.5884779 Murder 0.34364275
-0.2300776 0.7029752 HS Grad -0.09848975 0.6199323 -0.6571886
Life Exp Murder HS Grad Population -0.06805195 0.3436428 -0.09848975 Income 0.34025534 -0.2300776 0.61993232 Illiteracy -0.58847793 0.7029752 -0.65718861 Life Exp 1.00000000
-0.7808458 0.58221620 Murder -0.78084575 1.0000000 -0.48797102 HS Grad 0.58221620 -0.4879710 1.00000000
(2) 相关性的显著性检验 //此处不加详述
6:组间差异检验
(1)理解: 就是看看两个变量是否相等,之间有没有差异,差异是小还是大。
(2)
①:t检验 //用于差异性不大的变量检验
library(MASS)
t.test(Prob~So , data=UScrime)
Welch Two Sample t-test
data: Prob by So t = -3.8954, df = 24.925, p-value = 0.0006506
alternative hypothesis: true difference in means is not equal to 0 95
percent confidence interval: -0.03852569 -0.01187439 sample
estimates: mean in group 0 mean in group 1
0.03851265 0.06371269
由于p<0.001 故两者(南方、北方逮捕率)之间存在差异
②:非独立t检验 //(变量间会有相互影响)
with(UScrime, t.test(U1,U2,paired=TRUE))
Paired t-test
data: U1 and U2 t = 32.407, df = 46, p-value < 2.2e-16 alternative
hypothesis: true difference in means is not equal to 0 95 percent
confidence interval:
57.67003 65.30870 sample estimates: mean of the differences
61.48936
差异过大,故变量双方(年龄大、小)存在较大差异
③:独立检验 wilcoson检验 (两组数据在一个变量上的差异)
// 一般用于差异性明显的变量进行检验
wilcox.test(Prob~So, data=UScrime)
data: Prob by So W = 81, p-value = 8.488e-05 alternative hypothesis:
true location shift is not equal to 0
p<0.001 相互不独立
②:独立检验Kruskal.test //多组数据在一个变量上的差异
kruskal.test(Illiteracy~ state.region, data=states)
Kruskal-Wallis rank sum test
data: Illiteracy by state.region Kruskal-Wallis chi-squared = 22.672,
df = 3, p-value = 4.726e-05
多各地区的文盲率存在差异,可以进行相关性检验