R语言线性统计模型实例
- 毫无关系的前言
- 正文
- 变量选择
- 检验技术
- 变量选择准则
- 附录(代码)
毫无关系的前言
这学期实在是太快了,一转眼又快到了令人疼苦的考试月,我是很讨厌大学这种考试的,一方面我发现确实考不怎么过别人,另一方面我很讨厌临时刷题,临时积极地问问题、熬夜。但在大环境下,我熬的夜确实比之前多了,但效率不见得高。
这是本学期最后一篇文章了,作为学生,我常感叹自己能力不足,自从我在CSDN发了自己的第一篇文章后,对于一些问题我开始变得好奇起来,我能否做一个这样的东西?我能否独立将其源代码写出来,这也是我写作的动力。当然,里面不乏一些是我平时的作业,要删了可惜,不如融入互联网记忆,如果有人因为看了而少走一些弯路,那我觉得是很有意义的事情。
正文
这个其实是我作业的另一个部分,对于上一个部分可以参考我上一篇文章,是关于一个最简单的线性回归模型做的一系列估计、检验、预测之类的。这一篇文章我们刚开始并不知道用什么线性模型好?甚至其实我所选的数据集,用线性回归模型并不合适,当然,我们更应该看中得到这个不好结果的过程里,我们所作的工作。
首先,拿到一个数据集,先了解一下它的背景信息
rm(list=ls())
library("datasets")
data("swiss")
变量选择
检验技术
说通俗一点是你要留下一部分数据(约10%,视具体情况而定)对你得到的结果进行检验,在这里testset就是检验集,我们拟合用的数据放在fitset里头,当然对于检验问题一个比较出名的有交叉检验技术,但这里我们并不会用,因为样本量不大。
testset=swiss[1:5,]
fitset=swiss[-(1:5),]
变量选择准则
以下代码是没有用R自带的函数,共采纳了四个准则,分别是
index=list(c(2),c(3),c(4),c(5),c(2,3),c(2,4),c(2,5),c(3,4),c(3,5)
,c(4,5),c(2,3,4),c(2,3,5),c(2,4,5),c(3,4,5)
,c(2,3,4,5))
y=as.vector(fitset$Fertility)
x_temp=c(1:42)*0+1
x=as.matrix(fitset[,unlist(index[15])])
x=matrix(c(x_temp,x),ncol=as.numeric(length(unlist(index[15]))+1))
beta=solve(t(x)%*%x)%*%t(x)%*%y
y_fit=x%*%beta
sigma_estimate_square=sum(y_fit*y_fit)/(42-5)
c_1=c()
c_2=c()
c_3=c()
c_4=c()
for(i in 1:15)
{
q=as.numeric(length(unlist(index[i]))+1)
x=as.matrix(fitset[,unlist(index[i])])
x=matrix(c(x_temp,x),ncol=q)
# 参数计算
beta=solve(t(x)%*%x)%*%t(x)%*%y
# 拟合值
y_fit=x%*%beta
# RSS_q准则
RSS_q=sum(y_fit*y_fit)/(42-length(unlist(index[i]))-1)
# Cp准则
Cp=RSS_q/sigma_estimate_square-42+2*q
# AIC,BIC
AIC=42*log(RSS_q)+2*q
BIC=42*log(RSS_q)+2*q*log(42)
c_1=c(c_1,RSS_q)
c_2=c(c_2,Cp)
c_3=c(c_3,AIC)
c_4=c(c_4,BIC)
}
matrix(data=c(c_1,c_2,c_3,c_4),ncol = 4,dimnames =
list(index,c("RSS_q","Cp","AIC","BIC")))
cat("under the criterion of RSS_q, choose model",which.min(c_1))
cat("under the criterion of Cp, choose model",which.min(c_2))
cat("under the criterion of AIC, choose model",which.min(c_3))
cat("under the criterion of BIC, choose model",which.min(c_4))
输出结果如下:
> cat("under the criterion of RSS_q, choose model",which.min(c_1))
under the criterion of RSS_q, choose model 5
> cat("under the criterion of Cp, choose model",which.min(c_2))
under the criterion of Cp, choose model 5
> cat("under the criterion of AIC, choose model",which.min(c_3))
under the criterion of AIC, choose model 5
> cat("under the criterion of BIC, choose model",which.min(c_4))
under the criterion of BIC, choose model 5
都指向同一个模型
在此模型下,很容易求得模型参数的估计值,残差分析,异常点检验同第一部分。由于只有一个因变量,故不用考虑多重共线线性
参数估计结果如下:
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-33.763 -5.399 1.681 7.228 16.120
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.95653 2.31929 27.145 < 2e-16 ***
x 0.13713 0.03983 3.442 0.00136 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 10.77 on 40 degrees of freedom
Multiple R-squared: 0.2286, Adjusted R-squared: 0.2093
F-statistic: 11.85 on 1 and 40 DF, p-value: 0.001364
(代码放最后)
残差分析结果如下:
可以看到有3个点的残差不在[-2,2]的范围内,总体上看分布不均匀,有上升趋势,可能需要进行数据变换。利用R自带的boxcox变换函数,我们得到
a=boxcox(model)
a$x[which.max(a$y)]
# 结果(求变换用的λ)
[1] 2
也即建立新的模型如下:(model2)
重新拟合,得到的新的参数的估计值和新的残差图为:
Call:
lm(formula = (y * y - 1)/2 ~ x)
Residuals:
Min 1Q Median 3Q Max
-1824.63 -361.60 77.48 476.25 1269.05
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2002.027 148.315 13.50 < 2e-16 ***
x 10.265 2.547 4.03 0.000243 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 688.4 on 40 degrees of freedom
Multiple R-squared: 0.2887, Adjusted R-squared: 0.2709
F-statistic: 16.24 on 1 and 40 DF, p-value: 0.0002432
可以看到,经过Box-cox变换后,残差图并没有明显的改善,但残差确实会有所改善。但是R-squared直接下降到了0.2887,这是不可接受的。
model2 预测如下:
fit1=predict(model,data.frame(x=as.vector(testset$Infant.Mortality)))
a=predict(model2,data.frame(x=as.vector(testset$Infant.Mortality)))
fit2=sqrt(a*2+1)
real=as.vector(testset$Fertility)
# 结果
real: 80.2 83.1 92.5 85.8 76.9
fit1:
fit lwr upr
1 66.00076 43.93625 88.06526 T
2 66.00076 43.93625 88.06526 T
3 65.72650 43.65058 87.80242 F
4 65.74022 43.66490 87.81554 T
5 65.78135 43.70781 87.85490 T
fit2:
fit lwr upr
1 66.78921 40.48244 85.33915 T
2 66.78921 40.48244 85.33915 T
3 66.48113 39.95385 85.10683 F
4 66.49657 39.98049 85.11844 F
5 66.54286 40.06028 85.15327 T
(T代表真实值落入预测区间,F代表真实值没有落入预测区间)
可以看到,效果是相对的差,但是有一个地方值得注意,经过Box-Cox变换后的模型预测效果并没有更好,也就是说,Boxcox变换对于有些模型来讲,并没有起到效果,甚至是负效果。
我们进行前面进行变量选择出来的结果并不好,我们尝试利用全模型从另一个角度出发选择新的模型
我们不妨试一下用最简单的全模型进行拟合。(兜兜转转又回到起点):model3
参数估计如下:
Call:
lm(formula = swiss$Fertility ~ swiss$Agriculture + swiss$Examination +
swiss$Education + swiss$Catholic + swiss$Infant.Mortality)
Residuals:
Min 1Q Median 3Q Max
-15.2743 -5.2617 0.5032 4.1198 15.3213
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.91518 10.70604 6.250 1.91e-07 ***
swiss$Agriculture -0.17211 0.07030 -2.448 0.01873 *
swiss$Examination -0.25801 0.25388 -1.016 0.31546
swiss$Education -0.87094 0.18303 -4.758 2.43e-05 ***
swiss$Catholic 0.10412 0.03526 2.953 0.00519 **
swiss$Infant.Mortality 1.07705 0.38172 2.822 0.00734 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.165 on 41 degrees of freedom
Multiple R-squared: 0.7067, Adjusted R-squared: 0.671
F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10
经过计算,预测结果较前面误差小点,可以看到,变量 Agriculture 和 Examination不显著,选择将其去掉,得到新的模型:model4
无情参数估计:
Call:
lm(formula = swiss$Fertility ~ swiss$Education + swiss$Catholic +
swiss$Infant.Mortality)
Residuals:
Min 1Q Median 3Q Max
-14.4781 -5.4403 -0.5143 4.1568 15.1187
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 48.67707 7.91908 6.147 2.24e-07 ***
swiss$Education -0.75925 0.11680 -6.501 6.83e-08 ***
swiss$Catholic 0.09607 0.02722 3.530 0.00101 **
swiss$Infant.Mortality 1.29615 0.38699 3.349 0.00169 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.505 on 43 degrees of freedom
Multiple R-squared: 0.6625, Adjusted R-squared: 0.639
F-statistic: 28.14 on 3 and 43 DF, p-value: 3.15e-10
这个模型看上去不错,进行预测看一下结果如何:
> fit4
fit lwr upr
1 69.29743 5.5818929 133.0130
2 78.76860 15.0530576 142.4841
3 80.03561 16.3200699 143.7511
4 72.91831 9.2027734 136.6338
5 64.48474 0.7692015 128.2003
> real
[1] 80.2 83.1 92.5 85.8 76.9
预测结果较前面误差小,但预测区间范围过大
接下来我们对model3进行岭估计:
library(glmnet)
x=as.matrix(fitset[,unlist(index[31])])
r1<-glmnet(x=x,y=y,family = "gaussian",alpha = 0)
plot(r1,xvar="lambda")
#用交叉验证得到lambda
r1cv<-cv.glmnet(x=x,y=y,family="gaussian",alpha=0,nfolds = 10)
plot(r1cv)
#取误差平方和最小时的λ
model5<-glmnet(x=x,y=y,family = "gaussian",alpha = 0,lambda = r1cv$lambda.min)
rcoef(rimin)
结果如下:
> model5
Call: glmnet(x = x, y = y, family = "gaussian", alpha = 0, lambda = r1cv$lambda.min)
Df %Dev Lambda
1 5 71.6 0.8288
> coef(model5)
6 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) 54.42787574
Agriculture -0.03983483
Examination -0.02877502
Education -0.73958965
Catholic 0.09934750
Infant.Mortality 1.05449691
但实际上预测的结果仍然不如意,我只能说这个数据集确实不适合用线性统计模型来做,或者说这个数据集选的是真的不好。
附录(代码)
rm(list=ls())
library("datasets")
library("MASS")
data("swiss")
testset=swiss[1:5,]
fitset=swiss[-(1:5),]
# 变量选择
index=list(c(2),c(3),c(4),c(5),c(6),c(2,3),c(2,4),c(2,5),c(2,6),c(3,4),c(3,5)
,c(3,6),c(4,5),c(4,6),c(5,6),c(2,3,4),c(2,3,5),c(2,4,5),c(3,4,5),
c(2,3,6),c(2,4,6),c(2,5,6),c(3,4,6),c(3,5,6),c(4,5,6)
,c(2,3,4,5),c(2,3,4,6),c(2,3,5,6),c(2,4,5,6),c(3,4,5,6)
,c(2,3,4,5,6))
y=as.vector(fitset$Fertility)
x_temp=c(1:42)*0+1
x=as.matrix(fitset[,unlist(index[31])])
x=matrix(c(x_temp,x),ncol=as.numeric(length(unlist(index[31]))+1))
beta=solve(t(x)%*%x)%*%t(x)%*%y
y_fit=x%*%beta
sigma_estimate_square=sum(y_fit*y_fit)/(42-6)
c_1=c()
c_2=c()
c_3=c()
c_4=c()
for(i in 1:31)
{
q=as.numeric(length(unlist(index[i]))+1)
x=as.matrix(fitset[,unlist(index[i])])
x=matrix(c(x_temp,x),ncol=q)
# 参数计算
beta=solve(t(x)%*%x)%*%t(x)%*%y
# 拟合值
y_fit=x%*%beta
# RSS_q准则
RSS_q=sum(y_fit*y_fit)/(42-length(unlist(index[i]))-1)
# Cp准则
Cp=RSS_q/sigma_estimate_square-42+2*q
# AIC,BIC
AIC=42*log(RSS_q)+2*q
BIC=42*log(RSS_q)+2*q*log(42)
c_1=c(c_1,RSS_q)
c_2=c(c_2,Cp)
c_3=c(c_3,AIC)
c_4=c(c_4,BIC)
}
matrix(data=c(c_1,c_2,c_3,c_4),ncol = 4,dimnames =
list(index,c("RSS_q","Cp","AIC","BIC")))
cat("under the criterion of RSS_q, choose model",which.min(c_1))
cat("under the criterion of Cp, choose model",which.min(c_2))
cat("under the criterion of AIC, choose model",which.min(c_3))
cat("under the criterion of BIC, choose model",which.min(c_4))
x=as.vector(fitset[,unlist(index[4])])
model=lm(y~x)
summary(model)
plot(y,rstudent(model),ylim = c(-4,4),cex=0.8,xlab = "y")
abline(h=c(-2,0,2),lty=2)
a=boxcox(model)
a$x[which.max(a$y)]
## 得到λ=2
model2=lm((y*y-1)/2~x)
summary(model2)
plot((y*y-1)/2,rstudent(model2),ylim = c(-4,4),cex=0.8,xlab = "y")
abline(h=c(-2,0,2),lty=2)
predict(model,data.frame(x=as.vector(testset$Infant.Mortality)),interval = "prediction")
a=predict(model2,data.frame(x=as.vector(testset$Infant.Mortality)),interval = "prediction")
fit2=sqrt(a*2+1)
real=as.vector(testset$Fertility)
model3=lm(swiss$Fertility~swiss$Agriculture+swiss$Examination+swiss$Education
+swiss$Catholic+swiss$Infant.Mortality)
summary(model3)
model4=lm(swiss$Fertility~swiss$Education+swiss$Catholic+swiss$Infant.Mortality)
summary(model4)
# predict(model4,data.frame(testset[,c(3,4,5)]))
fit4=model4$coefficients[1]+model4$coefficients[2]*testset$Education+
model4$coefficients[3]*testset$Catholic+model4$coefficients[4
]*testset$Infant.Mortality
x=as.matrix(fitset[,unlist(index[31])])
x=matrix(c(x_temp,x),ncol=6)
y_fit=as.vector(predict(model4,swiss))[6:47]
sigma.estimate=sqrt(sum((y-y_fit)*(y-y_fit))/37)
x.pred=as.vector(testset$Fertility)
x.pred=c(1,x.pred)
temp=t(x.pred)%*%solve(t(x)%*%x)%*%x.pred
temp2=qt(1-0.05/2,df=37)*sigma.estimate*sqrt(1+temp)
temp2=as.numeric(temp2)
fit4=data.frame(fit=fit4,lwr=fit4-temp2,upr=fit4+temp2)
library(glmnet)
x=as.matrix(fitset[,unlist(index[31])])
r1<-glmnet(x=x,y=y,family = "gaussian",alpha = 0)
plot(r1,xvar="lambda")
#用交叉验证得到lambda
r1cv<-cv.glmnet(x=x,y=y,family="gaussian",alpha=0,nfolds = 10)
plot(r1cv)
#取误差平方和最小时的λ
model5<-glmnet(x=x,y=y,family = "gaussian",alpha = 0,lambda = r1cv$lambda.min)
coef(model5)
predict(model5,as.matrix(testset[,-1]),interval="prediction")