文章目录
- 回归诊断
- 性质
- 模型
- R实现
- 广义回归的其他模型
写在前面
总结一下回归分析部分的R语言实现。
普通最小二乘(OLS)回归法
包括简单线性回归、多项式回归、多元线性回归。
正态假设
为了恰当解释OLS模型的系数,数据必须满足以下统计假设:
- 正态性:对固定的自变量值,因变量值成正态分布;
- 独立性:
相互独立;
- 线性:因变量与自变量之间为线性相关;
- 等方差性:因变量的方差不随自变量的水平不同而变化。
简单线性回归
dt <- women
# View(dt)
# 进行线性回归
fit <- lm(weight~height, data = women)
summary(fit)
##
## Call:
## lm(formula = weight ~ height, data = women)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7333 -1.1333 -0.3833 0.7417 3.1167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
## height 3.45000 0.09114 37.85 1.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
# 绘制散点图及拟合曲线
plot(dt$height, dt$weight)
abline(fit)
多项式回归
fit1 <- lm(weight~height+I(height^2), data = women)
summary(fit1)
##
## Call:
## lm(formula = weight ~ height + I(height^2), data = women)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50941 -0.29611 -0.00941 0.28615 0.59706
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 261.87818 25.19677 10.393 2.36e-07 ***
## height -7.34832 0.77769 -9.449 6.58e-07 ***
## I(height^2) 0.08306 0.00598 13.891 9.32e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3841 on 12 degrees of freedom
## Multiple R-squared: 0.9995, Adjusted R-squared: 0.9994
## F-statistic: 1.139e+04 on 2 and 12 DF, p-value: < 2.2e-16
plot(dt$height, dt$weight)
lines(dt$height, fitted(fit1))
多元线性回归
states <- as.data.frame(state.x77[,c("Murder",
"Population",
"Illiteracy",
"Income", "Frost")])
cor(states)
## Murder Population Illiteracy Income Frost
## Murder 1.0000000 0.3436428 0.7029752 -0.2300776 -0.5388834
## Population 0.3436428 1.0000000 0.1076224 0.2082276 -0.3321525
## Illiteracy 0.7029752 0.1076224 1.0000000 -0.4370752 -0.6719470
## Income -0.2300776 0.2082276 -0.4370752 1.0000000 0.2262822
## Frost -0.5388834 -0.3321525 -0.6719470 0.2262822 1.0000000
library(car)
## Loading required package: carData
scatterplotMatrix(states, spread=FALSE, smoother.args=list(lty=2),
main="Scatter Plot Matrix")
fit2 <- lm(Murder ~ Population + Illiteracy + Income + Frost,
data=states)
summary(fit2)
##
## Call:
## lm(formula = Murder ~ Population + Illiteracy + Income + Frost,
## data = states)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7960 -1.6495 -0.0811 1.4815 7.6210
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.235e+00 3.866e+00 0.319 0.7510
## Population 2.237e-04 9.052e-05 2.471 0.0173 *
## Illiteracy 4.143e+00 8.744e-01 4.738 2.19e-05 ***
## Income 6.442e-05 6.837e-04 0.094 0.9253
## Frost 5.813e-04 1.005e-02 0.058 0.9541
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.535 on 45 degrees of freedom
## Multiple R-squared: 0.567, Adjusted R-squared: 0.5285
## F-statistic: 14.73 on 4 and 45 DF, p-value: 9.133e-08
有交互项的多元线性回归
若两个预测变量的交互项显著,说明响应变量与其中一个预测变量的关系依赖于另外一个预测变量的水平。
fit3 <- lm(mpg~hp+wt+hp:wt, data = mtcars)
summary(fit3)
##
## Call:
## lm(formula = mpg ~ hp + wt + hp:wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0632 -1.6491 -0.7362 1.4211 4.5513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.80842 3.60516 13.816 5.01e-14 ***
## hp -0.12010 0.02470 -4.863 4.04e-05 ***
## wt -8.21662 1.26971 -6.471 5.20e-07 ***
## hp:wt 0.02785 0.00742 3.753 0.000811 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.153 on 28 degrees of freedom
## Multiple R-squared: 0.8848, Adjusted R-squared: 0.8724
## F-statistic: 71.66 on 3 and 28 DF, p-value: 2.981e-13
小结
都使用lm()
函数进行回归拟合,使用summary()
函数给出结果。
对结果的分析步骤:
- 先看参数估计的系数
- 看各个自变量的p值
- 看模型的拟合度
- 看F检验的p值
回归诊断
标准方法
dt <- women
# 进行线性回归
fit <- lm(weight~height, data = women)
summary(fit)
##
## Call:
## lm(formula = weight ~ height, data = women)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7333 -1.1333 -0.3833 0.7417 3.1167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
## height 3.45000 0.09114 37.85 1.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
# 同时显示4张图
par(mfrow=c(2, 2))
plot(fit)
fit2 <- lm(weight~height+I(height^2), data = women)
plot(fit2)
图形的分析:
左上:残差图,讨论线性问题,残差项应该在两边随机波动,所以线性假设可能不能满足;
右上:正态QQ图,反映样本的正态性,散点基本为一条直线,则样本服从正态假设;
左下:同方差性;
右下:残差与杠杆图(离群图),验证独立性。
- 正态性:QQ图:点均在通道内表示正态性假设符合得很好。
states <- as.data.frame(state.x77[,c("Murder",
"Population",
"Illiteracy",
"Income", "Frost")])
fit3 <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=states)
summary(fit3)
##
## Call:
## lm(formula = Murder ~ Population + Illiteracy + Income + Frost,
## data = states)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7960 -1.6495 -0.0811 1.4815 7.6210
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.235e+00 3.866e+00 0.319 0.7510
## Population 2.237e-04 9.052e-05 2.471 0.0173 *
## Illiteracy 4.143e+00 8.744e-01 4.738 2.19e-05 ***
## Income 6.442e-05 6.837e-04 0.094 0.9253
## Frost 5.813e-04 1.005e-02 0.058 0.9541
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.535 on 45 degrees of freedom
## Multiple R-squared: 0.567, Adjusted R-squared: 0.5285
## F-statistic: 14.73 on 4 and 45 DF, p-value: 9.133e-08
par(mfrow=c(1,1))
library(car)
qqPlot(fit3, labels=row.names(states), id.method="identify",
simulate=TRUE, main="Q-Q Plot")
## Nevada Rhode Island
## 28 39
fitted(fit3)["Nevada"]
## Nevada
## 3.878958
states["Nevada",]
## Murder Population Illiteracy Income Frost
## Nevada 11.5 590 0.5 5149 188
- 独立性:DW检验
durbinWatsonTest(fit3)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.2006929 2.317691 0.24
## Alternative hypothesis: rho != 0
- 线性:成分残差图、偏残差图
par(mfrow=c(2,2))
crPlots(fit3)
成分残差图证实了线性假设,线性模型形式对该数据集看似是合适的。
- 同方差性
ncvTest(fit3)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 1.746514, Df = 1, p = 0.18632
综合验证方法
使用外部包
library(gvlma)
线性模型假设的综合验证
gvmodel <- gvlma(fit3)
summary(gvmodel)
##
## Call:
## lm(formula = Murder ~ Population + Illiteracy + Income + Frost,
## data = states)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7960 -1.6495 -0.0811 1.4815 7.6210
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.235e+00 3.866e+00 0.319 0.7510
## Population 2.237e-04 9.052e-05 2.471 0.0173 *
## Illiteracy 4.143e+00 8.744e-01 4.738 2.19e-05 ***
## Income 6.442e-05 6.837e-04 0.094 0.9253
## Frost 5.813e-04 1.005e-02 0.058 0.9541
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.535 on 45 degrees of freedom
## Multiple R-squared: 0.567, Adjusted R-squared: 0.5285
## F-statistic: 14.73 on 4 and 45 DF, p-value: 9.133e-08
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = fit3)
##
## Value p-value Decision
## Global Stat 2.7728 0.5965 Assumptions acceptable.
## Skewness 1.5374 0.2150 Assumptions acceptable.
## Kurtosis 0.6376 0.4246 Assumptions acceptable.
## Link Function 0.1154 0.7341 Assumptions acceptable.
## Heteroscedasticity 0.4824 0.4873 Assumptions acceptable.
多重共线性
方差膨胀因子
vif(fit3)
## Population Illiteracy Income Frost
## 1.245282 2.165848 1.345822 2.082547
cor(states)
## Murder Population Illiteracy Income Frost
## Murder 1.0000000 0.3436428 0.7029752 -0.2300776 -0.5388834
## Population 0.3436428 1.0000000 0.1076224 0.2082276 -0.3321525
## Illiteracy 0.7029752 0.1076224 1.0000000 -0.4370752 -0.6719470
## Income -0.2300776 0.2082276 -0.4370752 1.0000000 0.2262822
## Frost -0.5388834 -0.3321525 -0.6719470 0.2262822 1.0000000
广义线性回归–Logistic回归
Sigmoid函数
性质
- 定义域:
;
- 值域:
;
- 函数在定义域内为连续光滑函数;
- 处处可导,导数为
.
模型
求解过程:
构造损失函数,极大似然法进行参数估计(采用梯度下降法进行参数的计算)。
R实现
fm <- glm(formula, family = binomial(link = logit),
data=data.frame)
广义回归的其他模型
fitted.model <- glm(formula, family = family.generator, data = data.frame)
非线性回归模型
多项式回归模型
alloy<-data.frame(
x=c(37.0, 37.5, 38.0, 38.5, 39.0, 39.5, 40.0,
40.5, 41.0, 41.5, 42.0, 42.5, 43.0),
y=c(3.40, 3.00, 3.00, 3.27, 2.10, 1.83, 1.53,
1.70, 1.80, 1.90, 2.35, 2.54, 2.90)
)
lm.sol<-lm(y~1+x+I(x^2),data=alloy)
summary(lm.sol)
##
## Call:
## lm(formula = y ~ 1 + x + I(x^2), data = alloy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33322 -0.14222 -0.07922 0.05275 0.84577
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 257.06961 47.00295 5.469 0.000273 ***
## x -12.62032 2.35377 -5.362 0.000318 ***
## I(x^2) 0.15600 0.02942 5.303 0.000346 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.329 on 10 degrees of freedom
## Multiple R-squared: 0.7843, Adjusted R-squared: 0.7412
## F-statistic: 18.18 on 2 and 10 DF, p-value: 0.0004668
正交多项式回归
消除多重共线性的影响。
# 使用`ploy()`函数进行正交多项式的求解
poly(alloy$x, degree=2)
## 1 2
## [1,] -4.447496e-01 0.49168917
## [2,] -3.706247e-01 0.24584459
## [3,] -2.964997e-01 0.04469902
## [4,] -2.223748e-01 -0.11174754
## [5,] -1.482499e-01 -0.22349508
## [6,] -7.412493e-02 -0.29054360
## [7,] -1.645904e-17 -0.31289311
## [8,] 7.412493e-02 -0.29054360
## [9,] 1.482499e-01 -0.22349508
## [10,] 2.223748e-01 -0.11174754
## [11,] 2.964997e-01 0.04469902
## [12,] 3.706247e-01 0.24584459
## [13,] 4.447496e-01 0.49168917
## attr(,"coefs")
## attr(,"coefs")$alpha
## [1] 40 40
##
## attr(,"coefs")$norm2
## [1] 1.000 13.000 45.500 125.125
##
## attr(,"degree")
## [1] 1 2
## attr(,"class")
## [1] "poly" "matrix"
# 进行回归
lm.pol<-lm(y ~ 1 + poly(x, 2), data=alloy)
summary(lm.pol)
##
## Call:
## lm(formula = y ~ 1 + poly(x, 2), data = alloy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33322 -0.14222 -0.07922 0.05275 0.84577
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.40923 0.09126 26.400 1.4e-10 ***
## poly(x, 2)1 -0.94435 0.32904 -2.870 0.016669 *
## poly(x, 2)2 1.74505 0.32904 5.303 0.000346 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.329 on 10 degrees of freedom
## Multiple R-squared: 0.7843, Adjusted R-squared: 0.7412
## F-statistic: 18.18 on 2 and 10 DF, p-value: 0.0004668
(内在)非线性回归模型
- 方法一,使用极大似然法估计参数
- 方法二:使用非线性模型的参数估计函数
nls()
# 输入数据,构成数据框
cl<-data.frame(
X=c(rep(2*4:21, c(2, 4, 4, 3, 3, 2, 3, 3, 3, 3, 2,
3, 2, 1, 2, 2, 1, 1))),
Y=c(0.49, 0.49, 0.48, 0.47, 0.48, 0.47, 0.46, 0.46,
0.45, 0.43, 0.45, 0.43, 0.43, 0.44, 0.43, 0.43,
0.46, 0.45, 0.42, 0.42, 0.43, 0.41, 0.41, 0.40,
0.42, 0.40, 0.40, 0.41, 0.40, 0.41, 0.41, 0.40,
0.40, 0.40, 0.38, 0.41, 0.40, 0.40, 0.41, 0.38,
0.40, 0.40, 0.39, 0.39)
)
# 作非线性拟合,并输出各参数的估计值
nls.sol<-nls(Y~a+(0.49-a)*exp(-b*(X-8)), data=cl,
start = list( a= 0.1, b = 0.01 ))
nls.sum<-summary(nls.sol); nls.sum
##
## Formula: Y ~ a + (0.49 - a) * exp(-b * (X - 8))
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## a 0.390140 0.005045 77.333 < 2e-16 ***
## b 0.101633 0.013360 7.607 1.99e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01091 on 42 degrees of freedom
##
## Number of iterations to convergence: 19
## Achieved convergence tolerance: 1.365e-06
# 画出拟合曲线和散点图
xfit<-seq(8,44,len=200)
yfit<-predict(nls.sol, data.frame(X=xfit))
plot(cl$X, cl$Y)
lines(xfit, yfit)
# 计算偏导数和相应的Jacobi矩阵
fn<-function(a, b, X){
f1 <- 1-exp(-b*(X-8))
f2 <- -(0.49-a)*(X-8)*exp(-b*(X-8))
cbind(f1,f2)
}
D<-fn(nls.sum$parameters[1,1], nls.sum$parameters[2,1], cl$X)
# 作theta的方差估计
theta.var<-nls.sum$sigma^2*solve(t(D)%*%D);
theta.var
## f1 f2
## f1 2.545130e-05 5.984318e-05
## f2 5.984318e-05 1.784969e-04
## 输出标准差(计算结果)
sqrt(theta.var[1,1])
## [1] 0.005044928
sqrt(theta.var[2,2])
## [1] 0.01336027
# 输出标准差(已有结果), 两者是相同的
nls.sum$parameters[,2]
## a b
## 0.005044928 0.013360271
其他参数的估计
- 误差项方差
的估计:
不是无偏的,但是当样本量很大时,其偏差很小。
- 方法三:非线性模型的参数估计
nlm()
函数的使用(用于求解多元函数的极小值)
fn<-function(p, X, Y){
f <- Y-p[1]-(0.49-p[1])*exp(-p[2]*(X-8))
res<-sum(f^2)
f1<- -1+exp(-p[2]*(X-8))
f2<- (0.49-p[1])*exp(-p[2]*(X-8))*(X-8)
J<-cbind(f1,f2)
attr(res, "gradient") <- 2*t(J)%*%f
res
}
nlm(fn, p=c(0.1, 0.01), X=cl$x, Y=cl$y,
hessian=TRUE)
## $minimum
## [1] 0
##
## $estimate
## [1] 0.10 0.01
##
## $gradient
## [1] 0 0
##
## $hessian
## [,1] [,2]
## [1,] 0 0
## [2,] 0 0
##
## $code
## [1] 1
##
## $iterations
## [1] 0
nlm(fn, p=c(0.1, 0.01), X=cl$X, Y=cl$Y, hessian=TRUE)
## $minimum
## [1] 0.00500168
##
## $estimate
## [1] 0.3901400 0.1016327
##
## $gradient
## [1] 7.954390e-07 -3.261297e-07
##
## $hessian
## [,1] [,2]
## [1,] 44.20335 -14.799291
## [2,] -14.79929 6.248565
##
## $code
## [1] 1
##
## $iterations
## [1] 33
fn<-function(p, X, Y){
f <- Y-p[1]-(0.49-p[1])*exp(-p[2]*(X-8))
res=sum(f^2)
res
}
nlm(fn, p=c(0.1, 0.01),X=cl$x,Y=cl$y)
## $minimum
## [1] 0
##
## $estimate
## [1] 0.10 0.01
##
## $gradient
## [1] 0 0
##
## $code
## [1] 1
##
## $iterations
## [1] 0