R语言学习笔记（六）回归分析

原创

zorch 2022-08-26 13:44:57 博主文章分类：R Language ©著作权

文章标签 线性回归 5e 方差 文章分类 虚拟化云计算

©著作权归作者所有：来自51CTO博客作者zorch的原创作品，请联系作者获取转载授权，否则将追究法律责任

文章目录

写在前面
普通最小二乘(OLS)回归法

正态假设
简单线性回归
多项式回归
多元线性回归
有交互项的多元线性回归
小结

回归诊断

标准方法
综合验证方法
多重共线性

广义线性回归--Logistic回归

Sigmoid函数

性质

模型
R实现
广义回归的其他模型

非线性回归模型

多项式回归模型
正交多项式回归
(内在)非线性回归模型
其他参数的估计

写在前面

总结一下回归分析部分的R语言实现。

普通最小二乘(OLS)回归法

包括简单线性回归、多项式回归、多元线性回归。

正态假设

为了恰当解释OLS模型的系数，数据必须满足以下统计假设：

正态性：对固定的自变量值，因变量值成正态分布；
独立性： $R语言学习笔记（六）回归分析_5e$ 相互独立；
线性：因变量与自变量之间为线性相关；
等方差性：因变量的方差不随自变量的水平不同而变化。

简单线性回归

dt <- women
# View(dt)
# 进行线性回归
fit <- lm(weight~height, data = women)
summary(fit)

## 
## Call:
## lm(formula = weight ~ height, data = women)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7333 -1.1333 -0.3833  0.7417  3.1167 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
## height        3.45000    0.09114   37.85 1.09e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9903 
## F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

# 绘制散点图及拟合曲线
plot(dt$height, dt$weight)
abline(fit)

R语言学习笔记（六）回归分析_5e_02

多项式回归

fit1 <- lm(weight~height+I(height^2), data = women)
summary(fit1)

## 
## Call:
## lm(formula = weight ~ height + I(height^2), data = women)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50941 -0.29611 -0.00941  0.28615  0.59706 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 261.87818   25.19677  10.393 2.36e-07 ***
## height       -7.34832    0.77769  -9.449 6.58e-07 ***
## I(height^2)   0.08306    0.00598  13.891 9.32e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3841 on 12 degrees of freedom
## Multiple R-squared:  0.9995, Adjusted R-squared:  0.9994 
## F-statistic: 1.139e+04 on 2 and 12 DF,  p-value: < 2.2e-16

plot(dt$height, dt$weight)
lines(dt$height, fitted(fit1))

R语言学习笔记（六）回归分析_5e_03

多元线性回归

states <- as.data.frame(state.x77[,c("Murder", 
                                     "Population",
                                     "Illiteracy", 
                                     "Income", "Frost")])
cor(states)

##                Murder Population Illiteracy     Income      Frost
## Murder      1.0000000  0.3436428  0.7029752 -0.2300776 -0.5388834
## Population  0.3436428  1.0000000  0.1076224  0.2082276 -0.3321525
## Illiteracy  0.7029752  0.1076224  1.0000000 -0.4370752 -0.6719470
## Income     -0.2300776  0.2082276 -0.4370752  1.0000000  0.2262822
## Frost      -0.5388834 -0.3321525 -0.6719470  0.2262822  1.0000000

library(car)

## Loading required package: carData

scatterplotMatrix(states, spread=FALSE, smoother.args=list(lty=2),
main="Scatter Plot Matrix")

R语言学习笔记（六）回归分析_5e_04

fit2 <- lm(Murder ~ Population + Illiteracy + Income + Frost,
data=states)
summary(fit2)

## 
## Call:
## lm(formula = Murder ~ Population + Illiteracy + Income + Frost, 
##     data = states)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7960 -1.6495 -0.0811  1.4815  7.6210 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.235e+00  3.866e+00   0.319   0.7510    
## Population  2.237e-04  9.052e-05   2.471   0.0173 *  
## Illiteracy  4.143e+00  8.744e-01   4.738 2.19e-05 ***
## Income      6.442e-05  6.837e-04   0.094   0.9253    
## Frost       5.813e-04  1.005e-02   0.058   0.9541    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.535 on 45 degrees of freedom
## Multiple R-squared:  0.567,  Adjusted R-squared:  0.5285 
## F-statistic: 14.73 on 4 and 45 DF,  p-value: 9.133e-08

有交互项的多元线性回归

若两个预测变量的交互项显著，说明响应变量与其中一个预测变量的关系依赖于另外一个预测变量的水平。

fit3 <- lm(mpg~hp+wt+hp:wt, data = mtcars)
summary(fit3)

## 
## Call:
## lm(formula = mpg ~ hp + wt + hp:wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0632 -1.6491 -0.7362  1.4211  4.5513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 49.80842    3.60516  13.816 5.01e-14 ***
## hp          -0.12010    0.02470  -4.863 4.04e-05 ***
## wt          -8.21662    1.26971  -6.471 5.20e-07 ***
## hp:wt        0.02785    0.00742   3.753 0.000811 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.153 on 28 degrees of freedom
## Multiple R-squared:  0.8848, Adjusted R-squared:  0.8724 
## F-statistic: 71.66 on 3 and 28 DF,  p-value: 2.981e-13

小结

都使用lm()函数进行回归拟合，使用summary()函数给出结果。

对结果的分析步骤：

先看参数估计的系数
看各个自变量的p值
看模型的拟合度
看F检验的p值

回归诊断

标准方法

dt <- women
# 进行线性回归
fit <- lm(weight~height, data = women)
summary(fit)

## 
## Call:
## lm(formula = weight ~ height, data = women)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7333 -1.1333 -0.3833  0.7417  3.1167 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
## height        3.45000    0.09114   37.85 1.09e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9903 
## F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

# 同时显示4张图
par(mfrow=c(2, 2))
plot(fit)

R语言学习笔记（六）回归分析_线性回归_05

fit2 <- lm(weight~height+I(height^2), data = women)
plot(fit2)

R语言学习笔记（六）回归分析_5e_06

图形的分析：

左上：残差图，讨论线性问题，残差项应该在两边随机波动，所以线性假设可能不能满足；

右上：正态QQ图，反映样本的正态性，散点基本为一条直线，则样本服从正态假设；

左下：同方差性；

右下：残差与杠杆图(离群图)，验证独立性。

正态性：QQ图：点均在通道内表示正态性假设符合得很好。

states <- as.data.frame(state.x77[,c("Murder", 
                                     "Population",
                                     "Illiteracy", 
                                     "Income", "Frost")])
fit3 <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=states)
summary(fit3)

## 
## Call:
## lm(formula = Murder ~ Population + Illiteracy + Income + Frost, 
##     data = states)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7960 -1.6495 -0.0811  1.4815  7.6210 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.235e+00  3.866e+00   0.319   0.7510    
## Population  2.237e-04  9.052e-05   2.471   0.0173 *  
## Illiteracy  4.143e+00  8.744e-01   4.738 2.19e-05 ***
## Income      6.442e-05  6.837e-04   0.094   0.9253    
## Frost       5.813e-04  1.005e-02   0.058   0.9541    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.535 on 45 degrees of freedom
## Multiple R-squared:  0.567,  Adjusted R-squared:  0.5285 
## F-statistic: 14.73 on 4 and 45 DF,  p-value: 9.133e-08

par(mfrow=c(1,1))
library(car)
qqPlot(fit3, labels=row.names(states), id.method="identify",
simulate=TRUE, main="Q-Q Plot")

R语言学习笔记（六）回归分析_线性回归_07

##       Nevada Rhode Island 
##           28           39

fitted(fit3)["Nevada"]

##   Nevada 
## 3.878958

states["Nevada",]

##        Murder Population Illiteracy Income Frost
## Nevada   11.5        590        0.5   5149   188

独立性：DW检验

durbinWatsonTest(fit3)

##  lag Autocorrelation D-W Statistic p-value
##    1      -0.2006929      2.317691    0.24
##  Alternative hypothesis: rho != 0

线性：成分残差图、偏残差图

par(mfrow=c(2,2))
crPlots(fit3)

R语言学习笔记（六）回归分析_线性回归_08

成分残差图证实了线性假设，线性模型形式对该数据集看似是合适的。

同方差性

ncvTest(fit3)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 1.746514, Df = 1, p = 0.18632

综合验证方法

使用外部包

library(gvlma)

线性模型假设的综合验证

gvmodel <- gvlma(fit3)
summary(gvmodel)

## 
## Call:
## lm(formula = Murder ~ Population + Illiteracy + Income + Frost, 
##     data = states)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7960 -1.6495 -0.0811  1.4815  7.6210 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.235e+00  3.866e+00   0.319   0.7510    
## Population  2.237e-04  9.052e-05   2.471   0.0173 *  
## Illiteracy  4.143e+00  8.744e-01   4.738 2.19e-05 ***
## Income      6.442e-05  6.837e-04   0.094   0.9253    
## Frost       5.813e-04  1.005e-02   0.058   0.9541    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.535 on 45 degrees of freedom
## Multiple R-squared:  0.567,  Adjusted R-squared:  0.5285 
## F-statistic: 14.73 on 4 and 45 DF,  p-value: 9.133e-08
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = fit3) 
## 
##                     Value p-value                Decision
## Global Stat        2.7728  0.5965 Assumptions acceptable.
## Skewness           1.5374  0.2150 Assumptions acceptable.
## Kurtosis           0.6376  0.4246 Assumptions acceptable.
## Link Function      0.1154  0.7341 Assumptions acceptable.
## Heteroscedasticity 0.4824  0.4873 Assumptions acceptable.

多重共线性

方差膨胀因子

vif(fit3)

## Population Illiteracy     Income      Frost 
##   1.245282   2.165848   1.345822   2.082547

cor(states)

##                Murder Population Illiteracy     Income      Frost
## Murder      1.0000000  0.3436428  0.7029752 -0.2300776 -0.5388834
## Population  0.3436428  1.0000000  0.1076224  0.2082276 -0.3321525
## Illiteracy  0.7029752  0.1076224  1.0000000 -0.4370752 -0.6719470
## Income     -0.2300776  0.2082276 -0.4370752  1.0000000  0.2262822
## Frost      -0.5388834 -0.3321525 -0.6719470  0.2262822  1.0000000

广义线性回归–Logistic回归

Sigmoid函数

$R语言学习笔记（六）回归分析_5e_09$

性质

定义域： $R语言学习笔记（六）回归分析_线性回归_10$ ;
值域： $R语言学习笔记（六）回归分析_方差_11$ ;
函数在定义域内为连续光滑函数；
处处可导，导数为 $R语言学习笔记（六）回归分析_线性回归_12$ .

模型

$R语言学习笔记（六）回归分析_5e_13$
求解过程：
构造损失函数，极大似然法进行参数估计(采用梯度下降法进行参数的计算)。

R实现

fm <- glm(formula, family = binomial(link = logit),
data=data.frame)

广义回归的其他模型

fitted.model <- glm(formula, family = family.generator, data = data.frame)

非线性回归模型

多项式回归模型

alloy<-data.frame(
  x=c(37.0, 37.5, 38.0, 38.5, 39.0, 39.5, 40.0,
      40.5, 41.0, 41.5, 42.0, 42.5, 43.0),
  y=c(3.40, 3.00, 3.00, 3.27, 2.10, 1.83, 1.53,
      1.70, 1.80, 1.90, 2.35, 2.54, 2.90)
)
lm.sol<-lm(y~1+x+I(x^2),data=alloy)
summary(lm.sol)

## 
## Call:
## lm(formula = y ~ 1 + x + I(x^2), data = alloy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33322 -0.14222 -0.07922  0.05275  0.84577 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 257.06961   47.00295   5.469 0.000273 ***
## x           -12.62032    2.35377  -5.362 0.000318 ***
## I(x^2)        0.15600    0.02942   5.303 0.000346 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.329 on 10 degrees of freedom
## Multiple R-squared:  0.7843, Adjusted R-squared:  0.7412 
## F-statistic: 18.18 on 2 and 10 DF,  p-value: 0.0004668

正交多项式回归

消除多重共线性的影响。

# 使用`ploy()`函数进行正交多项式的求解
poly(alloy$x, degree=2)

##                   1           2
##  [1,] -4.447496e-01  0.49168917
##  [2,] -3.706247e-01  0.24584459
##  [3,] -2.964997e-01  0.04469902
##  [4,] -2.223748e-01 -0.11174754
##  [5,] -1.482499e-01 -0.22349508
##  [6,] -7.412493e-02 -0.29054360
##  [7,] -1.645904e-17 -0.31289311
##  [8,]  7.412493e-02 -0.29054360
##  [9,]  1.482499e-01 -0.22349508
## [10,]  2.223748e-01 -0.11174754
## [11,]  2.964997e-01  0.04469902
## [12,]  3.706247e-01  0.24584459
## [13,]  4.447496e-01  0.49168917
## attr(,"coefs")
## attr(,"coefs")$alpha
## [1] 40 40
## 
## attr(,"coefs")$norm2
## [1]   1.000  13.000  45.500 125.125
## 
## attr(,"degree")
## [1] 1 2
## attr(,"class")
## [1] "poly"   "matrix"

# 进行回归
lm.pol<-lm(y ~ 1 + poly(x, 2), data=alloy)
summary(lm.pol)

## 
## Call:
## lm(formula = y ~ 1 + poly(x, 2), data = alloy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33322 -0.14222 -0.07922  0.05275  0.84577 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.40923    0.09126  26.400  1.4e-10 ***
## poly(x, 2)1 -0.94435    0.32904  -2.870 0.016669 *  
## poly(x, 2)2  1.74505    0.32904   5.303 0.000346 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.329 on 10 degrees of freedom
## Multiple R-squared:  0.7843, Adjusted R-squared:  0.7412 
## F-statistic: 18.18 on 2 and 10 DF,  p-value: 0.0004668

(内在)非线性回归模型

方法一，使用极大似然法估计参数
方法二：使用非线性模型的参数估计函数nls()

# 输入数据，构成数据框
cl<-data.frame(
   X=c(rep(2*4:21, c(2, 4, 4, 3, 3, 2, 3, 3, 3, 3, 2, 
       3, 2, 1, 2, 2, 1, 1))),
   Y=c(0.49, 0.49, 0.48, 0.47, 0.48, 0.47, 0.46, 0.46, 
       0.45, 0.43, 0.45, 0.43, 0.43, 0.44, 0.43, 0.43, 
       0.46, 0.45, 0.42, 0.42, 0.43, 0.41, 0.41, 0.40, 
       0.42, 0.40, 0.40, 0.41, 0.40, 0.41, 0.41, 0.40, 
       0.40, 0.40, 0.38, 0.41, 0.40, 0.40, 0.41, 0.38, 
       0.40, 0.40, 0.39, 0.39)
)
# 作非线性拟合，并输出各参数的估计值
nls.sol<-nls(Y~a+(0.49-a)*exp(-b*(X-8)), data=cl,
             start = list( a= 0.1, b = 0.01 ))
nls.sum<-summary(nls.sol); nls.sum

## 
## Formula: Y ~ a + (0.49 - a) * exp(-b * (X - 8))
## 
## Parameters:
##   Estimate Std. Error t value Pr(>|t|)    
## a 0.390140   0.005045  77.333  < 2e-16 ***
## b 0.101633   0.013360   7.607 1.99e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01091 on 42 degrees of freedom
## 
## Number of iterations to convergence: 19 
## Achieved convergence tolerance: 1.365e-06

# 画出拟合曲线和散点图
xfit<-seq(8,44,len=200)
yfit<-predict(nls.sol, data.frame(X=xfit))
plot(cl$X, cl$Y)
lines(xfit, yfit)

R语言学习笔记（六）回归分析_5e_14

# 计算偏导数和相应的Jacobi矩阵
fn<-function(a, b, X){
   f1 <- 1-exp(-b*(X-8))
   f2 <- -(0.49-a)*(X-8)*exp(-b*(X-8))
   cbind(f1,f2)
}
D<-fn(nls.sum$parameters[1,1], nls.sum$parameters[2,1], cl$X)

# 作theta的方差估计
theta.var<-nls.sum$sigma^2*solve(t(D)%*%D); 
theta.var

##              f1           f2
## f1 2.545130e-05 5.984318e-05
## f2 5.984318e-05 1.784969e-04

## 输出标准差(计算结果)
sqrt(theta.var[1,1])

## [1] 0.005044928

sqrt(theta.var[2,2])

## [1] 0.01336027

# 输出标准差(已有结果), 两者是相同的
nls.sum$parameters[,2]

##           a           b 
## 0.005044928 0.013360271

其他参数的估计

误差项方差 $R语言学习笔记（六）回归分析_线性回归_15$ 的估计：

$R语言学习笔记（六）回归分析_5e_16$
$R语言学习笔记（六）回归分析_线性回归_17$ 不是无偏的，但是当样本量很大时，其偏差很小。

方法三：非线性模型的参数估计nlm()函数的使用（用于求解多元函数的极小值）

fn<-function(p, X, Y){
    f <- Y-p[1]-(0.49-p[1])*exp(-p[2]*(X-8))
    res<-sum(f^2)
    f1<- -1+exp(-p[2]*(X-8))
    f2<- (0.49-p[1])*exp(-p[2]*(X-8))*(X-8)
    J<-cbind(f1,f2)
    attr(res, "gradient") <- 2*t(J)%*%f
    res
} 
nlm(fn, p=c(0.1, 0.01), X=cl$x, Y=cl$y, 
hessian=TRUE)

## $minimum
## [1] 0
## 
## $estimate
## [1] 0.10 0.01
## 
## $gradient
## [1] 0 0
## 
## $hessian
##      [,1] [,2]
## [1,]    0    0
## [2,]    0    0
## 
## $code
## [1] 1
## 
## $iterations
## [1] 0

nlm(fn, p=c(0.1, 0.01), X=cl$X, Y=cl$Y, hessian=TRUE)

## $minimum
## [1] 0.00500168
## 
## $estimate
## [1] 0.3901400 0.1016327
## 
## $gradient
## [1]  7.954390e-07 -3.261297e-07
## 
## $hessian
##           [,1]       [,2]
## [1,]  44.20335 -14.799291
## [2,] -14.79929   6.248565
## 
## $code
## [1] 1
## 
## $iterations
## [1] 33

fn<-function(p, X, Y){
f <- Y-p[1]-(0.49-p[1])*exp(-p[2]*(X-8))
res=sum(f^2)
res
} 
nlm(fn, p=c(0.1, 0.01),X=cl$x,Y=cl$y)

## $minimum
## [1] 0
## 
## $estimate
## [1] 0.10 0.01
## 
## $gradient
## [1] 0 0
## 
## $code
## [1] 1
## 
## $iterations
## [1] 0

上一篇：nndl学习笔记（二）反向传播公式推导与详解

下一篇：R语言学习笔记（三）多元数据的数据特征、相关分析与图形表示

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯