一、模型比较的二中方式

(1)使用anova()函数比较二个模型

fit1 <- lm(Murder ~ Population + Illiteracy + Income + 
     Frost, data = states)
 fit2 <- lm(Murder ~ Population + Illiteracy, data = states)
 anova(fit2, fit1)Model 1: Murder ~ Population + Illiteracy
 Model 2: Murder ~ Population + Illiteracy + Income + Frost
   Res.Df    RSS Df Sum of Sq      F Pr(>F)
 1     47 289.25


2     45 289.17  2  0.078505 0.0061 0.9939

模型1嵌套在模型2中。anova()函数同时还对是否应该添加Income和Frost到线性模型
中进行了检验。由于检验不显著(p=0.994),因此我们可以得出结论:不需要将这两个变量添加到线性模型中,可以将它们从模型中删除

(2)AIC(Akaike Information Criterion,赤池信息准则)也可以用来比较模型,它考虑了模型的统计拟合度以及用来拟合的参数数目。AIC值越小的模型要优先选择,它说明模型用较少的参数获得了足够的拟合度。该准则可用AIC()函数实现

fit1 <- lm(Murder ~ Population + Illiteracy + Income + +     Frost, data = states)
 fit2 <- lm(Murder ~ Population + Illiteracy, data = states)
  AIC(fit1, fit2)
      df      AIC
 fit1  6 241.6429
 fit2  4 237.6565

ANOVA需要嵌套模型,而AIC方法不需要。

二、降维的几种方法

MASS包中的stepAIC()函数可以实现逐步回归模型(向前、向后和向前向后),依据的是精确AIC准则

(1)向前逐步回归(forward stepwise):每次添加一个预测变量到模型中,直到添加变量不会使模型有所改进为止.

(2)向后逐步回归(backward stepwise)从模型包含所有预测变量开始,一次删除一个变量直到会降低模型质量为止

(3)向前向后逐步回归(stepwise stepwise,通常称作逐步回归,以避免听起来太冗长),结合了向前逐步回归和向后逐步回归的方法,变量每次进入一个,但是每一步中,变量都会被重新评价,对模型没有贡献的变量将会被删除,预测变量可能会被添加、删除好几次,直到获得最优模型为止。

library(MASS)
fit1 <- lm(Murder ~ Population + Illiteracy + Income + Frost, data = states)
stepAIC(fit, direction = "backward") 

缺点:虽然它可能会找到一个好的模型,但是不能保证模型就是最佳型,因为不是每一个可能的模型都被评价了。

(4)全子集回归

即所有可能的模型都会被检验。

全子集回归可用leaps包中的regsubsets()函数实现。

library(leaps)
 leaps <- regsubsets(Murder ~ Population + Illiteracy + 
     Income + Frost, data = states, nbest = 4)
 plot(leaps, scale = "adjr2")


 library(car)
 subsets(leaps, statistic = "cp", 
     main = "Cp Plot for All Subsets Regression")
 abline(1, 1, lty = 2, col = "red")如何诊断模型的好坏?
(1)交叉验证
shrinkage <- function(fit, k = 10) {
     require(bootstrap)
     # define functions
     theta.fit <- function(x, y) {
         lsfit(x, y)
     }
     theta.predict <- function(fit, x) {
         cbind(1, x) %*% fit$coef
     }
     
     # matrix of predictors
     x <- fit$model[, 2:ncol(fit$model)]
     # vector of predicted values
     y <- fit$model[, 1]
     
     results <- crossval(x, y, theta.fit, theta.predict, ngroup = k)
     r2 <- cor(y, fit$fitted.values)^2
     r2cv <- cor(y, results$cv.fit)^2
     cat("Original R-square =", r2, "\n")
     cat(k, "Fold Cross-Validated R-square =", r2cv, "\n")
     cat("Change =", r2 - r2cv, "\n")
 }
 fit <- lm(Murder ~ Population + Income + Illiteracy + 
     Frost, data = states)
 shrinkage(fit)


 fit2 <- lm(Murder ~ Population + Illiteracy, data = states)
 shrinkage(fit2)(2)变量的相对重要性
 zstates <- as.data.frame(scale(states))
 zfit <- lm(Murder ~ Population + Income + Illiteracy + 
     Frost, data = zstates)
 coef(zfit)
relweights <- function(fit, ...) {
     R <- cor(fit$model)
     nvar <- ncol(R)
     rxx <- R[2:nvar, 2:nvar]
     rxy <- R[2:nvar, 1]
     svd <- eigen(rxx)
     evec <- svd$vectors
     ev <- svd$values
     delta <- diag(sqrt(ev))
     
     # correlations between original predictors and new orthogonal variables
     lambda <- evec %*% delta %*% t(evec)
     lambdasq <- lambda^2
     
     # regression coefficients of Y on orthogonal variables
     beta <- solve(lambda) %*% rxy
     rsquare <- colSums(beta^2)
     rawwgt <- lambdasq %*% beta^2
     import <- (rawwgt/rsquare) * 100
     lbls <- names(fit$model[2:nvar])
     rownames(import) <- lbls
     colnames(import) <- "Weights"
     
     # plot results
     barplot(t(import), names.arg = lbls, ylab = "% of R-Square", 
         xlab = "Predictor Variables", main = "Relative Importance of Predictor Variables", 
         sub = paste("R-Square = ", round(rsquare, digits = 3)), 
         ...)
     return(import)
 }
 fit <- lm(Murder ~ Population + Illiteracy + Income + 
     Frost, data = states)
 relweights(fit, col = "lightgrey")