参考资料:【回归分析】台湾交通大学-黄冠华教授
- goal : to test how well the used model fits to the observed data.
- in the linear regression,the coeffient of determination , which represents the fraction of the total variation of the data explained by the used model, can be used as a goodness-of-fit measurement.
- in logistic regression, the coefficient of determination is not a valid goodness-of-fit measurement. we need to develop a quantity in logistic regression for goodness-of-fit test.
- Model assumption in logistic regression
独立:用处体现在likelihood function上,如果不能保持独立,那么根据likelihood function推出来的标准差和显著性检验全部都是错的。
线性:log odds 与参数保持线性关系。
in this case linearity is on the logit scale,meaning that has the same increment with every unit increase in x. this is the same as saying that the odds ratio is the same between x and x+1 no matter what x is.
无交互作用:no interaction effects are assumptions of constancy of odd ratios of one variable across level of the other. - saturated model ,饱和模型,是最复杂的模型,是对原始数据的完全描述,它不需要再添加任何假设,使用它的预测和使用原始数据做预测效果一样。
- 如何识别一个模型是不是饱和模型?
如果 数据分分组数 = 模型中未知参数的个数, 则 该模型是饱和模型。 - 回顾之前:拟合优度检验 主要是看新模型和原数据拟合得是否贴切,而饱和模型和原数据是完全符合的,于是问题进一步转化成了:检验新模型和饱和模型是否接近。如果很靠近,说明新模型很好,否则不好。
for grouped data , goodness-of -fit amounts to compare the model we have with the saturated model(since the data can be exactly reproduced by the saturated model)
this is then equivalent to testing whether enough interaction effects have been included in the model (since a saturated model is the model with all possible interaction) - 如何比较两个模型之间的差距?
使用 ,在这里,
two forms of goodness-of-fit test are commonly used with logistic regression, where sums are taken over risk factor-confounder combinations:
form1:
where are the numbers of observations in each cell, and are the predicted numbers of observations based on the fitted model.
form2: - 举个小栗子:下面x1表示随机变量(binary variable),D21和D31表示虚拟变量(dummy variable),y表示因变量(binary variable),分别写出饱和模型和现需要检验的新模型。
根据fitted model 算出公式中的y=1 的概率,也就是。
以上是对分组数据而言,goodness-of-fit for individual data 的情况如下
特点是这里的自变量是连续的,没有分组。
解决思路:使用the Hosmer and Lemeshow 方法进行分组。
Hosmer and Lemeshow分组的大致步骤:
使用上述fitted model,根据每一条case已知的covariate,计算出对应的y=1时的概率。
然后,按照计算出来的概率,从小到大对这些case,进行排序。
如果是分成10组,那么就是【0%-10%】的为第一组,【10%-20%】的为第二组,……,【90%-100%】的为第10组。
分过组后,如何计算第一组的?对第一组的取平均,再乘上该组的sample size,即可。
上面的goodness-of-fit test 是一种overall test,也就是如果接受原假设意味着模型符合得很好,而如果不接受原假设意味着模型符合得不好,但却不说哪里不好,无法给出是哪一条假设出了问题,于是考虑使用 residual analysis。
普通线性回归中的残差图,在logistic regression不再适用,因为logistics regression 中的 outcome 只取0或1,做出的图像分段(不均匀分散,所以无法按照之前的特性对残差图进行分析),于是提出Pearson residuals.