因果推断dowhy之-Lalonde数据集上的案例学习

原创

l8947943s 2023-03-25 07:42:04 ©著作权

文章标签 学习 python 开发语言 ide 数据 文章分类 JavaScript 前端开发

©著作权归作者所有：来自51CTO博客作者l8947943s的原创作品，请联系作者获取转载授权，否则将追究法律责任

0x01. 背景

本次实验是使用Lalonde数据集在DoWhy中的因果推断的探索。这项研究考察了职业培训项目(treatment)在完成几年后对个人实际收入的影响。数据包括一些人口统计学变量(年龄、种族、学术背景和以前的实际收入)，这些数据作为common cause，以1978年的实际收入(数据中字段re78为outcome)。

数据解释如下：
age: age in years.

educ: years of schooling.

black: indicator variable for blacks.

hisp: indicator variable for Hispanics.

married: indicator variable for martial status.

nodegr: indicator variable for high school diploma.

re74: real earnings in 1974.

re75: real earnings in 1975.

re78: real earnings in 1978.

u74: indicator variable for earnings in 1974 being zero.

u75: indicator variable for earnings in 1975 being zero.

treat: an indicator variable for treatment status.

0x02. 开始实验

0x02_1. 导入数据

import dowhy.datasets

lalonde = dowhy.datasets.lalonde_dataset()
lalonde.head()

数据的输出如下：

	treat	age	educ	black	hisp	married	nodegr	re74	re75	re78	u74	u75
0	False	23.0	10.0	1.0	0.0	0.0	1.0	0.0	0.0	0.000	1.0	1.0
1	False	26.0	12.0	0.0	0.0	0.0	0.0	0.0	0.0	12383.680	1.0	1.0
2	False	22.0	9.0	1.0	0.0	0.0	1.0	0.0	0.0	0.000	1.0	1.0
3	False	18.0	9.0	1.0	0.0	0.0	1.0	0.0	0.0	10740.080	1.0	1.0
4	False	45.0	11.0	1.0	0.0	0.0	1.0	0.0	0.0	11796.470	1.0	1.0

0x02_2. Model

from dowhy import CausalModel


model=CausalModel(
        data = lalonde,
        treatment='treat',
        outcome='re78', # re78表示78年的收入
        common_causes= ['nodegr', 'black', 'hisp', 'age', 'educ', 'married'])
model.view_model()
from IPython.display import Image, display
display(Image(filename="causal_model.png"))

输出如下：

因果推断dowhy之-Lalonde数据集上的案例学习_学习

0x02_3. Identify

identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)

输出结果如下：

Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
   d                                                
────────(E[re78|educ,hisp,nodegr,age,black,married])
d[treat]                                            
Estimand assumption 1, Unconfoundedness: If U→{treat} and U→re78 then P(re78|treat,educ,hisp,nodegr,age,black,married,U) = P(re78|treat,educ,hisp,nodegr,age,black,married)

### Estimand : 2
Estimand name: iv
No such variable(s) found!

### Estimand : 3
Estimand name: frontdoor
No such variable(s) found!

0x02_4. Estimate

estimate = model.estimate_effect(identified_estimand,
                                 method_name="backdoor.propensity_score_weighting",
                                 target_units="ate",
                                 method_params={"weighting_scheme":"ips_weight"})
print(estimate)

输出结果如下：

*** Causal Estimate ***

## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
   d                                                
────────(E[re78|educ,hisp,nodegr,age,black,married])
d[treat]                                            
Estimand assumption 1, Unconfoundedness: If U→{treat} and U→re78 then P(re78|treat,educ,hisp,nodegr,age,black,married,U) = P(re78|treat,educ,hisp,nodegr,age,black,married)

## Realized estimand
b: re78~treat+educ+hisp+nodegr+age+black+married
Target units: ate

## Estimate
Mean value: 1639.7970160320529

0x02_5. Refute

Random Common Cause

lalonde_refute_random_common_cause = model.refute_estimate(
    identified_estimand,
    estimate,
    method_name="random_common_cause"
)

print(lalonde_refute_random_common_cause)

输出结果如下：

Refute: Add a random common cause
Estimated effect:1639.7970160320529
New effect:1639.7970160320524
p value:2.0

Replace Treatment with Placebo

lalonde_refute_random_common_cause = model.refute_estimate(
    identified_estimand,
    estimate,
    method_name="placebo_treatment_refuter",
    placebo_type="permute"
)

print(lalonde_refute_random_common_cause)

输出结果如下：

Refute: Use a Placebo Treatment
Estimated effect:1639.7970160320529
New effect:-178.33720174019228
p value:0.72

Remove Random Subset of Data

lalonde_refute_random_common_cause = model.refute_estimate(
    identified_estimand,
    estimate,
    method_name="data_subset_refuter",
    subset_fraction=0.9
)

print(lalonde_refute_random_common_cause)

输出结果如下：

Refute: Use a subset of data
Estimated effect:1639.7970160320529
New effect:1664.4859682446834
p value:0.82

0x03. 使用statsmodels观测统计数据

原notebooke不知道为什么要使用该包观测数据，就当扩充知识面了吧。

import statsmodels.formula.api as smf
reg=smf.wls('re78~1+treat', data=lalonde, weights=lalonde.ips_stabilized_weight)
res=reg.fit()
res.summary()

效果如下：

WLS Regression Results                            
==============================================================================
Dep. Variable:                   re78   R-squared:                       0.015
Model:                            WLS   Adj. R-squared:                  0.013
Method:                 Least Squares   F-statistic:                     6.743
Date:                Tue, 21 Mar 2023   Prob (F-statistic):            0.00973
Time:                        21:08:53   Log-Likelihood:                -4544.7
No. Observations:                 445   AIC:                             9093.
Df Residuals:                     443   BIC:                             9102.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept      4555.0763    406.703     11.200      0.000    3755.769    5354.384
treat[T.True]  1639.7970    631.495      2.597      0.010     398.699    2880.895
==============================================================================
Omnibus:                      303.262   Durbin-Watson:                   2.085
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4770.534
Skew:                           2.709   Prob(JB):                         0.00
Kurtosis:                      18.097   Cond. No.                         2.47
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

对没有使用weight和使用weight在married因素上针对treated和untreated的数据进行可视化：

estimate.interpret(method_name="confounder_distribution_interpreter",var_type='discrete',
                   var_name='married', fig_size = (10, 7), font_size = 12)

效果如图：

因果推断dowhy之-Lalonde数据集上的案例学习_python_02

手动模拟IPW，计算最终的ATE结果。

df = model._data
ps = df['propensity_score']
y = df['re78']
z = df['treat']

ey1 = z*y/ps / sum(z/ps)
ey0 = (1-z)*y/(1-ps) / sum((1-z)/(1-ps))
ate = ey1.sum()-ey0.sum()
print("Causal Estimate is " + str(ate))

# correct -> Causal Estimate is 1634.9868359746906

输出效果如下：