0x01. 背景

本次实验是使用Lalonde数据集在DoWhy中的因果推断的探索。这项研究考察了职业培训项目(treatment)在完成几年后对个人实际收入的影响。数据包括一些人口统计学变量(年龄、种族、学术背景和以前的实际收入),这些数据作为common cause,以1978年的实际收入(数据中字段re78为outcome)。

数据解释如下:
age: age in years.

educ: years of schooling.

black: indicator variable for blacks.

hisp: indicator variable for Hispanics.

married: indicator variable for martial status.

nodegr: indicator variable for high school diploma.

re74: real earnings in 1974.

re75: real earnings in 1975.

re78: real earnings in 1978.

u74: indicator variable for earnings in 1974 being zero.

u75: indicator variable for earnings in 1975 being zero.

treat: an indicator variable for treatment status.

0x02. 开始实验

0x02_1. 导入数据

import dowhy.datasets

lalonde = dowhy.datasets.lalonde_dataset()
lalonde.head()

数据的输出如下:

treat

age

educ

black

hisp

married

nodegr

re74

re75

re78

u74

u75

0

False

23.0

10.0

1.0

0.0

0.0

1.0

0.0

0.0

0.000

1.0

1.0

1

False

26.0

12.0

0.0

0.0

0.0

0.0

0.0

0.0

12383.680

1.0

1.0

2

False

22.0

9.0

1.0

0.0

0.0

1.0

0.0

0.0

0.000

1.0

1.0

3

False

18.0

9.0

1.0

0.0

0.0

1.0

0.0

0.0

10740.080

1.0

1.0

4

False

45.0

11.0

1.0

0.0

0.0

1.0

0.0

0.0

11796.470

1.0

1.0

0x02_2. Model

from dowhy import CausalModel


model=CausalModel(
        data = lalonde,
        treatment='treat',
        outcome='re78', # re78表示78年的收入
        common_causes= ['nodegr', 'black', 'hisp', 'age', 'educ', 'married'])
model.view_model()
from IPython.display import Image, display
display(Image(filename="causal_model.png"))

输出如下:

因果推断dowhy之-Lalonde数据集上的案例学习_学习

0x02_3. Identify

identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)

输出结果如下:

Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
   d                                                
────────(E[re78|educ,hisp,nodegr,age,black,married])
d[treat]                                            
Estimand assumption 1, Unconfoundedness: If U→{treat} and U→re78 then P(re78|treat,educ,hisp,nodegr,age,black,married,U) = P(re78|treat,educ,hisp,nodegr,age,black,married)

### Estimand : 2
Estimand name: iv
No such variable(s) found!

### Estimand : 3
Estimand name: frontdoor
No such variable(s) found!

0x02_4. Estimate

estimate = model.estimate_effect(identified_estimand,
                                 method_name="backdoor.propensity_score_weighting",
                                 target_units="ate",
                                 method_params={"weighting_scheme":"ips_weight"})
print(estimate)

输出结果如下:

*** Causal Estimate ***

## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
   d                                                
────────(E[re78|educ,hisp,nodegr,age,black,married])
d[treat]                                            
Estimand assumption 1, Unconfoundedness: If U→{treat} and U→re78 then P(re78|treat,educ,hisp,nodegr,age,black,married,U) = P(re78|treat,educ,hisp,nodegr,age,black,married)

## Realized estimand
b: re78~treat+educ+hisp+nodegr+age+black+married
Target units: ate

## Estimate
Mean value: 1639.7970160320529

0x02_5. Refute

Random Common Cause

lalonde_refute_random_common_cause = model.refute_estimate(
    identified_estimand,
    estimate,
    method_name="random_common_cause"
)

print(lalonde_refute_random_common_cause)

输出结果如下:

Refute: Add a random common cause
Estimated effect:1639.7970160320529
New effect:1639.7970160320524
p value:2.0

Replace Treatment with Placebo

lalonde_refute_random_common_cause = model.refute_estimate(
    identified_estimand,
    estimate,
    method_name="placebo_treatment_refuter",
    placebo_type="permute"
)

print(lalonde_refute_random_common_cause)

输出结果如下:

Refute: Use a Placebo Treatment
Estimated effect:1639.7970160320529
New effect:-178.33720174019228
p value:0.72

Remove Random Subset of Data

lalonde_refute_random_common_cause = model.refute_estimate(
    identified_estimand,
    estimate,
    method_name="data_subset_refuter",
    subset_fraction=0.9
)

print(lalonde_refute_random_common_cause)

输出结果如下:

Refute: Use a subset of data
Estimated effect:1639.7970160320529
New effect:1664.4859682446834
p value:0.82

0x03. 使用statsmodels观测统计数据

原notebooke不知道为什么要使用该包观测数据,就当扩充知识面了吧。

import statsmodels.formula.api as smf
reg=smf.wls('re78~1+treat', data=lalonde, weights=lalonde.ips_stabilized_weight)
res=reg.fit()
res.summary()

效果如下:

WLS Regression Results                            
==============================================================================
Dep. Variable:                   re78   R-squared:                       0.015
Model:                            WLS   Adj. R-squared:                  0.013
Method:                 Least Squares   F-statistic:                     6.743
Date:                Tue, 21 Mar 2023   Prob (F-statistic):            0.00973
Time:                        21:08:53   Log-Likelihood:                -4544.7
No. Observations:                 445   AIC:                             9093.
Df Residuals:                     443   BIC:                             9102.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept      4555.0763    406.703     11.200      0.000    3755.769    5354.384
treat[T.True]  1639.7970    631.495      2.597      0.010     398.699    2880.895
==============================================================================
Omnibus:                      303.262   Durbin-Watson:                   2.085
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4770.534
Skew:                           2.709   Prob(JB):                         0.00
Kurtosis:                      18.097   Cond. No.                         2.47
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

对没有使用weight和使用weight在married因素上针对treated和untreated的数据进行可视化:

estimate.interpret(method_name="confounder_distribution_interpreter",var_type='discrete',
                   var_name='married', fig_size = (10, 7), font_size = 12)

效果如图:

因果推断dowhy之-Lalonde数据集上的案例学习_python_02


手动模拟IPW,计算最终的ATE结果。

df = model._data
ps = df['propensity_score']
y = df['re78']
z = df['treat']

ey1 = z*y/ps / sum(z/ps)
ey0 = (1-z)*y/(1-ps) / sum((1-z)/(1-ps))
ate = ey1.sum()-ey0.sum()
print("Causal Estimate is " + str(ate))

# correct -> Causal Estimate is 1634.9868359746906

输出效果如下:

Causal Estimate is 1639.7970160320492