0x01. 背景
本次实验是使用Lalonde数据集在DoWhy中的因果推断的探索。这项研究考察了职业培训项目(treatment)
在完成几年后对个人实际收入的影响。数据包括一些人口统计学变量(年龄、种族、学术背景和以前的实际收入),这些数据作为common cause
,以1978年的实际收入(数据中字段re78为outcome
)。
数据解释如下:
age: age in years.
educ: years of schooling.
black: indicator variable for blacks.
hisp: indicator variable for Hispanics.
married: indicator variable for martial status.
nodegr: indicator variable for high school diploma.
re74: real earnings in 1974.
re75: real earnings in 1975.
re78: real earnings in 1978.
u74: indicator variable for earnings in 1974 being zero.
u75: indicator variable for earnings in 1975 being zero.
treat: an indicator variable for treatment status.
0x02. 开始实验
0x02_1. 导入数据
import dowhy.datasets
lalonde = dowhy.datasets.lalonde_dataset()
lalonde.head()
数据的输出如下:
treat | age | educ | black | hisp | married | nodegr | re74 | re75 | re78 | u74 | u75 | |
0 | False | 23.0 | 10.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.000 | 1.0 | 1.0 |
1 | False | 26.0 | 12.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 12383.680 | 1.0 | 1.0 |
2 | False | 22.0 | 9.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.000 | 1.0 | 1.0 |
3 | False | 18.0 | 9.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 10740.080 | 1.0 | 1.0 |
4 | False | 45.0 | 11.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 11796.470 | 1.0 | 1.0 |
0x02_2. Model
from dowhy import CausalModel
model=CausalModel(
data = lalonde,
treatment='treat',
outcome='re78', # re78表示78年的收入
common_causes= ['nodegr', 'black', 'hisp', 'age', 'educ', 'married'])
model.view_model()
from IPython.display import Image, display
display(Image(filename="causal_model.png"))
输出如下:
0x02_3. Identify
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)
输出结果如下:
Estimand type: EstimandType.NONPARAMETRIC_ATE
### Estimand : 1
Estimand name: backdoor
Estimand expression:
d
────────(E[re78|educ,hisp,nodegr,age,black,married])
d[treat]
Estimand assumption 1, Unconfoundedness: If U→{treat} and U→re78 then P(re78|treat,educ,hisp,nodegr,age,black,married,U) = P(re78|treat,educ,hisp,nodegr,age,black,married)
### Estimand : 2
Estimand name: iv
No such variable(s) found!
### Estimand : 3
Estimand name: frontdoor
No such variable(s) found!
0x02_4. Estimate
estimate = model.estimate_effect(identified_estimand,
method_name="backdoor.propensity_score_weighting",
target_units="ate",
method_params={"weighting_scheme":"ips_weight"})
print(estimate)
输出结果如下:
*** Causal Estimate ***
## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE
### Estimand : 1
Estimand name: backdoor
Estimand expression:
d
────────(E[re78|educ,hisp,nodegr,age,black,married])
d[treat]
Estimand assumption 1, Unconfoundedness: If U→{treat} and U→re78 then P(re78|treat,educ,hisp,nodegr,age,black,married,U) = P(re78|treat,educ,hisp,nodegr,age,black,married)
## Realized estimand
b: re78~treat+educ+hisp+nodegr+age+black+married
Target units: ate
## Estimate
Mean value: 1639.7970160320529
0x02_5. Refute
Random Common Cause
lalonde_refute_random_common_cause = model.refute_estimate(
identified_estimand,
estimate,
method_name="random_common_cause"
)
print(lalonde_refute_random_common_cause)
输出结果如下:
Refute: Add a random common cause
Estimated effect:1639.7970160320529
New effect:1639.7970160320524
p value:2.0
Replace Treatment with Placebo
lalonde_refute_random_common_cause = model.refute_estimate(
identified_estimand,
estimate,
method_name="placebo_treatment_refuter",
placebo_type="permute"
)
print(lalonde_refute_random_common_cause)
输出结果如下:
Refute: Use a Placebo Treatment
Estimated effect:1639.7970160320529
New effect:-178.33720174019228
p value:0.72
Remove Random Subset of Data
lalonde_refute_random_common_cause = model.refute_estimate(
identified_estimand,
estimate,
method_name="data_subset_refuter",
subset_fraction=0.9
)
print(lalonde_refute_random_common_cause)
输出结果如下:
Refute: Use a subset of data
Estimated effect:1639.7970160320529
New effect:1664.4859682446834
p value:0.82
0x03. 使用statsmodels观测统计数据
原notebooke不知道为什么要使用该包观测数据,就当扩充知识面了吧。
import statsmodels.formula.api as smf
reg=smf.wls('re78~1+treat', data=lalonde, weights=lalonde.ips_stabilized_weight)
res=reg.fit()
res.summary()
效果如下:
WLS Regression Results
==============================================================================
Dep. Variable: re78 R-squared: 0.015
Model: WLS Adj. R-squared: 0.013
Method: Least Squares F-statistic: 6.743
Date: Tue, 21 Mar 2023 Prob (F-statistic): 0.00973
Time: 21:08:53 Log-Likelihood: -4544.7
No. Observations: 445 AIC: 9093.
Df Residuals: 443 BIC: 9102.
Df Model: 1
Covariance Type: nonrobust
=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 4555.0763 406.703 11.200 0.000 3755.769 5354.384
treat[T.True] 1639.7970 631.495 2.597 0.010 398.699 2880.895
==============================================================================
Omnibus: 303.262 Durbin-Watson: 2.085
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4770.534
Skew: 2.709 Prob(JB): 0.00
Kurtosis: 18.097 Cond. No. 2.47
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
对没有使用weight和使用weight在married因素上针对treated和untreated的数据进行可视化:
estimate.interpret(method_name="confounder_distribution_interpreter",var_type='discrete',
var_name='married', fig_size = (10, 7), font_size = 12)
效果如图:
手动模拟IPW,计算最终的ATE结果。
df = model._data
ps = df['propensity_score']
y = df['re78']
z = df['treat']
ey1 = z*y/ps / sum(z/ps)
ey0 = (1-z)*y/(1-ps) / sum((1-z)/(1-ps))
ate = ey1.sum()-ey0.sum()
print("Causal Estimate is " + str(ate))
# correct -> Causal Estimate is 1634.9868359746906
输出效果如下:
Causal Estimate is 1639.7970160320492