0x01. 案例背景
IHDP(Infant Health and Development Program)就是一个半合成的典型数据集,用于研究 “专家是否家访” 对 “婴儿日后认知测验得分” 之间的关系。原数据集是基于随机控制实验进行的,因此可以获得因果干预效应的groud truth。为了实现观察性研究数据的数据有偏特点,特意从原数据干预组中有偏向性地去除了一部分数据引入选择偏倚。该数据集共包含747个样本(干预组: 139; 控制组: 608), 共包含25个协变量涉及孩童和其母亲的各项属性。
0x02. 开始实验
0x02_1. 导入要使用的包
# importing required libraries
import dowhy
from dowhy import CausalModel
import pandas as pd
import numpy as np
0x02_2. 读取数据
数据的地址为:IHDP数据集。
代码中读取数据会报错,因此我自行将数据整理成csv格式,放在本地读取了。
# 原始数据读取方式,因网络问题读取失败
# data= pd.read_csv("https://raw.githubusercontent.com/AMLab-Amsterdam/CEVAE/master/datasets/IHDP/csv/ihdp_npci_1.csv", header = None)
# 使用整理的本地化数据读取
data = pd.read_csv("data/ihdp.csv", header = None)
data
数据的原始格式如下表:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | … | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | |
0 | 1 | 5.599916 | 4.318780 | 3.268256 | 6.854457 | -0.528603 | -0.343455 | 1.128554 | 0.161703 | -0.316603 | … | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 6.875856 | 7.856495 | 6.636059 | 7.562718 | -1.736945 | -1.802002 | 0.383828 | 2.244320 | -0.629189 | … | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 2.996273 | 6.633952 | 1.570536 | 6.121617 | -0.807451 | -0.202946 | -0.360898 | -0.879606 | 0.808706 | … | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 1.366206 | 5.697239 | 1.244738 | 5.889125 | 0.390083 | 0.596582 | -1.850350 | -0.879606 | -0.004017 | … | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 1.963538 | 6.202582 | 1.685048 | 6.191994 | -1.045229 | -0.602710 | 0.011465 | 0.161703 | 0.683672 | … | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
0x02_3. 数据处理
col = ["treatment", "y_factual", "y_cfactual", "mu0", "mu1" ,]
for i in range(1,26):
col.append("x"+str(i))
data.columns = col
data = data.astype({"treatment":'bool'}, copy=False)
data.head()
数据输出结构如图:
treatment | y_factual | y_cfactual | mu0 | mu1 | x1 | x2 | x3 | x4 | x5 | … | x16 | x17 | x18 | x19 | x20 | x21 | x22 | x23 | x24 | x25 | |
0 | True | 5.599916 | 4.318780 | 3.268256 | 6.854457 | -0.528603 | -0.343455 | 1.128554 | 0.161703 | -0.316603 | … | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | False | 6.875856 | 7.856495 | 6.636059 | 7.562718 | -1.736945 | -1.802002 | 0.383828 | 2.244320 | -0.629189 | … | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | False | 2.996273 | 6.633952 | 1.570536 | 6.121617 | -0.807451 | -0.202946 | -0.360898 | -0.879606 | 0.808706 | … | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | False | 1.366206 | 5.697239 | 1.244738 | 5.889125 | 0.390083 | 0.596582 | -1.850350 | -0.879606 | -0.004017 | … | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | False | 1.963538 | 6.202582 | 1.685048 | 6.191994 | -1.045229 | -0.602710 | 0.011465 | 0.161703 | 0.683672 | … | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
0x02_4. Model
# Create a causal model from the data and given common causes.
model=CausalModel(
data = data,
treatment='treatment',
outcome='y_factual',
common_causes=["x"+str(i) for i in range(1,26)]
)
model.view_model()
from IPython.display import Image, display
display(Image(filename="causal_model.png"))
输出结果如图所示:
妈的,这吊图真是画质渣渣。。。图中将至特征作为共同原因,treatment为数据的treatment
,输出为y_factual,共同原因既作用于treatment
又作用于y_factual
0x02_5. Identify
#Identify the causal effect
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True, method_name="maximal-adjustment")
print(identified_estimand)
输出结果如下:
Estimand type: EstimandType.NONPARAMETRIC_ATE
### Estimand : 1
Estimand name: backdoor
Estimand expression:
d
────────────(E[y_factual|x10,x18,x1,x25,x15,x4,x12,x9,x11,x13,x6,x17,x5,x20,x2
d[treatment]
3,x8,x2,x24,x7,x19,x22,x14,x21,x3,x16])
Estimand assumption 1, Unconfoundedness: If U→{treatment} and U→y_factual then P(y_factual|treatment,x10,x18,x1,x25,x15,x4,x12,x9,x11,x13,x6,x17,x5,x20,x23,x8,x2,x24,x7,x19,x22,x14,x21,x3,x16,U) = P(y_factual|treatment,x10,x18,x1,x25,x15,x4,x12,x9,x11,x13,x6,x17,x5,x20,x23,x8,x2,x24,x7,x19,x22,x14,x21,x3,x16)
### Estimand : 2
Estimand name: iv
No such variable(s) found!
### Estimand : 3
Estimand name: frontdoor
No such variable(s) found!
0x02_6. Estimate
0x02_6_1. 使用线性回归
# Estimate the causal effect and compare it with Average Treatment Effect
estimate = model.estimate_effect(identified_estimand,
method_name="backdoor.linear_regression", test_significance=True
)
print(estimate)
print("Causal Estimate is " + str(estimate.value))
data_1 = data[data["treatment"]==1]
data_0 = data[data["treatment"]==0]
print("ATE", np.mean(data_1["y_factual"])- np.mean(data_0["y_factual"]))
输出结果如下:
*** Causal Estimate ***
## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE
### Estimand : 1
Estimand name: backdoor
Estimand expression:
d
────────────(E[y_factual|x10,x18,x1,x25,x15,x4,x12,x9,x11,x13,x6,x17,x5,x20,x2
d[treatment]
3,x8,x2,x24,x7,x19,x22,x14,x21,x3,x16])
Estimand assumption 1, Unconfoundedness: If U→{treatment} and U→y_factual then P(y_factual|treatment,x10,x18,x1,x25,x15,x4,x12,x9,x11,x13,x6,x17,x5,x20,x23,x8,x2,x24,x7,x19,x22,x14,x21,x3,x16,U) = P(y_factual|treatment,x10,x18,x1,x25,x15,x4,x12,x9,x11,x13,x6,x17,x5,x20,x23,x8,x2,x24,x7,x19,x22,x14,x21,x3,x16)
## Realized estimand
b: y_factual~treatment+x10+x18+x1+x25+x15+x4+x12+x9+x11+x13+x6+x17+x5+x20+x23+x8+x2+x24+x7+x19+x22+x14+x21+x3+x16
Target units: ate
## Estimate
Mean value: 3.9286717508727156
Causal Estimate is 3.9286717508727156
ATE 4.021121012430829
0x02_6_2. 使用propensity_score_matching Stratification
estimate = model.estimate_effect(identified_estimand,
method_name="backdoor.propensity_score_matching"
)
print("Causal Estimate is " + str(estimate.value))
print("ATE", np.mean(data_1["y_factual"])- np.mean(data_0["y_factual"]))
输出如下:
Causal Estimate is 3.97913882321704
ATE 4.021121012430829
0x02_6_3. 使用Propensity Score Stratification
estimate = model.estimate_effect(identified_estimand,
method_name="backdoor.propensity_score_stratification", method_params={'num_strata':50, 'clipping_threshold':5}
)
print("Causal Estimate is " + str(estimate.value))
print("ATE", np.mean(data_1["y_factual"])- np.mean(data_0["y_factual"]))
输出如下:
Causal Estimate is 3.4550471588628207
ATE 4.021121012430829
0x02_6_4. 使用Propensity Score Weighting
estimate = model.estimate_effect(identified_estimand,
method_name="backdoor.propensity_score_weighting"
)
print("Causal Estimate is " + str(estimate.value))
print("ATE", np.mean(data_1["y_factual"])- np.mean(data_0["y_factual"]))
输出为:
Causal Estimate is 3.9286717508727156
ATE 4.021121012430829
0x02_7. Refute
0x02_7_1. 使用random_common_cause
refute_results=model.refute_estimate(identified_estimand, estimate,
method_name="random_common_cause")
print(refute_results)
输出结果如下:
Refute: Add a random common cause
Estimated effect:3.4550471588628207
New effect:3.4550471588628215
p value:2.0
0x02_7_2. 使用placebo_treatment_refuter
res_placebo=model.refute_estimate(identified_estimand, estimate,
method_name="placebo_treatment_refuter")
print(res_placebo)
输出结果如下:
Refute: Use a Placebo Treatment
Estimated effect:3.4550471588628207
New effect:-0.00207657716864257
p value:0.88
0x02_7_3. 使用data_subset_refuter
res_subset=model.refute_estimate(identified_estimand, estimate,
method_name="data_subset_refuter", subset_fraction=0.9)
print(res_subset)
输出结果如下:
Refute: Use a subset of data
Estimated effect:3.4550471588628207
New effect:3.466444805696942
p value:0.9
0x03. 总结
本案例按照标准DoWhy标准流程进行,实现因果的分析的结果。