DoubleMLRegressor#
- class pgmpy.prediction.DoubleMLRegressor.DoubleMLRegressor(causal_graph, nuisance_estimators=None, effect_estimator=None, n_folds: int = 5, seed: int | None = None)[source]#
Bases:
_BaseCausalPredictionImplements the Double Machine Learning Regressor[1] (DML2) with cross-fitting.
This estimator implements the DoubleML algorithm with cross-fitting in a scikit-learn compatible estimator API. It uses user-specified causal graphs to extract exposure, outcome, and adjustment variables and uses that to fit/predict a Double ML regressor. The model is defined as follows:
- Given data D: (Y, T, X), where:
Y : outcome variable T : treatment (exposure) variable X : adjustment (confounder + pretreatment) variables
The DoubleML fitting procedure consists of three main steps:
Sample splitting into n_folds folds for cross-fitting.
- Fitting two nuisance estimators on each fold:
Outcome Models (outcome_est_): Predict Y using X.
Treatment Models (treatment_est_): Predict T from X.
- Computing residuals using nuisance estimators on each fold.
Outcome residuals: Y - outcome_est_.predict(X)
Treatment residuals: T - treatment_est_.predict(X)
3. Stack the residuals from the folds together and fit the effect estimator (effect_est_) to predict the outcome residuals from the treatment residuals.
Using the fitted models, predictions on new data (X_new, T_new) are computed as:
res_T_new = T_new - treatment_est_.predict(X_new) Y_pred = effect_est_(res_T_new) + outcome_est_.predict(X_new)
- Parameters:
- causal_graphDAG, PDAG, ADMG, MAG, or PAG
Causal graph with defined variable roles. The causal graph must have the following roles: exposures, outcomes, and adjustment. Additionally, pretreatment can be specified.
- nuisance_estimators: an estimator or a tuple of estimators of size 2 (default=LinearRegression)
If a single estimator is provided, it is used for both outcome and treatment nuisance models.
If a tuple of two estimators is provided, the first one is used for the treatment model and the second for the outcome model.
If None, defaults to LinearRegression for both models.
- effect_estimatorestimator-like (default=LinearRegression)
Estimator for the final effect estimation step. Must have a fit method and a predict method. If None, defaults to LinearRegression.
- n_foldsint, default=5
Number of folds to use for cross-fitting. If 1, doesn’t perform cross-fitting and computes in-sample residuals.
- seedint or None
Random seed for cross-fitting splits.
- Attributes:
- n_folds_int
Number of folds used in cross-fitting.
- n_features_in_int
Number of features seen during fit.
- n_samples_int
Number of samples seen during fit.
- exposure_var_str
Name of the exposure (treatment) variable.
- outcome_var_str
Name of the outcome variable.
- adjustment_vars_list of str
Names of adjustment (confounder) variables.
- pretreatment_vars_list of str
Names of pretreatment variables.
- feature_columns_fit_list of str
Names of features used in the model.
- outcome_est_estimator-like or list of estimator-like
Fitted outcome nuisance model(s).
- treatment_est_estimator-like or list of estimator-like
Fitted treatment nuisance model(s).
- effect_est_estimator-like
Fitted final effect estimator.
Notes
While the implementations allows the effect estimator to be any sklearn compatible estimator, the theoretical guarantees for DoubleML hold when the effect estimator is a linear model (such as LinearRegression). Using non-linear effect estimators may lead to biased estimates.
References
[1]Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1-C68.
Examples
>>> # Example 1: With adjustments and cross-fitting >>> import numpy as np >>> import pandas as pd >>> from sklearn.linear_model import LinearRegression >>> from pgmpy.base.DAG import DAG >>> from pgmpy.prediction import DoubleMLRegressor
>>> # Simulate data from a linear Gaussian BN that we use to estimate the causal effect from. >>> lgbn = DAG.from_dagitty( ... "dag { X -> T [beta=0.2] X -> Y [beta=0.3] T -> Y [beta=0.4] }" ... ) >>> data = lgbn.simulate(n_samples=1000, seed=42) >>> X = data.loc[:, ["X", "T"]] >>> y = data["Y"]
>>> # construct a DAG (roles must match DataFrame column names) >>> dag = DAG( ... lgbn.edges(), roles={"exposures": "T", "adjustment": "X", "outcomes": "Y"} ... ) >>> dml = DoubleMLRegressor( ... causal_graph=dag, ... nuisance_estimators=LinearRegression(), ... effect_estimator=LinearRegression(), ... n_folds=3, ... ) >>> dml = dml.fit(X, y) >>> dml.effect_est_ LinearRegression() >>> dml.effect_est_.coef_.round(1) array([0.4])
>>> preds = dml.predict(X.iloc[:5]) >>> preds.shape (5,)
>>> dml.n_folds_ 3 >>> dml.n_samples_ 1000
- fit(X, y, sample_weight: Any | None = None)[source]#
Fit the DoubleML model using the provided data.
- Parameters:
- Xpandas.DataFrame or numpy.ndarray
Feature data containing exposure and adjustment variables. If a numpy array is provided, it is converted to a dataframe with column names starting from 0.
- ypandas.Series, pandas.DataFrame, or numpy.ndarray
Outcome variable. If a DataFrame is provided, it must have a single column.
- sample_weightarray-like of shape (n_samples,), optional
Sample weights to be used in fitting the nuisance and effect estimators.
- Returns:
- selfobject
Fitted estimator.
- predict(X)[source]#
Makes conditional interventional (CATE) predictions using the fitted DoubleML model.
- Parameters:
- Xpandas.DataFrame
Feature data containing data for exposure and adjustment variables for which to make predictions.
- Returns:
- outcome_prednumpy.ndarray
Predicted outcome values.
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') DoubleMLRegressor#
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weightparameter infit.
- Returns:
- selfobject
The updated object.
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') DoubleMLRegressor#
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weightparameter inscore.
- Returns:
- selfobject
The updated object.