DoubleMLRegressor#

class pgmpy.prediction.DoubleMLRegressor.DoubleMLRegressor(causal_graph, nuisance_estimators=None, effect_estimator=None, n_folds: int = 5, seed: int | None = None)[source]#

Bases: _BaseCausalPrediction

Implements the Double Machine Learning Regressor[1] (DML2) with cross-fitting.

This estimator implements the DoubleML algorithm with cross-fitting in a scikit-learn compatible estimator API. It uses user-specified causal graphs to extract exposure, outcome, and adjustment variables and uses that to fit/predict a Double ML regressor. The model is defined as follows:

Given data D: (Y, T, X), where:: Y : outcome variable T : treatment (exposure) variable X : adjustment (confounder + pretreatment) variables

The DoubleML fitting procedure consists of three main steps:

Sample splitting into n_folds folds for cross-fitting.
Fitting two nuisance estimators on each fold:
- Outcome Models (outcome_est_): Predict Y using X.
- Treatment Models (treatment_est_): Predict T from X.

Computing residuals using nuisance estimators on each fold.
- Outcome residuals: Y - outcome_est_.predict(X)
- Treatment residuals: T - treatment_est_.predict(X)

3. Stack the residuals from the folds together and fit the effect estimator (effect_est_) to predict the outcome residuals from the treatment residuals.

Using the fitted models, predictions on new data (X_new, T_new) are computed as:

res_T_new = T_new - treatment_est_.predict(X_new) Y_pred = effect_est_(res_T_new) + outcome_est_.predict(X_new)

Parameters:

causal_graphDAG, PDAG, ADMG, MAG, or PAG

Causal graph with defined variable roles. The causal graph must have the following roles: exposures, outcomes, and adjustment. Additionally, pretreatment can be specified.

nuisance_estimators: an estimator or a tuple of estimators of size 2 (default=LinearRegression)

If a single estimator is provided, it is used for both outcome and treatment nuisance models.

If a tuple of two estimators is provided, the first one is used for the treatment model and the second for the outcome model.

If None, defaults to LinearRegression for both models.

effect_estimatorestimator-like (default=LinearRegression)

Estimator for the final effect estimation step. Must have a fit method and a predict method. If None, defaults to LinearRegression.

n_foldsint, default=5

Number of folds to use for cross-fitting. If 1, doesn’t perform cross-fitting and computes in-sample residuals.

seedint or None

Random seed for cross-fitting splits.

Attributes:

n_folds_int: Number of folds used in cross-fitting.
n_features_in_int: Number of features seen during fit.
n_samples_int: Number of samples seen during fit.
exposure_var_str: Name of the exposure (treatment) variable.
outcome_var_str: Name of the outcome variable.
adjustment_vars_list of str: Names of adjustment (confounder) variables.
pretreatment_vars_list of str: Names of pretreatment variables.
feature_columns_fit_list of str: Names of features used in the model.
outcome_est_estimator-like or list of estimator-like: Fitted outcome nuisance model(s).
treatment_est_estimator-like or list of estimator-like: Fitted treatment nuisance model(s).
effect_est_estimator-like: Fitted final effect estimator.

Notes

While the implementations allows the effect estimator to be any sklearn compatible estimator, the theoretical guarantees for DoubleML hold when the effect estimator is a linear model (such as LinearRegression). Using non-linear effect estimators may lead to biased estimates.

References

[1]

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1-C68.

Examples

>>> # Example 1: With adjustments and cross-fitting
>>> import numpy as np
>>> import pandas as pd
>>> from sklearn.linear_model import LinearRegression
>>> from pgmpy.base.DAG import DAG
>>> from pgmpy.prediction import DoubleMLRegressor

>>> # Simulate data from a linear Gaussian BN that we use to estimate the causal effect from.
>>> lgbn = DAG.from_dagitty(
...     "dag { X -> T [beta=0.2] X -> Y [beta=0.3] T -> Y [beta=0.4] }"
... )
>>> data = lgbn.simulate(n_samples=1000, seed=42)
>>> X = data.loc[:, ["X", "T"]]
>>> y = data["Y"]

>>> # construct a DAG (roles must match DataFrame column names)
>>> dag = DAG(
...     lgbn.edges(), roles={"exposures": "T", "adjustment": "X", "outcomes": "Y"}
... )
>>> dml = DoubleMLRegressor(
...     causal_graph=dag,
...     nuisance_estimators=LinearRegression(),
...     effect_estimator=LinearRegression(),
...     n_folds=3,
... )
>>> dml = dml.fit(X, y)
>>> dml.effect_est_
LinearRegression()
>>> dml.effect_est_.coef_.round(1)
array([0.4])

>>> preds = dml.predict(X.iloc[:5])
>>> preds.shape
(5,)

>>> dml.n_folds_
3
>>> dml.n_samples_
1000

fit(X, y, sample_weight: Any | None = None)[source]#

Fit the DoubleML model using the provided data.

Parameters:

Xpandas.DataFrame or numpy.ndarray: Feature data containing exposure and adjustment variables. If a numpy array is provided, it is converted to a dataframe with column names starting from 0.
ypandas.Series, pandas.DataFrame, or numpy.ndarray: Outcome variable. If a DataFrame is provided, it must have a single column.
sample_weightarray-like of shape (n_samples,), optional: Sample weights to be used in fitting the nuisance and effect estimators.

Returns:

selfobject: Fitted estimator.

predict(X)[source]#

Makes conditional interventional (CATE) predictions using the fitted DoubleML model.

Parameters:

Xpandas.DataFrame: Feature data containing data for exposure and adjustment variables for which to make predictions.

Returns:

outcome_prednumpy.ndarray: Predicted outcome values.

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → DoubleMLRegressor#

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in fit.

Returns:

selfobject: The updated object.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → DoubleMLRegressor#

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in score.

Returns:

selfobject: The updated object.