DoubleMLRegressor#

class pgmpy.prediction.DoubleMLRegressor.DoubleMLRegressor(causal_graph, nuisance_estimators=None, effect_estimator=None, n_folds: int = 5, seed: int | None = None)[source]#

Bases: _BaseCausalPrediction

Implements the Double Machine Learning Regressor[1] (DML2) with cross-fitting.

This estimator implements the DoubleML algorithm with cross-fitting in a scikit-learn compatible estimator API. It uses user-specified causal graphs to extract exposure, outcome, and adjustment variables and uses that to fit/predict a Double ML regressor. The model is defined as follows:

Given data D: (Y, T, X), where:

Y : outcome variable T : treatment (exposure) variable X : adjustment (confounder + pretreatment) variables

The DoubleML fitting procedure consists of three main steps:

  1. Sample splitting into n_folds folds for cross-fitting.

  2. Fitting two nuisance estimators on each fold:
    • Outcome Models (outcome_est_): Predict Y using X.

    • Treatment Models (treatment_est_): Predict T from X.

  1. Computing residuals using nuisance estimators on each fold.
    • Outcome residuals: Y - outcome_est_.predict(X)

    • Treatment residuals: T - treatment_est_.predict(X)

3. Stack the residuals from the folds together and fit the effect estimator (effect_est_) to predict the outcome residuals from the treatment residuals.

Using the fitted models, predictions on new data (X_new, T_new) are computed as:

res_T_new = T_new - treatment_est_.predict(X_new) Y_pred = effect_est_(res_T_new) + outcome_est_.predict(X_new)

Parameters:
causal_graphDAG, PDAG, ADMG, MAG, or PAG

Causal graph with defined variable roles. The causal graph must have the following roles: exposures, outcomes, and adjustment. Additionally, pretreatment can be specified.

nuisance_estimators: an estimator or a tuple of estimators of size 2 (default=LinearRegression)

If a single estimator is provided, it is used for both outcome and treatment nuisance models.

If a tuple of two estimators is provided, the first one is used for the treatment model and the second for the outcome model.

If None, defaults to LinearRegression for both models.

effect_estimatorestimator-like (default=LinearRegression)

Estimator for the final effect estimation step. Must have a fit method and a predict method. If None, defaults to LinearRegression.

n_foldsint, default=5

Number of folds to use for cross-fitting. If 1, doesn’t perform cross-fitting and computes in-sample residuals.

seedint or None

Random seed for cross-fitting splits.

Attributes:
n_folds_int

Number of folds used in cross-fitting.

n_features_in_int

Number of features seen during fit.

n_samples_int

Number of samples seen during fit.

exposure_var_str

Name of the exposure (treatment) variable.

outcome_var_str

Name of the outcome variable.

adjustment_vars_list of str

Names of adjustment (confounder) variables.

pretreatment_vars_list of str

Names of pretreatment variables.

feature_columns_fit_list of str

Names of features used in the model.

outcome_est_estimator-like or list of estimator-like

Fitted outcome nuisance model(s).

treatment_est_estimator-like or list of estimator-like

Fitted treatment nuisance model(s).

effect_est_estimator-like

Fitted final effect estimator.

Notes

While the implementations allows the effect estimator to be any sklearn compatible estimator, the theoretical guarantees for DoubleML hold when the effect estimator is a linear model (such as LinearRegression). Using non-linear effect estimators may lead to biased estimates.

References

[1]

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1-C68.

Examples

>>> # Example 1: With adjustments and cross-fitting
>>> import numpy as np
>>> import pandas as pd
>>> from sklearn.linear_model import LinearRegression
>>> from pgmpy.base.DAG import DAG
>>> from pgmpy.prediction import DoubleMLRegressor
>>> # Simulate data from a linear Gaussian BN that we use to estimate the causal effect from.
>>> lgbn = DAG.from_dagitty(
...     "dag { X -> T [beta=0.2] X -> Y [beta=0.3] T -> Y [beta=0.4] }"
... )
>>> data = lgbn.simulate(n_samples=1000, seed=42)
>>> X = data.loc[:, ["X", "T"]]
>>> y = data["Y"]
>>> # construct a DAG (roles must match DataFrame column names)
>>> dag = DAG(
...     lgbn.edges(), roles={"exposures": "T", "adjustment": "X", "outcomes": "Y"}
... )
>>> dml = DoubleMLRegressor(
...     causal_graph=dag,
...     nuisance_estimators=LinearRegression(),
...     effect_estimator=LinearRegression(),
...     n_folds=3,
... )
>>> dml = dml.fit(X, y)
>>> dml.effect_est_
LinearRegression()
>>> dml.effect_est_.coef_.round(1)
array([0.4])
>>> preds = dml.predict(X.iloc[:5])
>>> preds.shape
(5,)
>>> dml.n_folds_
3
>>> dml.n_samples_
1000
fit(X, y, sample_weight: Any | None = None)[source]#

Fit the DoubleML model using the provided data.

Parameters:
Xpandas.DataFrame or numpy.ndarray

Feature data containing exposure and adjustment variables. If a numpy array is provided, it is converted to a dataframe with column names starting from 0.

ypandas.Series, pandas.DataFrame, or numpy.ndarray

Outcome variable. If a DataFrame is provided, it must have a single column.

sample_weightarray-like of shape (n_samples,), optional

Sample weights to be used in fitting the nuisance and effect estimators.

Returns:
selfobject

Fitted estimator.

predict(X)[source]#

Makes conditional interventional (CATE) predictions using the fitted DoubleML model.

Parameters:
Xpandas.DataFrame

Feature data containing data for exposure and adjustment variables for which to make predictions.

Returns:
outcome_prednumpy.ndarray

Predicted outcome values.

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') DoubleMLRegressor#

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in fit.

Returns:
selfobject

The updated object.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') DoubleMLRegressor#

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in score.

Returns:
selfobject

The updated object.