NaiveIVRegressor#

class pgmpy.prediction.NaiveIVRegressor.NaiveIVRegressor(causal_graph, stage1_estimator: Any | None = None, stage2_estimator: Any | None = None)[source]#

Bases: _BaseCausalPrediction

Implements Naive Instrumental Variable (IV) regressor (single exposure, multiple instruments).

This estimator implements a simple two-stage least squares style procedure for the case of a single exposure and a single outcome with one or more instrumental variables. The first stage fits exposure ~ instrument using stage1_estimator. The second stage fits outcome ~ predicted_exposure (+ pretreatment covariates) using stage2_estimator.

Parameters:

causal_graphDAG, PDAG, ADMG, MAG, or PAG: Causal graph with defined variable roles
stage1_estimatoroptional, sklearn regressor (default = LinearRegression()): Estimator for stage 1 regression of exposure on instrument(s)
stage2_estimatoroptional, sklearn regressor (default = LinearRegression()): Estimator for stage 2 regression of outcome on predicted exposure and pretreatment covariates (if any).

Attributes:

exposure_var_str: Name of the exposure variable (single).
outcome_var_str: Name of the outcome variable (single).
instrument_vars_list of str: Names of instrument variables extracted from the causal graph
pretreatment_vars_list of str: Names of pretreatment covariates extracted from the causal graph.
feature_columns_fit_list of str: Names of features used during ‘fit’
feature_columns_predict_list of str: Names of features used during predict.
stage1_est_estimator: Fitted first-stage estimator.
stage2_est_estimator: Fitted second-stage estimator.
coef_array-like: Coefficients from the fitted stage2_estimator (if available).

References

[1]

“Instrumental Variables Estimation.” Wikipedia: https://en.wikipedia.org/wiki/Instrumental_variables_estimation

Examples

>>> # Example 1: Basic usage with LinearRegression estimators
>>> import pandas as pd
>>> from pgmpy.base import DAG
>>> from sklearn.linear_model import LinearRegression
>>> from pgmpy.prediction import NaiveIVRegressor
>>>
>>> # Simulate data from a linear Gaussian Bayesian network
>>> lgbn = DAG.from_dagitty(
...     "dag { Z1 -> X [beta=0.2] Z2 -> X [beta=0.2] X -> Y [beta=0.3] }"
... )
>>> data = lgbn.simulate(1000, seed=42)  # returns a pandas DataFrame
>>> df = data.loc[:, ["X", "Z1", "Z2"]]
>>> df = (df - df.mean(axis=0)) / df.std(axis=0)
>>> y = data["Y"]
>>> G = DAG(
...     lgbn.edges(),
...     roles={"exposures": "X", "instrument": ("Z1", "Z2"), "outcomes": "Y"},
... )
>>>
>>> model = NaiveIVRegressor(
...     causal_graph=G,
...     stage1_estimator=LinearRegression(),
...     stage2_estimator=LinearRegression(),
... )
>>> # Fit the model and make predictions
>>> _ = model.fit(df, y)
>>> preds = model.predict(df)
>>> preds.shape[0]
1000

>>> # Example 2: Usage with multiple instruments and pretreatment
>>> import pandas as pd
>>> from pgmpy.base import DAG
>>> from sklearn.linear_model import LinearRegression
>>> from pgmpy.prediction import NaiveIVRegressor
>>>
>>> # Simulate data from a linear Gaussian Bayesian Network
>>> lgbn = DAG.from_dagitty(
...     "dag { U1 -> X [beta=0.3] U2 -> X [beta=0.2] U3 -> X [beta=0.1] "
...     "U4 -> X [beta=0.2] X -> Y [beta=0.6] P -> Y [beta=0.2] }"
... )
>>> data = lgbn.simulate(300, seed=42)
>>> df = data.loc[:, ["X", "U1", "U2", "U3", "P"]]
>>>
>>> dag = DAG(
...     ebunch=[
...         ("U1", "X"),
...         ("U2", "X"),
...         ("U3", "X"),
...         ("U4", "X"),
...         ("X", "Y"),
...         ("P", "Y"),
...     ],
...     roles={
...         "exposures": "X",
...         "instrument": ("U1", "U2", "U3"),
...         "outcomes": "Y",
...         "pretreatment": ["P"],
...     },
... )
>>> model = NaiveIVRegressor(
...     causal_graph=dag,
... )
>>>
>>> # Fit the model and make predictions
>>> _ = model.fit(df, data["Y"])
>>> preds = model.predict(df)
>>> preds.shape[0]
300

>>> # Example 3: Usage with custom estimators and numpy array inputs
>>> import pandas as pd
>>> import numpy as np
>>> from pgmpy.base import DAG
>>> from sklearn.linear_model import LinearRegression
>>> from sklearn.ensemble import RandomForestRegressor
>>> from pgmpy.prediction import NaiveIVRegressor
>>>
>>> dag = DAG(
...     ebunch=[(1, 0), (0, 2)],
...     roles={"exposures": [0], "outcomes": [2], "instrument": [1]},
... )
>>> model = NaiveIVRegressor(
...     causal_graph=dag,
...     stage1_estimator=RandomForestRegressor(),
...     stage2_estimator=LinearRegression(),
... )
>>>
>>> # Simulate some random data
>>> n_samples = 50
>>> X_array = np.random.normal(0, 1, (n_samples, 2))
>>> y_array = np.random.normal(0, 1, n_samples)
>>>
>>> # Fit the model and make predictions
>>> _ = model.fit(X_array, y_array)
>>> preds = model.predict(X_array)
>>> preds.shape[0]
50

fit(X, y, sample_weight: Any | None = None)[source]#

This method performs two-stage least squares regression using the specified causal graph. It first fits the stage 1 estimator to predict the exposure variable from the instrument, then fits the stage 2 estimator to predict the outcome variable from the predicted exposure and pretreatment variables.

Parameters:

Xpandas.DataFrame or numpy ndarray: Feature data containing exposure, instrument, and pretreatment variables.
ypandas.Series, pandas.DataFrame, or numpy.ndarray: Outcome variable.
sample_weightarray-like, optional: Sample weights for fitting the estimators.

Returns:

selfobject: Fitted estimator.

predict(X)[source]#

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → NaiveIVRegressor#

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in fit.

Returns:

selfobject: The updated object.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → NaiveIVRegressor#

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in score.

Returns:

selfobject: The updated object.