Causal Discovery and Structure Learning#

Causal discovery recovers the causal graph from observational data — determining which variables directly cause which others. Given a dataset of jointly observed variables, pgmpy searches over graph structures to produce a DAG or PDAG representing the causal relationships.

At a Glance#

  • Unified API: All discovery algorithms share a consistent fit / causal_graph_ / score interface.

  • Expert Knowledge: Encode required edges, forbidden edges, and temporal orderings to constrain the search.

  • Scoring and Evaluation: Compare learned graphs against data or a known reference using a unified score(...) method.

  • Scikit-learn Compatibility: Use discovery estimators in sklearn pipelines and tooling.

API#

All discovery algorithms follow the same pattern — instantiate, fit, and read the result:

from pgmpy.datasets import load_dataset
from pgmpy.causal_discovery import PC   # swap in any discovery algorithm
from pgmpy.structure_score import BIC

dataset = load_dataset("sachs_discrete")
data = dataset.data

est = PC(ci_test="chi_square", return_type="dag", show_progress=False)
est.fit(data)

print(est.causal_graph_.edges())
print(est.adjacency_matrix_.head())

Switching algorithms requires only changing the class and its constructor arguments. The fitted result is always accessed through causal_graph_ and adjacency_matrix_.

Expert Knowledge#

pgmpy supports incorporating domain knowledge into the discovery process through the ExpertKnowledge class. You can encode:

  • Required edges that must appear in the learned graph.

  • Forbidden edges that must not appear.

  • Temporal orderings that constrain edge directions.

  • Search-space restrictions that limit which variables are considered.

Both constraint-based and score-based algorithms accept an expert_knowledge parameter:

from pgmpy.causal_discovery import HillClimbSearch, ExpertKnowledge
from pgmpy.datasets import load_dataset

dataset = load_dataset("sachs_discrete")
data = dataset.data

expert = ExpertKnowledge(
    required_edges=[("PKC", "PKA")],
    forbidden_edges=[("PKA", "PKC")],
    temporal_order=[["PKC", "Raf"], ["Mek", "PKA"]],
)

est = HillClimbSearch(
    scoring_method="bic-d", expert_knowledge=expert, show_progress=False
)
est.fit(data)
print(est.causal_graph_.edges())

Scoring and Evaluation#

All discovery estimators expose a unified score(...) method for comparing results:

from pgmpy.causal_discovery import HillClimbSearch, PC
from pgmpy.datasets import load_dataset
from pgmpy.structure_score import BIC

dataset = load_dataset("sachs_discrete")
data = dataset.data

hc = HillClimbSearch(scoring_method=BIC(data), show_progress=False).fit(data)
pc = PC(ci_test="chi_square", return_type="dag", show_progress=False).fit(data)

# Score against data (no ground truth needed)
print(hc.score(X=data, metric="correlation_score"))
print(pc.score(X=data, metric="correlation_score"))

You can score against data using unsupervised metrics, or against a known reference graph using supervised metrics. Use pgmpy.metrics.get_metrics(...) to discover available metrics programmatically.

Scikit-learn Compatibility#

Discovery estimators inherit from sklearn.base.BaseEstimator and work with sklearn tooling — cloning, parameter inspection, and pipelines:

from sklearn.pipeline import Pipeline
from pgmpy.causal_discovery import PC

pipeline = Pipeline(
    steps=[
        ("discover", PC(ci_test="chi_square", return_type="dag", show_progress=False)),
    ]
)
pipeline.fit(data)

print(pipeline[-1].causal_graph_)

Common Recipes#

from pgmpy.datasets import load_dataset

dataset = load_dataset("sachs_discrete")
data = dataset.data

Score-based discovery with BIC:

from pgmpy.causal_discovery import HillClimbSearch

est = HillClimbSearch(scoring_method="bic-d", show_progress=False).fit(data)

Constraint-based discovery with a different CI test:

from pgmpy.causal_discovery import PC

est = PC(ci_test="g_sq", return_type="dag", show_progress=False).fit(data)

GES returning a PDAG:

from pgmpy.causal_discovery import GES

est = GES(scoring_method="bic-d", return_type="pdag", show_progress=False).fit(data)

Score against ground truth:

est.score(true_graph=dataset.ground_truth, metric="shd")

Auto-detect scoring for continuous data:

est = HillClimbSearch(scoring_method=None, show_progress=False).fit(continuous_data)

See Also#

See also

  • Parameter Estimation — Estimate CPDs once you have a graph.

  • Metrics — Evaluate learned graphs with supervised and unsupervised metrics.

API Reference#

For the full list of discovery algorithms, scoring methods, and conditional independence tests: