Expert Knowledge Integration with Causal Discovery

Causal Discovery (aka. Structure learning) seeks to uncover the causal relationships among random variables from observational datasets. Methods like the PC algorithm and Hill-climb Search can be used for causal discovery, however, they often struggle to recover the correct causal relationships due to issues such as data not satisfying the assumptions of the method, limited data can lead to error, high compute time when there are many variables or the model is dense.

Incorporating expert knowledge can greatly improve the accuracy of these methods, as well as reduce the computational cost as it constrains the search space for these algorithms. pgmpy currently has the following options to specify expert knowledge:

  • Temporal Order / Tier information: Specify the tiers/temporal order for sets of variables.

  • Search Space: Specify the set of edges that the algorithms should search through.

  • Forbidden Edges: Edges that should not be present in the final model.

  • Required Edges: Edges that must be present in the final model.

This tutorial walks through examples of specifying expert knowledge with the causal discovery algorithms implemented in pgmpy. Expert knowledge is supported in pgmpy with the following algorithms:

  • PC

  • Hill Climb Search

  • GES

  • ExpertInLoop

The models used in this notebook are described in detail at the bnlearn repository.

Temporal Order with PC Algorithm

[1]:
# Imports and configuration
import logging
import itertools

from IPython.display import display
from pgmpy.utils import get_example_model
from pgmpy.global_vars import logger, config

logger.setLevel(logging.ERROR)
config.set_backend("numpy")
[2]:
# Load an example model and simulate data from it.
cancer = get_example_model("cancer")
cancer_samples = cancer.simulate(n_samples=5000, seed=42)

# Plot the `cancer` model
diag = cancer.to_graphviz()
diag.layout(prog="dot")
display(diag)
../../../pgmpy_docs/doctrees/nbsphinx/examples_Expert_Knowledge_4_1.svg
[3]:
# Run the standard PC algorithm and plot the results

from pgmpy.estimators import PC, ExpertKnowledge

learned_model = PC(cancer_samples).estimate()

diag = learned_model.to_graphviz()
diag.layout(prog="dot")
display(diag)
../../../pgmpy_docs/doctrees/nbsphinx/examples_Expert_Knowledge_5_1.svg

As seen from the learned model, the PC algorithm was not able to correctly recover the original data-generating DAG.

Now, we know that smoking (Smoker) and air pollution (Pollution) contribute to lung cancer (Cancer). Also, breathing difficulties (Dyspnoea) and a positive X-ray report (Xray) contribute to a cancer diagnosis (effect), so they cannot be the initial causes in this scenario. Based on this, we can craft a temporal order where the causes of cancer are at a higher temporal order, while the effects are lumped together at the end.

[4]:
# PC algorithm with the temporal knowledge.

expert_knowledge = ExpertKnowledge(
    temporal_order=[["Pollution", "Smoker"], ["Cancer"], ["Dyspnoea", "Xray"]]
)

est_model = PC(cancer_samples).estimate(expert_knowledge=expert_knowledge)

diag = est_model.to_graphviz()
diag.layout(prog="dot")
display(diag)
../../../pgmpy_docs/doctrees/nbsphinx/examples_Expert_Knowledge_7_1.svg

Note: Using any form of expert knowledge, especially with PC, is sensitive to the accuracy of the knowledge.

Expert knowledge based on search space

[5]:
# Load the Asia model and simulate data (https://www.bnlearn.com/bnrepository/discrete-small.html#asia)
asia = get_example_model("asia")
asia_samples = asia.simulate(n_samples=10000, seed=42)

diag = asia.to_graphviz()
diag.layout(prog="dot")
display(diag)
../../../pgmpy_docs/doctrees/nbsphinx/examples_Expert_Knowledge_10_1.svg

The idea with the search space is that it is usually difficult to distinguish between direct and indirect effects. For example, in the above Asia model, we might know that smoke would have a causal effect on dysp but might not exactly know through which pathways. With the search space, we try to specify a hypergraph over our variables such that we specify a causal edge between any two variables that we think should have a causal relationship, but aren’t sure about the causal path/edges.

As an example, for the asia model, we can consider additional edges from all three diseases (tub, lung, and bronc) to both the symptoms (xray and dysp).

[6]:
from pgmpy.metrics import SHD

search_space = [
    ("smoke", "cancer"),
    ("smoke", "lung"),
    ("smoke", "tub"),
    ("tub", "asia"),
    ("asia", "tub"),
    ("tub", "either"),
    ("lung", "either"),
    ("bronc", "either"),
    ("tub", "xray"),
    ("tub", "dysp"),
    ("lung", "x-ray"),
    ("lung", "dysp"),
    ("bronc", "xray"),
    ("bronc", "dysp"),
    ("either", "xray"),
    ("either", "dysp"),
    ("tub", "dysp"),
]

expert_knowledge = ExpertKnowledge(search_space=search_space)

est_model = PC(asia_samples).estimate(expert_knowledge=expert_knowledge, ci_test='pillai', enforce_expert_knowledge=True)
SHD(asia, est_model)
[6]:
8
[7]:
# Plot the estimated model

diag = est_model.to_graphviz()
diag.layout(prog="dot")
display(diag)
../../../pgmpy_docs/doctrees/nbsphinx/examples_Expert_Knowledge_13_0.svg

Knowledge base of required and forbidden edges

[8]:
child = get_example_model("child")
child_samples = child.simulate(n_samples=10000, seed=42)

diag = child.to_graphviz()
diag.layout(prog="dot")
display(diag)
../../../pgmpy_docs/doctrees/nbsphinx/examples_Expert_Knowledge_15_1.svg

Based on similar reasoning as in the first section, we can now create a knowledge base consisting of forbidden and required edges. For example, having birth asphyxia (‘Disease’) directly contributes to the 6 subclassses/markers of disease (‘LVH’, ‘LungParench’ etc.). It is the root cause, so it cannot have incoming edges from any of these nodes. Similarly, the CO2 level (‘CO2’) affects the CO2 report (‘CO2Report’), and not the other way around.

[9]:
required_edges = [("CO2", "CO2Report"),
                  ("ChestXray", "XrayReport"),
                  ("LVH", "LVHreport"),
                  ("Gruntin", "GruntingReport")]
# As disease should be an exogenous node, do not allow any incoming edges except `BirthAsphyxia'.
forbidden_edges = [(u, v) for u, v in itertools.combinations(child.nodes(), 2) if v == 'Disease']

expert_knowledge = ExpertKnowledge(
    required_edges=required_edges, forbidden_edges=forbidden_edges
)
est_model = PC(child_samples).estimate(expert_knowledge=expert_knowledge, ci_test='pillai', enforce_expert_knowledge=True)
SHD(child, est_model)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[9], line 11
      6 forbidden_edges = [(u, v) for u, v in itertools.combinations(child.nodes(), 2) if v == 'Disease']
      8 expert_knowledge = ExpertKnowledge(
      9     required_edges=required_edges, forbidden_edges=forbidden_edges
     10 )
---> 11 est_model = PC(child_samples).estimate(expert_knowledge=expert_knowledge, ci_test='pillai', enforce_expert_knowledge=True)
     12 SHD(child, est_model)

File ~/pgmpy/examples/pgmpy/estimators/PC.py:241, in PC.estimate(self, variant, ci_test, return_type, significance_level, max_cond_vars, expert_knowledge, enforce_expert_knowledge, n_jobs, show_progress, **kwargs)
    238     return skel, separating_sets
    240 # Step 2: Orient the edges based on collider structures.
--> 241 pdag = self.orient_colliders(
    242     skel, separating_sets, expert_knowledge.temporal_ordering
    243 )
    245 # Step 3: Either return the CPDAG, integrate expert knowledge or fully orient the edges to build a DAG.
    246 if expert_knowledge.temporal_order != [[]]:

File ~/pgmpy/examples/pgmpy/estimators/PC.py:548, in PC.orient_colliders(skeleton, separating_sets, temporal_ordering)
    546 if not skeleton.has_edge(X, Y):
    547     for Z in set(skeleton.neighbors(X)) & set(skeleton.neighbors(Y)):
--> 548         if Z not in separating_sets[frozenset((X, Y))]:
    549             if (temporal_ordering == dict()) or (
    550                 (temporal_ordering[Z] >= temporal_ordering[X])
    551                 and (temporal_ordering[Z] >= temporal_ordering[Y])
    552             ):
    553                 pdag.remove_edges_from([(Z, X), (Z, Y)])

KeyError: frozenset({'Disease', 'CO2'})
[ ]:
# Plot the estimated model

diag = est_model.to_graphviz()
diag.layout(prog="dot")
display(diag)