Structure Learning in Bayesian Networks#

In this notebook, we show a few examples of Causal Discovery or Structure Learning in pgmpy. pgmpy currently has the following algorithm for causal discovery:

  1. PC: Has \(3\) variants original, stable, and parallel. PC is a constraint-based algorithm that utilizes Conditional Independence tests to construct the model.

  2. Hill-Climb Search: Hill-Climb Search is a greedy optimization-based algorithm that makes iterative local changes to the model structure such that it improves the overall score of the model.

  3. Greedy Equivalence Search (GES): Another score-based method that makes greedy modifications to the model to improve its score iteratively.

  4. ExpertInLoop: An iterative algorithm that combines Conditional Independence testing with expert knowledge. The user or an LLM can act as the expert.

  5. Exhaustive Search: Exhaustive search iterates over all possible network structures on the given variables to find the most optimal one. As it tries to enumerate all possible network structures, it is intractable when the number of variables in the data is large.

The following Conditional Independence Tests are available to use with PC algorithm.

  1. Discrete Data: When all variables are discrete/categorical.

    1. Chi-square test: ci_test="chi_square"

    2. G-squared: ci_test="g_sq"

    3. Log-likelihood: Is equivalent to G-squared test. ci_test="log_likelihood

  2. Continuous Data: When all variables are continuous/numerical.

    1. Partial Correlation: ci_test="pearsonr"

  3. Mixed Data: When there is a mix of categorical and continuous variables.

    1. Pillai: ci_test="pillai"

For Hill-Climb, Exhausitive Search, and GES the following scoring methods can be used:

  1. Discrete Data: When all variables are discrete/categorical.

    1. BIC Score: scoring_method="bic-d"

    2. AIC Score: scoring_method="aic-d"

    3. K2 Score: scoring_method="k2"

    4. BDeU Score: scoring_method="bdeu"

    5. BDs Score: scoring_method="bds"

  2. Continuous Data: When all variables are continuous/numerical.

    1. Log-Likelihood: scoring_method="ll-g"

    2. AIC: scoring_method="aic-g"

    3. BIC: scoring_method="bic-g"

  3. Mixed Data: When there is a mix of discrete and continuous variables.

    1. AIC: scoring_method="aic-cg"

    2. BIC: scoring_method="bic-cg"

0. Simulate some sample datasets#

[11]:
import networkx as nx
import numpy as np
from sklearn.metrics import f1_score

from pgmpy.causal_discovery import PC, HillClimbSearch
from pgmpy.estimators import GES
from pgmpy.example_models import load_model
[3]:
# Discrete variable dataset
alarm_model = load_model("bnlearn/alarm")
alarm_samples = alarm_model.simulate(int(1e3))
alarm_samples.head()

# Continuous variable dataset
ecoli_model = load_model("bnlearn/ecoli70")
ecoli_samples = ecoli_model.simulate(int(1e3))
ecoli_samples.head()
Generating for node: BP: 100%|██████████| 37/37 [00:00<00:00, 92.14it/s]
[3]:
b1191 cspG eutG fixC cspA yecO yedE sucA cchB yceP ... dnaK folK ycgX lacZ nuoM dnaG b1583 mopB yaeM ftsJ
0 0.604062 2.547302 1.027700 1.400299 0.478782 1.806666 -2.077832 -1.229354 2.510977 0.784379 ... -1.843240 -0.447459 -1.517054 1.520873 -1.490384 -0.938163 0.412238 -2.132545 3.743092 -1.248824
1 0.599870 4.044272 1.430210 1.073578 0.363665 4.191238 -3.196371 -1.200287 3.072628 1.476936 ... -1.141279 0.056055 -0.871642 1.570619 -1.885748 -0.406936 1.359329 -2.626118 6.480438 -1.713716
2 -0.174397 2.314786 1.869508 1.333887 1.188071 2.432008 -1.952552 -1.696226 1.591140 1.924823 ... -2.049519 -0.214026 -1.467118 -0.648627 -3.483186 -1.015602 1.700527 -1.241445 2.912163 -0.500596
3 2.598877 1.709208 1.581040 2.630222 0.610870 1.269157 -0.764978 -1.421604 2.380991 1.668621 ... 3.167103 3.610352 2.727564 1.708831 -1.585674 2.261148 1.485151 2.788864 4.715340 2.504156
4 1.345047 2.061036 1.135546 1.210133 2.175349 1.658543 -1.243103 -1.782036 2.095497 0.866654 ... 1.257716 1.268718 1.957043 5.571929 -0.489090 1.575290 2.550061 0.642558 3.632864 1.114588

5 rows × 46 columns

[4]:
# Function to evaluate the learned model structures.
def get_f1_score(estimated_model, true_model):
    nodes = estimated_model.nodes()
    est_adj = nx.to_numpy_array(
        estimated_model.to_undirected(), nodelist=nodes, weight=None
    )
    true_adj = nx.to_numpy_array(
        true_model.to_undirected(), nodelist=nodes, weight=None
    )

    f1 = f1_score(np.ravel(true_adj), np.ravel(est_adj))
    print("F1-score for the model skeleton: ", f1)

1. PC algorithm#

[10]:
# Learning the discrete variable alarm model back

est = PC(ci_test='chi_square', variant="stable", max_cond_vars=4, return_type='dag')
est.fit(alarm_samples)
get_f1_score(est.causal_graph_, alarm_model)
Working for n conditional variables: 4: 100%|██████████| 4/4 [00:14<00:00,  3.64s/it]
F1-score for the model skeleton:  0.8205128205128205

[ ]:
# Learning the continuous variable ecoli model back

est = PC(ci_test='pearsonr', variant="orig", max_cond_vars=4, return_type='dag')
est.fit(X=ecoli_samples)
get_f1_score(est.causal_graph_, ecoli_model)
Working for n conditional variables: 4: 100%|██████████| 4/4 [00:08<00:00,  2.20s/it]
F1-score for the model skeleton:  0.7758620689655172

2. Hill-Climb Search Algorithm#

[12]:
# Learning the discrete variable alarm model back

est = HillClimbSearch(scoring_method="k2", max_indegree=4, max_iter=int(1e4))
est.fit(X=alarm_samples)
get_f1_score(est.causal_graph_, alarm_model)
  1%|          | 66/10000 [00:11<28:57,  5.72it/s]
F1-score for the model skeleton:  0.7321428571428571

[13]:
# Learning the continuous variable ecoli model back

est = HillClimbSearch(scoring_method="bic-g", max_indegree=4, max_iter=int(1e4))
est.fit(X=ecoli_samples)
get_f1_score(est.causal_graph_, ecoli_model)
  1%|          | 90/10000 [01:02<1:55:22,  1.43it/s]
F1-score for the model skeleton:  0.8

3. GES algorithm#

[5]:
# Learning the discrete variable alarm model back

est = GES(data=alarm_samples)
estimated_model = est.estimate(scoring_method="bic-d", min_improvement=250)
get_f1_score(estimated_model, alarm_model)
F1-score for the model skeleton:  0.4444444444444444

4. Expert In Loop Algorithm#

Please refer to the following blogpost for more details: https://medium.com/gopenai/llms-for-causal-discovery-745e2cba0b59