Bayesian Network

class pgmpy.models.BayesianNetwork.BayesianNetwork(ebunch=None, latents={})[source]

Base class for Bayesian Models.

add_cpds(*cpds)[source]

Add CPD (Conditional Probability Distribution) to the Bayesian Model.

Parameters

cpds (list, set, tuple (array-like)) – List of CPDs which will be associated with the model

Examples

>>> from pgmpy.models import BayesianNetwork
>>> from pgmpy.factors.discrete.CPD import TabularCPD
>>> student = BayesianNetwork([('diff', 'grades'), ('intel', 'grades')])
>>> grades_cpd = TabularCPD('grades', 3, [[0.1,0.1,0.1,0.1,0.1,0.1],
...                                       [0.1,0.1,0.1,0.1,0.1,0.1],
...                                       [0.8,0.8,0.8,0.8,0.8,0.8]],
...                         evidence=['diff', 'intel'], evidence_card=[2, 3])
>>> student.add_cpds(grades_cpd)

diff:

easy

hard

intel:

dumb

avg

smart

dumb

avg

smart

gradeA

0.1

0.1

0.1

0.1

0.1

0.1

gradeB

0.1

0.1

0.1

0.1

0.1

0.1

gradeC

0.8

0.8

0.8

0.8

0.8

0.8

add_edge(u, v, **kwargs)[source]

Add an edge between u and v.

The nodes u and v will be automatically added if they are not already in the graph

Parameters
  • u (nodes) – Nodes can be any hashable python object.

  • v (nodes) – Nodes can be any hashable python object.

Examples

>>> from pgmpy.models import BayesianNetwork
>>> G = BayesianNetwork()
>>> G.add_nodes_from(['grade', 'intel'])
>>> G.add_edge('grade', 'intel')
check_model()[source]

Check the model for various errors. This method checks for the following errors.

  • Checks if the sum of the probabilities for each state is equal to 1 (tol=0.01).

  • Checks if the CPDs associated with nodes are consistent with their parents.

Returns

check – True if all the checks are passed

Return type

boolean

copy()[source]

Returns a copy of the model.

Returns

BayesianNetwork

Return type

Copy of the model on which the method was called.

Examples

>>> from pgmpy.models import BayesianNetwork
>>> from pgmpy.factors.discrete import TabularCPD
>>> model = BayesianNetwork([('A', 'B'), ('B', 'C')])
>>> cpd_a = TabularCPD('A', 2, [[0.2], [0.8]])
>>> cpd_b = TabularCPD('B', 2, [[0.3, 0.7], [0.7, 0.3]],
...                    evidence=['A'],
...                    evidence_card=[2])
>>> cpd_c = TabularCPD('C', 2, [[0.1, 0.9], [0.9, 0.1]],
...                    evidence=['B'],
...                    evidence_card=[2])
>>> model.add_cpds(cpd_a, cpd_b, cpd_c)
>>> copy_model = model.copy()
>>> copy_model.nodes()
NodeView(('A', 'B', 'C'))
>>> copy_model.edges()
OutEdgeView([('A', 'B'), ('B', 'C')])
>>> len(copy_model.get_cpds())
3
do(nodes, inplace=False)[source]

Applies the do operation. The do operation removes all incoming edges to variables in nodes and marginalizes their CPDs to only contain the variable itself.

Parameters
  • nodes (list, array-like) – The names of the nodes to apply the do-operator for.

  • inplace (boolean (default: False)) – If inplace=True, makes the changes to the current object, otherwise returns a new instance.

Returns

pgmpy.models.BayesianNetwork – do operation

Return type

Instance of BayesianNetwork modified by the

Examples

fit(data, estimator=None, state_names=[], complete_samples_only=True, n_jobs=- 1, **kwargs)[source]

Estimates the CPD for each variable based on a given data set.

Parameters
  • data (pandas DataFrame object) – DataFrame object with column names identical to the variable names of the network. (If some values in the data are missing the data cells should be set to numpy.NaN. Note that pandas converts each column containing numpy.NaN`s to dtype `float.)

  • estimator (Estimator class) – One of: - MaximumLikelihoodEstimator (default) - BayesianEstimator: In this case, pass ‘prior_type’ and either ‘pseudo_counts’ or ‘equivalent_sample_size’ as additional keyword arguments. See BayesianEstimator.get_parameters() for usage. - ExpectationMaximization

  • state_names (dict (optional)) – A dict indicating, for each variable, the discrete set of states that the variable can take. If unspecified, the observed values in the data set are taken to be the only possible states.

  • complete_samples_only (bool (default True)) – Specifies how to deal with missing data, if present. If set to True all rows that contain np.Nan somewhere are ignored. If False then, for each variable, every row where neither the variable nor its parents are np.NaN is used.

  • n_jobs (int (default: -1)) – Number of threads/processes to use for estimation. It improves speed only for large networks (>100 nodes). For smaller networks might reduce performance.

Returns

None

Return type

Modifies the network inplace and adds the cpds property.

Examples

>>> import pandas as pd
>>> from pgmpy.models import BayesianNetwork
>>> from pgmpy.estimators import MaximumLikelihoodEstimator
>>> data = pd.DataFrame(data={'A': [0, 0, 1], 'B': [0, 1, 0], 'C': [1, 1, 0]})
>>> model = BayesianNetwork([('A', 'C'), ('B', 'C')])
>>> model.fit(data)
>>> model.get_cpds()
[<TabularCPD representing P(A:2) at 0x7fb98a7d50f0>,
<TabularCPD representing P(B:2) at 0x7fb98a7d5588>,
<TabularCPD representing P(C:2 | A:2, B:2) at 0x7fb98a7b1f98>]
fit_update(data, n_prev_samples=None, n_jobs=- 1)[source]

Method to update the parameters of the BayesianNetwork with more data. Internally, uses BayesianEstimator with dirichlet prior, and uses the current CPDs (along with n_prev_samples) to compute the pseudo_counts.

Parameters
  • data (pandas.DataFrame) – The new dataset which to use for updating the model.

  • n_prev_samples (int) – The number of samples/datapoints on which the model was trained before. This parameter determines how much weight should the new data be given. If None, n_prev_samples = nrow(data).

  • n_jobs (int (default: -1)) – Number of threads/processes to use for estimation. It improves speed only for large networks (>100 nodes). For smaller networks might reduce performance.

Returns

None

Return type

Modifies the network inplace

Examples

>>> from pgmpy.utils import get_example_model
>>> from pgmpy.sampling import BayesianModelSampling
>>> model = get_example_model('alarm')
>>> # Generate some new data.
>>> data = BayesianModelSampling(model).forward_sample(int(1e3))
>>> model.fit_update(data)
get_cardinality(node=None)[source]

Returns the cardinality of the node. Throws an error if the CPD for the queried node hasn’t been added to the network.

Parameters

node (Any hashable python object(optional)) – The node whose cardinality we want. If node is not specified returns a dictionary with the given variable as keys and their respective cardinality as values.

Returns

int or dict – If node is not specified returns a dictionary with the given variable as keys and their respective cardinality as values.

Return type

If node is specified returns the cardinality of the node.

Examples

>>> from pgmpy.models import BayesianNetwork
>>> from pgmpy.factors.discrete import TabularCPD
>>> student = BayesianNetwork([('diff', 'grade'), ('intel', 'grade')])
>>> cpd_diff = TabularCPD('diff', 2, [[0.6], [0.4]]);
>>> cpd_intel = TabularCPD('intel', 2, [[0.7], [0.3]]);
>>> cpd_grade = TabularCPD('grade', 2, [[0.1, 0.9, 0.2, 0.7],
...                                     [0.9, 0.1, 0.8, 0.3]],
...                                 ['intel', 'diff'], [2, 2])
>>> student.add_cpds(cpd_diff,cpd_intel,cpd_grade)
>>> student.get_cardinality()
defaultdict(<class 'int'>, {'diff': 2, 'intel': 2, 'grade': 2})
>>> student.get_cardinality('intel')
2
get_cpds(node=None)[source]

Returns the cpd of the node. If node is not specified returns all the CPDs that have been added till now to the graph

Parameters

node (any hashable python object (optional)) – The node whose CPD we want. If node not specified returns all the CPDs added to the model.

Returns

Return type

A list of TabularCPDs.

Examples

>>> from pgmpy.models import BayesianNetwork
>>> from pgmpy.factors.discrete import TabularCPD
>>> student = BayesianNetwork([('diff', 'grade'), ('intel', 'grade')])
>>> cpd = TabularCPD('grade', 2, [[0.1, 0.9, 0.2, 0.7],
...                               [0.9, 0.1, 0.8, 0.3]],
...                  ['intel', 'diff'], [2, 2])
>>> student.add_cpds(cpd)
>>> student.get_cpds()
get_factorized_product(latex=False)[source]
get_markov_blanket(node)[source]

Returns a markov blanket for a random variable. In the case of Bayesian Networks, the markov blanket is the set of node’s parents, its children and its children’s other parents.

Returns

list(blanket_nodes)

Return type

List of nodes contained in Markov Blanket

Parameters

node (string, int or any hashable python object.) – The node whose markov blanket would be returned.

Examples

>>> from pgmpy.models import BayesianNetwork
>>> from pgmpy.factors.discrete import TabularCPD
>>> G = BayesianNetwork([('x', 'y'), ('z', 'y'), ('y', 'w'), ('y', 'v'), ('u', 'w'),
...                    ('s', 'v'), ('w', 't'), ('w', 'm'), ('v', 'n'), ('v', 'q')])
>>> G.get_markov_blanket('y')
['s', 'u', 'w', 'v', 'z', 'x']
static get_random(n_nodes=5, edge_prob=0.5, n_states=None, latents=False)[source]

Returns a randomly generated bayesian network on n_nodes variables with edge probabiliy of edge_prob between variables.

Parameters
  • n_nodes (int) – The number of nodes in the randomly generated DAG.

  • edge_prob (float) – The probability of edge between any two nodes in the topologically sorted DAG.

  • n_states (int or list (array-like) (default: None)) – The number of states of each variable. When None randomly generates the number of states.

  • latents (bool (default: False)) – If True, also creates latent variables.

Returns

pgmpy.base.DAG instance

Return type

The randomly generated DAG.

Examples

>>> from pgmpy.models import BayesianNetwork
>>> model = BayesianNetwork.get_random(n_nodes=5)
>>> model.nodes()
NodeView((0, 1, 3, 4, 2))
>>> model.edges()
OutEdgeView([(0, 1), (0, 3), (1, 3), (1, 4), (3, 4), (2, 3)])
>>> model.cpds
[<TabularCPD representing P(0:0) at 0x7f97e16eabe0>,
 <TabularCPD representing P(1:1 | 0:0) at 0x7f97e16ea670>,
 <TabularCPD representing P(3:3 | 0:0, 1:1, 2:2) at 0x7f97e16820d0>,
 <TabularCPD representing P(4:4 | 1:1, 3:3) at 0x7f97e16eae80>,
 <TabularCPD representing P(2:2) at 0x7f97e1682c40>]
is_imap(JPD)[source]

Checks whether the bayesian model is Imap of given JointProbabilityDistribution

Parameters

JPD (An instance of JointProbabilityDistribution Class, for which you want to) – check the Imap

Returns

boolean – False otherwise

Return type

True if bayesian model is Imap for given Joint Probability Distribution

Examples

>>> from pgmpy.models import BayesianNetwork
>>> from pgmpy.factors.discrete import TabularCPD
>>> from pgmpy.factors.discrete import JointProbabilityDistribution
>>> G = BayesianNetwork([('diff', 'grade'), ('intel', 'grade')])
>>> diff_cpd = TabularCPD('diff', 2, [[0.2], [0.8]])
>>> intel_cpd = TabularCPD('intel', 3, [[0.5], [0.3], [0.2]])
>>> grade_cpd = TabularCPD('grade', 3,
...                        [[0.1,0.1,0.1,0.1,0.1,0.1],
...                         [0.1,0.1,0.1,0.1,0.1,0.1],
...                         [0.8,0.8,0.8,0.8,0.8,0.8]],
...                        evidence=['diff', 'intel'],
...                        evidence_card=[2, 3])
>>> G.add_cpds(diff_cpd, intel_cpd, grade_cpd)
>>> val = [0.01, 0.01, 0.08, 0.006, 0.006, 0.048, 0.004, 0.004, 0.032,
           0.04, 0.04, 0.32, 0.024, 0.024, 0.192, 0.016, 0.016, 0.128]
>>> JPD = JointProbabilityDistribution(['diff', 'intel', 'grade'], [2, 3, 3], val)
>>> G.is_imap(JPD)
True
static load(filename, filetype='bif')[source]

Writes the model to a file.

Parameters
  • filename (str) – The path along with the filename where to write the file.

  • filetype (str (default: bif)) – The format in which to write the model to file. Can be one of the following: bif, uai, xmlbif.

Examples

>>> from pgmpy.utils import get_example_model
>>> alarm = get_example_model('alarm')
>>> alarm.save('alarm.bif', filetype='bif')
>>> alarm_model = BayesianNetwork.load('alarm.bif', filetype='bif')
predict(data, stochastic=False, n_jobs=- 1)[source]

Predicts states of all the missing variables.

Parameters
  • data (pandas DataFrame object) – A DataFrame object with column names same as the variables in the model.

  • stochastic (boolean) –

    If True, does prediction by sampling from the distribution of predicted variable(s). If False, returns the states with the highest probability value (i.e MAP) for the

    predicted variable(s).

  • n_jobs (int (default: -1)) – The number of CPU cores to use. If -1, uses all available cores.

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from pgmpy.models import BayesianNetwork
>>> values = pd.DataFrame(np.random.randint(low=0, high=2, size=(1000, 5)),
...                       columns=['A', 'B', 'C', 'D', 'E'])
>>> train_data = values[:800]
>>> predict_data = values[800:]
>>> model = BayesianNetwork([('A', 'B'), ('C', 'B'), ('C', 'D'), ('B', 'E')])
>>> model.fit(train_data)
>>> predict_data = predict_data.copy()
>>> predict_data.drop('E', axis=1, inplace=True)
>>> y_pred = model.predict(predict_data)
>>> y_pred
    E
800 0
801 1
802 1
803 1
804 0
... ...
993 0
994 0
995 1
996 1
997 0
998 0
999 0
predict_probability(data)[source]

Predicts probabilities of all states of the missing variables.

Parameters

data (pandas DataFrame object) – A DataFrame object with column names same as the variables in the model.

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from pgmpy.models import BayesianNetwork
>>> values = pd.DataFrame(np.random.randint(low=0, high=2, size=(100, 5)),
...                       columns=['A', 'B', 'C', 'D', 'E'])
>>> train_data = values[:80]
>>> predict_data = values[80:]
>>> model = BayesianNetwork([('A', 'B'), ('C', 'B'), ('C', 'D'), ('B', 'E')])
>>> model.fit(values)
>>> predict_data = predict_data.copy()
>>> predict_data.drop('B', axis=1, inplace=True)
>>> y_prob = model.predict_probability(predict_data)
>>> y_prob
    B_0         B_1
80  0.439178    0.560822
81  0.581970    0.418030
82  0.488275    0.511725
83  0.581970    0.418030
84  0.510794    0.489206
85  0.439178    0.560822
86  0.439178    0.560822
87  0.417124    0.582876
88  0.407978    0.592022
89  0.429905    0.570095
90  0.581970    0.418030
91  0.407978    0.592022
92  0.429905    0.570095
93  0.429905    0.570095
94  0.439178    0.560822
95  0.407978    0.592022
96  0.559904    0.440096
97  0.417124    0.582876
98  0.488275    0.511725
99  0.407978    0.592022
remove_cpds(*cpds)[source]

Removes the cpds that are provided in the argument.

Parameters

*cpds (TabularCPD object) – A CPD object on any subset of the variables of the model which is to be associated with the model.

Examples

>>> from pgmpy.models import BayesianNetwork
>>> from pgmpy.factors.discrete import TabularCPD
>>> student = BayesianNetwork([('diff', 'grade'), ('intel', 'grade')])
>>> cpd = TabularCPD('grade', 2, [[0.1, 0.9, 0.2, 0.7],
...                               [0.9, 0.1, 0.8, 0.3]],
...                  ['intel', 'diff'], [2, 2])
>>> student.add_cpds(cpd)
>>> student.remove_cpds(cpd)
remove_node(node)[source]

Remove node from the model.

Removing a node also removes all the associated edges, removes the CPD of the node and marginalizes the CPDs of it’s children.

Parameters

node (node) – Node which is to be removed from the model.

Returns

Return type

None

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from pgmpy.models import BayesianNetwork
>>> model = BayesianNetwork([('A', 'B'), ('B', 'C'),
...                        ('A', 'D'), ('D', 'C')])
>>> values = pd.DataFrame(np.random.randint(low=0, high=2, size=(1000, 4)),
...                       columns=['A', 'B', 'C', 'D'])
>>> model.fit(values)
>>> model.get_cpds()
[<TabularCPD representing P(A:2) at 0x7f28248e2438>,
 <TabularCPD representing P(B:2 | A:2) at 0x7f28248e23c8>,
 <TabularCPD representing P(C:2 | B:2, D:2) at 0x7f28248e2748>,
 <TabularCPD representing P(D:2 | A:2) at 0x7f28248e26a0>]
>>> model.remove_node('A')
>>> model.get_cpds()
[<TabularCPD representing P(B:2) at 0x7f28248e23c8>,
 <TabularCPD representing P(C:2 | B:2, D:2) at 0x7f28248e2748>,
 <TabularCPD representing P(D:2) at 0x7f28248e26a0>]
remove_nodes_from(nodes)[source]

Remove multiple nodes from the model.

Removing a node also removes all the associated edges, removes the CPD of the node and marginalizes the CPDs of it’s children.

Parameters

nodes (list, set (iterable)) – Nodes which are to be removed from the model.

Returns

Return type

None

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from pgmpy.models import BayesianNetwork
>>> model = BayesianNetwork([('A', 'B'), ('B', 'C'),
...                        ('A', 'D'), ('D', 'C')])
>>> values = pd.DataFrame(np.random.randint(low=0, high=2, size=(1000, 4)),
...                       columns=['A', 'B', 'C', 'D'])
>>> model.fit(values)
>>> model.get_cpds()
[<TabularCPD representing P(A:2) at 0x7f28248e2438>,
 <TabularCPD representing P(B:2 | A:2) at 0x7f28248e23c8>,
 <TabularCPD representing P(C:2 | B:2, D:2) at 0x7f28248e2748>,
 <TabularCPD representing P(D:2 | A:2) at 0x7f28248e26a0>]
>>> model.remove_nodes_from(['A', 'B'])
>>> model.get_cpds()
[<TabularCPD representing P(C:2 | D:2) at 0x7f28248e2a58>,
 <TabularCPD representing P(D:2) at 0x7f28248e26d8>]
save(filename, filetype='bif')[source]

Writes the model to a file.

Parameters
  • filename (str) – The path along with the filename where to write the file.

  • filetype (str (default: bif)) – The format in which to write the model to file. Can be one of the following: bif, uai, xmlbif.

Examples

>>> from pgmpy.utils import get_example_model
>>> alarm = get_example_model('alarm')
>>> alarm.save('alarm.bif', filetype='bif')
simulate(n_samples=10, do=None, evidence=None, virtual_evidence=None, virtual_intervention=None, include_latents=False, partial_samples=None, seed=None, show_progress=True)[source]

Simulates data from the given model. Internally uses methods from pgmpy.sampling.BayesianModelSampling to generate the data.

Parameters
  • n_samples (int) – The number of data samples to simulate from the model.

  • do (dict) – The interventions to apply to the model. dict should be of the form {variable_name: state}

  • evidence (dict) – Observed evidence to apply to the model. dict should be of the form {variable_name: state}

  • virtual_evidence (list) – Probabilistically apply evidence to the model. virtual_evidence should be a list of pgmpy.factors.discrete.TabularCPD objects specifying the virtual probabilities.

  • virtual_intervention (list) – Also known as soft intervention. virtual_intervention should be a list of pgmpy.factors.discrete.TabularCPD objects specifying the virtual/soft intervention probabilities.

  • include_latents (boolean) – Whether to include the latent variable values in the generated samples.

  • partial_samples (pandas.DataFrame) – A pandas dataframe specifying samples on some of the variables in the model. If specified, the sampling procedure uses these sample values, instead of generating them. partial_samples.shape[0] must be equal to n_samples.

  • seed (int (default: None)) – If a value is provided, sets the seed for numpy.random.

  • show_progress (bool) – If True, shows a progress bar when generating samples.

Returns

pandas.DataFrame

Return type

A dataframe with the simulated data.

Examples

>>> from pgmpy.utils import get_example_model

Simulation without and evidence or intervention >>> model = get_example_model(‘alarm’) >>> model.simulate(n_samples=10)

Simulation with the hard evidence: MINVOLSET = HIGH >>> model.simulate(n_samples=10, evidence={“MINVOLSET”: “HIGH”})

Simulation with hard intervention: CVP = LOW >>> model.simulate(n_samples=10, do={“CVP”: “LOW”})

Simulation with virtual/soft evidence: p(MINVOLSET=LOW) = 0.8, p(MINVOLSET=HIGH) = 0.2, p(MINVOLSET=NORMAL) = 0 >>> virt_evidence = [TabularCPD(“MINVOLSET”, 3, [[0.8], [0.0], [0.2]], state_names={“MINVOLSET”: [“LOW”, “NORMAL”, “HIGH”]})] >>> model.simulate(n_samples, virtual_evidence=virt_evidence)

Simulation with virtual/soft intervention: p(CVP=LOW) = 0.2, p(CVP=NORMAL)=0.5, p(CVP=HIGH)=0.3 >>> virt_intervention = [TabularCPD(“CVP”, 3, [[0.2], [0.5], [0.3]], state_names={“CVP”: [“LOW”, “NORMAL”, “HIGH”]})] >>> model.simulate(n_samples, virtual_intervention=virt_intervention)

to_junction_tree()[source]

Creates a junction tree (or clique tree) for a given bayesian model.

For converting a Bayesian Model into a Clique tree, first it is converted into a Markov one.

For a given markov model (H) a junction tree (G) is a graph 1. where each node in G corresponds to a maximal clique in H 2. each sepset in G separates the variables strictly on one side of the edge to other.

Examples

>>> from pgmpy.models import BayesianNetwork
>>> from pgmpy.factors.discrete import TabularCPD
>>> G = BayesianNetwork([('diff', 'grade'), ('intel', 'grade'),
...                    ('intel', 'SAT'), ('grade', 'letter')])
>>> diff_cpd = TabularCPD('diff', 2, [[0.2], [0.8]])
>>> intel_cpd = TabularCPD('intel', 3, [[0.5], [0.3], [0.2]])
>>> grade_cpd = TabularCPD('grade', 3,
...                        [[0.1,0.1,0.1,0.1,0.1,0.1],
...                         [0.1,0.1,0.1,0.1,0.1,0.1],
...                         [0.8,0.8,0.8,0.8,0.8,0.8]],
...                        evidence=['diff', 'intel'],
...                        evidence_card=[2, 3])
>>> sat_cpd = TabularCPD('SAT', 2,
...                      [[0.1, 0.2, 0.7],
...                       [0.9, 0.8, 0.3]],
...                      evidence=['intel'], evidence_card=[3])
>>> letter_cpd = TabularCPD('letter', 2,
...                         [[0.1, 0.4, 0.8],
...                          [0.9, 0.6, 0.2]],
...                         evidence=['grade'], evidence_card=[3])
>>> G.add_cpds(diff_cpd, intel_cpd, grade_cpd, sat_cpd, letter_cpd)
>>> jt = G.to_junction_tree()
to_markov_model()[source]

Converts bayesian model to markov model. The markov model created would be the moral graph of the bayesian model.

Examples

>>> from pgmpy.models import BayesianNetwork
>>> G = BayesianNetwork([('diff', 'grade'), ('intel', 'grade'),
...                    ('intel', 'SAT'), ('grade', 'letter')])
>>> mm = G.to_markov_model()
>>> mm.nodes()
NodeView(('diff', 'grade', 'intel', 'letter', 'SAT'))
>>> mm.edges()
EdgeView([('diff', 'grade'), ('diff', 'intel'), ('grade', 'letter'), ('grade', 'intel'), ('intel', 'SAT')])