Bayesian Network¶
- class pgmpy.models.BayesianNetwork.BayesianNetwork(ebunch=None, latents={})[source]¶
Base class for Bayesian Models.
- add_cpds(*cpds)[source]¶
Add CPD (Conditional Probability Distribution) to the Bayesian Model.
- Parameters:
cpds (list, set, tuple (array-like)) – List of CPDs which will be associated with the model
Examples
>>> from pgmpy.models import BayesianNetwork >>> from pgmpy.factors.discrete.CPD import TabularCPD >>> student = BayesianNetwork([('diff', 'grades'), ('intel', 'grades')]) >>> grades_cpd = TabularCPD('grades', 3, [[0.1,0.1,0.1,0.1,0.1,0.1], ... [0.1,0.1,0.1,0.1,0.1,0.1], ... [0.8,0.8,0.8,0.8,0.8,0.8]], ... evidence=['diff', 'intel'], evidence_card=[2, 3]) >>> student.add_cpds(grades_cpd)
diff:
easy
hard
intel:
dumb
avg
smart
dumb
avg
smart
gradeA
0.1
0.1
0.1
0.1
0.1
0.1
gradeB
0.1
0.1
0.1
0.1
0.1
0.1
gradeC
0.8
0.8
0.8
0.8
0.8
0.8
- add_edge(u, v, **kwargs)[source]¶
Add an edge between u and v.
The nodes u and v will be automatically added if they are not already in the graph
- Parameters:
u (nodes) – Nodes can be any hashable python object.
v (nodes) – Nodes can be any hashable python object.
Examples
>>> from pgmpy.models import BayesianNetwork >>> G = BayesianNetwork() >>> G.add_nodes_from(['grade', 'intel']) >>> G.add_edge('grade', 'intel')
- check_model()[source]¶
Check the model for various errors. This method checks for the following errors.
Checks if the sum of the probabilities for each state is equal to 1 (tol=0.01).
Checks if the CPDs associated with nodes are consistent with their parents.
- Returns:
check – True if all the checks pass otherwise should throw an error.
- Return type:
boolean
- copy()[source]¶
Returns a copy of the model.
- Returns:
Model’s copy – Copy of the model on which the method was called.
- Return type:
pgmpy.models.BayesianNetwork
Examples
>>> from pgmpy.models import BayesianNetwork >>> from pgmpy.factors.discrete import TabularCPD >>> model = BayesianNetwork([('A', 'B'), ('B', 'C')]) >>> cpd_a = TabularCPD('A', 2, [[0.2], [0.8]]) >>> cpd_b = TabularCPD('B', 2, [[0.3, 0.7], [0.7, 0.3]], ... evidence=['A'], ... evidence_card=[2]) >>> cpd_c = TabularCPD('C', 2, [[0.1, 0.9], [0.9, 0.1]], ... evidence=['B'], ... evidence_card=[2]) >>> model.add_cpds(cpd_a, cpd_b, cpd_c) >>> copy_model = model.copy() >>> copy_model.nodes() NodeView(('A', 'B', 'C')) >>> copy_model.edges() OutEdgeView([('A', 'B'), ('B', 'C')]) >>> len(copy_model.get_cpds()) 3
- do(nodes, inplace=False)[source]¶
Applies the do operation. The do operation removes all incoming edges to variables in nodes and marginalizes their CPDs to only contain the variable itself.
- Parameters:
nodes (list, array-like) – The names of the nodes to apply the do-operator for.
inplace (boolean (default: False)) – If inplace=True, makes the changes to the current object, otherwise returns a new instance.
- Returns:
Modified network – If inplace=True, modifies the object itself else returns an instance of BayesianNetwork modified by the do operation.
- Return type:
pgmpy.models.BayesianNetwork or None
Examples
>>> from pgmpy.utils import get_example_model >>> asia = get_example_model('asia') >>> asia.edges() OutEdgeView([('asia', 'tub'), ('tub', 'either'), ('smoke', 'lung'), ('smoke', 'bronc'), ('lung', 'either'), ('bronc', 'dysp'), ('either', 'xray'), ('either', 'dysp')]) >>> do_bronc = asia.do(['bronc']) OutEdgeView([('asia', 'tub'), ('tub', 'either'), ('smoke', 'lung'), ('lung', 'either'), ('bronc', 'dysp'), ('either', 'xray'), ('either', 'dysp')])
- fit(data, estimator=None, state_names=[], complete_samples_only=True, n_jobs=- 1, **kwargs)[source]¶
Estimates the CPD for each variable based on a given data set.
- Parameters:
data (pandas DataFrame object) – DataFrame object with column names identical to the variable names of the network. (If some values in the data are missing the data cells should be set to numpy.NaN. Note that pandas converts each column containing numpy.NaN`s to dtype `float.)
estimator (Estimator class) – One of: - MaximumLikelihoodEstimator (default) - BayesianEstimator: In this case, pass ‘prior_type’ and either ‘pseudo_counts’ or ‘equivalent_sample_size’ as additional keyword arguments. See BayesianEstimator.get_parameters() for usage. - ExpectationMaximization
state_names (dict (optional)) – A dict indicating, for each variable, the discrete set of states that the variable can take. If unspecified, the observed values in the data set are taken to be the only possible states.
complete_samples_only (bool (default True)) – Specifies how to deal with missing data, if present. If set to True all rows that contain np.Nan somewhere are ignored. If False then, for each variable, every row where neither the variable nor its parents are np.NaN is used.
n_jobs (int (default: -1)) – Number of threads/processes to use for estimation. It improves speed only for large networks (>100 nodes). For smaller networks might reduce performance.
- Returns:
Fitted Model – Modifies the network inplace and adds the cpds property.
- Return type:
Examples
>>> import pandas as pd >>> from pgmpy.models import BayesianNetwork >>> from pgmpy.estimators import MaximumLikelihoodEstimator >>> data = pd.DataFrame(data={'A': [0, 0, 1], 'B': [0, 1, 0], 'C': [1, 1, 0]}) >>> model = BayesianNetwork([('A', 'C'), ('B', 'C')]) >>> model.fit(data) >>> model.get_cpds() [<TabularCPD representing P(A:2) at 0x7fb98a7d50f0>, <TabularCPD representing P(B:2) at 0x7fb98a7d5588>, <TabularCPD representing P(C:2 | A:2, B:2) at 0x7fb98a7b1f98>]
- fit_update(data, n_prev_samples=None, n_jobs=- 1)[source]¶
Method to update the parameters of the BayesianNetwork with more data. Internally, uses BayesianEstimator with dirichlet prior, and uses the current CPDs (along with n_prev_samples) to compute the pseudo_counts.
- Parameters:
data (pandas.DataFrame) – The new dataset which to use for updating the model.
n_prev_samples (int) – The number of samples/datapoints on which the model was trained before. This parameter determines how much weight should the new data be given. If None, n_prev_samples = nrow(data).
n_jobs (int (default: -1)) – Number of threads/processes to use for estimation. It improves speed only for large networks (>100 nodes). For smaller networks might reduce performance.
- Returns:
Updated model – Modifies the network inplace.
- Return type:
Examples
>>> from pgmpy.utils import get_example_model >>> from pgmpy.sampling import BayesianModelSampling >>> model = get_example_model('alarm') >>> # Generate some new data. >>> data = BayesianModelSampling(model).forward_sample(int(1e3)) >>> model.fit_update(data)
- get_cardinality(node=None)[source]¶
Returns the cardinality of the node. Throws an error if the CPD for the queried node hasn’t been added to the network.
- Parameters:
node (Any hashable python object(optional)) – The node whose cardinality we want. If node is not specified returns a dictionary with the given variable as keys and their respective cardinality as values.
- Returns:
variable cardinalities – If node is specified returns the cardinality of the node else returns a dictionary with the cardinality of each variable in the network
- Return type:
Examples
>>> from pgmpy.models import BayesianNetwork >>> from pgmpy.factors.discrete import TabularCPD >>> student = BayesianNetwork([('diff', 'grade'), ('intel', 'grade')]) >>> cpd_diff = TabularCPD('diff', 2, [[0.6], [0.4]]); >>> cpd_intel = TabularCPD('intel', 2, [[0.7], [0.3]]); >>> cpd_grade = TabularCPD('grade', 2, [[0.1, 0.9, 0.2, 0.7], ... [0.9, 0.1, 0.8, 0.3]], ... ['intel', 'diff'], [2, 2]) >>> student.add_cpds(cpd_diff,cpd_intel,cpd_grade) >>> student.get_cardinality() defaultdict(<class 'int'>, {'diff': 2, 'intel': 2, 'grade': 2})
>>> student.get_cardinality('intel') 2
- get_cpds(node=None)[source]¶
Returns the cpd of the node. If node is not specified returns all the CPDs that have been added till now to the graph
- Parameters:
node (any hashable python object (optional)) – The node whose CPD we want. If node not specified returns all the CPDs added to the model.
- Returns:
A list of TabularCPDs
- Return type:
Examples
>>> from pgmpy.models import BayesianNetwork >>> from pgmpy.factors.discrete import TabularCPD >>> student = BayesianNetwork([('diff', 'grade'), ('intel', 'grade')]) >>> cpd = TabularCPD('grade', 2, [[0.1, 0.9, 0.2, 0.7], ... [0.9, 0.1, 0.8, 0.3]], ... ['intel', 'diff'], [2, 2]) >>> student.add_cpds(cpd) >>> student.get_cpds()
- get_markov_blanket(node)[source]¶
Returns a markov blanket for a random variable. In the case of Bayesian Networks, the markov blanket is the set of node’s parents, its children and its children’s other parents.
- Returns:
Markov Blanket – List of nodes contained in Markov Blanket of node
- Return type:
- Parameters:
node (string, int or any hashable python object.) – The node whose markov blanket would be returned.
Examples
>>> from pgmpy.models import BayesianNetwork >>> from pgmpy.factors.discrete import TabularCPD >>> G = BayesianNetwork([('x', 'y'), ('z', 'y'), ('y', 'w'), ('y', 'v'), ('u', 'w'), ... ('s', 'v'), ('w', 't'), ('w', 'm'), ('v', 'n'), ('v', 'q')]) >>> G.get_markov_blanket('y') ['s', 'u', 'w', 'v', 'z', 'x']
- static get_random(n_nodes=5, edge_prob=0.5, n_states=None, latents=False)[source]¶
Returns a randomly generated bayesian network on n_nodes variables with edge probabiliy of edge_prob between variables.
- Parameters:
n_nodes (int) – The number of nodes in the randomly generated DAG.
edge_prob (float) – The probability of edge between any two nodes in the topologically sorted DAG.
n_states (int or list (array-like) (default: None)) – The number of states of each variable. When None randomly generates the number of states.
latents (bool (default: False)) – If True, also creates latent variables.
- Returns:
Random DAG – The randomly generated DAG.
- Return type:
Examples
>>> from pgmpy.models import BayesianNetwork >>> model = BayesianNetwork.get_random(n_nodes=5) >>> model.nodes() NodeView((0, 1, 3, 4, 2)) >>> model.edges() OutEdgeView([(0, 1), (0, 3), (1, 3), (1, 4), (3, 4), (2, 3)]) >>> model.cpds [<TabularCPD representing P(0:0) at 0x7f97e16eabe0>, <TabularCPD representing P(1:1 | 0:0) at 0x7f97e16ea670>, <TabularCPD representing P(3:3 | 0:0, 1:1, 2:2) at 0x7f97e16820d0>, <TabularCPD representing P(4:4 | 1:1, 3:3) at 0x7f97e16eae80>, <TabularCPD representing P(2:2) at 0x7f97e1682c40>]
- is_imap(JPD)[source]¶
Checks whether the bayesian model is Imap of given JointProbabilityDistribution
- Parameters:
JPD (An instance of JointProbabilityDistribution Class, for which you want to check the Imap) –
- Returns:
is IMAP – True if bayesian model is Imap for given Joint Probability Distribution False otherwise
- Return type:
True or False
Examples
>>> from pgmpy.models import BayesianNetwork >>> from pgmpy.factors.discrete import TabularCPD >>> from pgmpy.factors.discrete import JointProbabilityDistribution >>> G = BayesianNetwork([('diff', 'grade'), ('intel', 'grade')]) >>> diff_cpd = TabularCPD('diff', 2, [[0.2], [0.8]]) >>> intel_cpd = TabularCPD('intel', 3, [[0.5], [0.3], [0.2]]) >>> grade_cpd = TabularCPD('grade', 3, ... [[0.1,0.1,0.1,0.1,0.1,0.1], ... [0.1,0.1,0.1,0.1,0.1,0.1], ... [0.8,0.8,0.8,0.8,0.8,0.8]], ... evidence=['diff', 'intel'], ... evidence_card=[2, 3]) >>> G.add_cpds(diff_cpd, intel_cpd, grade_cpd) >>> val = [0.01, 0.01, 0.08, 0.006, 0.006, 0.048, 0.004, 0.004, 0.032, 0.04, 0.04, 0.32, 0.024, 0.024, 0.192, 0.016, 0.016, 0.128] >>> JPD = JointProbabilityDistribution(['diff', 'intel', 'grade'], [2, 3, 3], val) >>> G.is_imap(JPD) True
- static load(filename, filetype='bif')[source]¶
Writes the model to a file.
- Parameters:
Examples
>>> from pgmpy.utils import get_example_model >>> alarm = get_example_model('alarm') >>> alarm.save('alarm.bif', filetype='bif') >>> alarm_model = BayesianNetwork.load('alarm.bif', filetype='bif')
- predict(data, stochastic=False, n_jobs=- 1)[source]¶
Predicts states of all the missing variables.
- Parameters:
data (pandas DataFrame object) – A DataFrame object with column names same as the variables in the model.
stochastic (boolean) –
If True, does prediction by sampling from the distribution of predicted variable(s). If False, returns the states with the highest probability value (i.e MAP) for the
predicted variable(s).
n_jobs (int (default: -1)) – The number of CPU cores to use. If -1, uses all available cores.
Examples
>>> import numpy as np >>> import pandas as pd >>> from pgmpy.models import BayesianNetwork >>> values = pd.DataFrame(np.random.randint(low=0, high=2, size=(1000, 5)), ... columns=['A', 'B', 'C', 'D', 'E']) >>> train_data = values[:800] >>> predict_data = values[800:] >>> model = BayesianNetwork([('A', 'B'), ('C', 'B'), ('C', 'D'), ('B', 'E')]) >>> model.fit(train_data) >>> predict_data = predict_data.copy() >>> predict_data.drop('E', axis=1, inplace=True) >>> y_pred = model.predict(predict_data) >>> y_pred E 800 0 801 1 802 1 803 1 804 0 ... ... 993 0 994 0 995 1 996 1 997 0 998 0 999 0
- predict_probability(data)[source]¶
Predicts probabilities of all states of the missing variables.
- Parameters:
data (pandas DataFrame object) – A DataFrame object with column names same as the variables in the model.
Examples
>>> import numpy as np >>> import pandas as pd >>> from pgmpy.models import BayesianNetwork >>> values = pd.DataFrame(np.random.randint(low=0, high=2, size=(100, 5)), ... columns=['A', 'B', 'C', 'D', 'E']) >>> train_data = values[:80] >>> predict_data = values[80:] >>> model = BayesianNetwork([('A', 'B'), ('C', 'B'), ('C', 'D'), ('B', 'E')]) >>> model.fit(values) >>> predict_data = predict_data.copy() >>> predict_data.drop('B', axis=1, inplace=True) >>> y_prob = model.predict_probability(predict_data) >>> y_prob B_0 B_1 80 0.439178 0.560822 81 0.581970 0.418030 82 0.488275 0.511725 83 0.581970 0.418030 84 0.510794 0.489206 85 0.439178 0.560822 86 0.439178 0.560822 87 0.417124 0.582876 88 0.407978 0.592022 89 0.429905 0.570095 90 0.581970 0.418030 91 0.407978 0.592022 92 0.429905 0.570095 93 0.429905 0.570095 94 0.439178 0.560822 95 0.407978 0.592022 96 0.559904 0.440096 97 0.417124 0.582876 98 0.488275 0.511725 99 0.407978 0.592022
- remove_cpds(*cpds)[source]¶
Removes the cpds that are provided in the argument.
- Parameters:
*cpds (TabularCPD object) – A CPD object on any subset of the variables of the model which is to be associated with the model.
Examples
>>> from pgmpy.models import BayesianNetwork >>> from pgmpy.factors.discrete import TabularCPD >>> student = BayesianNetwork([('diff', 'grade'), ('intel', 'grade')]) >>> cpd = TabularCPD('grade', 2, [[0.1, 0.9, 0.2, 0.7], ... [0.9, 0.1, 0.8, 0.3]], ... ['intel', 'diff'], [2, 2]) >>> student.add_cpds(cpd) >>> student.remove_cpds(cpd)
- remove_node(node)[source]¶
Remove node from the model.
Removing a node also removes all the associated edges, removes the CPD of the node and marginalizes the CPDs of it’s children.
- Parameters:
node (node) – Node which is to be removed from the model.
- Returns:
- Return type:
Examples
>>> import pandas as pd >>> import numpy as np >>> from pgmpy.models import BayesianNetwork >>> model = BayesianNetwork([('A', 'B'), ('B', 'C'), ... ('A', 'D'), ('D', 'C')]) >>> values = pd.DataFrame(np.random.randint(low=0, high=2, size=(1000, 4)), ... columns=['A', 'B', 'C', 'D']) >>> model.fit(values) >>> model.get_cpds() [<TabularCPD representing P(A:2) at 0x7f28248e2438>, <TabularCPD representing P(B:2 | A:2) at 0x7f28248e23c8>, <TabularCPD representing P(C:2 | B:2, D:2) at 0x7f28248e2748>, <TabularCPD representing P(D:2 | A:2) at 0x7f28248e26a0>] >>> model.remove_node('A') >>> model.get_cpds() [<TabularCPD representing P(B:2) at 0x7f28248e23c8>, <TabularCPD representing P(C:2 | B:2, D:2) at 0x7f28248e2748>, <TabularCPD representing P(D:2) at 0x7f28248e26a0>]
- remove_nodes_from(nodes)[source]¶
Remove multiple nodes from the model.
Removing a node also removes all the associated edges, removes the CPD of the node and marginalizes the CPDs of it’s children.
- Parameters:
nodes (list, set (iterable)) – Nodes which are to be removed from the model.
- Returns:
- Return type:
Examples
>>> import pandas as pd >>> import numpy as np >>> from pgmpy.models import BayesianNetwork >>> model = BayesianNetwork([('A', 'B'), ('B', 'C'), ... ('A', 'D'), ('D', 'C')]) >>> values = pd.DataFrame(np.random.randint(low=0, high=2, size=(1000, 4)), ... columns=['A', 'B', 'C', 'D']) >>> model.fit(values) >>> model.get_cpds() [<TabularCPD representing P(A:2) at 0x7f28248e2438>, <TabularCPD representing P(B:2 | A:2) at 0x7f28248e23c8>, <TabularCPD representing P(C:2 | B:2, D:2) at 0x7f28248e2748>, <TabularCPD representing P(D:2 | A:2) at 0x7f28248e26a0>] >>> model.remove_nodes_from(['A', 'B']) >>> model.get_cpds() [<TabularCPD representing P(C:2 | D:2) at 0x7f28248e2a58>, <TabularCPD representing P(D:2) at 0x7f28248e26d8>]
- save(filename, filetype='bif')[source]¶
Writes the model to a file.
- Parameters:
Examples
>>> from pgmpy.utils import get_example_model >>> alarm = get_example_model('alarm') >>> alarm.save('alarm.bif', filetype='bif')
- simulate(n_samples=10, do=None, evidence=None, virtual_evidence=None, virtual_intervention=None, include_latents=False, partial_samples=None, seed=None, show_progress=True)[source]¶
Simulates data from the given model. Internally uses methods from pgmpy.sampling.BayesianModelSampling to generate the data.
- Parameters:
n_samples (int) – The number of data samples to simulate from the model.
do (dict) – The interventions to apply to the model. dict should be of the form {variable_name: state}
evidence (dict) – Observed evidence to apply to the model. dict should be of the form {variable_name: state}
virtual_evidence (list) – Probabilistically apply evidence to the model. virtual_evidence should be a list of pgmpy.factors.discrete.TabularCPD objects specifying the virtual probabilities.
virtual_intervention (list) – Also known as soft intervention. virtual_intervention should be a list of pgmpy.factors.discrete.TabularCPD objects specifying the virtual/soft intervention probabilities.
include_latents (boolean) – Whether to include the latent variable values in the generated samples.
partial_samples (pandas.DataFrame) – A pandas dataframe specifying samples on some of the variables in the model. If specified, the sampling procedure uses these sample values, instead of generating them. partial_samples.shape[0] must be equal to n_samples.
seed (int (default: None)) – If a value is provided, sets the seed for numpy.random.
show_progress (bool) – If True, shows a progress bar when generating samples.
- Returns:
A dataframe with the simulated data
- Return type:
pd.DataFrame
Examples
>>> from pgmpy.utils import get_example_model
Simulation without any evidence or intervention:
>>> model = get_example_model('alarm') >>> model.simulate(n_samples=10)
Simulation with the hard evidence: MINVOLSET = HIGH:
>>> model.simulate(n_samples=10, evidence={"MINVOLSET": "HIGH"})
Simulation with hard intervention: CVP = LOW:
>>> model.simulate(n_samples=10, do={"CVP": "LOW"})
Simulation with virtual/soft evidence: p(MINVOLSET=LOW) = 0.8, p(MINVOLSET=HIGH) = 0.2, p(MINVOLSET=NORMAL) = 0:
>>> virt_evidence = [TabularCPD("MINVOLSET", 3, [[0.8], [0.0], [0.2]], state_names={"MINVOLSET": ["LOW", "NORMAL", "HIGH"]})] >>> model.simulate(n_samples, virtual_evidence=virt_evidence)
Simulation with virtual/soft intervention: p(CVP=LOW) = 0.2, p(CVP=NORMAL)=0.5, p(CVP=HIGH)=0.3:
>>> virt_intervention = [TabularCPD("CVP", 3, [[0.2], [0.5], [0.3]], state_names={"CVP": ["LOW", "NORMAL", "HIGH"]})] >>> model.simulate(n_samples, virtual_intervention=virt_intervention)
- property states¶
Returns a dictionary mapping each node to its list of possible states.
- Returns:
state_dict – Dictionary of nodes to possible states
- Return type:
- to_junction_tree()[source]¶
Creates a junction tree (or clique tree) for a given bayesian model.
For converting a Bayesian Model into a Clique tree, first it is converted into a Markov one.
For a given markov model (H) a junction tree (G) is a graph 1. where each node in G corresponds to a maximal clique in H 2. each sepset in G separates the variables strictly on one side of the edge to other.
Examples
>>> from pgmpy.models import BayesianNetwork >>> from pgmpy.factors.discrete import TabularCPD >>> G = BayesianNetwork([('diff', 'grade'), ('intel', 'grade'), ... ('intel', 'SAT'), ('grade', 'letter')]) >>> diff_cpd = TabularCPD('diff', 2, [[0.2], [0.8]]) >>> intel_cpd = TabularCPD('intel', 3, [[0.5], [0.3], [0.2]]) >>> grade_cpd = TabularCPD('grade', 3, ... [[0.1,0.1,0.1,0.1,0.1,0.1], ... [0.1,0.1,0.1,0.1,0.1,0.1], ... [0.8,0.8,0.8,0.8,0.8,0.8]], ... evidence=['diff', 'intel'], ... evidence_card=[2, 3]) >>> sat_cpd = TabularCPD('SAT', 2, ... [[0.1, 0.2, 0.7], ... [0.9, 0.8, 0.3]], ... evidence=['intel'], evidence_card=[3]) >>> letter_cpd = TabularCPD('letter', 2, ... [[0.1, 0.4, 0.8], ... [0.9, 0.6, 0.2]], ... evidence=['grade'], evidence_card=[3]) >>> G.add_cpds(diff_cpd, intel_cpd, grade_cpd, sat_cpd, letter_cpd) >>> jt = G.to_junction_tree()
- to_markov_model()[source]¶
Converts bayesian model to markov model. The markov model created would be the moral graph of the bayesian model.
Examples
>>> from pgmpy.models import BayesianNetwork >>> G = BayesianNetwork([('diff', 'grade'), ('intel', 'grade'), ... ('intel', 'SAT'), ('grade', 'letter')]) >>> mm = G.to_markov_model() >>> mm.nodes() NodeView(('diff', 'grade', 'intel', 'letter', 'SAT')) >>> mm.edges() EdgeView([('diff', 'grade'), ('diff', 'intel'), ('grade', 'letter'), ('grade', 'intel'), ('intel', 'SAT')])