Hill Climb Search¶
- class pgmpy.estimators.HillClimbSearch(data, use_cache=True, **kwargs)[source]¶
Class for heuristic hill climb searches for DAGs, to learn network structure from data. estimate attempts to find a model with optimal score.
- Parameters:
data (pandas DataFrame object) – dataframe object where each column represents one variable. (If some values in the data are missing the data cells should be set to numpy.NaN. Note that pandas converts each column containing numpy.NaN`s to dtype `float.)
state_names (dict (optional)) – A dict indicating, for each variable, the discrete set of states (or values) that the variable can take. If unspecified, the observed values in the data set are taken to be the only possible states.
use_caching (boolean) – If True, uses caching of score for faster computation. Note: Caching only works for scoring methods which are decomposable. Can give wrong results in case of custom scoring methods.
References
Koller & Friedman, Probabilistic Graphical Models - Principles and Techniques, 2009 Section 18.4.3 (page 811ff)
- estimate(scoring_method='k2score', start_dag=None, fixed_edges={}, tabu_length=100, max_indegree=None, black_list=None, white_list=None, epsilon=0.0001, max_iter=1000000.0, show_progress=True)[source]¶
Performs local hill climb search to estimates the DAG structure that has optimal score, according to the scoring method supplied. Starts at model start_dag and proceeds by step-by-step network modifications until a local maximum is reached. Only estimates network structure, no parametrization.
- Parameters:
scoring_method (str or StructureScore instance) – The score to be optimized during structure estimation. Supported structure scores: k2score, bdeuscore, bdsscore, bicscore, aicscore. Also accepts a custom score, but it should be an instance of StructureScore.
start_dag (DAG instance) – The starting point for the local search. By default, a completely disconnected network is used.
fixed_edges (iterable) – A list of edges that will always be there in the final learned model. The algorithm will add these edges at the start of the algorithm and will never change it.
tabu_length (int) – If provided, the last tabu_length graph modifications cannot be reversed during the search procedure. This serves to enforce a wider exploration of the search space. Default value: 100.
max_indegree (int or None) – If provided and unequal None, the procedure only searches among models where all nodes have at most max_indegree parents. Defaults to None.
black_list (list or None) – If a list of edges is provided as black_list, they are excluded from the search and the resulting model will not contain any of those edges. Default: None
white_list (list or None) – If a list of edges is provided as white_list, the search is limited to those edges. The resulting model will then only contain edges that are in white_list. Default: None
epsilon (float (default: 1e-4)) – Defines the exit condition. If the improvement in score is less than epsilon, the learned model is returned.
max_iter (int (default: 1e6)) – The maximum number of iterations allowed. Returns the learned model when the number of iterations is greater than max_iter.
- Returns:
Estimated model – A DAG at a (local) score maximum.
- Return type:
Examples
>>> import pandas as pd >>> import numpy as np >>> from pgmpy.estimators import HillClimbSearch, BicScore >>> # create data sample with 9 random variables: ... data = pd.DataFrame(np.random.randint(0, 5, size=(5000, 9)), columns=list('ABCDEFGHI')) >>> # add 10th dependent variable ... data['J'] = data['A'] * data['B'] >>> est = HillClimbSearch(data) >>> best_model = est.estimate(scoring_method=BicScore(data)) >>> sorted(best_model.nodes()) ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'] >>> best_model.edges() OutEdgeView([('B', 'J'), ('A', 'J')]) >>> # search a model with restriction on the number of parents: >>> est.estimate(max_indegree=1).edges() OutEdgeView([('J', 'A'), ('B', 'J')])
Structure Score¶
BDeu Score¶
- class pgmpy.estimators.BDeuScore(data, equivalent_sample_size=10, **kwargs)[source]¶
Class for Bayesian structure scoring for BayesianNetworks with Dirichlet priors. The BDeu score is the result of setting all Dirichlet hyperparameters/pseudo_counts to equivalent_sample_size/variable_cardinality. The score-method measures how well a model is able to describe the given data set.
- Parameters:
data (pandas DataFrame object) – dataframe object where each column represents one variable. (If some values in the data are missing the data cells should be set to numpy.NaN. Note that pandas converts each column containing numpy.NaN`s to dtype `float.)
equivalent_sample_size (int (default: 10)) – The equivalent/imaginary sample size (of uniform pseudo samples) for the dirichlet hyperparameters. The score is sensitive to this value, runs with different values might be useful.
state_names (dict (optional)) – A dict indicating, for each variable, the discrete set of states (or values) that the variable can take. If unspecified, the observed values in the data set are taken to be the only possible states.
References
[1] Koller & Friedman, Probabilistic Graphical Models - Principles and Techniques, 2009 Section 18.3.4-18.3.6 (esp. page 806) [2] AM Carvalho, Scoring functions for learning Bayesian networks, http://www.lx.it.pt/~asmc/pub/talks/09-TA/ta_pres.pdf
Bic Score¶
- class pgmpy.estimators.BicScore(data, **kwargs)[source]¶
Class for Bayesian structure scoring for BayesianNetworks with Dirichlet priors. The BIC/MDL score (“Bayesian Information Criterion”, also “Minimal Descriptive Length”) is a log-likelihood score with an additional penalty for network complexity, to avoid overfitting. The score-method measures how well a model is able to describe the given data set.
- Parameters:
data (pandas DataFrame object) – dataframe object where each column represents one variable. (If some values in the data are missing the data cells should be set to numpy.NaN. Note that pandas converts each column containing numpy.NaN`s to dtype `float.)
state_names (dict (optional)) – A dict indicating, for each variable, the discrete set of states (or values) that the variable can take. If unspecified, the observed values in the data set are taken to be the only possible states.
References
[1] Koller & Friedman, Probabilistic Graphical Models - Principles and Techniques, 2009 Section 18.3.4-18.3.6 (esp. page 802) [2] AM Carvalho, Scoring functions for learning Bayesian networks, http://www.lx.it.pt/~asmc/pub/talks/09-TA/ta_pres.pdf
K2 Score¶
- class pgmpy.estimators.K2Score(data, **kwargs)[source]¶
Class for Bayesian structure scoring for BayesianNetworks with Dirichlet priors. The K2 score is the result of setting all Dirichlet hyperparameters/pseudo_counts to 1. The score-method measures how well a model is able to describe the given data set.
- Parameters:
data (pandas DataFrame object) – dataframe object where each column represents one variable. (If some values in the data are missing the data cells should be set to numpy.NaN. Note that pandas converts each column containing numpy.NaN`s to dtype `float.)
state_names (dict (optional)) – A dict indicating, for each variable, the discrete set of states (or values) that the variable can take. If unspecified, the observed values in the data set are taken to be the only possible states.
References
[1] Koller & Friedman, Probabilistic Graphical Models - Principles and Techniques, 2009 Section 18.3.4-18.3.6 (esp. page 806) [2] AM Carvalho, Scoring functions for learning Bayesian networks, http://www.lx.it.pt/~asmc/pub/talks/09-TA/ta_pres.pdf
BDsScore¶
- class pgmpy.estimators.BDsScore(data, equivalent_sample_size=10, **kwargs)[source]¶
Class for Bayesian structure scoring for BayesianNetworks with Dirichlet priors. The BDs score is the result of setting all Dirichlet hyperparameters/pseudo_counts to equivalent_sample_size/modified_variable_cardinality where for the modified_variable_cardinality only the number of parent configurations where there were observed variable counts are considered. The score-method measures how well a model is able to describe the given data set.
- Parameters:
data (pandas DataFrame object) – dataframe object where each column represents one variable. (If some values in the data are missing the data cells should be set to numpy.NaN. Note that pandas converts each column containing numpy.NaN`s to dtype `float.)
equivalent_sample_size (int (default: 10)) – The equivalent/imaginary sample size (of uniform pseudo samples) for the dirichlet hyperparameters. The score is sensitive to this value, runs with different values might be useful.
state_names (dict (optional)) – A dict indicating, for each variable, the discrete set of states (or values) that the variable can take. If unspecified, the observed values in the data set are taken to be the only possible states.
References
[1] Scutari, Marco. An Empirical-Bayes Score for Discrete Bayesian Networks. Journal of Machine Learning Research, 2016, pp. 438–48
- local_score(variable, parents)[source]¶
Computes a score that measures how much a given variable is “influenced” by a given list of potential parents.