PC (Constraint-Based Estimator)¶
- class pgmpy.estimators.PC(data=None, independencies=None, **kwargs)[source]¶
Class for constraint-based estimation of DAGs using the PC algorithm from a given data set. Identifies (conditional) dependencies in data set using statistical independence tests and estimates a DAG pattern that satisfies the identified dependencies. The DAG pattern can then be completed to a faithful DAG, if possible.
- Parameters:
data (pandas DataFrame object) – dataframe object where each column represents one variable. (If some values in the data are missing the data cells should be set to numpy.nan. Note that pandas converts each column containing numpy.nan`s to dtype `float.)
References
- [1] Koller & Friedman, Probabilistic Graphical Models - Principles and Techniques,
2009, Section 18.2
[2] Neapolitan, Learning Bayesian Networks, Section 10.1.2 for the PC algorithm (page 550), http://www.cs.technion.ac.il/~dang/books/Learning%20Bayesian%20Networks(Neapolitan,%20Richard).pdf
- build_skeleton(ci_test='chi_square', max_cond_vars=5, significance_level=0.01, variant='stable', n_jobs=-1, show_progress=True, **kwargs)[source]¶
Estimates a graph skeleton (UndirectedGraph) from a set of independencies using (the first part of) the PC algorithm. The independencies can either be provided as an instance of the Independencies-class or by passing a decision function that decides any conditional independency assertion. Returns a tuple (skeleton, separating_sets).
If an Independencies-instance is passed, the contained IndependenceAssertions have to admit a faithful BN representation. This is the case if they are obtained as a set of d-seperations of some Bayesian network or if the independence assertions are closed under the semi-graphoid axioms. Otherwise, the procedure may fail to identify the correct structure.
- Returns:
skeleton (UndirectedGraph) – An estimate for the undirected graph skeleton of the BN underlying the data.
separating_sets (dict) – A dict containing for each pair of not directly connected nodes a separating set (“witnessing set”) of variables that makes then conditionally independent. (needed for edge orientation procedures)
References
- [1] Neapolitan, Learning Bayesian Networks, Section 10.1.2, Algorithm 10.2 (page 550)
http://www.cs.technion.ac.il/~dang/books/Learning%20Bayesian%20Networks(Neapolitan,%20Richard).pdf
- [2] Koller & Friedman, Probabilistic Graphical Models - Principles and Techniques, 2009
Section 3.4.2.1 (page 85), Algorithm 3.3
- estimate(variant='stable', ci_test='chi_square', max_cond_vars=5, return_type='dag', significance_level=0.01, n_jobs=-1, show_progress=True, **kwargs)[source]¶
Estimates a DAG/PDAG from the given dataset using the PC algorithm which is a constraint-based structure learning algorithm[1]. The independencies in the dataset are identified by doing statistical independece test. This method returns a DAG/PDAG structure which is faithful to the independencies implied by the dataset.
- Parameters:
variant (str (one of "orig", "stable", "parallel")) –
The variant of PC algorithm to run. “orig”: The original PC algorithm. Might not give the same
results in different runs but does less independence tests compared to stable.
- ”stable”: Gives the same result in every run but does needs to
do more statistical independence tests.
- ”parallel”: Parallel version of PC Stable. Can run on multiple
cores with the same result on each run.
ci_test (str or fun) –
The statistical test to use for testing conditional independence in the dataset. If str values should be one of:
- ”independence_match”: If using this option, an additional parameter
independencies must be specified.
- ”chi_square”: Uses the Chi-Square independence test. This works
only for discrete datasets.
- ”pearsonr”: Uses the pertial correlation based on pearson
correlation coefficient to test independence. This works only for continuous datasets.
”g_sq”: G-test. Works only for discrete datasets. “log_likelihood”: Log-likelihood test. Works only for discrete dataset. “freeman_tuckey”: Freeman Tuckey test. Works only for discrete dataset. “modified_log_likelihood”: Modified Log Likelihood test. Works only for discrete variables. “neyman”: Neyman test. Works only for discrete variables. “cressie_read”: Cressie Read test. Works only for discrete variables.
max_cond_vars (int) – The maximum number of conditional variables allowed to do the statistical test with.
return_type (str (one of "dag", "cpdag", "pdag", "skeleton")) –
The type of structure to return.
- If return_type=pdag or return_type=cpdag: a partially directed structure
is returned.
- If return_type=dag, a fully directed structure is returned if it
is possible to orient all the edges.
- If `return_type=”skeleton”, returns an undirected graph along
with the separating sets.
significance_level (float (default: 0.01)) –
The statistical tests use this value to compare with the p-value of the test to decide whether the tested variables are independent or not. Different tests can treat this parameter differently:
- Chi-Square: If p-value > significance_level, it assumes that the
independence condition satisfied in the data.
- pearsonr: If p-value > significance_level, it assumes that the
independence condition satisfied in the data.
- Returns:
Estimated model – The estimated model structure, can be a partially directed graph (PDAG) or a fully directed graph (DAG), or (Undirected Graph, separating sets) depending on the value of return_type argument.
- Return type:
pgmpy.base.DAG, pgmpy.base.PDAG, or tuple(networkx.UndirectedGraph, dict)
References
- [1] Original PC: P. Spirtes, C. Glymour, and R. Scheines, Causation,
Prediction, and Search, 2nd ed. Cambridge, MA: MIT Press, 2000.
- [2] Stable PC: D. Colombo and M. H. Maathuis, “A modification of the PC algorithm
yielding order-independent skeletons,” ArXiv e-prints, Nov. 2012.
- [3] Parallel PC: Le, Thuc, et al. “A fast PC algorithm for high dimensional causal
discovery with multi-core PCs.” IEEE/ACM transactions on computational biology and bioinformatics (2016).
Examples
>>> from pgmpy.utils import get_example_model >>> from pgmpy.estimators import PC >>> model = get_example_model('alarm') >>> data = model.simulate(n_samples=1000) >>> est = PC(data) >>> model_chi = est.estimate(ci_test='chi_square') >>> print(len(model_chi.edges())) 28 >>> model_gsq, _ = est.estimate(ci_test='g_sq', return_type='skeleton') >>> print(len(model_gsq.edges())) 33
- static skeleton_to_pdag(skeleton, separating_sets)[source]¶
Orients the edges of a graph skeleton based on information from separating_sets to form a DAG pattern (DAG).
- Parameters:
skeleton (UndirectedGraph) – An undirected graph skeleton as e.g. produced by the estimate_skeleton method.
separating_sets (dict) – A dict containing for each pair of not directly connected nodes a separating set (“witnessing set”) of variables that makes then conditionally independent. (needed for edge orientation)
- Returns:
Model after edge orientation – An estimate for the DAG pattern of the BN underlying the data. The graph might contain some nodes with both-way edges (X->Y and Y->X). Any completion by (removing one of the both-way edges for each such pair) results in a I-equivalent Bayesian network DAG.
- Return type:
References
Neapolitan, Learning Bayesian Networks, Section 10.1.2, Algorithm 10.2 (page 550) http://www.cs.technion.ac.il/~dang/books/Learning%20Bayesian%20Networks(Neapolitan,%20Richard).pdf
Examples
>>> import pandas as pd >>> import numpy as np >>> from pgmpy.estimators import PC >>> data = pd.DataFrame(np.random.randint(0, 4, size=(5000, 3)), columns=list('ABD')) >>> data['C'] = data['A'] - data['B'] >>> data['D'] += data['A'] >>> c = PC(data) >>> pdag = c.skeleton_to_pdag(*c.build_skeleton()) >>> pdag.edges() # edges: A->C, B->C, A--D (not directed) [('B', 'C'), ('A', 'C'), ('A', 'D'), ('D', 'A')]
Conditional Independence Tests for PC algorithm¶
- pgmpy.estimators.CITests.chi_square(X, Y, Z, data, boolean=True, **kwargs)[source]¶
Chi-square conditional independence test. Tests the null hypothesis that X is independent from Y given Zs.
This is done by comparing the observed frequencies with the expected frequencies if X,Y were conditionally independent, using a chisquare deviance statistic. The expected frequencies given independence are . The latter term can be computed as :math:`P(X,Zs)*P(Y,Zs)/P(Zs).
- Parameters:
X (int, string, hashable object) – A variable name contained in the data set
Y (int, string, hashable object) – A variable name contained in the data set, different from X
Z (list, array-like) – A list of variable names contained in the data set, different from X and Y. This is the separating set that (potentially) makes X and Y independent. Default: []
data (pandas.DataFrame) – The dataset on which to test the independence condition.
boolean (bool) – If boolean=True, an additional argument significance_level must be specified. If p_value of the test is greater than equal to significance_level, returns True. Otherwise returns False. If boolean=False, returns the chi2 and p_value of the test.
- Returns:
CI Test Results – If boolean = False, Returns a tuple (chi, p_value, dof). chi is the chi-squared test statistic. The p_value for the test, i.e. the probability of observing the computed chi-square statistic (or an even higher value), given the null hypothesis that X ⟂ Y | Zs is True. If boolean = True, returns True if the p_value of the test is greater than significance_level else returns False.
- Return type:
References
[1] https://en.wikipedia.org/wiki/Chi-squared_test
Examples
>>> import pandas as pd >>> import numpy as np >>> data = pd.DataFrame(np.random.randint(0, 2, size=(50000, 4)), columns=list('ABCD')) >>> data['E'] = data['A'] + data['B'] + data['C'] >>> chi_square(X='A', Y='C', Z=[], data=data, boolean=True, significance_level=0.05) True >>> chi_square(X='A', Y='B', Z=['D'], data=data, boolean=True, significance_level=0.05) True >>> chi_square(X='A', Y='B', Z=['D', 'E'], data=data, boolean=True, significance_level=0.05) False
- pgmpy.estimators.CITests.g_sq(X, Y, Z, data, boolean=True, **kwargs)[source]¶
G squared test for conditional independence. Also commonly known as G-test, likelihood-ratio or maximum likelihood statistical significance test. Tests the null hypothesis that X is independent of Y given Zs.
- Parameters:
X (int, string, hashable object) – A variable name contained in the data set
Y (int, string, hashable object) – A variable name contained in the data set, different from X
Z (list (array-like)) – A list of variable names contained in the data set, different from X and Y. This is the separating set that (potentially) makes X and Y independent. Default: []
data (pandas.DataFrame) – The dataset on which to test the independence condition.
boolean (bool) – If boolean=True, an additional argument significance_level must be specified. If p_value of the test is greater than equal to significance_level, returns True. Otherwise returns False. If boolean=False, returns the chi2 and p_value of the test.
- Returns:
CI Test Results – If boolean = False, Returns a tuple (chi, p_value, dof). chi is the chi-squared test statistic. The p_value for the test, i.e. the probability of observing the computed chi-square statistic (or an even higher value), given the null hypothesis that X ⟂ Y | Zs is True. If boolean = True, returns True if the p_value of the test is greater than significance_level else returns False.
- Return type:
References
[1] https://en.wikipedia.org/wiki/G-test
Examples
>>> import pandas as pd >>> import numpy as np >>> data = pd.DataFrame(np.random.randint(0, 2, size=(50000, 4)), columns=list('ABCD')) >>> data['E'] = data['A'] + data['B'] + data['C'] >>> g_sq(X='A', Y='C', Z=[], data=data, boolean=True, significance_level=0.05) True >>> g_sq(X='A', Y='B', Z=['D'], data=data, boolean=True, significance_level=0.05) True >>> g_sq(X='A', Y='B', Z=['D', 'E'], data=data, boolean=True, significance_level=0.05) False
- pgmpy.estimators.CITests.independence_match(X, Y, Z, independencies, **kwargs)[source]¶
Checks if X ⟂ Y | Z is in independencies. This method is implemented to have an uniform API when the independencies are provided instead of data.
- Parameters:
X (str) – The first variable for testing the independence condition X ⟂ Y | Z
Y (str) – The second variable for testing the independence condition X ⟂ Y | Z
Z (list/array-like) – A list of conditional variable for testing the condition X ⟂ Y | Z
data (pandas.DataFrame The dataset in which to test the indepenedence condition.)
- Returns:
p-value
- Return type:
float (Fixed to 0 since it is always confident)
- pgmpy.estimators.CITests.log_likelihood(X, Y, Z, data, boolean=True, **kwargs)[source]¶
Log likelihood ratio test for conditional independence. Also commonly known as G-test, G-squared test or maximum likelihood statistical significance test. Tests the null hypothesis that X is independent of Y given Zs.
- Parameters:
X (int, string, hashable object) – A variable name contained in the data set
Y (int, string, hashable object) – A variable name contained in the data set, different from X
Z (list (array-like)) – A list of variable names contained in the data set, different from X and Y. This is the separating set that (potentially) makes X and Y independent. Default: []
data (pandas.DataFrame) – The dataset on which to test the independence condition.
boolean (bool) – If boolean=True, an additional argument significance_level must be specified. If p_value of the test is greater than equal to significance_level, returns True. Otherwise returns False. If boolean=False, returns the chi2 and p_value of the test.
- Returns:
CI Test Results – If boolean = False, Returns a tuple (chi, p_value, dof). chi is the chi-squared test statistic. The p_value for the test, i.e. the probability of observing the computed chi-square statistic (or an even higher value), given the null hypothesis that X ⟂ Y | Zs is True. If boolean = True, returns True if the p_value of the test is greater than significance_level else returns False.
- Return type:
References
[1] https://en.wikipedia.org/wiki/G-test
Examples
>>> import pandas as pd >>> import numpy as np >>> data = pd.DataFrame(np.random.randint(0, 2, size=(50000, 4)), columns=list('ABCD')) >>> data['E'] = data['A'] + data['B'] + data['C'] >>> log_likelihood(X='A', Y='C', Z=[], data=data, boolean=True, significance_level=0.05) True >>> log_likelihood(X='A', Y='B', Z=['D'], data=data, boolean=True, significance_level=0.05) True >>> log_likelihood(X='A', Y='B', Z=['D', 'E'], data=data, boolean=True, significance_level=0.05) False
- pgmpy.estimators.CITests.modified_log_likelihood(X, Y, Z, data, boolean=True, **kwargs)[source]¶
Modified log likelihood ratio test for conditional independence. Tests the null hypothesis that X is independent of Y given Zs.
- Parameters:
X (int, string, hashable object) – A variable name contained in the data set
Y (int, string, hashable object) – A variable name contained in the data set, different from X
Z (list (array-like)) – A list of variable names contained in the data set, different from X and Y. This is the separating set that (potentially) makes X and Y independent. Default: []
data (pandas.DataFrame) – The dataset on which to test the independence condition.
boolean (bool) – If boolean=True, an additional argument significance_level must be specified. If p_value of the test is greater than equal to significance_level, returns True. Otherwise returns False. If boolean=False, returns the chi2 and p_value of the test.
- Returns:
CI Test Results – If boolean = False, Returns a tuple (chi, p_value, dof). chi is the chi-squared test statistic. The p_value for the test, i.e. the probability of observing the computed chi-square statistic (or an even higher value), given the null hypothesis that X ⟂ Y | Zs is True. If boolean = True, returns True if the p_value of the test is greater than significance_level else returns False.
- Return type:
Examples
>>> import pandas as pd >>> import numpy as np >>> data = pd.DataFrame(np.random.randint(0, 2, size=(50000, 4)), columns=list('ABCD')) >>> data['E'] = data['A'] + data['B'] + data['C'] >>> modified_log_likelihood(X='A', Y='C', Z=[], data=data, boolean=True, significance_level=0.05) True >>> modified_log_likelihood(X='A', Y='B', Z=['D'], data=data, boolean=True, significance_level=0.05) True >>> modified_log_likelihood(X='A', Y='B', Z=['D', 'E'], data=data, boolean=True, significance_level=0.05) False
- pgmpy.estimators.CITests.pearsonr(X, Y, Z, data, boolean=True, **kwargs)[source]¶
Computes Pearson correlation coefficient and p-value for testing non-correlation. Should be used only on continuous data. In case when :math:`Z !=
- ull` uses
linear regression and computes pearson coefficient on residuals.
- X: str
The first variable for testing the independence condition X ⟂ Y | Z
- Y: str
The second variable for testing the independence condition X ⟂ Y | Z
- Z: list/array-like
A list of conditional variable for testing the condition X ⟂ Y | Z
- data: pandas.DataFrame
The dataset in which to test the indepenedence condition.
- boolean: bool
- If boolean=True, an additional argument significance_level must
be specified. If p_value of the test is greater than equal to significance_level, returns True. Otherwise returns False.
- If boolean=False, returns the pearson correlation coefficient and p_value
of the test.
- CI Test results: tuple or bool
If boolean=True, returns True if p-value >= significance_level, else False. If boolean=False, returns a tuple of (Pearson’s correlation Coefficient, p-value)
[1] https://en.wikipedia.org/wiki/Pearson_correlation_coefficient [2] https://en.wikipedia.org/wiki/Partial_correlation#Using_linear_regression
- pgmpy.estimators.CITests.pillai_trace(X, Y, Z, data, boolean=True, **kwargs)[source]¶
A mixed-data residualization based conditional independence test[1].
Uses XGBoost estimator to compute LS residuals[2], and then does an association test (Pillai’s Trace) on the residuals.
- Parameters:
X (str) – The first variable for testing the independence condition X ⟂ Y | Z
Y (str) – The second variable for testing the independence condition X ⟂ Y | Z
Z (list/array-like) – A list of conditional variable for testing the condition X ⟂ Y | Z
data (pandas.DataFrame) – The dataset in which to test the indepenedence condition.
boolean (bool) –
- If boolean=True, an additional argument significance_level must
be specified. If p_value of the test is greater than equal to significance_level, returns True. Otherwise returns False.
- If boolean=False, returns the pearson correlation coefficient and p_value
of the test.
- Returns:
CI Test results – If boolean=True, returns True if p-value >= significance_level, else False. If boolean=False, returns a tuple of (Pearson’s correlation Coefficient, p-value)
- Return type:
References
[1] Ankan, Ankur, and Johannes Textor. “A simple unified approach to testing high-dimensional conditional independences for categorical and ordinal data.” Proceedings of the AAAI Conference on Artificial Intelligence. [2] Li, C.; and Shepherd, B. E. 2010. Test of Association Between Two Ordinal Variables While Adjusting for Covariates. Journal of the American Statistical Association. [3] Muller, K. E. and Peterson B. L. (1984) Practical Methods for computing power in testing the multivariate general linear hypothesis. Computational Statistics & Data Analysis.
- pgmpy.estimators.CITests.power_divergence(X, Y, Z, data, boolean=True, lambda_='cressie-read', **kwargs)[source]¶
Computes the Cressie-Read power divergence statistic [1]. The null hypothesis for the test is X is independent of Y given Z. A lot of the frequency comparision based statistics (eg. chi-square, G-test etc) belong to power divergence family, and are special cases of this test.
- Parameters:
X (int, string, hashable object) – A variable name contained in the data set
Y (int, string, hashable object) – A variable name contained in the data set, different from X
Z (list, array-like) – A list of variable names contained in the data set, different from X and Y. This is the separating set that (potentially) makes X and Y independent. Default: []
data (pandas.DataFrame) – The dataset on which to test the independence condition.
lambda (float or string) –
The lambda parameter for the power_divergence statistic. Some values of lambda_ results in other well known tests:
”pearson” 1 “Chi-squared test” “log-likelihood” 0 “G-test or log-likelihood” “freeman-tuckey” -1/2 “Freeman-Tuckey Statistic” “mod-log-likelihood” -1 “Modified Log-likelihood” “neyman” -2 “Neyman’s statistic” “cressie-read” 2/3 “The value recommended in the paper[1]”
boolean (bool) –
- If boolean=True, an additional argument significance_level must
be specified. If p_value of the test is greater than equal to significance_level, returns True. Otherwise returns False.
If boolean=False, returns the chi2 and p_value of the test.
- Returns:
CI Test Results – If boolean = False, Returns a tuple (chi, p_value, dof). chi is the chi-squared test statistic. The p_value for the test, i.e. the probability of observing the computed chi-square statistic (or an even higher value), given the null hypothesis that X ⟂ Y | Zs is True. If boolean = True, returns True if the p_value of the test is greater than significance_level else returns False.
- Return type:
References
[1] Cressie, Noel, and Timothy RC Read. “Multinomial goodness‐of‐fit tests.” Journal of the Royal Statistical Society: Series B (Methodological) 46.3 (1984): 440-464.
Examples
>>> import pandas as pd >>> import numpy as np >>> data = pd.DataFrame(np.random.randint(0, 2, size=(50000, 4)), columns=list('ABCD')) >>> data['E'] = data['A'] + data['B'] + data['C'] >>> chi_square(X='A', Y='C', Z=[], data=data, boolean=True, significance_level=0.05) True >>> chi_square(X='A', Y='B', Z=['D'], data=data, boolean=True, significance_level=0.05) True >>> chi_square(X='A', Y='B', Z=['D', 'E'], data=data, boolean=True, significance_level=0.05) False