PowerDivergence#
- class pgmpy.ci_tests.PowerDivergence(data: DataFrame, lambda_: str | float = 'cressie-read')[source]#
Bases:
_BaseCITestCressie-Read power divergence test for conditional independence on discrete data [1].
This test evaluates the null hypothesis \(X \perp Y \mid Z\) using contingency tables. For a contingency table with observed counts \(O_{ij}\) and expected counts \(E_{ij}\) under independence, the Cressie-Read power divergence statistic is:
\[T_\lambda = \frac{2}{\lambda(\lambda + 1)} \sum_{i, j} O_{ij} \left[\left(\frac{O_{ij}}{E_{ij}}\right)^\lambda - 1\right],\]for \(\lambda \notin \{-1, 0\}\). Different values of \(\lambda\) recover common special cases such as the Pearson chi-square test and the log-likelihood ratio test.
If \(Z = \emptyset\), the implementation constructs the contingency table of \(X\) and \(Y\) from the full dataset and computes \(T_\lambda\) with
scipy.stats.chi2_contingency().If \(Z \neq \emptyset\), the data are partitioned by each observed configuration \(z\) of \(Z\). For each stratum, a contingency table for \(X\) and \(Y\) is constructed and its power divergence statistic \(T_\lambda^{(z)}\) and degrees of freedom \(\nu^{(z)}\) are computed. The overall statistic used in the code is:
\[T = \sum_{z} T_\lambda^{(z)}, \qquad \nu = \sum_{z} \nu^{(z)},\]where the sum runs over strata whose contingency tables do not contain an all-zero row or all-zero column. Strata with such degenerate tables are skipped.
Under the null hypothesis, \(T\) is treated with the usual chi-square asymptotic approximation, so the p-value is computed as:
\[p = 1 - F_{\chi^2_\nu}(T),\]where \(F_{\chi^2_\nu}\) is the CDF of the chi-square distribution with \(\nu\) degrees of freedom.
- Parameters:
- datapandas.DataFrame
The dataset on which to test the independence condition.
- lambda_float or string
The \(\lambda\) parameter for the power divergence statistic. Some values of
lambda_recover well-known special cases:“pearson” 1 “Chi-squared test”
“log-likelihood” 0 “G-test or log-likelihood”
“freeman-tuckey” -1/2 “Freeman-Tuckey Statistic”
“mod-log-likelihood” -1 “Modified Log-likelihood”
“neyman” -2 “Neyman’s statistic”
“cressie-read” 2/3 “The value recommended in the paper[1]”
- Attributes:
- statistic_float
The power divergence test statistic \(T\). Set after calling the test.
- p_value_float
The p-value for the test. Set after calling the test.
- dof_int
Degrees of freedom \(\nu\) for the test. Set after calling the test.
References
[1]Cressie, Noel, and Timothy RC Read. “Multinomial goodness‐of‐fit tests.” Journal of the Royal Statistical Society: Series B (Methodological) 46.3 (1984): 440-464.
Examples
>>> import pandas as pd >>> import numpy as np >>> np.random.seed(42) >>> data = pd.DataFrame( ... data=np.random.randint(low=0, high=2, size=(50000, 4)), columns=list("ABCD") ... ) >>> data["E"] = data["A"] + data["B"] + data["C"] >>> test = PowerDivergence(data=data) >>> test(X="A", Y="C", Z=[], significance_level=0.05) np.True_ >>> round(test.statistic_, 2) np.float64(0.03) >>> round(test.p_value_, 2) np.float64(0.86) >>> test.dof_ 1 >>> test(X="A", Y="B", Z=["D"], significance_level=0.05) np.True_ >>> test(X="A", Y="B", Z=["D", "E"], significance_level=0.05) np.False_