PowerDivergence#

class pgmpy.ci_tests.PowerDivergence(data: DataFrame, lambda_: str | float = 'cressie-read')[source]#

Bases: _BaseCITest

Cressie-Read power divergence test for conditional independence on discrete data [1].

This test evaluates the null hypothesis \(X \perp Y \mid Z\) using contingency tables. For a contingency table with observed counts \(O_{ij}\) and expected counts \(E_{ij}\) under independence, the Cressie-Read power divergence statistic is:

\[T_\lambda = \frac{2}{\lambda(\lambda + 1)} \sum_{i, j} O_{ij} \left[\left(\frac{O_{ij}}{E_{ij}}\right)^\lambda - 1\right],\]

for \(\lambda \notin \{-1, 0\}\). Different values of \(\lambda\) recover common special cases such as the Pearson chi-square test and the log-likelihood ratio test.

If \(Z = \emptyset\), the implementation constructs the contingency table of \(X\) and \(Y\) from the full dataset and computes \(T_\lambda\) with scipy.stats.chi2_contingency().

If \(Z \neq \emptyset\), the data are partitioned by each observed configuration \(z\) of \(Z\). For each stratum, a contingency table for \(X\) and \(Y\) is constructed and its power divergence statistic \(T_\lambda^{(z)}\) and degrees of freedom \(\nu^{(z)}\) are computed. The overall statistic used in the code is:

\[T = \sum_{z} T_\lambda^{(z)}, \qquad \nu = \sum_{z} \nu^{(z)},\]

where the sum runs over strata whose contingency tables do not contain an all-zero row or all-zero column. Strata with such degenerate tables are skipped.

Under the null hypothesis, \(T\) is treated with the usual chi-square asymptotic approximation, so the p-value is computed as:

\[p = 1 - F_{\chi^2_\nu}(T),\]

where \(F_{\chi^2_\nu}\) is the CDF of the chi-square distribution with \(\nu\) degrees of freedom.

Parameters:

datapandas.DataFrame

The dataset on which to test the independence condition.

lambda_float or string

The \(\lambda\) parameter for the power divergence statistic. Some values of lambda_ recover well-known special cases:

“pearson” 1 “Chi-squared test”

“log-likelihood” 0 “G-test or log-likelihood”

“freeman-tuckey” -1/2 “Freeman-Tuckey Statistic”

“mod-log-likelihood” -1 “Modified Log-likelihood”

“neyman” -2 “Neyman’s statistic”

“cressie-read” 2/3 “The value recommended in the paper[1]”

Attributes:

statistic_float: The power divergence test statistic \(T\). Set after calling the test.
p_value_float: The p-value for the test. Set after calling the test.
dof_int: Degrees of freedom \(\nu\) for the test. Set after calling the test.

References

[1]

Cressie, Noel, and Timothy RC Read. “Multinomial goodness‐of‐fit tests.” Journal of the Royal Statistical Society: Series B (Methodological) 46.3 (1984): 440-464.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(42)
>>> data = pd.DataFrame(
...     data=np.random.randint(low=0, high=2, size=(50000, 4)), columns=list("ABCD")
... )
>>> data["E"] = data["A"] + data["B"] + data["C"]
>>> test = PowerDivergence(data=data)
>>> test(X="A", Y="C", Z=[], significance_level=0.05)
np.True_
>>> round(test.statistic_, 2)
np.float64(0.03)
>>> round(test.p_value_, 2)
np.float64(0.86)
>>> test.dof_
1
>>> test(X="A", Y="B", Z=["D"], significance_level=0.05)
np.True_
>>> test(X="A", Y="B", Z=["D", "E"], significance_level=0.05)
np.False_

run_test(X: str, Y: str, Z: list)[source]#

Compute power divergence statistic, p-value, and degrees of freedom.

Sets self.statistic_ (chi-squared), self.p_value_, and self.dof_.