Expert In The Loop

class pgmpy.estimators.ExpertInLoop(data: DataFrame | None = None, **kwargs)[source]
estimate(pval_threshold: float = 0.05, effect_size_threshold: float = 0.05, orientation_fn: ~typing.Callable[[...], ~typing.Tuple[~typing.Hashable, ~typing.Hashable] | None] = <function llm_pairwise_orient>, orientations: ~typing.Set[~typing.Tuple[str, str]] = {}, expert_knowledge: ~pgmpy.estimators.ExpertKnowledge.ExpertKnowledge | None = None, use_cache: bool = True, show_progress: bool = True, **kwargs) DAG[source]

Estimates a DAG from the data by utilizing expert knowledge.

The method iteratively adds and removes edges between variables (similar to Greedy Equivalence Search (GES) algorithm) based on a global score metric that improves the model’s fit in each iteration. The score metric used is based on conditional independence testing. When adding an edge to the model, the method asks for expert knowledge to decide the orientation of the edge. Alternatively, an LLM can used to decide the orientation of the edge.

Parameters:
  • pval_threshold (float) – The p-value threshold to use for the test to determine whether there is a significant association between the variables or not.

  • effect_size_threshold (float) – The effect size threshold to use to suggest a new edge. If the conditional effect size between two variables is greater than the threshold, the algorithm would suggest to add an edge between them. And if the effect size for an edge is less than the threshold, would suggest to remove the edge.

  • orientation_fn (callable (default: pgmpy.utils.llm_pairwise_orient)) –

    A function to determine edge orientation. The function should at least take two arguments (the names of the two variables) and return either a tuple (source, target) representing the directed edge from source to target or None representing no edge between the variables. Any additional keyword arguments passed to estimate() will be forwarded to this function.

    Built-in functions that can be used:

    • pgmpy.utils.manual_pairwise_orient: Prompts the user to specify the direction between two variables by presenting options and taking input.

    • pgmpy.utils.llm_pairwise_orient: Uses a Large Language Model to determine direction. Requires additional parameters:

      • variable_descriptions: dict of {var_name: description} for context

      • llm_model: name of the LLM model (default: “gemini/gemini-1.5-flash”)

      • system_prompt: optional custom system prompt

    Custom functions can be provided that implement any desired logic for determining edge orientation, including using local LLMs or domain-specific heuristics.

  • orientations (set) – Users can specify a set of edges which would be used as the preferred orientation for edges over the output of orientation_fn.

  • expert_knowledge (pgmpy.estimators.ExpertKnowledge (default: None)) –

    Expert knowledge about the causal structure. This can include: - forbidden_edges: Edges that should not be present in the final model - required_edges: Edges that must be present in the final model (can be removed during pruning) - temporal_order: The temporal ordering of variables. Note that explicit orientations

    specified in the ‘orientations’ parameter will override this temporal ordering.

  • use_cache (bool) – If True, the method will cache the results returned by orientation_fn and reuse it in future calls of the estimate method instead of calling the orientation_fn.

  • show_progress (bool (default: True)) – If True, prints info of the running status.

  • kwargs (kwargs) – Any additional parameters to pass to the orientation_fn.

Returns:

pgmpy.base.DAG

Return type:

A DAG representing the learned causal structure.

Examples

>>> from pgmpy.utils import (
...     get_example_model,
...     llm_pairwise_orient,
...     manual_pairwise_orient,
... )
>>> from pgmpy.estimators import ExpertInLoop
>>> model = get_example_model("cancer")
>>> df = model.simulate(int(1e3))
>>> # Using manual orientation
>>> dag = ExpertInLoop(df).estimate(
...     effect_size_threshold=0.0001, orientation_fn=manual_pairwise_orient
... )
>>> # Using LLM-based orientation
>>> variable_descriptions = {
...     "Smoker": "A binary variable representing whether a person smokes or not.",
...     "Cancer": "A binary variable representing whether a person has cancer.",
...     "Xray": "A binary variable representing the result of an X-ray test.",
...     "Pollution": "A binary variable representing whether the person is in a high-pollution area or not.",
...     "Dyspnoea": "A binary variable representing whether a person has shortness of breath.",
... }
>>> dag = ExpertInLoop(df).estimate(
...     effect_size_threshold=0.0001,
...     orientation_fn=llm_pairwise_orient,
...     variable_descriptions=variable_descriptions,
...     llm_model="gemini/gemini-1.5-flash",
... )
>>> dag.edges()
OutEdgeView([('Smoker', 'Cancer'), ('Cancer', 'Xray'), ('Cancer', 'Dyspnoea'), ('Pollution', 'Cancer')])
>>> # Using a custom orientation function
>>> def my_orientation_func(var1, var2, **kwargs):
...     # Custom logic to determine edge orientation
...     if var1 == "Pollution" and var2 == "Cancer":
...         return ("Pollution", "Cancer")  # Pollution -> Cancer
...     elif var1 == "Cancer" and var2 == "Pollution":
...         return ("Pollution", "Cancer")  # Pollution -> Cancer
...     elif "Smoker" in (var1, var2) and "Cancer" in (var1, var2):
...         return ("Smoker", "Cancer")  # Smoker -> Cancer
...     # For edges involving Xray, always orient from other variable to Xray
...     elif "Xray" in (var1, var2):
...         if var1 == "Xray":
...             return (var2, var1)
...         else:
...             return (var1, var2)
...     # Default: use alphabetical ordering
...     return (var1, var2) if var1 < var2 else (var2, var1)
...
>>> dag = ExpertInLoop(df).estimate(
...     effect_size_threshold=0.0001, orientation_fn=my_orientation_func
... )
>>> dag.edges()
OutEdgeView([('Smoker', 'Cancer'), ('Cancer', 'Xray'), ('Cancer', 'Dyspnoea'), ('Pollution', 'Cancer')])
test_all(dag: DAG) DataFrame[source]

Runs CI tests on all possible combinations of variables in dag.

Parameters:

dag (pgmpy.base.DAG) – The DAG on which to run the tests.

Returns:

pd.DataFrame

Return type:

The results with p-values and effect sizes of all the tests.