logo

Causal Evaluation of Language Models

Contributions

WHAT WE EVALUATE

CaLM Framework

We craft a comprehensive and adaptable framework to meticulously evaluate the causal reasoning capabilities of language models.
Click on the dynamic box to see more.

Causal Target
Causal Task
Causal Ladder
Causal Scenario
Domain

KEY FEATURES

Broad Coverage

The most comprehensive and in-depth causal evaluation benchmark to date.

Empirical Findings

Click the dot to fix pages.    Click here to view more findings.
Findings from the Model

  • Causal reasoning inability. At present, language models struggle to perform tasks requiring sophisticated causal reasoning effectively. As the complexity of causal reasoning increases, the accuracy of each model progressively deteriorates, eventually falling almost to zero.
  • ...
  • View More
    Findings from the Adaptation

  • Optimal prompt varies across causal scenario.No ''optimal prompt'' universally fits all causal scenarios. Based on our observations, for scenarios at the lower levels of the causal ladder (i.e., causal discovery and association), employing 1/3-shot IcL proves effective. For scenarios at the intervention level, 3-shot IcL is recommended, and adding more shots may be beneficial if possible. For the counterfactual level, which requires detailed reasoning to determine the correct response, we suggest using manual CoT.
  • ...
  • View More
    Findings from the Causal Ladder

  • Consistent model capabilities in causal reasoning across scenarios. The causal reasoning capabilities of models show inherent consistency across the four levels of the causal ladder. Specifically, in 19 scenarios (excluding CEI and CB), there is a positive correlation in model performance. This observation suggests that a model’s causal reasoning ability is cohesive, not limited to specific scenarios.
  • ...
  • View More
    Findings from the Causal Scenario

  • Pairwise causal discovery (PCD). PCD seeks to establish if a causal relationship exists between two given events and to identify which of the two is the cause and which is the effect. The understandability of the scenario is easy. The leading three performers in this scenario are GPT-4 (79.1%), GPT-3.5-Turbo (75.2%), and text-davinci-003 (74.7%). The top model-prompt pair is GPT-4 with EF, achieving an accuracy of 83.0%. The solvability of the scenario is well-solved as the average accuracies of the top three models all exceed 70%. The most stable models, characterized by the lowest model volatility, are GPT-3.5-Turbo (1.3), Baichuan1 (7B) (2.1), and text-curie-001 (2.2)...
  • View More
    Findings from the Domain

  • Comparing seen vs. unseen dataset.The impact of using seen (open-source) and unseen (self-constructed) datasets on model performance is influenced by the complexity of the causal tasks. For more complex tasks at the intervention and counterfactual levels, models tend to perform better on open-source datasets than on self-constructed ones. Conversely, for simpler tasks related to causal discovery, models show slightly superior performance on self-constructed datasets than on those that are publicly available.
  • View More
    Findings from the Mode

  • Correlations among text modes. The three modes selected for our analysis - Natural, Symbolic, and Mathematical - are all rooted in textual data, with Natural mode serving as the primary basis. Our experimental results show a marked correlation between the Natural mode and the other two modes, highlighting interconnected capabilities across these modes.
  • View More
    Findings from the Language

  • Performance differences between English and Chinese datasets. In almost 90% of the causal scenarios, models demonstrate superior performance on English datasets. The trend is likely attributed to the dominance of English in the training data of language models. As these models are deployed globally, it is crucial to ensure training involves balanced and diverse language corpora to improve performance across various languages.
  • View More
    Findings from the Metric

  • Variability in model's robustness and accuracy across causal scenarios.The relationship between a model’s robustness and accuracy significantly varies across causal scenarios. In more challenging causal scenarios, such as PN and PS, models may show very low accuracy but disproportionately high robustness. This is primarily because the models' responses remain consistently poor, unaffected by disturbances. In contrast, in simpler scenarios like PCD and AR, there tends to be a positive correlation between accuracy and robustness, suggesting that as models perform better, they also become more stable...
  • View More
    Findings from the Error

  • Model capabilities and limitations in following instructions. All models inherently possess ability to generate content and typically do not produce empty responses, even when faced with challenging questions. However, their capacity to accurately follow instructions remains limited. Often, these models struggle to provide the most straightforward response as specified by the instructions, indicating a significant room for improvement in following instructions.
  • ...
  • View More

    Join Us ;)

    Click here to find data in Github.

    Description