Leaderboards

Models Overall

Here we evaluate the overall performance on all scenarios. Taking a comprehensive look, among the models ranked in the top 10 on CaLM, limited and open-source models make up half. However, all models in the top 5 are limited-source, with text-davinci-002, ranked fifth, leading the sixth-placed InternLM-chat (20B) by almost 6%. This suggests that for causal reasoning scenarios, there remains a noticeable difference between open and limited-source models. In every scenario and for all prompts, GPT-4 secures the first position with a significant lead of nearly 8% over GPT-3.5-Turbo, which is ranked second.

Models Average Performance by Scenario

Table of Models Performance

Scenarios Overall

We show the scenario analysis in the following. It turns out that models are more good at scenarios within lower rungs of the causal ladder.

Scenarios Average Performance by Model

Leaderboards

Models Overall

Models Average Performance by Scenario

Table of Models Performance

Scenarios Overall

Scenarios Average Performance by Model

Table of Scenarios Performance