Recent advancements in Large Language Models (LLMs) have facilitated the development of Multimodal LLMs (MLLMs). Despite their impressive capabilities, MLLMs often suffer from an over-reliance on unimodal biases (e.g., language bias and vision bias), leading to incorrect answers in complex multimodal tasks. To investigate this issue, we propose a causal framework to interpret the biases in Visual Question Answering (VQA) problems. Within our framework, we devise a causal graph to elucidate the predictions of MLLMs on VQA problems, and assess the causal effect of biases through an in-depth causal analysis. Motivated by the causal graph, we introduce a novel MORE dataset, consisting of 12,000 VQA instances. This dataset is designed to challenge MLLMs' abilities, necessitating multi-hop reasoning and the surmounting of unimodal biases. Furthermore, we propose two strategies to mitigate unimodal biases and enhance MLLMs' reasoning capabilities, including a Decompose-Verify-Answer (DeVA) framework for limited-access MLLMs and the refinement of open-source MLLMs through fine-tuning. Extensive quantitative and qualitative experiments offer valuable insights for future research.
Model | LLM | #Params | Two-Hop | Three-Hop | Overall | |||
---|---|---|---|---|---|---|---|---|
Open-ended | Multi-choice | Open-ended | Multi-choice | Open-ended | Multi-choice | |||
Random | / | / | / | 25.0 | / | 25.0 | / | 25.0 |
BLIP-2 | OPT | 6.7B | 4.0 | 16.4 | 1.4 | 15.4 | 2.7 | 15.9 |
InstructBlip | Vicuna | 13B | 3.0 | 17.0 | 1.6 | 16.2 | 2.3 | 16.6 |
mPLUG-Owl | Llama | 7B | 4.0 | 12.4 | 8.2 | 11.4 | 6.1 | 11.9 |
LLaVA | Llama | 13B | 8.0 | 20.8 | 6.8 | 13.6 | 7.4 | 17.5 |
GPT-4V | - | - | 15.8 | 25.6 | 15.3 | 23.2 | 15.6 | 24.4 |
Gemini Pro | - | - | 14.2 | 33.5 | 10.1 | 24.4 | 12.2 | 28.9 |
@article{chen2024quantifying,
title={Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models: A Causal Perspective},
author={Chen, Meiqi and Cao, Yixin and Zhang, Yan and Lu, Chaochao},
journal={arXiv preprint arXiv:2403.18346},
year={2024}
}