MORE: Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models

Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models:
A Causal Perspective

¹Peking University ²Singapore Management University ³Shanghai AI Laboratory
^*This work was done during her internship at Shanghai AI Laboratory.
^†Indicates corresponding author.

Abstract

Recent advancements in Large Language Models (LLMs) have facilitated the development of Multimodal LLMs (MLLMs). Despite their impressive capabilities, MLLMs often suffer from an over-reliance on unimodal biases (e.g., language bias and vision bias), leading to incorrect answers in complex multimodal tasks. To investigate this issue, we propose a causal framework to interpret the biases in Visual Question Answering (VQA) problems. Within our framework, we devise a causal graph to elucidate the predictions of MLLMs on VQA problems, and assess the causal effect of biases through an in-depth causal analysis. Motivated by the causal graph, we introduce a novel MORE dataset, consisting of 12,000 VQA instances. This dataset is designed to challenge MLLMs' abilities, necessitating multi-hop reasoning and the surmounting of unimodal biases. Furthermore, we propose two strategies to mitigate unimodal biases and enhance MLLMs' reasoning capabilities, including a Decompose-Verify-Answer (DeVA) framework for limited-access MLLMs and the refinement of open-source MLLMs through fine-tuning. Extensive quantitative and qualitative experiments offer valuable insights for future research.

Leaderboard

Model	LLM	#Params	Two-Hop		Three-Hop		Overall
Model	LLM	#Params	Open-ended	Multi-choice	Open-ended	Multi-choice	Open-ended	Multi-choice
Random	/	/	/	25.0	/	25.0	/	25.0
BLIP-2	OPT	6.7B	4.0	16.4	1.4	15.4	2.7	15.9
InstructBlip	Vicuna	13B	3.0	17.0	1.6	16.2	2.3	16.6
mPLUG-Owl	Llama	7B	4.0	12.4	8.2	11.4	6.1	11.9
LLaVA	Llama	13B	8.0	20.8	6.8	13.6	7.4	17.5
GPT-4V	-	-	15.8	25.6	15.3	23.2	15.6	24.4
Gemini Pro	-	-	14.2	33.5	10.1	24.4	12.2	28.9

BibTeX

@article{chen2024quantifying, title={Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models: A Causal Perspective}, author={Chen, Meiqi and Cao, Yixin and Zhang, Yan and Lu, Chaochao}, journal={arXiv preprint arXiv:2403.18346}, year={2024} }

Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models:
A Causal Perspective

Abstract

Our framework for generating data of MORE.

Leaderboard

Examples of MLLMs' over-reliance on unimodal biases.

Comparison of MORE with other VQA datasets.

Causal graph of MLLM’s Prediction on VQA problems.

BibTeX

Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models:A Causal Perspective

Abstract

Our framework for generating data of MORE.

Leaderboard

Examples of MLLMs' over-reliance on unimodal biases.

Comparison of MORE with other VQA datasets.

Causal graph of MLLM’s Prediction on VQA problems.

BibTeX

Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models:
A Causal Perspective