DEPO: Dual‑Efficiency Preference Optimization for LLM Agents

Sirui Chen1,2,3*, Mengshi Zhao4*, Lei Xu2,5, Yuying Zhao3,
Beier Zhu3†, Hanwang Zhang3, Shengjie Zhao1, Chaochao Lu2†
1Tongji University, 2Shanghai Artificial Intelligence Laboratory, 3Nanyang Technological University,
4The University of Hong Kong, 5École Polytechnique Fédérale de Lausanne (EPFL)

AAAI Conference on Artificial Intelligence (AAAI) 2026
*Equal Contribution. Corresponding Author.

Dual‑Efficiency

A comparison between (a) step-level inefficiency, arising from latency and cost in LLM token generation; (b) trajectory-level inefficiency, arising from latency and cost in environment interactions such as API calls, and our defined (c) dual-efficiency. For LLM agents, achieving genuine efficiency requires joint optimization across both dimensions.

Abstract

Recent advances in large language models (LLMs) have greatly improved their reasoning and decision-making abilities when deployed as agents. Richer reasoning, however, often comes at the cost of longer chain of thought (CoT), hampering interaction efficiency in real-world scenarios. Nevertheless, there still lacks systematic definition of LLM agent efficiency, hindering targeted improvements. To this end, we introduce dual‑efficiency, comprising (i) step-level efficiency, which minimizes tokens per step, and (ii) trajectory-level efficiency, which minimizes the number of steps to complete a task. Building on this definition, we propose DEPO, a dual-efficiency preference‑based optimization method that jointly rewards succinct responses and fewer action steps. Experiments on WebShop and BabyAI show that DEPO cuts token usage by up to 60.9% and steps by up to 26.9%, while achieving up to a 29.3% improvement in task performance. DEPO also generalizes to three out-of-domain math benchmarks and retains its efficiency gains when trained on only 25% of the data.

Experiment Results

main result

Comparison of DEPO with a wide range of baselines.

DEPO achieves a significant improvement in the efficiency of LLM agents. Crucially, this efficiency gain does not come at the cost of performance; DEPO maintains—or even improves—it.
generalizability

Model generalizability across math benchmarks.

DEPO demonstrates strong generalizability, achieving higher average accuracy while reducing generation tokens on most tasks.
sample efficiency

Sample efficiency of DEPO.

DEPO exhibits excellent sample efficiency. With larger training sets, DEPO consistently yields further gains.

Video Presentation

BibTeX

        
@misc{chen2025depodualefficiencypreferenceoptimization,
      title={DEPO: Dual-Efficiency Preference Optimization for LLM Agents}, 
      author={Sirui Chen and Mengshi Zhao and Lei Xu and Yuying Zhao and Beier Zhu and Hanwang Zhang and Shengjie Zhao and Chaochao Lu},
      year={2025},
      eprint={2511.15392},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.15392}, 
}