DEPO: Dual‑Efficiency Preference Optimization for LLM Agents

Sirui Chen^1,2,3*, Mengshi Zhao^4*, Lei Xu^2,5, Yuying Zhao³,
Beier Zhu^3†, Hanwang Zhang³, Shengjie Zhao¹, Chaochao Lu^2†

¹Tongji University, ²Shanghai Artificial Intelligence Laboratory, ³Nanyang Technological University,
⁴The University of Hong Kong, ⁵École Polytechnique Fédérale de Lausanne (EPFL)

AAAI Conference on Artificial Intelligence (AAAI) 2026
^*Equal Contribution. ^†Corresponding Author.

Paper Code Video

Dual‑Efficiency

A comparison between (a) step-level inefficiency, arising from latency and cost in LLM token generation; (b) trajectory-level inefficiency, arising from latency and cost in environment interactions such as API calls, and our defined (c) dual-efficiency. For LLM agents, achieving genuine efficiency requires joint optimization across both dimensions.

Abstract

Recent advances in large language models (LLMs) have greatly improved their reasoning and decision-making abilities when deployed as agents. Richer reasoning, however, often comes at the cost of longer chain of thought (CoT), hampering interaction efficiency in real-world scenarios. Nevertheless, there still lacks systematic definition of LLM agent efficiency, hindering targeted improvements. To this end, we introduce dual‑efficiency, comprising (i) step-level efficiency, which minimizes tokens per step, and (ii) trajectory-level efficiency, which minimizes the number of steps to complete a task. Building on this definition, we propose DEPO, a dual-efficiency preference‑based optimization method that jointly rewards succinct responses and fewer action steps. Experiments on WebShop and BabyAI show that DEPO cuts token usage by up to 60.9% and steps by up to 26.9%, while achieving up to a 29.3% improvement in task performance. DEPO also generalizes to three out-of-domain math benchmarks and retains its efficiency gains when trained on only 25% of the data.

Experiment Results

Comparison of DEPO with a wide range of baselines.

DEPO achieves a significant improvement in the efficiency of LLM agents. Crucially, this efficiency gain does not come at the cost of performance; DEPO maintains—or even improves—it.

Model generalizability across math benchmarks.

DEPO demonstrates strong generalizability, achieving higher average accuracy while reducing generation tokens on most tasks.

Sample efficiency of DEPO.

DEPO exhibits excellent sample efficiency. With larger training sets, DEPO consistently yields further gains.

Video Presentation

BibTeX

        
@misc{chen2025depodualefficiencypreferenceoptimization,
      title={DEPO: Dual-Efficiency Preference Optimization for LLM Agents}, 
      author={Sirui Chen and Mengshi Zhao and Lei Xu and Yuying Zhao and Beier Zhu and Hanwang Zhang and Shengjie Zhao and Chaochao Lu},
      year={2025},
      eprint={2511.15392},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.15392}, 
}