Kun Wang4 Xingyu Zeng4 Hongyu Lin5
Xianpei Han5 Le Sun5 Chaochao Lu1,2β
π§ Abstract
Large language models (LLMs) have demonstrated impressive capabilities and are receiving increasing attention to enhance their reasoning through scaling test-time compute. However, their application in open-ended, knowledge-intensive, complex reasoning scenarios is still limited.
Reasoning-oriented methods struggle to generalize to open-ended scenarios due to implicit assumptions of complete world knowledge. Meanwhile, knowledge-augmented reasoning (KAR) methods fails to address two core challenges:
- Error propagation: where errors in early steps cascade through the chain
- Verification bottleneck: where the exploreβexploit trade-off arises in multi-branch decision processes
To overcome these limitations, we introduce ARise, a novel framework that integrates risk assessment of intermediate reasoning states with dynamic retrieval-augmented generation (RAG) within a Monte Carlo tree search paradigm. This approach enables effective construction and optimization of reasoning plans across multiple maintained hypothesis branches.
Experimental results show that ARise significantly outperforms the state-of-the-art KAR methods by up to 23.10%, and the latest RAG-equipped large reasoning models by up to 25.37%.
π Key Features
- β Iterative Refinement through Decomposition: Breaks down complex reasoning tasks into manageable steps
- π Retrieval-then-Reasoning: Augments LLMs with fine-grained knowledge retrieval
- π² Monte Carlo Tree Search: Mitigates error propagation by enabling exploration of multiple branches
- π Risk-Adaptive Search: Uses Bayesian risk minimization to select a promising reasoning path
π Method: ARise Pipeline

Figure 1: ARise Pipeline Overview
ARise iteratively refines reasoning steps through decomposition, retrieval-then-reasoning, providing fine-grained knowledge for LLMs. MCTS treats each step as a node in the search tree, expanding linear reasoning to mitigate error propagation by enabling exploration of reasoning paths and allowing backtracking when necessary. Risk assessment leverages Bayesian risk minimization to evaluate the quality of each reasoning state, dynamically optimizing action strategies to guide the search towards promising directions.
π Experimental Results
Comparison with Baseline Methods

Figure 2: Comparison with Baseline Methods
ARise demonstrates superior performance. Specifically, on the Qwen2.5-14B-Instruct model, ARise outperforms across all benchmarks, achieving an absolute improvement of 19.83% in EM over the vanilla RAG method, 13.29% over prompt-based baselines, and 15.5% over search-based baselines.
ARise maintains robust performance on the Qwen2.5-7B-Instruct model with an absolute improvement of 13.67% in EM over the vanilla RAG method and overall surpasses various baselines. We observed that ARise performs slightly worse on Llama models. Nevertheless, ARise still maintains a notable F1 advantage on Llama, indicating its effectiveness in selecting more promising paths.
Comparison with Large Reasoning Models (LRMs)

Figure 3: Comparison with Large Reasoning Models
Learning-based LRMs like DeepSeek-R1 distilled models have not yet approached the point where they can effectively match or even replace search-based reasoning methods in terms of performance.
Our empirical comparison between base models with ARise and the DeepSeek-R1 distilled models reveals key insights into the effectiveness of test-time search. These learning-based LRMs extract the similar reasoning pattern from DeepSeek-R1. ARise exhibits a performance advantage over the LRMs, especially on the Qwen model series. On average, ARise shows a relative improvement of 4.03\%, emphasizing the benefit of our search-based method.
π Citation
@article{zhang2025arise,
title = {ARise: Towards Knowledge-Augmented Reasoning via Risk-Adaptive Search},
author = {Yize Zhang and Tianshu Wang and Sirui Chen and Kun Wang and Xingyu Zeng and Hongyu Lin and Xianpei Han and Le Sun and Chaochao Lu},
year = {2025},
journal = {arXiv preprint arXiv:2504.10893}
}