Test Time Recursive Thinking
MCP server: Test Time Recursive Thinking
Installation
npx test-time-recursive-thinkingAsk AI about Test Time Recursive Thinking
Powered by Claude Β· Grounded in docs
I know everything about Test Time Recursive Thinking. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
Test-time Recursive Thinking (TRT)
Self-Improvement without External Feedback (arXiv)
What is TRT?
Test-time Recursive Thinking (TRT) is an agentic framework that enables LLMs to self-improve during inference through iterative reflectionβwithout requiring external feedback or reward signals.
TRT operates in three stages:
- Generate: The model produces multiple solution candidates for a given problem
- Select: Solutions are evaluated and the best candidates are identified using self-consistency or verification
- Reflect: The model analyzes successful and failed attempts to extract generalizable insights, which inform subsequent generation rounds
This recursive process allows the model to accumulate knowledge within a session, progressively improving solution quality through self-directed learning.
Installation
Prerequisites
- Python >= 3.10
- CUDA-compatible GPU (recommended for vLLM-based experiments)
- Azure OpenAI API access (for LiveCodeBench experiments)
Quick Install
git clone https://github.com/YufanZhuang/test-time-recursive-thinking.git
cd test-time-recursive-thinking
./setup_env.sh
Manual Install
# Install AIME dependencies
cd AIME
pip install -r requirements.txt
# Install LiveCodeBench
cd ../LiveCodeBench
pip install -e .
# Install MCP server dependencies (for TRT agentic mode)
pip install mcp fastmcp aiofiles orjson
Quick Start
AIME Mathematical Reasoning
cd AIME/bash_scripts
bash qwen3.sh # Run Qwen3-235B evaluation
bash gpt_oss.sh # Run GPT model evaluation
LiveCodeBench Code Generation
Step 1: Set environment variables
export OPENAI_API_KEY="your-azure-openai-api-key"
export AZURE_OPENAI_ENDPOINT="your-azure-openai-endpoint"
Step 2: Start the TRT MCP server
cd LiveCodeBench/kflow_mcp
bash start_server.sh
Step 3: Run TRT evaluation (in a separate terminal)
cd LiveCodeBench/bash_scripts
bash kflow_o4-mini.sh # o4-mini with TRT
bash kflow_o3.sh # o3 with TRT
Configuration Options
AIME Parameters
| Parameter | Description | Default |
|---|---|---|
--model_name | Model identifier (e.g., Qwen/Qwen3-235B-A22B-Thinking-2507) | Required |
--max_new_tokens | Maximum tokens to generate | 262144 |
--temperature | Sampling temperature | 0.6 |
--reflex_size | Number of reflection samples (Maj@N) | 64 |
LiveCodeBench Parameters
| Parameter | Description | Default |
|---|---|---|
--model | Model name/identifier | Required |
--scenario | Evaluation scenario | codegeneration |
--max_tokens | Maximum token limit | 200000 |
--trt_rounds | Number of TRT iterations | 8 |
--roll_out_n | Number of rollouts per problem | 2 |
--difficulty | Problem difficulty filter | hard |
Evaluations
AIME Mathematical Reasoning
TRT achieves 100% accuracy on both AIME-24 and AIME-25 benchmarks:
LiveCodeBench Code Generation (Hard Problems)
TRT provides significant improvements on hard coding problems:
| Model | Baseline | TRT | Improvement |
|---|---|---|---|
| o4-mini (high) | 63.5% | 73.9% | +10.4pp |
| o3 (high) | 57.1% | 71.9% | +14.8pp |
Project Structure
test-time-recursive-thinking/
βββ README.md # This file
βββ LICENSE # MIT License
βββ setup_env.sh # Environment setup script
βββ assets/ # Images and figures
βββ AIME/ # AIME mathematical reasoning
β βββ bash_scripts/ # Experiment launch scripts
β βββ scripts/ # Python evaluation scripts
β βββ requirements.txt # Dependencies
β βββ README.md
βββ LiveCodeBench/ # Code generation evaluation
βββ bash_scripts/ # Experiment launch scripts
βββ kflow_mcp/ # TRT MCP server
βββ lcb_runner/ # Main evaluation runner
βββ pyproject.toml # Package configuration
βββ README.md
Questions?
If you have any questions related to the code or the paper, feel free to reach out to us at y5zhuang@ucsd.edu.
Citation
If you find our paper and code useful, please cite us:
@misc{zhuang2026testtimerecursivethinkingselfimprovement,
title={Test-time Recursive Thinking: Self-Improvement without External Feedback},
author={Yufan Zhuang and Chandan Singh and Liyuan Liu and Yelong Shen and Dinghuai Zhang and Jingbo Shang and Jianfeng Gao and Weizhu Chen},
year={2026},
eprint={2602.03094},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.03094},
}
