Code CoT w/ Self-Evolution LLM: rStar-Math on Small Language Model (although no Phi-4 from Microsoft).
DeepSeek 236B generated (in Round 1) a large dataset of high-quality, step-by-step reasoning trajectories of mathematical tasks - as Chain of Thought reasoning paths - which were applied for fine-tuning the small policy model, a 7B Qwen SLM.
After Round 1, the self-evolution framework takes over, using Monte Carlo Tree Search (MCTS) and the Process Preference Model (PPM) to iteratively further improve the smaller 7B policy model.
An open question remains:
Starting with a pure 7B policy model, self-evolution could theoretically gradually refine its reasoning ability by leveraging techniques like a) Monte Carlo Tree Search (MCTS) for reasoning path generation and b) Code-augmented verification to ensure logical correctness of generated data. But would this method ever converge to high performance mathematical reasoning paths?
This remains unanswered by Microsoft.
All rights w/ authors:
rStar-Math: Small LLMs Can Master Math Reasoning
with Self-Evolved Deep Thinking
by Xinyu Guan, Li Lyna Zhang, Yifei Liu,
Ning Shang, Youran Sun, Yi Zhu, Fan Yang, Mao Yang
from Microsoft Research Asia
#coding
#reasoning
#science
#airesearch