Code CoT w/ Self-Evolution LLM: rStar-Math Explained

Discover AI 4,563 3 months ago

Video Not Working? Fix It Now

Code CoT w/ Self-Evolution LLM: rStar-Math on Small Language Model (although no Phi-4 from Microsoft). DeepSeek 236B generated (in Round 1) a large dataset of high-quality, step-by-step reasoning trajectories of mathematical tasks - as Chain of Thought reasoning paths - which were applied for fine-tuning the small policy model, a 7B Qwen SLM. After Round 1, the self-evolution framework takes over, using Monte Carlo Tree Search (MCTS) and the Process Preference Model (PPM) to iteratively further improve the smaller 7B policy model. An open question remains: Starting with a pure 7B policy model, self-evolution could theoretically gradually refine its reasoning ability by leveraging techniques like a) Monte Carlo Tree Search (MCTS) for reasoning path generation and b) Code-augmented verification to ensure logical correctness of generated data. But would this method ever converge to high performance mathematical reasoning paths? This remains unanswered by Microsoft. All rights w/ authors: rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking by Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, Mao Yang from Microsoft Research Asia #coding #reasoning #science #airesearch

Comment