Self-Rewarding Reasoning Large Language Models (SR-LLMs)
"There are decades where nothing happens; and there are weeks where decades happen"
The race for AGI is heating up.
We have more players competing than ever before. We also see true competition from DeepSeek & Alibaba’s Qwen model series. These teams are injecting novel engineering breakthroughs into the ML ecosystem and the alpha is being rapidly absorbed by Western AI.
This is good for everyone.
This week was memorable. New OAI models were released, Google dropped 2.0 Flash Thinking, wildly powerful video generators from China took over the net… and the research world is experiencing breakthrough after breakthrough.
Now we are teaching models to evaluate the correctness of their outputs in real-time during inference without external feedback.
From the abstract of the February 26, 2025 paper, Self-rewarding correction for mathematical reasoning:
“We study self-rewarding reasoning large language models (LLMs), which can simultaneously generate step-by-step reasoning and evaluate the correctness of their outputs during the inference time without external feedback. This integrated approach allows a single model to independently guide its reasoning process, offering computational advantages for model deployment. We particularly focus on the representative task of self-correction, where models autonomously detect errors in their responses, revise outputs, and decide when to terminate iterative refinement loops. To enable this, we propose a twostaged algorithmic framework for constructing self-rewarding reasoning models using only selfgenerated data. In the first stage, we employ sequential rejection sampling to synthesize long chain-of-thought trajectories that incorporate both self-rewarding and self-correction mechanisms. Fine-tuning models on these curated data allows them to learn the patterns of self-rewarding and self-correction. In the second stage, we further enhance the models’ ability to assess response accuracy and refine outputs through reinforcement learning with rule-based signals. Experiments with Llama-3 and Qwen-2.5 demonstrate that our approach surpasses intrinsic self-correction capabilities and achieves performance comparable to systems that rely on external reward models.”
Let’s break this down.
This paper tackles the problem of making large language models (LLMs) better at mathematical reasoning, specifically focusing on their ability to self-correct their answers. Instead of relying on external "reward models" (which are often other LLMs, adding computational cost), the researchers propose a way to train a single LLM to do three things simultaneously:
Generate a solution: Like a regular LLM, it attempts to solve a math problem step-by-step (using "chain-of-thought" reasoning).
Evaluate its own solution: After generating a solution, the LLM assesses whether its own answer is likely correct or incorrect. This is done generatively, by outputting special tokens like "[VERIFY] correct" or "[VERIFY] wrong".
Correct its solution (if necessary): If the LLM judges its own answer as incorrect, it tries again, generating a revised solution. This process continues until the LLM believes its answer is correct.
The key innovation is training this "self-rewarding" capability without needing human-labeled data or external reward models during inference.
The training happens in two stages:
Stage 1: Supervised Fine-Tuning (SFT): The researchers create a dataset of math problems and solutions using a clever "sequential rejection sampling" technique. They start with a base LLM (like Qwen-2.5-Math or Llama-3) and generate many initial solution attempts. They then use a ground-truth verifier (a symbolic math engine, not an LLM) to identify correct and incorrect answers. They carefully construct trajectories that include:
Initial incorrect attempts.
Self-evaluations ("VERIFY wrong").
Revised, correct attempts.
Self-evaluation ("VERIFY correct")
Correct attempts, and self-evaluation ("VERIFY correct")
Fine-tuning on this curated data teaches the LLM the pattern of self-rewarding and self-correction.
Stage 2: Reinforcement Learning (RL): They further refine the model using RL, primarily with a simple "correctness" reward (did the final answer match the ground truth?). They experiment with both PPO and DPO-style algorithms. This stage fine-tunes the model's ability to both reason correctly and self-evaluate accurately.
The results show that this "self-rewarding" approach significantly outperforms models that try to self-correct purely through prompting (intrinsic self-correction) and is competitive with methods that use external reward models.
This is what a few of the exchanges look like, you can see a simplified version of what the authors call the “reasoning path” - a term I quite like.
“The general idea is to unify the generative reward model and reasoning model into a single LLM. This integrated approach allows a single model to independently guide its reasoning process, offering computational advantages for model deployment.
To enable this, we first sequential rejection sampling to synthesize long chain-of-thought trajectories that incorporate both self-rewarding and self-correction mechanisms. Fine-tuning models on these curated data allows them to learn the patterns of self-rewarding and self-correction. In the second stage, we further enhance the models' ability to assess response accuracy and refine outputs through reinforcement learning with rule-based signals.”
- Paper Authors
Link: https://arxiv.org/pdf/2502.19613
The really interesting avenues I see are mixing this technique with existing tooling & methods… more on that in the next section.