How to build custom reasoning agents with a fraction of the compute



Training AI reasoning models demands resources that most enterprise teams do not have. Engineering teams are often forced to choose between distilling knowledge from large, expensive models or relying on reinforcement learning techniques that provide sparse feedback.

Researchers at JD.com and several academic institutions recently introduced a new training paradigm that sidesteps this dilemma. The technique, called Reinforcement Learning with Verifiable Rewards with Self-Distillation (RLSD), combines the reliable performance tracking of reinforcement learning with the granular feedback of self-distillation. 

Experiments indicate that models trained with RLSD outperform those built on classic distillation and reinforcement learning algorithms. For enterprise teams, this approach lowers the technical and financial barriers to building custom reasoning models tailored to specific business logic.

The problem with training reasoning models

The standard method for training reasoning models is Reinforcement Learning with Verifiable Rewards (RLVR). In this paradigm, the model learns through trial and error, guided by a final outcome from its environment. An automated verifier checks if the model’s answer is right or wrong, providing a binary reward, such as a 0 or 1.

RLVR suffers from sparse and uniform feedback. “Standard GRPO has a signal density problem,” Chenxu Yang, co-author of the paper, told VentureBeat. “A multi-thousand-token reasoning trace gets a single binary reward, and every token inside that trace receives identical credit, whether it's a pivotal logical step or a throwaway phrase.” Consequently, the model never learns which intermediate steps led to its success or failure.

On-Policy Distillation (OPD) takes a different approach. Instead of waiting for a final outcome, developers pair a smaller student model with a larger, more capable teacher model. For each training example, the student compares its response to that of the teacher token by token. This provides the student with granular feedback on the entire reasoning chain and response-generation process.

Deploying and running a separate, massive teacher model alongside the student throughout the entire training process incurs massive computational overhead. “You have to keep a larger teacher model resident throughout training, which roughly doubles your GPU footprint,” Yang said. Furthermore, the teacher and student models must share the exact same vocabulary structure, which according to Yang, “quietly rules out most cross-architecture, cross-modality, or multilingual setups that enterprises actually run.”

The promise and failure of self-distillation

On-Policy Self-Distillation (OPSD) emerged as a solution designed to overcome the shortcomings of the other two approaches. In OPSD, the same model plays the role of both the student and the teacher.

During training, the student receives a standard prompt while the teacher receives privileged information, such as a verified, step-by-step answer key. This well-informed teacher version of the model then evaluates the student version, providing token-by-token feedback as the student tries to solve the problem using only the standard prompt.

OPSD appears to be the perfect compromise for an enterprise budget. It delivers the granular, step-by-step guidance of OPD. Because it eliminates the need for an external teacher model, it operates with the high computational efficiency and low cost of RLVR, only requiring an extra forward pass for the teacher.

However, the researchers found that OPSD suffers from a phenomenon called “privileged information leakage.”

“The objective is structurally ill-posed,” Yang said. “There's an irreducible mutual-information gap that the student can never close… When self-distillation is set up as distribution matching, the student is asked to imitate the teacher's full output distribution under privileged context.”

Because the teacher evaluates the student based on a hidden answer key, the training objective forces the student model to learn the teacher’s exact phrasing or steps instead of the underlying reasoning logic. As a result, the student model starts hallucinating references to an invisible solution that it will not have access to in a real-world deployment.

In practice, OPSD models show a rapid spike in performance early in training, but their reasoning capabilities soon plateau and progressively degrade over time.

Decoupling direction from magnitude with RLSD

The researchers behind RLSD realized that the signals governing how a model updates its parameters have fundamentally asymmetric requirements. They identified that the signal dictating the direction of the update (i.e., whether to reinforce or penalize a behavior) can be sparse, but must be perfectly reliable, because pointing the model in the wrong direction damages its reasoning policy.

On the other hand, the signal dictating the magnitude of the update (i.e., how much relative credit or blame a specific step deserves) benefits from being extremely dense to enable fine-grained, step-by-step corrections.

RLSD builds on this principle by decoupling the update direction from the update magnitude. The framework lets the verifiable environmental feedback from the RLVR signal strictly determine the direction of learning. The model only receives overall reinforcement if the final answer is objectively correct.

The self-teacher is stripped of its power to dictate what the model should generate. Instead, the teacher's token-by-token assessment is repurposed to determine the magnitude of the update. It simply distributes the total credit or blame across the individual steps of the model's reasoning path.

This alters how the model learns compared to the classic OPSD paradigm. In standard OPSD, the training objective acts like behavioral cloning, where the model is forced to directly copy the exact wording and phrasing of the teacher. This causes the student to hallucinate and leak references to data it does not have.

Instead of forcing the model to copy a hidden solution, RLSD provides a natural and virtually cost-free source of per-token credit information.

“The intuition: we're not teaching the model to reason like the teacher,” Yang said. “We're telling the model, on the path it chose, which of its own tokens were actually doing the work. The model's exploration distribution stays its own. Only the credit allocation gets sharpened.”

If a specific deduction strongly supports the correct outcome, it receives a higher score. If it is just a useless filler word, it receives a baseline score. RLSD eliminates the need to train complex auxiliary reward networks, manually annotate step-by-step data, or maintain massive external teacher models.

Putting RLSD to the test

To test RLSD, the researchers trained the open-weight Qwen3-VL-8B vision-language model and evaluated it on several visual reasoning benchmarks. These included MMMU for college-level multi-discipline questions, MathVista, MathVision, WeMath, and ZeroBench, a stress-test benchmark explicitly designed to be nearly impossible for current frontier models.

They compared the RLSD model against the base model with no post-training, standard RLVR via the GRPO algorithm, standard OPSD, and a hybrid combination of the two.

RLSD significantly outperformed every other method, achieving the highest average accuracy of 56.18% across all five benchmarks. It beat the base model by 4.69% and outperformed standard RLVR by 2.32%. The gains were most pronounced in complex mathematical reasoning tasks, where RLSD outperformed standard RLVR by 3.91% on the MathVision benchmark.

Beyond accuracy, the framework offers massive efficiency gains. “Concretely, RLSD at 200 training steps already beats GRPO trained for 400 steps, so roughly 2x convergence speedup,” Yang said. “Cost-wise, the only overhead beyond a normal GRPO pipeline is one extra forward pass per response to grab teacher logits. Compared to rollout generation… that's basically free.”

Unlike OPSD, which saw performance spike and then completely collapse due to information leakage, RLSD maintained long-term training stability and converged on a higher performance ceiling than standard methods.

The qualitative findings highlight how the model alters its learning behavior. For example, in a complex visual counting task, standard RLVR looks at the final correct answer and gives the entire paragraph of reasoning tokens the same reward. RLSD surgically applied rewards to the specific mathematical subtraction steps that solved the problem, while actively down-weighting generic filler text like "Looking at the image, I see…".

In another example, the model performed an incorrect math derivation based on a bar chart. Instead of labeling the whole response as a failure, RLSD concentrated the heaviest penalty on the exact point where the model misread a relationship from the chart. It remained neutral on the rest of the logical setup, recognizing that the initial framework was valid.

This is particularly important for messy, real-world enterprise use cases. If a model makes a mistake analyzing a 50-page quarterly earnings report, developers do not want it to unlearn its entire analytical framework. They just want it to fix the specific assumption it got wrong. RLSD allows the model to learn exactly which logical leaps are valuable and which are flawed, token by token. Because RLSD does this by repurposing the model itself, it provides models with granular reasoning capabilities while keeping the costs of training reasonable.

How enterprises can get started

For data engineers and AI orchestration teams, integrating RLSD is straightforward, but it requires the right setup. The most critical requirement is a verifiable reward signal, such as code compilers, math checkers, SQL execution, or schema validators. “Tasks without verifiable reward (open-ended dialogue, brand-voice writing) belong in preference-based pipelines,” Yang said.

However, RLSD is highly flexible regarding the privileged information it requires. While OPSD structurally requires full intermediate reasoning traces, forcing enterprises to either pay annotators or distill from a frontier model, RLSD does not.

“If you have full verified reasoning traces, great, RLSD will use them,” Yang said. “If all you have is the ground-truth final answer, that also works… OPSD doesn't have this flexibility.”

Integrating the technique into existing open-source multi-modality RL frameworks like veRL or EasyR1 is incredibly lightweight. According to Yang, it requires no framework rewrite and slots right into the standard stack. The code swap involves simply changing tens of lines to adjust the GRPO objective and sync the teacher with the student.

Looking ahead, RLSD offers a powerful way for enterprises to maximize their existing internal assets.

“The proprietary data enterprises hold inside their perimeter (compliance manuals, internal documentation, historical tickets, verified code snippets) is essentially free privileged information,” Yang concluded. “RLSD lets enterprises feed this kind of data straight in as privileged context, which sharpens the learning signal on smaller models without needing an external teacher and without sending anything outside the network.”



Leave a Reply

Your email address will not be published. Required fields are marked *