VibeThinker-3B: Packing Verifiable Reasoning into 3 Billion Parameters

Table of Contents

“Small model beats big model” papers appear regularly. Usually the claim holds on a specific benchmark under specific conditions, not across the board. WeiboAI’s VibeThinker-3B, published June 15, 2026, follows a similar structure but draws a clearer boundary: the claim is not that a 3B model replaces a frontier generalist. The claim is that verifiable reasoning can be compressed into a small model, while open-domain knowledge and general dialogue still benefit from more parameters.

Model basics

The central hypothesis

The authors distinguish two kinds of capability by how they scale with parameters:

  • Compressible: verifiable reasoning, multi-step reasoning, constraint satisfaction, self-correction, math/coding/STEM problem-solving.
  • Needs broad coverage: open-domain knowledge, general-purpose dialogue, long-tail scenario understanding, factual recall across many topics.

If this distinction holds, domains with reliable verification signals (math, code) are good candidates for a focused training push on a small model. VibeThinker-3B tests that idea.

Training pipeline

Five stages:

1. Curriculum-based two-stage SFT

Stage 1 covers math, code, STEM reasoning, general dialogue, and instruction following broadly. Stage 2 shifts to hard, long-horizon reasoning samples. Seed queries are selected for having clear answers, full solutions, unit tests, or executable evaluation rules.

Rather than learning one correct solution path, the distillation preserves multiple valid reasoning traces from a teacher model. The authors call this Diversity-Exploring Distillation, aimed at building a spectrum of valid approaches rather than memorizing one.

2. Multi-domain reasoning RL

Reuses MGPO (MaxEnt-Guided Policy Optimization) from VibeThinker-1.5B. Samples where rollouts include both correct and incorrect answers get higher training weight than samples at the extremes (always right or always wrong). Training runs Math RL, then Code RL, then STEM RL. A single 64K context window preserves long reasoning trajectories without truncation.

3. Long2Short Math RL

After accuracy-focused RL has expanded the model’s reasoning ability, a second RL pass shifts reward toward shorter correct trajectories. The goal is to maintain accuracy while reducing redundant reasoning tokens.

4. Offline self-distillation

Selects verified-correct trajectories from the Math/Code/STEM RL checkpoints and applies SFT back to a unified student model. A length-normalized negative log-likelihood score (learning-potential score) prioritizes traces the student currently handles poorly, so distillation adds the most value where the model is weakest.

5. Instruct RL

A final stage for user-facing behavior: format-sensitive prompts, long-context instructions, and general alignment examples. Explicit constraints use rule-based validators; open-ended prompts use a rubric-based reward model.

Benchmark results

From the paper and the model card. CLR-augmented scores are explained in the next section.

BenchmarkScoreWith CLR
AIME2591.496.7
AIME2694.397.1
HMMT2589.395.4
BruMO2593.899.2
IMO-AnswerBench76.480.6
LiveCodeBench v680.2 Pass@1N/A
OJBench38.6N/A
GPQA-DiamondN/A72.9
IFEval93.4N/A
IFBench74.5N/A

LeetCode OOD (new problems from April 25 to May 31, 2026; Python one-shot): 123 of 128, 96.1%.

Evaluation conditions vary by benchmark: math uses 64 independent generation samples averaged for Pass@1, IMO-AnswerBench uses 16, coding uses 8. Comparison model scores come from each model’s own release reports or public leaderboards, not re-evaluated under an identical harness.

CLR: Claim-Level Reliability Assessment

CLR is a test-time scaling technique that does not modify model weights.

Standard self-verification checks a whole reasoning trace at once. CLR splits the trace into individual claims or logical anchors and assesses each one’s reliability separately. The paper reports this pushes Pass@1 higher on answer-verifiable math benchmarks: AIME26 goes from 94.3 to 97.1, HMMT25 from 89.3 to 95.4, BruMO25 from 93.8 to 99.2, IMO-AnswerBench from 76.4 to 80.6.

Intended use and stated limitations

The model card is direct about scope.

Well-suited for:

  • LeetCode and competitive programming problems
  • Math olympiad, STEM reasoning, problems with verifiable answers
  • Local or low-cost inference where a strong reasoning core matters
  • Solving verifiable subproblems decomposed by a larger orchestrator

Not suitable for:

  • Tool calling / function calling
  • API orchestration
  • Autonomous coding agents
  • Broad research agents
  • General chat, open-domain factual QA

The model card explicitly states the model was not trained on tool-calling or agent-based programming data, and recommends against those use cases.

Caveats

Math evaluation mixes automated verifiers with LLM-as-judge, which means the choice of judge can affect reported numbers. Comparison scores are not from a unified re-evaluation harness. The strong results apply to verifiable reasoning domains; treating them as evidence that a 3B model is a general frontier replacement would be overreading the paper.

Further reading

References

Share :