VibeThinker-3B: A 3B Parameter Model That Matches Frontier Reasoning Systems — Here's What I Found After Running It
I was scrolling through Hacker News yesterday when a title caught my eye: a 3B parameter model beating Opus 4.5 on reasoning tasks. My first thought was clickbait. Then I read the paper — AIME26 score of 94.3, LiveCodeBench 80.2 Pass@1, 96.1% acceptance rate on unseen LeetCode contests.
That got my attention.
A 3B parameter model, performing at the same level as DeepSeek V3.2, GLM-5, and Gemini 3 Pro on mathematical and coding reasoning. These are models with hundreds of billions of parameters. How is that even possible?
The paper is on arXiv, the code is open source, and the weights are on HuggingFace. So I spent most of the day running it and digging into the technical details. Here's what I found.
Where VibeThinker Comes From
VibeThinker comes from WeiboAI — the AI research team at Weibo (the Chinese social media company). It's not a massive billion-dollar lab project. They took a different approach: small model, sophisticated training.
Their core technique is called the Spectrum-to-Signal Principle (SSP). It works in two phases:
-
SFT phase — build diversity: Instead of just training the model on standard answers, they feed it multiple solution paths for each problem. Correct ones, wrong ones, roundabout ones, direct ones. Two-stage distillation pulls these diverse approaches from a larger model, then curriculum learning gradually increases difficulty.
-
RL phase — amplify the right signal: They use something called MGPO (MaxEnt-Guided Policy Optimization) to find the correct paths among all those diverse solutions, while maintaining diversity. The maximum entropy constraint prevents mode collapse — where the model finds one trick that works and stops exploring alternatives.
For the 3B version, they added multi-domain RL (not just math, but also coding and STEM), offline self-distillation (the model generates its own training data), and a round of Instruct RL to boost instruction-following. They also introduced CLR (Claim-Level Reliability Assessment), a test-time scaling strategy that I'll get into later — it's genuinely clever.
The Benchmark Numbers
Let me just lay out the data:
Mathematical reasoning:
- AIME26: 94.3 (97.1 with CLR)
- HMMT25: 89.3 (95.4 with CLR)
- BruMO25: 99.2 with CLR
Coding reasoning:
- LiveCodeBench v6: 80.2 Pass@1
- Recent unseen LeetCode weekly/biweekly contests (Apr-May 2026): 96.1% acceptance
Instruction following:
- IFEval: 93.4
For context, DeepSeek V3.2 scores around 90-something on AIME26. GLM-5 is in the same range. A 3B model is matching these giants.
But here's the important caveat: the model excels at verifiable reasoning — math problems with definitive answers, coding problems with test cases, STEM questions with right-or-wrong answers. For open-ended knowledge tasks, long-context understanding, or creative writing, 3B models still can't compete with frontier systems. The paper itself acknowledges this: "for broad open-domain knowledge tasks, larger general-purpose models may still be more suitable."
Don't expect this to replace your daily Claude or GPT usage. Its strength is narrow, but within that narrow band, it's shockingly strong.
Running It Locally: Hands-On Experience
The weights are at WeiboAI/VibeThinker-3B on HuggingFace, with a ModelScope mirror. The official recommendation is vLLM==0.10.1 or SGLang>=0.4.9.post6. Plain transformers works too, but inference speed is much slower.
Quick test with transformers
| 1 | |
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
| 10 | |
| 11 | |
| 12 | |
| 13 | |
| 14 | |
| 15 | |
| 16 | |
| 17 | |
| 18 | |
| 19 | |
| 20 | |
| 21 | |
| 22 | |
| 23 | |
| 24 | |
| 25 | |
| 26 | |
The recommended inference parameters are important: temperature=0.6 or 1.0, top_p=0.95, top_k=-1 (or None in transformers). Don't use low temperature or greedy decoding — the training method needs sampling diversity to perform at its best.
Production setup with vLLM
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
This gives you a standard OpenAI-compatible API. I ran it on a 4090 — VRAM usage was around 8-9GB after loading. Works fine on any 24GB card (4090, 3090, A5000). On 16GB cards you'll need to reduce max-model-len.
CLR: The Clever Test-Time Scaling Trick
CLR (Claim-Level Reliability Assessment) is the part I found most interesting.
Normal small-model inference works like this: generate one answer, check if it's right, done. CLR does something different — it breaks the reasoning into individual "claims" and verifies each one separately.
Say a math problem's solution goes:
- Let x = 3 (claim 1)
- Substituting gives y = 5 (claim 2)
- Therefore xy = 15 (claim 3)
CLR verifies each step independently. If claim 2 is wrong, it pinpoints the exact step that failed instead of rejecting the whole answer.
The impact is significant: AIME26 jumps from 94.3 to 97.1, HMMT25 from 89.3 to 95.4. At competition math level, those are large improvements.
The tradeoff is compute — CLR essentially runs the problem multiple times for verification. In production, you'd need to weigh the accuracy gain against the extra inference cost.
The Cost Story: Why Small Models Matter
The VibeThinker 1.5B predecessor cost $7,800 to post-train. DeepSeek R1 cost $294K. MiniMax-M1 cost $535K. That's a 30-60x difference.
This cost gap shows up everywhere:
Training cost: Smaller models need far less compute for training, fine-tuning, and RL. Same hardware, way more experiments.
Inference cost: A 3B model runs on a single consumer GPU. DeepSeek R1 at 671B parameters needs multiple A100s or H100s. The hardware barrier is an order of magnitude lower.
Deployment cost: For local deployment and edge inference, a 3B model fits in a laptop GPU. 671B? Forget it.
Iteration speed: Small model experiments run fast. Large model hyperparameter tuning might take days.
The tradeoff is capability scope. VibeThinker can't write your emails or summarize articles. It's purpose-built for math and coding reasoning.
How It Relates to Qwen2.5-Coder-3B
VibeThinker-3B's base is Qwen2.5-Coder-3B. It's not trained from scratch — it's post-trained on top of an existing code model.
This is a smart choice. Qwen2.5-Coder-3B already has decent code understanding. Using it as a foundation and applying SSP training specifically for reasoning gives you a model that starts with solid fundamentals.
Architecturally, VibeThinker-3B is essentially identical to Qwen2.5-Coder-3B — same transformer architecture, same tokenizer. All the differences are in the post-training data and methodology.
If you're already using the Qwen2.5 family, switching to VibeThinker is nearly frictionless. Same API, same inference framework, similar prompt format.
When to Use It (and When Not To)
Good fit:
- Math competition tutoring: AIME/HMMT-level problems. The model's problem-solving is genuinely strong.
- Algorithm problem assistance: 96% pass rate on unseen LeetCode contests. Great for practice assistance or automated code logic checking.
- STEM problem solving: Physics, chemistry, engineering computation — anything verifiable.
- Embedded reasoning engine: Local/edge deployment where you need precise reasoning with minimal hardware.
- Complement to larger models: Main model handles general tasks, VibeThinker handles the math-heavy subtasks.
Bad fit:
- General conversation: Chat, Q&A, casual talk — use Claude or GPT.
- Knowledge-intensive tasks: World knowledge, factual Q&A — 3B doesn't store enough.
- Long text processing: Context understanding and document analysis aren't its strengths.
- Creative writing: Just no.
The 1.5B Version: An Even More Extreme Option
There's also a 1.5B version from November 2025. That one scored 80.3 on AIME24, beating the original DeepSeek R1 (671B parameters) which scored 79.8.
400x fewer parameters, slightly higher score. It's a striking comparison, even though later DeepSeek R1 versions have improved.
The 1.5B training cost was $7,800. The entire post-training process cost under eight thousand dollars. DeepSeek R1's post-training was $294K. That efficiency is hard to ignore.
The 1.5B is even more narrowly focused than the 3B. For pure math competition or pure algorithm problems, it works and runs faster on cheaper hardware. For anything slightly more general, go with the 3B.
Things to Watch Out For
The model doesn't "talk like a human." VibeThinker is optimized for verifiable reasoning. Its training signal comes from mathematical and coding correctness feedback, not human preference ratings. The output style is dry — direct problem-solving, no small talk. If you want a model that can solve problems AND chat, stick with the larger models.
Inference parameters matter. Official recommendation: temperature=0.6 or 1.0, top_p=0.95, top_k=-1. Don't use default temperature=0 or very low values. The training method needs sampling diversity.
The 40960 max_new_tokens isn't arbitrary. Math reasoning can produce very long solution chains. If you're tight on VRAM, you can reduce it, but be aware that some solutions genuinely need that much space.
Open source, but not a silver bullet. MIT license, weights are free to use. But don't expect it to solve everything. It's a specialized reasoning model. Use it for what it's good at and you'll be impressed. Use it for what it's not designed for and you'll be disappointed.
What the Community Is Saying
VibeThinker-3B hit 216 points on Hacker News with 85 comments. Some interesting discussion points:
Benchmark skepticism: "AIME only has a few problems each year, overfitting is too easy." Fair point. AIME problem sets are limited and the model might have seen similar types during training. But LiveCodeBench and LeetCode contests use continuously updated problems — overfitting is much less likely there. The 96.1% LeetCode acceptance on Apr-May 2026 unseen problems is fairly convincing.
The compression hypothesis: Someone pointed out that if a 3B model can match frontier reasoning, maybe "reasoning ability" is compressible. The model doesn't need to store all of human knowledge — it just needs the reasoning core. This aligns with the paper's "Parametric Compression-Coverage Hypothesis" — verifiable reasoning can be compressed into compact reasoning cores, while open-domain knowledge needs broad parameter coverage.
Agent use case: Can this serve as a reasoning engine for AI agents? Theoretically yes. Many agent workflows need precise reasoning — financial calculations, logical judgments, code verification. A 3B model dedicated to these tasks is cheaper and faster than calling a large model.
But: Competition math problems have relatively fixed patterns. Real-world software development is very different. The model's strength is "precise solving of well-defined problems," not "creative solving of open-ended problems."
Getting the Model
- HuggingFace:
WeiboAI/VibeThinker-3B - ModelScope:
WeiboAI/VibeThinker-3B - GitHub:
WeiboAI/VibeThinker(evaluation code and sample data)
MIT license, fully open for commercial use.
What's Next
I'm planning to integrate VibeThinker-3B into my Claude Code workflow — specifically for math computation verification and algorithm logic checking. Claude handles the coding and requirement understanding, VibeThinker handles the mathematical verification. Best of both worlds.
I'll write a follow-up once I have results. Drop a comment if you've tried it or have questions.
- VibeThinker-3B paper: arXiv:2606.16140*
- Weights: huggingface.co/WeiboAI/VibeThinker-3B*
- GitHub: github.com/WeiboAI/VibeThinker*