"What is the estimated token throughput for DeepSeek V3.2 on a B200 GPU?"
17.8s
Add to Favorites
Cozy Upvote
Share
Export
Estimated Token Throughput for DeepSeek V3.2 on an NVIDIA B200
What the sources actually say
| Source | Reported figure | Context / caveats |
|---|---|---|
| OpenRouter / DeepInfra (API stats) | ≈ 28 tokens / second (tok/s) | This is the throughput observed for the DeepSeek V3.2‑Exp model on DeepInfra’s deployment. The hardware stack includes B200/GB200 GPUs, but the metric is not broken out per‑GPU and may reflect a multi‑GPU service configuration 10. |
| vLLM blog (DeepSeek‑V3.2‑Exp in vLLM) | ~200 tps for V3.1 on B200; V3.2 runs at ~60 % of that speed | The blog explicitly measures V3.1’s throughput on a B200 and notes that V3.2’s throughput is roughly 60 % of V3.1’s. Applying the 60 % factor yields an estimated ~120 tokens / second for V3.2 on a single B200 3. |
| SGLang issue tracker | V3.2 achieves ~60 % of V3.1’s throughput (no absolute numbers) | Confirms the relative‑performance drop but does not give a concrete tps value 1. |
| General DeepSeek V3 (earlier model) | ~60 tokens / second (hardware unspecified) | Provides a rough baseline for earlier versions, but cannot be directly mapped to V3.2 or to B200 hardware 8. |
How to interpret the numbers
Directly measured figure (28 tok/s)
- This is the only explicit throughput number that mentions DeepSeek V3.2‑Exp and a B200‑compatible platform.
- Because the OpenRouter/DeepInfra service may be aggregating several GPUs or using a specific batch size, the 28 tok/s should be treated as a conservative, system‑level figure rather than the peak per‑GPU capability.
Derived estimate (≈ 120 tok/s)
- The vLLM blog gives a solid benchmark for V3.1 on a B200 (≈ 200 tok/s).
- The same source (and the SGLang tracker) states that V3.2 runs at about 60 % of V3.1’s speed.
- Multiplying 200 tok/s × 0.60 yields an estimated 120 tokens / second for a single B200 when running the experimental V3.2 model with its fine‑grained sparse attention kernels.
- This estimate assumes comparable batch size, prompt length, and precision (e.g., FP8) to the V3.1 measurement.
Why the numbers differ
- Sparse attention overhead: V3.2’s new DSA (DeepSeek Sparse Attention) reduces memory usage but introduces extra kernel logic, which can lower raw throughput.
- Deployment variations: Cloud services (DeepInfra) may prioritize latency, cost, or multi‑tenant fairness, leading to lower observed tok/s.
- Benchmark conditions: The vLLM measurement was performed in a controlled single‑GPU setting, while the OpenRouter metric reflects a production API environment.
Practical takeaway
- If you need a concrete, publicly‑available benchmark, the most reliable number is ≈ 28 tokens / second as reported by DeepInfra/OpenRouter 10.
- If you are planning your own B200 deployment and can replicate the vLLM testing conditions (single GPU, optimal batch size, FP8 precision), you can expect up to ~120 tokens / second, acknowledging that this is an extrapolation from the V3.1 baseline 31.
Bottom‑line recommendation
- Start with the 28 tok/s figure for capacity planning in a cloud‑service context.
- Benchmark your own workload if you have direct access to a B200; the vLLM‑derived estimate (~120 tok/s) gives a realistic upper bound under ideal conditions.