What is the estimated token throughput for DeepSeek V3.2 on a B200 GPU

Estimated Token Throughput for DeepSeek V3.2 on an NVIDIA B200

Source	Reported figure	Context / caveats
OpenRouter / DeepInfra (API stats)	≈ 28 tokens / second (tok/s)	This is the throughput observed for the DeepSeek V3.2‑Exp model on DeepInfra’s deployment. The hardware stack includes B200/GB200 GPUs, but the metric is not broken out per‑GPU and may reflect a multi‑GPU service configuration ¹⁰.
vLLM blog (DeepSeek‑V3.2‑Exp in vLLM)	~200 tps for V3.1 on B200; V3.2 runs at ~60 % of that speed	The blog explicitly measures V3.1’s throughput on a B200 and notes that V3.2’s throughput is roughly 60 % of V3.1’s. Applying the 60 % factor yields an estimated ~120 tokens / second for V3.2 on a single B200 ³.
SGLang issue tracker	V3.2 achieves ~60 % of V3.1’s throughput (no absolute numbers)	Confirms the relative‑performance drop but does not give a concrete tps value ¹.
General DeepSeek V3 (earlier model)	~60 tokens / second (hardware unspecified)	Provides a rough baseline for earlier versions, but cannot be directly mapped to V3.2 or to B200 hardware ⁸.

Directly measured figure (28 tok/s)
- This is the only explicit throughput number that mentions DeepSeek V3.2‑Exp and a B200‑compatible platform.
- Because the OpenRouter/DeepInfra service may be aggregating several GPUs or using a specific batch size, the 28 tok/s should be treated as a conservative, system‑level figure rather than the peak per‑GPU capability.
Derived estimate (≈ 120 tok/s)
- The vLLM blog gives a solid benchmark for V3.1 on a B200 (≈ 200 tok/s).
- The same source (and the SGLang tracker) states that V3.2 runs at about 60 % of V3.1’s speed.
- Multiplying 200 tok/s × 0.60 yields an estimated 120 tokens / second for a single B200 when running the experimental V3.2 model with its fine‑grained sparse attention kernels.
- This estimate assumes comparable batch size, prompt length, and precision (e.g., FP8) to the V3.1 measurement.
Why the numbers differ
- Sparse attention overhead: V3.2’s new DSA (DeepSeek Sparse Attention) reduces memory usage but introduces extra kernel logic, which can lower raw throughput.
- Deployment variations: Cloud services (DeepInfra) may prioritize latency, cost, or multi‑tenant fairness, leading to lower observed tok/s.
- Benchmark conditions: The vLLM measurement was performed in a controlled single‑GPU setting, while the OpenRouter metric reflects a production API environment.

If you need a concrete, publicly‑available benchmark, the most reliable number is ≈ 28 tokens / second as reported by DeepInfra/OpenRouter ¹⁰.
If you are planning your own B200 deployment and can replicate the vLLM testing conditions (single GPU, optimal batch size, FP8 precision), you can expect up to ~120 tokens / second, acknowledging that this is an extrapolation from the V3.1 baseline ³¹.

Start with the 28 tok/s figure for capacity planning in a cloud‑service context.
Benchmark your own workload if you have direct access to a B200; the vLLM‑derived estimate (~120 tok/s) gives a realistic upper bound under ideal conditions.