Select at least 1 AI model above
"What is the estimated token throughput for DeepSeek V3.2 on a B200 GPU?"
17.8s
Add to Favorites
Cozy Upvote
Share
Export

Estimated Token Throughput for DeepSeek V3.2 on an NVIDIA B200

What the sources actually say

Source Reported figure Context / caveats
OpenRouter / DeepInfra (API stats) ≈ 28 tokens / second (tok/s) This is the throughput observed for the DeepSeek V3.2‑Exp model on DeepInfra’s deployment. The hardware stack includes B200/GB200 GPUs, but the metric is not broken out per‑GPU and may reflect a multi‑GPU service configuration 10.
vLLM blog (DeepSeek‑V3.2‑Exp in vLLM) ~200 tps for V3.1 on B200; V3.2 runs at ~60 % of that speed The blog explicitly measures V3.1’s throughput on a B200 and notes that V3.2’s throughput is roughly 60 % of V3.1’s. Applying the 60 % factor yields an estimated ~120 tokens / second for V3.2 on a single B200 3.
SGLang issue tracker V3.2 achieves ~60 % of V3.1’s throughput (no absolute numbers) Confirms the relative‑performance drop but does not give a concrete tps value 1.
General DeepSeek V3 (earlier model) ~60 tokens / second (hardware unspecified) Provides a rough baseline for earlier versions, but cannot be directly mapped to V3.2 or to B200 hardware 8.

How to interpret the numbers

  1. Directly measured figure (28 tok/s)

    • This is the only explicit throughput number that mentions DeepSeek V3.2‑Exp and a B200‑compatible platform.
    • Because the OpenRouter/DeepInfra service may be aggregating several GPUs or using a specific batch size, the 28 tok/s should be treated as a conservative, system‑level figure rather than the peak per‑GPU capability.
  2. Derived estimate (≈ 120 tok/s)

    • The vLLM blog gives a solid benchmark for V3.1 on a B200 (≈ 200 tok/s).
    • The same source (and the SGLang tracker) states that V3.2 runs at about 60 % of V3.1’s speed.
    • Multiplying 200 tok/s × 0.60 yields an estimated 120 tokens / second for a single B200 when running the experimental V3.2 model with its fine‑grained sparse attention kernels.
    • This estimate assumes comparable batch size, prompt length, and precision (e.g., FP8) to the V3.1 measurement.
  3. Why the numbers differ

    • Sparse attention overhead: V3.2’s new DSA (DeepSeek Sparse Attention) reduces memory usage but introduces extra kernel logic, which can lower raw throughput.
    • Deployment variations: Cloud services (DeepInfra) may prioritize latency, cost, or multi‑tenant fairness, leading to lower observed tok/s.
    • Benchmark conditions: The vLLM measurement was performed in a controlled single‑GPU setting, while the OpenRouter metric reflects a production API environment.

Practical takeaway

  • If you need a concrete, publicly‑available benchmark, the most reliable number is ≈ 28 tokens / second as reported by DeepInfra/OpenRouter 10.
  • If you are planning your own B200 deployment and can replicate the vLLM testing conditions (single GPU, optimal batch size, FP8 precision), you can expect up to ~120 tokens / second, acknowledging that this is an extrapolation from the V3.1 baseline 31.

Bottom‑line recommendation

  • Start with the 28 tok/s figure for capacity planning in a cloud‑service context.
  • Benchmark your own workload if you have direct access to a B200; the vLLM‑derived estimate (~120 tok/s) gives a realistic upper bound under ideal conditions.

No follow-up threads yet

Dive deeper into this topic with a follow-up.

Sign in to start a follow-up thread