← Back
AI

MiMo-v2.5-Pro-UltraSpeed: What a 1T Model Doing 1000 Tokens Per Second Actually Means

Michael Sintim-Koree · June 2026

The benchmark getting passed around for MiMo-v2.5-Pro-UltraSpeed is 1000 tokens per second at 1 trillion parameters. At first glance, that number reads like a cherry-picked single-query figure on specialized hardware running some narrow task. After digging into what's actually claimed and how the architecture achieves it, the number is real in the sense that matters, and the reasoning system underneath it is worth understanding separately from the throughput headline.

Large models are slow. Fast inference usually means small models or aggressive quantization eating into quality. MiMo-v2.5-Pro-UltraSpeed is a different design choice, and that choice has real consequences for how you'd deploy it.


What MiMo is and why the architecture matters

MiMo is Xiaomi's reasoning model line, positioned as a frontier reasoning system with a training approach focused on mathematical and logical reasoning from the ground up rather than bolted on afterward. The v2.5-Pro variant is the full-scale release; UltraSpeed is the inference-optimized deployment configuration. The model is a Mixture-of-Experts architecture, which is the first thing that explains the throughput number.

A trillion-parameter MoE model does not run a trillion parameters per token. That's the point. In a dense model, every parameter participates in every forward pass. In a sparse MoE model, only the activated experts do. MiMo-v2.5-Pro uses a sparse gating mechanism where each token routes through a subset of the expert layers, leaving the rest idle for that token. The total parameter count is 1.02T; the active parameter count per forward pass is 42 billion — a figure confirmed across Xiaomi's own documentation, Artificial Analysis, and multiple independent reviews. That active-parameter figure determines compute cost per token. The 1T headline does not.

This is the same design pattern behind DeepSeek-V3 and the original Switch Transformer work. The architectural idea is not new. What MiMo is claiming is execution quality on top of it: reasoning capability at the level of much larger dense models, at inference costs closer to models a fraction of the stated size.


Where 1000 tokens per second actually comes from

MoE models have a known inference challenge: expert routing introduces communication overhead in distributed deployments because different tokens get routed to different experts, which may sit on different devices. If your expert parallelism across multiple GPUs or nodes is inefficient, you spend more time on routing synchronization than on actual matrix multiplications. A model that's theoretically fast becomes practically slow because you're waiting on the network between expert activations.

UltraSpeed addresses this through optimized expert routing and batching: grouping tokens by their expert assignments before dispatching compute, reducing tail latency from stragglers, and minimizing all-reduce communication between devices. The 1000 tokens per second figure is a throughput number measured across batched requests, not a single-query latency measurement. At scale, with a well-filled batch, the active experts stay busy and the communication overhead gets amortized across many tokens. That's the configuration where the number is real. (For context, independent benchmarks on the standard API show output speeds around 60 tokens per second; the 1000 token/s figure is specific to the UltraSpeed cluster configuration with high batch utilization.)

A key throughput mechanism is Multi-Token Prediction (MTP). MiMo-v2.5-Pro is equipped with three lightweight MTP modules using dense FFNs, which enable the model to predict multiple tokens per forward pass and effectively triple output speed during inference. This is the documented mechanism behind the UltraSpeed throughput numbers: not classic speculative decoding with a separate draft model, but a native architectural feature of the model itself. Whether additional speculative decoding optimizations are layered on top in the UltraSpeed serving configuration hasn't been fully disclosed.


The reasoning capability claim

Speed at this scale is only interesting if the model is actually good. A trillion-parameter model doing 1000 tokens per second that produces mediocre reasoning output is just expensive fast text generation.

MiMo's training approach centers on prioritizing reasoning-focused data throughout pretraining and post-training, rather than treating reasoning as a fine-tuning concern. Pre-training runs on 27 trillion tokens using FP8 mixed precision, with post-training following a three-stage paradigm: supervised fine-tuning, domain-specialized training using separate teacher models optimized via domain-specific RL across math, safety, and agentic tool-use, and then Multi-Teacher On-Policy Distillation (MOPD) to merge those capabilities into a unified model. The v2.5-Pro benchmarks show competitive performance on agentic coding tasks (notably 57.2% on SWE-bench Pro and strong scores on ClawEval), and the model ranks #8 of 144 tracked models on the Artificial Analysis Intelligence Index v4.0 with a score of 54. The primary comparisons in Xiaomi's own benchmarks are against Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.4, not o1. The training data curation decisions and the multi-teacher distillation approach appear to be doing significant work here.

The honest read of the benchmark numbers: MiMo-v2.5-Pro is competitive with frontier reasoning models on structured agentic and coding tasks. It doesn't obviously beat them across the board. The claim is frontier-competitive quality at significantly higher inference throughput and dramatically lower cost (roughly 40–60% fewer tokens per trajectory than comparable models on agentic benchmarks), not state-of-the-art on every benchmark.


What this changes for inference at scale

The practical bottleneck for deploying large reasoning models in production has been cost per token, not capability per se. OpenAI's o1 and o3 pricing shows real compute costs; complex reasoning tasks that require long chains of thought get expensive fast when you're paying per output token and the model is generating thousands of tokens of internal reasoning before producing a response.

A model that sustains 1000 tokens per second on reasoning-heavy workloads shifts the economics on a few specific use cases:

  • Batch processing large volumes of documents requiring structured reasoning — legal, financial analysis, technical review — where throughput is the constraint and per-query latency doesn't matter.
  • Agentic workflows running long reasoning chains as intermediate steps. Per-step cost drops when each step is cheaper to generate, which means you can afford more steps.
  • Applications where reasoning depth is currently capped because extended chain-of-thought is too expensive at production volume. That constraint loosens.
  • Real-time code review or test generation pipelines that need frontier reasoning quality but can't absorb frontier-model latency budgets.

What it doesn't change: interactive latency-sensitive applications where time-to-first-token matters more than sustained throughput, single-query workloads where batch efficiency doesn't apply, and anything outside structured reasoning where MiMo's benchmark advantages narrow or disappear.


The infrastructure reality

1 trillion parameters, even sparse, requires serious hardware to serve at 1000 tokens per second. The UltraSpeed mode was developed in collaboration with TileRT and is specifically described as running on commodity GPUs through extreme model-system codesign — though "commodity" here still means a multi-GPU cluster with high-bandwidth interconnects, not consumer hardware. The 1000 tokens per second figure is a cluster throughput number, not a workstation benchmark.

Organizations evaluating this for self-hosted deployment need to be clear-eyed about what infrastructure the speed claim assumes. Accessing MiMo-v2.5-Pro-UltraSpeed via API is a different proposition from running it on-premises. The API case is straightforward: Xiaomi offers it at 3× the cost of the standard MiMo-v2.5-Pro API, delivering approximately 10× the generation speed, on a limited-availability basis due to constrained high-speed inference resources. The self-hosted case requires hardware investment that most organizations won't make for a single model, regardless of the throughput numbers. That's not a knock on the model; it's just the reality of serving 1T parameters at any speed.

For organizations that do have the infrastructure (large enterprises, AI-native companies, cloud providers building internal tooling) the question is how this fits into a model routing strategy alongside existing deployments. A 1T MoE model at 1000 tokens per second is a strong candidate for the reasoning-intensive tier of a tiered inference stack, with smaller faster models handling simpler classification or generation tasks upstream.


The open weights question

MiMo-v2.5-Pro is available as open weights on Hugging Face under a permissive license, including weights, tokenizer, and the full model card. That's the detail that makes the infrastructure conversation relevant for more than just cloud customers. Open weights at this scale and quality, with a documented inference optimization approach, gives organizations a path to deploy frontier reasoning capability on their own hardware under their own control.

Whether the open weights release includes the full UltraSpeed inference optimization or just the base weights matters enormously for anyone trying to replicate the throughput numbers. The MTP modules are part of the released model architecture, so that throughput multiplier is accessible. But the expert batching, routing optimizations, and full TileRT system codesign that produce the 1000 token/s figure are engineering artifacts layered on top; those are separate from the weights themselves. This distinction is buried in the announcement and deserves more attention than it's getting.

A trillion-parameter open weights model with frontier reasoning quality and an inference configuration designed to hit 1000 tokens per second on appropriate hardware is a genuinely different option than what was available eighteen months ago. The gap between what you can run locally and what you can only access through a cloud API has narrowed, and MiMo is a concrete data point in that shift.


Where skepticism is warranted

The 1T parameter count is marketing math as much as engineering specification. The number that matters for compute cost is active parameters per token — 42 billion in MiMo-v2.5-Pro's case, which is now publicly documented. That active figure is roughly what you'd expect from a model with DeepSeek-V3-class inference costs, not a model that actually runs a trillion parameters per token. The 1T headline should always be read alongside the 42B active caveat.

Benchmark performance on agentic coding and long-horizon tasks is real but shows where the model was trained to excel. MiMo-v2.5-Pro's strongest results are on ClawEval, SWE-bench Pro, and GDPVal — agentic and software engineering tasks. General reasoning, open-ended generation, long-context coherence: the benchmark coverage there is thinner, and that's where evaluation data would be needed before treating this as a drop-in replacement for a broader reasoning workload.

The throughput number is also hardware-specific in ways that aren't disclosed clearly enough to be useful. 1000 tokens per second on what cluster configuration, at what batch size, on what sequence length distribution? Each of those variables shifts the number significantly. A throughput claim without a complete hardware and configuration disclosure is hard to use for actual deployment planning.


MiMo-v2.5-Pro-UltraSpeed is genuinely interesting. The combination of MoE-based parameter efficiency (42B active out of 1.02T total), reasoning-focused training with multi-teacher distillation, and native Multi-Token Prediction modules producing 1000 tokens per second in the UltraSpeed configuration is a real capability, not a marketing artifact dressed up as a benchmark. It's a strong option for high-throughput agentic and coding workloads on appropriate infrastructure, with open weights that make self-hosted deployment viable for organizations that can build or rent the cluster to serve it properly.

The 1T headline is the least informative thing about it. The 42B active parameter count, the MTP inference architecture, and the multi-teacher post-training strategy are what actually determine whether it fits your use case — and those require more digging than the announcement gives you.


If you've run MiMo-v2.5-Pro on your own hardware and actually measured throughput and active parameter utilization — or compared it head-to-head with DeepSeek-V3 on the same cluster — I'd genuinely like to hear what you found. The gap between the throughput claim and what you actually get serving base weights naively is where the useful information lives, and nobody seems to have published that yet.