Why does Mixtral 8x7B only activate 12 billion parameters when it has 47 billion total?

Mixtral uses top-2 routing, meaning each token passes through only 2 of its 8 expert networks. The '8x7B' refers to eight 7-billion-parameter experts, but since each token visits just two experts (plus shared layers), only about 12 billion parameters are active per inference. This is the core MoE efficiency trade-off: total capacity for knowledge storage is 47B, but compute cost matches a 12B dense model.

What is the load balancing problem in Mixture of Experts models?

Without constraints, MoE routers tend to collapse to using only a handful of experts while others atrophy unused. This negates the efficiency benefits since you end up with a dense model with extra overhead. The solution involves auxiliary losses during training that penalize uneven token distribution, though tuning these losses is notoriously difficult. Even with balancing, research shows the busiest expert still receives 40-60% more tokens than the least busy.

What is the difference between Mixture of Experts and 'mixture of models'?

MoE routes at the token level within a single model, with each token potentially visiting different experts. Mixture of models (as rumored for GPT-5) routes at the system level, selecting which complete model handles an entire request. MoE is an efficiency technique for making individual models cheaper to run; model routing is an orchestration approach for using multiple specialized models. The mechanisms serve different purposes despite both involving 'routing.'

mment

Headline: Mixture of Experts: Trillion Parameters, Billion-Scale Cost

A 671-billion-parameter model running on hardware that should choke on anything above 70 billion. That's not a typo. It's what Mixture of Experts makes possible, and it's why the architecture now underpins DeepSeek V3, Mixtral, Grok-1, and according to NVIDIA, over 60% of open-source AI model releases in 2024-25. Every top-10 open model on the Artificial Analysis leaderboard uses it.

The core insight is almost embarrassingly simple: don't activate all parameters for every token. Route each token through a small subset of specialized subnetworks called "experts," leaving the rest dormant. You get the capacity of a massive model at the inference cost of a much smaller one.

Sparse Routing in Practice

A standard transformer uses dense feed-forward layers where every token passes through every parameter. MoE swaps these out for multiple parallel "expert" networks plus a learned router that decides which experts handle which tokens.

The router itself is a small neural network trained alongside the main model. Given an input token, it outputs a probability distribution over all available experts, picks the top K (usually 2), and sends the token through only those. The outputs get weighted by the router's confidence scores and combined.

Mixtral 8x7B has 47 billion total parameters but activates only about 12 billion per token.

The "8x7B" refers to eight 7-billion-parameter expert networks, but each token only visits two of them. Inference compute scales with active parameters, not total parameters. That distinction is everything.

The efficiency gains stack up fast. Switch Transformers achieved 4x pretraining speedup over T5-XXL using this approach. DeepSeek pushed further: their DeepSeekMoE paper reports a 16B model matching LLaMA2 7B using only ~40% of the computations, and a 145B model matching DeepSeek 67B with just 28.5% of compute.

When Routing Goes Wrong

There's an obvious failure mode here. What if the router just sends everything to the same few experts? You'd end up with a dense model but with extra steps.

This happens more than you'd think.

Left unconstrained, routers collapse to using a handful of experts while others atrophy. NVIDIA's research shows that even with balancing algorithms, the busiest expert still receives 40-60% more tokens than the least busy one. The standard fix involves auxiliary losses during training that penalize uneven token distribution. Google's Switch Transformers introduced a "router z-loss" to stabilize training and prevent collapse. It works, but tuning these auxiliary losses is notoriously finicky. Cameron Wolfe's analysis describes MoE training as "notoriously difficult" due to instability, precision sensitivity, and hyperparameter brittleness. When it works, it works well. Getting it to work is the hard part.

You might expect experts to specialize by domain: one for legal text, another for code, another for medical terminology. They don't. Instead, experts develop fine-grained pattern specialization. One might handle certain syntactic structures. Another might specialize in specific token transitions. The routing decisions happen at such granular levels that human-interpretable domain boundaries don't really apply. This means expert routing adapts to the actual statistical structure of language rather than imposed categories, but it also means you can't easily inspect what each expert "knows" or surgically remove capabilities.

What DeepSeek Added

DeepSeek pushed the architecture further with two additions documented in their MoE paper.

Fine-grained expert segmentation. Instead of 8 large experts with top-2 routing, they use many smaller experts with more activated per token. This gives the router more granular choices and reduces knowledge overlap between experts.

Shared experts. Some expert capacity is always activated regardless of routing. These shared experts capture common knowledge that applies across most inputs, while routed experts handle specialized patterns. This reduces redundancy, since routed experts don't each need to learn common patterns independently.

The payoff: DeepSeek V3 has 671 billion total parameters but activates only 37 billion per inference. That's a 94% sparsity ratio, and as we covered previously, it enables running a genuinely frontier model on comparatively reasonable hardware.

One clarification worth making: what's rumored about GPT-5's architecture differs from MoE as described here. According to Latent Space, GPT-5 routes between complete models (reasoning and non-reasoning variants) at the system level. That's "mixture of models," not mixture of experts. Traditional MoE routes at the token level, where every token in a sequence might go to different experts. Both use routing, but the mechanisms are distinct.

Why Now?

MoE isn't new. The concept dates to the 1990s, and Google was training trillion-parameter MoE models with 2048 experts back in 2022.

What changed is the engineering. Training and serving sparse models requires careful parallelization. Expert weights need distribution across devices. Routing decisions create communication overhead. Making this efficient at scale took years of infrastructure work, and that work is now done. NVIDIA reports that their GB200 NVL72 delivers 10x performance improvement for MoE inference, enabling roughly 1/10th the token cost. The hardware caught up to the architecture.

The pattern is now established: if you want frontier performance at reasonable cost, you use MoE. DeepSeek V3, Grok-1, DBRX, Mixtral, and the latest Mistral Large all do. The question has shifted from "should we use MoE?" to "how do we tune it correctly?"

Our read: For builders, this changes how to think about model selection. A 671B MoE model isn't comparable to a 671B dense model. The active parameter count determines inference cost and, to some extent, capability. Mixtral 8x7B performs like a 70B dense model at the cost of a 13B one. That's the MoE value proposition in a single comparison.