Transformers / Mixture of Experts›
Foundations
Tokens & Embeddings
Transformers
Training
Fine-Tuning & RL
Model Behavior
Inference
Using Models
Evals & Measurement
Mixture of Experts
Kimi K2 lists 1.04 trillion parameters. To produce each token, it runs about 32 billion of them, roughly three percent. The design that splits those two numbers is mixture of experts, or MoE.
A standard model runs every parameter on every token. An MoE carves one part of each layer into many smaller sub-networks, the experts, and puts a small learned router in front of them. For each token, the router scores the experts and runs only the top few, often plus one shared expert that always fires; the rest stay idle. Total parameters set the model's capacity, and the active few set the compute per token.
It is tempting to picture one expert for math and another for French, with the router sending each token to its subject. That is not what happens. The router learns whatever lowers the training loss, and when Mistral traced the routing inside Mixtral, the split fell along syntax, not subject; math, biology, and philosophy text routed almost identically.
Skipping experts saves compute, not memory. Any expert might be the next token's pick, so all of them stay loaded in GPU memory while a sliver runs. So an MoE decodes at the speed of its active size and takes the memory of its total size. It is also no match for a dense model of that total size; its quality lands between the two.
Capacity is what you store; the active slice is what you pay to run.