Parameters

Remember y = mx + b from algebra. The slope m scales the input; the intercept b shifts the result. A model is that equation blown up to billions of slopes and intercepts, stacked across many layers. Those tunable numbers are its parameters: the weights (the slopes) and the biases (the intercepts). Each one is a single scalar, usually stored in 16 bits.

That's not a figure of speech. The parameters are the model. Pretraining compresses patterns from text straight into these values, and the knowledge lives nowhere else.

Don't confuse them with hyperparameters, the settings a human picks before training: the learning rate, the number of layers, and the batch size. Those shape how the model learns; the parameters are what it learns.

People read a parameter count as raw power. Treat it as a rough proxy. GPT-3 has 175 billion, yet DeepMind's Chinchilla showed a 70B model beating the 280B Gopher by training on far more data. Data and compute weigh as much as size.

The count blurs again with Mixture of Experts. Kimi K2 holds 1.04 trillion parameters but activates only 32 billion per token, so a model's size now takes two numbers: total capacity and active cost.

The model is its parameters. Bigger is a bet, not a guarantee.

See how size trades against data in scaling laws.