Foundations / Neural Networks›
Foundations
Tokens & Embeddings
Transformers
Training
Fine-Tuning & RL
Model Behavior
Inference
Using Models
Evals & Measurement
Neural Networks
Feed a language model the words "San Francisco is a" and "city" comes back as the likely next token, while an unrelated word like "kitten" sits near zero. Behind that ranking is a function built from layers of small units, each passing numbers to the next. That function is a neural network.
Numbers flow through three kinds of layers.
- The input layer takes your data. Text first becomes tokens, then numbers.
- Hidden layers do the work. Each unit multiplies its inputs by learned weights, adds a bias, then bends the result with an activation function. That bend is the point: its job is non-linearity, not "firing between 0 and 1." ReLU, the common default, returns
max(0, x), unbounded above, not the 0-to-1 squash of the old sigmoid. Remove it and the layers collapse, sinceW2(W1x + b1) + b2reduces to a single linear step. Depth would buy nothing. - The output layer scores every candidate next token. Those raw scores are logits, and softmax turns them into probabilities that sum to 1. That's how "city" beats "kitten."
Nobody wrote that ranking by hand. It fell out of the training data, one weighted sum at a time.
A neuron doesn't fire like a brain cell. It computes a weighted sum and bends it, and that bend is the whole trick. It's just math: multiply, add, bend, repeat.
See 3Blue1Brown's chapter 1 for the visuals, then parameters and deep learning.