Foundations / Neural Networks

Neural Networks

Feed a language model the words "San Francisco is a" and "city" comes back as the likely next token, while an unrelated word like "kitten" sits near zero. Behind that ranking is a function built from layers of small units, each passing numbers to the next. That function is a neural network.

Next-token prediction
SanFranciscoisainputhidden layersoutput
Next token
city
place
major
Context tokens enter the input layer

Numbers flow through three kinds of layers.

  1. The input layer takes your data. Text first becomes tokens, then numbers.
  2. Hidden layers do the work. Each unit multiplies its inputs by learned weights, adds a bias, then bends the result with an activation function. That bend is the point: its job is non-linearity, not "firing between 0 and 1." ReLU, the common default, returns max(0, x), unbounded above, not the 0-to-1 squash of the old sigmoid. Remove it and the layers collapse, since W2(W1x + b1) + b2 reduces to a single linear step. Depth would buy nothing.
  3. The output layer scores every candidate next token. Those raw scores are logits, and softmax turns them into probabilities that sum to 1. That's how "city" beats "kitten."

Nobody wrote that ranking by hand. It fell out of the training data, one weighted sum at a time.

A neuron doesn't fire like a brain cell. It computes a weighted sum and bends it, and that bend is the whole trick. It's just math: multiply, add, bend, repeat.

See 3Blue1Brown's chapter 1 for the visuals, then parameters and deep learning.