Attention

Read "the cat sat on the mat because it was tired." You take "it" to mean the cat, not the mat. A transformer recovers that link by letting each token build its meaning from the tokens around it. That mechanism is self-attention.

Self-attention

itattends tocat0.46

Tap or hover any word to make it the query

The model links "it" back to "cat"

Tap any word in the figure to make it the query. Each token projects into three learned vectors. A query says what it is looking for, a key advertises what it offers, and a value holds what it passes along. The model scores every query against every key and runs the scores through a softmax, so the weights sum to one. Each token's output is a weighted blend of the values. The query from "it" lands on the key from "cat," so cat's value dominates what "it" becomes. It works like a soft database lookup, run for every token at once.

Here the queries, keys, and values come from one sequence, the "self" in self-attention. Pull the query from one sequence and the keys and values from another and you get cross-attention, how a decoder reads an encoder, or text attends to image features (3Blue1Brown covers both). Let each token see only earlier positions and attention is causal, the masked form decoder models use to predict the next word; lift the mask, as encoders like BERT do, and it reads both directions.

Every token scores every other, so the work grows with the square of the sequence length. That cost is the pressure behind the KV cache and FlashAttention. Strip the name and attention is one move: score, then mix.