Tokens & Embeddings / Tokenization

Tokenization

A model never sees the letters in your prompt. It reads numbers. Turning your text into those numbers is tokenization: it splits the text into chunks called tokens, then maps each to an integer ID. From there on, the model only handles IDs.

Tokenization
San
Franc
isco
Start with raw text

Those chunks come from byte-level Byte Pair Encoding. It starts with the 256 possible byte values, then merges the most frequent adjacent pair into a new token, again and again, until the vocabulary reaches its target size. Common words collapse into a single token; rare words stay split into pieces. Because the base is raw bytes, every possible input already encodes, so emoji and other scripts never produce an unknown token.

The merges chase raw frequency, not spelling. A tidy split like ["un", "believ", "able"] is only illustrative; GPT-4 returns ["un", "belie", "vable"], and in a real sentence, where a space sits in front, the whole word collapses to one token. Because the model never sees letters, it stumbles on counting the r's in "strawberry," which GPT-4 reads as ["str", "aw", "berry"]. Paste a string into the OpenAI tokenizer and watch where it breaks.

Tokens are also the unit you pay for. API pricing, context limits, and inference speed are all counted in tokens, roughly four English characters each, never in words. The tokenizer itself holds no intelligence. It's a frequency table of byte chunks, fixed before the model runs and before any embedding turns those chunks into meaning.