Embeddings

A token ID is only a label. It tells the model which token, not what it means. An embedding fixes that: a learned vector, a list of numbers that acts as coordinates for each token in a high-dimensional space, positioned so words used alike land near each other. "King" sits close to "queen," and "Paris" close to "France."

Word embeddings

Each word becomes a vector

Nobody assigns these coordinates by hand. They start random, and training pulls tokens that appear in similar contexts together. This input table is static, one fixed vector per token, so a word with two senses gets a single row; attention later reshapes each vector from its neighbors, so meaning can shift with context.

Some directions in the space line up with relationships, the result the animation walks: king - man + woman lands near queen (Mikolov, Yih, and Zweig, 2013). It's real, a property of the word2vec and GloVe vector spaces, but cherry-picked. The standard demo excludes the three input words from the candidates; leave them in and the nearest vector is king itself, queen second (Linzen, 2016). It holds for gender and capital-country pairs and breaks on most others.

So read nearness as a learned hint, not proof of understanding. That hint is the engine under semantic search and RAG: embed two pieces of text, score them by cosine similarity, and you get a usable "how related are these" even when they share no words.