Multimodal

A multimodal model handles more than text. It can take in images, audio, or video, and sometimes produce them too. A "modality" is just a type of data, and multimodal means many of them at once.

The trick that makes it work is the same idea behind embeddings. Each kind of input gets converted into vectors in a shared space. An image is cut into patches, audio into slices, and each piece becomes a token-like vector the transformer can read next to words. Once everything is vectors, the model treats them the same way.

That shared space is why you can show a model a photo and ask about it in text, or hand it a chart and get a written summary. The image and the question live in the same representation, so attention can relate them directly.

For building, multimodal mostly expands what counts as a prompt. You can pass a screenshot, a PDF page, or an audio clip where you used to pass only text. The same ideas still apply: more input is more tokens, which affects cost, latency, and how much context you can fit.