Let’s start with the basics: understanding how an LLM “thinks” requires understanding the Transformer, the architecture introduced in 2017 by Google’s “Attention Is All You Need” paper. Everything you use today—GPT, Claude, Gemini, Llama—is built on this foundation.
1. Tokens: the atomic unit of text
Before talking about attention, you need to understand how the model “sees” text. Raw text is broken into tokens—which don’t necessarily correspond to whole words. The word “tokenization” might become [”token”, “ization”], while “AI” is a single token.
Each token is converted into an embedding: a high-dimensional numeric vector (e.g., 768 or 4096 floats) that represents its meaning in semantic space. This is where mathematics takes over linguistics.
2. The Attention Mechanism
The self-attention mechanism is the heart of the Transformer. The central idea is simple yet powerful: each token must ask itself, “What other tokens should I pay attention to to understand my meaning?”


