How a Transformer Works

Lesson #1 - AI made simple

Mar 08, 2026

∙ Paid

Let’s start with the basics: understanding how an LLM “thinks” requires understanding the Transformer, the architecture introduced in 2017 by Google’s “Attention Is All You Need” paper. Everything you use today—GPT, Claude, Gemini, Llama—is built on this foundation.

1. Tokens: the atomic unit of text

Before talking about attention, you need to understand how the model “sees” text. Raw text is broken into tokens—which don’t necessarily correspond to whole words. The word “tokenization” might become [”token”, “ization”], while “AI” is a single token.

Each token is converted into an embedding: a high-dimensional numeric vector (e.g., 768 or 4096 floats) that represents its meaning in semantic space. This is where mathematics takes over linguistics.

2. The Attention Mechanism

The self-attention mechanism is the heart of the Transformer. The central idea is simple yet powerful: each token must ask itself, “What other tokens should I pay attention to to understand my meaning?”

The Tech Lens

How a Transformer Works

Lesson #1 - AI made simple

1. Tokens: the atomic unit of text

2. The Attention Mechanism

This post is for paid subscribers