Drillso prepared drill5 min drill4 prepared drill paths

Transformer Attention

Click a highlighted phrase, try the prepared drill, then use Drill deeper to open the next layer.

Real Drillso session

Transformer Attention

A Transformer layer does not read a sequence in order the way a recurrent network does. Instead, every token first becomes a query, a key, and a value. The query represents what the current token is trying to find; the keys represent what every token offers as an addressable signal; the values carry the information that will be mixed into the output. Attention computes compatibility scores by comparing each query with all keys, then turns those scores into weights through softmax.

The important detail is that attention is content-addressed, not position-addressed by default. If two tokens have identical embeddings and no positional signal, the attention mechanism has no built-in reason to distinguish "A before B" from "B before A." This is why position information must be injected separately.

Multi-head attention makes this process less brittle. A single attention head must compress all relationships into one scoring pattern, but multiple heads allow the model to learn several relationship types at the same time: local syntax, long-range reference, delimiter matching, topic continuation, or positional cues. The outputs of these heads are concatenated and projected back into the model dimension, so the layer can combine several learned views of the sequence into one representation.