Every equation, explained with intuition. Based on Chris Wendler's construction.
1. Token Embedding
Token to vector mapping
$$x_i = E[\text{token}_i] \in \mathbb{R}^d$$
What: Each token (character) is mapped to a dense vector of dimension \(d\).
The embedding matrix \(E\) has one row per vocabulary item.
Why: Neural nets need continuous vectors, not discrete symbols.
The embedding lets similar tokens have similar vectors (learned during training).
What if: If we used one-hot vectors instead, every token would be
equally "far" from every other — no notion of similarity. Dense embeddings let the model
generalize across similar tokens.
In this demo
$$E[\text{token}_i]_j = \frac{1}{2}\sin\!\left(\text{idx}(i) \cdot \frac{\pi}{13} + 0.1j\right), \quad j = 0,\ldots,d{-}1$$
What: We use a deterministic sinusoidal embedding so every token gets a
unique, reproducible vector. Different tokens get different frequency patterns.
Why: In a real transformer, \(E\) is learned. Here we use a fixed function
so the demo is fully self-contained with no training step.
2. RoPE — Rotary Position Embedding
Frequency schedule
$$\theta_i = \frac{1}{10000^{\,2i/d}}, \quad i = 0, 1, \ldots, \tfrac{d}{2}{-}1$$
What: Each pair of dimensions rotates at a different frequency.
Low-index dimensions rotate fast (high \(\theta\)); high-index dimensions rotate slowly.
Why: Multiple frequencies create a unique "fingerprint" for each position.
Two positions can only match across ALL frequencies simultaneously if they are the same offset apart.
This is like how a combination lock needs all tumblers aligned.
What if: If all frequencies were the same, positions that differ by
\(2\pi/\theta\) would be indistinguishable — periodic aliasing.
Rotation matrix at position m
$$R_{\Theta,m} = \text{diag}\!\begin{pmatrix}
\begin{bmatrix} \cos m\theta_0 & -\sin m\theta_0 \\ \sin m\theta_0 & \cos m\theta_0 \end{bmatrix},
\;\ldots\;,
\begin{bmatrix} \cos m\theta_{d/2-1} & -\sin m\theta_{d/2-1} \\ \sin m\theta_{d/2-1} & \cos m\theta_{d/2-1} \end{bmatrix}
\end{pmatrix}$$
What: A block-diagonal matrix of \(d/2\) independent 2D rotations.
Each 2D block rotates its pair of dimensions by angle \(m\theta_i\).
Why block-diagonal? It preserves norms (\(\|R_{\Theta,m} v\| = \|v\|\))
and composes cleanly: \(R_{\Theta,a} \cdot R_{\Theta,b} = R_{\Theta,a+b}\).
This composition property is the key to making relative position work!
Key composition property
$$R_{\Theta,m}^T \cdot R_{\Theta,n} = R_{\Theta,\,n-m}$$
What: When we compute \(q_m^T k_n\), the two position rotations collapse into a
SINGLE rotation by the relative offset \(n - m\).
Why this matters: Attention scores depend only on the distance
between tokens, not their absolute positions. Token at position 5 attending to
position 3 gets the same geometric relationship as position 100 attending to 98.
3. Layer 0 — Previous Token Head
Goal: each position \(m\) attends to position \(m{-}1\) with maximum weight.
Key weight matrix — constant output
$$W_k \, x = (1, 0, 1, 0, \ldots, 1, 0)^T \quad \text{for ALL } x$$
What: \(W_k\) projects every token to the same vector \(c = (1,0,1,0,\ldots)\).
The key carries NO information about token identity.
Why: We want attention to depend ONLY on position, not on what the
token is. By making \(W_k x\) constant, the only distinguishing signal in
\(k_n = R_{\Theta,n} \cdot c\) comes from the RoPE rotation at position \(n\).
What if W_k depended on the token? Then attention would mix
positional and semantic signals. The head would no longer be a pure "previous position" detector.
Keys with RoPE
$$k_n = R_{\Theta,n} \cdot W_k \, x_n = R_{\Theta,n} \cdot c$$
What: Each key is the constant vector \(c\) rotated by position \(n\).
Position 0's key is unrotated; position 1 is rotated by \(\theta_i\); position 2 by \(2\theta_i\); etc.
Query weight matrix — the key equation
$$\boxed{W_q = \alpha \cdot R_{\Theta,-1} \cdot W_k}$$
What: The query matrix is the key matrix, pre-rotated by \(-1\) step and scaled by \(\alpha\).
Why \(R_{\Theta,-1}\)? This introduces a "-1 offset" into the query.
When RoPE adds the query's own position \(m\), the total rotation becomes \(m - 1\),
which perfectly aligns with the key at position \(m - 1\). The \(-1\) is the "look back one step" instruction.
Why \(\alpha\)? Scaling amplifies the score difference between the matching
position and non-matching positions. Larger \(\alpha\) → sharper attention → closer to a hard "select previous" operation.
What if we used \(R_{\Theta,-2}\)? Then position \(m\) would attend
to \(m{-}2\), creating a "two tokens back" head instead!
Queries with RoPE
$$q_m = R_{\Theta,m} \cdot W_q \, x_m
= R_{\Theta,m} \cdot \alpha \, R_{\Theta,-1} \cdot c
= \alpha \cdot R_{\Theta,\,m-1} \cdot c$$
What: The rotations compose! \(R_{\Theta,m} \cdot R_{\Theta,-1} = R_{\Theta,m-1}\).
So the query at position \(m\) looks like the key at position \(m{-}1\) (times \(\alpha\)).
Why this creates diagonal attention: \(q_m\) is essentially a rotated copy
of \(c\) at angle \(m{-}1\), and \(k_n\) is a rotated copy of \(c\) at angle \(n\).
They match best when \(n = m{-}1\).
Attention scores — deriving the diagonal
$$q_m^T k_n \;=\; \alpha \cdot c^T \, R_{\Theta,\,m-1}^T \, R_{\Theta,n} \, c
\;=\; \alpha \cdot c^T \, R_{\Theta,\,n-m+1} \, c$$
$$= \alpha \sum_{i=0}^{d/2-1} \cos\!\big((n - m + 1)\,\theta_i\big)$$
What: The score is a sum of cosines at different frequencies, all evaluated at the
relative offset \(\Delta = n - m + 1\).
When \(n = m{-}1\): \(\Delta = 0\), every cosine equals 1, score = \(\alpha \cdot d/2\) (MAXIMUM).
When \(n \neq m{-}1\): \(\Delta \neq 0\), the cosines point in different directions
and partially cancel. The sum is strictly less than \(d/2\). More frequencies → better cancellation → sharper peak.
Analogy: Think of \(d/2\) clock hands all pointing up at \(\Delta=0\).
At any other \(\Delta\), they point in different directions and their vertical sum is smaller.
After softmax — the attention pattern
$$\text{Attn}(m, n) = \text{softmax}_n\!\big(q_m^T k_n\big) \approx
\begin{cases} \approx 1 & \text{if } n = m{-}1 \\ \approx 0 & \text{otherwise}\end{cases}$$
What: Softmax exponentiates the score gap. Since position \(m{-}1\) has the highest score
by a margin proportional to \(\alpha\), it gets almost all the probability mass.
Result: A clear sub-diagonal band — each row puts ~100% attention on the previous position.
Output — copies the previous token's value
$$o_m^{(0)} = \sum_n \text{Attn}(m, n) \cdot v_n \;\approx\; v_{m-1}$$
What: Since attention concentrates on \(n = m{-}1\), the output at position \(m\)
is approximately the value vector of the previous token.
Why this matters for induction: After this layer, position \(m\) now
"knows" what token came before it. This information flows through the residual stream
and is used by Layer 1 to complete the induction circuit.
4. Layer 1 — Semantic / Induction Head
Goal: find previous occurrences of the current token and predict what followed them.
Rank-1 weight matrices
$$W_k^{(1)} = u_k \, v_k^T, \qquad W_q^{(1)} = u_q \, v_q^T$$
What: Both key and query matrices are rank-1 (outer product of two vectors).
This factorizes the computation into a "what to read" vector \(v\) and a "what to broadcast" vector \(u\).
Why rank-1? It makes the attention score decompose cleanly:
\(q_m^T k_n\) becomes a product of two independent projections
— one depending on the query token and one on the key token.
Keys and queries
$$k_n = u_k \cdot (v_k^T x_n), \qquad q_m = u_q \cdot (v_q^T x_m)$$
What: Each key/query is a fixed direction (\(u_k\) or \(u_q\)) scaled by how much
the token projects onto the "reading" direction (\(v_k\) or \(v_q\)).
Why: \(v_k^T x_n\) is a scalar that captures "the semantic identity of token \(n\)".
All keys point in the same direction \(u_k\) but with different magnitudes depending on the token.
Attention score — factored form
$$q_m^T k_n = \underbrace{(u_q^T u_k)}_{\text{constant}} \cdot \underbrace{(v_q^T x_m)}_{\text{query token}} \cdot \underbrace{(v_k^T x_n)}_{\text{key token}}$$
What: The score factors into three parts:
(1) a global constant, (2) a query-side scalar, (3) a key-side scalar.
Why this enables semantic matching: If \(v_q = v_k\), then
\(v_q^T x_m\) and \(v_k^T x_n\) are large for the same tokens, making identical tokens attend
to each other strongly. This is "find tokens like me".
What if the matrix were full-rank? Then the score would be a general
bilinear form \(x_m^T M x_n\), which could match ANY pair of tokens.
Rank-1 constrains it to match tokens with the same projection — a simpler, more interpretable pattern.
The induction mechanism
$$\text{Input: } \ldots A \; B \;\ldots\; A \;\; \underset{\uparrow}{\text{[predict here]}}$$
$$\text{Layer 0 output at last } A\!: \text{carries info about token before } A \text{ (i.e., context)}$$
$$\text{Layer 1: last } A \text{ matches first } A \;\Rightarrow\; \text{attend to position after first } A \;\Rightarrow\; \text{predict } B$$
The induction circuit in two steps:
Step 1 (Layer 0): Previous-token head copies each token's predecessor into
the residual stream. After Layer 0, position \(i\) knows "my previous token was ___".
Step 2 (Layer 1): The semantic head at the last position sees token \(A\),
searches for earlier occurrences of \(A\), and attends to the position AFTER it.
Whatever followed \(A\) before (\(B\)) becomes the prediction.
Why two layers? Layer 0 shifts information backward by one step.
Layer 1 uses that shifted information to perform the pattern match.
Neither layer alone can do induction — it requires the composition.
What if there's no repeated token? No induction pattern fires.
The model falls back to unigram/positional statistics.
5. Complete Forward Pass
Step-by-step from input tokens to prediction.
Step 1: Embed tokens
$$h_i^{(0)} = E[\text{token}_i] \in \mathbb{R}^d$$
What: Convert each token to its embedding vector. This is the initial "residual stream" state.
Step 2: Layer 0 — Previous Token Head
$$k_n^{(0)} = R_{\Theta,n} \cdot c \qquad (\text{constant key})$$
$$q_m^{(0)} = \alpha \cdot R_{\Theta,m-1} \cdot c$$
$$A^{(0)}_{m,n} = \text{softmax}_n\!\left(\alpha \sum_i \cos\!\big((n{-}m{+}1)\theta_i\big)\right) \approx \mathbf{1}[n = m{-}1]$$
$$o_m^{(0)} = \sum_n A^{(0)}_{m,n} \cdot V^{(0)} h_n^{(0)}$$
What: Each position gathers the value of its predecessor. The output \(o_m^{(0)}\) carries
information about token \(m{-}1\).
Step 3: Residual connection after Layer 0
$$h_m^{(1)} = h_m^{(0)} + o_m^{(0)}$$
What: The residual stream now contains BOTH the current token's embedding AND
the previous token's information, added together.
Why residual? Without the \(+\), Layer 1 would only see the previous token's info
and lose the current token's identity. The residual connection preserves both signals.
Step 4: Layer 1 — Semantic / Induction Head
$$k_n^{(1)} = u_k \cdot (v_k^T h_n^{(1)})$$
$$q_m^{(1)} = u_q \cdot (v_q^T h_m^{(1)})$$
$$A^{(1)}_{m,n} = \text{softmax}_n\!\big(q_m^{(1)T} k_n^{(1)}\big)$$
$$o_m^{(1)} = \sum_n A^{(1)}_{m,n} \cdot V^{(1)} h_n^{(1)}$$
What: Semantic matching on the enriched residual stream. Because \(h^{(1)}\) contains
both "who am I" and "who came before me", the rank-1 matching can find tokens with the
same predecessor-current pair, enabling induction.
Step 5: Final residual and output
$$h_m^{(2)} = h_m^{(1)} + o_m^{(1)}$$
$$\text{logits}_m = W_{\text{out}} \cdot h_m^{(2)} + b_{\text{out}}$$
$$P(\text{next token} \mid \text{context}) = \text{softmax}(\text{logits}_m)$$
What: The final residual stream is projected to vocabulary logits. Softmax converts to probabilities.
For induction: \(h_m^{(2)}\) at the last position contains the value from the token
that followed the previous occurrence. \(W_{\text{out}}\) maps this to high probability for that token.
Summary: Information flow
$$\underset{\text{embed}}{x_i} \;\xrightarrow{\text{Layer 0}}\;
\underset{\text{+ prev token info}}{h^{(1)}} \;\xrightarrow{\text{Layer 1}}\;
\underset{\text{+ induction match}}{h^{(2)}} \;\xrightarrow{W_\text{out}}\;
\text{prediction}$$
The full picture: Embeddings → shifted context via previous-token head →
residual enrichment → semantic match via induction head → prediction.
Each layer adds exactly one piece of the puzzle.