RoPE Transformer

Based on Chris Wendler's equations

📐 RoPE Rotation Matrix

$$R^{d}_{\Theta, m} = \begin{pmatrix} \cos{m\theta_1} & -\sin{m\theta_1} & 0 & \cdots & 0 & 0 \\ \sin{m\theta_1} & \cos{m\theta_1} & 0 & \cdots & 0 & 0 \\ 0 & 0 & \cos{m\theta_2} & \cdots & 0 & 0 \\ 0 & 0 & \sin{m\theta_2} & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & \cdots & \cos{m\theta_{d/2}} & -\sin{m\theta_{d/2}} \\ 0 & 0 & 0 & \cdots & \sin{m\theta_{d/2}} & \cos{m\theta_{d/2}} \\ \end{pmatrix}$$

Where θᵢ = 1 / (10000^(2i/d)) and m is position

🔑 Head 1: Previous Token Head

$$W_k \, x = c = (1, 0, 1, 0, \ldots)^T \quad \text{(constant for all } x \text{)}$$ $$W_q = \alpha \, R_{\Theta, -1} \, W_k$$ $$q_m = \alpha \, R_{\Theta,m-1} \cdot c, \qquad k_n = R_{\Theta,n} \cdot c$$ $$q_m^T k_n = \alpha \sum_{i} \cos\!\big((n - m + 1)\,\theta_i\big) \;=\; \begin{cases} \alpha \cdot d/2 & n = m{-}1 \\ < \alpha \cdot d/2 & n \neq m{-}1 \end{cases}$$

Diagonal attention: each position m attends maximally to position m−1

🧠 Head 2: Semantic Head (Rank-1)

$$W_k^{(2)} = u_k v_k^T, \quad W_q^{(2)} = u_q v_q^T$$ $$q_m^T k_n = (u_q^T u_k) (v_q^T x_m) (v_k^T x_n)$$

Matches based on token semantics (repeating patterns)

Complete 2-Layer RoPE Transformer

Every equation, explained with intuition. Based on Chris Wendler's construction.

1. Token Embedding

Token to vector mapping
$$x_i = E[\text{token}_i] \in \mathbb{R}^d$$
What: Each token (character) is mapped to a dense vector of dimension \(d\). The embedding matrix \(E\) has one row per vocabulary item.
Why: Neural nets need continuous vectors, not discrete symbols. The embedding lets similar tokens have similar vectors (learned during training).
What if: If we used one-hot vectors instead, every token would be equally "far" from every other — no notion of similarity. Dense embeddings let the model generalize across similar tokens.
In this demo
$$E[\text{token}_i]_j = \frac{1}{2}\sin\!\left(\text{idx}(i) \cdot \frac{\pi}{13} + 0.1j\right), \quad j = 0,\ldots,d{-}1$$
What: We use a deterministic sinusoidal embedding so every token gets a unique, reproducible vector. Different tokens get different frequency patterns.
Why: In a real transformer, \(E\) is learned. Here we use a fixed function so the demo is fully self-contained with no training step.

2. RoPE — Rotary Position Embedding

Frequency schedule
$$\theta_i = \frac{1}{10000^{\,2i/d}}, \quad i = 0, 1, \ldots, \tfrac{d}{2}{-}1$$
What: Each pair of dimensions rotates at a different frequency. Low-index dimensions rotate fast (high \(\theta\)); high-index dimensions rotate slowly.
Why: Multiple frequencies create a unique "fingerprint" for each position. Two positions can only match across ALL frequencies simultaneously if they are the same offset apart. This is like how a combination lock needs all tumblers aligned.
What if: If all frequencies were the same, positions that differ by \(2\pi/\theta\) would be indistinguishable — periodic aliasing.
Rotation matrix at position m
$$R_{\Theta,m} = \text{diag}\!\begin{pmatrix} \begin{bmatrix} \cos m\theta_0 & -\sin m\theta_0 \\ \sin m\theta_0 & \cos m\theta_0 \end{bmatrix}, \;\ldots\;, \begin{bmatrix} \cos m\theta_{d/2-1} & -\sin m\theta_{d/2-1} \\ \sin m\theta_{d/2-1} & \cos m\theta_{d/2-1} \end{bmatrix} \end{pmatrix}$$
What: A block-diagonal matrix of \(d/2\) independent 2D rotations. Each 2D block rotates its pair of dimensions by angle \(m\theta_i\).
Why block-diagonal? It preserves norms (\(\|R_{\Theta,m} v\| = \|v\|\)) and composes cleanly: \(R_{\Theta,a} \cdot R_{\Theta,b} = R_{\Theta,a+b}\). This composition property is the key to making relative position work!
Key composition property
$$R_{\Theta,m}^T \cdot R_{\Theta,n} = R_{\Theta,\,n-m}$$
What: When we compute \(q_m^T k_n\), the two position rotations collapse into a SINGLE rotation by the relative offset \(n - m\).
Why this matters: Attention scores depend only on the distance between tokens, not their absolute positions. Token at position 5 attending to position 3 gets the same geometric relationship as position 100 attending to 98.

3. Layer 0 — Previous Token Head

Goal: each position \(m\) attends to position \(m{-}1\) with maximum weight.

Key weight matrix — constant output
$$W_k \, x = (1, 0, 1, 0, \ldots, 1, 0)^T \quad \text{for ALL } x$$
What: \(W_k\) projects every token to the same vector \(c = (1,0,1,0,\ldots)\). The key carries NO information about token identity.
Why: We want attention to depend ONLY on position, not on what the token is. By making \(W_k x\) constant, the only distinguishing signal in \(k_n = R_{\Theta,n} \cdot c\) comes from the RoPE rotation at position \(n\).
What if W_k depended on the token? Then attention would mix positional and semantic signals. The head would no longer be a pure "previous position" detector.
Keys with RoPE
$$k_n = R_{\Theta,n} \cdot W_k \, x_n = R_{\Theta,n} \cdot c$$
What: Each key is the constant vector \(c\) rotated by position \(n\). Position 0's key is unrotated; position 1 is rotated by \(\theta_i\); position 2 by \(2\theta_i\); etc.
Query weight matrix — the key equation
$$\boxed{W_q = \alpha \cdot R_{\Theta,-1} \cdot W_k}$$
What: The query matrix is the key matrix, pre-rotated by \(-1\) step and scaled by \(\alpha\).
Why \(R_{\Theta,-1}\)? This introduces a "-1 offset" into the query. When RoPE adds the query's own position \(m\), the total rotation becomes \(m - 1\), which perfectly aligns with the key at position \(m - 1\). The \(-1\) is the "look back one step" instruction.
Why \(\alpha\)? Scaling amplifies the score difference between the matching position and non-matching positions. Larger \(\alpha\) → sharper attention → closer to a hard "select previous" operation.
What if we used \(R_{\Theta,-2}\)? Then position \(m\) would attend to \(m{-}2\), creating a "two tokens back" head instead!
Queries with RoPE
$$q_m = R_{\Theta,m} \cdot W_q \, x_m = R_{\Theta,m} \cdot \alpha \, R_{\Theta,-1} \cdot c = \alpha \cdot R_{\Theta,\,m-1} \cdot c$$
What: The rotations compose! \(R_{\Theta,m} \cdot R_{\Theta,-1} = R_{\Theta,m-1}\). So the query at position \(m\) looks like the key at position \(m{-}1\) (times \(\alpha\)).
Why this creates diagonal attention: \(q_m\) is essentially a rotated copy of \(c\) at angle \(m{-}1\), and \(k_n\) is a rotated copy of \(c\) at angle \(n\). They match best when \(n = m{-}1\).
Attention scores — deriving the diagonal
$$q_m^T k_n \;=\; \alpha \cdot c^T \, R_{\Theta,\,m-1}^T \, R_{\Theta,n} \, c \;=\; \alpha \cdot c^T \, R_{\Theta,\,n-m+1} \, c$$
$$= \alpha \sum_{i=0}^{d/2-1} \cos\!\big((n - m + 1)\,\theta_i\big)$$
What: The score is a sum of cosines at different frequencies, all evaluated at the relative offset \(\Delta = n - m + 1\).
When \(n = m{-}1\): \(\Delta = 0\), every cosine equals 1, score = \(\alpha \cdot d/2\) (MAXIMUM).
When \(n \neq m{-}1\): \(\Delta \neq 0\), the cosines point in different directions and partially cancel. The sum is strictly less than \(d/2\). More frequencies → better cancellation → sharper peak.
Analogy: Think of \(d/2\) clock hands all pointing up at \(\Delta=0\). At any other \(\Delta\), they point in different directions and their vertical sum is smaller.
After softmax — the attention pattern
$$\text{Attn}(m, n) = \text{softmax}_n\!\big(q_m^T k_n\big) \approx \begin{cases} \approx 1 & \text{if } n = m{-}1 \\ \approx 0 & \text{otherwise}\end{cases}$$
What: Softmax exponentiates the score gap. Since position \(m{-}1\) has the highest score by a margin proportional to \(\alpha\), it gets almost all the probability mass.
Result: A clear sub-diagonal band — each row puts ~100% attention on the previous position.
Output — copies the previous token's value
$$o_m^{(0)} = \sum_n \text{Attn}(m, n) \cdot v_n \;\approx\; v_{m-1}$$
What: Since attention concentrates on \(n = m{-}1\), the output at position \(m\) is approximately the value vector of the previous token.
Why this matters for induction: After this layer, position \(m\) now "knows" what token came before it. This information flows through the residual stream and is used by Layer 1 to complete the induction circuit.

4. Layer 1 — Semantic / Induction Head

Goal: find previous occurrences of the current token and predict what followed them.

Rank-1 weight matrices
$$W_k^{(1)} = u_k \, v_k^T, \qquad W_q^{(1)} = u_q \, v_q^T$$
What: Both key and query matrices are rank-1 (outer product of two vectors). This factorizes the computation into a "what to read" vector \(v\) and a "what to broadcast" vector \(u\).
Why rank-1? It makes the attention score decompose cleanly: \(q_m^T k_n\) becomes a product of two independent projections — one depending on the query token and one on the key token.
Keys and queries
$$k_n = u_k \cdot (v_k^T x_n), \qquad q_m = u_q \cdot (v_q^T x_m)$$
What: Each key/query is a fixed direction (\(u_k\) or \(u_q\)) scaled by how much the token projects onto the "reading" direction (\(v_k\) or \(v_q\)).
Why: \(v_k^T x_n\) is a scalar that captures "the semantic identity of token \(n\)". All keys point in the same direction \(u_k\) but with different magnitudes depending on the token.
Attention score — factored form
$$q_m^T k_n = \underbrace{(u_q^T u_k)}_{\text{constant}} \cdot \underbrace{(v_q^T x_m)}_{\text{query token}} \cdot \underbrace{(v_k^T x_n)}_{\text{key token}}$$
What: The score factors into three parts: (1) a global constant, (2) a query-side scalar, (3) a key-side scalar.
Why this enables semantic matching: If \(v_q = v_k\), then \(v_q^T x_m\) and \(v_k^T x_n\) are large for the same tokens, making identical tokens attend to each other strongly. This is "find tokens like me".
What if the matrix were full-rank? Then the score would be a general bilinear form \(x_m^T M x_n\), which could match ANY pair of tokens. Rank-1 constrains it to match tokens with the same projection — a simpler, more interpretable pattern.
The induction mechanism
$$\text{Input: } \ldots A \; B \;\ldots\; A \;\; \underset{\uparrow}{\text{[predict here]}}$$ $$\text{Layer 0 output at last } A\!: \text{carries info about token before } A \text{ (i.e., context)}$$ $$\text{Layer 1: last } A \text{ matches first } A \;\Rightarrow\; \text{attend to position after first } A \;\Rightarrow\; \text{predict } B$$
The induction circuit in two steps:
Step 1 (Layer 0): Previous-token head copies each token's predecessor into the residual stream. After Layer 0, position \(i\) knows "my previous token was ___".
Step 2 (Layer 1): The semantic head at the last position sees token \(A\), searches for earlier occurrences of \(A\), and attends to the position AFTER it. Whatever followed \(A\) before (\(B\)) becomes the prediction.
Why two layers? Layer 0 shifts information backward by one step. Layer 1 uses that shifted information to perform the pattern match. Neither layer alone can do induction — it requires the composition.
What if there's no repeated token? No induction pattern fires. The model falls back to unigram/positional statistics.

5. Complete Forward Pass

Step-by-step from input tokens to prediction.

Step 1: Embed tokens
$$h_i^{(0)} = E[\text{token}_i] \in \mathbb{R}^d$$
What: Convert each token to its embedding vector. This is the initial "residual stream" state.
Step 2: Layer 0 — Previous Token Head
$$k_n^{(0)} = R_{\Theta,n} \cdot c \qquad (\text{constant key})$$ $$q_m^{(0)} = \alpha \cdot R_{\Theta,m-1} \cdot c$$ $$A^{(0)}_{m,n} = \text{softmax}_n\!\left(\alpha \sum_i \cos\!\big((n{-}m{+}1)\theta_i\big)\right) \approx \mathbf{1}[n = m{-}1]$$ $$o_m^{(0)} = \sum_n A^{(0)}_{m,n} \cdot V^{(0)} h_n^{(0)}$$
What: Each position gathers the value of its predecessor. The output \(o_m^{(0)}\) carries information about token \(m{-}1\).
Step 3: Residual connection after Layer 0
$$h_m^{(1)} = h_m^{(0)} + o_m^{(0)}$$
What: The residual stream now contains BOTH the current token's embedding AND the previous token's information, added together.
Why residual? Without the \(+\), Layer 1 would only see the previous token's info and lose the current token's identity. The residual connection preserves both signals.
Step 4: Layer 1 — Semantic / Induction Head
$$k_n^{(1)} = u_k \cdot (v_k^T h_n^{(1)})$$ $$q_m^{(1)} = u_q \cdot (v_q^T h_m^{(1)})$$ $$A^{(1)}_{m,n} = \text{softmax}_n\!\big(q_m^{(1)T} k_n^{(1)}\big)$$ $$o_m^{(1)} = \sum_n A^{(1)}_{m,n} \cdot V^{(1)} h_n^{(1)}$$
What: Semantic matching on the enriched residual stream. Because \(h^{(1)}\) contains both "who am I" and "who came before me", the rank-1 matching can find tokens with the same predecessor-current pair, enabling induction.
Step 5: Final residual and output
$$h_m^{(2)} = h_m^{(1)} + o_m^{(1)}$$ $$\text{logits}_m = W_{\text{out}} \cdot h_m^{(2)} + b_{\text{out}}$$ $$P(\text{next token} \mid \text{context}) = \text{softmax}(\text{logits}_m)$$
What: The final residual stream is projected to vocabulary logits. Softmax converts to probabilities.
For induction: \(h_m^{(2)}\) at the last position contains the value from the token that followed the previous occurrence. \(W_{\text{out}}\) maps this to high probability for that token.
Summary: Information flow
$$\underset{\text{embed}}{x_i} \;\xrightarrow{\text{Layer 0}}\; \underset{\text{+ prev token info}}{h^{(1)}} \;\xrightarrow{\text{Layer 1}}\; \underset{\text{+ induction match}}{h^{(2)}} \;\xrightarrow{W_\text{out}}\; \text{prediction}$$
The full picture: Embeddings → shifted context via previous-token head → residual enrichment → semantic match via induction head → prediction. Each layer adds exactly one piece of the puzzle.
Step 1: Token Embedding
Keys (k_n = R_{Θ,n} W_k x_n)
Queries (q_m = R_{Θ,m} W_q x_m)
Attention Scores q_m^T k_n
Attention Pattern (Softmax)
Output (previous context enriched)
Semantic Keys (Rank-1: W_k = u_k v_k^T)
Semantic Queries (Rank-1: W_q = u_q v_q^T)
Induction Pattern
Final Output
🎯 Prediction