RoPE Transformer

1. Token Embedding

Token to vector mapping

$$x_i = E[\text{token}_i] \in \mathbb{R}^d$$

What: Each token (character) is mapped to a dense vector of dimension $d$. The embedding matrix $E$ has one row per vocabulary item.
Why: Neural nets need continuous vectors, not discrete symbols. The embedding lets similar tokens have similar vectors (learned during training).
What if: If we used one-hot vectors instead, every token would be equally "far" from every other — no notion of similarity. Dense embeddings let the model generalize across similar tokens.

In this demo

$$E[\text{token}_i]_j = \frac{1}{2}\sin\!\left(\text{idx}(i) \cdot \frac{\pi}{13} + 0.1j\right), \quad j = 0,\ldots,d{-}1$$

What: We use a deterministic sinusoidal embedding so every token gets a unique, reproducible vector. Different tokens get different frequency patterns.
Why: In a real transformer, $E$ is learned. Here we use a fixed function so the demo is fully self-contained with no training step.

2. RoPE — Rotary Position Embedding

Frequency schedule

$$\theta_i = \frac{1}{10000^{\,2i/d}}, \quad i = 0, 1, \ldots, \tfrac{d}{2}{-}1$$

What: Each pair of dimensions rotates at a different frequency. Low-index dimensions rotate fast (high $\theta$); high-index dimensions rotate slowly.
Why: Multiple frequencies create a unique "fingerprint" for each position. Two positions can only match across ALL frequencies simultaneously if they are the same offset apart. This is like how a combination lock needs all tumblers aligned.
What if: If all frequencies were the same, positions that differ by $2\pi/\theta$ would be indistinguishable — periodic aliasing.

Rotation matrix at position m

$$R_{\Theta,m} = \text{diag}\!\begin{pmatrix} \begin{bmatrix} \cos m\theta_0 & -\sin m\theta_0 \\ \sin m\theta_0 & \cos m\theta_0 \end{bmatrix}, \;\ldots\;, \begin{bmatrix} \cos m\theta_{d/2-1} & -\sin m\theta_{d/2-1} \\ \sin m\theta_{d/2-1} & \cos m\theta_{d/2-1} \end{bmatrix} \end{pmatrix}$$

What: A block-diagonal matrix of $d/2$ independent 2D rotations. Each 2D block rotates its pair of dimensions by angle $m\theta_i$.
Why block-diagonal? It preserves norms ($\|R_{\Theta,m} v\| = \|v\|$) and composes cleanly: $R_{\Theta,a} \cdot R_{\Theta,b} = R_{\Theta,a+b}$. This composition property is the key to making relative position work!

Key composition property

$$R_{\Theta,m}^T \cdot R_{\Theta,n} = R_{\Theta,\,n-m}$$

What: When we compute $q_m^T k_n$, the two position rotations collapse into a SINGLE rotation by the relative offset $n - m$.
Why this matters: Attention scores depend only on the distance between tokens, not their absolute positions. Token at position 5 attending to position 3 gets the same geometric relationship as position 100 attending to 98.

3. Layer 0 — Previous Token Head

Goal: each position $m$ attends to position $m{-}1$ with maximum weight.

Key weight matrix — constant output

$$W_k \, x = (1, 0, 1, 0, \ldots, 1, 0)^T \quad \text{for ALL } x$$

What: $W_k$ projects every token to the same vector $c = (1,0,1,0,\ldots)$. The key carries NO information about token identity.
Why: We want attention to depend ONLY on position, not on what the token is. By making $W_k x$ constant, the only distinguishing signal in $k_n = R_{\Theta,n} \cdot c$ comes from the RoPE rotation at position $n$.
What if W_k depended on the token? Then attention would mix positional and semantic signals. The head would no longer be a pure "previous position" detector.

Keys with RoPE

$$k_n = R_{\Theta,n} \cdot W_k \, x_n = R_{\Theta,n} \cdot c$$

What: Each key is the constant vector $c$ rotated by position $n$. Position 0's key is unrotated; position 1 is rotated by $\theta_i$; position 2 by $2\theta_i$; etc.

Query weight matrix — the key equation

$$\boxed{W_q = \alpha \cdot R_{\Theta,-1} \cdot W_k}$$

What: The query matrix is the key matrix, pre-rotated by $-1$ step and scaled by $\alpha$.
Why $R_{\Theta,-1}$? This introduces a "-1 offset" into the query. When RoPE adds the query's own position $m$, the total rotation becomes $m - 1$, which perfectly aligns with the key at position $m - 1$. The $-1$ is the "look back one step" instruction.
Why $\alpha$? Scaling amplifies the score difference between the matching position and non-matching positions. Larger $\alpha$ → sharper attention → closer to a hard "select previous" operation.
What if we used $R_{\Theta,-2}$? Then position $m$ would attend to $m{-}2$, creating a "two tokens back" head instead!

Queries with RoPE

$$q_m = R_{\Theta,m} \cdot W_q \, x_m = R_{\Theta,m} \cdot \alpha \, R_{\Theta,-1} \cdot c = \alpha \cdot R_{\Theta,\,m-1} \cdot c$$

What: The rotations compose! $R_{\Theta,m} \cdot R_{\Theta,-1} = R_{\Theta,m-1}$. So the query at position $m$ looks like the key at position $m{-}1$ (times $\alpha$).
Why this creates diagonal attention: $q_m$ is essentially a rotated copy of $c$ at angle $m{-}1$, and $k_n$ is a rotated copy of $c$ at angle $n$. They match best when $n = m{-}1$.

Attention scores — deriving the diagonal

$$q_m^T k_n \;=\; \alpha \cdot c^T \, R_{\Theta,\,m-1}^T \, R_{\Theta,n} \, c \;=\; \alpha \cdot c^T \, R_{\Theta,\,n-m+1} \, c$$

$$= \alpha \sum_{i=0}^{d/2-1} \cos\!\big((n - m + 1)\,\theta_i\big)$$

What: The score is a sum of cosines at different frequencies, all evaluated at the relative offset $\Delta = n - m + 1$.
When $n = m{-}1$: $\Delta = 0$, every cosine equals 1, score = $\alpha \cdot d/2$ (MAXIMUM).
When $n \neq m{-}1$: $\Delta \neq 0$, the cosines point in different directions and partially cancel. The sum is strictly less than $d/2$. More frequencies → better cancellation → sharper peak.
Analogy: Think of $d/2$ clock hands all pointing up at $\Delta=0$. At any other $\Delta$, they point in different directions and their vertical sum is smaller.

After softmax — the attention pattern

$$\text{Attn}(m, n) = \text{softmax}_n\!\big(q_m^T k_n\big) \approx \begin{cases} \approx 1 & \text{if } n = m{-}1 \\ \approx 0 & \text{otherwise}\end{cases}$$

What: Softmax exponentiates the score gap. Since position $m{-}1$ has the highest score by a margin proportional to $\alpha$, it gets almost all the probability mass.
Result: A clear sub-diagonal band — each row puts ~100% attention on the previous position.

Output — copies the previous token's value

$$o_m^{(0)} = \sum_n \text{Attn}(m, n) \cdot v_n \;\approx\; v_{m-1}$$

What: Since attention concentrates on $n = m{-}1$, the output at position $m$ is approximately the value vector of the previous token.
Why this matters for induction: After this layer, position $m$ now "knows" what token came before it. This information flows through the residual stream and is used by Layer 1 to complete the induction circuit.

4. Layer 1 — Semantic / Induction Head

Goal: find previous occurrences of the current token and predict what followed them.

Rank-1 weight matrices

$$W_k^{(1)} = u_k \, v_k^T, \qquad W_q^{(1)} = u_q \, v_q^T$$

What: Both key and query matrices are rank-1 (outer product of two vectors). This factorizes the computation into a "what to read" vector $v$ and a "what to broadcast" vector $u$.
Why rank-1? It makes the attention score decompose cleanly: $q_m^T k_n$ becomes a product of two independent projections — one depending on the query token and one on the key token.

Keys and queries

$$k_n = u_k \cdot (v_k^T x_n), \qquad q_m = u_q \cdot (v_q^T x_m)$$

What: Each key/query is a fixed direction ($u_k$ or $u_q$) scaled by how much the token projects onto the "reading" direction ($v_k$ or $v_q$).
Why: $v_k^T x_n$ is a scalar that captures "the semantic identity of token $n$". All keys point in the same direction $u_k$ but with different magnitudes depending on the token.

Attention score — factored form

$$q_m^T k_n = \underbrace{(u_q^T u_k)}_{\text{constant}} \cdot \underbrace{(v_q^T x_m)}_{\text{query token}} \cdot \underbrace{(v_k^T x_n)}_{\text{key token}}$$

What: The score factors into three parts: (1) a global constant, (2) a query-side scalar, (3) a key-side scalar.
Why this enables semantic matching: If $v_q = v_k$, then $v_q^T x_m$ and $v_k^T x_n$ are large for the same tokens, making identical tokens attend to each other strongly. This is "find tokens like me".
What if the matrix were full-rank? Then the score would be a general bilinear form $x_m^T M x_n$, which could match ANY pair of tokens. Rank-1 constrains it to match tokens with the same projection — a simpler, more interpretable pattern.

The induction mechanism

$$\text{Input: } \ldots A \; B \;\ldots\; A \;\; \underset{\uparrow}{\text{[predict here]}}$$ $$\text{Layer 0 output at last } A\!: \text{carries info about token before } A \text{ (i.e., context)}$$ $$\text{Layer 1: last } A \text{ matches first } A \;\Rightarrow\; \text{attend to position after first } A \;\Rightarrow\; \text{predict } B$$

The induction circuit in two steps:
Step 1 (Layer 0): Previous-token head copies each token's predecessor into the residual stream. After Layer 0, position $i$ knows "my previous token was ___".
Step 2 (Layer 1): The semantic head at the last position sees token $A$, searches for earlier occurrences of $A$, and attends to the position AFTER it. Whatever followed $A$ before ($B$) becomes the prediction.
Why two layers? Layer 0 shifts information backward by one step. Layer 1 uses that shifted information to perform the pattern match. Neither layer alone can do induction — it requires the composition.
What if there's no repeated token? No induction pattern fires. The model falls back to unigram/positional statistics.

5. Complete Forward Pass

Step-by-step from input tokens to prediction.

Step 1: Embed tokens

$$h_i^{(0)} = E[\text{token}_i] \in \mathbb{R}^d$$

What: Convert each token to its embedding vector. This is the initial "residual stream" state.

Step 2: Layer 0 — Previous Token Head

$$k_n^{(0)} = R_{\Theta,n} \cdot c \qquad (\text{constant key})$$ $$q_m^{(0)} = \alpha \cdot R_{\Theta,m-1} \cdot c$$ $$A^{(0)}_{m,n} = \text{softmax}_n\!\left(\alpha \sum_i \cos\!\big((n{-}m{+}1)\theta_i\big)\right) \approx \mathbf{1}[n = m{-}1]$$ $$o_m^{(0)} = \sum_n A^{(0)}_{m,n} \cdot V^{(0)} h_n^{(0)}$$

What: Each position gathers the value of its predecessor. The output $o_m^{(0)}$ carries information about token $m{-}1$.

Step 3: Residual connection after Layer 0

$$h_m^{(1)} = h_m^{(0)} + o_m^{(0)}$$

What: The residual stream now contains BOTH the current token's embedding AND the previous token's information, added together.
Why residual? Without the $+$, Layer 1 would only see the previous token's info and lose the current token's identity. The residual connection preserves both signals.

Step 4: Layer 1 — Semantic / Induction Head

$$k_n^{(1)} = u_k \cdot (v_k^T h_n^{(1)})$$ $$q_m^{(1)} = u_q \cdot (v_q^T h_m^{(1)})$$ $$A^{(1)}_{m,n} = \text{softmax}_n\!\big(q_m^{(1)T} k_n^{(1)}\big)$$ $$o_m^{(1)} = \sum_n A^{(1)}_{m,n} \cdot V^{(1)} h_n^{(1)}$$

What: Semantic matching on the enriched residual stream. Because $h^{(1)}$ contains both "who am I" and "who came before me", the rank-1 matching can find tokens with the same predecessor-current pair, enabling induction.

Step 5: Final residual and output

$$h_m^{(2)} = h_m^{(1)} + o_m^{(1)}$$ $$\text{logits}_m = W_{\text{out}} \cdot h_m^{(2)} + b_{\text{out}}$$ $$P(\text{next token} \mid \text{context}) = \text{softmax}(\text{logits}_m)$$

What: The final residual stream is projected to vocabulary logits. Softmax converts to probabilities.
For induction: $h_m^{(2)}$ at the last position contains the value from the token that followed the previous occurrence. $W_{\text{out}}$ maps this to high probability for that token.

Summary: Information flow

$$\underset{\text{embed}}{x_i} \;\xrightarrow{\text{Layer 0}}\; \underset{\text{+ prev token info}}{h^{(1)}} \;\xrightarrow{\text{Layer 1}}\; \underset{\text{+ induction match}}{h^{(2)}} \;\xrightarrow{W_\text{out}}\; \text{prediction}$$

The full picture: Embeddings → shifted context via previous-token head → residual enrichment → semantic match via induction head → prediction. Each layer adds exactly one piece of the puzzle.

📐 RoPE Rotation Matrix

🔑 Head 1: Previous Token Head

🧠 Head 2: Semantic Head (Rank-1)

Complete 2-Layer RoPE Transformer

1. Token Embedding

2. RoPE — Rotary Position Embedding

3. Layer 0 — Previous Token Head

4. Layer 1 — Semantic / Induction Head

5. Complete Forward Pass