Introduction

In this post, I explain the RoPE (Rotary Position Embedding) implementation mathematically. I assume readers have some high-level understanding of sinusoidal positional encoding and Rotary PE.

The implementation discussed here is based on HuggingFace’s Transformers library (specifically the Llama model).

RoPE was introduced in the paper “RoFormer: Enhanced Transformer with Rotary Position Embedding” by Su et al. (2021). The core idea is to encode positional information by rotating query and key vectors before computing attention scores — rather than adding a fixed positional vector to token embeddings. This results in an elegant property: attention scores naturally encode relative position between tokens.


RoPE Implementation in the Llama Model

Note: Some code has been removed for readability.

class LlamaRotaryEmbedding(torch.nn.Module):
    inv_freq: torch.Tensor

    def __init__(self, config, device=None):
        super().__init__()

        self.config = config
        base = self.config.base_theta
        dim = self.config.hidden_size
        num_heads = self.config.num_attn_heads
        head_dim = dim // num_heads

        inv_freq = 1.0 / (
            base ** (
                torch.arange(0, dim, 2, dtype=torch.int64) / head_dim)
        )

        self.register_buffer("inv_freq", inv_freq, persistent=False)
        self.register_buffer("original_inv_freq", inv_freq.clone(), persistent=False)

    @torch.no_grad()
    def forward(self, x, position_ids):
        inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1).to(x.device)
        position_ids_expanded = position_ids[:, None, :].float()

        device_type = x.device.type if isinstance(x.device.type, str) and x.device.type != "mps" else "cpu"

        freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
        emb = torch.cat((freqs, freqs), dim=-1)
        cos = emb.cos() * self.attention_scaling
        sin = emb.sin() * self.attention_scaling

        return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)


def rotate_half(x):
    """Rotates half the hidden dims of the input."""
    x1 = x[..., : x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2 :]
    return torch.cat((-x2, x1), dim=-1)


def apply_rotary_pos_emb(q, k, cos, sin, unsqueeze_dim=1):
    cos = cos.unsqueeze(unsqueeze_dim)
    sin = sin.unsqueeze(unsqueeze_dim)
    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed

Simplified Version

def simple_rope():
    sq_len = 100
    base_theta = 10000
    head_dim = 64

    q = torch.randn(size=(sq_len, head_dim))
    k = torch.randn(size=(sq_len, head_dim))

    inv_freq = 1.0 / (base_theta ** (torch.arange(0, head_dim, 2, dtype=torch.int64) / head_dim))

    position_ids = torch.arange(0, sq_len)
    freq = torch.outer(position_ids, inv_freq)
    cos = torch.cat([freq.cos(), freq.cos()], dim=-1)
    sin = torch.cat([freq.sin(), freq.sin()], dim=-1)

    # Applying the rotation
    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed


q_embed, k_embed = simple_rope()

Explanation

In the code above, we rotate the query and key vectors using sinusoidal functions:

q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)
$$ q' = q \cdot \cos + \text{rotate\_half}(q) \cdot \sin $$

$$ k' = k \cdot \cos + \text{rotate\_half}(k) \cdot \sin $$

This is equivalent to applying a rotation matrix to each query and key vector:

$$ q' = R(m)\, q \qquad k' = R(n)\, k $$

where $m$ and $n$ are the position indices of the query and key tokens respectively, and $R(\theta)$ is the 2D rotation matrix:

$$ R(\theta) = \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix} $$

For a 2D vector $\begin{bmatrix}x \\ y\end{bmatrix}$, rotation gives:

$$ R(\theta) \begin{bmatrix} x \\ y \end{bmatrix} =\begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} =\begin{bmatrix} x\cos\theta - y\sin\theta \\ x\sin\theta + y\cos\theta \end{bmatrix} $$$$ x' = x\cos\theta - y\sin\theta $$

$$ y' = x\sin\theta + y\cos\theta $$

Key Property: Relative Position in Attention

We rotate query and key independently. The relative rotation naturally emerges during the attention dot product:

$$ q_m^T k_n = q^T R(m)^T R(n)\, k = q^T R(n - m)\, k $$

This holds because:

$$ R(m)^T = R(-m) $$

$$ R(m)^T R(n) = R(-m)\, R(n) = R(n - m) $$

This is the most elegant property of RoPE. The attention score between a query at position $m$ and a key at position $n$ depends only on the relative distance $(n - m)$, not on their absolute positions. This gives the model an inherent sense of relative position without needing a separate relative positional encoding scheme.


Relative Position Matrix in Attention

The relative position $(m - n)$ across all query-key pairs in a sequence of length 9 looks like:

$$ \begin{bmatrix} 0 & -1 & -2 & -3 & -4 & -5 & -6 & -7 & -8 \\ 1 & 0 & -1 & -2 & -3 & -4 & -5 & -6 & -7 \\ 2 & 1 & 0 & -1 & -2 & -3 & -4 & -5 & -6 \\ 3 & 2 & 1 & 0 & -1 & -2 & -3 & -4 & -5 \\ 4 & 3 & 2 & 1 & 0 & -1 & -2 & -3 & -4 \\ 5 & 4 & 3 & 2 & 1 & 0 & -1 & -2 & -3 \\ 6 & 5 & 4 & 3 & 2 & 1 & 0 & -1 & -2 \\ 7 & 6 & 5 & 4 & 3 & 2 & 1 & 0 & -1 \\ 8 & 7 & 6 & 5 & 4 & 3 & 2 & 1 & 0 \end{bmatrix} $$

Step-by-Step Walkthrough

Token Representation

Before diving into the steps, it helps to understand how a token embedding is laid out in memory. For a token with head_dim = 6, the embedding is arranged as:

$$ \text{query} = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ y_1 \\ y_2 \\ y_3 \end{bmatrix} $$

A vector of length $d$ has $d/2$ planes, where each plane is an $(x_i, y_i)$ pair that will be rotated together:

$$ \text{planes} = [x_1, y_1],\; [x_2, y_2],\; [x_3, y_3],\; \ldots,\; [x_{d/2}, y_{d/2}] $$

In the HuggingFace RoPE implementation, the embedding is arranged as $[x_1, x_2, \ldots, x_{d/2},\; y_1, y_2, \ldots, y_{d/2}]$ — all $x$-values first, then all $y$-values (non-interleaved layout). The rotate_half function is specifically designed for this layout. An alternative layout places pairs together: $[x_1, y_1, x_2, y_2, \ldots]$ (interleaved), but this implementation uses the former.


Step 1: Inverse Frequency

inv_freq = 1.0 / (base_theta ** (torch.arange(0, head_dim, 2, dtype=torch.int64) / head_dim))
$$ \text{inv\_freq}_i = \frac{1}{\text{base\_theta}^{2i / \text{head\_dim}}} $$

Inverse Frequency

What is inverse frequency?

Inverse frequency controls the rotation speed for each plane in the token embedding. The early planes (low index $i$) have higher inverse frequency values and rotate faster — more per unit of position. The later planes have lower values and rotate slowly.

Frequency Rotation

For position $p$, plane $[x_i, y_i]$ is rotated by the angle:

$$ \theta_i^{(p)} = p \times \text{inv\_freq}_i = \frac{p}{\text{base\_theta}^{2i/d}} $$
  • Plane $[x_1, y_1]$ rotates by a large angle (high frequency).
  • Plane $[x_{d/2}, y_{d/2}]$ rotates by a small angle (low frequency).

Why does this help?

Different planes capture positional information at different scales:

FrequencyBehaviorEncodes
High (early planes)Oscillates rapidlyFine-grained, local positional differences — distinguishes nearby tokens
Low (later planes)Oscillates slowlyCoarse-grained, global positional information — distinguishes tokens far apart

This multi-scale strategy is directly inspired by sinusoidal positional encodings from the original Transformer paper (“Attention Is All You Need”, Vaswani et al.). The key difference is that RoPE applies these frequencies as rotations (multiplicative), not as additions to the embedding. This makes RoPE compatible with the dot-product attention mechanism in a way that preserves the relative position property shown above.


Step 2: Rotation Angles

position_ids = torch.arange(0, sq_len)
freq = torch.outer(position_ids, inv_freq)

torch.outer computes the outer product of position_ids and inv_freq, producing a rotation angle for every (position, plane) pair:

$$ \text{PE}(\text{pos},\, i) = \text{pos} \times \text{inv\_freq}_i = \frac{\text{pos}}{\text{base\_theta}^{2i/d}} $$

For each position $p$, the frequency vector has $d/2$ entries:

$$ \left[\frac{p}{10000^{0/d}},\; \frac{p}{10000^{2/d}},\; \frac{p}{10000^{4/d}},\; \ldots,\; \frac{p}{10000^{d/d}}\right] $$

The full frequency matrix across all positions is:

$$ \text{freq} = \begin{bmatrix} \frac{1}{10000^{0/d}} & \frac{1}{10000^{2/d}} & \cdots & \frac{1}{10000^{d/d}} \\ \frac{2}{10000^{0/d}} & \frac{2}{10000^{2/d}} & \cdots & \frac{2}{10000^{d/d}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{n}{10000^{0/d}} & \frac{n}{10000^{2/d}} & \cdots & \frac{n}{10000^{d/d}} \end{bmatrix} \in \mathbb{R}^{\text{seq\_len} \times d/2} $$

where $d = \text{head\_dim}$.


Step 3: Query and Key Rotation

We apply cosine and sine to the frequency matrix, then concatenate them to match the full embedding dimension:

cos = freq.cos()
sin = freq.sin()
cos = torch.cat([cos, cos], dim=-1)
sin = torch.cat([sin, sin], dim=-1)

cos plot

cos imshow

sin imshow

Why concatenate cos and sin twice?

The query vector has the layout $[x_1, \ldots, x_{d/2},\; y_1, \ldots, y_{d/2}]$. Both the $x$-half and the $y$-half need the same set of per-plane cosine (and sine) values. Concatenating [cos, cos] and [sin, sin] ensures every component gets the correct angle.

cos query

sin query


The rotate_half function

def rotate_half(x):
    x1 = x[..., : x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2 :]
    return torch.cat((-x2, x1), dim=-1)

This rearranges $[x_1, x_2, x_3,\; y_1, y_2, y_3]$ into $[-y_1, -y_2, -y_3,\; x_1, x_2, x_3]$:

$$ \text{rotate\_half}(q) = \begin{bmatrix} -y_1 \\ -y_2 \\ -y_3 \\ x_1 \\ x_2 \\ x_3 \end{bmatrix} $$

Putting it all together

Step A — element-wise multiply query by cos:

$$ \cos \times q = \begin{bmatrix} x_1\cos\theta_1 \\ x_2\cos\theta_2 \\ x_3\cos\theta_3 \\ y_1\cos\theta_1 \\ y_2\cos\theta_2 \\ y_3\cos\theta_3 \end{bmatrix} $$

Step B — element-wise multiply rotate_half(q) by sin:

$$ \sin \times \text{rotate\_half}(q) = \begin{bmatrix} -y_1\sin\theta_1 \\ -y_2\sin\theta_2 \\ -y_3\sin\theta_3 \\ x_1\sin\theta_1 \\ x_2\sin\theta_2 \\ x_3\sin\theta_3 \end{bmatrix} $$

Step C — add them:

$$ q' = \cos \cdot q + \sin \cdot \text{rotate\_half}(q) =\begin{bmatrix} x_1\cos\theta_1 - y_1\sin\theta_1 \\ x_2\cos\theta_2 - y_2\sin\theta_2 \\ x_3\cos\theta_3 - y_3\sin\theta_3 \\ y_1\cos\theta_1 + x_1\sin\theta_1 \\ y_2\cos\theta_2 + x_2\sin\theta_2 \\ y_3\cos\theta_3 + x_3\sin\theta_3 \end{bmatrix} $$

This is precisely the 2D rotation applied independently to each plane $[x_i, y_i]$ with its own angle $\theta_i$:

$$ x_i' = x_i\cos\theta_i - y_i\sin\theta_i $$

$$ y_i' = x_i\sin\theta_i + y_i\cos\theta_i $$

Each plane is rotated by a different angle $\theta_i$ — the planes associated with higher-frequency dimensions rotate more per position step, while lower-frequency planes rotate less. This is what gives the model a rich, multi-scale representation of position.


Step 4: Relative Rotation in Attention

After rotating both query and key, the attention dot product is:

$$ q_m^T k_n = (R(m)\,q)^T (R(n)\,k) = q^T R(m)^T R(n)\, k = q^T R(n - m)\, k $$

The attention score depends only on the relative position $(n - m)$ between the query at position $m$ and the key at position $n$. This is the core property that makes RoPE a relative positional encoding, even though it is applied at the individual token level.

This relative encoding property also makes RoPE generalize well to sequence lengths longer than those seen during training. Since the model learns to attend based on relative distance, it is less sensitive to the exact absolute position values — a significant advantage over fixed absolute positional encodings, which can struggle when applied to longer contexts at inference time.


Summary

Here is a concise summary of the full RoPE pipeline:

StepOperationFormula
1Compute inverse frequencies$\text{inv\_freq}_i = 1 / \text{base}^{2i/d}$
2Compute rotation angles$\theta_i^{(p)} = p \times \text{inv\_freq}_i$
3Build cos/sin tablesDuplicate along last dim to cover both $x$ and $y$ halves
4Rotate queries and keys$q' = q \cdot \cos + \text{rotate\_half}(q) \cdot \sin$
5Attention score$q_m^T k_n = q^T R(n-m)\, k$ — relative position encoded implicitly