Attention

Introduction In this post, I explain the RoPE (Rotary Position Embedding) implementation mathematically. I assume readers have some high-level understanding of sinusoidal positional encoding and Rotary PE. The implementation discussed here is based on HuggingFace’s Transformers library (specifically the Llama model). RoPE was introduced in the paper “RoFormer: Enhanced Transformer with Rotary Position Embedding” by Su et al. (2021). The core idea is to encode positional information by rotating query and key vectors before computing attention scores — rather than adding a fixed positional vector to token embeddings. This results in an elegant property: attention scores naturally encode relative position between tokens. ...