Positional Encoding in Transformers: From Sinusoidal to RoPE

Understand positional encoding in Transformers, from sinusoidal embeddings to RoPE, and learn how models capture token order and relative position.

I’m fascinated by how we encode word position inside Transformers.

If you remember a few years ago, when people were still mainly talking about recurrent networks like LSTMs, position came “for free” from the sequence itself: hidden states were built step by step, so the model naturally carried a notion of order through time.

In pure Transformers, that sequential recurrence is gone. Attention is parallel.
So without positional information, a model would treat token sets as if order barely mattered.

For an LLM, sentences like:

“The dog bites the man”
“The man bites the dog”

must not mean the same thing. The model needs position to know who is doing what to whom.

That is exactly where positional encoding enters.

Positional Encoding

In Attention Is All You Need, this is solved by adding a positional vector to each token embedding:

xpos=etoken+ppos

where:

: token embedding
: positional encoding vector for position pospospos

The original sinusoidal formulation is:

Intuitively, it is tempting to think “token 1 gets +1, token 2 gets +2, etc.”
But what really matters is that this is still mostly an absolute encoding of position.

And this creates a practical issue for long contexts: absolute positions can be less robust when we care about relative distance between words (how far one token is from another), especially beyond training length.

We need a way to represent position relatively.

That is where RoPE (Rotary Positional Embedding) comes in.

RoPE (Rotary Positional Embedding)

Before explaining the method, remember: tokens are mapped to vectors, and vectors are objects we can transform—add, multiply, and also rotate in vector space.

RoPE’s core idea is simple and elegant:

We rotate representations by an angle that depends on token position.

Imagine comparing the word at position 10 and position 15.

Position 10: rotated by
Position 15: rotated by
Difference:

Now imagine those same words appear later at positions 100 and 105.

Position 100: rotated by
Position 105: rotated by
Difference:

The absolute angles changed, but the relative angle stayed the same.

That gives two big advantages over older approaches:

Relative distance awareness
The model cares about angular difference (relative offset), not only absolute location in the sequence.
Better length extrapolation (with caveats)
Because position is encoded as continuous rotations, models often generalize better to longer contexts than seen during training (not infinitely, but usually better than purely learned absolute tables).

RoPE explained mathematically

Now we go into the math.

I know many people dislike this part, but if we want to implement it properly, we need formal definitions—not only analogies.

Before formulas, let’s clarify where RoPE is applied.

RoPE is applied after creating Query and Key, and before computing their dot product.

Unlike old positional embeddings (added once at the input), RoPE is typically applied inside each attention layer.

The workflow (step by step)

Inside one attention head:

Input
Token representation arrives from embeddings + previous layers.
Linear projections (create Q, K, V)

Apply RoPE to and (usually not to ):

Attention score (dot product)

Softmax and output

RoPE pairs dimensions into 2D planes.
For each pair at position mmm, we rotate by angle θim\theta_i mθim.

Define the 2D rotation matrix:

For the -th pair of a vector :

with frequencies typically chosen as:

Equivalent component form:

When we talk about algorithms within Transformers, I always love to get under the hood and see how they actually tick. Matrix calculations can be handled in countless ways; some are memory-efficient, others are processing-heavy, and finding the sweet spot is an art form.

In many cases, optimizing these operations into custom CUDA kernels significantly accelerates the training and inference of machine learning models.

Today, I want to highlight a specific Pull Request from the Unsloth repository—PR #238.

This PR focuses on a massive optimization of the RoPE (Rotary Positional Embeddings) algorithm, squeezing out even more performance from the hardware

Unsloth’s PR #238 Optimization

PR #238 is a performance-critical update to the Unsloth library that surgically optimizes the RoPE (Rotary Positional Embeddings) algorithm.

In standard deep learning frameworks, RoPE calculations are "memory-bound," meaning they waste too much time moving data in and out of the GPU's memory (VRAM).

Unsloth solves this by implementing a custom Triton kernel that fuses operations and handles memory "in-place."

The result is a dramatic reduction in VRAM usage and a significant boost in training speed, allowing large models to be fine-tuned on consumer-grade hardware.

1. Breaking the "Memory Wall"

In a standard PyTorch implementation, rotating an embedding vector using RoPE requires multiple sequential steps:

Load the tensor from VRAM.
Calculate sine and cosine components.
Create a temporary intermediate tensor to store the rotated values.
Write the final result back to VRAM.

This "round-trip" to the VRAM is the primary bottleneck. The GPU's compute cores are incredibly fast, but waiting for data to travel from the memory chips to the processor is slow. This is known as being memory-bound.

2. Kernel Fusion via Triton

Unsloth utilizes Triton, a language that allows developers to write highly parallelized GPU kernels with more flexibility than standard Python. PR #238 "fuses" these operations. Instead of multiple steps, the kernel loads a chunk of the embedding into the SRAM (the ultra-fast, on-chip cache). All rotations, multiplications, and additions happen right there, in one go. By the time the data leaves the chip, the calculation is finished, drastically reducing the number of read/write cycles to the VRAM.

3. Optimized Backward Pass & Longer Context

Training requires calculating gradients (backpropagation). Normally, the system must store "activations" from the forward pass to calculate these gradients later, which eats up massive amounts of memory. This PR optimizes the backward pass by mathematically deriving the gradients within the kernel itself. Because it doesn't need to store heavy intermediate states, the memory footprint drops. This is why Unsloth can handle longer context windows (like jumping from 4k to 32k or even 128k tokens) without hitting an "Out of Memory" (OOM) error.

4. Efficient In-place Tensor Manipulation

One of the most effective tricks in this PR is the use of in-place operations. Rather than allocating a new block of memory for the "rotated" version of the data, the kernel overwrites the original data in memory. At the scale of billions of parameters and thousands of tokens, avoiding these "clones" saves gigabytes of VRAM, making the difference between a model that crashes and one that trains smoothly on an RTX 3090/4090.

Positional Encoding in Transformers: From Sinusoidal to RoPE

Positional Encoding

RoPE (Rotary Positional Embedding)

RoPE explained mathematically

Unsloth’s PR #238 Optimization

PyTorch Compiler Explained: TorchDynamo, AOTAutograd & TorchInductor

Understanding Attention: The Idea Behind Modern AI

What Are Transformers in AI?