Positional Embedding

In a transformer model, positional information is initially absent when a sentence is fed into the model. Positional embeddings are added to the input token embeddings to give the model a sense of word order.

Consider a sequence of words. We want the model to understand the position of each word in the sequence. To do this, we create a unique "positional embedding" for each position using mathematical functions called sine and cosine waves.

Sine and cosine waves are special because they repeat in a predictable pattern. If you move along the wave, you'll see that it goes up, down, and back up again at regular intervals. This repeating pattern is called periodicity.

In positional embeddings, we use this periodicity to create a unique code for each position. The code is based on the position's location on the sine and cosine waves. Positions that are a certain distance apart will have codes with a specific relationship, because of the repeating pattern of the waves.

There are different ways to create these positional embeddings. One way is to use fixed embeddings, where the codes are predefined. Another way is to let the model learn the embeddings during training. Learning the embeddings requires more computation, but it can be helpful in certain situations.

At the end of this stage, the positional embeddings are added to the token embeddings, providing the model with both token and positional information.

The original paper uses sine and cosine functions of different frequencies to create positional embeddings. The equations for these embeddings are:

\text{PE(pos, 2i)} = \sin(pos / 10000^{(2i / d_{model})})

\text{PE(pos, 2i+1)} = \cos(pos / 10000^{(2i / d_{model})})

Where:

$PE$ is the positional embedding
$pos$ $p os$ is the position of the token in the sequence
- $PE(pos, 2i)$ is for even positions
- $PE(pos, 2i+1)$ is for odd positions
$i$ is the dimension index of the positional embedding
$d_{model}$ is the dimensionality of the token embeddings, chosen in the previous step

Input Embedding Attention