Attention

Understanding the Self Attention Mechanism

Self-attention allows tokens in an input text to incorporate and derive meaning from other relevant and close-by tokens in a sequence. It is similar to a person reading a sentence and understanding each word by relating it to the broader context.

Imagine reading a sentence and coming across a word like "it". When encountering the word "it," you must look back in the sentence to see what noun or concept "it" refers to. Self-attention allows a Transformer model to weigh the importance of different words in a sentence when understanding or encoding a particular word. It assigns attention scores to each word, indicating how much attention should be given to it or neighboring words. These attention scores are dynamic and depend on the context of the sentence. For example, if "it" refers to "the cat," the attention mechanism would give high scores to "the" and "cat" when encoding "it."

attention

Self Attention

The self-attention mechanism computes attention scores for each token in the input sequence. It considers all other tokens and determines how much attention to assign to them by calculating a weighted sum of the embeddings of all tokens, where the attention scores determine the weights. This mechanism is applied to all tokens simultaneously and in parallel, making it efficient.

To compute the attention scores, the self-attention mechanism uses three sets of vectors: Query $(Q)$ , Key $(K)$ , and Value $(V)$ . These vectors are linear projections of the input embeddings:

Q = X * W_Q

K = X * W_K

V = X * W_V

where $X$ is the input embedding matrix, and $W_Q$ , $W_K$ , and $W_V$ are learned projection matrices.

The Query vector represents the token we are trying to encode, while the Key vectors represent all other tokens. The Value vectors store the information that will be used to create the output.

The attention scores are calculated by measuring the similarity between the Query and Key vectors. High similarity results in higher attention scores. This similarity is computed using a dot product followed by scaling and softmax normalization:

Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V

where $Q, K, V$ are Query, Key, and Value vectors
$d_k$ is the dimensionality of the query/key vectors

The softmax is given by

\sigma(z_i) = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}} \quad \text{for } i=1,2,\dots,K

Where:

$z_i$ is the input to the softmax function for the $i$ -th element in a vector of length $K$ .
$e^{z_i}$ is the exponential of $z_i$ , which maps the input to a positive value.
$\sum_{j=1}^K e^{z_j}$ is the sum of the exponentials of all the elements in the input vector.
The softmax function $\sigma(z_i)$ normalizes the exponential of $z_i$ by dividing it by the sum of the exponentials of all the elements in the vector.

The softmax function takes a vector of arbitrary real numbers and maps it to a probability distribution, where each element is in the range (0, 1) and the sum of all elements is equal to 1. This is commonly used in the output layer of a neural network for multi-class classification tasks.

The intuition behind this equation is that each token (represented by its Query vector) is compared with all other tokens (represented by their Key vectors) to determine their relevance or similarity. The dot product measures this similarity, and the softmax normalization ensures that the attention scores sum up to 1, representing a probability distribution over the tokens. The weighted sum of the Value vectors, where the weights are the attention scores, represents the information about the token being encoded, considering its context within the input sequence.

Each token is a query to softly search through the entire input context, identifying relevant keywords. The model learns cues of what relevant keywords to expect given queries of certain types. From these dynamically queried relevance clues, the model updates its representation of the original query token with pertinent information extracted from the entire context.

Multi-Head Attention

Multi-head attention enhances the expressiveness of the self-attention mechanism by splitting the Query $(Q)$ , Key $(K)$ , and Value $(V)$ vectors into multiple smaller vectors (heads) and computing attention in parallel for each head.

Multihead Attention

The results from all heads are then concatenated and linearly transformed to obtain the final output:

MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O

where:

$Q$ , $K$ , and $V$ are the query, key, and value matrices, respectively. They are typically derived from the input embeddings or the output of the previous layer.
$head_i$ represents the output of the $i$ -th attention head.
$Concat$ is the concatenation operation, which concatenates the outputs of all attention heads along the feature dimension.
$W^O$ is a learnable weight matrix used to linearly transform the concatenated outputs of the attention heads.

head_i = Attention(Q W^Q_i, K W^K_i, V W^V_i)

where:

$QW^Q_i$ , $KW^K_i$ , and $VW^V_i$ are the linearly transformed query, key, and value matrices for the $i$ -th attention head, respectively.
$W^Q_i$ , $W^K_i$ , and $W^V_i$ are learnable weight matrices used to project the query, key, and value matrices into a lower-dimensional space for the $i$ -th attention head.
$Attention$ is the attention function, which computes the weighted sum of the values based on the compatibility between the queries and keys.

Multi-head attention allows the model to attend to different aspects of the input sequence simultaneously, capturing diverse relationships and representations.

By leveraging self-attention and multi-head attention, Transformers can effectively model long-range dependencies and capture the contextual information necessary for various natural language processing tasks.

Positional Embedding FFN & Outputs