Output & FeedForward Network

Feedforward Neural Network (FFN)

After the self-attention layer, the Transformer architecture includes a position-wise feedforward neural network (FFN) that is applied independently to each position (token) in the sequence. The FFN consists of two linear transformations with a ReLU activation in between:

FFN(x) = max(0, x *W_1 + b_1)* W_2 + b_2

where:

$x$ is the input vector (output of the self-attention layer)
$W_1$ and $W_2$ are learned weight matrices
$b_1$ and $b_2$ are learned bias vectors.
$\text{max}(0, z)$ operation is the ReLU activation function, where $z$ is the input to the activation function.

The FFN introduces non-linearity and increases the expressive power of the model, allowing it to capture more complex interactions and relationships between the tokens beyond what is captured by the self-attention layer alone. It learns to combine and manipulate the information non-linearly, enabling the model to capture higher-level features and patterns in the input sequence.

Residual Connection and Layer Normalization

The Transformer architecture employs residual connections and layer normalization to facilitate the flow of information and stabilize the training process.

Residual connections are used to add the input of a layer to its output. In the Transformer, the output of the self-attention layer and the FFN are added to their respective inputs:

x_{attn} = LayerNorm(x + Attention(x))

x_{ffn} = LayerNorm(x_{attn} + FFN(x_{attn}))

where $x$ is the input to the layer, $Attention(x)$ is the output of the self-attention layer, and $FFN(x_{attn})$ is the output of the feedforward neural network.

Residual connections allow the model to easily learn identity functions and bypass layers if needed, helping to prevent the vanishing or exploding gradient problem and enabling the training of deeper networks.

Layer normalization is applied after each residual connection to normalize the activations across the features within each layer:

LayerNorm(x) = \frac{x - \text{mean}(x)}{\sqrt{\text{var}(x) + \epsilon}} \cdot \gamma + \beta

where $\text{mean}(x)$ and $\text{var}(x)$ are the mean and variance of the input $x$ computed across the features, $\epsilon$ is a small constant for numerical stability, and $\gamma$ and $\beta$ are learned scaling and shifting parameters.

Layer normalization helps keep the activations stable and prevents them from diverging or vanishing during training, allowing the model to learn more efficiently and converge faster.

Output

The output of the Transformer encoder block is the final representation obtained after the layer normalization of the FFN output, capturing the local and global context information of each token in the input sequence.

In the Transformer decoder, the output of the decoder block is used to generate the next token in the output sequence. It is passed through a linear transformation followed by a softmax function to produce a probability distribution over the vocabulary:

output = \text{softmax}(x_{ffn} W_{vocab} + b_{vocab})

where $x_{ffn}$ is the output of the decoder block's FFN, $W_{vocab}$ and $b_{vocab}$ are the learned weights and biases of the output linear transformation, and softmax converts the logits into probabilities.

The token with the highest probability is then selected as the next token in the output sequence, and the process continues until an end-of-sequence token is generated or a maximum sequence length is reached.

For instance, let's consider a machine translation task where the Transformer model is used to translate a sentence from English to French. The input to the Transformer encoder would be the English sentence, and the output of the encoder would be a contextualized representation of each word in the input sentence. This output representation captures the local and global context information of each word, considering its relationships with other words in the sentence.

The Transformer decoder then takes this output representation as input and generates the French translation one word at a time. At each step, the decoder attends to the relevant parts of the encoder's output representation to determine the most appropriate French word to generate. The decoder's output at each step is passed through a linear transformation and a softmax function to produce a probability distribution over the French vocabulary. The word with the highest probability is selected as the next word in the French translation.

This process continues until the decoder generates an end-of-sequence token, indicating that the translation is complete. The final French translation is the concatenation of all the words generated by the decoder.

In summary, the feedforward neural network, residual connections, and layer normalization work together with the self-attention mechanism to process and transform the input sequence, introducing non-linearity, facilitating information flow, stabilizing training, and generating the final output representations.

Attention Transformers Resources