Input Embedding
Input embedding is a crucial step in the Transformer architecture, where the input sequence is converted into a numerical representation that the model can process. This numerical representation captures the semantic meaning and relationships between the input tokens, enabling the Transformer to understand and operate on the input data effectively.
What is a vocabulary?
To process text data, language models must convert words and sentences into a numerical representation that computers can understand. This process involves creating a vocabulary, which consists of the following steps:
- Tokenization: Breaking down text data into individual words and symbols.
- Mapping: Assigning a unique numeric ID value to each token (word or symbol).
- Creating a vocabulary dictionary: Linking each token ID with its original string value.
- Encoding: Converting the text into a series of token IDs for the computer model to process.
- Decoding: Using the vocabulary dictionary to convert token IDs to the original words.
The vocabulary is a bridge between human language and computer language, serving as a predefined list of words and subwords the computer understands. Each word in the vocabulary is assigned a unique number, allowing the computer to translate sentences into a series of numbers.
For instance, the word "apple"
might be represented by the number 42
, "banana"
by 78
, and so on. So, given a sentence like "I like apples and bananas,"
the computer translates it into a series of numbers using the vocabulary, like [12, 34, 42, 3, 78]
.
Words are not represented as full words in the vocabulary. Tokenization is often applied to break words into smaller subword units called tokens. This approach helps because the vocabulary can cover common roots, prefixes, and suffixes between words. Doing so reduces the total vocabulary size needed to encode text (otherwise, unique words may be distinct vocabulary entries, like "carry"
and "carried"
or "apple"
and "apples"
).
Input Embedding Process
The input (tokenization) and embedding process look as follows:
- Splitting the input text into pieces:
"The human investigates"
−>["The_"]
["human_"]
["invest"]
["igat"]
["es_"]
- Indexing the tokens into a vocabulary:
["The_"]
["human_"]
["invest"]
["igat"]
["es_"]
−>[3, 721, 66, 3434, 12]
- Assigning a d_model dimensional vector, where d_model is selected by the user, to each vocab entry:
[3, 721, 66, 3434, 12]
−>[[0.123, 0.0232, ...], [], [], ..., vocab_size elements]
.- This results in a matrix of size
[d_model * 5]
in our case.
One might wonder why we use vectors to represent tokens when we already have a vocabulary. Computers understand numbers, so converting words or tokens into numerical form is necessary. A single number, such as the token ID, is usually insufficient to capture all the information about a word (e.g., semantic, syntactic, and contextual meaning). Therefore, each token in the vocabulary is assigned a unique vector (a set of numbers) as its representation.
This numerical representation allows the model to perform mathematical operations on tokens and learn these word representations. The representations are initially randomly assigned for each word and are then learned over the course of training a model.