目录
手动求解 Transformer:分步数学示例
Understanding Transformers: A Step-by-Step Math Example — Part 1了解 Transformer:分步数学示例 — 第 1 部分
I understand that the transformer architecture may seem scary, and you might have encountered various explanations on…我知道变压器架构可能看起来很可怕,并且您可能遇到过关于……的各种解释。
This blog is incomplete, here is the complete version of it:该博客不完整,以下是完整版本:
Understanding Transformers from Start to End — A Step-by-Step Math Example从头到尾理解 Transformer — 一个逐步的数学示例
We will be using a simple dataset and performing numerous matrix multiplications to solve the encoder and decoder parts…我们将使用一个简单的数据集并执行大量矩阵乘法来解决编码器和解码器部分......
Inputs and Positional Encoding输入和位置编码
Step 1 (Defining the data)步骤 1(定义数据)
Step 2 (Finding the Vocab Size)第 2 步(计算词汇量)
Step 3 (Encoding and Embedding)步骤 3(编码和嵌入)
Step 4 (Positional Embedding)步骤 4(位置嵌入)
Encoder 编码器
Step 1 (Performing Single Head Attention)第 1 步(执行单头注意力)
Table of Contents 目录
Step 1 — Defining our Dataset第 1 步 - 定义我们的数据集
Step 2— Finding Vocab Size第 2 步 — 查找词汇量
Step 3 — Encoding 第 3 步 — 编码
Step 4 — Calculating Embedding第 4 步 — 计算嵌入
Step 5 — Calculating Positional Embedding第 5 步 — 计算位置嵌入
Step 6 — Concatenating Positional and Word Embeddings第 6 步 — 连接位置嵌入和词嵌入
Step 7 — Multi Head Attention第 7 步 — 多头注意力
Step 8 — Adding and Normalizing第 8 步 — 添加和规范化
Step 9 — Feed Forward Network第 9 步——前馈网络
Step 10 — Adding and Normalizing Again第 10 步 — 再次添加并标准化
Step 11 — Decoder Part第11步——解码器部分
Step 12 — Understanding Mask Multi Head Attention第 12 步 — 了解 Mask Multi Head Attention
Let’s do a simplified calculation:我们来做一个简单的计算:
Step 13 — Calculating the Predicted Word第 13 步 — 计算预测词
Important Points 要点
Conclusion 结论
Transformer Architecture explainedTransformer 架构解释
Tokenization 代币化
Embedding 嵌入
Positional encoding 位置编码
Transformer block 变压器块
Attention 注意力
The Softmax Layer Softmax 层
Post Training 岗位培训
手动求解 Transformer:分步数学示例
I have already written a detailed blog on how transformers work using a very small sample of the dataset, which will be my best blog ever because it has elevated my profile and given me the motivation to write more. However, that blog is incomplete as it only covers 20% of the transformer architecture and contains numerous calculation errors, as pointed out by readers. After a considerable amount of time has passed since that blog, I will be revisiting the topic in this new blog.
我已经写了一篇详细的博客,介绍变压器如何使用非常小的数据集样本工作,这将是我有史以来最好的博客,因为它提高了我的个人资料,并给了我写更多文章的动力。然而,正如读者指出的那样,该博客并不完整,因为它只涵盖了 Transformer 架构的 20%,并且包含大量计算错误。自该博客发布以来已经过去了相当长的时间,我将在这个新博客中重新审视该主题。
My previous blog on transformer architecture (covers only 20%):
我之前关于 Transformer 架构的博客(只涵盖了 20%):
Understanding Transformers: A Step-by-Step Math Example — Part 1了解 Transformer:分步数学示例 — 第 1 部分
I understand that the transformer architecture may seem scary, and you might have encountered various explanations on…我知道变压器架构可能看起来很可怕,并且您可能遇到过关于……的各种解释。
I understand that the transformer architecture may seem scary, and you might have encountered various explanations on YouTube or in blogs. However, in my blog, I will make an effort to clarify it by providing a comprehensive numerical example. By doing so, I hope to simplify the understanding of the transformer architecture.
我知道 Transformer 架构可能看起来很可怕,您可能在 YouTube 或博客上遇到过各种解释。然而,在我的博客中,我将努力通过提供全面的数值示例来阐明这一点。通过这样做,我希望能够简化对 Transformer 架构的理解。
Shoutout to HeduAI for providing clear explanations that have helped clarify my own concepts!
感谢 HeduAI 提供了清晰的解释,帮助澄清了我自己的概念!
This blog is incomplete, here is the complete version of it:
该博客不完整,以下是完整版本:
Understanding Transformers from Start to End — A Step-by-Step Math Example从头到尾理解 Transformer — 一个逐步的数学示例
We will be using a simple dataset and performing numerous matrix multiplications to solve the encoder and decoder parts…我们将使用一个简单的数据集并执行大量矩阵乘法来解决编码器和解码器部分......
Let’s get Started! 让我们开始吧!
Inputs and Positional Encoding
输入和位置编码
Let’s solve the initial part where we will determine our inputs and calculate positional encoding for them.
让我们解决初始部分,我们将确定输入并计算它们的位置编码。
Step 1 (Defining the data)
步骤 1(定义数据)
The initial step is to define our dataset (corpus).
第一步是定义我们的数据集(语料库)。
In our dataset, there are 3 sentences (dialogues) taken from the Game of Thrones TV show. Although this dataset may seem small, its size actually helps us in finding the results using the upcoming mathematical equations.
在我们的数据集中,有 3 个句子(对话)取自《权力的游戏》电视节目。尽管这个数据集看起来很小,但它的大小实际上有助于我们使用即将到来的数学方程找到结果。
Step 2 (Finding the Vocab Size)
第 2 步(计算词汇量)
To determine the vocabulary size, we need to identify the total number of unique words in our dataset. This is crucial for encoding (i.e., converting the data into numbers).
为了确定词汇量大小,我们需要确定数据集中唯一单词的总数。这对于编码(即将数据转换为数字)至关重要。
where N is a list of all words, and each word is a single token, We will break our dataset into a list of tokens, i.e., finding N.
其中 N 是所有单词的列表,每个单词都是一个标记,我们将把数据集分解为标记列表,即找到 N。
After obtaining the list of tokens, denoted as N, we can apply a formula to calculate the vocabulary size.
获得标记列表(记为 N)后,我们可以应用公式来计算词汇表大小。
using a set operation helps remove duplicates, and then we can count the unique words to determine the vocabulary size. Therefore, the vocabulary size is 23, as there are 23 unique words in the given list.
使用集合运算有助于删除重复项,然后我们可以计算唯一单词的数量以确定词汇量的大小。因此,词汇量大小为 23,因为给定列表中有 23 个唯一单词。
Step 3 (Encoding and Embedding)
步骤 3(编码和嵌入)
We well assign an integer to each unique word of our dataset.
我们很好地为数据集的每个唯一单词分配一个整数。
After encoding our entire dataset, it’s time to select our input. We will choose a sentence from our corpus to start with:
对整个数据集进行编码后,是时候选择我们的输入了。我们将从语料库中选择一个句子作为开头:
“When you play game of thrones”
“当你玩权力的游戏时”
Each word passed as input will be represented as an encoded integer, and each corresponding integer value will have an associated embedding attached to it.
作为输入传递的每个单词都将表示为编码的整数,并且每个相应的整数值都将附加一个关联的嵌入。
- These embedding can be find using Google Word2vec (vector representation of word). In our numerical example we will suppose embedding vector for each word filled with random values between (0 and 1).
这些嵌入可以使用 Google Word2vec(单词的向量表示)找到。在我们的数值示例中,我们假设每个单词的嵌入向量填充有(0 和 1)之间的随机值。 - Moreover, the original paper use 512 dimension of embedding vector, we will consider a very small dimension i.e., 5 for numerical example.
此外,原始论文使用 512 维的嵌入向量,我们将考虑一个非常小的维度,即 5 作为数值示例。
Each word embedding is now represented by an embedding vector of dimension 5, and the values are filled with random numbers using the Excel function RAND().
现在,每个词嵌入都由维度为 5 的嵌入向量表示,并且使用 Excel 函数 RAND() 用随机数填充值。
Step 4 (Positional Embedding)
步骤 4(位置嵌入)
Let’s consider the first word, i.e., “When” and calculate the positional embedding vector for it.
让我们考虑第一个单词,即“When”,并计算它的位置嵌入向量。
There are two formulas for positional embedding:
位置嵌入有两个公式:
The POS value for the first word, “When” will be zero since it corresponds to the starting index of the sequence. Additionally, the value of i, whether it is even or odd, determines which formula to use for calculating the PE values. The dimension value represents the dimensionality of the embedding vectors, and in this case, it is 5.
第一个单词“When”的 POS 值将为零,因为它对应于序列的起始索引。此外,i 的值(无论是偶数还是奇数)决定了使用哪个公式来计算 PE 值。维度值表示嵌入向量的维数,在本例中为 5。
Continuing the calculation of positional embeddings, we will assign a pos value of 1 for the next word, “you” and continue incrementing the pos value for each subsequent word in the sequence.
继续计算位置嵌入,我们将为下一个单词“you”分配 pos 值 1,并继续递增序列中每个后续单词的 pos 值。
After finding the positional embedding, we can concatenate it with the original word embedding.
找到位置嵌入后,我们可以将其与原始单词嵌入连接起来。
The resultant vector we obtain is the sum of e1+p1, e2+p2, e3+p3, and so on.
我们得到的结果向量是 e1+p1、e2+p2、e3+p3 等的总和。
The output of the initial part of our transformer architecture serves as the input to the encoder.
我们变压器架构初始部分的输出用作编码器的输入。
Encoder 编码器
In the encoder, we perform complex operations involving matrices of queries, keys, and values. These operations are crucial for transforming the input data and extracting meaningful representations.
在编码器中,我们执行涉及查询、键和值矩阵的复杂操作。这些操作对于转换输入数据和提取有意义的表示至关重要。
Inside the multi-head attention mechanism, a single attention layer consists of several key components. These components include:
在多头注意力机制内部,单个注意力层由几个关键组件组成。这些组件包括:
Please note that the yellow box represents a single attention mechanism. What makes it multi-head attention is the presence of multiple yellow boxes. For the purposes of this numerical example, we will consider only one (i.e., single-head attention) as depicted in the above diagram.
请注意,黄色框代表单一注意力机制。使其成为多头注意力的原因是多个黄色框的存在。为了这个数值示例的目的,我们将仅考虑一个(即单头注意力),如上图所示。
Step 1 (Performing Single Head Attention)
第 1 步(执行单头注意力)
There are three inputs in attention layer:
注意力层有3个输入:
- Query 询问
- Key 钥匙
- Value 价值
In the diagram provided above, the three input matrices (pink matrices) represent the transposed output obtained from the previous step of adding the position embeddings to the word embedding matrix.
在上面提供的图中,三个输入矩阵(粉色矩阵)表示从上一步将位置嵌入添加到词嵌入矩阵中获得的转置输出。
On the other hand, the linear weights matrices (yellow, blue and red) represent the weight used in the attention mechanism. These matrices can have any number of dimensions with respect to columns, but the number of rows must be the same as the number of columns in the input matrices for multiplication.
另一方面,线性权重矩阵(黄色、蓝色和红色)代表注意力机制中使用的权重。这些矩阵可以具有任意数量的列维数,但行数必须与用于乘法的输入矩阵中的列数相同。
In our case, we will assume that the linear matrices (yellow, blue, and red) contain random weights. These weights are typically initialized randomly and then adjusted during the training process through techniques like backpropagation and gradient descent.
在我们的例子中,我们假设线性矩阵(黄色、蓝色和红色)包含随机权重。这些权重通常是随机初始化的,然后在训练过程中通过反向传播和梯度下降等技术进行调整。
So let’s calculate (Query, Key and Value metrices):
让我们计算一下(查询、键和值指标):
Once we have the query, key, and value matrices in the attention mechanism, we proceed with additional matrix multiplications.
一旦我们在注意力机制中获得了查询、键和值矩阵,我们就可以进行额外的矩阵乘法。
Now we multiply the resultant matrix with value matrix that we computed earlier:
现在我们将结果矩阵与我们之前计算的值矩阵相乘:
If we have multiple head attentions, each yielding a matrix of dimension (6x3), the next step involves concatenating these matrices together.
如果我们有多个头部注意力,每个注意力都会产生一个维度 (6x3) 的矩阵,下一步涉及将这些矩阵连接在一起。
In the next step, we will once again perform a linear transformation similar to the process used to obtain the query, key, and value matrices. This linear transformation is applied to the concatenated matrix obtained from the multiple head attentions.
在下一步中,我们将再次执行类似于获取查询、键和值矩阵的过程的线性变换。该线性变换应用于从多个头注意力获得的级联矩阵。
As the blog is already becoming lengthy, in the next part, we will shift our focus to discussing the steps involved in the encoder architecture.
由于博客已经变得很长,在下一部分中,我们将把重点转移到讨论编码器架构中涉及的步骤。
If you have any query feel free to ask me!
如果您有任何疑问,请随时问我!
I plan to explain the transformer again in the same manner as I did in my previous blog (for both coders and non-coders), providing a complete guide with a step-by-step approach to understanding how they work.
我计划以与我在之前的博客中相同的方式再次解释变压器(针对编码员和非编码员),提供完整的指南,并逐步提供了解其工作原理的方法。
Table of Contents 目录
- Defining our Dataset 定义我们的数据集
- Finding Vocab Size 寻找词汇量
- Encoding 编码
- Calculating Embedding 计算嵌入
- Calculating Positional Embedding计算位置嵌入
- Concatenating Positional and Word Embeddings连接位置嵌入和词嵌入
- Multi Head Attention 多头注意力
- Adding and Normalizing 添加和标准化
- Feed Forward Network 前馈网络
- Adding and Normalizing Again再次添加并标准化
- Decoder Part 解码器部分
- Understanding Mask Multi Head Attention了解 Mask 多头注意力
- Calculating the Predicted Word计算预测词
- Important Points 要点
- Conclusion 结论
Step 1 — Defining our Dataset
第 1 步 - 定义我们的数据集
The dataset used for creating ChatGPT is 570 GB. On the other hand, for our purposes, we will be using a very small dataset to perform numerical calculations visually.
用于创建 ChatGPT 的数据集为 570 GB。另一方面,出于我们的目的,我们将使用非常小的数据集来直观地执行数值计算。
Our entire dataset containing only three sentences
我们的整个数据集仅包含三个句子
Our entire dataset contains only three sentences, all of which are dialogues taken from a TV show. Although our dataset is cleaned, in real-world scenarios like ChatGPT creation, cleaning a 570 GB dataset requires a significant amount of effort.
我们的整个数据集仅包含三个句子,所有这些句子都是取自电视节目的对话。尽管我们的数据集已清理,但在 ChatGPT 创建等现实场景中,清理 570 GB 的数据集需要大量工作。
Step 2— Finding Vocab Size
第 2 步 — 查找词汇量
The vocabulary size determines the total number of unique words in our dataset. It can be calculated using the below formula, where N is the total number of words in our dataset.
词汇量大小决定了数据集中唯一单词的总数。可以使用以下公式计算,其中 N 是数据集中的单词总数。
vocab_size formula where N is total number of words
vocab_size 公式,其中 N 是单词总数
In order to find N, we need to break our dataset into individual words.
为了找到 N,我们需要将数据集分解为单独的单词。
calculating variable N 计算变量N
After obtaining N, we perform a set operation to remove duplicates, and then we can count the unique words to determine the vocabulary size.
获得N后,我们执行集合操作来删除重复项,然后我们可以统计唯一单词来确定词汇量大小。
finding vocab size 寻找词汇量
Therefore, the vocabulary size is 23, as there are 23 unique words in our dataset.
因此,词汇量大小为 23,因为我们的数据集中有 23 个独特的单词。
Step 3 — Encoding 第 3 步 — 编码
Now, we need to assign a unique number to each unique word.
现在,我们需要为每个唯一的单词分配一个唯一的编号。
encoding our unique words
编码我们独特的词语
As we have considered a single token as a single word and assigned a number to it, ChatGPT has considered a portion of a word as a single token using this formula: 1 Token = 0.75 Word
由于我们将单个标记视为单个单词并为其分配了一个数字,ChatGPT 使用以下公式将单词的一部分视为单个标记: 1 Token = 0.75 Word
After encoding our entire dataset, it’s time to select our input and start working with the transformer architecture.
对整个数据集进行编码后,是时候选择我们的输入并开始使用变压器架构了。
Step 4 — Calculating Embedding
第 4 步 — 计算嵌入
Let’s select a sentence from our corpus that will be processed in our transformer architecture.
让我们从语料库中选择一个句子,它将在变压器架构中进行处理。
Input sentence for transformer
变压器输入语句
We have selected our input, and we need to find an embedding vector for it. The original paper uses a 512-dimensional embedding vector for each input word.
我们已经选择了输入,我们需要为其找到一个嵌入向量。原始论文对每个输入单词使用 512 维的嵌入向量。
Original Paper uses 512 dimension vector
原论文使用512维向量
Since, for our case, we need to work with a smaller dimension of embedding vector to visualize how the calculation is taking place. So, we will be using a dimension of 6
for the embedding vector.
因为对于我们的例子,我们需要使用较小维度的嵌入向量来可视化计算是如何发生的。因此,我们将使用 6
维度作为嵌入向量。
Embedding vectors of our input
嵌入我们输入的向量
These values of the embedding vector are between 0 and 1 and are filled randomly in the beginning. They will later be updated as our transformer starts understanding the meanings among the words.
嵌入向量的这些值介于 0 和 1 之间,并且在开始时随机填充。当我们的变压器开始理解单词之间的含义时,它们稍后将被更新。
Step 5 — Calculating Positional Embedding
第 5 步 — 计算位置嵌入
Now we need to find positional embeddings for our input. There are two formulas for positional embedding depending on the position of the ith value of that embedding vector for each word.
现在我们需要为我们的输入找到位置嵌入。位置嵌入有两个公式,具体取决于每个单词的嵌入向量第 i 个值的位置。
Positional Embedding formula
位置嵌入公式
As you do know, our input sentence is “when you play the game of thrones” and the starting word is “when” with a starting index (POS) value is 0
, having a dimension (d) of 6
. For i
from 0 to 5
, we calculate the positional embedding for our first word of the input sentence.
如您所知,我们的输入句子是“when you play the game of Throws”,起始词是“when”,起始索引(POS)值为 0
,维度(d)为 6
。对于 0 to 5
中的 i
,我们计算输入句子的第一个单词的位置嵌入。
Positional Embedding for word: When
单词的位置嵌入:何时
Similarly, we can calculate positional embedding for all the words in our input sentence.
类似地,我们可以计算输入句子中所有单词的位置嵌入。
Calculating Positional Embeddings of our input (The calculated values are rounded)
计算输入的位置嵌入(计算值四舍五入)
Step 6 — Concatenating Positional and Word Embeddings
第 6 步 — 连接位置嵌入和词嵌入
After calculating positional embedding, we need to add word embeddings and positional embeddings.
计算位置嵌入后,我们需要添加词嵌入和位置嵌入。
concatenation step 连接步骤
This resultant matrix from combining both matrices (Word embedding matrix and positional embedding matrix) will be considered as an input to the encoder part.
组合两个矩阵(词嵌入矩阵和位置嵌入矩阵)所得的矩阵将被视为编码器部分的输入。
Step 7 — Multi Head Attention
第 7 步 — 多头注意力
A multi-head attention is comprised of many single-head attentions. It is up to us how many single heads we need to combine. For example, LLaMA LLM from Meta has used 32 single heads in the encoder architecture. Below is the illustrated diagram of how a single-head attention looks like.
多头注意力由许多单头注意力组成。需要组合多少个单头取决于我们。例如,Meta 的 LLaMA LLM 在编码器架构中使用了 32 个单头。下面是单头注意力的示意图。
Single Head attention in Transformer
Transformer 中的单头注意力
There are three inputs: query, key, and value. Each of these matrices is obtained by multiplying a different set of weights matrix from the Transpose of same matrix that we computed earlier by adding the word embedding and positional embedding matrix.
共有三个输入:查询、键和值。这些矩阵中的每一个都是通过将一组不同的权重矩阵与我们之前通过添加单词嵌入和位置嵌入矩阵计算的相同矩阵的转置相乘而获得的。
Let’s say, for computing the query matrix, the set of weights matrix must have the number of rows the same as the number of columns of the transpose matrix, while the columns of the weights matrix can be any; for example, we suppose 4
columns in our weights matrix. The values in the weights matrix are between 0 and 1
randomly, which will later be updated when our transformer starts learning the meaning of these words.
比方说,为了计算查询矩阵,权重矩阵集合的行数必须与转置矩阵的列数相同,而权重矩阵的列可以是任意的;例如,我们假设权重矩阵中有 4
列。权重矩阵中的值随机位于 0 and 1
之间,稍后当我们的变压器开始学习这些单词的含义时,这些值将被更新。
calculating Query matrix 计算查询矩阵
Similarly, we can compute the key and value matrices using the same procedure, but the values in the weights matrix must be different for both.
类似地,我们可以使用相同的过程计算键和值矩阵,但两者的权重矩阵中的值必须不同。
Calculating Key and Value Matrices
计算键和值矩阵
So, after multiplying matrices, the resultant query, key, and values are obtained:
因此,在矩阵相乘之后,得到结果查询、键和值:
Query, Key, Value matrices
查询、键、值矩阵
Now that we have all three matrices, let’s start calculating single-head attention step by step.
现在我们已经有了所有三个矩阵,让我们开始逐步计算单头注意力。
matrix multiplication between Query and Key
查询和密钥之间的矩阵乘法
For scaling the resultant matrix, we have to reuse the dimension of our embedding vector, which is 6
.
为了缩放结果矩阵,我们必须重用嵌入向量的维度,即 6
。
scaling the resultant matrix with dimension 5
将结果矩阵缩放为维度 5
The next step of masking is optional, and we won’t be calculating it. Masking is like telling the model to focus only on what’s happened before a certain point and not peek into the future while figuring out the importance of different words in a sentence. It helps the model understand things in a step-by-step manner, without cheating by looking ahead.
下一步的屏蔽是可选的,我们不会计算它。屏蔽就像告诉模型在确定句子中不同单词的重要性时只关注某一点之前发生的事情,而不是展望未来。它帮助模型逐步理解事物,而不会因向前看而作弊。
So now we will be applying the softmax operation on our scaled resultant matrix.
所以现在我们将对缩放后的结果矩阵应用 softmax 运算。
Applying softmax on resultant matrix
对结果矩阵应用 softmax
Doing the final multiplication step to obtain the resultant matrix from single-head attention.
进行最后的乘法步骤以获得单头注意力的结果矩阵。
calculating the final matrix of single head attention
计算最终的单头注意力矩阵
We have calculated single-head attention, while multi-head attention comprises many single-head attentions, as I stated earlier. Below is a visual of how it looks like:
正如我之前所说,我们计算了单头注意力,而多头注意力由许多单头注意力组成。下面是它的外观:
Multi Head attention in Transformer
Transformer 中的多头注意力
Each single-head attention has three inputs: query, key, and value, and each three have a different set of weights. Once all single-head attentions output their resultant matrices, they will all be concatenated, and the final concatenated matrix is once again transformed linearly by multiplying it with a set of weights matrix initialized with random values, which will later get updated when the transformer starts training.
每个单头注意力有三个输入:查询、键和值,并且每三个输入都有一组不同的权重。一旦所有单头注意力输出它们的结果矩阵,它们都会被连接起来,并且最终的连接矩阵通过与一组用随机值初始化的权重矩阵相乘再次被线性变换,权重矩阵稍后将在变压器启动时更新训练。
Since, in our case, we are considering a single-head attention, but this is how it looks if we are working with multi-head attention.
因为在我们的例子中,我们正在考虑单头注意力,但这就是我们使用多头注意力的情况。
Single Head attention vs Multi Head attention
单头注意力 vs 多头注意力
In either case, whether it’s single-head or multi-head attention, the resultant matrix needs to be once again transformed linearly by multiplying a set of weights matrix.
无论哪种情况,无论是单头注意力还是多头注意力,都需要通过乘以一组权重矩阵来再次对所得矩阵进行线性变换。
normalizing single head attention matrix
标准化单头注意力矩阵
Make sure the linear set of weights matrix number of columns must be equal to the matrix that we computed earlier (word embedding + positional embedding) matrix number of columns, because the next step, we will be adding the resultant normalized matrix with (word embedding + positional embedding) matrix.
确保线性权重矩阵的列数必须等于我们之前计算的矩阵(词嵌入 + 位置嵌入)矩阵的列数,因为下一步,我们将使用(词嵌入+位置嵌入)矩阵。
Output matrix of multi head attention
多头注意力的输出矩阵
As we have computed the resultant matrix for multi-head attention, next, we will be working on adding and normalizing step.
由于我们已经计算了多头注意力的结果矩阵,接下来我们将致力于添加和标准化步骤。
Step 8 — Adding and Normalizing
第 8 步 — 添加和规范化
Once we obtain the resultant matrix from multi-head attention, we have to add it to our original matrix. Let’s do it first.
一旦我们从多头注意力中获得了结果矩阵,我们就必须将其添加到原始矩阵中。我们先来做吧。
Adding matrices to perform add and norm step
添加矩阵以执行加法和标准化步骤
To normalize the above matrix, we need to compute the mean and standard deviation row-wise for each row.
为了标准化上述矩阵,我们需要逐行计算每行的平均值和标准差。
calculating meand and std.
计算平均值和标准差。
we subtract each value of the matrix by the corresponding row mean and divide it by the corresponding standard deviation.
我们将矩阵的每个值减去相应的行平均值,然后除以相应的标准差。
normalizing the resultant matrix
标准化结果矩阵
Adding a small value of error prevents the denominator from being zero and avoids making the entire term infinity.
添加一个小的误差值可防止分母为零并避免使整个项无穷大。
Step 9 — Feed Forward Network
第 9 步——前馈网络
After normalizing the matrix, it will be processed through a feedforward network. We will be using a very basic network that contains only one linear layer and one ReLU activation function layer. This is how it looks like visually:
矩阵归一化后,将通过前馈网络进行处理。我们将使用一个非常基本的网络,仅包含一个线性层和一个 ReLU 激活函数层。这是它在视觉上的样子:
Feed Forward network comparison
前馈网络比较
First, we need to calculate the linear layer by multiplying our last calculated matrix with a random set of weights matrix, which will be updated when the transformer starts learning, and adding the resultant matrix to a bias matrix that also contains random values.
首先,我们需要通过将最后计算的矩阵与一组随机权重矩阵相乘来计算线性层,权重矩阵将在变压器开始学习时更新,并将所得矩阵添加到也包含随机值的偏差矩阵中。
Calculating Linear Layer 计算线性层
After calculating the linear layer, we need to pass it through the ReLU layer and use its formula.
计算完线性层后,我们需要将其传递到ReLU层并使用其公式。
Calculating ReLU Layer 计算 ReLU 层
Step 10 — Adding and Normalizing Again
第 10 步 — 再次添加并标准化
Once we obtain the resultant matrix from feed forward network, we have to add it to the matrix that is obtained from previous add and norm step, and then normalizing it using the row wise mean and standard deviation.
一旦我们从前馈网络获得结果矩阵,我们就必须将其添加到从之前的添加和归一步骤获得的矩阵中,然后使用行均值和标准差对其进行归一化。
Add and Norm after Feed Forward Network
前馈网络后添加并归一化
The output matrix of this add and norm step will serve as the query and key matrix in one of the multi-head attention mechanisms present in the decoder part, which you can easily understand by tracing outward from the add and norm to the decoder section.
此加法和范数步骤的输出矩阵将充当解码器部分中存在的多头注意机制之一中的查询和关键矩阵,您可以通过从加法和范数向外追踪到解码器部分来轻松理解它。
Step 11 — Decoder Part
第11步——解码器部分
The good news is that up until now, we have calculated Encoder part, all the steps that we have performed, from encoding our dataset to passing our matrix through the feedforward network, are unique. It means we haven’t calculated them before. But from now on, all the upcoming steps that is the remaining architecture of the transformer (Decoder part) are going to involve similar kinds of matrix multiplications.
好消息是,到目前为止,我们已经计算了编码器部分,我们执行的所有步骤,从编码数据集到通过前馈网络传递矩阵,都是唯一的。这意味着我们之前没有计算过它们。但从现在开始,接下来的所有步骤,即变压器的剩余架构(解码器部分)将涉及类似类型的矩阵乘法。
Take a look at our transformer architecture. What we have covered so far and what we have to cover yet:
看看我们的变压器架构。到目前为止我们已经涵盖的内容以及我们还需要涵盖的内容:
Upcoming steps illustration
后续步骤说明
We won’t be calculating the entire decoder because most of its portion contains similar calculations to what we have already done in the encoder. Calculating the decoder in detail would only make the blog lengthy due to repetitive steps. Instead, we only need to focus on the calculations of the input and output of the decoder.
我们不会计算整个解码器,因为它的大部分部分包含与我们在编码器中已经完成的类似的计算。详细计算解码器只会因重复步骤而使博客变得冗长。相反,我们只需要关注解码器的输入和输出的计算。
When training, there are two inputs to the decoder. One is from the encoder, where the output matrix of the last add and norm layer serves as the query and key for the second multi-head attention layer in the decoder part. Below is the visualization of it (from batool haider):
训练时,解码器有两个输入。一种来自编码器,其中最后一个加法层和范数层的输出矩阵充当解码器部分中第二个多头注意力层的查询和密钥。下面是它的可视化(来自batool Haider):
Visualization is from Batool Haider 可视化来自 Batool Haider
While the value matrix comes from the decoder after the first add and norm step.
而值矩阵来自解码器在第一个加法和归一步骤之后。
The second input to the decoder is the predicted text. If you remember, our input to the encoder is when you play game of thrones
so the input to the decoder is the predicted text, which in our case is you win or you die
.
解码器的第二个输入是预测文本。如果您还记得,我们对编码器的输入是 when you play game of thrones
,因此解码器的输入是预测文本,在我们的例子中是 you win or you die
。
But the predicted input text needs to follow a standard wrapping of tokens that make the transformer aware of where to start and where to end.
但是预测的输入文本需要遵循标记的标准包装,使转换器知道从哪里开始和在哪里结束。
input comparison of encoder and decoder
编码器和解码器的输入比较
Where <start>
and <end>
are two new tokens being introduced. Moreover, the decoder takes one token as an input at a time. It means that <start>
will be served as an input, and you
must be the predicted text for it.
其中 <start>
和 <end>
是引入的两个新令牌。此外,解码器一次将一个令牌作为输入。这意味着 <start>
将作为输入,而 you
必须是它的预测文本。
Decoder input <start> word
解码器输入字
As we already know, these embeddings are filled with random values, which will later be updated during the training process.
正如我们所知,这些嵌入填充有随机值,这些值稍后将在训练过程中更新。
Compute rest of the blocks in the same way that we computed earlier in the encoder part.
按照我们之前在编码器部分计算的相同方式计算其余块。
Calculating Decoder 计算解码器
Before diving into any further details, we need to understand what masked multi-head attention is, using a simple mathematical example.
在深入研究任何进一步的细节之前,我们需要使用一个简单的数学示例来了解什么是屏蔽多头注意力。
Step 12 — Understanding Mask Multi Head Attention
第 12 步 — 了解 Mask Multi Head Attention
In a Transformer, the masked multi-head attention is like a spotlight that a model uses to focus on different parts of a sentence. It’s special because it doesn’t let the model cheat by looking at words that come later in the sentence. This helps the model understand and generate sentences step by step, which is important in tasks like talking or translating words into another language.
在 Transformer 中,屏蔽的多头注意力就像聚光灯,模型用它来关注句子的不同部分。它很特别,因为它不会让模型通过查看句子后面的单词来作弊。这有助于模型逐步理解和生成句子,这对于说话或将单词翻译成另一种语言等任务非常重要。
Suppose we have the following input matrix, where each row represents a position in the sequence, and each column represents a feature:
假设我们有以下输入矩阵,其中每行代表序列中的一个位置,每列代表一个特征:
inpur matrix for masked multi head attentions
用于屏蔽多头注意力的 inpur 矩阵
Now, let’s understand the masked multi-head attention components having two heads:
现在,让我们了解具有两个头的屏蔽多头注意力组件:
- Linear Projections (Query, Key, Value): Assume the linear projections for each head: Head 1: Wq1,Wk1,Wv1 and Head 2: Wq2,Wk2,Wv2
线性投影(查询、键、值):假设每个头的线性投影:头 1:Wq1、Wk1、Wv1 和头 2:Wq2、Wk2、Wv2 - Calculate Attention Scores: For each head, calculate attention scores using the dot product of Query and Key, and apply the mask to prevent attending to future positions.
计算注意力分数:对于每个头,使用查询和密钥的点积计算注意力分数,并应用掩码以防止关注未来的位置。 - Apply Softmax: Apply the softmax function to obtain attention weights.
应用Softmax:应用softmax函数来获取注意力权重。 - Weighted Summation (Value): Multiply the attention weights by the Value to get the weighted sum for each head.
加权求和(值):将注意力权重乘以值,得到每个头的加权和。 - Concatenate and Linear Transformation: Concatenate the outputs from both heads and apply a linear transformation.
连接和线性变换:连接两个头的输出并应用线性变换。
Let’s do a simplified calculation:
我们来做一个简单的计算:
Assuming two conditions 假设两个条件
- Wq1 = Wk1 = Wv1 = Wq2 = Wk2 = Wv2 = I, the identity matrix.
Wq1 = Wk1 = Wv1 = Wq2 = Wk2 = Wv2 = I,单位矩阵。 - Q=K=V=Input Matrix Q=K=V=输入矩阵
Mask Multi Head Attention (Two Heads)
面膜多头注意(两个头)
The concatenation step combines the outputs from the two attention heads into a single set of information. Imagine you have two friends who each give you advice on a problem. Concatenating their advice means putting both pieces of advice together so that you have a more complete view of what they suggest. In the context of the transformer model, this step helps capture different aspects of the input data from multiple perspectives, contributing to a richer representation that the model can use for further processing.
连接步骤将两个注意力头的输出组合成一组信息。想象一下,您有两个朋友,他们每个人都为您提供有关问题的建议。连接他们的建议意味着将两条建议放在一起,以便您可以更全面地了解他们的建议。在变压器模型的上下文中,此步骤有助于从多个角度捕获输入数据的不同方面,从而有助于模型可用于进一步处理的更丰富的表示。
Step 13 — Calculating the Predicted Word
第 13 步 — 计算预测词
The output matrix of the last add and norm block of the decoder must contain the same number of rows as the input matrix, while the number of columns can be any. Here, we work with 6
.
解码器最后一个加法和范数块的输出矩阵必须包含与输入矩阵相同的行数,而列数可以是任意的。在这里,我们使用 6
。
Add and Norm output of decoder
解码器输出的加法和归一化
The last add and norm block resultant matrix of the decoder must be flattened in order to match it with a linear layer to find the predicted probability of each unique word in our dataset (corpus).
解码器的最后一个加法和范数块结果矩阵必须展平,以便将其与线性层匹配,以找到数据集(语料库)中每个唯一单词的预测概率。
flattened the last add and norm block matrix
展平最后一个加法和范数块矩阵
This flattened layer will be passed through a linear layer to compute the logits (scores) of each unique word in our dataset.
这个扁平层将通过线性层来计算数据集中每个唯一单词的 logits(分数)。
Calculating Logits 计算逻辑
Once we obtain the logits, we can use the softmax function to normalize them and find the word that contains the highest probability.
一旦我们获得了 logits,我们就可以使用 softmax 函数对它们进行归一化并找到包含最高概率的单词。
Finding the Predicted word
寻找预测的单词
So based on our calculations, the predicted word from the decoder is you
.
因此,根据我们的计算,解码器预测的单词是 you
。
Final output of decoder 解码器的最终输出
This predicted word you
, will be treated as the input word for the decoder, and this process continues until the <end>
token is predicted.
该预测的单词 you
将被视为解码器的输入单词,并且此过程将继续,直到预测出 <end>
标记。
Important Points 要点
- The above example is very simple, as it does not involve epochs or any other important parameters that can only be visualized using a programming language like Python.
上面的例子非常简单,因为它不涉及纪元或任何其他重要参数,这些参数只能使用Python等编程语言可视化。 - It has shown the process only until training, while evaluation or testing cannot be visually seen using this matrix approach.
它只显示了训练之前的过程,而使用这种矩阵方法无法直观地看到评估或测试。 - Masked multi-head attentions can be used to prevent the transformer from looking at the future, helping to avoid overfitting your model.
屏蔽多头注意力机制可用于防止 Transformer 着眼于未来,从而有助于避免模型过度拟合。
Conclusion 结论
In this blog, I have shown you a very basic way of how transformers mathematically work using matrix approaches. We have applied positional encoding, softmax, feedforward network, and most importantly, multi-head attention.
在这篇博客中,我向您展示了变压器如何使用矩阵方法进行数学工作的非常基本的方法。我们应用了位置编码、softmax、前馈网络,以及最重要的多头注意力。
In the future, I will be posting more blogs on transformers and LLM as my core focus is on NLP. More importantly, if you want to build your own million-parameter LLM from scratch using Python, I have written a blog on it which has received a lot of appreciation on Medium. You can read it here:
将来,我将发布更多关于 Transformer 和 LLM 的博客,因为我的核心重点是 NLP。更重要的是,如果你想使用 Python 从头开始构建自己的百万参数 LLM,我已经写了一篇关于它的博客,该博客获得了很多赞赏。你可以在这里阅读它:
Transformer Architecture explained
Transformer 架构解释
Transformers are a new development in machine learning that have been making a lot of noise lately. They are incredibly good at keeping track of context, and this is why the text that they write makes sense. In this chapter, we will go over their architecture and how they work.
Transformer 是机器学习的一项新发展,最近引起了很大的关注。他们非常擅长跟踪上下文,这就是为什么他们写的文本有意义。在本章中,我们将回顾它们的架构以及它们的工作方式。
Transformer models are one of the most exciting new developments in machine learning. They were introduced in the paper Attention is All You Need. Transformers can be used to write stories, essays, poems, answer questions, translate between languages, chat with humans, and they can even pass exams that are hard for humans! But what are they? You’ll be happy to know that the architecture of transformer models is not that complex, it simply is a concatenation of some very useful components, each of which has its own function. In this chapter, you will learn all of these components.
Transformer 模型是机器学习领域最令人兴奋的新发展之一。它们在论文《Attention is All You Need》中进行了介绍。变形金刚可以用来写故事、散文、诗歌、回答问题、语言之间的翻译、与人类聊天,甚至可以通过对人类来说很难的考试!但它们是什么?您会很高兴知道 Transformer 模型的架构并不那么复杂,它只是一些非常有用的组件的串联,每个组件都有自己的功能。在本章中,您将学习所有这些组件。
In a nutshell, what does a transformer do? Imagine that you’re writing a text message on your phone. After each word, you may get three words suggested to you. For example, if you type “Hello, how are”, the phone may suggest words such as “you”, or “your” as the next word. Of course, if you continue selecting the suggested word in your phone, you’ll quickly find that the message formed by these words makes no sense. If you look at each set of 3 or 4 consecutive words, it may make sense, but these words don’t concatenate to anything with a meaning. This is because the model used in the phone doesn’t carry the overall context of the message, it simply predicts which word is more likely to come up after the last few. Transformers, on the other hand, keep track of the context of what is being written, and this is why the text that they write makes sense.
简而言之,变压器有什么作用?想象一下您正在手机上写短信。每个单词之后,您可能会收到建议的三个单词。例如,如果您输入“Hello, how are”,手机可能会建议“you”或“your”等单词作为下一个单词。当然,如果你继续选择手机中的建议单词,你很快就会发现这些单词形成的消息毫无意义。如果您查看每组 3 或 4 个连续单词,它可能有意义,但这些单词不会连接到任何有意义的东西。这是因为手机中使用的模型不包含消息的整体上下文,它只是预测最后几个单词之后更有可能出现哪个单词。另一方面,变形金刚会跟踪正在编写的内容的上下文,这就是为什么他们编写的文本有意义。
The phone can suggest the next word to use in a text message, but does not have the power to generate coherent text.
手机可以建议短信中使用的下一个单词,但无法生成连贯的文本。
The phone can suggest the next word to use in a text message, but does not have the power to generate coherent text.
手机可以建议短信中使用的下一个单词,但无法生成连贯的文本。
I have to be honest with you, the first time I found out that transformers build text one word at a time, I couldn’t believe it. First of all, this is not how humans form sentences and thoughts. We first form a basic thought, and then start refining it and adding words to it. This is also not how ML models do other things. For example, images are not built this way. Most neural network based graphical models form a rough version of the image, and slowly refine it or add detail until it is perfect. So why would a transformer model build text word by word? One answer is, because that works really well. A more satisfying one is that because transformers are so incredibly good at keeping track of the context, that the next word they pick is exactly what it needs to keep going with an idea.
我必须诚实地告诉你,当我第一次发现 Transformer 一次构建一个单词的文本时,我简直不敢相信。首先,这不是人类形成句子和思想的方式。我们首先形成一个基本的想法,然后开始完善它并为其添加文字。这也不是机器学习模型做其他事情的方式。例如,图像不是以这种方式构建的。大多数基于神经网络的图形模型都会形成图像的粗略版本,然后慢慢对其进行细化或添加细节,直到完美为止。那么为什么 Transformer 模型要逐字构建文本呢?一个答案是,因为这确实非常有效。更令人满意的是,因为变形金刚非常擅长跟踪上下文,所以他们选择的下一个单词正是它需要继续实现一个想法。
And how are transformers trained? With a lot of data, all the data on the internet, in fact. So when you input the sentence “Hello, how are” into the transformer, it simply knows that, based on all the text in the internet, the best next word is “you”. If you were to give it a more complicated command, say, “Write a story.”, it may figure out that a good next word to use is “Once”. Then it adds this word to the command, and figures out that a good next word is “upon”, and so on. And word by word, it will continue until it writes a story.
变形金刚是如何训练的?事实上,有大量数据,互联网上的所有数据。因此,当您将句子“Hello, how are”输入到变压器中时,它只是知道,根据互联网上的所有文本,最好的下一个单词是“you”。如果你给它一个更复杂的命令,比如“写一个故事”,它可能会发现下一个合适的词是“一次”。然后它将这个单词添加到命令中,并找出下一个合适的单词是“upon”,依此类推。一个字一个字地继续下去,直到写出一个故事。
Command: Write a story. 命令:写一个故事。
Response: Once 回应:一次
Next command: Write a story. Once
下一个命令:写一个故事。一次
Response: upon 回应:根据
Next command: Write a story. Once upon
下一个命令:写一个故事。从前
Response: a 回应:一
Next command: Write a story. Once upon a
下一个命令:写一个故事。很久以前
Response: time 响应时间
Next command: Write a story. Once upon a time
下一个命令:写一个故事。曾几何时
Response: there 回复:有
Now that we know what transformers do, let’s get to their architecture. If you’ve seen the architecture of a transformer model, you may have jumped in awe like I did the first time I saw it, it looks quite complicated! However, when you break it down into its most important parts, it’s not so bad. The transformer has 4 main parts:
现在我们知道了 Transformer 的作用,接下来我们来看看它们的架构。如果你看过 Transformer 模型的架构,你可能会像我第一次看到它时一样惊叹不已,它看起来相当复杂!然而,当你将其分解为最重要的部分时,情况并没有那么糟糕。变压器有4个主要部分:
- Tokenization 代币化
- Embedding 嵌入
- Positional encoding 位置编码
- Transformer block (several of these)
变压器块(其中几个) - Softmax 软最大
The fourth one, the transformer block, is the most complex of all. Many of these can be concatenated, and each one contains two main parts: The attention and the feedforward components.
第四个是变压器块,是最复杂的。其中许多可以串联起来,每个都包含两个主要部分:注意力和前馈组件。
The architecture of a transformer model
Transformer 模型的架构
Tokenization 代币化
Tokenization is the most basic step. It consists of a large dataset of tokens, including all the words, punctuation signs, etc. The tokenization step takes every word, prefix, suffix, and punctuation signs, and sends them to a known token from the library.
代币化是最基本的步骤。它由大量标记数据集组成,包括所有单词、标点符号等。标记化步骤获取每个单词、前缀、后缀和标点符号,并将它们发送到库中的已知标记。
Embedding 嵌入
Once the input has been tokenized, it’s time to turn words into numbers. For this, we use an embedding. In a previous chapter you learned about how text embeddings send every piece of text to a vector (a list) of numbers. If two pieces of text are similar, then the numbers in their corresponding vectors are similar to each other (componentwise, meaning each pair of numbers in the same position are similar). Otherwise, if two pieces of text are different, then the numbers in their corresponding vectors are different.
一旦输入被标记化,就可以将单词转换为数字。为此,我们使用嵌入。在上一章中,您了解了文本嵌入如何将每段文本发送到数字向量(列表)。如果两段文本相似,则它们对应向量中的数字彼此相似(分量,意味着同一位置的每对数字相似)。否则,如果两段文本不同,则它们对应的向量中的数字也不同。
In general embeddings send every word (token) to a long list of numbers.
一般来说,嵌入将每个单词(令牌)发送到一长串数字(向量)。
Positional encoding 位置编码
Once we have the vectors corresponding to each of the tokens in the sentence, the next step is to turn all these into one vector to process. The most common way to turn a bunch of vectors into one vector is to add them, componentwise. That means, we add each coordinate separately. For example, if the vectors (of length 2) are [1,2], and [3,4], their corresponding sum is [1+3, 2+4], which equals [4, 6]. This can work, but there’s a small caveat. Addition is commutative, meaning that if you add the same numbers in a different order, you get the same result. In that case, the sentence “I’m not sad, I’m happy” and the sentence “I’m not happy, I’m sad”, will result in the same vector, given that they have the same words, except in different order. This is not good. Therefore, we must come up with some method that will give us a different vector for the two sentences. Several methods work, and we’ll go with one of them: positional encoding. Positional encoding consists of adding a sequence of predefined vectors to the embedding vectors of the words. This ensures we get a unique vector for every sentence, and sentences with the same words in different order will be assigned different vectors. In the example below, the vectors corresponding to the words “Write”, “a”, “story”, and “.” become the modified vectors that carry information about their position, labeled “Write (1)”, “a (2)”, “story (3)”, and “. (4)”.
一旦我们获得了与句子中每个标记相对应的向量,下一步就是将所有这些向量转换为一个向量进行处理。将一堆向量转换为一个向量的最常见方法是将它们按分量相加。这意味着,我们分别添加每个坐标。例如,如果向量(长度为 2)为 [1,2] 和 [3,4],则它们对应的和为 [1+3, 2+4],等于 [4, 6]。这可行,但有一个小警告。加法是可交换的,这意味着如果以不同的顺序将相同的数字相加,则会得到相同的结果。在这种情况下,句子“我不悲伤,我很高兴”和句子“我不高兴,我很悲伤”将产生相同的向量,因为它们具有相同的单词,除了以不同的顺序。不是很好。因此,我们必须想出一些方法来为两个句子提供不同的向量。有几种方法有效,我们将采用其中一种:位置编码。位置编码包括将一系列预定义向量添加到单词的嵌入向量中。这确保了我们为每个句子获得一个唯一的向量,并且具有不同顺序的相同单词的句子将被分配不同的向量。在下面的示例中,向量对应于单词“Write”、“a”、“story”和“.”。成为携带有关其位置的信息的修改向量,标记为“Write (1)”、“a (2)”、“story (3)”和“. (4)”。
Positional encoding adds a positional vector to each word, in order to keep track of the positions of the words.
位置编码为每个单词添加一个位置向量,以跟踪单词的位置。
Transformer block 变压器块
Let’s recap what we have so far. The words come in and get turned into tokens (tokenization), tokenized words are turned into numbers (embeddings), then order gets taken into account (positional encoding). This gives us a vector for every token that we input to the model. Now, the next step is to predict the next word in this sentence. This is done with a really really large neural network, which is trained precisely with that goal, to predict the next word in a sentence.
让我们回顾一下到目前为止我们所做的事情。单词进入并转换为标记(标记化),标记化的单词转换为数字(嵌入),然后考虑顺序(位置编码)。这为我们输入模型的每个标记提供了一个向量。现在,下一步是预测这句话中的下一个单词。这是通过一个非常非常大的神经网络来完成的,该神经网络是为了这个目标而精确训练的,以预测句子中的下一个单词。
We can train such a large network, but we can vastly improve it by adding a key step: the attention component. Introduced in the seminal paper Attention is All you Need, it is one of the key ingredients in transformer models, and one of the reasons they work so well. Attention is explained in the previous section, but for now, imagine it as a way to add context to each word in the text.
我们可以训练如此大的网络,但我们可以通过添加一个关键步骤来极大地改进它:注意力组件。在开创性论文《注意力就是你所需要的》中介绍,它是 Transformer 模型的关键要素之一,也是它们如此有效的原因之一。上一节解释了注意力,但现在,将其想象为一种为文本中每个单词添加上下文的方法。
The attention component is added at every block of the feedforward network. Therefore, if you imagine a large feedforward neural network whose goal is to predict the next word, formed by several blocks of smaller neural networks, an attention component is added to each one of these blocks. Each component of the transformer, called a transformer block, is then formed by two main components:
注意力组件被添加到前馈网络的每个块中。因此,如果你想象一个大型前馈神经网络,其目标是预测下一个单词,由几个较小的神经网络块组成,那么每个块都会添加一个注意力组件。变压器的每个组件(称为变压器块)由两个主要组件组成:
- The attention component. 注意力成分。
- The feedforward component.
前馈分量。
The transformer is a concatenation of many transformer blocks. Each one of these is composed by an attention component followed by a feedforward component (a neural network).
变压器是许多变压器块的串联。其中每一个都由一个注意力组件和一个前馈组件(神经网络)组成。
Attention 注意力
The next step is attention. As you learned attention mechanism deals with a very important problem: the problem of context. Sometimes, as you know, the same word can be used with different meanings. This tends to confuse language models, since an embedding simply sends words to vectors, without knowing which definition of the word they’re using.
下一步是关注。正如您所知,注意力机制处理一个非常重要的问题:上下文问题。正如您所知,有时同一个词可以具有不同的含义。这往往会混淆语言模型,因为嵌入只是将单词发送到向量,而不知道它们正在使用的单词的定义。
Attention is a very useful technique that helps language models understand the context. In order to understand how attention works, consider the following two sentences:
注意力是一种非常有用的技术,可以帮助语言模型理解上下文。为了理解注意力是如何工作的,请考虑以下两句话:
- Sentence 1: The bank of the river.
句子一:河岸边。 - Sentence 2: Money in the bank.
句子2:银行里的钱。
As you can see, the word ‘bank’ appears in both, but with different definitions. In sentence 1, we are referring to the land at the side of the river, and in the second one to the institution that holds money. The computer has no idea of this, so we need to somehow inject that knowledge into it. What can help us? Well, it seems that the other words in the sentence can come to our rescue. For the first sentence, the words ‘the’, and ‘of’ do us no good. But the word ‘river’ is the one that is letting us know that we’re talking about the land at the side of the river. Similarly, in sentence 2, the word ‘money’ is the one that is helping us understand that the word ‘bank’ is now referring to the institution that holds money.
正如您所看到的,“银行”一词出现在两者中,但定义不同。在第一句中,我们指的是河边的土地,在第二句中,我们指的是持有货币的机构。计算机对此一无所知,因此我们需要以某种方式将这些知识注入其中。有什么可以帮助我们呢?好吧,看来句子中的其他词可以拯救我们。对于第一句话,“the”和“of”这两个词对我们没有任何好处。但“河流”这个词让我们知道我们正在谈论的是河边的土地。同样,在第 2 句中,“货币”一词帮助我们理解“银行”一词现在指的是持有货币的机构。
Attention helps give context to each word, based on the other words in the sentece (or text).
注意力有助于根据句子(或文本)中的其他单词为每个单词提供上下文。
In short, what attention does is it moves the words in a sentence (or piece of text) closer in the word embedding. In that way, the word “bank” in the sentence “Money in the bank” will be moved closer to the word “money”. Equivalently, in the sentence “The bank of the river”, the word “bank” will be moved closer to the word “river”. That way, the modified word “bank” in each of the two sentences will carry some of the information of the neighboring words, adding context to it.
简而言之,注意力的作用是将句子(或一段文本)中的单词在单词嵌入中移得更近。这样,“Money in thebank”句子中的“bank”一词就会靠近“money”一词。同样,在句子“The Bank of the River”中,单词“bank”将移近单词“River”。这样,两个句子中每个单词中的修饰词“bank”都会携带相邻单词的一些信息,为其添加上下文。
The attention step used in transformer models is actually much more powerful, and it’s called multi-head attention. In multi-head attention, several different embeddings are used to modify the vectors and add context to them. Multi-head attention has helped language models reach much higher levels of efficacy when processing and generating text.
Transformer 模型中使用的注意力步骤实际上更强大,它被称为多头注意力。在多头注意力中,使用几种不同的嵌入来修改向量并向其添加上下文。多头注意力帮助语言模型在处理和生成文本时达到更高水平的效率。
The Softmax Layer Softmax 层
Now that you know that a transformer is formed by many layers of transformer blocks, each containing an attention and a feedforward layer, you can think of it as a large neural network that predicts the next word in a sentence. The transformer outputs scores for all the words, where the highest scores are given to the words that are most likely to be next in the sentence.
现在您知道 Transformer 是由多层 Transformer 块组成的,每个 Transformer 块都包含一个注意力层和一个前馈层,您可以将其视为一个预测句子中下一个单词的大型神经网络。转换器输出所有单词的分数,其中最高分给予最有可能出现在句子中的下一个单词。
The last step of a transformer is a softmax layer, which turns these scores into probabilities (that add to 1), where the highest scores correspond to the highest probabilities. Then, we can sample out of these probabilities for the next word. In the example below, the transformer gives the highest probability of 0.5 to “Once”, and probabilities of 0.3 and 0.2 to “Somewhere” and “There”. Once we sample, the word “once” is selected, and that’s the output of the transformer.
Transformer 的最后一步是 softmax 层,它将这些分数转换为概率(加起来为 1),其中最高分数对应于最高概率。然后,我们可以从这些概率中抽取下一个单词。在下面的示例中,变压器为“Once”提供最高概率 0.5,为“Somewhere”和“There”提供 0.3 和 0.2 的概率。一旦我们采样,就会选择“一次”这个词,这就是变压器的输出。
The softmax layer turns the scores into probabilities, and these are used to pick the next word in the text.
Softmax 层将分数转换为概率,并用于选择文本中的下一个单词。
Now what? Well, we repeat the step. We now input the text “Write a story. Once” into the model, and most likely, the output will be “upon”. Repeating this step again and again, the transformer will end up writing a story, such as “Once upon a time, there was a …”.
怎么办?好吧,我们重复这个步骤。我们现在输入文本“写一个故事。一旦”进入模型,最有可能的是,输出将为“on”。一次又一次地重复这个步骤,变形金刚最终会写出一个故事,比如“从前,有一个……”。
Post Training 岗位培训
Now that you know how transformers work, we still have a bit of work to do. Imagine the following: You ask the transformer “What is the capital of Algeria?”. We would love for it to answer “Algiers”, and move on. However, the transformer is trained on the entire internet. The internet is a big place, and it’s not necessarily the best question/answer repository. Many pages, for example, would have long lists of questions without answers. In this case, the next sentence after “What is the capital of Algeria?” could be another question, such as “What is the population of Algeria?”, or “What is the capital of Burkina Faso?”. The transformer is not a human who thinks about their responses, it simply mimics what it sees on the internet (or any dataset that has been provided). So how do we get the transformer to answer questions?
现在您已经了解了 Transformer 的工作原理,我们还有一些工作要做。想象一下:你问变压器“阿尔及利亚的首都是哪里?”。我们希望它能回答“阿尔及尔”,然后继续前进。然而,变压器是在整个互联网上进行训练的。互联网是一个很大的地方,它不一定是最好的问题/答案存储库。例如,许多页面都会列出一长串没有答案的问题。在这种情况下,“阿尔及利亚的首都是什么?”之后的下一句话可能是另一个问题,例如“阿尔及利亚有多少人口?”或“布基纳法索的首都是什么?”。变压器不是一个思考自己反应的人,它只是模仿在互联网上看到的内容(或已提供的任何数据集)。那么我们如何让变压器回答问题呢?
The answer is post-training. In the same way that you would teach a person to do certain tasks, you can get a transformer to perform tasks. Once a transformer is trained on the entire internet, then it is trained again on a large dataset which corresponds to lots of questions and their respective answers. Transformers (like humans), have a bias towards the last things they’ve learned, so post-training has proven a very useful step to help transformers succeed at the tasks they are asked to.
答案是训练后。就像教人执行某些任务一样,您可以让变压器来执行任务。一旦变压器在整个互联网上进行了训练,那么它就会在一个大型数据集上再次进行训练,该数据集对应于许多问题及其各自的答案。变形金刚(像人类一样)对他们最近学到的东西有偏见,因此事实证明,后期培训是帮助变形金刚成功完成他们所要求的任务的非常有用的步骤。
Post-training also helps with many other tasks. For example, one can post-train a transformer with large datasets of conversations, in order to help it perform well as a chatbot, or to help us write stories, poems, or even code.
培训后还有助于完成许多其他任务。例如,我们可以使用大量对话数据集对变压器进行后训练,以帮助它作为聊天机器人表现良好,或者帮助我们编写故事、诗歌甚至代码。
https://medium.com/@amanatulla1606/transformer-architecture-explained-2c49e2257b4c