transformer中的positional encoding(位置编码)计算理解
位置编码(Positional encoding)可以告诉Transformers模型一个实体/单词在序列中的位置或位置,这样就为每个位置分配一个唯一的表示。虽然最简单的方法是使用索引值来表示位置,但这对于长序列来说,索引值会变得很大,这样就会产生很多的问题。
Transformers 中的位置编码层
假设我们有一个长度为 L 的输入序列,并且我们需要对象在该序列中的位置。位置编码由不同频率的正弦和余弦函数给出:
i:用于映射到列索引, 其中0≤i<d/2,并且i 的单个值还会映射到正弦和余弦函数
import numpy as np
def getPositionEncoding(seq_len,dim,n=10000):
PE = np.zeros(shape=(seq_len,dim))
for pos in range(seq_len):
# print("pos=",pos)
for i in range(int(dim/2)):
# print("i=",i)
denominator = np.power(n, 2*i/dim)
print("pos=",pos," i=",i, " 2*i=", 2*i, " 2*i+1=", 2*i+1 )
PE[pos,2*i] = np.sin(pos/denominator)
PE[pos,2*i+1] = np.cos(pos/denominator)
return PE
PE = getPositionEncoding(seq_len=4, dim=4, n=100) # seq_len序列长度为4,表示有四个词; dim 表示位置向量的维度为4 (偶数为佳); n=100代替 公式里面的 10000
pos= 0 i= 0 2*i= 0 2*i+1= 1
pos= 0 i= 1 2*i= 2 2*i+1= 3
pos= 1 i= 0 2*i= 0 2*i+1= 1
pos= 1 i= 1 2*i= 2 2*i+1= 3
pos= 2 i= 0 2*i= 0 2*i+1= 1
pos= 2 i= 1 2*i= 2 2*i+1= 3
pos= 3 i= 0 2*i= 0 2*i+1= 1
pos= 3 i= 1 2*i= 2 2*i+1= 3
[[ 0. 1. 0. 1. ]
[ 0.84147098 0.54030231 0.09983342 0.99500417]
[ 0.90929743 -0.41614684 0.19866933 0.98006658]
[ 0.14112001 -0.9899925 0.29552021 0.95533649]]
Setting n=10,000 as done in the original paper, you get the following:
P = getPositionEncoding(seq_len=100, d=512, n=10000) ## 100个词; 512维位置向量;
cax = plt.matshow(P)
import numpy as np
dim=512 ###
denominator = np.power(n, 2*i/dim)
print("pos/denominator=",pos/denominator, np.sin(pos/denominator) )
print("pos/denominator=",pos/denominator, np.cos(pos/denominator))
i=3 ###
denominator = np.power(n, 2*i/dim)
print("pos/denominator=",pos/denominator, np.sin(pos/denominator) )
print("pos/denominator=",pos/denominator, np.cos(pos/denominator))
pos/denominator= 2.811706625951745e-06 2.8117066259480403e-06
pos/denominator= 2.811706625951745e-06 0.9999999999960472
pos/denominator= 4.488435662236571 -0.9750270944422548
pos/denominator= 4.488435662236571 -0.2220859408055681
在 Keras 中编写自己的位置编码层
import tensorflow as tf
from tensorflow import convert_to_tensor, string
from tensorflow.keras.layers import TextVectorization, Embedding, Layer
from import Dataset
import numpy as np
import matplotlib.pyplot as plt
以下代码使用 Tokenizer 对象将每个文本转换为整数序列(每个整数是字典中标记的索引)。
output_sequence_length = 4
vocab_size = 10
sentences = ["How are you doing", "I am doing good"]
tokenizer = Tokenizer()
tokenzied_sent = tokenizer.texts_to_sequences(sentences)
print("Vectorized words: ", tokenzied_sent)
实现transformer 模型时,必须编写自己的位置编码层。这个 Keras 示例展示了如何编写 Embedding 层子类:
class PositionEmbeddingLayer(Layer):
def __init__(self, sequence_length, vocab_size, output_dim, **kwargs):
super(PositionEmbeddingLayer, self).__init__(**kwargs)
self.word_embedding_layer = Embedding(
input_dim=vocab_size, output_dim=output_dim
self.position_embedding_layer = Embedding(
input_dim=sequence_length, output_dim=output_dim
def call(self, inputs):
position_indices = tf.range(tf.shape(inputs)[-1])
embedded_words = self.word_embedding_layer(inputs)
embedded_indices = self.position_embedding_layer(position_indices)
return embedded_words + embedded_indices
作者:Srinidhi Karjol
Here is an awesome recent Youtube video that covers position embeddings in great depth, with beautiful animations:
Visual Guide to Transformer Neural Networks - (Part 1) Position Embeddings
Taking excerpts from the video, let us try understanding the “sin” part of the formula to compute the position embeddings:
Here “pos” refers to the position of the “word” in the sequence. P0 refers to the position embedding of the first word; “d” means the size of the word/token embedding. In this example d=5. Finally, “i” refers to each of the 5 individual dimensions of the embedding (i.e. 0, 1,2,3,4)
While “d” is fixed, “pos” and “i” vary. Let us try understanding the later two.
If we plot a sin curve and vary “pos” (on the x-axis), you will land up with different position values on the y-axis. Therefore, words with different positions will have different position embeddings values.
There is a problem though. Since “sin” curve repeat in intervals, you can see in the figure above that P0 and P6 have the same position embedding values, despite being at two very different positions. This is where the ‘i’ part in the equation comes into play.
If you vary “i” in the equation above, you will get a bunch of curves with varying frequencies. Reading off the position embedding values against different frequencies, lands up giving different values at different embedding dimensions for P0 and P6.
The intuition
You may wonder how this combination of sines and cosines could ever represent a position/order? It is actually quite simple, Suppose you want to represent a number in binary format, how will that be?
You can spot the rate of change between different bits. The LSB bit is alternating on every number, the second-lowest bit is rotating on every two numbers, and so on.
But using binary values would be a waste of space in the world of floats. So instead, we can use their float continous counterparts - Sinusoidal functions. Indeed, they are the equivalent to alternating bits. Moreover, By decreasing their frequencies, we can go from red bits to orange ones.