Transformer来源于论文《Attention is All You Need》,自2017年推出来以后,热度一直高居不下,其相关的模型bert及包括各种bert的变种模型、GPT系列等等,一直在不断地刷新各种基础评测任务的评测指标。估计没有谁做nlp,而不了解的人了。作为初入nlp小白的我,自然也是感受到了它的强大,只是一直由于时间的约束,没有好好总结一番。
Transformer从宏观来说,可以分为encode与decode两个部分,如下图所示(图片来自论文),encode可以分为Multi-Head Attention、feed Forward两个部分,然后是这两块的N组合,当然了实际上是包含了六层结构,其中还包括有residual connection,layer normalization这些内部处理;而decode部分则可以分为Masked Multi-Head Attention、Feed Forward、Multi-Head Attention三个部分,然后是这三个模块的N组合,当然了实际上同样是六层结构,其中同样包含了residual connection,layer normalization这些内部处理。
Attention
原文是这么描述Attention的“An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. ”,大概的意思就是说,Attention函数是query与一组key-value之间的映射,从而获得output,只不过这里的query、key、value都是向量;output是由value的加权和计算而来,并且每一个value的weight是由query与相对应的key的compatibility函数计算而来。attention如下图的左半部分所示,而论文中也给出了其计算方式:
输入的Q、K、V三个矩阵分别代表上述的query、key、value,维度分别为
,(这里的Q、K、V都是需要由输入X分别与
进行点乘而来,具体步骤先在此不尽行详述)第一步:是先将query与key进行点乘,再除以
;
第二步:为把上一步的结果在经过softmax函数,获得其weight;
第三步:为将上步结果与value进行加权求和,获得attention向量
def dot_product_attention(q,
k,
v,
...):
"""Dot-product attention.
Args:
q: Tensor with shape [..., length_q, depth_k].
k: Tensor with shape [..., length_kv, depth_k]. Leading dimensions must
match with q.
v: Tensor with shape [..., length_kv, depth_v] Leading dimensions must
match with q.
Returns:
Tensor with shape [..., length_q, depth_v].
"""
# 计算Q, K的矩阵乘积。
logits = tf.matmul(q, k, transpose_b=True)
# 利用softmax将结果归一化。
weights = tf.nn.softmax(logits, name="attention_weights")
# 与V相乘得到加权表示。
return tf.matmul(weights, v)
如下图所示,Multi-Head Attention实际上就是把上述的attention多做几次,然后在concat一下,而文章也给出了其的表达方式:
具体步骤可以描述为:先把Q、K、V进行线性映射,然后进行
次的attention,最后concat起来。
def multihead_attention(query_antecedent,
memory_antecedent,
...):
"""Multihead scaled-dot-product attention with input/output transformations.
Args:
query_antecedent: a Tensor with shape [batch, length_q, channels]
memory_antecedent: a Tensor with shape [batch, length_m, channels] or None
...
Returns:
The result of the attention transformation. The output shape is
[batch_size, length_q, hidden_dim]
"""
#计算q, k, v矩阵
q, k, v = compute_qkv(query_antecedent, memory_antecedent, ...)
#计算dot_product的attention
x = dot_product_attention(q, k, v, ...)
x = common_layers.dense(x, ...)
return x
Feed Forward与Residual Connection
文章中的feed forward使用了两个线性变换,及其中使用了Relu激活函数。而文章中为了可以方便使用residual connection,所有的子层与嵌入层都使用了统一的维度512,。
Positional Encoding
由于该模型没有包含了类似于循环神经网络或者卷积神经网络那样的结构,模型没有很好捕捉序列的的能力,因此文章提出了位置编码的办法。其大概过程可以描述为,在编码词向量的过程中加入了单词的位置信息。而这个单词的位置信息可以很好地描述了单词与单词之间的关系,文章中也给出了位置编码的计算方式:
@expert_utils.add_name_scope()
def get_timing_signal_1d(length,
channels,
min_timescale=1.0,
max_timescale=1.0e4,
start_index=0):
"""Gets a bunch of sinusoids of different frequencies.
Each channel of the input Tensor is incremented by a sinusoid of a different
frequency and phase.
This allows attention to learn to use absolute and relative positions.
Timing signals should be added to some precursors of both the query and the
memory inputs to attention.
The use of relative position is possible because sin(x+y) and cos(x+y) can be
expressed in terms of y, sin(x) and cos(x).
In particular, we use a geometric sequence of timescales starting with
min_timescale and ending with max_timescale. The number of different
timescales is equal to channels / 2. For each timescale, we
generate the two sinusoidal signals sin(timestep/timescale) and
cos(timestep/timescale). All of these sinusoids are concatenated in
the channels dimension.
Args:
length: scalar, length of timing signal sequence.
channels: scalar, size of timing embeddings to create. The number of
different timescales is equal to channels / 2.
min_timescale: a float
max_timescale: a float
start_index: index of first position
Returns:
a Tensor of timing signals [1, length, channels]
"""
position = tf.to_float(tf.range(length) + start_index)
num_timescales = channels // 2
log_timescale_increment = (
math.log(float(max_timescale) / float(min_timescale)) /
(tf.to_float(num_timescales) - 1))
inv_timescales = min_timescale * tf.exp(
tf.to_float(tf.range(num_timescales)) * -log_timescale_increment)
scaled_time = tf.expand_dims(position, 1) * tf.expand_dims(inv_timescales, 0)
signal = tf.concat([tf.sin(scaled_time), tf.cos(scaled_time)], axis=1)
signal = tf.pad(signal, [[0, 0], [0, tf.mod(channels, 2)]])
signal = tf.reshape(signal, [1, length, channels])
return signal