1. Preface

Annotated of《Attention Is All You Need》

“This post presents an annotated version of the paper in the form of a line-by-line implementation”

Code is available here

0.1 Prelims

1
# !pip install -r requirements.txt
1
2
3
4
5
# # Uncomment for colab
# #
# !pip install -q torchdata==0.3.0 torchtext==0.12 spacy==3.2 altair GPUtil
# !python -m spacy download de_core_news_sm
# !python -m spacy download en_core_web_sm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import os
from os.path import exists
import torch
import torch.nn as nn
from torch.nn.functional import log_softmax, pad
import math
import copy
import time
from torch.optim.lr_scheduler import LambdaLR
import pandas as pd
import altair as alt
from torchtext.data.functional import to_map_style_dataset
from torch.utils.data import DataLoader
from torchtext.vocab import build_vocab_from_iterator
import torchtext.datasets as datasets
import spacy
import GPUtil
import warnings
from torch.utils.data.distributed import DistributedSampler
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP


# Set to False to skip notebook execution (e.g. for debugging)
warnings.filterwarnings("ignore")
RUN_EXAMPLES = True
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Some convenience helper functions used throughout the notebook


def is_interactive_notebook():
return __name__ == "__main__"


def show_example(fn, args=[]):
if __name__ == "__main__" and RUN_EXAMPLES:
return fn(*args)


def execute_example(fn, args=[]):
if __name__ == "__main__" and RUN_EXAMPLES:
fn(*args)


class DummyOptimizer(torch.optim.Optimizer):
def __init__(self):
self.param_groups = [{"lr": 0}]
None

def step(self):
None

def zero_grad(self, set_to_none=False):
None


class DummyScheduler:
def step(self):
None

0.2 Background

在神经网路计算中,减少序列处理任务的计算量,是一个非常重要的问题。先前提出的网络,包括Extended Neural GPU, ByteNet and ConvS2S,设计目的都是为了解决这个问题。这些网络都以CNN为基础,并行计算所有input and output positions的hidden representations.

在这些模型中,关联两个arbitrary input or output positions,所需要的操作数量,随着位置之间的距离增加而增加。例如在ConvS2S中呈线性增长,ByteNet中呈对数增长,这种增长会使得学习较远距离的两个位置之间的依赖关系,变得非常困难。在Transformer中,这个操作次数减少到了常数级别。(尽管由于平均注意力位置加权,导致有效分辨率降低,但是Multi-Head Attention可以抵消这种影响)

Self-attention,有时候也称为intra-attention,是在单个序列/句子的不同位置上,做注意力的机制,以便计算序列的表示。Self-attention已经成功应用于很多种类的任务,包括reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations. 端到端的网络,一般都是基于循环注意力机制recurrent attention mechanism,而不是序列对齐循环sequencealigned recurrence,并且已经有证据表明,端到端的网络在简单的语言问答、语言建模任务上,表现很好。

Transformer是第一个完全依靠Self-attention,而不是用序列对齐的RNN或者convolution的方式,来计算输入输出表示的转换模型。

1. Part 1: Model Architecture

1.1 Model Architecture

目前,大部分热门的、有竞争力的神经序列转换模型,都有Encoder-Decoder结构:

  • Encoder将输入序列$(x_1, \dots, x_n)$映射到一个连续的序列表示$\mathbf{z} = (z_1, \dots, z_n)$
  • 对于编码得到的$\mathbf{z}$,decoder每次解码生成一个符号,直到生成完整的输出序列$(y_1, \dots, y_m)$。每一步解码,模型都是自回归的(即在生成下一个符号时,将先前生成的符号,作为附加输入)。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
class EncoderDecoder(nn.Module):
"""
A standard Encoder-Decoder architecture. Base for this and many
other models.
"""

def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
super(EncoderDecoder, self).__init__()
self.encoder = encoder
self.decoder = decoder
self.src_embed = src_embed
self.tgt_embed = tgt_embed
self.generator = generator

def forward(self, src, tgt, src_mask, tgt_mask):
"Take in and process masked src and target sequences."
return self.decode(self.encode(src, src_mask), src_mask, tgt, tgt_mask)

def encode(self, src, src_mask):
return self.encoder(self.src_embed(src), src_mask)

def decode(self, memory, src_mask, tgt, tgt_mask):
return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)
1
2
3
4
5
6
7
8
9
class Generator(nn.Module):
"Define standard linear + softmax generation step."

def __init__(self, d_model, vocab):
super(Generator, self).__init__()
self.proj = nn.Linear(d_model, vocab)

def forward(self, x):
return log_softmax(self.proj(x), dim=-1)

Transformer的整体架构如下图所示,在Encoder和Decoder中都使用了self-attention,Point-wise和全连接层。Encoder和Decoder的大致结构如下图的左半部分、右半部分所示。

1.2 Encoder && Decoder Stacks

1.2.1 Encoder

Encoder由$N=6$个相同的层组成

1
2
3
4
def clones(module, N):
"Produce N identical layers."

return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
class Encoder(nn.Module):
"Core encoder is a stack of N layers"

def __init__(self, layer, N):
super(Encoder, self).__init__()
self.layers = clones(layer, N)
self.norm = LayerNorm(layer.size)

def forward(self, x, mask):
"Pass the input (and mask) through each layer in turn."
for layer in self.layers:
x = layer(x, mask)
return self.norm(x)

在每两个子层之间,都使用了残差连接Residual Connection && layer normalization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
class LayerNorm(nn.Module):
"Construct a LayerNorm module (See citation for details)"

def __init__(self, features, eps=1e-6):
super(LayerNorm, self).__init__()
self.a_2 = nn.Parameter(torch.ones(features))
self.b_2 = nn.Parameter(torch.zeros(features))
self.eps = eps

def forward(self, x):
mean = x.mean(-1, keepdim=True)
std = x.std(-1, keepdim=True)
return self.a_2 * (x - mean) / (std + self.eps) + self.b_2

也就是说,每个子层的输出为$\text{LayerNorm}(x + \text{SubLayer}(x))$,其中$\text{SubLayer(x)}$是由子层自动实现的函数。紧接着,对每个子层的output采用dropout,然后将其添加到下一个子层的输入,并进行归一化。

为了能够方便地使用残差连接,模型中的所有子层 && embedding层的输出,都设定成了相同的维度,即$d_{\text{model}} = 512$

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class SubLayerConnection(nn.Module):
"""
A residual connection followed by a layer norm.
Note for code simplicity the norm is first as opposed to last.
"""

def __init__(self, size, dropout):
super(SubLayerConnection, self).__init__()
self.norm = LayerNorm(size)
self.dropout = nn.Dropout1d(dropout)

def forward(self, x, sublayer):
"Apply residual connection to any sublayer with the same size."
return x + self.dropout(sublayer(self.norm(x)))

每个layer都有两个sub-layers,第一个是mutil-head self-attention mechanism,第二层是一个简单的position-wise的全连接前馈网络

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class EncoderLayer(nn.Module):
"Encoder is made up of self-attention && feed forward (defined below)"

def __init__(self, size, self_attn, feed_forward, dropout):
super(EncoderLayer, self).__init__()
self.self_attn = self_attn
self.feed_forward = feed_forward
self.sublayer = clones(SubLayerConnection(size, dropout), 2)
self.size = size

def forward(self, x, mask):
"Follow Figure 1 (left) for connections"
x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
return self.sublayer[1](x, self.feed_forward)

1.2.2 Decoder

Decoder也是由$N = 6$个相同层组成

1
2
3
4
5
6
7
8
9
10
11
class Decoder(nn.Module):
"Generic N layer decoder with masking."
def __init__(self, layer, N):
super(Decoder, self).__init__()
self.layers = clones(layer, N)
self.norm = LayerNorm(layer.size)

def forward(self, x, memory, src_mask, tgt_mask):
for layer in self.layers:
x = layer(x, memory, src_mask, tgt_mask)
return self.norm(x)

除了每个encoder中的两个子层之外,decoder还插入了第三个子层,对encoder栈的输出实行multi-head attention

和encoder相同,每个子层的两端,都引入了残差连接进行短路,然后执行layer normalization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
class DecoderLayer(nn.Module):
"Decoder is made of self-attn, src-attn, and feed forward (defined below)"

def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
super(DecoderLayer, self).__init__()
self.size = size
self.self.attn = self_attn
self.src_attn = src_attn
self.feed_forward = feed_forward
self.sublayer = clones(SubLayerConnection(size, dropout), 3)

def forward(self, x, memory, src_mask, tgt_mask):
"Follow Figure 1 (right) for connections."
m = memory
x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))
return self.sublayer[2](x, self.feed_forward)

另外,还修改了decoder中的self-attention子层,以防止当前位置attend到后续的位置。这种maksed的attention,是考虑到输出的embedding,会偏移一个位置,确保了生成位置$i$的预测时,仅依赖于小于$i$的位置处的已知输出,相当于把后面不该看到的信息屏蔽掉

1
2
3
4
5
6
def subsequent_mask(size):
"Mask out subsequent positions"

attn_shape = (1, size, size)
subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')
return torch.from_numpy(subsequent_mask) == 0

下面的attention mask图,显示了允许每个目标词(行)查看的位置(列)。在训练期间,当前解码位置的词,不能attend到后续位置的词。

1.2.3 Attention

Attention函数可以描述为,将query和一组key-value pair,映射到输出当中,其中query、key、values、outputs都是向量。输出是值的加权和,其中分配个每个value的权重,由query与相应的key的兼容函数计算得到。

称这种特殊的attention机制为“Scale Dot-Product Attention”。输入包含维度为$d_k$的query和key,以及维度为$d_v$的values。我们首先分别计算query和每个key的dot product,然后将每个dot product除以$\sqrt{d_k}$,最后使用softmax函数来获得key的权重。

在具体实现中,我们可以以矩阵的形式,进行并行计算,这样可以加速运算过程。具体来说,就是将所有的query、key、values分别组合成矩阵$\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}$,这样输出矩阵可以表示为:

1
2
3
4
5
6
7
8
9
10
11
12
def attention(query, key, value, mask=None, dropout=None):
"Compute 'Scaled Dot Product Attention"

d_k = query.size(-1)
scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
p_attn = scores.softmax(dim=-1)
if dropout is not None:
p_attn = dropout(p_attn)
return torch.matmul(p_attn, value), p_attn

两种最常用的attention函数是加和additive attention && 点积(点乘)dot-product (multiplicative) attention。此处的算法与点积attention类似,但是$\frac{1}{\sqrt{d_k}}$的比例因子不同 。

加和additive attention使用具有单个隐藏层的前馈网络,来计算兼容函数。虽然两种方法理论上的复杂度相似,但是在实践中,点积attention的运算会更快一点,也更加节省空间。因为它可以使用高效的矩阵乘法算法来实现。

虽然对于较小的$d_k$来说,这两种attention的表现类似,但是在不放缩较大的$d_k$时,加和attention要优于点积attention。此处考虑,对于较大的$d_k$,点积大幅度增大,将softmax函数推向具有极小梯度的区域(为了解释点积变大的原因,假设$q,k$是独立的随机变量,平均值为$0$,方差为$1$,这样他们的点积就为$q \cdot k = \sum^{d_k}_{i = 1}q_ik_i$,同样是均值为0,方差为$d_k$)。为了抵消这种影响,我们使用$\frac{1}{\sqrt{d_k}}$来对点积进行缩放。

Multi-head attention能够让模型考虑到不同位置的attention;另外,multi-head attention可以在不同的子空间表示不一样的关联关系,使用单个head的attention一般达不到这种效果。

其中参数矩阵为$\boldsymbol{W}^{\boldsymbol{Q}}_i \in \mathbb{R}^{d_{\text{model}} \times d_k}, \boldsymbol{W}^{\boldsymbol{K}}_i \in \mathbb{R}^{d_{\text{model}} \times d_k}, \boldsymbol{W}^{\boldsymbol{V}}_i \in \mathbb{R}^{d_{\text{model}} \times d_v}, \boldsymbol{W}^{\boldsymbol{O}} \in \mathbb{R}^{hd_v \times d_{\text{model}}}, $

在此工作中,使用$h = 8$个head并行的attention layer,对每一个head来说,有$d_k = d_v = d_{\text{model}} / h = 64$,总计算量与完整维度的sigle-head attention很相近

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
class MultiHeadAttention(nn.Module):
def __init__(self, h, d_model, dropout=0.1):
"Take in model size and number of heads."
super(MultiHeadAttention, self).__init__()
assert d_model % h == 0
# We assume d_v always equals d_k
self.d_k = d_model // h
self.h = h
self.linears = clones(nn.Linear(d_model, d_model), 4)
self.attn = None
self.dropout = nn.Dropout(p=dropout)

def forward(self, query, key, value, mask=None):
"Implements Figure 2"
if mask is not None:
# Same mask applied to all h heads
mask = mask.unsqueeze(1)
nbatches = query.size(0)

# 1) Do all the linear projections in batch from d_model => h x d_k
query, key, value = [
lin(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
for lin, x in zip(self.linears, (query, key, value))
]

# 2) Apply attention on all the projected vectors in batch
x, self.attn = attention(query, key, value, mask=mask, dropout=self.dropout)

# 3) "Concat" using a view and apply a final linear.
x = x.transpose(1, 2).contiguous().view(nbatches, -1, self.h * self.d_k)
del query
del key
del value
return self.linears[-1](x)

1.2.4 Applications of Attention in our Model

Transformer中以三种不同的方式使用了multi-head attention:

  1. 在“encoder-decoder attention”层,query来自先前的解码器层,并且key和value来自encoder的输出。Decoder中的每个位置attend输入序列中的所有位置,这与Seq2Seq模型中的经典encoder-decoder attention机制相同
  2. Encoder中的self-attention层。在self-attention层中,所有的keys、values、queries都来自于同一个地方,在此处都是来自于encoder中前一层的输出。Encoder中当前层的每个位置,都能attend到前一层的所有位置
  3. 类似的,decoder中的self-attention层允许解码器中的每个位置attend当前解码位置,和它前面的所有位置。这里需要屏蔽解码器中,向左的信息流,以保持自回归auto-regressive属性。具体的实现方法是,在缩放后的点积attention中,屏蔽(假设为负无穷)softmax的输入中,所有对应着非法连接的value

1.3 Position-wise Feed-Forward Networks

除了attention的子层,在encoder && decoder中的每一层,都包含一个全连接的前馈网络,该网络单独且相同地应用于每一个position。这由两个线性变换组成,中间有一个ReLU激活:

虽然不同位置的线性变换是相同的,但是在层与层之间,使用的参数是不同。

另外一种描述方式是,内核大小为$1$的两个卷积。输入和输出的大小为$d_{\text{model}}=512$,中间层的维度是$d_{ff} = 2048$.

1
2
3
4
5
6
7
8
9
10
11
class PositionwiseFeedForward(nn.Module):
"Implements FFN equation."

def __init__(self, d_model, d_ff, dropout=0.1):
super(PositionwiseFeedForward, self).__init__()
self.w_1 = nn.Linear(d_model, d_ff)
self.w_2 = nn.Linear(d_ff, d_model)
self.dropout = nn.Dropout(dropout)

def forward(self, x):
return self.w_2(self.dropout(self.w_1(x).relu()))

1.4 Embedding and Softmax

类似于其他的序列转换模型,我们使用learned embeddings,将input tokens && output tokens转换为维度为$d_{\text{model}}$的向量。我们还使用usual learned linear transformer && softmax,将decoder的输出,转换为predicted next-token probabilities.

在模型中,在两个embedding层 && pre-softmax线性变换层中,共享相同的权重矩阵。在embedding层中,将这些权重$\times \sqrt{d_{\text{model}}}$

1
2
3
4
5
6
7
8
class Embeddings(nn.Module):
def __init__(self, d_model, vocab):
super(Embeddings, self).__init__()
self.lut = nn.Embedding(vocab, d_model)
self.d_model = d_model

def forward(self, x):
return self.lut(x) * math.sqrt(self.d_model)

1.5 Position Encoding

由于模型不包含递归和卷积结构,为了使模型能够有效利用序列的顺序特征,需要加入序列中各个token之间的相对位置,或者token在序列中的绝对位置信息。在这里,我们将位置编码添加到encoder && decoder stack底部的输入embedding。由于位置编码与embedding具有相同的维度$d_{\text{model}}$,因此两者可以直接相加。其实这里还有很多位置编码可以选择,其中包括可更新和固定不变的。

在此项工作中,我们使用不同频率的正弦 && 余弦函数:

其中$pos$是位置,$i$是维度。也就是说,positional encoding的每个维度,都对应于一个正弦曲线,波长形成从$2\pi$到$10000 \cdot 2 \pi$的等比级数。选择这个函数的原因是,假设它能让模型很容易学会Attend相对位置,因为对于任何固定的偏移量$k$,$PE_{pos+k}$可以表示为$PE_{pos}$的线性函数。

此外,在encoder和decoder stacks中,在embedding和positional encoding的加和上,都使用了dropout机制。在基本模型上,使用$P_{drop} = 0.1$

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
class PositionalEncoding(nn.Module):
"Implement the PE function."

def __init__(self, d_model, dropout, max_len=5000):
super(PositionalEncoding, self).__init__()
self.dropout = nn.Dropout(p=dropout)

# Compute the positional encodings once in log space.
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1)
div_term = torch.exp(
torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)
)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0)
self.register_buffer("pe", pe)

def forward(self, x):
x = x + self.pe[:, : x.size(1)].requires_grad_(False)
return self.dropout(x)

如下图所示,positional encoding将根据位置,添加正弦曲线。曲线的frequency && offset对于每个维度都是不同的

我们也尝试使用了预学习的positional embedding,但是发现两个版本的结果基本是一样的。我们选择了正弦曲线版本的实现,因为使用这个版本能让模型能够处理大于训练语料中,最大使用长度的序列。

1.6 Full Model

下面定义了连接完整模型,并设置超参数的函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def make_model(
src_vocab, tgt_vocab, N=6, d_model=512, d_ff=2048, h=8, dropout=0.1
):
"Helper: Construct a model from hyperparameters."
c = copy.deepcopy
attn = MultiHeadedAttention(h, d_model)
ff = PositionwiseFeedForward(d_model, d_ff, dropout)
position = PositionalEncoding(d_model, dropout)
model = EncoderDecoder(
Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N),
Decoder(DecoderLayer(d_model, c(attn), c(attn), c(ff), dropout), N),
nn.Sequential(Embeddings(d_model, src_vocab), c(position)),
nn.Sequential(Embeddings(d_model, tgt_vocab), c(position)),
Generator(d_model, tgt_vocab),
)

# This was important from their code.
# Initialize parameters with Glorot / fan_avg.
for p in model.parameters():
if p.dim() > 1:
nn.init.xavier_uniform_(p)
return model

1.7 Inference

此处,我们向前迈出一步,来生成模型的预测。我们尝试使用transformer来记住输入。正如所看到的结果所示,由于模型尚未经过训练,因此输出是随机生成的。

在下一个章节中,我们将构建训练函数,并尝试训练模型,记住1-10的数字。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def inference_test():
test_model = make_model(11, 11, 2)
test_model.eval()
src = torch.LongTensor([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
src_mask = torch.ones(1, 1, 10)

memory = test_model.encode(src, src_mask)
ys = torch.zeros(1, 1).type_as(src)

for i in range(9):
out = test_model.decode(
memory, src_mask, ys, subsequent_mask(ys.size(1)).type_as(src.data)
)
prob = test_model.generator(out[:, -1])
_, next_word = torch.max(prob, dim=1)
next_word = next_word.data[0]
ys = torch.cat(
[ys, torch.empty(1, 1).type_as(src.data).fill_(next_word)], dim=1
)

print("Example Untrained Model Prediction:", ys)


def run_tests():
for _ in range(10):
inference_test()


show_example(run_tests)

2. Part2: Model Training