1. Preface

Annotated of《Attention Is All You Need》

“This post presents an annotated version of the paper in the form of a line-by-line implementation”

Code is available here

0.1 Prelims

1	# !pip install -r requirements.txt

# # Uncomment for colab
# #
# !pip install -q torchdata==0.3.0 torchtext==0.12 spacy==3.2 altair GPUtil
# !python -m spacy download de_core_news_sm
# !python -m spacy download en_core_web_sm

import os
from os.path import exists
import torch
import torch.nn as nn
from torch.nn.functional import log_softmax, pad
import math
import copy
import time
from torch.optim.lr_scheduler import LambdaLR
import pandas as pd
import altair as alt
from torchtext.data.functional import to_map_style_dataset
from torch.utils.data import DataLoader
from torchtext.vocab import build_vocab_from_iterator
import torchtext.datasets as datasets
import spacy
import GPUtil
import warnings
from torch.utils.data.distributed import DistributedSampler
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP


# Set to False to skip notebook execution (e.g. for debugging)
warnings.filterwarnings("ignore")
RUN_EXAMPLES = True

# Some convenience helper functions used throughout the notebook


def is_interactive_notebook():
    return __name__ == "__main__"


def show_example(fn, args=[]):
    if __name__ == "__main__" and RUN_EXAMPLES:
        return fn(*args)


def execute_example(fn, args=[]):
    if __name__ == "__main__" and RUN_EXAMPLES:
        fn(*args)


class DummyOptimizer(torch.optim.Optimizer):
    def __init__(self):
        self.param_groups = [{"lr": 0}]
        None

    def step(self):
        None

    def zero_grad(self, set_to_none=False):
        None


class DummyScheduler:
    def step(self):
        None

0.2 Background

在神经网路计算中，减少序列处理任务的计算量，是一个非常重要的问题。先前提出的网络，包括Extended Neural GPU, ByteNet and ConvS2S，设计目的都是为了解决这个问题。这些网络都以CNN为基础，并行计算所有input and output positions的hidden representations.

在这些模型中，关联两个arbitrary input or output positions，所需要的操作数量，随着位置之间的距离增加而增加。例如在ConvS2S中呈线性增长，ByteNet中呈对数增长，这种增长会使得学习较远距离的两个位置之间的依赖关系，变得非常困难。在Transformer中，这个操作次数减少到了常数级别。（尽管由于平均注意力位置加权，导致有效分辨率降低，但是Multi-Head Attention可以抵消这种影响）

Self-attention，有时候也称为intra-attention，是在单个序列/句子的不同位置上，做注意力的机制，以便计算序列的表示。Self-attention已经成功应用于很多种类的任务，包括reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations. 端到端的网络，一般都是基于循环注意力机制recurrent attention mechanism，而不是序列对齐循环sequencealigned recurrence，并且已经有证据表明，端到端的网络在简单的语言问答、语言建模任务上，表现很好。

Transformer是第一个完全依靠Self-attention，而不是用序列对齐的RNN或者convolution的方式，来计算输入输出表示的转换模型。

1. Part 1: Model Architecture

1.1 Model Architecture

目前，大部分热门的、有竞争力的神经序列转换模型，都有Encoder-Decoder结构：

Encoder将输入序列$(x_1, \dots, x_n)$映射到一个连续的序列表示$\mathbf{z} = (z_1, \dots, z_n)$
对于编码得到的$\mathbf{z}$，decoder每次解码生成一个符号，直到生成完整的输出序列$(y_1, \dots, y_m)$。每一步解码，模型都是自回归的（即在生成下一个符号时，将先前生成的符号，作为附加输入）。

class EncoderDecoder(nn.Module):
    """
    A standard Encoder-Decoder architecture. Base for this and many
    other models.
    """

    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.generator = generator

    def forward(self, src, tgt, src_mask, tgt_mask):
        "Take in and process masked src and target sequences."
        return self.decode(self.encode(src, src_mask), src_mask, tgt, tgt_mask)

    def encode(self, src, src_mask):
        return self.encoder(self.src_embed(src), src_mask)

    def decode(self, memory, src_mask, tgt, tgt_mask):
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)

class Generator(nn.Module):
    "Define standard linear + softmax generation step."

    def __init__(self, d_model, vocab):
        super(Generator, self).__init__()
        self.proj = nn.Linear(d_model, vocab)

    def forward(self, x):
        return log_softmax(self.proj(x), dim=-1)

Transformer的整体架构如下图所示，在Encoder和Decoder中都使用了self-attention，Point-wise和全连接层。Encoder和Decoder的大致结构如下图的左半部分、右半部分所示。

1.2 Encoder && Decoder Stacks

1.2.1 Encoder

Encoder由$N=6$个相同的层组成

def clones(module, N):
    "Produce N identical layers."

    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

class Encoder(nn.Module):
    "Core encoder is a stack of N layers"
    
    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)
    
    def forward(self, x, mask):
        "Pass the input (and mask) through each layer in turn."
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

在每两个子层之间，都使用了残差连接Residual Connection && layer normalization

class LayerNorm(nn.Module):
    "Construct a LayerNorm module (See citation for details)"

    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        self.eps = eps
    
    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.a_2 * (x - mean) / (std + self.eps) + self.b_2

也就是说，每个子层的输出为$\text{LayerNorm}(x + \text{SubLayer}(x))$，其中$\text{SubLayer(x)}$是由子层自动实现的函数。紧接着，对每个子层的output采用dropout，然后将其添加到下一个子层的输入，并进行归一化。

为了能够方便地使用残差连接，模型中的所有子层 && embedding层的输出，都设定成了相同的维度，即$d_{\text{model}} = 512$

class SubLayerConnection(nn.Module):
    """
    A residual connection followed by a layer norm.
    Note for code simplicity the norm is first as opposed to last.
    """
    
    def __init__(self, size, dropout):
        super(SubLayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout1d(dropout)
    
    def forward(self, x, sublayer):
        "Apply residual connection to any sublayer with the same size."
        return x + self.dropout(sublayer(self.norm(x)))

每个layer都有两个sub-layers，第一个是mutil-head self-attention mechanism，第二层是一个简单的position-wise的全连接前馈网络

class EncoderLayer(nn.Module):
    "Encoder is made up of self-attention && feed forward (defined below)"

    def __init__(self, size, self_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SubLayerConnection(size, dropout), 2)
        self.size = size
    
    def forward(self, x, mask):
        "Follow Figure 1 (left) for connections"
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayer[1](x, self.feed_forward)

1.2.2 Decoder

Decoder也是由$N = 6$个相同层组成

class Decoder(nn.Module):
    "Generic N layer decoder with masking."
    def __init__(self, layer, N):
        super(Decoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)
    
    def forward(self, x, memory, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, memory, src_mask, tgt_mask)
        return self.norm(x)

除了每个encoder中的两个子层之外，decoder还插入了第三个子层，对encoder栈的输出实行multi-head attention。

和encoder相同，每个子层的两端，都引入了残差连接进行短路，然后执行layer normalization

class DecoderLayer(nn.Module):
    "Decoder is made of self-attn, src-attn, and feed forward (defined below)"

    def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
        super(DecoderLayer, self).__init__()
        self.size = size
        self.self.attn = self_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SubLayerConnection(size, dropout), 3)
    
    def forward(self, x, memory, src_mask, tgt_mask):
        "Follow Figure 1 (right) for connections."
        m = memory
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
        x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))
        return self.sublayer[2](x, self.feed_forward)

另外，还修改了decoder中的self-attention子层，以防止当前位置attend到后续的位置。这种maksed的attention，是考虑到输出的embedding，会偏移一个位置，确保了生成位置$i$的预测时，仅依赖于小于$i$的位置处的已知输出，相当于把后面不该看到的信息屏蔽掉

def subsequent_mask(size):
    "Mask out subsequent positions"

    attn_shape = (1, size, size)
    subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')
    return torch.from_numpy(subsequent_mask) == 0

下面的attention mask图，显示了允许每个目标词（行）查看的位置（列）。在训练期间，当前解码位置的词，不能attend到后续位置的词。

1.2.3 Attention

Attention函数可以描述为，将query和一组key-value pair，映射到输出当中，其中query、key、values、outputs都是向量。输出是值的加权和，其中分配个每个value的权重，由query与相应的key的兼容函数计算得到。

称这种特殊的attention机制为“Scale Dot-Product Attention”。输入包含维度为$d_k$的query和key，以及维度为$d_v$的values。我们首先分别计算query和每个key的dot product，然后将每个dot product除以$\sqrt{d_k}$，最后使用softmax函数来获得key的权重。

在具体实现中，我们可以以矩阵的形式，进行并行计算，这样可以加速运算过程。具体来说，就是将所有的query、key、values分别组合成矩阵$\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}$，这样输出矩阵可以表示为：

$\text{Attention}(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}) = \text{softmax}(\frac{\boldsymbol{Q}\boldsymbol{K}^T}{\sqrt{d_k}}\boldsymbol{V})$

def attention(query, key, value, mask=None, dropout=None):
    "Compute 'Scaled Dot Product Attention"

    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    p_attn = scores.softmax(dim=-1)
    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn

两种最常用的attention函数是加和additive attention && 点积（点乘）dot-product (multiplicative) attention。此处的算法与点积attention类似，但是$\frac{1}{\sqrt{d_k}}$的比例因子不同。

加和additive attention使用具有单个隐藏层的前馈网络，来计算兼容函数。虽然两种方法理论上的复杂度相似，但是在实践中，点积attention的运算会更快一点，也更加节省空间。因为它可以使用高效的矩阵乘法算法来实现。

虽然对于较小的$d_k$来说，这两种attention的表现类似，但是在不放缩较大的$d_k$时，加和attention要优于点积attention。此处考虑，对于较大的$d_k$，点积大幅度增大，将softmax函数推向具有极小梯度的区域（为了解释点积变大的原因，假设$q,k$是独立的随机变量，平均值为$0$，方差为$1$，这样他们的点积就为$q \cdot k = \sum^{d_k}_{i = 1}q_ik_i$，同样是均值为0，方差为$d_k$）。为了抵消这种影响，我们使用$\frac{1}{\sqrt{d_k}}$来对点积进行缩放。

Multi-head attention能够让模型考虑到不同位置的attention；另外，multi-head attention可以在不同的子空间表示不一样的关联关系，使用单个head的attention一般达不到这种效果。

$\text{MultiHead}(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) \boldsymbol{W}^O$

其中参数矩阵为$\boldsymbol{W}^{\boldsymbol{Q}}_i \in \mathbb{R}^{d_{\text{model}} \times d_k}, \boldsymbol{W}^{\boldsymbol{K}}_i \in \mathbb{R}^{d_{\text{model}} \times d_k}, \boldsymbol{W}^{\boldsymbol{V}}_i \in \mathbb{R}^{d_{\text{model}} \times d_v}, \boldsymbol{W}^{\boldsymbol{O}} \in \mathbb{R}^{hd_v \times d_{\text{model}}}, $

在此工作中，使用$h = 8$个head并行的attention layer，对每一个head来说，有$d_k = d_v = d_{\text{model}} / h = 64$，总计算量与完整维度的sigle-head attention很相近

class MultiHeadAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):
        "Take in model size and number of heads."
        super(MultiHeadAttention, self).__init__()
        assert d_model % h == 0
        # We assume d_v always equals d_k
        self.d_k = d_model // h
        self.h = h
        self.linears = clones(nn.Linear(d_model, d_model), 4)
        self.attn = None
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, query, key, value, mask=None):
        "Implements Figure 2"
        if mask is not None:
            # Same mask applied to all h heads
            mask = mask.unsqueeze(1)
        nbatches = query.size(0)

        # 1) Do all the linear projections in batch from d_model => h x d_k
        query, key, value = [
            lin(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
            for lin, x in zip(self.linears, (query, key, value))
        ]

        # 2) Apply attention on all the projected vectors in batch
        x, self.attn = attention(query, key, value, mask=mask, dropout=self.dropout)

        # 3) "Concat" using a view and apply a final linear.
        x = x.transpose(1, 2).contiguous().view(nbatches, -1, self.h * self.d_k)
        del query
        del key
        del value
        return self.linears[-1](x)

1.2.4 Applications of Attention in our Model

Transformer中以三种不同的方式使用了multi-head attention：

在“encoder-decoder attention”层，query来自先前的解码器层，并且key和value来自encoder的输出。Decoder中的每个位置attend输入序列中的所有位置，这与Seq2Seq模型中的经典encoder-decoder attention机制相同
Encoder中的self-attention层。在self-attention层中，所有的keys、values、queries都来自于同一个地方，在此处都是来自于encoder中前一层的输出。Encoder中当前层的每个位置，都能attend到前一层的所有位置
类似的，decoder中的self-attention层允许解码器中的每个位置attend当前解码位置，和它前面的所有位置。这里需要屏蔽解码器中，向左的信息流，以保持自回归auto-regressive属性。具体的实现方法是，在缩放后的点积attention中，屏蔽（假设为负无穷）softmax的输入中，所有对应着非法连接的value

1.3 Position-wise Feed-Forward Networks

除了attention的子层，在encoder && decoder中的每一层，都包含一个全连接的前馈网络，该网络单独且相同地应用于每一个position。这由两个线性变换组成，中间有一个ReLU激活：

$\text{FFN}(x) = \max(0, x\boldsymbol{W}_1 + b_1)\boldsymbol{W}_2 + b_2$

虽然不同位置的线性变换是相同的，但是在层与层之间，使用的参数是不同。

另外一种描述方式是，内核大小为$1$的两个卷积。输入和输出的大小为$d_{\text{model}}=512$，中间层的维度是$d_{ff} = 2048$.

class PositionwiseFeedForward(nn.Module):
    "Implements FFN equation."

    def __init__(self, d_model, d_ff, dropout=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.w_2(self.dropout(self.w_1(x).relu()))

1.4 Embedding and Softmax

类似于其他的序列转换模型，我们使用learned embeddings，将input tokens && output tokens转换为维度为$d_{\text{model}}$的向量。我们还使用usual learned linear transformer && softmax，将decoder的输出，转换为predicted next-token probabilities.

在模型中，在两个embedding层 && pre-softmax线性变换层中，共享相同的权重矩阵。在embedding层中，将这些权重$\times \sqrt{d_{\text{model}}}$

class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        super(Embeddings, self).__init__()
        self.lut = nn.Embedding(vocab, d_model)
        self.d_model = d_model

    def forward(self, x):
        return self.lut(x) * math.sqrt(self.d_model)

1.5 Position Encoding

由于模型不包含递归和卷积结构，为了使模型能够有效利用序列的顺序特征，需要加入序列中各个token之间的相对位置，或者token在序列中的绝对位置信息。在这里，我们将位置编码添加到encoder && decoder stack底部的输入embedding。由于位置编码与embedding具有相同的维度$d_{\text{model}}$，因此两者可以直接相加。其实这里还有很多位置编码可以选择，其中包括可更新和固定不变的。

在此项工作中，我们使用不同频率的正弦 && 余弦函数：

$PE_{(pos, 2i)} = \sin{(pos / 10000^{2i / d_{\text{model}}})} \\ PE_{(pos, 2i + 1)} = \cos{(pos / 10000^{2i / d_{\text{model}}})}$

其中$pos$是位置，$i$是维度。也就是说，positional encoding的每个维度，都对应于一个正弦曲线，波长形成从$2\pi$到$10000 \cdot 2 \pi$的等比级数。选择这个函数的原因是，假设它能让模型很容易学会Attend相对位置，因为对于任何固定的偏移量$k$，$PE_{pos+k}$可以表示为$PE_{pos}$的线性函数。

此外，在encoder和decoder stacks中，在embedding和positional encoding的加和上，都使用了dropout机制。在基本模型上，使用$P_{drop} = 0.1$

class PositionalEncoding(nn.Module):
    "Implement the PE function."

    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer("pe", pe)

    def forward(self, x):
        x = x + self.pe[:, : x.size(1)].requires_grad_(False)
        return self.dropout(x)

如下图所示，positional encoding将根据位置，添加正弦曲线。曲线的frequency && offset对于每个维度都是不同的

我们也尝试使用了预学习的positional embedding，但是发现两个版本的结果基本是一样的。我们选择了正弦曲线版本的实现，因为使用这个版本能让模型能够处理大于训练语料中，最大使用长度的序列。

1.6 Full Model

下面定义了连接完整模型，并设置超参数的函数

def make_model(
    src_vocab, tgt_vocab, N=6, d_model=512, d_ff=2048, h=8, dropout=0.1
):
    "Helper: Construct a model from hyperparameters."
    c = copy.deepcopy
    attn = MultiHeadedAttention(h, d_model)
    ff = PositionwiseFeedForward(d_model, d_ff, dropout)
    position = PositionalEncoding(d_model, dropout)
    model = EncoderDecoder(
        Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N),
        Decoder(DecoderLayer(d_model, c(attn), c(attn), c(ff), dropout), N),
        nn.Sequential(Embeddings(d_model, src_vocab), c(position)),
        nn.Sequential(Embeddings(d_model, tgt_vocab), c(position)),
        Generator(d_model, tgt_vocab),
    )

    # This was important from their code.
    # Initialize parameters with Glorot / fan_avg.
    for p in model.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform_(p)
    return model

1.7 Inference

此处，我们向前迈出一步，来生成模型的预测。我们尝试使用transformer来记住输入。正如所看到的结果所示，由于模型尚未经过训练，因此输出是随机生成的。

在下一个章节中，我们将构建训练函数，并尝试训练模型，记住1-10的数字。

def inference_test():
    test_model = make_model(11, 11, 2)
    test_model.eval()
    src = torch.LongTensor([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
    src_mask = torch.ones(1, 1, 10)

    memory = test_model.encode(src, src_mask)
    ys = torch.zeros(1, 1).type_as(src)

    for i in range(9):
        out = test_model.decode(
            memory, src_mask, ys, subsequent_mask(ys.size(1)).type_as(src.data)
        )
        prob = test_model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.data[0]
        ys = torch.cat(
            [ys, torch.empty(1, 1).type_as(src.data).fill_(next_word)], dim=1
        )

    print("Example Untrained Model Prediction:", ys)


def run_tests():
    for _ in range(10):
        inference_test()


show_example(run_tests)

<2017-Google-Attention Is All You Need>