Transformer 架构详解
提示:下面的示意图使用 Mermaid 绘制,您可以在支持的 Markdown 渲染器中直接查看。
graph LR
subgraph Encoder
Input[输入序列] --> Embedding[Embedding]
Embedding --> PosEnc[位置编码]
PosEnc --> SA1[多头自注意力 1]
SA1 --> FF1[前馈网络 1]
FF1 --> SA2[多头自注意力 2]
SA2 --> FF2[前馈网络 2]
FF2 --> OutputE[编码器输出]
end
subgraph Decoder
DecIn[目标序列] --> EmbDec[Embedding]
EmbDec --> PosDec[位置编码]
PosDec --> SA_Dec[自注意力]
SA_Dec --> CrossAtt[跨注意力]
CrossAtt --> FF_Dec[前馈网络]
FF_Dec --> OutputD[解码器输出]
end
OutputE --> CrossAtt
关键组件
| 组件 | 作用 | 代码示例 |
|---|---|---|
| Embedding + Positional Encoding | 将离散词汇映射为向量并加入位置信息 | nn.Embedding + sinusoidal encoding |
| Multi‑Head Self‑Attention | 捕获序列内部的长程依赖 | nn.MultiheadAttention |
| Feed‑Forward Network | 逐位置的非线性变换 | nn.Linear -> GELU -> nn.Linear |
| LayerNorm & Residual | 稳定训练并加速收敛 | x = x + Sublayer(x); x = LayerNorm(x) |
PyTorch 示例:实现一个简化的 Transformer 编码器层
_27import torch_27import torch.nn as nn_27import math_27_27class SimpleTransformerEncoder(nn.Module):_27 def __init__(self, d_model=256, nhead=8, dim_feedforward=512, dropout=0.1):_27 super().__init__()_27 self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)_27 self.linear1 = nn.Linear(d_model, dim_feedforward)_27 self.dropout = nn.Dropout(dropout)_27 self.linear2 = nn.Linear(dim_feedforward, d_model)_27 self.norm1 = nn.LayerNorm(d_model)_27 self.norm2 = nn.LayerNorm(d_model)_27 self.dropout1 = nn.Dropout(dropout)_27 self.dropout2 = nn.Dropout(dropout)_27_27 def forward(self, src, src_mask=None, src_key_padding_mask=None):_27 # Self‑attention_27 src2, _ = self.self_attn(src, src, src, attn_mask=src_mask,_27 key_padding_mask=src_key_padding_mask)_27 src = src + self.dropout1(src2)_27 src = self.norm1(src)_27 # Feed‑forward_27 src2 = self.linear2(self.dropout(torch.relu(self.linear1(src))))_27 src = src + self.dropout2(src2)_27 src = self.norm2(src)_27 return src
运行提示:
src的形状应为(seq_len, batch_size, d_model)。
小结
Transformer 通过并行的自注意力机制和层级残差结构,实现了对长序列的高效建模。掌握上述核心模块后,您即可自行构建更复杂的模型,如 BERT、GPT 系列等。