Transformer 架构详解

提示:下面的示意图使用 Mermaid 绘制,您可以在支持的 Markdown 渲染器中直接查看。

graph LR
    subgraph Encoder
        Input[输入序列] --> Embedding[Embedding]
        Embedding --> PosEnc[位置编码]
        PosEnc --> SA1[多头自注意力 1]
        SA1 --> FF1[前馈网络 1]
        FF1 --> SA2[多头自注意力 2]
        SA2 --> FF2[前馈网络 2]
        FF2 --> OutputE[编码器输出]
    end
    subgraph Decoder
        DecIn[目标序列] --> EmbDec[Embedding]
        EmbDec --> PosDec[位置编码]
        PosDec --> SA_Dec[自注意力]
        SA_Dec --> CrossAtt[跨注意力]
        CrossAtt --> FF_Dec[前馈网络]
        FF_Dec --> OutputD[解码器输出]
    end
    OutputE --> CrossAtt

关键组件

组件作用代码示例
Embedding + Positional Encoding将离散词汇映射为向量并加入位置信息nn.Embedding + sinusoidal encoding
Multi‑Head Self‑Attention捕获序列内部的长程依赖nn.MultiheadAttention
Feed‑Forward Network逐位置的非线性变换nn.Linear -> GELU -> nn.Linear
LayerNorm & Residual稳定训练并加速收敛x = x + Sublayer(x); x = LayerNorm(x)

PyTorch 示例:实现一个简化的 Transformer 编码器层


_27
import torch
_27
import torch.nn as nn
_27
import math
_27
_27
class SimpleTransformerEncoder(nn.Module):
_27
def __init__(self, d_model=256, nhead=8, dim_feedforward=512, dropout=0.1):
_27
super().__init__()
_27
self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
_27
self.linear1 = nn.Linear(d_model, dim_feedforward)
_27
self.dropout = nn.Dropout(dropout)
_27
self.linear2 = nn.Linear(dim_feedforward, d_model)
_27
self.norm1 = nn.LayerNorm(d_model)
_27
self.norm2 = nn.LayerNorm(d_model)
_27
self.dropout1 = nn.Dropout(dropout)
_27
self.dropout2 = nn.Dropout(dropout)
_27
_27
def forward(self, src, src_mask=None, src_key_padding_mask=None):
_27
# Self‑attention
_27
src2, _ = self.self_attn(src, src, src, attn_mask=src_mask,
_27
key_padding_mask=src_key_padding_mask)
_27
src = src + self.dropout1(src2)
_27
src = self.norm1(src)
_27
# Feed‑forward
_27
src2 = self.linear2(self.dropout(torch.relu(self.linear1(src))))
_27
src = src + self.dropout2(src2)
_27
src = self.norm2(src)
_27
return src

运行提示src 的形状应为 (seq_len, batch_size, d_model)

小结

Transformer 通过并行的自注意力机制和层级残差结构,实现了对长序列的高效建模。掌握上述核心模块后,您即可自行构建更复杂的模型,如 BERT、GPT 系列等。

Discussion4

Join the conversation

Sign in to share your thoughts and connect with others.

Sign In with GitHub
Michael Chang
Michael Chang·20h ago
The section on Context Windows vs RAG was really illuminating. I've been debating which approach to take for our internal knowledge base. Do you think the 1M+ context windows in newer models will eventually make RAG obsolete?
Sarah Chen
Sarah ChenAuthor·18h ago
Great question, Michael! I don't think RAG is going away anytime soon. Even with huge context windows, RAG offers better latency, cost-efficiency, and most importantly - the ability to cite sources explicitly.
Priya Patel
Priya Patel·Dec 22, 2025
I finally understand how Positional Encodings work! The visual analogy with the clock hands was brilliant. 👏
DevOps Ninja
DevOps Ninja·Dec 22, 2025
Any chance you could cover Quantization (GGUF/GPTQ) in a future post? trying to run these locally on my MacBook and it's a bit of a jungle out there.