Transformer 架构详解

提示：下面的示意图使用 Mermaid 绘制，您可以在支持的 Markdown 渲染器中直接查看。

graph LR
    subgraph Encoder
        Input[输入序列] --> Embedding[Embedding]
        Embedding --> PosEnc[位置编码]
        PosEnc --> SA1[多头自注意力 1]
        SA1 --> FF1[前馈网络 1]
        FF1 --> SA2[多头自注意力 2]
        SA2 --> FF2[前馈网络 2]
        FF2 --> OutputE[编码器输出]
    end
    subgraph Decoder
        DecIn[目标序列] --> EmbDec[Embedding]
        EmbDec --> PosDec[位置编码]
        PosDec --> SA_Dec[自注意力]
        SA_Dec --> CrossAtt[跨注意力]
        CrossAtt --> FF_Dec[前馈网络]
        FF_Dec --> OutputD[解码器输出]
    end
    OutputE --> CrossAtt

关键组件

组件	作用	代码示例
Embedding + Positional Encoding	将离散词汇映射为向量并加入位置信息	`nn.Embedding` + sinusoidal encoding
Multi‑Head Self‑Attention	捕获序列内部的长程依赖	`nn.MultiheadAttention`
Feed‑Forward Network	逐位置的非线性变换	`nn.Linear -> GELU -> nn.Linear`
LayerNorm & Residual	稳定训练并加速收敛	`x = x + Sublayer(x); x = LayerNorm(x)`

PyTorch 示例：实现一个简化的 Transformer 编码器层


_27import torch
_27import torch.nn as nn
_27import math
_27
_27class SimpleTransformerEncoder(nn.Module):
_27    def __init__(self, d_model=256, nhead=8, dim_feedforward=512, dropout=0.1):
_27        super().__init__()
_27        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
_27        self.linear1 = nn.Linear(d_model, dim_feedforward)
_27        self.dropout = nn.Dropout(dropout)
_27        self.linear2 = nn.Linear(dim_feedforward, d_model)
_27        self.norm1 = nn.LayerNorm(d_model)
_27        self.norm2 = nn.LayerNorm(d_model)
_27        self.dropout1 = nn.Dropout(dropout)
_27        self.dropout2 = nn.Dropout(dropout)
_27
_27    def forward(self, src, src_mask=None, src_key_padding_mask=None):
_27        # Self‑attention
_27        src2, _ = self.self_attn(src, src, src, attn_mask=src_mask,
_27                                 key_padding_mask=src_key_padding_mask)
_27        src = src + self.dropout1(src2)
_27        src = self.norm1(src)
_27        # Feed‑forward
_27        src2 = self.linear2(self.dropout(torch.relu(self.linear1(src))))
_27        src = src + self.dropout2(src2)
_27        src = self.norm2(src)
_27        return src

运行提示：src 的形状应为 (seq_len, batch_size, d_model)。

小结

Transformer 通过并行的自注意力机制和层级残差结构，实现了对长序列的高效建模。掌握上述核心模块后，您即可自行构建更复杂的模型，如 BERT、GPT 系列等。

Discussion4

Join the conversation

Michael Chang·20h ago

The section on Context Windows vs RAG was really illuminating. I've been debating which approach to take for our internal knowledge base. Do you think the 1M+ context windows in newer models will eventually make RAG obsolete?

Sarah ChenAuthor·18h ago

Great question, Michael! I don't think RAG is going away anytime soon. Even with huge context windows, RAG offers better latency, cost-efficiency, and most importantly - the ability to cite sources explicitly.

Priya Patel·Dec 22, 2025

I finally understand how Positional Encodings work! The visual analogy with the clock hands was brilliant. 👏

DevOps Ninja·Dec 22, 2025

Any chance you could cover Quantization (GGUF/GPTQ) in a future post? trying to run these locally on my MacBook and it's a bit of a jungle out there.