Karpathy GPT 教程笔记（五）

布客飞龙

10人浏览 · 2026-06-22 03:01:23

布客飞龙 · 2026-06-22 03:01:23 发布

为了实现这一点，我们需要修改 Flatten 层。我们创建了一个新的 FlattenConsecutive 层，它可以将连续的 n 个元素拼接在一起，并增加一个“组”的维度。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_281.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_283.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_285.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_287.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_289.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_291.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_293.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_295.png

class FlattenConsecutive:
    def __init__(self, n):
        self.n = n
    def __call__(self, x):
        B, T, C = x.shape
        x = x.view(B, T // self.n, C * self.n)
        if x.shape[1] == 1:
            x = x.squeeze(1)
        return x

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_297.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_299.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_301.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_303.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_305.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_307.png

然后，我们重新设计模型架构。第一层 FlattenConsecutive(2) 将8个字符分成4组，每组2个字符的嵌入被拼接。随后的线性层只处理这“2个字符”的信息。之后，我们再次使用 FlattenConsecutive(2) 将4组合并为2组，以此类推，形成一个小型的层次化网络。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_309.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_311.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_313.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_315.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_317.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_319.png

model = Sequential([
    Embedding(vocab_size, n_embd),
    FlattenConsecutive(2), Linear(n_embd * 2, n_hidden), BatchNorm(n_hidden), Tanh(),
    FlattenConsecutive(2), Linear(n_hidden * 2, n_hidden), BatchNorm(n_hidden), Tanh(),
    FlattenConsecutive(2), Linear(n_hidden * 2, n_hidden), BatchNorm(n_hidden), Tanh(),
    Linear(n_hidden, vocab_size)
])

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_321.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_323.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_325.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_327.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_329.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_331.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_333.png

修复批归一化层

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_335.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_337.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_339.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_341.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_343.png

上一节我们构建了层次化模型。本节中，我们需要修复一个关键问题：BatchNorm 层对多维输入的处理。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_345.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_347.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_349.png

我们原来的 BatchNorm 实现假设输入是二维的 (batch_size, features)。但在我们的新架构中，FlattenConsecutive 会产生三维输入 (batch_size, groups, features)。我们需要让 BatchNorm 在训练时，同时计算 batch 和 groups 维度上的均值和方差。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_351.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_353.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_355.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_357.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_359.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_361.png

class BatchNorm:
    def __call__(self, x):
        if self.training:
            dims = (0, 1) if x.ndim == 3 else (0)
            xmean = x.mean(dims, keepdim=True)
            xvar = x.var(dims, keepdim=True)
        # ... 后续标准化和更新运行统计量

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_363.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_365.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_367.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_369.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_371.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_373.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_375.png

修复这个Bug后，模型性能得到了小幅但稳定的提升。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_377.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_379.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_381.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_383.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_385.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_387.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_389.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_391.png

实验结果与未来方向

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_393.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_395.png

通过增加模型容量（如嵌入维度和隐藏层大小），我们最终将验证损失降低到了 1.993 左右，成功跨过了2.0的界限。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_397.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_399.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_401.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_403.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_405.png

本节课我们一起实现了一个简化的WaveNet风格架构。我们学习了如何：

使用模块化构建块（如 Sequential）来组织复杂网络。
通过 FlattenConsecutive 和线性层实现信息的层次化融合。
调整 BatchNorm 以正确处理多维输入。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_407.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_409.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_411.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_413.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_415.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_417.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_419.png

然而，我们实现的只是WaveNet思想的核心骨架。完整的WaveNet还包括门控激活单元、残差连接和空洞因果卷积（用于高效计算）。此外，我们缺乏一个系统的超参数搜索和实验框架，目前的优化更多是“猜测与检验”。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_421.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_423.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_425.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_427.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_429.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_431.png

在未来的课程中，我们可以：

实现空洞卷积来高效地计算整个输入序列的输出。
添加残差连接以训练更深的网络。
建立实验管线，进行大规模的超参数优化。
探索循环神经网络（RNN/LSTM）和Transformer架构。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_433.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_435.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_437.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_439.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_441.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_443.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_445.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_447.png

挑战：你可以尝试调整本课的模型（如各层通道数、嵌入维度），或者阅读WaveNet论文实现更复杂的层，看看能否击败 1.993 的验证损失记录。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_449.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_451.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_453.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_455.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_457.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_459.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_461.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_463.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_465.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/c9be2506944533fbac16b5c8db99ea71_467.png

总结：本节课中，我们从基础的MLP出发，逐步构建了一个层次化的、类似WaveNet的字符级语言模型。我们重构了代码使其更清晰，引入了层次化信息融合的概念，并修复了批归一化层的多维处理问题。虽然性能得到了提升，但这仅仅是探索现代深度神经网络架构的开始。

课程 P7：从零构建 GPT 🧠

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_1.png

在本节课中，我们将学习如何从零开始构建一个类似 GPT 的 Transformer 语言模型。我们将使用一个简单的字符级数据集（Tiny Shakespeare），并逐步实现模型的核心组件，包括自注意力机制、多头注意力、前馈网络以及残差连接等。通过这个过程，你将深入理解现代大型语言模型（如 ChatGPT）背后的基本原理。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_3.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_5.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_7.png

概述 📋

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_9.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_11.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_13.png

Transformer 架构是当今许多先进 AI 系统的核心，它最初在 2017 年的论文《Attention Is All You Need》中被提出。GPT（Generative Pre-trained Transformer）正是基于此架构构建的。在本教程中，我们将专注于构建一个仅解码器的 Transformer，用于字符级语言建模任务。虽然我们无法复现 ChatGPT 那样的复杂系统，但通过构建一个微型版本，我们可以清晰地理解其工作原理。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_15.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_17.png

我们将从处理数据开始，逐步实现模型的关键部分，并在 Tiny Shakespeare 数据集上进行训练，最终生成莎士比亚风格的文本。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_19.png

1. 数据准备与分词 📚

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_21.png

首先，我们需要准备数据并将其转换为模型可以处理的格式。我们将使用 Tiny Shakespeare 数据集，它包含了莎士比亚的所有作品。

1.1 读取数据

我们从指定 URL 下载数据集，并将其读取为一个长字符串。

import torch
import requests

# 下载数据集
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
text = requests.get(url).text
print(f"数据集长度（字符数）: {len(text)}")
print(text[:1000])  # 打印前1000个字符

1.2 创建词汇表

接下来，我们找出数据集中所有独特的字符，构建一个词汇表。每个字符将被映射到一个唯一的整数（标记）。

# 获取所有独特字符并排序
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(f"词汇表大小: {vocab_size}")
print(''.join(chars))  # 打印所有字符

# 创建编码器和解码器
stoi = {ch: i for i, ch in enumerate(chars)}  # 字符 -> 整数
itos = {i: ch for i, ch in enumerate(chars)}  # 整数 -> 字符

def encode(s):
    return [stoi[c] for c in s]  # 字符串 -> 整数列表

def decode(l):
    return ''.join([itos[i] for i in l])  # 整数列表 -> 字符串

# 测试编码解码
test_str = "hi there"
encoded = encode(test_str)
decoded = decode(encoded)
print(f"原始字符串: {test_str}")
print(f"编码后: {encoded}")
print(f"解码后: {decoded}")

1.3 划分数据集

我们将数据集分为训练集（90%）和验证集（10%）。验证集用于评估模型的泛化能力，防止过拟合。

# 将整个文本编码为整数张量
data = torch.tensor(encode(text), dtype=torch.long)

# 划分训练集和验证集
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_23.png

2. 数据批处理 🔄

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_25.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_27.png

由于我们无法一次性将整个数据集输入模型，因此需要从数据中随机抽取小块（批次）进行训练。每个批次包含多个独立的序列，模型将并行处理它们。

以下是创建数据批次的函数：

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_29.png

def get_batch(split):
    # 根据 split 选择训练集或验证集
    data = train_data if split == 'train' else val_data
    # 生成随机起始索引
    ix = torch.randint(len(data) - block_size, (batch_size,))
    # 构建输入 x 和目标 y
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

# 设置超参数
batch_size = 4
block_size = 8

# 获取一个批次
xb, yb = get_batch('train')
print('输入 xb 的形状:', xb.shape)
print('目标 yb 的形状:', yb.shape)
print('输入示例:\n', xb)
print('目标示例:\n', yb)

在这个批次中，xb 是模型的输入，yb 是每个位置对应的下一个字符的目标值。模型的任务是根据 xb 的上下文预测 yb。

3. 基础模型：Bigram 语言模型 🔤

在深入 Transformer 之前，我们先实现一个最简单的语言模型——Bigram 模型。它仅根据当前字符的身份来预测下一个字符，不考虑任何上下文信息。

3.1 模型定义

Bigram 模型本质上是一个查找表，其中每个字符都直接预测下一个字符的分布。

import torch.nn as nn

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_31.png>

class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        # 每个标记直接映射到下一个标记的 logits
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        # idx 和 targets 都是形状为 (B, T) 的整数张量
        logits = self.token_embedding_table(idx)  # (B, T, C)

        if targets is None:
            loss = None
        else:
            # 调整形状以匹配 PyTorch 的交叉熵损失期望
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = nn.functional.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx 是当前上下文，形状为 (B, T)
        for _ in range(max_new_tokens):
            # 获取预测
            logits, loss = self(idx)
            # 只关注最后一步
            logits = logits[:, -1, :]  # 变为 (B, C)
            # 应用 softmax 获取概率
            probs = nn.functional.softmax(logits, dim=-1)  # (B, C)
            # 从分布中采样下一个标记
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            # 将采样到的标记附加到序列上
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
        return idx

# 实例化模型
model = BigramLanguageModel(vocab_size)

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_33.png

3.2 训练与生成

我们可以用简单的优化循环来训练这个模型，并观察其生成效果。

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

for steps in range(10000):
    # 获取一个数据批次
    xb, yb = get_batch('train')
    # 前向传播，计算损失
    logits, loss = model(xb, yb)
    # 反向传播，更新参数
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(f"最终损失: {loss.item()}")

# 生成文本
context = torch.zeros((1, 1), dtype=torch.long)
print(decode(model.generate(context, max_new_tokens=500)[0].tolist()))

Bigram 模型的表现非常有限，因为它没有利用上下文信息。接下来，我们将引入自注意力机制，让字符之间能够进行交流。

4. 自注意力机制 🤝

自注意力是 Transformer 的核心组件，它允许序列中的每个元素（标记）根据其与序列中其他元素的关系来聚合信息。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_35.png

4.1 数学原理

自注意力的关键思想是让每个标记生成三个向量：查询（Query）、键（Key） 和 值（Value）。

查询（Q）：表示“我正在寻找什么”。
键（K）：表示“我包含什么信息”。
值（V）：表示“如果被关注，我将传递什么信息”。

标记之间的亲和力（注意力权重）通过查询和键的点积计算：affinity = Q @ K^T。然后，我们使用这些权重对值进行加权求和，从而聚合信息。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_37.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_39.png

为了实现语言建模中的因果性（即当前标记不能看到未来标记），我们使用一个下三角掩码矩阵，将未来位置的注意力权重设置为负无穷大，这样在 softmax 后它们的权重就变为 0。

4.2 实现单头自注意力

以下是单头自注意力的 PyTorch 实现：

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_41.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_43.png

class Head(nn.Module):
    """ 单头自注意力 """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        # 下三角掩码，用于实现因果注意力
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)   # (B, T, head_size)
        q = self.query(x) # (B, T, head_size)
        # 计算注意力分数（亲和力）
        wei = q @ k.transpose(-2, -1) * C**-0.5  # (B, T, T) 缩放点积
        # 应用因果掩码
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = nn.functional.softmax(wei, dim=-1)  # (B, T, T)
        wei = self.dropout(wei)
        # 加权聚合值
        v = self.value(x)  # (B, T, head_size)
        out = wei @ v  # (B, T, head_size)
        return out

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_45.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_46.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_48.png

在这个实现中：

我们为键、查询和值定义了线性投影层。
计算缩放点积注意力分数，并应用因果掩码。
使用 softmax 将分数转换为概率分布（注意力权重）。
使用这些权重对值向量进行加权求和，得到输出。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_50.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_52.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_54.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_56.png

5. 多头注意力与 Transformer 块 🧩

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_58.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_60.png

单个注意力头可能只关注特定类型的关系。为了捕捉更丰富的信息，我们并行使用多个注意力头，这就是多头注意力。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_62.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_64.png

5.1 实现多头注意力

我们将多个单头注意力的输出在通道维度上拼接起来。

class MultiHeadAttention(nn.Module):
    """ 多头自注意力 """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)  # 投影层
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # 并行运行所有注意力头并拼接结果
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))  # 投影回残差路径
        return out

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_66.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_68.png

5.2 前馈网络

在自注意力进行通信之后，每个标记需要独立处理收集到的信息。这是通过一个简单的前馈网络（FFN）实现的，通常是一个两层 MLP。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_70.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_72.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_74.png

class FeedForward(nn.Module):
    """ 简单的前馈网络 """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),  # 扩展维度
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),  # 投影回原始维度
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_76.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_77.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_79.png

5.3 构建 Transformer 块

现在，我们将多头注意力和前馈网络组合成一个 Transformer 块。为了优化深度网络，我们引入残差连接和层归一化。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_81.png

残差连接：将块的输入直接加到其输出上。这创建了一条梯度高速公路，有助于缓解深度网络中的梯度消失问题。
层归一化：在块内对每个标记的特征进行归一化，稳定训练过程。

class Block(nn.Module):
    """ Transformer 块：通信（注意力）后接计算（前馈） """

    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)  # 多头自注意力
        self.ffwd = FeedForward(n_embd)                  # 前馈网络
        self.ln1 = nn.LayerNorm(n_embd)                  # 层归一化 1
        self.ln2 = nn.LayerNorm(n_embd)                  # 层归一化 2

    def forward(self, x):
        # 带残差连接和层归一化的自注意力
        x = x + self.sa(self.ln1(x))
        # 带残差连接和层归一化的前馈网络
        x = x + self.ffwd(self.ln2(x))
        return x

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_83.png

6. 构建完整 GPT 模型 🏗️

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_85.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_87.png

现在，我们可以将所有组件组合起来，构建完整的 GPT 模型。我们的模型将包括：

标记嵌入层：将整数标记转换为向量。
位置嵌入层：为序列中的每个位置提供位置信息。
多个 Transformer 块（解码器块）。
最终的层归一化和线性投影层，用于预测下一个标记。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_89.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_91.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_93.png

class GPTLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # 每个标记对应一个嵌入向量
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        # 每个位置对应一个嵌入向量
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        # 堆叠 Transformer 块
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        # 最终的层归一化
        self.ln_f = nn.LayerNorm(n_embd)
        # 语言建模头，将特征投影回词汇表
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # 获取标记嵌入和位置嵌入
        tok_emb = self.token_embedding_table(idx)  # (B, T, n_embd)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))  # (T, n_embd)
        x = tok_emb + pos_emb  # (B, T, n_embd)
        # 通过 Transformer 块
        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)  # (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = nn.functional.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx 是当前上下文 (B, T)
        for _ in range(max_new_tokens):
            # 如果上下文过长，裁剪到块大小
            idx_cond = idx if idx.size(1) <= block_size else idx[:, -block_size:]
            # 获取预测
            logits, loss = self(idx_cond)
            # 关注最后一步
            logits = logits[:, -1, :]  # (B, C)
            # 应用 softmax 获取概率
            probs = nn.functional.softmax(logits, dim=-1)  # (B, C)
            # 从分布中采样下一个标记
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            # 将采样到的标记附加到序列上
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
        return idx

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_95.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_97.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_99.png

7. 模型训练与评估 🚀

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_101.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_103.png

现在，我们可以使用更大的超参数来训练我们的 GPT 模型，并观察其性能。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_105.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_107.png

7.1 设置超参数与设备

# 超参数
batch_size = 64         # 每批处理的独立序列数
block_size = 256        # 最大上下文长度
max_iters = 5000        # 训练迭代次数
eval_interval = 500     # 每多少步评估一次
learning_rate = 3e-4    # 学习率
device = 'cuda' if torch.cuda.is_available() else 'cpu'  # 使用 GPU 如果可用
eval_iters = 200        # 评估时平均损失的批次数量
n_embd = 384            # 嵌入维度
n_head = 6              # 注意力头数量
n_layer = 6             # Transformer 块层数
dropout = 0.2           # Dropout 比率

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_109.png>

# 实例化模型并移至设备
model = GPTLanguageModel()
m = model.to(device)
print(f"模型参数量: {sum(p.numel() for p in m.parameters())/1e6:.2f} M")

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_111.png>

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/d489e9697e4ec525091637b9ac0b6163_113.png>

# 创建优化器
optimizer =

# 课程 P8：GPT的现状 🧠

在本节课中，我们将学习大型语言模型（如GPT）是如何被训练出来的，以及如何有效地将它们应用于实际任务。课程内容分为两部分：第一部分介绍训练GPT助手的完整流程，第二部分探讨如何在实际应用中最佳地使用这些助手。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_1.png>

## 第一部分：如何训练GPT助手 🏗️

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_3.png>

训练一个像GPT这样的助手模型是一个多阶段的过程。整个过程大致可以分为四个主要阶段：预训练、监督微调、奖励建模和强化学习。下面我们将逐一详细介绍。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_5.png>

### 1. 预训练阶段：打造基础模型

预训练是整个过程的核心，消耗了绝大部分的计算资源和时间。这个阶段的目标是让模型学会理解和生成人类语言。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_7.png>

首先，我们需要收集海量的文本数据。这些数据通常来自互联网，混合了多种来源。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_9.png>

以下是构成训练数据混合物的常见来源：
*   Common Crawl（网络爬虫数据）
*   C4（另一种常见的爬虫数据集）
*   高质量数据集，如：GitHub代码、维基百科、书籍、学术论文、Stack Exchange问答等。

这些数据按照特定比例混合采样，形成神经网络的训练集。在训练之前，文本需要经过一个称为“分词”的预处理步骤。分词将原始文本无损地转换为整数序列，因为这是GPT模型能够理解的“原生”格式。常用的算法包括字节对编码（BPE）。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_11.png>

**分词示例**：`"Hello world!"` 可能被转换为整数序列 `[15496, 995, 0]`。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_13.png>

接下来，我们看看管理这个阶段的一些关键超参数。以GPT-3和LLaMA为例：

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_15.png>

*   **词汇表大小**：通常在数万级别（例如，50,257个标记）。
*   **上下文长度**：决定模型一次能查看的标记数量，早期是2K或4K，现在可达100万。
*   **模型参数量**：GPT-3有1750亿参数，LLaMA有650亿参数。
*   **训练数据量**：GPT-3训练了约3000亿标记，而LLaMA训练了约1.4万亿标记。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_17.png>

模型的强大程度不仅取决于参数数量，更与训练数据量和训练时长密切相关。用于指定Transformer架构的超参数包括头数、维度、层数等。训练一个650亿参数的模型可能需要约2000个GPU训练21天，成本达数百万美元。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_19.png>

那么，预训练具体是如何进行的呢？我们将分词后的数据组织成批次。每个批次包含多行独立文档，每行长度等于上下文长度。文档之间用特殊的结束标记分隔。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_21.png>

模型的任务是预测序列中的下一个标记。以图中的绿色单元格为例，Transformer神经网络会查看它之前的所有黄色标记（即上下文），然后尝试预测下一个红色标记是什么。模型会为词汇表中的每一个可能的标记输出一个概率。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_23.png>

**训练目标**：通过比较模型的预测概率和实际的下一个标记（监督信号），使用反向传播算法不断调整Transformer的数十亿个参数，使其预测越来越准确。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_25.png>

训练开始时，模型的权重是随机的，输出也是杂乱无章的。随着训练的进行，模型逐渐学会单词、语法和文本结构。我们可以通过观察训练损失（loss）的下降来追踪进展，损失越低，意味着模型预测正确下一个标记的概率越高。

预训练完成后，我们得到了一个“基础模型”。人们发现，这种在庞大语料上训练出的模型，学到了强大的通用语言表示能力，可以高效地适配到各种下游任务。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_27.png>

### 2. 从基础模型到助手模型

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_29.png>

基础模型本质上是“文档续写者”，它只想完成它认为的文档。例如，如果你问它“法国的首都是什么？”，它可能会续写成“法国的首都是一个常见的问题，答案是巴黎。”，而不是直接给出答案。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_31.png>

为了让模型成为有用的“助手”，我们需要对它进行进一步的调优。主要有两种路径：

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_33.png>

**路径一：提示工程**
我们可以通过精心设计输入文本来“欺骗”基础模型执行任务。例如，使用“少样本提示”，在问题前提供几个问答示例，使模型模仿这种格式来回答问题。甚至可以通过构造“人类与助手对话”的文档格式，诱使基础模型扮演助手角色。但这种方法并不总是可靠。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_35.png>

**路径二：监督微调**
这是创建真正助手模型的更可靠方法。在此阶段，我们需要收集一个小规模但高质量的数据集。

以下是数据集的构建方式：
*   聘请人类标注员，根据详细的指南（要求回答有帮助、真实、无害）来编写“提示”和对应的“理想回答”。
*   通常需要数十万条这样的数据。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_37.png>

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_39.png>

然后，我们在这个新数据集上继续执行**语言建模任务**。算法不变，只是训练数据从互联网文档换成了高质量的问答对。训练后得到的模型称为“SFT模型”，它是一个可以直接部署的助手模型。

### 3. 基于人类反馈的强化学习

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_41.png>

为了使助手表现更好，我们可以引入基于人类反馈的强化学习。这个阶段分为两步：奖励建模和强化学习。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_43.png>

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_45.png>

**第一步：奖励建模**
我们改变数据收集的形式，从“写答案”变为“比较答案”。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_47.png>

以下是数据收集过程：
1.  使用已有的SFT模型为同一个提示生成多个（例如，三个）不同的回答。
2.  让人类标注员对这些回答进行质量排序。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_49.png>

接着，我们训练一个“奖励模型”。该模型的任务是：给定一个提示和回答，预测一个标量奖励值，代表这个回答的质量。训练时，我们让奖励模型的预测尽量与人类标注员的排序一致。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_51.png>

**第二步：强化学习**
现在，我们固定奖励模型，用它来指导SFT模型的进一步优化。

以下是强化学习的流程：
1.  收集一大批提示。
2.  用当前的SFT模型为每个提示生成回答。
3.  用奖励模型为每个回答打分。
4.  调整SFT模型的参数，使其生成的、获得高奖励的回答在未来出现的概率更高，同时降低低奖励回答的出现概率。

这个过程通常使用近端策略优化等强化学习算法。最终得到的模型就是“RLHF模型”。例如，ChatGPT就是一个RLHF模型。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_53.png>

那么，为什么需要RLHF？实验表明，人类通常更喜欢RLHF模型的输出。一个可能的原因是：对于人类来说，“比较两个答案哪个更好”比“凭空写出一个完美答案”要容易得多。RLHF更高效地利用了人类的判断力。

但需要注意的是，RLHF模型并非在所有方面都优于基础模型。它们可能会失去一些“创造性”或“多样性”，输出变得更加确定和保守。在需要生成多样化内容（如构思创意名称）的场景下，基础模型可能更有优势。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_55.png>

目前，能力最强的助手模型（如GPT-4、Claude）大多经过了RLHF训练。而许多开源模型（如Koala）是SFT模型。

## 第二部分：如何有效使用GPT助手 🛠️

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_57.png>

了解了模型的训练过程后，我们来看看如何在实际应用中最佳地使用它们。我们将通过一个具体例子来理解人类与LLM在解决问题时的认知差异。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_59.png>

假设你要写一句话：“加利福尼亚的人口是阿拉斯加的53倍。”你的思考过程可能是：
1.  **意识**：我需要比较两个州的人口。
2.  **知识检索**：我不知道具体数字，需要查维基百科。
3.  **工具使用**：查到数字后，需要用计算器做除法。
4.  **反思验证**：53倍这个结果合理吗？加州人口最多，似乎合理。
5.  **创作与修订**：尝试组织句子，觉得“有53倍于”很拗口，删掉重写，最终定稿。

这个过程涉及丰富的内心独白、工具使用和递归验证。然而，对于GPT来说，它看到的只是一个接一个的标记序列。它对每个标记进行的计算是相同且有限的（例如，一个80层的Transformer对每个标记进行80步“思考”）。它没有持续的内心独白，不会在过程中主动检查错误或使用外部工具，它只是在模仿训练数据中下一个标记出现的概率。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_61.png>

因此，我们可以把提示工程看作是弥补人类与LLM之间认知架构差异的桥梁。以下是一些核心策略：

### 1. 给予模型“思考时间”

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_63.png>

LLM需要标记来“思考”。对于复杂问题，不能指望它在一个标记内给出答案。

**关键技术**：
*   **思维链**：在提示中要求模型“逐步推理”或“展示你的工作”。这迫使模型将推理过程分散到多个输出标记上，从而更可能得出正确答案。例如，使用“让我们一步一步地思考...”作为提示开头。
*   **自我一致性**：不要只采样一次。让模型多次生成回答，然后通过投票或选择最佳答案的方式聚合结果，避免单次采样的随机性。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_65.png>

### 2. 明确要求高质量输出

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_67.png>

LLM训练数据中既有高质量答案，也有低质量答案。它默认会模仿所有内容。你需要明确要求它给出专家级答案。

**关键技术**：
*   在提示中指定角色，如“你是一个顶尖的物理学家”或“请确保答案正确”。
*   这有助于模型将概率质量集中在高质量输出上，而不是平均分配给所有可能的续写方式。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_69.png>

### 3. 弥补模型的能力缺陷

LLM可能不擅长精确计算、获取实时信息或处理特定格式。我们需要通过提示或外部工具来弥补。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_71.png>

**关键技术**：
*   **工具使用**：明确告诉模型“你不太擅长心算，请使用提供的计算器工具”，并定义工具的使用格式。许多框架（如ReAct）将工具调用集成到模型的思考过程中。
*   **检索增强**：将模型庞大的内部记忆与外部检索结合起来。使用向量数据库等技术，将与任务相关的文档片段检索出来，并插入到模型的上下文中，作为其“工作记忆”。这能极大提升模型在特定领域的表现。
*   **输出约束**：使用指导采样等技术，强制模型的输出遵循特定格式（如JSON、XML），确保输出易于被下游程序解析。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_73.png>

### 4. 超越单一提示：构建系统

复杂的任务往往不能通过一次问答完成。

**关键技术**：
*   **提示链**：将多个提示串联起来，形成工作流。例如，先让模型规划步骤，再分步执行，最后总结。
*   **反思与重试**：让模型评估自己生成的答案是否正确，如果不正确，则重新尝试。这模拟了人类的自我修正过程。
*   **树状搜索**：像AlphaGo一样，维护多个可能的推理路径（思维树），对它们进行评估和扩展，最终选择最优路径。这需要Python代码来协调多个LLM调用。

### 实践建议与总结

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_75.png>

对于初学者和应用开发者，建议遵循以下路径：

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_77.png>

1.  **优先提示工程**：从最强大的模型（如GPT-4）开始，设计详细、包含示例和背景信息的提示。充分考虑LLM的“心理特点”，使用思维链、检索增强等技术。
2.  **考虑系统设计**：不要局限于单一提示。思考如何用代码将多个提示、工具调用和逻辑判断粘合起来，构建一个可靠的系统。
3.  **最后考虑微调**：当提示工程潜力用尽时，再考虑微调。监督微调相对直接，但需要高质量数据。RLHF则非常复杂且不稳定，目前不建议初学者尝试。
4.  **认识局限性并安全使用**：始终记住LLM存在幻觉、偏见、知识过时、易受攻击等局限。建议在低风险场景中使用，将其作为“副驾驶”提供灵感和建议，并保持人类监督。

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_79.png>

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_81.png>

---

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/b4741a42e6cf74795c90528614bcb573_83.png>

**本节课总结**：我们一起学习了GPT助手训练的四个核心阶段（预训练、监督微调、奖励建模、强化学习），理解了基础模型与助手模型的区别。更重要的是，我们探讨了如何通过提示工程、工具使用和系统设计来弥合人类与LLM的认知差异，从而在实际应用中有效、可靠地利用这些强大的模型。记住，LLM是惊人的“标记模拟器”，而我们的任务是引导它，为它创造“思考”的条件。

# 课程 P9：构建 GPT 分词器 🧩

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/9b369713d8fe0d40ac1101ef2ac09517_1.png>

<https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/9b369713d8fe0d40ac1101ef2ac09517_3.png>

在本节课中，我们将要学习大型语言模型（LLM）中一个关键但常被忽视的组件：分词器。我们将了解什么是分词、为什么它如此重要，并动手从零开始实现一个基于字节对编码（BPE）的分词器。通过本教程，你将理解分词如何影响模型的性能，并掌握构建和训练自定义分词器的核心技能。

## 概述：什么是分词？

分词是将文本字符串转换为一系列整数（称为“词元”或“标记”）的过程，这些整数是语言模型能够理解和处理的基本单位。在之前的课程《从头开始构建 GPT》中，我们使用了一个简单的字符级分词器。然而，实际应用中的 LLM（如 GPT 系列）使用更复杂的分词方案，例如字节对编码。

分词是许多 LLM 奇怪行为的根源，例如拼写困难、处理非英语语言效果差、算术能力不佳等。理解分词的工作原理对于深入理解 LLM 至关重要。

## 从字符级分词到子词分词

上一节我们介绍了简单的字符级分词。本节中我们来看看更先进的子词分词方法。

在字符级分词中，每个字符（如 `‘h’`, `‘i’`）被映射为一个独立的整数。虽然简单，但这会导致序列非常长，效率低下。例如，句子 “hello there” 会被编码为一系列代表每个字符的整数。

实际操作中，我们使用子词分词。它将常见的字符组合（如 `‘he’`, `‘ll’`, `‘o’`）合并为单独的标记，从而压缩序列长度。这通过字节对编码等算法实现。

## 字节对编码算法详解

字节对编码是一种数据压缩算法，后来被应用于 NLP 的分词任务。其核心思想是迭代地合并数据中最常见的字节对。

以下是 BPE 算法的基本步骤：

1.  将文本编码为 UTF-8 字节序列，初始词汇表为 256 个字节（0-255）。
2.  统计所有相邻字节对的出现频率。
3.  找到出现频率最高的字节对。
4.  为该字节对创建一个新的标记，并将其加入词汇表。
5.  在数据中，将所有出现的该字节对替换为这个新标记。
6.  重复步骤 2-5，直到达到预设的词汇表大小或没有更多可合并的对。

通过这种方式，我们从基础的字节开始，逐步构建出代表常见字符组合的标记，从而实现对文本的高效压缩。

## 实现 BPE 分词器

现在，让我们动手实现一个基础的 BPE 分词器。我们将编写训练函数来从数据中学习合并规则，并编写编码/解码函数来进行文本和标记之间的转换。

首先，我们需要一个函数来统计字节对的出现频率。

```python

def get_stats(ids):

    """

    统计给定整数ID列表中相邻元素对的出现次数。

    Args:

        ids: 整数列表，代表字节或标记。

    Returns:

        一个字典，键为(元素1, 元素2)的元组，值为出现次数。

    """

    counts = {}

    for pair in zip(ids, ids[1:]):

        counts[pair] = counts.get(pair, 0) + 1

    return counts

接下来，实现合并最高频字节对的函数。


def merge(ids, pair, idx):

    """

    在ID序列中，用新ID替换所有出现的指定字节对。

    Args:

        ids: 整数列表。

        pair: 要合并的字节对，例如 (101, 32)。

        idx: 用于替换的新标记ID（例如 256）。

    Returns:

        合并后的新ID列表。

    """

    newids = []

    i = 0

    while i < len(ids):

        # 如果找到匹配的对，则进行合并

        if i < len(ids) - 1 and (ids[i], ids[i+1]) == pair:

            newids.append(idx)

            i += 2

        else:

            newids.append(ids[i])

            i += 1

    return newids

现在，我们可以编写训练循环，迭代地进行合并，构建词汇表。


def train_bpe(text, vocab_size):

    """

    在文本上训练BPE分词器。

    Args:

        text: 训练文本字符串。

        vocab_size: 目标词汇表大小。

    Returns:

        merges: 记录合并规则的字典，键为合并后的ID，值为被合并的字节对。

        vocab: 从标记ID到字节表示的映射。

    """

    # 1. 将文本编码为UTF-8字节，并转换为整数列表

    tokens = list(text.encode(‘utf-8’))

    # 初始词汇表大小是256（0-255）

    num_merges = vocab_size - 256

    merges = {} # (id1, id2) -> new_id

    vocab = {idx: bytes([idx]) for idx in range(256)} # id -> bytes

    for i in range(num_merges):

        # 2. 统计当前标记序列中字节对的频率

        stats = get_stats(tokens)

        if not stats:

            break

        # 3. 找到最常出现的字节对

        top_pair = max(stats, key=stats.get)

        # 4. 分配新的ID（从256开始）

        idx = 256 + i

        # 5. 记录合并规则

        merges[top_pair] = idx

        # 6. 更新词汇表：新标记是子标记字节的拼接

        vocab[idx] = vocab[top_pair[0]] + vocab[top_pair[1]]

        # 7. 在序列中应用合并

        tokens = merge(tokens, top_pair, idx)

    return merges, vocab

编码与解码

训练好分词器（获得 merges 和 vocab）后，我们需要实现编码（文本 -> 标记）和解码（标记 -> 文本）功能。

解码相对简单：将每个标记 ID 通过 vocab 映射回其字节表示，然后连接并解码为字符串。


def decode(ids, vocab):

    """

    将标记ID序列解码为文本字符串。

    Args:

        ids: 标记ID列表。

        vocab: 从标记ID到字节表示的映射。

    Returns:

        解码后的字符串。

    """

    # 将每个ID转换为其字节表示

    tokens_bytes = b’’.join(vocab[idx] for idx in ids)

    # 将字节解码为字符串，使用 ‘replace’ 处理无效字节

    text = tokens_bytes.decode(‘utf-8’, errors=‘replace’)

    return text

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/9b369713d8fe0d40ac1101ef2ac09517_5.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/9b369713d8fe0d40ac1101ef2ac09517_7.png

编码过程需要模拟训练时的合并过程，将文本转换为字节后，反复应用合并规则。


def encode(text, merges):

    """

    将文本字符串编码为标记ID序列。

    Args:

        text: 输入文本。

        merges: 训练得到的合并规则字典。

    Returns:

        标记ID列表。

    """

    # 将文本转换为UTF-8字节，再转为整数列表

    tokens = list(text.encode(‘utf-8’))

    # 只要还有可合并的对，就持续合并

    while True:

        stats = get_stats(tokens)

        # 找到当前序列中优先级最高（在merges中索引最小）的可合并对

        pair_to_merge = None

        min_idx = float(‘inf’)

        for pair in stats:

            idx = merges.get(pair)

            if idx is not None and idx < min_idx:

                min_idx = idx

                pair_to_merge = pair

        # 如果没有可合并的对，结束循环

        if pair_to_merge is None:

            break

        # 应用合并

        idx = merges[pair_to_merge]

        tokens = merge(tokens, pair_to_merge, idx)

    return tokens

实际分词器的复杂性

我们上面实现的是一个基础的、纯算法的 BPE 分词器。在实际应用中（如 GPT-2, GPT-4），分词器引入了更多规则来处理复杂情况。

预处理规则：例如，GPT-2 使用一个复杂的正则表达式模式，在 BPE 合并之前先将文本分割成不同的块（如字母、数字、标点符号）。这确保了合并只发生在特定类别内部，防止了像将 “dog.” 和 “dog!” 合并成不同标记的情况，使分词更加一致。

特殊标记：除了从数据中学习到的标记，分词器还会引入特殊标记，如 <|endoftext|> 用于分隔文档，或在聊天模型中用于区分用户、助手和系统消息的标记。这些标记在词汇表中拥有独立的 ID，并在处理时被特殊对待。

词汇表大小的影响：词汇表大小是一个关键超参数。太小的词汇表（如字符级）会导致序列过长，消耗大量计算资源。太大的词汇表则会使每个标记出现的频率降低，可能导致嵌入训练不足，同时也会增加模型输出层的计算负担。目前先进的模型通常在数万到十万左右。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/9b369713d8fe0d40ac1101ef2ac09517_9.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/9b369713d8fe0d40ac1101ef2ac09517_11.png

分词器与模型训练的关系

需要明确的是，分词器的训练与语言模型本身的训练是两个独立的阶段。

分词器训练：使用一个代表性数据集（可能与模型训练集不同），运行 BPE 算法，确定合并规则和最终词汇表。这个过程产生 merges 和 vocab 两个核心组件。
模型训练：使用训练好的分词器，将海量的模型训练文本全部转换为标记序列。这些标记序列被保存下来，语言模型在此标记序列上进行训练，学习预测下一个标记。

这种分离意味着我们可以针对不同的目标（如多语言支持、代码处理）优化分词器，而不必重新训练整个大模型。

总结

本节课中我们一起学习了构建 GPT 分词器的核心知识：

分词的重要性：分词是文本进入 LLM 的桥梁，其设计直接影响模型处理各种任务（拼写、多语言、算术、代码）的能力。
BPE 算法原理：通过迭代合并最常见字节对来构建词汇表，实现从字符到子词的压缩表示。
分词器实现：我们实现了 train_bpe、encode 和 decode 等核心函数，构建了一个可工作的基础分词器。
实际考量：了解了实际分词器（如 OpenAI 的 tiktoken）引入的预处理规则、特殊标记等复杂性，以及词汇表大小等设计选择。
训练流程：明确了分词器训练与语言模型训练是两个独立且先后进行的阶段。

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/9b369713d8fe0d40ac1101ef2ac09517_13.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/9b369713d8fe0d40ac1101ef2ac09517_15.png

https://github.com/OpenDocCN/dsai-notes-pt1-zh/raw/master/docs/andrej/img/9b369713d8fe0d40ac1101ef2ac09517_17.png

分词虽然是一个预处理步骤，但它深远地影响着语言模型的行为和能力。希望本教程能帮助你揭开分词的神秘面纱，并为深入理解和使用大型语言模型打下坚实基础。

https://edu.csdn.net/learn/39067/627173?utm_source=2019755004

汇聚全球AI编程工具，助力开发者即刻编程。

更多推荐

ChatGPT充值订阅前要看什么？GPT 与 Codex 使用需求的判断方法

准备开通 ChatGPT 或使用 Codex 时，不少人只关注价格，却忽略了账号归属、使用频率、额度消耗和续费方式。本文从实际使用需求出发，说明 ChatGPT Plus、Codex 和高频开发场景之间的区别，并整理订阅前需要注意的几个问题。

AI编程社区

MCP（Model Context Protocol）实战教程：从零搭建你的第一个 AI Agent 工具服务

说白了，MCP 这个东西火了大半年了，但很多开发者还停留在"听过但没用过"的阶段。带你从零写一个 MCP Server，然后用 Claude 调用它。整个过程不超过 30 分钟。先说清楚 MCP 是干嘛的。你可以把它理解成 AI 模型和外部工具之间的"USB 接口"——以前每个 AI 应用都要自己写一套工具调用逻辑，现在有了统一标准，工具写一次，到处能用。MCP 的核心价值就是标准化。以前每个 A