LLM分布式训练与推理优化

技术概览Technology Overview

LLM 分布式系统架构全景

Hardware

NVIDIA H100

A100 80GB

HBM3 Bandwidth

InfiniBand

⬇ NVLink / NCCL

Compute

Megatron-LM

DeepSpeed

FSDP

⬇ 3D Parallelism

Parallelism

Data Parallel

Tensor Parallel

Pipeline Parallel

⬇ Optimization

Optimization

Mixed Precision

Activation Checkpoint

Batching / Quantization

KV Cache

⬇ Serving

Serving

vLLM / TGI

TensorRT-LLM

MLOps Pipeline

1T+

GPT-4 参数量级

800GB

FP8 模型带宽

16x

H100 vs A100 Speedup

70B

单卡最大模型

"扩展定律表明：模型性能随计算量、数据量、参数量呈现幂律增长。分布式系统的核心任务是在有限硬件预算下逼近这一理论边界。"
— Kaplan et al., Scaling Laws for Neural Language Models

并行策略Parallelism Strategies

📊

Data Parallelism

数据并行

每个 GPU 持有完整模型副本，处理不同数据分片。梯度同步通过 AllReduce 完成。

优点：实现简单、易于扩展、通信开销低
缺点：显存效率低，无法训练超大模型
适用场景：模型可单卡容纳时的训练扩展

PyTorch DDP

import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# 初始化分布式环境
dist.init_process_group("nccl")
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)

model = model.cuda()
model = DDP(model, device_ids=[local_rank])

# 训练循环 - 每个进程独立前向/反向
for data in dataloader:
    output = model(data)
    loss = loss_fn(output, target)
    loss.backward()  # DDP 自动 AllReduce 梯度

🔀

Tensor Parallelism

张量并行

将单一层的参数矩阵按维度切分到多个 GPU，常用 Column+Row 切分（Megatron-LM 方式）。

通信：AllReduce 聚合各分片结果
优点：支持单层参数超单卡显存
缺点：通信密集，需 NVLink 高速互联

Megatron-LM Tensor Parallel

# 张量并行配置示例
tensor_parallel_config = {
    "tensor_model_parallel_size": 8,  # TP=8 将注意力层分片
    "pipeline_model_parallel_size": 8,
    "num_layers_per_pipeline_stage": 4,
}

# Column Parallel Linear: W = [W1, W2] 在 TP=2 时
# Y = X @ W1 + X @ W2 -> AllReduce
class ColumnParallelLinear(nn.Module):
    def forward(self, x):
        output_parallel = F.linear(x, self.weight, self.bias)
        # 当前 GPU 只计算部分列，需 AllReduce
        output = all_reduce(output_parallel)
        return output

🔄

Pipeline Parallelism

流水线并行

将模型按层划分到不同 GPU，形成计算流水线。减少流水线气泡（bubble）是核心优化点。

1F1B：One Forward One Backward，经典调度
Interleaved：微批次交错，减少气泡
异步流水线：如 Chimera、Discolato

GPipe Pipeline Schedule

# 流水线并行调度示意 (micro_batch=4, num_stages=4)
# 理想状态下流水线:
# Stage 0: [F0][F1][F2][F3]        [B3][B2][B1][B0]
# Stage 1:   [F0][F1][F2][F3]    [B3][B2][B1][B0]
# Stage 2:     [F0][F1][F2][F3]  [B3][B2][B1][B0]
# Stage 3:       [F0][F1][F2][F3][B3][B2][B1][B0]

# Bubble 比例 = (num_stages - 1) / num_microbatches
# Interleaved 1F1B 可将气泡减少至 ~5%

并行策略	通信模式	显存效率	扩展效率	适用场景
Data Parallel (DDP)	AllReduce (梯度)	低完整副本	高 ~线性	中小模型、多节点扩展
Tensor Parallel (TP)	AllReduce (激活)	高单层分片	中 NVLink 依赖	超单卡大层、Transformer
Pipeline Parallel (PP)	P2P (激活/梯度)	中阶段缓存	中气泡损耗	深层模型、多节点部署
3D Parallel (DP+TP+PP)	混合通信	高联合优化	高 near-linear	万亿参数模型训练

推理优化Inference Optimization

⚡

Continuous Batching

连续批处理

传统 Static Batching 需等待所有请求完成才能处理新批次，导致 GPU 空转。Continuous Batching（Iteration-level Scheduling）实现动态批处理，显著提升吞吐。

核心思想：在迭代级别调度，而非请求级别
关键指标：Throughput↑ 4-10x, Latency↓ 30%
实现：Orca Scheduler、Ray Serve

vLLM Continuous Batching

# vLLM PagedAttention + Continuous Batching
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    tensor_parallel_size=4,
    # 自动管理 KV Cache 内存
    max_num_seqs=256,        # 最大并发序列数
    max_num_batched_tokens=8192,  # 单批次最大 token 数
)

# 动态批处理：不同长度请求混合
sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=256,
)
# vLLM 自动调度，最大化 GPU 利用率
outputs = llm.generate(prompts, sampling_params)

💾

KV Cache 优化

Key-Value Cache Management

生成式推理中，KV Cache 显存占用随并发数和序列长度线性增长，是推理瓶颈的核心之一。

PagedAttention：类 OS 分页管理 KV Cache
Cache 量化：FP16 → INT8 → INT4 压缩
Prefix Caching：共享 system prompt KV

KV Cache 显存计算

# KV Cache 显存估算
# 单层单序列显存 = 2 * num_heads * head_dim * seq_len * bytes_per_param
# 完整 LLaMA-2 70B (seq_len=4096, INT8):

layers = 80
num_heads = 80
head_dim = 128
seq_len = 4096
bytes_per_param = 1  # INT8

# 单序列 KV Cache
kv_cache_per_seq = 2 * layers * num_heads * head_dim * seq_len * bytes_per_param
print(f(f"{kv_cache_per_seq / 1e9:.2f} GB"))  # ~5.24 GB

# 并发 256 序列
total_kv = kv_cache_per_seq * 256  # ~1.3 TB → 需要分页管理!

Batching与量化Batching & Quantization

🎯

量化策略对比

Quantization Strategies

LLM 量化分为训练后量化（PTQ）和量化感知训练（QAT）。主流方案针对不同精度-性能权衡：

方法	精度损失	推理加速	显存降低
FP16 (Baseline)	-	1x	1x
INT8	<1%	1.5-2x	2x
INT4 + GPTQ	2-3%	3-4x	4x
INT4 + AWQ	<1%	3-4x	4x
FP8 (H100)	<0.5%	2x	1.5x

🔧

Batching 调度算法

Batch Scheduling Algorithms

高效批处理需平衡 GPU 利用率与请求延迟。常见调度策略：

First-Come-First-Serve (FCFS)：简单但易产生碎片
Shortest-Job-First (SJF)：优化延迟但可能饿死长请求
Preemptive Scheduling：动态抢占，max-completion-time 优化

Sarathi-Serve Chunked Preemption

# Sarathi: 按 chunk 粒度调度，避免长序列阻塞
# 将长序列切分为固定大小的 chunk

CHUNK_SIZE = 512  # tokens per chunk

# 调度决策：每次迭代选择可放入当前批的请求
def schedule_batch(pending_requests, current_batch):
    available_slots = MAX_TOKENS - count_tokens(current_batch)
    
    candidates = []
    for req in pending_requests:
        next_chunk_size = min(CHUNK_SIZE, req.remaining_tokens)
        if next_chunk_size <= available_slots:
            candidates.append((req, next_chunk_size, req.priority))
    
    # 按优先级和短作业优先排序
    candidates.sort(key=lambda x: (-x[2], x[1]))
    return [req for req, _, _ in candidates]

KV Cache深度优化KV Cache Optimization

📑

PagedAttention 原理

Virtual Memory for KV Cache

vLLM 借鉴操作系统分页思想，将 KV Cache 离散化为固定大小的 block，实现物理内存的动态分配与共享。

Block Size：通常 16 tokens/block
Prefix Caching：相同 system prompt 自动复用 KV
speculative decoding：预测 token 的 KV 可提前复用

PagedAttention Block Management

# KV Cache Block 分配示意
# 物理内存: [Block 0][Block 1][Block 2][Block 3]...
# 逻辑序列1: [0,1] -> [0,1,2] -> [0,1,2,3]
# 逻辑序列2: [4,5]

class KVCacheBlockManager:
    def alloc(self, num_blocks):
        # 按需分配物理块
        physical_blocks = []
        for _ in range(num_blocks):
            block_id = self.free_list.pop()
            physical_blocks.append(block_id)
        return physical_blocks
    
    def canibalize(self, completed_seq):
        # 回收已完成序列的 block 供新请求使用
        self.free_list.extend(completed_seq.block_ids)

🔄

Cache 复用策略

Cache Reuse Strategies

多轮对话和 RAG 场景中，system prompt 与用户 prompt 的 KV 可在不同请求间共享：

Static KV Cache：不变的 system prompt 部分一次性计算并缓存
Streaming LLM：对 prefix 进行特殊压缩编码
Cache-aware Prefix Encoding：学习型 cache 位置

Prefix Caching Example

# HuggingFace TGI 启用 prefix caching
{
    "enable_prefix_caching": true,
    "prefix_cache_id": "system_prompt_v1"
}

# 多个请求共享相同 system prompt
# Request A: [System] + [User A question]
# Request B: [System] + [User B question]
# KV Cache: System 部分只计算一次!

🗜️

MQA / GQA

Multi-Query & Grouped-Query Attention

标准 MHA 中，每个 token 需存储 K/V 到所有 head。MQA/GQA 通过减少 K/V heads 数量降低显存：

MQA：所有 attention head 共享一组 K/V（效果较差）
GQA：N 个 query heads 分组共享 M 组 K/V（推荐）
LLaMA 2：使用 GQA（8 KV heads for 70B）

GQA Memory Analysis

# GQA vs MHA KV Cache 对比
# LLaMA-2 70B: num_q_heads=80, num_kv_heads=8

mha_kv_cache = 80  # 80 组 K/V heads
gqa_kv_cache = 8   # 仅 8 组 K/V heads

print(f(f"显存节省: {(1 - 8/80) * 100:.1f}%"))  # 显存节省: 90%

# 注意: query 维度不变，只减少 K/V 复制
# 推理时 K/V 通过 multicast 广播到所有 Q heads

🧊

FlashAttention

IO-Aware Exact Attention

FlashAttention 通过 IO-Aware tiling 策略，将注意力计算需要多次从 HBM 读取的问题转化为寄存器级计算，大幅降低显存占用和 IO 开销：

显存：O(N²) → O(N)，无需存储完整注意力矩阵
速度：2-4x 加速（尤其长序列）
变体：FlashAttention-2 (更优的 loop 策略)

FlashAttention-2 Usage

# 使用 FlashAttention-2 加速
from flash_attn import flash_attn_func

# Q, K, V: [batch, seq_len, num_heads, head_dim]
# FlashAttention 自动处理 tiling 和 softmax 归一化
output = flash_attn_func(
    q, k, v,
    dropout_p=0.0,
    softmax_scale=1.0 / math.sqrt(head_dim),
    causal=True  # 自回归生成时用 causal mask
)

MLOps实战MLOps in Production

🏗️

LLM Training Stack

训练基础设施

生产级 LLM 训练需要完整的 MLOps 管道：

数据管理：Data-centric AI, Delta Lake, Arrow
分布式训练：Megatron-LM + DeepSpeed + FSDP
实验跟踪：W&B, MLflow, TensorBoard
模型注册：MLflow Model Registry, HuggingFace Hub

Distributed Training Config

# DeepSpeed ZeRO-3 + Tensor Parallel 配置
{
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {"device": "cpu"},
        "offload_param": {"device": "cpu"}
    },
    "tensor_parallel": {
        "enabled": true,
        "size": 8
    },
    "bf16": {"enabled": true},
    "gradient_clipping": 1.0
}

🚀

Inference Serving Stack

推理服务架构

生产推理需考虑高可用、弹性扩缩容与成本控制：

推理引擎：vLLM, TGI, TensorRT-LLM, LMDeploy
负载均衡：nginx, envoy, custom scheduler
自动扩缩容：KEDA + Prometheus + HPA/VPA
成本优化：Spot Instance + Checkpointing

Kubernetes HPA for LLM

# KEDA 基于自定义指标的扩缩容
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llm-worker-scaler
spec:
  scaleTargetRef:
    name: llm-inference-worker
  pollingInterval: 10
  cooldownPeriod: 300
  triggers:
  - type: prometheus
    metadata:
      metricName: pending_requests_total
      serverAddress: http://prometheus:9090
      query: sum(vllm:num_requests_waiting)

📊

监控与可观测性

Observability

LLM 系统的监控需覆盖传统指标与模型特有指标：

系统层：GPU 利用率、显存、温度、功耗
推理层：首 token 延迟、token throughput、TTFT
模型层：困惑度、幻觉率、毒性分数
业务层：请求成功率、错误分布、SLA 达成率

Prometheus Metrics

# vLLM Prometheus metrics 示例
# 这些指标自动暴露给 Prometheus

vllm:num_requests_total{model="llama-2-70b",} 15234.0
vllm:num_generation_tokens_total{model="llama-2-70b",} 2048576.0
vllm:gpu_cache_usage_perc{model="llama-2-70b",gpu="0",} 0.89
vllm:time_to_first_token_seconds{model="llama-2-70b",quantile="0.5",} 0.045
vllm:time_per_output_token_seconds{model="llama-2-70b",quantile="0.5",} 0.012

🔄

CI/CD for LLMs

持续集成与部署

LLM 模型的 CI/CD 管道与传统软件有显著区别：

模型验证：单元测试、Benchmark evaluation
A/B Testing：金丝雀发布、流量分流
回滚策略：Blue-Green 部署、Feature Flag
合规检查：模型卡片、许可审计、数据血缘

GitHub Actions LLM Evaluation

# .github/workflows/llm-eval.yml
name: LLM Evaluation Pipeline

on:
  pull_request:
    branches: [main]

jobs:
  evaluate:
    runs-on: gpu-runner
    steps:
      - uses: actions/checkout@v4
      - name: Run Evals
        run: |
          python -m eval Harness \
            --model hf \
            --tasks mmlu,truthfulqa,hendrycksTest* \
            --num_fewshot 5
      - name: Compare to Baseline
        run: |
          python scripts/compare_evals.py \
            --current results.json \
            --baseline s3://models/baseline.json
      - name: Comment Results
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              body: 'Eval Results: ...'
            })

"MLOps 的核心挑战不是让模型跑起来，而是建立从实验到生产的闭环，让模型迭代速度跟上业务需求变化。"
— Google Cloud MLOps Guidelines