## 摘要

Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD mini-batch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization. Specifically, we show no loss of accuracy when training with large minibatch sizes up to 8192 images. To achieve this result, we adopt a hyper-parameter-free linear scaling rule for adjusting learning rates as a function of minibatch size and develop a new warmup scheme that overcomes optimization challenges early in training. With these simple techniques, our Caffe2-based system trains ResNet-50 with a minibatch size of 8192 on 256 GPUs in one hour, while matching small minibatch accuracy. Using commodity hardware, our implementation achieves ∼90% scaling efficiency when moving from 8 to 256 GPUs. Our findings enable training visual recognition models on internet-scale data with high efficiency.

## 章节内容

• 首先介绍了大尺度训练下的学习率调整：线性缩放规则以及warmup策略
• 其次训练过程中常见的陷阱以及触发这些陷阱的实现细节
• 然后详细的描述了分布式训练算法
• 最后通过实验证明了大尺度训练的困难在于优化问题

## 线性缩放规则

Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.

### 证明

$w_{t+k} = w_{t} - \eta \frac {1}{n} \sum_{j<k} \sum_{x\in B_{j}} \triangledown l(x, w_{t+j}) \\ \hat{w}_{t+1} = w_{t} - \hat{\eta} \frac {1}{kn} \sum_{j<k} \sum_{x\in B_{j}} \triangledown l(x, w_{t})$

### 缺陷

1. 初期训练时网络权重快速变化阶段
2. 批量大小不能够无限扩大

### 解决方案

1. 对于第一个问题，通过warmup策略解决
2. 对于第二个问题，通过实验证明了其训练尺度上限接近于8k

## warmup策略

• constant warmup
• 原理：在前$$5$$轮训练中使用固定的小学习率进行训练
• 适用场景：预训练模型微调训练
• gradual warmup
• 原理：在前$$5$$轮训练中，从一个较低的学习率$$\eta$$开始，然后逐渐增长到目标学习率$$k\eta$$
• 适用场景：从头开始训练模型

## 实践指南

1. 缩放交叉熵损失不等于缩放学习率
2. 如果使用动量替代公式，那么改变学习速率后同样需要进行动量修正
3. 用总的批量大小$$kn$$，而不是每个GPU的批量大小$$n$$来标准化每个GPU上的损失
4. 在每一轮训练中，使用同一次随机打乱的训练数据，然后划分到各个GPU进行训练

$u_{t+1} = mu_{t} + \frac {1}{n} \sum_{x\in B} \triangledown l(x, w_{t})\\ w_{t+1} = w_{t} - \eta u_{t+1}$

$v_{t+1} = mv_{t} + \eta \frac {1}{n}\sum_{x\in B}\triangledown l(x, w_{t}) \\ w_{t+1} = w_{t} - v_{t+1}$