## 摘要

Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization. Specifically, we show no loss of accuracy when training with large minibatch sizes up to 8192 images. To achieve this result, we adopt a linear scaling rule for adjusting learning rates as a function of minibatch size and develop a new warmup scheme that overcomes optimization challenges early in training. With these simple techniques, our Caffe2-based system trains ResNet50 with a minibatch size of 8192 on 256 GPUs in one hour, while matching small minibatch accuracy. Using commodity hardware, our implementation achieves ∼90% scaling efficiency when moving from 8 to 256 GPUs. This system enables us to train visual recognition models on internetscale data with high efficiency.

## 2.2 Warmup

As we discussed, for large minibatches (e.g., 8k) the linear scaling rule breaks down when the network is changing rapidly, which commonly occurs in early stages of training. We find that this issue can be alleviated by a properly designed warmup [16], namely, a strategy of using less aggressive learning rates at the start of training.

Constant warmup. The warmup strategy presented in [16] uses a low constant learning rate for the first few epochs of training. As we will show in §5, we have found constant warmup particularly helpful for prototyping object detection and segmentation methods [9, 30, 25, 14] that fine-tune pre-trained layers together with newly initialized layers.

In our ImageNet experiments with a large minibatch of size $$kn$$, we have tried to train with the low learning rate of $$η$$ for the first 5 epochs and then return to the target learning rate of $$\hat{η} = kη$$. However, given a large $$k$$, we find that this constant warmup is not sufficient to solve the optimization problem, and a transition out of the low learning rate warmup phase can cause the training error to spike. This leads us to propose the following gradual warmup.

Gradual warmup. We present an alternative warmup that gradually ramps up the learning rate from a small to a large value. This ramp avoids a sudden increase from a small learning rate to a large one, allowing healthy convergence at the start of training. In practice, with a large minibatch of size $$kn$$, we start from a learning rate of $$η$$ and increment it by a constant amount at each iteration such that it reaches $$\hat{η} = kη$$ after 5 epochs. After the warmup phase, we go back to the original learning rate schedule.

## 优点

What does “learning rate warm-up” mean?

What happened to differential learning rates and warmup?

warmup具有以下优点：

• 有助于减缓模型在初始阶段对mini-batch的提前过拟合现象，保持分布的平稳
• 有助于保持模型深层的稳定性

## PyTorch实现

fork该仓库进行了一些调整：zjZSTU/pytorch-gradual-warmup-lr