## 摘要

Although deep learning has produced dazzling successes for applications of image, speech, and video processing in the past few years, most trainings are with suboptimal hyper-parameters, requiring unnecessarily long training times. Setting the hyper-parameters remains a black art that requires years of experience to acquire. This report proposes several efficient ways to set the hyper-parameters that significantly reduce training time and improves performance. Specifically, this report shows how to examine the training validation/test loss function for subtle clues of underfitting and overfitting and suggests guidelines for moving toward the optimal balance point. Then it discusses how to increase/decrease the learning rate/momentum to speed up training. Our experiments show that it is crucial to balance every manner of regularization for each dataset and architecture. Weight decay is used as a sample regularizer to show how its optimal value is tightly coupled with the learning rates and momentum. Files to help replicate the results reported here are available at https://github.com/lnsmith54/hyperParam1.

## 4.4 WEIGHT DECAY

Weight decay is one form of regularization and it plays an important role in training so its value needs to be set properly. The important point made above applies; that is, practitioners must balance the various forms of regularization to obtain good performance. The interested reader can see Kukacka et al. (2017) for a review of regularization methods.

Our experiments show that weight decay is not like learning rates or momentum and the best value should remain constant through the training (i.e., cyclical weight decay is not useful). This appears to be generally so for regularization but was not tested for all regularization methods (a more complete study of regularization is planned for Part 2 of this report). Since the network’s performance is dependent on a proper weight decay value, a grid search is worthwhile and differences are visible early in the training. That is, the validation loss early in the training is sufficient for determining a good value. As shown below, a reasonable procedure is to make combined CLR and CM runs at a few values of the weight decay in order to simultaneously determine the best learning rates, momentum and weight decay.

**我们的实验表明，权重衰减不像学习率或动量，最佳值应该在整个训练过程中保持不变(即周期性权重衰减没有用)**。对于正则化来说，这似乎是普遍的，但是并没有对所有正则化方法进行测试(计划在本报告的第二部分对正则化进行更全面的研究)。由于网络的性能取决于适当的权重衰减值，因此网格搜索是值得的，并且在训练的早期就可以看到差异。也就是说，训练早期的验证损失足以确定一个好的值。如下图所示，一个合理的步骤是组合CLR（循环学习率）和CM（循环动量），和权重衰减的几个候选一起运行，以便同时确定最佳学习速率、动量和权重衰减

If you have no idea of a reasonable value for weight decay, test $10^{−3}, 10^{−4} , 10^{−5}$ , and 0. Smaller datasets and architectures seem to require larger values for weight decay while larger datasets and deeper architectures seem to require smaller values. Our hypothesis is that complex data provides its own regularization and other regularization should be reduced.

On the other hand, if your experience indicates that a weight decay value of $10^{−4}$ should be about right, these initial runs might be at $3×10^{−5} , 10^{−4} , 3×10^{−4}$ . The reasoning behind choosing $3$ rather than $5$ is that a magnitude is needed for weight decay so this report suggests bisection of the exponent rather than bisecting the value (i.e., between $10^{−4}$ and $10^{−3}$ one bisects as $10^{−3.5} = 3.16 × 10^{−4})$. Afterwards, make a follow up run that bisects the exponent of the best two of these or if none seem best, extrapolate towards an improved value.

Remark 6. Since the amount of regularization must be balanced for each dataset and architecture, the value of weight decay is a key knob to turn for tuning regularization against the regularization from an increasing learning rate. While other forms of regularization are generally fixed (i.e., dropout ratio, stochastic depth), one can easily change the weight decay value when experimenting with maximum learning rate and stepsize values.

Figure 9a shows the validation loss of a grid search for a 3-layer network on Cifar-10 data, after assuming a learning rate of 0.005 and momentum of 0.95. Here it would be reasonable to run values of $1×10^{−2} , 3.2×10^{−3} , 10^{−3}$ , which are shown in the Figure. Clearly the yellow curve implies that $1 × 10^{−2}$ is too large and the blue curve implies that $10^{−3}$ is too small (notice the overfitting). After running these three, a value of $3.2 × 10^{−3}$ seems right but one can also make a run with a weight decay value of $10^{−2.75} = 1.8 × 10^{−3}$ , which is the purple curve. This confirms that $3.2 × 10^{−3}$ is a good choice. Figure 9b shows the accuracy results from trainings at all four of these values and it is clear that the validation loss is predictive of the best final accuracy.

A reasonable question is can the value for the weight decay, learning rate and momentum all be determined simultaneously? Figure 10a shows the runs of a learning rate range test (LR = 0.001 - 0.01) along with a decreasing momentum (= 0.98 - 0.8) at weight decay values of $10^{−2} , 3.2 × 10^{−3} , 10^{−3}$ . As before, a value of $3.2 × 10^{−3}$ seems best. However, a test of weight decay at $1.8 × 10^{−3}$ shows it is better because it remains stable for larger learning rates and even attains a slightly lower validation loss. This is confirmed in Figure 10b which shows a slightly improved accuracy at learning rates above 0.005.

The optimal weight decay is different if you search with a constant learning rate versus using a learning rate range. This aligns with our intuition because the larger learning rates provide regularization so a smaller weight decay value is optimal. Figure 11a shows the results of a weight decay search with a constant learning rate of 0.1. In this case a weight decay of $10^{−4}$ exhibits overfitting and a larger weight decay of $10^{−3}$ is better. Also shown are the similar results at weight decays of $3.2 × 10^{−4}$ and $5.6 × 10^{−4}$ to illustrate that a single significant figure accuracy for weight decay is all that is necessary. On the other hand, Figure 11b illustrates the results of a weight decay search using a learning rate range test from 0.1 to 1.0. This search indicates a smaller weight decay value of $10^{−4}$ is best and larger learning rates in the range of 0.5 to 0.8 should be used.

Another option as a grid search for weight decay is to make a single run at a middle value for weight decay and save a snapshot after the loss plateaus. Use this snapshot to restart runs, each with a different value of WD. This can save time in searching for the best weight decay. Figure 12a shows an example on Cifar-10 with a 3-layer network (this is for illustration only as this architecture runs very quickly). Here the initial run was with a sub-optimal value of weight decay of $10^{−3}$ . From the restart point, three continuation runs were made with weight decay values of $10^{−3} , 3 × 10^{−3}$, and $10^{−2}$. This Figure shows that $3 × 10^{−3}$ is best.

Figure 12b illustrates a weight decay grid search from a snapshot for resnet-56 while performing a LR range test. This Figure shows the first half of the range test with a value of weight decay of $10^{−4}$ . Then three continuations are run with weight decay values of $10^{−3}, 3 × 10^{−4}$, and $10^{−4}$ . It is clear that a weight decay value of $10^{−4}$ is best and information about the learning rate range is simultaneously available.

## 实现流程

1. 找到最优学习率
2. 固定学习率，测试0/1e-3/1e-4/1e-5大小的权重衰减
3. 比如找到了1e-4，再进一步测试3e-5/1e-4/3e-4大小的权重衰减
4. 得到最佳权重衰减值

## PyTorch实现

### 训练参数

• 模型：SqueezeNet
• 数据集：CIFAR100
• 损失函数：标签平滑正则化
• 优化器：Adam
• 批量大小：96
• 学习率：最小学习率1e-8，最高学习率10

### 不同范围的学习率衰减比较

• 3e-4 -> 1e-4
• 3e-4 -> 3e-5
• 3e-4 -> 0

• 3e-4 -> 1e-4
• Top-1 acc: 60.31%
• Top-5 acc: 85.53%
• 3e-4 -> 3e-5
• Top-1 acc: 59.85%
• Top-5 acc: 84.86%
• 3e-4 -> 0
• Top-1 acc: 60.02%
• Top-5 acc: 85.64%

## 小结

• 学习率为3e-4
• 学习率衰减范围为3e-4 -> 1e-4
• 权重衰减值为3e-5