Pytorch实现：lukemelas/EfficientNet-PyTorch

## 摘要

Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available. In this paper, we systematically study model scaling and identify that carefully balancing network depth, width, and resolution can lead to better performance. Based on this observation, we propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. We demonstrate the effectiveness of this method on scaling up MobileNets and ResNet.

To go even further, we use neural architecture search to design a new baseline network and scale it up to obtain a family of models, called EfficientNets, which achieve much better accuracy and efficiency than previous ConvNets. In particular, our EfficientNet-B7 achieves state-of-the-art 84.3% top-1 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet. Our EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters. Source code is at this https URL.

## 简介

1. ResNet通过放大更多层数（深度）来扩展ResNet-18ResNet-200
2. GPipe将基准模型放大到4倍实现了84.3% ImageNet准确率；
3. 。。。

• 网络深度放大到$$\alpha^{N}$$
• 将网络宽度放大到$$\beta^{N}$$
• 将图像大小放大到$$\gamma^{N}$$

## 相关

• On the expressive power of deep neural networks. ICML, 2017.
• Resnet with one-neuron hidden layers is a universal approximator. NeurIPS, pp. 61726181, 2018.
• On the expressive power of overlapping architectures of deep learning. ICLR, 2018.
• The expressive power of neural networks: A view from the width. NeurIPS, 2018.

## 复合模型缩放

### 问题定义

$$i$$层卷积神经网络计算：

$Y_{i} = F_{i} (X_{i})$

• $$F_{i}$$表示算子；
• $$Y_{i}$$表示输出大小；
• $$X_{i}$$表示输入大小，张量形状为$$<H_{i}, W_{i}, C_{i}>^{1}$$
• $$H_{i}$$$$W_{i}$$表示空间维度；
• $$C_{i}$$表示通道维度。

$N = F_{k}\odot ... \odot F_{2}\odot F_{1}(X_{1})=\bigodot_{j=1...k}F_{j}(X_{1})$

$N=\bigodot_{i=1...s}F_{i}^{L_{i}}(X_{<H_{i}, W_{i}, C_{i}>})$

• $$F_{i}^{L_{i}}$$表示在第$$i$$个阶段重复$$F_{i}$$层共$$L_{i}$$次；
• $$<H_{i}, W_{i}, C_{i}>$$表示第$$i$$层的输入张量$$X$$大小；

$\max_{d, w, r} Accuracy(N(d, w, r))$

$s.t. N(d, w, r) = \bigodot_{i=1...s}\hat{F}_{i}^{d\cdot \hat{L}_{i}}(X_{<r\cdot \hat{H}_{i}, r\cdot \hat{W}_{i}, r\cdot \hat{C}_{i}>})$

$Memory(N) \leq target\_memory$

$FLOPS(N) \leq target\_flops$

• $$w, d, r$$分别是缩放网络宽度、深度和分辨率的系数；
• $$\hat{F}_{i}, \hat{L}_{i}, \hat{H}_{i}, \hat{W}_{i}, \hat{C}_{i}$$是基准网络的相关参数。

### 单一缩放

1. 系数$$d, w, r$$的优化相互依赖；
2. 如何在不同资源限定下改变系数大小。

### 复合缩放

$depth: d = \alpha^{\phi }$

$width: w = \beta^{\phi }$

$resolution: r = \gamma^{\phi }$

$s.t.\ \ \alpha\cdot \beta^{2}\cdot \gamma^{2} \approx 2$

$\alpha \geq 1, \beta \geq 1, \gamma \geq 1$

• $$\alpha, \beta, \gamma$$通过网格搜索决定；
• $$\phi$$用于控制可利用的计算资源。

## 实验

### EfficientNet

ImageNet上训练EfficientNet

• 优化器：RMSProp，衰减0.9，动量0.9
• 批量归一化：动量0.99
• 权重衰减：1e-5
• 学习率：初始0.256，每2.4轮衰减0.97
• 激活函数：SiLU(Swish-1)
• 增强：AutoAugment
• 随机深度（stochastic depth）：存活率0.8
• 随机失活：线性缩放，从B00.2B70.5

## 小结

1. 提出复合放大策略；
2. 搜索出EfficientNet-B0架构；
3. 证明了轻量级算子MBConv也能够扩展成大模型。