## 摘要

We propose a deep convolutional neural network architecture codenamed Inception, which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

## 1x1 卷积

1. 通过激活函数提高卷积表达能力
2. 通过控制滤波器个数来减少输出数据体深度，实现维度衰减，移除计算瓶颈

## 思考

1. 更大的网络意味着更多的参数，这使得网络更容易过拟合，因为训练数据是有限的
2. 增加网络规模会导致计算资源的急剧增加，而且很容易造成资源的浪费（很多权重最后会趋近于0

## 架构细节

The main idea of the Inception architecture is based on finding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components

Inception架构的主要思想是基于发现卷积视觉网络中的最优局部稀疏结构能够被密集组件近似覆盖

As these “Inception modules” are stacked on top of each other, their output correlation statistics are bound to vary: as features of higher abstraction are captured by higher layers, their spatial concentration is expected to decrease suggesting that the ratio of 3×3 and 5×5 convolutions should increase as we move to higher layers.

In general, an Inception network is a network consisting of modules of the above type stacked upon each other, with occasional max-pooling layers with stride 2 to halve the resolution of the grid. For technical reasons (memory efficiency during training), it seemed beneficial to start using Inception modules only at higher layers while keeping the lower layers in traditional convolutional fashion. This is not strictly necessary, simply reflecting some infrastructural inefficiencies in our current implementation.

One of the main beneficial aspects of this architecture is that it allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational complexity. The ubiquitous use of dimension reduction allows for shielding the large number of input filters of the last stage to the next layer, first reducing their dimension before convolving over them with a large patch size. Another practically useful aspect of this design is that it aligns with the intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from different scales simultaneously

### 模型细节

• 激活函数：ReLU
• 输入图像：224x224RGB图像（减去均值图像）
• #3x3 reduce#5x5 reduce表示在使用3x35x5卷积之前使用了1x1卷积的数量
• pool proj列中查询得到用于池化层之后的1x1卷积滤波器数量
• 全连接层：使用average pooling替代全连接层

### 额外分类器

• 使用滤波器大小为5x5，步长为3的平均池化层，在(4a)阶段得到4x4x512大小输出，在(4d)阶段得到4x4x528大小输出
• 使用1281x1大小卷积滤波器，用于维度衰减和整流线性激活
• 全连接层使用1024个滤波器以及ReLU
• 随机失活层：失活概率70%
• softmax分类器，用于分类1000

## 训练方法

• 优化器：SGD，动量为0.9
• 随步长衰减：每隔8轮降低4%学习率

## ILSVRC 2014分类挑战

### 简介

ILSVRC 2014分类挑战：共1000类别，120万幅训练图像，5万幅验证图像，10万幅测试图像．每个图像都有一个标注的真实类别

### 评判标准

1. Top-1准确率（the top-1 accuracy rate
2. Top-5错误率（the top-5 error rate

### 训练技巧

1. 使用不同的图像采样方法训练了７个相同的GoogLeNet模型，进行集成预测
2. 在测试阶段，缩放图像到4个尺度（更短边缩放至256, 288, 320, 352）。获取缩放后图像的左/中/右的方块（取最小边为长度的正方形）；对于肖像图像，取上/中/下。对于每个方块，取中心和4个角的224x224裁减图像，同时将方块缩放至224x224，包括它们的镜像版本。最终得到4x3x6x2=144个测试图像
3. 使用模型集成对测试图像进行分类概率计算，并平均每类概率