SSD: Single Shot MultiBox Detector

发表于 2020-05-17 更新于 2021-04-14 分类于深度学习/deeplearning 阅读次数：

本文字数： 3.2k 阅读时长 ≈ 6 分钟

摘要

We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. For 300 × 300 input, SSD achieves 74.3% mAP\(^{1}\) on VOC2007 test at 59 FPS on a Nvidia Titan X and for 512 × 512 input, SSD achieves 76.9% mAP, outperforming a comparable state-of-the-art Faster R-CNN model. Compared to other single stage methods, SSD has much better accuracy even with a smaller input image size. Code is available at: https://github.com/weiliu89/caffe/tree/ssd .

我们提出了一种利用单个深层神经网络检测图像中目标的方法。我们的方法名为SSD，将边界框的输出空间离散化为一组默认框，这些默认框具有不同的长宽比和基于不同特征图的比例。在预测时，网络为每个默认框中每个目标类别的存在生成分数，并对框进行调整以更好地匹配目标形状。此外，该网络将来自不同分辨率的多个特征图的预测结合起来，以更好地处理各种大小的目标。SSD相对于需要目标建议的方法来说是简单的，因为它完全消除了建议生成和随后的像素或特征重采样阶段，并将所有计算封装在单个网络中。这使得SSD易于训练，并可直接集成到需要检测组件的系统中。在PASCAL VOC、COCO和ILSVRC数据集上的实验结果证实，SSD与利用额外的目标建议步骤的方法相比有更高的准确性，并且速度快得多，同时为训练和推理提供了统一的框架。对于300 × 300输入，SSD在Nvidia Titan X上以59 FPS的速度在VOC2007测试中实现了74.3%的mAP\(^{1}\)，对于512 × 512输入，SSD实现了76.9%的mAP，优于同等水平Faster R-CNN模型。与其他单阶段训练方法相比，SSD即使在较小的输入图像尺寸下也具有更高的精度。代码地址：https://github.com/weiliu89/caffe/tree/ssd

\(^{`}\) We achieved even better results using an improved data augmentation scheme in follow-on experiments: 77.2% mAP for 300×300 input and 79.8% mAP for 512×512 input on VOC2007. Please see Sec. 3.6 for details

\(^{1}\) 在后续实验中，我们使用改进的数据扩充方案获得了更好的结果：在2007 VOC数据集中，300×300输入下得到77.2% mAP以及512×512输入下得到79.8% mAP。详情请见3.6小节

算法实现

使用同一个基础网络，提取多个不同分辨率的特征图，为预测和分类提供特征
设置先验框：在每个特征图的每个cell上设置一组不同尺度、不同长宽比的先验框
将多个特征图输入后续模型，计算得到预测的回归目标\(t\)、类别以及分类概率（通过卷积滤波器实现）
训练阶段：
1. 结合每个图像的标注框，计算回归目标\(t_{*}\)以及对应类别
2. 计算分类损失（Softmax Loss）和定位损失（Smooth L1 Loss）
3. 梯度更新和学习率调度
测试阶段：
1. 结合先验框和回归目标\(t\)，计算真正的预测边界框
2. NMS处理，过滤各类别的重叠预测框
3. 分类概率过滤，去除小于分类概率阈值的预测框
4. 取前几个预测框作为最后的结果