Temporal Segment Network

原文地址:

实现地址:ZJCV/TSN

摘要

Deep convolutional networks have achieved great success for image recognition. However, for action recognition in videos, their advantage over traditional methods is not so evident. We present a general and flexible video-level framework for learning action models in videos. This method, called temporal segment network (TSN), aims to model long-range temporal structures with a new segment-based sampling and aggregation module. This unique design enables our TSN to efficiently learn action models by using the whole action videos. The learned models could be easily adapted for action recognition in both trimmed and untrimmed videos with simple average pooling and multi-scale temporal window integration, respectively. We also study a series of good practices for the instantiation of temporal segment network framework given limited training samples. Our approach obtains the state-the-of-art performance on four challenging action recognition benchmarks: HMDB51 (71.0%), UCF101 (94.9%), THUMOS14 (80.1%), and ActivityNet v1.2 (89.6%). Using the proposed RGB difference for motion models, our method can still achieve competitive accuracy on UCF101 (91.0%) while running at 340 FPS. Furthermore, based on the temporal segment networks, we won the video classification track at the ActivityNet challenge 2016 among 24 teams, which demonstrates the effectiveness of temporal segment network and the proposed good practices.

深度卷积网络在图像识别领域已经取得了巨大的成功。不过,在视频动作识别领域,它们相对于传统方法的优势还没有非常明显。我们提出了一个视频动作识别的通用框架,称之为时间分段网络(TSN),通过一个新的基于分段的采样和融合模型来建模长时间时间结构。这个设计允许TSN能够有效的学习到整个视频的动作信息。通过简单的平均池化和多尺度时间窗口集成,所学习的模型可以容易地分别适用于修剪和未修剪视频中的动作识别。在给定有限的训练样本的情况下,我们还研究了时间分段网络框架实例化的一系列良好实践。在四个动作识别挑战基准上,我们的方法取得了最好的性能:HMDB51 (71.0%), UCF101 (94.9%), THUMOS14 (80.1%), and ActivityNet v1.2 (89.6%)。使用本文提出的RGB差分进行动作建模,我们的方法能够在运行340FPS的情况下仍然可以获得一个很好的准确度(UCF101 (91.0%) )。此外,基于时间片段网络,我们在2016年ActivityNet挑战赛中赢得了视频分类冠军,这证明了时间分段网络的有效性和所提出的良好实践。

启发和贡献

论文作者观察到一个现象:

consecutive frames are highly redundant, where a sparse and global temporal sampling strategy would be more favorable and efficient in this case. 连续帧是高度冗余的,在这种情况下,稀疏和全局时间采样策略会更有利和有效。

论文实现分3个方面:

  1. 提出了一个端到端的框架,称为时间分段网络(TSN),用于学习捕获长期时间信息的视频表示
  2. 设计了一个分层聚合方案,将动作识别模型应用于未剪辑的视频
  3. 研究了一系列应用于深度动作识别模型的良好实践

时间分段网络

TSN的视频采样策略遵循两个原则:

  1. 稀疏采样:单次训练所需的图像个数固定,不会因为视频长度发生变化,这样保证了计算成本的一致性
  2. 全局采样:确保采样片段在时间维度能够均匀分布,不管视频长短,采样片段通常会大致覆盖整个视频的视觉内容

其实现如下图所示: