## 摘要

Deep convolutional networks have achieved great success for image recognition. However, for action recognition in videos, their advantage over traditional methods is not so evident. We present a general and flexible video-level framework for learning action models in videos. This method, called temporal segment network (TSN), aims to model long-range temporal structures with a new segment-based sampling and aggregation module. This unique design enables our TSN to efficiently learn action models by using the whole action videos. The learned models could be easily adapted for action recognition in both trimmed and untrimmed videos with simple average pooling and multi-scale temporal window integration, respectively. We also study a series of good practices for the instantiation of temporal segment network framework given limited training samples. Our approach obtains the state-the-of-art performance on four challenging action recognition benchmarks: HMDB51 (71.0%), UCF101 (94.9%), THUMOS14 (80.1%), and ActivityNet v1.2 (89.6%). Using the proposed RGB difference for motion models, our method can still achieve competitive accuracy on UCF101 (91.0%) while running at 340 FPS. Furthermore, based on the temporal segment networks, we won the video classification track at the ActivityNet challenge 2016 among 24 teams, which demonstrates the effectiveness of temporal segment network and the proposed good practices.

## 引言

1. 一部分深度模型（采用连续帧，帧长度不足）专注于外观（appearance）和短时动作（short-term motion），并没有利用长时间时态结构（long-range temporal structure）；
2. 一部分深度模型通过密集采样（有可能指定固定间隔）方式采集长时间的帧数据，占据极大的内存空间以及计算耗时；

1. 如何有效地学习捕捉长时间时态结构的视频表示；
2. 如何将这些学习过的深度模型用于更真实的未剪辑视频；
3. 如何在有限的训练样本下有效地学习深度模型，并将其应用于大规模数据。

consecutive frames are highly redundant, where a sparse and global temporal sampling strategy would be more favorable and efficient in this case.

5个聚合函数 求和采样图像的分类成绩

1. RGB图像；
2. 堆叠的RGB差分；
3. 堆叠的光流场；
4. 堆叠扭曲的光流场。

1. 训练和推理问题（共3个方面：模型设计、应用以及训练）；
2. 相应的解决方案

1. 提出了一个端到端的框架，称为时间分段网络（TSN），用于学习捕获长时间时态信息的视频表示；
2. 设计了一个分层聚合方案，将动作识别模型应用于未剪辑的视频；
3. 研究了一系列应用于深度动作识别模型的良好实践。

## Tenporal Segment Network

1. 基于分段采样的动机
2. 时间分段网络框架架构
3. 针对聚集函数进行分析
4. 最后研究了时间分段网络框架实例化的几个实际问题

## 采样策略

TSN的视频采样策略遵循两个原则：

1. 稀疏采样：单次训练所需的图像个数固定，不会因为视频长度发生变化，这样保证了计算成本的一致性
2. 全局采样：确保采样片段在时间维度能够均匀分布，不管视频长短，采样片段通常会大致覆盖整个视频的视觉内容

1. 固定计算成本；
2. 保证采样成本