## 摘要

We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to generalise the best performing hand-crafted features within a data-driven learning framework. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multitask learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification.

## 章节介绍

• 第一章：引言。介绍了视频动作识别相比于静态图像分类的难度以及之前应用于视频动作分类的传统图像学习（光流、轨迹）和深度网络方法
• 第二章：双流架构。介绍了双流网络架构以及其中的空间网络
• 第三章：介绍了时间网络
• 第四章：介绍了多任务学习框架，即使用多个不同类别的数据集对一个网络进行训练
• 第五章：介绍了具体的实现细节
• 第六章：介绍了双流网络的评估

## 时间网络

1. 光流堆叠（optical flow stacking）：密集光流可以看成是连续帧$$t$$$$t+1$$之间的一组位置矢量场$$d_{t}$$，其中$$d_{t}(u+v)$$表示帧$$t$$的点$$(u,v)$$上的位移向量，由两部分组成：$$d_{t}^{x}$$$$d_{t}^{y}$$，可以看成是图像的两个通道。在论文中，堆叠了$$L$$对连续帧的光流通道$$d_{t}^{x,y}$$得到$$2L$$个输入通道，所以其输入卷$$I_{\tau }\in \Re^{w\times h\times 2L}$$中任意一帧$$\tau$$的计算公式如下：

$I_{\tau}(u, v, 2k-1) = d_{\tau + k -1}^{x}(u,v)\\ I_{\tau}(u, v, 2k) = d_{\tau+k-1}^{y}(u,v), u=[1;w], v=[1;h], k=[1;L]$

1. 轨迹堆叠（Trajectory stacking）。使用轨迹向量代替光流
2. 双向光流（Bi-directional optical flow）。构造前向和反向的光流场