## 摘要

Image descriptors based on activations of Convolutional Neural Networks (CNNs) have become dominant in image retrieval due to their discriminative power, compactness of representation, and search efficiency. Training of CNNs, either from scratch or fine-tuning, requires a large amount of annotated data, where a high quality of annotation is often crucial. In this work, we propose to fine-tune CNNs for image retrieval on a large collection of unordered images in a fully automated manner. Reconstructed 3D models obtained by the state-of-the-art retrieval and structure-from-motion methods guide the selection of the training data. We show that both hard-positive and hard-negative examples, selected by exploiting the geometry and the camera positions available from the 3D models, enhance the performance of particular-object retrieval. CNN descriptor whitening discriminatively learned from the same training data outperforms commonly used PCA whitening. We propose a novel trainable Generalized-Mean (GeM) pooling layer that generalizes max and average pooling and show that it boosts retrieval performance. Applying the proposed method to the VGG network achieves state-of-the-art performance on the standard benchmarks: Oxford Buildings, Paris, and Holidays datasets.

## 引言

1. 构建structure-from-motion(SfM)流程，生成苦难正样本和困难负样本，通过训练证明这两类数据均有利于检索任务训练；
2. 在同样的训练数据上学习白化有助于提升性能，并且证明将白化作为单独的后处理步骤比端到端的学习模式更有效；
3. 提出一个可学习的池化层，在不改变描述符维度同时能够有效提升检索性能；
4. 提出$$\alpha$$加权的查询扩展算法，更适用于紧凑的特征描述符。

## GeM

$f^{(m)}=[f^{(m)}_{1}, ..., f^{(m)}_{k}, ..., f^{(m)}_{K}]^{T}, \ f^{(m)}_{k}=\max_{x\in X_{k}}x$

$f^{(a)}=[f^{(a)}_{1}, ..., f^{(a)}_{k}, ..., f^{(a)}_{K}]^{T}, \ f^{(a)}_{k}=\frac{1}{\left|X_{k} \right|}\sum_{x\in X_{k}}x$

$f^{(g)}=[f^{(g)}_{1}, ..., f^{(g)}_{k}, ..., f^{(g)}_{K}]^{T}, \ f^{(g)}_{k}=(\frac{1}{\left|X_{k} \right|}\sum_{x\in X_{k}}x^{p_{k}})^{\frac{1}{p_{k}}}$

$\frac{\partial f_{k}}{\partial x_{i}}=\frac{1}{\left|X_{k}\right|} f_{k}^{1-p_{k}}x_{i}^{p_{k}-1}$

$\frac{\partial f_{k}}{\partial p_{k}}=\frac{f_{k}}{p_{k}^{2}}(log \frac{\left|X_{k} \right|}{\sum_{x\in X_{k}}x^{p_{k}}}+p_{k}\frac{\sum_{x\in X_{k}}x^{p_{k}}\log x}{\sum_{x\in X_{k}}x^{p_{k}}})$

## 暹罗学习

$L(i,j)=\begin{cases} \frac{1}{2}\left\| \tilde{f}_{i} - \tilde{f}_{j} \right\|^{2} & \text{ if } Y(i,j)=1 \\ \frac{1}{2}(max\{0, \tau-\left\| \tilde{f}_{i} - \tilde{f}_{j} \right\|\})^{2} & \text{ if } Y(i,j)=0 \end{cases}$

## 实验

### 池化

1. 共享同一个超参数$$p$$
2. 每个特征图拥有各自的$$p_{k}$$
3. 固定$$p$$值或者可学习。

1. 固定$$p$$值和可学习$$p$$设置得到的训练性能类似，相对而言可学习$$p$$更好一些；
2. $$p$$的初始值设置会影响最终学习结果。

## 小结

1. SfM自动数据清洗流程；
2. 广义平均池化层GeM
3. CNN描述符白化；
4. α-加权查询扩展算法αQE