Domain-Size Pooling in Local Descriptors and Network Architectures

Abstract

We introduce a simple modification of local image descriptors, such as SIFT, that improves matching performance by 43.09% on the Oxford image matching benchmark and is implementable in few lines of code. To put it in perspective, this is more than half of the improvement that SIFT provides over raw image intensities on the same datasets. The trick consists of pooling gradient orientations across different domain sizes, in addition to spatial locations, and yields a descriptor of the same dimension of the original, which we call DSP-SIFT. Domain-size pooling causes DSP-SIFT to outperform by 28.29% a Convolutional Neural Network, which in turn has been recently reported to outperform ordinary SIFT by 11.54%. This is despite the network being trained on millions of images and outputting a descriptor of considerably larger size. Domain-size pooling is counter-intuitive and contrary to the practice of scale selection as taught in scale-space theory, but has solid roots in classical sampling theory.

Highlights

fig-dsp-sift-logo

In SIFT, isolated scales are selected (a) and the descriptor constructed from the image at the selected scale (b) by computing gradient orientations (c) and pooling them in spatial neighborhoods (d) yielding histograms that are normalized and concatenated to form the descriptor (e). In DSP-SIFT (bottom), pooling occurs across different domain sizes (a): Patches of different sizes are re-scaled (b), gradient orientation computed (c) and pooled across locations and scales (d), and concatenated yielding a descriptor (e) of the same dimension of ordinary SIFT.


fig-dspsift-sift-miko fig-dspsift-sift-fisc

Each point represents one pair of images in the Oxford and Fischer datasets. The coordinates indicate average precision for each of the two methods under comparison. DSP-SIFT outperforms SIFT by a wide margin. The relative performance improvement of the winner is shown in the title of each panel.

Comparison with Convolutional Neural Networks

Convolutional neural networks have recently been shown to improve image matching by extracting local representations from higher layers of the network responses (Fischer et. al. 2014). However, this comes as a result of CNN features having a much higher dimension and the network itself being trained with millions of images. We show that with domain-size pooling, DSP-SIFT outperforms CNN features by a wide margin at a fraction of computational and storage complexity.

fig-dspsift-sift-miko fig-dspsift-sift-fisc

Again, each point represents one pair of images in the Oxford and Fischer datasets. The coordinates indicate average precision for each of the two methods under comparison. DSP-SIFT outperforms CNN by 21% on Oxford image matching dataset and more than 5% in Fischer dataset. The relative performance improvement of the winner is shown in the title of each panel.

Extensions

Domain-size pooling can be applied to other descriptors as well as network architectures. A simple modification on Scattering Transform Network gives DSP-SC which outperforms the original scattering representation by a margin of 19.54% (mean average precision) on the standard Oxford dataset. DSP-CNN which performs domain-size pooling in the intermediate layers in the convolutional neural network also achieves better performance in challenging object classification tasks. More details in Reference 2.

Code

Download dsp_toolbox_v0.0.2

Reference

  1. J. Dong and S. Soatto. Domain-Size Pooling in Local Descriptors: DSP-SIFT. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. [pdf][extended][slides]

  2. S. Soatto, J. Dong and N. Karianakis. Visual Scene Representations: Contrast, Scaling and Occlusion. The International Conference on Learning Representations (ICLR) Workshop, 2015. [pdf]