Domain-Size Pooling in Local Descriptors and Network Architectures
We introduce a simple modification of local image descriptors, such as SIFT, that improves matching performance by 43.09% on the Oxford image matching benchmark and is implementable in few lines of code. To put it in perspective, this is more than half of the improvement that SIFT provides over raw image intensities on the same datasets. The trick consists of pooling gradient orientations across different domain sizes, in addition to spatial locations, and yields a descriptor of the same dimension of the original, which we call DSP-SIFT. Domain-size pooling causes DSP-SIFT to outperform by 28.29% a Convolutional Neural Network, which in turn has been recently reported to outperform ordinary SIFT by 11.54%. This is despite the network being trained on millions of images and outputting a descriptor of considerably larger size. Domain-size pooling is counter-intuitive and contrary to the practice of scale selection as taught in scale-space theory, but has solid roots in classical sampling theory.
In SIFT, isolated scales are selected (a) and the descriptor constructed from the image at the selected scale (b) by computing gradient orientations (c) and pooling them in spatial neighborhoods (d) yielding histograms that are normalized and concatenated to form the descriptor (e). In DSP-SIFT (bottom), pooling occurs across different domain sizes (a): Patches of different sizes are re-scaled (b), gradient orientation computed (c) and pooled across locations and scales (d), and concatenated yielding a descriptor (e) of the same dimension of ordinary SIFT.
Each point represents one pair of images in the Oxford and Fischer datasets. The coordinates indicate average precision for each of the two methods under comparison. DSP-SIFT outperforms SIFT by a wide margin. The relative performance improvement of the winner is shown in the title of each panel.
Comparison with Convolutional Neural Networks
Convolutional neural networks have recently been shown to improve image matching by extracting local representations from higher layers of the network responses (Fischer et. al. 2014). However, this comes as a result of CNN features having a much higher dimension and the network itself being trained with millions of images. We show that with domain-size pooling, DSP-SIFT outperforms CNN features by a wide margin at a fraction of computational and storage complexity.
Again, each point represents one pair of images in the Oxford and Fischer datasets. The coordinates indicate average precision for each of the two methods under comparison. DSP-SIFT outperforms CNN by 21% on Oxford image matching dataset and more than 5% in Fischer dataset. The relative performance improvement of the winner is shown in the title of each panel.
Domain-size pooling can be applied to other descriptors as well as network architectures. A simple modification on Scattering Transform Network gives DSP-SC which outperforms the original scattering representation by a margin of 19.54% (mean average precision) on the standard Oxford dataset. DSP-CNN which performs domain-size pooling in the intermediate layers in the convolutional neural network also achieves better performance in challenging object classification tasks. More details in Reference 2.
S. Soatto, J. Dong and N. Karianakis. Visual Scene Representations: Contrast, Scaling and Occlusion. The International Conference on Learning Representations (ICLR) Workshop, 2015. [pdf]