Connecting Images through Sources: Exploring Low-Data, Heterogeneous Instance Retrieval
Abstract
:1. Introduction
- We present the alegoria dataset, a new benchmark made available to the community highlighting a panel of variations commonly found in cultural data through collections of vertical aerial imagery, oblique imagery, and street-level imagery through various sources.
- We propose new indicators for measuring cross-domain performance in image retrieval.
- We review available methods in the literature with the goal of identifying promising methods to solve our challenge. We reimplement and train two of these methods.
- We evaluate a panel of state-of-the-art methods on the alegoria benchmark through three evaluation setups, and further compare their performance against known variations.
- Motivated by uneven performance of descriptors depending on image characteristics, we propose a new multi-descriptor diffusion method with a variation allowing fine-tuning of inter- and intra-domain performance.
2. Materials and Methods
2.1. Alegoria Benchmark
2.1.1. Presentation
- Henrard (NN): Oblique aerial imagery in the Paris region between 1930 and 1970,
- Lapie (AN and NN): Oblique (low and medium altitude) aerial imagery between 1955 and 1965,
- Combier (NN): Ground-level and oblique aerial imagery between 1949 and 1974,
- MRU (AN): Oblique aerial wide area imagery between 1948 and 1970,
- Photothèque (IGN): Oblique and vertical imagery between 1920 and 2010.
- Scale (what portion of the image does the object occupy?): Very close/Close/Midrange/Far
- Illumination: Under-illuminated/Well-illuminated/Over-illuminated
- Vertical orientation: Ground level (or street-view)/Oblique/Vertical
- Level of occlusion (is the object hidden behind other objects?): No occlusion/Partially hidden/Hidden
- Alterations (is the image degraded?): No alteration/Mildly degraded/Degraded
- Color: Color/Grayscale/Monochrome (e.g., sepia)
2.1.2. Cultural Heritage Datasets
2.1.3. Evaluation Protocol
- Absolute retrieval performance: the average quality of results lists obtained when using all annotated images as queries, regardless of domain considerations.
- Intra-domain or attribute-specific performance: the retrieval performance obtained when using a subset of annotated images from a specific collection or with a specific attribute value. This allows a finer comparison along different representation domains and characteristics.
- Inter-domain performance: the ability to retrieve images outside of the query domain.
2.2. Retrieval Framework
2.3. The Low-Data Problem
2.3.1. Supervised Learning or Transfer Learning: Fine-Tuning on Related Datasets
2.3.2. Few-Shot Learning, Meta-Learning: A Promising Path?
- To our knowledge, there has been no work applying the principles of few shot learning to image retrieval (rather, few shot learning has borrowed ideas from image retrieval [45]). It is indeed a tricky situation in the sense that image retrieval does not use class information (except when evaluating), and thus defining a support set is not trivial.
- Most of the existing works have been tested on simplistic datasets with low resolution, few problematic variations, and semantically easy classes, e.g., the standard MetaDataset [44] grouping ImageNet, Omniglot (character recognition) [46], VGG Flowers (flower recognition) [47], Traffic Signs (traffic signs recognition) [48], and other relatively simple datasets.
- Due to the computational overhead induced with meta-learning architectures, small feature extractors such as ResNet-18 or its even smaller version ResNet-12 are used [49] to avoid memory limitations. This naturally also limits final performance.
- Simple baselines get better results than complex meta-algorithms [49] on some setups, which indicates that there is still room for improvement.
2.3.3. Unsupervised and Self-Supervised Learning
- Teaching a model how to solve Jigsaw puzzles [50] generated by selecting patches of an image, to automatically learn the important parts of an object. This is an example of a pretext task (solving puzzles) making the model learn semantically important features (shapes, relative spatial positions) that can be reused for a downstream task (in our case retrieval).
- Learning image generation models with the Generative Adversarial Networks architecture [51]. In this setup, a generative model competes with a discriminative model. The generative model tries to fool the discriminator by producing realistic fake images, while the discriminator tries to distinguish fake images from real images. Here, the discriminator provides a form of automated supervision to the generator, using only pixel data from a database of images. By reconstructing realistic images, the generative model is forced to get a visual understanding of the object. Applications to discriminative tasks have shown that the learned features do contain discriminative information [52], but for now only on the very basic MNIST [53] dataset.
- Leveraging data augmentation techniques to learn visual patterns without labels. Recall the tuplet losses presented in Section 2.3.1: the positive pairs that we pull together (while we push negative pairs apart) can be automatically generated with a single image on which we apply simple transformations such as cropping, distortion, blur... Chen et al. [54] achieved results similar to early supervised models on ImageNet classification with this framework.
2.4. The Heterogeneity Problem
2.4.1. Reducing Variance
2.4.2. Robust Feature Learning
2.5. Contributions and Experiments
2.5.1. Fast Adaptation
- A
- Use one random image from the alegoria benchmark as support to compute adapted parameters, then extract all descriptors with the same adapted parameters.
- B
- Same as A but with five random images.
- C
- For each image, we first compute adapted parameters using the same image, and then extract its descriptor. Descriptors of different images are thus extracted with different parameters.
- D
- We first issue an unadapted image search with the backbone extractor. Using the first retrieved image as the support set, we compute adapted parameters (different for each image), extract descriptors, and conduct another search.
- E
- For comparison, we also include an experiment with noise as support: we compute adapted parameters with random values sampled from the normal distribution as input, and extract descriptors with these parameters (same for all images).
2.5.2. Multi-Descriptor Diffusion: Balancing Intra- and Inter-Domain Performance
3. Evaluations and Results
3.1. rCNAPS Experiments
3.2. Preprocessing
4. Discussion
4.1. Supervision Axis
4.2. Intra-Domain Performance Disparities
4.3. Diffusion
4.4. Heterogeneity
4.5. Influence of the Training Dataset
4.6. Computational Complexity
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A
Class Name | Urban/Natural | Class Definition | Class Type |
---|---|---|---|
amiens | urban | tower | object |
annecy | semi urban | lake mouth | zone |
arc de triomphe | urban | monument | object |
basilique sacre coeur | urban | church | object |
biarritz | semi urban | hotel, beach | zone |
amiral bruix boulevard | urban | crossroad | zone |
bourg en bresse | semi urban | factory | object |
brest | urban | port | zone |
fourviere cathedral | urban | church | object |
reims cathedral | urban | church | object |
saint etienne de toul cathedral | urban | church | object |
deauville international center | urban | hotel | object |
charlevilles mezieres | urban | square | zone |
chantilly castle | semi urban | castle | object |
palace of versailles | urban | castle | object |
choux creteil | urban | tower | object |
cite internationale lyon | urban | neighborhood | zone |
foix | semi urban | castle | object |
gare du nord paris | urban | train station | object |
gare est paris | urban | train station | object |
gare perrache lyon | urban | train station | object |
grenoble | urban | river | zone |
guethary | natural | hotel | object |
saint laurent hospital chalon | urban | hotel | object |
nantes island | urban | neighborhood | zone |
invalides | urban | hotel | object |
issy moulineaux | urban | bridge | object |
la madeleine paris | urban | monument | object |
le havre | urban | tower | object |
lery seyne sur mer | semi urban | church | object |
macon | urban | bridge | object |
mairie lille | urban | tower | object |
chasseneuil memorial | natural | monument | object |
mont blanc | natural | mountain | object |
mont saint michel | natural | neighborhood | zone |
neuilly sur seine | urban | neighborhood | zone |
notre dame de lorette | natural | church | object |
notre dame garde | urban | church | object |
notre dame paris | urban | church | object |
pantheon paris | urban | monument | object |
picpus | urban | neighborhood | zone |
place bourse bordeaux | square | square | zone |
place marche clichy | urban | square | zone |
bouc harbour | semi urban | harbor | zone |
porte pantin | urban | neighborhood | zone |
porte saint denis | urban | monument | object |
aubepins neighborhood | urban | neighborhood | zone |
reims racetrack | urban | neighborhood | zone |
riom | urban | neihgborhood (town) | zone |
saint claude | semi urban | church | object |
gerland stadium | urban | monument | object |
st tropez | semi urban | neighborhood (town) | zone |
toulon | urban | neighborhood | zone |
eiffel tower | urban | tower | object |
tours | urban | neighborhood | zone |
aillaud towers nanterre | urban | tower | object |
vannes | urban | neighborhood | zone |
villa monceau | urban | neighborhood | zone |
References
- Pion, N.; Humenberger, M.; Csurka, G.; Cabon, Y.; Sattler, T. Benchmarking Image Retrieval for Visual Localization. In Proceedings of the International Conference on 3D Vision, Fukuoka, Japan, 25–28 November 2020. [Google Scholar]
- Mustafa, A.; Kim, H.; Guillemaut, J.Y.; Hilton, A. Temporally coherent 4D reconstruction of complex dynamic scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Doulamis, A.; Doulamis, N.; Ioannidis, C.; Chrysouli, C.; Nikos, G.; Dimitropoulos, K.; Potsiou, C.; Stathopoulou, E.; Ioannides, M. 5D Modelling: An Efficient Approach for Creating Spatiotemporal Predictive 3D Maps of Large-Scale Cultural Resources. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2015, II-5/W3, 61–68. [Google Scholar] [CrossRef] [Green Version]
- Palermo, F.; Hays, J.; Efros, A.A. Dating Historical Color Images. In European Conference on Computer Vision, Proceedings of the 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2012; pp. 499–512. [Google Scholar] [CrossRef] [Green Version]
- Blettery, E.; Lecat, P.; Devaux, A.; Gouet-Brunet, V.; Saly-Giocanti, F.; Brédif, M.; Delavoipiere, L.; Conord, S.; Moret, F. A Spatio-temporal Web-application for the Understanding of the Formation of the Parisian Metropolis. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2020, VI-4/W1-20, 45–52. [Google Scholar] [CrossRef]
- Doulamis, A.; Doulamis, N.; Protopapadakis, E.; Voulodimos, A.; Ioannides, M. 4D Modelling in Cultural Heritage. In Advances in Digital Cultural Heritage; Ioannides, M., Martins, J., Žarnić, R., Lim, V., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2018; pp. 174–196. [Google Scholar] [CrossRef]
- Schindler, G.; Dellaert, F. 4D Cities: Analyzing, Visualizing, and Interacting with Historical Urban Photo Collections. J. Multimed. 2012, 7, 7. [Google Scholar] [CrossRef] [Green Version]
- Gominski, D.; Poreba, M.; Gouet-Brunet, V.; Chen, L. Challenging deep image descriptors for retrieval in heterogeneous iconographic collections. In Proceedings of the 1st Workshop on Structuring and Understanding of Multimedia HeritAge Contents-SUMAC’19, Nice, France, 21–25 October 2019; pp. 31–38. [Google Scholar] [CrossRef] [Green Version]
- Fiorucci, M.; Khoroshiltseva, M.; Pontil, M.; Traviglia, A.; Del Bue, A.; James, S. Machine Learning for Cultural Heritage: A Survey. Pattern Recognit. Lett. 2020, 133, 102–108. [Google Scholar] [CrossRef]
- Kramer, I. Available online: https://github.com/ickramer/Arran (accessed on 9 January 2020).
- Ratajczak, R.; Crispim-Junior, C.F.; Faure, E.; Fervers, B.; Tougne, L. Automatic Land Cover Reconstruction From Historical Aerial Images: An Evaluation of Features Extraction and Classification Algorithms. IEEE Trans. Image Process. 2019, 28, 3357–3371. [Google Scholar] [CrossRef] [PubMed]
- Fernando, B.; Tommasi, T.; Tuytelaars, T. Location recognition over large time lags. Comput. Vis. Image Underst. 2015, 139, 21–28. [Google Scholar] [CrossRef] [Green Version]
- Philbin, J.; Chum, O.; Isard, M.; Sivic, J.; Zisserman, A. Object retrieval with large vocabularies and fast spatial matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; pp. 1–8. [Google Scholar]
- Philbin, J.; Chum, O.; Isard, M.; Sivic, J.; Zisserman, A. Lost in quantization: Improving particular object retrieval in large scale image databases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar] [CrossRef] [Green Version]
- Noh, H.; Araujo, A.; Sim, J.; Weyand, T.; Han, B. Large-Scale Image Retrieval with Attentive Deep Local Features. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 3476–3485. [Google Scholar] [CrossRef] [Green Version]
- Brown, A.; Xie, W.; Kalogeiton, V.; Zisserman, A. Smooth-AP: Smoothing the Path Towards Large-Scale Image Retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), 2020, Glasgow, UK, 23–28 August 2020. [Google Scholar]
- Radenovic, F.; Iscen, A.; Tolias, G.; Avrithis, Y.; Chum, O. Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5706–5715. [Google Scholar] [CrossRef] [Green Version]
- Cao, B.; Araujo, A.; Sim, J. Unifying Deep Local and Global Features for Efficient Image Search. arXiv 2020, arXiv:2001.05027. [Google Scholar]
- Babenko, A.; Lempitsky, V.S. Aggregating Local Deep Features for Image Retrieval. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1269–1277. [Google Scholar] [CrossRef]
- Tolias, G.; Sicre, R.; Jégou, H. Particular Object Retrieval with Integral Max-Pooling of CNN Activations. In Proceedings of the ICL 2016—International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016; pp. 1–12. [Google Scholar]
- Radenović, F.; Tolias, G.; Chum, O. Fine-Tuning CNN Image Retrieval with No Human Annotation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 1655–1668. [Google Scholar] [CrossRef] [Green Version]
- Kalantidis, Y.; Mellina, C.; Osindero, S. Cross-Dimensional Weighting for Aggregated Deep Convolutional Features. In European Conference on Computer Vision, Proceedings of the ECCV 2016: Computer Vision—ECCV 2016 Workshops, Amsterdam, The Netherlands, 8–10, 15–16 October 2016; Hua, G., Jégou, H., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2016; pp. 685–701. [Google Scholar]
- Ng, T.; Balntas, V.; Tian, Y.; Mikolajczyk, K. SOLAR: Second-Order Loss and Attention for Image Retrieval. arXiv 2020, arXiv:2001.08972. [Google Scholar]
- Tolias, G.; Jenicek, T.; Chum, O. Learning and Aggregating Deep Local Descriptors for Instance-Level Recognition. In European Conference on Computer Vision, Proceedings of the Computer Vision—ECCV 2020, 16th European Conference, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2020; pp. 460–477. [Google Scholar] [CrossRef]
- Chum, O.; Philbin, J.; Sivic, J.; Isard, M.; Zisserman, A. Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval. In Proceedings of the IEEE 11th International Conference on Computer Vision, Rio de Janeiro, Brazil, 14–21 October 2007; pp. 1–8. [Google Scholar] [CrossRef]
- Tolias, G.; Avrithis, Y.; Jégou, H. Image Search with Selective Match Kernels: Aggregation Across Single and Multiple Images. Int. J. Comput. Vis. 2016, 116, 247–261. [Google Scholar] [CrossRef]
- Zhang, X.; Jiang, M.; Zheng, Z.; Tan, X.; Ding, E.; Yang, Y. Understanding Image Retrieval Re-Ranking: A Graph Neural Network Perspective. arXiv 2020, arXiv:2012.07620. [Google Scholar]
- Donoser, M.; Bischof, H. Diffusion Processes for Retrieval Revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 1320–1327, ISSN 1063-6919. [Google Scholar] [CrossRef]
- Yang, F.; Hinami, R.; Matsui, Y.; Ly, S.; Satoh, S. Efficient Image Retrieval via Decoupling Diffusion into Online and Offline Processing. Proc. AAAI Conf. Artif. Intell. 2019, 33, 9087–9094. [Google Scholar] [CrossRef]
- Iscen, A.; Tolias, G.; Avrithis, Y.; Furon, T.; Chum, O. Efficient Diffusion on Region Manifolds: Recovering Small Objects with Compact CNN Representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 926–935, ISSN 1063-6919. [Google Scholar] [CrossRef] [Green Version]
- Zhong, Z.; Zheng, L.; Cao, D.; Li, S. Re-ranking Person Re-identification with k-Reciprocal Encoding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3652–3661, ISSN 1063-6919. [Google Scholar] [CrossRef] [Green Version]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.-F. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar]
- Weyand, T.; Araujo, A.; Cao, B.; Sim, J. Google Landmarks Dataset v2—A Large-Scale Benchmark for Instance-Level Recognition and Retrieval. arXiv 2020, arXiv:2004.01804. [Google Scholar]
- Gordo, A.; Almazán, J.; Revaud, J.; Larlus, D. End-to-End Learning of Deep Visual Representations for Image Retrieval. Int. J. Comput. Vis. 2017, 124, 237–254. [Google Scholar] [CrossRef] [Green Version]
- Scott, G.J.; England, M.R.; Starms, W.A.; Marcum, R.A.; Davis, C.H. Training Deep Convolutional Neural Networks for Land—Cover Classification of High-Resolution Imagery. IEEE Geosci. Remote Sens. Lett. 2017, 14, 549–553. [Google Scholar] [CrossRef]
- Gominski, D.; Gouet-Brunet, V.; Chen, L. Unifying Remote Sensing Image Retrieval and Classification with Robust Fine-tuning. arXiv 2021, arXiv:2102.13392. [Google Scholar]
- Zheng, Z.; Wei, Y.; Yang, Y. University-1652: A Multi-view Multi-source Benchmark for Drone-based Geo-localization. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020. [Google Scholar]
- Babenko, A.; Slesarev, A.; Chigorin, A.; Lempitsky, V. Neural Codes for Image Retrieval. In European Conference on Computer Vision, Proceedings of the Computer Vision—ECCV 2014, Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; Volume 8689. [Google Scholar] [CrossRef]
- Revaud, J.; Almazán, J.; Rezende, R.S.; Souza, C.R.d. Learning with average precision: Training image retrieval with a listwise loss. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 5107–5116. [Google Scholar]
- Wang, Y.; Yao, Q.; Kwok, J.T.; Ni, L.M. Generalizing from a Few Examples: A Survey on Few-shot Learning. ACM Comput. Surv. 2020, 53, 63:1–63:34. [Google Scholar] [CrossRef]
- Requeima, J.; Gordon, J.; Bronskill, J.; Nowozin, S.; Turner, R.E. Fast and Flexible Multi-Task Classification using Conditional Neural Adaptive Processes. In Advances in Neural Information Processing Systems 32; Wallach, H., Larochelle, H., Beygelzimer, A., de Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 7957–7968. [Google Scholar]
- Bateni, P.; Goyal, R.; Masrani, V.; Wood, F.; Sigal, L. Improved Few-Shot Visual Classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 14481–14490. [Google Scholar] [CrossRef]
- Perez, E.; Strub, F.; Vries, H.d.; Dumoulin, V.; Courville, A. FiLM: Visual Reasoning with a General Conditioning Layer. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Triantafillou, E.; Zhu, T.; Dumoulin, V.; Lamblin, P.; Evci, U.; Xu, K.; Goroshin, R.; Gelada, C.; Swersky, K.J.; Manzagol, P.A.; et al. Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples. arXiv 2019, arXiv:1903.03096. [Google Scholar]
- Triantafillou, E.; Zemel, R.; Urtasun, R. Few-Shot Learning through an Information Retrieval Lens. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 2252–2262. [Google Scholar]
- Lake, B.M.; Salakhutdinov, R.; Tenenbaum, J.B. Human-level concept learning through probabilistic program induction. Science 2015, 350, 1332–1338. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Nilsback, M.E.; Zisserman, A. Automated Flower Classification over a Large Number of Classes. In Proceedings of the Sixth Indian Conference on Computer Vision, Graphics Image Processing, Washington, DC, USA, 16–19 December 2008; pp. 722–729. [Google Scholar] [CrossRef]
- Houben, S.; Stallkamp, J.; Salmen, K.; Schlipsing, M.; Igel, C. Detection of Traffic Signs in Real-World Images: The German Traffic Sign Detection Benchmark. In Proceedings of the International Joint Conference on Neural Networks, Dallas, TX, USA, 4–9 August 2013. [Google Scholar]
- Dhillon, G.S.; Chaudhari, P.; Ravichandran, A.; Soatto, S. A Baseline for Few-Shot Image Classification. arXiv 2019, arXiv:1909.02729. [Google Scholar]
- Noroozi, M.; Favaro, P. Unsupervised Learning of Visual Representions by Solving Jigsaw Puzzles. In European Conference on Computer Vision, Proceedings of the ECCV 2016: Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016. [Google Scholar]
- Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems—Volume 2, NIPS’14, Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
- Donahue, J.; Krähenbühl, P.; Darrell, T. Adversarial Feature Learning. In Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017. [Google Scholar]
- Deng, L. The mnist database of handwritten digit images for machine learning research. IEEE Signal Process. Mag. 2012, 29, 141–142. [Google Scholar] [CrossRef]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning, Virtual, 13–18 July 2020; Volume 119, pp. 1597–1607. [Google Scholar]
- Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar] [CrossRef]
- Shorten, C.; Khoshgoftaar, T.M. A survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
- Sandfort, V.; Yan, K.; Pickhardt, P.J.; Summers, R.M. Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks. Sci. Rep. 2019, 9, 16884. [Google Scholar] [CrossRef] [PubMed]
- Ma, D.; Tang, P.; Zhao, L. SiftingGAN: Generating and Sifting Labeled Samples to Improve the Remote Sensing Image Scene Classification Baseline In Vitro. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1046–1050. [Google Scholar] [CrossRef] [Green Version]
- Mescheder, L.M.; Geiger, A.; Nowozin, S. Which Training Methods for GANs do actually Converge? In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
- Marriott, R.T.; Madiouni, S.; Romdhani, S.; Gentric, S.; Chen, L. An Assessment of GANs for Identity-related Applications. In Proceedings of the IEEE International Joint Conference on Biometrics (IJCB), Houston, TX, USA, 28 September–1 October 2020; pp. 1–10, ISSN 2474-9699. [Google Scholar] [CrossRef]
- Bay, H.; Tuytelaars, T.; Gool, L.V. SURF: Speeded Up Robust Features. In European Conference on Computer Vision, Proceedings of the ECCV 2006: Computer Vision—ECCV, Graz, Austria, 7–13 May 2006; Springer: Berlin/Heidelberg, Germany, 2006; p. 14. [Google Scholar]
- Verdie, Y.; Yi, K.M.; Fua, P.; Lepetit, V. TILDE: A Temporally Invariant Learned DEtector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 5279–5288. [Google Scholar] [CrossRef] [Green Version]
- Yi, K.M.; Trulls, E.; Ono, Y.; Lepetit, V.; Salzmann, M.; Fua, P. Learning to Find Good Correspondences. arXiv 2017, arXiv:1711.05971. [Google Scholar]
- Hu, S.; Feng, M.; Nguyen, R.M.H.; Lee, G.H. CVM-Net: Cross-View Matching Network for Image-Based Ground-to-Aerial Geo-Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Anoosheh, A.; Sattler, T.; Timofte, R.; Pollefeys, M.; Gool, L. Night-to-Day Image Translation for Retrieval-based Localization. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 5958–5964. [Google Scholar]
- Lin, L.; Wang, G.; Zuo, W.; Feng, X.; Zhang, L. Cross-Domain Visual Matching via Generalized Similarity Measure and Feature Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1089–1102. [Google Scholar] [CrossRef]
- Jegou, H.; Douze, M.; Schmid, C. Product Quantization for Nearest Neighbor Search. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 117–128. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Item | Value |
---|---|
Number of classes | 58 |
Min number of images per class | 10 |
Max number of images per class | 119 |
Mean number of images per class | 31 |
Median number of images per class | 25 |
Image file format | .jpg |
Image dimension (width × height) | 800 px × variable |
Number of annotated images | 1858 |
of which from Henrard | 99 |
of which from Lapie | 40 |
of which from Combier | 7 |
of which from MRU | 299 |
of which from Photothèque | 711 |
of which from Internet | 702 |
Number of distractors | 11,316 |
of which from Henrard | 935 |
of which from Lapie | 4508 |
of which from Combier | 1969 |
of which from MRU | 2193 |
of which from Photothèque | 1260 |
of which from Internet | 451 |
Training Dataset | Semantics | Number of Classes | Size |
---|---|---|---|
Imagenet [32] | Generalistic | 1000 | 1.3 M |
GoogleLandmarks [15] | Landmarks | 81 k | 1.4 M |
SF300 [36] | Remote Sensing | 27 k | 308 k |
University1652 [37] | Cross-view buildings | 701 | 50 k |
SfM-120k [21] | Landmarks | 551 | 120 k |
Landmarks [38] | Landmarks | 672 | 214 k |
Absolute Perf. (mAP) | Intra-Domain Performance (mAP) | Inter-Domain Performance | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Method | Training Dataset | Reranking | alegoria | MRU | Lapie | Photothèque | Internet | Henrard | mP1 | qP1 | mAPD |
Unsupervised | |||||||||||
SimCLR | alegoria distractors | 5.77 | 10.22 | 7.89 | 5.01 | 4.29 | 7.66 | 134 | 40 | 176.8 | |
Semi-supervised | |||||||||||
rCNAPS | GoogleLandmarks | 8.47 | 11.02 | 2.91 | 8.2 | 7.82 | 9.88 | 101 | 24 | 237.1 | |
Supervised | |||||||||||
*GeM | ImageNet | 14.25 | 23.63 | 19.92 | 11.28 | 12.98 | 13.93 | 152 | 23 | 290.9 | |
GeM-ArcFace | GoogleLandmarks | 24.30 | 27.16 | 29.43 | 13.45 | 34.10 | 20.33 | 67 | 10 | 251.0 | |
GeM-ArcFace | SF300 | 14.41 | 26.27 | 17.76 | 14.44 | 9.33 | 13.32 | 99 | 20 | 256.5 | |
*GeM-Triplet | Univ1652 | 10.02 | 20.33 | 15.37 | 8.98 | 6.45 | 10.17 | 100 | 25 | 239.1 | |
*GeM-Contrastive | SfM120k | 19.02 | 26.39 | 19.70 | 14.59 | 20.57 | 16.83 | 107 | 18 | 281.1 | |
*RMAC-APLoss | Landmarks | 19.97 | 24.82 | 20.25 | 13.96 | 24.38 | 15.49 | 79 | 13 | 261.5 | |
*HOW-Contrastive | GoogleLandmarks | 19.16 | 28.11 | 19.53 | 10.20 | 24.45 | 17.55 | 105 | 21 | 274.0 | |
Diffusion | |||||||||||
GeM-ArcFace | GoogleLandmarks | + QE | 25.02 | 27.95 | 30.41 | 13.19 | 35.81 | 20.57 | 77 | 11 | 250.6 |
GeM-ArcFace | GoogleLandmarks | +GQE | 27.41 | 30.60 | 45.08 | 15.01 | 37.82 | 23.94 | 78 | 11 | 284.8 |
GeM-ArcFace *GeM-Contrastive *HOW-Contrastive | GoogleLandmarks SfM120k GoogleLandmarks | +MD | 29.17 | 35.28 | 39.70 | 17.92 | 37.53 | 26.00 | 97 | 13 | 327.5 |
GeM-ArcFace *GeM-Contrastive *HOW-Contrastive | GoogleLandmarks SfM120k GoogleLandmarks | +cMD | 29.10 | 35.31 | 41.40 | 17.87 | 37.35 | 25.60 | 81 | 12 | 279.5 |
Evaluation Setup | Absolute Perf. (mAP) |
---|---|
Unadapted | 5.81 |
A (few-shot, k = 1) | 8.34 |
B (few-shot, k = 5) | 8.47 |
C (self-supervised) | 7.91 |
D (query expansion, k = 1) | 5.72 |
E (noise) | 8.00 |
Grayscale | Crop | Absolute Perf. (mAP) |
---|---|---|
23.68 | ||
🗸 | 23.78 | |
🗸 | 24.17 | |
🗸 | 🗸 | 24.30 |
Method | Computation Time |
---|---|
QE | 147 ms |
GQE | 57 ms |
MD | 128 ms |
cMD | 140 ms |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gominski, D.; Gouet-Brunet, V.; Chen, L. Connecting Images through Sources: Exploring Low-Data, Heterogeneous Instance Retrieval. Remote Sens. 2021, 13, 3080. https://doi.org/10.3390/rs13163080
Gominski D, Gouet-Brunet V, Chen L. Connecting Images through Sources: Exploring Low-Data, Heterogeneous Instance Retrieval. Remote Sensing. 2021; 13(16):3080. https://doi.org/10.3390/rs13163080
Chicago/Turabian StyleGominski, Dimitri, Valérie Gouet-Brunet, and Liming Chen. 2021. "Connecting Images through Sources: Exploring Low-Data, Heterogeneous Instance Retrieval" Remote Sensing 13, no. 16: 3080. https://doi.org/10.3390/rs13163080
APA StyleGominski, D., Gouet-Brunet, V., & Chen, L. (2021). Connecting Images through Sources: Exploring Low-Data, Heterogeneous Instance Retrieval. Remote Sensing, 13(16), 3080. https://doi.org/10.3390/rs13163080