Transformers in Pedestrian Image Retrieval and Person Re-Identification in a Multi-Camera Surveillance System
Abstract
:1. Introduction
2. Related Work
3. Foundation
3.1. Residual Network
3.2. Dense Network
3.3. PCB
3.4. Transformers
- Self-Attention: The self-attention estimates the significance of one item with others, explicitly modeling the interactions among them for structured prediction, updating each component via global information aggregation from the entire input sequence as shown in Figure 2. Consider a sequence of n items with d embedding dimension i.e., then the aim is to capture all the interaction, encoding each entity in terms of the global contextual information by three learnable weight matrices, including Keys (), Queries () and Values (), then projecting X on the mentioned matrices to obtain , , and asHere is the self-attention layer’s output achieved by computing the dot-product of the query with all keys for a given item; furthermore, softmax is applied to get the normalized attention scores where individual items become the weighted sum of all items. It is to be noted that the attention scores provide weights.
- Multi-headed Attention: The multi-head attention shown in Figure 3 is composed of multiple self-attention modules to capture multiple complex relationships between various items in a sequence, where each modules learns the weight matrices and . At the end of multi-head attention, the h self-attention modules are concatenated and then projected onto a weight matrix.
4. Proposed Architectures
4.1. Residual Transformer
4.2. Dense Transformer
4.3. PCB Transformer
5. Experimental Results
5.1. Setup
- Market-1501 (http://zheng-lab.cecs.anu.edu.au/Project/project_reid.html, accessed on 18 August 2020) dataset [39] is developed by employing six cameras, including one low and five high-resolution cameras outside a supermarket at Tsinghua University where the field-of-view overlap exists between the different cameras. Market-1501 has 32,668 annotated bounding boxes of 1501 pedestrians. For performing the cross-camera search, each pedestrian is captured by all cameras, while it is ensured that a pedestrian is present in at least two cameras.
- DukeMTMC-reID (https://github.com/sxzrt/DukeMTMC-reID_evaluation#download-dataset, accessed on 25 August 2020) dataset is constructed from the DukeMTMC [40] dataset, which consists of high-resolution videos acquired by eight cameras with pedestrian annotated bounding boxes. In [40], the pedestrian images are cropped after each 120th frame, yielding 1812 identities having 36,411 bounding boxes. Only 702 IDs are select for training, and 702 IDs are selected for testing, making sure that the pedestrians appear in more than two cameras.
5.2. Comparisons
5.3. Ablation Studies
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Zheng, L.; Yang, Y.; Hauptmann, A.G. Person re-identification: Past, present and future. arXiv 2016, arXiv:1610.02984. [Google Scholar]
- Bai, X.; Yang, M.; Huang, T.; Dou, Z.; Yu, R.; Xu, Y. Deep-person: Learning discriminative deep features for person re-identification. arXiv 2017, arXiv:1711.10658. [Google Scholar] [CrossRef] [Green Version]
- Yan, Y.; Zhang, Q.; Ni, B.; Zhang, W.; Xu, M.; Yang, X. Learning Context Graph for Person Search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2153–2162. [Google Scholar]
- Bakalos, N.; Voulodimos, A.; Doulamis, N.; Doulamis, A.; Ostfeld, A.; Salomons, E.; Caubet, J.; Jimenez, V.; Li, P. Protecting Water Infrastructure From Cyber and Physical Threats: Using Multimodal Data Fusion and Adaptive Deep Learning to Monitor Critical Systems. IEEE Signal Process. Mag. 2019, 36, 36–48. [Google Scholar] [CrossRef]
- Xu, Y.; Ma, B.; Huang, R.; Lin, L. Person search in a scene by jointly modeling people commonness and person uniqueness. In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; ACM: New York, NY, USA, 2014; pp. 937–940. [Google Scholar]
- Dai, Z.; Chen, M.; Zhu, S.; Tan, P. Batch feature erasing for person re-identification and beyond. arXiv 2018, arXiv:1811.07130. [Google Scholar]
- Huang, H.; Yang, W.; Chen, X.; Zhao, X.; Huang, K.; Lin, J.; Huang, G.; Du, D. EANet: Enhancing Alignment for Cross-Domain Person Re-identification. arXiv 2018, arXiv:1812.11369. [Google Scholar]
- Zheng, Z.; Yang, X.; Yu, Z.; Zheng, L.; Yang, Y.; Kautz, J. Joint discriminative and generative learning for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2133–2142. [Google Scholar]
- Wang, G.; Lai, J.; Huang, P.; Xie, X. Spatial-temporal person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Hilton Hawaiian Village, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8933–8940. [Google Scholar]
- Yang, F.; Yan, K.; Lu, S.; Jia, H.; Xie, X.; Gao, W. Attention driven person re-identification. Pattern Recognit. 2019, 86, 143–155. [Google Scholar] [CrossRef] [Green Version]
- Adaimi, G.; Kreiss, S.; Alahi, A. Rethinking Person Re-Identification with Confidence. arXiv 2019, arXiv:1906.04692. [Google Scholar]
- Wieczorek, M.; Rychalska, B.; Dabrowski, J. On the Unreasonable Effectiveness of Centroids in Image Retrieval. arXiv 2021, arXiv:2104.13643. [Google Scholar]
- Ye, M.; Shen, J.; Lin, G.; Xiang, T.; Shao, L.; Hoi, S.C. Deep learning for person re-identification: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2021. [Google Scholar] [CrossRef]
- Wang, H.; Fan, Y.; Wang, Z.; Jiao, L.; Schiele, B. Parameter-Free Spatial Attention Network for Person Re-Identification. arXiv 2018, arXiv:1811.12150. [Google Scholar]
- Wojke, N.; Bewley, A. Deep cosine metric learning for person re-identification. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 748–756. [Google Scholar]
- Zhong, Z.; Zheng, L.; Zheng, Z.; Li, S.; Yang, Y. Camera style adaptation for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5157–5166. [Google Scholar]
- Zheng, F.; Deng, C.; Sun, X.; Jiang, X.; Guo, X.; Yu, Z.; Huang, F.; Ji, R. Pyramidal Person Re-IDentification via Multi-Loss Dynamic Training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8506–8514. [Google Scholar]
- Luo, H.; Gu, Y.; Liao, X.; Lai, S.; Jiang, W. Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019; pp. 1487–1495. [Google Scholar]
- Quan, R.; Dong, X.; Wu, Y.; Zhu, L.; Yang, Y. Auto-ReID: Searching for a Part-aware ConvNet for Person Re-Identification. arXiv 2019, arXiv:1903.09776. [Google Scholar]
- Ro, Y.; Choi, J.; Jo, D.U.; Heo, B.; Lim, J.; Choi, J.Y. Backbone Can Not be Trained at Once: Rolling Back to Pre-trained Network for Person Re-identification. arXiv 2019, arXiv:1901.06140. [Google Scholar]
- Zeng, Z.; Wang, Z.; Wang, Z.; Chuang, Y.Y.; Satoh, S. Illumination-Adaptive Person Re-identification. arXiv 2019, arXiv:1905.04525. [Google Scholar] [CrossRef] [Green Version]
- Zhang, S.; Yin, Z.; Wu, X.; Wang, K.; Zhou, Q.; Kang, B. FPB: Feature Pyramid Branch for Person Re-Identification. arXiv 2021, arXiv:2108.01901. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Sharma, C.; Kapil, S.R.; Chapman, D. Person Re-Identification with a Locally Aware Transformer. arXiv 2021, arXiv:2106.03720. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Yunpeng, G. A general multi-modal data learning method for Person Re-identification. arXiv 2021, arXiv:2101.08533. [Google Scholar]
- Wang, D.; Zhang, S. Unsupervised person re-identification via multi-label classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10978–10987. [Google Scholar]
- Shu, X.; Wang, X.; Zhang, S.; Zhang, X.; Chen, Y.; Li, G.; Tian, Q. Large-Scale Spatio-Temporal Person Re-identification: Algorithm and Benchmark. arXiv 2021, arXiv:2105.15076. [Google Scholar]
- Jin, H.; Wang, X.; Liao, S.; Li, S.Z. Deep person re-identification with improved embedding and efficient training. In Proceedings of the 2017 IEEE International Joint Conference on Biometrics (IJCB), Denver, CO, USA, 1–4 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 261–267. [Google Scholar]
- Xiao, T.; Li, S.; Wang, B.; Lin, L.; Wang, X. Joint detection and identification feature learning for person search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3376–3385. [Google Scholar]
- Bromley, J.; Bentz, J.; Bottou, L.; Guyon, I.; LeCun, Y.; Moore, C.; Sackinger, E.; Shah, R. Signature Verification using a “Siamese” Time Delay Neural Network. Int. J. Pattern Recognit. Artif. Intell. 1993, 7, 669–688. [Google Scholar] [CrossRef] [Green Version]
- Ge, Y.; Li, Z.; Zhao, H.; Yin, G.; Yi, S.; Wang, X.; li, H. FD-GAN: Pose-guided feature distilling GAN for robust person re-identification. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 1230–1241. [Google Scholar]
- Wang, G.; Yuan, Y.; Chen, X.; Li, J.; Zhou, X. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the 2018 ACM Multimedia Conference on Multimedia Conference, Seoul, Korea, 22–26 October 2018; ACM: New York, NY, USA, 2018; pp. 274–282. [Google Scholar]
- Zhong, Z.; Zheng, L.; Luo, Z.; Li, S.; Yang, Y. Invariance matters: Exemplar memory for domain adaptive person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 598–607. [Google Scholar]
- Zhou, K.; Yang, Y.; Cavallaro, A.; Xiang, T. Omni-Scale Feature Learning for Person Re-Identification. arXiv 2019, arXiv:1905.00953. [Google Scholar]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar]
- Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; Wang, S. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 480–496. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1116–1124. [Google Scholar]
- Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the European Conference on Computer Vision, Workshops, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Cham, Switzerland, 2016; pp. 17–35. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.-F. Imagenet: A large-scale hierarchical image database. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
- Tahir, M.; Anwar, S.; Mian, A. Deep localization of protein structures in fluorescence microscopy images. arXiv 2018, arXiv:1910.04287. [Google Scholar]
- Anwar, H.; Anwar, S.; Zambanini, S.; Porikli, F. Deep ancient Roman Republican coin classification via feature fusion and attention. Pattern Recognit. 2021, 114, 107871. [Google Scholar] [CrossRef]
- Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 839–847. [Google Scholar]
- Wang, H.; Wang, Z.; Du, M.; Yang, F.; Zhang, Z.; Ding, S.; Mardziel, P.; Hu, X. Score-CAM: Score-weighted visual explanations for convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 24–25. [Google Scholar]
- Muhammad, M.B.; Yeasin, M. Eigen-CAM: Class Activation Map using Principal Components. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–7. [Google Scholar]
Level-1 (L1) | Level-2 (L2) | Level-3 (L3) | Level-4 (L4) | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
R@1 | R@5 | R@10 | mAP | R@1 | R@5 | R@10 | mAP | R@1 | R@5 | R@10 | mAP | R@1 | R@5 | R@10 | mAP | |
Market-1501 Dataset | ||||||||||||||||
ResNet50 | 87.65 | 94.69 | 96.85 | 71.15 | ||||||||||||
90.86 | 96.44 | 97.92 | 74.99 | 90.77 | 96.59 | 97.92 | 75.61 | 90.38 | 95.96 | 97.71 | 74.80 | 91.06 | 96.59 | 97.92 | 76.92 | |
90.08 | 96.29 | 97.63 | 74.78 | 90.97 | 96.44 | 97.83 | 76.63 | 90.20 | 96.35 | 97.95 | 76.00 | - | - | - | - | |
90.26 | 96.41 | 97.65 | 75.14 | 90.26 | 96.23 | 97.45 | 75.59 | 90.53 | 96.08 | 97.42 | 74.87 | 89.85 | 96.20 | 97.74 | 74.69 | |
89.70 | 96.20 | 97.48 | 74.21 | 89.40 | 96.50 | 97.77 | 74.56 | 89.67 | 96.41 | 97.95 | 74.74 | - | - | - | - | |
90.20 | 96.32 | 97.92 | 74.42 | 90.50 | 96.50 | 97.68 | 75.02 | 90.17 | 96.38 | 97.92 | 74.36 | 89.61 | 96.11 | 97.83 | 74.44 | |
DukeMTMC Dataset | ||||||||||||||||
ResNet50 | 77.74 | 87.84 | 91.34 | 60.65 | ||||||||||||
81.33 | 90.62 | 93.04 | 64.35 | 81.46 | 90.66 | 93.04 | 65.10 | 81.82 | 90.62 | 93.36 | 65.51 | 82.81 | 91.20 | 93.36 | 66.34 | |
81.19 | 91.29 | 93.90 | 64.68 | 81.60 | 90.98 | 93.58 | 64.56 | 82.41 | 90.93 | 93.18 | 65.70 | - | - | - | - | |
80.75 | 90.75 | 93.45 | 64.54 | 81.64 | 91.11 | 93.94 | 65.20 | 81.55 | 91.07 | 93.54 | 65.82 | - | - | - | - | |
81.37 | 90.66 | 93.09 | 65.27 | 82.14 | 90.98 | 93.13 | 65.67 | 81.87 | 90.35 | 93.27 | 65.50 | 81.19 | 90.17 | 93.40 | 65.80 | |
80.88 | 90.53 | 92.86 | 64.86 | 81.15 | 89.81 | 92.64 | 64.20 | 81.51 | 90.40 | 93.36 | 65.08 | 81.78 | 90.80 | 93.81 | 65.33 |
R@1 | R@5 | R@10 | mAP | R@1 | R@5 | R@10 | mAP | R@1 | R@5 | R@10 | mAP | R@1 | R@5 | R@10 | mAP |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Market-1501 Dataset | |||||||||||||||
DenseNet | |||||||||||||||
90.02 | 95.99 | 97.42 | 73.74 | 90.94 | 97.00 | 97.95 | 75.68 | 90.47 | 96.59 | 97.71 | 76.28 | 90.56 | 96.59 | 97.89 | 75.51 |
90.65 | 96.70 | 97.74 | 74.43 | 90.86 | 96.32 | 97.62 | 75.73 | 90.68 | 96.79 | 98.04 | 75.66 | 90.88 | 96.82 | 97.83 | 75.31 |
90.91 | 96.50 | 97.89 | 76.29 | 91.06 | 96.79 | 98.04 | 75.37 | 91.21 | 97.00 | 98.07 | 76.73 | 90.38 | 96.59 | 97.89 | 74.97 |
91.00 | 96.50 | 97.62 | 75.47 | 90.86 | 96.82 | 98.10 | 75.76 | 90.62 | 96.59 | 97.89 | 74.95 | 90.56 | 96.29 | 97.62 | 75.33 |
DukeMTMC Dataset | |||||||||||||||
DenseNet | |||||||||||||||
82.05 | 90.31 | 92.82 | 64.61 | 83.39 | 91.43 | 93.85 | 66.04 | 83.21 | 91.79 | 94.30 | 66.84 | 82.99 | 91.61 | 93.54 | 66.92 |
81.78 | 91.25 | 93.67 | 65.76 | 82.09 | 91.16 | 93.54 | 66.32 | 83.08 | 91.52 | 93.58 | 65.79 | 82.50 | 91.56 | 94.21 | 65.87 |
82.63 | 91.34 | 93.81 | 65.30 | 81.87 | 90.48 | 93.36 | 66.45 | 82.18 | 90.80 | 93.49 | 66.23 | 81.82 | 91.02 | 93.31 | 65.93 |
81.96 | 91.38 | 94.21 | 65.99 | 81.60 | 90.48 | 93.18 | 64.40 | 82.59 | 90.62 | 93.67 | 65.79 | 82.85 | 91.07 | 94.03 | 66.54 |
Methods | R@1 | R@5 | R@10 | mAP |
---|---|---|---|---|
92.64 | - | - | 77.47 | |
86.58 | 94.42 | 96.26 | 66.00 | |
90.08 | 95.72 | 97.15 | 71.95 | |
88.00 | 95.25 | 96.67 | 68.23 | |
72.36 | 86.67 | 90.53 | 42.64 |
Baseline | |||||||||
---|---|---|---|---|---|---|---|---|---|
Time | 55 min | 139 min | 154 min | 164 min | 178 min | 152 min | 197 min | 212 min | 234 min |
Parameters | 24.94 M | 25.34 M | 26.91 M | 33.21 M | 58.39 M | 25.34 M | 27.90 M | 37.35 M | 75.11 M |
Level-1 (L1) | Level-2 (L2) | Level-3 (L3) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
R@1 | R@5 | R@10 | mAP | R@1 | R@5 | R@10 | mAP | R@1 | R@5 | R@10 | mAP | |
89.40 | 95.78 | 97.65 | 74.41 | 90.23 | 96.11 | 97.65 | 75.29 | 90.74 | 96.41 | 97.83 | 75.47 | |
89.46 | 96.32 | 97.57 | 74.33 | 90.32 | 96.47 | 97.65 | 75.80 | 90.59 | 96.41 | 97.65 | 75.45 | |
90.26 | 96.41 | 97.65 | 75.14 | 90.26 | 96.23 | 97.45 | 75.59 | 90.53 | 96.08 | 97.42 | 74.87 | |
0.06 | 0.06 | 0.06 | 0.20 | 0.06 | 0.06 | 0.06 | 0.20 | 0.06 | 0.06 | 0.06 | 0.20 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tahir, M.; Anwar, S. Transformers in Pedestrian Image Retrieval and Person Re-Identification in a Multi-Camera Surveillance System. Appl. Sci. 2021, 11, 9197. https://doi.org/10.3390/app11199197
Tahir M, Anwar S. Transformers in Pedestrian Image Retrieval and Person Re-Identification in a Multi-Camera Surveillance System. Applied Sciences. 2021; 11(19):9197. https://doi.org/10.3390/app11199197
Chicago/Turabian StyleTahir, Muhammad, and Saeed Anwar. 2021. "Transformers in Pedestrian Image Retrieval and Person Re-Identification in a Multi-Camera Surveillance System" Applied Sciences 11, no. 19: 9197. https://doi.org/10.3390/app11199197
APA StyleTahir, M., & Anwar, S. (2021). Transformers in Pedestrian Image Retrieval and Person Re-Identification in a Multi-Camera Surveillance System. Applied Sciences, 11(19), 9197. https://doi.org/10.3390/app11199197