Uniformity Attentive Learning-Based Siamese Network for Person Re-Identification
Abstract
:1. Introduction
- We proposed an attentive learning-based Siamese network for person Re-ID. Our method includes a channel attention and encoder-decoder attention modules for robust feature extraction.
- We proposed uniformity loss for learning both common and discriminative features. The proposed loss helps the Siamese network to learn more important features accurately.
- Extensive experiments conducted on three common benchmarks show that the proposed method achieves comparable results in terms of both subjective and objective measures.
2. Related Works
2.1. Attention Mechanism
2.2. Siamese Network for Person ReID
3. The Proposed Method
3.1. Identification Using Attention Module
3.1.1. Encoder-Decoder Attention Module
3.1.2. Channel Attention Module
3.1.3. Identification Loss
3.2. Siamese Verification Loss
3.3. Attention Uniformity Loss
3.4. Overall Architecture and Final Loss
4. Experiments
4.1. Datasets and Evaluation Metrics
4.2. Implementation Details
4.3. Comparison with State-of-the-Art Method
4.4. Ablation Study
4.4.1. Efficiency of the Proposed Attention Module and Uniformity Loss
4.4.2. Comparison on Network Architectural Change
4.5. Qualitative Analysis
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Wang, X. Intelligent multi-camera video surveillance: A review. Pattern Recognit. Lett. 2013, 34, 3–19. [Google Scholar] [CrossRef]
- Loy, C.C.; Xiang, T.; Gong, S. Multi-camera activity correlation analysis. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 1988–1995. [Google Scholar]
- Weinberger, K.Q.; Saul, L.K. Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res. 2009, 10, 207–244. [Google Scholar]
- Zheng, W.S.; Gong, S.; Xiang, T. Reidentification by relative distance comparison. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 653–668. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Zajdel, W.; Zivkovic, Z.; Krose, B. Keeping track of humans: Have I seen this person before? In Proceedings of the 2005 IEEE International Conference on Robotics and Automation, Barcelona, Spain, 18–22 April 2005; pp. 2081–2086. [Google Scholar]
- Gray, D.; Tao, H. Viewpoint invariant pedestrian recognition with an ensemble of localized features. In European Conference Computer Vision; Springer: Berlin, Germany, 2008; pp. 262–275. [Google Scholar]
- Farenzena, M.; Bazzani, L.; Perina, A.; Murino, V.; Cristani, M. Person re-identification by symmetry-driven accumulation of local features. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2360–2367. [Google Scholar]
- Gheissari, N.; Sebastian, T.B.; Hartley, R. Person reidentification using spatiotemporal appearance. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; pp. 1528–1535. [Google Scholar]
- Zhao, R.; Ouyang, W.; Wang, X. Unsupervised salience learning for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 3586–3593. [Google Scholar]
- Mignon, A.; Jurie, F. Pcca: A new approach for distance learning from sparse pairwise constraints. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 18–20 June 2012; pp. 2666–2672. [Google Scholar]
- Liao, S.; Hu, Y.; Zhu, X.; Li, S.Z. Person re-identification by local maximal occurrence representation and metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2197–2206. [Google Scholar]
- Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 1116–1124. [Google Scholar]
- Li, W.; Zhao, R.; Xiao, T.; Wang, X. Deepreid: Deep filter pairing neural network for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern Recognition, Columbus, OH, USA, 24–27 June 2014; pp. 152–159. [Google Scholar]
- Zheng, Z.; Zheng, L.; Yang, Y. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3754–3762. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 60, 1097–1105. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the Thirty-first AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–10 February 2017. [Google Scholar]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4-9 December 2017; pp. 5998–6008. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Luong, M.T.; Pham, H.; Manning, C.D. Effective approaches to attention-based neural machine translation. arXiv 2015, arXiv:1508.04025. [Google Scholar]
- Mnih, V.; Heess, N.; Graves, A. Recurrent models of visual attention. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2204–2212.
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
- Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2048–2057. [Google Scholar]
- Sermanet, P.; Frome, A.; Real, E. Attention for fine-grained categorization. arXiv 2014, arXiv:1412.7054. [Google Scholar]
- Li, J.; Wei, Y.; Liang, X.; Dong, J.; Xu, T.; Feng, J.; Yan, S. Attentive contexts for object detection. IEEE Trans. Multimed. 2016, 19, 944–954. [Google Scholar] [CrossRef] [Green Version]
- Liu, X.; Zhao, H.; Tian, M.; Sheng, L.; Shao, J.; Yi, S.; Yan, J.; Wang, X. Hydraplus-net: Attentive deep features for pedestrian analysis. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 350–359. [Google Scholar]
- Li, W.; Zhu, X.; Gong, S. Harmonious attention network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake, UT, USA, 18–22 June 2018; pp. 2285–2294. [Google Scholar]
- Li, S.; Bak, S.; Carr, P.; Wang, X. Diversity regularized spatiotemporal attention for video-based person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake, UT, USA, 18–22 June 2018; pp. 369–378. [Google Scholar]
- Varior, R.R.; Haloi, M.; Wang, G. Gated siamese convolutional neural network architecture for human re-identification. In European Conference on Computer Vision; Springer: Berlin, Germany, 2016; pp. 791–808. [Google Scholar]
- Zheng, M.; Karanam, S.; Wu, Z.; Radke, R.J. Re-identification with consistent attentive siamese networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 5735–5744. [Google Scholar]
- Zheng, Z.; Zheng, L.; Yang, Y. A discriminatively learned cnn embedding for person reidentification. ACM Trans. Multimed. Comput. Commun. Appl. 2017, 14, 1–20. [Google Scholar] [CrossRef]
- Cheng, D.; Gong, Y.; Zhou, S.; Wang, J.; Zheng, N. Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1335–1344. [Google Scholar]
- Chen, W.; Chen, X.; Zhang, J.; Huang, K. Beyond triplet loss: A deep quadruplet network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 403–412. [Google Scholar]
- Li, D.X.; Fei, G.Y.; Teng, S.W. Learning Large Margin Multiple Granularity Features with an Improved Siamese Network for Person Re-Identification. Symmetry 2020, 12, 92. [Google Scholar] [CrossRef] [Green Version]
- Guo, H.; Zheng, K.; Fan, X.; Yu, H.; Wang, S. Visual attention consistency under image transforms for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 729–739. [Google Scholar]
- Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision; Springer: Berlin, Germany, 2016; pp. 483–499. [Google Scholar]
- Zheng, L.; Yang, Y.; Hauptmann, A.G. Person re-identification: Past, present and future. arXiv 2016, arXiv:1610.02984. [Google Scholar]
- Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
- Zhong, Z.; Zheng, L.; Cao, D.; Li, S. Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1318–1327. [Google Scholar]
- Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance Measures and a Data Set for Multi-Target, Multi-Camera Tracking. In European Conference on Computer Vision Workshop on Benchmarking Multi-Target Tracking; Springer: Berlin, Germany, 2016. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Li, D.; Chen, X.; Zhang, Z.; Huang, K. Learning deep context-aware features over body and latent parts for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 384–393. [Google Scholar]
- Chen, D.; Yuan, Z.; Chen, B.; Zheng, N. Similarity learning with spatial constraints for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1268–1277. [Google Scholar]
- Zhang, L.; Xiang, T.; Gong, S. Learning a discriminative null space for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1239–1248. [Google Scholar]
- Chen, Y.; Zhu, X.; Gong, S. Person re-identification by deep learning multi-scale representations. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 2590–2600. [Google Scholar]
- Liu, H.; Feng, J.; Qi, M.; Jiang, J.; Yan, S. End-to-end comparative attention networks for person re-identification. IEEE Trans. Image Process. 2017, 26, 3492–3506. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Varior, R.R.; Shuai, B.; Lu, J.; Xu, D.; Wang, G. A siamese long short-term memory architecture for human re-identification. In European Conference on Computer Vision; Springe: Berlin, Germany, 2016; pp. 135–153. [Google Scholar]
- Sun, Y.; Zheng, L.; Deng, W.; Wang, S. Svdnet for pedestrian retrieval. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3800–3808. [Google Scholar]
- Wang, H.; Gong, S.; Xiang, T. Highly efficient regression for scalable person re-identification. arXiv 2016, arXiv:1612.01341. [Google Scholar]
- Zheng, Z.; Zheng, L.; Yang, Y. Pedestrian alignment network for large-scale person re-identification. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 3037–3045. [Google Scholar] [CrossRef] [Green Version]
- Chang, X.; Hospedales, T.M.; Xiang, T. Multi-level factorisation net for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake, UT, USA, 18–22 June 2018; pp. 2109–2118. [Google Scholar]
- Li, W.; Zhu, X.; Gong, S. Person re-identification by deep joint learning of multi-loss classification. arXiv 2017, arXiv:1705.04724. [Google Scholar]
Name | Size | Backbone | Attention Module |
---|---|---|---|
Input | 128 × 256 | ||
Conv1 | 64 × 32 | 7 × 7, 64, stride 2 max pool, 3 × 3, stride 2 | |
Layer1 | 64 × 32 | × 3 | |
Layer2 | 32 × 16 | × 4 | |
Layer3 | 16 × 8 | × 6 | |
Channel attention | 1 × 1 | global average pool 1 × 1, 1024 fc, [1024, 64] fc, [64, 1024] | |
Encoder | 16 × 8 | ||
Decoder | 16 × 8 | ||
Multiple1 Multiple2 | 16 × 8 | ChannelAtt × Layer3 Multiple1 × Decoder | |
layer4 | 8 × 4 | × 3 | |
Spatial attention | 1 × 1 | 1 × 1, 2048 | |
global average pool 1 × 1, 2048 |
Market1501 | ||||
---|---|---|---|---|
Dataset | Single Query | Multi Query | ||
Metric | Rank-1 | mAP | Rank-1 | mAP |
XQDA [11] | 43.8 | 22.2 | 54.1 | 28.4 |
SCS [46] | 51.9 | 26.3 | - | - |
DNS [47] | 61.0 | 35.6 | 71.5 | 46.0 |
CRAFT [48] | 68.7 | 42.3 | 77.0 | 50.3 |
CAN [49] | 60.3 | 35.9 | 72.1 | 47.9 |
S-LSTM [50] | - | - | 61.6 | 35.3 |
G-SCNN [32] | 65.8 | 39.5 | 76.0 | 48.4 |
SVDNet [51] | 82.3 | 62.1 | - | - |
MSCAN [45] | 80.3 | 57.5 | 86.8 | 66.7 |
HA-CNN [30] | 91.2 | 75.7 | 93.8 | 82.8 |
Ours (ResNet50) | 91.3 | 79.2 | 94.1 | 85.3 |
Ours (VGG16) | 89.3 | 73.3 | 92.8 | 81.0 |
CUHK03 | ||||
---|---|---|---|---|
Dataset | Detected | Labeled | ||
Metric | Rank-1 | mAP | Rank-1 | mAP |
BoW + XQDA [52] | 6.4 | 6.4 | 7.9 | 7.3 |
LOMO + XQDA [11] | 12.8 | 11.5 | 14.8 | 13.6 |
IDE-R [42] | 21.3 | 19.7 | 22.2 | 21.0 |
IDE-R + XQDA [42] | 31.1 | 28.2 | 32.0 | 29.6 |
PAN [53] | 36.3 | 34.0 | 36.9 | 35.0 |
DPFL [48] | 40.7 | 37.0 | 43.0 | 40.5 |
HA-CNN [30] | 41.7 | 38.6 | 44.4 | 41.0 |
MLFN [54] | 52.8 | 47.8 | 54.7 | 49.2 |
CASN [33] | 57.4 | 50.7 | 58.9 | 52.2 |
Ours (ResNet50) | 58.9 | 52.6 | 62.6 | 57.7 |
Ours (VGG16) | 52.7 | 48.4 | 46.9 | 42.2 |
DukeMTMC-ReID | ||
---|---|---|
Metric | Rank-1 | mAP |
BoW + KISSME [52] | 25.1 | 12.2 |
LOMO + XQDA [11] | 30.8 | 17.0 |
ResNet50 [20] | 65.2 | 45.0 |
JLML [55] | 73.3 | 56.4 |
SVDNet [51] | 76.7 | 56.8 |
HA-CNN [30] | 80.5 | 63.8 |
Ours (ResNet50) | 80.7 | 65.5 |
Ours (VGG16) | 78.0 | 61.4 |
Dataset | Market1501 | |||
---|---|---|---|---|
Metric | Rank-1 | Rank-5 | Rank-10 | mAP |
ResNet50-Basel. [34] | 88.1 | 95.0 | 96.8 | 71.2 |
BesNet50 + AM | 89.7 | 96.2 | 97.4 | 76.7 |
Ours(ResNet50 + AM + UL) | 91.3 | 96.9 | 98.2 | 79.2 |
VGG16-Basel. [34] | 85.3 | 94.5 | 96.3 | 68.2 |
VGG16 + AM | 87.2 | 95.5 | 97.4 | 69.2 |
Ours(VGG16 + AM + UL) | 89.2 | 96.1 | 97.5 | 73.3 |
Dataset | Market1501 | |||
---|---|---|---|---|
Metric | Rank-1 | Rank-5 | Rank-10 | mAP |
layer1-AM | 89.3 | 96.3 | 97.4 | 74.6 |
layer2-AM | 90.7 | 96.7 | 98.0 | 78.2 |
layer3-AM | 91.3 | 96.9 | 98.2 | 79.2 |
layer4-AM | 88.7 | 96.1 | 97.4 | 75.4 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jeong, D.; Park, H.; Shin, J.; Kang, D.; Paik, J. Uniformity Attentive Learning-Based Siamese Network for Person Re-Identification. Sensors 2020, 20, 3603. https://doi.org/10.3390/s20123603
Jeong D, Park H, Shin J, Kang D, Paik J. Uniformity Attentive Learning-Based Siamese Network for Person Re-Identification. Sensors. 2020; 20(12):3603. https://doi.org/10.3390/s20123603
Chicago/Turabian StyleJeong, Dasol, Hasil Park, Joongchol Shin, Donggoo Kang, and Joonki Paik. 2020. "Uniformity Attentive Learning-Based Siamese Network for Person Re-Identification" Sensors 20, no. 12: 3603. https://doi.org/10.3390/s20123603
APA StyleJeong, D., Park, H., Shin, J., Kang, D., & Paik, J. (2020). Uniformity Attentive Learning-Based Siamese Network for Person Re-Identification. Sensors, 20(12), 3603. https://doi.org/10.3390/s20123603