PVTReID: A Quick Person Reidentification-Based Pyramid Vision Transformer
Abstract
:1. Introduction
- We recommend a strong basic network that utilizes PVT for ReID with performance that is comparable to CNN-based methods. With the help of PVT, by using progressively smaller feature pyramids the PVT-based method reduces the computation of large feature maps, thereby alleviating the problem of slow inference for ReID person images.
- We introduce a local feature clustering (LFC) module to compute the most discrete of the local features and cluster them. With LFC, we further separate the feature representations of different people in the feature space to make the person features more robust.
- We perform side information embeddings (SIE) on the camera information and send this information to the feature extraction network for use in training, which reduces interference from nonvisual information in the resulting features and improves their robustness. In addition, we verify the effects of using camera information in training at different PVT encoding stages.
- The final PVTReID framework achieves 87.8%, 63.2%, and 80.5% mAP accuracy on Market-1501, MSMT17, and DukeMTMC-reID, respectively, and has faster inference speed compared to CNN-based methods.
2. Related Work
2.1. Person Reidentification
2.2. Vision Transformer
2.3. Side Information
3. Methods
3.1. Basic PVT-Based Network
3.1.1. Patch Embed
3.1.2. Feature Usage
3.2. Local Feature Clustering Module
3.3. Side Information Embeddings
4. Experiments
4.1. Datasets
4.2. Implementation
4.3. Evaluation Protocol
4.4. Results for the PVT-Based Basic Network
4.5. Ablation Study of LFC
4.6. Ablation Study of Camera Information
4.6.1. Performance Analysis
4.6.2. Ablation Study of
4.7. Ablation Study of PVTReID
4.8. Comparison to State-of-the-Art Methods
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Additional Experimental Results
Study of Basic PVT-Based Network
Method | LR | DR | ADR | DPR | LR | Market1501 | |
---|---|---|---|---|---|---|---|
mAP | R1 | ||||||
Basic | 0.0008 | 0.1 | 0.1 | 0.3 | ✔ | 86.3 | 94.9 |
0.008 | ✔ | ✔ | ✔ | ✔ | 86.0 (−0.3) | 94.1 (−0.8) | |
0.004 | ✔ | ✔ | ✔ | ✔ | 86.1 (−0.2) | 94.2 (−0.7) | |
Learning | 0.002 | ✔ | ✔ | ✔ | ✔ | 86.2 (−0.1) | 94.5 (−0.4) |
Rate | 0.001 | ✔ | ✔ | ✔ | ✔ | 86.3 (0.0) | 94.7 (−0.2) |
0.0006 | ✔ | ✔ | ✔ | ✔ | 86.2 (−0.1) | 94.6 (−0.3) | |
0.0004 | ✔ | ✔ | ✔ | ✔ | 86.0 (−0.3) | 94.2 (−0.7) | |
Drop | ✔ | 0.0 | ✔ | ✔ | ✔ | 86.1 (−0.2) | 94.4 (−0.6) |
Rate | ✔ | 0.2 | ✔ | ✔ | ✔ | 85.5 (−0.8) | 94.0 (−0.9) |
Attention | ✔ | ✔ | 0.0 | ✔ | ✔ | 86.2 (−0.1) | 94.2 (−0.7) |
Drop | ✔ | ✔ | 0.2 | ✔ | ✔ | 85.2 (−0.1) | 94.3 (−0.6) |
✔ | ✔ | ✔ | 0.0 | ✔ | 84.9 (−1.4) | 93.6 (−1.3) | |
Drop | ✔ | ✔ | ✔ | 0.1 | ✔ | 85.7 (−0.6) | 94.1 (−0.8) |
Path | ✔ | ✔ | ✔ | 0.2 | ✔ | 86.1 (−0.2) | 94.6 (−0.3) |
✔ | ✔ | ✔ | 0.4 | ✔ | 86.0 (−0.3) | 94.4 (−0.5) | |
Loss Function | ✔ | ✔ | ✔ | ✔ | ✘ | 86.0 (−0.2) | 94.2 (−0.7) |
References
- Luo, H.; Jiang, W.; Fan, X.; Zhang, S. A survey on deep learning based person re-identification. Acta Autom. Sin. 2019, 45, 2032–2049. [Google Scholar]
- Zhuang, Z.; Wei, L.; Xie, L.; Zhang, T.; Zhang, H.; Wu, H.; Ai, H.; Tian, Q. Rethinking the distribution gap of person re-identification with camera-based batch normalization. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XII. Springer: Cham, Switzerland, 2020; pp. 140–157. [Google Scholar]
- Lin, Y.; Zheng, L.; Zheng, Z.; Wu, Y.; Hu, Z.; Yan, C.; Yang, Y. Improving person re-identification by attribute and identity learning. Pattern Recognit. 2019, 95, 151–161. [Google Scholar] [CrossRef]
- Wen, Y.; Zhang, K.; Li, Z.; Qiao, Y. A discriminative feature learning approach for deep face recognition. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VII. Springer: Cham, Switzerland, 2016; pp. 499–515. [Google Scholar]
- Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; Wang, S. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 480–496. [Google Scholar]
- Zheng, Z.; Zheng, L.; Yang, Y. A discriminatively learned cnn embedding for person reidentification. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2017, 14, 1–20. [Google Scholar] [CrossRef]
- Zhang, Z.; Lan, C.; Zeng, W.; Jin, X.; Chen, Z. Relation-aware global attention for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3186–3195. [Google Scholar]
- Chen, X.; Fu, C.; Zhao, Y.; Zheng, F.; Song, J.; Ji, R.; Yang, Y. Salience-guided cascaded suppression network for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3300–3310. [Google Scholar]
- Chen, T.; Ding, S.; Xie, J.; Yuan, Y.; Chen, W.; Yang, Y.; Ren, Z.; Wang, Z. Abd-net: Attentive but diverse person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8351–8361. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- He, S.; Luo, H.; Wang, P.; Wang, F.; Li, H.; Jiang, W. Transreid: Transformer-based object re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 15013–15022. [Google Scholar]
- Peng, Z.; Huang, W.; Gu, S.; Xie, L.; Wang, Y.; Jiao, J.; Ye, Q. Conformer: Local features coupling global representations for visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 367–376. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 568–578. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Islam, M.A.; Jia, S.; Bruce, N.D. How much position information do convolutional neural networks encode? arXiv 2020, arXiv:2001.08248. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- Luo, H.; Jiang, W.; Zhang, X.; Fan, X.; Qian, J.; Zhang, C. Alignedreid++: Dynamically matching local information for person re-identification. Pattern Recognit. 2019, 94, 53–61. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Zheng, L.; Yang, Y.; Hauptmann, A.G. Person re-identification: Past, present and future. arXiv 2016, arXiv:1610.02984. [Google Scholar]
- Matsukawa, T.; Suzuki, E. Person re-identification using CNN features learned from combination of attributes. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 2428–2433. [Google Scholar]
- Hermans, A.; Beyer, L.; Leibe, B. In defense of the triplet loss for person re-identification. arXiv 2017, arXiv:1703.07737. [Google Scholar]
- Yuan, Y.; Chen, W.; Yang, Y.; Wang, Z. In defense of the triplet loss again: Learning robust person re-identification with fast approximated triplet loss and label distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 354–355. [Google Scholar]
- Luo, H.; Gu, Y.; Liao, X.; Lai, S.; Jiang, W. Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
- Sun, Y.; Cheng, C.; Zhang, Y.; Zhang, C.; Zheng, L.; Wang, Z.; Wei, Y. Circle loss: A unified perspective of pair similarity optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6398–6407. [Google Scholar]
- Wei, L.; Zhang, S.; Yao, H.; Gao, W.; Tian, Q. Glad: Global-local-alignment descriptor for pedestrian retrieval. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 420–428. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–15. [Google Scholar]
- Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [PubMed]
- Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. (CSUR) 2022, 54, 1–41. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pvt v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2818–2826. [Google Scholar]
- McInnes, L.; Healy, J.; Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar]
- Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1116–1124. [Google Scholar]
- Wei, L.; Zhang, S.; Gao, W.; Tian, Q. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 79–88. [Google Scholar]
- Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–10 October 2016; pp. 17–35. [Google Scholar]
- Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13001–13008. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Lin, H.; Zhang, Z.; Sun, Y.; He, T.; Mueller, J.; Manmatha, R.; et al. Resnest: Split-attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 2736–2746. [Google Scholar]
- Zhou, K.; Yang, Y.; Cavallaro, A.; Xiang, T. Omni-scale feature learning for person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3702–3712. [Google Scholar]
- Jin, X.; Lan, C.; Zeng, W.; Wei, G.; Chen, Z. Semantics-aligned representation learning for person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11173–11180. [Google Scholar]
- Miao, J.; Wu, Y.; Liu, P.; Ding, Y.; Yang, Y. Pose-guided feature alignment for occluded person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 542–551. [Google Scholar]
- Wang, G.; Yang, S.; Liu, H.; Wang, Z.; Yang, Y.; Wang, S.; Yu, G.; Zhou, E.; Sun, J. High-order information matters: Learning relation and topology for occluded person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6449–6458. [Google Scholar]
- Zhu, K.; Guo, H.; Liu, Z.; Tang, M.; Wang, J. Identity-guided human semantic parsing for person re-identification. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part III. Springer: Cham, Switzerland, 2020; pp. 346–363. [Google Scholar]
- Wang, G.; Yuan, Y.; Chen, X.; Li, J.; Zhou, X. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 274–282. [Google Scholar]
Dataset | #Camera | #Image | #ID |
---|---|---|---|
Market-1501 | 6 | 32,668 | 1501 |
MSMT17 | 15 | 126,441 | 4101 |
DukeMTMC-reID | 8 | 36,441 | 1404 |
Backbone | Params (M) | Inference Speed | MSMT17 | |
---|---|---|---|---|
mAP | R1 | |||
ResNet50 | 23.5 | 1.0× | 51.3 | 75.3 |
ResNet101 | 44.5 | 1.48× | 53.8 | 77.0 |
ResNet152 | 60.2 | 1.96× | 55.6 | 78.4 |
ResNeSt50 | 25.6 | 1.86× | 61.2 | 82.0 |
ResNeSt200 | 68.6 | 3.12× | 63.5 | 83.5 |
ViT-B | 86.0 | 1.79× | 61.0 | 81.8 |
PVT-V2-B2 | 25.4 | 0.82× | 54.1 | 77.3 |
PVT-V2-B5 | 82.0 | 1.22× | 60.2 | 81.2 |
Backbone | Market1501 | MSMT17 | DukeMTMC-reID | |||
---|---|---|---|---|---|---|
mAP | R1 | mAP | R1 | mAP | R1 | |
Basic | 86.3 | 94.9 | 60.2 | 81.2 | 77.8 | 87.9 |
+LFC | 87.6 | 95.1 | 62.1 | 82.1 | 79.3 | 88.6 |
+LFC w/o local | 87.6 | 95.1 | 62.0 | 82.1 | 79.2 | 88.4 |
Method | Embed Stage | Market1501 | MSMT17 | DukeMTMC-reID | ||||||
---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | mAP | R1 | mAP | R1 | mAP | R1 | |
Basic | 86.3 | 94.9 | 60.2 | 81.2 | 77.8 | 87.9 | ||||
+SIE | ✔ | 86.7 | 94.7 | 61.8 | 81.9 | 79.4 | 88.6 | |||
✔ | 86.6 | 94.5 | 61.7 | 82 | 79.2 | 88.5 | ||||
✔ | 86.4 | 94.4 | 61.2 | 81.5 | 78.8 | 88.3 | ||||
✔ | 86.0 | 94.2 | 59.7 | 81.1 | 78.3 | 88.1 |
Method | LFM | SIE | Market1501 | MSMT17 | DukeMTMC-reID | |||
---|---|---|---|---|---|---|---|---|
mAP | R1 | mAP | R1 | mAP | R1 | |||
Basic | ✘ | ✘ | 86.3 | 94.9 | 60.2 | 81.2 | 77.8 | 87.9 |
✔ | ✘ | 87.6 | 95.1 | 62.1 | 82.1 | 79.3 | 88.6 | |
✘ | ✔ | 86.7 | 94.7 | 61.8 | 81.9 | 79.7 | 89.3 | |
PVTReID | ✔ | ✔ | 87.8 | 95.0 | 63.2 | 82.3 | 80.5 | 90.0 |
Backbone | Method | Size | Inference (Images/s) | Market1501 | MSMT17 | DukeMTMC-reID | |||
---|---|---|---|---|---|---|---|---|---|
mAP | R1 | mAP | R1 | mAP | R1 | ||||
CNN | CBN [2] | 256 × 128 | 338 | 77.3 | 91.3 | 42.9 | 72.8 | 67.3 | 82.5 |
OSNet [38] | 256 × 128 | 2028 | 84.9 | 94.8 | 52.9 | 78.7 | 73.5 | 88.6 | |
SAN [39] | 256 × 128 | 290 | 88.0 | 96.1 | 55.7 | 79.2 | 75.7 | 87.9 | |
PGFA [40] | 256 × 128 | 263 | 76.8 | 91.2 | - | - | 65.5 | 82.6 | |
HOReID [41] | 256 × 128 | 310 | 84.9 | 94.2 | - | - | 75.6 | 86.9 | |
ISP [42] | 256 × 128 | 315 | 88.6 | 95.3 | - | - | 80.0 | 89.6 | |
MGN [43] | 384 × 128 | 287 | 86.9 | 95.7 | 52.1 | 76.9 | 78.4 | 88.7 | |
SCSN [8] | 384 × 128 | 267 | 88.5 | 95.7 | 58.5 | 83.8 | 79.0 | 91.0 | |
ABDNet [9] | 384 × 128 | 223 | 88.3 | 95.6 | 60.8 | 82.3 | 78.6 | 89.0 | |
PVT | Basic | 256 × 128 | 359 | 86.3 | 94.9 | 60.2 | 81.2 | 77.8 | 87.9 |
PVTReID | 256 × 128 | 341 | 87.8 | 95.3 | 63.2 | 82.3 | 80.5 | 90.0 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Han, K.; Wang, Q.; Zhu, M.; Zhang, X. PVTReID: A Quick Person Reidentification-Based Pyramid Vision Transformer. Appl. Sci. 2023, 13, 9751. https://doi.org/10.3390/app13179751
Han K, Wang Q, Zhu M, Zhang X. PVTReID: A Quick Person Reidentification-Based Pyramid Vision Transformer. Applied Sciences. 2023; 13(17):9751. https://doi.org/10.3390/app13179751
Chicago/Turabian StyleHan, Ke, Qianlong Wang, Mingming Zhu, and Xiyan Zhang. 2023. "PVTReID: A Quick Person Reidentification-Based Pyramid Vision Transformer" Applied Sciences 13, no. 17: 9751. https://doi.org/10.3390/app13179751
APA StyleHan, K., Wang, Q., Zhu, M., & Zhang, X. (2023). PVTReID: A Quick Person Reidentification-Based Pyramid Vision Transformer. Applied Sciences, 13(17), 9751. https://doi.org/10.3390/app13179751