1. Introduction
With the development of deep neural networks and the demand for application scenarios, person reidentification (Re-ID) technology has better development prospects. What this research usually needs to solve is occlusion, similar appearance, and illumination change. Several novel and effective Re-ID models [
1,
2] based on deep learning have been proposed. In addition, Re-ID models [
3,
4,
5] based on attention mechanisms have also achieved many encouraging results. A popular method is to capture the global and local features of pedestrians by attention mechanism. Yang et al. found that the semantic information obtained by global features also included interference information (background interference information, etc.). To solve this problem, Chen et al. designed a hybrid higher-order attention network to utilize complex higher-order statistical information through the attention mechanism. Chen et al. proposed an attention but diversity network (ABD-Net) to apply a complementary mechanism to the attention mechanism.
However, researchers often focus on achieving a more robust model and ignore the baseline research. The current baseline research is insufficient to support high-precision model research. Specifically, BagTricks [
6] is a baseline with high performance. We show the visualization results of BagTricks in
Figure 1. We observed that in
Figure 1a,c, the target pedestrian is blocked by another pedestrian, which brings the wrong retrieval to the search results. In
Figure 1b, the clothes of the target pedestrian are very similar to another person, which makes BagTricks [
6] retrieve some wrong visualization results.
After studying and summarizing the existing algorithms, we designed an adaptive multiple loss baseline with a simple structure but powerful functions for Re-ID.
There are two reasons why we design a simple but powerful baseline. Firstly, to extract rich and representative global and local features, most researchers work on constructing deep convolutional neural networks [
7,
8,
9]. Zhang et al. [
7] designed a second-order nonlocal attention model (SONA), which effectively represents the local information of the target pedestrian through second-order features. In addition, Zheng et al. [
8] posed a pyramid network to integrate global with local information of input pictures. Alemu et al. [
9] proposed the idea of limiting samples, which can alleviate the problem of appearance similarity to a certain extent. Zhang et al. [
10] designed a semantic dense arrangement framework (DSA) to effectively alleviate the occlusion phenomenon encountered in the process of pedestrian recognition. Zhang et al. [
11] designed global relational awareness (RAG) to extract contextual semantic information through research. Usually, researchers apply the new method to a strong baseline to achieve a high retrieval readiness network. Through comparative experiments, we found that the performance of the network is very different when the same module is applied to different baselines.
Second, we conducted a detailed survey of current articles on Re-ID baselines [
6,
12,
13]. Specifically, BagTricks [
6] is a high-performance baseline that combines six tricks. Xiong et al. [
12] put batch normalization behind global pooling to improve network performance. In particular, Sun et al. [
13] designed a baseline to extract pedestrian features based on partial convolution. Ye et al. [
14] posed a robust AGW baseline, which uses the nonlocal attention [
15] module based on renet-50 [
16]. In addition, we find that the great difference between these baselines is that the loss functions are different. The retrieval accuracy of the model on small datasets is not satisfactory, and the network performance is relatively poor. Researchers usually use ID loss and triplet loss [
17] to build models, as the triplet loss can increase the distance between the input image and the negative sample. Wu et al. [
18] posed a view that loss of the triplet state would disrupt the internal information of the sample, and the existence of hard negative samples may lead to model collapse. We also introduced an adaptive mining sample loss (AMS) based on the triplet loss. AMS loss can automatically give the appropriate distance to the sample group, which can effectively avoid the misjudgment of samples (negative samples are misjudged as positive samples). We use triplet loss and AMS in the designed baseline, and the trained model has high retrieval accuracy.
In summary, the manuscript has the following contributions:
Based on the triplet loss, the designed AMS loss can greatly improve the performance of the model. The simple but robust characteristics make the network have not only high accuracy but also strong practicability.
We posed a robust and simple baseline, which achieves 82.3% mAP and 85.6% Rank-1 on the dataset of CUHK-03. This result is 25.7% and 26.8% higher than the current strong baseline BagTricks [
6], respectively.
We also carried out comparative and ablation experiments, such as embedding some novel methods or replacing the backbone, to prove that the baseline proposed is valid for Re-ID tasks.
Author Contributions
Conceptualization and methodology, Z.H.; software, Z.H. and Y.L.; validation, Y.L. and L.W.; formal analysis, L.W. and A.D.; data curation and writing original draft preparation, A.D.; writing—review and editing, S.J. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the National Science Foundation of China under grant number U1903213, and the Tianshan Innovation Team of Xinjiang Uygur Autonomous Region grant number 2020D14044.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Conflicts of Interest
The authors declare no conflict of interest.
References
- Tay, C.-P.; Roy, S.; Yap, K.-H. Aanet: Attribute attention network for person re-identifications. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7127–7136. [Google Scholar] [CrossRef]
- Gong, Y.; Wang, L.; Li, Y.; Du, A. A discriminative person re-identification model with global-local attention and adaptive weighted rank list loss. IEEE Access 2020, 8, 203700–203711. [Google Scholar] [CrossRef]
- Yang, W.; Huang, H.; Zhang, Z.; Chen, X.; Huang, K.; Zhang, S. Towards rich feature discovery with class activation maps augmentation for person re-identification. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 1389–1398. [Google Scholar] [CrossRef]
- Chen, B.; Deng, W.; Hu, J. Mixed high-order attention network for person re-identification. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 371–381. [Google Scholar] [CrossRef]
- Chen, T.; Ding, S.; Xie, J.; Yuan, Y.; Chen, W.; Yang, Y.; Ren, Z.; Wang, Z. ABD-net: Attentive but diverse person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 8350–8360. [Google Scholar] [CrossRef]
- Luo, H.; Jiang, W.; Gu, Y.; Liu, F.; Liao, X.; Lai, S.; Gu, J. A strong baseline and batch normalization neck for deep person re-identification. IEEE Trans. Multimed. 2019, 22, 2597–2609. [Google Scholar] [CrossRef]
- Xia, B.N.; Gong, Y.; Zhang, Y.; Poellabauer, C. Second-order non-local attention networks for person re-identification. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 3759–3768. [Google Scholar] [CrossRef]
- Zheng, F.; Deng, C.; Sun, X.; Jiang, X.; Guo, X.; Yu, Z.; Huang, F.; Ji, R. Pyramidal person re-identification via multi-loss dynamic training. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 8506–8514. [Google Scholar] [CrossRef]
- Alemu, L.T.; Pelillo, M.; Shah, M. Deep constrained dominant sets for person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 9855–9864. [Google Scholar] [CrossRef]
- Zhang, Z.; Lan, C.; Zeng, W.; Chen, Z. Densely semantically aligned person re-identification. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 667–676. [Google Scholar] [CrossRef] [Green Version]
- Hang, Z.; Lan, C.; Zeng, W.; Jin, X.; Chen, Z. Relation-aware global attention for person re-identification. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 3183–3192. [Google Scholar] [CrossRef]
- Xiong, F.; Xiao, Y.; Cao, Z.; Gong, K.; Fang, Z.; Zhou, J.T. Good practices on building effective CNN baseline model for person re-identification. In Proceedings of the Tenth International Conference on Graphics and Image Processing (ICGIP 2018), Chengdu, China, 12–14 December 2018; pp. 1–2. [Google Scholar] [CrossRef]
- Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; Wang, S. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Volume 11208, pp. 480–496. [Google Scholar] [CrossRef]
- Ye, M.; Shen, J.; Lin, G.; Xiang, T.; Shao, L.; Hoi, S.C.H. Deep learning for person re-identification: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 2872–2893. [Google Scholar] [CrossRef] [PubMed]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Liu, H.; Feng, J.; Qi, M.; Jiang, J.; Yan, S. End-to-end comparative attention networks for person re-identification. IEEE Trans. Image Process. 2017, 26, 3492–3506. [Google Scholar] [CrossRef] [PubMed]
- Wu, C.-Y.; Manmatha, R.; Smola, A.J.; Krahenbuhl, P. Sampling matters in deep embedding learning. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October·2017; pp. 2859–2867. [Google Scholar] [CrossRef]
- Luo, H.; Gu, Y.; Liao, X.; Lai, S.; Jiang, W. Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019; pp. 1487–1495. [Google Scholar] [CrossRef]
- Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random erasing data augmentation. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 13001–13008. [Google Scholar] [CrossRef]
- Zheng, Z.; Zheng, L.; Yang, Y. A discriminatively learned CNN embedding for person reidentification. ACM Trans. Multimed. Comput. Commun. Appl. 2018, 14, 1–20. [Google Scholar] [CrossRef]
- Fan, X.; Jiang, W.; Luo, H.; Fei, M. Spherereid: Deep hypersphere manifold embedding for person re-identification. J. Vis. Commun. Image Represent. 2019, 60, 51–58. [Google Scholar] [CrossRef]
- Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Scalable person re-identification: A benchmark. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1116–1124. [Google Scholar] [CrossRef]
- Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance measures and a data set for multi-target, multicamera tracking. In Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 17–35. [Google Scholar] [CrossRef]
- Li, W.; Zhao, R.; Xiao, T.; Wang, X. Deepreid: Deep filter pairing neural network for person re-identification. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 152–159. [Google Scholar] [CrossRef]
- Fan, D.; Wang, L.; Cheng, S.; Li, Y. Dual branch attention network for person re-identification. Sensors 2021, 21, 5839. [Google Scholar] [CrossRef] [PubMed]
- Ristani, E.; Tomasi, C. Features for multi-target multi-camera tracking and re-identification. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6036–6046. [Google Scholar] [CrossRef]
- Gong, Y.; Wang, L.; Cheng, S.; Li, Y. A strong baseline based on adaptive mining sample loss for person re-identification. In Proceedings of the CAAI International Conference on Artificial Intelligence 2021, Hangzhou, China, 5–6 June 2021; pp. 469–480. [Google Scholar] [CrossRef]
- Sun, Y.; Cheng, C.; Zhang, Y.; Zhang, C.; Zheng, L.; Wang, Z.; Wei, Y. Circle loss: A unified perspective of pair similarity optimization. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6397–6406. [Google Scholar] [CrossRef]
- Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality reduction by learning an invariant mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; pp. 1735–1742. [Google Scholar] [CrossRef]
- Chen, X.; Fu, C.; Zhao, Y.; Zheng, F.; Song, J.; Ji, R.; Yang, Y. Salience-guided cascaded suppression network for person re-identification. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 3297–3307. [Google Scholar] [CrossRef]
- Wang, G.; Yuan, Y.; Li, J.; Ge, S.; Zhou, X. Receptive multi-granularity representation for person re-identification. IEEE Trans. Image Process 2020, 29, 6096–6109. [Google Scholar] [CrossRef] [PubMed]
- Zheng, Z.; Yang, X.; Yu, Z.; Zheng, L.; Yang, Y.; Kautz, J. Joint discriminative and generative learning for person re-identification. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2138–2147. [Google Scholar] [CrossRef]
- Hou, R.; Ma, B.; Chang, H.; Gu, X.; Shan, S.; Chen, X. Interaction-and-aggregation network for person re-identification. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9317–9326. [Google Scholar] [CrossRef]
- Chen, G.; Lin, C.; Ren, L.; Lu, J.; Zhou, J. Self-critical attention learning for person re-identification. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9637–9646. [Google Scholar] [CrossRef]
- Gu, H.; Fu, G.; Li, J.; Zhu, J. Auto-ReID+: Searching for a multi-branch ConvNet for person re-identification. Neurocomputing 2021, 435, 53–66. [Google Scholar] [CrossRef]
- Jiao, S.; Pan, Z.; Hu, G.; Shen, Q.; Du, L.; Chen, Y.; Wang, J. Multi-scale and multi-branch feature representation for person re-identification—ScienceDirect. Neurocomputing 2020, 414, 120–130. [Google Scholar] [CrossRef]
- He, S.; Luo, H.; Wang, P.; Wang, F.; Li, H.; Jiang, W. TransReID: Transformer-based object re-identification. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, Canada, 11–17 October 2021; pp. 14993–15002. [Google Scholar] [CrossRef]
- Rao, Y.; Chen, G.; Lu, J.; Zhou, J. Counterfactual attention learning for fine-grained visual categorization and re-identification. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, Canada, 11–17 October 2021; pp. 1005–1014. [Google Scholar] [CrossRef]
- Zheng, Z.; Zheng, L.; Yang, Y. Pedestrian alignment network for large-scale person re-identification. IEEE Trans. Circuits Syst. Video Technol. 2019, 29, 3037–3045. [Google Scholar] [CrossRef] [Green Version]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Neural Inf. Process. Syst. 2017, 6000–6010. [Google Scholar] [CrossRef]
| Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).