Person Identification and Gender Classification Based on Vision Transformers for Periocular Images
Abstract
:Simple Summary
Abstract
1. Introduction
- (A)
- Perform image resizing and normalization. For resizing, the size of the images will be the same as the image size used during the development of the pre-trained model.
- (B)
- The motivation of vision transformers comes from sentence transformers in NLP (natural language processing). The sentence transformers expect sequences of words as input. However, in the image processing context, sequences of words are not available. Hence, the entire image is split into patches of equal size so that those patches could be further arranged as sequences to be fed to the encoder of the vision transformer. It can be noted that even after splitting the original image, the patches are still in 2-D form.
- (C)
- Each patch is subsequently flattened to form a 1-D vector from a 2-D matrix.
- (D)
- The 1-D vectors are further subjected to linear mapping. This assigns an intermediate representation to the original 1-D vector. It may be a more reduced dimension than the original 1-D vector’s dimensionality.
- (E)
- A classification token is appended to every linearly mapped 1-D vector. This additional information will preserve the classification label.
- (F)
- Using a positional encoding function, the vectors are mapped onto a set of values that typically follow frequency patterns.
- (G)
- The vectors are further subjected to an encoding process. This encoding is a two-step procedure that involves the normalization of layers and the multi-head self-attention algorithm. During normalization, a layer-specific mean subtraction and division by standard deviation is performed. For multi-head self-attention, the following process is performed: For every processed 1-D vector pertaining to its 2-D patch, a dot product is computed using query (Q) and key (K). This will yield a cosine similarity between Q and K. The product is normalized with the dimensions of the key vector (division operation). The result is subjected to the softmax function, which will yield a matrix whose values are in the range of 0 to 1. This matrix is multiplied with a value matrix. This process will assign the similarity amongst patches, which is a final output of the multi-head self-attention layer. At the end of this process, the linear mapping is applied to reduce the dimensionality of the concatenated multi-head self-attention matrix.
- (H)
- The MLP (multi-layer perceptron) head defines intermediate layers that finally yield the probability of predefined categories.
2. Related Works
3. Materials and Methods
3.1. Dataset
3.2. Transformers in NLP
3.3. Transformers in Vision
3.4. Proposed Approach
3.4.1. ROI Extraction
- Step 1
- —Let be the coordinates of the medial canthus points (inner corner) and be the coordinates of the lateral canthus points (outer corner). The Euclidean distance (D) between the medial and canthus points is determined as mentioned in (2).
- Step 2
- —Compute a 2-D coordinate = (, ) using (3). This is a point where the line connecting the medial and lateral canthus points meet.
- Step 3
- —Calculate the rectangle ROI’s upper-left coordinate (,) and lower-right coordinate (,) as mentioned in (4) and (5), respectively.
- Step 4
- —To get the rectangular ROI, use the calculated points (,) and (,).
3.4.2. Proposed Model
- Step 1
- —The initial step of development is to apply pre-processing algorithms like resizing to 224 × 224 and normalizing the images in sync with the image-net dataset, whose pre-trained weights are utilized for training, which forms the backbone of the architecture.
- Step 2
- —After this pre-processing step, the images and corresponding labels are randomly shuffled and passed as batches to the model to obtain the raw logits. The images are first broken down into patches and flattened using the linear projection matrix, and the positional embeddings are then added to it.
- Step 3
- —The transformer encoder block, analogous to a block contemplated by Vaswani et al. [25], consists of self-attention blocks, normalization layers, fully connected layers, and residual connections. The attention blocks are multi-headed, and hence they can focus on different patterns of the image.
- Step 4
- —The embedding pertaining to the terminal FC layer is then passed to two pooling layers–the average pool and the max pool. These are then concatenated before finally being passed to the classification head.
- Step 5
- —Instead of the custom train-test split method, a stratified k-fold with k = 5 is used to counter-balance the class imbalance. Instead of randomly splitting the dataset, this ensures that the classes are equally stratified into each fold and hence the model becomes less prone to overfitting.
- Step 6
- —The fully connected layer will output the desired class prediction using the Softmax function, and the class having the foremost value becomes the predicted class.
- Step 7
- —Setting this reliable cross-validation strategy is beneficial during inference because now we have five folds, after completing training for every fold, and we can take the mean of the model predictions as the final class.
4. Results
4.1. Experiments for Person Identification
4.2. Experiments for Gender Classification
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Jain, A.K.; Flynn, P.; Ross, A.A. HandBook of Biometrics; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2008; ISBN 9780387710402. [Google Scholar]
- Agarwal, S.; Punn, N.S.; Sonbhadra, S.K.; Tanveer, M.; Nagabhushan, P.; Pandian, K.K.S.; Saxena, P. Unleashing the Power of Disruptive and Emerging Technologies amid COVID-19: A Detailed Review. arXiv 2020, arXiv:2005.11507. [Google Scholar]
- Okereafor, K.; Ekong, I.; Okon Markson, I.; Enwere, K. Fingerprint Biometric System Hygiene and the Risk of COVID-19 Transmission. JMIR Biomed. Eng. 2020, 5, e19623. [Google Scholar] [CrossRef]
- Wei, J.; Wang, Y.; Wu, X.; He, Z.; He, R.; Sun, Z. Cross-Sensor Iris Recognition Using Adversarial Strategy and Sensor-Specific Information. In Proceedings of the 2019 IEEE 10th International Conference on Biometrics Theory, Applications and Systems, BTAS 2019, Tampa, FL, USA, 23–26 September 2019. [Google Scholar] [CrossRef]
- Kumari, P.; Seeja, K.R. An Optimal Feature Enriched Region of Interest (ROI) Extraction for Periocular Biometric System. Multimed Tools Appl. 2021, 80, 33573–33591. [Google Scholar] [CrossRef] [PubMed]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Park, U.; Jillela, R.R.; Ross, A.; Jain, A.K. Periocular Biometrics in the Visible Spectrum. IEEE Trans. Inf. Forensics Secur. 2011, 6, 96–106. [Google Scholar] [CrossRef] [Green Version]
- Sharma, R.; Ross, A. Periocular biometrics and its relevance to partially masked faces: A survey. Comput. Vis. Image Underst. 2023, 226, 103583. [Google Scholar] [CrossRef]
- Hollingsworth, K.; Bowyer, K.W.; Flynn, P.J. Identifying Useful Features for Human Verification in Near-Infrared Periocular Images. Image Vis. Comput. 2011, 29, 707–715. [Google Scholar] [CrossRef]
- Hollingsworth, K.P.; Darnell, S.S.; Miller, P.E.; Woodard, D.L.; Bowyer, K.W.; Flynn, P.J. Human and Machine Performance on Periocular Biometrics under Near-Infrared Light and Visible Light. IEEE Trans. Inf. Forensics Secur. 2012, 7, 588–601. [Google Scholar] [CrossRef]
- Oh, B.S.; Oh, K.; Toh, K.A. On Projection-Based Methods for Periocular Identity Verification. In Proceedings of the 2012 7th IEEE Conference on Industrial Electronics and Applications, ICIEA, Singapore, 18–20 July 2012; pp. 871–876. [Google Scholar] [CrossRef]
- Smereka, J.M.; Kumar, B.V.K.V. What Is a “good” Periocular Region for Recognition? In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Portland, OR, USA, 23–28 June 2013; pp. 117–124. [Google Scholar] [CrossRef]
- Smereka, J.M.; Kumar, B.V.K.V.; Rodriguez, A. Selecting Discriminative Regions for Periocular Verification. In Proceedings of the ISBA 2016—IEEE International Conference on Identity, Security and Behavior Analysis, Sendai, Japan, 29 February–2 March 2016. [Google Scholar] [CrossRef]
- Ambika, D.R.; Radhika, K.R.; Seshachalam, D. Fusion of Shape and Texture for Unconstrained Periocular Authentication. Int. J. Comput. Inf. Eng. 2017, 11, 821–827. [Google Scholar]
- Zhao, Z.; Kumar, A. Accurate Periocular Recognition under Less Constrained Environment Using Semantics-Assisted Convolutional Neural Network. IEEE Trans. Inf. Forensics Secur. 2017, 12, 1017–1030. [Google Scholar] [CrossRef]
- Chen, H.; Gao, M.; Ricanek, K.; Xu, W.; Fang, B. A Novel Race Classification Method Based on Periocular Features Fusion. Int. J. Pattern Recognit. Artif. Intell. 2017, 31, 1–21. [Google Scholar] [CrossRef]
- Castrillón-Santana, M.; Lorenzo-Navarro, J.; Ramón-Balmaseda, E. On Using Periocular Biometric for Gender Classification in the Wild. Pattern Recognit. Lett. 2016, 82, 181–189. [Google Scholar] [CrossRef]
- Tapia, J.; Aravena, C.C. Gender Classification from Periocular NIR Images Using Fusion of CNNs Models. In Proceedings of the 2018 IEEE 4th International Conference on Identity, Security, and Behavior Analysis, ISBA 2018. Singapore, 11–12 January 2018; pp. 1–6. [Google Scholar] [CrossRef]
- Kuehlkamp, A.; Bowyer, K. Predicting Gender from Iris Texture May Be Harder than It Seems. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision, WACV 2019, Waikoloa Village, HI, USA, 7–11 January 2019; pp. 904–912. [Google Scholar] [CrossRef] [Green Version]
- Kumari, P.; Seeja, K.R. Periocular Biometrics for Non-Ideal Images: With off-the-Shelf Deep CNN & Transfer Learning Approach. Procedia Comput. Sci. 2020, 167, 344–352. [Google Scholar] [CrossRef]
- Naseer, M.; Ranasinghe, K.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.-H. Intriguing Properties of Vision Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 23296–23308. [Google Scholar]
- Zhang, L.; Wen, Y. A Transformer-Based Framework for Automatic COVID19 Diagnosis in Chest CTs. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 513–518. [Google Scholar] [CrossRef]
- Guo, J.; Han, K.; Wu, H.; Xu, C.; Tang, Y.; Xu, C.; Wang, Y. CMT: Convolutional Neural Networks Meet Vision Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1–14. [Google Scholar]
- Bhojanapalli, S.; Chakrabarti, A.; Glasner, D.; Li, D.; Unterthiner, T.; Veit, A. Understanding Robustness of Transformers for Image Classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Waikoloa, HI, USA, 4–8 January 2022; pp. 10211–10221. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5999–6009. [Google Scholar]
- Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A Survey on Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
- Xia, Y.; He, T.; Tan, X.; Tian, F.; He, D.; Qin, T. Tied Transformers: Neural Machine Translation with Shared Encoder and Decoder. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, HI, USA, 27 January–1 February 2019; Volume 5, pp. 5466–5473. [Google Scholar] [CrossRef] [Green Version]
- Padole, C.N.; Proenca, H. Periocular Recognition: Analysis of Performance Degradation Factors. In Proceedings of the 2012 5th IAPR International Conference on Biometrics, ICB 2012, New Delhi, India, 29 March–1 April 2012; pp. 439–445. [Google Scholar] [CrossRef] [Green Version]
- Available online: www.youtube.com (accessed on 31 January 2023).
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL HLT 2019—2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
- Wightman, R. ViT Training Details Issue #252 Rwightman/Pytorch-Image-Models GitHub. Available online: https://github.com/rwightman/pytorch-image-models/issues/252#issuecomment-713838112,2013.3,9 (accessed on 6 June 2022).
- Liu, P.; Guo, J.M.; Tseng, S.H.; Wong, K.S.; der Lee, J.; Yao, C.C.; Zhu, D. Ocular Recognition for Blinking Eyes. IEEE Trans. Image Process. 2017, 26, 5070–5081. [Google Scholar] [CrossRef] [PubMed]
- Kumari, P.; Seeja, K.R. A Novel Periocular Biometrics solution for authentication during COVID-19 pandemic situation. J. Ambient. Intell. Humaniz. Comput. 2021, 12, 10321–10337. [Google Scholar] [CrossRef] [PubMed]
ViT Model/Patch | Width | Depth (Layers) | MLP | Heads | Parameters (Million) | GFLOPS |
---|---|---|---|---|---|---|
T/16 | 192 | 12 | 768 | 3 | 5.5 | 2.5 |
B/16 | 768 | 12 | 3072 | 12 | 86 | 35.1 |
S/16 | 256 | 6 | 1024 | 8 | 5.0 | 2.2 |
B/32 | 768 | 12 | 3072 | 12 | 87 | 8.7 |
S/32 | 384 | 12 | 1536 | 6 | 22 | 2.3 |
Experiments on the UBIPr Dataset | Models | CV Score (Accuracy)–ViT Models | Earlier Results | ||
---|---|---|---|---|---|
Original | Improved | Train-Val-Test | Train-Val-Test (6025-1152-3075) | ||
K-Fold (5) | K-Fold (5) | 6025-1152-3075 | |||
Person Identification (PI) | ViT-T/16 | 94.821 | 96.196 | 92.580 | 93.83 [33] |
ViT-B/16 | 98.186 | 98.147 | 96.000 | ||
ViT-S/16 | 94.645 | 96.079 | 92.650 | ||
ViT-B/32 | 94.625 | 93.381 | 85.630 | ||
ViT-S/32 | 91.641 | 93.494 | 87.840 |
Experiments on the UBIPr Dataset | Models | CV Score (Accuracy)-ViT Models | Earlier Results | ||
---|---|---|---|---|---|
Original | Improved | Train-Val-Test | Train-Val-Test (6025-1152-3075) | ||
K-Fold (5) | K-Fold (5) | 6025-1152-3075 | |||
Gender Classification (GC) | ViT-T/16 | 98.605 | 98.430 | 97.820 | 95.00 [33] |
ViT-B/16 | 99.040 | 99.132 | 98.530 | ||
ViT-S/16 | 98.440 | 98.664 | 98.170 | ||
ViT-B/32 | 97.580 | 97.464 | 96.170 | ||
ViT-S/32 | 97.776 | 97.386 | 96.580 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Suravarapu, V.K.; Patil, H.Y. Person Identification and Gender Classification Based on Vision Transformers for Periocular Images. Appl. Sci. 2023, 13, 3116. https://doi.org/10.3390/app13053116
Suravarapu VK, Patil HY. Person Identification and Gender Classification Based on Vision Transformers for Periocular Images. Applied Sciences. 2023; 13(5):3116. https://doi.org/10.3390/app13053116
Chicago/Turabian StyleSuravarapu, Vasu Krishna, and Hemprasad Yashwant Patil. 2023. "Person Identification and Gender Classification Based on Vision Transformers for Periocular Images" Applied Sciences 13, no. 5: 3116. https://doi.org/10.3390/app13053116
APA StyleSuravarapu, V. K., & Patil, H. Y. (2023). Person Identification and Gender Classification Based on Vision Transformers for Periocular Images. Applied Sciences, 13(5), 3116. https://doi.org/10.3390/app13053116