Modelling Sign Language with Encoder-Only Transformers and Human Pose Estimation Keypoint Data
Abstract
:1. Introduction
1.1. Sign Language
1.2. Modelling
1.3. Datasets
1.4. Our Approach
1.5. Related Work
1.6. Contributions
1.7. Article Organisation
2. Materials and Methods
2.1. Dataset
Algorithm 1 Data normalisation algorithm | |
Require: data | ▹ [sequences, frames, keypoints, x or y value] e.g., [64, 203, 54, 2] |
Require: mask | ▹ same shape as data, specifies keypoints with valid values |
1: function NORM(data, mask) | |
2: for do | |
3: | ▹ Mean x value of scaling kp 0 |
4: | ▹ Mean x value of scaling kp 1 |
5: | ▹ Mean y value of scaling kp 0 |
6: | ▹ Mean y value of scaling kp 1 |
7: | ▹x distance of mean scaling kps |
8: | ▹y distance of mean scaling kps |
9: | ▹ Norm. to mean scaling kp distances |
10: | ▹ Mean x value of new origin kp |
11: | ▹ Mean y value of new origin kp |
12: | ▹ Translate kps to new origin |
13: end for | |
14: return data | |
15: end function |
2.2. Model
2.3. Model Regularisation
2.4. Experimental Setup
2.5. Experiments
3. Results
4. Discussion
4.1. Model Performance
4.2. Model Training Practicalities
4.3. Model Architecture and Parameters
4.4. Normalisation
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
ASL | American Sign Language |
BERT | Bidirectional Encoder Representations from Transformers |
BSL | British Sign Language |
DSG | German Sign Language |
GRU | gated recurrent unit |
HCI | human–computer interaction |
I3D | Inflated 3D ConvNet |
NLP | natural language processing |
ReLU | rectified linear unit |
SPOTER | Sign POsebased TransformER |
VM | virtual machine |
WLASL | Word-level American Sign Language dataset |
WLASL-alt | Word-level American Sign Language alternative dataset |
References
- Vamplew, P.W. Recognition of Sign Language Using Neural Networks. Ph.D. Thesis, University of Tasmania, Hobart, Australia, 1996. [Google Scholar]
- Starner, T.; Weaver, J.; Pentland, A. Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 1371–1375. [Google Scholar] [CrossRef]
- Stokoe, W.C. Sign Language Structure: An Outline of the Visual Communication Systems of the American Deaf; University of Buffalo: Buffalo, NY, USA, 1960. [Google Scholar]
- Tamura, S.; Kawasaki, S. Recognition of Sign Language Motion Images. Pattern Recognit. 1988, 21, 343–353. [Google Scholar] [CrossRef]
- Vogler, C.; Sun, H.; Metaxas, D. A Framework for Motion Recognition with Applications to American Sign Language and Gait Recognition. In Proceedings of the Workshop on Human Motion, Austin, TX, USA, 7–8 December 2000; pp. 33–38. [Google Scholar] [CrossRef]
- Kim, S.; Waldron, M.B. Adaptation of Self Organizing Network for ASL Recognition. In Proceedings of the 15th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, San Diego, CA, USA, 31 October 1993; p. 254. [Google Scholar] [CrossRef]
- Waldron, M.B.; Kim, S. Isolated ASL Sign Recognition System for Deaf Persons. IEEE Trans. Rehabil. Eng. 1995, 3, 261–271. [Google Scholar] [CrossRef]
- Vogler, C.; Metaxas, D. Parallel Hidden Markov Models for American Sign Language Recognition. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 20–27 September 1999; Volume 1, pp. 116–122. [Google Scholar] [CrossRef]
- Kadir, T.; Bowden, R.; Ong, E.J.; Zisserman, A. Minimal Training, Large Lexicon, Unconstrained Sign Language Recognition. In Proceedings of the British Machine Vision Conference, Kingston, UK, 7–9 September 2004; Hoppe, A., Barman, S., Ellis, T., Eds.; pp. 96.1–96.10. [Google Scholar] [CrossRef]
- Cooper, H.; Bowden, R. Sign Language Recognition Using Linguistically Derived Sub-Units. In Proceedings of the Language Resources and Evaluation Conference Workshop on the Representation and Processing of Sign Languages: Corpora and Sign Languages Technologies, MCC, Valetta, Malta, 17–23 May 2010; pp. 1–5. [Google Scholar]
- Theodorakis, S.; Pitsikalis, V.; Maragos, P. Model-Level Data-Driven Sub-Units for Signs in Videos of Continuous Sign Language. In Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, USA, 14–19 March 2010; pp. 2262–2265. [Google Scholar] [CrossRef]
- Pitsikalis, V.; Theodorakis, S.; Vogler, C.; Maragos, P. Advances in Phonetics-Based Sub-Unit Modeling for Transcription Alignment and Sign Language Recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Colorado Springs, CO, USA, 20–25 June 2011; pp. 1–6. [Google Scholar] [CrossRef]
- Cooper, H.; Ong, E.J.; Pugeault, N.; Bowden, R. Sign Language Recognition Using Sub-Units. J. Mach. Learn. Res. 2012, 13, 2205–2231. [Google Scholar] [CrossRef]
- Koller, O.; Ney, H.; Bowden, R. May the Force Be with You: Force-aligned Signwriting for Automatic Subunit Annotation of Corpora. In Proceedings of the 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Shanghai, China, 22–26 April 2013; pp. 1–6. [Google Scholar] [CrossRef]
- Zhang, J.; Zhou, W.; Xie, C.; Pu, J.; Li, H. Chinese Sign Language Recognition with Adaptive HMM. In Proceedings of the 2016 IEEE International Conference on Multimedia and Expo (ICME), Seattle, WA, USA, 11–15 July 2016; pp. 1–6. [Google Scholar] [CrossRef]
- Camgöz, N.C.; Hadfield, S.; Koller, O.; Bowden, R. SubUNets: End-to-End Hand Shape and Continuous Sign Language Recognition. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3075–3084. [Google Scholar] [CrossRef]
- Mittal, A.; Kumar, P.; Roy, P.P.; Balasubramanian, R.; Chaudhuri, B.B. A Modified LSTM Model for Continuous Sign Language Recognition Using Leap Motion. IEEE Sens. J. 2019, 19, 7056–7063. [Google Scholar] [CrossRef]
- Vaswani, A.; Brain, G.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems; Long Beach Convention and Entertainment Center: Long Beach, CA, USA, 2017; pp. 5998–6008. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; (Long and Short Papers). Volume 1, pp. 4171–4186. [Google Scholar] [CrossRef]
- Hosemann, J. Eye Gaze and Verb Agreement in German Sign Language: A First Glance. Sign Lang. Linguist. 2011, 14, 76–93. [Google Scholar] [CrossRef]
- LeMaster, B. What Difference Does Difference Make?: Negotiating Gender and Generation in Irish Sign Language. In Gendered Practices in Language; Benor, S., Rose, M., Sharma, D., Sweetland, J., Zhang, Q., Eds.; CSLI Publications, Stanford University: Stanford, CA, USA, 2002; pp. 309–338. [Google Scholar]
- Klomp, U. Conditional Clauses in Sign Language of the Netherlands: A Corpus-Based Study. Sign Lang. Stud. 2019, 19, 309–347. [Google Scholar] [CrossRef]
- Bickford, J.A.; Fraychineaud, K. Mouth Morphemes in ASL: A Closer Look. In Proceedings of the Theoretical Issues in Sign Language Research Conference, Florianopolis, Brazil, 6–9 December 2006; pp. 32–47. [Google Scholar]
- Bragg, D.; Koller, O.; Bellard, M.; Berke, L.; Boudreault, P.; Braffort, A.; Caselli, N.; Huenerfauth, M.; Kacorri, H.; Verhoef, T.; et al. Sign Language Recognition, Generation, and Translation: An Interdisciplinary Perspective. In Proceedings of the ASSETS 2019—21st International ACM SIGACCESS Conference on Computers and Accessibility, Pittsburgh, PA, USA, 28–30 October 2019; pp. 16–31. [Google Scholar] [CrossRef]
- Emmorey, K. Language and Space (Excerpt). In Space: In Science, Art, and Society; Penz, F., Radick, G., Howell, R., Eds.; Cambridge University Press: Cambridge, UK, 2004; pp. 22–45. [Google Scholar]
- Woll, B. Digiti Lingua: A Celebration of British Sign Language and Deaf Culture; The Royal Society: London, UK, 2013. [Google Scholar]
- Quer, J.; Steinbach, M. Ambiguities in Sign Languages. Linguist. Rev. 2015, 32, 143–165. [Google Scholar] [CrossRef]
- Kramer, J.; Leifer, L. The Talking Glove. ACM SIGCAPH Comput. Phys. Handicap. 1988, 39, 12–16. [Google Scholar] [CrossRef]
- Massachusetts Institute of Technology. Ryan Patterson, American Sign Language Translator/Glove. 2002. Available online: https://lemelson.mit.edu/resources/ryan-patterson (accessed on 20 March 2023).
- Osika, M. EnableTalk. 2012. Available online: https://web.archive.org/web/20200922151309/https://enabletalk.com/welcome-to-enabletalk/ (accessed on 27 February 2023).
- Lin, M.; Villalba, R. Sign Language Glove. 2014. Available online: https://people.ece.cornell.edu/land/courses/ece4760/FinalProjects/f2014/rdv28_mjl256/webpage/ (accessed on 20 March 2023).
- BrightSign Technology Limited. The BrightSign Glove. 2015. Available online: https://www.brightsignglove.com/ (accessed on 20 March 2023).
- Pryor, T.; Azodi, N. SignAloud: Gloves That Transliterate Sign Language into Text and Speech, Lemelson-MIT Student Prize Undergraduate Team Winner. 2016. Available online: https://web.archive.org/web/20161216144128/https://lemelson.mit.edu/winners/thomas-pryor-and-navid-azodi (accessed on 20 March 2023).
- Avalos, J.M.L. IPN Engineer Develops a System for Sign Translation. 2016. Available online: http://www.cienciamx.com/index.php/tecnologia/robotica/5354-sistema-para-traduccion-de-senas-en-mexico-e-directa (accessed on 20 March 2023).
- O’Connor, T.F.; Fach, M.E.; Miller, R.; Root, S.E.; Mercier, P.P.; Lipomi, D.J.; O’Connor, T.F.; Fach, M.E.; Miller, R.; Root, S.E.; et al. The Language of Glove: Wireless Gesture Decoder with Low-Power and Stretchable Hybrid Electronics. PLoS ONE 2017, 12, e0179766. [Google Scholar] [CrossRef]
- Allela, R.; Muthoni, C.; Karibe, D. SIGN-IO. 2019. Available online: http://sign-io.com/ (accessed on 20 March 2023).
- Forshay, L.; Winter, K.; Bender, E.M. Open Letter to UW’s Office of News & Information about the SignAloud Project. 2016. Available online: http://depts.washington.edu/asluw/SignAloud-openletter.pdf (accessed on 20 March 2023).
- Erard, M. Why Sign Language Gloves Don’t Help Deaf People. Deaf Life 2019, 24, 22–39. [Google Scholar]
- Dafnis, K.M.; Chroni, E.; Neidle, C.; Metaxas, D.N. Bidirectional Skeleton-Based Isolated Sign Recognition Using Graph Convolutional Networks. In Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), Marseille, France, 20–25 June 2022. [Google Scholar]
- Johnston, T. Auslan Corpus Annotation Guidelines. 2013. Available online: https://media.auslan.org.au/attachments/AuslanCorpusAnnotationGuidelines_Johnston.pdf (accessed on 20 March 2023).
- Cormier, K.; Fenlon, J. BSL Corpus Annotation Guidelines. 2014. Available online: https://bslcorpusproject.org/wp-content/uploads/BSLCorpusAnnotationGuidelines_23October2014.pdf (accessed on 20 March 2023).
- Crasborn, O.; Bank, R.; Cormier, K. Digging into Signs: Towards a Gloss Annotation Standard for Sign Language Corpora. In Proceedings of the 7th Workshop on the Representation and Processing of Sign Languages: Corpus Mining, Language Resources and Evaluation Conference, Portorož, Slovenia, 28 May 2016; pp. 1–11. [Google Scholar] [CrossRef]
- Mesch, J.; Wallin, L. Gloss Annotations in the Swedish Sign Language Corpus. Int. J. Corpus Linguist. 2015, 20, 102–120. [Google Scholar] [CrossRef]
- Gries, S.T.; Berez, A.L. Handbook of Linguistic Annotation; Springer: Dordrecht, The Netherlands, 2017. [Google Scholar] [CrossRef]
- Koller, O.; Ney, H.; Bowden, R. Deep Hand: How to Train a CNN on 1 Million Hand Images When Your Data Is Continuous and Weakly Labelled. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3793–3802. [Google Scholar] [CrossRef]
- Hosain, A.A.; Santhalingam, P.S.; Pathak, P.; Rangwala, H.; Kosecka, J. FineHand: Learning Hand Shapes for American Sign Language Recognition. In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina, 16–20 November 2020; pp. 700–707. [Google Scholar] [CrossRef]
- Mukushev, M.; Imashev, A.; Kimmelman, V.; Sandygulova, A. Automatic Classification of Handshapes in Russian Sign Language. In Proceedings of the the LREC2020 9th Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives, Marseille, France, 11–16 May 2020; pp. 165–170. [Google Scholar]
- Rios-Figueroa, H.V.; Sánchez-García, A.J.; Sosa-Jiménez, C.O.; Solís-González-Cosío, A.L. Use of Spherical and Cartesian Features for Learning and Recognition of the Static Mexican Sign Language Alphabet. Mathematics 2022, 10, 2904. [Google Scholar] [CrossRef]
- Yang, S.H.; Cheng, Y.M.; Huang, J.W.; Chen, Y.P. RFaNet: Receptive Field-Aware Network with Finger Attention for Fingerspelling Recognition Using a Depth Sensor. Mathematics 2021, 9, 2815. [Google Scholar] [CrossRef]
- Goldin-Meadow, S.; Brentari, D. Gesture, Sign, and Language: The Coming of Age of Sign Language and Gesture Studies. Behav. Brain Sci. 2017, 40, e46. [Google Scholar] [CrossRef]
- Antonakos, E.; Roussos, A.; Zafeiriou, S. A Survey on Mouth Modeling and Analysis for Sign Language Recognition. In Proceedings of the 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Ljubljana, Slovenia, 4–8 May 2015; pp. 1–7. [Google Scholar] [CrossRef]
- Capek, C.M.; Waters, D.; Woll, B.; MacSweeney, M.; Brammer, M.J.; McGuire, P.K.; David, A.S.; Campbell, R. Hand and Mouth: Cortical Correlates of Lexical Processing in British Sign Language and Speechreading English. J. Cogn. Neurosci. 2008, 20, 1220–1234. [Google Scholar] [CrossRef]
- Koller, O.; Ney, H.; Bowden, R. Deep Learning of Mouth Shapes for Sign Language. In Proceedings of the 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), Santiago, Chile, 7–13 December 2015; pp. 477–483. [Google Scholar] [CrossRef]
- Wilson, N.; Brumm, M.; Grigat, R.R. Classification of Mouth Gestures in German Sign Language Using 3D Convolutional Neural Networks. In Proceedings of the 10th International Conference on Pattern Recognition Systems (ICPRS-2019), Tours, France, 8–10 July 2019; Institution of Engineering and Technology: Tours, France, 2019; pp. 52–57. [Google Scholar] [CrossRef]
- Michael, N.; Yang, P.; Liu, Q.; Metaxas, D.; Neidle, C. A Framework for the Recognition of Nonmanual Markers in Segmented Sequences of American Sign Language. In Proceedings of the British Machine Vision Conference, Dundee, UK, 29 August–2 September 2011; British Machine Vision Association: Dundee, UK, 2011; pp. 124.1–124.12. [Google Scholar] [CrossRef]
- Antonakos, E.; Pitsikalis, V.; Maragos, P. Classification of Extreme Facial Events in Sign Language Videos. EURASIP J. Image Video Process. 2014, 2014, 14. [Google Scholar] [CrossRef]
- Metaxas, D.; Dilsizian, M.; Neidle, C. Scalable ASL Sign Recognition Using Model-Based Machine Learning and Linguistically Annotated Corpora. In Proceedings of the 8th Workshop on the Representation & Processing of Sign Languages: Involving the Language Community, Language Resources and Evaluation Conference, Miyazaki, Japan, 12 May 2018. [Google Scholar]
- Camgöz, N.C.; Koller, O.; Hadfield, S.; Bowden, R. Multi-Channel Transformers for Multi-articulatory Sign Language Translation. In Proceedings of the 16th European Conference on Computer Vision (ECCV 2020) Part XI, Glasgow, UK, 23–28 August 2020; pp. 1–18. [Google Scholar]
- Weast, T.P. Questions in American Sign Language: A Quantitative Analysis of Raised and Lowered Eyebrows. Ph.D. Thesis, University of Texas at Arlington, Arlington, TX, USA, 2008. [Google Scholar]
- Najafabadi, M.M.; Villanustre, F.; Khoshgoftaar, T.M.; Seliya, N.; Wald, R.; Muharemagic, E. Deep Learning Applications and Challenges in Big Data Analytics. J. Big Data 2015, 2, 1–21. [Google Scholar] [CrossRef]
- Von Agris, U.; Blömer, C.; Kraiss, K.F. Rapid Signer Adaptation for Continuous Sign Language Recognition Using a Combined Approach of Eigenvoices, MLLR, and MAP. In Proceedings of the 2008 19th International Conference on Pattern Recognition, Tampa, FL, USA, 8–11 December 2008; pp. 1–4. [Google Scholar] [CrossRef]
- Gweth, Y.L.; Plahl, C.; Ney, H. Enhanced Continuous Sign Language Recognition Using PCA and Neural Network Features. In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 16–21 June 2012; pp. 55–60. [Google Scholar] [CrossRef]
- Forster, J.; Koller, O.; Oberdörfer, C.; Gweth, Y.; Ney, H. Improving Continuous Sign Language Recognition: Speech Recognition Techniques and System Design. In Proceedings of the SLPAT 2013, 4th Workshop on Speech and Language Processing for Assistive Technologies, Grenoble, France, 21–22 August 2013; pp. 41–46. [Google Scholar]
- Koller, O.; Zargaran, S.; Ney, H.; Bowden, R. Deep Sign: Enabling Robust Statistical Continuous Sign Language Recognition via Hybrid CNN-HMMs. Int. J. Comput. Vis. 2018, 126, 1311–1325. [Google Scholar] [CrossRef]
- Cui, R.; Liu, H.; Zhang, C. A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training. IEEE Trans. Multimed. 2019, 21, 1880–1891. [Google Scholar] [CrossRef]
- Forster, J.; Schmidt, C.; Hoyoux, T.; Koller, O.; Zelle, U.; Piater, J.; Ney, H. RWTH-PHOENIX-Weather: A Large Vocabulary Sign Language Recognition and Translation Corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, 23–25 May 2012; pp. 3785–3789. [Google Scholar]
- Koller, O.; Forster, J.; Ney, H. Continuous Sign Language Recognition: Towards Large Vocabulary Statistical Recognition Systems Handling Multiple Signers. Comput. Vis. Image Underst. 2015, 141, 108–125. [Google Scholar] [CrossRef]
- Camgöz, N.C.; Hadfield, S.; Koller, O.; Ney, H.; Bowden, R. Neural Sign Language Translation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7784–7793. [Google Scholar] [CrossRef]
- Schmidt, C.; Koller, O.; Ney, H. Enhancing Gloss-Based Corpora with Facial Features Using Active Appearance Model. In Proceedings of the International Symposium on Sign Language Translation and Avatar Technology, Chicago, IL, USA, 18–19 October 2013; pp. 1–7. [Google Scholar]
- Huang, J.; Zhou, W.; Zhang, Q.; Li, H.; Li, W. Video-Based Sign Language Recognition without Temporal Segmentation. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 2257–2264. [Google Scholar]
- Konstantinidis, D.; Dimitropoulos, K.; Daras, P. A Deep Learning Approach for Analyzing Video and Skeletal Features in Sign Language Recognition. In Proceedings of the 2018 IEEE International Conference on Imaging Systems and Techniques (IST), Krakow, Poland, 16–18 October 2018; pp. 1–6. [Google Scholar] [CrossRef]
- Wang, S.; Guo, D.; Zhou, W.G.; Zha, Z.J.; Wang, M. Connectionist Temporal Fusion for Sign Language Translation. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 26 October 2018; pp. 1483–1491. [Google Scholar] [CrossRef]
- Elakkiya, R.; Selvamani, K. Subunit Sign Modeling Framework for Continuous Sign Language Recognition. Comput. Electr. Eng. 2019, 74, 379–390. [Google Scholar] [CrossRef]
- Guo, D.; Wang, S.; Tian, Q.; Wang, M. Dense Temporal Convolution Network for Sign Language Translation. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; pp. 744–750. [Google Scholar] [CrossRef]
- Pu, J.; Zhou, W.; Li, H. Iterative Alignment Network for Continuous Sign Language Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4160–4169. [Google Scholar] [CrossRef]
- Zhang, Z.; Pu, J.; Zhuang, L.; Zhou, W.; Li, H. Continuous Sign Language Recognition via Reinforcement Learning. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 285–289. [Google Scholar] [CrossRef]
- Camgöz, N.C.; Koller, O.; Hadfield, S.; Bowden, R. Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 1–11. [Google Scholar]
- Koller, O. Towards Large Vocabulary Continuous Sign Language Recognition: From Artificial to Real-Life Tasks. Ph.D. Thesis, RWTH Aachen University, Aachen, Germany, 2020. [Google Scholar]
- Stoll, S.; Camgoz, N.C.; Hadfield, S.; Bowden, R. Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks. Int. J. Comput. Vis. 2020, 128, 891–908. [Google Scholar] [CrossRef]
- Zhou, H.; Zhou, W.; Zhou, Y.; Li, H. Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 13009–13016. [Google Scholar] [CrossRef]
- Papastratis, I.; Dimitropoulos, K.; Daras, P. Continuous Sign Language Recognition through a Context-Aware Generative Adversarial Network. Sensors 2021, 21, 2437. [Google Scholar] [CrossRef] [PubMed]
- Tang, S.; Hong, R.; Guo, D.; Wang, M. Gloss Semantic-Enhanced Network with Online Back-Translation for Sign Language Production. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; ACM: Lisboa, Portugal, 2022; pp. 5630–5638. [Google Scholar] [CrossRef]
- Schembri, A.; Fenlon, J.; Rentelis, R.; Reynolds, S.; Cormier, K. Building the British Sign Language Corpus. Lang. Doc. 2013, 7, 136–154. [Google Scholar]
- Duarte, A.; Palaskar, S.; Ventura, L.; Ghadiyaram, D.; DeHaan, K.; Metze, F.; Torres, J.; Giro-i-Nieto, X. How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language. In Proceedings of the 2021 IEEE CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 1–14. [Google Scholar]
- Li, D.; Opazo, C.R.; Yu, X.; Li, H. Word-Level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass, CO, USA, 1–5 March 2020; pp. 1448–1458. [Google Scholar] [CrossRef]
- Carreira, J.; Zisserman, A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4724–4733. [Google Scholar] [CrossRef]
- Hosain, A.A.; Selvam Santhalingam, P.; Pathak, P.; Rangwala, H.; Kosecka, J. Hand Pose Guided 3D Pooling for Word-level Sign Language Recognition. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 3428–3438. [Google Scholar] [CrossRef]
- Tunga, A.; Nuthalapati, S.V.; Wachs, J. Pose-Based Sign Language Recognition Using GCN and BERT. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikola, HI, USA, 5–9 January 2021; pp. 31–40. [Google Scholar] [CrossRef]
- Bohacek, M.; Hruz, M. Sign Pose-based Transformer for Word-level Sign Language Recognition. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikoloa, HI, USA, 4–8 January 2022; pp. 182–191. [Google Scholar] [CrossRef]
- Eunice, J.; J, A.; Sei, Y.; Hemanth, D.J. Sign2Pose: A Pose-Based Approach for Gloss Prediction Using a Transformer Model. Sensors 2023, 23, 2853. [Google Scholar] [CrossRef]
- Neidle, C.; Ballard, C. Revised Gloss Labels for Signs from the WLASL Dataset: Preliminary Version. 2022. Available online: https://www.bu.edu/asllrp/wlasl-alt-glosses.pdf (accessed on 20 March 2023).
- Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1302–1310. [Google Scholar] [CrossRef]
- Shanker, M.; Hu, M.; Hung, M. Effect of Data Standardization on Neural Network Training. Omega 1996, 24, 385–397. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the Ninth International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
- Xiong, R.; Yang, Y.; He, D.; Zheng, K.; Zheng, S.; Xing, C.; Zhang, H.; Lan, Y.; Wang, L.; Liu, T.Y. On Layer Normalization in the Transformer Architecture. In Proceedings of the 37th International Conference on Machine Learning, Virtual Event, 13–18 July 2020; pp. 10524–10533. [Google Scholar]
- Liu, X.; Yu, H.F.; Dhillon, I.S.; Hsieh, C.J. Learning to Encode Position for Transformer with Continuous Dynamical Model. In Proceedings of the 37th International Conference on Machine Learning, Virtual Event, 13–18 July 2020; Volume 119. [Google Scholar]
- Embedding—PyTorch 1.9.0 Documentation. Available online: https://pytorch.org/docs/1.9.0/generated/torch.nn.Embedding.html (accessed on 20 March 2023).
- Poulinakis, K.; Drikakis, D.; Kokkinakis, I.W.; Spottswood, S.M. Machine-Learning Methods on Noisy and Sparse Data. Mathematics 2023, 11, 236. [Google Scholar] [CrossRef]
- LogSoftmax—PyTorch 1.9.0 Documentation. Available online: https://pytorch.org/docs/1.9.0/generated/torch.nn.LogSoftmax.html#torch.nn.LogSoftmax (accessed on 20 March 2023).
- CrossEntropyLoss—PyTorch 1.9.0 Documentation. Available online: https://pytorch.org/docs/1.9.0/generated/torch.nn.CrossEntropyLoss.html?highlight=cross%20entropy%20loss#torch.nn.CrossEntropyLoss (accessed on 20 March 2023).
- Adam—PyTorch 1.9.0 Documentation. Available online: https://pytorch.org/docs/1.9.0/generated/torch.optim.Adam.html (accessed on 20 March 2023).
- CosineAnnealingWarmRestarts—PyTorch 1.9.0 Documentation. Available online: https://pytorch.org/docs/1.9.0/generated/torch.optim.lr_scheduler.CosineAnnealingWarmRestarts.html (accessed on 20 March 2023).
- Cranfield University. Digital Aviation Research and Technology Centre. 2023. Available online: https://www.cranfield.ac.uk/centres/digital-aviation-research-and-technology-centre (accessed on 20 March 2023).
- Emmorey, K.; Thompson, R.; Colvin, R. Eye Gaze during Comprehension of American Sign Language by Native and Beginning Signers. J. Deaf Stud. Deaf Educ. 2009, 14, 237–243. [Google Scholar] [CrossRef]
Model | WLASL-100 | WLASL-300 | ||||
---|---|---|---|---|---|---|
Top-1 | Top-5 | Top-10 | Top-1 | Top-5 | Top-10 | |
Pose-TGCN [85] | ||||||
Pose-GRU [85] | ||||||
GCN-BERT [88] | ||||||
SPOTER [89] | – | – | – | – | ||
Sign2Pose [90] | – | – | – | – |
Keypoint Group | Keypoint Labels 1 | Number of Keypoints |
---|---|---|
Body pose | 0–7, 15–18 | 12 |
Left hand | 0–20 | 21 |
Right hand | 0–20 | 21 |
All | – | 54 |
Dataset Split | Number of Examples |
---|---|
Training | 11,445 |
Validation | 2298 |
Test | 2145 |
All | 15,888 |
Classes | Training Examples | Validation Examples | Testing Examples | Total Examples |
---|---|---|---|---|
10 | 282 | 68 | 65 | 415 |
50 | 1052 | 246 | 241 | 1539 |
100 | 1842 | 418 | 403 | 2663 |
300 | 4302 | 950 | 889 | 6141 |
Parameter | Value(s) |
---|---|
108 | |
108 | |
Attention heads 1 | |
Activation function | ReLU |
Dropout |
Parameter | Value(s) |
---|---|
Weight decay | |
Learning rate | |
Parameter | Value(s) |
---|---|
Encoder layers | |
Sign classes | 10 |
Encoder attention heads | 4 |
Norm. centroid keypoint | |
Norm. scaling keypoints | |
Encoder FFN dimension | 108 |
Encoder dropout | 0 |
Embedding dropout | 0 |
Batch size | 64 |
Encoder activation function | ReLU |
Augmentation | None |
Encoder Layers | Model Parameters | Train Top-1 Accuracy | Validation Top-1 Accuracy | Test-A Top-1 Accuracy | Test-L Top-1 Accuracy |
---|---|---|---|---|---|
1 | 94,078 | ||||
2 | 165,142 | ||||
3 | 236,206 | ||||
4 | 307,270 | ||||
5 | 378,334 |
Classes | Attention Heads | Train Top-1 Accuracy | Validation Top-1 Accuracy | Test-A Top-1 Accuracy | Test-L Top-1 Accuracy |
---|---|---|---|---|---|
10 | 1 | ||||
2 | |||||
3 | |||||
4 | |||||
6 | |||||
9 | |||||
50 | 1 | ||||
2 | |||||
3 | |||||
4 | |||||
6 | |||||
9 | |||||
100 | 1 | ||||
2 | |||||
3 | |||||
4 | |||||
6 | |||||
9 | |||||
300 | 1 | ||||
2 | |||||
3 | |||||
4 | |||||
6 | |||||
9 |
Classes | Top-k | Attention Heads | Metric A, L or Same | Accuracy | Metric Accuracy |
---|---|---|---|---|---|
10 | 1 | 3 | L | ||
5 | 3 | A | |||
10 | 1 | Same | |||
50 | 1 | 4 | A | ||
5 | 3 | A | |||
10 | 3 | A | |||
100 | 1 | 4 | L | ||
5 | 6 | A | |||
10 | 6 | A | |||
300 | 1 | 6 | L | ||
5 | 6 | A | |||
10 | 6 | L |
Classes | Attention Heads | Top-1 Accuracy Range | Top-1 Uncertainty | ||
---|---|---|---|---|---|
Test-A | Test-L | Test-A | Test-L | ||
10 | 1 | 0.0938 | 0.0781 | 0.0204 | 0.0183 |
2 | 0.0703 | 0.0547 | 0.0172 | 0.0140 | |
3 | 0.0703 | 0.0703 | 0.0154 | 0.0144 | |
4 | 0.0781 | 0.0781 | 0.0202 | 0.0174 | |
6 | 0.0781 | 0.0469 | 0.0168 | 0.0098 | |
9 | 0.0625 | 0.0781 | 0.0141 | 0.0161 | |
50 | 1 | 0.0441 | 0.0843 | 0.0130 | 0.0189 |
2 | 0.0872 | 0.0677 | 0.0173 | 0.0154 | |
3 | 0.0767 | 0.0607 | 0.0181 | 0.0142 | |
4 | 0.0749 | 0.0699 | 0.0178 | 0.0163 | |
6 | 0.1122 | 0.0645 | 0.0220 | 0.0152 | |
9 | 0.0433 | 0.0829 | 0.0105 | 0.0184 | |
100 | 1 | 0.0366 | 0.0798 | 0.0099 | 0.0164 |
2 | 0.0737 | 0.0842 | 0.0152 | 0.0169 | |
3 | 0.0739 | 0.0373 | 0.0162 | 0.0093 | |
4 | 0.0841 | 0.0631 | 0.0181 | 0.0164 | |
6 | 0.0635 | 0.0655 | 0.0156 | 0.0130 | |
9 | 0.0622 | 0.0747 | 0.0138 | 0.0158 | |
300 | 1 | 0.0395 | 0.0230 | 0.0079 | 0.0057 |
2 | 0.0449 | 0.0455 | 0.0103 | 0.0091 | |
3 | 0.0501 | 0.0405 | 0.0113 | 0.0090 | |
4 | 0.0275 | 0.0181 | 0.0065 | 0.0057 | |
6 | 0.0350 | 0.0183 | 0.0078 | 0.0039 | |
9 | 0.0278 | 0.0262 | 0.0067 | 0.0067 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Woods, L.T.; Rana, Z.A. Modelling Sign Language with Encoder-Only Transformers and Human Pose Estimation Keypoint Data. Mathematics 2023, 11, 2129. https://doi.org/10.3390/math11092129
Woods LT, Rana ZA. Modelling Sign Language with Encoder-Only Transformers and Human Pose Estimation Keypoint Data. Mathematics. 2023; 11(9):2129. https://doi.org/10.3390/math11092129
Chicago/Turabian StyleWoods, Luke T., and Zeeshan A. Rana. 2023. "Modelling Sign Language with Encoder-Only Transformers and Human Pose Estimation Keypoint Data" Mathematics 11, no. 9: 2129. https://doi.org/10.3390/math11092129
APA StyleWoods, L. T., & Rana, Z. A. (2023). Modelling Sign Language with Encoder-Only Transformers and Human Pose Estimation Keypoint Data. Mathematics, 11(9), 2129. https://doi.org/10.3390/math11092129