A Historical Survey of Advances in Transformer Architectures
Abstract
:1. Introduction
1.1. Transformers
1.2. Self Attention
- -
- Q is the matrix of the queries;
- -
- K is the matrix of the keys;
- -
- V is the matrix of the values.
1.3. Feedforward Networks
1.4. Residual Connections
1.5. Position Encodings
2. Survey Methodology
2.1. Information Sources
2.2. Search
2.3. Study Selection
2.4. Data Collection and Data Items
3. Survey Results
3.1. Early Transformer Implementations
3.1.1. Introductory Works
3.1.2. Further Progression
3.1.3. Recent Advancements
3.2. Text-Based Applications
3.3. Image-Based Applications
3.4. Miscellaneous Applications
3.5. Recent Directions
4. Discussion
4.1. Historical Insight
4.2. Application-Based Implementations
4.2.1. Text-Based Applications
4.2.2. Image-Based Applications
4.2.3. Miscellaneous Applications
5. Gaps and Future Work
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar]
- Li, X.; Metsis, V.; Wang, H.; Ngu, A.H.H. TTS-GAN: A Transformer-Based Time-Series Generative Adversarial Network. In Artificial Intelligence in Medicine; Michalowski, M., Abidi, S.S.R., Abidi, S., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2022; Volume 13263, pp. 133–143. ISBN 978-3-031-09341-8. [Google Scholar]
- Myers, D.; Mohawesh, R.; Chellaboina, V.I.; Sathvik, A.L.; Venkatesh, P.; Ho, Y.-H.; Henshaw, H.; Alhawawreh, M.; Berdik, D.; Jararweh, Y. Foundation and large language models: Fundamentals, challenges, opportunities, and social impacts. Cluster Comput. 2024, 27, 1–26. [Google Scholar] [CrossRef]
- Liu, Y.; Zhang, Y.; Wang, Y.; Hou, F.; Yuan, J.; Tian, J.; Zhang, Y.; Shi, Z.; Fan, J.; He, Z. A Survey of Visual Transformers. IEEE Trans. Neural Netw. Learn. Syst. 2023, 1–21. [Google Scholar] [CrossRef] [PubMed]
- Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A Survey on Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
- Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Rumelhart, D.E.; McClelland, J.L. Learning Internal Representations by Error Propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Foundations; MIT Press: Cambridge, MA, USA, 1987; pp. 318–362. ISBN 978-0-262-29140-8. [Google Scholar]
- Zeyer, A.; Bahar, P.; Irie, K.; Schlüter, R.; Ney, H. A Comparison of Transformer and LSTM Encoder Decoder Models for ASR. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 14–18 December 2019; pp. 8–15. [Google Scholar]
- Pascanu, R.; Mikolov, T.; Bengio, Y. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 17–19 June 2013; Dasgupta, S., McAllester, D., Eds.; PMLR: Atlanta, GA, USA, 2013; Volume 28, pp. 1310–1318. [Google Scholar]
- Kim, Y.; Denton, C.; Hoang, L.; Rush, A.M. Structured Attention Networks. arXiv 2017, arXiv:1702.00887. [Google Scholar]
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. arXiv 2014, arXiv:1409.3215. [Google Scholar]
- Greff, K.; Srivastava, R.K.; Koutník, J.; Steunebrink, B.R.; Schmidhuber, J. LSTM: A Search Space Odyssey. IEEE Trans. Neural Netw. Learning Syst. 2017, 28, 2222–2232. [Google Scholar] [CrossRef] [PubMed]
- Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in Vision: A Survey. ACM Comput. Surv. 2021, 54, 1–41. [Google Scholar] [CrossRef]
- Voita, E.; Talbot, D.; Moiseev, F.; Sennrich, R.; Titov, I. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. arXiv 2019, arXiv:1905.09418. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Pittsburgh, PA, USA, 2019; Volume 1, (Long and Short Papers). pp. 4171–4186. [Google Scholar]
- Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; Dauphin, Y.N. Convolutional Sequence to Sequence Learning. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Precup, D., Teh, Y.W., Eds.; PMLR: London, UK, 2017; Volume 70, pp. 1243–1252. [Google Scholar]
- Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-Attention with Relative Position Representations. arXiv 2018, arXiv:1803.02155. [Google Scholar]
- Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, 71. [Google Scholar] [CrossRef] [PubMed]
- Attention Is All You Need Search Results. Available online: https://scholar.google.ae/scholar?q=Attention+Is+All+You+Need&hl=en&as_sdt=0&as_vis=1&oi=scholart (accessed on 5 June 2022).
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://api.semanticscholar.org/CorpusID:49313245 (accessed on 14 May 2024).
- Radford, A.; Jozefowicz, R.; Sutskever, I. Learning to Generate Reviews and Discovering Sentiment. arXiv 2017, arXiv:1704.01444. [Google Scholar]
- Wang, Q.; Li, B.; Xiao, T.; Zhu, J.; Li, C.; Wong, D.F.; Chao, L.S. Learning Deep Transformer Models for Machine Translation. arXiv 2019, arXiv:1906.01787. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Sukhbaatar, S.; Grave, E.; Lample, G.; Jegou, H.; Joulin, A. Augmenting Self-attention with Persistent Memory. arXiv 2019, arXiv:1907.01470. [Google Scholar]
- Shazeer, N. GLU Variants Improve Transformer. arXiv 2020, arXiv:2002.05202. [Google Scholar]
- Wu, Z.; Liu, Z.; Lin, J.; Lin, Y.; Han, S. Lite Transformer with Long-Short Range Attention. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2020; Volume 12346, pp. 213–229. ISBN 978-3-030-58451-1. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 1877–1901. [Google Scholar]
- Kolesnikov, A.; Dosovitskiy, A.; Weissenborn, D.; Heigold, G.; Uszkoreit, J.; Beyer, L.; Minderer, M.; Dehghani, M.; Houlsby, N.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
- Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: Piscataway, NJ, USA, 2022; pp. 6877–6886. [Google Scholar]
- Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu, Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-Trained Image Processing Transformer. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 12294–12305. [Google Scholar]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jegou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Meila, M., Zhang, T., Eds.; PMLR: London, UK, 2021; Volume 139, pp. 10347–10357. [Google Scholar]
- Fedus, W.; Zoph, B.; Shazeer, N. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. arXiv 2022, arXiv:2101.03961. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar]
- Zhai, X.; Kolesnikov, A.; Houlsby, N.; Beyer, L. Scaling Vision Transformers. arXiv 2021, arXiv:2106.04560. [Google Scholar]
- Zhang, S.; Roller, S.; Goyal, N.; Artetxe, M.; Chen, M.; Chen, S.; Dewan, C.; Diab, M.; Li, X.; Lin, X.V.; et al. OPT: Open Pre-trained Transformer Language Models. arXiv 2022, arXiv:2205.01068. [Google Scholar]
- Wang, H.; Ma, S.; Dong, L.; Huang, S.; Zhang, D.; Wei, F. DeepNet: Scaling Transformers to 1000 Layers 2022. arXiv 2022, arXiv:2203.00555. [Google Scholar] [CrossRef]
- Li, Y.; Yuan, G.; Wen, Y.; Hu, J.; Evangelidis, G.; Tulyakov, S.; Wang, Y.; Ren, J. Efficientformer: Vision transformers at mobilenet speed. Adv. Neural Inf. Process. Syst. 2022, 35, 12934–12949. [Google Scholar]
- Li, Y.; Hu, J.; Wen, Y.; Evangelidis, G.; Salahi, K.; Wang, Y.; Tulyakov, S.; Ren, J. Rethinking Vision Transformers for MobileNet Size and Speed. In Proceedings of the IEEE International Conference on Computer Vision, Paris, France, 2–3 October 2023. [Google Scholar]
- Meng, L.; Li, H.; Chen, B.-C.; Lan, S.; Wu, Z.; Jiang, Y.-G.; Lim, S.-N. AdaViT: Adaptive Vision Transformers for Efficient Image Recognition. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12299–12308. [Google Scholar]
- Yin, H.; Vahdat, A.; Alvarez, J.M.; Mallya, A.; Kautz, J.; Molchanov, P. A-ViT: Adaptive Tokens for Efficient Vision Transformer. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Orleans, LA, USA, 18–24 June 2022; pp. 10799–10808. [Google Scholar]
- Pope, R.; Douglas, S.; Chowdhery, A.; Devlin, J.; Bradbury, J.; Heek, J.; Xiao, K.; Agrawal, S.; Dean, J. Efficiently Scaling Transformer Inference. Proc. Mach. Learn. Syst. 2023, 5. [Google Scholar]
- Zhang, J.; Peng, H.; Wu, K.; Liu, M.; Xiao, B.; Fu, J.; Yuan, L. MiniViT: Compressing Vision Transformers with Weight Multiplexing. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Orleans, LA, USA, 18–24 June 2022; pp. 12135–12144. [Google Scholar]
- Yu, H.; Wu, J. A unified pruning framework for vision transformers. Sci. China Inf. Sci. 2023, 66, 179101. [Google Scholar] [CrossRef]
- Zhu, Y.; Kiros, R.; Zemel, R.; Salakhutdinov, R.; Urtasun, R.; Torralba, A.; Fidler, S. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
- Bojar, O.; Buck, C.; Federmann, C.; Haddow, B.; Koehn, P.; Leveling, J.; Monz, C.; Pecina, P.; Post, M.; Saint-Amand, H.; et al. Findings of the 2014 Workshop on Statistical Machine Translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, MD, USA, 26–27 June 2014; Association for Computational Linguistics: Pittsburgh, PA, USA, 2014; pp. 12–58. [Google Scholar]
- Lim, D.; Hohne, F.; Li, X.; Huang, S.L.; Gupta, V.; Bhalerao, O.; Lim, S.-N. Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods. arXiv 2021, arXiv:2110.14446. [Google Scholar] [CrossRef]
- Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Li, F.F. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2014; Volume 8693, pp. 740–755. ISBN 978-3-319-10601-4. [Google Scholar]
- Caruana, R. Multitask Learning. Mach. Learn. 1997, 28, 41–75. [Google Scholar] [CrossRef]
- Taylor, W.L. “Cloze procedure”: A new tool for measuring readability. J. Q. 1953, 30, 415–433. [Google Scholar] [CrossRef]
- Schwartz, R.; Sap, M.; Konstas, I.; Zilles, L.; Choi, Y.; Smith, N.A. Story Cloze Task: UW NLP System. In Proceedings of the LSDSem 2017, Valencia, Spain, 3 April 2017. [Google Scholar]
- Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, 1 November 2018; Association for Computational Linguistics: Brussels, Belgium, 2018; pp. 353–355. [Google Scholar]
- Lai, G.; Xie, Q.; Liu, H.; Yang, Y.; Hovy, E. RACE: Large-scale ReAding Comprehension Dataset From Examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; Association for Computational Linguistics: Copenhagen, Denmark, 2017; pp. 785–794. [Google Scholar]
- Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; Association for Computational Linguistics: Austin, TX, USA, 2016; pp. 2383–2392. [Google Scholar]
- Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language Modeling with Gated Convolutional Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 933–941. [Google Scholar]
- Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv 2016, arXiv:1506.01497. [Google Scholar] [CrossRef]
- Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images. 2009. Volume 7, pp. 32–33. Available online: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf (accessed on 14 May 2024).
- Parkhi, O.M.; Vedaldi, A.; Zisserman, A.; Jawahar, C.V. Cats and Dogs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012. [Google Scholar]
- Zhou, B.; Zhao, H.; Puig, X.; Fidler, S.; Barriuso, A.; Torralba, A. Scene Parsing through ADE20K Dataset. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 June 2017; pp. 5122–5130. [Google Scholar]
- Mottaghi, R.; Chen, X.; Liu, X.; Cho, N.-G.; Lee, S.-W.; Fidler, S.; Urtasun, R.; Yuille, A. The Role of Context for Object Detection and Semantic Segmentation in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
- Patwardhan, N.; Marrone, S.; Sansone, C. Transformers in the Real World: A Survey on NLP Applications. Information 2023, 14, 242. [Google Scholar] [CrossRef]
- Ainslie, J.; Ontanon, S.; Alberti, C.; Cvicek, V.; Fisher, Z.; Pham, P.; Ravula, A.; Sanghai, S.; Wang, Q.; Yang, L. ETC: Encoding Long and Structured Inputs in Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; Association for Computational Linguistics: Pittsburgh, PA, USA, 2020; pp. 268–284. [Google Scholar]
- Zaheer, M.; Guruganesh, G.; Dubey, K.A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. Big bird: Transformers for longer sequences. Adv. Neural Inf. Process. Syst. 2020, 33, 17283–17297. [Google Scholar]
- Yan, H.; Deng, B.; Li, X.; Qiu, X. TENER: Adapting Transformer Encoder for Named Entity Recognition. arXiv 2019, arXiv:1911.04474. [Google Scholar]
- Zhou, Q.; Yang, N.; Wei, F.; Tan, C.; Bao, H.; Zhou, M. Neural Question Generation from Text: A Preliminary Study. arXiv 2017, arXiv:1704.01792. [Google Scholar]
- Miwa, M.; Bansal, M. End-to-End Relation Extraction using LSTMs on Sequences and Tree Structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; Association for Computational Linguistics: Pittsburgh, PA, USA, 2016; pp. 1105–1116. [Google Scholar]
- Fragkou, P. Applying named entity recognition and co-reference resolution for segmenting English texts. Prog. Artif. Intell. 2017, 6, 325–346. [Google Scholar] [CrossRef]
- Parmar, N.J.; Vaswani, A.; Uszkoreit, J.; Kaiser, L.; Shazeer, N.; Ku, A.; Tran, D. Image Transformer. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 10–15 July 2018. [Google Scholar]
- Wang, J.; Yu, X.; Gao, Y. Feature Fusion Vision Transformer for Fine-Grained Visual Categorization. arXiv 2021, arXiv:2107.02341. [Google Scholar]
- Hamilton, M.; Zhang, Z.; Hariharan, B.; Snavely, N.; Freeman, W.T. Unsupervised Semantic Segmentation by Distilling Feature Correspondences. arXiv 2022, arXiv:2203.08414. [Google Scholar]
- Dong, L.; Xu, S.; Xu, B. Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5884–5888. [Google Scholar]
- Koizumi, Y.; Masumura, R.; Nishida, K.; Yasuda, M.; Saito, S. A Transformer-based Audio Captioning Model with Keyword Estimation. arXiv 2020, arXiv:2007.00222. [Google Scholar]
- Gong, Y.; Chung, Y.-A.; Glass, J. AST: Audio Spectrogram Transformer. arXiv 2021, arXiv:2104.01778. [Google Scholar]
- Liu, M.; Ren, S.; Ma, S.; Jiao, J.; Chen, Y.; Wang, Z.; Song, W. Gated Transformer Networks for Multivariate Time Series Classification. arXiv 2021, arXiv:2103.14438. [Google Scholar]
- Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting. arXiv 2022, arXiv:2201.12740. [Google Scholar]
- Tuli, S.; Casale, G.; Jennings, N.R. TranAD: Deep Transformer Networks for Anomaly Detection in Multivariate Time Series Data. arXiv 2022, arXiv:2201.07284. [Google Scholar] [CrossRef]
- He, K.; Gan, C.; Li, Z.; Rekik, I.; Yin, Z.; Ji, W.; Gao, Y.; Wang, Q.; Zhang, J.; Shen, D. Transformers in medical image analysis. Intell. Med. 2023, 3, 59–78. [Google Scholar] [CrossRef]
- Sajun, A.R.; Zualkernan, I.; Sankalpa, D. Investigating the Performance of FixMatch for COVID-19 Detection in Chest X-rays. Appl. Sci. 2022, 12, 4694. [Google Scholar] [CrossRef]
- Ziani, S. Enhancing fetal electrocardiogram classification: A hybrid approach incorporating multimodal data fusion and advanced deep learning models. Multimed. Tools Appl. 2023, 83, 55011–55051. [Google Scholar] [CrossRef]
- Ziani, S.; Farhaoui, Y.; Moutaib, M. Extraction of Fetal Electrocardiogram by Combining Deep Learning and SVD-ICA-NMF Methods. Big Data Min. Anal. 2023, 6, 301–310. [Google Scholar] [CrossRef]
- Li, J.; Chen, J.; Tang, Y.; Wang, C.; Landman, B.A.; Zhou, S.K. Transforming medical imaging with Transformers? A comparative review of key properties, current progresses, and future perspectives. Med. Image Anal. 2023, 85, 102762. [Google Scholar] [CrossRef] [PubMed]
- Shamshad, F.; Khan, S.; Zamir, S.W.; Khan, M.H.; Hayat, M.; Khan, F.S.; Fu, H. Transformers in medical imaging: A survey. Med. Image Anal. 2023, 88, 102802. [Google Scholar] [CrossRef]
- Fournier, Q.; Caron, G.M.; Aloise, D. A Practical Survey on Faster and Lighter Transformers. ACM Comput. Surv. 2023, 55, 1–40. [Google Scholar] [CrossRef]
- Wang, W.; Zhang, J.; Cao, Y.; Shen, Y.; Tao, D. Towards Data-Efficient Detection Transformers. In Proceedings of the Computer Vision—ECCV 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer Nature Switzerland: Cham, Switzerland, 2022; pp. 88–105. [Google Scholar]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Liu, P.J.; Saleh, M.; Pot, E.; Goodrich, B.; Sepassi, R.; Kaiser, L.; Shazeer, N. Generating Wikipedia by Summarizing Long Sequences. arXiv 2018, arXiv:1801.10198. [Google Scholar]
- Tay, Y.; Dehghani, M.; Bahri, D.; Metzler, D. Efficient Transformers: A Survey. ACM Comput. Surv. 2023, 55, 1–28. [Google Scholar] [CrossRef]
- Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The Long-Document Transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar]
- Kitaev, N.; Kaiser, Ł.; Levskaya, A. Reformer: The Efficient Transformer. arXiv 2020, arXiv:2001.04451. [Google Scholar]
- Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-Attention with Linear Complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar]
- Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking Attention with Performers. arXiv 2021, arXiv:2009.14794. [Google Scholar]
- OpenAI GPT-4 Technical Report 2023. arXiv 2023, arXiv:2303.08774. [CrossRef]
- Jaegle, A.; Gimeno, F.; Brock, A.; Zisserman, A.; Vinyals, O.; Carreira, J. Perceiver: General Perception with Iterative Attention. arXiv 2021, arXiv:2103.03206. [Google Scholar]
- Jaegle, A.; Borgeaud, S.; Alayrac, J.-B.; Doersch, C.; Ionescu, C.; Ding, D.; Koppula, S.; Zoran, D.; Brock, A.; Shelhamer, E.; et al. Perceiver IO: A General Architecture for Structured Inputs & Outputs. arXiv 2022, arXiv:2107.14795. [Google Scholar]
- Weng, Z.; Yang, X.; Li, A.; Wu, Z.; Jiang, Y.-G. Semi-supervised vision transformers. In Proceedings of the ECCV 2022, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Name of Paper | Author | Date | Proposed Model | Datasets | Models Benchmarked Against | Results | No. of Citations |
---|---|---|---|---|---|---|---|
Attention Is All You Need [1] | Vaswani et al. | Jun 2017 | Transformer | WMT 2014 English-to-German translation task, WMT 2014 English-to-French translation task | ByteNet, Deep-Att + PosUnk, GNMT + RL, ConvS2S, MoE, Deep-Att + PosUnk Ensemble, GNMT + RL Ensemble, ConvS2S Ensemble | 28.4 BLEU for EN-DE and 41.8 BLEU for EN-FR with Transformer (Big) | 90,568 |
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [17] | Devlin et al. | Oct 2018 | NLP Transformer | BooksCorpus, English Wikipedia | GLUE, SQuAD v1.1, SQuAD v2.0 | BERT Large average score of 82.1 on GLUE testing | 78,786 |
Self-Attention with Relative Position Representations [19] | Shaw et al. | Mar 2018 | Translation NLP Transformer | WMT 2014 English–German, 2014 WMT English–French | Original transformer | English-to-German improved over the baseline by 0.3 and 1.3 BLEU for the base and big configurations, respectively, and English-to-French improved by 0.5 and 0.3 BLEU for the base and big configurations, respectively | 1882 |
Improving Language Understanding by Generative Pre-Training [23] | Radford et al. | Jun 2018 | GPT | Unsupervised training—BooksCorpus fine-tuning based on task-natural language inference, question answering, semantic similarity, and text classification, all present in GLUE | NLI-ESIM + ELMo, CAFE, Stochastic Answer Network, GenSen, Multi-task BiLSTM + Attn QA-val-LS-skip, Hidden Coherence Model, Dynamic Fusion Net, BiAttention MRU SS, Classification-sparse byte mLSTM, TF-KLD, ECNU, Single-task BiLSTM + ELMo + Attn, Multi-task BiLSTM + ELMo + Attn | Best results: NLI-SNLI-89.9 QA-Story Cloze-86.5 SS-STSB-82.0 Classification-CoLA-45.4 GLUE-72.8 | 6642 |
RoBERTa: A Robustly Optimized BERT Pre-training Approach [26] | Liu et al. | Jul 2019 | Variant of BERT | BooksCorpus, English Wikipedia, CC-News, OpenWebText, Stories | BERT Large, XLNet Large Enseambles-ALICE, MT-DNN, XLNet | Best results: SQUAD 1.1-F1-94.6 Race-Middle-86.5 GLUE-SST-96.4 | 8926 |
Language Models are Unsupervised Multitask Learners [90] | Radford et al. | Feb 2019 | (GPT2) GPT variation | Created own dataset called WebText | Baseline models, in general | 55 F1 on CoQa, matches or exceeds 3 of 4 baselines, has state-of-the-art results on 7/8 datasets | 6954 |
Learning Deep Transformer Models for Machine Translation [25] | Wang et al. | Jun 2019 | Translation NLP Transformer | WMT’16 English–German (En-De) and NIST’12 Chinese–English (Zh-En-Small) | Original transformer | Avg. BLEU scores [%] on NIST’12 Chinese-English translation: 52.11 BLEU scores [%] on WMT’18 Chinese-English translation: newtest17-26.9, newstest18-27.4 | 548 |
Augmenting Self-attention with Persistent Memory [27] | Sukhbaatar et al. | Jul 2019 | Introduction of new Layer for transformer | Character level modeling—enwik8, text8 Word level modeling—wikiText-103 | Character-LN HM-LSTM, Recurrent highway networks, Large mLSTM, T12, Transformer+adaptive span Word-LSTM, TCN, GCNN-8, LASTM+nEURAL CACHE, 4-LAYER QRNN, LSTM+Hebbian+Cache, Transformer XL Standard | enwik8-1.01 text8-1.11 wiki-18.3 | 94 |
GLU Variants Improve Transformer [28] | Shazeer N | Feb 2020 | Variation of original and T5 by adding GLU layers | C4 | T5 | GLUE best average score of 84.67-FFNReGLU | 141 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer [22] | Raffel et al. | Jan 2020 | NLP Transformer | C4, fine-tuning using GLUE and SuperGLUE | Self-trained Baseline experimental setup | GLUE-85.97 CNNDM-20.90 SQuAD-85.44 SGLUE-75.64 EnDe-28.37 EnFr-41.37 EnRo-28.98 | 10,117 |
Lite Transformer With Long-Short Range Attention [29] | Wu et al. | Apr 2020 | Lightweight translation transformers to deploy on end devices | IWSLT’14 German–English, WMT English to German, WMT English to Franch (En-Fr) | Original Transformer, adaptive inputs (Baevski and Auli) | CNN-DailyMail-F1-Rouge-R-1:41.3, R-2:18.8, R-L:38.3 (did not beat original but lighter) WIKITEXT-103-Valid ppl.-21.4, Test ppl.-22.2 | 234 |
End-to-End Object Detection with Transformers [30] | Carion et al. | May 2020 | Object detection Transformer | COCO 2017 | Different variations of Faster RCNN for detection Panoptic FNN, UPSnet for panoptic segmentation | Panoptic Quality-45.1 Able to classify classes in general without being biased to the training images | 7829 |
Language Models are Few-Shot Learners [31] | Brown et al. | May 2020 | GPT-3 | Common Crawl, WebText, Books1, Books2, Wikipedia | QA-RAG, T5-11B (2 variants) | QA-Beats SOTA IN TriviaQA-71.2, GPT-3-FewShot LAMBADA-FEW SHOT-86.4 (BEATS SOTA) PIQA-few-shot-82.8 | 14,698 |
An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale [32] | Dosovitsky et al. | Oct 2020 | Vision Transformer (ViT) | Trained on ILSVRC-2012, ImageNet-21k, JFT Transfered on ReaL labels, Cifar10/100, Oxford-IIT Pets, Oxford Flowers-102 | BiT-L (ResNet152x4), Noisy Student (EfficientNet-L2) | ImageNet-88.55-ViT-H ReaL-90.72-ViT-H CIFAR-10-99.50-ViT-H CIFAR-100-94.55-ViT-H Oxford-IIIT-Pets-97.56-ViT-H Oxford Flowers-99.74-ViT-L VTAB (19 tasks)-77.63-ViT-H | 21,833 |
Pre-Trained Image Processing Transformer [34] | Chen et al. | Dec 2020 | Image Processing Transformer (IPT) | ImageNet | Super-resolution-VDSR, EDSR, RCAN, RDN, OISR-RK3, RNAN, SAN, HAN, IGNN image denoising-CBM3D, TNRD, DnCNN, MemNet, IRCNN, FFDNet, SADNet, RDN image deraining-DSC, GMM, JCAS, Clear, DDN, RESCAN, PReNet, JORDER.E, SPANet, SSIR, RCDNet | Super resolution: set5-38.37, set14-34.43, B100-32.48, Urban100-33.76 image denoising: BSD68-30-30.75, 50-28.39, Urban100 30-32.00, 50-29.71 deraining: Rain100L-PSNR-41.62, SSIM-0.9880 | 1129 |
Training data-efficient image transformers and distillation through attention [35] | Touvron et al. | Dec 2020 | based on ViT (DeiT) | ImageNet | ResNet, RegNetY, EfficientNet, KDforAA, ViT (all versions) | DeiT-B 384/1000 epochs outperforms ViT and EffecientNet-85.2 acc | 4021 |
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [33] | Zheng et al. | Dec 2020 | Semantic Segmentation Transformer (SETR) | CityScapes, ADE20K, Pascal Context, all trained separately | SCN, Semantic FPN ADE20K Dataset-FCN, CCNet, Strip pooling, DANet, OCRNet, UperNet, Deeplab V3+ Pascal Context-DANet, EMANet, SVCNet, Strip pooling, GFFNet, APCNet Cityscapes validation-FCN, PSPNet, DeepLab-V3, NonLocal, CCNet, GCNet, Axial-DeepLab-XL, Axial-DeepLab-L | ADE20K-mIoU = 50.28 Pascal = 55.83 Cityscapes = 82.15 | 2030 |
Switch Transformers: Scaling To Trillion Parameter Models With Simple And Efficient Sparsity [36] | Fedus et al. | Jan 2021 | Transformer | Colossal Clean Crawled Corpus (C4) | MoE, T5 | Negative Log Perplexity (quality threshold) -1.534 Best average score on SQuAD with score of 88.6 vs. T5 | 894 |
Learning Transferable Visual Models From Natural Language Supervision [37] | Radford et al. | Feb 2021 | Text-to-Image transformer | Created own dataset called WIT (WebImageText)—has the similar wordcount to WebText | Visual N-Grams for comparison on zero-shot transfer | aYahoo-98.4 ImageNet-76.2 SUN-58.5 | 8121 |
Scaling Vision Transformers [38] | Zhai et al. | Jun 2021 | Scaled-up ViT (ViT-G) | JFT-3B | NS, MPL, CLIP, ALIGN, BiT-L (ResNet), ViT-H | ImageNet-90.45 INet V2-88.33 VTAB (light)-78.29 | 573 |
OPT: Open Pre-trained Transformer Language Models [39] | Zhang et al. | May 2022 | Pre-trained NLP transformers, architecture followed GPT-3 | BookCorpus, Stories, CCNews v2, CommonCrawl, DM Mathematics, Project Gutenberg, HackerNews, OpenSubtitles, OpenWebText2, USPTO, Wikipedia, dataset Baumgartner et al. | Dialogue Evaluations—Reddit 2.7B, BlenderBot1, R2C2 BlenderBot Hate Speech detection—Davinci CrowS-Pairs-GPT-3 SteroSet-Davinci Dialogue Responsible AI Evaluations—Reddit 2.7B, BlenderBot1, R2C2 BlenderBot | Outperforms Davinci in hate speech detection, best is few-shot (multiclass) with F1-score of 0.812, CroS-Pairs—better than GPT-3 only in two categories, Religion and Disability, with an accuracy of 68.6% and 76.7%, respectively, StereoSet—Almost same as Davinci | 719 |
Name of Paper | Author | Date | Proposed Model | Datasets | Models Benchmarked Against | Results |
---|---|---|---|---|---|---|
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [17] | Devlin et al. | 2018 | NLP Transformer | BooksCorpus, English Wikipedia | GLUE, SQuAD v1.1, SQuAD v2.0 | BERT Large average score of 82.1 on GLUE testing |
RoBERTa: A Robustly Optimized BERT Pre-training Approach [26] | Liu et al. | 2019 | Variant of BERT | BooksCorpus, English Wikipedia, CC-News, OpenWebText, Stories | BERT Large, XLNet Large Enseambles-ALICE, MT-DNN, XLNet | Best results: SQUAD 1.1-F1-94.6 Race-Middle-86.5 GLUE-SST-96.4 |
TENER: Adapting Transformer Encoder for Named Entity Recognition [69] | Yan et al. | 2019 | Sequence labeling (NER) transformer | English NER-CoNLL2003, OntoNotes 5.0 Chinese NER-Chinese part of OntoNotes 4.0, MSRA, Weibo NER, Resume NER | Chinese NER-BiLSTM, 1D-CNN, CAN-NER, Transformer english NER-BiLSTM-CRF, CNN-BiLSTM-CRF, BiLSTM-BiLSTM-CRF, CNN-BiLSTM-CRF, 1D-CNN, LM-LSTM-CRF, CRF + HSCRF, BiLSTM-BiLSTM-CRF, LS + BiLSTM-CRF, CN^3, GRN | F1-scores Chinese NER-Weibo-58.17, Resume-95.00, OntoNotes4.0-72.43, MSRA-92.74 English NER-ontoNotes 5.0-88.43, model+CNN-char get 91.45 for CoNLL 2003 |
ETC: Encoding Long and Structured Inputs in Transformers [67] | Ainslie et al. | 2020 | Variation of BERT-lifted weights from RoBERTa | BooksCorpus, English Wikipedia | BERT, RoBERTa | Leaderboard results SOTA (1ST) NQ long answer-77.78 HOTPOT QA SUP.F1-89.09 WikiHop-82.25 OpenKP-42.05 |
Big Bird: Transformers for Longer Sequences [68] | Zaheer et al. | 2021 | Variation of BERT | MLM | HGN, GSAN, ReflectionNet, RikiNet-v2, Fusion-in-decoder, SpanBERT, MRC-GCN, MultiHop, Longformer | Answering QA task-Best results (F1 SCORE) HotpotQA-Sup-89.1 NaturalIQ-LA-77.8 TriviaQA-Verified-92.4 WikiHop-82.3 (accuracy) |
Name of Paper | Author | Date | Proposed Model | Datasets | Models Benchmarked Against | Results |
---|---|---|---|---|---|---|
Image Transformer [73] | Parmar et al. | 2018 | Attention Transformer | Cifar10 | Generative Image Modeling-Pixel CNN, Row Pixel RNN, Gated Pixel CNN, Pixel CNN+, PixelSNAIL Further Inference-ResNet, srez GAN, Pixel Recursive | GIM-4.06 bits/dim CIFAR10-VAlidation, second best with 3.77 on ImageNet, very close to Pixel RNN with 3.86 |
An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale [32] | Dosovitsky et al. | 2020 | Vision Transformer (ViT) | Trained on ILSVRC-2012, ImageNet-21k, JFT Transfered on ReaL labels, Cifar10/100, Oxford-IIT Pets, Oxford Flowers-102 | BiT-L (ResNet152x4), Noisy Student (EfficientNet-L2) | ImageNet-88.55-ViT-H ReaL-90.72-ViT-H CIFAR-10-99.50-ViT-H CIFAR-100-94.55-ViT-H Oxford-IIIT-Pets-97.56-ViT-H Oxford Flowers-99.74-ViT-L VTAB (19 tasks)-77.63-ViT-H |
Training data-efficient image transformers and distillation through attention [35] | Touvron et al. | 2020 | Based on ViT | ImageNet | ResNet, RegNetY, EfficientNet, KDforAA, ViT (all versions) | DeiT-B 384/1000 epochs outperforms ViT and EffecientNet-85.2 acc |
Feature Fusion Vision Transformer for Fine-Grained Visual Categorization [74] | Wang et al. | 2021 | Introduction of MAWS (mutual attention weight selection) | CUB-200-2011, Stanford Dogs and iNaturalist2017 | CUB-200-2011-ResNet-50, RA-CNN, GP-256, MaxExt, DFL-CNN, NTS-Net, Cross-X, DCL, CIN, DBTNet, ASNet, S3N, FDL, PMG, API-Net, StackedLSTM, MMAL-Net, ViT, TransFG & PSM INaturalist2017-Resnet152, SSN, Huang et al., IncResNetv2, TASN, ViT, TransFG&PSM Standford Dogs-MaxEnt, FDL, RA-CNN, SEF, Cross-X, API-Net, ViT, TransFG & PSM | CUB-91.3% accuracy iNaturalist2017-68.5% Standford Dogs-92.4% |
Unsupervised Semantic Segmentation By Distilling Feature Correspondences [75] | Hamilton et al. | 2022 | Unsupervised Semantic Segmentation Transformer (STEGO) | 27 class COCOStuff, 27 classes of Cityscapes | ResNet50, MoCoV2, DINO, Deep Cluster, SIFT, Doersch et al., Isola et al. AC, InMARS, IIC, MDC, PiCIE, PiCIE + H | Unsupervised Accuracy-56.9, mIoU-28.2 Linear Probe Accuracy-76.1, mIoU-41.0 |
Name of Paper | Author | Date | Proposed Model | Datasets | Models Benchmarked Against | Results |
---|---|---|---|---|---|---|
Speech-Transformer: A No-Recurrence Sequence-To-Sequence Model For Speech Recognition [76] | Dong et al. | 2018 | Audio Transformer | Wall Street Journal dataset | CTC, seq2seq, seq2seq + deep convolutional, seq2seq + Unigram LS | WER 10.9 on eval92 |
A Transformer-based Audio-Captioning Model with Keyword Estimation [77] | Koizumi et al. | 2020 | Audio-Captioning Transformer (TRACKE) | Cllotho dataset | Baseline LSTM, Transformer from same challenge | Beats in BLUE-1 with 52.1, BLUE-2-30.9, BLUE-3-18.8, BLUE-4-10.8, CIDEr-25.8, METEOR-14.9, ROGUE-L-34.5, SPICE-9.7 SPIDEr-17.7 |
AST: Audio Spectrogram Transformer [78] | Gong et al. | 2021 | Audio Transformer | Converted pre-trained ViT to AST, used DeiT weights | AudioSet dataset-Baseline, PANN, PSLA single, PSLA Ensemble-S, PSLA Ensemble-M ESC-50, speech comands V2-SOTA-S (without additional audio data) SOTA-P (with additional audio data) | AudioSet-AST (Ensamble-M) -> Balanced mAP-0.378, full mAP-0.485 ESC-50-AST-P (trained using additional audio data)-95.6% Speech Commands V2-AST-S (trained without additional audio data)-98.11% |
Gated Transformer Networks for Multivariate Time Series Classification [79] | Liu et al. | 2021 | Time series classification transformer | AUSLAN, ArabicDigits, CMUsubject1, CharacterTrajectories, ECG, JapeneseVowels, KickvsPunch, Libras, NetFlow, UWave, Wafer, WalkvsRun, PEMS | MLP, FCN, ResNet, Encoder, MCNN, t-LeNet, MCDCNN, Time-CNN, TWIESN | Best SOTA results in 7/13 datasets, with best scores of 100% for CMUsubject1, NetFlow and WalkvsRun |
TranAD: Deep Transformer Networks for Anomaly Detection in Multivariate Time Series Data [81] | Tuli et al. | 2022 | Anomaly Detection Time Series Transformer | NAB, UCR, MBA, SMAP, MSL, SWaT, WADI, SMD, MSDS | MERLIN, LSTM-NDT, DAGMM, OmniAnomaly, MSCRED, MAD-GAN, USAD, MTAD-GAT, CAE-M, GDN | Beats the SOTA results in 7/10 datasets for both f1score and AUC best score is AUC of 0.9994 and F1 of 0.9694 for the UCR dataset |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sajun, A.R.; Zualkernan, I.; Sankalpa, D. A Historical Survey of Advances in Transformer Architectures. Appl. Sci. 2024, 14, 4316. https://doi.org/10.3390/app14104316
Sajun AR, Zualkernan I, Sankalpa D. A Historical Survey of Advances in Transformer Architectures. Applied Sciences. 2024; 14(10):4316. https://doi.org/10.3390/app14104316
Chicago/Turabian StyleSajun, Ali Reza, Imran Zualkernan, and Donthi Sankalpa. 2024. "A Historical Survey of Advances in Transformer Architectures" Applied Sciences 14, no. 10: 4316. https://doi.org/10.3390/app14104316
APA StyleSajun, A. R., Zualkernan, I., & Sankalpa, D. (2024). A Historical Survey of Advances in Transformer Architectures. Applied Sciences, 14(10), 4316. https://doi.org/10.3390/app14104316