Retrieval-Based Transformer Pseudocode Generation
Abstract
:1. Introduction
- Retrieval: Finding the most similar sentence between the input and the training set.
- Deep learning-based transformer: passing the retrieved code (i.e., retrieved input) to the encoder-decoder transformer.
- Replacement process: by using information from input, retrieved input, and retrieved output (i.e., retrieved pseudocode).
2. Related Work
3. Retrieval-Based Transformer Pseudocode Generation
- Retrieval: Finding the most similar sentence between the input and training set.
- Deep learning-based transformer: passing the retrieved code (i.e., retrieved input) to the encoder-decoder transformer.
- Replacement process: by using information from input, retrieved input, and retrieved output (i.e., retrieved pseudocode).
3.1. Similarity between Input and Training Dataset (Retrieval Phase)
3.2. Deep Learning-Based Transformer
3.3. Replacement for Generating the Target Pseudocode
4. Experimental Results and Evaluation
4.1. Dataset Description
4.2. Performance Evaluation
4.3. Results
- In a deep learning-based transformer, DLBT gives better performance using six layers.
- In the retrieval process, some of the pseudocodes generated using 8-layer DLBT have tokens which “Replace function” fails to obtain the proper replacement. For example, “if space == 0:” converted to “If space equals to string ‘0 ‘,” while the actual pseudo-code “if space equals integer 0,”. Tokens “to string ‘ ‘“ do not refer to the input and cannot be replaced with “Replace function”.
4.4. Results Discussion and Interpretation
5. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Gad, W.; Alokla, A.; Nazih, W.; Aref, M.; Salem, A.B. DLBT: Deep Learning-Based Transformer to Generate Pseudo-Code from Source Code. Cmc-Comput. Mater. Contin. 2022, 70, 3117–3132. [Google Scholar] [CrossRef]
- Yang, G.; Zhou, Y.; Chen, X.; Yu, C. Fine-grained Pseudo-code Generation Method via Code Feature Extraction and Transformer. In Proceedings of the The 28th Asia-Pacific Software Engineering Conference (APSEC), Taipei, Taiwan, 6–9 December 2021. [Google Scholar]
- Alhefdhi, A.; Dam, H.K.; Hata, H.; Ghose, A. Generating pseudo-code from source code using deep learning. In Proceedings of the The 25th Australasian Software Engineering Conference (ASWEC), Adelaide, SA, Australia, 26–30 November 2018; pp. 21–25. [Google Scholar]
- Oda, Y.; Fudaba, H.; Neubig, G.; Hata, H.; Sakti, S.; Toda, T.; Nakamura, S. Learning to generate pseudo-code from source code using statistical machine translation. In Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), Lincoln, NE, USA, 9–13 November 2015; pp. 574–584. [Google Scholar]
- Ling, W.; Blunsom, P.; Grefenstette, E.; Hermann, K.M.; Kocisky, T.; Wang, F.; Senior, A. Latent Predictor Networks for Code Generation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; Volume 1, pp. 599–609. [Google Scholar]
- Jia, R.; Liang, P. Data Recombination for Neural Semantic Parsing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; Volume 1, pp. 12–22. [Google Scholar]
- Locascio, N.; Narasimhan, K.; DeLeon, E.; Kushman, N.; Barzilay, R. Neural Generation of Regular Expressions from Natural Language with Minimal Domain Knowledge. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 1918–1923. [Google Scholar]
- Babhulgaonkar, A.; Bharad, S. Statistical machine translation. In Proceedings of the 1st International Conference on Intelligent Systems and Information Management (ICISIM); IEEE: Manhattan, NY, USA; pp. 62–67.
- Koehn, P. Neural Machine Translation; Cambridge University Press: Cambridge, UK, 2020. [Google Scholar]
- Sennrich, R.; Zhang, B. Revisiting Low-Resource Neural Machine Translation: A Case Study. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 211–221. [Google Scholar]
- Mahata, S.K.; Mandal, S.; Das, D.; Bandyopadhyay, S. Smt vs. nmt: A comparison over hindi & bengali simple sentences. arXiv 2018, arXiv:1812.04898. [Google Scholar]
- Yin, P.; Neubig, G. A Syntactic Neural Model for General-Purpose Code Generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; Volume 1, pp. 440–450. [Google Scholar]
- Rabinovich, M.; Stern, M.; Klein, D. Abstract Syntax Networks for Code Generation and Semantic Parsing. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; Volume 1, pp. 1139–1149. [Google Scholar]
- Roodschild, M.; Sardiñas, J.G.; Will, A. A new approach for the vanishing gradient problem on sigmoid activation. Prog. Artif. Intell. 2020, 9, 351–360. [Google Scholar] [CrossRef]
- Zhang, J.; Utiyama, M.; Sumita, E.; Neubig, G.; Nakamura, S. Guiding Neural Machine Translation with Retrieved Translation Pieces. In Proceedings of the NAACL-HLT, Shenzhen, China, 18–22 July 2018; pp. 1325–1335. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Processing Syst. 2017, 30, 5998–6008. [Google Scholar]
- Büch, L.; Andrzejak, A. Learning-based recursive aggregation of abstract syntax trees for code clone detection. In Proceedings of the 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), Hangzhou, China, 24–27 February 2019; pp. 95–104. [Google Scholar]
- Deng, Y.; Huang, H.; Chen, X.; Liu, Z.; Wu, S.; Xuan, J.; Li, Z. From Code to Natural Language: Type-Aware Sketch-Based Seq2Seq Learning. International Conference on Database Systems for Advanced Applications; Springer: Berlin/Heidelberg, Germany, 2020; pp. 352–368. [Google Scholar]
- Gu, J.; Lu, Z.; Li, H.; Li, V.O. Incorporating Copying Mechanism in Sequence-to-Sequence Learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; Volume 1, pp. 1631–1640. [Google Scholar]
- Hayati, S.A.; Olivier, R.; Avvaru, P.; Yin, P.; Tomasic, A.; Neubig, G. Retrieval-Based Neural Code Generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 925–930. [Google Scholar]
- Mou, L.; Li, G.; Zhang, L.; Wang, T.; Jin, Z. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
- Hu, X.; Li, G.; Xia, X.; Lo, D.; Jin, Z. Deep code comment generation. In Proceedings of the 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC), Gothenburg, Sweden, 27 May–3 June 2018; pp. 200–210. [Google Scholar]
- Hu, X.; Li, G.; Xia, X.; Lo, D.; Jin, Z. Deep code comment generation with hybrid lexical and syntactical information. Empir. Softw. Eng. 2020, 25, 2179–2217. [Google Scholar] [CrossRef]
- Rai, S.; Gupta, A. Generation of Pseudo Code from the Python Source Code using Rule-Based Machine Translation. arXiv 2019, arXiv:1906.06117. [Google Scholar]
- Zhang, J.; Wang, X.; Zhang, H.; Sun, H.; Liu, X. Retrieval-based neural source code summarization. In Proceedings of the 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE), Seoul, Korea, 5–11 October 2020; pp. 1385–1397. [Google Scholar]
- Dumais, S.T. Latent semantic analysis. Annu. Rev. Inf. Sci. Technol. 2004, 38, 188–230. [Google Scholar] [CrossRef]
- Salton, G.; Wong, A.; Yang, C.S. A vector space model for automatic indexing. Commun. ACM 1975, 18, 613–620. [Google Scholar] [CrossRef] [Green Version]
- Liu, Z.; Xia, X.; Hassan, A.E.; Lo, D.; Xing, Z.; Wang, X. Neural-machine-translation-based commit message generation: How far are we? In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, 3 September 2018; pp. 373–384. [Google Scholar]
- Levenshtein, V.I. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet Physics Doklady; Soviet Union. 1966. Volume 10. pp. 707–710. Available online: https://nymity.ch/sybilhunting/pdf/Levenshtein1966a.pdf (accessed on 24 December 2021).
- Reiter, E. A structured review of the validity of BLEU. Comput. Linguist. 2018, 44, 393–401. [Google Scholar] [CrossRef]
- Haiduc, S.; Aponte, J.; Moreno, L.; Marcus, A. On the use of automated text summarization techniques for summarizing source code. In Proceedings of the 17th Working Conference on Reverse Engineering, Beverly, MA, USA, 13–16 October 2010; pp. 35–44. [Google Scholar]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Processing Syst. 2019, 32, 8026–8037. [Google Scholar]
- Honnibal, M.; Montani, I.; Van Landeghem, S.; Boyd, A. spaCy: Industrial-Strength Natural Language Processing in Python; Zenodo: Geneva, Switzerland, 2020. [Google Scholar]
- Princeton University. About WordNet. Available online: https://wordnet.princeton.edu/ (accessed on 3 March 2021).
SPoC Dataset | |
---|---|
C++ Code | Pseudocode |
string s; | create string s declare string variable s declare s as string s = string |
x1 = s[0]-96; | set x1 to s[0]-96 set x1 = s[0]-96 x1 = s[0]-96 |
if (x2 > x1) | if x2 is greater than x1 if x2 > x1 |
Dataset | Training | Testing |
---|---|---|
Django | 17,605 | 1200 |
SPoC | 181,862 | 15,183 |
Method | Django Dataset | SPoC Dataset | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
BLEU | Accuracy | Time | BLEU | Accuracy | Time | |||||
All | First | All | First | - | All | First | All | First | - | |
VSM | 28.43 | 30.59 | 12.99 | 15.93 | 0.08 s | 14.54 | 20.50 | 1.66 | 3.17 | 0.24 s |
LSI | 28.61 | 30.81 | 12.92 | 15.73 | 7.75 s | 14.62 | 20.68 | 1.64 | 3.12 | 79.90 s |
NNGen | 30.41 | 31.66 | 13.97 | 16.62 | 0.0025 s | 27.44 | 38.58 | 8.08 | 26.48 | 0.009s |
Levenshtein | 34.79 | 38.15 | 12.97 | 17.56 | 0.066 s | 38.85 | 60.40 | 16.80 | 49.205 | 0.12 s |
# | Python Code | Pseudocode |
---|---|---|
Manually Generated Pseudocode | ||
1 2 3 4 5 | from io import BytesIO open_tags = [ ] if six. PY2 for supported_code in _supported: s = s. replace (‘\t’, ‘\\t’) | #from io import BytesIO into default name space. #open_tags is an empty list. # if six.PY2 is true, # for every supported_code in _supported, #replace every occurrence of ‘\t’ in s with ‘\\t’. |
DLBT (6-layers) | ||
1 2 3 4 5 | from io import BytesIO open_tags = [ ] if six. PY2: for supported_code in _supported: s = s. replace (‘\t’, ‘\\t’) | # from io import BytesIO into default name space. #Substitute cache.get for cache_get. # Iftrue, Six.py2 is true. # For everysupportedinthis value, #s is an empty list. Replace every occurrence of ‘\t’ ‘in t with ‘\\t’. |
DLBT (8-layers) | ||
1 2 3 4 5 | from io import BytesIO open_tags = [ ] if six. PY2: for supported_code in _supported: s = s. replace (‘\t’, ‘\\t’) | # from io import BytesIO into default name space. ##Substitute pattern.regex.pattern for template_dir. # If six.py2 is true,If six.py2 is true, # For everytemplate_dirin supported, #replace all occurrences of ‘t ‘ with ‘t ‘. |
Retrieval-based Transformer Generation (6-layers) | ||
1 2 3 4 5 | from io import BytesIO open_tags = [ ] if six. PY2: for supported_code in _supported: s = s. replace (‘\t’, ‘\\t’) | # from io import BytesIO into default name space. #open_tags is an empty list. # if six.PY2 is true, # for every supported_code in _supported #s is an empty list.Replace every occurrence of ‘\t’ ‘in s with ‘\\t’. |
Retrieval-based Transformer (8-layers) | ||
1 2 3 4 5 | from io import BytesIO open_tags = [ ] if six. PY2: for supported_code in _supported: s = s. replace (‘\t’, ‘\\t’) | # from io import BytesIO into default name space. #open_tags is an empty list. # if six.PY2 is true, #for every supported_code in _supported #replace every occurrences of ‘\t’ in s with ‘\\t’. |
# | C++ Code | Pseudocode |
---|---|---|
Manually Generated Pseudocode | ||
1 | int n; | create integer n |
2 | cin >> n; | read n |
3 | for (int i = 0; i < (int)ans.size(); i++) { | for i = 0 to integer casted size of ans exclusive |
4 | if (x == cur) { | if x is equal to cur |
5 | for (int i = 0; i < a; i++) { cin >> a[i]; } | for i = 0 to a exclusive, read a[i] |
DLBT (6 layers) | ||
1 | int n; | create integer n |
2 | cin >> n; | read n |
3 | for (int i = 0; i < (int)ans.size(); i++) { | for i = 0 to size of a exclusive |
4 | if (x == cur) { | if x is equal to 0 |
5 | for (int i = 0; i < n; i++) { cin >> a[i]; } | for i = 0 to n exclusive, then read a[i] |
Retrieval-based Transformer (6 layers) | ||
1 | int n; | create integer n |
2 | cin >> n; | read n |
3 | for (int i = 0; i < (int)ans.size(); i++) { | for i = 0 to size of ans exclusive |
4 | if (x == cur) { | if x is equal to cur |
5 | for (int i = 0; i < n; i++) { cin >> a[i]; } | for i = 0 to a exclusive, read a[i] |
Dataset | Similarity Methods | |||||||
---|---|---|---|---|---|---|---|---|
LSI | VSM | NNGen | Levenshtein | |||||
BLEU | Acc | BLEU | Acc | BLEU | Acc | BLEU | Acc | |
Django (6-layers) | 55.96 | 32.16 | 55.23 | 31.52 | 59.74 | 37.17 | 61.96 | 40.81 |
Django (8-layers) | 53.14 | 31.76 | 52.32 | 30.81 | 57.63 | 34.07 | 60.55 | 38.94 |
SPoC (6-layers) | 22.31 | 3.02 | 22.03 | 3.17 | 40.79 | 27.14 | 50.28 | 34.72 |
SPoC (8-layers) | 20.83 | 2.05 | 20.67 | 2.31 | 39.43 | 25.74 | 45.37 | 27.94 |
Dataset | Model | BLEU |
---|---|---|
Django | Retrieval-based Transformer (6 layers) | 61.96 |
Retrieval-based Transformer (8 layers) | 61.29 | |
DLBT (6 layers and without cross-validation) | 59.62 | |
DLBT (8 layers and without cross-validation) | 58.58 | |
Code2NL [18] | 56.54 | |
code2pseudocode [3] | 54.78 | |
T2SMT [4] | 54.08 | |
DeepPseudo [2] | 50.81 | |
Code-GRU [18] | 50.81 | |
Seq2Seq w Atten. [2] | 43.96 | |
NoAtt [3] | 43.55 | |
RBMT [21] | 41.87 | |
CODE-NN [2,18] | 40.51 | |
ConvS2S [2] | 37.45 | |
Seq2Seq w/o Atten. [2] | 36.48 | |
Seq2Seq [18] | 28.26 | |
PBMT [3] | 25.17 | |
SimpleRNN [3] | 06.45 | |
SPoC | Retrieval-based Transformer (6 layers) | 50.28 |
DLBT (6 layers and without cross-validation) | 48.12 | |
DeepPseudo [2] | 46.45 | |
Retrieval-based Transformer (8 layers) | 45.37 | |
Transformer [2] | 43.73 | |
DLBT (8 layers and without cross-validation) | 43.16 | |
Seq2Seq w Atten. [2] | 41.00 | |
ConvS2S [2] | 34.19 | |
Seq2Seq w/o Atten. [2] | 33.76 | |
CODE-NN [2,18] | 32.10 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Alokla, A.; Gad, W.; Nazih, W.; Aref, M.; Salem, A.-B. Retrieval-Based Transformer Pseudocode Generation. Mathematics 2022, 10, 604. https://doi.org/10.3390/math10040604
Alokla A, Gad W, Nazih W, Aref M, Salem A-B. Retrieval-Based Transformer Pseudocode Generation. Mathematics. 2022; 10(4):604. https://doi.org/10.3390/math10040604
Chicago/Turabian StyleAlokla, Anas, Walaa Gad, Waleed Nazih, Mustafa Aref, and Abdel-Badeeh Salem. 2022. "Retrieval-Based Transformer Pseudocode Generation" Mathematics 10, no. 4: 604. https://doi.org/10.3390/math10040604
APA StyleAlokla, A., Gad, W., Nazih, W., Aref, M., & Salem, A. -B. (2022). Retrieval-Based Transformer Pseudocode Generation. Mathematics, 10(4), 604. https://doi.org/10.3390/math10040604