Pseudocode Generation from Source Code Using the BART Model
Abstract
:1. Introduction
2. Related Work
3. The Proposed Model for Pseudocode Generation
3.1. Bidirectional Encoder
3.2. Auto-Regressive Decoder
3.3. Pre-Trained BART Model
4. Experiments
4.1. Datasets
4.2. Models’ Parameters
4.3. Performance Measures
4.4. Results
5. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Xia, X.; Bao, L.; Lo, D.; Xing, Z.; Hassan, A.E.; Li, S. Measuring program comprehension: A large-scale field study with professionals. IEEE Trans. Softw. Eng. 2017, 44, 951–976. [Google Scholar] [CrossRef]
- Von Mayrhauser, A.; Vans, A.M. Program comprehension during software maintenance and evolution. Computer 1995, 28, 44–55. [Google Scholar] [CrossRef] [Green Version]
- Gad, W.; Alokla, A.; Nazih, W.; Aref, M.; Salem, A.B. DLBT: Deep Learning-Based Transformer to Generate Pseudo-Code from Source Code. CMC-Comput. Mater. Contin. 2022, 70, 3117–3132. [Google Scholar] [CrossRef]
- Alokla, A.; Gad, W.; Nazih, W.; Aref, M.; Salem, A.B. Retrieval-Based Transformer Pseudocode Generation. Mathematics 2022, 10, 604. [Google Scholar] [CrossRef]
- Yang, G.; Zhou, Y.; Chen, X.; Yu, C. Fine-Grained Pseudo-Code Generation Method via Code Feature Extraction and Transformer. In Proceedings of the 28th Asia-Pacific Software Engineering Conference (APSEC), Taipei, Taiwan, 6–9 December 2021; IEEE: Manhattan, NY, USA, 2021. [Google Scholar]
- Alhefdhi, A.; Dam, H.K.; Hata, H.; Ghose, A. Generating Pseudo-Code from Source Code using Deep Learning. In Proceedings of the 25th Australasian Software Engineering Conference (ASWEC), Adelaide, SA, Australia, 26–30 November 2018; IEEE: Manhattan, NY, USA, 2018; pp. 21–25. [Google Scholar]
- Koehn, P. Neural Machine Translation; Cambridge University Press: Cambridge, UK, 2020. [Google Scholar]
- Babhulgaonkar, A.; Bharad, S. Statistical Machine Translation. In Proceedings of the 1st International Conference on Intelligent Systems and Information Management (ICISIM), Aurangabad, India, 5–6 October 2017; IEEE: Manhattan, NY, USA, 2017; pp. 62–67. [Google Scholar]
- Oda, Y.; Fudaba, H.; Neubig, G.; Hata, H.; Sakti, S.; Toda, T.; Nakamura, S. Learning to Generate Pseudo-Code from Source Code using Statistical Machine Translation. In Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), Lincoln, NE, USA, 9–13 November 2015; IEEE: Manhattan, NY, USA, 2015; pp. 574–584. [Google Scholar]
- Sennrich, R.; Zhang, B. Revisiting Low-Resource Neural Machine Translation: A Case Study. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 211–221. [Google Scholar]
- Mahata, S.K.; Mandal, S.; Das, D.; Bandyopadhyay, S. Smt vs. nmt: A comparison over hindi & Bengali simple sentences. arXiv 2018, arXiv:1812.04898. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Reiter, E. A structured review of the validity of BLEU. Comput. Linguist. 2018, 44, 393–401. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Roodschild, M.; Sardinas, J.G.; Will, A. A new approach for the vanishing gradient problem on sigmoid activation. Prog. Artif. Intell. 2020, 9, 351–360. [Google Scholar] [CrossRef]
- Pascanu, R.; Mikolov, T.; Bengio, Y. Understanding the exploding gradient problem. arXiv 2012, arXiv:1211.5063. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving language understanding by generative pre-training. Comput. Sci. 2018, preprint. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. In Proceedings of the Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Online, 6–12 December 2020; Volume 33, pp. 1877–1901. [Google Scholar]
- Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7871–7880. [Google Scholar]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- Zhang, J.; Utiyama, M.; Sumita, E.; Neubig, G.; Nakamura, S. Guiding Neural Machine Translation with Retrieved Translation Pieces. In Proceedings of the NAACL-HLT, New Orleans, LA, USA, 1–6 June 2018; pp. 1325–1335. [Google Scholar]
- Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language Modeling with Gated Convolutional Networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 933–941. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Deng, Y.; Huang, H.; Chen, X.; Liu, Z.; Wu, S.; Xuan, J.; Li, Z. From Code to Natural Language: Type-Aware Sketch-Based Seq2Seq Learning. In Proceedings of the International Conference on Database Systems for Advanced Applications, Hyderabad, India, 11–14 April 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 352–368. [Google Scholar]
- Gu, J.; Lu, Z.; Li, H.; Li, V.O. Incorporating Copying Mechanism in Sequence-to-Sequence Learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 1631–1640. [Google Scholar]
- Buch, L.; Andrzejak, A. Learning-Based Recursive Aggregation of Abstract Syntax Trees for Code Clone Detection. In Proceedings of the 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), Hangzhou, China, 24–27 February 2019; IEEE: Manhattan, NY, USA, 2019; pp. 95–104. [Google Scholar]
- Rai, S.; Gupta, A. Generation of Pseudo Code from the Python Source Code using Rule-Based Machine Translation. arXiv 2019, arXiv:1906.06117. [Google Scholar]
- Norouzi, S.; Tang, K.; Cao, Y. Code Generation from Natural Language with Less Prior and More Monolingual Data. arXiv 2021, arXiv:2101.00259. [Google Scholar]
- Zhang, J.; Wang, X.; Zhang, H.; Sun, H.; Liu, X. Retrieval-Based Neural Source Code Summarization. In Proceedings of the 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE), Seoul, Korea, 5–11 October 2020; IEEE: Manhattan, NY, USA, 2020; pp. 1385–1397. [Google Scholar]
- Niu, C.; Li, C.; Ng, V.; Ge, J.; Huang, L.; Luo, B. SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations. In Proceedings of the 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE), Pittsburgh, PA, USA, 21–29 May 2022; IEEE: Manhattan, NY, USA, 2022; pp. 1–13. [Google Scholar]
- Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; et al. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. Findings of the Association for Computational Linguistics: EMNLP 2020. arXiv 2020, arXiv:2002.08155. [Google Scholar]
- Guo, D.; Ren, S.; Lu, S.; Feng, Z.; Tang, D.; Liu, S.; Zhou, L.; Duan, N.; Svyatkovskiy, A.; Fu, S.; et al. Graphcodebert: Pre-training code representations with data flow. arXiv 2020, arXiv:2009.08366. [Google Scholar]
- Guo, J.; Liu, J.; Wan, Y.; Li, L.; Zhou, P. Modeling Hierarchical Syntax Structure with Triplet Position for Source Code Summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 486–500. [Google Scholar]
- Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Kulal, S.; Pasupat, P.; Chandra, K.; Lee, M.; Padon, O.; Aiken, A.; Liang, P.S. Spoc: Search-based pseudocode to code. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Model | ||||
---|---|---|---|---|
BART large | 12 | 1024 | 16 | 400 M |
BART base | 6 | 768 | 12 | 110 M |
Dataset | Training | Evaluation | Test |
---|---|---|---|
Django | 17,005 | 600 | 1200 |
SPoC | 180,962 | 9000 | 15,183 |
# | Python Code | Pseudocode |
---|---|---|
Manually Generated Pseudocode | ||
1 2 3 4 5 | if new_name! = col_name: self. stream. write (b’\n’) def __init__ (self, context, base, sysid, pubid): if path. startswith ((‘http://’, ‘https://’, ‘/’)): dec_part = decimal_sep + dec_part | # if new_name is not equal to col_name, # write a newline character into self.stream file stream. # define the method __init__ with arguments self, context, base, sysid and pubid. # if path starts with string ‘http://’ or string ‘https://’ or with string ‘/’, # add decimal_sep and dec_part, substitute the result for dec_part. |
BART Base | ||
1 2 3 4 5 | if new_name ! = col_name : self. stream. write (b’\n’) def __init__ (self, context, base, sysid, pubid): if path. startswith ((‘http://’, ‘https://’, ‘/’)): dec_part = decimal_sep + dec_part | # if new_name is not equal to col_name, # write newlineto theself.stream file stream. # define the method __init__ with5arguments: self, context, base, sysid and pubid. # if path starts with string ‘http://’, ‘https://’ or ‘/’, #sumdecimal_sep and dec_part, substitute the result for dec_part. |
BART Large | ||
1 2 3 4 5 | if new_name! = col_name: self. stream. write (b’\n’) def __init__ (self, context, base, sysid, pubid): if path. startswith ((‘http://’, ‘https://’, ‘/’)): dec_part = decimal_sep + dec_part | # if new_name is not equal to col_name, # write a new line intoaself.stream file stream. # define the method __init__ with arguments self, context, base, sysid and pubid. # if path starts with ‘http://’, ‘https://’ or ‘/’, # add decimal_sep and dec_part, substitute the result for dec_part. |
C++ Code | Pseudocode | |
---|---|---|
Manually Generated Pseudocode | ||
1 | int len = s.size(); | create integer len with len = size of s |
2 | cin >> s >> t; | read s and t |
3 | for (int j = 0; j < (n − i * 2 − 1)/2; j ++) {cout << “*”;} | for j = 0 to (n − i * 2 − 1)/2, print * |
4 | if (a == “rock” and b == “paper”) return −1; | return −1 if a is equal to “rock” and b = “paper” |
5 | for (int j = 1; j <= 2 * (i − 1) + 1; j ++) cout << “D”; | for j = 1 to 2 * (i − 1) + 1 inclusive, print D |
BART Base | ||
1 | int len = s.size(); | create integer len with len = size of s |
2 | cin >> s >> t; | read s and t |
3 | for (int j = 0; j < (n − i * 2 − 1)/2; j ++) {cout << “*”;} | for j = 0 to (n − i * 2 − 1)/2, exclusive |
4 | if (a == “rock” and b == “paper”) return −1; | if a is equal to “rock” and b is equal “paper” return −1 |
5 | for (int j = 1; j <= 2 * (i − 1) + 1; j ++) cout << “D”; | for j = 1 to 2 * (i − 1) + 1, display D |
BART Large | ||
1 | int len = s.size(); | create integer len with len = size of s |
2 | cin >> s >> t; | read s and t |
3 | for (int j = 0; j < (n − i * 2 − 1)/2; j ++) {cout << “*”;} | for j = 0 to (n − i * 2 − 1)/2, print |
4 | if (a == “rock” and b == “paper”) return −1; | if a is equal to “rock” and b is equal “paper” return −1 |
5 | for (int j = 1; j <= 2 * (i − 1) + 1; j ++) cout << “D”; | for j = 1 to 2 * (i − 1) + 1, print D |
Epochs | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|---|
Train Loss | 0.644 | 0.30 | 0.270 | 0.131 | 0.241 | 0.375 | 0.106 | 0.093 |
Evaluation Loss | 0.543 | 0.470 | 0.435 | 0.417 | 0.417 | 0.426 | 0.437 | 0.441 |
Evaluation BLEU | 68.59 | 72.06 | 74.56 | 75.41 | 76.09 | 76.66 | 77.20 | 77.53 |
Evaluation Accuracy | 43.71 | 49.24 | 51.26 | 53.26 | 54.69 | 54.63 | 55.53 | 56.13 |
Test BLEU | 68.92 | 72.12 | 73.52 | 74.42 | 75.44 | 77.15 | 77.11 | 77.76 |
Test Accuracy | 46.57 | 50.12 | 52.75 | 53.33 | 54.80 | 56.39 | 56.80 | 58.31 |
Epoch | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
Train Loss | 0.328 | 0.284 | 0.210 | 0.181 | 0.124 |
Evaluation Loss | 0.443 | 0.400 | 0.375 | 0.357 | 0.327 |
Evaluation BLEU | 70.28 | 73.32 | 75.87 | 75.89 | 77.05 |
Evaluation Accuracy | 52.83 | 54.62 | 57.79 | 60.18 | 60.77 |
Test BLEU | 71.58 | 74.32 | 75.43 | 76.89 | 77.65 |
Test Accuracy | 52.37 | 56.62 | 58.09 | 60.28 | 61.77 |
Evaluation BLEU | 70.28 | 73.32 | 75.87 | 75.89 | 77.05 |
Evaluation Accuracy | 52.83 | 54.62 | 57.79 | 60.18 | 60.77 |
Dataset | Model | BLEU |
---|---|---|
Django | BART Large | 77.76 |
BART Base | 75.82 | |
Levenshtein Retrieval on 6-layer DLBT [4] | 61.96 | |
Levenshtein Retrieval on 8-layer DLBT [4] | 61.29 | |
6-layer DLBT Not cross [3] | 59.62 | |
8-layer DLBT Not cross [3] | 58.58 | |
Code2NL [24] | 56.54 | |
code2pseudocode [6] | 54.78 | |
T2SMT [9] | 54.08 | |
DeepPseudo [5] | 50.817 | |
Code-GRU [25] | 50.81 | |
Seq2Seq w Atten. [5] | 43.96 | |
NoAtt [6] | 43.55 | |
RBMT [28] | 41.876 | |
CODE-NN [5,25] | 40.51 | |
ConvS2S [5] | 37.455 | |
Seq2Seq w/o Atten. [5] | 36.483 | |
Seq2Seq [25] | 28.26 | |
PBMT [9] | 25.17 | |
SimpleRNN [6] | 06.45 | |
SPoC | BART Large | 77.65 |
BART Base | 76.26 | |
Levenshtein Retrieval on 6-layer DLBT [4] | 50.28 | |
6-layer DLBT Not cross [3] | 48.12 | |
DeepPseudo [5] | 46.454 | |
Transformer [5] | 43.738 | |
Seq2Seq w Atten [5] | 41.007 | |
ConvS2S [5] | 34.197 | |
Seq2Seq w/o Atten. [5] | 33.761 | |
CODE-NN [5,25] | 32.105 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Alokla, A.; Gad, W.; Nazih, W.; Aref, M.; Salem, A.-b. Pseudocode Generation from Source Code Using the BART Model. Mathematics 2022, 10, 3967. https://doi.org/10.3390/math10213967
Alokla A, Gad W, Nazih W, Aref M, Salem A-b. Pseudocode Generation from Source Code Using the BART Model. Mathematics. 2022; 10(21):3967. https://doi.org/10.3390/math10213967
Chicago/Turabian StyleAlokla, Anas, Walaa Gad, Waleed Nazih, Mustafa Aref, and Abdel-badeeh Salem. 2022. "Pseudocode Generation from Source Code Using the BART Model" Mathematics 10, no. 21: 3967. https://doi.org/10.3390/math10213967
APA StyleAlokla, A., Gad, W., Nazih, W., Aref, M., & Salem, A. -b. (2022). Pseudocode Generation from Source Code Using the BART Model. Mathematics, 10(21), 3967. https://doi.org/10.3390/math10213967