MRBERT: Pre-Training of Melody and Rhythm for Automatic Music Generation
Abstract
:1. Introduction
2. Related Work
3. Automatic Music Generation Based on MRBERT
3.1. Token Representation
3.2. Pre-Training of MRBERT
3.3. Fine-Tuning of Generation Tasks
3.3.1. Autoregressive Generation Task
3.3.2. Conditional Generation Task
3.3.3. Seq2Seq Generation Task
3.3.4. Joint Generation
4. Experiments
4.1. Dataset
4.2. Experimental Environment
4.3. Results of Autoregressive Generation
4.4. Results of Conditional Generation
4.5. Results of Seq2Seq Generation
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Zhang, W.; Wu, Q.J.; Yang, Y.; Akilan, T. Multimodel Feature Reinforcement Framework Using Moore–Penrose Inverse for Big Data Analysis. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 5008–5021. [Google Scholar] [CrossRef] [PubMed]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the 34th Advances in Neural Information Processing Systems (NeurIPS), Online, 6–12 December 2020; pp. 1877–1901. [Google Scholar]
- Dong, H.W.; Hsiao, W.Y.; Yang, L.C.; Yang, Y.H. MuseGan: Multi-Track Sequential Generative Adversarial Networks for Symbolic Music Generation and Accompaniment. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 34–41. [Google Scholar]
- Li, S.; Jang, S.; Sung, Y. Automatic Melody Composition Using Enhanced GAN. Mathematics 2019, 7, 883. [Google Scholar] [CrossRef]
- Choi, K.; Fazekas, G.; Sandler, M.; Cho, K. Convolutional Recurrent Neural Networks for Music Classification. In Proceedings of the 2017 IEEE 42nd International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 2392–2396. [Google Scholar]
- Qiu, L.; Li, S.; Sung, Y. DBTMPE: Deep Bidirectional Transformers-Based Masked Predictive Encoder Approach for Music Genre Classification. Mathematics 2021, 9, 530. [Google Scholar] [CrossRef]
- Park, H.; Yoo, C.D. Melody Extraction and Detection through LSTM-RNN with Harmonic Sum Loss. In Proceedings of the 2017 IEEE 42nd International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 2766–2770. [Google Scholar]
- Li, S.; Jang, S.; Sung, Y. Melody Extraction and Encoding Method for Generating Healthcare Music Automatically. Electronics 2019, 8, 1250. [Google Scholar] [CrossRef]
- McLeod, A.; Steedman, M. Evaluating Automatic Polyphonic Music Transcription. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, 23–27 September 2018; pp. 42–49. [Google Scholar]
- Jiang, Z.; Li, S.; Sung, Y. Enhanced Evaluation Method of Musical Instrument Digital Interface Data based on Random Masking and Seq2Seq Model. Mathematics 2022, 10, 2747. [Google Scholar] [CrossRef]
- Wu, J.; Hu, C.; Wang, Y.; Hu, X.; Zhu, J. A Hierarchical Recurrent Neural Network for Symbolic Melody Generation. IEEE Trans. Cybern. 2019, 50, 2749–2757. [Google Scholar] [CrossRef] [PubMed]
- Li, S.; Jang, S.; Sung, Y. INCO-GAN: Variable-Length Music Generation Method Based on Inception Model-Based Conditional GAN. Mathematics 2021, 9, 387. [Google Scholar] [CrossRef]
- Makris, D.; Agres, K.R.; Herremans, D. Generating Lead Sheets with Affect: A Novel Conditional Seq2Seq Framework. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Walder, C. Modelling Symbolic Music: Beyond the Piano Roll. In Proceedings of the 8th Asian Conference on Machine Learning (ACML), Hamilton, New Zealand, 16–18 November 2016; pp. 174–189. [Google Scholar]
- Hadjeres, G.; Pachet, F.; Nielsen, F. DeepBach: A Steerable Model for Bach Chorales Generation. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1362–1371. [Google Scholar]
- Chu, H.; Urtasun, R.; Fidler, S. Song From PI: A Musically Plausible Network for Pop Music Generation. arXiv 2016, arXiv:1611.03477. [Google Scholar]
- Mogren, O. C-RNN-GAN: Continuous Recurrent Neural Networks with Adversarial Training. arXiv 2016, arXiv:1611.09904. [Google Scholar]
- Noh, S.H. Analysis of Gradient Vanishing of RNNs and Performance Comparison. Information 2021, 12, 442. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Zeng, M.; Tan, X.; Wang, R.; Ju, Z.; Qin, T.; Liu, T.Y. MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training. In Proceedings of the Findings of the Associations for Computational Linguistics: ACL-IJCNLP, Online, 1–6 August 2021; pp. 791–800. [Google Scholar]
- Chou, Y.H.; Chen, I.; Chang, C.J.; Ching, J.; Yang, Y.H. MidiBERT-Piano: Large-scale Pre-training for Symbolic Music Understanding. arXiv 2021, arXiv:2107.05223. [Google Scholar]
- Peters, M.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, New Orleans, LA, USA, 1–6 June 2018; pp. 2227–2237. [Google Scholar]
- Huang, Y.S.; Yang, Y.H. Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1180–1188. [Google Scholar]
- Hsiao, W.Y.; Liu, J.Y.; Yeh, Y.C.; Yang, Y.H. Compound Word Transformer: Learning to Compose Full-Song Music over Dynamic Directed Hypergraphs. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; pp. 178–186. [Google Scholar]
- Simonetta, F.; Carnovalini, F.; Orio, N.; Rodà, A. Symbolic Music Similarity through a Graph-Based Representation. In Proceedings of the Audio Mostly on Sound in Immersion and Emotion, North Wales, UK, 12–14 September 2018; pp. 1–7. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Shapiro, I.; Huber, M. Markov Chains for Computer Music Generation. J. Humanist. Math. 2021, 11, 167–195. [Google Scholar] [CrossRef]
- Mittal, G.; Engel, J.; Hawthorne, C.; Simon, I. Symbolic Music Generation with Diffusion Models. arXiv 2021, arXiv:2103.16091. [Google Scholar]
- Zhang, W.; Wu, Q.J.; Zhao, W.W.; Deng, H.; Yang, Y. Hierarchical One-Class Model with Subnetwork for Representation Learning and Outlier Detection. IEEE Trans. Cybern. 2022, 1–14. [Google Scholar] [CrossRef] [PubMed]
- Zhang, W.; Yang, Y.; Wu, Q.J.; Wang, T.; Zhang, H. Multimodal Moore–Penrose Inverse-Based Recomputation Framework for Big Data Analysis. IEEE Trans. Neural Netw. Learn. Syst. 2022, 1–13. [Google Scholar] [CrossRef] [PubMed]
- Zhang, W.; Wu, Q.J.; Yang, Y. Semisupervised Manifold Regularization via a Subnetwork-Based Representation Learning Model. IEEE Trans. Cybern. 2022, 1–14. [Google Scholar] [CrossRef] [PubMed]
Parameters | MRBERT | w/o Cross-Attn. | w/o Separate Embed. |
---|---|---|---|
1 Number of Layers | 6 × 2 3 | 12 | 12 |
Hidden size | 768 | 768 | 768 |
FFN inner hidden size | 3072 | 3072 | 3072 |
Attention heads | 12 | 12 | 12 |
Attention head size | 64 | 64 | 64 |
Dropout | 0.1 | 0.1 | 0.1 |
Batch Size | 32 | 32 | 32 |
Weight Decay | 0.01 | 0.01 | 0.01 |
Max Steps | 10 k | 10 k | 10 k |
Warmup Steps | 1 k | 1 k | 1 k |
Learning Rate Decay | power | power | power |
Adam | 1 × 10−6 | 1 × 10−6 | 1 × 10−6 |
Adam | 0.9 | 0.9 | 0.9 |
Adam | 0.98 | 0.98 | 0.98 |
2 Melody Vocab Size | 68 + 4 = 72 4 | 68 + 4 = 72 | - |
Rhythm Vocab Size | 17 + 4 = 21 | 17 + 4 = 21 | - |
Melody + Rhythm Vocab Size | - | - | 68 + 17 + 4 = 89 |
Chord Vocab Size | 795 + 4 = 799 | 795 + 4 = 799 | 795 + 4 = 799 |
Time Step | Pitch | Probabilities of Melody | Rhythm | Probabilities of Rhythm | ||||||
---|---|---|---|---|---|---|---|---|---|---|
1 | <BOS> | <BOS> | ||||||||
2 | Rest | Rest:0.100 | G4: 0.098 | F4: 0.092 | D4: 0.089 | 1/4 | 1/4: 0.309 | 1/8: 0.263 | 1/2: 0.146 | 1/16: 0.054 |
3 | A4 | A4: 0.111 | G4:0.104 | D4: 0.095 | Rest: 0.095 | 1/4 | 1/4: 0.519 | 1/8: 0.205 | 1/2: 0.114 | 3/4: 0.046 |
4 | G4 | G4: 0.127 | E4: 0.114 | A4: 0.087 | F4: 0.079 | 1/4 | 1/4: 0.501 | 1/8: 0.202 | 1/4: 0.104 | 3/4: 0.054 |
5 | E4 | E4: 0.132 | A4: 0.098 | F4: 0.081 | D4: 0.072 | 1/8 | 1/8: 0.364 | 1/4: 0.364 | 1/2: 0.097 | 3/4: 0.070 |
6 | G4 | G4: 0.161 | A4: 0.153 | D4: 0.079 | B4: 0.069 | 1/8 | 1/8: 0.427 | 1/4: 0.356 | 1/2: 0.073 | 3/8: 0.042 |
7 | A4 | A4: 0.187 | E4: 0.146 | B4: 0.080 | D4: 0.077 | 1/4 | 1/4: 0.423 | 1/8: 0.398 | 1/2: 0.065 | 3/8: 0.037 |
8 | E4 | E4: 0.152 | A4: 0.136 | G4: 0.125 | D4: 0.104 | 1/8 | 1/8: 0.465 | 1/4: 0.308 | 1/2: 0.076 | 3/4: 0.049 |
9 | G4 | G4: 0.157 | E4: 0.147 | A4: 0.118 | D4: 0.112 | 1/8 | 1/8: 0.412 | 1/4: 0.313 | 1/2: 0.072 | 3/8: 0.061 |
10 | A4 | A4: 0.164 | D4: 0.100 | E4: 0.089 | C5: 0.066 | 1/8 | 1/8: 0.355 | 1/4: 0.344 | 1/2: 0.110 | 3/8: 0.056 |
11 | C5 | C5: 0.125 | G4: 0.107 | D4: 0.093 | F4: 0.087 | 1/8 | 1/8: 0.385 | 1/4: 0.370 | 1/2: 0.112 | 3/8: 0.038 |
12 | G4 | G4: 0.177 | A4: 0.148 | E4: 0.139 | D4: 0.088 | 1/8 | 1/8: 0.569 | 1/4: 0.267 | 1/2: 0.056 | 3/8: 0.045 |
13 | A4 | A4: 0.163 | E4: 0.113 | D4: 0.106 | Rest: 0.086 | 1/8 | 1/8: 0.405 | 1/4: 0.338 | 1/2: 0.071 | 3/8: 0.048 |
14 | E4 | E4: 0.131 | A4: 0.108 | F4: 0.085 | D4: 0.074 | 1/4 | 1/4: 0.453 | 1/8: 0.319 | 1/2: 0.082 | 3/8: 0.029 |
15 | F4 | F4: 0.148 | A4: 0.102 | G4: 0.090 | C5: 0.086 | 1/8 | 1/8: 0.497 | 1/4: 0.263 | 1/2: 0.075 | 3/4: 0.046 |
16 | G4 | G4: 0.212 | A4: 0.142 | E4: 0.116 | D4: 0.088 | 1/8 | 1/8: 0.519 | 1/4: 0.259 | 1/2: 0.082 | 3/8: 0.031 |
17 | A4 | A4: 0.156 | E4: 0.116 | D4: 0.088 | F4: 0.076 | 1/8 | 1/8: 0.445 | 1/4: 0.349 | 1/2: 0.056 | 3/8: 0.039 |
18 | F4 | F4: 0.144 | E4: 0.104 | G4: 0.087 | C5: 0.079 | 1/8 | 1/8: 0.452 | 1/4: 0.286 | 1/2: 0.093 | 3/8: 0.045 |
19 | G4 | G4: 0.148 | A4: 0.134 | E4: 0.103 | D4: 0.099 | 1/8 | 1/8: 0.489 | 1/4: 0.329 | 1/2: 0.065 | 3/8: 0.034 |
20 | E4 | E4: 0.139 | A4: 0.120 | C5: 0.093 | F4: 0.077 | 1/8 | 1/8: 0.495 | 1/4: 0.296 | 1/2: 0.082 | 3/8: 0.041 |
Model | HITS@1 (%) | HITS@3 (%) | HITS@5 (%) | HITS@10 (%) | ||||
---|---|---|---|---|---|---|---|---|
Mel. | Rhy. | Mel. | Rhy. | Mel. | Rhy. | Mel. | Rhy. | |
MRBERT | 15.87 | 51.53 | 42.03 | 83.01 | 61.53 | 92.81 | 87.36 | 99.81 |
w/o cross-attn. | 14.74 | 51.44 | 38.96 | 82.65 | 57.45 | 91.88 | 84.58 | 99.80 |
w/o separate embed. | 14.27 | 51.16 | 38.14 | 82.17 | 55.90 | 90.91 | 83.88 | 99.79 |
RNN | 12.51 | 48.24 | 33.60 | 79.28 | 50.28 | 89.67 | 78.63 | 99.72 |
Masked Pitch Sequence | Probabilities of Pitch | Masked Rhythm Sequence | Probabilities of Rhythm |
---|---|---|---|
<BOS>, D4, E-4 1, F4, G4, … | E-4: 0.276; G4: 0.130; B-4: 0.118; A-4: 0.114; F4: 0.087; Rest: 0.069 | <BOS>, 1/6, 1/6, 1/6, 1/2, … | 1/6: 0.626; 3/16: 0.098; 1/4: 0.094; 1/2: 0.048 |
…, C5, B4, A4, G4, F#4, … | A4: 0.229; Rest: 0.164; G4: 0.160; C5: 0.141; B4: 0.096; D5: 0.033 | …, 1/8, 1/8, 1/8, 1/8, 1/8, … | 1/8: 0.785; 1/4: 0.109; 3/8: 0.040; 1/2: 0.038 |
…, G4, F4, F4, F4, <EOS> | G4: 0.280; F4 2: 0.127; A4: 0.116; E4: 0.105; D4: 0.086; F#4: 0.083 | …, 3/8, 1/8, 1/2, 1/2, <EOS> | 1/8: 0.280; 1/2: 0.197; 1/4: 0.086; 3/8: 0.080 |
Model | HITS@1 (%) | HITS@3 (%) | HITS@5 (%) | HITS@10 (%) | ||||
---|---|---|---|---|---|---|---|---|
Mel. | Rhy. | Mel. | Rhy. | Mel. | Rhy. | Mel. | Rhy. | |
MRBERT | 18.67 | 51.14 | 45.86 | 82.78 | 65.05 | 93.69 | 89.84 | 99.79 |
w/o cross-attn. | 18.07 | 50.93 | 43.94 | 82.02 | 63.35 | 92.55 | 88.10 | 99.69 |
w/o separate embed. | 15.69 | 48.61 | 40.27 | 80.11 | 57.68 | 90.73 | 84.91 | 99.57 |
BiRNN | 13.07 | 48.11 | 34.91 | 78.48 | 51.95 | 89.03 | 79.71 | 99.12 |
Model | HITS@1 (%) | HITS@3 (%) | HITS@5 (%) | HITS@10 (%) |
---|---|---|---|---|
MRBERT | 22.94 | 45.90 | 57.42 | 71.97 |
w/o cross-attn. | 22.61 | 45.24 | 56.75 | 71.18 |
w/o separate embed. | 22.15 | 43.46 | 55.12 | 70.17 |
BiRNN | 19.70 | 39.96 | 51.50 | 66.51 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, S.; Sung, Y. MRBERT: Pre-Training of Melody and Rhythm for Automatic Music Generation. Mathematics 2023, 11, 798. https://doi.org/10.3390/math11040798
Li S, Sung Y. MRBERT: Pre-Training of Melody and Rhythm for Automatic Music Generation. Mathematics. 2023; 11(4):798. https://doi.org/10.3390/math11040798
Chicago/Turabian StyleLi, Shuyu, and Yunsick Sung. 2023. "MRBERT: Pre-Training of Melody and Rhythm for Automatic Music Generation" Mathematics 11, no. 4: 798. https://doi.org/10.3390/math11040798
APA StyleLi, S., & Sung, Y. (2023). MRBERT: Pre-Training of Melody and Rhythm for Automatic Music Generation. Mathematics, 11(4), 798. https://doi.org/10.3390/math11040798