Spectral Salt-and-Pepper Patch Masking for Self-Supervised Speech Representation Learning
Abstract
:1. Introduction
- We propose a straightforward and novel masking method for self-supervised speech representation learning with consecutive quadrilateral-shaped S&P patch blocks. S&P masking has not been attempted before for speech representation learning.
- Due to the difference in resolution or scale between the spectrogram and the image, applying S&P noise directly is not a useful method. To this end, we demonstrate that modifying S&P noise is more applicable for reconstruction objectives of self-supervised speech representation learning.
- We show that the combination of the proposed spectral S&P patch method with the conventional reconstruction-based speech representation learning approach is more effective in several speech downstream tasks compared with using the traditional masking methods alone.
2. Related Work
2.1. Salt-and-Pepper Noise
2.2. Masked Reconstruction for Self-Supervised Speech Representation Learning
3. Method
3.1. Modified S&P for Speech Representation Learning
3.2. Consecutive Quadrilateral-Shaped Spectral S&P Patch Masking for Self-Supervised Learning
3.3. Pretraining with S&P Patch Masking for Self-Supervised Learning
3.4. Training Pretrained Model on Downstream Tasks
4. Experimental Setup
4.1. Dataset Description
4.2. Downstream Tasks Details
4.3. Software and Hardware Details
- Python version: 3.9.15
- GPU server: The pretraining and downstream experiments are conducted on an NVIDIA RTX A6000 GPU (48GB) server and TITAN RTX GPU (24GB) server running Ubuntu 18.04.6 LTS, respectively.
- Deep learning framework: PyTorch [54] version 1.12.1 with CUDA version 11.3 and CuDNN version 8.21. These libraries enable efficient GPU acceleration for training and inference.
- English ASR downstream: For English ASR tasks, both KenLM [48] (Available online: https://github.com/kpu/kenlm, accessed on 14 June 2023) and Wav2letter++ [49] (Available online: https://github.com/flashlight/wav2letter, accessed on 14 June 2023) libraries are employed. KenLM is used for language modeling, while Wav2letter++ provides useful ASR functionality.
- Korean ASR downstream: In the case of Korean ASR tasks, the python-Levenshtein library (version 0.20.9) is utilized to compute the edit distance metric.
5. Results
5.1. LibriSpeech Phoneme Classification Results
5.2. TIMIT Phoneme Classification Results
5.3. Keyword Spotting Results
5.4. Speaker Identification Results
5.5. English ASR Results
5.6. Korean ASR Results
5.7. Ablation: Impact of Two S&P Patch Masking Hyperparameters
6. Discussion
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
- Liu, A.T.; Yang, S.w.; Chi, P.H.; Hsu, P.c.; Lee, H.y. Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6419–6423. [Google Scholar]
- Chung, Y.A.; Zhang, Y.; Han, W.; Chiu, C.C.; Qin, J.; Pang, R.; Wu, Y. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 13–17 December 2021; pp. 244–250. [Google Scholar]
- Hsu, W.N.; Bolte, B.; Tsai, Y.H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
- Liu, A.T.; Li, S.W.; Lee, H.y. Tera: Self-supervised learning of transformer encoder representation for speech. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 2351–2366. [Google Scholar] [CrossRef]
- Kenton, J.D.M.W.C.; Toutanova, L.K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL-HLT, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- Ling, S.; Liu, Y.; Salazar, J.; Kirchhoff, K. Deep contextualized acoustic representations for semi-supervised speech recognition. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6429–6433. [Google Scholar]
- Wang, W.; Tang, Q.; Livescu, K. Unsupervised pre-training of bidirectional speech encoders via masked reconstruction. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6889–6893. [Google Scholar]
- Liu, A.H.; Chung, Y.A.; Glass, J. Non-Autoregressive Predictive Coding for Learning Speech Representations from Local Dependencies. Proc. Interspeech 2021 2021, 3730–3734. [Google Scholar] [CrossRef]
- Ling, S.; Liu, Y. Decoar 2.0: Deep contextualized acoustic representations with vector quantization. arXiv 2020, arXiv:2012.06659. [Google Scholar]
- Chi, P.H.; Chung, P.H.; Wu, T.H.; Hsieh, C.C.; Chen, Y.H.; Li, S.W.; Lee, H.y. Audio albert: A lite bert for self-supervised learning of audio representation. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 19–22 January 2021; pp. 344–350. [Google Scholar]
- Oord, A.v.d.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
- Schneider, S.; Baevski, A.; Collobert, R.; Auli, M. wav2vec: Unsupervised Pre-Training for Speech Recognition. Proc. Interspeech 2019 2019, 3465–3469. [Google Scholar] [CrossRef] [Green Version]
- Chung, Y.A.; Hsu, W.N.; Tang, H.; Glass, J. An Unsupervised Autoregressive Model for Speech Representation Learning. Proc. Interspeech 2019 2019, 146–150. [Google Scholar] [CrossRef] [Green Version]
- Gunel, B.; Du, J.; Conneau, A.; Stoyanov, V. Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning. arXiv 2020, arXiv:2011.01403. [Google Scholar]
- Kim, T.; Yoo, K.M.; Lee, S.g. Self-Guided Contrastive Learning for BERT Sentence Representations. arXiv 2021, arXiv:2106.07345. [Google Scholar]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International conference on machine learning, PMLR, Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
- He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
- Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Avila Pires, B.; Guo, Z.; Gheshlaghi Azar, M.; et al. Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 2020, 33, 21271–21284. [Google Scholar]
- Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised contrastive learning. Adv. Neural Inf. Process. Syst. 2020, 33, 18661–18673. [Google Scholar]
- Palanisamy, K.; Singhania, D.; Yao, A. Rethinking CNN models for audio classification. arXiv 2020, arXiv:2007.11154. [Google Scholar]
- Gong, Y.; Chung, Y.A.; Glass, J. Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3292–3306. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Gong, Y.; Chung, Y.A.; Glass, J. AST: Audio Spectrogram Transformer. Proc. Interspeech 2021 2021, 571–575. [Google Scholar] [CrossRef]
- Gong, Y.; Lai, C.I.; Chung, Y.A.; Glass, J. Ssast: Self-supervised audio spectrogram transformer. AAAI Conf. Artif. Intell. 2022, 36, 10699–10709. [Google Scholar] [CrossRef]
- Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. Proc. Interspeech 2019 2019, 2613–2617. [Google Scholar] [CrossRef] [Green Version]
- DeVries, T.; Taylor, G.W. Improved regularization of convolutional neural networks with cutout. arXiv 2017, arXiv:1708.04552. [Google Scholar]
- Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; Manzagol, P.A.; Bottou, L. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 2010, 11, 3371–3408. [Google Scholar]
- Agostinelli, F.; Anderson, M.R.; Lee, H. Adaptive multi-column deep neural networks with application to robust image denoising. Adv. Neural Inf. Process. Syst. 2013, 26. [Google Scholar]
- Lehtinen, J.; Munkberg, J.; Hasselgren, J.; Laine, S.; Karras, T.; Aittala, M.; Aila, T. Noise2Noise: Learning Image Restoration without Clean Data. arXiv 2018, arXiv:1803.04189. [Google Scholar]
- Liang, L.; Deng, S.; Gueguen, L.; Wei, M.; Wu, X.; Qin, J. Convolutional neural network with median layers for denoising salt-and-pepper contaminations. Neurocomputing 2021, 442, 26–35. [Google Scholar] [CrossRef]
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An asr corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar]
- Bang, J.U.; Yun, S.; Kim, S.H.; Choi, M.Y.; Lee, M.K.; Kim, Y.J.; Kim, D.H.; Park, J.; Lee, Y.J.; Kim, S.H. Ksponspeech: Korean spontaneous speech corpus for automatic speech recognition. Appl. Sci. 2020, 10, 6936. [Google Scholar] [CrossRef]
- Chan, R.H.; Ho, C.W.; Nikolova, M. Salt-and-pepper noise removal by median-type noise detectors and detail-preserving regularization. IEEE Trans. Image Process. 2005, 14, 1479–1485. [Google Scholar] [CrossRef]
- Esakkirajan, S.; Veerakumar, T.; Subramanyam, A.N.; PremChand, C. Removal of high density salt and pepper noise through modified decision based unsymmetric trimmed median filter. IEEE Signal Process. Lett. 2011, 18, 287–290. [Google Scholar] [CrossRef]
- Erhan, D.; Courville, A.; Bengio, Y.; Vincent, P. Why does unsupervised pre-training help deep learning? In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, Sardinia, Italy, 13–15 May 2010; pp. 201–208. [Google Scholar]
- Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, Ft. Lauderdale, FL, USA, 11–13 April 2011; pp. 315–323. [Google Scholar]
- S3PRL Speech Toolkit (S3PRL). Github. Available online: https://github.com/s3prl/s3prl (accessed on 9 June 2023).
- Garofolo, J.S.; Lamel, L.F.; Fisher, W.M.; Fiscus, J.G.; Pallett, D.S. DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon Tech. Rep. N 1993, 93, 27403. [Google Scholar]
- Warden, P. Speech Commands: A Public Dataset for Single-Word Speech Recognition. Available online: http://download.tensorflow.org/data/speech_commands_v0 (accessed on 9 June 2023).
- Nagrani, A.; Chung, J.S.; Xie, W.; Zisserman, A. Voxceleb: Large-scale speaker verification in the wild. Comput. Speech Lang. 2020, 60, 101027. [Google Scholar] [CrossRef]
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Lee, K.F.; Hon, H.W. Speaker-independent phone recognition using hidden Markov models. IEEE Trans. Acoust. Speech Signal Process. 1989, 37, 1641–1648. [Google Scholar] [CrossRef] [Green Version]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- wen Yang, S.; Chi, P.H.; Chuang, Y.S.; Lai, C.I.J.; Lakhotia, K.; Lin, Y.Y.; Liu, A.T.; Shi, J.; Chang, X.; Lin, G.T.; et al. SUPERB: Speech Processing Universal PERformance Benchmark. Proc. Interspeech 2021 2021, 1194–1198. [Google Scholar] [CrossRef]
- Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
- Heafield, K. KenLM: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, Edinburgh, UK, 30–31 July 2011; pp. 187–197. [Google Scholar]
- Pratap, V.; Hannun, A.; Xu, Q.; Cai, J.; Kahn, J.; Synnaeve, G.; Liptchinsky, V.; Collobert, R. Wav2letter++: A fast open-source speech recognition system. In Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6460–6464. [Google Scholar]
- Kim, J.W.; Chung, H.; Jung, H.Y. Unsupervised Representation Learning with Task-Agnostic Feature Masking for Robust End-to-End Speech Recognition. Mathematics 2023, 11, 622. [Google Scholar] [CrossRef]
- Watanabe, S.; Hori, T.; Karita, S.; Hayashi, T.; Nishitoba, J.; Unno, Y.; Enrique Yalta Soplin, N.; Heymann, J.; Wiesner, M.; Chen, N.; et al. ESPnet: End-to-End Speech Processing Toolkit. Proc. Interspeech 2018 2018, 2207–2211. [Google Scholar] [CrossRef] [Green Version]
- Müller, R.; Kornblith, S.; Hinton, G.E. When does label smoothing help? Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
- Yujian, L.; Bo, L. A normalized Levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 1091–1095. [Google Scholar] [CrossRef]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
- Yang, Y.Y.; Hira, M.; Ni, Z.; Astafurov, A.; Chen, C.; Puhrsch, C.; Pollack, D.; Genzel, D.; Greenberg, D.; Yang, E.Z.; et al. Torchaudio: Building blocks for audio and speech processing. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 6982–6986. [Google Scholar]
- Harris, C.R.; Millman, K.J.; Van Der Walt, S.J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N.J.; et al. Array programming with NumPy. Nature 2020, 585, 357–362. [Google Scholar] [CrossRef]
- Kahn, J.; Rivière, M.; Zheng, W.; Kharitonov, E.; Xu, Q.; Mazaré, P.E.; Karadayi, J.; Liptchinsky, V.; Collobert, R.; Fuegen, C.; et al. Libri-light: A benchmark for asr with limited or no supervision. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7669–7673. [Google Scholar]
Dataset Specific | Used For | ||
---|---|---|---|
Name | Hours | Pretraining | Downstream Task |
LibriSpeech [33] | 960 | ✓ | Phoneme Classification (100 h) English ASR (100 h) |
TIMIT [40] | 5.4 | ✗ | Phoneme Classification (All) |
Speech Commands [41] | 18 | ✗ | Keyword Spotting (All) |
VoxCeleb1 [42] | 352 | ✗ | Speaker Identification (All) |
KsponSpeech [34] | 1000 | ✓ | Korean ASR (All) |
Representations | Network Type | No. Model Paramaters |
---|---|---|
Fbank * | - | 0 |
APC * [15] | Non-parallel | 9,107,024 |
NPC * [10] | Non-parallel | 19,380,560 |
Mockingjay * [3] | Non-parallel | 22,226,928 |
Audio ALBERT * [12] | Non-parallel | 7,805,264 |
TERA * [6] | Non-parallel | 21,981,008 |
S&P Patch † (Ours) | Non-parallel | 21,981,008 |
Combined with other representations | ||
Mockingjay + Ours † | Parallel | 22,226,928 |
Audio ALBERT + Ours † | Parallel | 7,805,264 |
TERA + Ours † | Parallel | 21,981,008 |
Representations | Feature Extraction (↓) | Fine-Tuning (↓) | Average (↓) | |||
---|---|---|---|---|---|---|
WER | Rescore | WER | Rescore | WER | Rescore | |
Fbank | 27.90 | 18.42 | 27.90 | 18.42 | 27.90 | 18.42 |
APC [15] | 23.66 | 16.58 | 21.44 | 15.37 | 22.55 | 15.98 |
NPC [10] | 24.18 | 16.25 | 21.20 | 14.55 | 22.69 | 15.40 |
Mockingjay [3] | 26.45 | 17.59 | 19.48 | 14.43 | 22.97 | 16.01 |
Audio ALBERT [12] | 24.32 | 16.14 | 19.16 | 14.27 | 21.74 | 15.21 |
TERA [6] | 22.47 | 14.96 | 19.95 | 14.16 | 21.21 | 14.56 |
Ours | 26.35 | 16.83 | 20.39 | 14.52 | 23.37 | 15.68 |
Combined with other representations | ||||||
Mockingjay + Ours | 26.35 | 16.83 | 20.39 | 14.52 | 23.37 | 15.68 |
Audio ALBERT + Ours | 26.23 | 17.30 | 19.25 | 14.05 | 22.74 | 15.68 |
TERA + Ours | 21.74 | 14.04 | 17.78 | 13.03 | 19.76 | 13.54 |
Representations | Feature Extraction (↓) | Fine-Tuning (↓) | Average (↓) | |||
---|---|---|---|---|---|---|
WER | Rescore | WER | Rescore | WER | Rescore | |
Fbank | 22.89 | 15.35 | 22.89 | 15.35 | 22.89 | 15.35 |
APC [15] | 21.94 | 15.32 | 19.18 | 13.26 | 20.56 | 14.29 |
NPC [10] | 22.21 | 15.45 | 19.42 | 13.36 | 20.82 | 14.41 |
Mockingjay [3] | 21.56 | 15.31 | 17.75 | 12.52 | 19.66 | 13.92 |
Audio ALBERT [12] | 21.13 | 14.54 | 17.20 | 12.25 | 19.17 | 13.40 |
TERA [6] | 19.90 | 13.33 | 17.18 | 12.06 | 18.54 | 12.70 |
Ours | 22.03 | 15.89 | 17.88 | 13.02 | 19.96 | 14.46 |
Combined with other representations | ||||||
Mockingjay + Ours | 21.22 | 15.06 | 17.31 | 12.42 | 19.27 | 13.74 |
Audio ALBERT + Ours | 20.58 | 14.24 | 17.03 | 12.02 | 18.81 | 13.13 |
TERA + Ours | 18.02 | 12.94 | 16.37 | 11.51 | 17.20 | 12.23 |
Representations | CER (↓) |
---|---|
Fbank | 15.31 |
APC [15] | 13.36 |
NPC [10] | 14.78 |
Mockingjay [3] | 16.95 |
Audio ALBERT [12] | 17.25 |
TERA [6] | 13.86 |
SVR1K [50] | 12.32 |
Ours | 14.66 |
Combined with other representations | |
Mockingjay + Ours | 14.83 |
Audio ALBERT + Ours | 15.87 |
TERA + Ours | 12.14 |
(a) | |||
C | Feature Extraction (↑) | Fine-Tuning (↑) | |
0.001 | 68.91 | 88.06 | |
0.002 | 70.29 | 88.46 | |
0.004 | 73.03 | 89.18 | |
0.006 | 71.07 | 88.85 | |
0.008 | 69.97 | 88.75 | |
0.01 | 69.02 | 88.14 | |
(b) | |||
C | Feature Extraction (↑) | Fine-Tuning (↑) | |
0.004 | 70.84 | 87.89 | |
71.82 | 88.13 | ||
71.70 | 88.49 | ||
73.03 | 89.18 | ||
72.10 | 88.87 | ||
71.59 | 88.49 | ||
68.85 | 88.44 | ||
71.76 | 88.86 |
(a) | ||
C | CER (↓) | |
0.002 | 13.84 | |
0.004 | 12.14 | |
0.006 | 13.58 | |
0.008 | 13.75 | |
0.01 | 13.34 | |
(b) | ||
C | CER (↓) | |
0.004 | 17.25 | |
12.14 | ||
13.51 | ||
15.82 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kim, J.-W.; Chung, H.; Jung, H.-Y. Spectral Salt-and-Pepper Patch Masking for Self-Supervised Speech Representation Learning. Mathematics 2023, 11, 3418. https://doi.org/10.3390/math11153418
Kim J-W, Chung H, Jung H-Y. Spectral Salt-and-Pepper Patch Masking for Self-Supervised Speech Representation Learning. Mathematics. 2023; 11(15):3418. https://doi.org/10.3390/math11153418
Chicago/Turabian StyleKim, June-Woo, Hoon Chung, and Ho-Young Jung. 2023. "Spectral Salt-and-Pepper Patch Masking for Self-Supervised Speech Representation Learning" Mathematics 11, no. 15: 3418. https://doi.org/10.3390/math11153418
APA StyleKim, J. -W., Chung, H., & Jung, H. -Y. (2023). Spectral Salt-and-Pepper Patch Masking for Self-Supervised Speech Representation Learning. Mathematics, 11(15), 3418. https://doi.org/10.3390/math11153418