1. Introduction
As global resource shortages become increasingly severe, the development and utilization of marine resources are becoming increasingly important. The ocean is seen as a crucial future energy source, expected to provide over 50% of energy needs [
1]. This concept has significantly promoted deep−sea activities and the development of related technologies worldwide, such as deep−sea diving, deep−sea mining, and deep−sea exploration. It has also facilitated the application of saturation diving technology in various fields, such as marine development, maritime operations, and offshore rescue.
Operating in deep−sea environments requires highly specialized technology. In addition to relying on advanced manned submersibles or underwater vehicles, saturation diving technology can also be used. Saturation diving is a special diving technique that allows divers to work in deep−sea environments for extended periods [
2]. In saturation diving, divers inhale a gas mixture composed of helium and oxygen, which enables them to adapt to the high−pressure environment of the deep sea without experiencing ‘nitrogen narcosis’ [
3]. However, this helium−oxygen gas mixture causes significant changes in the divers’ speech, resulting in the phenomenon of ‘helium speech’. The distortion caused by helium speech not only affects the ability of surface support personnel to assess the physiological condition of divers but also hinders the communication between divers, leading to reduced safety and efficiency in diving operations. Therefore, it is an urgent problem to solve helium speech distortion. Various attempts have been made to address this challenge, mainly by using speech processing techniques that enhance the intelligibility of helium speech. Although traditional methods such as time−domain processing, linear prediction, and homomorphic signal processing techniques have improved the intelligibility of helium speech to some extent, the quality of the corrected speech has not yet reached the desired standard. These methods mainly focus on solving the problem of formant frequency distortion but fall short in dealing with noise impact and pitch correction.
In current helium speech processing, time−domain segmentation is the most commonly used method for helium speech correction. In this method, the helium speech signal is divided into several segments, each segment is stretched, and then these segments are recombined into a corrected speech signal [
4]. Although this method can address the issue of formant frequency distortion to some extent, it does not solve the problem of spectral envelope correction. In order to achieve more complex spectral envelope correction, Suzuki et al. [
5] proposed the use of linear prediction and homomorphic signal processing methods to estimate and correct the impulse response of the vocal tract and also used inter−segment autocorrelation techniques to reduce the influence of noise on helium speech [
6]. Richards [
7] proposed a helium speech correction method based on the short−time Fourier transform of speech signals. This method is similar to the segmentation processing method in the time domain, except that its Fourier transforms the signal segments in the frequency domain instead of processing the signal segments in the time domain. Although this method can map any spectral envelope, it cannot modify pitch distortion. Subsequently, Duncan proposed a residual excited linear prediction coder, which improved the intelligibility and naturalness of speech output to some extent, but it was still far from practical application. Additionally, there are also other methods for correcting helium speech, such as improved speech excitation vocoders [
8], analytic signal methods [
9], and high−pressure speech transcoders [
10].
There are several urgent issues in helium speech correction that need to be addressed: First, there is currently no publicly available dataset for helium speech correction. Second, the processing methods for helium speech remain confined to time−domain and frequency−domain approaches. Additionally, traditional correction methods must satisfy certain conditions [
11], and most correction functions are only applicable to a specific frequency range, not all frequencies. Finally, helium speech correction involves both sentences and isolated words, but there are no effective evaluation metrics for sentences.
With the rapid development of machine learning and deep learning technologies [
12,
13], researchers have begun exploring the application of these emerging technologies in helium speech correction. As an advanced deep learning model, GANs were initially applied in the field of image processing and later extended to speech signal processing [
14]. Through their multi−layer structure and excellent unsupervised learning mechanism, GANs can effectively capture the spectral details in the speech signal [
15], and they also significantly improve the performance of speech correction [
16]. The main research content of this paper is as follows:
Constructing a helium speech dataset: The dataset includes various types of speech at different depths. Speech from multiple divers is collected for the same text, which improves the diversity of the dataset to some extent;
Proposing a comprehensive similarity evaluation metric: This metric is used to conduct in-depth evaluation from the perspective of keywords in sentences, providing an additional reference point for evaluating the performance of helium speech;
Proposing an adaptive speech−based generative adversarial network: This approach utilizes an adaptive segmentation algorithm and a fusion loss function, enhancing the ability to learn helium speech features while overcoming the shortcomings of traditional methods in high−pitch correction.
2. Related Work
Helium speech refers to the speech produced in an environment containing helium. Due to the physical properties of helium, the resonant frequency of the speaker’s vocal tract shifts nonlinearly upward [
17]. The main characteristics of helium speech include a nonlinear increase in formant frequencies, an increase in formant bandwidth, a reduction in formant amplitude, and changes in pitch period. These features significantly reduce the intelligibility of helium speech, thereby affecting the quality of speech communication in special environments such as saturation diving [
18].
2.1. Helium Speech Evaluation Mechanisms and Datasets
One of the main challenges in the field of helium speech research is the lack of evaluation mechanisms and the scarcity of datasets. The effectiveness of previous helium speech correction methods is mainly evaluated by word error rate [
19], which measures the correction effect by calculating the number of errors between the corrected speech and the target speech. However, the word error rate has great limitations in practical applications [
20]. It is merely a superficial evaluation based on string matching and cannot deeply compare the semantic consistency and perceptual quality of speech, nor can it effectively compare the similarity between keywords in two sentences [
21]. This limitation prevents the word error rate from fully reflecting the true intelligibility and clarity of the corrected speech, making the evaluation results somewhat constrained [
22].
The signal−to−noise ratio (SNR) and the mean opinion score (MOS) are not suitable for the direct evaluation of speech correction effects at this stage. The SNR mainly measures the clarity of the signal and the influence of background noise. While it can indicate the quality of the speech signal, it does not address the distortion caused by helium speech. The MOS is based on subjective ratings from listeners, which aims to quantify speech quality. Although it offers some reference value for evaluating speech quality, its subjectivity and variability of evaluation criteria mean that the MOS may not accurately reflect the semantic similarity targeted in this study.
Regarding datasets, currently, there is no publicly available standardized dataset in the field of helium speech. This lack of datasets complicates research on helium speech correction [
23]. Collecting helium speech data is inherently challenging. It requires a high−pressure environment, which makes the data collection process complex and time−consuming. A high−quality dataset should include speech data from various languages. It should also cover different types of information, such as isolated words and continuous passages, to ensure diversity and representativeness.
2.2. Helium Speech Distortion Mechanisms
In normal speech, sound waves generated by vocal cord vibrations are transmitted through the vocal tract, producing formants at different frequencies. However, in a helium environment, the speed of sound in helium is faster than in air, causing the resonant frequencies to shift upward. This upward shift, particularly in the low−frequency range, manifests as nonlinear warping, which severely affects speech clarity and intelligibility [
24,
25]. The increase in formant bandwidth is limited at high frequencies, while it is much larger at low frequencies, and the formant amplitude is significantly attenuated, further contributing to speech distortion.
The authors of [
26] compared the spectrograms of the isolated word “amplitude modulation” in normal speech and helium speech, finding that the displacement ratios of the first four formant frequencies in helium speech were 1.81, 1.73, 1.16, and 1.32, respectively, and these shifts were nonlinear. There was also an increase in the formant bandwidth and a significant attenuation of the high−frequency components above 6000 Hz. Another study [
27] compared the spectrograms of continuous speech in normal and helium environments, revealing that the pitch frequency in helium speech increased by approximately 1.5 octaves, indicating a significant change in pitch period. These studies suggest that changes in formant frequencies and pitch period are the primary factors contributing to helium speech distortion.
2.3. Feature Selection for Helium Speech
FBank features are a common characteristic in speech signal processing and are known for their excellent [
28] spectral representation capabilities. The corresponding FBank features of helium speech are obtained by applying the Fourier transform to the helium speech signal and filtering the resulting spectrum. FBank features not only retain important information from the speech signal but also effectively suppress the influence of noise. Additionally, they have a high resolution in the frequency domain, allowing them to accurately represent the spectral characteristics of speech signals. Therefore, FBank features have been widely used in fields such as speech recognition and speech synthesis. In helium speech correction, using FBank features can help better capture the spectral characteristics of the speech signal, thereby improving the correction accuracy.
Secondly, formant features are also an important characteristic of speech signals, reflecting the formant frequencies and bandwidths. Formants are determined by the shape and size of the vocal tract and have a significant impact on the quality and timbre of speech [
29]. In speech signals, formants typically appear as a series of peaks in the spectrum, which exhibit relatively stable characteristics across different speech segments. By analyzing formant features, one can obtain information on the formant frequencies and bandwidths in the speech signal, aiding in the identification and differentiation of various speech signals. In helium speech correction, formant features provide information on the speech’s formants, helping to restore the quality and timbre of the speech, making the corrected speech more natural and clear.
Using both FBank features and formant features can also have a complementary effect. FBank features primarily focus on the spectral characteristics of the speech signal, providing high−resolution spectral information, while formant features focus on the formant information of the speech signal [
30], providing information on speech quality and timbre. By combining these two features, a comprehensive analysis of the speech signal can be conducted in both the frequency domain and formant information domains.
2.4. Common Correction Methods for Helium Speech
Traditional helium speech correction methods are mainly divided into two categories: time−domain−based processing and frequency−domain−based processing. Both of these methods correct helium speech by analyzing the speech production mechanism in a helium−oxygen environment, but they differ slightly in processing details.
Time−domain processing techniques focus on analyzing the temporal characteristics of speech signals. Common approaches, like pitch period detection and waveform adjustment, can enhance the clarity of helium speech to some extent. However, because the distortion in helium speech originates from frequency−domain characteristics, the effectiveness of time−domain methods is significantly limited.
Frequency−domain processing techniques analyze the spectral characteristics of speech signals for correction. Common methods include linear predictive coefficients, linear predictive cepstral coefficients, and short−time Fourier transform. These techniques address helium speech distortion by adjusting formant frequencies, bandwidths, and amplitudes. Specifically, the use of FBank features is effective in capturing spectral data, thus improving the speech quality.
In recent years, deep learning technology [
31,
32] has made significant progress in the field of speech processing or image processing. Deep learning, by simulating the neural network structure of the human brain, can learn and extract complex features from large volumes of data. Applying deep learning methods to helium speech correction enables the automatic learning of spectral characteristics and effective correction. Research indicates [
33] that deep learning−based helium speech correction methods excel in improving speech intelligibility and quality.
3. Helium Speech Correction Methods
Due to its unique characteristics, helium speech requires more detailed consideration of its spectral properties, time−domain features, and acoustic changes during speech conversion. To improve the accuracy and robustness of helium speech conversion, this study constructed a helium speech dataset and used the MetricGAN+ model as a baseline. Based on this, an adaptive−audio−based metric generative adversarial network (AMGAN) is proposed. Through this, an adaptive speech segmentation algorithm and a combined loss function are introduced in a hybrid architecture that comprehensively considers both the perceptual quality and semantic consistency of the speech. The structure of AMGAN is shown in
Figure 1.
The discriminator in AMGAN employs a CNN architecture featuring multiple two−dimensional convolutional layers along with a global average pooling layer. This design enables the model to effectively handle inputs of varying lengths. Even when the input speech varies, the discriminator ensures that the FBank features and pitch period features are adjusted to a consistent length after processing.
The generator is a fully convolutional feedforward network that incorporates residual blocks. Its fully connected layers utilize LeakyReLU nodes and Learnable Sigmoid (LS) nodes. Additionally, downsampling is applied to mask estimation prior to T−F masking processing. The AMGAN model operates with a floating−point operation count of approximately 1.32 GFLOPs and a parameter count of around 1.89 million.
3.1. Speech Adaptive Segmentation Algorithm
In helium speech correction, traditional fixed−parameter methods are insufficient to meet the differentiated processing needs of speech with varying frequencies. Due to the unique acoustic properties of helium speech, the processing modes for high−frequency and low−frequency bands differ, requiring distinct activation functions for each frequency band. It is difficult for fixed−parameter activation functions to deal with this complexity, which affects the effectiveness of speech correction. To address this issue, this study proposes an adaptive segmentation method. By introducing learnable nodes, the activation function can adaptively adjust its compression mode according to the characteristics of different frequency bands, specifically, using the following equation:
where
is the compression ratio learned during training, used to control the shape of the compression function.
The key to the adaptive segmentation algorithm lies in the introduction of learnable parameters so that the activation function can be dynamically adjusted during the training process, thereby better adapting to the speech characteristics of different frequencies. As shown in
Figure 2, a higher
behaves similarly to a hard threshold, which is not smooth but more saturated, approximating a binary mask, with most output values being 0 or 1. In contrast, a lower
resembles a linear function.
This approach allows for the optimized processing of both high−frequency and low−frequency speech. Additionally, with the introduction of learnable parameters, the activation function can learn the optimal processing method during training, which is expected to achieve optimal correction results across different frequency ranges.
Moreover, adaptive segmentation also splits long sentences based on the root mean square (RMS) value, dividing the entire speech signal into segments of similar length. By calculating the RMS value for each frame, silent parts of the speech are detected as follows:
where
is the number of samples per frame, and
is the amplitude of the
ith sample.
By calculating the RMS value for each frame, once the effective sound segment reaches the minimum length since the last segment, and the silent part exceeds the minimum interval, the audio is separated from the frame within the silent region with the lowest RMS value using the following equation:
where
is the time point of separation,
is the start time of the silent part, and
is the end time of the silent part.
Additionally, longer silent segments are processed by either deleting or shortening them to avoid unnecessary pauses that could disrupt speech continuity. These operations effectively reduce the temporal errors caused by the same person speaking at different times, thereby also reducing the accuracy impact caused by such temporal errors. Through adaptive segmentation, the speech signal is divided into shorter segments that are easier to process, ensuring a higher degree of semantic consistency during subsequent speech conversion and processing.
3.2. Combined Loss Functions
The study of the total generator loss function
is aimed at better quantifying the differences between two segments of speech so that helium speech can be effectively converted to normal speech while maintaining semantic consistency and perceptual quality. This loss function consists of several components, each of which has its specific purpose and role. The total loss function can be expressed as a weighted combination of these loss functions.
where
represents the weight coefficient of the loss function, and
and
refer to the primary loss function and auxiliary loss function.
3.2.1. Main Loss Function
The primary loss function plays a central role in network training, as it is primarily responsible for updating the network during training. Let
represent a function that normalizes the target evaluation metric between 0 and 1, where I denotes the input to the target evaluation metric. If the evaluation involves a pair of helium speech
and its corresponding normal speech
y, it is necessary to ensure that the behavior of the discriminator network
D is similar to
. Then, the target function of
D is as follows:
where
is the discriminator loss used to train the discriminator network to distinguish between generated speech and real speech.
The training of the generator network
G is entirely dependent on the adversarial loss:
where
represents the generator loss, which primarily focuses on the quality of the generated speech and the consistency of semantics.
s denotes the expected score, and in this model, setting the s value to 1 indicates the expectation that the generated speech matches the original speech.
3.2.2. Auxiliary Loss Function
The auxiliary loss function is primarily used during training to address time distortion issues caused by factors such as speech rate, pitch, or intonation. In helium speech correction, traditional speech processing methods often perform poorly when dealing with variations in speech rate and pitch. This is largely because different speech rates, pitches, and intonations can cause temporal distortions, making it difficult for the model to capture accurate speech features. To address this issue, this study introduces an auxiliary loss function to enhance the model’s ability to handle temporal distortions, thereby reducing errors caused by such distortions. By aligning and adjusting along the time dimension, the model can maintain high correction accuracy even in the face of changes in speech rate, pitch, and intonation.
The auxiliary loss function is used to compare the similarity between two sequences by finding the optimal matching path between them to measure their similarity along the time axis.
Let there be two sequences
and
, where
and
denote elements in the sequences. The goal of the algorithm is to find a path that minimizes the total distance, where the auxiliary distance can be represented by the following formula:
where
is the sequence of points on the path that satisfy the start and end conditions, and
represents the distance or similarity measure between elements at position
in sequences
and
.
The temporal distortion problem between helium speech and normal speech is partially solved by introducing sequence alignment to minimize the auxiliary distance and ensure that the converted speech is aligned along the time axis.
3.3. Helium Speech Feature Extraction
The process of extracting FBank features from helium speech is illustrated in
Figure 3.
Pre−emphasis: The speech signal is passed through a high−pass filter to enhance the high−frequency components of the speech signal while keeping the signal across the entire frequency range from low to high. After pre−emphasis, the spectrum can be computed using the same signal−to−noise ratio.
Framing: The audio signal is divided into frames of fixed length, with this frame length serving as the minimum unit.
Windowing: Adjacent frames need to have overlapping portions, meaning the distance the window moves each time is less than the frame length.
DFT Processing: A fixed number of features are extracted from each frame, and the frame signal undergoes discrete Fourier transform (DFT) processing to obtain the power spectrum estimate of the periodogram. The formula is as follows:
where
represents the time−domain signal,
is the data value for the
frame,
n is the value range,
denotes the
complex coefficient of the
frame,
represents the power spectrum of the
frame,
is the
N -point window function, and
k is the length of the DFT.
Filtering: After the DFT, the magnitude of each frequency component is obtained, and squaring this magnitude yields the energy of each frequency component. Since the human ear is more sensitive to lower frequencies and less sensitive to higher frequencies, the
frequency scale is used to simulate this characteristic. The relationship between
frequency and actual frequency is as follows:
where
k is the scaling factor that converts linear frequency to
frequency, and
is the baseline reflecting the frequency resolution characteristics of the auditory system. The Mel filter bank is applied to the energy spectrum to obtain the FBank features, with the following formula:
To extract formants, an electrical transmission simulation model is established, approximating the human vocal tract as a first−order lossy acoustic tube. Since changes in environmental pressure and respiratory gas composition do not affect the overall shape of the source spectrum, and the speech spectrum is obtained by multiplying the source spectrum with the transmission spectrum, if the shape of the source spectrum remains unchanged, the amplitude of the helium speech spectrum in the high−frequency range will be significantly reduced. Therefore, pre−emphasis processing must be performed before extraction:
where
is the pre−emphasis weight, and
s represents the original signal.
By applying pre−emphasis processing, the amplitude of high−frequency components is increased to compensate for the high−frequency attenuation caused by the transmission spectrum expansion, which makes the amplitude of each formant more uniform when extracting the formant parameters.
Next, prediction error filtering is performed using the following filter:
where
p is the order of the prediction error filter, and
values are the filter coefficients for each order. The complex roots of the polynomial represent the formant frequencies and bandwidths. Let
be any complex root; then,
is also a root. Let
correspond to a formant frequency
and a
bandwidth
; then,
4. Evaluation System
Using speech recognition technology to convert speech into text can simplify the process of comparing speech data. By employing natural language processing techniques to compare the semantic similarity between generated text and reference text, the quality of speech conversion can be indirectly assessed, allowing for a focus on the analysis of correction algorithms. Based on this approach, both normal speech and trained speech are processed through the same speech recognition model, which ensures the consistency of the speech−to−text conversion process. This unified standard reduces conversion errors and provides a more equitable evaluation criterion.
4.1. Paragraph Evaluation
The word error rate cannot effectively compare the similarity between keywords in two sentences, and the overall similarity cannot be used to evaluate the similarity in detail. Therefore, the proposed comprehensive similarity (CS) score was used to evaluate the keyword similarity in sentences.
For the CS score, let there be a document
D, and given a query
Q containing the keyword
, the Best Matching 25 (BM25) score can be defined as follows:
where
represents the term frequency of
in document
D,
denotes the number of words in document
D, and
indicates the average length of all documents in the corpus.
and
b are free parameters.
represents the inverse document frequency of the term
and is calculated as follows:
where
N is the total number of documents in the corpus, and
represents the number of documents that contain
.
BM25 is an enhancement of the TF−IDF algorithm. In the calculation of term frequency, BM25 limits the impact of the term frequency of keyword qi on the score. To prevent excessively high term frequencies, BM25 sets an upper bound of for this value.
Although BM25 itself is not a similarity metric, it can be used to some extent for similarity computation. To measure the similarity between two texts
A and
B, and to enable the CS score to reflect semantic similarity, the following steps should be performed (
Figure 4):
Indexing: Treat the keywords in texts A and B as queries and the other text as the document. First, index text A.
Query: Use text B as the query and score text A using BM25.
Inverted Query: Index text B and use text A as the query to score text B using BM25.
Combined Score: Sum and average these two scores to obtain a CS score.
4.2. Isolated Word Evaluation
During diving, communication often relies on complete sentences, but the recognition and correction of isolated words are equally important. Additionally, the pronunciation of isolated words read individually may differ slightly from those within complete sentences. This difference arises because the pronunciation and intonation of words in a sentence are influenced by the surrounding context, while isolated words are pronounced in a relatively independent and standard manner. Therefore, these subtle differences must be considered when evaluating isolated words. In this study, isolated words are categorized into two types: those read directly in isolation and those extracted from sentences through tokenization.
Tokenization is the process of breaking continuous text into individual words or phrases, which is particularly important for Chinese text processing due to the lack of explicit word boundaries in Chinese. Tokenization helps extract core vocabulary and semantic information from sentences. Additionally, tokenization technology for Chinese is well established, allowing for the correction and evaluation of isolated words obtained from tokenized sentences.
For the evaluation criteria described above, Jaccard similarity is used as the metric for helium speech isolated word correction. Jaccard similarity is a widely used metric in speech recognition, focusing on the overall similarity of words rather than their offset, and does not involve time relations, making it well suited for assessing the correction performance of isolated words.
The Jaccard similarity is used to compare the similarity of isolated words and derive similarity scores. The Jaccard similarity is defined as follows:
where
A and
B are two sets,
represents the size of the intersection of the two sets, and
represents the size of the union of the two sets.
5. Datasets
Saturation diving is a specialized deep−sea operation method in which divers work under high−pressure environments for extended periods, using a helium−oxygen gas mixture to avoid high−pressure nitrogen narcosis and oxygen toxicity. At 0 °C, the speed of sound in air is 332 m/s; in hydrogen, it is 1286 m/s; and in helium, it is 971 m/s. Due to the change in sound propagation caused by the helium−oxygen gas mixture, the speech of divers in this environment exhibits unique acoustic characteristics. Given the lack of publicly available helium speech datasets, we collected a helium speech dataset to investigate these characteristics and convert them into normal speech through speech conversion technology.
In the helium speech dataset constructed in this study, the data were collected from divers using helium−oxygen gas during saturation diving to depths of 83 m–110 m while reading aloud. The dataset was stored in WAV format with a sampling rate of 16 kHz.
The total duration of the dataset was 326 min, recorded by nine divers with extensive diving experience. The recordings were made using high−sensitivity microphones and professional recorders to ensure audio quality. These nine divers have extensive diving experience and are capable of accurately reading professional terms and technical descriptions. Each diver’s average reading time was approximately 36 min, and these data were later segmented into smaller speech fragments for use in speech processing and analysis, facilitating model training and testing.
The reading content mainly includes five sections:
Common Words: This includes words frequently used in daily diving operations. The pronunciation and recognition of these common words are particularly important in underwater environments and can be used to analyze changes in divers’ pronunciation under different pressures and environmental conditions, helping to optimize the performance of speech correction systems.
Article Paragraphs: This section includes common paragraphs from commentary articles, covering topics ranging from literary criticism to current events. Through these texts, the study can examine changes in speech characteristics under different themes.
Diving Technical Terms: The technical term section includes terms related to diving operations, equipment usage, and emergency handling. Because these involve critical commands and safety operations, the pronunciation characteristics of these technical terms in high−pressure environments have significant research value.
Daily Conversations: The daily conversation texts are selected to simulate dialogues that divers might have during breaks in their work, including greetings, discussions about work progress, and task arrangements. This section helps analyze speech characteristics in natural conversational settings.
Technical Instructions: This section includes detailed descriptions of diving equipment and operational procedures. When divers read these instructions, they need to use more formal and technical language, which places higher demands on the accuracy of speech conversion.
Table 1 provides detailed information on the construction of the helium speech dataset, including speech types, recording environments, the number of participants, average reading duration, diving depth relative to sea level, and the average word count of the corresponding audio.
6. Evaluation
6.1. Experimental Environment
In this experiment, AMGAN was trained on an NVIDIA GeForce RTX 3090 GPU. The software configuration included Python 3.10, PyTorch 1.8.0, and CUDA 10.1. The input audio had a sampling rate of 16 kHz, and the audio was segmented into 1–5 s clips to reduce computational complexity. Audio segmentation also helped accelerate the training and inference process to some extent. Due to limitations in computing power and the convergence speed of the algorithm, the training process consisted of 8,000 epochs, with the optimizer set to AdamW and an initial learning rate of 0.0005.
6.2. Analysis of Continuous Sentence Correction Results
In this study, we first used the semantic textual similarity (STS) score from the HanLP library as the evaluation metric for the overall similarity of corrected sentences. Subsequently, we employed the proposed comprehensive similarity score as the evaluation metric for the similarity of sentence keywords, allowing for a thorough assessment of the correction algorithms from multiple dimensions [
34].
The experiment was conducted on 12 sets of Chinese text paragraphs from the helium speech dataset, where each paragraph was segmented into several sentences for feature extraction. The feature extraction primarily included FBank features and formant features [
35].
Figure 5 illustrates the process of extracting FBank features from the helium speech sample “我们知道水是生物的重要组成部分” ("We know that water is an essential component of living organisms) and the corresponding FBank features obtained.
Figure 6 shows the comparison of formant features of the sentence. Formants are key features reflecting the speech spectrum, and the comparison results between the converted speech and normal speech in terms of formant features also demonstrate high similarity. This further validates the effectiveness and stability of the AMGAN model in the speech conversion process.
Figure 7 is a heatmap of CS scores. The heatmap visually presents the distribution of CS scores between different sentences in the same paragraph, under normal pressure, and after model correction. It can be observed that the CS scores along the diagonal are significantly higher than others, indicating that the sentences converted by the AMGAN model have a high similarity to the corresponding sentences under normal pressure. This suggests a high correlation between the keywords in the sentences under normal pressure and those in the converted helium–oxygen environment.
The analysis was conducted on 12 sets of text paragraphs from the helium speech dataset, and the experimental results are shown in
Table 2, which details the STS and CS scores for different sentences. By comparing these data, it can be observed that the AMGAN model performs well on most sentences, exhibiting high STS and CS scores. Additionally, based on multiple comparisons of CS scores for the same paragraph, this study concludes that when the CS score exceeds 15, the paragraph demonstrates a high degree of similarity.
The analysis of the above experimental data reveals that the AMGAN model performs excellently in long speech correction. The STS and CS scores for most sentences are high, indicating good conversion results at the sentence level. The STS scores show the similarity in text correction results before and after conversion, proving that the model effectively retains the linguistic information of the original speech. Furthermore, the high CS scores indicate that the keywords in the converted speech closely resemble those in the original speech under normal pressure, validating the practical application value of the model.
6.3. Isolated Word Evaluation Metrics and Correction Results Analysis
In the study of isolated words, feature extraction was performed on the input helium speech data.
Figure 8 shows the speech features of isolated words in different environments:
Figure 8a–c represent the waveform, spectrogram, and FBank features of “请” (please), “我” (I), and “有" (have) under normal pressure, respectively;
Figure 8d–f represent the corresponding features in a helium–oxygen environment; and
Figure 8g–i represent the features after correction to normal speech using the AMGAN. By comparing the speech features under helium–oxygen and normal−pressure environments, it can be observed that some features of the trained audio are already similar to those under normal pressure.
Table 3 presents the correction results of nine sets of isolated words in helium speech, evaluated by Jaccard similarity after segmentation [
36]. It can be seen that AMGAN effectively restores isolated words in a helium–oxygen environment to recognizable normal speech.
The experimental results show that the accuracy of all test sets is above 90% overall, with the lowest accuracy being 93.65%. Through the analysis of the original speech, it was finally concluded that the possible reason for the lowest similarity was that one of the saturation divers had an accent that caused inaccuracies in single−word correction. On most word lists, the AMGAN model exhibited high Jaccard similarity, indicating good speech conversion performance. By comparing the word correction results, the performance of the model in correcting isolated words was evaluated, verifying the model’s accuracy and stability.
7. Conclusions
In this study, we first constructed a helium speech dataset containing isolated words and continuous speech paragraphs in both Chinese and English and proposed a new evaluation metric called the comprehensive similarity score to better assess the effectiveness of helium speech correction. Based on this, a GAN model named AMGAN was designed and implemented. This model significantly improved the clarity and intelligibility of helium speech through an adaptive speech segmentation algorithm and a combined loss function.
Currently, preliminary research results have been achieved for helium speech correction in both paragraphs and isolated words. The AMGAN method has shown excellent performance in helium speech correction, with significant improvements in clarity and intelligibility compared to traditional methods. Moreover, the comprehensive similarity score provides a new perspective for evaluating the effectiveness of speech correction, offering a more comprehensive reflection of the corrected speech quality. The research results not only enrich the theoretical foundation in the field of helium speech correction but also provide effective technical support for practical applications. Additionally, this research has certain implications for improving the efficiency and safety of deep−sea diving communication and offers a reference for the future development of helium speech processing technology.
However, this study has several limitations. First, the experiment only used a Chinese dataset. Future research could expand to other languages. This would allow for exploration of the applicability of the AMGAN method in different linguistic environments and the differences in correction algorithms. The current dataset primarily consists of divers with similar linguistic backgrounds. This leads to limited diversity in accents and speaking styles. Such limitations may affect the model’s ability to generalize under various accents and speaking styles. Additionally, the lack of samples from different regions and cultural backgrounds may reduce the model’s adaptability when processing diverse speech inputs.
To address these issues, future work should focus on enhancing the dataset’s diversity. Strategies include expanding the data collection to include divers from different regions and linguistic backgrounds. Increasing the collection of various speaking styles is also important. Furthermore, applying data augmentation techniques could help simulate differences in accents and intonations. Moreover, while the comprehensive similarity score showed excellent performance in helium speech correction, other potential evaluation metrics have not been systematically compared or explored. This should also be a focus for future research. Lastly, this study mainly examined the application of the GAN−based AMGAN model in helium speech correction. Future work could explore the performance of other deep learning models in helium speech correction to identify more optimized solutions.
This research provides new methods and evaluation metrics for helium speech correction, with high practical value and application prospects. Although there are some shortcomings, these limitations offer new directions and ideas for future research. It is hoped that more researchers will conduct in−depth exploration in this field, thus advancing the development of helium speech correction technology.
Author Contributions
Conceptualization, H.L. and S.Z.; methodology, Y.C. and H.L.; software, Y.C. and H.L.; validation, Y.C.; investigation, S.Z.; writing—original draft preparation, Y.C.; supervision, H.J. and S.Z. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by the National Natural Science Foundation of China (No. 62371261) and the Nantong Science and Technology Program (No. JC2023076).
Data Availability Statement
The datasets presented in this article are not readily available because the data are part of an ongoing study or due to technical. Requests to access the datasets should be directed to author’s email.
Conflicts of Interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
References
- Narayanan, R.; Sreelekshmi, K.; Keerthi, T. Marine resources and sustainable utilization. In Conservation and Sustainable Utilization of Bioresources; Springer: Berlin/Heidelberg, Germany, 2023; pp. 581–596. [Google Scholar]
- Verma, R.; Mohanty, C.; Kodange, C. Saturation diving and its role in submarine rescue. J. Mar. Med. Soc. 2016, 18, 72–74. [Google Scholar] [CrossRef]
- Vrijdag, X.C.; van Waart, H.; Pullon, R.M.; Sames, C.; Mitchell, S.J.; Sleigh, J.W. EEG functional connectivity is sensitive for nitrogen narcosis at 608 kPa. Sci. Rep. 2022, 12, 4880. [Google Scholar] [CrossRef] [PubMed]
- Stover, W. Technique for correcting helium speech distortion. J. Acoust. Soc. Am. 1967, 41, 70–74. [Google Scholar] [CrossRef] [PubMed]
- Suzuki, H. Helium speech unscrambler using digital filter constructed by linear prediction adn impluse response conversion. IEICE Trans. Commun. 1975, 58, 337–384. [Google Scholar]
- Suzuki, J.; Nakatsui, M. Translation of helium speech by splicing of autocorrelation function. J. Radio Res. Lab. 1976, 23, 229–234. [Google Scholar]
- Richards, M. Helium speech enhancement using the short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 1982, 30, 841–853. [Google Scholar] [CrossRef]
- Golden, R.M. Improving Naturalness and Intelligibility of Helium-Oxygen Speech, Using Vocoder Techniques. J. Acoust. Soc. Am. 1966, 40, 621–624. [Google Scholar] [CrossRef]
- Takasugi, T.; Suzuki, J. Translation of helium speech by the use of ‘analytic signal’. J. Radio Res. Lab. 1974, 21, 61–69. [Google Scholar]
- Daymi, M.; Gayed, M.; Malherbe, J.C.; Kammoun, L. A modified hyperbaric speech transcoder. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, Yasmine Hammamet, Tunisia, 6–9 October 2002; IEEE: Piscataway, NJ, USA, 2002; Volume 6, p. 6. [Google Scholar]
- Jack, M.; Duncan, G. The helium speech effect and electronic techniques for enhancing intelligibility in a helium-oxygen environment. Radio Electron. Eng. 1982, 52, 211–223. [Google Scholar] [CrossRef]
- Liu, H.; Zhang, C.; Deng, Y.; Liu, T.; Zhang, Z.; Li, Y.F. Orientation cues-aware facial relationship representation for head pose estimation via transformer. IEEE Trans. Image Process. 2023, 32, 6289–6302. [Google Scholar] [CrossRef]
- Liu, H.; Zhou, Q.; Zhang, C.; Zhu, J.; Liu, T.; Zhang, Z.; Li, Y.F. MMATrans: Muscle Movement Aware Representation Learning for Facial Expression Recognition via Transformers. IEEE Trans. Ind. Inform. 2024. [Google Scholar] [CrossRef]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
- Qin, Z.; Zhao, T.; Li, F.; Tao, X. Survey of research on multimodal semantic communication. J. Commun. 2023, 44, 28–41. [Google Scholar]
- Maben, L.M.; Guo, Z.; Chen, C.; Chudiwal, U.; Siong, C.E. Study of Generative Adversarial Networks for Noisy Speech Simulation from Clean Speech. In Proceedings of the 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, Taipei, Taiwan, 31 October–3 November 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1143–1149. [Google Scholar]
- Hollien, H.; Thompson, C.; Cannon, B. Speech intelligibility as a function of ambient pressure and Heo 2 atmosphere. Aerosp. Med. 1973, 44, 249–253. [Google Scholar] [PubMed]
- Zhang, X.; Zheng, L. Features Extraction and Analysis of Disguised Speech Formant Based on SoundTouch. In Proceedings of the 2019 IEEE 3rd Advanced Information Management, Communicates, Electronic and Automation Control Conference, Chongqing, China, 11–13 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 502–508. [Google Scholar]
- Chen, M.; Duquenne, P.A.; Andrews, P.; Kao, J.; Mourachko, A.; Schwenk, H.; Costa-jussà, M.R. BLASER: A text-free speech-to-speech translation evaluation metric. arXiv 2022, arXiv:2212.08486. [Google Scholar]
- Zhang, J. Efficiency analysis of jaccard similarity in probabilistic distribution model. Acad. J. Comput. Inf. Sci. 2023, 6, 53–63. [Google Scholar]
- Wu, S.; Liu, F.; Zhang, K. Short text similarity calculation based on jaccard and semantic mixture. In Proceedings of the Bio-Inspired Computing: Theories and Applications: 15th International Conference, BIC-TA 2020, Qingdao, China, 23–25 October 2020; Revised Selected Papers 15. Springer: Berlin/Heidelberg, Germany, 2021; pp. 37–45. [Google Scholar]
- Agrawal, V.; Chaurasia, A.; Kumar, S.; Chikkamath, S.; Nirmala, S.; Budihal, S. Wav2Letter: Transforming Speech to Text with CNN for Automatic Speech Recognition. In Proceedings of the 2024 3rd International Conference for Innovation in Technology, Bangalore, India, 1–3 March 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–5. [Google Scholar]
- Looby, A.; Erbe, C.; Bravo, S.; Cox, K.; Davies, H.L.; Di Iorio, L.; Jézéquel, Y.; Juanes, F.; Martin, C.W.; Mooney, T.A.; et al. Global inventory of species categorized by known underwater sonifery. Sci. Data 2023, 10, 892. [Google Scholar] [CrossRef]
- Delattre, P.C.; Liberman, A.M.; Cooper, F.S. Acoustic loci and transitional cues for consonants. J. Acoust. Soc. Am. 1955, 27, 769–773. [Google Scholar] [CrossRef]
- Lindblom, B.E.; Studdert-Kennedy, M. On the role of formant transitions in vowel recognition. J. Acoust. Soc. Am. 1967, 42, 830–843. [Google Scholar] [CrossRef]
- Dongmei, L.; Ming, L.; Lili, G. A helium speech recognition method using machine learning. Telecommun. Technol. 2022, 62, 72–77. [Google Scholar]
- Zhang, S.; Guo, L.; Li, H.; Bao, Z.; Zhang, X.; Chen, Y. A survey on heliumspeech communications in saturation diving. China Commun. 2020, 17, 68–79. [Google Scholar] [CrossRef]
- Shanthi, T.S.; Lingam, C. Review of feature extraction techniques in automatic speech recognition. Int. J. Sci. Eng. Technol. 2013, 2, 479–484. [Google Scholar]
- Goki, S.H.; Ghazvini, M.; Hamzenejadi, S. A Wavelet Transform Based Scheme to Extract Speech Pitch and Formant Frequencies. arXiv 2022, arXiv:2209.00733. [Google Scholar]
- Hess, W. Pitch Determination of Speech Signals: Algorithms and Devices; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012; Volume 3. [Google Scholar]
- Liu, T.; Wang, M.; Yang, B.; Liu, H.; Yi, S. ESERNet: Learning Spectrogram Structure Relationship for Effective Speech Emotion Recognition with Swin Transformer in Classroom Discourse Analysis. Neurocomputing 2024, 612, 128711. [Google Scholar] [CrossRef]
- Liu, H.; Liu, T.; Chen, Y.; Zhang, Z.; Li, Y.F. EHPE: Skeleton cues-based gaussian coordinate encoding for efficient human pose estimation. IEEE Trans. Multimed. 2022, 26, 8464–8475. [Google Scholar] [CrossRef]
- Li, D.; Zhang, S.; Guo, L.; Chen, Y. Helium Speech Correction Algorithm Based on Deep Neural Networks. In Proceedings of the 2020 International Conference on Wireless Communications and Signal Processing, Nanjing, China, 21–23 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 99–103. [Google Scholar]
- Meiyu, Z.; Peiliang, W.; Yan, D.; Yi, L.; Lingfu, K. Chinese semantic and phonological information-based text proofreading model for speech recognition. J. Commun. Tongxin Xuebao 2022, 43, 65–79. [Google Scholar]
- Medabalimi, A.J.X.; Seshadri, G.; Bayya, Y. Extraction of formant bandwidths using properties of group delay functions. Speech Commun. 2014, 63, 70–83. [Google Scholar] [CrossRef]
- Jin, L.; Hendriks, H. The development of aspect marking in L1 and L2 Chinese. Work. Pap. Engl. Appl. Linguist. 2005, 9, 69–99. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).