3.1. Description of Experimental Setups
To study the quality of speech recognition with a phase-sensitive reflectometer, a dedicated setup, shown in
Figure 5, was created. Experimental studies of how the setup design parameters influence the quality of speech registration and recognition were carried out. The radiation of a continuous laser source is amplified by an amplifier A, if necessary. It goes to a circulator (C) that sends it along the path 1–2 to the sensing fiber with a length
, which can be formed differently. The far end of the sensing fiber is wound in several turns with a short radius of 4 mm, which is less than the minimum allowable for properly confined radiation guidance in our standard SM fiber. So, the laser radiation fades in such wound fiber turnings, and no intensity reflects from the far end of the sensor, which ensures the setup operates as a single section of the reflectometer. The different waves backscattered by the fiber refractive index inhomogeneities interfere with one another, and the interference signal travels to the optical preamplifier pEDFA through the circulator ports 2–3. Then, the optical filter F cuts off spontaneous emission pEDFA noise, and the photodetector PD detects the interference signal. The signal from the photodetector PD is sent to the ADC. The ADC sampling frequency
fD can be varied if necessary. The signal is converted into digital form and then travels to a PC, on the screen of which one can see the interference signal in the form of a time-dependent intensity that changes when the fiber is affected by acoustic disturbances and speech. The setup represents a single “point” of a fiber line of a phase-sensitive reflectometer, i.e., it operates as a single fiber section within the spatial resolution determined by the pulse width according to the formula
, where
n = 1.5 is the refractive index of the fiber. With a sensing fiber length of 20 m in the setup, we experimentally imitate data acquisition from one section of the phase-sensitive reflectometer with a probing pulse of duration
= 200 ns (FWHM). At the output of the setup, one time sequence is recorded, which corresponds to a single cross-section of the waterfall reflectogram from one coordinate of the reflectometer line. Then, all further processing and recognition algorithms are applied to this time sequence.
Variation in a wide range of such reflectometer main parameters (such as the pulse repetition rate) is required in experiments to investigate the quality of speech recognition in different conditions. In a commercially available φ-OTDR, it is not feasible to quickly vary its parameters, which may require too much labor. To solve this problem, the special setup model of φ-OTDR single coordinate was developed. It does not require modulation of the radiation source intensity and allows using a lower-frequency ADC compared with a typical φ-OTDR. Thus, it becomes possible to quickly and easily vary the main parameters of a distributed microphone based on φ-OTDR by simply changing the parameters of the setup. The equivalent of the pulse width is changed by changing the length of the linear sensing fiber section. The equivalent of the pulse repetition rate is changed by a variation in the ADC sampling frequency. The signal-to-noise ratio at the output of the system can be varied by changing the configuration of the optical amplification circuit, i.e., by changing the parameters of the booster (A) and the preamplifier (pEDFA) combination, or by adding a variable attenuation to the optical signal. All the setup parameters and their correspondence to the φ-OTDR in the form of a ready-made device are given in
Table 1.
In practice, the distributed microphone system must be able to register acoustic disturbances with quite different sound volumes. Many reasons may lead to a decrease in acoustic disturbance intensity, such as changes in distance from the source of an acoustic wave to the sensing fiber, changes in speech volume, and obstacles shielding sound propagation to the sensing fiber. Therefore, it is necessary to study the quality of speech recognition when fiber is perturbed by acoustic waves with different sound volumes. Various sensing fiber configurations were investigated as well. Thus, experimental studies were carried out in several stages:
1. A cylindrical PZT was used as a sensing element with a sensing fiber wound on it. The PZT was a hollow cylinder with an outer diameter of
D = 85 mm, an inner diameter of
d = 75 mm, and a height of
H = 30 mm, with the length of the wound sensing fiber
LS = 20 m. The sensing fiber was influenced by the PZT vibrations, which occurred because of changes in the electrical signal applied to the PZT, which had a −3 dB bandwidth of Δ
fmax = 16 kHz. The amplitude of the electrical signal applied to the PZT changed over time, according to an audio-recording of the Harvard sentences, so the sensing fiber wound on the PZT was stretched due to the PZT vibrations (an “active” PZT). A scheme and a photo of the experimental setup are shown in
Figure 6a,b.
At this stage of experiments, we verified the emulated results of the signal forming in a fiber-optical microphone based on a φ-OTDR. Acoustic disturbance applied to the fiber by the PZT is an idealized case when a realization of the sentences is clearly converted to fiber vibrations. Thus, it helped us conduct the first tests on how the ADC sampling frequency influences the quality of speech registration and recognition via the fiber-optical microphone.
2. A sensing fiber with a length of LS = 20 m was wound into a passive coil with a diameter of 85 mm. Speakers playing recordings of Harvard sentences influenced the sensing fiber coil.
Thus, this stage of experiments allowed us to conduct trials reproducing real conditions. Such a design of the sensing fiber allows us to maintain a high system sensitivity [
35], since the speakers affect all of the fiber length. A scheme and a photo of the experimental setup are shown in
Figure 6a,b. The same PZT, as in stage 1, was used as a coil, but the voltage was not applied to the PZT. The speakers playing the Harvard sentences audio recording were placed next to the “passive” PZT (without feeding any electrical signal to it).
3. A straight sensing fiber with a length of
LS = 2.5 m was used, influenced by speakers playing recordings of Harvard sentences. The experimental setup scheme is shown in
Figure 7a. This stage of experiments included several steps:
3.1. The fiber was placed simply on a table, without any auxiliary elements, to improve sound transmission to the optical fiber.
3.2. The fiber was glued to a metal plate with dimensions of 80 × 50 × 0.2 mm
3, as proposed in [
12,
36] and shown in
Figure 7b. The metal plate was meant to increase the sensitivity of the system. It was mounted on legs so it could oscillate under the influence of an acoustic wave at the output of the speakers, transmitting vibrations to the optical fiber.
3.3. The fiber was wound on an elastic core made of a plastic bottle, as proposed in [
37], with a diameter of 140 mm. The bottle, as well as the metal plate in the previous experiment 3.2, made it possible to increase the amplitude of vibration transmission to the fiber. In this experiment, three regimes were studied: firstly, the original appearance of the bottle was kept, i.e., a bottle with a bottom, and speakers acted on it (
Figure 8a); secondly, we cut off the bottom of the bottle, thus forming a horn, and the sidepiece was influenced by speakers (
Figure 8b); and thirdly, the speakers were placed near the removed bottom and influenced the bottle from inside (
Figure 8c). By changing the volume of the sound produced by the speakers, we investigated the behavior of such a system with different sound levels and various parameters.
4. A linear sensing fiber with a length of
L = 2.5 m was used with a pair of wFBGs spaced by 1 m, as shown in
Figure 9. Disturbance was applied in the middle of the fiber between the two wFBGs by speakers playing recordings of Harvard sentences. This configuration of the sensing fiber increases the intensity of the backscattered lightwaves, which increases the signal-to-noise ratio at the photodetector.
Since the length of the sensing fiber in all experiments was short (less than 1 km), there was no need to amplify the laser radiation with a booster. The pEDFA amplification varied depending on the configuration of the sensing fiber, as, in some cases, the intensity of the backscattered signal is sufficient to not need an optical preamplifier, but in other cases, it is not. Thus, we assembled two setups with or without a preamplifier and a filter to compare the results.
The parameters of the setup in different experiments are given in
Table 2.
The most important parameter to consider when designing a real system is the pulse repetition rate, because it defines the requirements for the ADC parameters, the latter of which is an expensive component of the system. However, most important is that the ADC sampling frequency
fD physically limits the maximum sensing fiber length
LS. The pulse repetition rate, in fact, is related to a maximum sensing fiber length as follows:
where
is the fiber effective refractive index, and
is light speed in a vacuum.
Table 3 shows how the maximum sensing fiber length relates to the pulse repetition rate of a φ-OTDR (the ADC sampling frequency of the experimental setup). Each section of a sensing fiber within a resolution length is equivalent to a one-point microphone placed with a resolution of
l0 = 20 m. Thus, the equivalent number of one-point microphones in relation to the pulse repetition rate is shown in
Table 3 as well.
A sensing fiber length of less than 10 km, such as
LS = 2.5 km, is sufficient for smart home technology implementations like the single fiber for multiple apartment concept and for other local applications in small areas and at short distances. This will help with the development of advanced devices. However, when it is necessary to cover larger distances for some applications, such as integration in a fiber communication line using a comprehensive remote monitoring system [
38] or distributed monitoring of roads [
39] in smart city infrastructure with the function of speech recognition, a sensor length
LS of tens of kilometers must be ensured. At the moment, long-distance DAS are most in demand. Therefore, the issue of ensuring high recognition quality at sampling frequencies of less than 10 kHz remains quite important, and this problem can be solved by increasing the system sensitivity [
37].
Other φ-OTDR parameter variations, such as adjusting the optical amplification required in practice, can be easily achieved by changing the pEDFA gain. The pulse duration can be changed easily as well by simply adjusting AOM parameters. Therefore, we paid the most attention to how the ADC sampling frequency in our setup influences the quality of speech recognition.
In each experiment, the dependence of the quality of speech recognition on the ADC sampling frequency was studied using five speech-recognition services. Each service was used for speech recognition with different sampling frequencies of 3, 5, 10, 20, and 40 kHz. For each sampling frequency, the percentage of words correctly recognized was calculated, thus making possible the intercomparison of these services by quality numbers. This procedure allowed us to choose the services that recognize speech best. For the recognition services chosen, the quality of speech recording with our microphone and speech recognition were more precisely studied. As a result, curves of the percentage of recognized words and the Levenshtein distance with dependence on the ADC sampling frequency were obtained, which are equivalent to the dependence on the pulse repetition rate in a φ-OTDR. After analyzing the obtained dependencies, the most effective sampling frequencies were selected to study the influence of other system parameters on the recognition quality. For example, when the fiber was affected by the sound from the speakers, a study of the recognition quality dependence on the sound volume measured near the sensing fiber was conducted.
The analysis of signal spectrograms is also informative. It allows us to evaluate the frequency composition of the signals received and thus determine why the particular configurations of sensing fiber provide good or bad speech recognition in a distributed fiber microphone based on φ-OTDR.
Figure 10a,b shows the spectrum and a spectrogram of the audio recording applied to the sensing fiber in experiments. The spectrum (
Figure 10a) is calculated as the FFT from the original audio recording, where the 10 Harvard sentences from Listing 1 follow one another with 1 s spacing. As one can see, the spectrum shows which spectral components have ever existed in the realization, but the moments of their presence are unknown. This is a drawback of estimating the spectral composition using the spectrum and why the spectrogram (
Figure 10b) is needed to properly conduct a comparative analysis of the obtained spectral composition with an initial one. The original audio recording was obtained with a sampling frequency of 44 kHz, while the spectrogram is presented in the frequency range up to 5 kHz for better visualization since spectral components with frequencies above 5 kHz have a low magnitude and do not make a significant contribution to the spectrum but rather increase the noise. The spectrogram shows the frequency composition as a function of time for each sentence. The sentences can be easily distinguished, as there are 10 regions in the 3D plot, separated in time by dark-blue silence regions (of 1 s duration) with a magnitude of less than −10 dB (“silence”). Subsequently, we compared all the newly obtained and reconstructed spectrograms with this original 3D plot. The experimental results obtained in different experiments and a discussion are given further.
3.2. Results of Experimental Studies of Quality of Speech Recognition with Different Services for Different Sensing Fiber Configurations
Table 4 and
Table 5 show how the percentage of correctly recognized words depends on some ADC sampling frequencies in a distributed fiber-optic microphone based on φ-OTDR with a sensing fiber wound around a hollow PZT cylinder and a coil sensing fiber. The time sequences in the setup’s output, shown in
Figure 6, were gathered and recognized with the five services in both cases.
Table 6 shows the results of speech recognition using Yandex SpeechKit (YS) and Whisper NN for sampling frequencies of 20 kHz and 40 kHz and different sound volumes of the acoustic wave from the speakers acting on the fiber coil. The sound volume was measured by a smartphone with a sound meter app near the coil perpendicular to the speakers. We selected three values of sound volume for the PC-controlled audio recording, which are given in
Table 6. The sound volumes in the experiments, which ranged from 70 to 92 dB(C), are quite high and lie within the range of loud conversations.
In
Table 7, the experimental results of speech recognition depending on the ADC sampling frequency are given for a straight sensing fiber in two cases: when the fiber is placed simply on a table (stage 3.1) and when the fiber section with a 0.8 m length is glued to a metal plate (stage 3.2). We used Yandex SpeechKit and Whisper NN for speech recognition of gathered signals, as in previous experiments. The average sound volume measured near the sensing fiber was 89 dB(C). Studies at lower volume levels have also been conducted, but in these cases, the quality of speech recognition was reduced greatly, similar to experiments with a coiled sensing fiber, and these results are uninformative.
Table 8 shows the results of speech recognition in signals gathered with an ADC sampling frequency of 40 kHz while varying the volumes of the Harvard sentences audio record produced by the speakers and using a sensing fiber wound around an elastic core. The sound volumes given in
Table 8 were measured at the output of the speakers installed near the bottle.
For the fiber microphone based on a φ-OTDR with wFBGs, the quality of speech signal recognition was investigated for a sound volume of 108 dB(C).
Table 9 and
Table 10 show the results for Yandex SpeechKit and Whisper NN, since other algorithms did not recognize speech with adequate quality when a sensing fiber with a pair of wFBGs was used.
Figure 11 shows spectrograms of signals gathered with a sampling frequency of 40 kHz for different types of disturbance and sensing fiber configurations.
The spectrograms are built using a sliding Hanning window with a length of HSP = 2000 samples. Each spectrogram is marked with the Levenshtein distance value calculated after speech recognition of the corresponding signal.
As shown in
Table 4 and
Table 5, the best recognition quality is provided by Yandex SpeechKit and Whisper NN when the ADC sampling frequency is 40 kHz. Other recognition services (Yandex Translate, Speechpad, and Google Documents) do not provide good speech recognition, so the percentage of correctly recognized words is zero or very low for a sensing fiber wound around both “active” and “passive” PZT. At sampling frequencies of 3 and 5 kHz, low recognition quality for all five services is caused by bad speech recording with the microphone because it cuts off high frequencies via low
fD. Meanwhile, at high frequencies from 10 to 40 kHz, Yandex SpeechKit and Whisper NN give better results, by 50% to 70%, compared with Yandex Translate, Speechpad, and Google Documents, which might be caused by the better-developed recognition algorithms of these services. In the spectrogram of a signal gathered with “active” PZT disturbance and a sampling frequency of 40 kHz, the sentences can be clearly distinguished from each other, and the spectrogram in
Figure 11a is quite similar to the initial one in
Figure 10b.
For the coiled sensing fiber, the most intense acoustic signal with a volume of 92 dB(C) is recognized best, as shown in
Table 6. As expected, the lower the intensity of the acoustic disturbance, the lower the percentage of correctly recognized words. When decreasing the sound volume from 92 dB(C) to 80 dB(C), the percentage of correctly recognized words drops by half using Whisper NN. When the volume is at the typical conversation level, i.e., 50 dB(C), speech cannot be recognized. The spectrogram for the coiled sensing fiber is more noisy. It has high-power components (up to 50 dB in magnitude, shown in
Figure 11b in yellow) in a wide frequency range above 5 kHz, unlike the original one, so it is difficult to distinguish the wanted frequencies from high background noise. We assume this is due to the much higher sensitivity of the 20 m fiber coil, which has approximately 75 turns with a winding diameter of 85 mm. However, to confirm this, it is necessary to exclude the possibility of signal wrapping at a high sound volume, which may also cause high noise in a spectrum.
Table 7 shows that, for a pulse repetition rate above 10 kHz, high-quality speech recognition of up to 75% can be achieved, even with a sensing fiber placed simply on the table. The spectrogram in
Figure 11c shows that the 10 sentences can be distinguished from each other (in the frequency range of 0.5–2 kHz, there are regions of higher magnitude), but it is very noisy. However, the frequency of 10 kHz is quite high and limits the parameters of a φ-OTDR, according to
Table 3, and thus
LS,max = 10 km. For a sampling frequency lower than 10 kHz, a metal plate should be used to ensure more efficient sound conversion into fiber vibrations. In the case of using a metal plate at sampling frequencies higher than 10 kHz, there is a huge drawback, not typical of any other cases. In the spectrogram in
Figure 11d for 40 kHz, some dead zones appear during the 22nd and 33rd seconds (which are displayed as dark-blue regions). During these time intervals, the fiber does not vibrate, and the speaker’s sound is not registered. We associate such artifacts with the natural frequencies of the plate, which leads to distortions and degrades the desired signal. Thus, we can conclude that using a metal plate for sound transmission enhancement leads to many limitations.
Table 8 shows that influencing the bottom of the bottle and the bottle sidepiece, unlike other sensing fiber configurations, yields a bad recognition quality of less than 15% of correctly recognized words, even at a high sound volume of 92 dB(C), which is caused by the registered signal deteriorations and can be explained as follows: When the bottom of the bottle is perturbed, it vibrates intensively, which increases both wanted acoustic waves from the speakers and unwanted ones coming from the environment. The mix of waves undergoes multiple reflections in an internal cavity of the bottle, which leads to a significant background level and parasitic wave emergence. The signal-to-noise ratio decreases because of unwanted acoustic-wave multiple amplifications. One can observe this effect in the spectrogram shown in
Figure 11e as an increase in the noise level, so it is impossible to distinguish the sentences from each other. When the bottom and sidepiece (
Figure 11f) of the bottle are influenced, the most significant spectral components in the spectrograms are in a frequency range of about 0.5 kHz and 1 kHz, probably due to the natural frequencies of the bottle with a bottom. Perturbing the horn-like bottle from the inside yields good speech recognition quality at the lowest volume level of 72 dB(C), so in the spectrogram (
Figure 11g), the sentences can be distinguished separately from each other, the speech is transmitted without significant distortion, and the noise level is quite low. Visually, the spectrogram is similar to the numerical simulation result based on the developed mathematical model.
Analyzing the results of speech recognition quality using the wFBGs given in
Table 9 and
Table 10, one can see that such a configuration of a sensing fiber yields the highest quality of speech recognition, which is also demonstrated in the spectrograms for a sampling frequency of 40 kHz in
Figure 11h (without a preamplifier, the percentage is 96% and the Levenshtein distance is 15) and in
Figure 11i (with a preamplifier, the percentage is almost 94% and the Levenshtein distance is 26), that are similar to the spectrogram of the initial signal, and on each of them the boundaries between the sentences can be clearly distinguished separately. However, in
Figure 11i, unstable sensitivity regions are observed at the beginning of sentences six and eight, which are in an increased-sensitivity region. This may be due to random changes in the propagating light polarization state, which leads to the polarization-induced fading (PIF) effect [
40]. If a preamplifier is used, some parasitic high-power spectrum components appear in a frequency region of 3.5 kHz, but they do not affect the quality of speech recognition dramatically. As shown in
Table 9 and
Table 10, the sensing fiber with wFBGs allows word recognition even at low sampling frequencies of 5 and 10 kHz. Thus, artificial reflectors such as wFBGs in the sensing fiber yield quite a stable signal, which is a promising result.
3.3. Quality of Speech Recognition Competitive Analysis for Different Sensing Fiber Configurations
For Yandex SpeechKit and Whisper NN, which generally demonstrated the best results in speech recognition quality, curves of the percentage of recognized words and the Levenshtein distance were obtained depending on the ADC sampling frequency in a range from 500 Hz to 40 kHz, as presented in
Figure 12a–d.
The result of speech recognition at different frequencies in the experimental setup using the PZT with a sensing fiber is shown in
Figure 12 in blue. When the ADC sampling frequency
fD is less than 5 kHz, the quality of speech recognition decreases significantly. The percentage of recognized words and the Levenshtein distance tend to be 0 and 386 (when
M = 386 is the number of characters in Harvard sentences Listing 1), respectively, which means that speech is not recognized at all. With an increase in the sampling frequency and, consequently, with an increase in the pulse repetition rate, the quality of speech recognition improves, which is assessed by both metrics. The percentage of recognized words increases and reaches 77.5% and 88.75% for the Yandex SpeechKit and the Whisper NN, respectively, and the Levenshtein distance decreases to 50 when the sampling frequency reaches 40 kHz. One can identify a certain cut-off sampling frequency, i.e., a threshold sampling frequency, above which high-quality recognition is ensured, the percentage of recognized words is above 70%, and the Levenshtein distance is less than 100. For the Yandex SpeechKit, the cut-off ADC sampling frequency is 15 kHz, and for the Whisper NN, it is 5 kHz. When the ADC sampling frequency
fD is greater than 20 kHz, the recognition quality increases insignificantly, and a kind of saturation appears. The nonlinear nature of the plot might be due to a random distribution of light amplitude and phase wrapping, which affects the signal-to-noise ratio at the PD. It is interesting that for the Whisper NN, the quality of recognition does not grow proportionally to the sampling frequency increase but instead has a threshold nature. At sampling frequencies below 5 kHz, speech is not recognized at all, and at a frequency of 5 kHz and higher, speech is recognized with high quality up to 70% (the Levenshtein distance of 55). The results of these experiments allow us to conclude that the pulse repetition rate of φ-OTDR significantly affects the quality of the recorded interference signal, which determines the subsequent quality of speech recognition. Thus, the quality of signal registration with fiber, and hence the quality of speech recognition proportionally, depends on the pulse repetition rate of the φ-OTDR. Below the cut-off sampling frequency, all the words are badly transmitted because they lose information contained in high speech frequencies. Hence, recognition services are not able to recover words from the sounds. Further, the higher the pulse repetition rate of the φ-OTDR, the better the speech recognition quality. However, after reaching a limitation (typical for a specific sensing fiber configuration), the recognition quality does not grow further when increasing the pulse repetition rate because of the increasing noise as well as the wanted signal at higher frequencies.
It is also important to compare experimental results when a coiled sensing fiber or a sensing fiber with a length of 2.5 m simply placed on the table is perturbed by the speakers. For the coiled sensing fiber, the speech recognition quality curve is shown in
Figure 12 in red. It is more rugged compared with the one obtained with PZT disturbance, which is probably due to the peculiarities of sound wave propagation from the speakers along the sensing fiber. However, for the Whisper NN, a cut-off recognition frequency is about 10 kHz. At an ADC sampling frequency of 25 kHz, there is a sharp deterioration of the recognition quality, perhaps because of some random ambient acoustic noise increasing during this experiment. However, for higher sampling frequencies, the quality of recognition is better, and the percentage of words recognized reaches 72.2% (the Levenshtein distance is 104). At a sampling frequency of 40 kHz, the recognition quality decreases again. This characterizes the microphone with a coiled sensing element as unstable over time, probably because of its high sensitivity to both wanted (speech) and unwanted (acoustic noise) disturbances.
As per the setup with a sensing element glued to a metal plate (the curve is shown in purple in
Figure 12), for some sampling frequencies, the recognition quality also drops sharply. However, in general, the quality of speech recognition increases with the sampling frequency. Such a sensing fiber configuration allows speech recording and recognition at frequencies lower than 10 kHz, better than the others, except for the sensing fiber with wFBGs. When comparing the three dependencies obtained with the Whisper NN for a coiled sensing fiber, a sensing fiber placed simply on the table (yellow curve), and a sensing fiber glued to a metal plate, one can see that at a sampling frequency of about 5 kHz, the best result is obtained with a sensing fiber glued to a metal plate. However, when the sampling frequency is greater than 10 kHz, the quality of speech recognition using a coiled sensing fiber becomes higher than the other two configurations. Thus, for a pulse repetition rate of more than 10 kHz, increasing sensitivity using a metal plate becomes unsuitable since, in this case, the result is the worst: the speech recognition quality is less than 66.3% (the Levenshtein distance is 102).
The results demonstrate that it is fundamentally possible to recognize speech recorded by a fiber microphone based on φ-OTDR with a pulse repetition rate of frep > 10 kHz with an accuracy of more than 50%, even when a linear sensing fiber is simply placed on a table. If there is an issue of increasing the quality of speech recognition without increasing the pulse repetition rate, a coiled fiber can be used, which can increase the system’s sensitivity and, in practice, be easily implemented in the case of short-length paths. The best recognition quality, when a real acoustic disturbance is applied to the fiber by the speakers, is obtained when a sensing fiber with wFBGs is used. The results obtained with a wFBG sensing fiber and those obtained when a hollow PZT cylinder with a sensing fiber was used are interesting to compare in two spectral ranges and using two recognition methods. For the Whisper NN, the cut-off recognition frequency with the PZT actuator is 5 kHz, and in a frequency range less than 10 kHz, the speech recognition quality in a scheme with the sensing fiber wound around the PZT is the best. At the same time, for a sensing fiber with wFBGs, the cut-off recognition frequency is 10 kHz, and when fD = 3 kHz and fD = 5 kHz, few words are recognized by the Whisper NN, so the percentage of recognition is 30%, and the Levenshtein distance is more than 200. Thus, when the fiber is perturbed by the speakers, wFBGs make it possible to ensure the best recognition quality and stable operation of the fiber microphone. At a lower pulse repetition rate, recognition quality significantly depends on the recognition algorithm. As seen in the experimental results, it is clear that the Yandex Speechkit recognizes almost 70% of words at fD = 5 kHz, while the Whisper NN recognizes less than 30%.
An important result is that the speech recognition quality increases with the pulse repetition rate, which limits the spectrum that can be recorded. A low pulse repetition rate results in a loss of high-frequency components. High frequencies in human speech are important, as they carry meaning. They are not only used to express emotions and intonation but also to help distinguish sounds and words in a speech, give brightness and clarity to the voice, and improve speech intelligibility in noisy environments. Therefore, even using effective neural network algorithms, recognition quality above 75% is ensured only at high pulse repetition rates. However, this limits the maximum length of the sensing fiber. Thus, it is necessary to find a compromise between the available sensing fiber length and the desired quality of speech recognition, which depends on the task specification to be solved. For example, when increasing the pulse repetition rate, only a shorter sensing fiber length is possible, which allows for obtaining a higher quality of speech recognition within the observed area. This might be preferable when it is necessary to clearly capture all commands with an error of less than a few words, for example, for voice control systems within a limited area of one apartment, office, etc. When one needs to increase the length of the sensor, only a lower recognition quality can be obtained, but this allows, for example, serving more users simultaneously without increasing equipment costs. In this case, one can recognize only the keywords of the conversation, which, for example, will mark that at a specific coordinate along the sensor, there is a request for some service, and subsequently, the command can be clarified when contacting again.