1. Introduction
The rapid development of information and communication technology (ICT) has led to a positive effect on societies and communities in many ways, along with a dramatic increase in multimedia information usage via the Internet. In addition, new innovations have appeared alongside the utilization of big data collected from cyber-physical systems and worldwide networks. Although it can enrich our daily lives, multimedia big data, which includes personal data that users wish to keep private, is at high risk of illegal distribution and misuse through the proliferation of non-authentic media and the leakage of private information.
For example, speech communication technology has been implemented via speech communication channels such as the Voice over Internet Protocol (VoIP) and the Public Switched Telephone Network (PSTN), and since these channels are considerably vulnerable to attacks, problems such as speech tampering, spoofing, and issues regarding the digital forensics of speech data using voice conversion and text-to-speech techniques have occurred [
1,
2]. Therefore, it is necessary to achieve secure protection and implement preventative countermeasures in speech communication technology.
The speech watermarking technique has attracted attention as a solution to achieve these countermeasures against speech tampering and spoofing [
3,
4]. This technique aims to protect digital speech content by embedding an inaudible security code into a speech signal and by detecting the embedded security code from the watermarked speech signal. In general, speech watermarking methods must meet four requirements to provide a useful and reliable form of audio/speech watermarking [
2,
5]: (1) inaudibility (inaudible to humans with no sound distortion caused by the embedded data), (2) robustness (not affected when subjected to techniques such as data compression and malicious attacks), (3) blind-detectability (high possibility of detecting the embedded data without using the original or reference signal), and (4) confidentiality (secure and undetectable concealment of embedded data).
Conventional audio watermarking techniques can also be used as speech watermarking [
6]. Typical examples include the least-significant bit (LSB) replacement method [
3] and the direct spread spectrum (DSS) method [
3,
7]. The LSB method has an advantage concerning inaudibility because it has less effect on the magnitude [
3] but a disadvantage regarding fragility against any modifications. In contrast, the DSS method has an advantage regarding robustness against various modifications [
3,
7] but a disadvantage regarding inaudibility due to sound distortion by spectrum spreading. In short, the LSB method satisfies requirement (1) but not (2), and the DSS method satisfies requirement (2) but not (1).
Other state-of-the-art audio watermarking methods include singular value decomposition (SVD) with dither modulation quantization, which is a type of quantization index modulation (QIM) [
8,
9], and various phase modulation techniques [
5,
10]. Although they have strong points in terms of one or two requirements (e.g., inaudibility and robustness or blind-detectability), they cannot satisfy all four requirements simultaneously due to fragility against speech codecs and sensitivity to frame desynchronization attacks. This suggests that a speech watermarking method using typical audio watermarking techniques must be reconsidered to satisfy the robustness against speech codecs and to ensure blind detection with frame synchronization.
There are some specific speech watermarking methods that depend on speech codecs [
2]. These methods are individually designed by adaptively embedding and detecting a security code into/from parameters (linear predictive parameters, codebook parameters, vector quantization, etc.) in the speech codec scheme for G.711, G.721, G.728, G.729, and GSM, although the main approach is taken into code excited linear prediction (CELP) and mixed excitation linear prediction (MELP). These methods can be used for covert communication based on the VoIP system and the design of real-time communication over PSTN [
2]. All of these methods have some kind of trade-off between sound quality, capacity, and robustness. However, as there have been no comparative evaluations of these methods under various conditions, it is unclear whether or not they can satisfy all four requirements simultaneously, although it is safe to say that they are not robust against various speech codecs because they depend on specific speech codecs.
A few sophisticated speech watermarking methods based on the source-filter model have been proposed for application to speech tampering detection [
11,
12]. The source-filter model assumes the glottal pulse is the sound source and the vocal tract is a filter (spectral envelope) in a speech synthesis system. Linear prediction (LP) synthesis can be utilized as the source-filter model. In the LP synthesis method, the speech into sound source as LP residue and the vocal tract information as LP coefficients are used to synthesize speech signals. This inspired us to investigate whether a speech watermarking-based LP scheme could be implemented by separately applying watermarking to LP residues as the source and LP coefficients as the vocal tract filter without affecting each other.
Speech watermarking methods based on formant tuning (FT) [
13] and on the McAdams coefficient (MC) [
14] have been successfully proposed for LP schemes to embed the security code into the spectral envelope of a speech signal, i.e., LP coefficients. The FT method achieves inaudible watermarking by controlling line spectral frequency (LSF) features converted from the LP coefficients to embed the security code into formant tuning. The MC-based method also achieves inaudible robust watermarking by controlling McAdams coefficients to embed the security code into frequency spectral scaling, i.e., scale-shifting for spectral envelope shapes related to the LFSs. These methods satisfy robustness because the LSF features are robust against various speech codecs. Although both methods can satisfy blind-detectability, they may be sensitive to frame desynchronization issues. Confidentiality in these methods can be satisfied by specifying a non-public condition in the watermarking algorithms.
Another possibility for considering speech watermarking based on the LP scheme is to embed the security code into LP residue. We previously proposed a DSS method using LP residue (LP-DSS scheme) to combine the robustness of the DSS method with the inaudibility of the LP-based method [
15], which was inspired by the similarity between the statistical properties of the pseudo random noise (PN) signal used in DSS and the LP residue. While this method was more inaudible than the DSS method and kept the same robustness, it did not satisfy the last two requirements, blind-detectability and confidentiality, because it is still non-blind speech watermarking.
This paper aims to develop a state-of-the-art speech watermarking technique that satisfies all four requirements. To this end, we propose a blind speech watermarking method based on the LP-DSS scheme that incorporates its blind detection and frame synchronization and adds two embedding processes to solve the blind-detectability and confidentiality issues.
This paper is organized as follows.
Section 2 describes the LP-DSS scheme and then addresses the remaining issues.
Section 3 describes the two forms of data embedding for blind detection and frame synchronization to solve the remaining issues.
Section 4 studies validation of the proposed method with regard to frame synchronization and blind detection, and then provides how to set embedding strength in the proposed method.
Section 5 comparatively evaluates the proposed method by comparison with other related three methods with regard to the four requirements for speech watermarking.
Section 6 concludes this paper.
3. Proposed Method
We propose blind detection and frame synchronization processes to satisfy the blind-detectability requirement. We also proposed two forms of data embedding for the blind detection to satisfy the confidentiality requirement.
We assume that the LP residues of three adjacent frames are highly correlated. This suggests that the LP residue of an even-numbered frame can be used to embed security codes in odd-numbered frames, and then the LP residue of the even-numbered frame can be used to accurately detect security codes from the watermarked signal of the odd frames. We also speculate that the opposite embedding/detection form can be used by interchanging the even-numbered/odd-numbered frames. Given that the LP residues used for embedding and detection are the same, the frame synchronization process can also be taken into account by maximizing the cross-correlation between the watermarked signal and the LP residue as a norm.
3.1. Concept of Data Embedding for Blind Detection and Frame Synchronization
The LP scheme presupposes that the time sequence signal is a stationary process. Within a short duration, speech signals are stationary, so they can be accurately predicted using the LP scheme. Consequently, the host signals of adjacent frames have equal statistical properties during frame processing, and the host signals are highly correlated across adjacent frames. The LP residues obtained from the host signal are also expected to be highly correlated across adjacent frames.
LP residue in the (k)th frame is assumed to be equal to LP residue in adjacent frames in the ()th frame or ()th frame. On the basis of this assumption, the LP residue in the adjacent frame in the ()th frame or ()th frame can be used to embed a security code in the host signal in the (k)th frame. The security code can also be detected by using the LP residue in adjacent frames in the ()th frame or ()th frame. Furthermore, if we consider the embedding of security code in the forms of front and back frames, when the LP residue in the (k)th frame for embedding and the LP residue in the (k)th frame for detection are the same (), the expectation is calculated as .
On the other hand, when the different LP residue () in the (k)th frame is used for embedding and detecting secure code into a speech signal, the expected value is . The expected value in the case of the frame being correctly synchronized is thus higher than in the case of the frame not being correctly synchronized.
Figure 2 shows the block diagram of the proposed method, where (a,b) show the embedding and detection processes for the security code, respectively. The difference between the non-blind LP-DSS method and the blind LP-DSS method is that the host signal is classified into even-numbered and odd-numbered frames, and the LP residue from the host signal of the even-numbered or odd-numbered frame is used to embed the security code in the host signal of other frames. This process enables the LP residue to be processed from the watermarked signal of the unprocessed frame and the security code to be detected from the watermarked signal of the other frame. In addition, even if the frame synchronized position is shifted during the security code detection, the proposed method can correctly detect the security code by synchronizing the frames. The data embedding, the frame synchronization, and the blind detection processes are explained in
Section 3.2–
Section 3.4, respectively.
3.2. Data Embedding Process
As a basic assumption, the frame length is known. We consider two forms of embedding process in the frame processing: front-side and back-side.
Figure 3a,b show the embedding process of the security code in the front-side and back-side forms, respectively.
The host signal is divided into
K frames by a short-term analysis using a fixed frame length. The even-numbered frame was set to
k and the host signal of this frame was set to
in the embedding process of the front-side form. In this case, the LP residue
in the (
)th frame is obtained by LP analysis from the host signal
in the odd-numbered frame. The security code
is spread-spectrum modulated using the LP residue
in the (
)th frame and embedded in the host signal
of the even-numbered frame. The watermarked signal is
. The watermarked signal
in the even-numbered frame is the same as the host signal
in the even-numbered frame. To embed the security code robustly, the embedding strength
a is controlled by the embedding-strength level
(as in the LP-DSS method) using Equation (
6).
Similarly, the embedding process (back-side form) can be done by switching the even/odd-numbered frames for LP residue calculation and embedding, as shown in
Figure 3b.
3.3. Frame Synchronization Process
Figure 4 shows the proposed frame synchronization process. The watermarked signal is segmented into
K frames. LP residue
in the (
)th frame is calculated from the watermarked signal
in the (
)th frame. The expected value is obtained by the following equation to multiply the obtained LP residue
in the (
)th frame by the watermarked signal
in the (
k)th frame.
Substituting
into Equation (
10) yields
where
,
, and
are assumed to be ergodic. The LP residue has the statistical properties
and
. We then multiply the watermarked signal
by the LP residue
and obtain the expected value
by Fourier transform as follows:
where
s is the sample shift:
. The subscript
k is the reference frame that is defined as frames of watermark signal
:
. The
is the LP residue in the (
)th frame used to embed the security code. By finding the expectation value as in Equation (
9), only the second term in Equation (
11) remains. Then, the watermarked signal
in the (
k)th frame is shifted by one sample and the expected value is calculated in the same way. This is done until
samples. The expected value is processed as an absolute value, and the arithmetic mean is calculated using the correlation values for the number of reference frames, as
The synchronized position
is determined from the maximum value of
obtained from the arithmetic mean, as
3.4. Blind Detection Process
We consider two forms of detection process in the frame processing: front-side and back-side.
Figure 3c,d show the detection process of the security code in the front-side and back-side forms, respectively.
The security code is detected from the watermarked signal
in the (
k)th frame. The watermarked signal
in the (
)th frame is the same as the host signal
in the (
)th frame in the detection process of the front-side form. The LP residue
in the (
)th frame is calculated from the watermarked signal
in the (
)th frame by using LP analysis. Then, the security code
can be obtained accurately by multiplying LP residue
in the (
)th frame by
in the (
k)th frame using the Equation (
7).
Similarly, the detection process can be done backward (back-side form) by switching the even/odd-numbered frames for the LP residue calculation and detection, as shown in
Figure 3d.
4. Validation of Proposed Method
In this section, we investigate whether the frame synchronization and blind detection processes are functioning as designed and determine the appropriate level for the embedding strength. The correct functioning of the frame synchronization process is evaluated by carrying out a frame synchronization test. The correct functioning of the blind detection process is evaluated by carrying out bit detection tests. The appropriate embedding-strength level is determined by examining the trade-off between PESQ, LSD, and BER.
4.1. Database and Conditions
The 12 utterances in the ATR database (B set) [
16] were used to evaluate the frame synchronization process and blind detection process. The original speech signal had a 16-kHz sampling frequency, 16 bit quantization, 8.5 s duration, and one channel (mono).
The LP order P is 12 because the sampling frequency is 16 kHz. The frame length of the proposed method is 20 ms, the same as the frame length of the LP analysis. The bit rates (bps) in these validations were 4, 8, and 16 bps. Twelve random bit strings were used for each bit rate. The embedding-strength level for embedding the security code was set to dB in 5-dB increments.
4.2. Validation of Frame Synchronization
The frame synchronization process in the proposed method was evaluated by carrying out a frame synchronization test. The accuracy of frame synchronization
was considered to be correctly frame synchronized at
. This is defined as
where
is the number of analysis frames that can be completely frame-synchronized, and
is the number of all analysis frames. The number of reference frames is 1, 2, 4, 8, 16, 32, 64, 128, and All (202). The total number of speech signal frames used in this evaluation is
, that is, 202.
Figure 5 shows the averaged accuracy of frame synchronization, where the horizontal axis represents the number of references frames and the vertical axis represents the accuracy of frame synchronization.
Figure 5 shows the front-side and back-side forms. We can see here that the frame synchronization processes of the front-side and back-side forms have no significant difference in the accuracy of frame synchronization. In addition, the accuracy of frame synchronization increases with the increase of the number of reference frames and embedding-strength level. Specifically, the frame synchronization processes of the front-side and back-side forms have
accuracy of frame synchronization when the embedding-strength level is
dB and the number of reference frames is
, that is, 202.
Next, we evaluated the frame synchronization process of the front-side and back-side forms for robustness to the sample-cut attack. The first few samples of the signal with embedded bit strings are randomly deleted in the range of
. The signal is utilized as input to analyze the accuracy of frame synchronization using Equation (
15). The number of samples to be deleted is 14, 57, 109, 127, 144, 320, 406, 439, 480, 487, 494, and 611.
Figure 6 shows the averaged accuracy of frame synchronization, where the horizontal axis represents the number of references frames
and the vertical axis represents the accuracy of frame synchronization.
Figure 6 shows the front-side and back-side forms. We can see here that the frame synchronization process of the front-side and back-side forms have no significant difference in the accuracy of frame synchronization. In addition, the accuracy of frame synchronization increases with the increase of the number of reference frames and embedding-strength level. Specifically, the frame synchronization accuracy in the front-side and back-side forms is
when the number of reference frames is All and the embedding-strength level is
dB.
4.3. Validation of Blind Detection
The performance of the blind detection process to detect security codes correctly was evaluated from the results of bit detection tests. The criterion for detecting a correct security code was a BER of less than or equal to .
Figure 7 shows the BER results of blind detection concerning the bit rate, where the horizontal axis represents bit rate and the vertical axis represents BER.
Figure 7 shows the front-side and back-side forms. The solid and dotted lines shows results with the BER of less than and greater than
, respectively. As we can see, the BER increases with the bit rate increase. In addition, a decrease in BER was observed with an increase in the embedding-strength level. As a result, we confirmed that at bit rates of 4 and 8 bps, the BER was less than
at the embedding-strength level of
dB, and at the bit rate of 16 bps, the BER was less than
at the embedding-strength level of
dB.
4.4. Validation of Blind Detection with Frame Synchronization
To investigate the effect of the frame synchronization process on the blind-detection process, we compared the BER of the blind-detection process alone to that with the frame synchronization. Here, we used the number of reference frames All, where the frame synchronization accuracy was .
Figure 8 shows the BER results of the blind-detection process with the frame synchronization, where the horizontal axis represents bit rate and the vertical axis represents BER.
Figure 8 shows the front-side and back-side forms. Comparing
Figure 7 and
Figure 8, we can see that the blind-detection process with frame synchronization reduced the BER by
at the embedding-strength level of
dB. On the other hand, at
to 0 dB, where the accuracy of the frame synchronization was
, no significant difference was observed with the presence or absence of frame synchronization.
4.5. Setting of Embedding Strength
The embedding strength level
in the proposed method is determined to minimize the PESQ, LSD, and BER simultaneously. As shown in Equation (
6), the embedding strength is calculated from the power level of the host signal, the watermarked signal, and the embedding-strength level
.
Bit detection and sound quality tests were carried out to determine the optimal
. In the sound quality test, the PESQ International Telecommunication Union Telecommunication Standardization Sector ( ITU-T) P.862 and LSD were used as objective measures, as in our previous study [
15]. In speech watermarking, it is generally helpful if the security code can be detected with a bit rate of 6 bps and a BER of less than or equal to
.
We carried out the sound quality tests (PESQ and LSD) to determine how well the watermarked signal was perceived. PESQ is the perceived sound quality of the watermarked signal relative to the original signal, expressed as mean opinion scores (MOS). The MOS varies on a scale of 1 (bad), 2 (poor), 3 (fair), 4 (good), and 5 (excellent). Typically, the PESQ threshold for speech watermarking is 3 (fair or slightly annoying). LSD is defined as
where
and
are short-term Fourier transforms of the original signal and watermark signal,
and
, in the
q-th frame. Typically, the LSD criterion for speech watermarking is less than or equal to 1 dB.
Figure 9 shows the results of PESQ, LSD, and BER from the blind detection process with frame synchronization of the front-side form, where the horizontal axis represents embedding-strength level
and the vertical axis represents (a) PESQ, (b) LSD, and (c) BER. As we can see, the BER decreases as
increases, and distortions increase as
increases. The
was determined to be
dB when the bit rate was 4 and 8 bps. At this time, the mean
was
dB. When the bit rate was 16 bps, there was no optimal
. Thus, the
was determined to be
dB, in which BER
, LSD
dB, and PESQ
MOSs, at this bit rate. Hence, the mean
was
dB.
Figure 10 shows the results of PESQ, LSD, and BER from the blind detection process with frame synchronization of the back-side form, where the horizontal axis represents embedding-strength level
and the vertical axis represents (a) PESQ, (b) LSD, and (c) BER. We can see here that the BER decreases as
increases, and distortions increase as
increases. The
was determined to be
dB when the bit rate was 4 and 8 bps. At this time, the mean
was
dB. When the bit rate was 16 bps, there was no optimal
. Thus, the
was determined to be
dB, in which BER
, LSD
dB, and PESQ
MOSs, at this bit rate. Hence, the mean
was
dB.
After applying the two embedding processes for the blind detection process with the frame synchronization process, the optimal embedding-strength level was determined to be dB.
5. Comparative Evaluations
We investigated whether the proposed method satisfies the four requirements of inaudibility, robustness, blind-detectability, and confidentiality by comparing it with the LSB, DSS [
4], and MC [
14] methods.
5.1. Database and Conditions
We selected a total of 100 utterances: 50 semi-randomly selected utterances from LibriSpeech [
17] and 50 semi-randomly selected utterances from Voice Cloning Toolkit (VCTK) [
18]. The selections were semi-random so that we could obtain utterances from a specific number of speakers. LibriSpeech was originally sampled at 16 kHz to study automatic speech recognition, and VCTK was originally sampled at 48 kHz to study speech synthesis. The sampling frequency for both databases was standardized to 16 kHz.
Inaudibility was evaluated by performing a sound quality test. Robustness was evaluated by carrying out basic robustness tests of ten processes as non-malicious interference [
19]: normal (no interference), downsampling to 12 kHz (Resample-12), upsampling to 24 kHz (Resample-24), bit compression to 8 bit (Requant-8), bit expansion to 24 bit (Requant-24), conversion to Ogg format (Ogg), conversion to G711 coding (G711), conversion to G723.1 coding (G723.1), conversion to G726 coding (G726), and conversion to MP4 format (MP4). Blind-detectability was evaluated by carrying out frame synchronization tests and bit detection tests. Confidentiality was evaluated by performing bit detection tests when the embedding process and detection process of the front-side form and back-side form were switched. The bit rates of 4, 8, and 16 were used for evaluation.
5.2. Evaluation Results for Inaudibility
Figure 11 shows the evaluation results, namely, PESQ and LSD with respect to the bit rate, for the sound quality test of the LSB, DSS, MC, and proposed methods. As we can see, the LSB method has the best sound quality among the four methods. The DSS method has the PESQ of less than 3 and the LSD of more than 1 dB. The PESQ for the MC method is greater than 3 to 16 bits, and the LSD is less than 1 dB. The PESQ for the proposed method is greater than 3 until 8 bits, and the LSD is 1 dB.
5.3. Evaluation Results for Robustness
Figure 12 shows the evaluation results for the basic robustness tests of the LSB, DSS, MC, and proposed methods, where the horizontal axis shows the bit rate and the vertical axis shows the BER of each attack process. As we can see, the LSB method is not robust (i.e., does not have a BER greater than
) against any attacks except Requant-24. The DSS method is robust (BER less than
) except for G723.1. The MC method and the proposed method are robust (BER less than
) except for the Requant-8 and G723.1 encodings. Unfortunately, speech watermarking using LP residuals is fundamentally vulnerable to G726.
5.4. Evaluation Results for Blind-Detectability
Blind-detectability was evaluated by carrying out the frame synchronization test and bit detection test of the blind detection process with the frame synchronization process when the embedding-strength level was varied. In this paper, we regard blind-detectability to be achieved if the following two conditions are satisfied: (a) the BER is less than or equal to and (b) the accuracy of the frame synchronization is greater than or equal to . The magenta marker shows the result when the embedding-strength level is dB.
Figure 13 shows the results of blind-detectability, where the horizontal axis represents the accuracy of frame synchronization and the vertical axis represents BER. The gray areas indicate that two conditions can be satisfied simultaneously. We can see here that the BER decreases as the accuracy of the frame synchronization increases. The results at 4 and 8 bps show that the accuracy of the frame synchronization is greater than
, and the BER of less than
is satisfied at the embedding-strength level
of
dB.
5.5. Evaluation Results for Confidentiality
The proposed method can increase confidentiality in non-public conditions. However, if it were to be implemented in the public condition, we do not know what would happen to the confidentiality. Therefore, confidentiality was evaluated by carrying out bit detection tests when the embedding and detection processes were switched in public conditions. Confidentiality was defined as the security codes being incorrectly detected if the BER was greater than or equal to .
Figure 14 shows the evaluation results of the confidentiality, where the horizontal axis shows the bit rate and the vertical axis shows the BER of each attack process. We can see here that embedding by the front-side form does not allow the back-side form to detect the security codes correctly. Moreover, when embedded by the back-side form, the security codes could not be detected correctly by the front-side form. If a security code is embedded in the front-side form, the back-side form cannot properly frame-synchronize because the amount of remaining LP used in the front-side and back-side forms is different. Therefore, if embedding and detection are performed on different forms, frame synchronization will not be possible and the BER of security codes will be increased. These results demonstrate that the proposed method is confidential in public conditions thanks to the front-side and back-side forms.
5.6. Performance Limitations
The proposed method has two limitations: (i) the accuracy of frame synchronization due to the number of reference frames depending on the signal length and (ii) limited embedding capacity due to the frame length of the LP analysis.
For (i), the evaluation in
Section 4.2 shows that the accuracy of frame synchronization depends on the number of reference frames. At least 128 reference frames are required for frame synchronization accuracy to exceed
, which means the frame window length of 20 ms would require at least
s for the host signal.
For (ii), we used a frame window length of 20 ms for the proposed method as the commonly used analysis length for LP analysis. In that case, the number of frames per second is 50. Since the proposed method embeds one bit in two frames, the embedding capacity is limited to 25 bps. Generally, the bit rate of 4 bps is sufficient for speech watermarking. When speech watermarking is applied to tampering detection and spoofing detection, the precision of the tampering detection and spoofing detection depends on the bit rate. If the precision of is required, a bit rate of at least 16 bps is needed.
These limitations are due to the fact that the frame length is 20 ms. Reducing the frame length to shorter than 20 ms should lead to an increase in the number of frames, which in turn will lead to an increase in the embedding capacity and the number of reference frames.