1. Introduction
The MPEG-H 3D Audio (3DA) standard [
1] aims to provide immersive 3D audio for high-resolution UHDTVs. It comprises state-of-the-art technologies that support high-efficiency compression and transmission of channel, object-based, SAOC (spatial audio object coding) and HoA (high-order ambisonics) audio formats, and high-quality rendering in various layouts. Recently, the MPEG-H 3DA system was adopted in TV broadcast standards, e.g., ATSC 3.0 [
2] and DVB [
3], and in the virtual reality (VR) profile of mobile standard 3GPP [
4].
The MPEG-H decoder can render each audio format on various playback layouts using the format converter, object renderer, SAOC renderer, and HoA renderer. In addition, the standard includes a low-complexity high-quality binaural renderer in response to the increasing use cases of mobile devices. Another important feature of MPEG-H 3DA is interactivity, i.e., the listener can manipulate each audio scene to the extent allowed by the content creator. A typical use case is to render audio based on three degrees of freedom of the head (i.e., yaw, pitch, and roll rotation). With these advantages, the MPEG-H 3DA decoder low complexity (LC) profile was selected as a 3DoF (Degrees of Freedom)/3DoF + rendering technology for MPEG-I audio standards for augmented reality (AR) and VR.
MPEG-H 3DA performs 3DoF rendering of object and channel signals [
1], using virtual loudspeaker rendering (VLR) [
5]. In the VLR, object signals are first converted to virtual channel (loudspeaker) signals using panning techniques, such as 3D vector basic amplitude panning (VBAP) [
6] and multidirectional amplitude panning (MDAP) [
7]. The virtual channel signals are filtered with low-resolution binaural room transfer function (BRTF) or head-related transfer function (HRTF) and then downmixed to generate the binaural signal. During this binaural downmixing process, various speaker-to-ear transfer functions corresponding to different acoustics paths are mixed, resulting in the so-called comb filter effect [
8]. The primary artifacts of the comb filter effect are spectral coloration and volume degradation, both of which cause inaccurate spatial image position, often significantly [
8,
9]. Therefore, to improve the audio quality of MPEG-H 3DA binaural rendering (BR), it is crucial to prevent the comb filter effect.
For the channel signals, the MPEG-H 3DA decoder performs BR after ‘objectizing’ channel signals (by regarding channel signals as object signals at the loudspeaker positions). In general, high-quality channel sources include ‘artistic intentions’, such as EQ, which may prevent the comb filter artifacts. However, ‘artistic intents’ alone cannot overcome the comb filter effect in every case of 3DoF rendering of channel sources.
This paper presents a rigorous method to improve the sound quality of MPEG-H 3DA BR. We first propose an efficient gain normalization algorithm that can compensate for the spectral artifacts caused by the comb filter effect. The previous studies on this issue mainly focused on the panning algorithm. In [
7,
10], the panning gain ratio was adjusted to improve the sound image localization performance. In [
8,
9], a panning gain normalization was used to reduce coloration. However, these methods were developed assuming loudspeaker-based listening environments. In such a case, binaural transfer functions are not considered precisely; hence they only partially prevent the comb filter effect when used for VLR. A baseline solution for this spectral coloration and distortion is active downmixing [
11]. Unfortunately, active downmixing does not work when more than one sound object resides in a single processing band, which is common in many audio signals.
The proposed algorithm performs spectral compensation in both the panning gain and downmix signal domains. In order to implement frequency dependent compensation on the standard effectively, the MPEG-H 3DA frequency domain object renderer and binaural renderer working in the 64-band complex quadrature mirror filter (QMF) [
12] are used. Panning gain compensation is performed in the low-frequency band where the bandwidth of the QMF is wider than the critical frequency scale. And a signal compensation similar to the active downmix is performed in the high-frequency band, where the QMF band has a narrower bandwidth than the critical frequency scale. In such a way, the proposed algorithm compromises the algorithm’s complexity with performance within a framework of MPEG-H 3DA.
This paper is organized as follows.
Section 2 reviews the MPEG-H 3DA decoder and BR.
Section 3 demonstrates spectral artifacts in the MPEG-H 3DA BR, and
Section 4 proposes a binaural gain normalization and its implementation structure on MPEG-H 3DA BR as well as an empirical analysis based on our implementation.
Section 5 presents experimental results that include both objective and subjective tests. In
Section 6, we present the conclusion.
2. Spectral Distortions in the MPEG-H 3DA BR
Figure 1 illustrates the core decoding and rendering parts of the MPEG-H 3DA. Dynamic range compression (DRC) and the HoA rendering block are omitted in the figure for brevity. The core decoder of MPEG-H 3DA converts a compressed bit-stream to a waveform (e.g., channel-based object-based audio) and associated metadata (e.g., positions of objects and a loudspeaker geometry). During the rendering stage, the associated data are processed by the scene-displacement interface. It calculates and applies the rotation matrix to update the positions of audio objects and channels, both of which depend on user interaction (e.g., user’s yaw, pitch, and roll movement). Using 3D VBAP, the object renderer then takes the object and channel signals and updated positions to produce virtual loudspeaker signals. Finally, the binaural renderer generates a binaural signal by filtering the channel signal by BRTFs/HRTFs that corresponds to the virtual loudspeaker position [
13].
Figure 2 shows more detailed schematics of the VLR-based BR in MPEG-H 3DA. For a given loudspeaker layout and head rotation information, the VLR system generates channel signals for all object signals. Following 3D VBAP, the
i-th channel signal
for input object signals
is obtained as
where
l and
k are the frame and frequency indices, respectively, and
represents the
p norm of the panning gain vector
.
, and
denotes the total numbers of input sound objects and virtual loudspeakers.
The normalized panning gain
allows the maintenance of a constant loudness regardless of the panning direction. The norm order
p can be either 1 or 2 depending on the coherence between the binaural filters [
8]. The ear signals are then obtained by filtering the channel signal with BRTFs/HRTFs that correspond to the virtual loudspeaker location. The binaural signal
in
Figure 2 is obtained by summing the ear signals, as follows, where the frequency-transformed BRTFs/HRTFs are denoted as
.
The transfer function of the MPEG-H 3DA BR system between the
m-th sound object to the binaural signal, defined as
, can be expressed as
where
and
. Here, for simplicity, we omit the frame index
l and assume
.
The comb filter effect occurs during the summing of the ear signals as Equation (
2). The effect can be illustrated by calculating the transfer function between an object signal and the left ear signal, i.e.,
, following the standard of MPEG-H 3DA BR [
1], as in
Figure 3. During the calculation, we assume that a source at an angle of
to the left from the frontal direction is rendered using a pair of virtual loudspeakers located at
and
, respectively. The diffuse-field equalized (DFE) MIT HRTF [
14] is used as a virtual loudspeaker, and
p is set to 2 for the gain normalization.
The transfer function in
Figure 3, compared to true HRTFs, shows two main artifacts: broad spectral notches near 3.3 kHz as well as 12 kHz and a broad spectral boost below 1.5 kHz. These distortions are commonly observed in practical loudspeaker setups and directly affect the perceptual quality of the downmixed sound.
A simple solution is active downmixing [
11] that compensates for the spectral distortions during downmixing of the ear signals. However, the active downmixing applies gain to a downmixed binaural signal rather than an individual object signal. Consequently, some unwanted spectral distortion may occur during the step if multiple objects are in the same processing band. Therefore, a rigorous approach is required to solve the issue with minimal artifacts and low computational complexity as a standard technology.
3. Proposed Spectral Compensation for the MPEG-H 3DA BR
In this section, we propose two methods of preventing spectral artifacts caused by the comb filter. Our first method, panning gain compensation (PGC), allows compensations for the panning gain of each object. As a result, it suppresses spectral notches and boosts that may occur during downmixing otherwise. PGC might be computationally heavy, as each object’s left and right ear signals must be compensated separately. Our second method, binaural spectral compensation (BSC), is proposed to reduce the computational complexity of PGC. Then, two compensation methods are used in different QMF bands to compromise the system complexity with the performance.
3.1. Panning Gain Compensation (PGC)
Ideally, the transfer function between sound objects and the listener’s ear is considered distortionless if it is equal to the true HRTF corresponding to the virtual loudspeaker position. However, it is practical to assume that the BR system has HRTFs only at sparse locations according to the pre-defined loudspeaker layout. In such cases, it is still possible to approximate the true HRTFs using geometric interpolation. A previous study [
15] showed that the HRTF magnitude of a target location could be estimated via interpolation of the magnitudes of neighboring HRTFs surrounding that location, and the interpolation weights could be approximated by the 3D VBAP when the virtual source was in the far-field.
Inspired by this, we approximate the magnitude of the ideal HRTF:
where ⊙ denotes element-wise multiplication. It is important to note that the norm order
p in Equations (
1) and (
4) should be identical to obtain a smooth interpolation. To validate the magnitude approximation of Equation (
4), we measure the HRTF magnitude of the 5th subband of the 64-band QMF in MPEG-H 3DA operating on a 22.2-channel loudspeaker layout [
16]. The results obtained along the azimuth angle
∼
for a fixed elevation angle at
are plotted in
Figure 4. Since the 22.2-channel system has more loudspeakers on the frontal hemisphere than on the back hemisphere, the approximation accuracy is expected to be higher in the frontal region, i.e.,
∼
. On the norm order, in our experiments, we choose
since it provides a smoother approximation of the target HRTF than any other values as shown in
Figure 4.
We also design real-numbered compensation coefficients to prevent the comb filter effect. These coefficients are applied to the panning gains. Our goal is to define the real-number coefficient
that satisfies the following condition:
It simply means the panning gains are determined by computing
. However, with further consideration toward the practical implementation of PGC on MPEG-H 3DA BR working in QMF domain, we calculate the compensation coefficient per QMF subband. Using Equations (
3) and (
4), we calculate the compensation coefficients of the subband
b as
where
is the lowest fast Fourier transform (FFT) bin belongs to the frequency range of
b-th QMF subband.
Equivalently, based on Equation (
3), we apply the compensation gains directly to the panning gains as
Finally, channel signals for each left and right ear are calculated as
The compensation of
using
Equation (
5) means the loudness of each subband in the downmixed signal is restored to that of the target signal. In other words, the comb filter effect is prevented by the proposed compensation of
.
3.2. Binaural Spectral Compensation (BSC)
In the previous section, PGC was proposed as an effective solution to prevent the comb filter effect. However, during PGC, every panning gain of a channel signal needs to be compensated separately for the left and right ears. This results in doubling the convolution operation as in Equation (
8).
An alternative approach is binaural spectral compensation (BSC), where the downmixed signal is compensated directly as follows.
where
is a real-valued gain designed for compensating the spectral notches and boosts due to the comb filter effect. The downmixed ear signal is given by
where
denotes a vector comprising the channel signals. Additionally, using Equation (
4), the ideal downmixed signal without spectral artifacts is obtained as
Additionally, to reduce unnecessary temporal fluctuation, it is possible to obtain the compensation gain for the QMF subband
b of the MPEG-H 3DA BR as
where
denotes a time-smoothing operator conveniently implemented using a 1st-order IIR recursive filter.
Finally, a spectrally compensated binaural signal is obtained as
From the implementation perspective, calculating a non-integer norm is highly complex. Therefore, we use the norm order
for ease of implementation, which was also validated from the performance perspective in the previous section. However, even with
, calculating the square root for each frequency bin
k is a significant burden for the rendering processor. To circumvent this problem, we further modify the numerator and denominator of Equation (
12) as
Therefore, the square root is computed per QMF band b, significantly reducing computational complexity. In contrast, our experiments show that the computational accuracy remains within 95% of the original value, as long as a sufficient number of frequency bins (>10) are included.
3.3. Combination of PGC and BSC
The downside of BSC is the discrepancy between the rendered signal and the target signal when there exists more than one acoustic object in a single band. In those cases (), the BSC gain is likely to be incorrect for every object. Especially, when two objects exist in different critical bands belonging to the same QMF band b, experiments show that artifacts can be audible. Therefore, although the BSC is computationally much simpler than the PGC, it is applied only to QMF bands with a comparable or narrower bandwidth than the critical bandwidth in our implementation.
It can be noted that, when
, the BSC gain in Equation (
12) is equivalent to the active downmixing gain in [
11]. The purpose of developing the BSC in this paper, however, is to compromise between the increase in computational complexity and the deterioration of the sound quality of MPEG-H 3DA BR within a unified framework comprising two different compensation strategies.
4. Implementation and Complexity
The proposed algorithm was implemented on the frequency–domain binaural rendering of the MPEG-H 3DA reference software.
Figure 5 shows block diagrams of the MPEG-H 3DA BR comprising the PGC and BSC blocks. As illustrated, the PGC is applied to the output of object render (amplitude panning) before the binaural rendering block, while the BSC is applied to the downmixed binaural signal. In the figure,
r denotes the QMF time slot, and
l and
denote the index of the short-time segmented frame and the QMF time slot index of the segmented frame, respectively, i.e.,
, where
is the size of the short-time frame. In pursuit of the implementation, the compensated panning gains for the PGC, i.e.,
in Equation (
7), are calculated for all azimuth and elevation angles at 1-degree intervals and stored in a lookup table. The system was designed to refresh the panning gains at every time slot of the QMF subband.
The PGC and BSC are selectively used in different QMF bands according to the bandwidth of the critical band and processing band, i.e., QMF bandwidth, as described in
Section 3.3. The bands below 6 kHz are compensated using the PGC. For the bands below 750 Hz, PGC gains of the left and right ears are averaged to use for both ear signals. This is because the phase difference between the left and right ear’s HRTFs is relatively insignificant. The BSC is employed for the bands above
, where the critical bandwidth is wide enough to cover multiple QMF bands. To avoid artifacts, we limit the magnitude of BSC gain to have values within −4.8 dB∼4.8 dB, and
is implemented using a 1st-order IIR low-pass filter with a time constant of 10 ms.
Table 1 and
Table 2 summarize the computational complexities of the PGC and BSC blocks in the FD-BR in a unit of MOPS (million operations per second). In the tables,
is the number of object signals as in Equation (
1), and
is the averaged number of loudspeakers involved in the amplitude panning, i.e.,
for 3D VBAP.
is the total number of virtual loudspeakers, and
is the FFT size. Weight in the rightmost column of the tables is based on [
17] and
F accounts for the number of real operations of the butterfly in the FFT algorithm, i.e.,
[
13].
is the number of subbands over which the FD-BR is performed.
is the number of FFT frames per second, calculated as
, where
is the sampling frequency.
Using
Table 1 and
Table 2 [
13], the worst-case MOPS of the MPEG-H 3DA BR can be counted using HRTFs as a virtual loudspeaker, with and without the proposed compensation method. For the calculation, we choose MPEG-H 3DA low complexity (LC) profile level 4, which is a scenario of virtual loudspeaker rendering of 28 objects on the 22.2 channels, i.e.,
, at a sampling rate of
= 48 kHz. We set the parameters considering a typical operating condition of FD-BR using HRTFs:
, and
48,000. The counted MOPS of the FD-BR are
and
, respectively, for the cases with and without the proposed compensation algorithm.
We can estimate the contribution of the proposed method to the total computation cost of the MPEG-H 3DA decoding and binaural rendering process. The numbers are obtained under the condition that each QMF band selectively employs either PGC or BSC, as explained previously. Considering that, the proposed compensation algorithm takes about 3.1% and 16.3% of the worst-case complexity of the entire MPEG-H 3DA decoding/rendering process [
1] and the MPEG-H 3DA BR, respectively. This means that the increase in complexity by employing the proposed method is marginal. In fact, the computational growth of 10 MOPS implies that the proposed method is affordable, even for low-power mobile devices.