1. Introduction
While multichannel loudspeaker systems for home and car entertainment are becoming increasingly popular these days, the number of available multichannel audio recordings is still limited (note that multichannel audio recordings and mixed multichannel audio is meant subsequently). In contrast to movies on DVD or Blu-ray, the majority of audio recordings are only obtainable in the two-channel stereo format. In addition, the main content of digital radio and television and also the increasing significant streaming services for music and movies are only obtainable in the two-channel stereo format, too. So, it can be noted that there is a low availability of multichannel audio records. That is why a system is worthwhile, which extends an original stereo audio signal for playback over a multichannel loudspeaker system. As a result, the spatial quality and the listening experience can be enhanced compared with the pure stereo playback. For this reason, there is wide scholarly interest in novel stereo-to-multichannel upmix algorithms, e.g., [
1,
2,
3,
4,
5] but there are many more recent publications on this topic. In order to determine the quality of such upmix algorithms, subjective listening tests were typically used [
6,
7,
8]. Usher [
6] presented specific design criteria for stereo-to-multichannel upmix algorithms to enhance spatial sound quality. He used formal listening tests for subjective evaluation of the design criteria according to [
9], where three general sound quality issues for the evaluation of multichannel audio systems were defined. Choisel and Wickelmaier [
7] used eight selected spatial attributes for sound quality evaluation in order to compare upmix algorithms. They derived a set of objectives measures from sound field analysis to predict auditory attributes. Barry and Kearney [
8] used subjective listening tests for the assessment of source separation-based upmixing algorithms. In addition, they used objective testing to measure the errors which could theoretically occur in source separation algorithms.
Formal listening tests have been the only appreciable approach to assess the quality of stereo-to-multichannel upmix algorithms. More importantly, subjective quality assessments are always connected with expenditure and are both time-consuming and expensive. That is why an objective evaluation technique for stereo upmix algorithms for spatial audio is desirable.
2. Objective Evaluation Test
For the objective evaluation test, the following assumptions about stereo-to-multichannel upmix algorithms are made: stereo-to-multichannel upmix algorithms should enhance and extend the listening experience without adding artificial effects or contents and provide virtual sound sources true to original. The virtual sound sources in an original stereo configuration are placed between the two (front) loudspeakers. So, it is assumed that upmix algorithms are designed to have no virtual sound sources in the rear, only between the front loudspeakers. Therefore, the remaining amount of direct signal in the surround channels should be as low as possible. Furthermore, stereo-to-multichannel upmix algorithms should provide the listener with front channels that are louder than the surround channels under the condition that there is always an existing virtual sound source in the used stereo input signal. In addition, it is assumed that the effects of the correlation of the surround channels are perceived subjective, and that the surround channels should have a certain correlation to prevent uncomfortable perception. Finally, stereo-to-multichannel algorithms should create a high subjective perceived spatial quality with all loudspeakers in order to enhance the listening experience.
For the objective evaluation of stereo-to-multichannel upmix algorithms, the following tests were defined: 1. panning test, 2. direct signal test, 3. volume test, 4. phase test, 5. perception test. In every single test a special test signal is used as input signal for the tested upmix. The generated output signals are then analyzed and evaluated according to defined criteria (see
Figure 1). Note that the evaluation test will measure how well the assumptions were met according to defined criteria.
The overall evaluation score
of a tested upmix results from the weighted single-test evaluation scores
, and is given by
The higher a score, the better the test result. Zero is the worst, one the best evaluation score. Appropriate results should be visualized here based on two upmix algorithms available on the market with two modes for music (a) and movies (b) in each case. Hereinafter, they are denoted as upmix 1(a), 1(b), 2(a) and 2(b).
2.1. Panning Test
Criterion: The direction of the virtual sound source in the stereo-to-multichannel upmix should correspond to the direction of the virtual sound source in the initial stereo configuration. This is accompanied with the result that the spatial representation of sound events is preserved true to original.
The panning test (see
Figure 2) is conducted in two versions. The evaluation score
results from the evaluation scores of the time- and frequency-independent panning test
and the time- and frequency-dependent panning test
, weighted with
and
, given by
Initially, at an interval of 1° virtual test sound sources
are defined with angles from −30° to 30° according to the reference loudspeaker arrangement [
10]. With the tangent law as the modified stereophonic law of sines, the two panning coefficients
and
can be calculated from
[
11]. A signal, weighted with the left panning coefficient
, represents the left part of a stereo signal
. A signal, weighted with the right panning coefficient
, represents the right part of a stereo signal
. For every angle
a stereo test signal is generated as input signal for the tested upmix. This is done by multiplying the resulting panning coefficients
and
with white Gaussian noise
according to
With the output signals of the upmix algorithm for the three front channels
(front left),
(center) and
(front right), two panning coefficients
and
are determined, and from these panning coefficients the direction of the virtual sound source
is calculated. Note that two-to-five upmix algorithms are tested, but only the three front channels are used for the panning test. The difference
, which is the deviation of the angle of the virtual sound source of the upmix from the defined test signal, serves as the basis for evaluation. The score of the time- and frequency-independent panning test
is calculated from the mean of the normalized absolute deviation of all test cases with
and
given by
The division by is needed to normalize the mean of the deviations so that the evaluation score assumes values ranging from 0 to 1.
In the case of the time- and frequency-dependent panning test, the angle of the virtual test sound source is randomly generated in the range of −30° to 30°. This is done at any time
and frequency
in a time–frequency representation with the help of a short-time Fourier transform (STFT). So, the ability of the upmix to respond to fast changes of the virtual sound source should be tested. The score is calculated from the mean of the normalized absolute deviation across all
times and
frequencies with
as the maximum absolute value of all angles given by
The evaluation allows the comparison of the angle of the virtual sound source of the stereo input signal with the angle of the virtual sound source of the multichannel output signal (see
Figure 3).
Upmix 1: The more the angle of the virtual sound source in the stereo configuration diverges from in mode (a), the larger are the discrepancies in the multichannel configuration. As a consequence, the spatial extent of the initial stereo configuration is partly reduced significantly. At the same time, the majority of sound events is perceived from a small spot around the center. In mode (b) the direction of the virtual sound source in the multichannel configuration does not even tendentially comply with the direction of the virtual sound source in the stereo configuration. That is because of the aim of mode (b) to enhance speech intelligibility. So, only a small range around the center speaker () is emphasized and parts straight beyond this area are already located considerably further away.
Upmix 2: In mode (a) the angle of the virtual sound source in the multichannel configuration complies tendentially with the angle of the virtual sound source in the stereo configuration. The more diverges from , the larger are the discrepancies in the multichannel configuration until the angle converges fast towards the angle . The spatial extent admittedly nearly remains, but the majority of the virtual sound sources is located closer to the center speaker. In mode (b) the spatial extent admittedly nearly remains, but within a certain area beyond the center () all sound events are solely located in one direction. To enhance speech intelligibility, sound events in the center are emphasized because parts straight beyond this area are located considerably further away.
2.2. Direct Signal Test
Criterion: The remaining amount of direct signal in the surround channels of the stereo-to-multichannel upmix could result in undesired virtual sound sources, which could interfere with the spatial representation of sound events true to original. Although it could lead to a higher subjective perceived spatial quality, the remaining amount of direct signal in the surround channels would be against the assumptions made for upmix algorithms, and should therefore be as low as possible.
Again, a special test signal is defined and used as input signal for the tested upmix algorithm. Different direct signals were taken from the database MedleyDB [
12]. These audio recordings were then convolved with room impulse responses and mixed to a test signal (see
Appendix A). The procedure of the direct signal test is shown in
Figure 4.
The generated surround channels
(surround left) and
(surround right) of the upmix are analyzed by determining their remaining amount of direct signal
. This is compared with the known test signal
and serves as the basis for evaluation. The score of the direct signal test
is calculated from the mean of the quotient of the spectral envelopes [
13]
and
of the amounts of direct signals, across all
times and
frequencies, representing their relative deviation. To ensure a comparative evaluation, the summed power of the extracted surround channels must be equivalent to the summed power of the input signals. That is because signals before and after the upmixing process are considered. Only through using normalized surround signals is a comparative evaluation possible. This ensures, among others, that surround signals are considered correctly, which are identically equal to the input signals but reduced in power. That is because they would have the same relative amount of direct signal. The evaluation allows the comparison of the remaining surround channel direct signal with the known direct signal of the used test signal (see
Figure 5).
Upmix 1: Mode (a) contains a reduced remaining direct signal in the surround channels. Mode (b) contains a remaining direct signal which is slightly lower or greater.
Upmix 2: The remaining direct signals are almost identical in modes (a) and (b), but slightly increased relative to the direct signal of the test signal. These proportionally increased remaining surround channel direct signals can occur because of input signal level adjustment or positive feedback of the upmix output signals. It should be noted that other upmix algorithms could also exhibit obviously reduced remaining surround channel direct signals.
2.3. Volume Test
Criterion: Power and loudness of the surround channels of the stereo-to-multichannel upmix should not be greater than the ones of the front channels. No unnatural or unexpected spatial sound should occur because the volume of the surround sound lateral or behind the listener is perceived louder than the volume of the sound events in front of the listener.
The volume test is therefore subdivided into the power test and the loudness test. The evaluation score of the volume test
results from the evaluation scores of the power test
and the loudness test
, weighted with
and
, and leads to
In each case, five-second-long extracts were taken from twelve popular pieces of music from various genres to create a sixty-second-long test signal (see
Appendix B,
Table A3).
2.3.1. Power Test
The procedure of the power test is shown in
Figure 6.
The defined stereo test signal is used as input signal for the tested upmix. The power
,
,
,
and
of the generated upmix output signals
,
,
,
and
are considered. The maximum power of the three front signals
is compared with the power of both surround signals
and
each. The ratios
and
serve as the basis for evaluation. The evaluation scores of the left and right surround signal,
and
, are calculated from the means of the quotients across all
and
times in which the power of the particular surround signals is greater than the power of the front channels, with
:
The evaluation scores
and
represent the relative deviations of the considered power
respectively
from the maximum power of the three front signals
. The evaluation score of the power test
results from the evaluation scores of the power test for the left and right surround signal,
and
, weighted with
and
, and is given by
The evaluation allows the comparison of front with surround channel power (see
Figure 7). For reasons of clarity only the left surround channel power is used in the following figures.
Upmix 1: In mode (a), the power of the left surround channel is basically lower and in some areas partly as high as the power of the front channel with the greatest power. In mode (b), the power of the left surround channel is mostly lower and in some areas partly higher than the power of the front channel with the greatest power. While in mode (a), the power of each front channel is more or less relatively similar, they are mostly considerably different in mode (b). The strong emphasis on the center channel can especially be recognized.
Upmix 2: In mode (a), the power of the left surround channel is basically lower and in some areas partly as high as the power of the front channel with the greatest power. In mode (b), the power of the left surround channel is lower and in some areas partly as high as or slightly higher than the power of the front channel with the greatest power. While in mode (a), the power of each front channel is more or less relatively similar, they are mostly considerably different in mode (b). The strong emphasis on the center channel can especially be recognized.
2.3.2. Loudness Test
The procedure of the loudness test is shown in
Figure 8.
The defined stereo test signal is used as input signal for the tested upmix. The generated upmix output signals
,
,
,
and
are considered. The loudness
of the three front channels
,
and
is compared with the loudness
of both surround channels
and
. This serves as the basis for the evaluation score of the loudness test
. The determination of the loudness is done blockwise according to [
14], but separately for the loudness of the front channels
and the loudness of the surround channels
. The evaluation score is calculated from the mean of the absolute deviations
across all
blocks in which the loudness of the surround channels is greater than the loudness of the front channels, and is given by
The evaluation allows the comparison of the loudness of the front channels with the loudness of the surround channels (see
Figure 9).
Upmix 1: In mode (a), the loudness of the surround channels is basically lower and in some areas partly similar or slightly greater than the loudness of the front channels. In mode (b), the loudness of the surround channels is basically lower and in some areas partly higher than the loudness of the front channels.
Upmix 2: In mode (a) as well as in mode (b), the loudness of the surround channels is basically lower and in some areas partly similar or slightly greater than the loudness of the front channels.
2.4. Phase Test
Criterion: The surround channels of the stereo-to-multichannel upmix should have a certain correlation to prevent uncomfortable perception. If the surround channels would be completely correlated, a mono sound source would be created, which could be perceived as uncomfortable. If the surround channels would be completely decorrelated, two independent sound sources would be created, which could be perceived as uncomfortable, too [
15,
16,
17,
18].
The procedure of the phase test is shown in
Figure 10. The test signal is identically equal to the test signal of the volume test and is used as input signal for the tested upmix.
The correlation degree
results from the normalized cross-correlation of the generated upmix output signals
and
for
, and leads with
to the evaluation score of the phase test
according to
With the requirement that the surround channels
and
should not be either completely correlated or completely decorrelated, two evaluation limits,
and
, were defined within which a certain correlation is supposed to be optimal. The tendency for complete correlation and thus the creation of a mono sound source is higher weighted in the evaluation score than the tendency for complete decorrelation. The evaluation allows the comparison of the correlation degrees of the surround channels (see
Figure 11).
Upmix 1: In mode (a), the correlation degree of the surround channels is basically negative, in mode (b), basically positive. It is notable that the surround signals of the one mode are a phase-inverted version of the surround signals of the other mode.
Upmix 2: In both modes, the correlation degree of the surround channels is basically approximately one. So, there is the danger that the correlated surround signals are decomposed into a mono signal.
2.5. Perception Test
Criterion: The stereo-to-multichannel upmix should generate a high subjectively perceived spatial quality. This is accompanied with the result that the listening experience is improved compared to the initial stereo configuration, and that the listener feels projected in the middle of the sound events.
The procedure of the perception test is shown in
Figure 12. The test signal is identically equal to the test signal of the volume and phase test and is used as input signal for the tested upmix.
The interaural cross-correlation coefficient (IACC) describes the subjectively perceived spatial quality of sound events, and is a measure for apparent source width (ASW). The lateral energy fraction (LF) describes the impression of spatial quality, and is also a measure for listener envelopment (LEV) [
7,
19,
20]. For the determination of IACC, the generated upmix output signals
,
,
,
and
are used to create a simulated sound field on the basis of head-related impulse responses (HRIR). The binaural signals:
result from the summed generated upmix output signals
across all channels
of the multichannel configuration convolved with the particular head-related impulse responses
and
for the left and right ear (with
and
).
IACC results from the normalized cross-correlation of the half-wave rectified and with a third-order Butterworth filter (
) low-pass filtered binaural signals
and
. The prefiltering ensures that the results correspond better to the subjectively perceived spatial quality [
21,
22,
23].
For the determination of LF, the generated upmix output signals
,
,
,
and
are used to create another simulated sound field. The signal
, which is recorded from a virtual omnidirectional microphone, results from the summed generated upmix output signals
across all channels
of the multichannel configuration, and is given by
The signal
, which is recorded by a virtual bidirectional microphone, results from the summed generated upmix output signals
across all channels
of the multichannel configuration weighted with the respective loudspeaker directions
, and can be written as
LF results, with the signals
and
, from the ratio of acoustic waves, which are arriving at the listening position laterally and from all directions [
7,
23], given by
Due to using a simulated sound field, the differentiation between early-and late-arriving signal components is omitted [
7].
The evaluation score of the perception test based on IACC results in
and is a direct measure for ASW. The subjectively perceived spatial quality is the higher, the lower IACC is. The evaluation score
of the perception test based on LF results in
and is a direct measure for LEV. The impression of spatial quality is the higher, the higher LF is. The evaluation score of the perception test
results from the evaluation scores of the perception test based on IACC and LF,
and
, weighted with
and
, given by
Use of simulated sound fields within the scope of the perception test ensures simplicity because of independence from the properties of room, speakers, microphones, etc., which had to be considered for the determination of IACC and LF based on costly recordings. The evaluation allows the comparison of IACC and LF (see
Figure 13).
Upmix 1: In mode (a), IACC assumes middle to high values, LF assumes high values (see
Figure 13). According to that, middle to low subjectively perceived spatial quality and a high impression of spatial quality occurs. In mode (b), IACC and LF assume middle to high values. According to that, middle to low subjectively perceived spatial quality and middle to high impression of spatial quality occurs.
Upmix 2: In both modes, IACC assumes high and LF middle to high values (see
Figure 14). According to that, low subjectively perceived spatial quality and middle to high impression of spatial quality occurs.
3. Results
Table A4 (
Appendix C) summarizes the single scores of all evaluation tests for the exemplarily tested stereo-to-multichannel upmix algorithms, the used weighting factors and the resulting overall evaluation score. The higher a score, the better the test result of the tested stereo-to-multichannel upmix according to the defined criteria. Zero is the worst, one the best evaluation score. With the help of weighting factors, significance of single tests can be adjusted (see
Appendix D).
Figure 14 illustrates the evaluation scores according to
Table A4. All in all, upmix 1(a) has the best overall evaluation score by far, upmix 1(b) the second best. Upmix 2(b) has the worst overall evaluation score, upmix 2(a) the second worst. Note that two commercial upmix algorithms in two different modes were used to demonstrate the functional principle of the proposed evaluation test and to illustrate how possible results can be visualized. The aim of this paper was not to compare existing upmix algorithms but to introduce an objective evaluation test to gain the possibility of objective comparison. So, an overall evaluation of an upmix algorithm with the proposed evaluation test is appropriate in comparison with other upmix algorithms as references. Therefore,
Figure 14 provides an appropriate graphical overview for the comparative evaluation of different upmix algorithms. Note that corresponding results of the single evaluation tests were presented in the end of each section.
For the proposed evaluation test several assumptions about stereo-to-multichannel upmix algorithms were made. Since upmix algorithms are also based on assumptions, the evaluation test will measure how well the assumptions made here were met. Furthermore, a self-contained evaluation of a single upmix algorithm should focus on panning test, direct signal test and volume test. That is because the effects of the correlation of the surround channels (phase test) are perceived subjective. In addition, the impacts of lateral energy fraction and interaural cross-correlation (perception test) on perceived spatiality are subjective, too.
4. Conclusions
In this paper, we proposed an objective evaluation for stereo-to-multichannel upmix algorithms based on defined objective criteria, special test signals and several single evaluation tests. Two upmix algorithms available on the market were used to demonstrate the single tests exemplarily. The panning test checks whether the direction of the virtual sound source in the stereo-to-multichannel upmix corresponds to the direction of the virtual sound source in the initial stereo configuration. The direct signal test checks whether the remaining direct signal in the surround channels is as low as possible. The volume test checks whether the power and the loudness of the surround channels is not greater than these of the front channels. The phase test checks whether the surround channels of the stereo-to-multichannel upmix are not either completely correlated or completely decorrelated, but have a certain correlation. And the perception test checks whether the stereo-to-multichannel upmix generates a high subjectively perceived spatial quality.
The introduced objective evaluation test enables an objective comparative evaluation, which can now provide a measurable quantity for the quality of stereo-to-multichannel upmix algorithms. In addition, the objective evaluation test could be used for the optimization of upmix algorithms and also for the clarification and illustration of the impacts and influences of different modes and parameters. The proposed objective evaluation test is assumed as an appropriate alternative or supplement for time-consuming and expensive subjective listening tests.
Nevertheless, a comparison of the proposed objective evaluation test with subjective test results will be a focus of future work as part of appropriate validation.