Comparing both sets of results, there is a noticeable difference of relative spread perception between horizontal and vertical interchannel decorrelation. The horizontal results show that decorrelation is similarly effective at increasing horizontal image spread (HIS) for all frequency bands—however, a slightly longer time-delay is required for the ‘Low’ frequency band to generate significantly greater levels of HIS (this has been discussed further in
Section 6.1 below). In contrast, vertical decorrelation in the median plane appears to be most effective for the ‘Low’ frequency band, though with little significant difference between conditions (unlike the results for horizontal decorrelation). The statistical correlation between gain factor and vertical image spread (VIS) is also noticeably lower for all vertical decorrelation conditions, in comparison to those for horizontal decorrelation. This suggests that, although significant changes to VIS by vertical decorrelation are observed in the median plane, the effect is weaker than that of horizontal decorrelation between a pair of left and right loudspeakers.
6.1. Interaural Cross-Correlation Coefficient (IACC)
Considering the relationship between horizontal decorrelation and interaural cross-correlation (IAC),
Table 4 below displays the IAC coefficients (IACCs) of the binauralised stimuli signals in the horizontal plane. These results were calculated as the average IACC of 50 ms windows over time. For the binauralisation, a pair of head-related impulse responses (HRIRs) from the MIT KEMAR database [
21] (left and right (±30°)) were convolved with the source signals of each condition. For the ‘Low’ and ‘High’ frequency bands, the results demonstrate that the IACC decreases as the gain factor increases (i.e., decreasing the ICCC)—this is the expected relationship between ICCC and IACC, as demonstrated in previous research [
1]. However, for the ‘Middle’ frequency band, IACC appears to decrease up to a gain factor of 0.6, before slightly increasing as the gain increases further. The reason for this is not clear and the subjective results in
Figure 7 do not seem to follow the same trend. Since
Figure 7 shows that HIS increases as gain factor increases (ICCC decreases), the results in
Table 4 suggest that IACC may not be a good predictor of HIS for middling frequencies—instead, calculation of ICCC may prove to be more accurate for HIS prediction in this region.
To investigate the IACC results for the ‘Middle’ band further, the binauralised ‘Middle’ stimuli have been filtered into the three contributing octave-bands (centre frequencies of 500 Hz, 1 kHz and 2 kHz), with
Table 5 below displaying the 50 ms window averaged IACC for each octave-band filtered condition. Here the 500 Hz octave-band displays a decrease of IACC as the gain factor increases (i.e., ICCC decreases), corresponding with the subjective HIS results for the ‘Middle’ band in
Figure 7. On the other hand, the 1 kHz band has a decrease of IACC up to a gain factor of ‘0.8’, which then increases again with a gain factor of ‘1.0’. Similarly, the 2 kHz band has a decrease of IACC up to ‘0.6’, followed by an increase of IACC as gain factor increases to ‘1.0’. This indicates that the IACC results seen in
Table 4 are largely determined by the 1 kHz and 2 kHz octave-bands; however, if IACC does indeed contribute to HIS, it seems that the subjective results for the ‘Middle’ band may have been dictated by the 500 Hz octave-band. The apparent disagreement between the IACC results and subjective HIS results for the ‘Middle’ frequency band suggests that measurement of IACC in two-channel stereophony (with a base angle of 60°) may not be an accurate predictor of HIS, particularly for broader bands in the middle frequency region. It should also be said that the observed trend of IACC for the 1 kHz and 2 kHz octave-bands may be specific to the decorrelation method used in the present study; therefore, further investigations are required to observe the octave-band relationship between ICCC and IACC for different decorrelation techniques.
With regard to concert hall acoustics, the IACC
3 measure considers that the greatest contributors of ASW lie within the frequencies of the ‘Middle’ band—it has been demonstrated concert hall studies that an increase of ASW corresponds with a decrease of IACC for the 500 Hz, 1 kHz and 2 kHz octave-bands [
2]. The results in
Table 4 and
Table 5 appear to contradict with this somewhat, where only the IACCs of the 500 Hz octave-band correspond with the increasing trend seen in the subjective HIS results. As mentioned above, this effect may be specific to the decorrelation method used, however, it could also be an inherent limitation of IACC calculation when presenting audio in two-channel stereophony. In order to properly assess the use of IACC (or IACC
3) for HIS prediction over loudspeakers, further investigation is required to compare ICCC, IACC and HIS at octave-band level, combined with the assessment of multiple decorrelation methods.
It is also interesting to note that significant increases of HIS were perceived for all three frequency bands in the subjective results, rather than just the ‘Middle’ band from which the IACC3 is measured. For the ‘Low’ band, it may be that these frequencies are mostly correlated in concert halls when summing at the ear, thus providing no contribution to the measurement of IACC; however, when two low frequency signals are artificially decorrelated between a left and right loudspeaker pair, the differences of interaural correlation would likely be greater, seemingly causing an increase of HIS. Moreover, high frequencies were not included in the IACC3 measure due to a lack of reflection energy at higher frequencies in concert halls—with that in mind, the results presented here strongly suggest that measuring the IACC of high frequencies can also contribute to the measurement of HIS in two-channel stereophony. A greater consideration of higher frequencies (4 kHz to 16 kHz octave-bands) could be the basis of accurate HIS measurement in surround sound reproduction, where high frequency energy may be considerably greater.
6.2. Low Frequency Band Discussion
In the subjective results, it is apparent that the perception of the ‘Low’ frequency band differs between horizontal and vertical decorrelation. In order to observe the effect of time-delay on the source signals for the ‘Low’ band,
Figure 9 displays the difference of spectrum between the two output channels, with gain factors of ‘0.2’, ‘0.6’ and ‘1.0’ for each time-delay. Spectra were calculated as the long-term average FFT using 4096 FFT points and a frame size of 4096 samples (with 50% overlapping windows and no spectral smoothing). In the plots, a positive amplitude indicates a bias towards the right/height loudspeaker channel (for horizontal and vertical decorrelation, respectively), whereas a negative amplitude is a bias to the left/main loudspeaker channel.
From the
Figure 9 plots, it can be seen that as time-delay decreases, the distribution of frequencies between the two channels becomes unbalanced. This is further reflected in
Table 6, where the RMS of the two decorrelated output signals has been calculated for each time-delay, using a gain factor of ‘1.0’. With a 1 ms time-delay and gain factor of ‘1.0’, all frequencies below around 250 Hz are boosted in the left/main loudspeaker channel, resulting in a RMS difference between the two channels of 4.4 dB. In the case of horizontal decorrelation, this bias to the left loudspeaker may have caused the greater deviation of responses seen in the ‘Low’ frequency band subjective results (as suggested by the larger error bars for the 1 ms time-delay in
Figure 7). Whereas for vertical decorrelation, the uneven frequency distribution with a 1 ms time-delay would have resulted in more energy in the lower main-layer loudspeaker below 250 Hz, potentially causing an increase of perceived loudness (despite SPL level-matching the conditions). It is hypothesised that such a change in perceived loudness could have caused the greater perception of VIS seen for the 1 ms time-delay in
Figure 8, potentially from an enhanced floor reflection; having said that, the statistical correlation between gain factor and VIS for a 1 ms delay remains weak (
rs = 0.42). These results suggest a 1 ms time-delay is unsuitable for decorrelating low frequency content. In future experimentation, it may also be useful to RMS level-match the two decorrelated outputs of the complementary comb-filtering method, in order to reduce a bias of energy towards one loudspeaker channel.
While uneven frequency distribution may account for the 1 ms vertical decorrelation results, VIS change was also seen for longer time-delays at low frequencies (though the differences were largely insignificant). As hypothesised above, a floor reflection may have influenced the perception of VIS, particularly when more energy is present in the lower main-layer loudspeaker. To look for a potential effect of the listening room on VIS perception, binaural room impulse responses (BRIRs) of the semi-anechoic chamber have been recorded using the HAART impulse response toolbox [
22], which utilises the exponential sine sweep approach [
23]. Sine sweeps were reproduced from both the main- and height-layer loudspeakers independently, with the signals captured by a Neumann KU100 dummy head located in the listening position. The main- and height-layer BRIRs were then time and level aligned, before being summed together to replicate the vertical testing condition.
Figure 10 displays the FFT of the summed BRIRs, calculated using 4096 FFT-points and a frame length of 4096 samples (with 50% overlapping hanning windows). On inspection of the spectrum, a large notch can be seen in the low frequency region around 140 Hz, as well as smaller notches in the ‘Middle’ frequency band (up to around 2 kHz). Given the regularity of the notches, it suggests a comb-filter effect due to a first reflection interacting with the direct sound—presumably from the rubber flooring of the semi-anechoic chamber (despite placing absorption on the floor between the listener and the loudspeakers). The first frequency notch of comb-filtering when two similar signals interact can be determined by Equation (2) below. The main-layer loudspeaker was located at 1.15 m above the ground and 1.5 m from the listening position—this results in a floor reflection path of around 1.25 m greater than the direct signal, with a delay of ~3.6 ms between their arrival at the ear. From this, it is calculated that the first comb-filter notch from a floor reflection should theoretically occur at 139 Hz—the similarity between this and the large notch observed in
Figure 10 suggests that a floor reflection is indeed present. Previous research has shown that a single ceiling reflection can increase the perception of VIS [
6,
7]—further testing is required to observe whether a single floor reflection can also have a similar effect on the vertical image.
where
is the time-delay between signals and
is the first notch frequency of the comb-filter.
To analyse the summed main and height BRIRs further, the ratio of early reflection energy to direct sound energy (ER/D) at the listening position has been calculated using Equation (3) below. The results presented in
Table 7 show that the ER/D is noticeably greater for the ‘Low’ frequency band (−1.9 dB) than the other bands (
Table 7)—in other words, the floor reflection observed in
Figure 10 is likely to have been heavily weighted with low frequency energy, while the higher frequencies were mostly absorbed. Hypothetically, a decorrelation of enhanced reflections might have led to a further increase of VIS with the ‘Low’ frequency band. If this were the case, it is possible that the results presented here are specific to the listening environment in which the testing was conducted. However, despite this, the subjective results still indicate that some change of VIS is perceivable at low frequencies, with further investigation required to ascertain the exact cause of the perception.
where
is the impulse signal,
is 0 ms,
is 2.5 ms and
is 80 ms.
6.3. High Frequency Band Discussion
On further inspection of the vertically summed BRIR spectra in
Figure 10, large notches can also be seen within the ‘High’ frequency band, specifically in the region of the 16 kHz octave-band—these notches are presumably due to HRTF filtering at the pinna [
9]. To investigate the effect of the HRTF on the ‘High’ band, the vertical stimuli have been convolved with the sum of two anechoic head-related impulse responses (HRIRs) from MIT’s KEMAR dummy head database [
21]—where one HRIR represents the main-layer loudspeaker angle (0° azimuth, 0° elevation) and the other the height-layer loudspeaker (0° azimuth, +30° elevation). In order to observe the gain factor effect on the ear input spectrum, the HRIR-convolved stimuli spectra have been plotted in
Figure 11 for three gain factors (0.2, 0.4 and 1.0) of each time-delay—this is to demonstrate how the HRTF spectrum changes as the vertically-arranged signals are decorrelated, with only three gain factors plotted to improve clarity. The FFTs have been calculated as a long-term average using 4096 FFT-points and a frame length of 4096, with 1/96 octave spectral smoothing and 50% overlapping hanning windows.
The spectra in
Figure 11 display similar high frequency spectral notches to those observed in
Figure 10—these notches are around 11.5 kHz and 17 kHz and appear to have the greatest depth when the signals are correlated (gain factor = 0.0). As the gain factor increases (i.e., as correlation between the main- and height-channels decreases), a spectral boost occurs in these regions and the notches become ‘filled in’. This is most apparent for the 10 ms and 20 ms time-delays, whereas with shorter delays, a comb-filtering effect occurs that noticeably distorts the definition of the notches. Furthermore,
Figure 12 below compares the HRTFs for each loudspeaker position independently against the HRTF for the loudspeakers combined (i.e., the vertical stereophonic test condition). Here it is seen that the notches displayed in the plots of
Figure 11 appear to be unrelated to the individual main- and height-layer loudspeaker HRTFs, which suggests that these notches are specific to the vertical stereophonic condition.
From these observations, it is hypothesised that the filling in notches provides a spectral cue for the perception of VIS; however, further investigations are required to explore whether this is indeed the case. If the spectral notches (and subsequent filling from decorrelation) are an important cue for VIS perception, the increase of spectral distortion with shorter time-delays would inevitably have an impact on the detection of such cues. This is reflected in the vertical decorrelation subjective results (
Figure 8), where larger error bars are seen for the 1 ms time-delay and a significant gain factor effect is only apparent for the 5 ms time-delay and above—the only significant difference between individual gain factor conditions for the ‘High’ band was seen with a time-delay of 10 ms.
Despite clear spectral changes in the ‘High’ frequency band as correlation decreases, little significant change of VIS is seen between the different gain factor conditions in the subjective results (
Figure 8). One possibility is that this could be related to the un-weighted sound pressure level (SPL) (dB(Z)) used during testing, which is likely to produce differences in loudness between the frequency bands. It is known from the literature that presentation level has an impact on the perceived extent of a source, where a greater level increases the size of an auditory event [
24]. To quantify the loudness of each band,
Table 8 displays the LUFS (LKFS) values [
25] for the correlated stimuli source signals used during testing, as calculated by Adobe Audition. The results show a +4 dB increase of loudness for the ‘High’ frequency band compared to the ‘Low’ frequency band (when both have been SPL level-matched).
Table 8 also displays the LUFS values calculated for a pink noise signal that has been filtered into the same frequency bands used during testing. Pink noise has equal energy for each octave-band and is thought to roughly represent the typical octave-band relationship within a complex signal. Looking at
Table 8, the pink noise LUFS results demonstrate that the relative loudness of the ‘High’ frequency band is considerably lower than that of the ‘Low’ frequency band. When comparing this against the LUFS of the test stimuli, it is clear that the levels used during testing are not representative of a typical interband frequency relationship found in a complex source. Given that the loudness of the high frequency stimuli was comparatively high during testing, it may have resulted in an increased perception of VIS for all conditions, resulting in more subtle VIS changes from decorrelation—further testing on this hypothesis is required to determine the effect of level and loudness on VIS perception.
The loudness level of the ‘High’ frequency band stimuli signal (
Table 8) could also be related to a lack of reflective energy at high frequencies. As seen with the ER/D in
Table 7, the ‘Low’ frequency band has the greatest amount of early reflective energy, suggesting that less amplification would be required to meet the target SPL. In contrast, given the greater absorption at high frequencies in the semi-anechoic chamber, it is thought that the ‘High’ frequency band would require more amplification at line level to match the same SPL.
Another reason for a lack of significant VIS difference between the ‘High’ frequency band conditions in the subjective testing could be related to the “pitch-height effect” [
24,
26,
27] and “directional bands effect” [
28,
29]. From the directional band research, it is known that 4 kHz and 16 kHz bands tend to be perceived in front and an 8 kHz band is often perceived above (if presented under anechoic conditions). Similarly, when octave-band noise signals are presented at ear height from in front of the listener, a “pitch-height effect” occurs which sees the 8 kHz octave-band elevated upwards (towards the position of the height-channel loudspeaker); whereas the 16 kHz band is localised towards the main-channel loudspeaker and 4 kHz is perceived somewhere between the two [
24,
26]. Wallis and Lee [
27] have demonstrated that this effect also occurs when coherent octave-bands are presented in vertical stereophony, i.e., the same signal reproduced in a main-layer loudspeaker and a height-layer loudspeaker simultaneously (both of which are located in the median plane, similar to the test condition in the current experiment). It is hypothesised that this natural vertical spread of frequencies may also be apparent in the ‘High’ frequency band signals, resulting in an initial broad VIS when both signals are correlated (gain factor = 0.0), with relatively small changes to VIS from decorrelation as the gain factor increases. The potential perception of this has been illustrated in
Figure 13, showing the possible distribution of octave-bands across the frontal vertical image. To investigate this hypothesis further, it would be useful to conduct experiments on the absolute extent of VIS for individual octave-bands, as well as for broadband signals—results from such an investigation would give important insights into the inherent perception of VIS by vertical interchannel decorrelation.