1. Introduction
In audio reproduction systems for three-dimensional (3D) sound, such as Auro 3D [
1] and Dolby Atmos [
2], the loudspeakers can generally be divided into two layers: the lower (main) layer and the upper (height) layer. In the context of sound recording made using a microphone array in an acoustic space, the frontal loudspeakers of the main layer, which are located on the horizontal plane, are predominantly used for sound source positioning. Conversely, the height layer, which is typically elevated by between 30° and 45°, primarily aims to enhance the perceived listener envelopment (LEV) by presenting ambient signals, although it can also be used to reproduce elevated sound sources. When recording for such formats, it is necessary to pay close attention to the amount of direct sound present in the height layer signal. The reason for this is as follows. Should there be excessive direct sound in the height layer then, at the reproduction stage, sound sources may be perceived as vertically-oriented phantom images at intermediate positions between the main and height loudspeaker layers. Additional spatial and timbral effects may also be perceived, depending on the time and level relationships between the direct sounds in the respective layers. Collectively, these properties comprise an interference effect referred to as ‘vertical interchannel crosstalk’.
To date, the few studies that have considered vertical interchannel crosstalk have primarily been concerned with preventing the direct sound present in the height layer from affecting the perceived location of the main channel signal. Although this has received little attention in the literature, suggestions as to the nature of the effect can be garnered from studies undertaken within the context of vertical amplitude panning. Within such studies, the literature generally agrees that increases in interchannel level difference (ICLD) between vertically arranged stereophonic loudspeakers will cause the resultant phantom image to be localised in a position biased towards the loudspeaker of greater amplitude [
3,
4,
5,
6,
7]. Despite this, such studies do not necessarily indicate that sufficient ICLD alone will prevent the signal in the height layer from affecting the perceived location of the main channel signal. For example, whilst Barbour [
5] demonstrated that ICLDs between 6 and 9 dB resulted in the perceived phantom image position matching the physical position of the lower loudspeaker for pink noise and speech sources, Somerville et al. [
3] and Kimura and Ando [
7] found that the phantom images remained somewhat elevated for ICLDs up to and including 15 dB when the test stimuli were musical sources, pink noise and speech. It should be noted that differences in the experimental setup and, in particular, the physical position of the loudspeakers might have contributed to these differences in results.
With respect to more direct experiments into the effects of vertical interchannel crosstalk, Lee [
8] conducted an analysis into the ‘localisation threshold’, which was defined as the minimum amount of attenuation of direct sound necessary in the height layer for the main channel signal to be localised at the position of the main layer. It is important to note that the localisation threshold is not a complete masking of the direct sound in the height layer. Instead, although the perceived location of the main channel signal would be unaffected, the aforementioned spatial and timbral effects of vertical interchannel crosstalk would remain somewhat audible. In [
8], cello and bongo sources were presented from vertically-arranged stereophonic loudspeakers located directly in front of the listening position. With respect to the listening position, the lower (main) loudspeaker was not elevated, whilst the upper (height) loudspeaker was elevated by 30°. Delays ranging from 0 to 50 ms were applied to the height loudspeaker with respect to the main. A subsequent study conducted by Stenzl et al. [
9] was generally similar, although for that study phantom images were formed between diagonally-arranged loudspeakers (e.g., the left loudspeaker in the main layer and right loudspeaker in the height layer). In addition, their test stimuli also included male speech alongside the cello and bongo sources. The results of both studies revealed the following with respect to localisation thresholds. Firstly, for delays in the range of 0–10 ms, they were not significantly affected by interchannel time difference (ICTD). In addition to this, the effect of sound source was not significant. The thresholds reported by each study then, for ICTDs up to 10 ms, were in the range of −6 to −7 dB, which shows good agreement with the amplitude panning experiments of Barbour [
5] and Wendt et al. [
6].
From a practical standpoint, the results presented in [
8] and [
9] can be used to influence techniques both for the rendering of 3D images and for the design of microphone configurations for recording in 3D audio formats. In either case, the results are informative as to the maximum levels of direct sound that can be present in the height layer without the perceived location of the main channel signal being affected. For example, with respect to microphone techniques, it can be seen that the direct sound in the height layer must be attenuated by a minimum of 6 dB when the spacing between the main and height layers of microphones is less than 3.4 m (corresponding to an ICTD of 10 ms). This can be achieved through the use of cardioid microphones in the height layer. In the case that a vertically-coincident configuration is used, angling the height microphones at least 90° away from the sound source should provide the necessary attenuation, as was suggested in [
8]. The microphones in the main layer should be positioned on axis with respect to the sound source.
In a more recent study conducted by the authors [
10], it was demonstrated that localisation thresholds have a frequency dependency. Octave bands of pink noise, with centre frequencies ranging from 125 Hz to 8 kHz, as well as broadband pink noise, were presented to subjects from vertically-arranged stereophonic loudspeakers in an anechoic chamber. Delays ranging from 0 to 10 ms were applied to the height layer with respect to the main. The results of the study showed that the localisation thresholds were not significantly affected by ICTD, which agreed with the results reported in [
8,
9]. The thresholds for the 125 and 250 Hz bands were in the range of −3 to −5 dB, which was significantly higher than the −9 to −11 dB thresholds found for the 1, 2 and 8 kHz bands. In addition, the threshold for the broadband source was the lowest of all stimuli tested, being −11.5 dB. These results seem to provide an implication for the analysis of localisation thresholds for natural sound sources with different spectral balances; it might be suggested that the threshold for a high frequency dominant source would be lower than that for a low frequency dominant source. As mentioned above, previous studies [
8,
9] reported that the localisation threshold was not source dependent. However, since the sources used in those studies were somewhat limited (cello and bongo in [
8], cello, bongo and speech in [
9]), a wider range of sources would need to be tested in order to confirm the source dependency of the localisation threshold.
Of further interest in the present study is how localisation thresholds are affected by the way in which the test stimuli are presented to subjects (the presentation method). In two recent localisation experiments conducted by the authors [
11,
12], continuous broadband pink noise was presented to subjects from loudspeakers arranged in two layers. The loudspeakers positioned on the main layer were not elevated with respect to the listening position, whilst those in the height layer were elevated by 30°. In the first experiment [
11], each layer consisted of a single loudspeaker positioned with 0° azimuth. Under such conditions, localisation judgments for the pink noise sources were accurate. Conversely, in the second experiment [
12] each layer consisted of stereophonic loudspeakers with a base angle of 60° (±30°). The results of this study showed that the pink noise was perceived as being elevated with respect to the physical position of each layer. This difference in results is indicative of the phantom image elevation effect, in which stimuli are perceived as being more elevated when presented as stereophonic phantom images compared to single source only presentation [
13,
14]. This has notable implications for the reduction of vertical interchannel crosstalk. If main channel images are elevated with respect to the physical position of the main channel layer as a result of the phantom image elevation effect then it could be argued that the location-based effects of vertical interchannel crosstalk would be less distracting. Consequently, the localisation threshold might be much lower under such circumstances or, alternatively, might not be necessary at all. It is therefore of interest to determine how the localisation thresholds would vary when stimuli are presented as vertically-arranged quadraphonic phantom images compared to for vertical stereophonic presentation.
From the above background the following research questions were derived:
The present paper is organised as follows. An experiment is first described in which the effects of sound source, presentation method and ICTD on localisation threshold were analysed. Following this, a second experiment is presented in which the thresholds obtained in the first experiment were applied to sound sources and verified in localisation tests. The paper concludes with discussions pertaining to the results of each experiment, as well as the implications for image rendering and microphone techniques. This also includes suggestions for future work.
3. Experiment Two: Verification of the Localisation Thresholds
3.1. Materials and Methods
3.1.1. Physical Setup
The physical setup for the verification test is shown in
Figure 8. The experiment was conducted in the same room as was used in Experiment One and used an almost identical setup. However, as Experiment One demonstrated that the localisation thresholds were not affected by presentation method, only the L, R, HL and HR loudspeakers were used (i.e., the vertical quadraphonic condition); the C and HC loudspeakers were removed. The vertical quadraphonic condition was favoured to the vertical stereophonic condition as existing 3D audio systems, such as Auro 3D [
1], tend to make use of elevated L and R loudspeakers, however they do not always use an elevated centre loudspeaker. It was therefore considered that the vertical quadraphonic condition would be more relevant to practical situations. A light-emitting diode (LED) strip was positioned directly in front of the listening position. This was located behind the acoustically-transparent curtain and was to be used by subjects to make localisation judgments.
3.1.2. Test Stimuli
The stimuli used for the experiment were the same sources used in Experiment One. The test stimuli were presented to subjects in the following conditions: (1) main layer only; (2) height layer only; (3) vertically oriented phantom image with 0 dB interchannel level difference (ICLD) and; (4) vertically oriented phantom image with the localisation threshold applied to the height layer. The ICTDs applied to the height layer for the phantom image conditions were 0 and 1 ms. The 10 ms condition was not tested for the following reasons. Firstly, as there was no significant difference between the localisation thresholds obtained for 1 and 10 ms it was deemed unnecessary to test both conditions. Furthermore, as discussed earlier, 10 ms represents a condition whereby the path difference between the direct sound arriving the main and height layers respectively is around 3.4 m, which is fairly large in practice. It was therefore decided that the 1-ms condition would be more representative of a practical configuration, with the resultant path difference being only around 0.34 m.
Although not necessarily integral to the verification test, the 0-dB ICLD and height layer only conditions were included in the experiment in order to reduce any expectation biases. During preliminary tests, in which only the main layer only and localisation threshold conditions were considered, subjects reported that hearing all stimuli originate from the same position in a localisation test was confusing. Furthermore, some were led to believe that the stimuli could not all be coming from the same location and this forced them to provide different answers to what was actually being perceived. The height layer only and 0-dB conditions were therefore included in order to introduce stimuli that were in a position away from the main layer only condition. This was found to prevent the issue.
All stimuli were presented at 70 dB LAeq at the listening position when presented from the main layer only. The increase in amplitude when the stimuli were presented as vertically arranged quadraphonic phantom images was dependent on the localisation threshold applied to the height layer. In the case of the 0-ms condition, the height layer was attenuated by 9.5 dB with respect to the main layer, whilst for 1 ms the attenuation was 7 dB, which was based on the results from Experiment One. In total, there were 30 stimuli, being the main and height layer-only conditions (10), the localisation threshold conditions (10—five sources, two ICTDs) and the 0-dB ICLD conditions (10—five sources, two ICTDs).
3.1.3. Test Method
The test was completed by the same 10 subjects who participated in Experiment One. Localisation judgments were made using the LED strip located directly in front of the listening position. For each test, subjects were provided with a handheld knob, which controlled which LED on the strip was turned on. Subjects were required to adjust the knob until the position of the active LED matched the perceived location of the focal point of each stimulus. This method was chosen following research conducted by Lee et al. [
27], who found that it was faster and produced results with greater accuracy and consistency compared to the numbered scale method, which had been used in a number of previous vertical localisation studies [
11,
28,
29]. The position of the LED selected for each stimulus was converted into an elevation angle. The heads of subjects were not fixed, however they were instructed to sit up and face forwards at all times, using only their eyes to look at the light strip. To help maintain the correct seating position, a small headrest was positioned behind the head of each subject. The test was completed four times by each subject, with each sitting containing all 30 stimuli and taking around 10 min to complete. The presentation order of stimuli was randomised for each test.
3.2. Results
Levene and Shapiro–Wilk tests were first conducted, using the SPSS Statistics 22 software (IBM, New York, NY, USA), in order to determine the suitability of the collected data for parametric statistical analysis. The Shapiro–Wilk test showed that not all scores in each condition featured normal distribution, although the results of the Levene test showed homogeneity of variance for all sound sources. For these reasons, non-parametric tests were chosen for the statistical analysis.
Figure 9 shows the median perceived elevation of each of the test stimuli, plotted with notch edges. Consideration of the data reveals the following. Firstly, the localisation thresholds derived in Experiment One resulted in perceived elevation judgments similar to those for the same source presented from the main layer only. This was the case for all sources, with the median difference in perceived elevation between the main layer only and localisation threshold conditions ranging between −4.0° and −0.8° for the 0-ms ICTD and between 3.9° and 0.0° for the 1-ms ICTD. In addition to this, the notch edges for all the localisation threshold conditions overlap with those for main layer-only presentation. It is also interesting to note that, for the 0-ms ICTD, the median perceived elevation for the stimuli with the localisation threshold applied was slightly lower than the main layer only condition for all sources.
In order to further determine whether or not the localisation thresholds derived from Experiment One were successful in preventing vertical interchannel crosstalk from affecting the perceived location of the main channel signal, Wilcoxon tests were conducted. The results suggested that, generally, there were no significant differences between the elevation judgments for the localisation threshold and main layer-only conditions. However, the data did suggest that the difference was significant for the quartet at 0 ms (p = 0.041) and for both the guitar (p = 0.002) and speech (p = 0.025) at 1 ms. Despite this, there is a clear overlap between the notch edges for each of these stimuli. In addition, the Pearson’s correlation coefficient did not show a large effect in any case (r = 0.25 for quartet at 0 ms, r = 0.39 for guitar at 1 ms, r = 0.28 for speech at 1 ms). It can therefore be suggested that the difference in median perceived elevation between the localisation threshold and main layer only conditions was not significant. Consequently, the localisation thresholds derived in the present study are appropriate in preventing vertical interchannel crosstalk from affecting the perceived location of the main channel signal.
With respect to the 0-dB ICLD conditions, a series of interesting results can be seen. Firstly, for the oboe it is clear that perceived elevation was not significantly affected by changes in how the stimulus was presented to subjects. In all cases, the perceived elevation was similar, which would indicate that this source was less affected by the migration of the main channel signal from the main layer as a result of vertical interchannel crosstalk. This result might suggest that the application of localisation thresholds would not always be necessary. Furthermore, for the other stimuli it is clear that the median perceived elevation was greater for the 0-dB ICLD condition compared to the main layer only condition. For 0 ms, the difference in median perceived elevation ranged from 5.5° to 7.2°, whilst for 1 ms the difference ranged from 7.9° to 11.4°. This result indicates that the perceived location of the main channel signal would be more affected by vertical interchannel crosstalk when the height layer is delayed with respect to the main. However, despite this result it is clear that the difference was not always significant, with there being a notable overlap between notch edges between the 0-dB ICLD conditions and the main layer-only conditions. This is particularly noticeable for the guitar and quartet sources at 0 ms. Nevertheless, is clear that vertical interchannel crosstalk at the very least resulted in an increase in the median perceived elevation of the main channel signal, which was notably reduced when the localisation thresholds derived from Experiment One were applied.
A further result of note can be seen with respect to the main and height layer-only conditions. Firstly, for the latter condition it would appear that perceived elevation judgments were generally accurate for all sources, excluding the oboe, with respect to the physical position of the height layer. Conversely, for the main layer-only condition the judgments were less accurate, with perceived source elevation being in the range of 5.8°–13.0° with respect to the main layer’s physical location. This elevation of the sound source with respect to the main layer was also maintained for the conditions whereby a localisation threshold was applied to the height layer. The results of a Wilcoxon signed rank test, which compared the results for the main layer-only condition to the physical position of the main layer (0°), showed that each source was perceived to be significantly higher than the physical height from which the source was presented (p = 0.000 for all sources).
5. Conclusions
The present study carried out an analysis of localisation thresholds for natural sound sources. The study was divided into two experiments. In the first (Experiment One) the effects of sound source, ICTD and presentation method were examined. Anechoically recorded conga, quartet, speech, guitar and oboe sources were presented to subjects in a natural listening environment using two conditions: vertical stereophonic and vertical quadraphonic. For each condition, the loudspeakers were divided into two layers, being ‘height’ (30° elevation) and ‘main’ (0° elevation). Delays ranging from 0 to 10 ms were applied to the height layer with respect to the main. Subjects sat a listening test in which the minimum amount of attenuation necessary in the height layer for the resultant phantom image to match the position of the same source presented from the main layer alone was considered.
The results of the experiment showed that the localisation thresholds were affected only by ICTD. For delays of 0 ms the threshold was −9.5 dB, which was significantly lower than the −7 dB found for 1 and 10 ms. That less ICLD was necessary in the presence of a delay was initially interpreted based on the existence of a localisation dominance effect. In addition, attempts to explain the non-significant effect of sound source were made based on the hypothesis that the primary mechanism to determine whether or not the localisation threshold had been met was the balance of spectral energy provided by the main and height layer, particularly in the 7–9 kHz range, which is not related to the spectrum of the source itself. This hypothesis also explained the non-significant effect of presentation method, with it being demonstrated that the reduction in the difference in energy between the main layer only and phantom image conditions in the 7–9 kHz region was similar for both methods for a given ICLD.
In Experiment Two, the localisation thresholds obtained in Experiment One were applied to natural sound sources, with localisation tests being conducted in order to verify that they were effective at preventing vertical interchannel crosstalk from affecting the perceived location of the main channel signal. Stimuli were presented using the vertical quadraphonic condition, with the main and height layer-only, 0-dB ICLD and localisation threshold conditions all being tested. ICTDs of 0 and 1 ms were applied to the height layer with respect to the main. Subjects used a light strip, which was controlled by a handheld knob, in order to identify the perceived location of each stimulus. For all stimuli, there was no significant difference in perceived elevation between the main layer only and localisation threshold conditions.
A key result from Experiment Two was that no evidence was found to support the existence of a localisation dominance effect, with the perceived elevation of the sources with 1-ms ICTD being higher than those with 0-ms ICTD. It is therefore unclear why less level reduction was necessary in Experiment One in the case that an ICTD was present. The results also showed evidence of the phantom image elevation effect, which was used to suggest that the perceived difference in elevation between the main layer only and phantom image conditions would be similar for both presentation methods for a given ICLD. This therefore further explained why the effect of presentation method was not significant in Experiment One. In addition, the results implied that the oboe source was localised based on the pitch-height effect. This meant that the hypothesis regarding the balance of spectral cues provided by the main and height layers did not adequately explain the localisation thresholds obtained for this source. As a result of this, it was suggested that the results might be explained based on differences in perceived VIS between the main layer-only and phantom image conditions.
The practical implications of the results obtained in the study were also discussed. In particular, differences between suggestions made in previous studies and those indicated by the present results were considered. It was also stated that further study would need to be conducted into the spatial and timbral effects when the localisation thresholds are applied in order to determine whether or not it would be more appropriate for the direct sound in the height layer to be masked.