1. Introduction
The feeling of being “immersed” in virtual environments (VEs) has long been an essential element in the design of user experiences for developers and content creators [
1]. Virtual environments are immersive when they afford perception of the environment through sensorimotor relationships that mimic our natural existence [
2]. The degree of immersion depends on various factors of the visual experience such as the field of view, display latency, display resolution and also the number of other sensory modalities available in the virtual environment. For example, a slowly updated display is less immersive than one that can catch up to the speed of our head movements. In its simplest form, an immersive VE has both visual and auditory modalities [
3,
4].
Immersive sound in a VE can be achieved by incorporating environmental sounds, sounds of our own actions and simulation of the acoustics of the environment, which will affect the perceived sound [
4]. Sound can be incorporated into a VE as simple stereo sounds or as spatial sounds, which render real-world cues such as sound reflections and acoustic changes due to body movements, resulting in the virtual experience being perceived as more authentic. Spatial sounds not only increase the feeling of presence or ‘being there’ but also elicit more head and body movements from the user on account of being more immersive [
5]. However, spatial sounds are complex and both acquisition and reproduction are demanding in terms of the equipment, effort and expense involved. Binaural sound is sound that is perceived as being present in a specific location in space—distance, elevation and azimuth. As the name suggests, it is achieved by simulating how the sound reaches each of our ears. Finding where a sound originates—sound localization, is essential to veridical perception of an environment. High fidelity in spatial sound rendering is uncompromisable since conflicting visual information could interfere with sound localization [
6], as commonly seen in the capture effect or ventriloquism effect [
7,
8]. Serafin and colleagues [
4] describe ‘ear adequate’ headphones, individually administered binaural signals, head movement tracking and room acoustics as some of the requirements of a spatial soundscape to ensure high fidelity in the audio-visual environment. The position of the sound source, the position of the receiver (user), the individual ear and head properties of the receiver, the positions of other objects in the environment and the acoustic properties of the room can all together be used to generate an individualized soundscape. The level of complexity in the type of soundscape incorporated into the VE is application dependent [
9]. Therefore, it is more pragmatic and economical to use spatial sounds only when they are effective and add value to the specific VE. For example, spatial sound may not be essential for a virtual lesson with an instructor speaking, whereas, it will be an advantage in a table tennis training environment, where auditory feedback will improve gameplay.
In an investigation of the efficacy of different sound types in a 3D VE, Høeg and colleagues [
10] used a visual search task, where the participant was asked to search for a specific visual target randomly positioned in a scene. To assist the user in finding the target, a sound was played to indicate the target location. This auditory cue is akin to a friend calling our name from a crowd, which would help us find them more easily. In this study, the effect of different auditory cues on participant reaction times (RT), that is time to search for the target in the scene, was measured. The authors found that binaurally presented cues facilitated RTs more than stereo cues or the absence of cues by being spatially and temporally synchronous with the visual elements of the display. This finding is significant because it essentially shows that sound localization was better with binaural audio in a virtual environment. However, since the visual environment in this study was a simple 3D visual search display based on a 360° video, it is not clear whether the advantage of a binaural cue will also be present in a more dynamic virtual environment with environmental noise, which is more likely to occur in a real-world setting.
It has been observed that the introduction of noise can obscure audio cues and hinder the detection of visual stimuli [
11]. In recent research by Malpica, Serrano, Gutierrez and Masia [
12], the introduction of different types of noise led to a severe drop in visual detection and recognition performance in virtual reality (VR), irrespective of the type of noise introduced.
In an inverse effect, sounds went undetected under high perceptual load in the visual modality in an effect known as ‘inattentional deafness’ [
13]. Moreover, visual distractors, even when irrelevant to the task, capture attention [
14,
15,
16,
17].
The above findings on the effect of auditory and visual noise on perception indicate that behaviour is affected by the presence of noise—both visual and auditory. Therefore, in our study, we tested the efficacy of different types of spatial sounds, specifically, stereo and binaural, in a virtual environment with and without environmental noise. We used a visual search task with different auditory cues to test their relative effects on search performance. This task allowed us to use both visual and auditory modalities in the VE and to place our task at the intersection of the two modalities. In this manner, we studied both visual target identification and sound localization simultaneously in a noisy virtual environment. A sound localization task or a visual search task on their own would be insufficient to understand perception of auditory and visual stimuli in an ecologically valid virtual environment. Our setup enabled the study of the interaction between both visual and auditory modalities in a VE. In this manner, we mimicked a common scenario in everyday life—looking for someone in a crowd, which is made easier if they call to us. This is where the advantage of binaural audio—that it can be placed at a distance and elevation along an azimuth—comes into play. The sound source would be congruent with the visual stimulus location enhancing stimulus detection. A stereo cue, in contrast, only has slight delays between the inputs to the left and the right ear, which gives the illusion of depth, but does not enable accurate sound localization. Therefore, the binaural audio cue is expected to facilitate visual search more than the stereo cue, as already found in previous literature [
10,
18]. This will not necessarily be true in the condition with environmental noise, where both auditory and visual distractors will interfere with target search and localization.
The interim results from our study have already been described elsewhere [
19]. The descriptive results indicated lower performance variability in the presence of an auditory cue with a slight indication that the binaural cue may be more advantageous than the stereo cue. Participants did not report differences in mental load between the experimental conditions on any dimension measured using the NASA-TLX questionnaire [
20]. Participants generally reported high spatial presence in the task as measured using the Igroup Presence Questionnaire (IPQ) [
21].
In the present investigation, we derived four measures from the eye tracking data we collected during the experiment—two measures pertaining to the spatiotemporal characteristics of eye movements made during the task and two measures pertaining to the physiological response to the task. We chose these measures to obtain a comprehensive understanding of behaviour in a rich virtual environment. These measures would tease apart the different cognitive processes that contribute to task performance, allowing us to assess the effect of the environment and the different auditory cues.
The first measure, time to first fixation (
TFF), quantifies the time for target search, which is an indicator of the speed of target localization. The second measure, gaze trajectory length (
GTL), quantifies the length of the search path, which gives us insight into the search process adopted by participants. The
TFF results are expected to replicate the results obtained by Høeg et al. [
10]. We expect that the binaural cues will result in shorter search times (
TFF) and shorter search paths (
GTL) than the stereo and no cue cases in the noise-free environment. Such a result would indicate quicker target detection with binaural cues in a realistic environment, which would strengthen the case for binaural sound use in VEs.
We do not have a specific prediction about whether the same results will be obtained in the condition with environmental noise. Even if the cues are effective, the presence of distracting noise could make the search task more effortful. In our interim analysis, although there was no discernible pattern in the mental effort report of participants, there was a report of frustration in the conditions with environmental noise [
19]. Therefore, in the present study, we focused on two measures of mental effort that could be derived from the eye tracking data-blink rate and pupil size. Pupil diameter is a well-established indicator of cognitive load that increases with increase in load [
22,
23,
24,
25]. It has been tested as an indicator of mental effort in practical applications such as combat [
26], driving [
27,
28] and surgery [
29]. In contrast, blink rate is a more ambiguous measure. In some studies, blink rate has been reported to decrease with cognitive load [
30], while in others, blink rate has been reported to increase with load [
24]. A more complicated relationship of blink rate with different types of loads has been found in other studies [
31,
32]. In our study, we expect pupil size to increase in the conditions with environmental noise, while no specific prediction is made for blink size. We also expect pupil size to be higher in conditions without the cue than with stereo or binaural cue. An advantage for binaural cue in terms of cognitive load measures would be the ultimate benchmark for the utility of binaural sounds in VEs.
4. Discussion
In this study, we compared the effect of three types of auditory cues (no cue, stereo cue and binaural cue) on visual search behaviour in two types of virtual environments using four measures—time to first fixation (TFF), gaze trajectory length (GTL), blink rate and pupil size.
4.1. Behavioural and Eye Position Measures
First, we found a performance advantage for binaural cue in comparison to trials where cue was absent, whereas, such an advantage was not present for stereo cues. This improved performance was also visible in the target search duration (
TFF). Participants were quicker to find the target with the help of a binaural cue than with a stereo cue or in the absence of an auditory cue (
Section 3.2). This result is in line with the quicker search times obtained in the studies by Hoeg and colleagues [
10] and Brungart and colleagues [
18].
The gaze trajectory length (
GTL) measure, which quantified the length of the search path (
Section 3.3), revealed a cue advantage as well. Trials with no auditory cue showed longer search paths than trials with binaural and stereo cues, clearly showing a benefit of the auditory cue. However, there was no difference between the search paths of the stereo and binaural cues.
Although not statistically significant, the boxplots and summary barplots (
Figure 2a,b) show that, in the presence of a cue, search durations (
TFF) were higher when the stadium had distractors (full stadium condition) than when it did not (empty stadium condition). This effect may be attributed to distracted search in the full stadium only in the presence of auditory cues, since such an effect is not present when there is no auditory cue. This is visible in the individual participant data (
Figure 2c), where 12 of 17 participants show lower search times in empty than full stadium conditions for the binaural cues (11 for stereo cues). While research combining task performance with fully moving VEs is scarce, related research could provide additional insight. Olk and colleagues [
35] reported slower detection of stimuli in a VE when those stimuli were harder to distinguish from surrounding objects either due to their distance or distinctiveness. This indicates that the minions, which were chosen to merge with the yellow elements of the VE, were indeed less distinctive when the players and audience with yellow uniforms appeared in the full stadium condition. Moreover, the appearance of a person—a strong social element—was recently shown to influence participants’ visual attention in virtual reality [
36], where a person was fixated significantly more in a 360° video compared to a 2D video. In our task, the players and audience in the full stadium condition would have similarly attracted attention. This validates the minions as an appropriate visual target for our task. Additionally, the presence of people, even though task-irrelevant, also negatively impacted task performance in our study.
However, no such difference is seen between the two stadium conditions for the search paths. To understand this disparity, we additionally investigated
TFF and
GTL by separating the trials based on the location of the targets (
Figure 6). For
TFF, in the empty stadium, we found a high association between the target distance from the centre (in degree) and the time to fixation of the target. For the full stadium, this association holds true only in the presence of a binaural cue and to a lesser degree in the presence of a stereo cue. This implies that the full stadium interferes with the search process as expected.
In contrast, for the gaze length trajectories, we could not find such a pattern. It remains unclear how the eye and head movements necessary for different target positions in the different stadium conditions moderate the overall results. This could have been because of a design shortcoming. As mentioned earlier, our participants did not always fixate exactly on the large, central blue cross before the start of each trial. This distracted trial beginning could mean that the search would not have always started from the middle of the display, leading to inconsistent GTLs. One solution to this problem would be to force the participant to fixate on the central cross to begin the trial. Alternatively, a more dynamic setup would have averted this problem by presenting the cue depending on the participant’s current gaze location.
Another source of inconsistency in the results was individual differences. Comparing the per participant effects in
Figure 3 and
Figure 4, the individual measure varies strongly for both gaze paths (
GTL) and search duration (
TFF). We could not spot any general pattern describing the relative degree of changes in either of the measurements. Larger sample sizes are required in future studies to mitigate the effect of this variability. Any variability stemming from differing levels of familiarity with virtual reality technology, although low as mentioned in
Section 2.1, can also be explored with a larger sample size. In spite of the individual effects heavily moderating the degree between the absence and presence of an auditory cue in this visual search task, we found that the presence of any auditory cue speeds up the search performance.
Overall, the search duration (TFF) and search path (GTL) measures presented to be useful metrics of search behaviour in our task. Together, they have revealed a search advantage of auditory cues, with the binaural cue being slightly more advantageous than the stereo cue.
4.2. Physiological Measures
The next two measures we tested—blink rate and pupil size—were physiological measures, both of which have been studied in virtual environments. Although not statistically significant, we found lower blink rates in the empty stadium conditions than in the full stadium conditions. However, the trials without cues did not show such an effect. On the contrary, we observed lower blink rates in the easier trials with the binaural cue. Blink rate is an inconclusive measure of cognitive effort [
37]. Blink rate is known to decrease in cases of extreme focus and increased workload, as observed in surgeons [
30,
38]. Veltman and Gaillard [
30] indicate a distinction in the underlying factors that affect blink rate. They found that blink rate decreased when more visual information had to be processed, while it increased when the difficulty of the task increased. In an experiment systematically varying visual and cognitive demands, Recarte and colleagues [
31] found that blink rate decreased with visual load and increased with mental load. In a driving task, Merat et al. [
32] found a similar fall in the blink rate with increased visual information in the absence of a secondary task. Adding a secondary task increased blink rates, although some results did not fit this pattern, indicating a tradeoff in blink behaviour between visual information intake and mental workload. These U-shaped patterns in blink rates have been interpreted differently by others. Berguer and colleagues [
39] found that surgeons had lower blink rates when performing surgery than at rest, but blink rate increased while doing the same in a laparoscopic environment. They interpreted this result as the outcome of a conflict between task demand or stress and concentration. Zheng et al. [
38] found, in a VR laparoscopic surgery setting, that those participants who reported more frustration and mental effort in the NASA-TLX blinked less frequently. It is also worth noting that some studies have not reported an effect of mental load on blink rate, while pupil size or other measures responded to load [
28,
40].
In the context of the ambiguous nature of factors affecting blink rate that were discussed above, our results did not show a discernible pattern to draw parallels to any of the above literature. Although the full stadium condition had higher visual information in the display, this information was task-irrelevant, and therefore, it cannot be equated to the demand of having to process additional visual information as described above. Our task may have been too easy to elicit an effect of visual or mental load in comparison to the difficult driving and surgery scenarios that have been studied. In addition, the median blink rates and an investigation of individual participant blink rates revealed high variability in the data. Large variability in blink rate was also reported by Benedetto et al. [
28]. In spite of blink rate being decreased in head-mounted VR displays in comparison to monitors or natural settings [
41], in our data, some participants showed extremely large blink rates (up to 15), which may indicate poor data quality. Blinks are identified when the pupil is not detected by the eye tracker. The Tobii Glasses eye tracker we used was embedded in the VR headset, which should have resulted in lesser data loss. However, loss of the pupil size data stream occurred more frequently for some participants (as high as 19% for one participant). Some participants wore glasses and/or lenses, which may have resulted in higher data loss. This is a shortcoming of video-based eyetracking, which needs to be overcome to increase the reach of eye tracking integrated VR setups. A simple solution to this problem would be to record a video of the participant’s eyes, which would allow us to manually identify blinks.
The comparability of our results with existing literature is additionally made difficult by the fact that the blink sensors (remote eye tracker, head-mounted eye tracker, EOG) and blink detection algorithms (manual, automatic, different duration thresholds, etc.) are all different. If future studies report data quality and the precise parameters used for blink detection, it will become easier to reach a consensus on this complex measure.
Our last measure, pupil size, showed only an effect of the stadium with larger pupil sizes in the full stadium condition. This effect is clearly due to luminance differences between the two scenes. Pupil size responds to both changes in luminance and cognitive effort [
25] and our result shows that the task-evoked pupillary response (TEPRs) was not separable from luminance effects. However, even in the empty stadium trials, where we can reasonably assume equivalent luminance between cue conditions, any increase in cognitive effort that may have been present was not seen in our results except as a small decrease in median pupil size. It should be noted that TEPRs are small changes that require a large number of trials to be averaged and reflect large changes in cognitive load [
22,
23], both of which did not apply to our study.
Although our scene was chosen to be visually and auditory realistic, which was an advantage for immersion and presence in VE, the realistic stimulus was also partly the reason why the physiological measures did not perform well. We could conduct the same study with more fine-grained control over the visual and auditory noise levels in the environment. The experiment could have only auditory noise or only visual noise as additional conditions to isolate the effect of noise from the amount of visual input that needs to be processed. This would enable the use of both pupil size and blink rate.
For future use of pupil size in our paradigm, besides correcting the design constraints mentioned above, technical sources of error need to be accounted for as well. Eye movements themselves cause distortions in pupil size, which are most evident in different camera viewing angles. Correcting these distortions requires complex mathematical models [
42]. Most eye trackers incorporate these perspective-distortion corrections; however, individual differences might still exist (described and modelled in Mathur et al. [
42]). One solution is to use measures that are resistant to luminance changes and pupil distortions, such as the Index of Pupillary Activity [
43], which measures changes in the oscillatory behaviour of pupil data. However, this measure requires long trial durations, which was not the case in most of our trials.
4.3. Eye Tracking and Immersive VR
Eyetracking gives us access to many dimensions of behaviour. It extends the study of simple behavioural responses by giving us a more fine-grained insight into human interaction with the environment. In our task, we could have asked participants to simply press a button on detecting a target. However, recording eye movement data instead, allowed us to look more closely into the search strategy of each participant. It also allowed the participant to perform the task more naturally without having to remember buttons that they might usually have to press. Such eye movement paradigms make stimulus-response paradigms more seamless and ecologically valid. However, as discussed above, some of the measures obtained from eye tracking data have shortcomings that need to be overcome. Higher fidelity signals will be required in the future for effective use of systems that can provide metrics of cognitive effort for improving user experience and for providing user feedback.
In spite of the lack of results from the physiological measures, our eye position measures have revealed a definitive advantage of an auditory cue for target localization and detection in a virtual environment. We also found that visual and auditory noise interfered with target localization in the presence of facilitating cues. There is also an indication of the usefulness of the binaural cue, which was seen in spite of the large individual differences between the different degrees of environmental noise. This evidence is in support of the use of spatial sound in different virtual environments to improve responsiveness and immersion in the environment. Our study can be extended in future research with different environments and different degrees of noise to obtain a more comprehensive understanding of sound localization and perception in realistic VEs. This would enable the design of more effective virtual environments with appropriate use of binaural sounds.