1. Introduction
At present, an operator’s monitoring of air and space situations and searching judgments of relevant targets are mainly conducted through a radar interface [
1]. With an explosion in the number of aircraft, the air and space situation is becoming increasingly complex [
2,
3,
4]. The shortcomings of the insufficient information loading of the 2D graphical interface are exposed. Information on the height, distance, and direction of the flight target is difficult to obtain through the radar interface the first time. This has a serious impact on the efficiency of its search and localization capabilities. The vigorous development of MR technology provides an idea for solving the above problems: the information dimension is broadened by its three-dimensional display mode, which can realize a holographic perception of the overall situation and fill the gaps in the types of information [
5,
6,
7,
8]. However, greater visual fatigue and cognitive pressures in search and judgment are experienced by an operator due to the richer amount of information that comes with it [
9,
10]. Multimodal interaction is an important means of human–computer interaction, and the interaction efficiency can be significantly improved by combining different sensory channels [
11,
12,
13]. In addition, there are now more and more multimodal solutions that combine vision with hearing, touch, and so on. In this paper, the visual search judgment of an operator is mainly based on the addition of auditory assistance in an MR environment [
14]. Combining the element of sound with spatial perception correspondingly assists with indicating the spatial characteristics of the target. Here, the spatial characteristics are mainly a proxy for the position information of the target. In the spherical coordinate system [
15], the position of a 3D object is uniquely determined by the height h, distance r, and azimuth angle θ. Under existing theories, there are also numerous researches on the intuitive perception and mutual induction relationship between pitch–spatial height, volume–spatial distance, and vocal tract alternation–spatial direction. This paper will also introduce the research results and current progress of auditory-assisted vision from these perspectives.
It is intuitively clear that there is a connection between pitch and spatial height. Cesare V. Parise and Katharina Knorre [
16] determined the existence of a mapping between auditory frequencies and perceived vertical height by measuring the statistics of natural auditory signals and obtained the following conclusions: The experiments show a consistent mapping between sound frequencies and the average height of their external spatial sources. This is particularly evident in the middle range of the spectrum between 1 and 6 kHz. In addition, moderate but consistent frequency-dependent biases were also present in horizontal sound localization. These results suggest that there is a significant frequency perception bias in sound localization. This bias depends on the statistics of the natural auditory scene as well as on the filtering properties of the outer ear. This provides important ideas for subsequent mapping relationships involving frequency dependence, such as pitch–height, as in this paper, and provides a theoretical basis for their further study and generalized application in MR environments. It is also evidenced in neurology. With the use of fMRI, Kelly McCormick [
17] obtained a pitch–height correspondence by observing the consistent effect of the pitch–spatial height pitch in the bilateral inferior frontal and insular cortex, the right frontal visual field, and the right subparietal cortex. The related applications of the pitch–height-based cross-modal mapping are also becoming more abundant and mature: Marco Pitteri [
18] scientifically presented a SMARC effect applied to the music field. Pitch–height can be expressed in a spatial format; when the response is executed in the lower part of the space, the response times (RTs) are faster for bass heights. In addition, when the response is executed in the upper part of the space, the RTs are faster for high pitch heights. This provides an important theoretical basis for the search performance in this paper. It is shown in Judith Holler’s [
19] research that the common phonological gestures of some Dutch and Persian speakers reveal a mapping of high-to-high space and low-to-low space. Moreover, this gestural spatial–pitch mapping is considered to occur simultaneously with the corresponding spatial word (high-low). In addition, Sarah Dolscheid [
20] confirmed that different speakers differ in spatial–pitch correlations. The strong correlation between both pitch and spatial height is also illustrated in their respective sets of data. In addition, the pitch–height spacing is more malleable compared to other mapping correlations. The above studies and some related studies [
21,
22] have explored the connection between pitch–indication–height, including both the physiological principle and the application level, but the mechanism has not been verified and applied in a mixed-reality environment, and it should be investigated whether it can be applied in the scenario of this paper.
There is a correlation between the volume and the distance: the farther away, the smaller the sound [
23]. Taking advantage of this property, the cognitive mechanisms related to the indication, offset, and enhancement between the volume and spatial distance of vision have been intensively investigated and are widely used. Paul Luizard [
24] presented a heuristic model based on the parametric analytic solution of the diffusion equation and the physical approximation of sound energy behavior. It is used to describe the design of geometric structures in concert halls. The energy decay properties of the volume in the audio-visual coupled space are explained. This provides important support for the subsequent volume design principles in this paper. Andrew J. Kolarik [
25] showed the coupled relationship between the distance properties of vision and sound source distance through an experimental study. and found that when there is partial visual loss, participants with visual loss misjudge the relationship between the room size and sound source distance as judged by participants with normal vision. Scott A. Hong [
26] investigated the problem of hearing damage to infants from excessive volume by setting up three microphones at different spatial distances to simulate different volumes. Min-Chih Hsieh [
27] set up an electric vehicle warning sound with different volume levels and different distances and the efficiency of it with a recognition of the warning was judged. Accordingly, the degree of influence of auditory warnings at different volume levels on visual perception was investigated. Hongrae Jo [
28] used visual and acoustic methods to explain the phenomenon of bubble condensation. The volume of the bubble was measured visually to determine the distance of its movement in a vertical direction. The auditory perception of the sound pressure signal after volume conversion was also used to obtain its relative position relationship. This reflects the consistency of the indication of visual distance and auditory volume. The volume–distance cross-modal application described above exemplifies the validity of the auditory-perceptual indication of distance and its strong correlation with cognition. By setting three different distances, Like Jiang [
29] studied the rapidity of the visual impact of different sound distances. This can be used to evaluate the impact of traffic noise on the efficiency of visual judgments on highways. This study also provides an important idea for the experimental design of this paper, that is, by studying the correspondence between different distances and different sounds, the correlation between volume and distance indication is confirmed. The mechanism, experiment, and application of volume indication distance in different scientific fields are illustrated by the above results and some related studies [
30,
31,
32]. However, whether its application in an MR environment is smooth and reasonable and whether it is consistent with cognitive mechanisms needs to be further investigated.
The direction in visual space can likewise be indicated by sound. The Head-Related Transfer Function (HRTF) is a very effective implementation of a sound localization algorithm for describing the transmission of sound waves from a sound source to both ears. This has been demonstrated in related studies by Dmitry N. Zotkin and Ramani Duraiswami [
33] who used a new approach to detail the algorithmic details of HRTF interpolation, room impulse response creation, HRTF selection from the database, and audio scene representation by selecting personalized head-related transfer functions (HRTFs) based on anatomical measurements from the database. They successfully constructed a theoretical system for rendering local spatial audio in virtual auditory space. This provides an important theoretical basis for auditory-assisted visual–spatial localization. V R. Algazi and R. O. Duda [
34] studied the statistical data of anthropometric parameters and the correlation between anthropometric measurements and some temporal and spectral features of HRTF and provided a database for public study, contributing to research in the field of auditory localization. Sound indication direction can also be achieved by a binaural vocal tract shift and a time difference in time arrival. M. Houtenbos [
35] explored the effects of different vocal channel input sequences on the perceived orientation of joint visual-auditory interactions in the absence of complete visual cues. He achieved this using audio “beeps” in different lateral input directions (left-to-right ear and right-to-left ear) corresponding to the direction of approach of the target. Experiments have shown that this way of indicating the direction can substantially improve the efficiency of the visual-auditory display and its perceived direction. This facilitated the specific design of the experimental protocol based on the principle in this paper. Tahir Mushtaq QURESHI [
36] proposes a digital waveguide-based sound channel model. The sound channel is decomposed by the model into uniform-length cylindrical segments, the time played by the sound wave propagation in its axial direction is explored, and the predicted relevant direction is determined by this time difference. Alfonso Nieto-Castanon [
37] analyzed the vocalization data of the subjects in the experiment. It was ascertained that the degree of acoustic variation along a given joint direction was relatively strong. This confirms a consistent relationship between the vocal tract as a target variable and orientation. The effectiveness of directional indication based on vocal tract alternation and its region-specific cognitive induction were effectively demonstrated by the above-mentioned studies. However, its related mechanisms have been less frequently applied in MR environments as a general rule, and this needs to be subsequently explored through specific experiments.
In summary, the study shows that the pitch, volume, and vocal track alternation scheme in a joint audio-visual interaction is closely related to spatial height, distance, and direction in realistic environments. They also have a wide range of applications in the positioning, medical, and transportation fields [
38,
39,
40]. However, there is still a gap in how to efficiently optimize spatial–visual interactions through auditory assistance in MR environments. Whether the mechanisms in the real space state are applicable in an MR environment and whether the cognitive mechanisms change have not yet been explored. Further studies of visual-auditory multimodal interactions in MR environments are therefore stalled. The search task of command and control in a holographic environment is also constrained. In addition, there is a lack of research on the integration of the above three types of elements. They cannot meet the rapid and accurate localization of sound and spatial location when each element is complete. Therefore, in this paper, an experiment was designed to investigate the connection between the characteristics of sound and the spatial elements of MR, pitch–height, volume–distance, and vocal tract alternation–spatial orientation. A validation of the effects of the fusion of the auditory elements indicating spatial location was also carried out
4. Discussion and Analysis
This paper focuses on the connection between auditory pitch, volume, and vocal tract alternation schemes and the height, distance, and direction of space in the MR display environment in the context of air and space situations. Furthermore, the induced mechanisms and cognitive properties of auditory-assisted visual–spatial search localization are explored. The results of the three experiments concluded that several elements of auditory-aided indication have a strong correlation with the target characteristics. And the variability among the datasets for different pitch–height, volume–distance, and vocal channel alternation–azimuth is obvious. This indicates that their indication relationships are non-interfering and clearly distinguishable from each other, with sufficient cognitive discrimination.
In Experiment 1, we set audio cues with different pitches to assist subjects’ spatial-visual height judgments in an MR. The relationship between the spatial area indicated by the auditory aid and the human visual cognitive area was determined by comparing the reaction times. After data analysis, we found that the association between pitch and target height was significant: The three mapping relations, high pitch–high region, medium pitch–medium region, and low pitch–low region, had the shortest response times and little time difference for the three different pitch indications. This suggests that subjects’ visual cognition is guided to high, medium, and low regions by high, medium, and low pitches accordingly. In addition, in the case of a pitch and height association mismatch, the search process satisfies the human reverse cognitive laws and the cognitive properties of the airborne target.
In Experiment 2, the discrimination of subjects’ spatial–visual distance in an MR was assisted by setting up audio with different volume levels. We analyzed whether the volume was optimized for human visual distance perception by comparing two distance discrimination scenarios with and without auditory-assisted indication audios with different volumes and whether their cognitive mechanisms had changed. When no volume assist cue was present, the efficiency of discriminating between distant, medium, and near targets increased from low to high. When there was a volume-assisted indication, the efficiency of discriminating between far and near targets was similar and both were lower than the intermediate distance targets. However, regardless of the efficiency, the results were better than when there was no volume assistance. This illustrates the strong correlation between volume and distance in the MR environment and the better effects of volume-assisted distance optimization.
In Experiment 3, different vocal tract alternation schemes were set up to assist subjects in discriminating spatial–visual directions in an MR. By combining intuitive interaction theory and coding theory, a vocal track alternation scheme was designed for the sound indication directions. The left and right short audios were used to indicate the left and right directions, and the left “long sound and short sound” and right “long sound and short sound” audios were used to indicate the left-back and right-back directions. Through an experimental verification, a better optimization effect was obtained in each direction when the vocal track alternation scheme assisted the interactions. In addition, the degree of optimization was greater for the left-back and right-back directions, which further reduced the judgment time difference between each direction. This proves that the approach can improve subjects’ ability to discriminate directions in the MR environment in all aspects.
In a comprehensive experiment, each of these three auditory properties was combined in a piece of audio that indicated a specific location in space. The overall visual space was divided into 36 areas by combining height, distance, and direction similar to the differentiation in the previous article. Correspondingly, there were 36 audios with different audio, volume, and channel alternation methods to determine the effectiveness of auditory-assisted instructions for the visual judgment of position in MR environments by studying the correlation between them. It was assessed by introducing an evaluation method and concluded that it was better optimized.