3. Developing RVM-IVRSA
3.1. Framework
Figure 2 introduces the main functions, network architecture, and underlying data classification of the system from the functional layer, network layer, and data layer of RVM-IVRSA used in the experiment. In addition, we also used some techniques and solutions to optimize the user experience in motion tracking, voice communication, scene synchronization, and visual delay.
3.2. Established Scene
The implementation of RVM-IVRSA functionality is based on a virtual space scenario. We set up a scene in Unity3D and imported a character model and a camera object of SteamVR, as shown in
Figure 3. When the scene is running, “Camera” will present the changes of the visual field to the user according to the movement of the HMD. The “Controller (left)” and “Controller (left)” under the “Camera” correspond to simulating the motion of two VIVE controllers. The character model in the scene does not represent the user, but a model used to obtain operational parameters of another terminal to render actions of other users.
3.3. Motion Tracking
A very important function in functional layer is optical positioning of a head display and three points of two controllers, so as to achieve the capture of the user’s head and hand movements, which is the basis of body language expression in a virtual space [
23]. The HTC VIVE system includes the following components: VIVE HMD, two Lighthouse laser base stations, and two wireless controllers. The most traditional way to track head position with VR headsets is to use inertial sensors, but inertial sensors can only measure rotation (rotation around the XYZ triaxial axis, called the three degrees of freedom), not movement (along the XYZ triaxial axis, the other three degrees of freedom, collectively called the six degrees of freedom) [
24]. In addition, inertia sensor error is relatively large if there is a need for more accurate and free tracking of head movement, or additionally a need for other location-tracking technology. Instead of the usual optical lens and mark point positioning system, the Lighthouse system used by HTC VIVE consists of two laser base stations: an infrared LED array in each base station, and an infrared laser transmitter with two rotating axes perpendicular to each other, with a rotation speed of 10 ms. The base station takes 20 ms as a cycle. At the beginning of a cycle, the infrared LED flashes. The rotating laser on the
X-axis within 10 ms sweeps the user’s free activity area. In the remaining 10 ms, the rotation laser of the
Y-axis sweeps the user’s free activity area, and the
X-axis does not emit light [
25,
26]. The Lighthouse base station valve has a number of light-sensitive sensors installed on its HMD, and the controller signal is synchronized after the base station’s LED flashes, with light-sensitive sensors being able to measure the time it takes the
X-axis laser and
Y-axis laser to reach the sensor. This is exactly the time it takes for the
X-axis and
Y-axis lasers to go to this particular angle, and thus the
X-axis and
Y-axis angles of the sensor relative to the base station is known. The position of photosensitive sensors distributed on the head display and controller is also known, and thus the position of the head display and motion trajectory can be calculated by the position difference of each sensor. HTC VIVE’s positioning system captures and maps spatial position and rotation of HMD and two controllers into virtual space [
27]. The schematic diagram is shown in
Figure 4.
The screen of user using RVM-IVRSA is shown in
Figure 5. What is displayed is the perspective of the user in the system, and the movements of the user’s head and hands will be reflected in shape changes of the model’s head and hands in another terminal, which is seen by another observer.
In software development, we take a relatively simple path, that is, two terminals send each other user status data obtained by the HTC VIVE’s positioning system. The HTC VIVE device was used here without full-body tracking devices, and thus the data captured only included the status of the user’s head and hands. We assigned state data of the other user to the head and hand of the model in the scene, so as to complete morphological changes of the head and hand of the user identification (model) in the three-dimensional scene.
In the concrete implementation, we added the component “SteamVR_TrackedObject” to the “Controller (left)” and “Controller (left)” of the object “Camera”, in which all traceable devices were stored in this “TrackedObject” class. We used the GetComponet method (we needed to set a SteamVR_TrackedObject class variable T using the “T = GetComponent < SteamVR_TrackedObject & gt; ()” statement that realizes the tracking relationship between T and the handle) to obtain the currently traced object and the input to the handle.
The component “SteamVR_Controller” was used to manage input controls for all devices. We set a variable D of class “steamvr_controller.Device” to get the index parameter of T (implemented by the statement “D = steamvr_controller.Input((int)trackedobj.index)”). Using an if statement “if (device.GetPressDown (SteamVR_Controller.ButtonMask.Trigger))” to monitor whether the trigger was pressed, GetPressDown method returned a Boolean value. We designed the system wherein when each trigger was pressed, the “Camera” moved forward 1 m (according to the angle between the “Camera” and the X-axis, the system calculates the increment of X in the position parameter when the forward action occurs; the Y increment calculation is the same), thus realizing the user’s movement function in the scene.
3.4. Voice Communication
Voice data transfers require use of microphones in HTC VIVE’s HMD to collect users’ voice data and create a WAVE file to store it. We use DirectX API DirectSound class to complete the task of voice acquisition. The collected voice data are transmitted to other clients through TCP socket. When other clients receive a voice file, they need to write acquired data to buffer and call function, which is “GetVioceData ()” [
28,
29,
30]. Real-time voice communications allow users to communicate more promptly. At the same time, user’s tone, accent, and idiom are also included in the audio signal, which can more comprehensively show the user’s mood and personality.
3.5. Scene Synchronization
Different from traditional network data communication, RVM-IVRSA synchronization data are not text, pictures, or files, but are a virtual environment with multiple 3D models. Therefore, scene synchronization should be considered in the design of a test platform.
The test platform adopts ECS (entity component system) architecture [
31], whose pattern follows the principle of composition over inheritance. Each basic unit in scene is an entity, and each entity is composed of one or more components. Each component only contains data that represents its characteristics. For example, MoveComponent contains properties such as speed, location, etc. Once an entity owns MoveComponent, it is considered to have the ability to move. System is a tool for dealing with a collection of entities that have one or more of the same components. In this example, moving system is concerned only with moving entities, and it will walk through all entities that have the MoveComponent component and will update the location of entities on the basis of relevant data (speed, location, orientation, and so on). Entities and components are a one-to-many relationship. What capabilities an entity has depends entirely on what components it has. By dynamically adding or removing components, behavior of entity can be changed at run time [
32,
33].
On the basis of ECS architecture of the test platform, we adopted a mechanism of deterministic lockstep synchronization to realize scene synchronization [
34]. When operation data of a client starts to upload to the server, the server will lock the current frame all the time, and the server will not start to simulate the scene process until all data of the terminal has been uploaded. In the simulation process, the server will process simulation results into the instruction data of the client, which will be forwarded to all user terminals. Finally, user terminals will start their own simulation process on the basis of the forwarding instruction just received. The schematic diagram of the deterministic lockstep synchronization mechanism is shown in
Figure 6.
Here, we used two PCs as terminals along with Alibaba’s ECS cloud server. The data transmitted were input data for terminal users, namely, real-time position parameters of HMD, key parameters of handles, and voice files. The values of the operation parameters from the two terminals were first consolidated into two sets of packets, each containing a token value to mark the number of the source terminal (this int variable is called “T number”, assigning values 1 and 2 to terminal 1 and 2, respectively). Then, they were packed one at a time and sent to both terminals. The terminal will filter on the basis of the value of “T number”, retaining the value of the operation parameter from another terminal to assign the character model B in the scene.
3.6. Compensate Visual Delay
RVM-IVRSA is an immersive virtual reality system with distributed characteristics. When immersive virtual reality system is used, visual delay will occur, which is manifested as HMD conducting angular motion, wherein image generation time in the scene is later than motion time. Humans can sense a visual delay of more than 30 ms, and to maintain a good sense of immersion, a scene needs a frame rate of at least 15 fps and a visual delay of less than 100 ms [
35]. There is also some network delay in the RVM-IVRSA in its use of the TCP socket communication system before communication to establish a connection relationship [
36,
37]. In order to improve delay, some simulation algorithms are needed to estimate motion of HMD in advance.
Dead reckoning (DR) algorithm can compensate for visual delay and network delay [
38,
39,
40].
where
is the position vector at time
,
is velocity vector at time
, A is acceleration vector at time
, and
is estimated position vector at time
. The calculation error in Formula (1) is
where
P is position vector at time
.
All position vectors in Equations (1) and (2) need position parameters and angle parameters of HMD, that is, .
The velocity and acceleration in Equation (1) can be calculated by Equation (3):
In addition to the DR algorithm, we also used the Sklansky model and its prediction algorithm to predict the motion of HMD. The Sklansky model is a kind of basic motion model; the calculation of the algorithm is small, and it is suitable for real-time tracking [
41]. Its discrete form is
In Equation (4), is sampling period. is position of head at time , is velocity of head at time , and is acceleration of head at time . Noise is a gaussian white noise sequence with an average value of 0 (, is Kronecker delta).
According to the obtained HMD position information, we can establish the measurement equation of the system as follows:
where
, and measuring noise
is a gaussian white noise sequence with an average value of 0(
).
Kalman filter is used in prediction calculation [
42], which is a data processing scheme. It can give a new state estimation value from a recursive equation at any time, and thus the amount of calculation and data storage is small [
43]. Kalman filtering assumes two variables: position and velocity, which are both random and subject to a gaussian distribution. In this case, position and velocity are related, and the possibility of observing a particular position depends on current velocity. This correlation is represented by
covariance matrix. In short, each element
in the matrix represents degree of correlation between
th and
th state variables.
The equations used for prediction include:
State prediction formula:
Variance prediction formula:
Prediction residual formula:
Residual assistance formula:
In the equations above,
means “prediction”,
and
mean “optimal”. According to these equations, head position and initial value of predicted variance can be used to predict head motion [
44,
45].
We added a script component “position adjustment” to the objects in the head and hands of the character model, which corrected the model position on the basis of the DR algorithm and the Sklansky model. It first obtained the position information of the “Camera” and “Controller” at different moments, and then calculated the speed of each moment according to the change of position. According to the change of relative velocity, the system calculated the acceleration at each moment. Finally, the received state parameters were adjusted according to the DR algorithm and Sklansky model, and the position and other parameters of the model were obtained as a result. In this way, the parameter values of the spatial position and direction of the character model were determined, so as to realize the model change synchronization of the two terminals.
5. Result
5.1. User Satisfaction
In the questionnaire of the first part of the experiment, we put forward five survey questions about user satisfaction and assessed the satisfaction of volunteers from five aspects. A Likert scale questionnaire was used to score volunteers with a score of 1 to 5. The five questions were
- Question 1:
Do you think this new mode of interaction is comfortable to use? (high score means comfort)
- Question 2:
Is the system stable when you use it? (high score means stability)
- Question 3:
Do you think this kind of social expression is efficient? (high score means high efficiency)
- Question 4:
Would you recommend this new type of social contact to people around you? (high score means more willing to recommend)
- Question 5:
Do you think there is much room for improvement in the system? (high score means no improvement is needed)
Through the statistics of the five problems, we could analyze user satisfaction of virtual reality social communication from the five aspects of the system: comfort, stability, use efficiency, extensibility, and improvement demand. We assumed that RVM-IVRSA should be comparable in stability to traditional social software (WeChat), and that it should have an even greater advantage in terms of efficiency. Popularization and improvement on traditional social software should be more mature, with a better result. In terms of comfort, immersive virtual display should be rated lower than traditional social software because it can cause some physiological problems, such as dizziness. Overall, results should be quite different. Result statistics of the problems are given in
Table 1.
In this part of experiment, we proposed the hypothesis that RVM-IVRSA is very different from WeChat in user feedback in terms of the aspects of comfort, efficiency, popularization, and improvement, while RVM-IVRSA’s performance in stability was very similar to that of WeChat. The experimental results showed that the user feedback was basically in line with our assumptions. However, combining with the average score data, we found that the average score in efficiency and popularization was relatively close, but the p-value was very small, which indicated that users showed obvious personal tendency between IVRSA and WeChat. On the basis of this result, we believe that there are fundamental differences between IVRSA and WeChat as interactive systems.
In terms of stability, a significant difference between two groups of data (p = 0.39) indicated that RVM-IVRSA is not different from traditional desktop social applications in terms of stability. The high stability feedback (mean = 4.40, standard deviation = 0.32) indicates that visual delay and network delay of the system are not obvious, and DR algorithm and Sklansky model prediction algorithm have significant effects. RVM-IVRSA’s mode of interacting is superior in terms of communication, on the basis of the “usage efficiency” feedback. The statistical results of p = 0.04 fully illustrate that this interaction technology is subversive and has a great impact on user experience.
RVM-IVRSA was rated low on comfort. Discomfort caused by using the system was due to vertigo of immersive virtual reality and higher light stimuli. We cannot eliminate the fact that vertigo is always a technical problem of virtual reality that is difficult to conquer. Physical condition and adaptability of VR system experiences determine the degree of vertigo. Although some designs have been adopted in many VR system designs (such as increasing frame rate or canceling physical acceleration) to alleviate vertigo effect to a certain extent, this problem cannot be completely avoided. Generally, continuous wearing time of an immersive VR device should not exceed 30 min.
RVM-IVRSA’s “popularization” and “improvement” feedback results were lower than those of WeChat, which revealed shortcomings of immature design of the system. The “popularization” evaluation (mean = 3.31, standard deviation = 0.65) showed that volunteers′ evaluation of RVM-IVRSA universality was inconsistent. The main reason was high cost of VR equipment, and some volunteers thought RVM-IVRSA had no advantages in promotion.
5.2. Online Social Experience
In addition to questions about user satisfaction, the questionnaire also included five questions about functional effects and user experience, in order to obtain subjective feedback from volunteers on social functions of the system. The five questions are
- Question 6:
Is expression of intent limited? (high score means more freedom)
- Question 7:
Is expression of intent accurate? (high score means more accurate)
- Question 8:
Are there various ways of expressing intentions? (high score means higher variability)
- Question 9:
Is social process natural? (high score means more natural)
- Question 10:
How similar is it to a real face-to-face social situation? (high score means more similar)
In terms of user intent expression, we assumed that RVM-IVRSA should be superior to traditional social software across the board, and differences between the two should be significant, indicating that RVM-IVRSA’s social approach is subversive. The results of the above five questions are given in
Table 2.
In this part of the experiment, we assumed that the feedback results of RVM-IVRSA were better than WeChat in all aspects, and that there was a big gap. The experimental results basically confirmed our hypothesis.
Table 2 shows that the system had obvious functional advantages. Using voice and body language can express users’ intention more freely and accurately. Throughout the process, volunteers behaved naturally and communicated efficiently. It was worth mentioning that some volunteers put forward some shortcomings of virtual space scene interaction and modeling quality.
Compared with user feedback from WeChat, RVM-IVRSA features a comprehensive advantage in user’s social experience. The significant differences among five sets of data reflected the subversive impact of face-to-face social simulation on the online social field.
5.3. Understanding of Body Language
Table 3 records accuracy rate and mean time data of RVM-IVRSA body language expression recognition by users in our second part of the test. We first recorded recognition accuracy and average time spent (not counting the time used to identify failure) of the five actions provided by researchers (driving, reading, tennis, hugging, and clapping). There were a total of 45 kinds of actions that were independently selected by volunteers, and 75 tests were conducted. It was inconvenient to list too many kinds of test data. The test results of 45 actions in this part were not equal to preset actions in terms of test times and should not be used as comparison data. However, these actions had a strong randomness that was closer to real usage. Therefore, data of this part of the test were integrated and recorded in
Table 3 with the label of “Optional”. We regard it as reference data that represent real situation. By analyzing it alone, we could better understand the real usage level of body language in this social mode. Combining with preset actions data, we can see the impact of various movements on recognition rate.
Figure 7 is drawn from the data in
Table 3.
Table 3 and
Figure 7 show that five pre-prepared actions had a higher recognition accuracy than “Optional” actions of volunteers. The accuracy of “Drive” and “Handclap” reached 100%, “Tennis” reached 87%, and the recognition time of these three movements was within 15 s. The recognition accuracy of “Read” action was only 53%, which we think was because this action not only needed a character model to imitate shape change, but also needed cooperation of other object models (such as books) to be reproduced accurately. When model shape was changed simply, pose estimation could easily be wrong. The “Hug” action was difficult to reproduce accurately using only three points of tracking with the head and hands, requiring a physical change in other parts of body, such as arms. From the data of “Optional”, we can infer that in practice, the success rate of body language recognition was not ideal, and the usage time was also a lot. In feedback from volunteers after the experiment, we learned of several reasons behind this. In addition to the points described above, another important reason was that users were not familiar with this type of interaction and could not think of efficient body language movements quickly. Regarding this piece of feedback, we believe that RVM-IVRSA technology development is not mature, and that users need some experience to freely use all its functions; therefore, the use of wizards is very necessary for IVRSA based on motion capture.
We obtained age, gender, and body language recognition rates for 30 users. We attempted to group the data according to age and gender to discuss the effect of users′ age and gender on the experimental results in this experiment.
5.4. Influences of User Age
We grouped time data according to age of users. As number of people in each age group was different, we asked the smaller group to carry out several additional tests to ensure that each action of each age group had six test data. The mean time of each group is given in
Table 4.
Figure 8 is a line graph drawn on the basis of the data in
Table 4.
Figure 8 shows that the older the user, the more time he or she spends in terms of recognition. Older people use less body language in social interactions and are slower to respond to changes in movement than younger people. In fact, some older people have trouble with body language. Social software developers should consider giving older people more guidance. The time each age group spent in recognizing “Drive” and “Handclap” was the closest, indicating that action with good universality was less affected by user’s age.
The time spent on the four actions of ”Tennis”, “Hug”, “Optional”, and “Read” increased with the age of users. “Tennis” and “Hug” had the biggest increases in average time, suggesting younger people were using them more than older people.
5.5. Influences of Users’ Gender
We also grouped the time data according to the gender of users. We randomly selected 10 groups of male data and 10 groups of female data for comparison. The mean time of the two groups is given in
Table 5.
Figure 9 is a histogram drawn on the basis of the data in
Table 5.
From
Table 5 and
Figure 9, it can be seen that action types have the biggest impact on recognition time. The recognition time of male and female in the same action was not much different. The biggest difference between male and female results was ”Tennis”, which we think was because males pay more attention to sports than females. This suggests that the influence of gender differences on people’s attention is directly reflected in understanding of body language in social behaviors.
6. Conclusions
It should be emphasized that the experimental design and results of this study were based on the functional mode of “scene roaming + real-time voice + motion capture”. In the study of face-to-face social simulation, this model was representative, but not comprehensive.
Through the discussion and analysis of experimental results, we have summarized two main conclusions:
Firstly, the method of face-to-face scene simulation using an immersive virtual reality system in social software can effectively improve the efficiency of users′ intention expression. The combination of voice communication and body language provides more options for users to express their intentions. Online social applications that simulate face-to-face social situations can create a more natural and realistic social environment for users and improve their social experience. Compared to traditional desktop social applications, volunteers showed a more active and energetic state when using IVRSA.
Secondly, IVRSA does not perform well in body language, but with real-time language communication, it can achieve high recognition accuracy. Social software of VR devices using head and hands positioning is still limited in terms of function of body language expression, and it is difficult to fully reflect shape changes of many complex actions through three-point tracking. High-quality virtual reality social applications require more comprehensive body tracking devices with more tracking points.
In terms of efficiency and ease of use, the post-WIMP interactive system in this mode was obviously better than the traditional desktop interactive system. However, through our experiments, we also found some system defects caused by technical limitations.
First of all, in the user satisfaction survey, we know that users thought that RVM-IVRSA is not convenient and comfortable compared with WeChat application, and that this kind of application is difficult to popularize at present and needs to be improved. After the experiment, we discussed with the participants and learned that the main cause of these problems was the high requirement of hardware support. Due to the need for real-time graphics computation, the performance of the computer graphics processor directly affects the number of frames of the system. Current immersive VR apps have a small share of the market, and user demand is low. Buying HTC VIVE devices is not a must for most people, and HTC VIVE is not cheap. HTC VIVE’s HMD weighs 470 g, which all the participants thought was acceptable, but inevitably uncomfortable.
Secondly, as shown above, the body language recognition experiment shows that the movement reproduction brought by three-point tracking is very rough, which brings great difficulties to the expression of complex movements. In addition, the poor quality of graphic structure and materials has caused some distortion problems. The object collision system in the scene is relatively simple, which also brings about the problem of models penetrating each other. This part of the problem has nothing to do with hardware. It belongs to the shortcoming of the system in function implementation and is the main direction for us to improve it in the future.
We found that when we recorded the rate of body language recognition, we also recorded the user’s age and gender. Therefore, in addition to systematic evaluation, we grouped the data according to the age and gender of users, and also discussed the influence of age and gender. We admit that this part of the discussion is not comprehensive, and more rigorous experimental design is needed to further explore this issue.
Users of different ages have different understandings of body language, which was presented in the simulation system. Older users had worse understanding and performance of movements than younger users, and they may have some operational difficulties when using IVRSA. It is necessary for IVRSA to take age into account to improve the manner of operation and guidance. The impact of gender differences on body language understanding is mainly reflected in the different types of hot topics that male and female users pay attention to. For example, in the identification of sports-related body movements, there was a large time gap between male and female users. We believe that in addition to age and gender, cultural, regional, and occupational factors may also lead to differences in users′ sensitivity to body movements, and the impact of these factors remains to be studied.
Finally, in a comprehensive evaluation of the system presented in this paper, we believe that the use of RVM-IVRSA is feasible and has many advantages over traditional desktop social applications. However, this system development scheme is not perfect, and there is still a lot of room for improvement. Through the evaluation of RVM-IVRSA, we believe that IVRSA based on other functional patterns is also feasible. In face-to-face social simulation, “RVM” is the most basic functional mode, and other IVRSA functional modes can be regarded as a supplement to “RVM”.