1. Introduction
In the post-pandemic era, the increasing acceptance of remote work and online education has become an undeniable reality. Against this backdrop, virtual reality (VR) technology is gradually gaining prominence, hailed as an innovative approach to remote work and education. Compared to conventional modes of remote work and learning, VR offers users a fresh experience of interacting with data in a visualized environment, liberating them from the physical constraints of traditional screens. This technological innovation endows remote work and education with a more appealing and immersive quality. For instance, Hodgson et al. [
1] discuss how immersive VR is being integrated into higher education environments to enhance student engagement. Christopoulos et al. [
2] explore the benefits of virtual interactions in education to increase student motivation. Tunk and Kumar [
3] further highlight the potential of VR to redefine “work from home” by making remote work more engaging and collaborative.
However, realizing the full potential of these trends faces a significant obstacle—the lack of robust text input functionality in VR. Bowman et al. [
4] compared various text input methods in VR and identified several challenges. Meanwhile, Grubert et al. [
5] examined the usability of physical keyboards in VR and emphasized the need for more precise hand tracking. Additionally, another study by Grubert et al. [
6] investigated how different hand representations influence typing accuracy, which underlines the necessity of optimizing VR text input systems to match real-world typing efficiency.
Although existing solutions, such as wearable devices, controllers, and motion sensors, provide text input support, they are often inconvenient and incur additional costs. For instance, Boletsis and Kongsvik [
7] propose a VR keyboard solution using a drum-like design, while Otte et al. [
8] explore text input using a touch-sensitive physical keyboard, and Meier et al. [
9] introduce the TapID wristband for text input based on finger taps. Recently, machine learning methods have gained attention, with some solutions using additional cameras to capture users’ hand movements and display them in real time in VR. For example, Hwang et al. [
10] developed a 3D pose estimation approach using a monocular fisheye camera, while Wu et al. [
11] proposed a wrist-mounted camera for estimating finger positions. Although these approaches are innovative, they require extra equipment, which adds complexity and cost to the user experience.
Furthermore, many existing solutions do not support using a physical keyboard, which can disrupt users’ typing habits and cause inconvenience. Studies like those by Fourrier et al. [
12] and Kim et al. [
13] analyze gesture-based VR typing systems, highlighting the limitations of using virtual keyboards that deviate from traditional physical keyboard experiences. To preserve a familiar and comfortable experience, our research focuses on using a physical keyboard with 3D tactile feedback, aligning with the principle of “easy adaptation for users with keyboard input experience”.
Therefore, we propose a solution that utilizes the built-in cameras of HMDs to capture users’ typing actions on a physical keyboard. This approach avoids inconvenience and additional costs while respecting users’ typing habits. However, using the built-in cameras of HMDs presents unique challenges. From the perspective of the HMDs’ cameras, the hand’s fingers are difficult to capture accurately due to the palm obstructing the view, making it challenging to obtain a complete hand outline and precise finger positions.
Another significant barrier to effective text input in VR is the presence of jitter. Jitter refers to image rendering issues that cause virtual hands to shake, resulting in inconsistency between the movements of virtual hands and the responses in the virtual environment. This inconsistency further prevents users from interacting correctly with the virtual keyboard, causing severe typing errors. Stauffert et al. [
14] emphasize that even small amounts of jitter can negatively impact VR performance, particularly in tasks requiring precision.
This study aims to address the text entry challenge in VR, especially within immersive office experiences. As mentioned above, we consider the following research questions:
How can the built-in cameras of HMDs accurately detect typing actions, even when the line of sight is obstructed, considering the unique challenges posed by these cameras?
How can jitter be reduced to enhance the accuracy of virtual hand movements in VR, thereby improving the user experience?
How does using a physical keyboard with 3D tactile feedback impact users’ typing efficiency and habits in VR?
To address these research questions, we propose utilizing the back of the hand image. By extracting information from the back of the hand image, we can accurately predict the finger’s position even when it is obstructed. To achieve this, we first establish a database of back of the hand images. Subsequently, we input the back of the hand images and corresponding motion history images (MHI) into a two-stream long short-term memory (LSTM) network. This network processes the information and applies Kalman filtering (KF) to reduce jitter, thereby enhancing the precision of hand position tracking. The reason for using LSTM instead of other models is based on
Section 4, which evaluates multiple models using key evaluation criteria such as latency, accuracy, and jitter. Advanced models such as TSSequencer [
15], PatchTST [
16], BNNActionNet [
17], and LSTM are applied to comprehensively compare each model’s suitability for VR typing tasks. Based on this thorough evaluation, including factors such as latency, accuracy, jitter, and ease of deployment on head-mounted displays (HMDs), we ultimately decided to adopt 2S-LSTM-KF as the optimal model for our VR typing system.
Subsequently, comparative experiments were conducted, and statistical analyses were performed on typing and questionnaire results, confirming the efficacy of our proposed method in maintaining typing efficiency. Finally, through additional experiments, we analyzed changes in users’ typing habits between regular and virtual reality typing, confirming that our method minimally impacts users’ typing habits.
Following this introduction,
Section 2 reviews related work on VR typing systems and text entry challenges, providing a background for our approach.
Section 3 explains the proposed method, detailing the 2S-LSTM network and KF techniques for hand tracking and jitter reduction.
Section 4 presents the performance comparison, which evaluates our model alongside other state-of-the-art models.
Section 5 describes the experimental studies, including a typing efficiency comparison and a typing behavior analysis examining how different VR typing solutions affect finger usage and typing habits, providing insights into typing performance and user interaction in VR. Finally,
Section 6 concludes the study and suggests directions for future research.
4. Performance Comparison
To identify the optimal network framework for VR typing tasks, we preliminary compared multiple models, focusing on latency, accuracy, and jitter. This comparison aimed to determine which model provides the good overall performance.
4.1. Participants and Equipment
Latency, accuracy, and jitter are influenced primarily by the performance of hardware and algorithms. Therefore, we standardized the hardware across all conditions, using the HTC VIVE Pro paired with our developed VR typing interface. The only variable across conditions was the hand-tracking model employed. To gain insights into real-world user experience, we recruited three participants (two males and one female) with normal vision for the comparison.
Latency data were captured using VRScore [
37], a widely-used VR performance assessment tool. Accuracy was evaluated using the test set from the internal dataset described in
Section 3. Jitter was quantified by comparing the positions of real and virtual hands.
4.2. Comparison Conditions
We tested the following models in the VR typing environment, where 2S denotes a 2-stream network architecture, and KF represents Kalman filtering:
Condition 1: HTC VIVE Pro built-in gesture detection;
Condition 2: TSSequencer [
15];
Condition 3: 2S-TSSequencer;
Condition 4: 2S-TSSequencer-KF;
Condition 5: PatchTST [
16];
Condition 6: 2S-PatchTST;
Condition 7: BNNActionNet [
17];
Condition 8: 2S-BNNActionNet;
Condition 9: 2S-BNNActionNet-KF;
Condition 11: 2S-LSTM;
Condition 12: 2S-LSTM-KF (Ours).
Each condition differed only in the model used, with all other factors, such as refresh rate, kept consistent to ensure that performance differences were attributed solely to the models.
4.3. Metrics and Data Collection
For an optimal VR typing experience, latency, accuracy, and jitter are all crucial evaluation metrics. We chose to collect data on all three metrics across the conditions to make a comprehensive assessment and identify the best-performing model.
4.3.1. Latency
Latency is an important evaluation metric, as high latency can induce motion sickness in users [
38]. While an ideal latency is below 20 ms [
39], most VR systems struggle to maintain stability within this range due to various factors like graphical rendering, signal transmission, and computational load. Individual sensitivity to latency varies, with some users perceiving delays as short as 3–4 ms [
40]. We recorded the minimum, maximum, and average latency over 10 min intervals for each model. Participants provided feedback on their perceived latency and were allowed to switch between conditions for better comparison.
4.3.2. Accuracy
The accuracy for each model was measured using the test set from our internal dataset after training with the training set. This allowed us to evaluate each model’s effectiveness in accurately recognizing hand movements during the typing task.
4.3.3. Jitter
Jitter was evaluated as the stability of hand positions by measuring discrepancies between real and virtual hand positions at 21 × 2 key points. Points with a discrepancy exceeding a threshold were counted as contributing to jitter, while points below this threshold were not. The threshold value was established based on criteria published in our prior work at TENCON2023 [
41].
4.4. Result and Discussion
4.4.1. Result of Latency, Accuracy, and Jitter
The latency measurements for different conditions are summarized in
Table 1 below. The table presents the minimum, maximum, and average latency values derived from the total 10 min of latency data collected for each condition.
Concerning latency, as shown in
Table 1, most models demonstrated acceptable latency compared to the baseline (HTC VIVE Pro built-in gesture detection), with only PatchTST and 2S-PatchTST showing significantly higher latency. This increased latency may lead to user discomfort, such as dizziness, making these models less suitable for VR typing tasks. Participants reported feeling very uncomfortable and restless after using PatchTST and 2S-PatchTST for a period of time, which differed from their experiences in other conditions.
Regarding accuracy,
Table 1 indicates that all models, except LSTM, achieved respectable accuracy. Excluding the high-latency PatchTST and 2S-PatchTST, the highest accuracy was observed with 2S-BNNActionNet-KF, which outperformed our proposed model by 2.27%. However, this difference was not substantial enough to noticeably affect typing performance, as participant feedback confirmed that users could not perceive a clear difference in accuracy among the top-performing models.
In terms of jitter, as shown in
Table 1, comparing 2S-TSSequencer with 2S-TSSequencer-KF, 2S-BNNActionNet with 2S-BNNActionNet-KF, and 2S-LSTM with 2S-LSTM-KF (Ours), it is evident that the models with KF exhibit smaller jitter values compared to their non-KF counterparts. Additionally, participants reported being generally satisfied with the jitter performance of the conditions which have KF.
4.4.2. Discussion on Performance Comparison
The latency results showed that while the vast majority of models (except for PatchTST and 2S-PatchTST) exhibited slightly higher latency than the baseline condition (Condition 1), this increase of a few milliseconds to over ten milliseconds remained within an acceptable range. Participant feedback indicated that the slight increase in latency brought by these models was imperceptible compared to Condition 1. Consequently, due to excessive latency, both PatchTST and 2S-PatchTST can be excluded from consideration, and we believe that the computational heaviness of PatchTST, which is based on the Transformer architecture, is a key factor contributing to its significant latency issues.
In terms of accuracy, 2S-BNNActionNet-KF emerged as the top performer, while the 2S-LSTM-KF model trailed by 2.27%. The 2S-TSSequencer-KF also performed admirably, leading 2S-LSTM-KF by just 1.57%. Given the nature of typing actions, which involve subtle movements and rapid finger lifts, the task of identifying typing fingers may not necessitate complex long-range dependency modeling, thereby limiting the advantages of the TSSequencer. Furthermore, the TSSequencer model might require larger and higher-quality datasets to fully realize its strengths. However, the dataset used in this study was self-made under limited conditions and funding, potentially constraining the performance of the TSSequencer. The results show that 2S-BNNActionNet-KF is a promising solution, especially in terms of accuracy. However, LSTM performed slightly better in terms of latency and jitter. Some previous research reported that BNNActionNet has the advantage with lower computing resources, but that LSTM achieves higher accuracy, especially in applications that require capturing subtle temporal variations [
42]. As the computing resources of new HMDs improve in the future, these results may change.
Jitter analysis showed that 2S-LSTM-KF performed the best, followed by 2S-BNNActionNet-KF and 2S-TSSequencer-KF, which also demonstrated solid results. When comparing models with and without Kalman filtering (KF), the KF-enhanced versions consistently showed improved jitter performance. This suggests that incorporating KF benefits jitter reduction not only in 2S-LSTM-KF but across other models as well.
After considering latency, accuracy, and jitter performance, we believe that both 2S-BNNActionNet-KF and 2S-LSTM-KF are optimal choices. Given that 2S-BNNActionNet does not significantly outperform 2S-LSTM-KF across all metrics and considering the author’s extensive experience in deploying LSTM on VR devices, we have decided to use 2S-LSTM-KF for this experiment. In our future work, we will further explore and investigate the potential applications of 2S-BNNActionNet.
6. Conclusions and Future Work
This study addresses the challenge of text entry in VR environments, particularly within immersive office settings. By leveraging machine learning techniques, the proposed 2S-LSTM typing solution, which utilizes the back of the hand image, demonstrates superior performance compared to existing solutions like the Oculus Quest 2 and Leap Motion. The 2S-LSTM solution significantly enhances typing efficiency, reduces fatigue, accurately replicates hand positions, and provides a more positive user experience. These findings underscore the potential of the developed solution to improve typing performance and user satisfaction in VR environments.
Through the performance comparison, we evaluated several advanced models, including TSSequencer, PatchTST, and BNNActionNet, across latency, accuracy, and jitter metrics. Considering latency, accuracy, and jitter performance, as well as constraints posed by our available resources, we believe that both 2S-BNNActionNet-KF and 2S-LSTM-KF are solid choices. While both 2S-LSTM-KF and 2S-BNNActionNet-KF demonstrated strong performance, we chose to proceed with 2S-LSTM-KF, as the performance gap between it and 2S-BNNActionNet-KF is minimal and imperceptible to users in the VR typing task. Additionally, the author’s extensive experience with deploying LSTM on VR devices supports this decision.
Furthermore, the outcomes of this study are expected to significantly contribute to the fields of distance learning and telecommuting. Addressing the challenges of text entry in VR can facilitate the development and widespread adoption of VR technology across various applications. Future research and development efforts can focus on refining the solution and exploring its potential applications in practical settings. Additionally, expanding the sample size, incorporating additional typing metrics, and further investigating factors influencing typing performance in VR environments can provide valuable insights for developing and refining VR typing systems.
However, several avenues for future research and development could further enhance the effectiveness and user experience of VR typing systems. One key limitation of this study is the relatively small and homogeneous sample size. Future research should aim to include a larger and more diverse group of participants, helping to generalize the findings across different demographics, such as age, typing proficiency, and familiarity with VR technology.
One notable aspect of the performance comparison was the promising potential of 2S-BNNActionNet-KF. This model demonstrated strong performance in various metrics, indicating that future work could explore the replacement of LSTM-KF with 2S-BNNActionNet-KF to achieve even better results. Investigating the benefits of integrating this model may yield further improvements in typing accuracy and overall user experience.
While the 2S-LSTM typing solution has shown promise, there remains room for improvement. Future work could focus on refining the algorithm to further enhance typing accuracy and efficiency, potentially by incorporating more sophisticated machine learning techniques or adapting the algorithm to account for individual differences in typing habits.
This study primarily focused on short-term adaptation to VR typing. Future research should investigate long-term adaptation and learning effects. Understanding how typing habits evolve over extended periods of VR use could provide valuable insights into designing more intuitive and efficient typing systems.
Overall, this research has laid a strong foundation for future advancements in VR typing systems. By addressing these identified areas for future work, researchers and developers can continue to enhance VR typing solutions, making them more efficient, intuitive, and accessible to a broader range of users.