Submit to Sensors Review for Sensors Propose a Special Issue

Journal Menu

Journal Browser

► Journal Browser

Computer Vision and Virtual Reality: Technologies and Applications

Print Special Issue Flyer
Special Issue Editors
Special Issue Information
Keywords
Benefits of Publishing in a Special Issue
Published Papers

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Intelligent Sensors".

Deadline for manuscript submissions: closed (20 October 2024) | Viewed by 8191

Share This Special Issue

Special Issue Editors

Dr. Hai Huang

E-Mail Website
Guest Editor

School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China
Interests: computer vision; virtual reality; digital twin
Special Issues, Collections and Topics in MDPI journals

Dr. Yuehua Wang

E-Mail Website
Guest Editor

Department of Computer Science and Information Systems, Texas A&M University, Commerce, TX 75428, USA
Interests: autonomous driving; computer vision; cyber physical systems; cyber security
Special Issues, Collections and Topics in MDPI journals

Dr. Xuqiang Shao

E-Mail Website
Guest Editor

Department of Computer Science, North China Electric Power University, Baoding 071003, China
Interests: computer animation; deep learning

Dr. Ming Meng

E-Mail Website
Guest Editor

School of Data Science and Media Intelligence, Communication University of China, Beijing 100024, China
Interests: virtual reality; 3D vision; digital humans

Special Issue Information

Dear Colleagues,

Computer vision (CV) is one of the fundamental technologies for immersive virtual reality (VR) and augmented reality (AR) systems, in which cameras are often used to capture the real-world information. Sensor-captured imaging and related intelligent processing algorithms support 3D reconstruction, scene understanding, gesture recognition, eye tracking, object detection and tracking, etc., all of which contribute to creating a more realistic, interactive and fascinating virtual world.

This Special Issue is open to multidisciplinary research on the convergence of CV, VR/AR. It covers original research articles, reviews, and communication surveys in the described domain that include but are not limited to the following topics:

Deep learning in image processing;
Image segmentation;
Object detection and recognition;
Vision-based tracking and sensing;
Pose estimation;
Human–computer interaction;
3D reconstruction and computer graphics;
SLAM;
Scene understanding;
Augmented reality;
Emerging VR/AR applications and systems based on CV technologies.

We look forward to receiving your contributions.

Dr. Hai Huang
Dr. Yuehua Wang
Dr. Xuqiang Shao
Dr. Ming Meng
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

computer vision
computer graphics
virtual reality
augmented reality
machine learning
deep learning
human–computer interaction
3D reconstruction
3D vision

Benefits of Publishing in a Special Issue

Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (7 papers)

Download All Papers

Order results

Result details

Show export options Show export options

Select all

Export citation of selected articles as:

Research

30 pages, 14025 KiB

Open AccessArticle

Player Experience Evaluation in Game-Based Systems for Older Adults

by Johnny Alexander Salazar-Cardona, Bryjeth Ceballos-Cardona, Patricia Paderewski-Rodriguez, Francisco Gutiérrez-Vela and Jeferson Arango-López

Sensors 2024, 24(18), 6121; https://doi.org/10.3390/s24186121 - 22 Sep 2024

Cited by 1 | Viewed by 890

Abstract

Significant efforts are currently being made to improve the quality of life of the older adult population. These efforts focus on aspects such as health, social interaction, and mental health. One of the approaches that has shown positive results in several studies is the application of game-based systems. These systems are not only used for entertainment, but also as tools for learning and promoting positive feelings. They are a means to overcome loneliness and isolation, as well as to improve health and provide support in daily life. However, it is important to note that, while these experiences are gradually being introduced to the older adult population, they are often designed with a younger audience in mind who are assumed to be more technologically proficient. This supposition can make older adults initially feel intimidated when interacting with this type of technology, which limits their ability to fully utilize and enjoy these technological solutions. Therefore, the purpose of this article is to apply a game experience and fun evaluation process oriented toward the older adult population based on the playability theory of human–computer interaction in virtual reality game experiences. This is expected to offer highly rewarding and pleasurable experiences, which will improve engagement with the older population and promote active and healthy aging. Full article

(This article belongs to the Special Issue Computer Vision and Virtual Reality: Technologies and Applications)

► Show Figures

Figure 1

25 pages, 19272 KiB

Open AccessArticle

6DoF Object Pose and Focal Length Estimation from Single RGB Images in Uncontrolled Environments

by Mayura Manawadu and Soon-Yong Park

Sensors 2024, 24(17), 5474; https://doi.org/10.3390/s24175474 - 23 Aug 2024

Viewed by 1115

Abstract

Accurate 6DoF (degrees of freedom) pose and focal length estimation are important in extended reality (XR) applications, enabling precise object alignment and projection scaling, thereby enhancing user experiences. This study focuses on improving 6DoF pose estimation using single RGB images of unknown camera metadata. Estimating the 6DoF pose and focal length from an uncontrolled RGB image, obtained from the internet, is challenging because it often lacks crucial metadata. Existing methods such as FocalPose and Focalpose++ have made progress in this domain but still face challenges due to the projection scale ambiguity between the translation of an object along the z-axis (

t_{z}

) and the camera’s focal length. To overcome this, we propose a two-stage strategy that decouples the projection scaling ambiguity in the estimation of z-axis translation and focal length. In the first stage,

t_{z}

is set arbitrarily, and we predict all the other pose parameters and focal length relative to the fixed

t_{z}

. In the second stage, we predict the true value of

t_{z}

while scaling the focal length based on the

t_{z}

update. The proposed two-stage method reduces projection scale ambiguity in RGB images and improves pose estimation accuracy. The iterative update rules constrained to the first stage and tailored loss functions including Huber loss in the second stage enhance the accuracy in both 6DoF pose and focal length estimation. Experimental results using benchmark datasets show significant improvements in terms of median rotation and translation errors, as well as better projection accuracy compared to the existing state-of-the-art methods. In an evaluation across the Pix3D datasets (chair, sofa, table, and bed), the proposed two-stage method improves projection accuracy by approximately 7.19%. Additionally, the incorporation of Huber loss resulted in a significant reduction in translation and focal length errors by 20.27% and 6.65%, respectively, in comparison to the Focalpose++ method. Full article

(This article belongs to the Special Issue Computer Vision and Virtual Reality: Technologies and Applications)

► Show Figures

Figure 1

19 pages, 6221 KiB

Open AccessArticle

Learning Temporal–Spatial Contextual Adaptation for Three-Dimensional Human Pose Estimation

by Hexin Wang, Wei Quan, Runjing Zhao, Miaomiao Zhang and Na Jiang

Sensors 2024, 24(13), 4422; https://doi.org/10.3390/s24134422 - 8 Jul 2024

Viewed by 1069

Abstract

Three-dimensional human pose estimation focuses on generating 3D pose sequences from 2D videos. It has enormous potential in the fields of human–robot interaction, remote sensing, virtual reality, and computer vision. Existing excellent methods primarily focus on exploring spatial or temporal encoding to achieve 3D pose inference. However, various architectures exploit the independent effects of spatial and temporal cues on 3D pose estimation, while neglecting the spatial–temporal synergistic influence. To address this issue, this paper proposes a novel 3D pose estimation method with a dual-adaptive spatial–temporal former (DASTFormer) and additional supervised training. The DASTFormer contains attention-adaptive (AtA) and pure-adaptive (PuA) modes, which will enhance pose inference from 2D to 3D by adaptively learning spatial–temporal effects, considering both their cooperative and independent influences. In addition, an additional supervised training with batch variance loss is proposed in this work. Different from common training strategy, a two-round parameter update is conducted on the same batch data. Not only can it better explore the potential relationship between spatial–temporal encoding and 3D poses, but it can also alleviate the batch size limitations imposed by graphics cards on transformer-based frameworks. Extensive experimental results show that the proposed method significantly outperforms most state-of-the-art approaches on Human3.6 and HumanEVA datasets. Full article

(This article belongs to the Special Issue Computer Vision and Virtual Reality: Technologies and Applications)

► Show Figures

Figure 1

19 pages, 3590 KiB

Open AccessArticle

A Multi-Scale Natural Scene Text Detection Method Based on Attention Feature Extraction and Cascade Feature Fusion

by Nianfeng Li, Zhenyan Wang, Yongyuan Huang, Jia Tian, Xinyuan Li and Zhiguo Xiao

Sensors 2024, 24(12), 3758; https://doi.org/10.3390/s24123758 - 9 Jun 2024

Cited by 1 | Viewed by 819

Abstract

Scene text detection is an important research field in computer vision, playing a crucial role in various application scenarios. However, existing scene text detection methods often fail to achieve satisfactory results when faced with text instances of different sizes, shapes, and complex backgrounds. To address the challenge of detecting diverse texts in natural scenes, this paper proposes a multi-scale natural scene text detection method based on attention feature extraction and cascaded feature fusion. This method combines global and local attention through an improved attention feature fusion module (DSAF) to capture text features of different scales, enhancing the network’s perception of text regions and improving its feature extraction capabilities. Simultaneously, an improved cascaded feature fusion module (PFFM) is used to fully integrate the extracted feature maps, expanding the receptive field of features and enriching the expressive ability of the feature maps. Finally, to address the cascaded feature maps, a lightweight subspace attention module (SAM) is introduced to partition the concatenated feature maps into several sub-space feature maps, facilitating spatial information interaction among features of different scales. In this paper, comparative experiments are conducted on the ICDAR2015, Total-Text, and MSRA-TD500 datasets, and comparisons are made with some existing scene text detection methods. The results show that the proposed method achieves good performance in terms of accuracy, recall, and F-score, thus verifying its effectiveness and practicality. Full article

(This article belongs to the Special Issue Computer Vision and Virtual Reality: Technologies and Applications)

► Show Figures

Figure 1

17 pages, 49973 KiB

Open AccessArticle

Real-Time Multi-Person Video Synthesis with Controllable Prior-Guided Matting

by Aoran Chen, Hai Huang, Yueyan Zhu and Junsheng Xue

Sensors 2024, 24(9), 2795; https://doi.org/10.3390/s24092795 - 27 Apr 2024

Cited by 1 | Viewed by 974

Abstract

In order to enhance the matting performance in multi-person dynamic scenarios, we introduce a robust, real-time, high-resolution, and controllable human video matting method that achieves state of the art on all metrics. Unlike most existing methods that perform video matting frame by frame as independent images, we design a unified architecture using a controllable generation model to solve the problem of the lack of overall semantic information in multi-person video. Our method, called ControlMatting, uses an independent recurrent architecture to exploit temporal information in videos and achieves significant improvements in temporal coherence and detailed matting quality. ControlMatting adopts a mixed training strategy comprised of matting and a semantic segmentation dataset, which effectively improves the semantic understanding ability of the model. Furthermore, we propose a novel deep learning-based image filter algorithm that enforces our detailed augmentation ability on both matting and segmentation objectives. Our experiments have proved that prior information about the human body from the image itself can effectively combat the defect masking problem caused by complex dynamic scenarios with multiple people. Full article

(This article belongs to the Special Issue Computer Vision and Virtual Reality: Technologies and Applications)

► Show Figures

Figure 1

20 pages, 9499 KiB

Open AccessArticle

Part2Point: A Part-Oriented Point Cloud Reconstruction Framework

by Yu-Cheng Feng, Sheng-Yun Zeng and Tyng-Yeu Liang

Sensors 2024, 24(1), 34; https://doi.org/10.3390/s24010034 - 20 Dec 2023

Cited by 2 | Viewed by 1270

Abstract

Three-dimensional object modeling is necessary for developing virtual and augmented reality applications. Traditionally, application engineers must manually use art software to edit object shapes or exploit LIDAR to scan physical objects for constructing 3D models. This is very time-consuming and costly work. Fortunately, GPU recently provided a cost-effective solution for massive data computation. With GPU support, many studies have proposed 3D model generators based on different learning architectures, which can automatically convert 2D object pictures into 3D object models with good performance. However, as the demand for model resolution increases, the required computing time and memory space increase as significantly as the parameters of the learning architecture, which seriously degrades the efficiency of 3D model construction and the feasibility of resolution improvement. To resolve this problem, this paper proposes a part-oriented point cloud reconstruction framework called Part2Point. This framework segments the object’s parts, reconstructs the point cloud for individual object parts, and combines the part point clouds into the complete object point cloud. Therefore, it can reduce the number of learning network parameters at the exact resolution, effectively minimizing the calculation time cost and the required memory space. Moreover, it can improve the resolution of the reconstructed point cloud so that the reconstructed model can present more details of object parts. Full article

(This article belongs to the Special Issue Computer Vision and Virtual Reality: Technologies and Applications)

► Show Figures

Figure 1

16 pages, 2490 KiB

Open AccessArticle

E²LNet: An Efficient and Effective Lightweight Network for Panoramic Depth Estimation

by Jiayue Xu, Jianping Zhao, Hua Li, Cheng Han and Chao Xu

Sensors 2023, 23(22), 9218; https://doi.org/10.3390/s23229218 - 16 Nov 2023

Viewed by 1103

Abstract

Monocular panoramic depth estimation has various applications in robotics and autonomous driving due to its ability to perceive the entire field of view. However, panoramic depth estimation faces two significant challenges: global context capturing and distortion awareness. In this paper, we propose a new framework for panoramic depth estimation that can simultaneously address panoramic distortion and extract global context information, thereby improving the performance of panoramic depth estimation. Specifically, we introduce an attention mechanism into the multi-scale dilated convolution and adaptively adjust the receptive field size between different spatial positions, designing the adaptive attention dilated convolution module, which effectively perceives distortion. At the same time, we design the global scene understanding module to integrate global context information into the feature maps generated using the feature extractor. Finally, we trained and evaluated our model on three benchmark datasets which contains the virtual and real-world RGB-D panorama datasets. The experimental results show that the proposed method achieves competitive performance, comparable to existing techniques in both quantitative and qualitative evaluations. Furthermore, our method has fewer parameters and more flexibility, making it a scalable solution in mobile AR. Full article

(This article belongs to the Special Issue Computer Vision and Virtual Reality: Technologies and Applications)

► Show Figures

Journal Menu

Journal Browser

Computer Vision and Virtual Reality: Technologies and Applications

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Published Papers (7 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI