Three-Dimensional Machine Vision for Robots: Human Activity and Scene Understanding

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Artificial Intelligence".

Deadline for manuscript submissions: 31 March 2025 | Viewed by 6711

Special Issue Editors


E-Mail Website
Guest Editor
Embedded Technology and Visual Processing Research Center, School of Computer Science and Technology, Xidian University, Xi’an 710071, China
Interests: 3D vision; scene understanding; robot; deep learning
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Embedded Technology and Visual Processing Research Center, School of Computer Science and Technology, Xidian University, Xi’an 710071, China
Interests: scene understanding; robot; human–object interaction

Special Issue Information

Dear Colleagues,

Robots have been used in many different areas. Human activity and scene understanding is very important in service robots, because robots should understand the person and environment completely. Human activity and scene understanding is useful in many areas, such as security, surveillance, human–computer interactions, patient monitoring analysis systems, sports, and robotics. With the development of deep learning video analysis techniques for robots, scene understanding, natural language processing, multimodal features (including appearance features, spatial features, and semantic features) based on video frames, skeleton data, and semantic labels have all been used to improve the performance of human activity and scene understanding. Vision transformers and graph models have achieved exemplary performance for a broad range of computer vision tasks, e.g., image recognition, object detection, segmentation, and image captioning. All of these tasks are helpful for robots to develop understanding and perception.

This Special Issue seeks original contributions that help advance the theory and algorithmic design of vision transformers and graph models, and focus on presenting state-of-the-art vision transformers and graph models based on human activity understanding techniques that are developed for solving important problems in 3D robot action/activity recognition, understanding, prediction, and so on.

In this Special Issue, original research articles and reviews are welcome. Research areas may include (but are not limited to) the following:

  • Human–object interaction recognition;
  • Graph models;
  • Action recognition;
  • Graph neural networks;
  • Action predictions;
  • Two-/Three-dimensional scene understanding;
  • Two-/Three-dimensional object recognition.

Prof. Dr. Liang Zhang
Dr. Guangming Zhu
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • robot scene understanding
  • action recognition
  • graph models
  • action prediction

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (5 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

17 pages, 6909 KiB  
Article
Quality Assessment of Light Field Images Based on Adaptive Attention in ViT
by Yifan Du, Wei Lang, Xinwen Hu, Li Yu, Hua Zhang, Lingjun Zhang and Yifan Wu
Electronics 2024, 13(15), 2985; https://doi.org/10.3390/electronics13152985 - 29 Jul 2024
Viewed by 834
Abstract
Light field images can record multiple information about the light rays in a scene and provide multiple views from a single image, offering a new data source for 3D reconstruction. However, ensuring the quality of light field images themselves is challenging, and distorted [...] Read more.
Light field images can record multiple information about the light rays in a scene and provide multiple views from a single image, offering a new data source for 3D reconstruction. However, ensuring the quality of light field images themselves is challenging, and distorted image inputs may lead to poor reconstruction results. Accurate light field image quality assessment can pre-judge the quality of light field images used as input for 3D reconstruction, providing a reference for the reconstruction results before the reconstruction work, significantly improving the efficiency of 3D reconstruction based on light field images. In this paper, we propose an Adaptive Vision Transformer-based light-field image-quality assessment model (AViT-LFIQA). The model adopts a multi-view sub-aperture image sequence input method, greatly reducing the number of input images while retaining as much information as possible from the original light field image, alleviating the training pressure on the neural network. Furthermore, we design an adaptive learnable attention layer based on ViT, which addresses the lack of inductive bias in ViT by using adaptive diagonal masking and a learnable temperature coefficient strategy, making the model more suitable for training on small datasets of light field images. Experimental results demonstrate that the proposed model is effective for various types of distortions and shows superior performance in light-field image-quality assessment. Full article
Show Figures

Figure 1

23 pages, 24997 KiB  
Article
YOLO Adaptive Developments in Complex Natural Environments for Tiny Object Detection
by Jikun Zhong, Qing Cheng, Xingchen Hu and Zhong Liu
Electronics 2024, 13(13), 2525; https://doi.org/10.3390/electronics13132525 - 27 Jun 2024
Viewed by 1276
Abstract
Detection of tiny object in complex environments is a matter of urgency, not only because of the high real-world demand, but also the high deployment and real-time requirements. Although many current single-stage algorithms have good detection performance under low computing power requirements, there [...] Read more.
Detection of tiny object in complex environments is a matter of urgency, not only because of the high real-world demand, but also the high deployment and real-time requirements. Although many current single-stage algorithms have good detection performance under low computing power requirements, there are still significant challenges such as distinguishing the background from object features and extracting small-scale target features in complex natural environments. To address this, we first created real datasets based on natural environments and improved dataset diversity using a combination of copy–paste enhancement and multiple image enhancement techniques. As for the choice of network, we chose YOLOV5s due to its nature of fewer parameters and easier deployment in the same class of models. Most improvement strategies to boost detection performance claim to improve the performance of privilege extraction and recognition. However, we prefer to consider the combination of realistic deployment feasibility and detection performance. Therefore, based on the hottest improvement methods of YOLOV5s, we try to make adaptive improvements in three aspects, namely attention mechanism, head network, and backbone network. The experimental results proved that the decoupled head and Slimneck based improvements achieved, respectively, 0.872 and 0.849, 0.538 and 0.479, 87.5% and 89.8% on the mAP0.5, mAP0.5:0.95, and Precision metrics, surpassing the results of the baseline model on these three metrics: 0.705, 0.405 and 83.6%. This result suggests that the adaptively improved model can better meet routine testing needs without significantly increasing the number of parameters. These models perform well on our custom dataset and are also effective on images that are difficult to detect by naked eye. Meanwhile, we find that YOLOV8s, which also has the decoupled head improvement, has the results of 0.743, 0.461, and 87.17% on these three metrics. It proves that under our dataset, it is possible to achieve more advanced results with lower number of model parameters just by adding decoupled head. And according to the results, we also discuss and analyze some improvements that are not adapted to our dataset, which also provides ideas for researchers in similar scenarios: in the booming development of object detection, choosing the suitable model and adapting to combine with other technologies would help to provide solutions to real-world problems. Full article
Show Figures

Figure 1

26 pages, 17621 KiB  
Article
DPCalib: Dual-Perspective View Network for LiDAR-Camera Joint Calibration
by Jinghao Cao, Xiong Yang, Sheng Liu, Tiejian Tang, Yang Li and Sidan Du
Electronics 2024, 13(10), 1914; https://doi.org/10.3390/electronics13101914 - 13 May 2024
Viewed by 1304
Abstract
The precise calibration of a LiDAR-camera system is a crucial prerequisite for multimodal 3D information fusion in perception systems. The accuracy and robustness of existing traditional offline calibration methods are inferior to methods based on deep learning. Meanwhile, most parameter regression-based online calibration [...] Read more.
The precise calibration of a LiDAR-camera system is a crucial prerequisite for multimodal 3D information fusion in perception systems. The accuracy and robustness of existing traditional offline calibration methods are inferior to methods based on deep learning. Meanwhile, most parameter regression-based online calibration methods directly project LiDAR data onto a specific plane, leading to information loss and perceptual limitations. A novel network, DPCalib, a dual perspective view network that mitigates the aforementioned issue, is proposed in this paper. This paper proposes a novel neural network architecture to achieve the fusion and reuse of input information. We design a feature encoder that effectively extracts features from two orthogonal views using attention mechanisms. Furthermore, we propose an effective decoder that aggregates features from two views, thereby obtaining accurate extrinsic parameter estimation outputs. The experimental results demonstrate that our approach outperforms existing SOTA methods, and the ablation experiments validate the rationality and effectiveness of our work. Full article
Show Figures

Figure 1

13 pages, 16460 KiB  
Article
Research on PointPillars Algorithm Based on Feature-Enhanced Backbone Network
by Xiaoning Shu and Liang Zhang
Electronics 2024, 13(7), 1233; https://doi.org/10.3390/electronics13071233 - 27 Mar 2024
Cited by 2 | Viewed by 1207
Abstract
In the industrial field, the 3D target detection algorithm PointPillars has gained popularity. Improving target detection accuracy while maintaining high efficiency has been a significant challenge. To address the issue of low target detection accuracy in the PointPillars 3D target detection algorithm, this [...] Read more.
In the industrial field, the 3D target detection algorithm PointPillars has gained popularity. Improving target detection accuracy while maintaining high efficiency has been a significant challenge. To address the issue of low target detection accuracy in the PointPillars 3D target detection algorithm, this paper proposes an algorithm based on feature enhancement to improve the backbone network. The algorithm enhances preliminary feature information of the backbone network by modifying it based on PointPillars with the aid of channel attention and spatial attention mechanisms. To address the inefficiency caused by the excessive number of subsampled parameters in PointPillars, FasterNet (a lightweight and efficient feature extraction network) is utilized for down-sampling and forming different scale feature maps. To prevent the loss and blurring of extracted features resulting from the use of inverse convolution, we utilize the lightweight and efficient up-sampling modules Carafe and Dysample for adjusting resolution. Experimental results indicate improved accuracy under all difficulties of the KITTI dataset, demonstrating the superiority of the algorithm over PointPillars. Full article
Show Figures

Figure 1

17 pages, 2228 KiB  
Article
Applying Machine Learning to Construct a Printed Circuit Board Gold Finger Defect Detection System
by Chien-Yi Huang and Pei-Xuan Tsai
Electronics 2024, 13(6), 1090; https://doi.org/10.3390/electronics13061090 - 15 Mar 2024
Cited by 1 | Viewed by 1371
Abstract
Machine vision systems use industrial cameras’ digital sensors to collect images and use computers for image pre-processing, analysis, and the measurements of various features to make decisions. With increasing capacity and quality demands in the electronic industry, incoming quality control (IQC) standards are [...] Read more.
Machine vision systems use industrial cameras’ digital sensors to collect images and use computers for image pre-processing, analysis, and the measurements of various features to make decisions. With increasing capacity and quality demands in the electronic industry, incoming quality control (IQC) standards are becoming more and more stringent. The industry’s incoming quality control is mainly based on manual sampling. Although it saves time and costs, the miss rate is still high. This study aimed to establish an automatic defect detection system that could quickly identify defects in the gold finger on printed circuit boards (PCBs) according to the manufacturer’s standard. In the general training iteration process of deep learning, parameters required for image processing and deductive reasoning operations are automatically updated. In this study, we discussed and compared the object detection networks of the YOLOv3 (You Only Look Once, Version 3) and Faster Region-Based Convolutional Neural Network (Faster R-CNN) algorithms. The results showed that the defect classification detection model, established based on the YOLOv3 network architecture, could identify defects with an accuracy of 95%. Therefore, the IQC sampling inspection was changed to a full inspection, and the surface mount technology (SMT) full inspection station was canceled to reduce the need for inspection personnel. Full article
Show Figures

Figure 1

Back to TopTop