applsci-logo

Journal Browser

Journal Browser

State-of-the-Art of Computer Vision and Pattern Recognition

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (20 August 2024) | Viewed by 20213

Special Issue Editors


E-Mail Website
Guest Editor

E-Mail Website
Guest Editor
Department of Information and Communication Engineering, and Convergence Engineering for Intelligent Drone, Sejong University, Seoul 05006, Republic of Korea
Interests: deep learning; object detection; NLP; pattern recognition; computer vision
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

In the rapidly evolving field of computer vision and pattern recognition, continuous advancements are reshaping the way we perceive and interact with visual data. This Special Issue aims to discuss the latest breakthroughs and innovations in these domains, offering a comprehensive snapshot of the cutting-edge research that is pushing the boundaries of what is possible.

The Special Issue will cover a wide spectrum of topics, including, but not limited to, image classification, object detection, image segmentation, video analysis, deep learning, feature extraction, face recognition, and gesture recognition. Contributions will explore novel algorithms, architectures, methodologies, and applications that contribute to the enhanced understanding and interpretation of visual data. Additionally, the issue will delve into the fusion of computer vision and pattern recognition, highlighting the synergies between these two fields and their combined potential to revolutionize various industries.

We invite researchers, practitioners, and experts in computer vision and pattern recognition to submit their original research, reviews, and case studies. The Special Issue aims to foster interdisciplinary collaboration, enabling researchers to share their insights, experiences, and challenges. By addressing both theoretical and practical aspects, this collection of articles will not only provide a comprehensive overview of recent advances but also serve as a valuable resource for researchers, practitioners, and educators in the field.

Prof. Dr. Hyeonjoon Moon
Dr. Lien Minh Dang
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • image classification
  • object detection
  • image segmentation
  • video analysis
  • deep learning
  • feature extraction
  • gesture recognition
  • pattern recognition
  • computer vision
  • pattern recognition
  • face recognition

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (12 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

17 pages, 1335 KiB  
Article
Fast Motion State Estimation Based on Point Cloud by Combing Deep Learning and Spatio-Temporal Constraints
by Sidong Wu, Liuquan Ren and Enzhi Zhu
Appl. Sci. 2024, 14(19), 8969; https://doi.org/10.3390/app14198969 - 5 Oct 2024
Viewed by 771
Abstract
Moving objects in the environment have a higher priority and more challenges in growing domains like unmanned vehicles and intelligent robotics. Estimating the motion state of objects based on point clouds in outdoor scenarios is currently a challenging area of research. This is [...] Read more.
Moving objects in the environment have a higher priority and more challenges in growing domains like unmanned vehicles and intelligent robotics. Estimating the motion state of objects based on point clouds in outdoor scenarios is currently a challenging area of research. This is due to factors such as limited temporal information, large volumes of data, extended network processing times, and the ego-motion. The number of points in a point cloud frame is typically 60,000–120,000 points, but most current motion state estimation methods for point clouds only downsample to a few thousand points for fast processing. The downsampling step will lead to the loss of scene information, which means these methods are far from being used in practical applications. Thus, this paper proposes a motion state estimation method that combines spatio-temporal constraints and deep learning. It starts by estimating and compensating the ego-motion of multi-frame point cloud data and mapping multi-frame data to a unified coordinate system; then the point cloud motion segmentation model on the multi-frame point cloud is proposed for motion object segmentation. Finally, spatio-temporal constraints are utilized to correlate the moving object at different moments and estimate the motion vectors. Experiments on KITTI, nuScenes, and real captured data show that the proposed method has good results, with an average vector deviation of only 0.036 m and 0.043 m in KITTI and nuScenes under a processing time of about 80 ms. The EPE3D error under the KITTI data is only 0.076 m, which proves the effectiveness of the method. Full article
(This article belongs to the Special Issue State-of-the-Art of Computer Vision and Pattern Recognition)
Show Figures

Figure 1

21 pages, 8944 KiB  
Article
Industrial Image Anomaly Detection via Self-Supervised Learning with Feature Enhancement Assistance
by Bin Wu and Xiaoqi Wang
Appl. Sci. 2024, 14(16), 7301; https://doi.org/10.3390/app14167301 - 19 Aug 2024
Viewed by 995
Abstract
Industrial anomaly detection is constrained by the scarcity of anomaly samples, limiting the applicability of supervised learning methods. Many studies have focused on anomaly detection by generating anomaly images and adopting self-supervised learning approaches. Leveraging pre-trained networks on ImageNet has been explored to [...] Read more.
Industrial anomaly detection is constrained by the scarcity of anomaly samples, limiting the applicability of supervised learning methods. Many studies have focused on anomaly detection by generating anomaly images and adopting self-supervised learning approaches. Leveraging pre-trained networks on ImageNet has been explored to assist in this training process. However, achieving accurate anomaly detection remains time-consuming due to the network’s depth and parameter count not being reduced. In this paper, we propose a self-supervised learning method based on Feature Enhancement Patch Distribution Modeling (FEPDM), which generates simulated anomalies. Unlike direct training on the original feature extraction network, our approach utilizes a pre-trained network to extract multi-scale features. By aggregating these multi-scale features, we are able to train at the feature level, thereby adapting more efficiently to various network structures and reducing domain bias with respect to natural image classification. Additionally, it significantly reduces the number of parameters in the training process. Introducing this approach not only enhances the model’s generalization ability but also significantly improves the efficiency of anomaly detection. The method was evaluated on MVTec AD and BTAD datasets, and (image-level, pixel-level) AUROC scores of (95.7%, 96.2%), (93.4%, 97.6%) were obtained, respectively. The experimental results have convincingly demonstrated the efficacy of our method in tackling the scarcity of abnormal samples in industrial scenarios, while simultaneously highlighting its broad generalizability. Full article
(This article belongs to the Special Issue State-of-the-Art of Computer Vision and Pattern Recognition)
Show Figures

Figure 1

18 pages, 1758 KiB  
Article
A Human Body Simulation Using Semantic Segmentation and Image-Based Reconstruction Techniques for Personalized Healthcare
by Junyong So, Sekyoung Youm and Sojung Kim
Appl. Sci. 2024, 14(16), 7107; https://doi.org/10.3390/app14167107 - 13 Aug 2024
Viewed by 1118
Abstract
The global healthcare market is expanding, with a particular focus on personalized care for individuals who are unable to leave their homes due to the COVID-19 pandemic. However, the implementation of personalized care is challenging due to the need for additional devices, such [...] Read more.
The global healthcare market is expanding, with a particular focus on personalized care for individuals who are unable to leave their homes due to the COVID-19 pandemic. However, the implementation of personalized care is challenging due to the need for additional devices, such as smartwatches and wearable trackers. This study aims to develop a human body simulation that predicts and visualizes an individual’s 3D body changes based on 2D images taken by a portable device. The simulation proposed in this study uses semantic segmentation and image-based reconstruction techniques to preprocess 2D images and construct 3D body models. It also considers the user’s exercise plan to enable the visualization of 3D body changes. The proposed simulation was developed based on human-in-the-loop experimental results and literature data. The experiment shows that there is no statistical difference between the simulated body and actual anthropometric measurement with a p-value of 0.3483 in the paired t-test. The proposed simulation provides an accurate and efficient estimation of the human body in a 3D environment, without the need for expensive equipment such as a 3D scanner or scanning uniform, unlike the existing anthropometry approach. This can promote preventive treatment for individuals who lack access to healthcare. Full article
(This article belongs to the Special Issue State-of-the-Art of Computer Vision and Pattern Recognition)
Show Figures

Figure 1

16 pages, 2729 KiB  
Article
Analysis of Bullet Impact Locations in the 10 m Air Pistol Men’s Competition Based on Covariance
by Ji-Yeon Moon and Euichul Lee
Appl. Sci. 2024, 14(14), 6006; https://doi.org/10.3390/app14146006 - 10 Jul 2024
Viewed by 932
Abstract
The purpose of this study was to quantify the bullet impact locations of the men’s 10 m air pistol competition and propose objective metrics for evaluating shooting techniques. We automatically collected data from the top 20 competitors’ shooting results using computer vision techniques. [...] Read more.
The purpose of this study was to quantify the bullet impact locations of the men’s 10 m air pistol competition and propose objective metrics for evaluating shooting techniques. We automatically collected data from the top 20 competitors’ shooting results using computer vision techniques. Metrics such as x-variance, y-variance, covariance, x-mean, y-mean, root mean square error (RMSE), x-mean score, and y-mean score were computed to investigate correlations among rankings, left–right and up–down shot groups, aiming relationships, and precision. Covariance analysis revealed significant interactions between horizontal and vertical aiming, highlighting the importance of balanced coordination between these directions for high performance. Athletes with lower covariance values, indicating less variation between horizontal and vertical aiming, tended to achieve higher rankings. Additionally, top-ranked athletes exhibited lower RMSE values, underscoring the importance of precision in achieving high scores. In conclusion, this study analyzed the correlation between x and y through covariance, examined its relationship with competition rankings, and proposed new indicators for training and performance enhancement. This study is novel in that it provides quantitative data to correct poor aiming and shooting habits by performing a covariance-based bidirectional correlation analysis, rather than simply analyzing bullet impact locations in a single horizontal or vertical direction. Our approach establishes a foundation for more data-driven and objective evaluations in the sport of shooting. Full article
(This article belongs to the Special Issue State-of-the-Art of Computer Vision and Pattern Recognition)
Show Figures

Figure 1

15 pages, 470 KiB  
Article
Addressing the Non-Stationarity and Complexity of Time Series Data for Long-Term Forecasts
by Ranjai Baidya and Sang-Woong Lee
Appl. Sci. 2024, 14(11), 4436; https://doi.org/10.3390/app14114436 - 23 May 2024
Cited by 1 | Viewed by 1899
Abstract
Real-life time series datasets exhibit complications that hinder the study of time series forecasting (TSF). These datasets inherently exhibit non-stationarity as their distributions vary over time. Furthermore, the intricate inter- and intra-series relationships among data points pose challenges for modeling. Many existing TSF [...] Read more.
Real-life time series datasets exhibit complications that hinder the study of time series forecasting (TSF). These datasets inherently exhibit non-stationarity as their distributions vary over time. Furthermore, the intricate inter- and intra-series relationships among data points pose challenges for modeling. Many existing TSF models overlook one or both of these issues, resulting in inaccurate forecasts. This study proposes a novel TSF model designed to address the challenges posed by real-life data, delivering accurate forecasts in both multivariate and univariate settings. First, we propose methods termed “weak-stationarizing” and “non-stationarity restoring” to mitigate distributional shift. These methods enable the removal and restoration of non-stationary components from individual data points as needed. Second, we utilize the spectral decomposition of weak-stationary time series to extract informative features for forecasting. To learn features from the spectral decomposition of weak-stationary time series, we exploit a mixer architecture to find inter- and intra-series dependencies from the unraveled representation of the overall time series. To ensure the efficacy of our model, we conduct comparative evaluations against state-of-the-art models using six real-world datasets spanning diverse fields. Across each dataset, our model consistently outperforms or yields comparable results to existing models. Full article
(This article belongs to the Special Issue State-of-the-Art of Computer Vision and Pattern Recognition)
Show Figures

Figure 1

23 pages, 12281 KiB  
Article
Exploring the Feasibility of Vision-Based Non-Contact Oxygen Saturation Estimation: Considering Critical Color Components and Individual Differences
by Hyeon Ah Seong, Chae Lin Seok and Eui Chul Lee
Appl. Sci. 2024, 14(11), 4374; https://doi.org/10.3390/app14114374 - 22 May 2024
Viewed by 906
Abstract
The blood oxygen saturation, which indicates the ratio of oxygenated hemoglobin to total hemoglobin in the blood, is closely related to one’s health status. Oxygen saturation is typically measured using a pulse oximeter. However, this method can cause skin irritation, and in situations [...] Read more.
The blood oxygen saturation, which indicates the ratio of oxygenated hemoglobin to total hemoglobin in the blood, is closely related to one’s health status. Oxygen saturation is typically measured using a pulse oximeter. However, this method can cause skin irritation, and in situations where there is a risk of infectious diseases, the use of such contact-based oxygen saturation measurement devices can increase the risk of infection. Therefore, recently, methods for estimating oxygen saturation using facial or hand images have been proposed. In this paper, we propose a method for estimating oxygen saturation from facial images based on a convolutional neural network (CNN). Particularly, instead of arbitrarily calculating the AC and DC components, which are essential for measuring oxygen saturation, we directly utilized signals obtained from facial images to train the model and predict oxygen saturation. Moreover, to account for the time-consuming nature of accurately measuring oxygen saturation, we diversified the model inputs. As a result, for inputs of 10 s, the Pearson correlation coefficient was calculated as 0.570, the mean absolute error was 1.755%, the root mean square error was 2.284%, and the intraclass correlation coefficient was 0.574. For inputs of 20 s, these metrics were calculated as 0.630, 1.720%, 2.219%, and 0.681, respectively. For inputs of 30 s, they were calculated as 0.663, 2.142%, 2.612%, and 0.646, respectively. This confirms that it is possible to estimate oxygen saturation without calculating the AC and DC components, which heavily influence the prediction results. Furthermore, we analyzed how the trained model predicted oxygen saturation through ‘SHapley Additive exPlanations’ and found significant variations in the feature contributions among participants. This indicates that, for more accurate predictions of oxygen saturation, it may be necessary to individually select appropriate color channels for each participant. Full article
(This article belongs to the Special Issue State-of-the-Art of Computer Vision and Pattern Recognition)
Show Figures

Figure 1

14 pages, 3982 KiB  
Article
Deep Feature Retention Module Network for Texture Classification
by Sung-Hwan Park, Sung-Yoon Ahn and Sang-Woong Lee
Appl. Sci. 2024, 14(10), 4011; https://doi.org/10.3390/app14104011 - 9 May 2024
Viewed by 917
Abstract
Texture describes the unique features of an image. Therefore, texture classification is a crucial task in computer vision. Various CNN-based deep learning methods have been developed to classify textures. During training, the deep-learning model undergoes an end-to-end procedure of learning features from low [...] Read more.
Texture describes the unique features of an image. Therefore, texture classification is a crucial task in computer vision. Various CNN-based deep learning methods have been developed to classify textures. During training, the deep-learning model undergoes an end-to-end procedure of learning features from low to high levels. Most CNN architectures depend on high-level features for the final classification. Hence, other low- and mid-level information was not prioritized for the final classification. However, in the case of texture classification, it is essential to determine detailed feature information within the pattern to classify textures as they have diversity and irregularity in images within the same class. Therefore, the feature information at the low- and mid-levels can also provide meaningful information to distinguish the classes. In this study, we introduce a CNN model with a feature retention module (FRM) to utilize features from numerous levels. FRM maintains the texture information extracted at each level and extracts feature information through filters of various sizes. We used three texture datasets to evaluate the proposed model combined with the FRM. The experimental results showed that learning using different levels of features together assists in improving learning performance more than learning using high-level features. Full article
(This article belongs to the Special Issue State-of-the-Art of Computer Vision and Pattern Recognition)
Show Figures

Figure 1

14 pages, 6250 KiB  
Article
Emotion Recognition beyond Pixels: Leveraging Facial Point Landmark Meshes
by Herag Arabian, Tamer Abdulbaki Alshirbaji, J. Geoffrey Chase and Knut Moeller
Appl. Sci. 2024, 14(8), 3358; https://doi.org/10.3390/app14083358 - 16 Apr 2024
Cited by 2 | Viewed by 1322
Abstract
Digital health apps have become a staple in daily life, promoting awareness and providing motivation for a healthier lifestyle. With an already overwhelmed healthcare system, digital therapies offer relief to both patient and physician alike. One such planned digital therapy application is the [...] Read more.
Digital health apps have become a staple in daily life, promoting awareness and providing motivation for a healthier lifestyle. With an already overwhelmed healthcare system, digital therapies offer relief to both patient and physician alike. One such planned digital therapy application is the incorporation of an emotion recognition model as a tool for therapeutic interventions for people with autism spectrum disorder (ASD). Diagnoses of ASD have increased relatively rapidly in recent years. To ensure effective recognition of expressions, a system is designed to analyze and classify different emotions from facial landmarks. Facial landmarks combined with a corresponding mesh have the potential of bypassing hurdles of model robustness commonly affecting emotion recognition from images. Landmarks are extracted from facial images using the Mediapipe framework, after which a custom mesh is constructed from the detected landmarks and used as input to a graph convolution network (GCN) model for emotion classification. The GCN makes use of the relations formed from the mesh along with the special distance features extracted. A weighted loss approach is also utilized to reduce the effects of an imbalanced dataset. The model was trained and evaluated with the Aff-Wild2 database. The results yielded a 58.76% mean accuracy on the selected validation set. The proposed approach shows the potential and limitations of using GCNs for emotion recognition in real-world scenarios. Full article
(This article belongs to the Special Issue State-of-the-Art of Computer Vision and Pattern Recognition)
Show Figures

Figure 1

19 pages, 2204 KiB  
Article
Polyp Generalization via Diversifying Style at Feature-Level Space
by Sahadev Poudel and Sang-Woong Lee
Appl. Sci. 2024, 14(7), 2780; https://doi.org/10.3390/app14072780 - 26 Mar 2024
Viewed by 657
Abstract
In polyp segmentation, the latest notable topic revolves around polyp generalization, which aims to develop deep learning-based models capable of learning from single or multiple source domains and applying this knowledge to unseen datasets. A significant challenge in real-world clinical settings is the [...] Read more.
In polyp segmentation, the latest notable topic revolves around polyp generalization, which aims to develop deep learning-based models capable of learning from single or multiple source domains and applying this knowledge to unseen datasets. A significant challenge in real-world clinical settings is the suboptimal performance of generalized models due to domain shift. Convolutional neural networks (CNNs) are often biased towards low-level features, such as style features, impacting generalization. Despite attempts to mitigate this bias using data augmentation techniques, learning model-agnostic and class-specific feature representations remains complex. Previous methods have employed image-level transformations with styles to supplement training data diversity. However, these approaches face limitations in ensuring style diversity due to restricted style sources, limiting the utilization of the potential style space. To address this, we propose a straightforward yet effective style conversion and generation module integrated into the UNet model. This module transfers diverse yet plausible style features to the original training data at the feature-level space, ensuring that generated styles align closely with the original data. Our method demonstrates superior performance in single-domain generalization tasks across five datasets compared to prior methods. Full article
(This article belongs to the Special Issue State-of-the-Art of Computer Vision and Pattern Recognition)
Show Figures

Figure 1

14 pages, 5220 KiB  
Article
Land-Cover Classification Using Deep Learning with High-Resolution Remote-Sensing Imagery
by Muhammad Fayaz, Junyoung Nam, L. Minh Dang, Hyoung-Kyu Song and Hyeonjoon Moon
Appl. Sci. 2024, 14(5), 1844; https://doi.org/10.3390/app14051844 - 23 Feb 2024
Cited by 4 | Viewed by 3839
Abstract
Land-area classification (LAC) research offers a promising avenue to address the intricacies of urban planning, agricultural zoning, and environmental monitoring, with a specific focus on urban areas and their complex land usage patterns. The potential of LAC research is significantly propelled by advancements [...] Read more.
Land-area classification (LAC) research offers a promising avenue to address the intricacies of urban planning, agricultural zoning, and environmental monitoring, with a specific focus on urban areas and their complex land usage patterns. The potential of LAC research is significantly propelled by advancements in high-resolution satellite imagery and machine learning strategies, particularly the use of convolutional neural networks (CNNs). Accurate LAC is paramount for informed urban development and effective land management. Traditional remote-sensing methods encounter limitations in precisely classifying dynamic and complex urban land areas. Therefore, in this study, we investigated the application of transfer learning with Inception-v3 and DenseNet121 architectures to establish a reliable LAC system for identifying urban land use classes. Leveraging transfer learning with these models provided distinct advantages, as it allows the LAC system to benefit from pre-trained features on large datasets, enhancing model generalization and performance compared to starting from scratch. Transfer learning also facilitates the effective utilization of limited labeled data for fine-tuning, making it a valuable strategy for optimizing model accuracy in complex urban land classification tasks. Moreover, we strategically employ fine-tuned versions of Inception-v3 and DenseNet121 networks, emphasizing the transformative impact of these architectures. The fine-tuning process enables the model to leverage pre-existing knowledge from extensive datasets, enhancing its adaptability to the intricacies of LC classification. By aligning with these advanced techniques, our research not only contributes to the evolution of remote-sensing methodologies but also underscores the paramount importance of incorporating cutting-edge methodologies, such as fine-tuning and the use of specific network architectures, in the continual enhancement of LC classification systems. Through experiments conducted on the UC-Merced_LandUse dataset, we demonstrate the effectiveness of our approach, achieving remarkable results, including 92% accuracy, 93% recall, 92% precision, and a 92% F1-score. Moreover, employing heatmap analysis further elucidates the decision-making process of the models, providing insights into the classification mechanism. The successful application of CNNs in LAC, coupled with heatmap analysis, opens promising avenues for enhanced urban planning, agricultural zoning, and environmental monitoring through more accurate and automated land-area classification. Full article
(This article belongs to the Special Issue State-of-the-Art of Computer Vision and Pattern Recognition)
Show Figures

Figure 1

18 pages, 1398 KiB  
Article
A Deep Bidirectional LSTM Model Enhanced by Transfer-Learning-Based Feature Extraction for Dynamic Human Activity Recognition
by Najmul Hassan, Abu Saleh Musa Miah and Jungpil Shin
Appl. Sci. 2024, 14(2), 603; https://doi.org/10.3390/app14020603 - 10 Jan 2024
Cited by 11 | Viewed by 3596
Abstract
Dynamic human activity recognition (HAR) is a domain of study that is currently receiving considerable attention within the fields of computer vision and pattern recognition. The growing need for artificial-intelligence (AI)-driven systems to evaluate human behaviour and bolster security underscores the timeliness of [...] Read more.
Dynamic human activity recognition (HAR) is a domain of study that is currently receiving considerable attention within the fields of computer vision and pattern recognition. The growing need for artificial-intelligence (AI)-driven systems to evaluate human behaviour and bolster security underscores the timeliness of this research. Despite the strides made by numerous researchers in developing dynamic HAR frameworks utilizing diverse pre-trained architectures for feature extraction and classification, persisting challenges include suboptimal performance accuracy and the computational intricacies inherent in existing systems. These challenges arise due to the vast video-based datasets and the inherent similarity in the data. To address these challenges, we propose an innovative, dynamic HAR technique employing a deep-learning-based, deep bidirectional long short-term memory (Deep BiLSTM) model facilitated by a pre-trained transfer-learning-based feature-extraction approach. Our approach begins with the utilization of Convolutional Neural Network (CNN) models, specifically MobileNetV2, for extracting deep-level features from video frames. Subsequently, these features are fed into an optimized deep bidirectional long short-term memory (Deep BiLSTM) network to discern dependencies and process data, enabling optimal predictions. During the testing phase, an iterative fine-tuning procedure is introduced to update the high parameters of the trained model, ensuring adaptability to varying scenarios. The proposed model’s efficacy was rigorously evaluated using three benchmark datasets, namely UCF11, UCF Sport, and JHMDB, achieving notable accuracies of 99.20%, 93.3%, and 76.30%, respectively. This high-performance accuracy substantiates the superiority of our proposed model, signaling a promising advancement in the domain of activity recognition. Full article
(This article belongs to the Special Issue State-of-the-Art of Computer Vision and Pattern Recognition)
Show Figures

Figure 1

16 pages, 10304 KiB  
Article
BWLM: A Balanced Weight Learning Mechanism for Long-Tailed Image Recognition
by Baoyu Fan, Han Ma, Yue Liu and Xiaochen Yuan
Appl. Sci. 2024, 14(1), 454; https://doi.org/10.3390/app14010454 - 4 Jan 2024
Cited by 2 | Viewed by 1473
Abstract
With the growth of data in the real world, datasets often encounter the problem of long-tailed distribution of class sample sizes. In long-tailed image recognition, existing solutions usually adopt a class rebalancing strategy, such as reweighting based on the effective sample size of [...] Read more.
With the growth of data in the real world, datasets often encounter the problem of long-tailed distribution of class sample sizes. In long-tailed image recognition, existing solutions usually adopt a class rebalancing strategy, such as reweighting based on the effective sample size of each class, which leans towards common classes in terms of higher accuracy. However, increasing the accuracy of rare classes while maintaining the accuracy of common classes is the key to solving the problem of long-tailed image recognition. This research explores a direction that balances the accuracy of both common and rare classes simultaneously. Firstly, a two-stage training is adopted, motivated by the use of transfer learning to balance features of common and rare classes. Secondly, a balanced weight function called Balanced Focal Softmax (BFS) loss is proposed, which combines balanced softmax loss focusing on common classes with balanced focal loss focusing on rare classes to achieve dual balance in long-tailed image recognition. Subsequently, a Balanced Weight Learning Mechanism (BWLM) to further utilize the feature of weight decay is proposed, where the weight decay as the weight balancing technique for the BFS loss tends to make the model learn smaller balanced weights by punishing the larger weights. Through extensive experiments on five long-tailed image datasets, it proves that transferring the weights from the first stage to the second stage can alleviate the bias of the naive models toward common classes. The proposed BWLM not only balances the weights of common and rare classes, but also greatly improves the accuracy of long-tailed image recognition and outperforms many state-of-the-art algorithms. Full article
(This article belongs to the Special Issue State-of-the-Art of Computer Vision and Pattern Recognition)
Show Figures

Figure 1

Back to TopTop