Deep Learning for Image, Video and Signal Processing

A special issue of Information (ISSN 2078-2489). This special issue belongs to the section "Information Applications".

Deadline for manuscript submissions: closed (30 September 2024) | Viewed by 52510

Special Issue Editors


E-Mail Website
Guest Editor
Department of Electrical and Computer Engineering, Democritus University of Thrace, 67100 Xanthi, Greece
Interests: deep learning; computer vision; audio source separation; music information retrieval
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Department of Electrical and Computer Engineering, Democritus University of Thrace, 67100 Xanthi, Greece
Interests: deep learning; machine learning; manifold learning; image analysis; biomedical signal processing; biomedical image analysis; pattern recognition

Special Issue Information

Dear Colleagues,

Deep learning has been a major revolution in modern information processing. All major application areas have been affected positively by this breakthrough, including image, video and signal processing. Deep learning has rendered traditional approaches that employ man-made features obsolete by allowing neural networks to extract optimized features through learning. Current networks, featuring large networks with millions of parameters, can address many image, video and signal processing problems with top performance. The use of GPUs for training these networks is detrimental. In addition, extensions of traditional learning strategies, such as contrastive, semi-supervised learning and teacher-student models, have addressed the requirement for large amounts of annotated data.

The aim of this Special Issue is to present and highlight the newest trends in deep learning for image, video and signal processing applications. These may include, but are not limited to, the following topics:

  • Object detection;
  • Semantic/instance segmentation;
  • Image fusion;
  • Image/video spatial/temporal inpainting;
  • Generative image/video processing;
  • Image/video classification;
  • Document image processing;
  • Image/video processing for autonomous driving;
  • Audio processing/classification;
  • Audio source separation;
  • Contrastive/semi-supervised learning;
  • Knowledge distillation methods.

Dr. Nikolaos Mitianoudis
Dr. Ilias Theodorakopoulos
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Information is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (23 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

14 pages, 1898 KiB  
Article
Privacy-Preserving ConvMixer Without Any Accuracy Degradation Using Compressible Encrypted Images
by Haiwei Lin, Shoko Imaizumi and Hitoshi Kiya
Information 2024, 15(11), 723; https://doi.org/10.3390/info15110723 - 11 Nov 2024
Viewed by 371
Abstract
We propose an enhanced privacy-preserving method for image classification using ConvMixer, which is an extremely simple model that is similar in spirit to the Vision Transformer (ViT). Most privacy-preserving methods using encrypted images cause the performance of models to degrade due to the [...] Read more.
We propose an enhanced privacy-preserving method for image classification using ConvMixer, which is an extremely simple model that is similar in spirit to the Vision Transformer (ViT). Most privacy-preserving methods using encrypted images cause the performance of models to degrade due to the influence of encryption, but a state-of-the-art method was demonstrated to have the same classification accuracy as that of models without any encryption under the use of ViT. However, the method, in which a common secret key is assigned to each patch, is not robust enough against ciphertext-only attacks (COAs) including jigsaw puzzle solver attacks if compressible encrypted images are used. In addition, ConvMixer is less robust than ViT because there is no position embedding. To overcome this issue, we propose a novel block-wise encryption method that allows us to assign an independent key to each patch to enhance robustness against attacks. In experiments, the effectiveness of the method is verified in terms of image classification accuracy and robustness, and it is compared with conventional privacy-preserving methods using image encryption. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

13 pages, 22601 KiB  
Article
Lightweight Reference-Based Video Super-Resolution Using Deformable Convolution
by Tomo Miyazaki, Zirui Guo and Shinichiro Omachi
Information 2024, 15(11), 718; https://doi.org/10.3390/info15110718 - 8 Nov 2024
Viewed by 396
Abstract
Super-resolution is a technique for generating a high-resolution image or video from a low-resolution counterpart by predicting natural and realistic texture information. It has various applications such as medical image analysis, surveillance, remote sensing, etc. However, traditional single-image super-resolution methods can lead to [...] Read more.
Super-resolution is a technique for generating a high-resolution image or video from a low-resolution counterpart by predicting natural and realistic texture information. It has various applications such as medical image analysis, surveillance, remote sensing, etc. However, traditional single-image super-resolution methods can lead to a blurry visual effect. Reference-based super-resolution methods have been proposed to recover detailed information accurately. In reference-based methods, a high-resolution image is also used as a reference in addition to the low-resolution input image. Reference-based methods aim at transferring high-resolution textures from the reference image to produce visually pleasing results. However, it requires texture alignment between low-resolution and reference images, which generally requires a lot of time and memory. This paper proposes a lightweight reference-based video super-resolution method using deformable convolution. The proposed method makes the reference-based super-resolution a technology that can be easily used even in environments with limited computational resources. To verify the effectiveness of the proposed method, we conducted experiments to compare the proposed method with baseline methods in two aspects: runtime and memory usage, in addition to accuracy. The experimental results showed that the proposed method restored a high-quality super-resolved image from a very low-resolution level in 0.0138 s using two NVIDIA RTX 2080 GPUs, much faster than the representative method. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

14 pages, 6789 KiB  
Article
Real-Time Nonlinear Image Reconstruction in Electrical Capacitance Tomography Using the Generative Adversarial Network
by Damian Wanta, Mikhail Ivanenko, Waldemar T. Smolik, Przemysław Wróblewski and Mateusz Midura
Information 2024, 15(10), 617; https://doi.org/10.3390/info15100617 - 9 Oct 2024
Viewed by 432
Abstract
This study investigated the potential of the generative adversarial neural network (cGAN) image reconstruction in industrial electrical capacitance tomography. The image reconstruction quality was examined using image patterns typical for a two-phase flow. The training dataset was prepared by generating images of random [...] Read more.
This study investigated the potential of the generative adversarial neural network (cGAN) image reconstruction in industrial electrical capacitance tomography. The image reconstruction quality was examined using image patterns typical for a two-phase flow. The training dataset was prepared by generating images of random test objects and simulating the corresponding capacitance measurements. Numerical simulations were performed using the ECTsim toolkit for MATLAB. A cylindrical sixteen-electrode ECT sensor was used in the experiments. Real measurements were obtained using the EVT4 data acquisition system. The reconstructed images were evaluated using selected image quality metrics. The results obtained using cGAN are better than those obtained using the Landweber iteration and simplified Levenberg–Marquardt algorithm. The suggested method offers a promising solution for a fast reconstruction algorithm suitable for real-time monitoring and the control of a two-phase flow using ECT. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Graphical abstract

17 pages, 9437 KiB  
Article
Utilizing RT-DETR Model for Fruit Calorie Estimation from Digital Images
by Shaomei Tang and Weiqi Yan
Information 2024, 15(8), 469; https://doi.org/10.3390/info15080469 - 7 Aug 2024
Viewed by 1592
Abstract
Estimating the calorie content of fruits is critical for weight management and maintaining overall health as well as aiding individuals in making informed dietary choices. Accurate knowledge of fruit calorie content assists in crafting personalized nutrition plans and preventing obesity and associated health [...] Read more.
Estimating the calorie content of fruits is critical for weight management and maintaining overall health as well as aiding individuals in making informed dietary choices. Accurate knowledge of fruit calorie content assists in crafting personalized nutrition plans and preventing obesity and associated health issues. In this paper, we investigate the application of deep learning models for estimating the calorie content in fruits from digital images, aiming to provide a more efficient and accurate method for nutritional analysis. We create a dataset comprising images of various fruits and employ random data augmentation techniques during training to enhance model robustness. We utilize the RT-DETR model integrated into the ultralytics framework for implementation and conduct comparative experiments with YOLOv10 on the dataset. Our results show that the RT-DETR model achieved a precision rate of 99.01% and mAP50-95 of 94.45% in fruit detection from digital images, outperforming YOLOv10 in terms of F1- Confidence Curves, P-R curves, precision, and mAP. Conclusively, in this paper, we utilize a transformer architecture to detect fruits and estimate their calorie and nutritional content. The results of the experiments provide a technical reference for more accurately monitoring an individual’s dietary intake by estimating the calorie content of fruits. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

16 pages, 3092 KiB  
Article
Epileptic Seizure Detection from Decomposed EEG Signal through 1D and 2D Feature Representation and Convolutional Neural Network
by Shupta Das, Suraiya Akter Mumu, M. A. H. Akhand, Abdus Salam and Md Abdus Samad Kamal
Information 2024, 15(5), 256; https://doi.org/10.3390/info15050256 - 2 May 2024
Cited by 3 | Viewed by 1423
Abstract
Electroencephalogram (EEG) has emerged as the most favorable source for recognizing brain disorders like epileptic seizure (ES) using deep learning (DL) methods. This study investigated the well-performed EEG-based ES detection method by decomposing EEG signals. Specifically, empirical mode decomposition (EMD) decomposes EEG signals [...] Read more.
Electroencephalogram (EEG) has emerged as the most favorable source for recognizing brain disorders like epileptic seizure (ES) using deep learning (DL) methods. This study investigated the well-performed EEG-based ES detection method by decomposing EEG signals. Specifically, empirical mode decomposition (EMD) decomposes EEG signals into six intrinsic mode functions (IMFs). Three distinct features, namely, fluctuation index, variance, and ellipse area of the second order difference plot (SODP), were extracted from each of the IMFs. The feature values from all EEG channels were arranged in two composite feature forms: a 1D (i.e., unidimensional) form and a 2D image-like form. For ES recognition, the convolutional neural network (CNN), the most prominent DL model for 2D input, was considered for the 2D feature form, and a 1D version of CNN was employed for the 1D feature form. The experiment was conducted on a benchmark CHB-MIT dataset as well as a dataset prepared from the EEG signals of ES patients from Prince Hospital Khulna (PHK), Bangladesh. The 2D feature-based CNN model outperformed the other 1D feature-based models, showing an accuracy of 99.78% for CHB-MIT and 95.26% for PHK. Furthermore, the cross-dataset evaluations also showed favorable outcomes. Therefore, the proposed method with 2D composite feature form can be a promising ES detection method. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

18 pages, 3172 KiB  
Article
Transformer-Based Approach to Pathology Diagnosis Using Audio Spectrogram
by Mohammad Tami, Sari Masri, Ahmad Hasasneh and Chakib Tadj
Information 2024, 15(5), 253; https://doi.org/10.3390/info15050253 - 30 Apr 2024
Viewed by 1572
Abstract
Early detection of infant pathologies by non-invasive means is a critical aspect of pediatric healthcare. Audio analysis of infant crying has emerged as a promising method to identify various health conditions without direct medical intervention. In this study, we present a cutting-edge machine [...] Read more.
Early detection of infant pathologies by non-invasive means is a critical aspect of pediatric healthcare. Audio analysis of infant crying has emerged as a promising method to identify various health conditions without direct medical intervention. In this study, we present a cutting-edge machine learning model that employs audio spectrograms and transformer-based algorithms to classify infant crying into distinct pathological categories. Our innovative model bypasses the extensive preprocessing typically associated with audio data by exploiting the self-attention mechanisms of the transformer, thereby preserving the integrity of the audio’s diagnostic features. When benchmarked against established machine learning and deep learning models, our approach demonstrated a remarkable 98.69% accuracy, 98.73% precision, 98.71% recall, and an F1 score of 98.71%, surpassing the performance of both traditional machine learning and convolutional neural network models. This research not only provides a novel diagnostic tool that is scalable and efficient but also opens avenues for improving pediatric care through early and accurate detection of pathologies. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

22 pages, 6807 KiB  
Article
Deep Learning-Based Road Pavement Inspection by Integrating Visual Information and IMU
by Chen-Chiung Hsieh, Han-Wen Jia, Wei-Hsin Huang and Mei-Hua Hsih
Information 2024, 15(4), 239; https://doi.org/10.3390/info15040239 - 20 Apr 2024
Cited by 1 | Viewed by 1832
Abstract
This study proposes a deep learning method for pavement defect detection, focusing on identifying potholes and cracks. A dataset comprising 10,828 images is collected, with 8662 allocated for training, 1083 for validation, and 1083 for testing. Vehicle attitude data are categorized based on [...] Read more.
This study proposes a deep learning method for pavement defect detection, focusing on identifying potholes and cracks. A dataset comprising 10,828 images is collected, with 8662 allocated for training, 1083 for validation, and 1083 for testing. Vehicle attitude data are categorized based on three-axis acceleration and attitude change, with 6656 (64%) for training, 1664 (16%) for validation, and 2080 (20%) for testing. The Nvidia Jetson Nano serves as the vehicle-embedded system, transmitting IMU-acquired vehicle data and GoPro-captured images over a 5G network to the server. The server recognizes two damage categories, low-risk and high-risk, storing results in MongoDB. Severe damage triggers immediate alerts to maintenance personnel, while less severe issues are recorded for scheduled maintenance. The method selects YOLOv7 among various object detection models for pavement defect detection, achieving a mAP of 93.3%, a recall rate of 87.8%, a precision of 93.2%, and a processing speed of 30–40 FPS. Bi-LSTM is then chosen for vehicle vibration data processing, yielding 77% mAP, 94.9% recall rate, and 89.8% precision. Integration of the visual and vibration results, along with vehicle speed and travel distance, results in a final recall rate of 90.2% and precision of 83.7% after field testing. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

15 pages, 3559 KiB  
Article
STAR-3D: A Holistic Approach for Human Activity Recognition in the Classroom Environment
by Vijeta Sharma, Manjari Gupta, Ajai Kumar and Deepti Mishra
Information 2024, 15(4), 179; https://doi.org/10.3390/info15040179 - 25 Mar 2024
Cited by 3 | Viewed by 1555
Abstract
The video camera is essential for reliable activity monitoring, and a robust analysis helps in efficient interpretation. The systematic assessment of classroom activity through videos can help understand engagement levels from the perspective of both students and teachers. This practice can also help [...] Read more.
The video camera is essential for reliable activity monitoring, and a robust analysis helps in efficient interpretation. The systematic assessment of classroom activity through videos can help understand engagement levels from the perspective of both students and teachers. This practice can also help in robot-assistive classroom monitoring in the context of human–robot interaction. Therefore, we propose a novel algorithm for student–teacher activity recognition using 3D CNN (STAR-3D). The experiment is carried out using India’s indigenously developed supercomputer PARAM Shivay by the Centre for Development of Advanced Computing (C-DAC), Pune, India, under the National Supercomputing Mission (NSM), with a peak performance of 837 TeraFlops. The EduNet dataset (registered under the trademark of the DRSTATM dataset), a self-developed video dataset for classroom activities with 20 action classes, is used to train the model. Due to the unavailability of similar datasets containing both students’ and teachers’ actions, training, testing, and validation are only carried out on the EduNet dataset with 83.5% accuracy. To the best of our knowledge, this is the first attempt to develop an end-to-end algorithm that recognises both the students’ and teachers’ activities in the classroom environment, and it mainly focuses on school levels (K-12). In addition, a comparison with other approaches in the same domain shows our work’s novelty. This novel algorithm will also influence the researcher in exploring research on the “Convergence of High-Performance Computing and Artificial Intelligence”. We also present future research directions to integrate the STAR-3D algorithm with robots for classroom monitoring. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

22 pages, 12087 KiB  
Article
A Cloud-Based Deep Learning Framework for Downy Mildew Detection in Viticulture Using Real-Time Image Acquisition from Embedded Devices and Drones
by Sotirios Kontogiannis, Myrto Konstantinidou, Vasileios Tsioukas and Christos Pikridas
Information 2024, 15(4), 178; https://doi.org/10.3390/info15040178 - 24 Mar 2024
Cited by 2 | Viewed by 1734
Abstract
In viticulture, downy mildew is one of the most common diseases that, if not adequately treated, can diminish production yield. However, the uncontrolled use of pesticides to alleviate its occurrence can pose significant risks for farmers, consumers, and the environment. This paper presents [...] Read more.
In viticulture, downy mildew is one of the most common diseases that, if not adequately treated, can diminish production yield. However, the uncontrolled use of pesticides to alleviate its occurrence can pose significant risks for farmers, consumers, and the environment. This paper presents a new framework for the early detection and estimation of the mildew’s appearance in viticulture fields. The framework utilizes a protocol for the real-time acquisition of drones’ high-resolution RGB images and a cloud-docker-based video or image inference process using object detection CNN models. The authors implemented their framework proposition using open-source tools and experimented with their proposed implementation on the debina grape variety in Zitsa, Greece, during downy mildew outbursts. The authors present evaluation results of deep learning Faster R-CNN object detection models trained on their downy mildew annotated dataset, using the different object classifiers of VGG16, ViTDet, MobileNetV3, EfficientNet, SqueezeNet, and ResNet. The authors compare Faster R-CNN and YOLO object detectors in terms of accuracy and speed. From their experimentation, the embedded device model ViTDet showed the worst accuracy results compared to the fast inferences of YOLOv8, while MobileNetV3 significantly outperformed YOLOv8 in terms of both accuracy and speed. Regarding cloud inferences, large ResNet models performed well in terms of accuracy, while YOLOv5 faster inferences presented significant object classification losses. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

23 pages, 2629 KiB  
Article
Detect with Style: A Contrastive Learning Framework for Detecting Computer-Generated Images
by Georgios Karantaidis and Constantine Kotropoulos
Information 2024, 15(3), 158; https://doi.org/10.3390/info15030158 - 11 Mar 2024
Viewed by 1679
Abstract
The detection of computer-generated (CG) multimedia content has become of utmost importance due to the advances in digital image processing and computer graphics. Realistic CG images could be used for fraudulent purposes due to the deceiving recognition capabilities of human eyes. So, there [...] Read more.
The detection of computer-generated (CG) multimedia content has become of utmost importance due to the advances in digital image processing and computer graphics. Realistic CG images could be used for fraudulent purposes due to the deceiving recognition capabilities of human eyes. So, there is a need to deploy algorithmic tools for distinguishing CG images from natural ones within multimedia forensics. Here, an end-to-end framework is proposed to tackle the problem of distinguishing CG images from natural ones by utilizing supervised contrastive learning and arbitrary style transfer by means of a two-stage deep neural network architecture. This architecture enables discrimination by leveraging per-class embeddings and generating multiple training samples to increase model capacity without the need for a vast amount of initial data. Stochastic weight averaging (SWA) is also employed to improve the generalization and stability of the proposed framework. Extensive experiments are conducted to investigate the impact of various noise conditions on the classification accuracy and the proposed framework’s generalization ability. The conducted experiments demonstrate superior performance over the existing state-of-the-art methodologies on the public DSTok, Rahmouni, and LSCGB benchmark datasets. Hypothesis testing asserts that the improvements in detection accuracy are statistically significant. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

61 pages, 7868 KiB  
Article
Advances in Facial Expression Recognition: A Survey of Methods, Benchmarks, Models, and Datasets
by Thomas Kopalidis, Vassilios Solachidis, Nicholas Vretos and Petros Daras
Information 2024, 15(3), 135; https://doi.org/10.3390/info15030135 - 28 Feb 2024
Viewed by 13270
Abstract
Recent technological developments have enabled computers to identify and categorize facial expressions to determine a person’s emotional state in an image or a video. This process, called “Facial Expression Recognition (FER)”, has become one of the most popular research areas in computer vision. [...] Read more.
Recent technological developments have enabled computers to identify and categorize facial expressions to determine a person’s emotional state in an image or a video. This process, called “Facial Expression Recognition (FER)”, has become one of the most popular research areas in computer vision. In recent times, deep FER systems have primarily concentrated on addressing two significant challenges: the problem of overfitting due to limited training data availability, and the presence of expression-unrelated variations, including illumination, head pose, image resolution, and identity bias. In this paper, a comprehensive survey is provided on deep FER, encompassing algorithms and datasets that offer insights into these intrinsic problems. Initially, this paper presents a detailed timeline showcasing the evolution of methods and datasets in deep facial expression recognition (FER). This timeline illustrates the progression and development of the techniques and data resources used in FER. Then, a comprehensive review of FER methods is introduced, including the basic principles of FER (components such as preprocessing, feature extraction and classification, and methods, etc.) from the pro-deep learning era (traditional methods using handcrafted features, i.e., SVM and HOG, etc.) to the deep learning era. Moreover, a brief introduction is provided related to the benchmark datasets (there are two categories: controlled environments (lab) and uncontrolled environments (in the wild)) used to evaluate different FER methods and a comparison of different FER models. Existing deep neural networks and related training strategies designed for FER, based on static images and dynamic image sequences, are discussed. The remaining challenges and corresponding opportunities in FER and the future directions for designing robust deep FER systems are also pinpointed. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

18 pages, 7127 KiB  
Article
Benchmarking Automated Machine Learning (AutoML) Frameworks for Object Detection
by Samuel de Oliveira, Oguzhan Topsakal and Onur Toker
Information 2024, 15(1), 63; https://doi.org/10.3390/info15010063 - 21 Jan 2024
Cited by 1 | Viewed by 3098
Abstract
Automated Machine Learning (AutoML) is a subdomain of machine learning that seeks to expand the usability of traditional machine learning methods to non-expert users by automating various tasks which normally require manual configuration. Prior benchmarking studies on AutoML systems—whose aim is to compare [...] Read more.
Automated Machine Learning (AutoML) is a subdomain of machine learning that seeks to expand the usability of traditional machine learning methods to non-expert users by automating various tasks which normally require manual configuration. Prior benchmarking studies on AutoML systems—whose aim is to compare and evaluate their capabilities—have mostly focused on tabular or structured data. In this study, we evaluate AutoML systems on the task of object detection by curating three commonly used object detection datasets (Open Images V7, Microsoft COCO 2017, and Pascal VOC2012) in order to benchmark three different AutoML frameworks—namely, Google’s Vertex AI, NVIDIA’s TAO, and AutoGluon. We reduced the datasets to only include images with a single object instance in order to understand the effect of class imbalance, as well as dataset and object size. We used the metrics of the average precision (AP) and mean average precision (mAP). Solely in terms of accuracy, our results indicate AutoGluon as the best-performing framework, with a mAP of 0.8901, 0.8972, and 0.8644 for the Pascal VOC2012, COCO 2017, and Open Images V7 datasets, respectively. NVIDIA TAO achieved a mAP of 0.8254, 0.8165, and 0.7754 for those same datasets, while Google’s VertexAI scored 0.855, 0.793, and 0.761. We found the dataset size had an inverse relationship to mAP across all the frameworks, and there was no relationship between class size or imbalance and accuracy. Furthermore, we discuss each framework’s relative benefits and drawbacks from the standpoint of ease of use. This study also points out the issues found as we examined the labels of a subset of each dataset. Labeling errors in the datasets appear to have a substantial negative effect on accuracy that is not resolved by larger datasets. Overall, this study provides a platform for future development and research on this nascent field of machine learning. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

18 pages, 3164 KiB  
Article
Fast Object Detection Leveraging Global Feature Fusion in Boundary-Aware Convolutional Networks
by Weiming Fan, Jiahui Yu and Zhaojie Ju
Information 2024, 15(1), 53; https://doi.org/10.3390/info15010053 - 17 Jan 2024
Viewed by 1698
Abstract
Endoscopy, a pervasive instrument for the diagnosis and treatment of hollow anatomical structures, conventionally necessitates the arduous manual scrutiny of seasoned medical experts. Nevertheless, the recent strides in deep learning technologies proffer novel avenues for research, endowing it with the potential for amplified [...] Read more.
Endoscopy, a pervasive instrument for the diagnosis and treatment of hollow anatomical structures, conventionally necessitates the arduous manual scrutiny of seasoned medical experts. Nevertheless, the recent strides in deep learning technologies proffer novel avenues for research, endowing it with the potential for amplified robustness and precision, accompanied by the pledge of cost abatement in detection procedures, while simultaneously providing substantial assistance to clinical practitioners. Within this investigation, we usher in an innovative technique for the identification of anomalies in endoscopic imagery, christened as Context-enhanced Feature Fusion with Boundary-aware Convolution (GFFBAC). We employ the Context-enhanced Feature Fusion (CEFF) methodology, underpinned by Convolutional Neural Networks (CNNs), to establish equilibrium amidst the tiers of the feature pyramids. These intricately harnessed features are subsequently amalgamated into the Boundary-aware Convolution (BAC) module to reinforce both the faculties of localization and classification. A thorough exploration conducted across three disparate datasets elucidates that the proposition not only surpasses its contemporaries in object detection performance but also yields detection boxes of heightened precision. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

21 pages, 11275 KiB  
Article
Towards Enhancing Automated Defect Recognition (ADR) in Digital X-ray Radiography Applications: Synthesizing Training Data through X-ray Intensity Distribution Modeling for Deep Learning Algorithms
by Bata Hena, Ziang Wei, Luc Perron, Clemente Ibarra Castanedo and Xavier Maldague
Information 2024, 15(1), 16; https://doi.org/10.3390/info15010016 - 27 Dec 2023
Cited by 4 | Viewed by 2489
Abstract
Industrial radiography is a pivotal non-destructive testing (NDT) method that ensures quality and safety in a wide range of industrial sectors. Conventional human-based approaches, however, are prone to challenges in defect detection accuracy and efficiency, primarily due to the high inspection demand from [...] Read more.
Industrial radiography is a pivotal non-destructive testing (NDT) method that ensures quality and safety in a wide range of industrial sectors. Conventional human-based approaches, however, are prone to challenges in defect detection accuracy and efficiency, primarily due to the high inspection demand from manufacturing industries with high production throughput. To solve this challenge, numerous computer-based alternatives have been developed, including Automated Defect Recognition (ADR) using deep learning algorithms. At the core of training, these algorithms demand large volumes of data that should be representative of real-world cases. However, the availability of digital X-ray radiography data for open research is limited by non-disclosure contractual terms in the industry. This study presents a pipeline that is capable of modeling synthetic images based on statistical information acquired from X-ray intensity distribution from real digital X-ray radiography images. Through meticulous analysis of the intensity distribution in digital X-ray images, the unique statistical patterns associated with the exposure conditions used during image acquisition, type of component, thickness variations, beam divergence, anode heel effect, etc., are extracted. The realized synthetic images were utilized to train deep learning models, yielding an impressive model performance with a mean intersection over union (IoU) of 0.93 and a mean dice coefficient of 0.96 on real unseen digital X-ray radiography images. This methodology is scalable and adaptable, making it suitable for diverse industrial applications. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

18 pages, 6041 KiB  
Article
Dual-Pyramid Wide Residual Network for Semantic Segmentation on Cross-Style Datasets
by Guan-Ting Shen and Yin-Fu Huang
Information 2023, 14(12), 630; https://doi.org/10.3390/info14120630 - 24 Nov 2023
Viewed by 1408
Abstract
Image segmentation is the process of partitioning an image into multiple segments where the goal is to simplify the representation of the image and make the image more meaningful and easier to analyze. In particular, semantic segmentation is an approach of detecting the [...] Read more.
Image segmentation is the process of partitioning an image into multiple segments where the goal is to simplify the representation of the image and make the image more meaningful and easier to analyze. In particular, semantic segmentation is an approach of detecting the classes of objects, based on each pixel. In the past, most semantic segmentation models were for only one single style, such as urban street views, medical images, or even manga. In this paper, we propose a semantic segmentation model called the Dual-Pyramid Wide Residual Network (DPWRN) to solve the segmentation on cross-style datasets, which is suitable for diverse segmentation applications. The DPWRN integrated the Pyramid of Kernel paralleled with Dilation (PKD) and Multi-Feature Fusion (MFF) to improve the accuracy of segmentation. To evaluate the generalization of the DPWRN and its superiority over most state-of-the-art models, three datasets with completely different styles are tested in the experiments. As a result, our model achieves 75.95% of mIoU on CamVid, 83.60% of F1-score on DRIVE, and 86.87% of F1-score on eBDtheque. This verifies that the DPWRN can be generalized and shows its superiority in semantic segmentation on cross-style datasets. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

21 pages, 4397 KiB  
Article
POSS-CNN: An Automatically Generated Convolutional Neural Network with Precision and Operation Separable Structure Aiming at Target Recognition and Detection
by Jia Hou, Jingyu Zhang, Qi Chen, Siwei Xiang, Yishuo Meng, Jianfei Wang, Cimang Lu and Chen Yang
Information 2023, 14(11), 604; https://doi.org/10.3390/info14110604 - 7 Nov 2023
Viewed by 1731
Abstract
Artificial intelligence is changing and influencing our world. As one of the main algorithms in the field of artificial intelligence, convolutional neural networks (CNNs) have developed rapidly in recent years. Especially after the emergence of NASNet, CNNs have gradually pushed the idea of [...] Read more.
Artificial intelligence is changing and influencing our world. As one of the main algorithms in the field of artificial intelligence, convolutional neural networks (CNNs) have developed rapidly in recent years. Especially after the emergence of NASNet, CNNs have gradually pushed the idea of AutoML to the public’s attention, and large numbers of new structures designed by automatic searches are appearing. These networks are usually based on reinforcement learning and evolutionary learning algorithms. However, sometimes, the blocks of these networks are complex, and there is no small model for simpler tasks. Therefore, this paper proposes POSS-CNN aiming at target recognition and detection, which employs a multi-branch CNN structure with PSNC and a method of automatic parallel selection for super parameters based on a multi-branch CNN structure. Moreover, POSS-CNN can be broken up. By choosing a single branch or the combination of two branches as the “benchmark”, as well as the overall POSS-CNN, we can achieve seven models with different precision and operations. The test accuracy of POSS-CNN for a recognition task tested on a CIFAR10 dataset can reach 86.4%, which is equivalent to AlexNet and VggNet, but the operation and parameters of the whole model in this paper are 45.9% and 45.8% of AlexNet, and 29.5% and 29.4% of VggNet. The mAP of POSS-CNN for a detection task tested on the LSVH dataset is 45.8, inferior to the 62.3 of YOLOv3. However, compared with YOLOv3, the operation and parameters of the model in this paper are reduced by 57.4% and 15.6%, respectively. After being accelerated by WRA, POSS-CNN for a detection task tested on an LSVH dataset can achieve 27 fps, and the energy efficiency is 0.42 J/f, which is 5 times and 96.6 times better than GPU 2080Ti in performance and energy efficiency, respectively. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

15 pages, 1514 KiB  
Article
Deep-Learning-Based Multitask Ultrasound Beamforming
by Elay Dahan and Israel Cohen
Information 2023, 14(10), 582; https://doi.org/10.3390/info14100582 - 23 Oct 2023
Viewed by 2236
Abstract
In this paper, we present a new method for multitask learning applied to ultrasound beamforming. Beamforming is a critical component in the ultrasound image formation pipeline. Ultrasound images are constructed using sensor readings from multiple transducer elements, with each element typically capturing multiple [...] Read more.
In this paper, we present a new method for multitask learning applied to ultrasound beamforming. Beamforming is a critical component in the ultrasound image formation pipeline. Ultrasound images are constructed using sensor readings from multiple transducer elements, with each element typically capturing multiple acquisitions per frame. Hence, the beamformer is crucial for framerate performance and overall image quality. Furthermore, post-processing, such as image denoising, is usually applied to the beamformed image to achieve high clarity for diagnosis. This work shows a fully convolutional neural network that can learn different tasks by applying a new weight normalization scheme. We adapt our model to both high frame rate requirements by fitting weight normalization parameters for the sub-sampling task and image denoising by optimizing the normalization parameters for the speckle reduction task. Our model outperforms single-angle delay and sum on pixel-level measures for speckle noise reduction, subsampling, and single-angle reconstruction. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

15 pages, 4049 KiB  
Article
On the Use of Kullback–Leibler Divergence for Kernel Selection and Interpretation in Variational Autoencoders for Feature Creation
by Fábio Mendonça, Sheikh Shanawaz Mostafa, Fernando Morgado-Dias and Antonio G. Ravelo-García
Information 2023, 14(10), 571; https://doi.org/10.3390/info14100571 - 18 Oct 2023
Cited by 1 | Viewed by 1926
Abstract
This study presents a novel approach for kernel selection based on Kullback–Leibler divergence in variational autoencoders using features generated by the convolutional encoder. The proposed methodology focuses on identifying the most relevant subset of latent variables to reduce the model’s parameters. Each latent [...] Read more.
This study presents a novel approach for kernel selection based on Kullback–Leibler divergence in variational autoencoders using features generated by the convolutional encoder. The proposed methodology focuses on identifying the most relevant subset of latent variables to reduce the model’s parameters. Each latent variable is sampled from the distribution associated with a single kernel of the last encoder’s convolutional layer, resulting in an individual distribution for each kernel. Relevant features are selected from the sampled latent variables to perform kernel selection, which filters out uninformative features and, consequently, unnecessary kernels. Both the proposed filter method and the sequential feature selection (standard wrapper method) were examined for feature selection. Particularly, the filter method evaluates the Kullback–Leibler divergence between all kernels’ distributions and hypothesizes that similar kernels can be discarded as they do not convey relevant information. This hypothesis was confirmed through the experiments performed on four standard datasets, where it was observed that the number of kernels can be reduced without meaningfully affecting the performance. This analysis was based on the accuracy of the model when the selected kernels fed a probabilistic classifier and the feature-based similarity index to appraise the quality of the reconstructed images when the variational autoencoder only uses the selected kernels. Therefore, the proposed methodology guides the reduction of the number of parameters of the model, making it suitable for developing applications for resource-constrained devices. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

16 pages, 5731 KiB  
Article
Innovative Visualization Approach for Biomechanical Time Series in Stroke Diagnosis Using Explainable Machine Learning Methods: A Proof-of-Concept Study
by Kyriakos Apostolidis, Christos Kokkotis, Evangelos Karakasis, Evangeli Karampina, Serafeim Moustakidis, Dimitrios Menychtas, Georgios Giarmatzis, Dimitrios Tsiptsios, Konstantinos Vadikolias and Nikolaos Aggelousis
Information 2023, 14(10), 559; https://doi.org/10.3390/info14100559 - 12 Oct 2023
Cited by 4 | Viewed by 2014
Abstract
Stroke remains a predominant cause of mortality and disability worldwide. The endeavor to diagnose stroke through biomechanical time-series data coupled with Artificial Intelligence (AI) poses a formidable challenge, especially amidst constrained participant numbers. The challenge escalates when dealing with small datasets, a common [...] Read more.
Stroke remains a predominant cause of mortality and disability worldwide. The endeavor to diagnose stroke through biomechanical time-series data coupled with Artificial Intelligence (AI) poses a formidable challenge, especially amidst constrained participant numbers. The challenge escalates when dealing with small datasets, a common scenario in preliminary medical research. While recent advances have ushered in few-shot learning algorithms adept at handling sparse data, this paper pioneers a distinctive methodology involving a visualization-centric approach to navigating the small-data challenge in diagnosing stroke survivors based on gait-analysis-derived biomechanical data. Employing Siamese neural networks (SNNs), our method transforms a biomechanical time series into visually intuitive images, facilitating a unique analytical lens. The kinematic data encapsulated comprise a spectrum of gait metrics, including movements of the ankle, knee, hip, and center of mass in three dimensions for both paretic and non-paretic legs. Following the visual transformation, the SNN serves as a potent feature extractor, mapping the data into a high-dimensional feature space conducive to classification. The extracted features are subsequently fed into various machine learning (ML) models like support vector machines (SVMs), Random Forest (RF), or neural networks (NN) for classification. In pursuit of heightened interpretability, a cornerstone in medical AI applications, we employ the Grad-CAM (Class Activation Map) tool to visually highlight the critical regions influencing the model’s decision. Our methodology, though exploratory, showcases a promising avenue for leveraging visualized biomechanical data in stroke diagnosis, achieving a perfect classification rate in our preliminary dataset. The visual inspection of generated images elucidates a clear separation of classes (100%), underscoring the potential of this visualization-driven approach in the realm of small data. This proof-of-concept study accentuates the novelty of visual data transformation in enhancing both interpretability and performance in stroke diagnosis using limited data, laying a robust foundation for future research in larger-scale evaluations. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

17 pages, 1251 KiB  
Article
Sound Event Detection in Domestic Environment Using Frequency-Dynamic Convolution and Local Attention
by Grigorios-Aris Cheimariotis and Nikolaos Mitianoudis
Information 2023, 14(10), 534; https://doi.org/10.3390/info14100534 - 30 Sep 2023
Cited by 3 | Viewed by 1572
Abstract
This work describes a methodology for sound event detection in domestic environments. Efficient solutions in this task can support the autonomous living of the elderly. The methodology deals with the “Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE)” 2023, and [...] Read more.
This work describes a methodology for sound event detection in domestic environments. Efficient solutions in this task can support the autonomous living of the elderly. The methodology deals with the “Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE)” 2023, and more specifically with Task 4a “Sound event detection of domestic activities”. This task involves the detection of 10 common events in domestic environments in 10 s sound clips. The events may have arbitrary duration in the 10 s clip. The main components of the methodology are data augmentation on mel-spectrograms that represent the sound clips, feature extraction by passing spectrograms through a frequency-dynamic convolution network with an extra attention module in sequence with each convolution, concatenation of these features with BEATs embeddings, and use of BiGRU for sequence modeling. Also, a mean teacher model is employed for leveraging unlabeled data. This research focuses on the effect of data augmentation techniques, of the feature extraction models, and on self-supervised learning. The main contribution is the proposed feature extraction model, which uses weighted attention on frequency in each convolution, combined in sequence with a local attention module adopted by computer vision. The proposed system features promising and robust performance. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

17 pages, 5637 KiB  
Article
A Deep Neural Network for Working Memory Load Prediction from EEG Ensemble Empirical Mode Decomposition
by Sriniketan Sridhar, Anibal Romney and Vidya Manian
Information 2023, 14(9), 473; https://doi.org/10.3390/info14090473 - 25 Aug 2023
Viewed by 2052
Abstract
Mild Cognitive Impairment (MCI) and Alzheimer’s Disease (AD) are frequently associated with working memory (WM) dysfunction, which is also observed in various neural psychiatric disorders, including depression, schizophrenia, and ADHD. Early detection of WM dysfunction is essential to predict the onset of MCI [...] Read more.
Mild Cognitive Impairment (MCI) and Alzheimer’s Disease (AD) are frequently associated with working memory (WM) dysfunction, which is also observed in various neural psychiatric disorders, including depression, schizophrenia, and ADHD. Early detection of WM dysfunction is essential to predict the onset of MCI and AD. Artificial Intelligence (AI)-based algorithms are increasingly used to identify biomarkers for detecting subtle changes in loaded WM. This paper presents an approach using electroencephalograms (EEG), time-frequency signal processing, and a Deep Neural Network (DNN) to predict WM load in normal and MCI-diagnosed subjects. EEG signals were recorded using an EEG cap during working memory tasks, including block tapping and N-back visuospatial interfaces. The data were bandpass-filtered, and independent components analysis was used to select the best electrode channels. The Ensemble Empirical Mode Decomposition (EEMD) algorithm was then applied to the EEG signals to obtain the time-frequency Intrinsic Mode Functions (IMFs). The EEMD and DNN methods perform better than traditional machine learning methods as well as Convolutional Neural Networks (CNN) for the prediction of WM load. Prediction accuracies were consistently higher for both normal and MCI subjects, averaging 97.62%. The average Kappa score for normal subjects was 94.98% and 92.49% for subjects with MCI. Subjects with MCI showed higher values for beta and alpha oscillations in the frontal region than normal subjects. The average power spectral density of the IMFs showed that the IMFs (p = 0.0469 for normal subjects and p = 0.0145 for subjects with MCI) are robust and reliable features for WM load prediction. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

Review

Jump to: Research

24 pages, 1556 KiB  
Review
Audio-Driven Facial Animation with Deep Learning: A Survey
by Diqiong Jiang, Jian Chang, Lihua You, Shaojun Bian, Robert Kosk and Greg Maguire
Information 2024, 15(11), 675; https://doi.org/10.3390/info15110675 - 28 Oct 2024
Viewed by 963
Abstract
Audio-driven facial animation is a rapidly evolving field that aims to generate realistic facial expressions and lip movements synchronized with a given audio input. This survey provides a comprehensive review of deep learning techniques applied to audio-driven facial animation, with a focus on [...] Read more.
Audio-driven facial animation is a rapidly evolving field that aims to generate realistic facial expressions and lip movements synchronized with a given audio input. This survey provides a comprehensive review of deep learning techniques applied to audio-driven facial animation, with a focus on both audio-driven facial image animation and audio-driven facial mesh animation. These approaches employ deep learning to map audio inputs directly onto 3D facial meshes or 2D images, enabling the creation of highly realistic and synchronized animations. This survey also explores evaluation metrics, available datasets, and the challenges that remain, such as disentangling lip synchronization and emotions, generalization across speakers, and dataset limitations. Lastly, we discuss future directions, including multi-modal integration, personalized models, and facial attribute modification in animations, all of which are critical for the continued development and application of this technology. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

52 pages, 3960 KiB  
Review
A Critical Analysis of Deep Semi-Supervised Learning Approaches for Enhanced Medical Image Classification
by Kaushlesh Singh Shakya, Azadeh Alavi, Julie Porteous, Priti K, Amit Laddi and Manojkumar Jaiswal
Information 2024, 15(5), 246; https://doi.org/10.3390/info15050246 - 24 Apr 2024
Viewed by 1908
Abstract
Deep semi-supervised learning (DSSL) is a machine learning paradigm that blends supervised and unsupervised learning techniques to improve the performance of various models in computer vision tasks. Medical image classification plays a crucial role in disease diagnosis, treatment planning, and patient care. However, [...] Read more.
Deep semi-supervised learning (DSSL) is a machine learning paradigm that blends supervised and unsupervised learning techniques to improve the performance of various models in computer vision tasks. Medical image classification plays a crucial role in disease diagnosis, treatment planning, and patient care. However, obtaining labeled medical image data is often expensive and time-consuming for medical practitioners, leading to limited labeled datasets. DSSL techniques aim to address this challenge, particularly in various medical image tasks, to improve model generalization and performance. DSSL models leverage both the labeled information, which provides explicit supervision, and the unlabeled data, which can provide additional information about the underlying data distribution. That offers a practical solution to resource-intensive demands of data annotation, and enhances the model’s ability to generalize across diverse and previously unseen data landscapes. The present study provides a critical review of various DSSL approaches and their effectiveness and challenges in enhancing medical image classification tasks. The study categorized DSSL techniques into six classes: consistency regularization method, deep adversarial method, pseudo-learning method, graph-based method, multi-label method, and hybrid method. Further, a comparative analysis of performance for six considered methods is conducted using existing studies. The referenced studies have employed metrics such as accuracy, sensitivity, specificity, AUC-ROC, and F1 score to evaluate the performance of DSSL methods on different medical image datasets. Additionally, challenges of the datasets, such as heterogeneity, limited labeled data, and model interpretability, were discussed and highlighted in the context of DSSL for medical image classification. The current review provides future directions and considerations to researchers to further address the challenges and take full advantage of these methods in clinical practices. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

Back to TopTop