applsci-logo

Journal Browser

Journal Browser

Artificial Intelligence for Multimedia Signal Processing

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (30 June 2021) | Viewed by 59424

Printed Edition Available!
A printed edition of this Special Issue is available here.

Special Issue Editors


E-Mail Website
Guest Editor
Department of IT Engineering, Sookmyung Women’s University, Seoul 04310, Republic of Korea
Interests: image/video signal processing; pattern recognition; computer vision; deep learning; artificial intelligence
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Department of Information and Communication Engineering, Kyungnam University, Changwon 51767, Korea
Interests: media; video coding; video compression; video encoder; image processing; realistic digital broadcasting system

Special Issue Information

Dear Colleagues,

At the ImageNet Large Scale Visual Re-Conversion Challenge (ILSVRC), a 2012 global image recognition contest, the University of Toronto Supervision team led by Prof. Geoffrey Hinton took first and second place by a landslide, sparking an explosion of interest in deep learning. Since then, global experts and companies such as Google, Microsoft, nVidia, and Intel have been competing to lead artificial intelligence technologies, such as deep learning. Now, they are developing deep-learning-based technologies applied to all industries and solving many classification and recognition problems.

These artificial intelligence technologies are also actively applied to broadcasting and multimedia processing technologies. A lot of research has been conducted in a wide variety of fields, such as content creation, transmission, and security, and these attempts have been made in the past two to three years to improve image, video, speech, and other data compression efficiency in areas related to MPEG media processing technology. Additionally, technologies such as media creation, processing, editing, and creating scenarios are very important areas of research in multimedia processing and engineering.

While this Special Issue invites topics broadly across advanced computational intelligence algorithms and technologies for emerging multimedia signal processing, some specific topics include but are not limited to:

- Signal/image/video processing algorithm for advanced deep learning;

- Fast and complexity reduction mechanism based on deep neural network;

- Protecting technologies for privacy/personalized media data;

- Advanced circuit/system design and analysis based on deep neural networks;

- Image/video-based recognition algorithm using deep neural network;

- Deep-learning-based speech and audio processing;

- Efficient multimedia sharing schemes using artificial intelligence;

- Artificial intelligence technologies for multimedia creation, processing, editing, and creating scenarios;

- Deep-learning-based web data mining and representation.

Prof. Dr. Byung-Gyu Kim
Prof. Dr. Dongsan Jun
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Artificial/computational intelligence
  • Image/video/speech signal processing
  • Advance deep learning
  • Learning mechanism
  • Multimedia processing

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (14 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Editorial

Jump to: Research

3 pages, 174 KiB  
Editorial
Artificial Intelligence for Multimedia Signal Processing
by Byung-Gyu Kim and Dong-San Jun
Appl. Sci. 2022, 12(15), 7358; https://doi.org/10.3390/app12157358 - 22 Jul 2022
Cited by 1 | Viewed by 1673
Abstract
At the ImageNet Large Scale Visual Re-Conversion Challenge (ILSVRC), a 2012 global image recognition contest, the University of Toronto Supervision team led by Prof [...] Full article
(This article belongs to the Special Issue Artificial Intelligence for Multimedia Signal Processing)

Research

Jump to: Editorial

13 pages, 6937 KiB  
Article
Reduction of Compression Artifacts Using a Densely Cascading Image Restoration Network
by Yooho Lee, Sang-hyo Park, Eunjun Rhee, Byung-Gyu Kim and Dongsan Jun
Appl. Sci. 2021, 11(17), 7803; https://doi.org/10.3390/app11177803 - 25 Aug 2021
Cited by 2 | Viewed by 2596
Abstract
Since high quality realistic media are widely used in various computer vision applications, image compression is one of the essential technologies to enable real-time applications. Image compression generally causes undesired compression artifacts, such as blocking artifacts and ringing effects. In this study, we [...] Read more.
Since high quality realistic media are widely used in various computer vision applications, image compression is one of the essential technologies to enable real-time applications. Image compression generally causes undesired compression artifacts, such as blocking artifacts and ringing effects. In this study, we propose a densely cascading image restoration network (DCRN), which consists of an input layer, a densely cascading feature extractor, a channel attention block, and an output layer. The densely cascading feature extractor has three densely cascading (DC) blocks, and each DC block contains two convolutional layers, five dense layers, and a bottleneck layer. To optimize the proposed network architectures, we investigated the trade-off between quality enhancement and network complexity. Experimental results revealed that the proposed DCRN can achieve a better peak signal-to-noise ratio and structural similarity index measure for compressed joint photographic experts group (JPEG) images compared to the previous methods. Full article
(This article belongs to the Special Issue Artificial Intelligence for Multimedia Signal Processing)
Show Figures

Figure 1

17 pages, 6472 KiB  
Article
Context-Based Structure Mining Methodology for Static Object Re-Identification in Broadcast Content
by Krishna Kumar Thirukokaranam Chandrasekar and Steven Verstockt
Appl. Sci. 2021, 11(16), 7266; https://doi.org/10.3390/app11167266 - 6 Aug 2021
Cited by 1 | Viewed by 1644
Abstract
Technological advancement, in addition to the pandemic, has given rise to an explosive increase in the consumption and creation of multimedia content worldwide. This has motivated people to enrich and publish their content in a way that enhances the experience of the user. [...] Read more.
Technological advancement, in addition to the pandemic, has given rise to an explosive increase in the consumption and creation of multimedia content worldwide. This has motivated people to enrich and publish their content in a way that enhances the experience of the user. In this paper, we propose a context-based structure mining pipeline that not only attempts to enrich the content, but also simultaneously splits it into shots and logical story units (LSU). Subsequently, this paper extends the structure mining pipeline to re-ID objects in broadcast videos such as SOAPs. We hypothesise the object re-ID problem of SOAP-type content to be equivalent to the identification of reoccurring contexts, since these contexts normally have a unique spatio-temporal similarity within the content structure. By implementing pre-trained models for object and place detection, the pipeline was evaluated using metrics for shot and scene detection on benchmark datasets, such as RAI. The object re-ID methodology was also evaluated on 20 randomly selected episodes from broadcast SOAP shows New Girl and Friends. We demonstrate, quantitatively, that the pipeline outperforms existing state-of-the-art methods for shot boundary detection, scene detection, and re-identification tasks. Full article
(This article belongs to the Special Issue Artificial Intelligence for Multimedia Signal Processing)
Show Figures

Figure 1

13 pages, 4074 KiB  
Article
3D Avatar Approach for Continuous Sign Movement Using Speech/Text
by Debashis Das Chakladar, Pradeep Kumar, Shubham Mandal, Partha Pratim Roy, Masakazu Iwamura and Byung-Gyu Kim
Appl. Sci. 2021, 11(8), 3439; https://doi.org/10.3390/app11083439 - 12 Apr 2021
Cited by 14 | Viewed by 10141
Abstract
Sign language is a visual language for communication used by hearing-impaired people with the help of hand and finger movements. Indian Sign Language (ISL) is a well-developed and standard way of communication for hearing-impaired people living in India. However, other people who use [...] Read more.
Sign language is a visual language for communication used by hearing-impaired people with the help of hand and finger movements. Indian Sign Language (ISL) is a well-developed and standard way of communication for hearing-impaired people living in India. However, other people who use spoken language always face difficulty while communicating with a hearing-impaired person due to lack of sign language knowledge. In this study, we have developed a 3D avatar-based sign language learning system that converts the input speech/text into corresponding sign movements for ISL. The system consists of three modules. Initially, the input speech is converted into an English sentence. Then, that English sentence is converted into the corresponding ISL sentence using the Natural Language Processing (NLP) technique. Finally, the motion of the 3D avatar is defined based on the ISL sentence. The translation module achieves a 10.50 SER (Sign Error Rate) score. Full article
(This article belongs to the Special Issue Artificial Intelligence for Multimedia Signal Processing)
Show Figures

Figure 1

13 pages, 401 KiB  
Article
A Novel 1-D CCANet for ECG Classification
by Ian-Christopher Tanoh and Paolo Napoletano
Appl. Sci. 2021, 11(6), 2758; https://doi.org/10.3390/app11062758 - 19 Mar 2021
Cited by 11 | Viewed by 3413
Abstract
This paper puts forward a 1-D convolutional neural network (CNN) that exploits a novel analysis of the correlation between the two leads of the noisy electrocardiogram (ECG) to classify heartbeats. The proposed method is one-dimensional, enabling complex structures while maintaining a reasonable computational [...] Read more.
This paper puts forward a 1-D convolutional neural network (CNN) that exploits a novel analysis of the correlation between the two leads of the noisy electrocardiogram (ECG) to classify heartbeats. The proposed method is one-dimensional, enabling complex structures while maintaining a reasonable computational complexity. It is based on the combination of elementary handcrafted time domain features, frequency domain features through spectrograms and the use of autoregressive modeling. On the MIT-BIH database, a 95.52% overall accuracy is obtained by classifying 15 types, whereas a 95.70% overall accuracy is reached when classifying 7 types from the INCART database. Full article
(This article belongs to the Special Issue Artificial Intelligence for Multimedia Signal Processing)
Show Figures

Figure 1

16 pages, 3719 KiB  
Article
Multimodal Unsupervised Speech Translation for Recognizing and Evaluating Second Language Speech
by Yun Kyung Lee and Jeon Gue Park
Appl. Sci. 2021, 11(6), 2642; https://doi.org/10.3390/app11062642 - 16 Mar 2021
Cited by 4 | Viewed by 2412
Abstract
This paper addresses an automatic proficiency evaluation and speech recognition for second language (L2) speech. The proposed method recognizes the speech uttered by the L2 speaker, measures a variety of fluency scores, and evaluates the proficiency of the speaker’s spoken English. Stress and [...] Read more.
This paper addresses an automatic proficiency evaluation and speech recognition for second language (L2) speech. The proposed method recognizes the speech uttered by the L2 speaker, measures a variety of fluency scores, and evaluates the proficiency of the speaker’s spoken English. Stress and rhythm scores are one of the important factors used to evaluate fluency in spoken English and are computed by comparing the stress patterns and the rhythm distributions to those of native speakers. In order to compute the stress and rhythm scores even when the phonemic sequence of the L2 speaker’s English sentence is different from the native speaker’s one, we align the phonemic sequences based on a dynamic time-warping approach. We also improve the performance of the speech recognition system for non-native speakers and compute fluency features more accurately by augmenting the non-native training dataset and training an acoustic model with the augmented dataset. In this work, we augment the non-native speech by converting some speech signal characteristics (style) while preserving its linguistic information. The proposed variational autoencoder (VAE)-based speech conversion network trains the conversion model by decomposing the spectral features of the speech into a speaker-invariant content factor and a speaker-specific style factor to estimate diverse and robust speech styles. Experimental results show that the proposed method effectively measures the fluency scores and generates diverse output signals. Also, in the proficiency evaluation and speech recognition tests, the proposed method improves the proficiency score performance and speech recognition accuracy for all proficiency areas compared to a method employing conventional acoustic models. Full article
(This article belongs to the Special Issue Artificial Intelligence for Multimedia Signal Processing)
Show Figures

Figure 1

11 pages, 1606 KiB  
Article
Rain Streak Removal for Single Images Using Conditional Generative Adversarial Networks
by Prasad Hettiarachchi, Rashmika Nawaratne, Damminda Alahakoon, Daswin De Silva and Naveen Chilamkurti
Appl. Sci. 2021, 11(5), 2214; https://doi.org/10.3390/app11052214 - 3 Mar 2021
Cited by 13 | Viewed by 3288
Abstract
Rapid developments in urbanization and smart city environments have accelerated the need to deliver safe, sustainable, and effective resource utilization and service provision and have thereby enhanced the need for intelligent, real-time video surveillance. Recent advances in machine learning and deep learning have [...] Read more.
Rapid developments in urbanization and smart city environments have accelerated the need to deliver safe, sustainable, and effective resource utilization and service provision and have thereby enhanced the need for intelligent, real-time video surveillance. Recent advances in machine learning and deep learning have the capability to detect and localize salient objects in surveillance video streams; however, several practical issues remain unaddressed, such as diverse weather conditions, recording conditions, and motion blur. In this context, image de-raining is an important issue that has been investigated extensively in recent years to provide accurate and quality surveillance in the smart city domain. Existing deep convolutional neural networks have obtained great success in image translation and other computer vision tasks; however, image de-raining is ill posed and has not been addressed in real-time, intelligent video surveillance systems. In this work, we propose to utilize the generative capabilities of recently introduced conditional generative adversarial networks (cGANs) as an image de-raining approach. We utilize the adversarial loss in GANs that provides an additional component to the loss function, which in turn regulates the final output and helps to yield better results. Experiments on both real and synthetic data show that the proposed method outperforms most of the existing state-of-the-art models in terms of quantitative evaluations and visual appearance. Full article
(This article belongs to the Special Issue Artificial Intelligence for Multimedia Signal Processing)
Show Figures

Figure 1

14 pages, 2752 KiB  
Article
Discovering Sentimental Interaction via Graph Convolutional Network for Visual Sentiment Prediction
by Lifang Wu, Heng Zhang, Sinuo Deng, Ge Shi and Xu Liu
Appl. Sci. 2021, 11(4), 1404; https://doi.org/10.3390/app11041404 - 4 Feb 2021
Cited by 21 | Viewed by 2603
Abstract
With the popularity of online opinion expressing, automatic sentiment analysis of images has gained considerable attention. Most methods focus on effectively extracting the sentimental features of images, such as enhancing local features through saliency detection or instance segmentation tools. However, as a high-level [...] Read more.
With the popularity of online opinion expressing, automatic sentiment analysis of images has gained considerable attention. Most methods focus on effectively extracting the sentimental features of images, such as enhancing local features through saliency detection or instance segmentation tools. However, as a high-level abstraction, the sentiment is difficult to accurately capture with the visual element because of the “affective gap”. Previous works have overlooked the contribution of the interaction among objects to the image sentiment. We aim to utilize interactive characteristics of objects in the sentimental space, inspired by human sentimental principles that each object contributes to the sentiment. To achieve this goal, we propose a framework to leverage the sentimental interaction characteristic based on a Graph Convolutional Network (GCN). We first utilize an off-the-shelf tool to recognize objects and build a graph over them. Visual features represent nodes, and the emotional distances between objects act as edges. Then, we employ GCNs to obtain the interaction features among objects, which are fused with the CNN output of the whole image to predict the final results. Experimental results show that our method exceeds the state-of-the-art algorithm. Demonstrating that the rational use of interaction features can improve performance for sentiment analysis. Full article
(This article belongs to the Special Issue Artificial Intelligence for Multimedia Signal Processing)
Show Figures

Graphical abstract

12 pages, 5708 KiB  
Article
Single Image Super-Resolution Method Using CNN-Based Lightweight Neural Networks
by Seonjae Kim, Dongsan Jun, Byung-Gyu Kim, Hunjoo Lee and Eunjun Rhee
Appl. Sci. 2021, 11(3), 1092; https://doi.org/10.3390/app11031092 - 25 Jan 2021
Cited by 24 | Viewed by 5440
Abstract
There are many studies that seek to enhance a low resolution image to a high resolution image in the area of super-resolution. As deep learning technologies have recently shown impressive results on the image interpolation and restoration field, recent studies are focusing on [...] Read more.
There are many studies that seek to enhance a low resolution image to a high resolution image in the area of super-resolution. As deep learning technologies have recently shown impressive results on the image interpolation and restoration field, recent studies are focusing on convolutional neural network (CNN)-based super-resolution schemes to surpass the conventional pixel-wise interpolation methods. In this paper, we propose two lightweight neural networks with a hybrid residual and dense connection structure to improve the super-resolution performance. In order to design the proposed networks, we extracted training images from the DIVerse 2K (DIV2K) image dataset and investigated the trade-off between the quality enhancement performance and network complexity under the proposed methods. The experimental results show that the proposed methods can significantly reduce both the inference speed and the memory required to store parameters and intermediate feature maps, while maintaining similar image quality compared to the previous methods. Full article
(This article belongs to the Special Issue Artificial Intelligence for Multimedia Signal Processing)
Show Figures

Figure 1

15 pages, 1416 KiB  
Article
A Multi-Resolution Approach to GAN-Based Speech Enhancement
by Hyung Yong Kim, Ji Won Yoon, Sung Jun Cheon, Woo Hyun Kang and Nam Soo Kim
Appl. Sci. 2021, 11(2), 721; https://doi.org/10.3390/app11020721 - 13 Jan 2021
Cited by 21 | Viewed by 4007
Abstract
Recently, generative adversarial networks (GANs) have been successfully applied to speech enhancement. However, there still remain two issues that need to be addressed: (1) GAN-based training is typically unstable due to its non-convex property, and (2) most of the conventional methods do not [...] Read more.
Recently, generative adversarial networks (GANs) have been successfully applied to speech enhancement. However, there still remain two issues that need to be addressed: (1) GAN-based training is typically unstable due to its non-convex property, and (2) most of the conventional methods do not fully take advantage of the speech characteristics, which could result in a sub-optimal solution. In order to deal with these problems, we propose a progressive generator that can handle the speech in a multi-resolution fashion. Additionally, we propose a multi-scale discriminator that discriminates the real and generated speech at various sampling rates to stabilize GAN training. The proposed structure was compared with the conventional GAN-based speech enhancement algorithms using the VoiceBank-DEMAND dataset. Experimental results showed that the proposed approach can make the training faster and more stable, which improves the performance on various metrics for speech enhancement. Full article
(This article belongs to the Special Issue Artificial Intelligence for Multimedia Signal Processing)
Show Figures

Figure 1

12 pages, 1140 KiB  
Article
Place Classification Algorithm Based on Semantic Segmented Objects
by Woon-Ha Yeo, Young-Jin Heo, Young-Ju Choi and Byung-Gyu Kim
Appl. Sci. 2020, 10(24), 9069; https://doi.org/10.3390/app10249069 - 18 Dec 2020
Cited by 8 | Viewed by 2540
Abstract
Scene or place classification is one of the important problems in image and video search and recommendation systems. Humans can understand the scene they are located, but it is difficult for machines to do it. Considering a scene image which has several objects, [...] Read more.
Scene or place classification is one of the important problems in image and video search and recommendation systems. Humans can understand the scene they are located, but it is difficult for machines to do it. Considering a scene image which has several objects, humans recognize the scene based on these objects, especially background objects. According to this observation, we propose an efficient scene classification algorithm for three different classes by detecting objects in the scene. We use pre-trained semantic segmentation model to extract objects from an image. After that, we construct a weight matrix to determine a scene class better. Finally, we classify an image into one of three scene classes (i.e., indoor, nature, city) by using the designed weighting matrix. The performance of our scheme outperforms several classification methods using convolutional neural networks (CNNs), such as VGG, Inception, ResNet, ResNeXt, Wide-ResNet, DenseNet, and MnasNet. The proposed model achieves 90.8% of verification accuracy and improves over 2.8% of the accuracy when comparing to the existing CNN-based methods. Full article
(This article belongs to the Special Issue Artificial Intelligence for Multimedia Signal Processing)
Show Figures

Figure 1

20 pages, 4670 KiB  
Article
Recommendations for Different Tasks Based on the Uniform Multimodal Joint Representation
by Haiying Liu, Sinuo Deng, Lifang Wu, Meng Jian, Bowen Yang and Dai Zhang
Appl. Sci. 2020, 10(18), 6170; https://doi.org/10.3390/app10186170 - 4 Sep 2020
Cited by 4 | Viewed by 2533
Abstract
Content curation social networks (CCSNs), such as Pinterest and Huaban, are interest driven and content centric. On CCSNs, user interests are represented by a set of boards, and a board is composed of various pins. A pin is an image with a description. [...] Read more.
Content curation social networks (CCSNs), such as Pinterest and Huaban, are interest driven and content centric. On CCSNs, user interests are represented by a set of boards, and a board is composed of various pins. A pin is an image with a description. All entities, such as users, boards, and categories, can be represented as a set of pins. Therefore, it is possible to implement entity representation and the corresponding recommendations on a uniform representation space from pins. Furthermore, lots of pins are re-pinned from others and the pin’s re-pin sequences are recorded on CCSNs. In this paper, a framework which can learn the multimodal joint representation of pins, including text representation, image representation, and multimodal fusion, is proposed. Image representations are extracted from a multilabel convolutional neural network. The multiple labels of pins are automatically obtained by the category distributions in the re-pin sequences, which benefits from the network architecture. Text representations are obtained with the word2vec tool. Two modalities are fused with a multimodal deep Boltzmann machine. On the basis of the pin representation, different recommendation tasks are implemented, including recommending pins or boards to users, recommending thumbnails to boards, and recommending categories to boards. Experimental results on a dataset from Huaban demonstrate that the multimodal joint representation of pins contains the information of user interests. Furthermore, the proposed multimodal joint representation outperformed unimodal representation in different recommendation tasks. Experiments were also performed to validate the effectiveness of the proposed recommendation methods. Full article
(This article belongs to the Special Issue Artificial Intelligence for Multimedia Signal Processing)
Show Figures

Figure 1

16 pages, 827 KiB  
Article
The Application and Improvement of Deep Neural Networks in Environmental Sound Recognition
by Yu-Kai Lin, Mu-Chun Su and Yi-Zeng Hsieh
Appl. Sci. 2020, 10(17), 5965; https://doi.org/10.3390/app10175965 - 28 Aug 2020
Cited by 11 | Viewed by 2584
Abstract
Neural networks have achieved great results in sound recognition, and many different kinds of acoustic features have been tried as the training input for the network. However, there is still doubt about whether a neural network can efficiently extract features from the raw [...] Read more.
Neural networks have achieved great results in sound recognition, and many different kinds of acoustic features have been tried as the training input for the network. However, there is still doubt about whether a neural network can efficiently extract features from the raw audio signal input. This study improved the raw-signal-input network from other researches using deeper network architectures. The raw signals could be better analyzed in the proposed network. We also presented a discussion of several kinds of network settings, and with the spectrogram-like conversion, our network could reach an accuracy of 73.55% in the open-audio-dataset “Dataset for Environmental Sound Classification 50” (ESC50). This study also proposed a network architecture that could combine different kinds of network feeds with different features. With the help of global pooling, a flexible fusion way was integrated into the network. Our experiment successfully combined two different networks with different audio feature inputs (a raw audio signal and the log-mel spectrum). Using the above settings, the proposed ParallelNet finally reached the accuracy of 81.55% in ESC50, which also reached the recognition level of human beings. Full article
(This article belongs to the Special Issue Artificial Intelligence for Multimedia Signal Processing)
Show Figures

Figure 1

18 pages, 7190 KiB  
Article
Human Height Estimation by Color Deep Learning and Depth 3D Conversion
by Dong-seok Lee, Jong-soo Kim, Seok Chan Jeong and Soon-kak Kwon
Appl. Sci. 2020, 10(16), 5531; https://doi.org/10.3390/app10165531 - 10 Aug 2020
Cited by 20 | Viewed by 12272
Abstract
In this study, an estimation method for human height is proposed using color and depth information. Color images are used for deep learning by mask R-CNN to detect a human body and a human head separately. If color images are not available for [...] Read more.
In this study, an estimation method for human height is proposed using color and depth information. Color images are used for deep learning by mask R-CNN to detect a human body and a human head separately. If color images are not available for extracting the human body region due to low light environment, then the human body region is extracted by comparing between current frame in depth video and a pre-stored background depth image. The topmost point of the human head region is extracted as the top of the head and the bottommost point of the human body region as the bottom of the foot. The depth value of the head top-point is corrected to a pixel value that has high similarity to a neighboring pixel. The position of the body bottom-point is corrected by calculating a depth gradient between vertically adjacent pixels. Two head-top and foot-bottom points are converted into 3D real-world coordinates using depth information. Two real-world coordinates estimate human height by measuring a Euclidean distance. Estimation errors for human height are corrected as the average of accumulated heights. In experiment results, we achieve that the estimated errors of human height with a standing state are 0.7% and 2.2% when the human body region is extracted by mask R-CNN and the background depth image, respectively. Full article
(This article belongs to the Special Issue Artificial Intelligence for Multimedia Signal Processing)
Show Figures

Figure 1

Back to TopTop