1. Introduction
With the rapid advancement of technology, envisioning a future filled with smart gadgets is becoming more feasible. Smart security cameras are wire-free cameras that can do more than just record video and take pictures. These gadgets include added features that allow you to check on the condition of your home even if you are hundreds of miles away. These devices provide you with the assurance that your home is secure and safe from intruders. The detection of human actions from real-time CCTV video data streams is a major subject. It is very useful for video surveillance and anomaly detection.
1.1. Motivation
In the domain of security, there are two conflicting criteria for smart monitoring. The first thing that is necessary is high-quality video. Having visual representations of a problem is useless if you cannot get valuable information from it. The capacity to recognize persons or interpret descriptions of objects can be decisive in determining whether a video includes important information. The overall graphical fidelity of videos is affected by their resolution. The better the resolution, the clearer the video. Video file size is also affected by resolution. As a result, a high-definition video has a larger file size than a standard-definition video of the same runtime. The availability of data storage is the second necessity for smart security monitoring. Following the completion of the recording, the video must be saved, which requires both memory and processing power if it is to be broadcast. This memory and processing capability necessitates a substantial financial commitment. As a result, the size of a video file accounts for a significant portion of the entire cost of any security system.
Video compression could be beneficial in fixing this issue. Parts of the data that are deemed repetitious are removed using a video codec, reducing the size of the video. However, because video compression results in data loss, some of the lost data may be critical, as a deep learning neural network may rely on the missing data. We define critical data as data that have a statistical link with poor deep learning neural network prediction performance if it they are omitted or altered. As a result, it is critical to pick a video compression format that strikes the optimal balance between removing unnecessary data and retaining crucial data. In addition, standard video compression standards are intended for human video consumption. Most videos are currently processed by machines, and human consumption is linked to validating machine analysis results [
1]. A new video codec permits lower bitrates without compromising machine analysis performance.
1.2. Research Purpose
Humans watch and evaluate video in a very different way than machines. Disparities in colour, shape, and size of things are influenced by the objects’ distance, lighting, familiarity, and image resolution. The impact of compression on human perception of digital information can be measured using metrics such as Structural Similarity Index Metric (SSIM) [
2,
3] and Peak Signal-to-Noise Ratio (PSNR) [
3]. The goal of this research is to identify the maximum bitrate compression that a video may go through for a neural network to still be able to predict the action in the film.
This is accomplished by examining the outputs of a Convolutional Dense Neural Network (5C3DNN) tested on several different bitrate versions of video sequences from the i3DPost Multi-view Human Action Dataset [
4]. The 5C3DNN is trained on the HMDB51 Dataset [
5].
2. Related Work
From previous research, it can be assumed that the major adverse effects on neural network prediction probability appear to be caused by several types of blur and noise. To test the performance of a new capsule-based architecture [
6] on data that include intrinsic noise, an image classification architecture based on the clustering of neurons into capsules and a new dynamic routing protocol was explored. In order to accomplish this, the efficiencies of six widely used Convolutional Neural Networks were reviewed on two distinct datasets with varying image-quality distortions. It was demonstrated that high classification accuracy is not achievable when a dataset comprises a large number of classes, and the new capsule architecture is resistant to specific image loss.
The effects of ×264 video encoding optimized for human vision on neural network performance in the realm of autonomous vehicles was analysed to potentially demonstrate the relevance of ×264 video encoding as a method for decreasing the volume of data required for the sufficient performance of machine learning algorithms used for autonomous driving while preserving human perception levels of the data [
7].
A range of network architectures and application domains, including end-to-end convolution, encoder–decoder, region-based CNN (R-CNN), dual-stream, and generative adversarial networks (GAN), have been examined to demonstrate a nonlinear and nonuniform link between network performance and the amount of lossy compression [
8]. Notably, performance degrades drastically below a JPEG quality (quantization) level of 15% and an H.264 Constant Rate Factor (CRF) of 40%. In certain instances, however, retraining these architectures on pre-compressed imagery recovers network performance. In addition, there is a correlation between architectures employing an encoder–decoder pipeline and lossy image compression resistance.
Training the models [
9] on uncompressed images increases their robustness when evaluated on compressed data; moderate image compression has no effect on the classification performance of deep learning-based models. On average, an image can be compressed by a quality factor of 10 for a JPEG encoder and 20 for a MozJPEG encoder while preserving its classification accuracy [
10].
3. Methodology
3.1. Bitrate-Compression
The FFmpeg [
11] tool is used to compress our dataset in terms of bitrate. We pick FFmpeg to compress our datasets since it is a high-performance, industry-leading, open-source multimedia framework for encoding, decoding, transcoding, and other tasks. The rate at which video data are conveyed in bits per second is referred to as video bitrate. When a video is expected to be compressed, a maximum bitrate that must not be exceeded must be sent to the video codec. For example, if the video codec is designed to compress the video at 1 mbps, the encoder will compress each second of the video so that the decoder only receives 1 mb of data per second.
There are several algorithms for finding and applying a bitrate value for video compression, such as:
Constant Bitrate (CBR) [
12]: video quality is sacrificed in order to keep a constant bitrate.
Variable Bitrate (VBR) [
12]: maintains video quality while allowing the bitrate to fluctuate.
To understand how bitrate affects video quality, it is necessary to know how video compression works. When video is compressed, the compression technique uses Discrete Cosine Transform (DCT) [
13] to transfer video from the pixel domain to the frequency domain, and Quantization [
14] to reject a significant number of frequencies that the human eye cannot distinguish. When compressing a video, information is compromised in order to maintain the video’s quality. A substantial amount of data are lost when compressing a video considerably, and the loss of data is obvious. If the video is not heavily compressed, the file size is huge, but the video quality is excellent. This is known as the rate–distortion trade-off in video compression. Assuming that the resolution is fixed, the lower the bitrate, the worse the video quality.
The i3DPost Multi-view Human Action Dataset is made up of consecutive png frames, and because FFmpeg can only encode video files, we first utilize it to compile this dataset into an mp4 file. As depicted in
Figure 1, this video is fed into an H.264/HEVC [
15] video encoder, which performs prediction, transformation, and encoding operations to produce a compressed bitstream. An H.264/HEVC video decoder completes the procedures of decoding, inverse transforming, and reconstruction to produce a decoded video sequence. This decoded video sequence is bitrate-altered before being evaluated with the 5C3DNN model.
3.2. Neural Network
A neural network classification model with five convolution layers is built. This is a basic classifier. A moving average technique [
16] is employed. This method involves creating a function that generates a unique forecast for the entire video. This function takes ‘n’ frames from a video and predicts what will happen in each frame. Averaging the predictions of these ‘n’ frames yields the total score for the entire video. This is referred to as the Single-Frame CNN method.
The 5C3DNN model (
Figure 2) is first trained on the HMDB51 dataset. The HMDB51 dataset includes videos from many sources, such as movies and web videos. The dataset includes 6849 video clips from 51 activity categories, each of which has at least 101 clips. This 5C3DNN model is an image classification model that operates on each and every frame of the video before averaging the individual probabilities to produce a final probabilities vector.
Several constants are defined during the preparation of the dataset, including the frame’s height and width (64 × 64) and the maximum number of frames collected from each action to train the model. The sizes of all of these frames are standardised, and this is now a feature. Each feature is assigned an action label. There are now two lists: features and labels. Labels are converted to hot-encoded vectors. Following the division of these lists into training and testing sets, the training data are fed into the 5C3DNN model. On test data, the model predicts with 96% accuracy after training.
3.3. YUV
YUV is a pixel format employed by video applications. YUV is the real identity of the colour space shared by all “YUV” pixel formats. In contrast to RGB (Red–Green–Blue) formats, YUV colours are represented with one “luminance” component, called Y (similar to grey scale), and two “chrominance” components, called U (blue projection) and V (red projection).
The YUV 4:2:0 format comprises luminance plane Y, followed by the U and V chrominance planes. The two chrominance planes (blue and red projections) are sub-sampled by a factor of two in both the horizontal and vertical dimensions. For a 2 × 2 square of pixels, there are four Y samples but only one U sample and one V sample. This format uses 48 bits per 4 pixels, or 12 bits per pixel, giving it a depth of 12 bits per pixel.
The sequential png frames from the i3DPost Multi-view Human Action Dataset are grouped and converted into YUV 4:2:0 format video, as it considered to be a raw data-form and is helpful for calculating the PNSR with respect to the YUV 4:2:0 format of bitrate-altered videos.
4. Results and Analysis
The i3DPost Multi-view Human Action Dataset has 125 sequential uncompressed frames with an aspect ratio of 1920 × 1080. The bitrate of the uncompressed video formed using these frames is calculated based on one uncompressed frame:
pixels per frame = 1920 * 1080 = 2,073,600 pixels.
In YUV420, there is one U and one V value per 2 × 2 group of Y (which means the two chroma components are sampled at half the sample rate of luma both horizontally and vertically.
Hence, a 2 × 2 block of an uncompressed frame has 4*4-bit Y values + one 4-bit U + 4-bit V. Therefore, for 2 × 2 pixel: 16 + 4 + 4 = 24 bits and 1 pixel: 6 bits.
Hence, bits per frame = 6 * 2,073,600 = 12,441,600 bits.
The uncompressed video runs at 30 frames per second (FPS). Therefore,
bit rate = 30 * size of one uncompressed frame = 30 * 12,441,600 = 373,248,000 bits/s.
The bitrate of each uncompressed video sequence that is formed from the uncompressed images in the i3DPost Multi-view Human Action Dataset is 373,248,000 bits/s.
The trained 5C3DNN model is further tested on bitrate-altered videos to find out the extent to which we can decrease the bitrate of the video so that the neural network model fails to predict the action. For this, eight video sequences with bitrates alerted at variable intervals are fed into the model and accuracies are noted. The following graph shows the results obtained for each video sequence.
The results are consistent and maintain a non-linear relationship with respect to bitrate compression. The neural network appears to perform really well on video sequences that are too compressed for human consumption. From
Figure 3, we can observe that in the range of 25 Kbps–35 Kbps, the neural network is losing its ability to predict the action in the video.
To calculate the PSNR value between the uncompressed video and the bitrate-altered videos, a MATLAB function ‘yuvpsnr’ is used. The bitrate-altered videos are initially converted from mp4 to YUV420 format, as the function yields solutions only between two YUV format videos.
Both of the Bitrate vs. PSNR graphs (
Figure 4 and
Figure 5) portray the same fall of the quality with bitrate reduction that was depicted in the Bitrate vs. Accuracy graph (
Figure 3) earlier.
The Bitrate vs SSIM graphs (
Figure 6) prove the fall of quality with respect to the bitrate shown in
Figure 3.
The follwing quartet of
Figure 7,
Figure 8,
Figure 9 and
Figure 10 portray the dissipation of the data with respect to the reduction of the bitrate and thus losing the accuracy in predicting the activity at 35 Kbps. We can see that model is performing well even at 40 Kbps bitrate, where human consumption of the video is not possible.
The
Table 1 gives the approximate compression rates of the video sequences at the breakeven points.
The neural network model presented in this paper is trained and tested on only six classes of the HMBD51 Dataset (“climb_stairs”, “fall_floor”, “run”, “sit”,“stand”, and “walk”). To evaluate the performance of this model, it is compared with proven neural network architectures in the
Table 2.
5. Conclusions
In this research work, we empirically assessed the effect of bitrate reduction on the performance of neural networks for action recognition. In doing so, we hoped to demonstrate the applicability of video compression as a means of decreasing the quantity of data required for the proper performance of deep learning algorithms used for action detection while preserving human perception levels of said data.
We generated a dataset from raw uncompressed images and evaluated the bitrate-altered movies inferred from eight video sequences in that dataset on a 5C3DNN model trained on a different activity dataset. Following this, we drew two comparisons. First, we examined the accuracy of bitrate-modified videos and determined the range of bitrates at which the neural network loses its capacity to predict activity. Next, we compared the PSNR values of the YUV version of bitrate-altered videos to those of the uncompressed raw videos to see if they complemented the prior findings.
Important findings from our studies include that the range of 25 Kbps to 35 Kbps contains the breakeven bitrate points at which the neural network could not predict. This range of videos has a compression ratio between 10,977.88 and 14,929.92. It was discovered that there is no linear relationship between bitrate compression and 5C3DNN performance. Later, the model was compared to different architectures and proved its capability to perform better than them at lower bitrates.
The datasets reported in this work are static; hence, additional work on dynamic videos is necessary to strengthen the outcomes of this experiment. Future research on this subject would benefit from the inclusion of even more-diverse datasets and ones that include other luminance settings. Future research on this topic would also benefit from testing the effectiveness of another neural network model created and trained to meet a different application domain other than Action Recognition.