1. Introduction
In an ever more connected world, smart cities are becoming ever more present in our society. In these smart cities, use cases wherein the innovations will benefit the inhabitants are also growing, improving their quality of life [
1,
2,
3].
One of these areas is safety, in which machine learning (ML) models reveal high potential in real-time video-stream analysis in order to determine if violence exists in those videos [
4]. These ML approaches concern the field of computer vision, a field responsible for transducingdigital images and videos, and being able to extract knowledge and understandable information from them, in order for them to be used in diverse contexts [
5]. Depending on the nature of the data that needs to be assessed, there are several ML paradigms that can be used to train a model in order for it to better improve itself, and to correctly predict the outcome, for given data [
6].
Some of the available alternatives to recognize actions in video streams are based on ML approaches, such as deep learning (DL), which grew in popularity in the last few years, as it was realized that it had massive potential in several applications that could benefit from having a machine recognizing diverse human actions [
7,
8]. When talking about DL, it is important to refer to neural networks (NN), which have the goal of mimicking the operation of the human brain, and more specifically the use of biological terms such as neurons and synapses. Usually they do not follow a group of instructions to resolve problems as conventional algorithmic approaches do. Neurons are powerful components that have a high potential in information storage, image recognition, and classification problems [
9].
In this sense, the present work describes the research and analysis of the exploration of ML models that detect violence in video streams. Following the CRISP-DM (The CRoss Industry Standard Process for Data Mining) methodology, the application of ML models to the collected data, allowed the comparison of the performance in terms of audio and video classification.
Regarding the structure of this article, after the present introduction, an exposition of some related works is given, followed by an explanation of the methodology, materials, and methods. The results are then shown and discussed. Finally, some conclusions are drawn.
2. Related Work
Over the last few years, several studies have been carried out in this field, and in this article we present a few of them considered relevant to the framework of this study.
In [
10], the researchers highlighted the necessity of violence detecting models being efficient. As such, they proposed a hybrid feature “handcrafted/learned” framework. Handcrafted spatiotemporal features are able to achieve high accuracy of both appearance and motion; however, the extraction of such features is still troublesome for some applications.
The model first attempts to obtain the illustrative image from the video sequence taken as input for feature extraction, using Hough forest as a classifier. Afterwards, a 2D CNN is used to classify the image. The proposed approach was tested in a less-crowded scenario where it achieved accuracy rates ranging between 84% and 96%.
In [
11], researchers assume that in a fighting scene, the motion blobs have a specific shape and position. Having been analyzed, the K largest are ultimately classified as either being violent or nonviolent.
To detect objects, this method used an ellipse detection method. To extract features, an algorithm to find the acceleration was used, and finally spatiotemporal features were used in an effort to classify the scenes. This method was tested with both crowded and less-crowded scenes, yielding a near 90% accuracy rate.
While this method was outperformed by SotA methods in terms of accuracy, it has expressively faster computation time, which does make it desirable for real-time applications.
In [
12], researchers used independent networks to learn features specifically related to violence, such as blood, explosions, fights, etc. After that, in an effort to describe the violence while using such features, distinct SVM classifiers are trained for each of the concepts, and their results are later merged into a meta-classification.
To detect objects, movement detection and a temporal-robust features model were used, and to extract features, a bag of words method was used. This method was used in sparsely crowded scenarios, attaining a 96% accuracy rate.
3. Methodology, Materials, and Methods
Overall, this study was developed based on the CRISP-DM methodology, which, in short, consists of six phases: business understanding, data understanding, data preparation, modeling, evaluation, and deployment. In this article, the main aspects of this process are explained.
In order to carry out this study, several technologies were leveraged, the most relevant ones being Python and some of its libraries and PyTorch, an open-source ML library based on the Torch Library, used to create and train the models.
Making use of these technologies, it was possible to create the whole pipeline related to the creation, training, and evaluation of the models, described in the next subsection. In these training sessions, each of the chosen pretrained models that were used in each different stage of the pipeline were tested, using different settings, in order to understand which configuration better suited each model and yielded the best results.
In the training stage, each of the models were fed with data from which they try to extract knowledge, understanding what features correlate to the label; in this case, what features lead to a video stream to be considered violent or not. In the validation stage, each of the models were also fed with data; however, they do not learn from it, and, instead, they try to leverage their previously gained knowledge in order to predict if the video stream contains violence. In each training session, data pertaining to each of the models in the pipeline and their performance were saved in order to analyze how well the models performed. Such data encompasses accuracy percentage in each epoch, loss in each epoch, and also the necessary information to compose a confusion matrix (CM—composed of true positives, true negatives, false positives, and false negatives) regarding the last epoch.
3.1. Pipeline
In order to provide a clearer understanding of the models exposed in this article, the pipeline of the research is presented in
Figure 1. This figure describes the flow of information and actions taken in the training and evaluating of the models.
First, the dataset was loaded, and in the first stage, a number of videos equivalent to the defined batch size (BS), a hyperparameter defined later in this paper, had their audio and video data extracted. After being extracted, an MFCC spectrogram was generated for each of the loaded videos, from their extracted audio data, while the video data was extracted and systematically loaded into a Tensor, so that it could be passed to the classifier.
Afterwards, both the classifiers received their respective data, which they processed and made a prediction for, being that the audio classifier completed its processing before the video one.
Once all the videos in the dataset were covered, the results were processed and saved for analysis.
3.2. Data
In the action recognition portion of the ML field, there is no shortage of action recognition datasets, e.g., UCF50 [
13] and UCF101 [
14], that feature, respectively, 50 and 101 categories of human performed actions. Just as there are datasets that feature nonviolent actions, there are also ones that features violent actions. Some examples of these are the Hockey Fight Detection Dataset [
15] and Real Life Violence Situations Dataset [
16], the last one containing a variety of violent scenes, ranging from punches and kicks to throws, among other.
For this study, it was deemed that the Hockey dataset would be unsuitable, as despite showing violent actions, these scenes would perhaps not generalize well to violent acts committed by people in day-to-day situations due to the inherent characteristics to that dataset (voluptuous hockey equipment, bats, the ice ring in which they play). Hence, the decision was made to use the Real Life Violence Situations Dataset; however, after running into some errors, a problem became clear. Most of the videos featured in the dataset contained no sound whatsoever. It was then decided that the dataset should be exploited to its full potential, which resulted in around 250 videos, but nearly all of them violent, as very few of the nonviolent videos had sound.
To mend this, scenes from videos of “city walking” were utilized. These videos are filmed by people with GoPros or similar cameras, that do not speak, and limit themselves to just walking about different parts of different cities, and capturing footage. In these videos, people can be seen walking the streets, talking, eating, and engaging in every day actions. Consequently, “city walking” videos from several cities were downloaded and excerpts from them made in order to give the models a wide range of nonviolent scenes with audio.
In total, 500 videos with a duration of around 10 s each were collected, in which 250 featured violent scenes, and 250 nonviolent scenes.
Figure 2 presents some examples of the scenes presented in the dataset. Besides the retrieval of public datasets featuring violent actions, the recording of videos specific to this project was also scheduled to take place, which featured the simulation of several violent scenarios. However, the recordings were involved in some bureaucracy, such as COVID-19 restrictions, which delayed their recording, making them unavailable for use in this article.
3.3. Data Preparation
In ML it is a common practice that some kind of preprocessing needs to be applied to the data that will be used to train the models, as some of its features may contribute negatively to the models’ training.
An example of these negative impacts regards images, as they are typically stored as multidimensional arrays of values ranging from 0 to 255. One possible consequence of feeding those images as they are to a model would be what is often referred to as “exploding gradients”, where a model’s weights will vary dramatically from extremely high to extremely low values, possibly reaching NaN values, rendering the model useless.
After normalizing the data, typically, the data ranges from [0, 1], but it is also possible that the values are normalized to a range of [−0.5, 0.5]. In this case, the normalization was performed in such a way that the final values ranged from [0, 1], in order to improve the performance and training stability of the models.
3.4. Modeling
In each classifier, several models were tested in order to have more robust results from which to draw conclusions from. In this chapter, each of the models used in each of the classifiers will be explored, having their architecture detailed and their origins explained, leading to a better understanding of why they are suitable for the task at hand.
3.4.1. Audio
Audio analysis using ML techniques is based on extracting useful information from a sound track, analyzing, and then predicting over it. This useful information consists of the frequency and amplitude of the sound wave over a period of time. Two popular ways of extracting these features from an audio signal are as follows:
STFT: an audio waveform is converted to a spectrogram using STFT. This spectrogram displays the time–frequency changes as a 2D array of complex numbers that represent the magnitude and the phase [
17].
MFCC: consists of calculating the power spectrum that gives the frequency spectrum to identify the present frequencies. To calculate the power spectrum, the Mel scale filter bank must be applied in order to extract the frequency bands. Afterwards, the log filter banks must be applied as well as the DCT coefficients to generate a compressed version of filter banks, and achieve the MFCC [
17].
The audio classifiers being used in this study were both proposed in [
18], along with a few other networks. The objective of this paper’s authors was to present a residual learning framework that would ease the training of networks that are substantially deeper than those that had been used thus far. Their architecture was created having, mainly, VGG nets in mind, having up to eight times the depth (when comparing the 152 layer version), and attempted to achieve similar, if not better, results whilst having lower complexity.
The proposed NNs were highly successful, having earned the authors first place in the ILSVRC 2015 classification task, and having achieved better results than the VGG networks on the ImageNet classification dataset.
The first audio classifier being detailed is the Resnet18, whose architecture is possible to see in
Table 1. This model’s architecture starts with a convolutional layer with 7 × 7 kernel size, and is followed by the beginning of the skip connection. The input is then added to the output that is produced by the 3 × 3 max pool layer, and two pairs of convolutional layers with a kernel size of 3 × 3, having each 64 kernels. This part represents the first residual block, and five convolutional layers in total.
From there, the output of this residual block is passed on to three additional residual blocks, each with two pairs of convolutional layers. The convolutional layers of these residual blocks have a kernel size of 3 × 3 each, and 128 such filters. The convolutional layers in each residual block see their number of filters double, comparatively with the convolutional layers in the previous block, and the size of the output decreases by half. With these additional convolutional layers and the fully connected layer that is featured afterwards, the total amount of layers is brought up to 18, from where the model get its name.
The second network that served as audio classifier was the Resnet34, which has an architecture that closely resembles the Resnet18, as is possible to see in
Table 2. This model’s architecture begins by having almost the same setup for its first convolutional layer, and residual block following it, with the difference that this residual block features three pairs of convolutional layers, and not two. Afterwards, it also follows the same trend, with the convolutional layers of each residual block seeing their number of filters double, when compared with the convolutional layers in the previous block, but in this model’s case, each residual block does not necessarily have only two pairs of convolutional layers. More precisely, the second residual block features four pairs of convolutional layers, so eight in total. From there, the trend is kept, but with the difference that the third residual block features six pairs of convolutional layers, and the fourth, four pairs.
3.4.2. Video
Video analysis presents different challenges compared to image analysis. When analyzing a video, the previous events that occurred in the sequence must be taken into account, to take into account the whole video, and not individual frames.
In order to do this, one cannot only take into account the two dimensions that are present in image analysis, but also a third dimension: the temporal dimension. To accommodate this new dimension, several strategies are available; however, the one being employed in this article revolves around 3D CNNs, which, as shown in [
19], outperform 2D CNNs in challenging action recognition benchmarks, such as Sports-1M [
20] and Kinetics [
21].
The authors of [
19] were inspired by these promising results, and introduced two new forms of spatiotemporal convolutions that can be viewed as middle grounds, between the extremes of 2D (spatial convolution) and full 3D (spatiotemporal convolution). Their first proposal was called mixed convolution (MC), and is explained in previously in the Resnet18.
The first proposal is the Resnet_MC18, whose architecture is possible to see in
Table 3. It employs 3D convolutions solely in the early layers of the network, with 2D convolutions in the top layers, the rationale being that the motion modeling is a low/mid-level operation that can be implemented via 3D convolutions in the early layers of a network, and spatial reasoning over these mid-level motion features (implemented by 2D convolutions in the top layers) lead to accurate action recognition, and, in this article, to the accurate recognition of violent actions [
19].
The second proposal by the authors of [
19] is a “(2+1)D” convolutional block (
Table 4), which explicitly factorizes 3D convolution into two separate and successive operations, a 2D spatial convolution and a 1D temporal convolution. The authors justify this “decomposition” with two reasons.
The first advantage is that having an additional nonlinear rectification between these two operations effectively doubles the number of nonlinearities, compared to a network using full 3D convolutions for the same number of parameters, thus rendering the model capable of representing more complex functions. The second potential benefit is that the decomposition facilitates the optimization, yielding in practice both a lower training loss and a lower testing loss, hence effectively making the (2+1)D blocks easier to optimize.
3.5. Evaluation
In order evaluate the different classifiers’ performances, different metrics were considered and analyzed, namely, the following:
Average training accuracy (ATA)—average accuracy rate achieved by a classifier with the training set.
Average training loss (ATL)—average loss achieved by a classifier with the training set.
Average validation accuracy (AVA)—average accuracy rate achieved by a classifier with the validation set.
Average validation loss (AVL)—average loss achieved by a classifier with the validation set.
4. Results and Discussion
After having completed the training of every classifier, in diverse conditions, altering the different hyperparameters, a set of results was achieved. These results encompass close to 200 h of training, divided into eight training sessions, in which the four different models were used in their respective classifiers. These results comprise dozens of graphs and text files that range from confusion matrices that plot the last epoch of the training session to line graphs that plot the evolution of the accuracy and loss rates. Having this data available is crucial in order to determine what approach is working best, and why something is not performing to its fullest extent. It is important to note two crucial hyperparameters, namely, BS and learning rate (LR). BS can be defined as the number of samples that normally pass through the neural network at one time. On the other hand, LR controls how much it is necessary to change the model in response to the estimated error each time the model weights are updated [
22]. All the other options that were used to configure the model had their values set to the default options, described on PyTorch’s documentation of the models.
In the following subsections, the results achieved by the two types of classifiers tested in this article will detailed and analyzed. They will be evaluated under several perspectives, and connections between the different settings related to the hyperparameters and the performance of the classifiers are attempted to draw. Firstly, the average values of each of the models used for the classifier in question will be analyzed, regardless of the hyperparameters set. Afterwards, the hyperparameters will be taken into account, in order to obtain a more comprehensive look at the results, in a better effort to understand if any conclusions can be drawn.
4.1. Audio
The theory behind the classifying of audio arises from the sound characteristics that are typically present in violent acts, such as screaming, battering of surfaces, breaking of items, and so forth. It is believed that the classifier will be able to identify in which scenarios violent acts are occurring simply based on the audio feed.
In
Table 5, a general look into each of the models’ performances can be seen in terms of ATA, ATL, AVA, and AVL.
Here, we can see that, in a general sense, the models are extremely similar, which is to be expected as they were conceived by the same people and are based on the same architecture, the only difference being that the Resnet34 has more layers than the Resnet18 one. All of their values are very similar, with perhaps the AVL being the one that has the most significant difference, but even in that case, it is a 3.75% difference, which is very similar.
Table 6 presents the audio classifier results by BS. When comparing the results from
Table 5 with those from
Table 6, it is possible to realize a trend in which both models performed better with a BS of 7 then 5, mainly in average accuracy. This is particularly notable with the Resnet34, which scored 71% of average accuracy in the validation set, almost 5% more than its AVL, as seen in
Table 5.
When looking at
Table 6, it is possible to verify that a wider range of results are present. When comparing the models’ performance with the different BSs, the results are somewhat inconclusive. On one hand, the ATA somewhat improves, being accompanied by a drop in the ATL. However, when it comes to the average accuracy in the validation set, in the Resnet18 model’s case, there was no improvement gained from increasing the BS from 5 to 7, even having had a worse performance, as the average loss in the validation set increased. In the Resnet34’s case, the average accuracy in the validation set improved by almost 6%; however, the average loss barely increased.
Taking a look at the models’ results from an LR perspective, the obtained results are presented in
Table 7. In this table, it is possible to verify how the LR affected the models’ performance. When looking at the performance over the training set, both models performed similarly, scoring close to 90% accuracy with an LR of
10
, and having their accuracy drop to around 80% with an LR of
10
, and with this drop, an increase of the average loss ensued. Furthermore, the models’ trend in the training set stayed the same in the validation set as well. With an increase in LR, the models’ performance decreased, verified in both average accuracy, with a decrease, and an increase in the average loss.
It is possible, once again, to notice a trend when comparing the results from
Table 5 and
Table 7, which shows us that both the models benefit from using an LR of
10
, which it is possible to see by both the increase in average accuracy in the training and validation set, and the accompanying loss reduction.
In a general sense, the audio classifiers performed as expected, displaying a solid performance in general, but not performing as well as their counterparts. The best result was achieved by the Resnet18 model, with an LR of 10 and a BS of 7, with it having achieved an average accuracy of 91% and an average loss of 0.08 on the training set, and seeing those values worsen to 76% and 0.51 in the validation set. Despite having dropped 15%, the average accuracy in the validation set still falls within the expected values.
4.2. Video
Regarding video classifiers, the expectations shift, as essentially all models used in state-of-the-art papers are video classifiers, so these kind of models are expected to perform at an extremely high level. It is natural that video classifiers are the ones that perform better, as video features are more descriptive and correlate more to violent actions such as punching, beating, and similar actions, than the sound that these actions produce.
Looking at
Table 8, the distinction between the performance of the audio and video classifiers becomes clear. In the training set, the video classifiers outscored the audio classifiers on average by around 5%. Their performance becomes even more impressive when it comes to the average accuracy on the validation set, in which the video classifiers outscored their audio counterparts by a whopping 15%.
The general results from the video classifiers showed promise, so it is interesting to understand how the different hyperparameters interact with each one of the models that were chosen as video classifiers and their respective performance.
Firstly, looking at their behavior from the perspective of the BS parameter, the obtained results are presented in
Table 9 .
These results were quite surprising, due to their unexpected consistency, which seems to indicate that the BS of 5 and 7 are adequate values with which to train the models. Across both models, regardless of which BS was used, the results remained remarkably close, with a maximum difference between the models’ average accuracy in the validation set of just 1.5%.
Moving to
Table 10, it is possible to verify the LR’s influence in the models’ results. In this table, much more diverse results are present, which is quite apparent, but not right away. In regards to the average accuracy in the training set, neither the BS nor the LR had too much of an effect, with every model scoring between 90% and 91% and an accompanying low loss. When it comes to the average accuracy in the validation set, however, the scenario changes dramatically. Despite not having an extreme fluctuation, a close to 7% difference in the average accuracy between the models now exists, ranging from 82.5% to 89%. In this regard, it is clear to see that the LR bears a lot more relevance to the models’ performance than the BS, at least in the used values.
The best results, regarding the average accuracy in the validation set, were achieved by both the Resnet(2+1)D and Resnet_MC18, with both of the models attaining scores of 89% accuracy in the set. This impressive accuracy was obtained, as can be expected from the above explanation, with an LR of 10. Comparing these results with the audio classifiers, it becomes clear why video classifiers are used in virtually every state-of-the-art paper. Even when taking into account all the different testing scenarios, in no circumstance did the audio classifiers score close to their video counterparts.
5. Conclusions
After all the tests and analyses were carried out, it was finally possible to conclude which model and type of classifier performed better, by looking at
Table 11. In this table, it is possible to see what model, by type of classifier, scored the highest average accuracy rates over the validation set. It is important to note that the “-” in the BS column of the video classifiers means that with either a BS of 5 or 7, both of these models scored the same average accuracy of 89% over the validation set, so the “-” signifies that, regardless of the BS, the models performed the same when having an LR of
10
.
Looking at
Table 11, it is clear that all the models outperformed the expectations that were previously set for them. The audio classifier, despite being the lowest scorer of the tested classifiers, still achieved a respectable 76% of average accuracy, which, in and of itself, is a good achievement. When we look, however, at the other classifiers, it is possible to truly perceive the potential that these models have for action recognition, and violence detection in particular.
This research proved that the application of models with neural networks for the detection of violence is quite feasible since all of these models scored, on average, between 85% and 89% accuracy on the validation set, which are remarkable results. Compared to existing schemes, these results are slightly inferior to those that are found in state-of-the-art papers, which have slightly superior accuracy rates of 90%. However, with a more robust dataset, these values could surely be improved, and these results could match those of state-of-the-art featured models.
Concerning future work, two main paths could be taken. First, the expansion of the used dataset can lead to a more robustly trained model, which would naturally yield better results, since it would be more comprehensively trained. Next, the implementation of the model in a real-time environment would be advantageous to improve this research. In addition, this study has a novel contribution, as it represents a basis for an even more robust study to be developed by the authors. This future study includes the detection of violence using early fusion and late fusion methods (multimodal classification).