1. Introduction
The problem of suspicious-behavior detection is crucial in addressing crime prevention. Despite a myriad of work related to anomaly detection in videos, there has been little focus on suspicious-behavior detection tasks for crime prevention. Crime prevention detection is a very difficult task due to several reasons. First, crime prevention requires detecting suspicious behavior before the actual crime occurs. This is a considerable challenge, since typical suspicious behavior in a slow-paced environment is highly similar to normal behavior. Second, abnormal actions from suspicious activities generally occur at a slower pace compared to actual criminal activities such as theft, burglary, and robbery. Third, it often requires relatively longer observations of a person’s activity before any inference can be made. Lastly, there are significantly fewer data available for training the model. In this project, we propose an effective method for detecting suspicious behavior that works with minimal training.
With the advancement of computer vision technology, various techniques have been proposed for crime prevention through video-anomaly detection. However, the effectiveness of these techniques varies depending on the use cases, such as traffic monitoring, crowd monitoring, and security surveillance. In this paper, we specifically focus on the use case of retail theft prevention, which aims to identify suspicious behavior in retail stores such as shoplifting or similar types of theft.
Several state-of-the-art techniques for preventing shoplifting have been proposed, including those by Kirichenko L. et al. [
1], Gandapur et al. [
2], Qin Z. et al. [
3], and Wu Y. et al. [
4]. These techniques heavily rely on CNN-based feature extraction methods that use pretrained models to extract visual appearance and optical flow features in order to represent spatiotemporal information. However, such methods are computationally costly, and prone to excessive and redundant information. For example, Inflated 3D ConvNet (I3D) is the state-of-the-art deep learning preprocessing method for video classification and action recognition in computer vision. It was proposed by researchers at Google in 2017, and it builds on the success of 2D convolutional neural networks (CNNs) by extending them to 3D. I3D uses 3D convolutional layers to process spatiotemporal data in video frames. It starts with a pretrained 2D CNN on ImageNet, and then extends it to 3D by inflating each 2D filter into a 3D filter. This allows for the network to capture both spatial and temporal information from video frames. The I3D achieved state-of-the-art results on several benchmark datasets for action recognition, including Kinetics and HMDB51. It is widely used in research and applications related to security surveillance.
The I3D method is a powerful approach to video analysis, but it demands a considerable amount of processing power to preprocess each frame that can take around 0.5 to 5 s to preprocess a single frame using I3D on a high-spec hardware configuration. Object detection techniques such as YOLO, Faster R-CNN, and Mask R-CNN, on the other hand, can perform real-time video processing, detecting and extracting human object bounding box coordinates at a rate of 30 frames per second or higher. Thus, by efficiently utilizing video frame features in real time to detect suspicious behavior (without relying on I3D or similar approach), it is possible to significantly reduce the inference detection time.
The question that arises next is which features we should extract from video frames to effectively detect suspicious behaviors. Our hypothesis is that, by examining a sequence of video frames depicting a person’s activity and behavior, we can increase the likelihood of detecting suspicious behavior. Time-series deep learning classification models can be trained to track and learn the sequences of people’s actions and movements, which can lead to more efficient and improved detection performance. To achieve this, we propose a novel approach that involves extracting the bounding box features of individuals in CCTV videos, and analyzing the data using time-series deep learning algorithms. We aim to address the following research questions:
How effective is using a sequence of video frames depicting a person’s activity and behavior in increasing the likelihood of detecting suspicious behavior?
How can time-series deep learning classification models be trained to track and learn sequences of individual actions and movements to improve detection performance?
How does the proposed method compare to the I3D preprocessing method and the state-of-the-art Robust Temporal Feature Magnitude (RTFM) deep learning anomaly detection method in terms of detection performance on shoplifting incidents?
To this end, we propose YOLOv5 with Deep Sort method to detect and track individuals across multiple frames in video sequences. We then extract the resulting bounding box coordinates as temporal features and use them as inputs to time-series classification deep learning models. We evaluated our approach using the UCF Crime dataset, which includes labeled video frames of shoplifting incidents. Our proposed method was compared against the state-of-the-art Robust Temporal Feature Magnitude (RTFM) deep learning anomaly detection method. Our results revealed that our method exhibits faster detection speed and higher accuracy scores. Impressively, we achieved an 8.45-fold increase in detection speed and a F1 score of 92%, surpassing RTFM by 3%, all without the need for data augmentation or I3D image feature extraction.
The paper is organized as follows:
Section 2 provides a review of the related work, while
Section 3 details the proposed method.
Section 4 describes the used datasets and the applied preprocessing techniques. The experimental setup is discussed in
Section 5, while
Section 6 presents the results and their discussion. Lastly, in
Section 7, the paper is concluded with a summary of the findings.
2. Related Work
Anomaly detection has been widely studied in computer vision [
5,
6,
7,
8,
9] in various problem settings such as fighting/violence alerts, people fall detection, unusual pedestrian motion patterns, and traffic accidents. Three common techniques have been extensively studied in the field of video anomaly detection, namely, unsupervised [
10,
11,
12,
13,
14,
15], supervised [
16,
17,
18], and semi-supervised/weakly supervised [
19,
20,
21,
22,
23,
24]. The unsupervised technique attempts to detect abnormal activities where no labeled normal/abnormal training data are provided. The supervised technique, on the other hand, uses labeled normal/anomalous data during the training process. Recently, the semi-supervised/weakly supervised technique has been discovered, and it is often cited as the state-of-the-art technique for video anomaly detection. The weakly supervised technique uses video-level labels to selectively segment videos into normal and abnormal frames. During the training phase, if a video comprises all normal events, each frame is labeled as normal. However, if the video has at least one abnormal frame, all the frames in the video (including the normal event frames) are labeled as abnormal. This labeling assignment is known as “noisy labels” in the literature because normal frames are labeled as abnormal. After such labeling, a rank loss is applied to assign higher scores to the anomaly frame in the video. Recent results [
19,
20] showed that these approaches are very effective in video anomaly detection.
Due to the extensive work of video anomaly detection, it is not feasible or practical to compare the effectiveness of different techniques without considering the use cases. In this paper, we specifically focus on shoplifting crime prevention. The objective is to detect a crime via the suspicious behavior of individuals before the actual shoplifting crime occurs.
Kirichenko L. et al. [
2] presented a hybrid neural network for detecting shoplifting in video records. The network combined convolutional and recurrent networks, with gated recurrent units being used as the recurrent component. Ansari et al. [
25] proposed a dual-stream convolutional neural network and a long short-term memory (LSTM)-based deep learner to extract appearance and salient motion features from video sequences.
Gandapur et al. [
2] proposed a three-layered bidirectional gate recurrent unit (BiGRU) and a convolutional neural network (CNN). The CNN was used to extract the spatial features from video frames, whereas temporal and local motion features were extracted by the BiGRU from the CNN-extracted features of multiple frames. The limitation of this method is due to the video frames of certain actions that need to be manually chosen as part of the video-processing phase. The approach also requires ranked-based loss to effectively detect and classify suspicious activity.
Qin Z. et al. [
3] proposed a two-stage method to detect and prevent criminal activities in shopping malls in real time. The first stage involves the CNN feature extraction method using a pretrained VGG-16 model. The second involves a classification task using either SVM or LSTM using a custom ranking loss.
Wu Y. et al. [
4] proposed a three-dimensional convolutional neural network (3D-CNN) to extract information features from continuous multiframe cube data and acquire the features of spatial–temporal dimensions. A three-dimensional CNN was proposed to represent time-series information on continuous multiframe cube data. The input data of the three-dimensional CNN were cube data composed of multiple consecutive video frames aiming to improve crime detection events.
These state-of-the-art techniques rely heavily on CNN-based feature extraction methods (e.g., I3D, S3D, CLIP, RAFT, ResNet, and VGGish) using a pretrained model to extract visual appearance and optical flow features in order to represent spatiotemporal information. The learned feature representations from such methods are susceptible to excessive and redundant information because they track all surfaces and edges on each frame. Furthermore, they are computationally costly because each object and scene typically move in each video frame. In this paper, we propose a much simpler and more efficient approach to capturing and extracting spatiotemporal features. Rather than capturing all motion objects and surfaces, we aimed to capture the sequence of the bounding boxes of individuals at consecutive frames. This minimizes the motion-feature representation into four attributes (i.e., x1, x2, y1, y2) for each person. The movement velocity of each person is also tracked and represented as a sequence of time-series data points. This approach has significantly fewer motion-feature representations and less computational overhead compared to those of existing approaches.
3. Proposed Method
The proposed approach is based on a two-stage algorithm consisting of feature extraction and deep-learning time-series analysis. The first stage uses YOLOv5 with Deep Sort to track individuals across a video and capture the bounding box information that tracks each individual in the video. The extracted information from the bounding boxes is used to create a new dataset. In the second stage, the extracted temporal features are supplied to deep-learning models for analysis and classification.
The first stage employs object detection and video tracking algorithm YOLOv5 with Deep Sort [
26] to track multiple individuals across each video. This allows for the extraction of bounding box features across multiple time steps. Hence, the video-based dataset is converted into a tabular format, as shown in
Figure 1.
The second stage of the proposed method involves applying various state-of-the-art time-series deep-learning methods to the extracted dataset. We explored different time-series classification algorithms imported from the popular Time Series Artificial Intelligence (TSAI) library.
The final dataset is constructed by noting the video number, the frame number, and the bounding box coordinates of each person in the frame as identified by YOLOv5 with Deep Sort. An example of the final dataset structure is shown in
Figure 1. The description of the columns in the dataset is provided in
Figure 2. This conversion of video data into a tabular format using YOLOv5 with Deep Sort leads to rapid data processing and includes no other preprocessing or particular image data augmentation techniques, allowing for quick data augmentation into the described tabular format.
We employed YOLOv5 with Deep Sort using the default parameters loaded with the pretrained weights from crowdhuman_yolov5m [
26]. This set of weights were chosen because they were trained to identify humans in crowded scenarios, which applies to our case. Moreover, we configured the model to only track humans and not any other animals/objects by setting the class parameter to 0. Hence, all the resulting bounding boxes tracked people through multiple frames.
The proposed method for tracking people through videos reduces the impact of cluttered videos, and removes any possible dependency of the classification model on the scene itself, which allows for better generalization. Instead, a popular model such as YOLOv5 with Deep Sort is able to easily identify humans in different scenarios and focus on only that, while classification models focus more on analyzing the movement and positions of people. The proposed pipeline is illustrated in
Figure 3.
An example of the bounding boxes tracking people in successive frames can be seen below. Examining the video at 18, 20, 33, 36, 41, and 48 s in
Figure 4 demonstrates how Deep Sort was able to track two individuals in a store and consistently label each person as P23 and P24 (Persons Number 23 and 24). This implementation successfully allowed for tracking multiple people, even if they switched positions, as was the case at 41 and 48 s, when the people switched positions, but the algorithm was able to consistently track both of them.
Existing state-of-the-art anomaly detection models, such as RTFM, apply sophisticated data transformation and segmentation techniques to train deep-learning models. In contrast, instead of using pixels as features (as in CNN-based models), our approach generates the numerical features of human activity that represent coordinates from a real-time person tracker. A real-time person tracker generates movement coordinates for each time frame that could be used as features for our deep learning model.
Due to the nature of the time-series data from the coordinates and time frame, we employed recurrent neural networks capable of learning the order in sequence prediction problems. Therefore, long short-term memory (LSTM)-based networks are proposed as the ideal candidate for Stage 2 of the proposed approach due to their ability to capture temporal dependencies. We also explored state-of-the-art time-series deep learning classification models An Explainable Convolutional Neural Network (Fauvel, 2021) and MiniRocket (Dempster, 2021), as they both offer the best performance for time-series classification tasks.
6. Results and Discussion
We computed precision, recall, F1 score, and AUC for each time-series model trained for 15 and 25 epochs. The models achieved optimal performance at 15 epochs, which we chose as the standard training time for our models. The results of our experiments with 15 epochs are presented in
Table 4. A high precision, recall, F1 score, and AUC (close to 1) are desired, as they together indicate a well-fitted model.
As shown in
Table 4, the Xception and XCM models achieved the best performance. In particular, Xception and XCM achieved an AUC of 0.96 and F1 score of 0.97. This indicates that the proposed approach based on either Xception or XCM could identify abnormal videos with a high degree of accuracy. Inception achieved average performance with an AUC of 0.77 and F1 score of 0.56, while MiniRocket performed poorly with an AUC of 0.50 and F1 score of 0.46. The results in
Table 4 show that the proposed approach based on the combination of feature extraction using bounding boxes, together with time-series models (Xception or XCM), was capable of accurately identifying abnormal videos.
The confusion matrix for the best models is presented in
Figure 10. Neither model misclassified a normal instance as abnormal, and only 1 out of 12 abnormal instances was classified as normal. In other words, the models mistook only 1 instance of shoplifting as regular shopping while accurately identifying the remaining 11 instances of shoplifting.
We further investigate the above misclassification in
Figure 10, which was is in Shoplifting Video 34 in the UCF Crime dataset. In particular, this video had over 20,000 rows in the tabular dataset, which were cut down to 15,000. This was particularly odd because the video itself had only 3 people in the entire clip. Furthermore, YOLOv5 with Deep Sort identified about 225 bounding boxes in the video. This was likely what confused the models, since all the other values in the dataset did not have as many people. Hence, this is an issue with the application of YOLOv5 with Deep Sort and the main reason behind the misclassification.
To better compare the tested models, we performed 10-fold cross validation on the dataset. In the context of shoplifting crime prevention, false positives can be costly, as they may lead to unnecessary interventions or the detainment of innocent individuals. Therefore, in this specific application, even a small false-positive rate may render the model totally infeasible in real life. This suggests that precision is more important than recall in this context, as it is crucial to minimize the number of false positives.
Table 4 shows that the highest precision value of 0.99 was achieved by the Xception Time and XCM models, indicating that they are very good at correctly identifying positive samples. This indicates that, out of all the positive predictions by these models, 99% were correct. The precision value for the inception time model was relatively lower, at 0.64, indicating that it had more false positives than the other models did. Considering that even a small false positive rate may render the model totally infeasible in real life, it is clear that the Xception Time and XCM models may be the most appropriate choices for shoplifting crime prevention application, as they achieved the highest precision values. The MiniRocket model, on the other hand, achieved the lowest precision value of 0.43, which may be considered unacceptable in this context.
While recall is also an important metric, the fact that false positives may render the model totally infeasible suggests that precision should be prioritized over recall in this specific application. However, precision and recall are both important metrics, and a balance between the two should be sought in the context of the specific problem. Therefore, the average F1 scores across the 10 folds with the standard deviations and median F1 score for each model are shown in
Table 5. Clearly, XceptionTime and MiniRocket performed the best during cross validation, with the highest average F1 scores of 0.87 and 0.89, respectively, and a median F1 score of 0.92. Moreover, XCM actually performed the worst during cross validation, with a mean F1 score of only 0.6 and large variance. The distribution of the F1 scores for each model is better illustrated in the box plot in
Figure 11.
Moreover, Xception Time model still achieved the highest precision value with a mean of 0.96 ± 0.04, indicating that it was able to correctly identify positive samples with a high degree of accuracy. The MiniRocket model had the second-highest precision value with a mean of 0.91 ± 0.08, followed by the InceptionTime model, with a mean of 0.86 ± 0.17. The XCM model had the lowest precision value with a mean of 0.63 ± 0.24, indicating that it was less effective at correctly identifying positive samples.
Since a high number of false positives may render the model totally infeasible in real life, the Xception Time model may still be the most appropriate choice for shoplifting crime prevention application, as it achieved the highest precision value in both tables of results. However, the MiniRocket and InceptionTime models also achieved high precision values and may be worth considering.
In terms of recall, the Xception Time model had the highest mean recall value of 0.92 ± 0.06, indicating that it was able to correctly identify a high proportion of positive samples. The InceptionTime and MiniRocket models had slightly lower recall values, with means of 0.85 ± 0.14 and 0.88 ± 0.09, respectively. The XCM model had the lowest recall value with a mean of 0.63 ± 0.19, indicating that it was less effective at identifying positive samples.
The figure shows that MiniRocket was actually the most stable algorithm since its F1 score had the lowest interquartile range, while XCM performed the poorest, with an extremely large interquartile range of about 27%.
Using the nonparametric Kruskal–Wallis test to test for significant differences among the median F1 scores, we obtained a statistic of 11.015 and a p-value of 0.012, which led us to reject the null hypothesis (at 5% significance level) that the distributions of all the models had the same median value. Hence, there was at least one model with a statistically significant different median F1 score.
We could use the nonparametric Tukey test to identify whether the median of each distribution of the F1 scores was statistically different, and identify exactly which models had a statistically different F1 score distribution.
Table 6 shows the results of the test.
Here, we can clearly see that we were able to reject the null hypothesis (at 5% level of significance) when comparing any model to XCM, but not otherwise. This indicates that XCM had a statistically different distribution to that of the other models, while InceptionTime, XceptionTime, and MiniRocket were comparable. Hence, XCM performed statistically worse than all the other models.
On the basis of the overall distribution of the F1 scores across the 10-fold cross validation, we can conclude that MiniRocket was the best model with a median F1 score of 92% since it had the highest median F1 score and low variance, which rendered it more stable.
We now compare our best models during cross validation, MiniRocket and XceptionTime, to the baseline RTFM [
33]. Comparing the baseline model to our median F1 score (92%), we were able to significantly vanquish the given baseline in
Table 7.
The imbalance in the dataset classes was easily dealt with by Xception and MiniRocket without requiring any particular preprocessing or data balancing methods such as oversampling. The ability of Xception and MiniRocket to handle imbalanced data without additional preprocessing indicates the robustness of the methods.
A further attractive feature of MiniRocket is that it had a mean F1 score of 89%, which was the same as that for RTFM. This shows that our proposed method was able to perform just as well with a far more intuitive approach to the problem.
Additionally, as in
Figure 12, we notice that RTFM correctly identifies all the normal class instances, but is unable to distinguish between the abnormal instances, with 5 misclassifications and 5 correct classifications. Hence, the RTFM was able to easily learn how normal instances looked like, but became easily confused with abnormal instances.
We could also visually compare the scenes that RTFM misclassified and our approach was able to classify correctly. RTFM misclassified five of the shoplifting scenes (numbers 24 to 28) as normal, which was incorrect. These five incorrect misclassifications can be seen in
Figure 13.
By analyzing the videos in
Figure 13, one can understand why the RTFM model was unable to detect the misclassifications. A particularly noticeable characteristic of each of the clips that the RTFM failed to classify correctly was that, in all the clips, the subject performing the theft was in the same place for almost the entire clip. Since the RTFM focuses on identifying temporal features from I3D features, we hypothesize that clips with a lack of subject movement lack temporal information for the RTFM to be able to flag them as an anomaly. RTFM aims to capture temporal information between successive frames in a video, but as observed, it is unable to capture anomalies with a lack of movement from the subjects in the clip. The proposed bounding box representation of the videos helps in capturing this temporal information more effectively, and allows for the machine-learning model to better detect anomalies on the basis of the movement of these bounding boxes.
Additionally, the proposed approach outperformed the SOTA regarding inference time. We examined the inference time of the RTFM model and our proposed approach using MiniRocket on a 14 s shoplifting video at 30 frames per second. The prediction was performed 10 times, and the times were averaged to account for any fluctuations in the inference time. The results of the inference test, conducted using an RTX 2060, can be found in
Table 8, showing that the proposed approach with MiniRocket outperformed RTFM in inference times by up to 8.45 times faster. The primary reason for this is that RTFM utilizes large amounts of computation time to generate I3D features. On the other hand, the proposed approach provides extremely fast object detection using YOLO with Deep Sort, which leads to much faster preprocessing. Needless to say that the reduction in preprocessing time provides a competitive advantage during inference.