A Self-Paced Multiple Instance Learning Framework for Weakly Supervised Video Anomaly Detection
Abstract
:1. Introduction
- We propose a general self-paced multiple-instance learning (SP-MIL) framework for the task of WS-VAD, which can significantly enhance the performance of widely used models (e.g., DeepMIL, RTFM, UR-DMU). To the best of our knowledge, this is the first work in which self-paced learning is used for WS-VAD.
- Unlike the widely used top-k instance selection strategy, which may result in a biased classifier, we propose to adaptively select the video instances (i.e., segments) from easy to hard according to the principle of self-paced learning. By alternatively updating the subset of segments used for training and parameters of the classifier, we can achieve an enhanced classifier with the ability of unbiased prediction.
- Extensive experiments conducted on the UCF-Crime, ShanghaiTech, and XD-Violence datasets demonstrate the effectiveness of the proposed SP-MIL framework across different video features and baseline models, and the proposed framework achieved the best performance compared with the state-of-the-art methods.
2. Related Work
2.1. Weakly Supervised Video Anomaly Detection
2.2. Self-Paced Learning
3. Proposed Method
3.1. Overview
Algorithm 1: Pseudo-code of the proposed SP-MIL framework |
3.2. SP-MIL Framework
3.2.1. Baseline Model Initialization
- 1.
- Feature extractor . Feature extractors used in the existing WS-VAD task can be generally categorized into two types. The first type extractors rely solely on a pre-trained network for feature extraction, such as convolutional 3D neural networks (C3D) [20], Inflated 3D ConvNets (I3D) [21], and contrastive language-image pre-training (CLIP) [22]. The output of the feature extractor is generally represented as . The second type further refines features using a temporal modeling network built upon the pre-trained networks , such as graph convolutional network (GCN) [1], multi-scale temporal network (MTN) [5], and temporal self-attention (TSA) [23], etc. In this case, the output of the feature extractor can be represented as . Despite the variations in feature extraction methods and output dimensions among different pre-trained networks, each video segment can be consistently represented as a corresponding feature vector, ensuring the compatibility between to modeling. To demonstrate the generality of our SP-MIL framework, we evaluate the effectiveness of two different pre-trained feature extractors (I3D and CLIP) along with DeepMIL [7], which relies solely on the pre-trained network for feature extraction, RTFM [5], and UR-DMU [24], which additionally incorporate the temporal modeling network .
- 2.
- Classifier . The standard classifier f used in the WS-VAD task typically comprises a fully connected network. Each video segment feature is fed into f, producing its corresponding anomaly score . The widely used classification objective function is as follows:The overall objective function of the baseline model is expressed as follows:
3.2.2. Classifier Enhancement Using Self-Paced Learning
- 1.
- Segment-level score prediction. For each video during the t-th epoch of training, the classifier takes video segments as input and output segment-level anomaly scores by using the following formulation:
- 2.
- Segment-level pseudo-label generation. For each segment-level anomaly score , its segment-level pseudo-label is generated using a pre-defined binarization threshold . The calculation formula is as follows:
- 3.
- Adaptive segment subset selection. The purpose of segment selection is to adaptively identify the most easily classified subset of segments from each video in every training epoch. The processes of segment subset selection and classifier updating are iterative. During the t-th epoch of training, this self-paced learning-based segment selection and classifier-updating strategy can be formulated as:is the self-paced learning regularizer; according to [25], it can be defined as . To solve the optimization problem (8), we firstly need to estimated the age parameter according to the loss values of the video segments during the t-th epoch of training. In particular, for all segments and their corresponding pseudo-labels of video , can be estimated as follows. We first compute their corresponding loss values . It is assumed that the easier the segment, the lower the corresponding loss value. According to this assumption, we sort the segment loss values in ascending order. Let be a vector of the sorted loss values. Then, can be defined as the R-th element of , that is,
- 4.
- Classifier updating. The classifier updating and segment selection processes are performed alternately. During this stage, is fixed while optimizing the classifier parameter by using the selected segment subset with index , and the segment-level classification objective is defined as follows:
3.2.3. Testing Phase
4. Experiments
4.1. Datasets and Evaluation Metrics
4.2. Baseline Models and Implementation Details
4.2.1. Baseline Models
- DeepMIL model: This model was introduced by Sultani et al. [7] to solve the WS-VAD problem. In this approach, the pre-trained features of video segments are first extracted using a pre-trained feature extractor. These pre-trained features are then directly fed into the classifier to generate anomaly scores, and the segment with the highest anomaly score is selected to train the classifier using the video-level label. The objective function is defined as follows:RTFM model: This model, introduced by Tian et al. [5], first extracts pre-trained features of video segments through a pre-trained feature extractor. These pre-trained features are then processed by a multi-scale temporal network to derive temporal context features, which are subsequently fed into the classifier to obtain anomaly scores, and the top-k segments with the highest temporal context feature magnitudes are selected to train the classifier with video-level labels. The objective function is defined as follows:
- UR-DMU model: This model was introduced by Zhou et al. [24], it first extracts pre-trained features of video segments through a pre-trained feature extractor. These pre-trained features are subsequently processed by a global and local multi-head self-attention module within the transformer network to obtain more expressive embeddings, which are then fed into dual memory units (DMU) to learn more discriminative features. The embeddings and features output by the DMU are then incorporated into the classifier training process. For further details, please refer to [24].
4.2.2. Implementation Details
4.3. The Effectiveness of the SP-MIL Based Classfier Enhancement Strategy
4.4. The Effectiveness of Late-Fusion Strategy for SP-MIL Framework
4.5. Effect of Parameters and R
4.6. Comparison with the Prior SOTA
- Results on ShanghaiTech. Table 7 presents the AUC results of different WS-VAD methods on the ShanghaiTech dataset. It can be observed that RTFM+SP-MIL achieves an AUC of 98.10%. This may be attributed to the fact that our framework can enhance the classifier’s discriminability through self-paced learning strategy. Compared to the noisy label cleaning strategy in GCN [1], our model demonstrates an AUC improvement of 13.66%, highlighting the benefit of refining segment-level pseudo-labels via the SP-MIL framework. Additionally, Ye et al. [30] and Wu et al. [16] improved the temporal modeling of pre-trained features. However, their performance remains slightly lower than that of our RTFM+SP-MIL method.
Method Source Feature AUC (%) DeepMIL [7] CVPR 2018 C3D RGB 75.41 DeepMIL * [7] CVPR 2018 I3D RGB 78.32 DeepMIL * [7] CVPR 2018 CLIP 80.91 GCN [1] CVPR 2019 TSN RGB 82.12 CLAWS [31] ECCV 2020 C3D RGB 83.03 MIST [9] CVPR 2021 I3D RGB 82.30 RTFM [5] ICCV 2021 I3D RGB 84.30 RTFM * [5] ICCV 2021 I3D RGB 83.50 RTFM * [5] ICCV 2021 CLIP 84.45 Chang et al. [14] TMM 2021 I3D RGB 84.62 Wu et al. [16] TIP 2021 I3D RGB 84.89 BN-SVP [32] CVPR 2022 I3D RGB 83.39 TCA-VAD [33] ICME 2022 I3D RGB 83.75 MSL [6] AAAI 2022 I3D RGB 85.30 Thakare et al. [34] PR 2023 I3D RGB 83.56 Liu et al. [13] TNNLS 2023 I3D RGB 85.42 Sun et al. [35] ICME 2023 I3D RGB 85.88 Cho et al. [36] CVPR 2023 I3D RGB 86.10 Zhang et al. [11] CVPR 2023 I3D RGB 86.22 He et al. [28] PR 2024 I3D RGB 85.07 AlMarri et al. [37] WACV 2024 I3D RGB+FLOW 85.47 Yang et al. [29] CVPR 2024 CLIP (RGB+Text) 87.79 DeepMIL+SP-MIL I3D+CLIP 84.81 RTFM+SP-MIL I3D+CLIP 86.35 UR-DMU+SP-MIL I3D+CLIP 87.76 * means that we re-implemented this method.Method Source Feature AUC (%) DeepMIL * [7] CVPR 2018 I3D RGB 92.71 DeepMIL * [7] CVPR 2018 CLIP 94.96 GCN [1] CVPR 2019 TSN RGB 84.44 CLAWS [31] ECCV 2020 C3D RGB 89.67 Chang et al. [14] TMM 2021 I3D RGB 92.25 MIST [9] CVPR 2021 I3D RGB 94.83 RTFM [5] ICCV 2021 I3D RGB 97.21 RTFM * [5] ICCV 2021 I3D RGB 97.39 RTFM * [5] ICCV 2021 CLIP 97.70 Wu et al. [16] TIP 2021 I3D RGB 97.48 BN-SVP [32] CVPR 2022 C3D RGB 96.00 S3R [38] ECCV 2022 I3D RGB 97.48 MSL [6] AAAI 2022 I3D RGB 96.08 Liu et al. [13] TNNLS 2023 I3D RGB 97.54 Liu et al. [13] TNNLS 2023 I3D RGB+FLOW 97.76 Cho et al. [36] CVPR 2023 I3D RGB 97.60 Sun et al. [35] ICME 2023 I3D RGB 97.92 Tan et al. [39] WACV 2024 I3D RGB+FLOW 97.54 Ye et al. [30] ICASSP 2024 I3D RGB 98.00 DeepMIL+SP-MIL I3D+CLIP 97.81 RTFM+SP-MIL I3D+CLIP 98.10 UR-DMU+SP-MIL I3D+CLIP 97.92 * means that we re-implemented this method.Results on XD-Violence. Table 8 shows the performance of our framework on the XD-Violence dataset compared to those of state-of-the-art (SOTA) methods. As can be seen from the table, RTFM+SP-MIL achieved the best results. With the CLIP feature extractor, the AP value of our RTFM+SP-MIL was 0.74% higher than that of Yang et al. [29]. The MGFN [40] network enhances the multi-scale temporal network based on the RTFM model, but its result is 5.23% lower than our RTFM+SP-MIL. Similarly, the AFT method [28] focus on recognizing abnormal events and also employ a late-fusion strategy, but our RTFM+SP-MIL results are 4.35% higher than those of He et al. [28]. These results demonstrate the effectiveness of our framework.
4.7. Qualitative Results
5. Conclusions
6. Limitation and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Zhong, J.X.; Li, N.; Kong, W.; Liu, S.; Li, T.H.; Li, G. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1237–1246. [Google Scholar]
- Kommanduri, R.; Ghorai, M. DAST-Net: Dense visual attention augmented spatio-temporal network for unsupervised video anomaly detection. Neurocomputing 2024, 579, 127444. [Google Scholar] [CrossRef]
- Cho, M.; Kim, T.; Kim, W.J.; Cho, S.; Lee, S. Unsupervised video anomaly detection via normalizing flows with implicit latent features. Pattern Recognit. 2022, 129, 108703. [Google Scholar] [CrossRef]
- Pu, Y.; Wu, X. Locality-Aware Attention Network with Discriminative Dynamics Learning for Weakly Supervised Anomaly Detection. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
- Tian, Y.; Pang, G.; Chen, Y.; Singh, R.; Verjans, J.W.; Carneiro, G. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In Proceedings of the International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 4975–4986. [Google Scholar]
- Li, S.; Liu, F.; Jiao, L. Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 1395–1403. [Google Scholar]
- Sultani, W.; Chen, C.; Shah, M. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6479–6488. [Google Scholar]
- Park, S.; Kim, H.; Kim, M.; Kim, D.; Sohn, K. Normality Guided Multiple Instance Learning for Weakly Supervised Video Anomaly Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 2665–2674. [Google Scholar]
- Feng, J.C.; Hong, F.T.; Zheng, W.S. Mist: Multiple instance self-training framework for video anomaly detection. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14009–14018. [Google Scholar]
- Lv, H.; Yue, Z.; Sun, Q.; Luo, B.; Cui, Z.; Zhang, H. Unbiased multiple instance learning for weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 8022–8031. [Google Scholar]
- Zhang, C.; Li, G.; Qi, Y.; Wang, S.; Qing, L.; Huang, Q.; Yang, M.H. Exploiting Completeness and Uncertainty of Pseudo Labels for Weakly Supervised Video Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 16271–16280. [Google Scholar]
- Kumar, M.; Packer, B.; Koller, D. Self-Paced Learning for Latent Variable Models. Adv. Neural Inf. Process. Syst. 2010, 23, 1189–1197. [Google Scholar]
- Liu, T.; Lam, K.M.; Kong, J. Distilling privileged knowledge for anomalous event detection from weakly labeled videos. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 12627–12641. [Google Scholar] [CrossRef] [PubMed]
- Chang, S.; Li, Y.; Shen, S.; Feng, J.; Zhou, Z. Contrastive attention for video anomaly detection. IEEE Trans. Multimed. 2021, 24, 4067–4076. [Google Scholar] [CrossRef]
- Zhang, J.; Qing, L.; Miao, J. Temporal convolutional network with complementary inner bag loss for weakly supervised anomaly detection. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 4030–4034. [Google Scholar]
- Wu, P.; Liu, J. Learning causal temporal relation and feature discrimination for anomaly detection. IEEE Trans. Image Process. 2021, 30, 3513–3527. [Google Scholar] [CrossRef] [PubMed]
- Zhou, S.; Wang, J.; Meng, D.; Xin, X.; Li, Y.; Gong, Y.; Zheng, N. Deep self-paced learning for person re-identification. Pattern Recognit. 2018, 76, 739–751. [Google Scholar] [CrossRef]
- Sangineto, E.; Nabi, M.; Culibrk, D.; Sebe, N. Self paced deep learning for weakly supervised object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 712–725. [Google Scholar] [CrossRef] [PubMed]
- Zhang, D.; Yang, L.; Meng, D.; Xu, D.; Han, J. SPFTN: A Self-Paced Fine-Tuning Network for Segmenting Objects in Weakly Labelled Videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5340–5348. [Google Scholar]
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Joo, H.K.; Vo, K.; Yamazaki, K.; Le, N. Clip-tsa: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection. In Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 8–11 October 2023; pp. 3230–3234. [Google Scholar]
- Zhou, H.; Yu, J.; Yang, W. Dual memory units with uncertainty regulation for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 3769–3777. [Google Scholar]
- Wang, K.; Wang, Y.; Zhao, Q.; Meng, D.; Liao, X.; Xu, Z. SPLBoost: An Improved Robust Boosting Algorithm Based on Self-Paced Learning. IEEE Trans. Cybern. 2021, 51, 1556–1570. [Google Scholar] [CrossRef] [PubMed]
- Liu, W.; Luo, W.; Lian, D.; Gao, S. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6536–6545. [Google Scholar]
- Wu, P.; Liu, J.; Shi, Y.; Sun, Y.; Shao, F.; Wu, Z.; Yang, Z. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Part XXX 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 322–339. [Google Scholar]
- He, P.; Zhang, F.; Li, G.; Li, H. Adversarial and focused training of abnormal videos for weakly-supervised anomaly detection. Pattern Recognit. 2024, 147, 110119. [Google Scholar] [CrossRef]
- Yang, Z.; Liu, J.; Wu, P. Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 18899–18908. [Google Scholar]
- Ye, H.; Xu, K.; Jiang, X.; Sun, T. Learning Spatio-Temporal Relations with Multi-Scale Integrated Perception for Video Anomaly Detection. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 4020–4024. [Google Scholar]
- Zaheer, M.Z.; Mahmood, A.; Astrid, M.; Lee, S.I. Claws: Clustering assisted weakly supervised learning with normalcy suppression for anomalous event detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 358–376. [Google Scholar]
- Sapkota, H.; Yu, Q. Bayesian Nonparametric Submodular Video Partition for Robust Anomaly Detection. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 3202–3211. [Google Scholar]
- Yu, S.; Wang, C.; Xiang, L.; Wu, J. TCA-VAD: Temporal Context Alignment Network for Weakly Supervised Video Anomly Detection. In Proceedings of the International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
- Thakare, K.V.; Dogra, D.P.; Choi, H.; Kim, H.; Kim, I.J. Rareanom: A benchmark video dataset for rare type anomalies. Pattern Recognit. 2023, 140, 109567. [Google Scholar] [CrossRef]
- Sun, S.; Gong, X. Long-short temporal co-teaching for weakly supervised video anomaly detection. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 2711–2716. [Google Scholar]
- Cho, M.; Kim, M.; Hwang, S.; Park, C.; Lee, K.; Lee, S. Look Around for Anomalies: Weakly-Supervised Anomaly Detection via Context-Motion Relational Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 12137–12146. [Google Scholar]
- AlMarri, S.; Zaheer, M.Z.; Nandakumar, K. A Multi-Head Approach with Shuffled Segments for Weakly-Supervised Video Anomaly Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 1–6 January 2024; pp. 132–142. [Google Scholar]
- Wu, J.C.; Hsieh, H.Y.; Chen, D.J.; Fuh, F.S.; Liu, T.L. Self-supervised Sparse Representation for Video Anomaly Detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022. [Google Scholar]
- Tan, W.; Yao, Q.; Liu, J. Overlooked video classification in weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 1–6 January 2024; pp. 202–210. [Google Scholar]
- Chen, Y.; Liu, Z.; Zhang, B.; Fok, W.; Qi, X.; Wu, Y.C. Mgfn: Magnitude-contrastive glance-and-focus network for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 387–395. [Google Scholar]
Method | Feature Extractor | UCF-Crime AUC (%) | ShanghaiTech AUC (%) | XD-Violence AP (%) |
---|---|---|---|---|
DeepMIL * | I3D | 78.32 | 92.71 | 73.24 |
DeepMIL+SP-MIL | I3D | 82.68 | 97.04 | 74.44 |
RTFM * | I3D | 83.50 | 97.39 | 78.07 |
RTFM+SP-MIL | I3D | 85.20 | 97.74 | 80.81 |
UR-DMU * | I3D | 85.02 | 97.54 | 80.05 |
UR-DMU+SP-MIL | I3D | 86.53 | 97.76 | 80.77 |
Method | Feature Extractor | UCF-Crime AUC (%) | ShanghaiTech AUC (%) | XD-Violence AP (%) |
---|---|---|---|---|
DeepMIL * | CLIP | 80.91 | 94.96 | 74.03 |
DeepMIL+SP-MIL | CLIP | 84.14 | 97.54 | 75.28 |
RTFM * | CLIP | 84.45 | 97.70 | 79.11 |
RTFM+SP-MIL | CLIP | 85.50 | 97.94 | 81.52 |
UR-DMU * | CLIP | 86.76 | 97.56 | 80.64 |
UR-DMU+SP-MIL | CLIP | 87.13 | 97.90 | 81.24 |
Method | Feature Extractor | UCF-Crime AUC (%) | ShanghaiTech AUC (%) | XD-Violence AP (%) |
---|---|---|---|---|
DeepMIL * | I3D+CLIP | 80.96 | 94.97 | 75.07 |
DeepMIL+SP-MIL | I3D+CLIP | 84.81 | 97.81 | 76.63 |
RTFM * | I3D+CLIP | 84.65 | 97.76 | 80.80 |
RTFM+SP-MIL | I3D+CLIP | 86.35 | 98.10 | 84.42 |
UR-DMU * | I3D+CLIP | 86.89 | 97.78 | 82.97 |
UR-DMU+SP-MIL | I3D+CLIP | 87.76 | 97.92 | 84.32 |
0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | |
---|---|---|---|---|---|---|---|---|---|
AUC (%) | 83.94 | 84.60 | 84.68 | 85.40 | 85.50 | 85.49 | 85.41 | 84.80 | 83.58 |
R | 3 | 5 | 8 | 10 | 15 | 20 | 25 | 30 | 32 |
---|---|---|---|---|---|---|---|---|---|
AUC (%) | 84.43 | 84.68 | 85.07 | 85.50 | 84.23 | 83.67 | 82.99 | 81.24 | 80.75 |
Method | Source | Feature | AP (%) |
---|---|---|---|
DeepMIL * [7] | CVPR 2018 | I3D RGB | 73.24 |
DeepMIL * [7] | CVPR 2018 | CLIP | 74.03 |
Wu et al. [27] | ECCV 2020 | C3D RGB | 67.19 |
MSL [6] | AAAI 2022 | C3D RGB | 75.53 |
Wu et al. [16] | TIP 2021 | I3D RGB | 75.90 |
RTFM [5] | ICCV 2021 | I3D RGB | 77.81 |
RTFM * [5] | ICCV 2021 | I3D RGB | 78.07 |
RTFM * [5] | ICCV 2021 | CLIP | 79.11 |
MSL [6] | AAAI 2022 | I3D RGB | 78.28 |
Sun et al. [35] | ICME 2023 | I3D RGB | 77.92 |
MGFN [40] | AAAI 2023 | I3D RGB | 79.19 |
Thakare et al. [34] | PR 2023 | I3D RGB | 79.89 |
DMU [24] | AAAI 2023 | I3D RGB | 81.66 |
Liu et al. [13] | TNNLS 2023 | I3D RGB | 79.00 |
Zhang et al. [11] | CVPR 2023 | I3D RGB | 78.74 |
Zhang et al. [11] | CVPR 2023 | I3D+VGGish | 81.43 |
Cho et al. [36] | CVPR 2023 | I3D RGB | 81.30 |
He et al. [28] | PR 2024 | I3D RGB | 80.07 |
Tan et al. [39] | WACV 2024 | I3D RGB | 82.10 |
Yang et al. [29] | CVPR 2024 | CLIP | 83.68 |
DeepMIL+SP-MIL | I3D+CLIP | 76.63 | |
RTFM+SP-MIL | I3D+CLIP | 84.42 | |
UR-DMU+SP-MIL | I3D+CLIP | 84.32 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
He, P.; Li, H.; Han, M. A Self-Paced Multiple Instance Learning Framework for Weakly Supervised Video Anomaly Detection. Appl. Sci. 2025, 15, 1049. https://doi.org/10.3390/app15031049
He P, Li H, Han M. A Self-Paced Multiple Instance Learning Framework for Weakly Supervised Video Anomaly Detection. Applied Sciences. 2025; 15(3):1049. https://doi.org/10.3390/app15031049
Chicago/Turabian StyleHe, Ping, Huibin Li, and Miaolin Han. 2025. "A Self-Paced Multiple Instance Learning Framework for Weakly Supervised Video Anomaly Detection" Applied Sciences 15, no. 3: 1049. https://doi.org/10.3390/app15031049
APA StyleHe, P., Li, H., & Han, M. (2025). A Self-Paced Multiple Instance Learning Framework for Weakly Supervised Video Anomaly Detection. Applied Sciences, 15(3), 1049. https://doi.org/10.3390/app15031049