A Self-Paced Multiple Instance Learning Framework for Weakly Supervised Video Anomaly Detection

He, Ping; Li, Huibin; Han, Miaolin

doi:10.3390/app15031049

Open AccessArticle

A Self-Paced Multiple Instance Learning Framework for Weakly Supervised Video Anomaly Detection

by

Ping He

,

Huibin Li

^*

and

Miaolin Han

School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an 710049, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(3), 1049; https://doi.org/10.3390/app15031049

Submission received: 25 November 2024 / Revised: 15 January 2025 / Accepted: 19 January 2025 / Published: 21 January 2025

(This article belongs to the Collection Trends and Prospects in Multimedia)

Download

Browse Figures

Versions Notes

Abstract

:

Weakly supervised video anomaly detection (WS-VAD) is often addressed as a multi-instance learning problem in which a few fixed number of video segments are selected for classifier training. However, this kind of selection strategy usually leads to a biased classifier. To solve this problem, we propose a novel self-paced multiple-instance learning (SP-MIL) framework for WS-VAD. Given a pre-trained baseline model, the proposed SP-MIL can enhance its performance by adaptively selecting video segments (from easy to hard) and persistently updating the classifier. In particular, for each training epoch, the baseline classifier is firstly used to predict the anomaly score of each segment, and their pseudo-labels are generated. Then, for all segments in each video, their age parameter is estimated based on their loss values. Based on the age parameter, we can determine the self-paced learning weight (hard weight with values of 0 or 1) of each segment, which is used to select the subset of segments. Finally, the selected segments, along with their pseudo-labels, are used to update the classifier. Extensive experiments conducted on the UCF-Crime, ShanghaiTech, and XD-Violence datasets demonstrate the effectiveness of the proposed framework, outperforming state-of-the-art methods.

Keywords:

weakly supervised video anomaly detection; multiple-instance learning; self-paced learning; unbiased prediction

1. Introduction

As monitoring systems become more widespread, timely and accurate detection of abnormal behaviors or emergencies in surveillance videos is crucial for building smart cities and maintaining public safety. An abnormal event in a video typically refers to an unusual appearance or motion property, or to a normal appearance or motion property occurring at an anomalous time or in an anomalous space. The task of video anomaly detection involves identifying temporal or spatial anomalies within the video. In general, based on the granularity of available labels, the video anomaly detection methods can be typically classified into supervised learning [1], unsupervised learning [2,3], and weakly supervised learning [4,5]. Since manual frame-by-frame labeling of abnormal events in videos is extremely labor-intensive, supervised learning-based video anomaly detection methods are prohibitively costly. In contrast, unsupervised video anomaly detection methods rely on the similarities among samples to perform clustering, or modeling the distribution of normal samples. During testing, video frames that deviate significantly from the learned normal pattern are identified as anomalies. However, due to the absence of prior knowledge of anomalies, the effectiveness of unsupervised methods is generally limited [6]. Consequently, this paper primarily focuses on weak supervision-based video anomaly detection methods, which rely solely on video-level labels (normal and abnormal) during the training process.

The WS-VAD task was initially proposed by Sultani et al. [7] and formulated as a multi-instance learning (MIL) problem. In particular, normal and abnormal videos are considered negative and positive bags, respectively, and a series of non-overlapping video clips of equal length are treated as instances within each bag. This refers to a scenario in which only video-level labels are provided during training, while frame-by-frame anomaly detection must be performed during testing. If at least one anomalous instance is detected in a video, the video is labeled anomalous (i.e., a positive bag); otherwise, it is classified as normal (i.e., a negative bag). The objective of the WS-VAD task is to train an instance-level anomaly classifier using only video-level labels. During training, the most representative instances from both positive and negative bags are initially identified, and these selected instances, along with their corresponding bag-level labels, are then used to train the anomaly detection classifier. The training process forcibly reduces the predicted abnormal score for each selected representative segment in a normal video, while increasing the score of one or more selected segments with the highest abnormal scores in an abnormal video. During testing, the classifier predicts the anomaly score for each video segment, with all frames within a segment being assigned the same score, thereby enabling frame-by-frame anomaly detection. Based on this MIL framework, numerous studies have been conducted, leading to significant advancements [8,9].

However, as a result of this training approach, the classifier is typically sensitive only to the most representative segments of both normal and abnormal videos, while neglecting other segments within these videos. Consequently, classifiers trained in this manner often yield biased results [10], tending to recognize only the most conspicuous anomalies, while subtle anomalies are frequently overlooked. For example, the classifier was able to identify anomalies when a car was vandalized but missed subtle anomalous activity by the person prior to the vandalism. The moment the car is destroyed can be seen as an easy anomaly, while the subtle abnormal activity by a person prior to the destruction can be seen as a difficult anomaly. A biased classifier may generally recognize only the easy anomalies while failing to recognize the difficult ones. In contrast, an unbiased classifier should be able to identify both easy and difficult anomalies in most cases. The root cause of biased predictions in MIL stems from its training scheme, which involves biased sample selection. A natural idea is to train the classifier to utilize a broader set of video clips rather than only the top-k segments with the highest anomaly scores. To this end, previous studies [1,10,11] have made significant progress toward this goal, often by training classifiers on the most confident or reliable video clips along with their respective pseudo-labels. For example, Zhong et al. [1], using the 10-crop method, performed data augmentation on each video segment and then selected the segment with the least variance among the ten scores as the most confident one for training. Similarly, Zhang et al. [11] employed Monte Carlo dropout for uncertainty estimation on each video segment, with each segment generating multiple distinct scores via dropout. A specific proportion of segments with the lowest variance is selected, and their corresponding average score is used to train the classifier. In addition, Lv et al. [10] considered the video clips with the most stable historical scores as confident sets, while the remaining clips were categorized as ambiguous sets. They employed supervised learning for clips within the confident sets and unsupervised clustering for clips within the ambiguous sets to facilitate classifier training. Although the aforementioned methods have demonstrated some success, they typically pre-define the proportion or number of reliable and confident samples to be selected, which lacks flexibility in adapting to different datasets, potentially impacting the generalization performance of the classifier.

To address this issue, we propose a general self-paced multiple-instance learning (SP-MIL) framework inspired from the idea of self-paced learning [12]. To illustrate our framework, the comparative flow diagram is shown in Figure 1. For the baseline model, when given a sequence of video segments, segment-level features are typically derived via feature extraction and temporal modeling, after which a randomly initialized classifier is used to predict the abnormal score of each segment. Subsequently, the top-k segments with the highest anomaly scores are selected and aggregated to form video-level features. The classifier is then trained using these video-level features along with corresponding video-level labels to obtain the baseline model. It is worth noting that k is usually a pre-set fixed value that does not change with the training epochs. This setting makes the trained classifier biased and inaccurate in judging the abnormal degree of other segments in the video. Different from the baseline model, the proposed SP-MIL framework seeks to enhance the classifier’s prediction ability by progressively and adaptively incorporating more representative video segments into each epoch of the whole training process, advancing from easy to hard segments through a self-paced learning strategy. Specifically, in each training epoch, the difficulty degree of each video segment is initially calculated based on its current anomaly score and its corresponding pseudo-label. Subsequently, based on the calculated complexity, the easiest video segment at this epoch is selected for training. As time progresses, the number of selected segments gradually increases in a progression from easy to hard. Furthermore, more segments will be adaptively selected and integrated into the classifier’s training process. This enables the classifier to progressively incorporate more video segments, thereby enhancing its ability to distinguish between anomalous and normal cases, resulting in unbiased predictions.

The main contributions of this paper can be summarized as follows:

We propose a general self-paced multiple-instance learning (SP-MIL) framework for the task of WS-VAD, which can significantly enhance the performance of widely used models (e.g., DeepMIL, RTFM, UR-DMU). To the best of our knowledge, this is the first work in which self-paced learning is used for WS-VAD.
Unlike the widely used top-k instance selection strategy, which may result in a biased classifier, we propose to adaptively select the video instances (i.e., segments) from easy to hard according to the principle of self-paced learning. By alternatively updating the subset of segments used for training and parameters of the classifier, we can achieve an enhanced classifier with the ability of unbiased prediction.
Extensive experiments conducted on the UCF-Crime, ShanghaiTech, and XD-Violence datasets demonstrate the effectiveness of the proposed SP-MIL framework across different video features and baseline models, and the proposed framework achieved the best performance compared with the state-of-the-art methods.

2. Related Work

2.1. Weakly Supervised Video Anomaly Detection

The weakly supervised setting offered a balanced solution for the VAD task, requiring less annotation efforts than dense frame-level labeling in a fully supervised setting, while achieving superior performance compared to the unsupervised setting. Current WS-VAD approaches can be broadly categorized into two classes [13]. One class includes encoder-agnostic methods, which utilize a pre-trained feature extractor to obtain segment feature and train the classifier to detect abnormal segments. The second category involves encoder-based methods that train both the feature extractor and classifier. In the first category of methods, Sultani et al. [7] modeled this task as a multi-instance learning problem, where the segment with the highest abnormal score was used as the video-level prediction, and a ranking loss was employed to increase the margin between normal and abnormal video predictions. Chang et al. [14] proposed a contrastive attention mechanism, assigning importance weights to each video segment and weighting segment features to obtain video-level features, which were then used with video-level labels to train the classifier. Zhang et al. [15] improved the loss function by introducing a new inner-bag loss, which encouraged the highest and lowest scores in normal videos to converge as closely as possible, while driving the highest and lowest scores in abnormal videos apart. However, some researchers argued that the feature extractors used in this category were usually pre-trained on large-scale trimmed video datasets, which presented a domain gap when applied to the WS-VAD task [9].

In the second category of methods, some researchers [5,9,16] performed temporal modeling on the segment features produced by the feature extractor. They used temporal modeling networks to further extract temporal context features and then trained the temporal modeling network alongside the classifier. For example, Tian et al. [5] proposed a multi-scale temporal network (MTN), which captured both multi-resolution local and global temporal dependencies between video segments. Wu and Liu [16] introduced a causal temporal relation (CTR) module, designed to aggregate relevant information from historical and current features through temporal attention. In contrast, Feng et al. [9] trained the feature extractor and classifier simultaneously, without using a temporal modeling network. They proposed an attention-based self-guided encoder to learn task-specific representations and generated clip-level pseudo-labels to guide network optimization. However, both the above two types of methods typically used only a few fixed number of segments with top-k abnormal scores from both normal and abnormal videos to update the classifier. The classifier obtained by this kind of segment selection and training strategy usually results in biased predictions [10]; that is, only simple and obvious abnormal behaviors can be detected, potentially producing false alarms for more subtle anomalies. Based on this finding, utilizing only video-level labels to achieve unbiased prediction without additional human annotations has become an important problem to address.

2.2. Self-Paced Learning

The self-paced learning (SPL) theory [12] is inspired by the cognitive process of human beings, where samples are gradually incorporated into the training process from easy to hard. In recent years, self-paced learning methods have been successfully applied to different computer vision tasks [17,18,19]. For example, Zhou et al. [17] proposed a deep self-paced learning algorithm for the person re-identification task, which utilized high-confidence samples while suppressing low-confidence noisy samples in the early stages of training. This approach allowed the neural network to be trained more stably by gradually involving reliable samples from easy to hard, thereby improving the accuracy of person re-identification. For weakly supervised object detection tasks, Sangineto et al. [18] proposed a training method based on self-paced learning, which selects the most reliable subset of images and boxes in each iteration and trains the network in a fully supervised manner to develop robust detectors. Zhang et al. [19] proposed a self-paced fine-tuning network-based framework for the weakly supervised object segmentation task, which combined a self-paced learning mechanism with deep neural networks to enhance the model’s segmentation capability.

Inspired by the successful application of SPL in related tasks, we propose to combine SPL with multiple-instance learning (MIL) for the WS-VAD task. This combination aims to aid in the selection of more reliable video segments, thereby facilitating the training of more robust models. To the best of our knowledge, we are the first to use SPL for the task of WS-VAD. The work most related to ours is Lv et al. [10], in which the authors proposed an unbiased multiple-instance learning (UMIL) approach. In each training iteration, the video clips are first divided into confident and ambiguous sets. The confident set is used to train the classifier through segment-wise supervised learning, while the ambiguous set assists the classifier training by reducing the abnormal context bias within the confident set through unsupervised clustering. This method allows the classifier to be trained on a broader range of video clips, ultimately leading to a more reliable and unbiased classifier. However, their approach typically involves presetting the proportion of confident video clips used for training, which limits its flexibility in various practical scenarios, particularly for videos with differing durations of abnormal events. Therefore, in this paper, inspired by the success of self-paced learning, we propose a SP-MIL framework, the core idea being to adaptively select the subset of segments used for training in a easy-to-hard way and dynamically update the classifier to persistently enhance its performance through continuous self-training.

3. Proposed Method

3.1. Overview

An overview of the proposed general SP-MIL framework is shown in Figure 2. For each input video, it is first split into non-overlapped, consecutive segments of unequal lengths. Then, the pre-trained features of these segments are extracted, and temporal modeling is applied to derive temporal context features. It is important to note that temporal modeling is a optional module. The temporal context features of the segments are then fed into a randomly initialized classifier to predict the abnormal score of each segment. Subsequently, the most representative segments are selected along with video-level labels for training this classifier. Once the training is complete, the trained classifier is employed as the baseline model of our proposed SP-MIL framework, which is then refined to obtain an enhanced classifier for unbiased prediction. The pseudo-code of the proposed SP-MIL framework is provided in Algorithm 1. Specifically, the anomaly scores of segments are first predicted and then converted into pseudo-labels via threshold binarization. The difficulty of each segment is then calculated based on its loss. Next, during each epoch of training, the age parameter of each video is first evaluated, reflecting the overall difficulty of the video and each segment, and is determined by the loss of all segments in the video. Then, the evaluated age parameter is used as a threshold to determine the self-paced learning weight of each segment, which is a binary hard weight with values of 0 or 1. Finally, the segment subset for training is selected based on the self-paced learning weights, and the selected segments along with their pseudo-labels are used to update the classifier. This process is repeated T times to obtain the enhanced classifier

f_{T}

with the optimal parameter

ω

_{T}

.

Algorithm 1: Pseudo-code of the proposed SP-MIL framework

3.2. SP-MIL Framework

3.2.1. Baseline Model Initialization

Let {

X

_{k}}_{k = 1}^{K}

denote the given K training videos which include

K_{+}

abnormal videos and

K_{-}

normal videos, and the index set of all abnormal videos is

I_{+} = {1, 2, \dots, K_{+}}

while the negative ones

I_{-} = {K_{+} + 1, K_{+} + 2, \dots, K}

. Given a video

X_{k} = {x_{k i}}_{i = 1}^{N}

with N continuous and non-overlapped segments, the corresponding weak label

y_{k} \in {0, 1}

indicates whether

X_{k}

contains abnormal events. The baseline model consists of two primary components: a feature extractor g and a classifier f. For each video

X_{k}

, the feature extractor g processes video segments

{x_{k i}}_{i = 1}^{N}

as input, generating segment features

{z_{k i}}_{i = 1}^{N}

, which are subsequently fed into the classifier f to determine the anomaly scores

{s_{k i}}_{i = 1}^{N}

. The top-1 or top-k segments with the highest anomaly scores are selected and used to train the classifier using video-level labels.

1.: Feature extractor $g$ . Feature extractors used in the existing WS-VAD task can be generally categorized into two types. The first type extractors rely solely on a pre-trained network $h_{0}$ for feature extraction, such as convolutional 3D neural networks (C3D) [20], Inflated 3D ConvNets (I3D) [21], and contrastive language-image pre-training (CLIP) [22]. The output of the feature extractor is generally represented as $z_{k i} = g (x_{k i}) = h_{0} (x_{k i})$ . The second type further refines features using a temporal modeling network $h_{1}$ built upon the pre-trained networks $h_{0}$ , such as graph convolutional network (GCN) [1], multi-scale temporal network (MTN) [5], and temporal self-attention (TSA) [23], etc. In this case, the output of the feature extractor can be represented as $z_{k i} = g (x_{k i}) = h_{1} (h_{0} (x_{k i}))$ . Despite the variations in feature extraction methods and output dimensions among different pre-trained networks, each video segment can be consistently represented as a corresponding feature vector, ensuring the compatibility between $h_{1}$ to $h_{0}$ modeling. To demonstrate the generality of our SP-MIL framework, we evaluate the effectiveness of two different pre-trained feature extractors (I3D and CLIP) along with DeepMIL [7], which relies solely on the pre-trained network $h_{0}$ for feature extraction, RTFM [5], and UR-DMU [24], which additionally incorporate the temporal modeling network $h_{1}$ .
2.: Classifier $f$ . The standard classifier f used in the WS-VAD task typically comprises a fully connected network. Each video segment feature $z_{k i}$ is fed into f, producing its corresponding anomaly score $s_{k i} = f (z_{k i}) = f (g (x_{k i}))$ . The widely used classification objective function is as follows:

$L_{c e} = - y_{k} \log p_{k} - (1 - y_{k}) \log (1 - p_{k}),$

(1)

where $p_{k} = max_{i \in {1, 2, \dots, N}} s_{k i}$ or $p_{k} = \frac{1}{k} \sum_{i \in Ω_{k} (X_{k})} s_{k i}$ , with $Ω_{k} (X_{k})$ denotes the segment index set of video $X_{k}$ with the top-k largest anomaly scores. This indicates that one or more of the most representative video segments are typically selected using the video-level label to supervise the training of the classifier f. The regularization terms commonly used in baseline loss functions typically include the ranking loss (i.e., hinge loss), smooth loss, and sparse loss, with their respective expressions given as follows:

$L_{h i n g e} = max (0, 1 - max_{k \in I_{+}} f (g (x_{k i})) + max_{k \in I_{-}} f (g (x_{k i}))),$

(2)

$L_{s p a r s e} = \sum_{i = 1}^{N} | f (g (x_{k i})) |,$

(3)

$L_{s m o o t h} = \sum_{i = 1}^{N - 1} {(f (g (x_{k i})) - f (g (x_{k, i + 1})))}^{2} .$

(4)

The overall objective function of the baseline model is expressed as follows:

$L_{b a s e l i n e} = α L_{h i n g e} + β L_{s p a r s e} + γ L_{s m o o t h} + L_{c e},$

(5)

where $α$ , $β$ and $γ$ are hyperparameters.

3.2.2. Classifier Enhancement Using Self-Paced Learning

The SP-MIL framework models the WS-VAD task as a segment-level supervised learning problem. The main idea is to initially learn easy segments and then gradually transfer the acquired knowledge to recognize harder ones. By gradually exposing the classifier to more segments and their pseudo-labels during training, the classifier can identify difficult anomalies more effectively than the baseline model, thereby achieving more unbiased prediction. Specifically, the SP-MIL framework consists of four primary steps: segment-level score prediction, segment-level pseudo-label generation, adaptive segment subset selection, and classifier updating.

For illustrative purposes, let superscript t denote the t-th epoch of training. For each video

X_{k} = {x_{k i}}_{i = 1}^{N}

with N segments, the segment features

{z_{k i}}_{i = 1}^{N}

are used to generate the anomaly scores

{s_{k i}^{t}}_{i = 1}^{N}

using the classifier

f_{t - 1}

from the

(t - 1)

-th epoch of training. These anomaly scores are then binarized to produce segment-level pseudo-labels

{y_{k i}^{t}}_{i = 1}^{N}

. Subsequently, the segment subset with index set

I_{k}^{t}

is adaptively updated based on the estimated age parameter

λ_{k}^{t}

and segment-level self-paced learning weights

{v_{k i}^{t}}_{i = 1}^{N}

. The age parameter vector for all K videos is represented as

λ_{t} = [λ_{1}^{t}, \dots, λ_{K}^{t}] \in R^{K}

, where

λ_{k}^{t}

represents the age parameter shared for all segments in the k-th video during the t-th epoch of training. The self-paced learning weight matrix for all segments of all videos during the t-th epoch of training is represented as

V_{t} = {[v_{1}^{t}, \dots, v_{K}^{t}]}^{T} \in R^{K \times N}

, where

v_{k}^{t} = [v_{k 1}^{t}, v_{k 2}^{t}, \dots, v_{k N}^{t}]

and

v_{k i}^{t}

represents the self-paced learning weight of i-th segment in the k-th video during the t-th epoch of training, which is a hard weight and takes the binary values 0 or 1, indicating whether the video segment is selected or not. Finally, these segments in the selected subset with their corresponding pseudo-labels are used for supervised training, refining the classifier

f_{t - 1}

to obtain

f_{t}

. More details for each step are introduced as follows. Additionally, the details of parameter updates at each step are illustrated in Figure 3.

1.: Segment-level score prediction. For each video $X_{k}$ during the t-th epoch of training, the classifier $f_{t - 1}$ takes video segments ${x_{k i}}_{i = 1}^{N}$ as input and output segment-level anomaly scores ${s_{k i}^{t}}_{i = 1}^{N}$ by using the following formulation:

$s_{k i}^{t} = f (g (x_{k i}); ω_{t - 1}), \forall x_{k i} \in X_{k}, \forall k \in {1, 2, \dots, K},$

(6)

where $ω_{t - 1}$ are network parameters of classifier $f_{t - 1}$ .
2.: Segment-level pseudo-label generation. For each segment-level anomaly score $s_{k i}^{t}$ , its segment-level pseudo-label $y_{k i}^{t}$ is generated using a pre-defined binarization threshold $η$ . The calculation formula is as follows:

$y_{k i}^{t} = \{\begin{matrix} 1 (s_{k i}^{t} > η), \forall k \in I_{+} \\ 0, \forall k \in I_{-} \end{matrix}$

(7)

where 1 is an indicator function. Equation (7) demonstrates that during the t-th epoch of training, the segment-level pseudo-label is assigned to a value of 1 for segments in abnormal videos if their anomaly score exceeds the threshold $η$ ; otherwise, it is set to 0. It is worth noting that for segments in normal videos, the segment-level pseudo-label is consistently set to 0.
3.: Adaptive segment subset selection. The purpose of segment selection is to adaptively identify the most easily classified subset of segments from each video in every training epoch. The processes of segment subset selection and classifier updating are iterative. During the t-th epoch of training, this self-paced learning-based segment selection and classifier-updating strategy can be formulated as:

$min_{ω_{t}, V_{t}} E (ω_{t}, V_{t}; λ_{t}) = \sum_{k = 1}^{K} \sum_{i \in I_{k}^{t}} v_{k i}^{t} l_{k i}^{t} + h (V_{t}; λ_{t}),$

(8)

where $ω_{t}$ represents the model parameter of $f_{t}$ , $V_{t}$ denotes the self-paced learning weight matrix, and $v_{k i}^{t}$ represents the self-paced learning weight of i-th segment in the k-th video. $λ_{t}$ represents the age parameter vector, and $λ_{k}^{t}$ denotes the age parameter shared for all segments in the k-th video, which control the difficulty of the video. $I_{k}^{t}$ is the easiest subset with the index of segments selected in the k-th video. For each video segment and its generated pseudo-label during the t-th epoch of training, $l_{k i}^{t}$ denotes the segment-level loss, which can be calculated using the following formula:

$l_{k i}^{t} = - y_{k i}^{t} \log (f (g (x_{k i}); ω_{t - 1})) - (1 - y_{k i}^{t}) \log (1 - f (g (x_{k i}); ω_{t - 1})) .$

(9)

$h (V_{t}; λ_{t})$ is the self-paced learning regularizer; according to [25], it can be defined as $h (V_{t}; λ_{t}) = - \sum_{k = 1}^{K} \sum_{i = 1}^{N} λ_{k}^{t} v_{k i}^{t}$ . To solve the optimization problem (8), we firstly need to estimated the age parameter $λ_{t}$ according to the loss values of the video segments during the t-th epoch of training. In particular, for all segments ${x_{k i}}_{i = 1}^{N}$ and their corresponding pseudo-labels ${y_{k i}}_{i = 1}^{N}$ of video $X_{k}$ , $λ_{k}^{t}$ can be estimated as follows. We first compute their corresponding loss values ${l_{k i}^{t}}_{i = 1}^{N}$ . It is assumed that the easier the segment, the lower the corresponding loss value. According to this assumption, we sort the segment loss values in ascending order. Let $L = [l_{k 1}^{t}, l_{k 2}^{t}, \dots, l_{k N}^{t}]$ be a vector of the sorted loss values. Then, $λ_{k}^{t}$ can be defined as the R-th element of $L$ , that is,

$λ_{k}^{t} = L (R),$

(10)

where R denotes the number of selected segments for the video $X_{k}$ and is defined as follows:

$R = ⌊ R_{0} + \frac{t}{T} \times (R_{T} - R_{0}) ⌋,$

(11)

where $⌊ \cdot ⌋$ denotes the floor function, $R_{0}$ and $R_{T}$ represent the number of segments selected for the subset with index $I_{k}^{t}$ during the initial and final training epochs, respectively, and T represents the maximum number of training epochs. It is worth noting that the age parameter describes the difficulty of each segment. According to the definition of $λ_{k}^{t}$ , it will increase gradually with the training epochs t. Then, more difficult segments will be added to the training set for classifier updating. This is also the key of self-paced learning for the strategy of easy-to-hard sample selection. Once all $λ_{k}^{t}$ has been well defined, the optimization problem (8) can be solved using the alternative convex search method. Specifically, when $ω_{t - 1}$ is fixed, the optimal $V_{t}$ can be calculated via Equation (12) [12], and when $V_{t}$ is fixed, existing off-the-shelf supervised learning methods can be used to determine the optimal $ω_{t}$ .

$v_{k i}^{t} = 1 (l_{k i}^{t} \leq λ_{k}^{t}), \forall x_{k i} \in X_{k}, \forall k \in {1, 2, \dots, K},$

(12)

where 1 is an indicator function, Equation (12) shows that when a segment’s loss is less than the given age parameter $λ_{k}^{t}$ , it is considered as an easy segment and will be selected (i.e., $v_{k i}^{t} = 1$ ); otherwise, it will be unselected (i.e., $v_{k i}^{t} = 0$ ). As $λ_{t}$ increases gradually over time, more segments will be selected into the subset with index $I_{k}^{t}$ . This allows the classifier to progressively access more segments, moving from easier segments to more difficult ones, thereby facilitating unbiased prediction.
4.: Classifier updating. The classifier updating and segment selection processes are performed alternately. During this stage, $V_{t}$ is fixed while optimizing the classifier parameter $ω_{t}$ by using the selected segment subset with index $I_{k}^{t}$ , and the segment-level classification objective is defined as follows:

$L_{s c e} = \sum_{k = 1}^{K} \sum_{i \in I_{k}^{t}} l_{k i}^{t},$

(13)

where $l_{k i}^{t}$ corresponds to Equation (9). Similarly, we employ the terms of hinge loss, smooth loss, and sparse loss, and the total loss function of the proposed SP-MIL framework is defined as follows:

$L_{S P - M I L} = α L_{h i n g e} + β L_{s p a r s e} + γ L_{s m o o t h} + L_{s c e},$

(14)

where the loss functions $L_{h i n g e}$ , $L_{s p a r s e}$ , and $L_{s m o o t h}$ correspond to Equations (2)–(4), respectively. When combined with the previously described four-step iterative process, this forms a single training epoch of the proposed SP-MIL framework. Repeating this process over T epochs results in the final enhanced classifier $f_{T}$ with the optimal parameter $ω_{T}$ .

3.2.3. Testing Phase

When a test video is provided, the pre-trained features of video segments are first extracted using the feature extractor g, followed by direct prediction of abnormal scores for each segment by using the SP-MIL framework enhanced classifier. Drawing inspiration from ensemble learning, classifiers enhanced through different pre-trained feature extractors can complement each other. Thus, we propose a straightforward yet effective late-fusion strategy. Specifically, when I3D and CLIP are used as pre-trained feature extractors, the enhanced classifiers from both are fused at the score level, further boosting the performance. The fused segment-level anomaly score is calculated using the following formula:

s_{k i} = \{\begin{matrix} s_{1}, if | s_{1} - c | \geq | s_{2} - c | \\ s_{2}, otherwise \end{matrix}

(15)

where

s_{1}

and

s_{2}

represent the enhanced classifiers obtained using I3D and CLIP as the pre-trained feature extractor, respectively, and c denotes the threshold for distinguishing anomaly scores from the two enhanced classifiers.

4. Experiments

4.1. Datasets and Evaluation Metrics

We conducted the WS-VAD experiments on three widely used datasets: ShanghaiTech [26], UCF-Crime [7], and XD-Violence [27]. (1) ShanghaiTech: A medium-scale dataset derived from campus street video surveillance, encompassing 13 distinct background scenes, comprising 437 videos—330 normal and 107 abnormal. Following the standard splits established in [1] for WS-VAD, we partitioned this dataset into two subsets: a training set with 63 abnormal and 175 normal videos, and a testing set with the remaining 44 abnormal and 155 normal videos. (2) UCF-Crime: A large-scale dataset comprising 1900 videos, evenly split between 950 abnormal and 950 normal videos. The dataset encompasses 13 realistic anomalies, including abuse, arrest, arson, assault, accident, burglary, explosion, fighting, robbery, shooting, stealing, shoplifting, and vandalism. (3) XD-Violence: A dataset comprising 217 hours of audio-visual content collected from both movies and real-world scenes, containing 4754 untrimmed videos, with 3954 videos for training and 800 videos for testing.

For ShanghaiTech and UCF-Crime, the area under the frame-level receiver operating characteristic curve (AUC) is used as the evaluation metric. For XD-Violence, the area under the precision-recall curve, referred to as frame-level average precision (AP), is used as the evaluation metric.

4.2. Baseline Models and Implementation Details

4.2.1. Baseline Models

Since the SP-MIL framework does not alter the baseline model structure, to verify the effectiveness of our proposed general SP-MIL framework, we employed three of the most widely used WS-VAD methods as our baseline models: DeepMIL [7], RTFM [5], and UR-DMU [24].

DeepMIL model: This model was introduced by Sultani et al. [7] to solve the WS-VAD problem. In this approach, the pre-trained features of video segments are first extracted using a pre-trained feature extractor. These pre-trained features are then directly fed into the classifier to generate anomaly scores, and the segment with the highest anomaly score is selected to train the classifier using the video-level label. The objective function is defined as follows:

$L_{D e e p M I L} = α L_{h i n g e} + β L_{s p a r s e} + γ L_{s m o o t h},$

(16)

where $L_{h i n g e}$ , $L_{s p a r s e}$ , and $L_{s m o o t h}$ correspond to Formulas (2)–(4), respectively.
RTFM model: This model, introduced by Tian et al. [5], first extracts pre-trained features of video segments through a pre-trained feature extractor. These pre-trained features are then processed by a multi-scale temporal network to derive temporal context features, which are subsequently fed into the classifier to obtain anomaly scores, and the top-k segments with the highest temporal context feature magnitudes are selected to train the classifier with video-level labels. The objective function is defined as follows:

$L_{R T F M} = α L_{h i n g e} + β L_{s p a r s e} + γ L_{s m o o t h} + L_{c e}$

(17)

Here,

L_{h i n g e}

,

L_{s p a r s e}

,

L_{s m o o t h}

, and

L_{c e}

correspond to Formulas (2)–(5), respectively. Notably, the hinge loss in the RTFM model differs slightly from that in Formula (2) as it is applied to the magnitudes of the temporal context features rather than the anomaly scores.

UR-DMU model: This model was introduced by Zhou et al. [24], it first extracts pre-trained features of video segments through a pre-trained feature extractor. These pre-trained features are subsequently processed by a global and local multi-head self-attention module within the transformer network to obtain more expressive embeddings, which are then fed into dual memory units (DMU) to learn more discriminative features. The embeddings and features output by the DMU are then incorporated into the classifier training process. For further details, please refer to [24].

4.2.2. Implementation Details

Following [7,24], each video is divided into 32 or 200 segments, i.e., N = 32 for DeepMIL and RTFM (N = 200 for UR-DMU). The hyperparameters in the loss function are set to

α

=

1 \times 10^{- 4}

,

β

=

8 \times 10^{- 3}

, and

γ

=

8 \times 10^{- 4}

. The maximum number of epochs T is set to 100,

R_{0}

= 3 and

R_{T}

= 32 for DeepMIL and RTFM (

R_{T}

= 200 for UR-DMU). The threshold value

η

is set to 0.5, and the late-fusion threshold value c is set to 0.4. The SP-MIL framework is trained on a single NVIDIA RTX 3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA) using PyTorch (version 2.1). The Adam optimiser is employed with a weight decay of 0.0005 and a learning rate of 0.001. For DeepMIL, the fully connected layers described in our model have 512, 32, and 1 nodes, each followed by a ReLU activation function and a dropout function with a dropout rate of 0.7. For RTFM and UR-DMU, the number of nodes is 512, 128, and 1, respectively. It is worth noting that for fair comparison, we also reported the results of DeepMIL, RTFM, and UR-DMU reproduced by our framework by using different pre-trained features.

4.3. The Effectiveness of the SP-MIL Based Classfier Enhancement Strategy

In this section, we verify the general effectiveness of the SP-MIL based classfier enhancement strategy using various baseline models (DeepMIL [7], RTFM [5], UR-DMU [24]), and the generality across different feature extractors (I3D, CLIP).

Table 1 and Table 2 present the comparative results between the baseline models and the enhanced classifiers on the ShanghaiTech, UCF-Crime, and XD-Violence datasets, with DeepMIL, RTFM, and UR-DMU as the baseline models and I3D and CLIP as the feature extractors, respectively. From Table 1, it can be observed that when I3D is used as the feature extractor, the performance of three baseline models is significantly improved by further training with the SP-MIL framework. Specifically, DeepMIL+SP-MIL achieved performance gains of 4.36%, 4.33%, and 1.20% on the UCF-Crime, ShanghaiTech, and XD-Violence over DeepMIL. Similarly, RTFM+SP-MIL achieved improvements of 1.70%, 0.35%, and 2.74% over RTFM, and UR-DMU+SP-MIL demonstrated improvements of 1.51%, 0.22%, and 0.72% over UR-DMU on the same datasets. Table 2 exhibits a comparable trend based on the CLIP video features. Table 1 and Table 2 demonstrate that the SP-MIL framework enables the classifier to initially focus on easily identifiable video segments and progressively tackle more challenging examples near the classification boundary. Consequently, the classifier gradually accumulates crucial prior knowledge for the challenging segments, thereby enhancing its discriminative capability.

To further confirm this observation, we use t-SNE to visualize the features extracted from the output of the second layer of the classifier in both the baseline models and the baselines+SP-MIL on the UCF-Crime dataset. The results are shown in Figure 4. It illustrates that classifier features trained by the three baseline models remain intermixed. However, after additional training with the SP-MIL framework, normal and abnormal video features become distinctly separated, demonstrating the effectiveness of the SP-MIL framework.

4.4. The Effectiveness of Late-Fusion Strategy for SP-MIL Framework

Drawing inspiration from ensemble learning, we propose a straightforward yet effective score-level late-fusion strategy, where we first use I3D and CLIP as feature extractors, respectively. Next, we train the respective classifiers, enhanced by the SP-MIL framework, and test the video segments using these classifiers to obtain the corresponding anomaly scores. The anomaly scores of video segments are subsequently fused using Formula (15) to produce the final anomaly scores. The comparison results are presented in Table 3. Notably, we integrate these two classifiers only during the testing phase, as the entire pipeline has already been trained. From Table 3, we can find that the fusion results of the SP-MIL framework demonstrate significant improvements over both baseline models, which indicates that the widely used I3D and CLIP features have much complementary information for the WS-VAD task.

4.5. Effect of Parameters $η$ and R

For the pseudo-label threshold

η

, we employ the RTFM model as the baseline model and CLIP as the feature extractor to conduct ablation experiments on the UCF-Crime dataset. The experimental results are presented in Table 4. Based on the experimental results, we can find that when

η

is set to 0.5, the model achieves the best performance. This may be because anomaly detection of video frames is a binary classification task, and the threshold is naturally associated with fluctuations around the median value.

The sample selection hyperparameter R, which pertains to training in epoch t, represents the number of samples selected for training in each video during the current epoch. We use the RTFM model and CLIP feature extractor to perform ablation experiments on the UCF-Crime dataset. The results are presented in Table 5. Based on the experimental results, we conclude that 10 is the most suitable value of R for the UCF-Crime dataset. For ShanghaiTech and XD-Violence datasets, we also conducted similar experiments and we find that 8 and 15 are the most suitable values for R, respectively. The results above demonstrate that our SP-MIL framework can adaptively select the optimal number of training samples for different datasets. Notably, when UR-DMU is used as the baseline model, the most suitable values of R for the ShanghaiTech, UCF-Crime, and XD-Violence datasets are 15, 21, and 33, respectively.

4.6. Comparison with the Prior SOTA

In this section, we primarily compare the performance of our proposed method DeepMIL+SP-MIL, RTFM+SP-MIL and UR-DMU+SP-MIL with other state-of-the-art methods.

Results on UCF-Crime. Table 6 shows that UR-DMU+SP-MIL achieved competitive results. For comparison, He et al. [28] focused on addressing the inter-video data imbalance problem by proposing an adversarial and focus training method. However, their performance was 2.69% lower than that of UR-DMU+SP-MIL. Similarly, Zhang et al. [11] introduced a method that focuses on pseudo-label completeness and uncertainty, employing Monte Carlo dropout to select confident video clips and enhance classifier performance. However, their results were slightly lower than those of the RTFM+SP-MIL method. In addition, our UR-DMU+SP-MIL method is comparable to that of Yang et al. [29], who incorporated additional textual information to enhance the performance of their model. Compared to MIST [9], which utilized a two-stage training strategy to improve the accuracy for pseudo-labels, the UR-DMU+SP-MIL method not only outperformed it by 5.46% but also demonstrated its versatility by improving the capabilities of classifiers with various baseline models.

Results on ShanghaiTech. Table 7 presents the AUC results of different WS-VAD methods on the ShanghaiTech dataset. It can be observed that RTFM+SP-MIL achieves an AUC of 98.10%. This may be attributed to the fact that our framework can enhance the classifier’s discriminability through self-paced learning strategy. Compared to the noisy label cleaning strategy in GCN [1], our model demonstrates an AUC improvement of 13.66%, highlighting the benefit of refining segment-level pseudo-labels via the SP-MIL framework. Additionally, Ye et al. [30] and Wu et al. [16] improved the temporal modeling of pre-trained features. However, their performance remains slightly lower than that of our RTFM+SP-MIL method.

Table 6. Comparisons of the frame-level AUC scores with other methods on the UCF-Crime dataset.

Method	Source	Feature	AUC (%)
DeepMIL [7]	CVPR 2018	C3D RGB	75.41
DeepMIL * [7]	CVPR 2018	I3D RGB	78.32
DeepMIL * [7]	CVPR 2018	CLIP	80.91
GCN [1]	CVPR 2019	TSN RGB	82.12
CLAWS [31]	ECCV 2020	C3D RGB	83.03
MIST [9]	CVPR 2021	I3D RGB	82.30
RTFM [5]	ICCV 2021	I3D RGB	84.30
RTFM * [5]	ICCV 2021	I3D RGB	83.50
RTFM * [5]	ICCV 2021	CLIP	84.45
Chang et al. [14]	TMM 2021	I3D RGB	84.62
Wu et al. [16]	TIP 2021	I3D RGB	84.89
BN-SVP [32]	CVPR 2022	I3D RGB	83.39
TCA-VAD [33]	ICME 2022	I3D RGB	83.75
MSL [6]	AAAI 2022	I3D RGB	85.30
Thakare et al. [34]	PR 2023	I3D RGB	83.56
Liu et al. [13]	TNNLS 2023	I3D RGB	85.42
Sun et al. [35]	ICME 2023	I3D RGB	85.88
Cho et al. [36]	CVPR 2023	I3D RGB	86.10
Zhang et al. [11]	CVPR 2023	I3D RGB	86.22
He et al. [28]	PR 2024	I3D RGB	85.07
AlMarri et al. [37]	WACV 2024	I3D RGB+FLOW	85.47
Yang et al. [29]	CVPR 2024	CLIP (RGB+Text)	87.79
DeepMIL+SP-MIL		I3D+CLIP	84.81
RTFM+SP-MIL		I3D+CLIP	86.35
UR-DMU+SP-MIL		I3D+CLIP	87.76

* means that we re-implemented this method.

Table 7. Comparisons of the frame-level AUC scores with other methods on the ShanghaiTech dataset.

Method	Source	Feature	AUC (%)
DeepMIL * [7]	CVPR 2018	I3D RGB	92.71
DeepMIL * [7]	CVPR 2018	CLIP	94.96
GCN [1]	CVPR 2019	TSN RGB	84.44
CLAWS [31]	ECCV 2020	C3D RGB	89.67
Chang et al. [14]	TMM 2021	I3D RGB	92.25
MIST [9]	CVPR 2021	I3D RGB	94.83
RTFM [5]	ICCV 2021	I3D RGB	97.21
RTFM * [5]	ICCV 2021	I3D RGB	97.39
RTFM * [5]	ICCV 2021	CLIP	97.70
Wu et al. [16]	TIP 2021	I3D RGB	97.48
BN-SVP [32]	CVPR 2022	C3D RGB	96.00
S3R [38]	ECCV 2022	I3D RGB	97.48
MSL [6]	AAAI 2022	I3D RGB	96.08
Liu et al. [13]	TNNLS 2023	I3D RGB	97.54
Liu et al. [13]	TNNLS 2023	I3D RGB+FLOW	97.76
Cho et al. [36]	CVPR 2023	I3D RGB	97.60
Sun et al. [35]	ICME 2023	I3D RGB	97.92
Tan et al. [39]	WACV 2024	I3D RGB+FLOW	97.54
Ye et al. [30]	ICASSP 2024	I3D RGB	98.00
DeepMIL+SP-MIL		I3D+CLIP	97.81
RTFM+SP-MIL		I3D+CLIP	98.10
UR-DMU+SP-MIL		I3D+CLIP	97.92

* means that we re-implemented this method.

Results on XD-Violence. Table 8 shows the performance of our framework on the XD-Violence dataset compared to those of state-of-the-art (SOTA) methods. As can be seen from the table, RTFM+SP-MIL achieved the best results. With the CLIP feature extractor, the AP value of our RTFM+SP-MIL was 0.74% higher than that of Yang et al. [29]. The MGFN [40] network enhances the multi-scale temporal network based on the RTFM model, but its result is 5.23% lower than our RTFM+SP-MIL. Similarly, the AFT method [28] focus on recognizing abnormal events and also employ a late-fusion strategy, but our RTFM+SP-MIL results are 4.35% higher than those of He et al. [28]. These results demonstrate the effectiveness of our framework.

4.7. Qualitative Results

In this section, we present the qualitative results of our RTFM+SP-MIL and three baseline models (DeepMIL, RTFM, UR-DMU) on the UCF-Crime, ShanghaiTech, and XD-Violence datasets. Figure 5, Figure 6 and Figure 7 show that our framework can locate abnormal events more effectively than the baseline models. For the UCF-Crime dataset, Figure 5a–c demonstrates that the positioning curve of our framework closely aligns with the ground truth. Even when the abnormal events occur intermittently in Figure 5a, our detecting results are only slightly delayed or advanced. This highlights the robustness and accuracy of our framework in detecting abnormal events, such as arson and robbery. Additionally, Figure 5d illustrates that our framework has a low false positive rate. Similar conclusions can be drawn from the positioning curves of the ShanghaiTech and XD-Violence datasets. In addition, Figure 6b illustrates that the classifiers in RTFM and DeepMIL exhibit a bias in localizing anomalies, resulting in delayed anomaly detection, while the classifiers in UR-DMU cause significant false alarms. In contrast, our method effectively identifies the location of anomalies. This highlights the effectiveness of our SP-MIL strategy in providing unbiased predictions. Although our framework exhibits a few false positives in Figure 6a and Figure 7c, overall, our detecting results accurately promptly reflect the real anomalies in the videos.

5. Conclusions

In this paper, we model the WS-VAD task as a segment-level supervised learning problem and propose a general self-paced multiple-instance learning framework. The proposed framework can adaptively select training segments in each training epoch, progressing from easy to hard video segments, and alternately updates the self-paced learning weights and the classifier parameters to enhance the baseline classifier’s capability through unbiased prediction. Our SP-MIL framework can be applied to various baseline models (e.g., DeepMIL, RTFM, UR-DMU) and feature extractors (e.g., I3D, CLIP), demonstrating broad applicability and effectiveness. Furthermore, when using RTFM as the baseline model, our proposed SP-MIL framework achieved state-of-the-art results on the ShanghaiTech and XD-Violence datasets. When using UR-DMU as the baseline model, our framework achieved competitive results on UCF-Crime dataset.

6. Limitation and Future Work

This study has several limitations that open avenues for future research: (1) Text information integration. Although our current approach leverages pre-trained CLIP features, it lacks text information related to the video anomaly class that could enhance the model’s performance. Future work could investigate integrating text-based features to exploit the rich semantic context of video annotations, possibly resulting in more robust anomaly detection. (2) Sample selection mechanism. In our self-paced learning method, the number of samples involved in training increases with each epoch. While this approach allows for adaptive learning, current mechanisms for selecting segments could be refined. Exploring more sophisticated sample selection strategies, such as dynamic weighting or reinforcement learning-based methods, could optimize the learning process and improve the model’s ability to detect more subtle anomalies. (3) Baseline model diversity. Our study primarily utilizes three common baseline models, leaving many other potential model architectures unexplored. Future research should explore a broader range of model architectures to validate their effectiveness within the SP-MIL framework. This exploration could uncover new insights and further enhance the generalizability of our approach.

Author Contributions

Conceptualization, P.H. and H.L.; methodology, P.H.; software, P.H. and M.H.; validation, P.H., H.L. and M.H.; formal analysis, P.H.; investigation, P.H.; writing—original draft preparation, P.H.; writing—review and editing, H.L.; visualization, M.H.; supervision, H.L.; funding acquisition, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China under Grant 2018AAA0102201.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhong, J.X.; Li, N.; Kong, W.; Liu, S.; Li, T.H.; Li, G. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1237–1246. [Google Scholar]
Kommanduri, R.; Ghorai, M. DAST-Net: Dense visual attention augmented spatio-temporal network for unsupervised video anomaly detection. Neurocomputing 2024, 579, 127444. [Google Scholar] [CrossRef]
Cho, M.; Kim, T.; Kim, W.J.; Cho, S.; Lee, S. Unsupervised video anomaly detection via normalizing flows with implicit latent features. Pattern Recognit. 2022, 129, 108703. [Google Scholar] [CrossRef]
Pu, Y.; Wu, X. Locality-Aware Attention Network with Discriminative Dynamics Learning for Weakly Supervised Anomaly Detection. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
Tian, Y.; Pang, G.; Chen, Y.; Singh, R.; Verjans, J.W.; Carneiro, G. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In Proceedings of the International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 4975–4986. [Google Scholar]
Li, S.; Liu, F.; Jiao, L. Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 1395–1403. [Google Scholar]
Sultani, W.; Chen, C.; Shah, M. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6479–6488. [Google Scholar]
Park, S.; Kim, H.; Kim, M.; Kim, D.; Sohn, K. Normality Guided Multiple Instance Learning for Weakly Supervised Video Anomaly Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 2665–2674. [Google Scholar]
Feng, J.C.; Hong, F.T.; Zheng, W.S. Mist: Multiple instance self-training framework for video anomaly detection. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14009–14018. [Google Scholar]
Lv, H.; Yue, Z.; Sun, Q.; Luo, B.; Cui, Z.; Zhang, H. Unbiased multiple instance learning for weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 8022–8031. [Google Scholar]
Zhang, C.; Li, G.; Qi, Y.; Wang, S.; Qing, L.; Huang, Q.; Yang, M.H. Exploiting Completeness and Uncertainty of Pseudo Labels for Weakly Supervised Video Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 16271–16280. [Google Scholar]
Kumar, M.; Packer, B.; Koller, D. Self-Paced Learning for Latent Variable Models. Adv. Neural Inf. Process. Syst. 2010, 23, 1189–1197. [Google Scholar]
Liu, T.; Lam, K.M.; Kong, J. Distilling privileged knowledge for anomalous event detection from weakly labeled videos. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 12627–12641. [Google Scholar] [CrossRef] [PubMed]
Chang, S.; Li, Y.; Shen, S.; Feng, J.; Zhou, Z. Contrastive attention for video anomaly detection. IEEE Trans. Multimed. 2021, 24, 4067–4076. [Google Scholar] [CrossRef]
Zhang, J.; Qing, L.; Miao, J. Temporal convolutional network with complementary inner bag loss for weakly supervised anomaly detection. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 4030–4034. [Google Scholar]
Wu, P.; Liu, J. Learning causal temporal relation and feature discrimination for anomaly detection. IEEE Trans. Image Process. 2021, 30, 3513–3527. [Google Scholar] [CrossRef] [PubMed]
Zhou, S.; Wang, J.; Meng, D.; Xin, X.; Li, Y.; Gong, Y.; Zheng, N. Deep self-paced learning for person re-identification. Pattern Recognit. 2018, 76, 739–751. [Google Scholar] [CrossRef]
Sangineto, E.; Nabi, M.; Culibrk, D.; Sebe, N. Self paced deep learning for weakly supervised object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 712–725. [Google Scholar] [CrossRef] [PubMed]
Zhang, D.; Yang, L.; Meng, D.; Xu, D.; Han, J. SPFTN: A Self-Paced Fine-Tuning Network for Segmenting Objects in Weakly Labelled Videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5340–5348. [Google Scholar]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Joo, H.K.; Vo, K.; Yamazaki, K.; Le, N. Clip-tsa: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection. In Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 8–11 October 2023; pp. 3230–3234. [Google Scholar]
Zhou, H.; Yu, J.; Yang, W. Dual memory units with uncertainty regulation for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 3769–3777. [Google Scholar]
Wang, K.; Wang, Y.; Zhao, Q.; Meng, D.; Liao, X.; Xu, Z. SPLBoost: An Improved Robust Boosting Algorithm Based on Self-Paced Learning. IEEE Trans. Cybern. 2021, 51, 1556–1570. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Luo, W.; Lian, D.; Gao, S. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6536–6545. [Google Scholar]
Wu, P.; Liu, J.; Shi, Y.; Sun, Y.; Shao, F.; Wu, Z.; Yang, Z. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Part XXX 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 322–339. [Google Scholar]
He, P.; Zhang, F.; Li, G.; Li, H. Adversarial and focused training of abnormal videos for weakly-supervised anomaly detection. Pattern Recognit. 2024, 147, 110119. [Google Scholar] [CrossRef]
Yang, Z.; Liu, J.; Wu, P. Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 18899–18908. [Google Scholar]
Ye, H.; Xu, K.; Jiang, X.; Sun, T. Learning Spatio-Temporal Relations with Multi-Scale Integrated Perception for Video Anomaly Detection. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 4020–4024. [Google Scholar]
Zaheer, M.Z.; Mahmood, A.; Astrid, M.; Lee, S.I. Claws: Clustering assisted weakly supervised learning with normalcy suppression for anomalous event detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 358–376. [Google Scholar]
Sapkota, H.; Yu, Q. Bayesian Nonparametric Submodular Video Partition for Robust Anomaly Detection. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 3202–3211. [Google Scholar]
Yu, S.; Wang, C.; Xiang, L.; Wu, J. TCA-VAD: Temporal Context Alignment Network for Weakly Supervised Video Anomly Detection. In Proceedings of the International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
Thakare, K.V.; Dogra, D.P.; Choi, H.; Kim, H.; Kim, I.J. Rareanom: A benchmark video dataset for rare type anomalies. Pattern Recognit. 2023, 140, 109567. [Google Scholar] [CrossRef]
Sun, S.; Gong, X. Long-short temporal co-teaching for weakly supervised video anomaly detection. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 2711–2716. [Google Scholar]
Cho, M.; Kim, M.; Hwang, S.; Park, C.; Lee, K.; Lee, S. Look Around for Anomalies: Weakly-Supervised Anomaly Detection via Context-Motion Relational Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 12137–12146. [Google Scholar]
AlMarri, S.; Zaheer, M.Z.; Nandakumar, K. A Multi-Head Approach with Shuffled Segments for Weakly-Supervised Video Anomaly Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 1–6 January 2024; pp. 132–142. [Google Scholar]
Wu, J.C.; Hsieh, H.Y.; Chen, D.J.; Fuh, F.S.; Liu, T.L. Self-supervised Sparse Representation for Video Anomaly Detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022. [Google Scholar]
Tan, W.; Yao, Q.; Liu, J. Overlooked video classification in weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 1–6 January 2024; pp. 202–210. [Google Scholar]
Chen, Y.; Liu, Z.; Zhang, B.; Fok, W.; Qi, X.; Wu, Y.C. Mgfn: Magnitude-contrastive glance-and-focus network for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 387–395. [Google Scholar]

Figure 1. Schematic diagram of the baseline model and the proposed

s

elf-

p

aced

m

ultiple-

i

nstance

l

earning (SP-MIL) framework.

Figure 1. Schematic diagram of the baseline model and the proposed

s

elf-

p

aced

m

ultiple-

i

nstance

l

earning (SP-MIL) framework.

Figure 2. Overview of the proposed general SP-MIL framework.

Figure 3. An example to illustrate the parameter updates at each step of the loop.

Figure 4. Illustrations of the t-SNE feature distributions of the baseline models and their enhanced models by using the proposed SP-MIL framework. The top sections of (1), (2), and (3) respectively represent the feature distributions of the DeepMIL, RTFM, and UR-DMU models. The bottom sections of (1), (2), and (3) represent the feature distributions of their enhanced models DeepMIL+SP-MIL, RTFM+SP-MIL, and UR-DMU+SP-MIL, respectively.

Figure 5. Qualitative results of anomaly detection performance on UCF-Crime dataset. The red boxes represent the abnormal events present in each abnormal video.

Figure 6. Qualitative results of anomaly detection performance on ShanghaiTech dataset. The red boxes represent the abnormal events present in each abnormal video.

Figure 7. Qualitative results of anomaly detection performance on XD-Violence dataset. The red boxes represent the abnormal events present in each abnormal video.

Table 1. The effectiveness of the SP-MIL for different baseline models (I3D feature is used).

Method	Feature Extractor	UCF-Crime AUC (%)	ShanghaiTech AUC (%)	XD-Violence AP (%)
DeepMIL *	I3D	78.32	92.71	73.24
DeepMIL+SP-MIL	I3D	82.68	97.04	74.44
RTFM *	I3D	83.50	97.39	78.07
RTFM+SP-MIL	I3D	85.20	97.74	80.81
UR-DMU *	I3D	85.02	97.54	80.05
UR-DMU+SP-MIL	I3D	86.53	97.76	80.77

* means that we re-implemented this method.

Table 2. The effectiveness of the SP-MIL for different baseline models (CLIP feature is used).

Method	Feature Extractor	UCF-Crime AUC (%)	ShanghaiTech AUC (%)	XD-Violence AP (%)
DeepMIL *	CLIP	80.91	94.96	74.03
DeepMIL+SP-MIL	CLIP	84.14	97.54	75.28
RTFM *	CLIP	84.45	97.70	79.11
RTFM+SP-MIL	CLIP	85.50	97.94	81.52
UR-DMU *	CLIP	86.76	97.56	80.64
UR-DMU+SP-MIL	CLIP	87.13	97.90	81.24

* means that we re-implemented this method.

Table 3. The effectiveness of the late-fusion strategy for SP-MIL framework.

Method	Feature Extractor	UCF-Crime AUC (%)	ShanghaiTech AUC (%)	XD-Violence AP (%)
DeepMIL *	I3D+CLIP	80.96	94.97	75.07
DeepMIL+SP-MIL	I3D+CLIP	84.81	97.81	76.63
RTFM *	I3D+CLIP	84.65	97.76	80.80
RTFM+SP-MIL	I3D+CLIP	86.35	98.10	84.42
UR-DMU *	I3D+CLIP	86.89	97.78	82.97
UR-DMU+SP-MIL	I3D+CLIP	87.76	97.92	84.32

* means that we re-implemented this method.

Table 4. Ablation study on parameter

η

on the UCF-Crime dataset.

Table 4. Ablation study on parameter

η

on the UCF-Crime dataset.

$η$	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9
AUC (%)	83.94	84.60	84.68	85.40	85.50	85.49	85.41	84.80	83.58

Table 5. Ablation study on parameter R on the UCF-Crime dataset.

R	3	5	8	10	15	20	25	30	32
AUC (%)	84.43	84.68	85.07	85.50	84.23	83.67	82.99	81.24	80.75

Table 8. Performance comparison of state-of-the-art methods on the XD-Violence dataset.

Method	Source	Feature	AP (%)
DeepMIL * [7]	CVPR 2018	I3D RGB	73.24
DeepMIL * [7]	CVPR 2018	CLIP	74.03
Wu et al. [27]	ECCV 2020	C3D RGB	67.19
MSL [6]	AAAI 2022	C3D RGB	75.53
Wu et al. [16]	TIP 2021	I3D RGB	75.90
RTFM [5]	ICCV 2021	I3D RGB	77.81
RTFM * [5]	ICCV 2021	I3D RGB	78.07
RTFM * [5]	ICCV 2021	CLIP	79.11
MSL [6]	AAAI 2022	I3D RGB	78.28
Sun et al. [35]	ICME 2023	I3D RGB	77.92
MGFN [40]	AAAI 2023	I3D RGB	79.19
Thakare et al. [34]	PR 2023	I3D RGB	79.89
DMU [24]	AAAI 2023	I3D RGB	81.66
Liu et al. [13]	TNNLS 2023	I3D RGB	79.00
Zhang et al. [11]	CVPR 2023	I3D RGB	78.74
Zhang et al. [11]	CVPR 2023	I3D+VGGish	81.43
Cho et al. [36]	CVPR 2023	I3D RGB	81.30
He et al. [28]	PR 2024	I3D RGB	80.07
Tan et al. [39]	WACV 2024	I3D RGB	82.10
Yang et al. [29]	CVPR 2024	CLIP	83.68
DeepMIL+SP-MIL		I3D+CLIP	76.63
RTFM+SP-MIL		I3D+CLIP	84.42
UR-DMU+SP-MIL		I3D+CLIP	84.32

* means that we re-implemented this method.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, P.; Li, H.; Han, M. A Self-Paced Multiple Instance Learning Framework for Weakly Supervised Video Anomaly Detection. Appl. Sci. 2025, 15, 1049. https://doi.org/10.3390/app15031049

AMA Style

He P, Li H, Han M. A Self-Paced Multiple Instance Learning Framework for Weakly Supervised Video Anomaly Detection. Applied Sciences. 2025; 15(3):1049. https://doi.org/10.3390/app15031049

Chicago/Turabian Style

He, Ping, Huibin Li, and Miaolin Han. 2025. "A Self-Paced Multiple Instance Learning Framework for Weakly Supervised Video Anomaly Detection" Applied Sciences 15, no. 3: 1049. https://doi.org/10.3390/app15031049

APA Style

He, P., Li, H., & Han, M. (2025). A Self-Paced Multiple Instance Learning Framework for Weakly Supervised Video Anomaly Detection. Applied Sciences, 15(3), 1049. https://doi.org/10.3390/app15031049

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Self-Paced Multiple Instance Learning Framework for Weakly Supervised Video Anomaly Detection

Abstract

1. Introduction

2. Related Work

2.1. Weakly Supervised Video Anomaly Detection

2.2. Self-Paced Learning

3. Proposed Method

3.1. Overview

3.2. SP-MIL Framework

3.2.1. Baseline Model Initialization

3.2.2. Classifier Enhancement Using Self-Paced Learning

3.2.3. Testing Phase

4. Experiments

4.1. Datasets and Evaluation Metrics

4.2. Baseline Models and Implementation Details

4.2.1. Baseline Models

4.2.2. Implementation Details

4.3. The Effectiveness of the SP-MIL Based Classfier Enhancement Strategy

4.4. The Effectiveness of Late-Fusion Strategy for SP-MIL Framework

4.5. Effect of Parameters $η$ and R

4.6. Comparison with the Prior SOTA

4.7. Qualitative Results

5. Conclusions

6. Limitation and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

A Self-Paced Multiple Instance Learning Framework for Weakly Supervised Video Anomaly Detection

Abstract

1. Introduction

2. Related Work

2.1. Weakly Supervised Video Anomaly Detection

2.2. Self-Paced Learning

3. Proposed Method

3.1. Overview

3.2. SP-MIL Framework

3.2.1. Baseline Model Initialization

3.2.2. Classifier Enhancement Using Self-Paced Learning

3.2.3. Testing Phase

4. Experiments

4.1. Datasets and Evaluation Metrics

4.2. Baseline Models and Implementation Details

4.2.1. Baseline Models

4.2.2. Implementation Details

4.3. The Effectiveness of the SP-MIL Based Classfier Enhancement Strategy

4.4. The Effectiveness of Late-Fusion Strategy for SP-MIL Framework

4.5. Effect of Parameters η and R

4.6. Comparison with the Prior SOTA

4.7. Qualitative Results

5. Conclusions

6. Limitation and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.5. Effect of Parameters $η$ and R