Active Hard Sample Learning for Violation Action Recognition in Power Grid Operation

Meng, Lingwen; He, Di; Ban, Guobang; Xi, Guanghui; Li, Anjun; Zhu, Xinshan

doi:10.3390/info16010067

Open AccessArticle

Active Hard Sample Learning for Violation Action Recognition in Power Grid Operation

by

Lingwen Meng

¹

,

Di He

^2,*

,

Guobang Ban

¹

,

Guanghui Xi

¹

,

Anjun Li

¹

and

Xinshan Zhu

²

¹

Electric Power Research Institute of Guizhou Power Grid Co., Ltd., Guiyang 550002, China

²

School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China

^*

Author to whom correspondence should be addressed.

Information 2025, 16(1), 67; https://doi.org/10.3390/info16010067

Submission received: 28 November 2024 / Revised: 13 January 2025 / Accepted: 14 January 2025 / Published: 20 January 2025

Download

Browse Figures

Versions Notes

Abstract

:

Power grid operation occurs in complex, dynamic environments where the timely identification of operator violations is essential for safety. Traditional methods often rely on manual supervision and rule-based detection, leading to inefficiencies. Existing deep learning approaches, while powerful, require fully labeled data and long training times, thereby increasing costs. To address these challenges, we propose an active hard sample learning method specifically for the violation action recognition of operators in power grid operation. We design a hard instance sampling module with multi-strategy fusion based on active learning to improve training efficiency. This module identifies hard samples based on the consistency of models or samples, where we develop uncertainty evaluation and the instance discrimination strategy to assess the contributions of samples effectively. We utilize ResNet50 and ViT architectures with Faster-RCNN for detection and recognition, developed using PyTorch 2.0. The dataset comprises 2000 samples, and 30% and 60% labeled data are employed. Experimental results show significant improvements in model performance and training efficiency, demonstrating the method’s effectiveness in complex power grid environments. Our approach enhances safety monitoring and advances active learning and hard sample techniques in practical applications.

Keywords:

violation action recognition; hard sample learning; active learning; semi-supervised learning; power grid operation

1. Introduction

The safe and stable operation of the power grid, as a cornerstone of national infrastructure, is essential not only for economic growth but also for maintaining social stability. Power grid operation frequently takes place in complex and dynamic environments, where various violations, ranging from human error and negligence to procedural violations and equipment failures, are common [1,2]. Such violations can jeopardize the integrity of the power grid, leading to significant risks, including equipment failure, substantial property damage, and even personal injury.

Given these challenges, the accurate and timely identification of violations in grid operations has become a key focus of intelligent monitoring and maintenance [3]. AI-powered smart monitoring systems facilitate the real-time detection of operator violations and operational anomalies. By proactively identifying potential issues, these systems reduce reliance on manual supervision, improve response times, and enhance operational efficiency. Automating the detection process not only minimizes the risk of costly downtime but also ensures safer working conditions, contributing to more reliable and resilient power grid operations.

Traditional violation action recognition methods, which rely primarily on manual supervision and rule-based detection, are inefficient, heavily reliant on human expertise, and poorly equipped to manage the complex and dynamic environments present in power grid operation. With the broad application of deep learning technology in the field of computer vision, significant advances have been made in video action recognition [4,5,6]. Deep learning-based video analysis methods have gradually been applied to the field of violation action recognition [7,8,9]. For the detection of violation behavior, Wang et al. [10] developed a violation detection strategy that can accurately identify two types of violation instances by combining the target detection model based on YOLOv5, the attitude estimation model based on HRNet, and the skeleton-based action recognition model based on ST-GCN. The two types of violations include violations without tools (i.e., safety helmets and working clothes) and violations without specific tools (i.e., falling, climbing, and crossing). Furthermore, Li et al. [11] employed the convolution neural network to extract the surface and deep spatio-temporal fusion features of the skeleton sequence for underground operation violation recognition. Recently, Elezi et al. [12] proposed to combine active learning and semi-supervised strategies so as to optimize the selection of high-quality samples for manual annotation. Based on the uncertainty of the samples, they introduced a selection algorithm grounded in the robustness of network prediction results, which improved adaptability in selecting samples across various scenarios. Parvaneh et al. [13] proposed an active learning-based strategy to select unlabeled samples, transforming the task of uncertain sample selection into one focused on selecting samples near the decision boundary. By performing weighted interpolation between the representations of labeled and unlabeled data in the latent space, they identified the most inconsistent unlabeled data, thereby improving the efficiency of training classification networks. Although deep learning models have shown impressive accuracy in violation action recognition, they face significant challenges when operating in complex environments. Video instances collected in such settings often suffer from issues such as the blurring of action, occlusion, and lighting variations [14,15,16]. Existing methods struggle to accurately identify operator violations under these challenging conditions, and they still face limitations when dealing with high annotation costs and constrained computational resources. To overcome these aforementioned challenges, we propose an Active Hard Sample Learning (AHSL) method for the violation action recognition of operators in power grid operation.

Our contributions can be summarized as follows:

To improve the training efficiency of the network, we design a hard instance sampling (HIS) module with multi-strategy fusion to reduce the iteration time of the training. This module identifies hard samples based on the consistency of models or samples and performs iterative learning through active learning to ensure continuous updates of the network.
We develop uncertainty evaluation (UE) and the instance discrimination strategy (IDS) to eliminate redundant data while ensuring sample diversity and balance and merge the two manners to assess the contributions of samples effectively, so as to obtain valuable hard samples for the network.
Experimental results demonstrate significant improvements in both model performance and training efficiency, showcasing the effectiveness of the proposed method in complex power grid environments.

The rest of the paper is organized as follows. Section 2 details the proposed method, Section 3 shows the experiments to verify the advantages of the proposed method, and finally, we make a discussion and conclusion in Section 4 and Section 5.

2. Methods

Given the high annotation costs for massive video data in power operation monitoring, data redundancy, and the lengthy training time of models in the power sector, selecting performance-optimal samples from vast datasets is crucial. To address these challenges, we propose AHSL for violation action recognition of power grid operators. In AHSL, we introduce the mean teacher framework consisting of a teacher model and a student model, combined with semi-supervised learning. Using the competitive relationship between the two models, the valuable unlabeled samples are selected for annotation, and the new hard samples are learned through an active learning approach. Specifically, we design the HIS module with multi-strategy fusion to address issues, such as data imbalance and redundancy in power grid operation, so as to improve the training efficiency of the network. This module first uses UE based on the consistency between the teacher model and the student model to obtain uncertainty scores. These uncertainty scores are then used for an initial sample selection. Subsequently, IDS is applied to evaluate the selected samples, generating discrimination scores. Finally, the two scores are fused to compute the final score of the sample selection. A higher score indicates that it is more challenging to distinguish samples with the deep model, suggesting that learning from these hard instance samples could improve the robustness of the deep model. Meanwhile, using the active learning mechanism, the selected hard instance samples are annotated and incorporated into the training process to re-optimize the network, enabling efficient and accurate recognition of violation actions for operators in power grid operation, even in complex environments.

As illustrated in Figure 1, the proposed AHSL involves three main phases: data preparation, training, and testing. In the data preparation phase, the dataset is divided into a training set, a validation set, and a test set. We utilize the detection network, i.e., Faster-RCNN [17], to detect the operators in each sample and mark them with detection boxes, with data augmentation applied to the marked training set to prevent overfitting during network training. During the training stage, 30% or 60% of the training sample set after detection is randomly selected as the labeled sample set

\{v_{j}\} |_{j = 1}^{L}

, where

L

is the number of the labeled sample set, and the remaining as the unlabeled sample set

\{x_{i}\} |_{i = 1}^{M}

, where

M

is the number of the unlabeled sample set. The training process employs a network composed of a teacher model and a student model, both sharing the same structures. The teacher model predicts the unlabeled sample

x_{i}

, generating

p_{i}^{t}

, while the student model extracts the features from both the labeled sample

v_{j}

and unlabeled sample

x_{i}

, obtaining

{\tilde{f}}_{j}^{s}

and

f_{i}^{s}

, and predicts the unlabeled sample

x_{i}

to produce

p_{i}^{s}

. These results are then fed into the HIS module. In the HIS module, uncertainty estimation (UE) is used to assess the consistency between the teacher and student models’ predictions for the unlabeled sample set, generating uncertainty scores. Meanwhile, the instance discrimination strategy (IDS) is applied to measure the consistency between the labeled and unlabeled samples, producing discrimination scores to ensure sample diversity. These uncertainty and discrimination scores are then fused to compute final selection scores for the unlabeled samples. Samples with higher scores, indicating greater difficulty for the model, are identified as high-value hard instances. These hard samples are manually labeled and added to the labeled set for the next training cycle, creating an iterative active learning process. To prevent excessively high uncertainty scores, a regularization loss is introduced, which helps identify samples where the models exhibit significant disagreements, further refining the selection of challenging instances. The student model is updated through classification loss and regularization loss, while the teacher model is updated using an exponential moving average (EMA) [18] of the student model’s parameters. During the testing phase, the optimized teacher model is used for inference, producing final predictions of operator violation actions in complex environments of power grid operation. Note that we refer to validation and testing collectively as the testing phase.

2.1. Hard Instance Sampling Module

In typical scenarios of power grid operation, due to the stable operation of the device, there may be appearance redundancy, action redundancy, and a lack of changes in motion or action dynamics in the collected video data. These issues result in a large number of redundant frames or frames lacking valuable information, which hinders the convergence speed of the models. To address this problem, effective strategies need to be designed within the model to filter out such redundant or low-value frames. One possible strategy is to detect and remove appearance redundancy frames by comparing pixel differences between consecutive frames. Another approach is to use a technique such as optical flow estimation to detect and filter out motion redundancy frames. However, video data often includes complex scenario factors such as cluttered backgrounds, target occlusion, and lighting changes, making traditional methods like frame difference and motion detection insufficiently general and flexible for handling the various movements and changes in video content. To overcome these challenges, we propose that the HIS module is used to select high-value and challenging instance samples for training, which can enhance the model’s ability to filter redundant frames and focus learning on more informative and challenging samples, ultimately improving the efficiency and accuracy of the action recognition model in power grid operation. The HIS module includes uncertainty evaluation and the instance discrimination strategy.

Uncertainty Evaluation. Uncertainty-based strategies for sample selection have been widely applied and have achieved significant improvements in model performance when combined with active learning [19,20,21,22]. In the proposed HIS module, uncertainty estimation (UE) is carried out by measuring the consistency between the teacher and student models to assess the uncertainty of each sample. Specifically, given the unlabeled sample set

\{x_{i}\} |_{i = 1}^{M}

, the teacher model and the student model are used to extract and predict these samples, producing the prediction probabilities,

{{p}^{t} (x_{i})} |_{i = 1}^{M}

and

{{p}^{s} (x_{i})} |_{i = 1}^{M}

, where

M

represents the size of the mini-batch. The uncertainty of each sample is determined based on the difference between their predictions. Accordingly, the uncertainty score

u_{i}

for the sample

x_{i}

is defined as

u_{i} = K L (p^{t} (x_{i}) | | p^{s} (x_{i})) = p^{t} (x_{i}) l o g \frac{p^{t} (x_{i})}{p^{s} (x_{i}) + ϵ},

(1)

where the Kullback–Leibler (KL) divergence is utilized to measure the difference between these two predictions,

p^{t} (x_{i})

and

p^{s} (x_{i})

, respectively, represent the prediction probabilities of the teacher model and the student model for sample

x_{i}

, and the small constant

ϵ

ensures that the denominator does not equal zero, thus avoiding division by zero errors during uncertainty computation. From Equation (1), a higher score for

u_{i}

indicates greater difference between them; that is, the uncertainty of the sample

x_{i}

is greater, implying that the sample

x_{i}

is more challenging and valuable.

Instance Discrimination Strategy. Video data typically contains rich temporal information, and compared to static images, the motion in videos exhibits continuity or repetition. Since neighboring frames often share similar semantic information, they may also have comparable uncertainty scores. In this case, selecting instance samples solely based on uncertainty evaluation may limit learning with neighboring frames. However, these frames may contribute equally to the learning process of the action recognition model, meaning that relying only on uncertainty might not fully exploit the temporal information present in the video. To better utilize this temporal information, we also propose the IDS. This strategy avoids repeatedly selecting temporally similar frames for annotation, ensuring the diversity and representativeness of the selected instance samples. In doing so, it helps reduce redundancy and promotes the selection of more informative samples, thus optimizing the efficiency of model training and improving its performance in action recognition tasks.

Using the IDS, we perform sample evaluation in the temporal domain from both sample distribution and feature distribution. Concretely, for each unlabeled sample, a Gaussian distribution

N (μ, σ^{2})

is used as the distance metric to evaluate the instance from the view of the sample distribution. Given the labeled sample set

\{v_{j}\} |_{j = 1}^{L}

, we utilize the student model to extract the features

{\tilde{f}_{j}^{s}} |_{j = 1}^{L}

, where the features are the output from the front of the classification layer. Next, based on the statistical characteristics (e.g., mean and variance) of these features

{\tilde{f}_{j}^{s}} |_{j = 1}^{L}

, the evaluation indicator of the unlabeled instance sample

x_{i}

under sample distribution is as follows:

r_{i} = (1 - \sum_{j = 1}^{L} ω_{i}^{j} e^{- \frac{1}{2} {(\frac{x_{i} - μ_{j}}{σ_{j}})}^{2}}),

(2)

where

μ_{j}

and

σ_{j}

represent the mean and variance of the feature

{\tilde{f}}_{j}^{s}

for the

j

th labeled sample. The mask

ω_{i}^{j} \in {0, 1}

is used to select the closest distribution for the

i

th unlabeled instance sample. Specifically,

ω_{i}^{j} = 1

if the distribution of the

j

th labeled sample is very close to that of the unlabeled sample

x_{i}

; otherwise,

ω_{i}^{j} = 0

. Since each labeled sample has its own distribution, centered around its temporal location in the video, this manner is conducive to selecting samples with more information for model training.

Meanwhile, the feature distribution distance is further considered to enhance the semantic accuracy of sample evaluation. Hence, the score of the feature distance between the

i

th unlabeled sample and the

j

th labeled sample is calculated as follows:

e_{i} = α \cdot e x p (- \frac{{(1 - D ({\tilde{f}}_{j}^{s}, f_{j}^{s}))}^{2}}{2 δ^{2}}),

(3)

where

D (\cdot)

denotes a distance function such as Euclidean distance, and

α

and

δ

are two learnable parameters, respectively. As for the score

e_{i}

obtained by Equation (3), which is similar to Gaussian form, the farther the feature distance between the

i

th unlabeled sample and the

j

th labeled sample is, the larger the value of score

e_{i}

is. Hence, by simultaneously considering the evaluation of both sample and feature distributions, the discrimination score is

d_{i} = r_{i} + e_{i}

. In doing so, the manner ensures the selection of diverse instances over time, avoiding the repetitive selection of temporally close frames, which is beneficial to increase the diversity of training samples.

Considering the uncertainty and diversity of samples, we perform multi-strategy fusion. That is, combining both UE and the IDS, the most beneficial samples (hard samples) for model learning are selected for labeling. The final score

s_{i}

of sample selection is computed by a fusion scoring mechanism, which is formulated as follows:

s_{i} = β \cdot u_{i} + (1 - β) \cdot d_{i},

(4)

where

β

is a coupling factor that is used to balance the contribution of different strategies in the final selection process, and

u_{i}, d_{i}

are normalized into range

(0,1)

, respectively. After obtaining the final score

s_{i}

, we utilize a threshold

τ

to select the hard samples (

s_{i} \geq τ

) for manual annotating and add them to the original labeled sample set. The network is then iteratively updated, thus completing one loop cycle of active learning.

Furthermore, only learning the hard samples may exhibit sparsity of the training data, which causes instability in model training. To address this issue, we adopt temporal warping augmentation [23] to expand the hard samples along the temporal dimension, aiming to overcome the complexity of actions due to temporal variations and to introduce greater randomness and continuity into the data. Specifically, temporal warping augmentation stretches the samples to different time lengths. As shown in Figure 2, for an 8-frame video segment, the figure illustrates examples of selecting 2 or 4 frames for annotation, with the unselected frames padded by randomly adjacent frames. T data augmentation could improve the stability of the model during training.

For action classification, cross-entropy loss is used for optimization, and the uncertainty scores obtained are used for weighting these hard samples to improve their contribution to model learning. Hence, a classification loss (

L_{i d}

) is calculated as follows:

L_{i d} = - \frac{1}{M} \sum_{j = 1}^{M} (w_{j} q_{j} \log {\tilde{p}}^{s} (x_{j})),

(5)

where

q_{j}

is the ground truth label of the

j

th sample,

{\tilde{p}}^{s} (x_{j})

is the prediction probability for

j

th sample from the student model, and

w_{j} = \exp (u_{j})

is the weight of the

j

th hard sample. Note that

w_{j} = 1

for the original labeled sample set. Based on the uncertainty score

u_{i}

, a regularization loss

L_{r e g}

is proposed to prevent the high uncertainty of the unlabeled sample set

\{x_{i}\} |_{i = 1}^{M}

; the form of the calculation is as follows:

L_{r e g} = \frac{1}{M} \sum_{i = 1}^{M} {e x p (- u}_{i}) .

(6)

2.2. Optimization

The overall optimization objective of the student model for the proposed method is a combination of the classification loss and the regularization loss, and it is defined as

L_{t o t a l} = L_{i d} + L_{r e g} .

(7)

For the student model, based on the initial parameter values of the current model, the gradient of the total loss function

L_{t o t a l}

with respect to each parameter is computed using the backpropagation algorithm. Then, the student model’s parameters are updated using the gradient descent algorithm as follows:

θ_{l + 1}^{s} = θ_{l}^{s} - γ \frac{\partial L_{t o t a l}}{\partial θ_{l}^{s}},

(8)

where

θ_{l}^{s}

is the parameter of the student model at the

l

th iteration and

γ

is the learning rate. To avoid overfitting, the teacher model’s parameters are not updated online during training. Instead, the parameters of the student model are used to update the teacher model’s parameters using the EMA algorithm. The updated rule for the teacher model’s parameters is defined as

θ_{l + 1}^{t} \leftarrow ϕ θ_{l}^{t} + (1 - ϕ) θ_{l + 1}^{s},

(9)

where

θ_{l}^{t}

is the teacher model’s parameter at the

l

th iteration,

θ_{l + 1}^{s}

is the student model’s parameter at the

(l + 1)

th iteration, and

ϕ

is the smoothing factor that controls the update speed of the parameters, typically set to a value very close to 1, with an empirical value of 0.999. Through iterative optimization, the model gradually learns discriminative features from the training data and improves its performance.

3. Experiments

In typical power grid operation scenarios, we follow international electrical safety standards, such as Article 100 of NFPA-70E [24], to establish safe working conditions. This includes a risk assessment of the operational environment, contingency planning, ensuring all personnel have received specific task-related training, and requiring operators to hold a work ticket before conducting operations. Note that the electric verification is conducted before the operation in the blackout operation scenario. Under the premise of the above preparation, the purpose of our research is to collect video data and monitor the operation process, and realize the automatic recognition of the operator’s actions, so as to reflect whether the operator is performing the job according to the standards, such as the wearing of personal protective equipment, and so on. In this way, it is beneficial to monitor the occurrence of emergencies in the process of power grid operation, thereby preventing electrical shock hazards.

The data collection is conducted by combining on-site video recording and real-time monitoring systems. Each video is recorded at a resolution of 1080p, the average duration of all videos is 10 min, and the sample data are collected in various complex power grid operating environments, characterized by variations in lighting, temporal positions, and weather conditions. For example, (1) recording is conducted at different times of the day to capture both bright sunlight and low-light scenes. (2) Work conditions range across various locations, such as substations and outdoor work sites, where the positions of operators and equipment frequently change. (3) Weather conditions such as sunny, cloudy, and rainy days are considered during data collection to enhance the practical applicability of the dataset.

Furthermore, we extract frames from the collected video data at a rate of 30 frames per second, and use the detection network, i.e., Faster-RCNN, to identify operators and ensure that each sample contains at least one operator. Ultimately, the total number of samples in the dataset is 2000, which is divided into training, validation, and test sets in a 6:2:2 ratio, resulting in 1200 training samples, 400 validation samples, and 400 test samples. The validation samples are labeled by professionals. Based on semi-supervised learning, we utilize two experiment settings, such that 30% and 60% of the training samples are randomly selected for labeling and evaluation by professionals. Note that the labeled samples contain at least a violation by one operator. There is a total of 8 violation categories for operators, which are divided according to the wearing of personal protective equipment (i.e., safety helmet, work clothes, insulated gloves, insulated shoes, safety harness) and other violations (i.e., smoking, crossing security boundary, throwing wires or tools). Several screenshots of the dataset are shown in Figure 3.

3.1. Evaluation Criteria

In the experiments, accuracy and the F1-score are used to evaluate the action recognition results. Accuracy is defined as the proportion of correctly classified samples, and is calculated by

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N},

(10)

where

T P

represents the number of true positive samples,

T N

represents the number of true negative samples,

F P

represents the number of false positive samples, and

F N

represents the number of false negative samples. F1-score (F1) is the harmonic mean of precision and recall, calculated by

F 1 = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l},

(11)

where precision and recall are calculated by

P r e c i s i o n = \frac{T P}{T P + F P},

(12)

R e c a l l = \frac{T P}{T P + F N} .

(13)

3.2. Experimental Details

For the network architecture, two mainstream models, ResNet50 [25] (R-50) and ViT [26] (ViT-S), are used in the experiments. The pre-trained parameters for ResNet50 are obtained from ImageNet [27], while ViT is initially trained on ImageNet-21K to obtain initial weights, which are then fine-tuned on ImageNet-1K to acquire the pre-trained parameters. During the training phase, all training samples with a resolution size 1920 × 1080 are resized to a resolution size of 224 × 224 before being input into the network. The samples are then augmented via random cropping and flipping, random grayscale, and color jitter to increase the number of training samples. The SGD optimizer is used for training, with momentum and weight decay set to 0.9 and 1 × 10⁻⁴, respectively. The initial learning rate is set to 1 × 10⁻³ for R-50 and 8 × 10⁻³ for ViT-S, with a cosine learning rate decay strategy applied. The mini-batch size

N

is set to 64, and the models are trained for 30 epochs (R-50) or 50 epochs (ViT-S). The threshold

τ

for selecting hard instance samples is set to 0.6, and the coupling factor

β

is set to 0.5. At the initial stage of active learning, 30% and 60% of the training samples are randomly selected as the initial labeled sample set, with the remaining as the unlabeled sample set. The training time is about 35 min and 50 min on an RTX-3090 graphics card for 30% and 60% labeled data. For a single sample, the average test time of the model is 83.3 ms.

3.3. Results Comparison and Analysis

In typical power grid operation scenarios, the proposed method is trained by 30% and 60% labeled samples and tested on a test set where the samples are collected from complex environments. Table 1 and Table 2 present the accuracy of the proposed method for each violation category in different types, along with the accuracy of baseline methods ResNet50 (BS-R-50) and ViT (BS-ViT-S) for comparison. Table 1 uses the ResNet50 model, while Table 2 applies the ViT model.

From Table 1 and Table 2, it can be observed that the proposed method demonstrates substantial performance improvements over the baseline models (ResNet50 and ViT). These tables present the accuracies for violation categories of different types in various environments. By averaging the results across all categories, the average accuracy of the proposed method using ResNet50 trained by 30% and 60% labeled samples increases from the baseline 77.74% and 83.71% to 81.88% and 86.98%. ViT, when trained on 30% and 60% labeled samples, increases to 83.08% and 87.26% in terms of average accuracy. This clear improvement across both models demonstrates the method’s strong generalization capability across varying model architectures and violation action types. The performance gains also reflect the method’s effectiveness in selecting informative samples and improving the learning process, resulting in better overall model performance even in complex scenarios. Furthermore, the model’s performance improves as the proportion of labeled samples increases. With 30% labeled data, the model relies more heavily on unlabeled data, which may enhance generalization but could also lead to lower accuracy due to the insufficient number of labeled samples. Increasing the labeled data to 60% improves the performance by providing more reliable training signals.

In Table 1, the accuracy values for each violation category using ResNet50 trained by 60% labeled samples improve by significant margins when the proposed method is applied. The results show, for example, that the accuracy of recognizing the smoking category from other types increases by 2.42%, while the accuracy of identifying the safety helmet category from the personal protective equipment type improves by 2.95%. These consistent improvements across various categories highlight how the proposed method enhances recognition performance across multiple violation types, indicating a better learning process for both simple and hard-to-detect violations. Similarly, in Table 2, the proposed method continues to outperform the baseline. For example, improvements are seen in categories such as throwing wires or tools or insulated gloves, with accuracy boosts of 5.23% or 2.33%. These results suggest that the proposed method is model-agnostic (working well with both CNN-based and transformer-based models like ResNet50 and ViT).

As shown in Figure 4, considering the model trained on 60% labeled samples, we further compare the testing results of the baseline methods (BS-R-50 and BS-ViT-S) and the proposed methods (AHSL-R-50 and AHSL-ViT-S) across different numbers of training epochs. In Figure 4, we can see the superior efficiency of the proposed AHSL in model fitting. By analyzing the results, it becomes clear that the proposed AHSL achieves higher accuracy more quickly on different model architectures, indicating a faster convergence rate. The effectiveness of the proposed AHSL in terms of model fitting time can be attributed to two factors: (1) The proposed AHSL selectively focuses on difficult or ambiguous samples, which accelerates the learning process by reducing redundant training on easy, already well-classified samples. This targeted training allows the model to adapt more rapidly to challenging data points, thus enhancing overall efficiency. (2) By incorporating multiple strategies to evaluate sample uncertainty and diversity, the model learns more effectively in each epoch. This reduces the number of epochs required to achieve comparable or better performance than the baseline methods. Overall, the proposed AHSL improves training efficiency by speeding up convergence while maintaining or improving accuracy, thereby reducing the total time needed for model fitting.

In the experiment, we also conducted the ablation study to evaluate the contribution of each component in the proposed method trained on 60% labeled samples. This study systematically removes or modifies certain components of the proposed method to analyze their individual impact on performance. As shown in Table 3, based on the ResNet50 model, AHSL-R-50 is used to analyze the contribution of the component in the proposed method. Average accuracy, recall, precision, and F1-score are used for evaluation. It can be observed that the average accuracy and F1 of action recognition decreases when uncertainty estimation (w/o UE) is not applied, or the instance discrimination strategy (w/o IDS) is removed. The best performance is achieved when all components are used. Hence, it can be concluded that both the uncertainty estimation and the instance discrimination strategy contribute to performance gains. Meanwhile, we also calculated the proportion of false positive samples and false negative samples across the entire test sample, denoted as the false positive rate (FPR) and the false negative rate (FNR), respectively. Compared to BS-R-50 trained on 60% labeled samples, AHSL-R-50 demonstrates a reduced frequency of erroneous samples and 7.09% performance gains in the F1-score, highlighting its superior robustness. This decrease in error rates indicates that the uncertainty estimation and instance discrimination mechanisms effectively enhance the model’s overall accuracy in complex test scenarios.

To further demonstrate the effectiveness of the proposed framework based on active hard sample learning (AHSL), we compare it with three advanced existing methods: SVFormer-S [23], TimeBalance [28], and AFNet [29], trained on 60% labeled samples, and the test results are listed as shown in Table 4. Specifically, we initially apply the models of these three methods for training and testing to obtain their recognition results. We then integrate these models into the proposed AHSL framework for training and testing, obtaining their recognition results as well. The results in Table 4, such as Number 1 and Number 4, Number 2 and Number 5, and Number 3 and Number 6, show that the proposed AHSL framework not only achieves competitive performance relative to the original models but also reaches model convergence in fewer iteration rounds. This again confirms that the proposed AHSL framework could improve the efficiency of the model.

Furthermore, we perform an analysis of the important hyperparameters in the proposed method trained on 60% labeled samples, such as the threshold

τ

for selecting hard instance samples and the coupling factor

β

in the fusion scoring mechanism. We conduct the experiments with a set of values (such as {0.3, 0.4, 0.5, 0.6, 0.7}) for both the threshold

τ

and the coupling factor

β

. As shown in Figure 5, different values of the threshold

τ

directly influence the selection of training instances and, consequently, model performance. When

τ

is set too low or high, the performance of the model decreases significantly. Our experiments demonstrate that

τ

= 0.6 provides the optimal balance, selecting a mix of moderately difficult and challenging instances. This choice encourages the model to focus on learning from harder cases while maintaining a diverse range of training examples, ultimately resulting in improved overall performance and better generalization. From the right of Figure 5, we can observe that an intermediate value of

β

offers the best performance; that is,

β

= 0.5, as it is better able to select the hard instance samples for the model learning while treating UE and the IDS equally.

To provide a clearer understanding of the performance and effectiveness of the proposed method trained on 60% labeled samples for recognizing violation actions by operators in power grid operations, we visualize the recognition results for selected sample instances from the test set, as illustrated in Figure 6 and Figure 7. In these figures, the first column presents the original images captured during power grid operations, while the second column displays the detecting results of the target operators. The third column showcases the recognition of violation actions performed by these target operators. From the figures, the proposed method successfully identifies violation actions even in complex scenarios (e.g., indoor and outdoor, sunny and rainy days, day and night, etc.). This demonstrates the robustness and adaptability of the method, confirming that it is capable of recognizing violation actions in real time under various operational conditions. Furthermore, these visualizations highlight the model’s potential for use in automated monitoring and intelligent safety management within power grid operations.

4. Discussion

Traditional action recognition methods primarily rely on large-scale labeled data and standard classification algorithms. However, these methods face challenges when dealing with high-complexity and dynamic scenarios, such as violation recognition in power grid operation in complex environments. Especially in cases of sample scarcity and imbalance, traditional methods often suffer from performance limitations. The proposed method trains using 30% and 60% labeled data and tests the data collected by power grid operation in complex environments, but the gains from adding additional labeled data may diminish as the model becomes more confident. Hence, the optimal percentage of labeled data depends on balancing labeling costs with performance improvements, and in this study, 60% could provide a good compromise. The successful application of the proposed method advances the development of active learning and hard sample learning in practical applications, which is beneficial to adapting to changes in real-world scenarios. Its applicability is not limited to power grid operation but can be extended to other fields requiring precise action recognition, such as industrial monitoring and safety inspection. However, this method still has some limitations, such as currently being suitable for single-modal data, and only suitable for the recognition of single-target illegal actions. Therefore, in the future, integrating multimodal data, namely text, sensor data, or audio, and developing multi-object recognition techniques can further explore the potential and advantages of this method in complex scenes.

5. Conclusions

We have achieved significant performance improvements in recognizing the violation actions of power grid operators in complex environments compared to the baseline methods. For example, using ResNet50 as a backbone trained on 30% and 60% labeled data, it has achieved average accuracies of 81.88% and 86.98%, respectively, on 400 test images. The proposed AHSL combines the mean teacher framework and designs the HIS module, specifically designed for the power grid environment. This module, including UE and the IDS, selects hard samples by performing the multi-strategy fusion for model learning, significantly improving both the training efficiency and recognition accuracy of the model. By incorporating semi-supervised learning, the proposed method achieves good performance by using only a small amount of labeled data, thus reducing reliance on fully labeled data. Furthermore, the proposed method is beneficial for enhancing safety monitoring and advancing active learning and hard sample techniques in practical applications.

Author Contributions

Conceptualization, L.M., D.H. and X.Z.; methodology, D.H.; validation, G.B., G.X. and A.L.; formal analysis, L.M.; investigation, D.H.; data curation, G.X. and A.L.; writing—original draft preparation, D.H.; writing—review and editing, L.M. and X.Z.; visualization, G.X. and A.L.; supervision, G.B.; project administration, G.B.; funding acquisition, L.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Guizhou Power Grid Co., Ltd., grant number GZKJXM20222320.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are unavailable due to privacy or ethical restrictions.

Acknowledgments

We are grateful for the administrative and technical support of Electric Power Research Institute Guizhou Power Grid Co., Ltd.

Conflicts of Interest

Authors Lingwen Meng, Guobang Ban, Guanghui Xi and Anjun Li were employed by the Electric Power Research Institute of Guizhou Power Grid Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Mohamed, A.A.R.; Best, R.J.; Liu, X.; Morrow, D.J. Two-phase BESS optimization methodology to enhance the distribution network power quality and mitigate violations. IET Renew. Power Gener. 2023, 17, 2895–2908. [Google Scholar] [CrossRef]
Zhou, F.; Lu, H.; Jiang, C. Violation with concerns of safety: A study on non-compliant behavior and the antecedent and consequent effects in power grid construction. Saf. Sci. 2024, 170, 106353. [Google Scholar] [CrossRef]
Ban, G.; Fu, L.; Jiang, L.; Du, H.; Li, A.; He, Y.; Zhou, J. Two-stage dynamic risk identification of complex work personnel behavior based on image screening. Power Syst. Big Data 2024, 27, 58–69. [Google Scholar]
Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6202–6211. [Google Scholar]
Feichtenhofer, C. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 203–213. [Google Scholar]
Bertasius, G.; Wang, H.; Torresani, L. Is space-time attention all you need for video understanding? In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; p. 4. [Google Scholar]
Alafif, T.; Hadi, A.; Allahyani, M.; Alzahrani, B.; Alhothali, A.; Alotaibi, R.; Barnawi, A. Hybrid classifiers for spatio-temporal abnormal behavior detection, tracking, and recognition in massive Hajj crowds. Electronics 2023, 12, 1165. [Google Scholar] [CrossRef]
Meydani, A.; Shahinzadeh, H.; Ramezani, A.; Moazzami, M.; Nafisi, H.; Askarian-Abyaneh, H. Comprehensive Review of Artificial Intelligence Applications in Smart Grid Operations. In Proceedings of the International Conference on Technology and Energy Management, Behshahr, Iran, 14–15 February 2024; pp. 1–13. [Google Scholar]
Meng, L.; Ban, G.; Liu, F.; Qiu, W.; He, D.; Zhang, L.; Wang, S. Illegal action classification in power grid operation based on cross-domain few-shot learning. Power Syst. Big Data 2024, 27, 69–76. [Google Scholar]
Wang, J.; Zhou, H.; Sun, H.; Su, Z.; Li, X. A Violation Behaviors Detection Method for Substation Operators based on YOLOv5 And Pose Estimation. In Proceedings of the IEEE China International Youth Conference on Electrical Engineering, Wuhan, China, 3–5 November 2022; pp. 1–5. [Google Scholar]
Li, X.; Chen, H.; Tian, Y.; Zhong, Y.; Jiang, G. Identification Method of Underground Operation Violation Behavior Based on Deep Learning. In Proceedings of the International Conference on Algorithms, High Performance Computing and Artificial Intelligence, Guangzhou, China, 21–23 October 2022; pp. 547–550. [Google Scholar]
Elezi, I.; Yu, Z.; Anandkumar, A.; Leal-Taixe, L.; Alvarez, J.M. Not all labels are equal: Rationalizing the labeling costs for training object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14492–14501. [Google Scholar]
Parvaneh, A.; Abbasnejad, E.; Teney, D.; Haffari, G.R.; Van Den Hengel, A.; Shi, J.Q. Active learning by feature mixing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12237–12246. [Google Scholar]
Wu, Z.; Li, H.; Zheng, Y.; Xiong, C.; Jiang, Y.G.; Davis, L.S. A coarse-to-fine framework for resource efficient video recognition. Int. J. Comput. Vis. 2021, 129, 2965–2977. [Google Scholar] [CrossRef]
Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3202–3211. [Google Scholar]
Li, W.; Pan, G.; Wang, C.; Xing, Z.; Han, Z. From coarse to fine: Hierarchical structure-aware video summarization. ACM Trans. Multimed. Comput. Commun. Appl. 2022, 18, 1–16. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Jain, P.; Kapoor, A. Active learning for large multi-class problems. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 762–769. [Google Scholar]
Li, X.; Guo, Y. Adaptive active learning for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 859–866. [Google Scholar]
Yang, Y.; Ma, Z.; Nie, F.; Chang, X.; Hauptmann, A.G. Multi-class active learning by uncertainty sampling with diversity maximization. Int. J. Comput. Vis. 2015, 113, 113–127. [Google Scholar] [CrossRef]
Heilbron, F.C.; Lee, J.Y.; Jin, H.; Ghanem, B. What do i annotate next? an empirical study of active learning for action localization. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 199–216. [Google Scholar]
Xing, Z.; Dai, Q.; Hu, H.; Chen, J.; Wu, Z.; Jiang, Y.G. Svformer: Semi-supervised video transformer for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18816–18826. [Google Scholar]
NFPA-70E. Available online: https://www.nfpa.org/codes-and-standards/nfpa-70e-standard-development/70e (accessed on 27 November 2024).
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Kolesnikov, A.; Dosovitskiy, A.; Weissenborn, D.; Heigold, G.; Uszkoreit, J.; Beyer, L.; Minderer, M.; Dehghani, M.; Houlsby, N.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Dave, I.R.; Rizve, M.N.; Chen, C.; Shah, M. Timebalance: Temporally-invariant and temporally-distinctive video representations for semi-supervised action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2341–2352. [Google Scholar]
Zhang, Y.; Bai, Y.; Wang, H.; Xu, Y.; Fu, Y. Look more but care less in video recognition. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; pp. 30813–30825. [Google Scholar]

Figure 1. The framework of the proposed method, which includes the data preparation phase, the training phase, and the testing phase. Different colors of the bar rectangle denote the probabilities of different categories in

p_{i}^{s}

and

p_{i}^{t}

.

Figure 1. The framework of the proposed method, which includes the data preparation phase, the training phase, and the testing phase. Different colors of the bar rectangle denote the probabilities of different categories in

p_{i}^{s}

and

p_{i}^{t}

.

Figure 2. Temporal warping augmentation.

Figure 3. Several screenshots randomly selected in the dataset.

Figure 4. Comparison of the results on different training epochs.

Figure 5. Results of the threshold

τ

for selecting hard instance samples and the coupling factor

β

on different values.

Figure 5. Results of the threshold

τ

for selecting hard instance samples and the coupling factor

β

on different values.

Figure 6. Visualization results of violation action recognition for operators in power grid operation.

Figure 7. Visualization results of violation action recognition for operators in power grid operation.

Table 1. Accuracy comparison of the model (R-50) trained by 30% and 60% labeled samples on test set for different category of violation. The results in bold indicate the best results.

Types	Categories	30%		60%
Types	Categories	BS-R-50	AHSL-R-50	BS-R-50	AHSL-R-50
personal protective equipment	safety helmet	85.67%	90.54%	92.23%	95.18%
	work clothes	73.28%	76.82%	78.56%	82.32%
	insulated gloves	81.96%	85.24%	85.52%	86.46%
	insulated shoes	68.75%	72.51%	73.13%	74.67%
	safety harness	66.64%	69.78%	72.43%	76.56%
others	smoking	83.29%	88.07%	91.03%	93.45%
	crossing security boundary	78.95%	84.32%	86.36%	92.45%
	throwing wires or tools	83.17%	87.76%	90.45%	94.78%
Average Accuracy		77.74%	81.88%	83.71%	86.98%

Table 2. Accuracy comparison of the model (ViT-S) trained by 30% and 60% labeled samples on test set for different category of violation. The results in bold indicate the best results.

Types	Categories	30%		60%
Types	Categories	BS-ViT-S	AHSL-ViT-S	BS-ViT-S	AHSL-ViT-S
personal protective equipment	safety helmet	87.52%	91.78%	92.55%	95.89%
	work clothes	73.87%	77.25%	77.92%	81.78%
	insulated gloves	81.87%	85.16%	84.63%	86.96%
	insulated shoes	69.65%	73.52%	72.32%	74.45%
	safety harness	67.49%	71.13%	70.85%	77.09%
others	smoking	86.92%	90.27%	92.14%	94.07%
	crossing security boundary	82.37%	86.58%	85.06%	92.83%
	throwing wires or tools	85.74%	88.94%	89.75%	94.98%
Average Accuracy		79.43%	83.08%	83.15%	87.26%

Table 3. Results of different component in the proposed method trained on 60% labeled samples, where ↑ and ↓ indicate that larger is better or smaller is better, respectively. The results in bold indicate the best results.

Method	Average Accuracy ↑	Recall ↑	Precision ↑	F1 ↑	FPR ↓	FNR ↓
BS-R-50	83.71%	77.32%	80.31%	78.79%	12.75%	15.25%
AHSL-R-50 (w/o UE)	84.86%	79.85%	82.47%	81.14%	11.00%	13.00%
AHSL-R-50 (w/o IDS)	85.15%	82.34%	84.89%	83.60%	9.25%	11.25%
AHSL-R-50	86.98%	85.23%	86.55%	85.88%	8.00%	9.00%

Table 4. Comparison of average accuracy and training epochs for the existing methods trained on 60% labeled samples. #Epoch indicates the number of iteration rounds for model convergence. The results in bold indicate the best results.

Number	Method	Average Accuracy	#Epoch
1	SVFormer-S [23]	82.69%	50
2	TimeBalance [28]	78.85%	50
3	AFNet [29]	83.76%	40
4	SVFormer-S + AHSL	82.58%	30
5	TimeBalance + AHSL	80.02%	30
6	AFNet + AHSL	84.15%	25

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Meng, L.; He, D.; Ban, G.; Xi, G.; Li, A.; Zhu, X. Active Hard Sample Learning for Violation Action Recognition in Power Grid Operation. Information 2025, 16, 67. https://doi.org/10.3390/info16010067

AMA Style

Meng L, He D, Ban G, Xi G, Li A, Zhu X. Active Hard Sample Learning for Violation Action Recognition in Power Grid Operation. Information. 2025; 16(1):67. https://doi.org/10.3390/info16010067

Chicago/Turabian Style

Meng, Lingwen, Di He, Guobang Ban, Guanghui Xi, Anjun Li, and Xinshan Zhu. 2025. "Active Hard Sample Learning for Violation Action Recognition in Power Grid Operation" Information 16, no. 1: 67. https://doi.org/10.3390/info16010067

APA Style

Meng, L., He, D., Ban, G., Xi, G., Li, A., & Zhu, X. (2025). Active Hard Sample Learning for Violation Action Recognition in Power Grid Operation. Information, 16(1), 67. https://doi.org/10.3390/info16010067

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Active Hard Sample Learning for Violation Action Recognition in Power Grid Operation

Abstract

1. Introduction

2. Methods

2.1. Hard Instance Sampling Module

2.2. Optimization

3. Experiments

3.1. Evaluation Criteria

3.2. Experimental Details

3.3. Results Comparison and Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI