1. Introduction
Electricity has become a basic need in the modern world, as it is used in homes, businesses, and industry. To distribute electricity to these sectors, a network is formed, which is called the power grid. Technically, the power grid consists of a production side and a demand side. Electricity generation is increased or decreased depending on the demand side’s needs. Unfortunately, some of the electricity produced is lost during generation, transmission, and distribution. Energy losses are divided into two main classes: non-technical losses (NTL) and technical losses. Various methods, techniques, and tools are in practice or are proposed to address technical losses.
On the demand side, one of the NTLs is electricity theft. Electricity loss is a major issue for power utility companies, as it causes major disruption to their operations, which leads to loss of revenue, increased generation load, and excessive electricity bills for legitimate consumers. Moreover, electricity loss also causes issues related to economic growth and power infrastructure stability. NTL, also known as commercial losses, happen mostly due to electricity theft and fraud. Power utility companies still lose large amounts of revenue due to unlawful electricity theft and fraud by electricity consumers. This theft places a heavy burden on the power grid infrastructure and results in fires that threaten public safety. They also cause loss of revenue for electrical generation companies [
1,
2,
3]. It is a challenge to address power caused by theft. Theft can be done by tampering with electricity meters, double-tapping attacks, changing meter readings through communication links, and using shunt devices. It is an open secret that power utilization is strongly connected with the development of a country and is hence a vital measure that shapes the foundation of industrialization. With the consistently increasing need for power usage, electricity theft is at a peak. Fossil fuel combustion from electricity generation causes 70% of greenhouse gas (GHG) emissions [
4]. In spite of endeavors to reduce GHG outflows, electricity theft overshadows these endeavors in developing countries. The capacity to create electric power is diminished as a result of resources lost to energy theft. Due to electricity theft, unnecessary blackouts/load-shedding occur, which encourages users to opt alternative energy resources to fulfill their requirements, including using petrol and diesel generators that cause GHG emissions.
The majority of climate talks have focused on how to lower GHG emissions; very few have examined the consequences of energy theft. By continuously monitoring the electrical system and isolating energy-theft hotspots from a distance, Smart Meters (SM) are suggested as a strategy to prevent energy theft. All transformers, distribution poles, and customer houses should have SMs. The measurements are subsequently transmitted over a communication network to the distribution company’s database for examination, and if trouble areas are found, power is cut off remotely. This technology would enhance performance, which would immediately result in a decrease in GHG emissions while also increasing total returns to the distribution firm. It would also promote transparency in the metering process.
Moreover, NTLs cause USD 75 billion in lost revenue in the United States. This amount is enough to power 77,000 households for a year [
5]. A World Bank report shows that China, Brazil, and India suffer 16%, 25%, and 6% losses in electricity supply, respectively [
6]. According to Joker et al. [
7] such losses are not only limited to developing countries; developed countries such as the U.S. and the U.K. bear losses of USD 6 billion and GBP 173 million, respectively, each year. The above discussion shows that an efficient electricity theft detection (ETD) model is required to detect NTLs. In the literature, hardware devices, and data-driven and game-theoretic approaches are used to detect NTLs. Hardware-based approaches use sensors and radio identification tags to distinguish between honest and malicious samples. However, these approaches are expensive, require huge maintenance costs, and do not provide optimal results under extreme weather conditions [
3,
8,
9,
10]. Methods based on game theory design a utility function among electric utilities, stakeholders, and customers. However, it is difficult to implement an accurate utility function. Moreover, these approaches are less accurate and have a high false-positive rate (FPR) [
11,
12,
13,
14].
The introduction of smart power grids opens new opportunities for ETD. A smart grid is an upgraded version of a conventional power grid and consists of smart meters, sensors, and computing devices that have self-healing mechanisms and communication technologies. The smart meters and sensors obtain data on consumers’ electricity consumption (EC), electricity prices, and the status of the grid infrastructure [
15,
16]. The data-driven approaches are trained on the collected EC data to distinguish between honest and malicious samples. These approaches have received a lot of focus from the research community, but they have the following limitations: curse of dimensionality, class imbalance problems, and low detection rates for standalone ML and DL models. Moreover, conventional ML models such as k-nearest neighbors and naïve Bayes have high FPRs. As mentioned in the literature, electric utilities cannot tolerate low detection rates and high FPRs because for on-site inspection they have limited resources.
This paper presents a hybrid DL model (named HGC) that is a combination of a gated recurrent unit (GRU) and a convolutional neural network (CNN). GRU extracts temporal features, while CNN retrieves abstract patterns from EC data. The advantages of the models are summarized in the HGC model. It also outperforms existing models. The uneven distribution of class patterns leads to poor performance. This problem leads to majority class bias, which leads to incorrect results. In this paper, a hybrid approach consisting of undersampling and oversampling methods is presented to deal with the uneven distribution of class samples. The main contributions of the paper are listed below.
We present an HGC model that combines the advantages of GRU and CNN. It is the first study that combines the advantages of sequential and non-sequential models.
A CNN model extracts latent or abstract patterns, while a GRU retrieves temporal patterns from EC data. The curse of dimensionality is addressed with both DL models.
The adaptive approach of synthetic minority oversampling and TomekLinks are used to discuss the problem of class imbalance.
The performance of the HGC model is evaluated using a real EC dataset obtained from the State Grid Corporation of China (SGCC).
To verify the real efficiency of the proposed model, extensive experimentation is performed based on recall, accuracy, precision, F1 score and FPR.
The rest of the paper is organized as follows.
Section 2 presents an overview of related literature. We present the Problem Statement in
Section 3, followed by Materials and Methods in
Section 4. The Proposed Model is outlined in
Section 5.
Section 6 contains the Experimental Analysis and Discussion. The Experimental Outcome and Arguments are discussed in
Section 7. Finally, we come to an end in
Section 8.
2. Related Literature
The tools and techniques proposed in the literature to detect NTLs are studied in this part of the document. In [
5], a model combining CNN and multilayer perceptron (MLP) is used. It integrates the advantages of both DL models, which is why it gives better results than standalone models. The first model is employed to extract hidden, abstract patterns, while the latter one is used for extracting meaningful information. The class imbalance problem, however, is not addressed, which makes the ML and DL models biased towards majority class samples and ignore minority ones. Moreover, MLP does not give results on sequential datasets. Joker et al. [
7] propose an electricity theft detector that is developed using an SVM classifier to differentiate between malicious and honest customers. It is the first study that integrates a ML model and hardware devices to capture drift changes in data that can happen due to many reasons: e.g., a different number of members in a household or weather changes. Some authors utilize random undersampling to solve the uneven distribution of class samples. However, this technique creates underfitting. Moreover, they utilize hardware devices that make the proposed solution expensive. In [
17], the authors propose a theft detector that contains gradient boosting classifiers. The authors introduce the concept of stochastic features, which enhance the detection rate and reduce the FPR. Moreover, they conduct a comparative study and prove that boosting classifiers perform better than SVM on an Irish dataset. Moreover, electricity theft cases are updated by arguing that existing theft cases’ resemblance to real-time samples is the least. Random oversampling is employed to handle the uneven distribution of class samples, which creates an overfitting problem. The curse of dimensionality is a big nuisance and reduces the detection-rate of ML and DL models. In [
18], the authors use heuristic techniques to select optimal combination of features from EC data, which solves overfitting, memory constraints, and computational overhead issues. However, they use accuracy as a fitness function to evaluate the efficacy of meta-heuristic techniques, which is not a good practice.
In [
19], a long short-term memory (LSTM)-dependent framework is suggested. It is proposed for differentiating between malicious and normal patterns as well as changes due to drift. Based on our knowledge, this is the first study that considers drift changes with malicious patterns and reduces FPR. The Power utilities are unable to bear high FPR due to their limited resources to inspect on the site. Fenza et al. [
20] propose a model that integrates the benefits of both CNN and random forest. The former is used to obtain abstract features, while the latter is used to differentiate malicious and normal patterns in EC data. The class imbalance problem is handled using SMOTE, which creates overfitting. In [
21], a DL model is proposed that integrates the benefits of both LSTM and MLP. This is the first article that has leveraged the benefits of both sequential and non-sequential data. The class imbalance problem is not considered, which is why ML and DL models give biased results. In [
22], an ensemble deep CNN is used for detection of atypical behaviors in EC data. Imbalanced data are a severe issue in ETD and is handled through random bagging. Finally, a well-known voting ensemble strategy is utilized to decide between malicious and normal patterns. Ghori et al. [
23] conduct a comparison study between different conventional ML classifiers using a real EC dataset. The ANN and boosting classifiers such as LightBoost, CatBoost, and XGBoost give better performance than other models. Moreover, the curse of dimensionality is dealt with by selecting optimal combination features.
In [
24], the authors put forward a fascinating technique for NTL detection using smart meter data. Moreover, auxiliary information is utilized to enhance the accuracy of ML models. Different features are built using distance and density outlier-detection methods. The proposed model is employed in smart grids to distinguish illegitimate patterns from legitimate patterns. In [
25], Hasan et al. put forward the idea of identifying low-voltage stations and comparing the performance of supervised and unsupervised learning methods. The suggested method gives better results in contrast to SVM and DT-SVM.
Ismail et al. [
26], merge the integrated model of CNN and LSTM. This is the first study that integrates the benefits of both DL learning models. Moreover, the uneven distribution of class samples is another severe issue. SMOTE is utilized to handle this issue. The proposed hybrid model achieves 89% accuracy, which is more than conventional ML and DL models.
The poisoning attack problem in smart grids is proposed by Maamar et al [
27]. They introduce a sequential and parallel DL-based autoencoder based on GRU and LSTM models. The deep neural network performs better than a shallow neural network. In [
28], it is revealed that existing studies mostly monitor attacks on the consumer side. No one focuses on the distribution side, where hackers hack utility meters and create higher electricity bills. In their study, they introduce a hybrid C-RNN-based model and prove that it performs well compared to other DL models. The proposed model is evaluated on SCADA meter readings.
In [
29], a new hybrid approach is introduced that integrates the benefits of k-mean clustering and a deep neural network. Irish Smart Energy Trials data are used for model evaluation. However, if the authors utilize other advanced clustering algorithms, then proposed model increases the performance. Shehzad et al. [
30] introduce a smart system for ETD. The system integrates the benefits of statistical methods and different DL models such as MLP, LSTM, RNN, and GRU. The proposed technique is evaluated on real data from Singaporean homes. However, the performance of the suggested technique is not checked using other performance measures such as F1-score, recall, precision, FPR, ROC-AUC, and PR-AUC.
6. Experimental Setting and Analysis
In this section, we analyze the performance of the proposed model on the SGCC dataset using various performance measures. We also compare the results obtained with the proposed model to those of benchmark models.
6.1. Performance Measures
Uneven distribution of class samples is a critical problem in ETD, where the number of samples of the normal class is higher than that of the malignant class. When an ML or DL model is trained on this type of data, it attracts majority class samples and ignores minority class samples, producing false results/alarms. The literature indicates that electric utilities cannot tolerate false alarms due to limited resources for on-site testing. Although the training dataset is balanced with the proposed sampling technique, the test data are unbalanced. Therefore, appropriate performance measures are needed to evaluate the performance of the benchmark and proposed models. In this paper, the performance measures used are accuracy, F1 score, recall, ROC-AUC, and PR-AUC. To calculate the above measures, we use a confusion matrix: a confusion table that contains true negative (TN), true positive (TP), false negative (FN), and false positive (FP) results.
6.1.1. Accuracy
Accuracy is the ratio between the number of correct predictions and the total number of records in the dataset.
where
and
are the sums of total number of true negatives and true positives, respectively, and
,
,
, and
are the sums of true negatives, true positives, false negatives, and false positives, respectively.
6.1.2. Recall
Recall is determined by dividing the correctly predicted positive records by the total number of positive records. The equation of recall is given below, as described in [
33]:
where
FN is the number of dishonest consumers predicted by the model as honest consumers.
6.1.3. F1-Score
The F1-score is also a good performance measure for imbalanced datasets. When ML/DL models have a high F1-score, they are considered good for predictions in real-world scenarios. The equation for the F1-score is given below, as described in [
34,
35]
To calculate the precision, the number of true positives divided by the sum of false positives and true positives, as mentioned in [
33].
The ROC curve is obtained by plotting recall and FPR on the y-axis and x-axis, respectively. It is a good measure for imbalanced datasets because it is not skewed toward the majority class. Its value ranges from 0 to 1. However, ROC only considers the recall/true positive rate, so it focuses on positive records and ignores the negative ones. The PR curve is another important measure that considers recall and precision simultaneously and gives equal importance to twain classes.
6.2. Implementation Environment
The proposed and benchmark models are implemented using Google Colaboratory [
36], which provides distributed computing power. Their performance is studied using the SGCC dataset collected from the largest electric utility in China. DL models are implemented using TensorFlow (v2.8.2), while ML models are trained and evaluated using the Scikit library (v1.0.2), and the Keras API is used to develop the hybrid model.
6.3. Proposed Deep Learning Model Performance Analysis
In this section, we analyze the performance of the proposed model using accuracy and loss curves for training and testing data.
Figure 2 shows the performance of the model on training and test data using accuracy curves. Both curves move side-by-side with a small difference, indicating that the proposed model does not suffer from overfitting. However, after the fourth epoch, the test accuracy starts to decrease, which means that the model suffers from overfitting. Thus, if more than four epochs are trained, the performance of the model decreases. To improve the model’s performance in the future, meta-heuristic algorithms will be used to help select the optimal parameters for deep and machine learning to avoid overfitting. It is very complex and time-consuming to select these parameters manually.
Figure 3 also shows the same phenomena using loss curves on training and testing data. The value of loss can be decreased with more epochs.
However, there is a high probability that the model encounters overfitting, which affects generalization. In addition, the proposed model consists of GRU, CNN, and dense layers. The gates like, update and reset in the GRU layer control the information flow through network. These gates remember valuable information and ignore redundant and noisy patterns from the data. CNN layers help the proposed hybrid model learn global/abstract patterns from EC data and reduce the curse of dimensionality, which directly increases the convergence speed. The literature shows that dropout layers simplify the model and prevent overfitting. Finally, the dense layer takes inputs from the GRU and CNN models and passes them to a sigmoid function to distinguish between normal and malicious samples. For all these reasons, a hybrid model performs better than the individual models.
6.4. Benchmark Models
This section implements various DL and ML models that have previously been proposed in the literature and compares their performance with that of the proposed hybrid model.
6.4.1. Wide and Deep Convolutional Neural Network
In [
5], Zheng et al. propose a DL model that is a fusion of CNN and ANN. This is the first study to combine the advantages of both models. The authors feed 2D data to a CNN, while 1D data are fed into an ANN to learn local and global patterns from the SGCC dataset. However, the ANN model does not give good results on 1D data because it is designed for tabular data. In this work, we use the same hyperparameter settings and the same dataset for a fair comparison.
6.4.2. Logistic Regression (LR)
This is a basic supervised learning model used for binary classification. It is also known as a single-layer neural network. It simply contains an input layer whose values are multiplied by weights, and the resulting value is fed into a sigmoid function that produces either 0 or 1 as input. LR consists of various solvers such as Newton’s method and stochastic gradient descent that are used to tune the hyperparameters.
6.4.3. Decision Tree (DT)
DTs are used in both regression and classification tasks. They consist of a root node, edges, and leaf nodes that are used to predict the result. A DT works like the human mind and creates a tree-like structure in which the dataset is divided into many branches based on features. The best attributes/features are selected based on the information gain and Gini index criteria as root nodes. DTs are easy to implement and give good results on smaller datasets. However, for larger datasets there is a risk of overfitting. In addition, a small change in the data leads to poor generalization.
6.4.4. Support Vector Machine (SVM)
SVMs are a supervised learning model used for both regression and classification purposes. They are able to classify linear and nonlinear data by using the power of kernel functions. These kernel functions draw a decision boundary to classify between normal and malicious samples after converting non-linear data into linear patterns. In [
7], the authors develop a current theft detector based on consumption patterns using an SVM classifier to draw a decision boundary between benign and stolen samples. From the literature, SVM is well-suited for smaller datasets, as it requires a lot of computational time to draw a decision boundary between normal and malicious patterns for larger datasets. In this work, the RBF kernel is used for the SGCC dataset due to the nonlinearity of the data.
6.4.5. Random Forest (RF)
An ensemble technique called RF is used to solve complex problems by training multiple decision trees on datasets. It has applications in banking, e-commerce, and other fields. RFs control the problem of DF overfitting and increase precision. They give good results with little adjustment of hyperparameters. They also minimize overfitting and increase the precision when the number of DTs is increased during the training period. However, they require a lot of computation time for larger datasets, since multiple DTs are trained on a single dataset, which reduces their effectiveness in real-world problems.
6.4.6. Naive Bayes Classifier
This is a classification method derived from Bayes’ theorem. The Naive Bayes (NB) does not consider the linkage between inputted features and targeted column, and uses the probability distribution to distinguish between normal and malicious samples. There are many versions developed depending on the type of dataset. In today’s world, there are many applications in various fields such as sentiment analysis, email filtering, recommender systems, spam, and natural language processing. In this work, we use Gaussian NB since the SGCC dataset has continuous features.
7. Experimental Results and Discussions
The performance of the proposed HGC model is compared with the state-of-the-art classifiers. The same datasets with different ratios for training and testing are used for DT, NB, LR, CNN, GRU, RF, SVM, and WDCNN. As discussed earlier, the CNN design consists of a number of convolution layers with filters (kernels) and pooling layers, followed by one or more fully connected (FC) layers, and applies a softmax function to classify an object with probabilistic values between 0 and 1. Each layer has its own functionality and extracts abstract or latent features that cannot be detected by the human eye.
The GRU layers have two important gates; update and reset. These are used to learn necessary patterns and remove unnecessary values. As discussed earlier, the flow of information is controlled by GRU gates to improve the performance of the model. The GRU-extracted features are then combined with the latent or abstract patterns. The proposed HGC model extracts abstract and periodic patterns from EC data using GRU and CNN hence HGC outperforms as compared to counterparts of it. The combination of optimal features helps the HGC to attain 0.96 PR-AUC and 0.97 ROC-AUC values, which are higher than those of all the above-mentioned classifiers. The performance of proposed model is compared with conventional models using PR and ROC curves in
Figure 4 and
Figure 5. The proposed hybrid model achieves better results than its counterparts. SVM achieves 0.88 ROC-AUC and 0.85 PR-AUC. We use a linear kernel instead of an RBF kernel to train the SVM model on EC data because the dataset contains a large number of records and features, which increases the model computation time, so it is not suitable for larger datasets.
LR is a conventional ML model that distinguishes between normal and malignant samples using a sigmoid function. It achieves 0.86 and 0.88 for PR-AUC and ROC-AUC, respectively, which is better than SVM, but has lower performance than other models. It has a large number of applications in various fields because it is easy to implement and is suitable for linearly separable datasets, but in the SGCC dataset, malicious and normal samples are not linearly separable. Therefore, LR gives lower performance compared to other models [
30].
RF gets 0.76 PR-AUC and 0.75 ROC-AUC, while DT gets 0.80 ROC-AUC and 0.85 PR-AUC on the EC dataset. DT gives better results than RF. DT provides good performance on smaller datasets but has overfitting on larger datasets, and small changes in the data reduce its generalization ability. RF is an ensemble method designed to overcome the overfitting/low generalization of DT. It controls overfitting but has low PR-AUC and ROC-AUC, as seen in
Figure 4 and
Figure 5, because RF takes the average of all DT prediction results.
In addition, NB is a conventional classifier that classifies between normal and malignant samples using Bayes theorem. It obtains 0.71 and 0.65 PR-AUC and ROC-AUC values, respectively. Unlike other conventional ML and ensemble models, it gives poor results. It assumes that there is an independent relationship between the attributes and the target features.
Moreover, CNN gains 0.96 ROC-AUC and 0.94 PR-AUC values, while GRU gains 0.96 and 0.96 ROC-AUC and PR-AUC values on the EC dataset, which are higher than the PR-AUC and ROC-AUC values of conventional ML models. Technically, a CNN consists of a number of convolution layers with filters (kernels) and pooling layers, followed by one or more fully connected (FC) layers. In addition, the convolutional layer is used to remove redundant, overlapping, and noisy values from the EC data. GRU also gives good results that are in the acceptable range, as it has update and reset gates to help remember periodic patterns. In [
5], the authors combine the merits of the ANN and CNN models to develop a hybrid model. Their proposed model achieves a value of 0.96 PR-AUC and 0.97 ROC-AUC. In the literature, the authors demonstrate that the hybrid model performs better than the DL models and the standalone ML model. Therefore, in this research, the Keras API is used to develop a hybrid model. It integrates the advantages of both GRU and CNN models. The former learns the temporal patterns, while the latter derives global and abstract patterns from EC data. The extracted features of both models are merged and passed to a fully linked layer for the classification of theft and normal patterns. The proposed model achieves better results than the standalone DL and the previously proposed hybrid DL models for the above reasons. It achieves 0.987 ROC-AUC values and 0.985 PR-AUC values on EC data, as observed in
Table 5 and
Table 6.
Table 5 and
Table 6 show the performance analysis of the ML and DL models at 70% and 60% training ratios, respectively. It can be seen that the proposed model maintains its superiority and gives better results at both training ratios. For the DL models, performance increases as the size of the training data increases because DL models are inherently sensitive to the size of the training data. On the other side, the increased or decreased performance of conventional ML models follow the power law [
37]. This law states that beyond a certain point, the performance of ML models increases with the increase of the amount of data. After this point, the models face the problem of overfitting, which affects their generalizability. In this work, RF and NB give poor results compared to other conventional ML models. Although both models perform well on balanced datasets, they show poor performance due to the following limitations.
NB accounts for the independent relationship between features and target variables that does not exist in real EC data, while RF controls for overfitting by the average performance of all DTs. The literature shows that the performance of DL models depends on the size of the training data. Large datasets yield high values for performance measures. ROC analysis of different hybrid models is given in
Table 7. In [
38], CNN-LSTM and LSTM RUSBoost achieve 0.817 and 0.879 ROC values, respectively, while in [
30], MLP–LSTM achieves 0.92 ROC, and HG
achieves 0.93 ROC. In our case, our proposed model maintains its superiority and performs better than the above-mentioned hybrid models by achieving 0.98 ROC.
The computation time of the ML and DL models is given in
Table 8. NB and LR have a lower computation time in contrast to other ML models because the former only computes the probability distribution of all features and provides the final results, whereas LR is a single-layer neural network that multiplies the inputs with weights and distinguishes between malignant and normal samples. For the above reasons, they require little computational time compared to other ML models.
In ETD, SVM is a well-known classifier. RF requires more training time than DT because it trains multiple DTs on the SGCC dataset and computes the average of multiple estimators. Moreover, the training time of DL models depends on the number of hidden layers, the size of the dataset, the stack size, and the number of neurons in each layer. GRU and CNN are DL models that take 2364 and 202 seconds to train, respectively. GRU requires more training time because it has update and reset gates that extract temporal patterns from SGCC data and save the important information in memory networks, while CNN only retrieves abstract/latent patterns by using convolution functions and max-pooling layers, which is why they have low computation time. Moreover, HGC takes 1704 seconds to train with the SGCC dataset. It has a lower computation time than GRU because it converges in 5 epochs, whereas GRU converges in 15 epochs. In addition, HGC requires more training time than the CNN model because it integrates the benefits of both models. Moreover, at the present time, meta-heuristic techniques are receiving attention from the research community for feature selection and hyperparameter optimization in ML and DL models. Therefore, in this study, BHA, a meta-heuristic technique, is used for feature selection. The literature demonstrates that these techniques have high computational complexity. For this reason, a small portion of the dataset is used to evaluate the ability of BHA for feature selection. The selected data consist of 10,000 records and 30 days of EC values from 42,372 records. BHA takes 3000 seconds to select the optimal combination of features/attributes from the selected EC data, which is more than the time required by all DL models: GRU, CNN, WDCNN, and HGC. The above results show that the computational time of BHA increases as the amount of data increases. Therefore, these types of real-time applications are not suitable for the smart grid. Moreover, the increased dataset size enhances the performance of DL models. Hence, the performance of these models depend on the size of training dataset. In canse of convolution ML models, the performance is enhanced by following the power law. Their performance stop improving after certain point of training [
37].
From the literature, hybrid models work well because they combine training and testing of both DL models and have better generalization capabilities than many other machine and deep learning models. However, HGC maintains dominance over the state-of-the-art DL models and shows better performance on varieties of training ratios over SGCC dataset. Nexus to the above, there is no free lunch. The cost benefit analysis is a trade-off between computational time and accuracy. The proposed algorithm is computationally expensive, but on the other hand, it provides higher accuracy than the other algorithms used for comparison. With more and more computational resources available these days, researchers are focusing on algorithms that provide better efficiency in the face of widespread data.
8. Conclusions and Future Work
Electricity theft is an unavoidable issue that causes power losses in both; developed and developing countries. As a result, power utility companies have major disruptions in their operations, leading to loss of revenue. Moreover, electricity loss also causes issues with economic growth and power infrastructure stability. In this study, a combined DL model for NTL detection is presented that incorporates a GRU and a CNN. To remove null and undefined values, EC data are pre-processed by normalization. In addition, uneven distribution of class samples is another problem in ETD that affects the effectiveness of the ML and DL models. In this paper, a hybrid approach is used to address these problems. The performance of the proposed model is evaluated on the SGCC dataset in real-time using various performance metrics and compared with SVM, LR, CNN, GRU, RF, DT, NB, and WDCNN. The model achieves 0.987, 0.985, 0.94, 0.94, and 0.91 ROC-AUC, PR-AUC, accuracy, F1-score, and recall score on the SGCC dataset, respectively. The obtained results are better than those of other ML and DL models. However, despite the proposed model outperforming substitute techniques, it is too sensitive to changes in input data. The presented model will help many industrial applications to identify normal and abnormal samples or records. To improve the model’s performance and avoid overfitting, meta-heuristic algorithms help select the optimal parameters for deep and machine learning. It is very complex and time consuming to select these parameters manually.
In the future, meta-heuristic techniques will be used to achieve optimal hyperparameter tuning in DL models.