2.1. Challenges in the Smart Meter Big Data Analytics
In reality, a smart meter is typically installed at the main power switch of a household, enabling the measurement and transmission of electricity consumption data at frequent intervals, ranging from every few seconds to minutes [
8]. In certain cases, multiple meters may be deployed collectively, with each meter capturing the electricity usage of a circuit or a major electrical appliance, such as a refrigerator, air conditioner, washing machine, and more [
14,
15]. Undoubtedly, the aggregation of household electricity consumption data has catalyzed the rise of big data analytics in recent years [
16,
17]. Various institutions, such as the National Science Foundation of the United States, the Engineering and Science Research Institute of the United Kingdom, and the Smart City Innovation Center of Denmark, have supported numerous studies on the smart meter big data analytics [
18]. The number of related papers has been steadily increasing since 2012. For example, the following discussion delves into three significant themes drawn from the literature: (1) Load management, which leverages load forecasting as a foundation. The prediction of household electricity consumption enables the appropriate classification and management of different groups of households from the demand side [
1,
19]. For instance, during periods of tight power supply, households that are more suitable for reducing electricity consumption can be identified. Alternatively, time-based electricity pricing mechanisms can be employed to incentivize households to save electricity during specific time slots. (2) Load characteristic analysis, which involves the grouping of electricity consumption behavior based on the extensive data derived from household electricity consumption. This analysis facilitates comparative assessments among peers, enabling a deeper understanding of consumption patterns [
20,
21]. Activities such as morning washing or evening meal preparation can be categorized to identify common trends and patterns. (3) Electricity theft detection, which is highlighted as a distinct aspect of the smart meter big data analytics. This detection mechanism focuses on identifying prolonged instances of illicit electricity consumption, serving as a specialized form of abnormal data detection [
22,
23]. This analysis requires a more extensive duration of electricity consumption data, which is of particular interest to power companies aiming to mitigate losses from unauthorized consumption.
In addition, it is important to note that different brands of smart meters may exhibit variations in the quality of measurement and/or transmission [
24]. Consequently, in the domain of managing and analyzing smart meter big data, various challenges pertaining to data storage and pre-processing warrant discussion. To illustrate this, envision a scenario where a smart meter captures power consumption data every second, resulting in the generation of 31,536,000 records per year. This underscores the significant data volume that smart meters can produce. Therefore, it is imperative to devise effective strategies for data storage, management, data pre-processing, and analysis methods to ensure optimal analysis outcomes for such a substantial volume of data [
24].
At present, two primary technologies are utilized for storing and managing such big data [
8,
19,
20,
21,
25]. The first is the relational database, which is the most prevalent and well-established technology. SQL serves as the recognized standard data query and processing language interface for this database type. The second technology, exemplified by Apache Hive 3.13, is the distributed database, which is exceptionally suited for managing a substantial and continually expanding volume of data [
26]. It leverages Hadoop Distributed File Systems to seamlessly integrate additional databases as the data size increases [
26]. Moreover, when considering the perspective of smart meter manufacturers, it is important to note that majority of devices currently generate power consumption data in the CSV format. Consequently, the use of programming scripts is imperative for data pre-processing, converting CSV records into one of the previously mentioned database systems [
21]. When deciding between a relational or distributed database for smart meter data, it is essential to recognize that this study focuses on a building with fewer than 100 households, each equipped with a smart meter. Despite the substantial data volume, managing it typically remains feasible without necessitating multiple database servers to handle advanced functions, such as load balancing [
8]. Therefore, in such a setting, opting for a relational database such as PostgreSQL, as opposed to a distributed option such as Apache Hive, might be the most optimal choice for storing smart meter data. This preference arises from the relational database’s proficiency in handling programming scripts for diverse data pre-processing tasks [
19,
20,
25]. Additionally, the main focus of smart meter big data analytics often revolves around electricity consumption records of individual buildings over a maximum three-year period. Such dataset sizes comfortably fit within the capacities of a relational database, offering cost benefits, ease of deployment, and management efficiency. Another prominent tool in the realm of big data analytics is Apache Spark 3.2.4, an analytics framework built on a database foundation, engineered to expedite data querying and processing [
27]. However, Apache Spark demands a substantial memory allocation to function optimally [
27]. In the context assumed in this study, where a building might have just one database server for all electricity consumption data, introducing an additional server to accommodate Apache Spark’s high memory needs might not be the most economical approach. In lieu of this, utilizing a relational database with standard SQL and self-developed data-processing programs seems more than adequate for managing various tasks across the outlined scenarios.
Furthermore, the time granularity of electricity consumption records, especially generated by different brands of smart meters, often exhibits significant variation [
28,
29]. For example, occasional instances of missing records may occur, leading to situations where only one or two power consumption records are available per 15 min, despite the intended frequency of consumption measurement being at the minute level [
30]. Consequently, within a specific time interval, such as 15 min, the quantity of accurately recorded electricity usage records can fluctuate [
30]. The literature often discusses the sampling frequency of smart meters, which can vary significantly, ranging from as fast as thousands of samples per second (expressed in kHz) to as long as two hours [
28,
29]. Instant electricity consumption data collected over a few minutes are usually sufficient for most analyses, aiding residents in monitoring and conserving electricity usage. However, higher sampling frequencies are recommended for electricity bill pricing, considering the varying periods and progressive tariff structures [
7]. In fact, managing the large storage space and complexities associated with electricity consumption records has been a topic of discussion, with proposed methods for maintaining accuracy and reducing the data volume [
8,
31]. Therefore, prior to conducting the smart meter big data analytics, it is necessary to pre-process the raw data to ensure the presence of an electricity consumption record on the time axis every 5 or 15 min [
32,
33]. Moreover, when analyzing the electricity consumption behavior of residents, it may be necessary to prepare a separate dataset comprising electricity consumption records at analysis time intervals of 30 or 60 min [
32,
33]. Essentially, the time granularity of the data collection should be determined based on the analysis algorithm employed [
34]. The original electricity consumption records should be adequately preserved in the database, while the generation of electricity consumption data at different time granularities should be realized in real time through a method akin to a database view.
In summary, analyzing the electricity consumption records for all buildings together at the city level may raise concerns regarding the potential breach of personal privacy [
25]. Further, for electricity consumption records of five years or longer, it seems more suitable for the power company to store and analyze such huge datasets. For the database server at the building or household level, it may be better to manage the electricity consumption records within a three-year timeframe. Hence, it is recommended to deploy a distributed database on the power company’s side, while the server at the building or household level, where the household is located, can utilize a contemporary relational database. This configuration is deemed sufficient for handling the voluminous smart-meter-generated big data within the given assumptions.
2.2. Difficulties of Predicting Household-Level Power Consumption
It is important to recognize that the mere installation of smart meters does not directly lead to reduced electricity bills for residents [
24]. To achieve significant energy savings, it is essential to combine smart meters with comprehensive information services that can effectively influence residents’ behavior and promote electricity conservation [
35]. One common example of such services is electricity consumption forecasting, combined with time-of-use (TOU) rates. This service can suggest optimal usage times for electrical appliances, encouraging households to adjust their consumption habits to save money [
7,
24,
36,
37]. Indeed, electricity consumption prediction has long been a focus of research in the field of energy conservation and carbon reduction [
38]. Initially, it emerged from energy consumption simulations during the building design stage, estimating air conditioning or lighting energy usage based on building characteristics such as orientation and the window opening rate [
38,
39]. With the proliferation of Internet of Things (IoT) sensors, various algorithms have been developed in the literature to estimate the electricity consumption of buildings or households using sensed data [
30,
40,
41,
42]. In addition, the rapid advancements in artificial intelligence technology in recent years have contributed to more accurate household electricity consumption prediction models, aligning them better with real-world needs. It is through the combination of smart meters, advanced prediction techniques, and behavioral changes that substantial progress can be made in achieving energy savings and promoting sustainable practices. Certainly, load forecasting entails the prediction of future power consumption, and this can be accomplished through two primary methods: (1) predicting household behavior by leveraging various sensors (e.g., motion sensors, and temperature and humidity sensors) to estimate power consumption, sometimes on a per-appliance basis [
7,
31,
36,
37], and (2) directly utilizing extensive electricity consumption data from smart meters to project future consumption [
19].
Despite the clear demand for load forecasting and the availability of numerous methods [
1], achieving a high forecasting accuracy remains a challenge. Some studies have shown forecasting errors reaching up to 300% [
8]. In fact, there are several metrics available to assess the accuracy of electricity consumption forecasting, broadly classified into two categories. The first category is suitable for comparing the accuracy of different forecasting methods within the same dataset, such as Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE). The second category is applicable across various forecasting methods, reflecting data collection and expressing the accuracy of each prediction method as a percentage, for instance, Mean Absolute Percentage Error (MAPE), as shown in Equation (1):
where
t represents a specific record in the electricity consumption dataset,
n represents the total number of records,
yt represents the actual power consumption value, and
ypred,t represents the predicted power consumption value.
Hence, MAPE stands out as an adequate indicator for comparing error levels among various prediction methods across different electricity consumption datasets. Moreover, considering the wide array of prediction techniques and metrics utilized in the literature, the research team conducted a review of three recent articles that employed deep learning methods for electricity consumption prediction, all of which employed the MAPE metric to demonstrate prediction performance. The first article employs the LSTM method to forecast the power consumption of air-conditioning systems in multiple factory settings, achieving a MAPE value of approximately 10% [
10]. The second article utilizes a general deep learning approach as well as a custom CNN and LSTM method for predicting power consumption in the well-known IHEPC public household power usage dataset, yielding a MAPE value of around 30% [
11]. The third article applies the LSTM method to predict real-time electricity consumption in school buildings, resulting in accuracy ranging from 5% to 30% on various occasions [
12].
The studies reported in [
10,
12] highlighted that when electricity consumption data are collected in relatively simple environments, such as factory air-conditioning systems or office settings such as schools, the MAPE values are typically small. However, there still exists a 10% MAPE error, primarily attributed to high electricity consumption instances [
10,
12]. In essence, when focusing solely on the high-power consumption records in the dataset, such as predicting consumption exceeding 3000 W, the MAPE value tends to increase due to the limited data points available. Conversely, when analyzing electricity consumption data from typical households, even with the use of enhanced deep learning techniques, the MAPE value remains high, around 30%. Consequently, it is affirmed in the literature that predicting electricity consumption for general households presents a difficult challenge. To effectively address this, it is imperative to devise robust data pre-processing strategies and formulate predictive algorithms tailored to the specific forecasting needs of households, ensuring that residents can fully leverage the advantages of smart meters.
Finally, the literature shows that when analyzing electricity consumption at the city or community level, the overall prediction accuracy tends to be relatively high due to the larger scale [
25]. On the other hand, when the analysis is performed at the household level, while it provides the most relevant insights for individual households, the prediction accuracy is not always ideal [
40]. To improve the accuracy of household-level electricity consumption prediction, it is necessary to augment smart meter data with additional environmental factors (such as temperature and humidity, illuminance, etc.), building characteristics (such as orientation, indoor area, building materials, etc.) [
41,
43,
44], and the information pertaining to residents’ daily schedules and activities [
45]. However, collecting and analyzing additional personal privacy data to enhance the precision of electricity consumption forecasting may raise concerns and potentially discourage residents from installing smart meters [
24,
34]. Further, recent literature indicates that, in order to enhance accuracy, it is necessary to integrate a greater number of sensors for monitoring buildings, residential environments, and residents’ behavior [
37]. Nevertheless, this raises concerns about personal privacy, making it difficult for the public to accept such intrusive monitoring practices [
34].
2.3. The Need of Predicting Unsafe Power Usage Events
Overall, the existing literature on the big data analytics of smart meters primarily focuses on power consumption prediction. Other applications are relatively scarce, possibly due to the recent implementation of smart meters in developed countries and the gradual accumulation of electricity consumption data [
3]. However, British scholars have explored the relationship between smart meters and residential fires, indicating a slight correlation resulting from flaws in the process of replacing old meters with new smart meters, leading to incomplete wiring and subsequent fire incidents [
3]. From the perspective of residents’ concerns, electrical safety is undoubtedly one of the most important issues. Nevertheless, there is currently limited literature utilizing smart meter big data to predict unsafe electricity consumption that can lead to electrical fires.
Predicting instances of unsafe electricity usage within a household is comparatively less complex than the task of household-level electricity consumption prediction, as it can solely rely on the household’s historical smart meter big data. Moreover, such predictive capabilities offer significant advantages to households while requiring a lesser amount of personal privacy information. The underlying assumption for predicting unsafe electricity consumption is similar to load prediction, assuming no significant changes in residents’ behavior (e.g., prolonged absences or tenant turnover), enabling algorithms to utilize past electricity consumption data to forecast future patterns. Previous literature has explored abnormal power consumption from two main perspectives: the power company side and the user side [
46]. The power company side focuses on identifying cases of electricity theft and detecting discrepancies between actual and measured power consumption [
22,
46]. The user side examines abnormal power consumption patterns related to appliances and explores strategies for power-saving appliance replacements [
47]. However, incidents of extremely high power consumption that lead to actual electrical fires are rare. As a result, it is not feasible to solely focus on expanding the records of extremely high power consumption. Similarly, deleting the records of normal power consumption is also not appropriate.
Thus, it is believed that the integration of such warning mechanisms into smart meters could offer a valuable capability to promptly notify residents about potentially unsafe electricity usage. This would empower residents to take timely corrective actions, thereby ensuring the safety of their homes. Prioritizing these advanced predictive features should be regarded as a fundamental service provided by smart meters [
7], ultimately enhancing residents’ confidence in their effectiveness. This increased trust is expected to stimulate higher rates of smart meter adoption.