The proposed method in this paper was validated and analyzed using a real-world dataset. The software platform utilized the Anaconda3 compiler with Python 3.6. The hardware platform consisted of a desktop computer with an Intel i7-8700 CPU and 16 GB of RAM. The experimental data used in this study were sourced from the UK-DALE dataset, specifically the sampling data from House 1 published by Imperial College London. This dataset provides device-level and total power data for 50 appliances in a single UK household over 385 days, along with records of state-change events for each device. The sampling interval is 6 s.
4.1. Comparison of Similarity of Load Characteristics
The accuracy of NILM load identification largely depends on the selected load features. In this paper, the super states are formed by aggregating the internal states of all target devices. If the target devices include appliances or combinations of appliances with similar power values and change patterns, super states with similar observed values and patterns of change may be generated, which can adversely affect the accuracy of load decomposition.
To validate the ability of the KLE feature matrix to reduce the similarity of load features, we first identified appliances and appliance combinations with high-power feature similarity. By comparing the power waveforms of appliances in the dataset, three devices were selected: a television (TV), a refrigerator (fridge, FRE), and a desktop computer (office_PC, OP). Among these, the power curves of the television and refrigerator exhibit high similarity, and the power curve of the television–refrigerator combination has a high similarity to that of the desktop computer. For each device and device combination, a representative portion of power samples was selected, with the sample waveforms shown in the following figure.
It can be seen that the power values of the refrigerator and TV are mainly distributed in the range of 80 W to 120 W, and there are very similar waves near 90 W. Compared with the power waveform of the combination of the computer and TV machine–refrigerator, it can also be found that the power value and waveform change law have a high similarity. For these devices and equipment combinations, it is difficult to distinguish effectively by using traditional steady-state characteristics such as power and current.
The binning method was used to quantify the internal states of the super states for the television, refrigerator, computer, and television–refrigerator combination. The power curves in
Figure 4 were converted into the corresponding quantized-state sequence diagrams, with the results shown in
Figure 5. For simplicity in labeling, the quantized states of the television, computer, refrigerator, and television–refrigerator combination are denoted as TV, OP, FR, and TV-FRE, respectively.
By examining
Figure 4 and
Figure 5, it can be observed that the states of the television and refrigerator have a high degree of similarity. This implies that if these devices or combinations are included in a super state, and other device states remain the same, the power magnitude and waveform of the super states where the television is on and the refrigerator is off, or the refrigerator is on and the television is off, will be very similar. Similarly, the computer’s state has a high similarity to the state of the television–refrigerator super state, which can also lead to misclassification during load decomposition. The experiment will use these states as examples to demonstrate the ability of the KL feature extraction method to reduce the similarity of load feature curves with close values and change patterns.
To quantitatively compare the impact of different load feature selections on the similarity of various device states or super states, the similarity index in Formula (25) is used to represent the degree of similarity between the load features of two states.
In Equation (25), represent any two device states or super states, denotes the load feature currently being compared, and represents the sample vectors of and . By calculating the similarity index for a specific type of load feature using the similarity function, the similarity and identifiability of appliances under that load feature can be quantitatively compared. The higher the similarity, the more challenging it is to distinguish between the states.
Based on the analysis of
Figure 4 and
Figure 5, the following states were selected for comparison: TV(1) and REF(1), PC(2) and TV-REF(2), PC(2) and TV-REF(3). Equation (25) was used to calculate the similarity indices for both the power features and KLE features of these device states.
The power and corresponding KLE feature matrices of TV(1), PC(2) and REF(1) are shown in
Figure 6. In the experiment,
.
Figure 7 compares the similarity indices of the three state groups under different load features. It can be observed that when using power states, the similarity indices of all three comparison groups are above 0.8, indicating a high degree of similarity. In contrast, the similarity indices for the KLE feature matrices show a significant reduction, all falling between 0.3 and 0.4. This indicates that KLE features can reduce the similarity between device states and super states with similar power curves, which is beneficial for load decomposition and state detection.
The similarity index experiment confirms that KLE features can increase the distinction between load features at the feature level. The DSM-DDL algorithm further enhances the ability to identify devices with similar features at the algorithmic level.
4.3. Accuracy Testing Criteria
Firstly, three parameters are defined. For hyper-state , the load decomposition results are as follows: TP is true positive (recognition result is ), FP is false positive (identification result is not ), and FN is false negative (other hyper-states are identified as ).
Compared with the classical f-score criterion, the finite state FS-f score method [
19] with local penalty measures can transform the binary attributes of true positive (TP) into the measurement of discrete values and is more suitable for measuring the accuracy of non-binary classification such as state classification and power estimation.
The FS-F score is defined as follows:
The inaccurate portion of true positives (inacc) is defined as follows:
where
is the length of the time series,
duration is the estimated power value of the super state at
time, and
is the power observation value at
time.
is the total number of super states. The method punishes according to the error between the estimated state and the real state. The precision and recall rate of FS-f score are defined as
The FS-f score is the harmonic mean of precision and recall:
Considering the power consumption estimation error of the algorithm, it is defined as follows:
The higher the , the more accurate the algorithm’s energy consumption estimation.
4.4. Analysis of Algorithm Efficiency
The super state includes all possible combinations of the internal states of the target devices. Same as the state space of multi-agents, the number of super states grows exponentially [
20] with the number of devices increases. For example, if two devices with three internal states each form a super state, the number of possible super states would be 3
2 = 9, and for three devices, it increases to 3
3 = 27. However, this is only the theoretical count. An analysis of datasets such as AMpds [
21] and UK-DALE reveals that if we only consider the super states that actually appear in the dataset, they are highly sparse. This sparsity is determined by household electricity usage patterns and the operating principles of the appliances. For instance, a washing machine’s state sequence might be wash–rinse–spin, but it is unlikely to follow the sequence spin–wash–rinse. Additionally, some appliances are only used at specific times of the day, such as lights and kitchen appliances. These factors lead to the actual number of super states being far lower than the theoretical number.
In the UK-DALE dataset, the relationship between the theoretical number of super states and the actual number of super states as the number of devices increases is illustrated in
Figure 8. Even when simultaneously decomposing 15 devices, only about 6500 super states are observed. The sparsity of the super states not only prevents the number of improved deep dictionaries that need to be learned from growing exponentially but also results in the state transition probability matrix
and the observation probability matrix
having high sparsity. By only considering the super states that actually occur when learning dictionaries and constructing matrices, storage requirements and computational complexity can be significantly reduced.
Figure 9 illustrates the relationship between the time required for a complete super-state estimation using the KL feature extraction and DSM-DDL load disaggregation method and the actual number of super states [
22]. As shown, the algorithm’s execution time increases in a piecewise linear fashion as the number of super states increases. The impact of the number of super states on load disaggregation time primarily affects the selection of the candidate super-state set, which requires traversing the state transition matrix
and the observation probability matrix
to calculate
. The greater the number of super states, the more dictionaries are included in the candidate super-state set. By examining
Figure 8 and
Figure 9, it can be observed that even with as many as 10
6 super states, the load disaggregation time is only around 7 s. For common scales of load monitoring target devices, where the number of appliances ranges from 5 to 15, the actual number of super states is below 10,000, and the algorithm’s runtime is less than 1.3 s, meeting the requirements for practical applications.
4.5. Load Disaggregation Accuracy Experiment
We selected 10 days of sampling data from seven commonly used and highly similar devices in the UK-DALE dataset as the experimental data. The seven devices are kitchen lights (KILs), refrigerators (FRs), televisions (TVs), computers (PCs), dishwashers (DWs), hoovers (HOs), washing machines (WMs), hair dryers (HDs), and toasters (TOs). The appliances involved in the experiment were divided into two groups. The first group consists of devices with high similarity, including DWs, WMs, TOs, and HOs. Among them, HOs and WMs, as well as TOs and HDs, have a high degree of similarity, as shown in the power curve graphs in
Figure 4. The second group consists of devices with lower similarity, including KILs, TVs, PCs, DWs, and FRs.
In addition to the target devices involved in the super-state modeling, data from non-modeled devices were added to the total power data of each group as noise, resulting in an approximately 6% NM for both groups. Sampling data from non-operating periods (where the power remains at 0 for an extended duration) were partially removed to refine the dataset. For each group of devices, the load disaggregation experiments were designed for the following two scenarios [
22]:
No Noise: Only the power data of the devices of interest are used in both the training and test sets, representing an ideal scenario.
With Noise: The power data of non-participating devices are retained in both the training and test sets, making this scenario more aligned with real-world conditions.
First, the deep dictionary optimization function defined in Equation (13) is used for dictionary learning and load disaggregation. The current, power, and KLE feature matrices are separately used as load features to aggregate the target devices into super states. The KL features of the target devices are extracted from the power sampling data, while the KL feature matrices of the super states are extracted from the total power sampling data. Experiments are conducted on the first group of devices under both noise scenarios, following the iteration stopping criteria established in
Section 2.3. The results are as follows:
The average number of dictionaries within the load disaggregation candidate dictionary set is 4.7.
It is clear that, in the absence of noise, all three methods achieve the best results. Among the three load features, experiments using power features exhibit the lowest accuracy due to the high similarity between the power characteristics of the devices, while current features perform slightly better than power features. In all scenarios, the KLE features achieve the best results, demonstrating that this feature effectively reduces the similarity between similar devices’ characteristics and, when applied to load disaggregation, can significantly improve the algorithm’s accuracy.
Next, the proposed DSM-DDL algorithm is applied. The optimization objective is given by Equation (14). The threshold in Equation (6) is set to 0.1, and both the dimension of the power column vector and the order of the autocorrelation matrix are set to 5. Parameters and in Equation (16) are configured accordingly. The algorithm is then tested using current, KLE feature matrices, and power as load features to verify whether the improved deep dictionary method can further enhance the NILM system’s ability to disaggregate similar devices.
The experimental results of the improved depth dictionary in three scenarios are as follows.
The average number of dictionaries within the load disaggregation candidate dictionary set to 4.7.
Based on the results of both experiments, it can be observed that using state transition probabilities and observation probabilities as criteria for selecting the super-state dictionary resulted in an average of 4.7 algorithm runs per disaggregation, significantly less than the total 268 super states. This approach effectively reduces computation time and hardware overhead while maintaining accuracy, thereby ensuring efficient load disaggregation.
When comparing the results of the two experiments, it is evident that the proposed DSM-DDL algorithm outperforms deep dictionary learning in all aspects, especially when disaggregating highly similar devices. Interestingly, the accuracy improvement was more pronounced when using power features compared to KLE features. This is because KLE features already exhibit low similarity, resulting in minimal overlap between super-state domains. Consequently, the effect of reducing overlap using the improved deep dictionary was more significant for the power features, which originally had higher similarity.
Comparing
Table 2 and
Table 3 as well as
Table 4 and
Table 5, it is evident that noise has a more significant negative impact on the deep dictionary method than on the improved deep dictionary method. After incorporating the distance-based optimization into the dictionary learning objective function, the elements within the super-state domains are more closely aligned, and the position of the KL feature matrix within the domains is more optimal. Consequently, the overlap between similar super-state domains is reduced, making the disaggregation process less susceptible to noise-induced measurement deviations.
In both sets of experiments, the overall accuracy of noise modeling is lower than in the noise-free scenarios, which is expected given the complexity of real-world applications. However, when compared to scenarios where noise is retained but not addressed, noise modeling demonstrates a higher disaggregation accuracy, highlighting its effectiveness in practical applications.
In the noise modeling scenario, the accuracy metrics for the load disaggregation method using current load features and deep dictionary learning (DDL) are compared against the method using the KL feature matrix and the proposed DSM-DDL algorithm. As shown in
Figure 10 and
Table 6, the NILM method combining the KL feature matrix with the improved deep dictionary learning significantly outperforms the method using current features and DDL. Specifically, there is an approximately 8% improvement in both the FS-f score and EST.ACC accuracy metrics.
The above experiments tested the algorithm’s accuracy on super states aggregated from devices with partially similar features. It can be observed that even in the noise modeling scenario, where KL features and the improved deep dictionary are used, the power estimation accuracy only reaches 73.4%. Based on the feature similarity experiments in
Section 3.1 and experimental data from References [
10,
14], this is likely due to the high similarity among devices in the scenario. In fact, similar devices only make up a portion of the entire dataset, while most devices have significantly different load characteristics.
As a comparison, four devices with generally lower similarity—refrigerators, dishwashers, hair dryers, and vacuum cleaners—were selected for aggregation into super states for the load disaggregation experiment. In the noise modeling scenario, load disaggregation was performed using current features combined with DDL and KL feature matrices combined with DSM-DDL, respectively. The results are shown in
Table 6. It can be observed that device similarity significantly impacts the accuracy of load disaggregation. The disaggregation accuracy for the highly similar dishwasher and vacuum cleaner is noticeably lower than that for the other two devices. However, the overall accuracy of the aggregated data improves substantially compared to previous experiments as the overall similarity among target devices decreases. Furthermore, combining the results from
Figure 10 and
Table 6, it can be seen that the refrigerator has low similarity in power consumption with other devices. However, its small power consumption means that its power variation has little impact on the overall power variation, leading to high similarity between super states where only the refrigerator’s state differs, thus affecting accuracy.
Table 6 also shows that the accuracy parameters of the aggregated data are closer to those of dishwashers and vacuum cleaners, indicating that the larger the proportion of a device in the total power, the greater its influence on the final power estimation. Improving the estimation accuracy of similar high-power devices can effectively enhance the performance of the load disaggregation method.
The load disaggregation accuracy experiments demonstrate that the use of KLE features and super-state modeling with improved deep dictionary learning significantly enhances the ability to identify similar devices. When combined with noise modeling techniques, the accuracy and robustness of the NILM system are effectively improved.