1. Introduction
Colombia’s Non-Interconnected Zones (NIZ) are places that, due to their geographical location, cannot be connected to the national electricity grid, causing the use of polluting methods of electricity generation in these areas [
1]. Nowadays, access to energy services is indispensable for almost any activity, and the presence or absence of energy determines the social, economic, and environmental dynamics of communities. Therefore, it is important to analyze the trends in the Non-Interconnected Zones in terms of energy service and to ask: Which NIZs have the highest energy demand? Which zones have the best quality of electricity service? Is it possible to determine the ideal locations for renewable energy projects to benefit zones with similar consumption characteristics? Through data mining, it is possible to transform simple values into relevant and useful information to answer the above questions [
2]. In this study, clustering techniques such as K-Means, K-medoids, and divisive analysis clustering are used to analyze patterns or associations of groups with similar characteristics, considering variables related to electrical energy in Colombia’s NIZs, such as active energy, reactive energy, power, and hours without service in these locations.
It is important to highlight that the data analytics techniques applied in this study have not been implemented before in the context of the NIZs in Colombia, which means that there is no record of similar studies specifically aimed at these locations and their problems. This allows a first approach to generate knowledge of these areas for academia, industry, and government that can lead to the analysis of various factors related to the NIZs. On the other hand, it is a tool for data-driven decision-making to guide investment, performance, and best practices to evaluate the possibility of new energy projects that allow for the improvement of the quality of electricity service in these locations.
This article presents in
Section 2 a theoretical framework where the concepts of clustering are presented, and the algorithms used are described; additionally some similar works are presented.
Section 3 describes the methodological approach, the data and variables, and the analysis techniques used. This is followed by the results obtained and a discussion aimed at identifying the similarities of the clusters identified, and finally, conclusions and future work are presented.
2. Background
Clustering is an unsupervised learning technique that identifies groups with similar characteristics within a database. They can be divided into two types: partitioning clustering which requires prior specification of the number of clusters to be considered (K-means and K-medoids), and hierarchical clustering does not require prior specification of the number of clusters (divisive and agglomerative hierarchical clustering) [
3].
For cluster analysis, several clustering algorithm techniques were used in this work for their representation, such as K-Means, K-medoids, and divisive analysis clustering: first, K-means, which is an algorithm that groups the data according to the number of clusters previously established, this algorithm places the data with smaller distances between them to form a cluster and its representation is the average of all the values of this grouping or its centroid, that is, with this algorithm each element must belong to a cluster with the average closest to it [
4]. It is one of the most widely used clustering algorithms in the field of research as it has the advantage of simplicity and effectiveness, although among its disadvantages is that it is often affected by outliers, therefore, the use of this algorithm and its effectiveness will depend on the dataset.
In this way, according to Saxena et al. [
4], the algorithm is very simple: the K number of clusters is selected, and the centroids are located, then the distance of each element for the centroids or averages is calculated, the clustering is done according to the elements with smaller distances or closer to the centroid and finally ends when the centroid does not change with the iterations.
Subsequently, K-medoids is a technique very similar to K-means, but unlike K-means, K-medoids use the most central element contained in the cluster to be its representation, and for this reason, its advantage is that it is less affected by outliers. Many algorithms apply the K-medoids method, but one of the most common is the partitioning around medoids (PAM), this algorithm is the one that will be used in this work and, according to Chitra y Maheswari [
5], is an iterative algorithm that first selects K initial elements as medoids to serve as seed, then calculates the distances between the elements to assign each element to the medoid that is closest to it. This process is repeated many times until a medoid is achieved in which the elements within the cluster have a smaller distance.
Thus, apart from the PAM algorithm, there are also other methods for applying K-medoids that differ in terms of the advantages or fields in which they need to be applied. An example of another recognized algorithm is the CLARA method, which is often used when the dataset is too large and requires a lot of computational resources to run [
6]. For this case, the PAM algorithm will be sufficient as the data is not too large to require such a huge computational capacity.
Afterwards, divisive analysis clustering (DIANA) is a technique that does not require the number of clusters to be specified beforehand. It starts with a single cluster where all the elements are found until clusters are obtained for each element [
7]. It is also an algorithm that works by iterations, where it initially takes the cluster in which there is a greater difference between two of its elements, then takes the element with the greatest distance from the rest, thus starting a new cluster, a reallocation of the elements occurs depending on how close or far it is from the new cluster, causing the cluster to be divided into two new clusters until the individual elements are reached [
6].
In this research, we will use the dendrogram for the visual representation of the DIANA algorithm and validate the data using the correlation coefficient, which evaluates the distances of the dendrogram and the original distances of the elements. To be considered a good dendrogram, the coefficient must be between 1 and 0.7.
Finally, heatmap analysis uses heatmap analysis combined with dendrograms to visually represent the similarity between data and determine common patterns or characteristics [
6].
On the other hand, when performing clustering analysis, it is necessary to validate the results to know to what extent the results obtained are reliable. There are two ways to validate clustering results: internal validation and external validation. External validation is only possible when the actual classification of each element of the dataset is available for comparison with the classification obtained by the different clustering algorithms used, and internal validation is possible by considering the position of each element within each cluster and the difference or distance between all clusters, to determine how good and accurate the classification is [
8].
For this study, the internal validation index: Silhouette, will be considered to observe how accurate the cluster assignment of each of the algorithms is. The Silhouette index (
) according to [
6], each element
is obtained by calculating the average of the distances (
) between the element
and the other elements of the same cluster, then is calculated the average distance between the element
and the other clusters. Finally,
is identified as the smallest distance between the element and the rest of the clusters. Its equation is then given by:
In this way, the index value will be between 1 and −1. High values represent a good allocation of the element in its cluster, and low values represent a wrong allocation. On the other hand, the software to be used contains functions and packages that make this process easy.
It should be noted that it is impossible to carry out an external validation for this work, given that there are no real classification values assigned to each shire, county, or municipality, nor precedents of studies on this subject in Colombia’s Non-Interconnected Zones.
In the same sense, it is possible to find studies or research in which data mining and clustering techniques are applied to work with the analysis of electrical factors from a macro (analysis by several countries) or micro (analysis by residential units or buildings) point of view. Among these studies is the work carried out by Gostkowski et al. [
9], who managed to evaluate energy consumption with respect to the economic development of various sectors of the countries belonging to the Visegrad Group; they applied various clustering techniques (K-means, hierarchical agglomerative cluster, and DIANA) that allow them to determine the changes in energy consumption and establish a clear dynamic between consumption and the economy of the countries over the years.
On the other hand, there are authors such as Li et al., Liu et al., and Ramos et al. [
10,
11,
12], who, by using data mining and various clustering techniques (partitional and hierarchical) made characterizations of energy consumption, which as an advantage, allows them to know the electricity consumption habits of customers or buildings, design improvements to the electricity service and detect faults or anomalies, so that in this sense, there will be progress in energy efficiency, new energy management strategies, and future consumption trends will be identified, this is a clear example and evidence that the application of clustering and data mining algorithms can provide great tools to improve various factors in terms of electrical energy.
Finally, the work of Kapousouz et al. [
13] can be highlighted, who conducted a clustering analysis of electricity and water consumption in the United States with data from 1985 to 2015. They related electricity and water resources and identified changes and trends over the years; in this way, clustering techniques can be applied to relate different topics and find information that helps to understand certain dynamics.
5. Discussion
From all the analyses, there are clusters with similar characteristics that make it possible to determine the quality of service for each of the non-interconnected zones, as well as the zones with the highest and lowest hours of service and the relationship in terms of active energy consumption, maximum power, and reactive energy.
Regarding analysis by shire counties using the K-means and DIANA algorithms, it can be said that the shire counties classified in the first cluster are those with the most defective service, with fewer hours of daily electricity service (only 9 h approximately) and therefore very low power and consumption values. It should be noted that this is the cluster with the most elements and where the shire counties are located: Guaviare, Caquetá, Chocó, Cauca, La Guajira, Valle del Cauca, Casanare, Nariño, Antioquia, and Bolívar, which have their own environmental characteristics (forests, mountain, and river systems) where there are good energy potentials for the implementation of renewable energies such as photovoltaic systems for La Guajira, Bolívar or Casanare, such as wind systems for La Guajira or hydro-kinetic systems for Antioquia or Cauca [
21,
22,
23,
24].
On the other hand, those located in the third cluster (Vaupés, Putumayo, and Guainía) have a regular service, with respect to the hours of energy service available to them compared to the other clusters, and they are grouped in a very similar way to their geographical location, where these three shire counties are in the southeast of the country. This grouping or pattern obtained through clustering analysis could be due to similar consumption habits and cultures and similar territorial management plans.
In turn, the Silhouette index for this first analysis by shire counties showed a good value (0.6) for the K-means and DIANA algorithms, which justifies and proves that, based on the data, a classification was made in accordance with the characteristics of the areas. It has been possible to observe hidden relationships, such as the quality of energy service by shire county grouping and the under-utilization of renewable natural resources in certain areas of the country, which would not have been apparent from looking at the large data set alone. Beyond testing different clustering methods, the conditions under which Colombians in Non-Interconnected Zones have or do not have access to a good electricity service and, consequently, to basic measures for the development of their daily, economic, social, and even cultural activities, are presented.
In developing countries such as Colombia that face a lack of access to modern energy services, energy poverty in this context can be explained by some factors mentioned by [
25,
26], such as the availability and reliability of energy sources, NIZs are peripheral and remote areas, factors associated with household preferences and social and cultural beliefs, shire counties such as Amazonas, La Guajira and Cauca with indigenous populations or Chocó with Afro-descendant populations, as described by [
27]. In addition, some researchers, such as Fisher et al. [
28], consider that price is the critical factor for energy access and can be a determining factor for these areas with isolated energy policies, social inequality, and high poverty rates [
29].
In addition, the study was carried out for the 24 municipalities that have NIZ in the shire counties: Amazonas, Cauca, Chocó, Magdalena, La Guajira, and Vichada which are shire counties in different regions of the country, with interesting behavior in their hours of service and consumption.
Referring to the study at the municipal level, it is observed that of the 24 municipalities studied, a large majority (cluster of 12 municipalities) in various regions and shire counties of the country have a low-quality service with approximately 8 h of service per day. The study carried out by the K-means method also shows that the municipality of Leticia in Amazonas has a high level of reactive energy, which makes it belong to a cluster exclusive to it, as none of the other municipalities studied have similar characteristics, and although on average Leticia manages 23 h of energy per day, the statistical analysis carried out shows that they have been without energy service for days at a time.
Studying the case of Uribia, La Guajira, the study shows that it belongs to the worst service cluster, however, it should be noted that this area has a high energy potential for the implementation of wind and solar energy [
30,
31]. Similar to this case, there are many municipalities with good potential, which, if clean energy generation projects were to be developed for these places, would improve the living conditions of their inhabitants and the environmental conditions of the area.
On the other hand, considering the Silhouette index for the analysis at the municipal level, good values between 0.6 and 0.59 were obtained for the three algorithms applied, indicating how well the municipalities studied are classified within each cluster. This is due to the selection of an optimal number of clusters, as choosing more or fewer clusters would not result in acceptable validation index values, as each municipality or shire county would not be well classified, and it would be very difficult to obtain hidden information on the relationships and similarities of each non-interconnected zone.
This can support the future improvement of energy management or power generation projects, justify actions, support decision-making, and understand the different situations at the national level in non-interconnected areas.
6. Conclusions
The results show the classification of the shire counties with respect to the quality of their current electricity service, from a deficient service to a more adequate service, considering the demand and the hours without service. In this way, it is possible to determine where energy solutions should be oriented, with the strategic objective of greater coverage in more NIZ.
It is emphasized that the results are evaluated by a clustering validation that allows to determine the quality of the analysis through an internal validation (Silhouette index) so that it is possible to have comparisons between the different methods or algorithms applied (K-means, K-medoids PAM, divisive analysis clustering, and heatmaps) and to find the most efficient analyses.
It is shown how there are similarities between the quality of service and the geographical locations of the NIZ themselves, for the possible study of consumption habits and customs in the communities of non-interconnected zones.
This research has identified relevant information for the analysis of electricity variables in the NIZ. All the findings help to guide decision-making, to identify patterns and similarities, and provide tools to think about improving the living conditions of the inhabitants of the non-interconnected areas of Colombia.
In future work, it is proposed to extend the data analysis with other techniques to identify the incidence of Non-Interconnected Zones in territories that have Territorially Focused Development Programmes (PDET). These are special municipalities that, based on the Colombian peace process, aim to stabilize and transform the territories most affected by violence, poverty, illicit economies, and institutional weakness, and thus achieve rural development. In this same way, and as a future work, it is proposed to process the totality of the data at the municipal level of each shire county without any type of restriction, with the aim of having a sufficiently detailed study. It should be highlighted that the algorithms presented are sufficiently robust to carry out this analysis.