Next Article in Journal
Evaluation of Distributed Generation and Electric Vehicles Hosting Capacity in Islanded DC Grids Considering EV Uncertainty
Previous Article in Journal
Double Cathode Modification Improves Charge Transport and Stability of Organic Solar Cells
Previous Article in Special Issue
Evaluation of Leachate Recirculation as a Stabilisation Strategy for Landfills in Developing Countries
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Data Analysis of Electricity Service in Colombia’s Non-Interconnected Zones through Different Clustering Techniques

by
Ramón Fernando Colmenares-Quintero
1,*,
Gina Maestre-Gongora
1,
Marieth Baquero-Almazo
1,
Kim E. Stansfield
2 and
Juan Carlos Colmenares-Quintero
3
1
Faculty of Engineering, Universidad Cooperativa de Colombia, Calle 50A No. 41-34, Medellín 050012, Colombia
2
VOCATE Ltd., 2 Fountain Place, Worcester WR1 3HW, UK
3
Institute of Physical Chemistry, Polish Academy of Sciences, Kasprzaka 44/52, 01-224 Warsaw, Poland
*
Author to whom correspondence should be addressed.
Energies 2022, 15(20), 7644; https://doi.org/10.3390/en15207644
Submission received: 20 September 2022 / Revised: 28 September 2022 / Accepted: 11 October 2022 / Published: 17 October 2022
(This article belongs to the Special Issue Bio-Refineries and Renewable Energies Supported on ICT)

Abstract

:
Energy determines the social, economic, and environmental aspects that enable the advancement of communities. For this reason, this paper aims to analyze the quality of the energy service in the Non-Interconnected Zones (NIZ) of Colombia. For this purpose, clustering techniques (K-means, K-medoids, divisive analysis clustering, and heatmaps) are applied for data analysis in the context of the NIZ to identify patterns or hidden information in the Colombian government data related to the state of the electricity service in these localities during the years 2019–2020. A descriptive statistical analysis and validation of the results of the clustering techniques is also carried out using R software. Through the implementation of clustering algorithms such as K-means, K-medoids, and divisive analysis clustering, potential areas for the development of renewable and alternative energy projects are identified, considering places with deficiencies in their current electricity service, higher consumption, or places with very low daily hours of electricity service. Additionally, relationships were identified in the dataset that can be considered as tools that would support decision-making for academia and industry, as well as the definition of guidelines or strategies from the government to improve energy efficiency and quality for these places, and consequently, the living conditions of the residents of Colombia’s NIZs.

1. Introduction

Colombia’s Non-Interconnected Zones (NIZ) are places that, due to their geographical location, cannot be connected to the national electricity grid, causing the use of polluting methods of electricity generation in these areas [1]. Nowadays, access to energy services is indispensable for almost any activity, and the presence or absence of energy determines the social, economic, and environmental dynamics of communities. Therefore, it is important to analyze the trends in the Non-Interconnected Zones in terms of energy service and to ask: Which NIZs have the highest energy demand? Which zones have the best quality of electricity service? Is it possible to determine the ideal locations for renewable energy projects to benefit zones with similar consumption characteristics? Through data mining, it is possible to transform simple values into relevant and useful information to answer the above questions [2]. In this study, clustering techniques such as K-Means, K-medoids, and divisive analysis clustering are used to analyze patterns or associations of groups with similar characteristics, considering variables related to electrical energy in Colombia’s NIZs, such as active energy, reactive energy, power, and hours without service in these locations.
It is important to highlight that the data analytics techniques applied in this study have not been implemented before in the context of the NIZs in Colombia, which means that there is no record of similar studies specifically aimed at these locations and their problems. This allows a first approach to generate knowledge of these areas for academia, industry, and government that can lead to the analysis of various factors related to the NIZs. On the other hand, it is a tool for data-driven decision-making to guide investment, performance, and best practices to evaluate the possibility of new energy projects that allow for the improvement of the quality of electricity service in these locations.
This article presents in Section 2 a theoretical framework where the concepts of clustering are presented, and the algorithms used are described; additionally some similar works are presented. Section 3 describes the methodological approach, the data and variables, and the analysis techniques used. This is followed by the results obtained and a discussion aimed at identifying the similarities of the clusters identified, and finally, conclusions and future work are presented.

2. Background

Clustering is an unsupervised learning technique that identifies groups with similar characteristics within a database. They can be divided into two types: partitioning clustering which requires prior specification of the number of clusters to be considered (K-means and K-medoids), and hierarchical clustering does not require prior specification of the number of clusters (divisive and agglomerative hierarchical clustering) [3].
For cluster analysis, several clustering algorithm techniques were used in this work for their representation, such as K-Means, K-medoids, and divisive analysis clustering: first, K-means, which is an algorithm that groups the data according to the number of clusters previously established, this algorithm places the data with smaller distances between them to form a cluster and its representation is the average of all the values of this grouping or its centroid, that is, with this algorithm each element must belong to a cluster with the average closest to it [4]. It is one of the most widely used clustering algorithms in the field of research as it has the advantage of simplicity and effectiveness, although among its disadvantages is that it is often affected by outliers, therefore, the use of this algorithm and its effectiveness will depend on the dataset.
In this way, according to Saxena et al. [4], the algorithm is very simple: the K number of clusters is selected, and the centroids are located, then the distance of each element for the centroids or averages is calculated, the clustering is done according to the elements with smaller distances or closer to the centroid and finally ends when the centroid does not change with the iterations.
Subsequently, K-medoids is a technique very similar to K-means, but unlike K-means, K-medoids use the most central element contained in the cluster to be its representation, and for this reason, its advantage is that it is less affected by outliers. Many algorithms apply the K-medoids method, but one of the most common is the partitioning around medoids (PAM), this algorithm is the one that will be used in this work and, according to Chitra y Maheswari [5], is an iterative algorithm that first selects K initial elements as medoids to serve as seed, then calculates the distances between the elements to assign each element to the medoid that is closest to it. This process is repeated many times until a medoid is achieved in which the elements within the cluster have a smaller distance.
Thus, apart from the PAM algorithm, there are also other methods for applying K-medoids that differ in terms of the advantages or fields in which they need to be applied. An example of another recognized algorithm is the CLARA method, which is often used when the dataset is too large and requires a lot of computational resources to run [6]. For this case, the PAM algorithm will be sufficient as the data is not too large to require such a huge computational capacity.
Afterwards, divisive analysis clustering (DIANA) is a technique that does not require the number of clusters to be specified beforehand. It starts with a single cluster where all the elements are found until clusters are obtained for each element [7]. It is also an algorithm that works by iterations, where it initially takes the cluster in which there is a greater difference between two of its elements, then takes the element with the greatest distance from the rest, thus starting a new cluster, a reallocation of the elements occurs depending on how close or far it is from the new cluster, causing the cluster to be divided into two new clusters until the individual elements are reached [6].
In this research, we will use the dendrogram for the visual representation of the DIANA algorithm and validate the data using the correlation coefficient, which evaluates the distances of the dendrogram and the original distances of the elements. To be considered a good dendrogram, the coefficient must be between 1 and 0.7.
Finally, heatmap analysis uses heatmap analysis combined with dendrograms to visually represent the similarity between data and determine common patterns or characteristics [6].
On the other hand, when performing clustering analysis, it is necessary to validate the results to know to what extent the results obtained are reliable. There are two ways to validate clustering results: internal validation and external validation. External validation is only possible when the actual classification of each element of the dataset is available for comparison with the classification obtained by the different clustering algorithms used, and internal validation is possible by considering the position of each element within each cluster and the difference or distance between all clusters, to determine how good and accurate the classification is [8].
For this study, the internal validation index: Silhouette, will be considered to observe how accurate the cluster assignment of each of the algorithms is. The Silhouette index ( s i ) according to [6], each element i is obtained by calculating the average of the distances ( a i ) between the element i and the other elements of the same cluster, then is calculated the average distance between the element i and the other clusters. Finally, b i is identified as the smallest distance between the element and the rest of the clusters. Its equation is then given by:
s i = b i a i max ( a i ,   b i )
In this way, the index value will be between 1 and −1. High values represent a good allocation of the element in its cluster, and low values represent a wrong allocation. On the other hand, the software to be used contains functions and packages that make this process easy.
It should be noted that it is impossible to carry out an external validation for this work, given that there are no real classification values assigned to each shire, county, or municipality, nor precedents of studies on this subject in Colombia’s Non-Interconnected Zones.
In the same sense, it is possible to find studies or research in which data mining and clustering techniques are applied to work with the analysis of electrical factors from a macro (analysis by several countries) or micro (analysis by residential units or buildings) point of view. Among these studies is the work carried out by Gostkowski et al. [9], who managed to evaluate energy consumption with respect to the economic development of various sectors of the countries belonging to the Visegrad Group; they applied various clustering techniques (K-means, hierarchical agglomerative cluster, and DIANA) that allow them to determine the changes in energy consumption and establish a clear dynamic between consumption and the economy of the countries over the years.
On the other hand, there are authors such as Li et al., Liu et al., and Ramos et al. [10,11,12], who, by using data mining and various clustering techniques (partitional and hierarchical) made characterizations of energy consumption, which as an advantage, allows them to know the electricity consumption habits of customers or buildings, design improvements to the electricity service and detect faults or anomalies, so that in this sense, there will be progress in energy efficiency, new energy management strategies, and future consumption trends will be identified, this is a clear example and evidence that the application of clustering and data mining algorithms can provide great tools to improve various factors in terms of electrical energy.
Finally, the work of Kapousouz et al. [13] can be highlighted, who conducted a clustering analysis of electricity and water consumption in the United States with data from 1985 to 2015. They related electricity and water resources and identified changes and trends over the years; in this way, clustering techniques can be applied to relate different topics and find information that helps to understand certain dynamics.

3. Materials and Methods

For this work, we used the datasets of the state of the electricity service in the country’s Non-Interconnected Zones, covering the years 2019–2020, from the database of the Institute for Planning and Promotion of Energy Solutions for Non-Interconnected Zones (IPSE) [14]. This dataset was obtained through the National Monitoring Centre (CNM) and telemetry systems that read variables such as active energy (MWh), reactive energy (MVARh), maximum power (kW), and hours per day with and without power service for each NIZ (with telemetry system) of Colombia, as can be seen in Table 1.
The stages for the analysis of these data are:
  • Data collection, processing, and cleaning: to obtain an adequate analysis, data in most cases need to be processed and cleaned, given that datasets are collected monthly and therefore need to be standardized and normalized for their use. In some cases, the data had different units of measurement, with missing or wrong values. For this study, we used the data processed by Colmenares-Quintero et al. [15], which for the most part, were adequately adjusted to the standards required for this work.
  • The application of different clustering algorithms with respective validations: four algorithms were applied (K-means, K-medoids, divisive analysis clustering, and heatmaps) with internal validation by the Silhouette index. To search for patterns or classifications that help to understand the behavior of the electricity service in the NIZ, among other aspects.
  • Analysis and interpretation of the clusters obtained. Once the results of the application of the clustering algorithms were obtained, they were compared and analyzed to determine the best algorithm for the case of these data and to observe and analyze the implications for the quality of the electricity service, according to the clusters obtained.
The application of the different clustering techniques was carried out in the R software version 4.1.1 due to its variety of tools, libraries, and packages and because it is free software [16].
It is important to highlight that the methodology used presents an innovation compared to other studies carried out with clustering techniques, not only because it is a complete methodology that includes the collection and cleaning of data for deep analysis, for the discovery of hidden information in them, but also because we used open data government, provided by a public institution. This demonstrates the transparency of the government by providing information for the community, industry, and academia, as well as the progress in the Colombian digital government. In the same way, the use of open data government provides a way to reuse data to achieve public, social, and economic value from it [17,18].

4. Results

4.1. Descriptive Statistical Analysis

Initially, based on the previous cleaning and adaptation of the data, a descriptive statistical analysis of the variables is carried out for each of the 18 shire counties with NIZ in the country. This paper will present the results for 7 of the 18 shire counties analyzed: Amazonas, Cauca, Chocó, La Guajira, Magdalena, San Andrés y Providencia, and Vichada, as they are the most representative shire counties in terms of NIZ energy demand at the national level, this is according to studies conducted by [15] and according to telemetry reports submitted by IPSE over the course of the years studied [19]. Minimums, maximums, ranges, means, medians, and the standard deviations of the different variables by shire county were evaluated to observe hidden behaviors and characteristics that would be impossible to perceive with a very large database, and that can help to better understand the object of study of this work.
Table 2 and Figure 1 show that among the shire counties analyzed, San Andres y Providencia has the highest electricity consumption, with 81% (8412 MWh) of the total active energy analyzed, followed in second place by the shire county of Amazonas, which has active energy that represents 11% (1195 MWh) of the energy. On the other hand, the remaining 8% is distributed in the shire counties of Vichada, Chocó, Magdalena, La Guajira, and Cauca in order of highest to lowest active energy.
In addition, Table 2 shows that areas or localities in shire counties such as Amazonas, Cauca, Chocó, and Vichada have been without electricity service all day long. This broadly shows the quality of the energy service for these places, with the characteristic that they are isolated areas of the country because they are border areas, and the topography of the territory is associated with jungle and mountainous areas with very marked internal conflict.
On the other side, it is noteworthy to observe from Table 2 that in terms of hours of energy service on average, the shire county of San Andres y Providencia has almost 24 h per day, which could explain the amount of active energy demand it presents, given that it is the non-interconnected shire county with the most hours of service, followed by Magdalena, Vichada, and Amazonas with 22, 20, and 18 h per day on average, respectively. Likewise, it was observed that the shire counties of Cauca and La Guajira have the lowest average daily hours of energy service among the shire counties analyzed, being 6 h each, which corresponds to the same position they obtained in terms of the amount of active energy they consume. With this, it can be said that there is a direct relationship between the active energy value and the average daily hours of electricity service for each shire county.
In this way, concerning the number of hours with and without energy for each shire county (which are important for identifying the quality of service), a broader analysis can be obtained, in which, according to the values in Table 2, it was affirmed that the shire counties of San Andrés y Providencia, Magdalena, Vichada and Amazonas exceeded 80% of the time with electricity service per day on average, while the shire county of Chocó barely exceeded 50%, and the shire counties of Cauca and La Guajira did not even reach an average of 30% of energy service per day. These numbers are telling and signify a beginning of a form of grouping that can be observed more accurately in clustering analyses.
Regarding the maximum power per shire county, it can be concluded from Table 2 and Figure 2 that, as expected, the shire county of San Andres y Providencia on average had the highest power represented by 64% (14021 kW) with respect to the shire counties studied. In second place was Chocó with 18% (3925 kW), followed by the shire counties of Amazonas and Vichada with 11% (2295 kW) and 5% (1194 kW), respectively. These values can be directly linked to the size or number of non-interconnected localities present in each shire county studied.

4.2. Energy Service Analysis by Shire Counties Applying Different Clustering Algorithms

Based on the above approach, the recognition of the data and its trends starts with the clustering analysis at the departmental level of the variables of active energy, maximum power, and hours of energy service, we excluded San Andres y Providencia from this analysis as it is a shire county whose high values make it difficult to analyze the whole dataset (representing 80% of demand) and we are interested in assessing trends in other areas.

4.2.1. K-Means Algorithm by Shire Counties

The K-means algorithm was applied to the data using the Euclidean distance measure, which is based on the Pythagorean theorem and seeks to measure the distance between two elements. Figure 3a shows the results of the fviz_nbclust function that evaluates, by different methods, the optimal number of clusters (to identify whether clustering is possible and how many clusters should be implemented); Figure 3b shows the clustering pattern of shire counties within the three identified clusters. Table 3 shows the average summary of the information relevant to each classified cluster and its respective variables (hours of service per day, active energy, and power); from this table, it is possible to highlight and obtain information about the clusters with the highest and the lowest number of service hours per day or the clusters with the highest and the lowest consumption of active energy and power.

4.2.2. K-Medoids Algorithm by Shire Counties

The data was then analyzed with the K-medoids algorithm with a “Manhattan” distance measure; this measure is theoretically more robust than the Euclidean one since it is not affected to a large extent by outliers. It can be seen in Figure 4a that changing the distance measure suggests another number of clusters using the function fviz_nbclust, thus, Figure 4b shows the creation of four clusters. Contrasting the clusters obtained in Figure 3b and Figure 4b shows that the shire county of Amazonas was done in a single cluster and only the internal validation by the Silhouette index below will tell which of the two analyses is more accurate. Table 4 (below) shows the representative information for each new grouping of shire counties and the corresponding variables; from this Table, detailed numerical information can be obtained for each of the 4 clusters. In Table 4, it can be seen that the Amazonas shire county, belonging to a single cluster, has very similar values to those it had when it was part of the closest cluster in the previous analysis (Figure 3b and Table 3), this shows a broadly incorrect grouping that will be confirmed in the course of the investigation.

4.2.3. DIANA Algorithm and Heatmaps by Shire Counties

Subsequently, the analysis was carried out by using the “Divisive Analysis Clustering” DIANA technique to obtain a dendrogram that represents in a different way what was expressed in the previous methods (Figure 5) where the correlation of the data is 0.8515, being for [6] 1 the most optimal and from 0.7 upwards acceptable. It is also notable that in this method, it is not necessary to specify the number of clusters beforehand. Additionally, the heatmaps technique is also used for better visualization (Figure 6).

4.3. Energy Service Analysis by Municipalities Applying Different Clustering Algorithms

As there are so many data and non-interconnected localities or zones, represented by approximately 1800 records, it is very general to perform the analysis only at the shire county level, therefore, this information can be further specified by studying the municipalities of specific shire counties which, because of their behavior or characteristics in the descriptive statistical analysis, attract attention. This section will analyze the 24 municipalities where there are non-interconnected zones in the shire counties of Amazonas, Cauca, Chocó, Magdalena, La Guajira, and Vichada. This study aims to gain a deeper understanding of service quality and to consider other variables in addition to those previously analyzed.

4.3.1. K-Means Algorithm by Municipality

Mainly the function is applied to identify the optimal number of clusters by different methods, with most methods of this function indicating that the best number of clusters for this data is three clusters (Figure 7a). It should be noted that, unlike the variables studied in the previous section, this section will analyze reactive power (measured in MVARh) and the average hours that NIZs in these municipalities had service per day. This is to identify clusters of municipalities with services with large losses in their energy generation system.
The result of this study was the grouping of municipalities into three clusters (Figure 7b), where the average measures of each cluster are shown in Table 5. It details the values of variables such as hours of service per day and reactive energy for each cluster; the values shown demonstrate the great difference between cluster 2 and cluster 3 mainly in terms of hours of service per day, given that for the municipalities belonging to cluster 3 there is an average of 8.33 h of service per day, while for the municipalities of cluster 2 there is an average of 23.1 h of service per day.

4.3.2. K-Medoids Algorithm by Municipality

Additionally, to make a later comparison between algorithms, the study was carried out using the K-medoids algorithm. The function yields the same optimal number of clusters (Figure 8a), resulting in 3 clusters of sets of municipalities with similar characteristics (Figure 8b). It is noteworthy to highlight that as in the previous case (Figure 7), there was a great similarity in the classification that each cluster had and that for both cases, the municipality of Leticia belonged to an independent cluster; this is due to its characteristic of being a municipality with higher values of consumption, energy and service hours than the others studied in this stage, and there was no other element that came close to it or had similar characteristics. The summary information for each variable with respect to each cluster is visible in Table 6. Compared to the information in Table 5, Table 6 does not show large variations in the value of the variables, which provides evidence that the two techniques (k-means and k-medoids) have similar behavior for the dataset. These groupings can be analyzed and validated in the next section.

4.3.3. DIANA Algorithm and Heatmaps by Municipality

Finally, the “Divisive Analysis Clustering” algorithm was applied, obtaining the dendrogram in Figure 9; this dendrogram has a data correlation of 0.715, which is within the acceptable range. Heatmaps (Figure 10) are also presented as another way of looking at this classification. For both cases (Heatmaps and DIANA algorithm), it is not necessary to express in advance the number of clusters, since for these, it is necessary to start from a single very large cluster to reach the elements individually, forming on the way clusters of municipalities with similar characteristics.

4.4. Internal Validation and Comparison of Clustering Techniques

It is necessary to analyze different algorithms for the same dataset and choose the one that best fits the data or, in this case, with which the dataset has the best and most accurate classification and clustering.
For validation purposes, it is impossible to apply external validation, as there are no real values for the classification or grouping of shire counties or municipalities, and no previous studies for comparison. For this reason, internal validation is carried out employing the “Silhouette” index, which identifies whether the assignment of an element within a cluster is well done with respect to the other elements of the same cluster and the other clusters [6]. In general terms, the value of the “Silhouette” will be between 1 and −1, with high values corresponding to a correctly made assignment of elements within a cluster and low or negative values corresponding to an incorrect assignment.
In this respect, Table 7 shows the average Silhouette index values for each of the elements in each cluster and for each of the algorithms or techniques used in the analysis by shire counties. From this result, it is highlighted that the best allocations were made by the K-means and DIANA algorithms, as they have an average index of 0.6, which is a high value that means a correct allocation and grouping. On the other hand, it is observed that the K-medoids algorithm presents a lower average index, which may be linked to the fact that this algorithm was analyzed with the Manhattan distance measure and this measure suggested an additional and independent cluster for the shire county of Amazonas and this cluster is not present in the other two techniques applied.
As for the analysis by municipalities, Table 8 also shows the average of the indices of each element of each cluster, for each of the techniques and algorithms applied. Unlike the previous one, in this one, the K-means and K-medoids techniques have a good allocation of each element within the established clusters since they handle indexes of 0.6. It is possible to say that for these data, either of these two algorithms fits. While the DIANA algorithm manages a slightly lower index compared to the other two, being an index of 0.59.
In general terms, the best fitting technique for both analyses (shire counties and municipalities) and datasets is the K-means algorithm with Euclidean distance measure. This algorithm, unlike DIANA, gives the means of each cluster for each variable studied, giving the possibility to have a much more valid quantitative analysis. However, visually DIANA analysis allows us to identify other trends and see other paths or more specific groupings with low computational cost (fewer iterations) [20].
In this way, it is demonstrated that the results obtained are reliable, given that they present high indexes that justify and validate the analysis carried out. The different applied clustering techniques perform their functions correctly for the data set; this is further justified by the theoretical–practical works [6,8,20] that show the functionality of the Silhouette index to internally validate the results obtained from the clustering analysis.

5. Discussion

From all the analyses, there are clusters with similar characteristics that make it possible to determine the quality of service for each of the non-interconnected zones, as well as the zones with the highest and lowest hours of service and the relationship in terms of active energy consumption, maximum power, and reactive energy.
Regarding analysis by shire counties using the K-means and DIANA algorithms, it can be said that the shire counties classified in the first cluster are those with the most defective service, with fewer hours of daily electricity service (only 9 h approximately) and therefore very low power and consumption values. It should be noted that this is the cluster with the most elements and where the shire counties are located: Guaviare, Caquetá, Chocó, Cauca, La Guajira, Valle del Cauca, Casanare, Nariño, Antioquia, and Bolívar, which have their own environmental characteristics (forests, mountain, and river systems) where there are good energy potentials for the implementation of renewable energies such as photovoltaic systems for La Guajira, Bolívar or Casanare, such as wind systems for La Guajira or hydro-kinetic systems for Antioquia or Cauca [21,22,23,24].
On the other hand, those located in the third cluster (Vaupés, Putumayo, and Guainía) have a regular service, with respect to the hours of energy service available to them compared to the other clusters, and they are grouped in a very similar way to their geographical location, where these three shire counties are in the southeast of the country. This grouping or pattern obtained through clustering analysis could be due to similar consumption habits and cultures and similar territorial management plans.
In turn, the Silhouette index for this first analysis by shire counties showed a good value (0.6) for the K-means and DIANA algorithms, which justifies and proves that, based on the data, a classification was made in accordance with the characteristics of the areas. It has been possible to observe hidden relationships, such as the quality of energy service by shire county grouping and the under-utilization of renewable natural resources in certain areas of the country, which would not have been apparent from looking at the large data set alone. Beyond testing different clustering methods, the conditions under which Colombians in Non-Interconnected Zones have or do not have access to a good electricity service and, consequently, to basic measures for the development of their daily, economic, social, and even cultural activities, are presented.
In developing countries such as Colombia that face a lack of access to modern energy services, energy poverty in this context can be explained by some factors mentioned by [25,26], such as the availability and reliability of energy sources, NIZs are peripheral and remote areas, factors associated with household preferences and social and cultural beliefs, shire counties such as Amazonas, La Guajira and Cauca with indigenous populations or Chocó with Afro-descendant populations, as described by [27]. In addition, some researchers, such as Fisher et al. [28], consider that price is the critical factor for energy access and can be a determining factor for these areas with isolated energy policies, social inequality, and high poverty rates [29].
In addition, the study was carried out for the 24 municipalities that have NIZ in the shire counties: Amazonas, Cauca, Chocó, Magdalena, La Guajira, and Vichada which are shire counties in different regions of the country, with interesting behavior in their hours of service and consumption.
Referring to the study at the municipal level, it is observed that of the 24 municipalities studied, a large majority (cluster of 12 municipalities) in various regions and shire counties of the country have a low-quality service with approximately 8 h of service per day. The study carried out by the K-means method also shows that the municipality of Leticia in Amazonas has a high level of reactive energy, which makes it belong to a cluster exclusive to it, as none of the other municipalities studied have similar characteristics, and although on average Leticia manages 23 h of energy per day, the statistical analysis carried out shows that they have been without energy service for days at a time.
Studying the case of Uribia, La Guajira, the study shows that it belongs to the worst service cluster, however, it should be noted that this area has a high energy potential for the implementation of wind and solar energy [30,31]. Similar to this case, there are many municipalities with good potential, which, if clean energy generation projects were to be developed for these places, would improve the living conditions of their inhabitants and the environmental conditions of the area.
On the other hand, considering the Silhouette index for the analysis at the municipal level, good values between 0.6 and 0.59 were obtained for the three algorithms applied, indicating how well the municipalities studied are classified within each cluster. This is due to the selection of an optimal number of clusters, as choosing more or fewer clusters would not result in acceptable validation index values, as each municipality or shire county would not be well classified, and it would be very difficult to obtain hidden information on the relationships and similarities of each non-interconnected zone.
This can support the future improvement of energy management or power generation projects, justify actions, support decision-making, and understand the different situations at the national level in non-interconnected areas.

6. Conclusions

The results show the classification of the shire counties with respect to the quality of their current electricity service, from a deficient service to a more adequate service, considering the demand and the hours without service. In this way, it is possible to determine where energy solutions should be oriented, with the strategic objective of greater coverage in more NIZ.
It is emphasized that the results are evaluated by a clustering validation that allows to determine the quality of the analysis through an internal validation (Silhouette index) so that it is possible to have comparisons between the different methods or algorithms applied (K-means, K-medoids PAM, divisive analysis clustering, and heatmaps) and to find the most efficient analyses.
It is shown how there are similarities between the quality of service and the geographical locations of the NIZ themselves, for the possible study of consumption habits and customs in the communities of non-interconnected zones.
This research has identified relevant information for the analysis of electricity variables in the NIZ. All the findings help to guide decision-making, to identify patterns and similarities, and provide tools to think about improving the living conditions of the inhabitants of the non-interconnected areas of Colombia.
In future work, it is proposed to extend the data analysis with other techniques to identify the incidence of Non-Interconnected Zones in territories that have Territorially Focused Development Programmes (PDET). These are special municipalities that, based on the Colombian peace process, aim to stabilize and transform the territories most affected by violence, poverty, illicit economies, and institutional weakness, and thus achieve rural development. In this same way, and as a future work, it is proposed to process the totality of the data at the municipal level of each shire county without any type of restriction, with the aim of having a sufficiently detailed study. It should be highlighted that the algorithms presented are sufficiently robust to carry out this analysis.

Author Contributions

Conceptualization, R.F.C.-Q. and G.M.-G.; methodology, R.F.C.-Q., G.M.-G. and M.B.-A.; validation, R.F.C.-Q., G.M.-G., M.B.-A., K.E.S. and J.C.C.-Q.; investigation, R.F.C.-Q. and G.M.-G.; data Curation, G.M.-G. and M.B.-A.; writing—original draft preparation, R.F.C.-Q., G.M.-G. and M.B.-A.; writing—review and editing, R.F.C.-Q., K.E.S. and J.C.C.-Q.; visualization, R.F.C.-Q., G.M.-G., M.B.-A., K.E.S. and J.C.C.-Q.; supervision, R.F.C.-Q.; project administration, R.F.C.-Q.; funding acquisition, R.F.C.-Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Universidad Cooperativa de Colombia, grant number INV3123 and the APC was funded by the Universidad Cooperativa de Colombia.

Data Availability Statement

All data analyzed during this study are included in this article.

Acknowledgments

The authors acknowledge the fruitful ideas and discussions received from the ENCORE consortium (Project title: “ENCORE “Energizing Coastal Regions with Offshore Renewable Energy”) and the Universidad Nacional del Chimborazo (Project title: “Estudio de procesos de biorrefinería aplicados a biomasa proveniente de residuos lignocelulósicos empleando tratamientos térmicos y disolventes eutécticos profundos; estudio de caso: Riobamba Ecuador”) during different meetings.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Colmenares-Quintero, R.F.; Latorre-Noguera, L.F.; Rojas, N.; Kolmsee, K.; Stansfield, K.E.; Colmenares-Quintero, J.C. Computational Framework for the Selection of Energy Solutions in Indigenous Communities in Colombia: Kanalitojo Case Study. Cogent Eng. 2021, 8, 1926406. [Google Scholar] [CrossRef]
  2. Ochoa, L.L.; Paredes, K.R.; Tejada, J.E. Estudio Comparativo de Técnicas no Supervisadas de Minería de Datos para Segmentación de Alumnos. In Global Partnerships for Development and Engineering Education, Proceedings of the 15th LACCEI International Multi-Conference for Engineering, Education and Technology, Boca Raton, FL, USA, 19–21 July 2017; Latin American and Caribbean Consortium of Engineering Institutions: Boca Raton, FL, USA, 2017; p. 115. ISBN 978-0-9993443-0-9. [Google Scholar]
  3. Razak, M.A.; Yakub, F.; Sulaiman, N.N.I.; Rashid, A.M.Z.; Shaikh Salim, S.A.Z.; Rasid, A.Z.; Abu, A. Energy Consumption Clustering Analysis in Residential Building. In Proceedings of the Intelligent Manufacturing and Mechatronics; Jamaludin, Z., Ali Mokhtar, M.N., Eds.; Springer: Singapore, 2020; pp. 436–450. [Google Scholar] [CrossRef]
  4. Saxena, A.; Prasad, M.; Gupta, A.; Bharill, N.; Patel, O.P.; Tiwari, A.; Er, M.J.; Ding, W.; Lin, C.-T. A Review of Clustering Techniques and Developments. Neurocomputing 2017, 267, 664–681. [Google Scholar] [CrossRef] [Green Version]
  5. Chitra, K.; Maheswari, D. A Comparative Study of Various Clustering Algorithms in Data Mining. Int. J. Comput. Sci. Mob. Comput. 2017, 6, 109–115. [Google Scholar]
  6. Amat Rodrigo, J. RPubs-Clustering y Heatmaps: Aprendizaje No Supervisado Con R. Available online: https://rpubs.com/Joaquin_AR/310338 (accessed on 5 December 2021).
  7. Rodriguez, M.Z.; Comin, C.H.; Casanova, D.; Bruno, O.M.; Amancio, D.R.; Costa, L.D.; Rodrigues, F.A. Clustering Algorithms: A Comparative Approach. PLoS ONE 2019, 14, e0210236. [Google Scholar] [CrossRef] [Green Version]
  8. Tizón Galisteo, D. Big Data Clustering. Master’s Thesis, UNED, Madrid, Spain, 2017. [Google Scholar]
  9. Gostkowski, M.; Rokicki, T.; Ochnio, L.; Koszela, G.; Wojtczuk, K.; Ratajczak, M.; Szczepaniuk, H.; Bórawski, P.; Bełdycka-Bórawska, A. Clustering Analysis of Energy Consumption in the Countries of the Visegrad Group. Energies 2021, 14, 5612. [Google Scholar] [CrossRef]
  10. Li, Y.; Yang, J.; Jiang, X. Study on Clustering Analysis of Building Energy Consumption Data. IOP Conf. Ser. Earth Environ. Sci. 2021, 676, 012061. [Google Scholar] [CrossRef]
  11. Liu, X.; Ding, Y.; Tang, H.; Xiao, F. A Data Mining-Based Framework for the Identification of Daily Electricity Usage Patterns and Anomaly Detection in Building Electricity Consumption Data. Energy Build. 2021, 231, 110601. [Google Scholar] [CrossRef]
  12. Ramos, S.; Soares, J.; Cembranel, S.S.; Tavares, I.; Foroozandeh, Z.; Vale, Z.; Fernandes, R. Data Mining Techniques for Electricity Customer Characterization. Procedia Comput. Sci. 2021, 186, 475–488. [Google Scholar] [CrossRef]
  13. Kapousouz, E.; Seyrfar, A.; Derrible, S.; Ataei, H. Chapter 5-A Clustering Analysis of Energy and Water Consumption in U.S. States from 1985 to 2015. In Data Science Applied to Sustainability Analysis; Dunn, J., Balaprakash, P., Eds.; Elsevier: Amsterdam, The Netherlands, 2021; pp. 81–108. ISBN 978-0-12-817976-5. [Google Scholar]
  14. IPSE, IPSE–Energía Que Nos Conecta. Available online: https://ipse.gov.co/ (accessed on 18 May 2022).
  15. Colmenares-Quintero, R.F.; Maestre-Gongora, G.P.; Pacheco-Moreno, L.J.; Rojas, N.; Stansfield, K.E.; Colmenares-Quintero, J.C. Analysis of the Energy Service in Non-Interconnected Zones of Colombia Using Business Intelligence. Cogent Eng. 2021, 8, 1907970. [Google Scholar] [CrossRef]
  16. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2020. [Google Scholar]
  17. Abusleme, C. ¿Por qué los gobiernos promueven estrategias de datos abiertos? Los casos de México, Chile y Colombia. Rev. Estud. Políticas Públicas 2020, 6, 20–41. [Google Scholar] [CrossRef]
  18. Maestre-Gongora, G.; Rangel-Carrillo, A.; Osorio-Sanabria, M. The Value of Open Data Government: A Quality Assessment Approach. Rev. Investig. Desarro. E Innov. 2021, 11, 507–518. [Google Scholar] [CrossRef]
  19. IPSE Aumenta en un 5.9% la Energía Registrada en las Localidades de las Zonas No Interconectadas según Informe de Telemetría. Available online: https://ipse.gov.co/blog/2021/11/26/aumenta-en-un-5-9-la-energia-registrada-en-las-localidades-de-las-zonas-no-interconectadas-segun-informe-de-telemetria/ (accessed on 27 September 2022).
  20. Reddy, C.K.; Vinzamuri, B. A Survey of Partitional and Hierarchical Clustering Algorithms. In Data Clustering; Chapman and Hall/CRC: Boca Raton, FL, USA, 2014; ISBN 978-1-315-37351-5. [Google Scholar]
  21. Carvajal-Romo, G.; Valderrama-Mendoza, M.; Rodríguez-Urrego, D.; Rodríguez-Urrego, L. Assessment of Solar and Wind Energy Potential in La Guajira, Colombia: Current Status, and Future Prospects. Sustain. Energy Technol. Assess. 2019, 36, 100531. [Google Scholar] [CrossRef]
  22. López, A.R.; Krumm, A.; Schattenhofer, L.; Burandt, T.; Montoya, F.C.; Oberländer, N.; Oei, P.-Y. Solar PV Generation in Colombia-A Qualitative and Quantitative Approach to Analyze the Potential of Solar Energy Market. Renew. Energy 2020, 148, 1266–1279. [Google Scholar] [CrossRef]
  23. Villegas-Quiceno, A.P.; Aristizabal-Tique, V.H.; Arbelaez-Pérez, O.F.; Colmenares-Quintero, R.F.; Vélez-Hoyos, F.J. Development of Riverine Hydrokinetic Energy Systems in Colombia and Other World Regions: A Review of Case Studies. DYNA 2021, 88, 256–264. [Google Scholar] [CrossRef]
  24. Vides-Prado, A.; Camargo, E.O.; Vides-Prado, C.; Orozco, I.H.; Chenlo, F.; Candelo, J.E.; Sarmiento, A.B. Techno-Economic Feasibility Analysis of Photovoltaic Systems in Remote Areas for Indigenous Communities in the Colombian Guajira. Renew. Sustain. Energy Rev. 2018, 82, 4245–4255. [Google Scholar] [CrossRef]
  25. Sy, S.A.; Mokaddem, L. Energy Poverty in Developing Countries: A Review of the Concept and Its Measurements. Energy Res. Soc. Sci. 2022, 89, 102562. [Google Scholar] [CrossRef]
  26. Hiemstra-van der Horst, G.; Hovorka, A.J. Reassessing the “Energy Ladder”: Household Energy Use in Maun, Botswana. Energy Policy 2008, 36, 3333–3344. [Google Scholar] [CrossRef]
  27. Benavides-Castillo, J.M.; Carmona-Parra, J.A.; Rojas, N.; Stansfield, K.E.; Colmenares-Quintero, J.C.; Colmenares-Quintero, R.F. Framework to Design Water-Energy Solutions Based on Community Perceptions: Case Study from a Caribbean Coast Community in Colombia. Cogent Eng. 2021, 8, 1905232. [Google Scholar] [CrossRef]
  28. Fisher, U.; Sugarmen, C.; Ring, A.; Sinai, J. Gas Turbine “Solarization”-Modifications for Solar/Fuel Hybrid Operation. J. Sol. Energy Eng. 2004, 126, 872–878. [Google Scholar] [CrossRef]
  29. Prieto, A.V.; García-Estévez, J.; Ariza, J.F. On the Relationship between Mining and Rural Poverty: Evidence for Colombia. Resour. Policy 2022, 75, 102443. [Google Scholar] [CrossRef]
  30. Atlas Interactivo-Radiación IDEAM. Available online: http://atlas.ideam.gov.co/visorAtlasRadiacion.html (accessed on 3 September 2021).
  31. Atlas Interactivo-Vientos-IDEAM. Available online: http://atlas.ideam.gov.co/visorAtlasVientos.html (accessed on 3 September 2021).
Figure 1. Box and whisker: active energy for shire county.
Figure 1. Box and whisker: active energy for shire county.
Energies 15 07644 g001
Figure 2. Box and Whisker: Maximum Power for shire county.
Figure 2. Box and Whisker: Maximum Power for shire county.
Energies 15 07644 g002
Figure 3. (a) Optimal number of clusters; (b) clusters by K-means algorithm (shire counties, active energy, operating hours, and power).
Figure 3. (a) Optimal number of clusters; (b) clusters by K-means algorithm (shire counties, active energy, operating hours, and power).
Energies 15 07644 g003
Figure 4. (a) Optimal number of clusters; (b) clusters by K-medoids algorithm (shire counties, active energy, operating hours, and power).
Figure 4. (a) Optimal number of clusters; (b) clusters by K-medoids algorithm (shire counties, active energy, operating hours, and power).
Energies 15 07644 g004
Figure 5. Dendrogram by DIANA algorithm (shire counties, active energy, operating hours and power).
Figure 5. Dendrogram by DIANA algorithm (shire counties, active energy, operating hours and power).
Energies 15 07644 g005
Figure 6. Heatmaps (shire counties, active energy, operating hours, and power).
Figure 6. Heatmaps (shire counties, active energy, operating hours, and power).
Energies 15 07644 g006
Figure 7. (a) Optimal number of clusters; (b) clusters by K-means algorithm (municipalities, operating hours per day, and reactive power).
Figure 7. (a) Optimal number of clusters; (b) clusters by K-means algorithm (municipalities, operating hours per day, and reactive power).
Energies 15 07644 g007
Figure 8. (a) Optimal number of clusters; (b) clusters by K-medoids algorithm (municipalities, operating hours per day, and reactive power).
Figure 8. (a) Optimal number of clusters; (b) clusters by K-medoids algorithm (municipalities, operating hours per day, and reactive power).
Energies 15 07644 g008
Figure 9. Dendrogram by DIANA algorithm (municipalities, hours of service, and reactive energy).
Figure 9. Dendrogram by DIANA algorithm (municipalities, hours of service, and reactive energy).
Energies 15 07644 g009
Figure 10. Heatmaps (municipalities, hours of service, and reactive energy).
Figure 10. Heatmaps (municipalities, hours of service, and reactive energy).
Energies 15 07644 g010
Table 1. Dataset characterization.
Table 1. Dataset characterization.
VariablesUnit of MeasurementNumber of Registers
Active energyMWh1777
Reactive energyMVARh1687
PowerkW1811
Hours with energy serviceHora1814
Hours without energy serviceHora 1799
Table 2. Descriptive statistical analysis of the data.
Table 2. Descriptive statistical analysis of the data.
Shire CountyVariableMinimalMaximalRangeAverageMedianStandard DeviationCoefficient of Variation
AmazonasActive Energy214109408811951001697141.93
Maximum Power79.658008.787929.132295218.773275142.71
Hours With Service5.522418.4818.8823.59635.50
Hours Without Service018.4818.485.120.416131.09
CaucaActive Energy 0.24534.2734.0286.97673.15
Maximum Power0.3248004799.6810244.86431421.40
Hours With Service024246.466.36240.05
Hours Without Service0242417.5417.64214.44
ChocóActive Energy 0.062831.67831.6111133.795156139.69
Maximum Power0.48847,700847,699.523925130.0647,3511206.13
Hours With Service1.12422.913.4212.59754.20
Hours Without Service022.922.910.5810.7768.74
La GuajiraActive Energy1643273029.5519.64
Maximum Power188.81252.4863.67223221.72198.75
Hours With Service3.5110.16.596.867.26123.11
Hours Without Service13.920.496.5917.1416.7419.24
MagdalenaActive Energy51107567876.51216.12
Maximum Power129.22206.9177.69162160.252213.72
Hours With Service15.1723.78.5322.3423.1418.15
Hours Without Service0.38.838.531.660.861109.99
San Andrés Y ProvidenciaActive Energy83818,20917,371841211,383736187.50
Maximum Power1521.530,490.4628,968.9614,02110,292.4412,84391.60
Hours With Service23.59240.4123.992400.26
Hours Without Service00.410.410.0100648.07
VichadaActive Energy1326292616614259836136.16
Maximum Power103.664614.724511.061194580.581528127.98
Hours With Service5.92418.120.0823.4526.22
Hours Without Service018.118.13.920.5755136.15
Table 3. Representative means of the shire counties clusters by K-means.
Table 3. Representative means of the shire counties clusters by K-means.
ClusterHours of Service per DayActive Energy (MWh)Power (kW)
19.1242.71356
221.15051937
315.051114047
Table 4. Representative means of shire counties clusters by K-medoids.
Table 4. Representative means of shire counties clusters by K-medoids.
ClusterHours of Service per DayActive Energy (MWh)Power (kW)
118.911962295
27.8918.5992
318.82212215
415.051114047
Table 5. Representative means of the municipality clusters by K-means.
Table 5. Representative means of the municipality clusters by K-means.
ClusterHours of Service per DayReactive Energy (MVARh)
120.1111
223.11412
38.339.86
Table 6. Representative means of municipality clusters by K-medoids.
Table 6. Representative means of municipality clusters by K-medoids.
ClusterHours of Service per DayReactive Energy (MVARh)
120.6119
28.8311
323.11412
Table 7. Comparison of clustering algorithms for shire county analysis.
Table 7. Comparison of clustering algorithms for shire county analysis.
Average Silhouette Index of Clustering by Shire Counties
K-Means K-Medoids Divisive Analysis (DIANA)
0.60.530.6
Table 8. Comparison of clustering algorithms for municipal analysis.
Table 8. Comparison of clustering algorithms for municipal analysis.
Average Silhouette Index of Clustering by Municipalities
K-Means K-Medoids Divisive Analysis (DIANA)
0.60.60.59
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Colmenares-Quintero, R.F.; Maestre-Gongora, G.; Baquero-Almazo, M.; Stansfield, K.E.; Colmenares-Quintero, J.C. Data Analysis of Electricity Service in Colombia’s Non-Interconnected Zones through Different Clustering Techniques. Energies 2022, 15, 7644. https://doi.org/10.3390/en15207644

AMA Style

Colmenares-Quintero RF, Maestre-Gongora G, Baquero-Almazo M, Stansfield KE, Colmenares-Quintero JC. Data Analysis of Electricity Service in Colombia’s Non-Interconnected Zones through Different Clustering Techniques. Energies. 2022; 15(20):7644. https://doi.org/10.3390/en15207644

Chicago/Turabian Style

Colmenares-Quintero, Ramón Fernando, Gina Maestre-Gongora, Marieth Baquero-Almazo, Kim E. Stansfield, and Juan Carlos Colmenares-Quintero. 2022. "Data Analysis of Electricity Service in Colombia’s Non-Interconnected Zones through Different Clustering Techniques" Energies 15, no. 20: 7644. https://doi.org/10.3390/en15207644

APA Style

Colmenares-Quintero, R. F., Maestre-Gongora, G., Baquero-Almazo, M., Stansfield, K. E., & Colmenares-Quintero, J. C. (2022). Data Analysis of Electricity Service in Colombia’s Non-Interconnected Zones through Different Clustering Techniques. Energies, 15(20), 7644. https://doi.org/10.3390/en15207644

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop