1. Introduction
We present a method for the visualization of multidimensional data by extending the sum of ranking differences (SRD) method by a goal-oriented modification of the widely applied parallel coordinates method.
Parallel coordinates is a popular visualization technique thanks to its straightforward use and comprehensibility [
1,
2,
3]. Axes are drafted for each variable, where the data points of each variable are depicted and connected to the respective neighboring axis. The method visualizes the correlation between the variables, as the intersection of the lines connecting the axes may indicate how different the variables are. The rank correlation coefficient (Kendall’s tau,
) measures the correlation if intersections are present [
4]. If no intersections can be observed between the lines, the ordering of the two variables is identical, and their rank correlation is equal to one. To provide comprehensible visualization, the number of intersections between the axes should be reduced. The intersections of the lines can be optimized through the careful placement of the variables on the x-axes, which is achieved by measuring the similarity between the axes [
5]. Multidimensional scaling curtails the number of intersections by reducing the variables into 2D coordinates and ordering based on their proximity to each other [
6]. Employing hierarchical clustering and occlusion algorithms enables the number of lines to be reduced, improving the comprehensibility of its visualization [
7]. These examples illustrate that the technique of parallel coordinates requires the effective ordering of the variables based on their similarity.
The aim of this work is to improve the parallel coordinate visualization by ordering the variables based on the SRD method and also to provide a more detailed visualization of how the variables are grouped in the SRD method.
SRD provides an ordering based on how similar the variables are to a common reference, which can be an ideal one or aggregated gold standard [
8,
9,
10]. As this straightforward method is visualized through the 1D ordering of the SRD values, there are inevitably limitations to the captured information in one figure. The correlation between the variables is not shown, nor is capable of determining mutually similar variables. The motivation behind this study was to bring more information to the figure of the SRD results. The aim of this work is to overcome the deficiencies of the present visualization tools for the SRD technique by a goal-oriented modification of the widely applied parallel coordinates method.
The key idea of this work is to incorporate SRD into the parallel coordinates method. As the parallel coordinates method excels at describing correlations, it seems reasonable to use the techniques for multi-criteria analysis methods, such as the SRD, which is used to determine the relationship of the variables to a gold standard. The SRD sorts the axes of parallel coordinates, efficiently depicting the ranking and showing the correlation between neighboring variables.
The SRD values promote either the close proximity of a variable to the ideal ranking or the random or reverse nature of the ranking to provide an unambiguous ordering, which may further enhance the visualization technique based on the parallel coordinates method.
For the visualization of mutually similar variables, two directions, namely on the left and right-hand sides on the x-axis are defined. As the deviation angle between the SRD values of the variables is another indicator of the consistency of the visualization, we developed an iterative algorithm that determines the angles between the last variables placed in two directions; the selected variable is placed to the left- or right-hand side of the reference based on the lesser deviation angle. Additionally, the method is compared to simple, multidimensional scaling and unidirectional SRD-based orderings.
The contribution of this work is as follows:
We utilize SRD as the axes of parallel coordinates, providing a thorough visualization, and define two directions for dissimilar variables.
We introduce a new visualization technique for the sum of ranking differences instead of the classical unidirectional ordering.
We extend the parallel coordinates to two direction on the x-axis (to the left and right from the middle) to provide information on mutually dissimilar variables.
We provide a toolbox for MATLAB and Octave to enhance the visualization of SRD-based parallel coordinates.
In the following, we first provide a review of the relevant studies and identify the major gaps in the state of the art and state how these gaps will be addressed by this study. In
Section 3, the details of the method are presented.
Section 4 demonstrates the applicability of the visualization tool in the analysis of the sources of greenhouse emissions. We compare the orderings based on the alphabet, MDS, SRD and the deviation angle-based SRD approaches to benchmark the orderings for the purpose of visualizing the parallel coordinates. According to the benchmarking results concerning the visualization methods, the angle-based method yields the best arrangement of the climate-change data.
2. Review of Related Methods—Motivation
The visualization of multivariate data focuses primarily on the visualization of data points, i.e., projecting data points into a lower-dimensional subspace. The principal component analysis (PCA) utilizes the correlation of data. Other techniques, such as Sammon mapping, preserve the distances or neighborhoods by searching for two- or three-dimensional mappings [
11].
It is a crucial but distinct task to explore and visualize the relationships between the variables themselves. Exploratory data analysis is used to visualize the relationship between variables by using correlation and distance matrices (scatter matrix) widely and unambiguously. Although it also represents the variables themselves, a scatter matrix also shows the correlation that a comparison between variables can determine. Heatmap-based representations of correlation matrices are probably the most common technique in this area. The seriation of the matrices also supports the representation of objects, where related groups of variables can be defined and arranged [
12]. A further development is when the dominant relations can be interpreted as a network, and the internal structure of the variables is explored by displaying this network [
13]. Classical multivariate statistical tools are also enriched by the visualization of the variables. The biplot visualization is developed in connection with PCA. The biplot represents the variables by vectors, and the angles of these vectors exert the similarity of the variables in a quantitative manner [
14].
The application of the above methods is resource intensive, the evaluation of the results requires significant background knowledge, and the importance of correlations is not revealed. The SRD technique was created to fulfill the need for simplicity and interpretability: the SRD algorithm also serves an additional purpose besides exploring the relationship between the variables by aggregating the variables (aspects) and characterizing the other variables (aspects) in relation to them. Therefore SRD visualization not only sorts variables based on distance, but also illustrates the likelihood that rank ordering(s) can occur.
The advantage of the method is also confirmed by its broad applicability. It has been used for the following:
To determine the similarity between models and facilitate the selection of models without considering weight allocation problems [
15];
For tea grade identification [
16];
In QSAR modeling to determine training-test set splits correctly [
17];
For comparison of performance parameters (merits) in QSAR/QSPR model validation [
18];
For column selection in chromatography [
19,
20];
For comparison of lipophilicity parameters [
21,
22];
For outlier detections [
23];
In political sciences to determine the optimal constituency size [
24] and to rank universities and institutions [
25];
The 1D visualization of SRD is very special. Colored bars are assigned to each variable: The x-axis and one of the y-axes contain the same quantity, namely, the scaled SRD values between 0 and 100, i.e., the 2D plot realizes, in fact, a 1D ordering. Hence, the lengths of the bard do not carry any information; their tops lie on the 45 grad line. The essential information is the distances of the bars from the zero (gold standard) and the random distribution. (The SRD distribution from random numbers is calculated from the exact theoretical distribution only if the number of objects (lines) is small (<14 for untied observations and <9 if ties are also present in the input matrix.) In all other cases, the random SRD distribution is well approximated with a Gaussian curve based on Monte Carlo simulations (preferably in the form of a cumulated distribution function).
Figure 1 shows the traditional way to visualize the SRD ordering [
8,
9]. The large distance from zero would suggest that the present variables are not optimal (or the gold standard has little to do with them). The cumulated probability curve indicates that the variables located here are indistinguishable from random ranking. XX1, fairly 5%, shows the predefined error limit (the random order cannot be excluded; it happens less than 20 cases from 100). On the right to the median, the reverse rankings are located. If a line is placed to the right of the XX19 dotted line, the reverse ranking is considered to be significant at the 5% level. The grouping of lines also holds some information: not the distance from each other, but considering it on the condition of the gold standard. The ordering of SRD values is one-dimensional from left to right.
There is another visualization possibility for SRD. Namely, the uncertainty of SRD bars can be estimated by leave-one-out or leave-many-out cross validation or bootstrap. Preferably, such uncertainties can be plotted on a box and whisker plot, showing a kind of distributional information. It was invented in 2011 [
8].
The ultimate visualization of SRD is a heatmap. It solves the problematic selection of the gold standard in a way that all variables serve as the gold standard once and only once; it is called comparisons with one variable at a time (COVAT) [
22]. This distance matrix is symmetric and ordered according to the sum of the distances, column-wise. However, this matrix of pairwise distances with color codes shows only binary couplings, similarly to a correlation matrix, just in a reverse order (the smaller an SRD value is, the greater the similarity).
All ways of visualizations provide information for the variables and their ordering (grouping). The ranking of objects (enumerated in the rows of the input matrix) and their specific rankings have not been shown in any of the figures yet, although it would be imperative to do so, as well. This train of thought led to eliminating the hiatus of visualization of object rankings. As it will be presented in the following section, we propose a method that arranges the parallel coordinates in bidirectional from the middle to the left and the right based on SRD ordering.
4. Case Study of Greenhouse Gas Emissions
To introduce the capabilities of the SRD-based parallel coordinates, we employ the algorithm on the Climate Analysis Indicators Tool (CAIT) [
29] database of Climate Watch, which contains a category type of aggregated greenhouse gases (all GHG) that includes carbon dioxide, methane, nitrous oxide and fluorinated gases. Carbon dioxide (CO
2) represents another emissions category.
Table 1 contains all emissions categories as well as their IDs and codes. All categories are provided in metric tons of CO
2 equivalent divided by the population. Indicators are presented in the following formulae: category code + ‘e’ + emissions type, where ‘e’ denotes the category and emissions separator.
Moreover, the emissions data include the land use, change and forestry (LCF) indicator provided by the Food and Agriculture Organization of the United Nations [
30] in their Food and Agriculture statistics (FAOSTAT) emissions database, as well as combustion data recorded by the Organization for Economic Co-Operation and Development (OECD) [
31]. The economic indicators are requested from the World Bank Open Data database, namely, the gross domestic product (GDP), population, rural population percentage, GDP growth and urban population growth rate [
32]. The emissions and the GDP are divided by the population, which is removed afterwards. False entries with high proportion of missing data as well as countries with fewer than 500,000 inhabitants are excluded.
First, the categories characterizing the climate change are reduced: constants (all zeroes) as well as variables with low standard deviation are eliminated, and the variables are arranged in alphabetic order. The total number of nations and categories are 147 and 20 (excluding population), respectively.
Regarding the visualization of parallel coordinates, the red axis is considered to be the gold standard or origin axis. The quantiles (25%—yellow, 50%—magenta, 75%—blue, 100%—green) and the first decile (10%—red) of the ranks are colored to distinguish between their positions. The rank numbers for each variable are plotted on the y-axes, and the x-axis depends on the input data. In the case of
Figure 4, the indicators are provided in alphabetical order for the purpose of visualization. The neighboring indicators are in an arbitrary ordering without regard to their relationships to each other; therefore, the minimization of the total crossings is not probable. This method yields 76,509 intersections for a ranking of 147 members. The quantiles change position with each step further from the origin; therefore, the ordering is inefficient. Especially between GDP, GDP growth, industry, all GHGs categories, and from the land use, change and forestry to the total emissions category, the intersections propose that these are non-similar neighbors. Even in this inefficient layout, the very similar variables (e.g., LCFeAllGHG and LCFeCO
2) can be detected, as can reverse rankings (e.g., OFCAllGHG and RuralPOP), or very different ones (e.g., GDPGrowth and INDeAllGHG), as well.
The MDS-based analysis focuses on the lower-dimensional representation of the pairwise dissimilarity matrix. Kendall’s presents similarities; therefore, the inversion yields a dissimilarity that can be used to calculate the MDS value of a variable. The method analyzes the similarity of one variable to all others.
The parallel visualization of MDS yields 53,591 intersections, and both negative and positive dissimilarities. The MDS-based ordering is depicted in
Figure 5. On the left-hand side, various categories are placed in the vicinity of each other, creating a cluster whose quantiles remain in a similar position, reducing the number of intersections.
The grouping suggests that these retain a close connection to each other, primarily energy-related variables and GDP, total emissions, industry and transportation categories. This result might indicate how closely the sectors relate to each other. As the figure proceeds towards the positive limits of the x-axis, the intersections between the categories become more and more convoluted. The distance between the categories is either minute or enormous. The introduction of circular economy may depend on the sectors that are distant from each other. The waste, other fuel sources, industry CO2, and growth indicators may be the key to starting the transition to a circular economy, as these sectors are relatively “independent” of each other. Waste is a consequence of human consumption, and changing the recycling industry barely affects the products themselves, as the core materials rarely change. On some occasions, the waste can be decommissioned, e.g., communal waste can be burned or fermented to methane.
To depict the SRD-based parallel coordinates, we performed SRD with the total of all GHGs emissions category as the gold standard before sorting the categories based on the SRD value and drafting
Figure 6. The correlations gradually diminish as the normalized SRD values increase. Above 35, the number of intersections in the rankings of the variables escalates. The proximity of the axes shows the close degree of similarity based on their distance from the ideal ranking, although they are not necessarily mutually similar. The total number of intersections is 53,591.
Beginning from the left-hand side, the number of intersections are considerable in the range of 92 and 66, and after that, follows a group of variables that can be mutually similar. On the right-hand side, the quantiles are swapped in terms of position, although there are exceptions, as the SRD values are not 100 (not perfectly inverse). Given that the total emissions indicator acts as the reference, it is placed at the zero SRD value mark. The following indicators are the total greenhouse gas emitted by the energy sector, as well as the CO2 emissions of total and energy sectors. Electricity, transportation and manufacturing are grouped, expressing a connection to the GDP indicator. The industry, building and bunker fuels are grouped as well. CO2 emitted from industrial activity and waste are connected as well. Other fuel sources, agriculture and GDP growth are considered outliers. The land use, change and forestry indicators, urban growth, and rural population are ranked in reverse order to the gold standard.
In
Figure 7, the angle-based ordering of the SRD values is depicted. The total of the GHG emissions category axis is denoted with red dashes at the x-axis so that the gold standard is quickly identifiable. On the left-hand side, the position of the quantiles is reversed, indicating that the left-hand side of the figure contains reversely ranked categories, whereas, on the right-hand side, the quantiles remain within the same range, with a negligible degree of fluctuation. It is only reasonable that the randomly and reversely ranked categories are placed in the reverse direction, as, if one side is filled with similarly correlating variables, the angles between the remaining non-correlating variables may be less than each other than to the other side. The total crossings can be further reduced (to 52,200) if the asymmetric distances are plotted to the right and the left from the gold standard, whose SRD value is 0. The total number of intersections is determined as 52,200, which is 24,309 less than in the alphabet-based approach.
In the vicinity of the gold standard in
Figure 7, all greenhouse gas emitted by the energy sector, CO
2 emissions of transportation, manufacturing, and GDP are followed by industry bunker fuels and waste, all of which are mutually similar. Such a relation implicitly suggests the significant contribution of the heavy industry to the total emissions. Because of that, the economic models of nations should be shifted towards circular sustainable economies, as circular economies may provide such benefits as mitigation, reduced costs of manufacturing, recycling and efficient waste management [
33,
34,
35]. On the left-hand side, the total CO
2 emitted by energy and electricity, respectively, are similar variables. It is in the best interest of mankind to reduce CO
2 emissions due to global warming and health issues that are caused by pollution and increasing temperature. The most polluting sector of all is the energy sector, where fossil fuels are used for on-site and industrial energy generation. An example is Germany, which has made significant progress to introduce renewable energy to the sector in an attempt to reduce CO
2 emissions. A net-positive change occurred, whereas the growth of the industry has not been altered, as it has provided an immense amount of jobs [
36]. The only problem with such energy is its storage, which is not yet cost effective at the present [
37].
For the reduction in emissions from bunker fuels, shipping could be optimized based on the operational and charter cost of vessels, port handling and fees, fuel consumption, and inventory management [
38]. As shipping provides the most merchandise to the service sector, unjustified consumption poses health risks to those living in cities with harbors [
39] and steadily increases the polluting agents in the air.
A strong manufacturing industry is a prerequisite of a profitable economy, but without a circular one, the processes may cause a detrimental effect on the environment. As one of the most polluting countries, China has been identified as being capable of reducing 78 % of its CO
2 emissions if proper actions are taken [
40]. For green technology to progress, optimization and perseverance are better driving forces than strongly profit-oriented approaches masked by greenwashing.
The left-hand side consists of indicators that may prove useful in determining either the service or the agricultural sector, but are not essential to the industry sector. Industry is a requirement for the service sector, and agriculture is boosted with manufactured machines. The GDP growth is reversely ranked along with agriculture, yet it is clear that the two magnitudes barely correlate. In many developing countries, the agriculture is based on subsistence farming, and a high proportion of the population may solely feed themselves and their families. The population of these countries live below the poverty line and can only contribute marginal profits to the “big” economy [
41]. The figure contains 50,252 intersections, 1948 less than the classical SRD-based ordering.
Several sectors of the economy are provided in the analysis. The sectors are in exponential growth in general, which does not consider sustainability and circular economy. Unfortunately, the phenomenon of greenwashing [
42] has arisen under the current regulations. Greenwashing companies are not environmentally friendly, but aim to obtain more profit. These organizations may exaggerate emissions reductions and mislead consumers. Huge contributors of greenhouse gases are the building and construction sectors. The problems and solutions are being thoroughly researched [
43]. To reduce emissions, the introduction of biofuel to heavy machinery and energy generation is recommended. The electrification of processes and the application of carbon capture technologies is also proposed. Through embracing the recommended solutions, the company policies become ever-more aligned with the Paris Agreement for both short- and long-term periods to avoid greenwashing. An example of greener building is passive houses. The maintenance cost in terms of the environmental footprint of office buildings is now reduced [
44], and passive houses are under strict regulation of heat insulation and electricity consumption [
45]. The actual manufacturing emissions of the required materials for passive buildings may be overlooked [
46]. However, from the CO
2 emissions of the buildings category, the correlation steeply decreases.
5. Discussion
As mentioned above, the ordering of the variables is a major factor that reduces the number of intersections. It is not recommended to visualize the parallel coordinates based on an arbitrary ordering, such as the alphabet, as the number of intersections is large, which is 76,509 in this exact example. Each ordering method reduces the number of intersections significantly (each by about 25,000), proving that the ordering is a requirement to good visualization.
Although the angle-based ordering algorithm provided fewer intersections than the multidimensional scaling-based method, the gain is only marginal. However, the hierarchical ordering-based multidimensional scaling was not compared to SRD. We compared a similar rank correlation-based multidimensional scaling technique to the novel SRD-based parallel coordinates algorithm; the multdimensional scaling performs slightly worse. It is important to note that another analysis may not yield such enormous or minute distances between the axes. By dropping the fixed-position of the axes, the multidimensional scaling-based method may reduce the resemblance of the figure and provide information on the relationship between the axes.
The inclusion of the SRD method in the visualization of the parallel coordinates provided a satisfactory reduction in the number of intersections, which was reduced and compared to multidimensional scaling. “Dynamics Visualization based on Parallel Coordinates” (DYVIPAC) focuses on a set of problems and not on general use. However, the SRD angle method-based visualization alone is not capable of catching complex biological or mathematical problems, or dynamics analyses. It is a technique that focuses on the correlation and relationship between the variables.
The parallel coordinates technique is widely used, e.g., in ecology, where a complete framework was created to ease the visualization and clarification of water quality as well as invasive species proliferation. The visualization is interactive, and the framework provides features, such as line coloring, data filtering, and highlighting as well as tables. In the SRD-based parallel coordinates visualization toolbox, we provide a general SRD-based visualization with angle-based ordering that can be further extended in MATLAB or Octave, depending on the requirements of the analyst.
6. Conclusions
The visualization of parallel coordinates in itself is extended as the representation of the sum of ranking differences (SRD); the order of the variables is calculated by the SRD method. The SRD and the parallel coordinates technique are modified by placing the variables not only on the right-hand side of the gold standard, but on the left-hand side as well, thereby increasing the information content of the visualized relationship system (i.e., showing that variables adjacent to each other are similar).
We developed a method for classifying the arrangement according to SRD as a distance norm for triangular inequality (and the cosine theorem). The variables are iteratively placed based on the angle of deviation between the selected side reference variables on left- and right-hand side. The parallel coordinates-based visualization technique is preferably applicable, as it significantly reduces the quantity of the intersections in the figure.
The algorithm inherits the limitations of the parallel coordinates, e.g., negative correlations can be easily overestimated visually, and the hardships of reading the visualization may increase when adding a significant amount of variables or objects. Non-linear correlations may not be appropriately interpreted in the original method, and this requires further research concerning the visualization of SRD values. Although the original parallel coordinates method suffers from lack of proper ordering, it is solved by applying the SRD algorithm. However, these limitations are intertwined with advantages. A great deal of information is presented by the visualized data that can be evaluated in a short matter of time, and it is easy to learn. The method excels at providing information on the correlation between the rankings and detects outliers in the rankings, making it a candidate for efficient cluster analysis.
Another limitation is that only two mutually similar groups can be established, due to the shortage of axes. In this case, the probability values being mutually similar may pave the way for more information on the exact relationships between the variables. It is also important to note that parallel coordinates may be used besides other visualization techniques.
As for future research, the determination of non-linear correlations will be built in the SRD framework; moreover, the behavior of aggregation is to be researched, where the properties of the distributions of the ranking distances during data fusion are to be determined.