1. Introduction
Nowadays, comparing the performance of a newly developed multi-objective optimization algorithm involves calculating descriptive statistics such as means, medians, and standard deviations of the algorithm’s performance. Recently, we have published several studies that show that calculating descriptive statistics is not enough to compare algorithms’ performances, but require some high-level statistics [
1,
2,
3].
In this paper, we will show how different benchmarking practices can potentially have big influences on the outcome of a statistical analysis performed on multi-objective optimization algorithms. Additionally, we will present a web-service-based e-Learning tool called DSCTool [
4], which guides the user through all necessary steps needed to perform a proper statistical analysis. The DSCTool also reduces the requirement of additional statistical knowledge from the user, which includes knowing which conditions must be fulfilled to select a relevant and proper statistical test (e.g., parametric or nonparametric) [
5]. The conditions that should be checked before selecting a relevant statistical test includes checking for data independence, normality, and homoscedasticity of variances.
In multi-objective optimization, there is no single solution that simultaneously optimizes each objective. Since the objectives are said to be conflicting, there exists a set of alternative solutions. Each solution that belongs to the set of alternative solutions is optimal in a manner that no other solution from the search space is superior to it when all objectives are considered. The question that appears here is how to compare algorithms with regard to sets of solutions. To this end, many different performance metrics have been proposed, which map the solution sets to a set of real numbers. By using performance metrics, we can quantify the differences between solution sets. Further, this data can be used as input data that should be analyzed by applying some statistical tests.
To see the importance of making a proper statistical analysis in benchmarking studies that involve multi-objective optimizaton algorithms we performed analysis on the results presented at the 2018 IEEE CEC Competition on Evolutionary Many-Objective Optimization. To do this, the analyses are performed using the DSCTool, where each step from the DSCTool pipeline is explained in more detail. More details about the DSCTool are presented in [
4].
The rest of the paper is organized as follows. In
Section 2, the multi-objective optimization is shortly reintroduced, followed by
Section 3 where two imporatant caveats of statistical analysis are presented.
Section 4 reintroduces the DSCTool and
Section 5 shows and compares different statistical analysis performed by using the DSCTool. Finally, the conclusions of the paper are presented in
Section 6.
2. Multi-Objective Optimization
Multi-objective optimization is a research of multiple-criteria decision making, concerning with optimization problems that involve more than one objective, which should be optimized simultaneously. A multi-objective problem can be formulated as
where
is the number of objectives and
X is a set of feasible solutions.
X must often satisfy some inequality constraints
,
and equality constraints
. Since typical optimization algorithms are unconstrained search methods, dealing with constraints is a challenging task. Standard approaches [
6] to handling constraints consists of penalty functions, objectivization of constraints violations, repair algorithms, separation of constraints and objectives, hybrid methods.
The objectives are usually conflicting, therefore there exists some kind of trade-off among objectives, i.e., we can only improve one objective at the expense of the other. As a consequence there is not one optimum solution for all objectives, but a set of solutions called Pareto optimal set, which are all considered optimal with respect to all objectives. There are many quality aspects with regard to which approximation sets are compared, such as closeness to the Pareto optimal set, and coverage of a wide range of diverse solutions. Quality is measured in terms of criteria that relates to properties of convergence and diversity. There have been many studies tackling this problem by using unary and binary quality indicators, which take one or two input sets that are mapped into a real number. However, one problem the arises is that there exists many quality indicators, for example, hypervolume (HV) [
7], epsilon indicator (EI) [
8], generational distance (GD) [
9], inverse generational distance (IGD) [
9] etc., that capture different information from the approximation set and its selection can greatly influence the outcome of the comparison study. For the interested reader Riquelme et al. [
10] provide an in-depth review and analysis of 54 multi-objective-optimization metrics. One option to reduce the influence of the selection of a quality indicator, is to use ensemble learning, where the idea is to combine information gained from several quality indicators into one real value. With a suitable performance metric calculated, we then need to perform some statistical analysis to gain understanding of the compared algorithms performances.
3. Statistical Analysis
Frequentist statistical analysis can be done by exploratory data analysis, which involves calculating descriptive statistics (i.e., mean, median, standard deviation, etc.), or some higher-level statistics that are done by inferential statistics to explore relations between compared variables (e.g., hypothesis testing with statistical tests) [
2]. No matter which statistical analysis is done, it is very important to understand all its limitations and caveats because with its improper usage, wrong conclusions can be made. For example, improper usage comes from misunderstanding or not knowing the requirements when a certain statistical test can be applied. Recently, it was shown that outliers and small differences in the data can have negative impact on the results obtained from statistical tests [
11].
3.1. Outliers
An outlier is an observation that lies outside of the distribution of the data [
12]. If outliers are not properly handled, some deceptive statistics can be affected, which might not reflect the actual probability data distribution. This indicates that the statistical analysis shows that there is a statistical significance, while the probability distribution shows the opposite.
Outliers are data points that differ significantly from other data points in the data set and can cause serious problems in statistical analyses. For example, means are the most commonly used descriptive statistics that are involved in comparison studies because they are unbiased estimators. However, they are sensitive to outliers, so by using them for statistical analysis we inherently transfer this sensitivity to results of any analysis that originates from them. One option to reduce the influence of outliers is to use medians as a more robust statistic since they are less sensitive to outliers.
3.2. Small Differences in the Data
Though medians reduce the sensitivity to outliers, both means and medians can have negative impact on results obtained from applying statistical tests. Such impact can be observed when differences between means or medians are in some -neighbourhood. An -neighborhood is a range in function value space in which distance from a given number is less than some specified number . When the means or medians are not the same, but in some -neighborhood, they receive different rankings. However, if distributions of multiple runs are the same, indicating that there are no performance differences between the compared algorithms, this suggests that they should obtain the same rankings. On the contrary, it can also happen that the distributions of multiple runs are not the same, indicating that algorithms should obtain different rankings, but the means or medians are the same.
4. The DSCTool
The DSCTool is presented in [
4] as an e-Learning tool with the goal to make all required scenarios for different performance evaluations of single- and multi-objective optimization algorithms. It guides the user through several steps of statistical analysis pipeline, starting from preparing and providing input data (optimisation algorithm results), then selecting the desired comparison scenario, and obtaining final result from it. In this way, the statistical knowledge required from the user is greatly reduced. Actually, the DSCTool allows users not to think about how the statistical analysis should be performed, but only to define the comparison scenario that is relevant for their experimental setup. Basically, the user must select a significance level used by statistical test and follow a pipeline for selecting the most commonly-used statistical tests in benchmarking evolutionary algorithms. For multi-objective optimization algorithm analysis, the user must provide data in form of one or more quality indicators, decide on significance level and which kind of ensemble (i.e., average, hierarchical majority vote, or data-driven) is desired for comparing data calculated by quality indicator. Following the pipeline is trivial as it will be shown in next sections.
The DSCTool is developed for the Deep Statistical Comparison (DSC) approach [
2] and its variants that all have a benefit of providing robust statistical analysis with reduced influence of outliers and small differences in the data since a comparison is made on data distribution. In this paper, we will only show web services that implement DSC variants used with multi-objective optimization, i.e., the basic DSC ranking scheme used for comparing data for one quality indicator [
13], and average, hierarchical majority vote, and data-driven ensembles that are used to fuse the information from more quality indicators [
1,
3].
The DSC ranking scheme is based on comparing distributions using some two-sample statistical test [
14] with predefined significance level. The obtained ranking scheme is used to make a further statistical comparison. The DSC average ensemble is a ranking scheme that uses a set of user-selected quality indicators [
1] to calculate a mean of DSC rankings for each quality indicator on specific problems [
15]. The DSC hierarchical majority vote ensemble is a ranking scheme [
1] that checks which algorithm wins in the most quality indicators or which algorithm is ranked the most number of times with the best DSC ranking on each benchmark problem separately. Finally, the DSC data-driven ensemble [
3] is where the preference of each quality indicator is estimated using its entropy, which is calculated by the Shannon entropy weighted method [
16]. The preference ranking organization method (PROMETHEE) [
17] is then used to determine the rankings.
5. Experiments, Results, and Discussion
To show the usefulness and the power of DSCTool, the results presented at the 2018 IEEE CEC Competition on Evolutionary Many-Objective Optimization [
18] are analyzed, where 10 algorithms were submitted for competition (i.e., AGE-II [
19], AMPDEA, BCE-IBEA [
20], CVEA3 [
21], fastCAR [
22], HHcMOEA [
23], KnEA [
24], RPEA [
25], RSEA [
26], and RVEA [
27]). The competition provided 15 optimization problems (MaF01-MaF15) with 5, 10, and 15 objectives [
28]. The obtained optimization results for each algorithm together with results can be accessed at
https://github.com/ranchengcn/IEEE-CEC-MaOO-Competition/tree/master/2018. We would like to point here that we focus only on the issue of how to evaluate the performance (i.e., which statistical approach to be used), where the problem selection and experimental setup have been already solved and set by the competition organizers.
For performance metrics, the organizers have selected inverted generational distance (IGD), where 10,000 uniformly distributed reference points were sampled on the Pareto front and hypervolume (HV), where the population was normalized by the nadir point of Pareto front and Monte Carlo estimation method with 1,000,000 points was adopted. Each algorithm was executed 20 times on each problem with 5, 10, and 15 objectives, resulting in 900 approximation sets. Using these approximation sets, both quality indicators were calculated and the mean of each quality indicator was taken as a comparison metric for each problem and number of objectives. The algorithms were then ranked according to the comparison metric and the final score of the algorithm was determined as the sum of the reciprocal values of the rankings. In
Table 1 and
Table 2, official competition ranking results for 5 and 10 objectives are presented for both quality indicators, respectively.
Using the tables, it can be seen that CVEA3 is the best performing algorithm due to obtained total ranking of 1 for both quality indicators in 5 and 10 objectives.
Further, statistical analysis on the same data is showed using the DSCTool. First, quality indicators should be calculated using the approximation sets obtained by the algorithms. In addition to IGD and HV, also generational distance (GD) and epsilon indicator (EI) were calculated. After all quality indicators are calculated, the data for each quality indicator must be organized in an appropriate JSON input for the rank web service that performs the DSC ranking scheme. The JSON inputs for each quality indicator and each number of objectives are available at
http://cs.ijs.si/dl/dsctool/rank-json.zip. For our analysis, the two-sample Anderson–Darling test was selected to compare data distributions (since it is more powerful than Kolmogorov–Smirnov test). To obtain the DSC rankings for each quality indicator, the rank web service was executed with an appropriate JSON input. The result is a JSON response where rankings are calculated based on the DSC ranking scheme. The rankings results for all quality indicators for 5 and 10 objectives are provided in
Table 3,
Table 4,
Table 5 and
Table 6, respectively.
First, let us compare results obtained in the official competition (see
Table 1 and
Table 2) and the results obtained by DSCTool, where comparisons are done using only one quality indicator (see
Table 3,
Table 4,
Table 5 and
Table 6). Here we would like to remind the reader that competition ranking is based on simple statistics in the form of the mean value obtained for a specific quality indicator, while DSCTool uses the DSC approach in which the comparison is based on the distribution of quality indicator values from several runs. Quickly one can see that for some algorithms the obtained ranking is the same (e.g., for algorithm CVEA3 on 5 objectives both approaches returned ranking 1 according to IGD and HV quality indicator), while for some others there is a large difference between the rankings (e.g., for algorithm AGE-II on 5 objectives and using HV quality indicator competition ranked it as 3rd, while DSCTool ranked it as 9th). Since the paper is not about comparing different statistical approaches, we will not go into details. Nevertheless, we would like to remind the reader that though mean values are unbiased estimators, they can be heavily influenced by outliers and some small differences in the data. For further details on this topic we refer the reader to [
2]. Therefore, by only using different (in our case more powerful) statistics, conclusions obtained from the rankings can be drastically changed. In our experiment, two additional quality indicators (i.e., GD and EI) were calculated on purpose, so the influence of its selection can be even better seen. Looking at
Table 3,
Table 4,
Table 5 and
Table 6, drastic changes in rankings can be observed. Again looking at algorithm AGE-II it can be seen that if only EI and GD would be used, the AGE-II would change from the average performing algorithm to the best one, also outperforming algorithm CVEA3 in all cases.
Though it is well known that the selection of the quality indicator can have a big influence on the statistical analysis outcome, this influence can be also clearly observed in
Table 3,
Table 4,
Table 5 and
Table 6. In such situations, it is better to use an ensemble of several quality indicators and estimate performance according to all of them. The DSCTool provides the ensemble service, which calculates rankings according to inputs from several quality indicators. To do this, ranking results from running the rank web service on all quality indicators should be taken and used as inputs to the ensemble service. In addition to all quality indicator data that needs to be put into a proper JSON form, one needs to decide on which ensemble technique to be used. The DSCTool provides three ensemble options, namely: average, hierarchical, and data-driven. When using the average method, rankings are simply averaged, with the hierarchical method, the input rankings are looked at from a hierarchical viewpoint where the algorithm with the highest rankings obtains the best ensemble ranking, and the data-driven method, where the information gain provided by the quality indicators is also taken into account to determine the obtained ensemble ranking. The examples of the JSON input for the average ensemble method can be found at
http://cs.ijs.si/dl/dsctool/M05-average.json and
http://cs.ijs.si/dl/dsctool/M10-average.json for 5 and 10 objectives, respectively. To prepare the JSON input for hierarchical or data-driven ensemble, the name of the method should be modified/changed from average to hierarchical or data-driven, while everything else remains the same. Further, to obtain the ensemble rankings, the ensemble service was executed with an appropriate JSON input. The rankings obtained from ensemble services in 5 and 10 objectives are shown in
Table 7,
Table 8 and
Table 9. Unsurprisingly, it can be again seen that the rankings have changed with respect to individual rankings shown in
Table 3,
Table 4,
Table 5 and
Table 6. Now, the rankings no longer represent individual quality indicators, but a fusion of the information by all of them. All ensembles provide similar ranking results, but there are still some differences between them. So, the selection of an ensemble method can also have an influence and must be made with great care. If we are interested purely in average performance, average should be selected, if we are interested in most often high-performing algorithm, hierarchical should be selected (this is also recommended for dynamic optimization), while if we care about which algorithm seems the most relevant in general, the data-driven ensemble method should be selected.
The algorithms rankings obtained on each benchmark problem consist of statistical significance that can be used to compare the algorithms only on that specific problem (or performing single problem analysis). However, if we are interested in a more general conclusion, or to compare the algorithms using the set of all benchmark problems (or multiple-problem analysis), it is not enough only to look at the mean ranking of the algorithms obtained from all benchmark problems. In such situations, we should analyze the data by applying some paired statistical test. For this purpose an omnibus test must be performed, which is implemented in DSCTool as an omnibus web service. The omnibus test can be performed on the results from any of the above tables. When using the DSCTool rank or ensemble web service we receive a JSON result that consists of the rankings for all algorithms and all benchmark problems (which were shown in
Table 3,
Table 4,
Table 5,
Table 6,
Table 7,
Table 8 and
Table 9). Additionally, it provides us the information if the parametric tests can be applied on our data and which statistical tests are relevant for our data. In our cases, the Friedman test was recommended as an appropriate one. To make a general conclusion, we continue by using it with a significance level of 0.05. Instead of showing the results of all obtained rankings (presented in the tables), only one scenario was selected to show how this can be done and the results were compared to the outcome of the competition. We decided to make omnibus test for the results obtained with EI quality indicator on five objective problems. The reason is in the fact that, here, algorithm AGE-II outperformed the CVEA3, which was shown to be the best performing algorithm at the competition. After preparing the JSON input (accessible at
http://cs.ijs.si/dl/dsctool/M05-ei-omnibus.json) the omnibus web service was executed. The obtained results indicate that the null hypothesis is rejected, since the calculated
p-value is
and it is smaller than our previously set significance level (0.05). This means that there is a statistical significance in the data and post-hoc test should be performed to identify where these significant differences come from. In our case, the algorithm with the lowest mean value (AGE-II) was selected as the control algorithm and compared to the other algorithms. To apply the post-hoc test, the algorithm means obtained from the omnibus test were taken, the number of algorithms and the number of benchmark functions set (in our case 10 and 15, respectively), and the control algorithm specified (in our case AGE-II). Since the post-hoc statistic is dependent from the omnibus statistical test, the same statistical test must be applied in the post-hoc test (in our case, the Friedman test). After creating the JSON input (accessible at
http://cs.ijs.si/dl/dsctool/M05-ei-posthoc.json), the posthoc web service was executed. The results show that algorithm AGE-II significantly outperforms every other algorithm. Therefore, what we have shown is that not only can the rankings change by applying a different quality indicator, but the differences between performances can be significant. If we would do a similar test on an HV quality indicator with five objective problems, the results would be reversed. Here, CVEA3 would be shown as an algorithm that significantly outperforms every other algorithm. However, if we would look at the results of any of the ensemble methods, we would see that there is no statistical significance between algorithms AGE-II and CVEA3, which in the end would be a general conclusion (assuming we would not search for the best algorithm according to some problem specific quality indicator).
With our experiments, we have shown how important it is to perform proper analysis since we were able to change outcomes of the study only by selecting different quality indicators and/or statistical approaches.