Next Article in Journal
Entanglement Purification for Logic-Qubit of Photon System Based on Parity Check Measurement Gate
Next Article in Special Issue
Hiding Information in Digital Images Using Ant Algorithms
Previous Article in Journal
Learning the Nonlinear Solitary Wave Solution of the Korteweg–De Vries Equation with Novel Neural Network Algorithm
Previous Article in Special Issue
A Comparative Study of Rank Aggregation Methods in Recommendation Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Study on the Use of Artificially Generated Objects in the Process of Training MLP Neural Networks Based on Dispersed Data

by
Kwabena Frimpong Marfo
and
Małgorzata Przybyła-Kasperek
*
Institute of Computer Science, University of Silesia, Bȩdzińska 39, 41-200 Sosnowiec, Poland
*
Author to whom correspondence should be addressed.
Entropy 2023, 25(5), 703; https://doi.org/10.3390/e25050703
Submission received: 23 March 2023 / Revised: 19 April 2023 / Accepted: 21 April 2023 / Published: 24 April 2023

Abstract

:
This study concerns dispersed data stored in independent local tables with different sets of attributes. The paper proposes a new method for training a single neural network—a multilayer perceptron based on dispersed data. The idea is to train local models that have identical structures based on local tables; however, due to different sets of conditional attributes present in local tables, it is necessary to generate some artificial objects to train local models. The paper presents a study on the use of varying parameter values in the proposed method of creating artificial objects to train local models. The paper presents an exhaustive comparison in terms of the number of artificial objects generated based on a single original object, the degree of data dispersion, data balancing, and different network structures—the number of neurons in the hidden layer. It was found that for data sets with a large number of objects, a smaller number of artificial objects is optimal. For smaller data sets, a greater number of artificial objects (three or four) produces better results. For large data sets, data balancing and the degree of dispersion have no significant impact on quality of classification. Rather, a greater number of neurons in the hidden layer produces better results (ranging from three to five times the number of neurons in the input layer).

1. Introduction

A major problem in the domain of solving problems using machine learning is the decentralization of data sets and the inconsistency of information stored in local independent bases. When data is collected independently by institutions such as banks, hospitals, and various types of mobile applications, one cannot expect the format of the data to be uniform and consistent. Rather, one should expect that different sets of attributes and different sets of objects are present in local tables. Additionally, inconsistencies in data very often occur. The research presented in this paper deals precisely with the issue of classification based on dispersed data. By dispersed data, we mean data that are collected in several decision tables that contain inconsistencies, have different sets of attributes, and objects with the possibility that some attributes and objects may be common among decision tables. In addition, it is almost impossible to identify which objects are common among decision tables since to do that would require the existence of some central identifier of objects, which more often than not does not exist or may not be accessible due to data protection reasons.
The two main approaches that can be used for dispersed data are ensemble of classifiers and federated learning. Ensemble learning is a general approach of creating local models independently based on local tables [1,2], after which a final prediction is generated based on the local models by applying some fusion method [3,4,5]. In this approach, there is no global model as such.
In federated learning, a global model is built which constitutes the main objective presented in [6,7]. In this approach, the main focus is on data protection and data privacy [8]. Here, models are created in local spaces and their parameters only are sent to a central server—local data are not exchanged or combined among local spaces. The local models are then aggregated and sent to the local spaces. Such a procedure is iterated until a convergence criterion is satisfied.
The approach proposed in this paper is quite different. The aim of the method is to build a global model but in a completely different way than in federated learning. Indeed, local models are built based on local tables which are later used to construct a global model; however, this procedure is not iterative. Creation of a global model is carried out by a one-time aggregation. In the final stage, the global model is trained with a stratified subset of the original test set for which the values on the full set of conditional attributes present in all local tables are defined.
In this study, neural networks—multilayer perceptrons (MLP)—are used as local models. For the aggregation of such local networks to be possible, all of them must have the same structure. Since there are different conditional attributes in each local table, obtaining the same input layer in all models is not trivial. It is necessary to artificially generate objects based on the original objects that are to be used to train the network. Such artificial objects must have defined values on the conditional attributes that are missing in the considered local table. The paper proposes a method for generating artificial objects and contains a study on the use of different parameter values in the proposed method of generating artificial objects. An exhaustive comparison in terms of the number of artificial objects generated based on a single original object, the degree of data dispersion, data balancing, and different network structures—the number of neurons in the hidden layer are presented. The main conclusions reached are as follows: it was found that for data sets with a large number of objects, a smaller number of artificial objects is optimal. For smaller data sets, a greater number of artificial objects (three or four) produces better results. For large data sets, data balancing and the degree of dispersion have no significant impact on the quality of classification. Rather, a greater number of neurons in the hidden layer produces better results (ranging from three to five times the number of neurons in the input layer).
The contribution of the paper are as follows:
  • Proposing a method for generating artificial objects for training local MLP networks with identical structure;
  • Comparison of the proposed method in relation to different number of artificially objects generated;
  • Comparison of the proposed method in relation to different versions of data dispersion;
  • Comparison of the proposed method in relation to different number of neurons in the hidden layer;
  • Comparison of the proposed method for balanced and imbalanced versions of data sets.
Neural networks have been considered for dispersed data in various applications. The papers [9,10] considered neural networks as a model for aggregating prediction vectors generated by local classifiers. In the paper [11], neural networks were used in a federated learning approach. Neural networks were also used as base models in an ensemble of classifiers whose predictions were then aggregated by various fusion methods [12]. However, none of the approaches described above is similar to the one proposed in this study. The main difference lie in the non-iterative approach when building the global model in the proposed approach and the use of local tables with different sets of conditional attributes to train local networks with identical structures.
The paper is organized as follows. In Section 2, the proposed method for generating a global model is described. The section explains how to determine the structure of local models and how to prepare artificial objects for training local models. Then, the method of aggregating local models to the global model and the stage of training the global model are described. Section 3 addresses the data sets that were used and presents the conducted experiments, comparisons, and discussion on obtained results. Section 4 is on conclusions and future research plans.

2. Materials and Methods

The main idea of the proposed model is to build a global model based on dispersed data—local tables with different sets of conditional attributes—in three stages:
  • First stage: training local models, MLP neural networks based on local tables;
  • Second stage: aggregation of local models to the global model. This stage is performed in a non-iterative way by a single calculation;
  • Third stage: post-training the global model using a stratified subset of the original test set.
All three stages are described below in separate subsections.

2.1. First Stage—Training Local Models, MLP Neural Networks, Based on Local Tables

Formally, dispersed data is a set of decision tables that are collected independently by separate units. We assume that a set of decision tables—local tables D i = ( U i , A i , d ) i { 1 , , n } from one discipline—is available, where U i is the universe comprising a set of objects; A i is a set of conditional attributes; and d is a decision attribute. We assume that the sets of conditional attributes of local tables are quite different although it may rarely happen that a larger set of attributes is common between tables. More likely, the differences in attributes found in local tables are significant.
The local models that are used in this study are multilayer perceptron networks (MLP). Based on each local table, an MLP model is trained separately. The desired objective that all local models must have the same structure is not trivial since each local table has different conditional attributes, thus making the training process difficult. We propose that the input layer of local networks contains all the attributes that are present in all local tables—let us denote this set as A = i { 1 , , n } A i . In addition, the hidden layer should contain the same number of neurons in all networks. The output layer will be same for all tables due to the identical decision attribute present in all local tables. In this study, we use only one hidden layer in the network.
Now, a problem arises when we seek to train such a network based on a single local table given that the table in question lacks conditional attributes (perhaps many) that are present in the input layer of the network. A method for generating artificial objects with supplemented values on missing conditional attributes is proposed. These values are imputed based on certain characteristics provided by other local tables in the dispersed data in which the missing attributes are present. In doing so, data protection is ensured because we do not exchange raw data but only certain values of statistical measures derived from the dispersed data.
Based on each original object from a local table, k artificial objects are generated as follows:
  • Let us consider an object x that belongs to a decision class v from a local table D i .
  • We define a set of tuples as
    M E T H O D S = ( m i n , m i n ) , ( m i n , m e a n ) ( m a x , m e d i a n ) , ( m a x , m a x ) ( m i n , m e a n , m e d i a n , m a x ) × ( m i n , m e a n , m e d i a n , m a x )
    For each missing attribute (attribute from the set A \ A i ) and each m e t h o d M E T H O D S , m e t h o d ( 0 ) is computed on the objects having the decision class v for all local tables in which the attribute is present. After, m e t h o d ( 1 ) is computed on the the resulting values from m e t h o d ( 0 ) .
  • After step 2, there will be | M E T H O D S | = 16 values for decision class v. k distinct values denoted by a k are randomly selected from the 16 values, where k is the number of artificial objects that are to be generate.
  • From step 3, there will be k derived values for all the missing attributes of object x.
  • The final step is to duplicate object x, k times, and assign the a k values to the missing attribute.
This process is carried out for all objects in a local table and executed separately for each local table.
A training set of artificially prepared objects as described above is then used to train the MLP network. The neural networks is implemented using the Keras library in Python. Different number of neurons in the hidden layer is experimented on—values ranging from 0.25 to 5 times the number of neurons from the input layer are tested. For the hidden layer, the ReLU (Rectified Linear Unit) activation function is used as it is the most popular activation function and gives very good results [13]. For the output layer, the Softmax activation function is used, which is recommended when we deal with a multi-class problem [14]. The neural network is trained by using a gradient descent method with an adaptive step size in the backpropagation method. The Adam optimizer [15] and the categorical cross-entropy loss function [16] are used in the study.

2.2. Second Stage—Aggregation of Local Models to the Global Model

The second stage consists of aggregation of local networks into a single global network. In the first stage, the local neural networks are prepared in such a way that aggregation is possible—all local networks have the same structure; thus, the global network will also have the same network structure. The weights in global model are determined based on the weighted average of the corresponding weights from the local models. However, due to the dispersed data stored in the local tables, not all local models are equally accurate, so the weighted average is employed to make the local model’s influence on the construction of the global model depend on the accuracy of a given local model. The method used is inspired by the second weighting system used in the AdaBoost algorithm [17].
For each local model, a classification error is estimated based on its training set (containing artificial objects). Let us denote by e i the classification error determined for the i-th local model i { 1 , , n } . Since local models are built based on a piece of data, their accuracy can be very different. It may sometimes happen that their classification error is above 0.5 . In order not to eliminate such local models from the aggregation stage as they may contain important information on specific attributes that may have a positive impact in the global model, the min-max normalization is applied to the interval [ 0 , 0.5 ] of all errors e i , i { 1 , , n } . After, the weights ω i for each local neural network i { 1 , , n } are adjusted according to the formula proposed in [17]:
ω i = l n ( 1 e i e i )
The initial weights of the global model between neural connections are then calculated based on the adjusted weights of all the local networks. More specifically, the weights of the global model are determined by the weighted average of adjusted weights ω i , i { 1 , , n } .
It should be noted that some attributes that appear more frequently in local tables may have been better trained in global model than others. Therefore, a MLP network created in this way does not always generate sufficiently good results. In the next stage, the quality of the network is improved.

2.3. Third Stage—Post-Training the Global Model Using a Small Training Set

In order to implement this step, it is necessary to have access to an independent set of training data which can be called a global training set. This means that each object in this set has values for all conditional attributes A from the dispersed data. This set cannot be generated from local tables since aggregation is not possible considering the assumptions about dispersed data mentioned earlier.
Such a global training set is extracted from the test set. The test set is divided into two equal parts in a stratified manner. One is used for the post-training stage and the other for testing. This procedure is repeated twice where each time a different half is used for the post-training phase. In future studies, it is planned to generate such a global training set artificially.

3. Results

The experiments are conducted with data taken from the UC Irvine Machine Learning Repository. Three data sets are selected: Vehicle data [18], Landsat Satellite data [19], and Dry Bean data [20]. Each data set available in the repository is stored in a single table. These data sets are chosen for three reasons. To begin, these data sets are chosen because of the presence of multiple decision classes in the sets as the proposed method is tested for multi-class problems. Additionally, an important factor is the significant number of conditional attributes present in the data sets. The data are dispersed into local tables in the way where the conditional attributes are split. The aim is to test the approach where we have different conditional attributes in local tables. To achieve this, a large number of attributes is needed originally so that such dispersion can occur and a meaningful subset of these attributes can be present in each local table. Lastly, in this study, we focus on using numerical data—there are numerical, discrete, or continuous attributes in all data sets. Due to the large variation in the attributes in the Dry Bean data, the set is normalized.
The only possible way to evaluate the model for the considered dispersed data is the train-and-test method. This is because the data in the local tables contain only subsets of conditional attributes, while we assume that the test objects will already have specified values for all possible attributes present in the local tables. So, before the original data set is dispersed, it is divided into a training set ( 70 % of objects) and a test set ( 30 % of objects) in a stratified manner. Data characteristics are given in Table 1. The training data sets are then dispersed into local tables. Different degrees of dispersion are considered in order to check whether the method can cope with significant data dispersion. The creation of versions with 3 , 5 , 7 , 9 , and 11 local tables based on the original training set are considered where all local tables contained only a subset of the original set of conditional attributes. In addition, different local tables had different sets of attributes; however, there is a possibility of individual attributes being present among some tables. The decision attribute is included in each of the tables. The full set of objects is also stored in each of the local tables but without identifiers. This reflects the real situation where one cannot identify the objects between local tables.
All the data sets are heavily imbalanced Figure 1. To check whether the proposed method can handle imbalanced data, each data set is considered in two versions—the imbalanced version and the balanced version. The data are balanced with the use of the synthetic minority over-sampling technique (SMOTE) method [21]. The implementation of this algorithm, available in WEKA [22] software, is used. The balancing procedure is performed for each local table separately using only the locally available subset of attributes. All objects for each decision class are balanced in a way that after the implementation of this process, each decision class has an equal number of objects as the decision class with the most objects in the set. Thus, a total of thirty dispersed sets are analyzed: each of the three data sets is dispersed into 5 versions, each version is balanced to a total of 3 × 5 × 2 .
The quality of classification is evaluated based on the test set. The accuracy measure a c c is analyzed. This is the defined as a fraction of correctly classified objects to all objects in the test set.
The main goal of the experiments is to investigate how the number of objects artificially generated based on a single object from a local table affects the quality of classification. An additional purpose is to determine the guidelines that should be followed in determining such an optimal value depending on the characteristics of the data sets as well as to check the effect of the degree of dispersion on the obtained quality of classification. The different network structures and the impact of the number of neurons in the hidden layers on the quality of classification are also studied. Comparison analysis to determine whether the proposed approach performs equally well for balanced and imbalanced data is carried out. To meet these objectives above, the scheme of the experiments is as follows.
  • Studying different number of artificial objects generated based on a single object from each local table. The number of artificial objects generated k { 1 , 2 , 3 , 4 , 5 } are studied.
  • Studying different levels of dispersion: 3, 5, 7, 9, 11 local tables.
  • Studying different number of neurons in the hidden layer. The number is determined in proportion to the number of neurons in the input layer. The following values are tested: {0.25, 0.5, 0.75, 1, 1.5, 1.75, 2, 2.5, 2.75, 3, 3.5, 3.75, 4, 4.5, 4.75, 5} × the number of neurons in the input layer.
  • Studying two versions for each data set—balanced and imbalanced versions.
  • Studying an iterative approach modeled on federated learning in order to make comparisons with the proposed approach.
Comparison of experimental results is made in terms of:
  • The quality of classification for different number of artificial objects generated;
  • The quality of classification for different versions of dispersion;
  • The quality of classification for different number of neurons in the hidden layer;
  • The quality of classification for balanced and imbalanced version of data sets.
Table A1, Table A2, Table A3, Table A4, Table A5 and Table A6, presented in Appendix A, show the classification accuracy obtained for different versions of dispersion, different numbers of artificially generated objects, and different numbers of neurons in the hidden layer for Vehicle imbalanced, Vehicle balanced, Landsat Satellite imbalanced, Landsat Satellite balanced, Dry Bean imbalanced and Dry Bean balanced data sets. Each experiment is performed three times. The average of the three runs is given in the tables below. In each row of the tables, the best result is in a bold font. The following sections present an analysis of the results included in these tables from different perspectives. The last part presents a comparison with the approach modeled on federated learning.

3.1. Comparison of Quality of Classification for Different Numbers of Objects Artificially Generated

First, we compare the quality of classification using different number of artificially generated objects. Table 2 shows a comparison of the best results (those in a bold font in Table A1, Table A2, Table A3, Table A4, Table A5 and Table A6) obtained for different number of artificially generated objects. In the table, for each dispersed data set, the best result is shown in a bold font.
As can be seen, for different data sets, different numbers of artificially generated objects guarantee the best results. In the case of the Vehicle data set, it can only be said that the approach with one artificial object gives the worst results. In the case of the Dry Bean data set, definitely the use of two artificial objects generates the best results. For the Landsat Satellite data set, it is hard to define any of these types of relations.
Statistical tests are performed in order to check the importance in the differences in the obtained results a c c for different number of objects artificially generated. The Friedman’s test using all results from Table 2 is performed. Five dependent groups are analyzed ( { 1 , 2 , 3 , 4 , 5 } number of artificial objects). The test did not confirm that differences among the classification accuracy in these five groups are significant ( p = 0.672 ). However, as can be seen from Table 2, the classification accuracy obtained for different data sets are from completely different ranges. Due to this discrepancy, it is difficult to prove the significance of the differences. Therefore, it was decided to separate the obtained results against the considered data sets. Thus, three sets (for Vehicle, for Landsat Satellite, and for Dry Bean) each containing a ten-element sample are obtained. The Friedman’s test confirmed the significance of the differences for the Dry Bean data set with p = 0.003 . The Wilcoxon each-pair test confirmed the significant differences between the average accuracy values for the following pairs: Vehicle—2 and 4 artificial objects, p = 0.01 ; Landsat Satellite—1 and 3 artificial objects, p = 0.03 ; Dry Bean—2 and 1 artificial objects, p = 0.008 , 2 and 3 artificial objects, p = 0.006 , 2 and 4 artificial objects, p = 0.008 , 2 and 5 artificial objects, p = 0.004 .
Additionally, comparative box-plot charts for the values of the classification accuracy and different data sets are created (Figure 2). As can be observed, for the Dry Bean data set, the box-plot for the two artificial objects definitely stands out among the others. It can also be concluded that using a single artificial object never generates good results. Taking into account the results of the comparisons and the number of objects in the analyzed data sets, a general conclusion can be drawn. For data sets with a large number of objects (around 9000 objects), a smaller number of artificial objects such as two objects is optimal. For smaller data sets with up to a thousand objects, a greater number of artificial objects (three or four) produces better results. More specifically, the smaller the number of objects in the local tables, the more artificially generated objects should be used in the proposed approach.

3.2. Comparison of Quality of Classification for Different Versions of Dispersion

We now compare the classification accuracy obtained for different versions of data dispersion. In Table 3 a comparison of the best results (those bolded in Table A1, Table A2, Table A3, Table A4, Table A5 and Table A6) obtained for different version of dispersion is presented. In the table, for data set, the best result is shown in a bold font.
As can be observed, in the case of Vehicle data set, the best results are obtained for medium data dispersion (7 local tables) or even large data dispersion (11 local tables). For this data set, the differences in results obtained for different versions of dispersion are the greatest compared to the other data sets. For Landsat Satellite and Dry Bean data sets, the smallest dispersion (3 local tables) gives better results. However, looking closely at the results, we can observe that the absolute differences noted for these data sets are really small—at the third decimal place. So, we can conclude that for data sets with such a large number of objects, the differences recorded for different degrees of dispersion are really unremarkable.
Statistical tests are performed in order to confirm the importance in the differences in the obtained results a c c . At first, the values of the classification accuracy in five dependent groups ( 3 , 5 , 7 , 9 , 11 local tables) are analyzed. The Friedman test confirmed a statistically significant difference in the results obtained for the five different version of dispersion being considered, χ 2 ( 28 , 4 ) = 26.608 , p = 0.00003 . The Wilcoxon each-pair test confirmed the significant differences between the average accuracy values for all pairs with 11 local tables: 3 and 11 local tables p = 0.007 , 5 and 11 local tables p = 0.001 , 7 and 11 local tables p = 0.004 , 9 and 11 local tables p = 0.016 .
Additionally, a comparative box-plot chart for the values of the classification accuracy is created (Figure 3). Here, the distributions of the values obtained for different versions of dispersion are similar; thus, we can conclude that for sufficiently large data sets (5000 objects), the degree of dispersion does not have a huge impact on the obtained results. More specifically, the degree of dispersion has little effect on the quality of classification in the proposed approach.

3.3. Comparison of Quality of Classification for Different Numbers of Neurons in the Hidden Layer

In Table A1, Table A2, Table A3, Table A4, Table A5 and Table A6, which are presented earlier, all the results obtained for the different analyzed number of neurons in the hidden layer are given. The best obtained classification accuracies are also marked in those tables. It can be seen that these best results are generated by a higher number of neurons in the hidden layer. The optimal values are above 3 × the number of neurons in the input layer up to 5 × the number of neurons in the input layer. This propriety does not depend on the number of objects in data set—no matter how large the data set is, more neurons in the hidden layer gives better results. However, there is not one universal number of neurons in the hidden layer that is optimal for every data set.
In order to notice certain patterns for particular data sets, heat maps are created based on the results from Table A1, Table A2, Table A3, Table A4, Table A5 and Table A6 and shown in Figure 4. On the x-axis, the number of neurons in the hidden layer is presented, while the number of artificial objects generated and the version of the dispersion are shown on the y-axis. The color on the map is determined by the classification accuracy value. Definitely for the Dry Bean data set, the clearest pattern can be seen, which shows that increasing the number of neurons in the hidden layer clearly improves classification accuracy. Additionally, for the Vehicle data set, it can be seen that a higher number of neurons results in better quality. The least visible dependence is found in the heat map for the Landsat Satellite data set. Here, for a large number of neurons in the hidden layer, both very good classification quality and worse results were observed. More specifically, it depends on the data set whether the increased number of neurons in the hidden layer will improve the quality of classification, and this impact is very different and specific to the data set.

3.4. Comparison of Quality of Classification for Balanced and Imbalanced Versions of Data Set

We will now focus on comparing the results obtained for balanced and imbalanced data. In Table 4, a comparison of the best results (those in a bold font in Table A1, Table A2, Table A3, Table A4, Table A5 and Table A6) obtained for balanced and imbalanced versions of each dispersed data is presented. In the table, the best result is shown in a bold font for each dispersed data set.
Based on the results, it cannot be explicitly concluded that the proposed method gives better results for balanced only or imbalanced data only as it depends on the data set in question. For the Vehicle data set, better results are obtained with balanced data, while for the Landsat Satellite data set, better results are obtained with imbalanced data. In both cases, the Wilcoxon test for dependent samples confirmed the statistical significance of the differences with p = 0.0001 . In contrast, for the Dry Bean data set, the results in both balanced and imbalanced versions are virtually the same. Here, the Wilcoxon test did not confirm the significance of the differences ( p = 0.523 ).
Comparative box-plot charts for the values of the classification accuracy in two groups of imbalanced and balanced data are created (Figure 5). The graphs confirm earlier conclusions; hence, it can be said that the proposed method handles balanced and imbalanced data comparably. In fact, the final result depends on the specifics of the data set. Determining the specific characteristics of the data sets that influenced the results requires further study. More specifically, it depends on the data set whether applying the SMOTE method for balancing the data set improves the quality of classification.

3.5. Comparison of Quality of Classification of the Proposed Approach with an Iterative Approach Modeled on Federated Learning

In this section, the results obtained from the approach modeled on federated learning [7,8,11] will be presented. Then a comparison will be made with the results obtained for the proposed approach.
The main difference between the proposed approach and the one based on federated learning is the iterative aggregation of local models. In the proposed approach, local models aggregation occurs only once. The approach modeled on federated learning involves the following steps:
  • Generation of local MLP neural networks based on local tables created analogously as described in Section 2.1. This means that missing attributes are filled in local tables, and artificial objects are generated.
  • The obtained weights and biases from local models are sent to a central server.
  • At the central server, the average of the weights and biases are computed, and the global model obtained is sent back to the local devices.
  • Local devices accept the global model, and once again, trained weights and biases are sent to the central server. Steps 3 and 4 are iterated three times.
  • The global model is post-training on a stratified half of the test set and its accuracy is tested on the remaining half. At another step, the global model is post-training on the other half and tested on the remaining half, after which the classification accuracy is averaged. This is the final step of the process.
As may be noted, an effort was made to provide a fair comparison as both the artificial objects and the post-training process were used in the above approach. An important difference between the proposed approach and the above model is the iterative aggregation of the global model. In addition, the same numbers of artificial objects generated and the same number of neurons in the hidden layer were also analyzed. Of course, the experiments were performed on the same data sets in terms of the degree of dispersion and balanced/imbalanced version. The full results are not given here for the sake of readability and clarity of the paper. Table 5 gives comparison of the results obtained for the proposed approach and the one based on federated learning. In the table, a better result from the two approaches is shown in a bold font. As can be seen, in the overwhelming number of cases, the proposed approach produced better results. Only in thirteen cases for the Vehicle data set did the approach modeled on federated learning produce slightly better results.
Statistical tests are performed to confirm the significance of the differences in the obtained results a c c for the proposed approach and the approach modeled on federated learning. The Wilcoxon test using all results from Table 5 is performed. Two dependent groups are analyzed (PA—the proposed approach, FL—the approach modeled on federated learning). The test confirms that differences among the classification accuracy in these two groups are significant ( p = 0.0001 ). Additionally, comparative box-plot charts for the values of the classification accuracy are created (Figure 6). The graphs confirm earlier conclusions, and hence it can be said that the proposed method generates better results than the approach modeled on federated learning.

4. Conclusions

The paper presented a new method for generating a global MLP model based on dispersed data with different sets of conditional attributes present in local tables. The novelty proposed is the method of generating artificial objects to train local networks with identical structure. An exhaustive comparison of the proposed method has been carried out in terms of the number of artificially generated objects, network structure, data balancing, and degree of data sparseness. The main conclusions are as follows. The greater the number of objects in local tables, the smaller the number of artificially generated objects is sufficient to generate optimal results. For smaller data sets, a greater number of artificial objects (three or four) produces better results. For large data sets, data balancing and the degree of dispersion have no significant impact on the quality of classification. In most cases, a higher number of neurons in the hidden layer gives better results; however, this is very data-dependent and specific. The best results are obtained for the number of neurons in the hidden layer equal to three to five times the number of neurons in the input layer. The paper also confirmed that the proposed method gives better results than the method modeled on federated learning.
In the proposed approach, many aspects should be considered in the future. Among the main plans are to test other ways of aggregating local models and proposing a new method for generating a global training set used in the post-training phase.

Author Contributions

Conceptualization, K.F.M., M.P.-K.; methodology, K.F.M., M.P.-K.; software, K.F.M.; validation, K.F.M., M.P.-K.; formal analysis, M.P.-K., K.F.M.; investigation, M.P.-K., K.F.M.; writing—original draft preparation, M.P.-K.; writing—review and editing, M.P.-K., K.F.M.; visualization, M.P.-K., K.F.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Publicly available data sets were analyzed in this study. This data can be found here: [19].

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1, Table A2, Table A3, Table A4, Table A5 and Table A6 show the classification accuracy obtained for different versions of dispersion, different numbers of artificially generated objects, and different numbers of neurons in the hidden layer for Vehicle imbalanced, Vehicle balanced, Landsat Satellite imbalanced, Landsat Satellite balanced, Dry Bean imbalanced, and Dry Bean balanced data sets, respectively.
Table A1. Results of classification accuracy a c c for the proposed approach with one hidden layer and different number of artificially generated objects—Vehicle imbalanced data set. Designation I is used for the number of neurons in the input layer.
Table A1. Results of classification accuracy a c c for the proposed approach with one hidden layer and different number of artificially generated objects—Vehicle imbalanced data set. Designation I is used for the number of neurons in the input layer.
No. of Artificial No. ofNo. of Neurons in Hidden Layer
ObjectsTables0.25 × I0.5 × I0.75 × I1 × I1.5 × I1.75 × I2 × I2.5 × I2.75 × I3 × I3.5 × I3.75 × I4 × I4.5 × I4.75 × I5 × I
30.2830.3310.5220.6640.7030.6940.7170.7130.6980.6630.7030.7240.6800.7110.7190.697
50.2810.6640.6780.6900.6930.6590.6990.6940.7020.6640.7030.6990.6940.7060.7100.694
170.4460.6400.6640.6860.6360.6590.6690.6590.6980.6820.6760.6690.6730.6850.6960.685
90.4660.6050.5720.6260.6670.6600.6590.6980.6900.6540.6820.6770.6750.6680.6420.692
110.2830.5160.5590.5370.6610.6520.6440.6760.6360.6940.6510.6800.6480.6690.6610.664
30.2830.3080.6640.6750.6730.6800.6680.6750.6850.6730.6760.6850.6640.6730.6850.688
50.2810.3980.6440.6710.6770.6650.6730.6650.6900.6640.6780.6820.6880.6960.6800.690
270.2830.6720.6500.6690.6960.6650.6730.6860.6890.6680.6990.6880.6930.6880.6730.692
90.2830.6670.6390.6520.6600.6670.6800.6900.6630.7070.7010.6730.6930.6780.6780.659
110.2830.2830.6390.6540.6990.6770.6990.6760.6800.7070.6920.6800.6650.6960.7100.678
30.2830.6750.6350.6480.6470.7060.7200.6980.7070.7030.7300.7230.7050.6970.6940.703
50.2830.6300.6770.7110.6800.6970.7170.7140.6970.6900.7140.6900.7280.7100.7100.707
370.2830.2800.6540.6840.6850.6770.6880.7010.7200.6920.7090.7130.6930.6920.7190.715
90.2830.2800.6850.6820.6920.6990.6730.7050.7090.6970.7010.6880.6880.6820.6850.707
110.2830.2830.6560.6680.6300.6550.6680.6430.6720.6590.6730.6720.6540.6710.6600.671
30.2830.5460.650.6380.6710.6940.6810.6680.6880.660.680.6780.6670.6890.690.702
50.2830.5180.6340.6570.6760.6810.6640.690.7230.6960.6920.6970.690.7240.6930.665
470.2830.3750.6840.6670.6940.680.6960.6930.6680.6780.7070.6780.6960.7190.7090.718
90.2830.5250.6730.690.690.7060.6930.6930.7180.7020.6880.7270.6890.6970.7090.685
110.2830.2830.6610.7010.7280.6840.7070.6920.7010.6990.690.7060.7050.7170.7240.698
30.3520.6170.630.6460.6690.6710.6640.650.6860.6690.6520.6850.6710.6930.6630.635
50.2830.6440.6920.6520.6810.7010.6850.6930.6730.6940.6920.7020.7010.7060.7140.71
570.2830.3910.6780.6630.6920.7130.6780.6730.6990.7130.690.690.7030.6760.7060.681
90.2830.520.6440.6340.6850.6940.6520.7030.6980.6960.6920.6860.6690.6920.6810.682
110.2830.2830.6590.6610.6570.6670.6890.6670.6780.6850.6610.6630.6960.680.6730.68
Table A2. Results of classification accuracy a c c for the proposed approach with one hidden layer and different number of artificially generated objects—Vehicle balanced data set. Designation I is used for the number of neurons in the input layer.
Table A2. Results of classification accuracy a c c for the proposed approach with one hidden layer and different number of artificially generated objects—Vehicle balanced data set. Designation I is used for the number of neurons in the input layer.
No. of Artificial No. ofNo. of Neurons in Hidden Layer
ObjectsTables0.25 × I0.5 × I0.75 × I1 × I1.5 × I1.75 × I2 × I2.5 × I2.75 × I3 × I3.5 × I3.75 × I4 × I4.5 × I4.75 × I5 × I
30.2830.6850.7050.6920.6890.7020.7200.6990.7130.7050.7100.7180.7260.7300.7140.727
50.3740.6560.5420.6800.6970.6960.7270.6940.7020.6930.7130.7110.7380.6940.7050.728
170.2830.6840.7150.7300.7450.7510.7390.7430.7310.7450.7550.7380.7440.7390.7390.757
90.2830.7070.6850.5520.7100.7070.7350.7310.7430.7380.7260.7150.7320.7320.7300.709
110.4460.5380.6540.6760.6820.6600.6780.6820.6560.6810.6880.7050.6770.6970.6810.696
30.2830.5830.6510.6440.6850.6960.6640.690.6750.690.6840.6720.6930.6960.7050.69
50.2830.2830.6860.7110.7150.7110.7320.7340.7240.7240.730.7220.7360.7380.7310.748
270.2830.5670.7150.7020.7130.7280.7260.710.6990.7180.7390.7380.7130.7050.7030.717
90.3770.6890.6420.6860.6820.7170.6760.7150.7070.6980.6890.7050.7110.7180.6850.702
110.2830.6970.640.6850.6710.6960.6980.7110.6890.6940.7090.7180.7050.6880.7260.709
30.3020.6850.6810.6470.6970.7070.7060.7100.6930.6960.6900.6940.6750.6890.7260.715
50.2830.6770.7010.6860.6750.6930.7110.7130.7030.7100.7030.7140.7010.7220.7180.706
370.2830.4900.7400.7150.7390.7390.7340.7360.7470.7350.7550.7400.7560.7400.7510.736
90.2830.4650.6520.7020.6840.6970.7090.7270.6930.7010.7200.7230.6990.7360.7280.717
110.2830.4900.6810.6570.6760.6930.7060.6980.6890.6840.6980.6880.7010.6960.6860.692
30.2830.6760.610.710.7170.7020.7280.7340.7350.7180.7260.7050.7320.7310.7180.701
50.2830.6470.6770.6960.6940.6940.7030.6670.7170.7170.7220.730.7380.7360.7180.728
470.2830.7010.5540.7260.710.730.7180.740.7270.7170.7450.730.730.730.7480.752
90.2830.2830.7010.6970.7110.7110.7220.7470.730.7440.7410.720.7380.7350.7220.727
110.2830.2830.6690.7050.6750.6880.7020.7030.6970.710.7190.690.7170.6980.6940.713
30.2830.4020.6990.6520.6890.7140.7310.7180.7260.7090.690.6720.6730.7070.730.697
50.2830.5490.6550.6880.6960.6850.6780.7180.6810.7180.7110.7260.7220.7150.7010.71
570.2830.4860.6970.7060.6940.7150.7590.7220.7410.7220.7360.7470.7430.7270.7320.731
90.2830.3230.6880.7360.7180.7610.7410.7730.7390.7380.7480.7050.7530.740.7410.744
110.2830.2830.680.6730.6820.6820.6880.6780.6750.7020.6970.7010.6960.7130.7060.686
Table A3. Results of classification accuracy a c c for the proposed approach with one hidden layer and different number of artificially generated objects—Landsat Satellite imbalanced data set. Designation I is used for the number of neurons in the input layer.
Table A3. Results of classification accuracy a c c for the proposed approach with one hidden layer and different number of artificially generated objects—Landsat Satellite imbalanced data set. Designation I is used for the number of neurons in the input layer.
No. of Artificial No. ofNo. of Neurons in Hidden Layer
ObjectsTables0.25 × I0.5 × I0.75 × I1 × I1.5 × I1.75 × I2 × I2.5 × I2.75 × I3 × I3.5 × I3.75 × I4 × I4.5 × I4.75 × I5 × I
30.5880.7890.7890.7930.8060.8020.8080.8050.7990.8090.8050.8060.8090.8060.8020.801
50.2350.7930.7950.8090.8040.8070.8030.7980.8090.8030.8100.8050.8110.8130.8150.810
170.7470.7950.7880.7950.7960.8050.8140.8130.8080.8100.8000.8080.8070.7970.7990.805
90.7780.7910.7940.7920.7910.7930.7950.8010.7970.7980.8000.8040.8000.8050.8010.803
110.7930.7870.7890.8010.7940.7980.7890.7990.7910.8040.7990.8020.8020.7960.8020.803
30.7650.7870.7910.8030.80.7970.7960.8080.7920.8040.7980.8090.8030.8090.8030.803
50.5580.4890.8020.7970.8030.8020.8070.8030.8090.8070.8020.7970.7990.8070.8090.802
270.2350.7950.7980.8040.7950.7970.8050.7950.7990.7920.8090.8030.8070.8010.8050.803
90.2350.790.7940.7970.7990.7970.7980.8050.80.7970.7960.8130.8010.7960.7960.803
110.2350.7950.7920.7840.8040.8020.80.8060.8050.8080.8030.8150.810.8070.8150.801
30.7130.7860.7980.8000.8070.7960.8120.8010.8040.8080.8070.8090.8130.8000.8100.806
50.7500.7940.7960.8040.8030.8050.8100.8070.8110.8120.8070.8200.8100.8090.8030.804
370.5560.7940.7990.7890.8050.7980.7970.7990.7900.8010.8060.8110.7990.8030.7980.801
90.5560.7970.8010.7970.8020.8000.8050.8010.8010.8080.8040.8070.8020.8070.8040.805
110.2350.7530.7940.7910.7950.8030.8050.7950.7960.7990.8100.7990.8040.7940.7930.792
30.6230.7960.7940.8070.8090.8050.7950.8010.8070.8020.8110.8120.7970.8130.8090.81
50.7810.7980.7950.8040.790.8080.8010.8050.80.7980.810.8070.8080.8080.8040.811
470.5580.780.7960.80.7910.7970.8050.7950.8010.810.8030.80.8040.8030.7960.801
90.7840.7970.7970.7960.7940.790.7970.7830.8060.8030.8040.7960.8030.8030.7960.798
110.2350.7920.7930.7940.7950.7940.7970.7960.7930.7990.7950.7840.7940.7960.7940.8
30.2350.7830.7860.7940.790.8030.7910.7920.8030.8050.7890.8020.7970.8070.8030.808
50.5610.7940.80.7930.8110.8090.8060.7910.8090.8060.8120.8120.8010.8130.8130.809
570.660.790.7960.7990.7990.8090.8030.7970.8050.8060.80.8010.8020.8080.8060.803
90.5590.7920.8020.7920.7990.7920.8050.8090.80.8010.8030.8060.8010.7990.8020.801
110.2350.7890.7990.7980.790.7970.7990.8020.80.8010.8010.8030.8010.80.8040.794
Table A4. Results of classification accuracy a c c for the proposed approach with one hidden layer and different number of artificially generated objects—Landsat Satellite balanced data set. Designation I is used for the number of neurons in the input layer.
Table A4. Results of classification accuracy a c c for the proposed approach with one hidden layer and different number of artificially generated objects—Landsat Satellite balanced data set. Designation I is used for the number of neurons in the input layer.
No. of Artificial No. ofNo. of Neurons in Hidden Layer
ObjectsTables0.25 × I0.5 × I0.75 × I1 × I1.5 × I1.75 × I2 × I2.5 × I2.75 × I3 × I3.5 × I3.75 × I4 × I4.5 × I4.75 × I5 × I
30.7670.7940.7910.7820.7870.7920.7870.7880.7790.7870.7990.7960.7860.7890.7800.786
50.2380.7400.7780.7810.7850.7810.7930.7880.7930.7950.7890.7960.7870.7850.7900.799
170.4170.7760.7890.7930.7950.8000.7930.7840.7930.7940.7980.7910.7890.7830.8000.796
90.2380.7820.7870.7860.7900.7800.7850.7820.7810.7860.7800.7850.7820.7870.7870.794
110.2380.7410.7800.7810.7850.7880.7870.7890.7880.7890.7810.7820.7900.7830.7840.789
30.2380.7750.7790.7890.7820.7810.7910.790.7910.7890.7880.7890.7880.7850.7990.791
50.2380.7870.7930.7950.7780.7860.7930.7890.7940.7960.7910.7950.7880.7970.7910.787
270.2380.7880.7860.7870.7850.780.7810.7810.7780.790.7910.7670.7820.7820.7770.785
90.6890.7890.7830.7910.7870.7930.7770.7860.7850.7890.7920.7920.7870.7870.7890.781
110.5630.7720.7830.7850.780.7790.7820.7890.7820.7830.7720.7850.7910.780.7850.789
30.5620.7820.7860.7810.7780.7790.7870.7940.7800.7810.7910.7730.8010.7870.8030.790
50.2380.7850.7870.7950.7910.7880.7880.7900.7980.7860.7840.7840.8050.7820.7870.790
370.2380.7920.7860.7900.7850.7930.7970.7770.7860.7880.7920.8010.7980.7890.7800.798
90.5640.7810.7840.7860.7950.7910.7910.7870.7940.7940.7870.7950.7920.7870.7940.792
110.5670.7750.7800.7820.7880.7800.7830.7860.7850.7800.7790.7860.7880.7840.7890.773
30.7530.7740.7780.7880.7880.7830.7790.7880.7880.7810.7830.7870.7750.7870.7960.797
50.5680.7870.7880.790.7940.7860.7860.7930.790.7990.790.7810.790.7970.7970.801
470.5660.7730.780.7870.7920.7840.7710.790.7810.7830.7910.7750.7930.790.7790.775
90.2380.7730.7880.7780.7750.780.7870.780.780.7630.7930.7710.7850.7910.7870.784
110.5660.7850.7870.780.7840.7810.7820.7820.7910.7780.7840.7760.790.7760.7760.798
30.5690.7810.7940.7930.7870.7790.7940.7990.7970.8030.80.7880.7940.7970.7920.798
50.6760.7770.7810.7840.7850.80.7920.7950.7980.7840.7860.7810.7920.7940.7930.792
570.7340.7790.7810.780.7810.7780.7820.7780.7870.7750.780.7790.7910.7890.7880.769
90.4010.7880.780.7870.7920.790.7940.7870.7840.7930.7890.7960.7720.7930.7830.78
110.2380.7590.7850.7850.7870.7840.790.7920.7880.7910.7860.7860.7870.7860.7820.788
Table A5. Results of classification accuracy a c c for the proposed approach with one hidden layer and different number of artificially generated objects —Dry Bean imbalanced data set. Designation I is used for the number of neurons in the input layer.
Table A5. Results of classification accuracy a c c for the proposed approach with one hidden layer and different number of artificially generated objects —Dry Bean imbalanced data set. Designation I is used for the number of neurons in the input layer.
No. of Artificial No. ofNo. of Neurons in Hidden Layer
ObjectsTables0.25 × I0.5 × I0.75 × I1 × I1.5 × I1.75 × I2 × I2.5 × I2.75 × I3 × I3.5 × I3.75 × I4 × I4.5 × I4.75 × I5 × I
30.8770.8910.8920.9020.9020.9040.9040.9100.9120.9100.9100.9120.9120.9130.9150.915
50.5410.8880.8940.9000.9000.9070.9030.9070.9070.9090.9130.9110.9110.9120.9110.911
170.8850.8870.8930.8940.9010.9000.9020.9030.9050.9060.9090.9100.9110.9110.9110.911
90.8790.8900.8880.8930.8970.8990.9030.9040.9050.9070.9060.9070.9080.9100.9120.911
110.8820.8900.8920.8930.8990.8990.9020.9040.9050.9080.9070.9080.9100.9120.9120.910
30.8860.8890.9030.9050.9110.9140.9110.9120.9120.9120.9160.9150.9160.9160.9170.917
50.8220.890.9040.8990.9020.9050.9050.910.9110.9110.9110.9120.9140.9140.9140.915
270.880.8950.9030.8960.9040.9080.9080.9070.9080.9110.9110.9110.9110.9140.9150.913
90.8840.8880.8960.8940.8990.9020.9050.9050.9050.9060.910.9120.9120.9120.9110.913
110.8840.8910.8970.8940.8970.8990.9020.9040.9070.9060.9080.9120.9090.9130.9110.914
30.8850.8950.8970.8980.9040.9050.9070.9080.9100.9110.9110.9130.9120.9120.9120.913
50.8860.8880.8900.9020.9020.9060.9050.9060.9070.9090.9090.9100.9100.9100.9120.912
370.8040.8870.8960.8920.9020.9030.9020.9080.9070.9060.9090.9070.9090.9110.9110.911
90.5200.8860.8900.8950.8960.8990.8970.9050.9040.9020.9070.9070.9070.9100.9090.909
110.6330.8860.8910.8960.8970.8970.9000.9000.9020.9020.9030.9050.9050.9060.9070.908
30.790.8940.8950.90.9070.9070.9090.910.9090.910.9130.9130.9130.9140.9140.917
50.8840.8930.8990.90.9030.9030.9030.9070.9090.9110.9090.9110.9110.910.9110.914
470.8660.8870.8890.8960.9010.8980.8980.9050.9060.9050.9060.9090.9090.9080.910.91
90.8870.8860.8920.8920.8980.9020.9030.9030.9050.9060.910.9080.910.9090.9110.911
110.780.8890.8950.8960.8980.8990.90.9010.9030.9060.9090.9080.9080.9090.9090.911
30.8760.8920.9010.8980.9010.910.9080.9110.9130.9090.9120.9140.9140.9140.9140.911
50.8820.8870.8930.8970.9030.9030.9060.9090.9070.9080.910.910.9110.9120.9130.913
570.8830.890.8930.8950.8980.9020.9010.9040.9060.9080.910.9110.910.9130.9110.914
90.8830.8870.8890.8940.8990.8990.9010.9030.9060.9040.9070.9080.9110.9130.910.911
110.8560.8890.8920.8920.8970.8980.8970.9070.9030.9020.9060.9080.9080.910.9110.908
Table A6. Results of classification accuracy a c c for the proposed approach with one hidden layer and different number of artificially generated objects—Dry Bean balanced data set. Designation I is used for the number of neurons in the input layer.
Table A6. Results of classification accuracy a c c for the proposed approach with one hidden layer and different number of artificially generated objects—Dry Bean balanced data set. Designation I is used for the number of neurons in the input layer.
No. of Artificial No. ofNo. of Neurons in Hidden Layer
ObjectsTables0.25 × I0.5 × I0.75 × I1 × I1.5 × I1.75 × I2 × I2.5 × I2.75 × I3 × I3.5 × I3.75 × I4 × I4.5 × I4.75 × I5 × I
30.4210.8910.8970.8980.9010.9050.9040.9100.9100.9110.9110.9120.9110.9140.9130.915
50.8830.8910.8940.9000.9020.9040.9040.9090.9090.9090.9100.9100.9090.9100.9110.912
170.6570.8880.8930.8940.8970.8970.9000.9040.9050.9050.9070.9080.9090.9090.9110.911
90.8810.8850.8940.8960.8980.8980.9010.9030.9050.9070.9080.9100.9070.9110.9130.912
110.8710.8870.8920.8940.8950.9030.9030.9020.9030.9080.9090.9090.9090.9100.9110.909
30.8970.9020.9050.9060.9120.910.9140.9160.9140.9160.9160.9170.9180.9160.9170.918
50.8860.9010.9060.9050.9070.9090.9090.9110.9070.910.9130.9150.9140.9140.9160.913
270.6340.8920.9030.9010.9070.9040.9050.9090.9150.9110.9140.9120.9140.9130.910.914
90.8630.8940.8960.90.9030.9060.9080.9080.9080.9070.9120.910.9120.9120.9120.913
110.8780.8950.8940.8920.9020.9010.9010.9090.9050.9080.9060.910.9090.910.910.911
30.4130.8900.8980.9010.9040.9080.9090.9080.9090.9080.9100.9120.9120.9130.9130.911
50.5800.8890.8910.8960.8990.9060.9070.9050.9060.9080.9090.9090.9130.9130.9120.911
370.8870.8890.8940.9000.9000.9030.9040.9050.9060.9090.9100.9120.9100.9130.9120.912
90.6060.8840.8920.8930.8970.9030.8990.9010.9010.9040.9040.9050.9060.9040.9070.906
110.7340.8910.8920.8940.8980.8980.9010.9020.9050.9060.9060.9060.9090.9110.9100.913
30.8840.8930.8940.8970.9050.9040.9070.9090.9110.910.9140.9110.9130.9130.9140.916
50.8840.8920.8940.8930.9020.9040.9020.9040.9090.9090.9070.9090.9120.9110.9130.913
470.8840.8880.8940.8970.90.90.9030.9050.9060.9070.9090.9090.9070.9090.910.912
90.8540.8880.8860.890.9010.9020.9030.9010.9040.9040.9020.9050.910.9090.910.911
110.8440.890.8930.8950.8960.9050.9010.9060.9050.9060.9070.9090.9090.9120.9110.912
30.8830.8890.8990.9010.9060.9080.9070.9080.910.9090.910.9120.9120.9110.9130.913
50.8840.8950.8960.90.9050.9070.9030.9060.9060.9070.9110.910.9110.9120.9130.912
570.8010.8850.8890.8960.9010.9050.90.9040.9050.9060.9080.9070.9110.9110.910.91
90.8820.8880.8910.8920.8980.90.90.9030.9030.9010.9070.9080.9070.9110.910.911
110.4250.8910.8920.8920.8970.8970.8970.9050.9040.9050.9070.9090.9070.9070.9090.909

References

  1. Bazan, J.G.; Drygaś, P.; Zaręba, L.; Molenda, P. A new method of building a more effective ensemble classifiers. In Proceedings of the 2020 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), Glasgow, UK, 19–24 July 2020; pp. 1–6. [Google Scholar]
  2. Piwowarczyk, M.; Muke, P.Z.; Telec, Z.; Tworek, M.; Trawiński, B. Comparative analysis of ensembles created using diversity measures of regressors. In Proceedings of the 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Toronto, ON, Canada, 11–14 October 2020; pp. 2207–2214. [Google Scholar]
  3. Muzammal, M.; Talat, R.; Sodhro, A.H.; Pirbhulal, S. A multi-sensor data fusion enabled ensemble approach for medical data from body sensor networks. Inf. Fusion 2020, 53, 155–164. [Google Scholar] [CrossRef]
  4. Przybyła-Kasperek, M. Comparison of Dispersed Decision Systems with Pawlak Model and with Negotiation Stage in Terms of Five Selected Fusion Methods. In Proceedings of the Computational Collective Intelligence ICCCI 2018 10th International Conference, ICCCI 2018, Bristol, UK, 5–7 September 2018; pp. 301–310. [Google Scholar] [CrossRef]
  5. Seydi, S.T.; Saeidi, V.; Kalantar, B.; Ueda, N.; van Genderen, J.L.; Maskouni, F.H.; Aria, F.A. Fusion of the multisource datasets for flood extent mapping based on ensemble convolutional neural network (CNN) model. J. Sens. 2022, 2022, 2887502. [Google Scholar] [CrossRef]
  6. Firouzi, R.; Rahmani, R.; Kanter, T. Federated learning for distributed reasoning on edge computing. Procedia Comput. Sci. 2021, 184, 419–427. [Google Scholar] [CrossRef]
  7. Połap, D. Fuzzy consensus with federated learning method in medical systems. IEEE Access 2021, 9, 150383–150392. [Google Scholar] [CrossRef]
  8. Mothukuri, V.; Parizi, R.M.; Pouriyeh, S.; Huang, Y.; Dehghantanha, A.; Srivastava, G. A survey on security and privacy of federated learning. Future Gener. Comput. Syst. 2021, 115, 619–640. [Google Scholar] [CrossRef]
  9. Marfo, K.F.; Przybyła-Kasperek, M. Radial basis function network for aggregating predictions of k-nearest neighbors local models generated based on independent data sets. Procedia Comput. Sci. 2022, 207, 3234–3243. [Google Scholar] [CrossRef]
  10. Przybyła-Kasperek, M.; Marfo, K.F. Neural network used for the fusion of predictions obtained by the K-Nearest neighbors algorithm based on independent data sources. Entropy 2021, 23, 1568. [Google Scholar] [CrossRef] [PubMed]
  11. Venkatesha, Y.; Kim, Y.; Tassiulas, L.; Panda, P. Federated learning with spiking neural networks. IEEE Trans. Signal Process. 2021, 69, 6183–6194. [Google Scholar] [CrossRef]
  12. Senousy, Z.; Abdelsamea, M.M.; Mohamed, M.M.; Gaber, M.M. 3E-Net: Entropy-based elastic ensemble of deep convolutional neural networks for grading of invasive breast carcinoma histopathological microscopic images. Entropy 2021, 23, 620. [Google Scholar] [CrossRef] [PubMed]
  13. Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA, 11–13 April 2011; pp. 315–323. [Google Scholar]
  14. Li, X.; Li, X.; Pan, D.; Zhu, D. On the learning property of logistic and softmax losses for deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 4739–4746. [Google Scholar]
  15. Kingma, D.P.; Ba, J. In Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, ICLR, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  16. Mannor, S.; Peleg, D.; Rubinstein, R. The cross entropy method for classification. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 7–11 August 2005; pp. 561–568. [Google Scholar]
  17. Schapire, R.E. Explaining Adaboost. Empirical Inference: Festschrift in Honor of Vladimir N. Vapnik; Springer: Berlin/Heidelberg, Germany, 2013; pp. 37–52. [Google Scholar]
  18. Siebert, J.P. Vehicle Recognition Using Rule Based Methods; Turing Institute: London, UK, 1987. [Google Scholar]
  19. Asuncion, A.; Newman, D.J. UCI Machine Learning Repository; University of Massachusetts Amherst: Amherst, MA, USA, 2007; Available online: https://archive.ics.uci.edu (accessed on 10 March 2023).
  20. Koklu, M.; Ozkan, I.A. Multiclass classification of dry beans using computer vision and machine learning techniques. Comput. Electron. Agric. 2020, 174, 105507. [Google Scholar] [CrossRef]
  21. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  22. Ingrid, R.; Zdravko, M. An introduction to the weka data mining system. In Proceedings of the 2017 ACM SIGCSE Technical Symposium on Computer Science Education, Seattle, WA, USA, 8–11 March 2017; p. 742. [Google Scholar]
Figure 1. Imbalance of data—cardinality of decision classes in training and test sets.
Figure 1. Imbalance of data—cardinality of decision classes in training and test sets.
Entropy 25 00703 g001
Figure 2. Box-plot chart with (median, the first quartile—Q1, the third quartile—Q3) the value of classification accuracy a c c for the different numbers of objects artificially generated.
Figure 2. Box-plot chart with (median, the first quartile—Q1, the third quartile—Q3) the value of classification accuracy a c c for the different numbers of objects artificially generated.
Entropy 25 00703 g002
Figure 3. Box-plot chart with (median, the first quartile—Q1, the third quartile—Q3) the value of classification accuracy a c c for different versions of dispersion.
Figure 3. Box-plot chart with (median, the first quartile—Q1, the third quartile—Q3) the value of classification accuracy a c c for different versions of dispersion.
Entropy 25 00703 g003
Figure 4. Heat maps on the accuracy levels of all data sets.
Figure 4. Heat maps on the accuracy levels of all data sets.
Entropy 25 00703 g004
Figure 5. Box-plot chart with (median, the first quartile—Q1, the third quartile—Q3) the value of classification accuracy a c c for imbalanced and balanced versions of data sets.
Figure 5. Box-plot chart with (median, the first quartile—Q1, the third quartile—Q3) the value of classification accuracy a c c for imbalanced and balanced versions of data sets.
Entropy 25 00703 g005
Figure 6. Box-plot chart with (median, the first quartile—Q1, the third quartile—Q3) the value of classification accuracy a c c for the proposed approach and the approach modeled on federated learning.
Figure 6. Box-plot chart with (median, the first quartile—Q1, the third quartile—Q3) the value of classification accuracy a c c for the proposed approach and the approach modeled on federated learning.
Entropy 25 00703 g006
Table 1. Data set characteristics. Sign # denotes the number of objects in the set.
Table 1. Data set characteristics. Sign # denotes the number of objects in the set.
Data Set# The Training Set# The Test Set# Conditional Attributes# Decision Classes
Vehicle592254184
Landsat Satellite44351000366
Dry Bean95274084167
Table 2. Comparison of classification accuracy a c c obtained for different number of artificially generated objects.
Table 2. Comparison of classification accuracy a c c obtained for different number of artificially generated objects.
DataNo. ofNo. of Artificially Generated Objects
SetTables12345
Vehicle imbalanced30.7240.6880.730.7020.693
50.710.6960.7280.7240.714
70.6980.6990.720.7190.713
90.6980.7070.7090.7270.703
110.6940.710.6730.7280.696
Vehicle balanced30.730.7050.7260.7350.731
50.7380.7480.7220.7380.726
70.7570.7390.7560.7520.759
90.7430.7180.7360.7470.773
110.7050.7260.7060.7190.713
Landsat Satellite imbalanced30.8090.8090.8130.8130.808
50.8150.8090.820.8110.813
70.8140.8090.8110.810.809
90.8050.8130.8080.8060.809
110.8040.8150.810.80.804
Landsat Satellite balanced30.7990.7990.8030.7970.803
50.7990.7970.8050.8010.8
70.80.7910.8010.7930.791
90.7940.7930.7950.7930.796
110.790.7910.7890.7980.792
Dry Bean imbalanced30.9150.9170.9130.9170.914
50.9130.9150.9120.9140.913
70.9110.9150.9110.910.914
90.9120.9130.910.9110.913
110.9120.9140.9080.9110.911
Dry Bean balanced30.9150.9180.9130.9160.913
50.9120.9160.9130.9130.913
70.9110.9150.9130.9120.911
90.9130.9130.9070.9110.911
110.9110.9110.9130.9120.909
Table 3. Comparison of classification accuracy a c c obtained for different numbers of artificially generated objects.
Table 3. Comparison of classification accuracy a c c obtained for different numbers of artificially generated objects.
DataNo. of ArtificiallyNo. of Local Tables
SetGenerated Objects357911
Vehicle imbalanced10.7240.710.6980.6980.694
20.6880.6960.6990.7070.71
30.730.7280.720.7090.673
40.7020.7240.7190.7270.728
50.6930.7140.7130.7030.696
Vehicle balanced10.730.7380.7570.7430.705
20.7050.7480.7390.7180.726
30.7260.7220.7560.7360.706
40.7350.7380.7520.7470.719
50.7310.7260.7590.7730.713
Landsat Satellite imbalanced10.8090.8150.8140.8050.804
20.8090.8090.8090.8130.815
30.8130.820.8110.8080.81
40.8130.8110.810.8060.8
50.8080.8130.8090.8090.804
Landsat Satellite balanced10.7990.7990.80.7940.79
20.7990.7970.7910.7930.791
30.8030.8050.8010.7950.789
40.7970.8010.7930.7930.798
50.8030.80.7910.7960.792
Dry Bean imbalanced10.9150.9130.9110.9120.912
20.9170.9150.9150.9130.914
30.9130.9120.9110.910.908
40.9170.9140.910.9110.911
50.9140.9130.9140.9130.911
Dry Bean balanced10.9150.9120.9110.9130.911
20.9180.9160.9150.9130.911
30.9130.9130.9130.9070.913
40.9160.9130.9120.9110.912
50.9130.9130.9110.9110.909
Table 4. Comparison of classification accuracy a c c obtained for imbalanced and balanced versions of data.
Table 4. Comparison of classification accuracy a c c obtained for imbalanced and balanced versions of data.
Data SetNo. of TablesNo. of Art. ObjectsImbalancedBalancedData SetImbalancedBalanced
10.7240.73 0.9150.915
20.6880.705 0.9170.918
330.730.726 0.9130.913
40.7020.735 0.9170.916
50.6930.731 0.9140.913
10.710.738 0.9130.912
20.6960.748 0.9150.916
530.7280.722 0.9120.913
40.7240.738 0.9140.913
50.7140.726 0.9130.913
10.6980.757 0.9110.911
20.6990.739 0.9150.915
Vehicle730.720.756Dry0.9110.913
40.7190.752Bean0.910.912
50.7130.759 0.9140.911
10.6980.743 0.9120.913
20.7070.718 0.9130.913
930.7090.736 0.910.907
40.7270.747 0.9110.911
50.7030.773 0.9130.911
10.6940.705 0.9120.911
20.710.726 0.9140.911
1130.6730.706 0.9080.913
40.7280.719 0.9110.912
50.6960.713 0.9110.909
10.8090.799
20.8090.799
330.8130.803
40.8130.797
50.8080.803
10.8150.799
20.8090.797
530.820.805
40.8110.801
50.8130.8
10.8140.8
20.8090.791
Landsat730.8110.801
Satellite 40.810.793
50.8090.791
10.8050.794
20.8130.793
930.8080.795
40.8060.793
50.8090.796
10.8040.79
20.8150.791
1130.810.789
40.80.798
50.8040.792
Table 5. Comparison of classification accuracy a c c obtained for the proposed approach (PA) and the approach modeled on federated learning (FL).
Table 5. Comparison of classification accuracy a c c obtained for the proposed approach (PA) and the approach modeled on federated learning (FL).
ApproachPAFLPAFLPAFLPAFLPAFL
DataNo. ofNo. of Artificially Generated Objects
SetTables1122334455
Vehicle imbalanced30.7240.6770.6880.6770.730.6730.7020.7090.6930.724
50.710.6810.6960.6730.7280.6930.7240.7170.7140.709
70.6980.7050.6990.6930.720.6610.7190.6650.7130.701
90.6980.6730.7070.6850.7090.7090.7270.6970.7030.677
110.6940.6730.710.6730.6730.6730.7280.6890.6960.661
Vehicle balanced30.730.650.7050.7130.7260.7520.7350.7130.7310.689
50.7380.7090.7480.6690.7220.7010.7380.7320.7260.713
70.7570.7090.7390.7480.7560.6650.7520.7360.7590.764
90.7430.7170.7180.7560.7360.7280.7470.7480.7730.748
110.7050.7090.7260.7360.7060.740.7190.6930.7130.748
Landsat Satellite imbalanced30.8090.7590.8090.7660.8130.7730.8130.7830.8080.781
50.8150.7660.8090.7680.820.7810.8110.780.8130.781
70.8140.7790.8090.770.8110.7770.810.7770.8090.769
90.8050.7710.8130.7670.8080.7840.8060.7860.8090.782
110.8040.7710.8150.7730.810.7750.80.7810.8040.782
Landsat Satellite balanced30.7990.740.7990.7340.8030.7430.7970.770.8030.773
50.7990.740.7970.7590.8050.7460.8010.7770.80.774
70.80.760.7910.7520.8010.7660.7930.7770.7910.773
90.7940.750.7930.7590.7950.7620.7930.7740.7960.765
110.790.7570.7910.7760.7890.7610.7980.7630.7920.788
Dry Bean imbalanced30.9150.8810.9170.9040.9130.8830.9170.8770.9140.894
50.9130.8720.9150.8890.9120.8830.9140.8930.9130.893
70.9110.880.9150.8990.9110.8890.910.8780.9140.873
90.9120.8770.9130.8910.910.8750.9110.8750.9130.889
110.9120.8870.9140.8910.9080.8780.9110.8890.9110.893
Dry Bean balanced30.9150.8930.9180.910.9130.8890.9160.870.9130.89
50.9120.8760.9160.90.9130.8840.9130.8590.9130.88
70.9110.8780.9150.8950.9130.90.9120.8840.9110.889
90.9130.8810.9130.890.9070.8810.9110.870.9110.881
110.9110.8760.9110.8960.9130.8720.9120.8870.9090.886
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Marfo, K.F.; Przybyła-Kasperek, M. Study on the Use of Artificially Generated Objects in the Process of Training MLP Neural Networks Based on Dispersed Data. Entropy 2023, 25, 703. https://doi.org/10.3390/e25050703

AMA Style

Marfo KF, Przybyła-Kasperek M. Study on the Use of Artificially Generated Objects in the Process of Training MLP Neural Networks Based on Dispersed Data. Entropy. 2023; 25(5):703. https://doi.org/10.3390/e25050703

Chicago/Turabian Style

Marfo, Kwabena Frimpong, and Małgorzata Przybyła-Kasperek. 2023. "Study on the Use of Artificially Generated Objects in the Process of Training MLP Neural Networks Based on Dispersed Data" Entropy 25, no. 5: 703. https://doi.org/10.3390/e25050703

APA Style

Marfo, K. F., & Przybyła-Kasperek, M. (2023). Study on the Use of Artificially Generated Objects in the Process of Training MLP Neural Networks Based on Dispersed Data. Entropy, 25(5), 703. https://doi.org/10.3390/e25050703

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop