1. Introduction
Regression analysis is one of the most useful and popular sub-parts of supervised learning, and success stories have been reported in many areas. Regression analysis assumes that the relevant experimental data are accurate, and numerous statistical treatments have been proposed to model accurately defined data [
1]. However, there are cases in which this assumption is unrealistic, because data used in regression analysis consist of observations that might be imprecise such as “about 50 years old” or “almost fully cured”.
In regression analysis, we sometimes encounter imprecise data with non-sharp boundary, linguistic data, and incomplete information. The fuzzy set theory introduced by Zadeh [
2] has been used as a means to solve this problem. The fuzzy regression model is needed to estimate the statistical relationship among variables when independent variables, dependent variables or regression coefficients are expressed as a fuzzy set. Since the fuzzy regression model introduced by Tanaka [
3], numerous modified or extended fuzzy regression models have been proposed, and many successful applications have been presented [
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15]. One goal of the fuzzy regression analysis is to discover statistical relationships between given fuzzy values. The general approach, as discussed by many authors, is to attempt to induce their relationship from center and spread of fuzzy values using the same function. In the fuzzy regression model, the response functions for center and spread may not be the same. Choi and Yoon [
6] described a fuzzy regression model whose response function for center and spread is an exponential function and a linear function, respectively.
In the fuzzy regression model, the response functions for center and spread may not be the same. Choi and Yoon [
6] described a fuzzy regression model whose response function for center and spread is an exponential function and a linear function, respectively. If the response function for center and spread does not match, it may be more effective to assume the response function for center and spread in an independent form, and to estimate the assumed response function. In addition, in fuzzy regression models, the center or spread of observation of the response function is often obtained in repeated numbers. If the results of the survey or the quality of the product is expressed as fuzzy numbers, the spread of the response function may repeat a finite value as examples presented in D’Urso [
8] and Yoon et al. [
14,
15]. In this case, it may be more efficient to use regression models for categorical data instead of continuous data. In the ordinary regression analysis, the categorical regression analysis, logistic regression analysis or discriminant analysis is used to analyze the categorical response variable. If the spreads of the fuzzy data are repeated by the same number and the center of the fuzzy data varies, it may be more efficient to estimate the fuzzy regression model by independently estimating the spread and center instead of applying the equivalent function. This is our motivation. The proposed hybrid estimation algorithm is a modified and extended new version of the algorithm introduced in Jung et al. [
16].
We propose an algorithm that constructs a fuzzy regression model when the response function for the spread and the center of the dependent variable do not match. To estimate a regression model for the center of the dependent variable with continuous data, the least absolute deviations (LAD) estimation method is used that is not sensitive to an outlier. In addition, a discriminant analysis is used to estimate the response function for the spread of the dependent variable expressed as repeated numbers by finite number. The
F-transform introduced by Perfilieva [
17] has been studied and found useful in many applications [
18,
19,
20,
21,
22]. The
F-transform converts original data into weighted mean values where the weights are given by the basic functions which are membership functions to identify fuzzy sets. In this paper, we use the
F-transform to categorize the spreads of dependent variable. In order to predict the dependent variable of the fuzzy regression model, we propose a hybrid algorithm that combines LAD estimation with discriminant analysis and
F-transform. In fuzzy regression analysis, we expect the proposed hybrid algorithm to improve the prediction error if the response function for the spread and the center of the dependent variable do not match.
This paper is organized as follows:
Section 2 presents preliminary concepts required to develop the main results.
Section 3, proposes a hybrid estimation algorithm combining the LAD estimation method and the discriminant analysis.
Section 4 gives two numerical examples to explain our results and compare with the existing methods.
Section 5 concludes the paper.
2. Preliminaries
Fuzzy set theory was introduced to provide a suitable concept for dealing with inaccurate data [
2]. Following [
3,
4], we introduce some definitions required to develop the main results such as the fuzzy sets and the fuzzy numbers.
A fuzzy set
A is a set of ordered pairs
where
is a membership function of
A.
The support of
A defined on
is a crisp set defined as
A fuzzy number,
A, is a normal and convex subset of the real line,
, with bounded support. As a special case, a fuzzy number
A, denoted by
, is said to be a
-fuzzy number if its membership function is denoted by
where
is the center,
and
are the left spread and the right spread.
L and
R are functions verifying the properties of the class of fuzzy sets such that
and
,
. In particular, if
in
, then
A is called a triangular fuzzy number and is denoted by
.
For any in , the -level set of a fuzzy set A is a crisp set that contains all the elements in X that have membership value in A greater than or equal to .
F (Fuzzy)-transform was introduced by Perfilieva [
17] in 2001. Here, some basic concepts from [
18] are introduced.
Definition 1. Let be fixed nodes within , such that and 2. We say that fuzzy sets , identified with their membership functions defined on , constitute a fuzzy partition of if they fulfill the following conditions for :
- (1)
- (2)
if where for the uniformity of denotation, we put ;
- (3)
is continuous;
- (4)
strictly increases on () and strictly decreases on ()
- (5)
The membership functions
are called symmetric basic functions. An example of fuzzy sets
with symmetric triangular membership functions on the interval
is given below:
where
is defined by
on
(
). Also,
on
(
). Here,
strictly increases on
(
) and
strictly decreases on
(
), which satisfy above conditions.
Definition 2. Let a discrete function be given at a finite set of points . The F-transform of a discrete function f with respect to define the numerical vector , where each is given by The are weighted mean values of f, where the weights are determined by the membership values. The are called components of the discrete F-transform.
Definition 3. Let be the F-transform of f with respect to . Then the function is called the inverse F-transform of
3. Fuzzy Regression Based on the F-Transform and Discriminant Analysis
General fuzzy linear regression model proposed by Tanaka [
3] and applied in many fields is formulated as follows:
where
is the fuzzy input,
is the fuzzy coefficients,
is the known response function,
is the fuzzy output and
is the fuzzy error .
The
-level set of the general fuzzy regression model is given as follows:
The center of the above model(*) is represented as follows:
And the models for the left and right spread of the model (*) are as follows:
In particular, if the response function is linear, the fuzzy regression model is formulated as follows:
where
,
and
are fuzzy numbers.
We estimate the
-level set of the proposed fuzzy regression model using by the least squares method which minimizes the sum of squared residuals
In fuzzy regression analysis, many researchers have used the above least squares method. However, there are two parts to consider in a general fuzzy regression model. The first is that the response function for the left spread
and the response function for the right spread
are not the same. Choi and Yoon [
6] proposed a general fuzzy regression model whose response function to the center is exponential and the response function to the spread is linear. The second is that the size of spread or center set of the dependent variable may be very small than the number of samples.
In fuzzy analysis, if the response functions of the spread and center are not match, it may be more efficient to construct the fuzzy regression model by independently estimating the spread and center instead of applying the equivalent function. In addition, if the spreads of dependent and independent variables are expressed in several repetitive numbers as shown in
Table 1 presented by D’Urso [
8], it may be more efficient to use statistical methods for categorical data than statistical methods for continuous variables. In
Table 1 presented by D’Urso [
8], the number of samples is 30 but the spread of the response variable is expressed in only 4 values.
In general, the spreads of the fuzzy data in a fuzzy sample vary greatly. However, the spreads of fuzzy data obtained from quality of products, or the preference or surveys, consists of repeating numbers. If the spread is repeated by the same number and the center changes, it may be more efficient to estimate the fuzzy regression model by estimating the spread and center separately.
Especially for fuzzy regression models, it may be more efficient to estimate the spread using categorical data analysis methods when the number of spreads of the dependent variable is small due to iterations.
Discriminant analysis [
23,
24] is one of the techniques that are used to predict the probability of belonging to a given category based on one or multiple independent variables when the dependent variable is categorical and the independent variable is interval in nature. The categorical variable means that the dependent variable is divided into several categories. In the fuzzy regression model, if the number of spreads of a given dependent variable is small due to iterations, it converts the dependent variable to a categorical variable.
To do this, we use the F-transform. In this paper, we propose the hybrid estimation algorithm to predict the dependent variable of the fuzzy regression with spreads which are represented by only a few numbers due to repetition.
The Proposed Hybrid Estimation Algorithm
- Step 1.
Estimate the center of dependent variable .
Using the set of center of the dependent variable
, obtain the predicted value
of center of the dependent variable by minimizing the following object function
- Step 2.
Estimate the left spread of the dependent variable using F-transform and Fisher’s linear discriminant analysis.
- 1.
Define the universe of discourse : is the set of spreads of dependent variable. Let and be the minimum and the maximum value of Then the universe of discourse is defined by }, ], where and are two proper positive real numbers. The values and are predefined by researcher or can be considered to be the tuning parameters.
- 2.
Define the basic function on
and obtain F-transform: The
F-transform
is obtained by
where the function
is the basic function on the universe
.
- 3.
Classify the given data by using the Fisher’s linear discriminant analysis: Each of the data is grouped q times which is the number of overlapped basic functions of corresponding data. In
Figure 1, the number of overlapped basic functions
q = 2 in (a,b) and
q = 3 in (c). If
is included in the support of
then
is assigned to group
. Since the basic functions are overlapped,
can be assigned to more than one group. Using the assigned group and independent variables, we construct the Fisher’s linear discriminant function and predict the assigned group based on the discriminant score. The predicted group is represented by
- 4.
Predict the left spread of dependent variable using the inverse F-transform: Using the F-transform obtained by the previous step 2-2 and the inverse F-transform, the spread
is predicted. Let
be the F-transform of the left spread with respect to
. Then the predicted left spread is given by
where
represents the F-transform corresponding to the predicted group
.
- Step 3.
Predict the right spread by the step 2.
By this hybrid algorithm, we predict the values of dependent variable. Our overall method is summarized in Algorithm 1.
Algorithm 1: predicting value of dependent variable with F-transform |
|
4. Numerical Examples and Comparison Studies
In this section, we illustrate two examples to compare the performance of the proposed method and the existing methods which assume the same model for center and spread. Chachi and Taheri estimated the fuzzy regression model by minimizing the sum of the square of the left and right endpoints of the
-level of residuals [
4]. Diamond constructed the fuzzy regression model by minimizing the sum of the square of residuals of the end points of support and the center [
7]. Two measures are used to compare the performance of the estimated fuzzy regression model. One is the
based on the difference between the estimated value and observed value, and the other is
comparing the overlapped area.
The performance measure
comparing the difference between the predicted value and observed value and is given as follows [
5].
where
and
. The more efficient model has
value closer to zero.
The performance measure
comparing the overlapped area between the predicted value and observed value is given as follows [
4].
where
and
means the minimum value. The more efficient method has the smaller value of
and the larger value of
.
To show the efficiency of the proposed estimation method, we use examples used by D’Urso [
8] and Yoon et al. [
14]. To compare the performance of the proposed algorithm with the existing methods, the Chachi and Taheri method [
4] and the Diamond method [
7], which are based on least squares estimation, are used.
Example 1. D’Urso [8] and Chachi and Taheri [4] consider a multiple fuzzy regression model. Table 1 shows the performance data for 30 quality Roman restaurants. - 1.
Estimate the center of the dependent variable using the least absolute deviation method. The estimated center is given by the following equation:
- 2.
From
Table 1, we obtain
which is the set of left spreads of dependent variable. Define the universe discourse
as
, where
and
with proper constant
and
.
- 3.
Estimate the left spread of dependent variable using the F-transform and Fisher’s linear discriminant analysis. The given data shows that the spreads of 30 samples are expressed in only four numbers. The basic functions of
can be given as follows:
Using the defined fuzzy partition and Equation (
2), the F-transform is obtained as follows:
Since the number of the overlapped basic function is two, each of the data is grouped two times. The first one is represented by
and the Fisher linear discriminant functions which classify the given data using independent variables
and
are as follows:
where
and
are not spreads but endpoints of the triangular fuzzy set.
Using the Fisher’s discriminant scores obtained by the above the Fisher linear discriminant functions,
is predicted as
By similar method, the second one is represented by
and
is predicted as
The results are presented in
Table 2. In
Table 2, the symbol
(or
) represents the membership degree to the group
(or
) of
From
Table 2 and Equation (
3), the left spreads are predicted. For instance,
By same method, the right spreads are also predicted and the predicted spreads are presented in
Table 3.
Table 4 shows the final values estimated by the proposed hybrid algorithm. The symbols
and
denote the left endpoint and the right endpoint of the dependent variable, respectively.
The final estimate
is obtained from
Table 4, and the performance measures
and
are given in
Table 5.
Table 5 shows that the proposed hybrid estimation algorithm has the smallest sum of areas for residual. The area of overlap between the estimated and observed values is smaller than that of the difference method, but the difference is not large. Therefore, we can say that the proposed hybrid algorithm outperforms the existing methods.
Example 2. Yoon et al. [15] surveyed the impact of family (), colleague (), school (), and national satisfaction () on life satisfaction (). Table 3 shows data on the satisfaction study, and the data in Table 6 is represented by a symmetric triangular fuzzy number with central m with width s. The number of data on the life satisfaction in Table 6 is 106 but the size of the set of the spread is only nine. This means that the same value is expressed repeatedly. In this case, it may be more efficient to classify the spread of a dependent variable using discriminant analysis. Table 6 shows data on the satisfaction study, and the data
in
Table 6 is represented by a symmetric triangular fuzzy number with central
m with width
s. The number of data on the life satisfaction in
Table 6 is 106 but the size of the set of the spread
is only nine. This means that the same value is expressed repeatedly. In this case, it may be more efficient to classify the spread of a dependent variable using discriminant analysis. The fuzzy partition of
for discriminant analysis can be defined as follows:
Since the number of the overlapped basic function is two, each of the data is grouped two times. The first one is represented by
and the Fisher linear discriminant functions which classify the given data using independent variables
and
are as follows:
where
are not spreads but endpoints of the triangular fuzzy set. And the result of LAD estimation for the centers
of dependent and independent variables is as follows:
The life satisfaction estimated using fuzzy partition, discriminant analysis, and LAD estimation is given in
Table 6. The results of the performance measure for life satisfaction
is presented in
Table 7. The performance of the proposed hybrid estimation algorithm is compared with the Diamond method [
7] and Chachi and Taheri method [
4], as shown in Example 1.
Table 7 shows that if the values of spread are repeatedly expressed with the same values, it may be more efficient to use
F-transform and discriminating analysis.
Examples 1 and 2 show that the proposed hybrid estimation algorithm may be more efficient if the spread of a given dependent variable is expressed as a repeated number. That is, when the number of spreads of the dependent variable is smaller than the sample size, the proposed hybrid estimation algorithm is effective.
5. Conclusions
In this paper, we have confirmed that the response function for the center and spread of the dependent variable in the fuzzy regression model may not match, and proposed the hybrid estimation algorithm for independently estimating the response function for the center and the response function for the spread. The proposed hybrid estimation algorithm is a modified and extended new version of the algorithm introduced in Jung et al. [
16]. We also applied the discriminant analysis for categorical data to construct the fuzzy regression model when the size of the set of spreads of the dependent variable is very small than the number of samples. In addition,
F-transform was used to categorize the spread of the dependent variable. Then, we combined the LAD estimation method for the center of the dependent variable with
F-transform and discriminant analysis for the spread of dependent variable to estimate the value of the dependent variable.
Two examples have confirmed that the proposed fuzzy regression model estimated using the F-transform, Fisher discriminant function, and LAD estimation method can be more efficient than the existing other methods. This means that when the number of spreads of the dependent variable is much less than the sample size or the number of centers of the dependent variable, the proposed hybrid estimation algorithm can provide more efficient estimation results.
In future studies, we plan to check whether the proposed hybrid algorithm is robust to the number of basic functions and the type of membership function. In addition, we will apply our algorithm to fuzzy regression model with fuzzy coefficients.