1. Introduction
Density estimates are a common technique in modern data analysis. They are usually used to analyze statistical characteristics, such as skewness and multimodality of samples, and quantify uncertainties. They have been widely used in engineering, economics, medicine, geography and other fields. The methods of density estimation contain the parametric method and nonparametric method. The parametric method requires strong assumptions for the prior model to restrict the probability density function drawn from a given parametric family of distribution, and then calculates the corresponding parameter estimates from the samples. The main problem of the parametric method is that inaccurate setting of the prior parameter model may lead to wrong conclusions. Moreover, in the process of testing the posterior model, there is a common situation that multiple assumptions of prior models can pass a posterior test, which greatly affects the accuracy and efficiency of data analysis. Therefore, to avoid the defects in the parametric method, Fix and Hodges [
1] first eliminate the strong assumptions of the parametric method by introducing the idea of discriminant analysis, which is also the fundamental thought source of the nonparametric method. The simplest histogram method is an intuitive embodiment of this idea. Nonparametric methods do not require any prior assumptions and can predict the real distribution of samples only by a data-driven method. Because of their simplicity and efficiency, nonparametric methods have attracted wide attention. As a classical nonparametric method, the kernel density estimator was first proposed by Rosenblatt [
2] and Parzen [
3]. It requires only a finite linear combination of a given kernel centered on the samples, while it has optimal convergence rate, optimality and regularization properties and, thus, it is the most widely studied and applied nonparametric format at present. However, since actual samples usually come from the bounded region, the prediction results of the classical kernel density estimator near boundary points are often poor, called boundary problems. In addition, the selection of bandwidth for the kernel density estimator is very important. It will lead to an over-smoothing phenomenon in high (low)-density regions and lack local adaptivity.
In order to eliminate the boundary problem of the classical kernel density estimator, Schuster [
4] proposed a data reflection method to obtain a complete dataset through reflection processing of data near the endpoints of bounded intervals, thus, avoiding the boundary problem. However, the data reflection method only applies to the case where the derivative of true density function is zero at the end of the bounded interval, because it actually corrects the inconsistency of classical kernel estimation on bounded intervals. Compared with the limitation of the data reflection method on the boundary derivative of the true density function, the boundary kernel method was first proposed by Gasser and Muller [
5] as another more general solution and has been further extended. It can adapt to the density function of any shape. A disadvantage of the boundary kernel method is that the estimated value obtained may be negative. Jones and Foster [
6] proposed a non-negative boundary kernel estimator to solve this defect by combining the advantages of the local linear estimator and re-normalized kernel estimator. Another problem with the boundary kernel method increases the variance in the estimator. In order to improve this problem, there are some methods proposed, such as transformation method, pseudo data method, generalized reflection method and so on. In particular, Chen [
7,
8] attempted to provide a simpler method to eliminate the boundary problem without sacrificing the nonnegativity of density estimation. He further proposed a Beta kernel estimator on the support [0, 1] and a Gamma kernel estimator on the support
, while he denoted that they have smaller variances. The proposal of a Beta kernel estimator and Gamma kernel estimator has attracted wide attention. Markovich [
9] discussed the good characteristics of the Gamma kernel estimator and extended it to the estimation of multivariate density functions and their partial derivatives. Based on the Gamma estimator, Lin [
10] discussed the relationship between temperature and number of potential novel coronavirus infections in China. However, Zhang [
11,
12] showed that the Beta kernel estimator and Gamma kernel estimator have serious boundary problems and perform worse than the well-known boundary kernel estimation method, when the true density function does not meet the shoulder condition (that is, the first derivative of the density function at the boundary is 0). Cherfaoui [
13] et al. further discussed the properties of Gamma kernels in the case that the density function does not satisfy the shoulder condition. Therefore, the boundary problem of kernel density estimation still has a lot of room for improvement. Moreover, the above density estimates method to eliminate the boundary problem generally faces the complexity of kernel function construction or requires some certain applicable conditions, while they are low efficiency in dealing with large samples.
Compared with the kernel density estimator calculated directly, the binned kernel density estimator is beneficial to save a lot of calculation. By pre-grouping samples on isometric grids and applying an appropriate kernel density estimator to the sample data after pre-classification, the calculation is greatly reduced. Moreover, some researchers have shown that a large binned number will not affect the mean integrated squared error of the kernel density estimator [
14]. Hall [
15] studied the accuracy of density estimation of binned kernel under general binned rules. He provided the minimum mesh size required to obtain a given accuracy level by discussing the influence for accuracy from binned rules and the smoothness of the kernel function. Luo [
16] improved the accuracy of the kernel density estimator method based on the resampling strategy of a multi-point grid. Harel [
17] discussed the asymptotic normality of a binned kernel density estimator for non-stationary random variables. Peherstorfer [
18] proposed a density estimation based on sparse grid, which can be viewed as improved binned rules. It used a sparse grid instead of full grid to reduce the bins. Although the binned kernel density estimator improves the processing efficiency of large sample data through the binned strategy, it still faces the boundary problem of the kernel density estimator in essence. In addition, there are some other methodologies to apply kernel density estimation to large datasets. Cheng [
19] proposed a quick multivariate kernel density estimation for massive datasets by viewing the estimator as a two-step procedure: first, kernel density estimator in sub-interval and then function approximation based on pseudo data via the Nedaraya–Watson estimator. However, the research of Gao [
20] demonstrated that the generalized rational form estimators provide a low convergence rate. Moreover, the computation of pseudo data using a kernel density estimator brings more computation than the above binned rule and does not consider the boundary problem of the kernel density estimator. Zheng [
21] focused on the choice method of samples from large data to produce a proxy for the true data with a prescribed accuracy, which is more complex than the direct binned rule. Moreover, the research does not pay much attention to the discussion of the kernel density estimator. Therefore, the binned method is very simple and clear. Recently, we proposed a kernel density estimator based on quasi-interpolation and proved its theoretical statistical properties, but the research does not provide a solution for the over-smoothing phenomenon [
22].
Another problem (over-smoothing phenomenon) for kernel density estimators is caused by the improper selection of bandwidth, and different scholars have adopted different methods to reduce the occurrence of this phenomenon. The most classical method to choose the bandwidth is the thumb rule, which calculates the optimal bandwidth by the standard deviation and dimension of the samples. Due to the simplicity of this method, it is regarded as a common tool in most application studies of kernel density estimators. However, the actual samples are usually random and uneven, and the optimal bandwidth obtained by the thumb rule is fixed. It only provides a calculation criterion of an optimal bandwidth in a sense and has a very limited improvement effect on the over-smoothing phenomenon. An adaptive bandwidth approach is used to ameliorate this phenomenon viewed as a correction to the thumb rule, which consists of two steps. Firstly, the evaluated function is calculated with a fixed bandwidth and the quantitative relationship between the pointwise function value of samples and the geometric mean value of the samples is established. Then, according to the quantitative relationship, the pointwise correction coefficient is determined to modify bandwidth. The final kernel density estimator can be obtained based on these modified bandwidth. The adaptive bandwidth method improves the accuracy of kernel density estimators for a fixed bandwidth, but it is difficult to apply to large samples because each sample will affect the determination of the correction coefficient and the computational efficiency is low. Barreiro Ures [
23] proposed a bandwidth selection method for large samples via using subagging. The subagging can be viewed as an improvement on the cross-validation method. Therefore, it is difficult to capture local changes in samples. Moreover, the research does not consider the boundary problem.
In conclusion, a classical kernel density estimator is a convenient vehicle and it is widely used in many branches of science and technology. However, the majority of research usually did not consider the constraints of the kernel density estimator model itself. These limitations and deficiencies for the kernel density estimator need to be further considered. In addition, previous methods of kernel density estimators are not synthetically considered among the boundary problem, smooth problem and large sample computation efficiency. Therefore, in view of the insufficiency of the classical kernel density estimator, this paper proposes a new modeling process of multivariate adaptive binned kernel density estimators based on the quadtree algorithm and quasi-interpolation, which significantly improves the prediction accuracy of the estimation density function. Research works in this paper are summarized as follows:
(1) Aiming at the boundary problem of the classical kernel density estimator defined over bounded region a new set of asymmetric kernel functions is introduced based on the quasi-interpolation theory to avoid the boundary problem.
(2) To improve the computational efficiency of the classical kernel density estimator for large samples, the idea of binned kernel density estimation is introduced. The coefficient explicit expression of the density estimator under the binned rule of data is derived, which greatly reduces the computation and improves the computational efficiency of the model.
(3) To alleviate the over-smoothing phenomenon of classical kernel density estimators, this paper proposes an adaptive strategy based on the segmentation thought of quadtree algorithm. We set the segmentation thresholds from sample size, bin width and kurtosis to achieve adaptive computation for the amount of bin and bin width. It can effectively avoid the over-smoothing phenomenon in the high (low)-density area and increase local adaptability in the model for samples and further improve the accuracy in the model.
(4) We extend the univariate model based on the quadtree algorithm to the multivariate model. The numerical simulation based on Monte Carlo shows that the constructed models in this paper perform well in the boundary problem, large samples and over-smoothing phenomenon, which are significantly better than the current widespread use of kernel density estimation methods.
2. Univariate Adaptive Quasi-Interpolation Density Estimator
2.1. Univariate Quasi-Interpolation Density Estimator
Let
be a set of random samples subject to the probability distribution of an unknown probability density function
. The classical non-parametric kernel density estimator is defined as:
where
denotes bandwidth and
denotes kernel function or weight function. There are some common symmetric kernel functions shown in
Table 1:
According to Equation (1), the classical kernel density estimator requires one to calculate the distance between the predicted point and each sampling point to allot weight function. It means that the computation increases rapidly with the increase in sample size. We can note that the prediction points are mainly influenced by the samples in the limited bandwidth domain, while the samples outside the bandwidth domain have very little influence. The pointwise calculation of large samples outside the bandwidth domain greatly reduces the computational efficiency. Therefore, the binned kernel density estimator was proposed:
where
denotes the centers of the
-th bin and
denotes the number of samples dropped in the
-th bin, satisfying
. For clarity, we remind readers:
denotes random sample and
represents center of bin. According to Equation (2), it can be found that the binned kernel density estimator transforms the pointwise calculation of the classical kernel density estimator into the calculation for bin centers. Its essential idea is to treat the samples in a small region as a whole and the central points of each region as the core samples. Therefore, it can ignore the bandwidth difference between each individual sample and the central point of the region. In this way, unnecessary detailed calculation in the classical kernel density estimator can be reduced and the computational efficiency can be improved on the premise of ensuring accuracy.
However, since actual samples are usually sampled from the bounded domain, the above two classes of the kernel density estimator face the same problem; that is, the boundary problem will occur when a fixed symmetric kernel function is used to predict the true density function defined over a bounded domain. The main reason for the boundary problem is that the weight is allotted outside the density support when smoothing near the boundary point by using a fixed symmetric kernel function. A natural strategy is to use a kernel that has no weight allotted outside the support. Therefore, under the framework of numerical approximation, combining with the theory of quasi-interpolation and the binned idea to improve the above models, this paper proposes a new binned quasi-interpolation density estimator, which can not only improve the computation efficiency of large samples, but also eliminate the boundary problem.
Let us start with some definitions and lemmas:
Definition 1. Letbe a bounded interval, andbe known,be a set of scattered centers on the interval,be the discrete function values corresponding to scattered centers. Letbe the positive shape parameter,be the MQ function first constructed by Hardy [
24],
then we have a quasi-interpolation ( operator).whereare the asymmetric MQ kernels Here, these kernels satisfy and .
In addition, we obtain the following error estimates, which can be found in Wu and Schaback [
25].
Lemma 1. For any function, let,be a positive shape parameter, there exists some constant,,independent ofand, such that the following inequalityholds. According to lemma 1, for any shape parameter
satisfying
, the convergence rate
for the whole bounded interval can be provided by quasi-interpolation
. Furthermore, the research of Ling [
26] shows that the multivariate
operator by the tensor product technique(dimension-splitting) can provide the same convergence rate as the univariate case. Inspired by the convergence characteristics of quasi-interpolation and the idea of the binned kernel density estimator, we construct a univariate adaptive quasi-interpolation density estimator based on the quadtree algorithm, which consists of three steps. Suppose that
is a random variable,
are the
independent samples in the random variable
. There is an unknown density function
on the bounded interval. The first step is to divide the interval
into
bins
. Let
denote the number of samples
dropping into the corresponding bin
. In the second step, we construct a new univariate binned density estimator as follows:
Here,
denote the asymmetric MQ kernels defined by Equation (3), and the coefficients
are defined as
According to Equation (4) and lemma 1, we can note that the introduction of asymmetric MQ kernels can avoid the boundary problem caused by the weight allotted outside the support when the traditional kernel function smooths near the boundary points. Moreover, Equation (5) shows that
represents the frequency of samples falling into the corresponding bin
. Through the linear combination of frequencies between adjacent bins, the explicit expression of coefficients of the estimator under the binned rule is given, which can effectively improve the calculation efficiency in the model. Thirdly, the over-smoothing phenomenon in the kernel density estimator is considered. In the above two steps, we built a univariate binned quasi-interpolation density estimator. Based on the known samples and interval, the interval was divided into a certain number of bins, and then the estimated density function could be calculated by the endpoint position of the bin and the number of samples in the bins. If the number of bins is too few, the predicted result is over-smoothing, which differs greatly from the actual scenario. If the number of bins is too great, the calculation efficiency will be greatly reduced. How to determine the number and width of bins is the key to both model accuracy and calculation efficiency. The most common method is the thumb rule, which takes the idea of a fixed bandwidth and calculates the bandwidth as follows:
Here, denotes the dimension and denotes standard deviation of samples. In particular, to maintain notational clarity, we remind readers: denotes dimension and is a mark of operator. The number of bins is calculated by ceiling . This method uses equal bandwidth, and similar equal bandwidth methods include the unbiased cross-validation method and insertion method, etc. However, due to the strong randomness and uneven distribution of actual samples, the equal bandwidth method generally has the problem of insufficient description of details for the high-density area, which causes the over-smoothing phenomenon. Therefore, it is expected that the bandwidth can be adjusted adaptively with the density of samples. The bandwidth should be smaller in high-density areas to enhance local characterization and improve accuracy. In addition, the bandwidth should be larger in the gentle area to avoid excessive calculation and improve calculation efficiency. A common adaptive method determines the number of bins according to the thumb rule and obtains the estimated value of bin centers. Then, the ratio between each estimated value and the geometric mean of each estimated value is taken as the correction coefficient of bandwidth, so as to achieve the purpose of taking smaller bandwidth in the intensive area and larger bandwidth in the sparse area. This adaptive method is simple and easy to operate, but it also has three disadvantages: First, this method is based on the estimation of thumb rule, and the adaptive process does not change the number of bins, which can be regarded as the optimal configuration of bandwidth in essence. Second, the degree of adaptive refinement is insufficient and the determination of bandwidth correction coefficient is too rough, which is susceptible to extreme values. Moreover, it is difficult to distinguish sharp peaks from wide peaks. Third, the adaptive effect of multi-peak distribution is poor. In addition, the density near the boundary is usually small, and increasing the width of the bin easily aggravates the boundary problem. Therefore, this paper proposes a new adaptive binned method.
2.2. Adaptive Binned Method Based on Quadtree Algorithm
The quadtree algorithm, as a space partition index technology, is widely used in the image processing field [
27]. The key idea is an iterated segmentation of data space. The number of iterations depends on the number of samples in bins and bin-width threshold. Therefore, the density of samples can be characterized by the number and width of bins. The area with dense samples has more iterated segmentation and the area with sparse samples has less iterated segmentation. Therefore, according to the idea of quadtree segmentation, we can adaptively adjust the bin number and bin width in the quasi-interpolation density estimator via a data-driven method. The high-density area is divided into more bins to obtain a smaller bin width, which can more keenly capture the distribution details of the area, while the gentle area is divided into fewer bins to save the cost of calculation, so as to achieve a reasonable distribution of bins and improve the accuracy in the model. The adaptive binned method based on the quadtree algorithm is shown in
Figure 1:
First of all, the sample space is divided into four bins and the number of samples in each bin and the bin widths are recorded. Secondly, we set the threshold of sample number and bin width . The setting of sample number threshold captures distribution details in the high-density area with more bins and improves computing efficiency in the gentle area with less bins. It not only solves the over-smoothing problem but also takes into account computing efficiency. The setting of the bin-width threshold ensures the segmentation level of the whole domain and avoids an insufficient number of bins, which leads to the large estimation error or boundary problem. Following the thumb rule, we set the bin-width threshold to . In addition, we set a kurtosis threshold to identify the peak distribution of samples and improve the accuracy. Finally, the number of samples and bin width in each bin are compared with the number of sample number threshold , bin-width threshold and kurtosis threshold. The segmentation is finished when all of these conditions are met.
3. Multivariate Adaptive Binned Quasi-Interpolation Density Estimator
Based on the idea of the above univariate adaptive binned quasi-interpolation density estimator, we extend it to the multivariate model. Following the above process, we first construct the multivariate binned density estimator. The classical multivariate kernel density estimator and multivariate binned density estimator are extended from the univariate model via the tensor product technique. They are defined as follows:
where
,
,
.
where
Based on the above univariate binned quasi-interpolation density estimator, we also extended it to the multivariate binned quasi-interpolation density estimator via tensor product technique:
Let
be a
-dimension random variable with an unknown density function
defined on a bounded hyperrectangle
,
. Let
be the
independent samples of random variable
. Inspired by the idea of histogram,
is divided into
sub-intervals
,
,
. Here,
are positive integers,
,
. Then, the frequency of
dropping into each bin
can be calculated by
. A multivariate (bivariate) quasi-interpolation density estimator via the tensor product technique is as follows:
For
,
, the coefficient
is
where
For
,
, the coefficient
is
where
In Equation (9), for
, there are
and
. To avoid the over-smoothing phenomenon, we use the advantage of the tensor product to transform the multivariate adaptive binned problem into a univariate problem, and the adaptive process is shown in
Figure 2.
First, we divide the domain into two bins for each dimension and record the number of samples and the bin width in each bin from the univariate dimension. Secondly, they are compared with the threshold of sample number, bin width and kurtosis to achieve iterative segmentation. Finally, these bins in each dimension are spanned into some two-dimensional bins via the tensor product technique, and the number of samples falling in each two-dimensional bin is recorded.