1. Introduction
Many variables of interest have their natural boundaries. The propensity score is the probability of an individual receiving treatment and ranges between zero and one. The variant allele frequency (VAF) is an essential measure of genetic variation, ranging from zero to one. The visual analog scale (VAS) score for pain ranges from 0 to 100 mm. When a researcher is interested in the shape of the density function of a bounded variable, conventional kernel methods, commonly used in statistical analysis, may not accurately capture the true shape or slope of the unknown density function, particularly near the boundary.
In semiparametric statistical problems, a nonparametrically estimated density or function can be used as an input for the second-stage estimation or testing procedure of the primary parameter of interest (see [
1,
2]). Many estimators that achieve semiparametric efficiency bounds have been obtained using this approach. When a nonparametric estimate serves as an intermediate input of statistical inference, the natural boundary of the variable is crucial because ignoring it can introduce significant boundary biases.
Several methods have been proposed for solving the density estimation problem within a bounded interval. These methods include reflection [
3], pseudo-data generation [
4], boundary kernel [
5,
6], and data transformation [
7,
8]. Among these, only the reflection method provides an exact kernel function when applied to a one-sided bounded domain problem. The other methods offer asymptotic solutions; the resulting kernel density estimators satisfy the boundary conditions only asymptotically.
When the support of the variable of interest is truly bounded (e.g., a two-sided interval), the reflection method is no longer exact. In the literature, it is often suggested that methods designed for one-sided domains can be extended to domains with two-sided boundaries by applying the method separately to each boundary. However, this suggestion is only justifiable in an asymptotic sense and does not yield an exact kernel function. This paper presents a systematic method for constructing exact kernel functions on a compact support with boundary conditions. The proposed method is adequate for the most plausible boundary conditions. Several examples designed for different boundary conditions are discussed.
Multi-dimensional extensions are essential in function and density estimation problems. All the methods in the literature have focused on the one-dimensional case, the function or density estimation problem on a real line. This paper presents a multivariate kernel density estimator designed for a compact support with boundary conditions. This is a new and significant contribution to the literature.
This study builds upon the method developed by [
9,
10], who proposed a kernel density estimator as a solution to the heat (or diffusion) equation. Ref. [
10] examined the relationship between kernel- and penalty-based methods for density estimation problems from this perspective. Ref. [
9] derived the kernel function that satisfies a boundary condition at the endpoints. In comparison to the aforementioned studies, this study extends the method’s scope to multivariate density estimation problems. Both [
9,
10] focused solely on one-dimensional density estimation problems. Furthermore, in contrast to [
10], we consider the problem of finite support with boundary conditions. In contrast to [
9], we allow for general boundary conditions beyond the zero-derivative conditions at the boundary.
The remainder of this paper is organized as follows.
Section 2 and
Section 3 describe the procedure for deriving the exact kernel function on a compact support. In
Section 4, we provide examples of kernel functions designed for different boundary conditions.
Section 5 compares the proposed method with existing approaches, particularly the reflection method.
Section 6 presents a multivariate extension.
Section 7 assesses the finite sample performance of the estimators by simulation. In
Section 8, we present a real data example, and
Section 9 concludes the paper. The proofs of the main results can be found in
Appendix A.
2. Overview: Kernel Density Estimation and the Heat Equation
Kernel density estimation is a fundamental tool in nonparametric and semiparametric statistical methods. Let
be observations drawn from an unknown distribution
F with density
f. The following Gaussian kernel density estimator is a widely used estimator of
f:
This kernel density estimator can be regarded as a solution to the heat (or diffusion) equation. This heat (or diffusion) equation interpretation of kernel smoothing was used in [
9,
10] and is well-established in the imaging literature. Recent studies that employ this relationship include [
11,
12]. In this study, we apply the same perspective to density estimation problems with nontrivial boundary conditions. By modifying the boundary conditions of the heat equation, we derive various forms of exact kernel functions for the density estimation on a compact support. Subsequently, we extend this approach to address multivariate density estimation problems.
We establish a connection between the kernel smoothing method to the heat equation, focusing on a one-dimensional problem. The empirical distribution function is obtained by assigning equal mass
to each data point
:
Its derivative, known as the empirical density, is a linear combination of
n point masses, each positioned at one of the data points.
where
is the Dirac delta function. Let a kernel function
be a well-defined probability density function, and define
. The bandwidth
h serves as a scale parameter for the probability density
. The kernel density estimator is defined as the convolution of the kernel and empirical density function:
Consider a rod with an initial temperature described by the function
. Let
denote the temperature at a point
x and time
t. If the rod is sufficiently long, the influence of the boundary on the temperature distribution can be neglected; therefore, we can model the phenomenon over the entire line
. Let
and
be the first and second derivatives with respect to the first argument, and
be the first derivative with respect to the second argument. Consequently, we determine the temperature
using the following equations:
The heat or diffusion Equation (3) describes the propagation of heat over time
t and location
x. Equation (4) specifies the initial temperature at time
, while Equation (5) represents the boundary condition. Proposition 1 demonstrates that the solution to the heat equation is
Suppose the data provide the initial condition
. This implies that the empirical density function
serves as the initial condition. Then, the solution takes the following form:
By setting
, we can identify the solution (7) to the heat equation as the kernel density estimate in Equation (1). The role of time
t (more precisely
) as a smoothing parameter
h is intuitive when considering the heat equation. The heat or diffusion equation describes how heat or a liquid object initially accumulates at a single point and then diffuses over time.
Figure 1 illustrates this phenomenon. In the initial stage at
, all the heat is concentrated at a single point. As time passes,
, heat spreads to the neighborhood of the initial point and is eventually felt even in remote areas.
Proposition 1 illustrates the solution to the heat equation. The proof can be found in standard textbooks such as [
13,
14,
15,
16]. We include it in an appendix because it illustrates the fundamental procedure that is repeatedly employed throughout this study.
Proposition 1. The solution to the heat Equation (3), subject to the initial condition (4) and boundary condition (5), is The solution consists of two steps. In the first step, we utilize the heat Equation (3) and boundary condition (5) to derive the heat kernel function. In the above example, the heat kernel is Gaussian, . The kernel function depends solely on ; therefore, it can be written as . In the second step, we use the observed data as an initial condition and perform the convolution of the heat kernel with the empirical density, resulting in .
In the subsequent sections, this procedure will be employed to derive the precise boundary kernel for density estimation on a compact support with general boundary conditions.
3. Method: Kernel Density Estimation on a Finite Interval with Boundary Conditions
We assume prior knowledge or make assumptions regarding the boundary behavior of an unknown density function. Some variables naturally possess conditions at their boundaries. Occasionally, we can estimate the boundary value or slope of the density function during the first stage. Fixing the value of the density function at a boundary is referred to as the Dirichlet condition, while predetermining the slope of the density function at a boundary is known as the Neumann condition. More general types of boundary conditions will be introduced in the following section.
Suppose the density function is supported on an interval
. We focus on cases where the support of the density is known. However, if, for instance, the right end point
l is unknown, it would be reasonable to estimate it using the maximum value of the sample. Assume that the slope of the density at the boundary is 0. We can formulate the problem as a heat equation with the Neumann boundary condition:
This type of boundary condition is known as a perfect insulation condition. The solutions to these equations provide the exact kernel, as summarized in Proposition 2.
Proposition 2. The solution of the heat Equation (8), subject to the perfect insulation condition (10) at both boundaries, is given bywhere the corresponding kernel is A kernel density estimate is obtained with the initial condition : The solution (12) is akin to orthogonal series estimators, where a trigonometric series serves as the basis for the expansion. The primary distinction lies in the rapid decay of the factor
as
m increases. Due to this dampening factor, the influence of additional terms in series (12) diminishes quickly. Unlike orthogonal series estimators, Equation (12) does not “collapse” to the data itself, preventing “Dirac’s disaster”. This implies that smoothing or truncating the series is not crucial (see [
3], pp. 23–25). Because the contributions of the additional terms are negligible in our solution (
12), the first few terms in the series determine most of the shape of the density. The series uniformly converges over the entire support. However, when numerically evaluating the density estimate in Proposition 2, it becomes necessary to determine the number of terms to add. In most cases, using the first 50 terms suffices to obtain a satisfactory approximation of the solution.
Next, we consider the density estimation under the
Dirichlet condition. Suppose that the value of the density function is known at the boundary, and the density function is 0 at both endpoints. This particular problem can be expressed as a heat Equation (
8), with the initial condition (9) and boundary condition:
Proposition 3. The solutions of Equations (8), (9), and (13) are given bywith kernel function The resulting kernel density estimate is Consider a scenario where the information generated by data is perfectly absorbed upon reaching a boundary. This situation is analogous to having large reservoirs of 0 temperature at both ends of the domain. When heat flow reaches the boundary from the interior, it is rapidly absorbed (cooled), maintaining the boundary temperature at 0. Because, within a finite interval, the amount of information the data supplies is always positive, it is evident that we will lose a certain amount of information through the boundaries. Therefore, under the Dirichlet condition, the total amount of information no longer sums to one, and the resulting kernel density estimate is no longer a proper density function. This deficiency can be easily addressed. The heat (or information) lost can be inferred by calculating the heat (or information) conserved within the domain. At a fixed
t,
represents the amount of information at location
x, and the total information remaining in the domain is
Then the proper density function is
Whether it is necessary to modify the density estimate using the 0 boundary condition in order to obtain the proper density should be determined based on the specific problem.
4. Method: General Boundary Conditions in One-Dimensional Problems
In this section, we discuss the more general problems. Suppose that, at the endpoints, the values of a density function are given by
for
. This is a heat equation problem with the Dirichlet condition and is described by the diffusion Equation (
8), initial condition (9), and boundary conditions:
Proposition 4 identifies the solution to this problem.
Proposition 4. The solutions to Equations (8), (9), and (16) are As a second example, we consider a mixed boundary condition. On one side of the support, the first derivative of the density is given; however, the density value is known on the other side. We can apply the Neumann condition at the left end and the Dirichlet condition at the right end. This problem can be represented as a heat equation with the following boundary conditions:
The solution is presented in Proposition 5.
Proposition 5. The solutions to Equations (8), (9), and (17) are Next, suppose a certain type of symmetry exists at the boundary, with the same density values and derivatives at the ends. To highlight the symmetry, consider a boundary
; then,
,
. It can be represented as a diffusion equation with the following boundary conditions:
Proposition 6 identifies the solution for symmetric boundary conditions.
Proposition 6. The solutions to Equations (8), (9), and (18) are It is impossible to cover all possible kernels for every conceivable boundary condition. However, there is a general principle at play. One can render a density estimation problem to a heat/diffusion equation problem in order to determine an appropriate kernel with a specific boundary behavior. If the solution of that particular case is known, either analytically or numerically, in the context of the heat equation, it can be applied to density estimation problems.
5. Discussion: Comparisons with Existing Boundary Kernel Estimators
Several well-known methods exist for density estimation within a bounded interval. This section highlights similarities and differences between the newly proposed exact kernel estimator and the existing methods. The reflection method [
3] provides exact kernel functions when the density is defined on a half interval with specific boundary conditions. The pseudo-data generation method [
4] is a similar variant. This method differs from the reflection method in the way it generates reflected data. These two methods are the closest to the approach used in the present study. Additional methods include the boundary kernel [
5,
6] and transformation methods [
7,
8].
Except for the reflection method, existing methods do not provide an exact kernel function and are only valid asymptotically. Most of these methods are developed for a domain with a one-sided boundary, such as a positive half interval. While these methods can sometimes be extended to a finite interval with two boundaries, it is not always possible. Additionally, extending these methods to higher dimensions can be challenging for multi-dimensional problems.
Unlike existing methods, the proposed approach is applicable to a finite domain, two-sided interval. Extending this approach to multivariate problems is straightforward, and this multivariate kernel density estimation on a bounded domain constitutes the main contribution of this paper. Furthermore, with almost any boundary condition, it is possible to find an exact kernel that provides a valid density estimate.
Because the reflection and ‘negative’ reflection methods [
3] are the most basic strategy in the literature for problems within a bounded domain and can be considered as a special case of the diffusion equation-based method, we will focus on the reflection method in this section. The origins of the reflection and negative reflection [
3] can be traced back to [
17], where formulas for a half-line density estimation problem were proposed. While some properties were stated, the source and the contextual understanding of these formulas and their properties were not explicitly mentioned.
First, we describe the (negative) reflection method. Suppose that a sample
is drawn from an unknown density on a half-line
. The reflection method augments the data by copying the original observations on the opposite side of the domain
. Applying the normal kernel density method to the augmented data yields
If the kernel function is symmetric around 0 and differentiable, the first derivative of the density function is always 0 at the boundary,
. It is because
The negative reflection method is defined in a similar manner. Augment the data and apply the usual kernel method, but with weights of
to the reflected data:
We can show that the boundary value of the density estimate is always 0,
because
With the Gaussian kernel, the density estimates with the reflection and negative reflection methods are
Equations (
19) and (
20) provide the exact kernel functions for density estimation on a half-line with corresponding boundary conditions. However, this method is best understood when posed as a heat equation problem. The framework of the heat equation elucidates why (
19) or (
20) should be proposed as solutions to this problem. The aforementioned reflection methods solve the heat equation defined on a half-line with the corresponding boundary conditions. Consider the heat equation on the positive half-line. Our exposition on this subject is drawn from [
15].
One method to solve this half-line problem is to use even extensions. Because the initial condition
is defined solely on a half-line, we define its even extension as
and consider the entire line problem with a new initial condition
Its solution, which addresses the entire line problem, is presented in Proposition 1. Let us denote
the Gaussian kernel
, then
Restricting to positive half-lines,
is a unique solution to the original half-line problem. The boundary condition
is inherently satisfied because
is an even function. Its derivative with respect to the first argument,
, is an odd function; therefore, at the origin,
.
The explicit form of
is
:
The second equality arises from the definition of
, and the third equality is the change in variable
. The solution
is derived by restricting
to
. Finally, acknowledging the Gaussian kernel and setting
yield the following explicit formula:
This is exactly the same as the ‘reflection method’ (
19) when we set
. The even extension of the initial condition involves the process of augmenting the data by reflecting the original observations on the negative half-line.
Similar to the even extension explaining how the reflection method is obtained, the odd extension provides a rationale for the negative reflection method. We define an odd extension as
following the same logic and arriving at the same result as the negative reflection solution (
20). We verify whether the odd extension ensures
for the boundary condition.
In the above discussion, our framework explains why these methods should be the solution of kernel density on a one-sided domain. However, this also underscores the limitations of the simple reflection method. Suppose the data are generated from a two-sided finite interval, and the derivative of the density function is 0 on both sides. One might be tempted to repeat the reflection procedure twice, once for the left side and once for the right side. However, the kernel obtained in this way is not the exact kernel for the finite-interval density estimation. The data should be reflected an infinite number of times. The exact kernel obtained through the reflection method is
This is obtained when we first reflect a data point on the negative side and periodically reflect both points with periods , that is, and for all integers This indefinite reflection is required because the reflected data not only affect the left boundary but also the right side . Another copy must be made at position to cancel out the second undesired effect. However, this additional copy affects both boundaries , and to cancel them out appropriately, an infinite number of reflections must be made.
There are two representations, Equations (
11) and (
21), for the same problem. Proposition 7 confirms that these two equations represent the same kernel function in different ways.
Each representation has its advantages. In Equation (
11), the infinite series converges rapidly when the dampening factor
vanishes quickly as
m increases, which occurs when either
t is large or
l is small. Therefore, Equation (
11) is suitable for the problem in which either the bandwidth is relatively large, or the length of the interval domain is narrow. However, Equation (
21) converges rapidly when
quickly approaches zero as
m increases, which happens when
t is small or
l is large. Therefore, an infinite reflection representation is preferable when the bandwidth is relatively small or the density function domain is long.
For the Dirichlet condition, when the value of the density function on both sides is set to 0, we apply the negative reflection principle to derive
Proposition 7. For and , the kernel density estimates in (11) and (21) are equivalent. Additionally, (14) and (22) are equivalent. 6. Main Results: Exact Kernels for Multivariate Data
This section considers a multivariate density estimation on a compact support with a boundary condition. One observes data where , a k-dimensional vector, is supported on a bounded domain . Let represent the closure of and denote the boundary of the domain. A Laplacian operator is defined as . It is a natural generalization of the second-order derivative. Let x represent a vector .
Suppose that the density function takes value 0 on the boundary. To determine the exact kernel of the density estimation, consider a
k-dimensional heat equation with Dirichlet boundary conditions:
The strategy worked for the one-dimensional problem continues to be useful, namely, find the solution of the heat equation and use the empirical density
as an initial condition
. The eigenfunction expansion can be used for this purpose [
18]. Consider the following time-independent problem.
The values of that admit a nontrivial solution of the above equation are called the eigenvalues of Laplacian , and the corresponding ’s are eigenfunctions. There is a countable set of eigenvalues with positive values, and for each , there is a finite number of linearly independent eigenfunctions. Moreover, the eigenfunctions corresponding to distinct eigenvalues are orthogonal. Let us denote as the eigenvalues and eigenfunctions of the Laplacian in a bounded domain with Dirichlet boundary conditions.
Suppose that the initial condition
can be expanded in the above eigenfunctions:
where Fourier coefficients
. When the initial condition is data, i.e., a sequence of Dirac function, the meaning of the expansion should be understood in the sense that the infinite sum converges to
in
. Thus,
,
. The solution to the heat equation is
Plug in the coefficients
and interchange the integration and summation to obtain
This means that
can be written as a convolution of a kernel and the initial condition
where the kernel is defined as follows:
Plug the empirical density into the initial condition and obtain a kernel density estimator
The specific form of the density estimator depends on several factors, including the dimension of the problem, the shape of the boundary, and the specific boundary condition. We consider examples where a multivariate density function is defined within a box.
Suppose that a multivariate density function is defined on
, and its value on the boundary is 0. Consider the following eigenvalue problem:
Theorem 1 identifies its eigenvalues and eigenfunctions. Using the procedure described above, one can obtain the desired multivariate kernel function.
Theorem 1. The eigenvalues and eigenfunctions of Equation (25) arewhere . The corresponding kernel is Moreover, from (24), the kernel density estimator for a given a bandwidth t is The next example is a case where the partial derivatives of the density function are known on the boundary. Across the boundary, the first partial derivative of the density function is identically 0. To determine the eigenvalues and corresponding eigenfunctions of the
Neumann problem, consider the following time-independent problem:
Theorem 2. The eigenvalues and eigenfunctions of Equation (26) arewhere . The kernel function iswhere constants can be found in (A5). Moreover, the multivariate density estimator is Note that the solution of the multivariate density estimation is the product of solutions of the corresponding univariate problems. This is because a rectangular domain ensures independence with respect to variables . A circular domain also produces a simple solution owing to its angular symmetry. If the shape of is more challenging, the same procedure can still be used, but the solution of the heat equation may be found numerically.
7. Simulation Results
This section assesses the finite sample performance of the proposed exact kernel density estimator using simulated data. We generate samples from two models. In Model 1, we independently draw a sample of size n from a standard uniform distribution. In Model 2, the sample is drawn from a truncated normal distribution with a mean of zero, a variance of one, and lower and upper truncation points of 0 and 1, respectively. In both cases, the support of the data is the unit interval . We evaluate the performance of several density estimators at 0, which is the left end of the support. The same analysis can be conducted for 1, the right end of the support. The results will be symmetric.
Consider four estimators. (i)
denotes the ordinary kernel density estimator in (
1) that has no boundary bias correction. (ii)
is the density estimator from the one-sided reflection method. As explained in
Section 5, it is the kernel density estimator using an augmented sample
. (iii)
is the estimator from the two-sided reflection method. In addition to the sample used in the one-sided reflection, this two-sided method uses additional data points
. (iv)
is the exact kernel density estimator obtained in Proposition 2. This estimator imposes a condition that the slopes of
at the boundary points are equal to zero. As explained in
Section 5, the reflection methods impose the same constraints at the boundary.
We try three different values for the sample size
n and five different values for the bandwidth
h, as shown in
Table 1. Recall that the smoothing parameter
t for the exact kernel method is related to
h by
. Let
represent the density estimate at 0 for the
j-th simulation run, and let
L be the total number of simulation repetitions. Then the bias and root mean squared errors are calculated by
Table 1 shows the results for Model 1. Overall, the ordinary kernel density estimator
proves to be inadequate. It should not be used in empirical studies when the support of the density function is bounded. The one-sided reflection method
significantly mitigates the boundary bias problem, and the two-sided reflection method
performs even better than its one-sided counterpart. However, they tend to exhibit larger biases when the bandwidth value is large. As explained in
Section 5, reflecting the sample only once over the boundary is not sufficient to correct the boundary bias problem, and this tendency becomes more pronounced with larger bandwidth values. Regarding the bias–variance trade-off, note that a smaller bandwidth results in smaller bias but larger variance, while a larger bandwidth has the opposite effect. Finally, the exact kernel density estimator
proves to be the most effective in controlling the boundary bias problem. It remains effective even in small samples and with larger bandwidth values.
Table 2 shows comparable results for Model 2. We observe the same overall tendencies as before. However, unlike in Model 1, the biases in Model 2 do not disappear even as the sample size increases. This may be attributed the fact that the zero-slope conditions at the boundary may not be satisfied. Since this constraint is imposed in all three boundary correction methods considered in this section, it could result in a persistent bias. Nonetheless, even in this case, it is evident that the exact kernel density estimator is the most effective approach for addressing the boundary bias issue.
8. Real Data Example
The propensity score is the probability of an individual receiving treatment given their characteristics. One crucial issue is whether the propensity score model is well-specified. Because the estimated propensity score is a critical step in controlling selection bias in non-experimental data, a misspecified propensity score may lead to incorrect inference of the treatment effect. One strategy in the literature for checking the model specification is to compare the estimated densities of the propensity scores between the treatment and control groups [
19]. This demonstrates the importance of recognizing boundary effects in the density estimation step.
We illustrate this point using a nationally supported work (NSW) demonstration experiment. We used a sample from [
20]. Since male participants in the NSW data are a focal point in the literature, we considered only male participants. For the non-experimental comparison group, we used [
20]’s full PSID sample. There were 297 treatment observations and 2490 control observations. Most studies use parametric models, such as logit or probit, to estimate the propensity score; therefore, we fit the probability of being in the treatment group using a logit model. (Data are available in an R package rrp. The individual characteristics used to fit the logit model are the linear terms of all explanatory variables (except post-experiment income) in dataset LLvsPSID in rrp).
Figure 2 shows histograms of the estimated propensity scores of the treatment and control groups. There were many instances where the estimated probabilities were close to the boundaries. In the control group, most individuals’ propensity scores were close to 0. This is because there were few similarities between the control and treatment groups. The treatment group comprises people eligible for the government-supported social program; therefore, they often represent the low-income part of the population. In contrast, the PSID sample control group represents the average household in the United States; therefore, their characteristics are quite different from those of the treatment group.
We used the estimated propensity score of each group as our data and applied the kernel method to estimate density functions to compare the density functions of the propensity score.
Figure 3 shows the density estimated using the Gaussian kernel and by applying [
3]’s rule of thumb to select the optimal bandwidth for each sample.
The propensity score is a probability, so its density function estimate should be within the unit interval. However, the density estimates the flow outside the unit interval. This is because many individuals have propensity scores close to the boundaries of either 0 or 1. Because the kernel method is a smoothing technique, it allows the information carried by the data points to flow outside. A significant mass is generated outside the unit interval when many observations are near the boundaries. This is a conceptual problem because we compare two probability density functions. However, it also causes a boundary bias problem because it underestimates the density value near the boundary.
As a remedy, we use the exact kernel on a finite domain in Proposition 2:
We applied this to the propensity score for each group and obtained
Figure 4. Unlike
Figure 3, the results in
Figure 4 show substantial masses clustered at the boundaries. For the control group, where most observations are close to 0, the estimated value of the density function on the left side is approximately twice that in
Figure 4 than that in
Figure 3. This indicates that one would have an incorrect inference without a proper mechanism to fix the boundary effects.