1. Introduction
Functional data analysis (FDA) is a branch of statistics that analyzes data providing information about curves, surfaces or anything else varying over a continuum. In its most general form, under an FDA framework, each sample element of functional data is considered to be a random function.
Popularized by Ramsay and Silverman [
1,
2], statistics for functional data analysis have attracted considerable research interest because of its wide applications in many practical fields, such as medicine, economics and linguistics. For an introduction to the topics, we can refer to the monographs of Ramsay and Silverman [
3] for parametric models, and Ferraty and Vieu [
4] for nonparametric models.
In this paper, the following functional non-parametric regression model is considered.
where
Y is a scalar response variable,
is a covariate taking value in a subset
of an infinite-dimensional functional space
endowed with a semi-metric
.
is the unknown regression operator from
to
, and the random error
satisfies
For the estimation of model (1), Ferraty and Vieu [
5] investigated the classical functional Nadaraya-Watson (N-W) kernel type estimator of
and obtained the asymptotic properties with rates in the case of
-mixing functional data. Ling and Wu [
6] studied the modified N-W kernel estimate and derived the asymptotic distribution for strong mixing functional time series data, Baíllo and Grane [
7] proposed a functional local linear estimate based on the local linear idea. In this paper, we focus on the k-nearest neighbors (kNN) method for regression model (1). The kNN method, as one of the most simple and traditional nonparametric techniques, is often used as a nonparametric classification method. The kNN method was first developed by Evelyn Fix and Joseph Hodges in 1951 [
8] and then expanded by Thomas Cover [
9]. In our kNN regression, the input consists of the k-closest training examples in a dataset, whereas the output is the property value for the object. This value is the average of the values of the k-nearest neighbors. Under independent samples, research in kNN regression mostly focuses on the estimation of the continuous regression function
. For example, Burba et al. [
10] investigated the kNN estimator based on the idea of the local adaptive bandwidth of functional explanatory variables. The papers [
11,
12,
13,
14,
15,
16,
17,
18], and others, obtained the asymptotic behavior of nonparametric regression estimators for functional data in independent and dependent cases. Further, Kudraszow and Vieu [
19] obtained asymptotic results for a kNN generalized regression estimator when the observed variables take values in an abstract space. Kara-Zaitri et al. [
20] provided an asymptotic theory for several different target operators and some simulated experiences, including regression, conditional density, conditional distribution and hazard operators. However, functional observations often behave with correlation, including satisfying some form of negative dependence or negative association.
Negatively associated (NA) sequences were introduced by Joag-Dev and Proschan in [
21]. Random variables
are said to be NA, if for every pair of disjoint subsets
or equivalently,
where
f and
g are coordinatewise non-decreasing, such that this covariance exists. An infinite sequence
is NA if every finite subcollection is NA.
For example, if follows permutation distributions, where always and are n real numbers, then is NA.
Whereas kNN regression under NA sequences has not been explored in the literature, in this paper, we extend the kNN estimation of functional data from the case of independent samples to NA sequences.
Let a pair
be a sample of NA pairs in
, which is a random vector valued in the
.
is a semi-metric space,
is not necessarily of the finite dimension and we do not suppose the existence of a density for the functional random variable
. For a fixed
, the closed ball with
as the center and
as the radius is denoted as:
The kNN regression estimator [
10] is defined as follows:
where
is the kernel function supported on
.
is a positive random variable that depends on
and is defined by:
obviously, the kNN estimator can be seen as an expansion to a random locally adaptive neighborhood of the traditional kernel method [
5] defined as:
where
is a sequence of positive real numbers such as
a.s.
.
This paper is organized as follows. The main results of our paper about the asymptotic behavior of the kNN estimators using a data-driven random number of neighbors are given in
Section 2.
Section 3 illustrates the numerical performance of the proposed method, including nonparametric functional regression analysis of the sea level surface temperature (SST) data for the El Ni
o area (0–100 S, 800–900 W). The technical proofs are postponed to
Section 4. Finally,
Section 5 is devoted to comments on the results and to related perspectives for the future.
2. Assumptions and Main Results
In this section, we focus on the asymptotic property of the kNN regression estimator and need to state the convergence rate of an estimator.
One says that the rate of almost complete convergence of a sequence
to
Y is of order
if only if for any
,
and we write
(see for instance [
5]). By the Borel-Cantelli lemma, this implies that
almost surely, so almost complete convergence is a stronger result than almost sure convergence.
Our results are stated under some mild assumptions we gather below for easy references. Throughout the paper, we will denote by some positive generic constants, which may be different in various places.
Assumption 1. and is a continuous function, and strictly monotonically increasing at the origin with .
Assumption 2. There exist a function and a bounded function such that:
- (i)
F, and
- (ii)
for any
- (iii)
such that
Assumption 3. is a nonnegative bounded kernel function with support [0, 1], and if , the derivative exists on [0, 1] satisfying: Assumption 4. is a bounded Lipschitz operator with order β on , and there exists such that: Assumption 5. with continuous on
Assumption 6. Kolmogorov’s ϵ-entropy of satisfies:For , the Kolmogorov’s ϵ-entropy of some set is defined by where is the minimal number of open balls, which can cover with as the center and ϵ as the radius in . Remark 1. Assumption 1, Assumption 2((i)–(iii)) and Assumption 4 are the standard assumptions for small ball probability and regression operators in nonparametric FDA, see Kudraszow and Vieu [19]. Assumption 2(ii) will play a key role in the methodology particularly when we compute the asymptotic variance and permit it to be explicit in Ling and Wang [6]. Assumption 2(iii) shows that the small ball probability can be written as the product of the two independent functions and , which has been used many times in Masry [11], Laib and Louani [12] and other literatures. Assumption 5 is standard in the nonparametric setting and concerns the existence of the conditional moments in Masry [11] and Burba [10], which aims to obtain the rate of uniform almost complete convergence. Assumption 6 assumes the Kolmogorov’s ϵ-entropy condition, which we will use in the following proof of the rate of uniform almost complete convergence. Theorem 1. Under Assumptions 1–6, suppose that sequence satisfies , and for n large enough, then we have: Remark 2. The Theorem extends the kNN estimation result of Theorem 2 in Kudraszow and Vieu [19] from the independent case to the NA mixed dependent case, and obtains the same convergence rate under the assumptions. Second, the almost complete convergence rate of the prediction operator is divided into two parts, one part affected by strong mixing and Kolmogorov’s -entropy, and the other part depends on the smoothness of the regression operator and smoothness parameter k. Corollary 1. Under the condition of the Theorem, we have: Corollary 2. Under the condition of the Theorem, we have: 3. Simulation
3.1. A simulation Study
In this section, we aim at illustrating the performance of the nonparametric functional regression model and we will make a comparison with traditional kernel density estimation methods. We consider the nonparametric functional regression model:
where
,
is distributed according to
, the functional curve
is generated in the following way:
where
,
,
0 represents zero vector and the covariance matrix is defined as:
By the definition of NA, it can be seen that
is an NA vector for each
with a finite moment of any order (see Wu and Wang [
22]).
We choose casually that
, the sample sizes
n as
, t takes 1000 equispaced values in
. We carry out the simulation of the curve
for the 330 samples (see
Figure 1).
We consider the Epanechnikov kernel given by
, and the semi-metrics
based on derivatives of order
q.
Our purpose is to compare the mean square error (MSE) of the kNN method with the NW kernel approach on finite simulated datasets. In the finite sample simulation, the following steps are followed.
Step 1: We take 300 curves to construct the training samples , and the other 30 constitute the test samples .
Step 2: In the training sample, the parameters k and h in the kNN method and NW kernel method are automatically selected based on the cross-validation method, respectively.
Step 3: Based on the MSE standard (see [
4] for details), we obtain that the respective semi-metric parameters
q in both the kNN method and the NW method takes
.
Step 4: The response values
and
of the test sample
are calculated by using the kNN method and the NW method, respectively, and their MSE and scatter plots against the true value
are represented by
Figure 2.
As we can see in
Figure 2, the MSE of the kNN method is much smaller than that of the NW method, and the scattered points in
Figure 2 are more densely distributed around the linear function
, which shows that the kNN method has a better fit and higher prediction accuracy for the NA dependent functional samples.
The kNN method and NW method were used to conduct 100 independent replicated experiments at sample sizes of
, respectively. AMSE was calculated for both methods at different sample sizes using the following equation.
As can be seen from
Table 1, the AMSE of the kNN method is much smaller than that of the NW kernel method when the sample size is fixed at
, respectively; when the estimation method is fixed, the AMSE of the two estimation methods have the same trend—they both decrease as the sample size increases. However, the decreasing speed of the kNN method is significantly faster than that of the NW kernel method.
3.2. A Real Study
This section applies the proposed kNN regression analysis of the data, which consist of the sea level surface temperature (SST) for the El
area (0–100 S, 800–900 W) for a total of 31 years from 1 January 1990 to 31 December 2020. The data are available online at the website:
https://www.cpc.ncep.noaa.gov/data/indices/ (accessed on 1 January 2022). More relevant discussions of these data can be found in Ezzahrioui et al. [
13,
14], Delsol et al. [
23], and Ferraty et al. [
24] The 1618 weekly SST data from the original data were preprocessed and averaged by month to obtain 372 monthly average SST discrete data.
Figure 3 displays the decomposition of the multiplicative time series of the monthly SST.
Figure 4 shows that the monthly average SST in El Ni
o regions from 1990 to 2020 had a clear seasonal variation, and the monthly trend of SST can also clearly be observed from the seasonal index plot of the monthly mean SST.
The main factors affecting the temperature variation can be generally summarized as seasonal factors and random fluctuations. If the seasonal factor is removed, the SST should be left with only random fluctuations, i.e., the values fluctuate up and down at some mean value. At the same time, if the effect of random fluctuations is not considered, the SST is left with only the seasonal factor, i.e., the SST will have similar values in the same month in different years.
The following steps implement the kNN regression estimation method for the analysis of the SST data and display the comparison with the NW sum estimation method in
Figure 5.
Step 1: Transform 372 months (31 years) of SST data into functional data.
Step 2: Divide the 31 samples of data into two parts: 30 training samples of data for model fitting and 1 test sample of data for prediction assessment.
Step 3: Here, the functional principal component analysis (FPCA) is applicable to semi-measures for rough curves such as SST data (see Chapter 3 of Ferraty et al. [
25] for the methodology). A quadratic kernel function used in
Section 3.1 is used in kNN regression.
Step 4: The SST values for 12 months in 2020 are predicted by the kNN method and the NW method, respectively, along with obtaining their MSEs for both methods.
Then, in step 1, we split the discrete monthly average temperature data of 372 months into 31 years of temperature profiles and express them as . Therefore, the response variable can be expressed as ,. Thus, is the sample set of dependent function type with a sample size of 30, where is the function type data, and is a real value.
In Step 3, the choice of parameters
q for the kNN method and NW method is performed via computation of cross-validation in R, which gives
and
for the kNN regression method and NW method, respectively. The selection of parameters
k and
h is similar to
Section 3.1.
From
Figure 5, which compares the MSE values calculated by the two methods, it can be seen that the MSE of the kNN method is much smaller than that of the NW method. Further, noting that the degree of fit between the curves fitted by the two methods to the true curve (dotted line), the predicted curves by two methods are generally closer to the true curve, indicating that the prediction effect of both methods is very good. However, a closer look reveals that the predicted values of the kNN method obviously have better fitting at the inflection points of the curves, such as January, February, March, November and December, which fully reflect the fact that the kNN method pays more attention to the local variation than the NW method when processing the data like this, including the abnormal or extreme distribution of the response variable.