1. Introduction
In the past few decades, functional data analysis has been widely developed and applied in various fields, such as medicine, biology, economics, environmetrics, and chemistry (see [
1,
2,
3,
4,
5]). An important model in functional data analysis is the partial functional linear model, which includes the parametric linear part and the functional linear part. To make the relationships between variables more flexible, the parametric linear part is usually replaced by the non-parametric part. This model is known as the functional partially linear regression model, which has been studied in [
6,
7,
8]. The functional partially linear regression model is formulated as follows:
where
Y is the response variable.
denotes the functional predictor, characterized by its mean function,
, and covariance operator,
. The slope function
is an unknown function.
is a general continuous function defined on a compact support
. The random error
has a mean of zero and a finite variance
, and is statistically independent of the predictor
. When
is a constant, model (
1) reduces to a functional linear model. Refer to [
9,
10,
11] for further details. With
representing the parametric linear component, model (
1) is identified as a partially functional linear model, an area explored in [
12,
13,
14].
Hypothesis testing plays a critical role in statistical inference. For testing the linear relationship between the response and the functional predictor in the functional linear model, functional principal component analysis (FPCA) is a major idea in constructing test statistics. See [
9,
10,
15]. Taking into account the flexibility of non-parametric functions, Ref. [
6] introduced the functional partially linear model. Refs. [
7,
8] constructed the estimators of the slope functions based on spline and FPCA respectively. They utilized B-spline for estimating non-parametric components. In the context of predictors with additive measurement error, ref. [
16] investigated estimators for the slope function and non-parametric component using FPCA and kernel smoothing methods. Ref. [
17] established estimators of the slope function, non-parametric component, and mean of the response variable in the presence of randomly missing responses.
However, testing the relationship between the response variable and functional predictor in the functional partially linear regression model has been rarely considered so far. In this paper, the following hypothesis testing for model (
1) will be considered:
where
denotes an assigned function. Here we assume
without compromising generality. To test (
2) within the framework of model (
1), a chi-square test was devised by [
18]. This test relies on estimators for the nonlinear and slope functions. The underlying assumption is that the functional data can be well-approximated by a small number of principal components.
In particular, we focus on functional data that cannot be approximated with a few principal components, such as the velocity and acceleration of changes in China’s Air Quality Index (AQI). If these changes are represented by some curves, the velocity and acceleration are equivalent to the first and second derivatives of the AQI, respectively. The number of principal components selected by FPCA may approach approximately 30. Only several research studies have considered this data structure in the functional data analysis. Ref. [
19] constructed a FLUTE test based on order-four U-statistic for the testing in the functional linear model, which can be computationally very costly. In order to save calculation time, ref. [
20] developed a faster test using a order-two U-statistic. Inspired by this, we introduce a non-parametric U-statistic that integrates functional data analysis with the traditional kernel method to test (
2).
The structure of the paper is as follows.
Section 2 details the development of a new test procedure for the functional partially linear regression model.
Section 3 presents the theoretical properties of the proposed test statistic under some regularity conditions.
Section 4 includes a simulation study to evaluate the finite sample performance of the proposed test.
Section 5 presents the application of the test to spectrometric data. The proofs of the primary theoretical results are presented in
Appendix A.
2. Test Statistic
Assume
Y and
U are random variables taking real values.
is a stochastic process with sample paths in
, which is the set of all square-integrable functions defined on
. Let
,
represent the inner product and norm in
, respectively. {
} constitutes a random sample drawn from model (
1),
For any given
, we move
to the left,
Hence, model (
4) simplifies to a classical non-parametric model. A pseudo-estimate for the non-parametric function employing Nadaraya–Watson method, can be formulated as follows:
where
with
being a preselected kernel function. A kernel function maps from the set of real numbers to the set of real numbers. It adheres to the following properties: (i) Non-negativity: the kernel function
must be non-negative. (ii) Normalization: The integral (or sum in the discrete case) of the kernel function over the entire real line must equal 1, which means it can be interpreted as a probability density function. The bandwidth
h in (
5) is typically selected through data-driven procedures, such as cross-validation techniques. Here, we estimate non-parametric
without the
ith sample.
Let
where
. So the pseudo-estimate (
5) of non-parametric function can be reformulated in matrix form as
Substituting
for
in model (
3), we have
where
,
. If we denote
, where “≜” stands for “defined as”. Then
can be the estimator of the conditional expectation
for any
.
Given an arbitrary orthonormal basis
in
, the functional predictor
and the slope function
admit the following series expansions: Let
p represent the number of truncated basis functions, as follows:
where
,
. Let
, then the model (
6) can be rewritten as follows:
Denote
, which has mean
and covariance matrix
. Let
For model (
3), the approximation error is defined as follows:
To investigate the influence of the approximation error, we impose the following conditions on the functional predictors and regression function:
(C1) The functional predictors and the regression function adhere to the following conditions:
(i) The functional predictors reside within a Sobolev ellipsoid of order two, then there exists a universal constant C, such that
(ii) The regression function satisfies where D is a constant.
By applying the Cauchy–Schwarz inequality, we obtain the following:
Then the approximation error can be ignored as
. Model (
6) becomes as follows:
which is a high-dimensional partial linear model. Since
can be an effective measure for assessing the distance between
and zero for test (
2). Motivated by [
21], we construct the following test statistic by estimating (
8).
where
where
and
denote the sample means of
and
, respectively. By some calculations, we can obtain
,
. The test statistic
quantifies the discrepancy between
and 0 under the null hypothesis. High values of the test statistic
suggest evidence in favor of the alternative hypothesis, prompting the rejection of the null hypothesis.
3. Asymptotic Theory
To achieve the asymptotic properties of the proposed test, we first suppose the following conditions based on [
19,
21]. We denote the following:
A condition on the dimensionality of matrix is stipulated as follows:
(C2) As , ; , .
(C3) For a constant
, there exists an
m-dimensional random vector
such that
. The vector
is characterized by
,
, and for any
,
is a
matrix with
. It is assumed that each random vector
has finite fourth moments and
for some constant
. Moreover, we assume the following:
for
and
, where
d is a positive integer.
(C4) , and .
(C5) The error term satisfies .
(C6) The random variable U is confined to a compact domain , and its density function f exhibits a continuously differentiable second derivative and bounded away from 0 on its support. The kernel is a symmetric probability density with compact support and is Lipschitz continuous.
(C7) and are Lipschitz continuous and admit continuous second-order derivatives.
(C8) It is assumed that the sample size n and the smoothing parameter h satisfy the following: .
(C9) The truncated number p and the sample size n are assumed to satisfy .
Condition (C2) is widely utilized in high-dimensional data research (see [
21,
22,
23]). Condition (C3) resembles a factor model. To assess local power, we further impose condition (C4) on the coefficient vector
. In fact, (C4) can serve as the local alternative as its distance measurement between
and 0. This local alternative can be also found in [
21]. (C5) is the typical assumption for the error term
. Conditions (C6–C8) are very common in non-parametric smoothing. (C9) is a technical condition that is needed to derive the theorems.
In practical applications, the data must satisfy conditions (C1–C3) and (C7). Conditions (C1) and (C7) are generally met for most datasets. (C2) does not specify a relationship between p and n. The matrix’s positive definiteness ensures that the regression coefficients can be identified. holds if the eigenvalues of are all bounded or the largest eigenvalue is of smaller order than , where b is the number of unbounded eigenvalues. Condition (C3) essentially assumes that the functional predictor is based on a latent factor model, where the factor loadings meet the pseudo-independence assumption. If is a Gaussian process, it can be expanded as , with being independent standard normal random variables. This expansion is a special case of (C3) when the -th element of the transformation matrix is . These conditions are generally met for most data and do not affect the validity of the proposed test. Many datasets can be regarded as following a Gaussian process, such as changes in gene expression levels, logarithmic returns on financial asset prices, soil moisture, and temperature distribution.
We present the asymptotic theory for the proposed test statistic under the null hypothesis and local alternative (C4) in the subsequent two theorems:
Theorem 1. Under the assumptions of conditions (C1), (C3–C9), it follows thatwhere . It can be regarded as the covariance operator of a random variable . Theorem 2. Assume conditions (C1–C3) and (C5–C9) hold, we then have the following results under either the null hypothesis or the local alternative (C4):where represents convergence in distribution. Theorem 2 demonstrates that, under the local alternative hypothesis (C4), the proposed test statistic possesses the following asymptotic local power at the nominal significance level
:
where
denotes the cumulative distribution function of the standard normal, and
represents its
th quantile. We define
, which represents the signal-to-noise ratio. When the term
, the power converges to
, then the power converges to 1 if it has a high order of
. This implies that the proposed test is consistent. The power performance will be demonstrated through simulations in
Section 4.
According to Theorem 2, the proposed test statistic leads to the rejection of
at a significance level
when
where
and
serve as consistent estimators for
and
, respectively. We use a similar method as in [
24] to estimate the trace. That is,
where
,
,
with
. And the simple estimator
is used, which is consistent under the null hypothesis testing.
4. Simulation
This section evaluates the finite sample performance of the proposed test, including its size and power. The assessment is conducted through a series of simulation studies. Through numerical simulations, we will validate that the distribution of the proposed test statistic under the null hypothesis is consistent with the properties stated in Theorem 1. For each simulation, we create 1000 Monte Carlo samples. The basis expansion and FPCA are conducted using the R package fda.
To mitigate the probability of both Type I and Type II errors in the testing procedure, the sample size must be adequately large. However, to maintain computational efficiency during the numerical simulations, the sample size should not be excessively large. Consequently, the sample size n in this study has been set within a range of 50 to 200. To validate the effectiveness of our proposed test, the parameters are flexibly set.
Here we compare the proposed test
with the chi-square test
constructed by [
18]. The cumulative percentage of total variance (CPV) method is used to estimate the number of principal components in
. Let CPV, explained by the first
m empirical functional principal components, be defined as follows:
where
is the estimate of the eigenvalue of the covariance operator. The smallest value of
m for which CPV (
m) surpasses the threshold of 95% is selected in this section. We denote
p as the number of basis functions used to fit curves. The simulated data are produced according to the following model:
where
or
, and
is independently drawn from the uniform distribution on
. To analyze the impact of different error distributions, the following four distributions will be selected: (1)
∼
, (2)
∼
, (3)
∼
, (4)
∼
. All results about
are presented in
Supplementary Materials.
We next report the simulation results for two data structures of the predictor .
1. The predictor
is defined as
, with
normally distributed with mean 0 and variance
,
for
. The slope function
is given by
, where the coefficient
c ranges from 0 to 0.2.
corresponds to the null hypothesis. The number of basis functions used to fit curves and the sample size are taken as follows:
,
. Under different error distributions,
Table 1 and
Table 2 evaluate the empirical size and power of both tests for different non-parametric functions when the nominal level
is
.
From
Table 1 and
Table 2, the following can be seen: (i) The performances of both tests remain consistent across various error distributions and non-parametric functions; (ii) Because
is intended for functional data beyond the reach of a few principal components, the power of the proposed test is somewhat less than that of
. (iii) The power of the test increases with the sample size
n, but it is not significantly affected by increases in the parameter value
p. In fact, for the functional data structure given in Simulation 1, the number of principal components selected is relatively small, regardless of the number of basis functions used to fit the functional data.
2. The functional predictor is constructed using the expansion in (
7), with
representing the Fourier basis function on [0,1] defined as
,
,
,
,
. The first
p of the basis functions will be used to generate the prediction function and slope function. Let
,
, where
,
, the coefficient of slope function
with
and
c varying from 0 to 1.
corresponds to the case in which
is true. The coefficients of predictor
follow the moving average model:
where the constant
T adjusts the degree of dependence among the elements of the predictor.
are drawn independently from the distribution
with
. The element at the
position of the covariance matrix
for coefficient vector
is
where
is independently generated from the uniform distribution
.
The bandwidth is chosen using cross-validation (CV). At a significance level of
,
Table 3 delineates the empirical size and power of the two tests when the function
is linear.
Table 4 presents the results for the case where
is a trigonometric function.
From
Table 3 and
Table 4, the number of basis functions used for fitting functions has a very important impact on the test. Specifically, (i) Across various error distributions, as
p increases, the empirical size of test
significantly exceeds the nominal level, whereas our proposed test
maintains stable performance; (ii) The power of the test increases with the sample size
n. Conversely, it decreases as the values of
p increase. (iii) The proposed test demonstrates robustness across all scenarios presented in this simulation study. Actually, for the functional data structure given in Simulation 2, selecting too many principal components negates the effectiveness of FPCA-based test statistics. Instead, the proposed test has great advantages (see bold numbers in
Table 3 and
Table 4).
To more effectively verify the accuracy of the asymptotic theory underlying our proposed test statistic,
Table 5 provides the mean and standard deviation (sd) of the test statistic under different scenarios. From
Table 5, it is observed that when
, the mean of our proposed test statistic fluctuates around zero, and the standard deviation fluctuates around one. This aligns with the theoretical expectations. As
c increases, the mean of the test statistic moves further away from zero, and the standard deviation moves further away from one, indicating a departure from the null hypothesis.
Furthermore, to verify the asymptotic theory of our proposed test, we consider the case where
.
Figure 1 and
Figure 2 draw the null distributions and the q-q plots
, corresponding to
and
, respectively. The null distributions are represented by the dashed lines, while the solid lines are density function curves of standard normal distributions.
For different
,
Figure 3 and
Figure 4, respectively, show the empirical power functions of the proposed test statistics. These figures are presented for four different error distribution functions. The function
is linear in
Figure 3 and trigonometric in
Figure 4. When
, the empirical power functions of the proposed test are represented by solid lines, dashed lines, and dotted lines, respectively. From
Figure 3 and
Figure 4, it can be seen that the power increases rapidly as long as
c increases slightly. The test’s power is positively related to the sample size
n and inversely related to the magnitude of
p. The proposed test is stable under different error distributions. These are consistent with the conclusions in
Table 3 and
Table 4.
It is worth noting that, theoretically, a kernel function
is sufficient if it satisfies the conditions of symmetry and Lipschitz continuity. In practical applications, however, the choice of kernel function should be based on the characteristics and requirements of the data. For instance, the Epanechnikov kernel is more suitable for bounded data, while the Gaussian kernel is better suited for data with long tails. In this simulation study, according to the given data setting, the Epanechnikov kernel was chosen. To compare the effects of the two kernels, we replaced the Epanechnikov kernel used to generate
Figure 4 with a Gaussian kernel to produce
Figure 5. From
Figure 4 and
Figure 5, it can be observed that the impact of the two kernels on the test is relatively minor.
The numerical simulations show that our proposed test performs well for the data types. However, with larger sample sizes, the numerical simulations in this paper require considerable computational time, which is a limitation of the proposed test statistic. Additionally, its performance on the datasets that violate the assumptions (C1–C3) and (C7), such as when the real data are not in a Sobolev ellipsoid of order two, remains to be seen.
5. Application
This section applies the proposed test to the spectral data, which has been described and analyzed in the literature (see [
25,
26]). This dataset can be obtained on the following platforms:
http://lib.stat.cmu.edu/datasets/tecator (accessed on 16 July 2024). Each meat sample is characterized by a 100-channel spectrum of absorbance, along with the moisture (water), fat, and protein contents. The absorbance is calculated as the negative logarithm base 10 of the transmittance, as measured by the spectrometer. The three contents, measured in percent, are determined by analytic chemistry. The dataset comprises 240 samples, partitioned into 5 subsets for the validation of models and extrapolation studies. In this section, we utilize a total of 215 samples, which include both training and test samples drawn from the 5 subsets. The spectral measurement data consist of curves, represented by
, corresponding to absorbance values recorded at 100 equally spaced wavelengths from 850 nm to 1050 nm. Let
represent the fat content as the response variable,
represent the protein content, and
represent the moisture content. Similar to [
27], the following two models will be used to assume the relationship between them:
The present investigation primarily focuses on the test in models (
10) and (
11):
. The number of basis functions used for fitting function curves
p is selected as 129.
Figure 6 shows the estimation of slope function
in models (
10) and (
11).
The calculation results are as follows: (i) For model (
10), the value of the statistic is
;
p-value is 0. (ii) For model (
11), the value of the statistic is
;
p-values are 0.386. From this, we can see that the model test (
10) is significant, while the model test (
11) is not significant. This result can also be reflected in
Figure 6. It is obvious that the estimated value of
on the right side of
Figure 6 is much smaller than that on the left side.