1. Introduction
With the advent of the big data era, more and more functional data, providing information about objects varying over a continuum, are collected.
Currently, functional data analysis is being applied in various fields such as medicine, environmental science, and economics and is receiving increasing attention. For details on functional data analysis, see monographs by Ramsay and Silverman [
1], Horváth and Kokoszka [
2], and Hsing and Eubank [
3].
Several variants of functional linear regression models have been proposed to investigate the influence of functional and/or scalar predictors on functional or scalar response and, therefore, to make predictions. Cardot [
4], Tony [
5], and others have utilized spline methods for estimation and prediction in functional linear regression models. In 2007, Cardot et al. [
6] extended the population least squares method to functional linear models, proposing smooth spline estimates for model function coefficients and providing asymptotic results for this estimation. In 2012, Delaigle and Hall [
7] utilized partial least squares to demonstrate the consistency and convergence of functional linear models. Tony and Ming [
8] studied the estimation and prediction issues of functional linear regression models within the framework of reproducing kernel Hilbert spaces. Nevertheless, these models cannot deal with general responses such as binary and Poisson.
An important tool for functional data analysis is the functional linear regression model, while the generalized functional regression model is an extension of the functional linear regression model. As research progressed, the generalized functional regression model was introduced to handle more complex response variables. This model was first introduced by Nelder and Wedderburn [
9] in 1972, and it investigates the relationship between continuous and discrete response variables and the predictor variables through a link function. In 2002, James [
10] proposed generalized linear models with functional predictors and applied it to standard missing data problems. In 2005, Müller and Stadtmüller [
11] proposed a generalized functional linear regression model where the response variable is a discrete scalar and the predictor is functional. In 2011, Goldsmith et al. [
12] developed fast fitting methods for generalized functional linear models that can be applied to various functional data designs, including functions measured with and without error and sparsely or densely sampled. In 2021, Xiao et al. [
13] proposed a generalized partially functional linear regression model where the response variable is general and the predictors are scalar and functional. However, none of these models incorporate the interaction of functional predictors.
To better address complex data that include both functional predictors and scalar predictors, scholars have improved the functional linear model and proposed a functional regression model with mixed predictors. In 2016, Kong et al. [
14] explored the estimation and variable selection problems in cases where the parametric part is high-dimensional and the functional predictors are multidimensional. Yao [
15] and Ma et al. [
16] further built upon the work of Kong [
14], conducting more in-depth research and proving the large sample properties of the estimators. In 2020, Xu et al. [
17] studied the estimation and hypothesis testing issues for models with multiple functional predictors and demonstrated the corresponding large sample properties.
In many practical applications, we need to consider the interactions between variables, and failure to consider the interaction term may lead to the problem of missing variables in the model, thus introducing inaccurate predictions and inappropriate interpretations. By introducing interaction terms, the inaccuracy can be reduced and the model can be made more reliable, thereby improving the prediction by the model and providing more reliable decision support. Indeed, functional linear regressions models with interaction between functional predictors have been proposed recently; several examples follow. In 2016, Usset et al. [
18] proposed a functional regression model with a scalar response and multiple functional predictors with two-way interactions in addition to their main effects. In 2019, Luo and Qin [
19] proposed function-on-function regression models with interaction and quadratic effects, together with an efficient estimation method that has a minimum prediction error. In 2013, Yang et al. [
20] introduced a class of nonlinear multivariate time-frequency functional models that can identify important features in each signal as well as the interaction of signals. Some models considered the interaction of two different time points in the functional data. In 2020, Matsui [
21] proposed a functional quadratic model which took the interaction between two different time points of the functional data into consideration. In 2020, Sun and Wang [
22] also considered a quadratic regression model where the predictor and the response are both functional; it estimated predictions for the coefficient functions, and unknown responses and asymptotics were demonstrated. Nonetheless, these models cannot be applied to general scalar responses. As far as we know, only Fuchs et al. [
23] in 2015 considered general scalar response with functional predictors to include linear functional interaction terms. However, one drawback of the method of Fuchs et al. [
23] is that scalar predictors are not included, and a second drawback is that the asymptotic properties of estimated regression coefficients were not established.
A practical motivation of this paper is the investigation of the influence of air qualities, climate factors, medical and social indicators, and their interactions on cancer incidence, which is a binary response. Cancer is one of the leading causes of death in humans; therefore, it is crucial to analyze the factors related to cancer incidence. Studying cancer incidence can help improve public health and quality of life, reduce social medical costs, and promote human health and socio-economic development. In 2022, Qiu et al. [
24] pointed out that cancer incidence in China is much higher than those in the United States and the United Kingdom due to the fact that China faces problems such as a large population, uneven development in various regions, and a relative lag in cancer control strategies. In 2014, Qin et al. [
25] indicated that long-term exposure to air pollutants or short-term exposure to some high concentrations of air pollutants such as PM2.5 may be associated with some increased incidence rates of overall cancer, especially prostate cancer and female breast cancer. In 2022, Wu et al. [
26] found that areas with high green coverage have a lower risk of cancer. In 2023, Cao et al. [
27] analyzed the relationship between per capita GDP and cancer incidence in 55 regions of China, showing that regions with high GDP have high cancer incidence. In 2017, Xu et al. [
28] conducted a statistical analysis of the current situation of PM2.5 in Changzhou in China and considered an interaction between PM2.5 and relative humidity during the same period, indicating a certain degree of interaction between the two. In 2022, Yang et al. [
29] used the generalized linear model to study the effects of PM2.5 and relative humidity on visibility and found a significant interaction between PM2.5 and relative humidity.
Therefore, we collected data on average daily PM2.5 concentration (from 1 January 2015 to 31 December 2020), average daily humidity (from 1 January 2015 to 31 December 2020), per capita GDP, green coverage rate in built-up areas, the proportion of medical personnel (PMP) (which is the ratio of the number of licensed (assistant) doctors to the population in the locality), and the binary cancer incidence in 49 cities in China from
http://www.cnemc.cn/,
http://www.stats.gov.cn/sj/ndsj/ and
http://www.chinancpcn.org.cn/home. Our aim was to investigate the influence of PM2.5 concentration, air humidity, per capita GDP, green coverage, and PMP on cancer incidence, with the focus not only on the main effects but also on the interaction between PM2.5 concentration and air humidity to, therefore, make predictions.
Existing models with interaction terms between functional predictors and general scalar responses cannot deal with multiple functional and scalar predictors, which is the case in our motivated datasets. Moreover, the asymptotic properties of estimators have not been addressed in existing models. Therefore, in
Section 2, we fully consider the combined influence of functional predictors, scalar predictors, and interactions between functional predictors on general scalar response by proposing a generalized partially functional linear model with interaction terms. In
Section 3, the asymptotic properties of our proposed estimators are established. Extensive simulation studies are given in
Section 4.
Section 5 is reserved for the real data analysis.
4. Simulation
In this simulation, we consider the case that has two functional predictors, three scalar predictors, an interaction term between the two functional predictors, and a binary response. In order to include the case in which the functional predictors do not have the same domain, we define the functional predictors
and
,
where
n can be any positive integer. In the latter sample size,
n takes the values of 50, 100, and 500, and, for each
n, we run 100 simulations. First, we define two standard orthogonal bases
and
, satisfying
Under the Gaussian assumption, we define the two randomly generated functional principal component scores
that satisfy
where
. Notice that the first three functional principal components explain up to 90% of the variation in the two predictors. So, we have
Fifty images of
and
are shown in
Figure 1.
For scalar predictors, we assume , and
We assume that the theoretical values of the regression coefficients are
where
,
For the interaction term, its principal component score is denoted by
and satisfies
where
The corresponding response variable is generated by
where the link function
and
is a sequence of pseudo-random numbers.
The principal component analysis was performed for , and the running results showed that the principal component scores of with cumulative contribution were 3, 3, 3 for each sample size, respectively and the principal component scores of with cumulative contribution were 2, 2, 2.
Table 1 shows how the standardized prediction error (SPE) varies with different sample sizes, and the results show that the model’s predictions become more and more accurate as the sample size increases. Here, SPE is defined by
.
Figure 2 shows
,
and the corresponding
confidence interval bands for different sample sizes, where the red curves are the theoretical values of
and
and the black curves are the corresponding estimates
and
From
Figure 2, it can be seen that, as the sample size increases, the estimated value becomes closer to the theoretical value.
Figure 3 shows the visualized 3D plot with
in the middle panel and the
confidence intervals for
in the left and right panels.
Table 2 shows the estimated values of
and their corresponding standard deviations for different sample sizes. It can be seen that, as
n increases, the standard deviation becomes smaller and the estimated value of
becomes closer to the theoretical value, where the theoretical values of
are 4, 6, and 8, respectively.
Table 3 shows the standard deviation and root mean square error for
and
for different sample sizes. Here, we use the coefficients of the basis expansion of the regression coefficient function to calculate the root mean square error. For example, the root mean square error of
is
. The results show that, as
n increases, both the standard deviation and the RMS error become smaller, indicating that, as sample size increases, the prediction becomes more accurate.
5. Application
To investigate the influence of the influence of air qualities, climate factors, medical and social indicators, and their interactions on cancer incidence using the proposed model, we collected data on average daily PM2.5 concentration, average daily humidity, per capita GDP, green coverage rate in built-up areas, the proportion of medical personnel (PMP), and the incidence of cancer in 49 cities in China from
http://www.cnemc.cn/,
http://www.stats.gov.cn/sj/ndsj/, and
http://www.chinancpcn.org.cn/home.
There are two functional predictors (average daily PM2.5 concentration and average daily humidity from 1 January 2015 to 31 December 2020), three scalar predictors (per capita GDP, greenery coverage, and PMP in 2020), and the response is the cancer incidence in 2020. The ratio of the number of new cancer cases to the total number of people in China in 2020 is
. The data of the cancer incidence can only contain 0 and 1, indicating high or low cancer incidence rate. When the cancer incidence of a city was less than
, the city was considered to have a low cancer incidence rate, denoted by 0; otherwise, the cancer incidence was high, denoted by 1.
Figure 4 shows average daily PM2.5 concentration and daily relative humidity in 21 cities selected from the 49 cities.
We chose as the link function. The model was first subjected to principal component analysis and then the number of principal components was determined based on the cumulative contribution to obtain the number of functional principal components for PM2.5 concentrations and relative humidity, which were chosen as , in order to explain 75% of the variation.
The prediction accuracy is shown by the Generalized Cross Validation (GCV) with a value of 0.0038.
The results of the regression coefficients for the scalar predictor variable
are shown in
Table 4, where we can see that the per capita GDP is positively correlated with the incidence of cancer, i.e., the higher the GDP per capita, the higher the incidence of cancer in that city, which is consistent with the findings of Cao et al. [
27]. The reason for this situation is that the promotion of cancer screening, early diagnosis, and treatment in the more economically developed regions has, to some extent, facilitated the detection of the disease. The greenery coverage is negatively correlated with the cancer incidence, i.e., the higher the greenery coverage, the lower the cancer incidence, which is also consistent with the findings of Wu et al. [
26]. A high green coverage rate implies better air quality, which in turn reduces the risk of cancer. Additionally, a high green coverage rate may provide more outdoor recreational spaces, promoting physical activity and exercise, contributing to maintaining good physical health, and, thus, reducing the risk of cancer. The PMP is positively correlated with the incidence of cancer. As we all know, cancer incidence is age-related, and older people are more susceptible to cancer. The higher PMP, the better the medical conditions, the longer the average life expectancy of the people, and, therefore, the higher the cancer incidence.
The regression coefficient functions
and
for the functional predictors are shown in
Figure 5. From
Figure 5, we can see that the effect of PM2.5 concentration on cancer incidence is generally positively correlated, i.e., the higher the PM2.5 concentration, the higher the cancer incidence. This result is consistent with Qin et al. [
25] from 2014. Regarding the effect of humidity on cancer incidence, there is a more significant positive correlation between humidity and cancer incidence, i.e., the higher the humidity, the higher the cancer incidence. In high-humidity environments, there may be a higher presence of mold and fungi, and the spores and harmful substances released by these microorganisms may have negative effects on human health, increasing the risk of cancer. In high-humidity environments, pollutants in the air are more likely to adhere to suspended particles, making them more easily inhalable by humans. These pollutants include PM2.5, organic compounds, and heavy metals, which are believed to be associated with the occurrence of cancer. High humidity increases the survival time of bacteria and viruses in the air, increasing the chances of people becoming infected with diseases. Certain viruses such as hepatitis B virus and human papillomavirus (HPV) are believed to be associated with the occurrence of cancer.
The interaction surface estimate
(middle) ± two times the estimated standard errors (left and right) are given in
Figure 6.
Figure 7 shows the contour map of
, from which it can be seen that
decreases and then increases with
when
and increases and then decreases with
when
. In the conditions of higher humidity, PM2.5 particles may be more prone to settling, reducing the suspended harmful particles in the air, potentially lowering the incidence of cancer. Conversely, in lower-humidity conditions, PM2.5 may be more likely to remain suspended in the air, increasing the risk of respiratory system exposure, thereby raising the incidence of cancer. Additionally, the concentrations of PM2.5 and humidity may not fluctuate synchronously throughout the day. By introducing interaction terms, the model can capture the temporal complexities, making the estimation results more in line with real-world conditions.
To verify the necessity of considering the interaction term, i.e., to demonstrate the effectiveness of our proposed method, we compare mod1 proposed in this paper with mod2, which does not include the interaction term, i.e.,
The general standards for evaluating model performance are AIC (Akaike Information Criterion), residual, R-squared, RMSE (root mean square error), and MAE (mean absolute error). The smaller values of AIC, residuals, RMSE, and MAE indicate that the model’s fitting effect and generalization ability are better. The R-squared takes a value between 0 and 1, and, the bigger the value, the better the model’s fitting effect. According to
Table 5, we can see that the AIC, residuals, RMSE, and MAE values of mod1 are smaller and that R-squared is much closer to 1 compared to that of mod2, which indicates that mod1 has a better performance. Thus, including the interaction term between PM2.5 concentration and relative humidity makes the research results more meaningful.
6. Discussion
This paper proposes a generalized partially functional linear model with interaction terms. We first use principal component analysis to reduce the dimensionality of the functional data, followed by maximum likelihood estimation to obtain the estimates of the unknown parameters, then prove the asymptotic property of the estimators, and finally perform data simulations and apply our model to a real data example.
As the incidence and mortality of cancer in China are increasing year by year, it is necessary to study the influencing factors and formulate corresponding measures. The effect of PM2.5 concentration, average daily humidity, per capita GDP, the greenery coverage of built-up areas, and PMP on cancer incidence in 49 cities in China was investigated, which showed that the effect of PM2.5 concentration and relative humidity on cancer incidence was generally positively correlated. The effect of greenery coverage in built-up areas on cancer incidence is negatively correlated, while the effect of per capita GDP and the proportion of medical personnel on cancer incidence is positively correlated. The higher the economic level and the more developed the medical conditions, the longer the average life expectancy of people and, therefore, the higher the cancer incidence. Comparing this model with the model without the interaction term shows that considering the role of the interaction term leads to more accurate and meaningful predictions.
Our research lays a foundation for further study on the generalized partially functional linear model with interaction terms and of unknown link function or variance function.