1. Introduction
In an agricultural context, profitability, sustainability, efficiency in resource usage, quality of production and managing decisions are supported by “precision agriculture” (PA) which involves methodologies, modeling tools and strategies to improve the quality of financial decision marking in the agricultural sector. Different methodologies of data modelling are used for PA, including regression models (
Van de Putte et al. 2010), artificial neural networks (
Shoshi et al. 2021;
Kujawa and Niedbała 2021) and other machine learning techniques (
Chlingaryan et al. 2018;
Liakos et al. 2018). In addition, there are a lot of studies about PA from other fields of applied science, such as electronical engineering, chemistry, biology and other natural sciences. This paper accounts for the data modelling part through the use of a semiparametric time series model and local polynomial estimator, providing estimates with less risk and a more accurate representation of the data structure.
For the last decade, data science applications and modelling tools have been more frequently used in agricultural studies, leading to extensive literature on the subject. In the context of regression modelling, some important studies are as follows:
Gonzalez-Sanchez et al. (
2014) focused on accurate yield estimation based on machine learning methods and a linear regression model. Regarding the use of nonparametric methods in agricultural data analysis,
Färe et al. (
2013) presented a detailed review.
Grigorios (
2009) introduce a nonparametric regression-based kernel density estimator to represent the production function.
Sam (
2010) modeled the market risks of the agricultural futures of corn, soybeans and wheat by using a nonparametric kernel estimator. In addition,
Zvizdojevic and Vukotic (
2015),
Ogundari and Brümmer (
2011),
Wang et al. (
2016),
Majumdar et al. (
2017) and
Shoshi et al. (
2021) provide important contributions in the modeling of agricultural data using regression models and other machine learning techniques.
The studies given above involve using parametric and nonparametric analysis tools to model agricultural data. Note that parametric methods such as linear regression models require strict assumptions about data structure. Moreover, it should be emphasized that nonparametric estimators, unlike parametric approaches, are very flexible, however, their estimation quality and accuracy diminish greatly if several predictors are added to the model, which is known as the “curse of dimensionality”. This paper considers a semiparametric time-series regression model to solve both problems: the curse of dimensionality and the need for strict assumptions regarding data structure. Therefore, avoiding the disadvantages of the two aforementioned regression models, the benefits of using semiparametric regression models, which combine the features of parametric and nonparametric models simultaneously, can be clearly seen. Although there are a number of studies about the use of semiparametric time series model in the literature, in applications pertinent to “precision agriculture”, the lack of semiparametric techniques is evident. To go into detail, the parametric component of the model is interpreted as a linear regression model, while the nonparametric component allows flexibility from the strict structural assumptions associated with linear regression. Moreover, interpretation of the semiparametric model is easy and understandable. Some of the important studies on the semiparametric time series model are as follows:
Kato and Shiohama (
2009),
Gao and Phillips (
2010), and
Aydın and Yılmaz (
2021). These studies used kernel and spline-based estimators and applied these estimators to different application fields including econometrics, censored time-series data and medical applications.
In contrast to the studies mentioned above, the main purpose of this study is to contribute to PA by proposing a semiparametric local polynomial estimator (LPE) for modelling agricultural time-series data and to show the effects of the three main financial risk factors (currency exchange rates, foreign investment, and interest rates) on agricultural data. Note that if statistical importance and better qualified estimates are obtained using LPE for the response variable (i.e., crop yield values), it may provide a critical advantage in managing agricultural productivity and financial decision-making processes. It follows that the effective features of the response variable can be shown through the statistical significance of the parametric and nonparametric components of the model. The statistical properties of the LPE are derived in
Section 3. Both the semiparametric time-series model and LPE can be easily understood and interpreted, which is beneficial for farm managers and researchers carrying out data analysis for the purpose of predicting sound financial decisions in the agricultural sector.
The data analyzed in this paper involves agricultural data and financial risk factors obtained from all over Turkey. The dataset contains data points from 1962 to 2020. Cereal yield (kg per hectare) is considered as the response variable. The determined predictors are official exchange rate (USD), foreign direct investment (% of GDP) and interest rate, as decided by the Central Bank of the Republic of Turkey. The nonparametric covariate of land used for cereal production is determined by comparing the relationship between land (km2) and yield.
The organization of the paper is as follows:
Section 2 offers a detailed overview of both the semiparametric time-series model and LPE.
Section 3 provides the finite sample properties of the introduced LPE as well as the evaluation metrics to measure the quality of the LPE in modeling crop yield data.
Section 4 is comprised of the analysis and results for the estimation of the semiparametric time-series model for the crop yield data. The parametric and nonparametric components of the model are presented individually. Finally, conclusions are described in
Section 5.
3. Statistical Properties and Evaluation Metrics
In this section, the finite sample properties of the proposed LPE are discussed. Regarding the parametric component of the model, Equation (10a) is expanded to show the bias and variance of . Note that partial residuals are needed here; these are defined after Equation (9). Because the model involves autoregressive error terms, both bias and variance involve the matrix. Before the calculations are made, some assumptions are needed in order to obtain accurate bias and variance of the regression coefficients. These assumptions are as follows:
A1. Regression function is bounded its second partial derivative.
A2. Matrix of parametric covariates have a continuous density function .
A3. is bounded as .
A4. Standard assumptions of Kernel function are ensured. These are: is a continuous bivariate kernel function and .
A5. To provide the asymptotic normality, is bounded away from zero as , which is considered together with A3.
The bias and variance of the regression coefficients are presented in Theorem 1 under these assumptions.
Theorem 1. Assume that (A1)–(A5) are ensured. The expanded form ofis thus given by:where is the partial residuals for. From (11), the bias and covariance matrix of can be inferenced easily as follows: Also, if and.
Proof of Theorem 1 is given in
Appendix A. Theorem 2 is provided below to demonstrate the distribution of
.
Theorem 2. Assume that (A1)–(A5) are confirmed. Letbe the distribution function of the standard normal distribution and. Accordingly, the following expressions can be written: Here, this result shows that whether smooth function does or does not exist in the model, the estimate of has a-convergence to .
Note that the bias and variance–covariance matrix of the regression coefficients given in (12) and (13) are used a measurement tool to evaluate the behaviors of the LPE in crop yield data modelling. Moreover, note that model variance
is generally unknown. Therefore, an estimate of
is used, calculated as follows:
where
and
are given in (10c).
In addition, root mean squared (RMSE) scores for the nonparametric component estimation are calculated as:
After the parametric and non-parametric components, two criteria popular in the time-series literature are introduced to show the performance of the LPE for the semiparametric time-series model. These criteria are given below:
Hence, the performance of the LPE can be evaluated by using fitted values from the time-series model. Note that semiparametric LPE estimator shows its difference by involving both parametric and nonparametric components. These feature makes LPE more flexible than its conventional alternatives such as linear estimators or autoregressive models. In this context, from our point of view, the effects of the financial risk factors on the crop yield are represented by LPE estimator better than existing methods.
In addition,
Table 1 is presented below to provide some basic information about the data of interest. Detailed information about the data is given in
Section 4.
Table 1 involves the descriptive statistics of the variables.
4. Analysis of the Effects of Financial Risk Factors on Crop Yield
As mentioned in
Section 1, a cereal yield dataset collected between years 1962 and 2020 is modelled by the introduced LPE. The dataset was collected from the following website:
https://data.worldbank.org/indicator/NV.AGR.TOTL.ZS?locations=TR (accessed on 28 January 2022). The cereal yield (kg/hectare) (yield) variable is considered as a response variable to be explained using a multiple predictor time-series model. Note here that there are lots of potential predictors to model the yield variable. However, this study focuses on some of the main financial risk factors that are explained below in detail. In this section, the important predictors are determined according to linear correlation between the predictors and the response variable. The nonparametric covariate of the model is decided by observing its scatter plot versus the yield. Accordingly, explanatory variables for both the parametric and nonparametric components of the model are listed as follows:
Notice that the variable names used in the semiparametric time-series model are given on the right side of the list.
Figure 1 is obtained by using R software.
Figure 1 displays the correlations between the parametric covariates and yield, which is used as the response variable. It should also be noted that panel (a) in
Figure 1 shows the scatter plots for each combination of variables, as well as the density plots for each variable and the graph giving the correlations between the variables, while panel (b) displays the correlogram showing the strength of the correlations between the variables.
The reason for choosing “Land under cereal production” as a nonparametric covariate can be clearly seen in
Figure 2. It seems that there is a clear nonlinear relationship between response variable yield.
From the information given above, the semiparametric time-series model is written as follows:
where
and
are autoregressive error terms, as defined in Equation (2). Here, the vector of regression coefficients can be notated as
and their LPE estimate is then
. Similarly, if
is specified as a vector, and its LPE estimate is expressed as
. Note that one of the commonly used methods in the time-series literature for obtaining a model for
responses is an autoregressive (AR) model. Therefore, an AR model is used as a benchmark method and the quality of each of the two models is compared. Note that
Aydın and Yılmaz (
2021) have previously discussed a similar comparison. The Dickey–Fuller test is applied to determine the optimum lag for the AR model and the results are shown in
Table 2.
It can be seen from
Table 1 that the yield series are stationary without lag when the trend coefficient is added to the model. In order to represent data, this paper considers an
) model according to AIC criterion, and model coefficients are estimated as
. Thus, the
model is given in (17) as
where
s are normally distributed with a constant variance (white noise). Additional results of the analysis are provided in following figures and tables.
The following tables (
Table 3 and
Table 4) contain the scores for measuring the quality of both the semiparametric time series (based on LPE) and the
models, respectively. The minimum scores are indicated with an asterisk. Obviously, the LPE gives the minimum mean absolute percentage error (MAPE), mean absolute relative error (MARE) and model variances. This means that the best estimates are obtained with LPE, not
Note that the performance values of
are not too distant from those of LPE. However, regarding overall model performance, LPE appears to be superior.
Table 4 shows the bias and variances of the parametric component estimated using the LPE. Here,
denotes the effect of
to the model,
shows the contribution of
and
denotes the effect size of
on the variables. According to the values of the regression coefficients, it can be seen that
affects the response variable the most. This variable is highly dependent on the sensitive economical structure of Turkey. Interestingly, foreign investment seems to have an effect on the yield, however, we cannot say for certain what is the correlation between these two variables. In Turkey, higher foreign investment may indirectly affect the price of fertilizers, animal feed and pesticides, which are the main agricultural expenditures. Therefore, the yield may be positively affected by foreign investment. On the other hand, while the interest ratio in the country seems to have little effect on the yield, this is not significant enough proof to reject its long-term effect on the yield or other important agricultural indicators. Therefore, the relationship between agricultural indicators and the interest ratio should be closely inspected.
Figure 3 contains the bar graph of the scores given in
Table 3. With the exception of the MARE criterion, the performance of the LPE and
models are similar. However, in general, the LPE solves the targeted modeling problem more efficiently than the traditional
model, which is an expected result because the semiparametric time-series model includes the non-parametric component. The success of the LPE in this context is illustrated in
Figure 4. An estimated curve of
and its 95% confidence interval are also given in this figure. The RMSE value for the estimated curve is 0.4192.
Figure 4 also shows that the estimate obtained from LPE satisfactorily represents the data.
As can be seen in
Figure 4, the relationship between
and
is clearly nonlinear. Due to the AR model using linear modelling structure, it cannot catch the pure nonlinear relationships such as
and
. In this context, the merit of the introduced LPE estimator should be emphasized because it can represent both linear and nonlinear relationships between the variables.
5. Conclusions
This paper discusses modelling crop yield data using a new semiparametric estimator based on a local polynomial estimation model, LPE. A Turkish cereal yield dataset is considered as the real-world data example. The results are given in
Section 4. In order to determine the accuracy of the LPE’s performance, the AR model, which is the traditional method used in modeling time-series data in the literature, is used as a benchmark and the results of a comparison between the two methods are presented. The results given in
Table 3 and
Table 4 and
Figure 3 and
Figure 4 show that the proposed LPE estimator gives satisfactory estimates for both the parametric and nonparametric components. In this context, it can be said that using the LPE estimator for agricultural data successfully models crop yield with less risk. Although this paper considers the AR model as a benchmark method, crop yield prediction is studied by several authors based on different estimation techniques. For instance,
Chandio et al. (
2020) investigated the effects of climate change factors in cereal yield in Turkey between the dates 1968–2014. Differently from our paper, they considered linear regression to estimate the cereal yield. Although linear models are a widely used method for modelling time series, they cannot catch the nonlinear effects of the explanatory variables, which means the semiparametric estimator reduces the risk. Some similar studies can be ordered as follows:
Çakır et al. (
2014) used artificial neural networks to estimate the cereal yield in Turkey. Moreoever,
Chandio et al. (
2021) inspected modelling cereal production for different phenomena. By considering the methods given in the mentioned studies, a detailed comparison study can be made to show explicitly the behaviors of the introduced semiparametric estimator and the alternative nonparametric and semiparametric estimation methods in future research. On the other hand, the LPE estimator involves only one nonparametric component, and its performance depends on the optimal bandwidth parameter. Accordingly, more than one nonparametric component limits the LPE estimator, and its calculation process is more complicated than the AR and linear models. However, the performance of LPE can tolerate these disadvantages.