1. Introduction
In the field of environmental sciences, forecasting offers significant socioeconomic benefits to society at large [
1]. Recent advancements in artificial intelligence (AI) have enhanced prediction capabilities, enabling the generation of numerous machine-based forecasts with improved accuracy. These advancements have been facilitated by cutting-edge technological developments [
2]. Weather forecasting that includes parameters such as humidity, temperature, pressure, solar irradiation, and wind speed can be accomplished using various statistical and mathematical models. The applications of weather forecasting models developed based on data collected from ground, satellite, and radar images are extensive, covering fields such as transportation, disaster management, construction, and agriculture [
3]. Additionally, these models play a critical role in optimizing agricultural practices by providing accurate and timely weather predictions [
4]. To this day, the accurate prediction of hyperlocal weather parameters is still a challenging task due to the complexity of interactions among the inherent limitations in capturing fine-grained spatial and temporal variations and atmospheric parameters. By implementing IoT sensor networks, it is possible to acquire real-time data from several spatially distributed yet proximate locations in high resolution. The increase in gathered data shall tremendously enhance the accuracy of weather parameter predictions [
5].
Agriculture has long been in the spotlight in numerous countries considering the constant climate changes that affect crop yield. The costs are increasing, while the yield is decreasing, thus making farmers shift to other jobs and abandon agriculture. The Republic of Croatia is one of many countries that have different government incentives in order to increase crop production and help farmers overcome increasing production costs and the increase in extreme weather conditions. To address these issues and improve farmers’ livelihoods, technology is increasingly being adopted in agriculture. Researchers and innovators are developing new techniques to enhance cultivation productivity. Weather prediction plays a crucial role in helping farmers plan crop production and estimate yields, facilitating effective crop management [
6].
Embedding Internet of Things (IoT) technologies resolves various challenges across diverse applications such as smart homes, agriculture, and weather forecasting. Traditional weather reporting systems often fail to provide consistent accuracy in these increasingly adverse conditions [
7]. IoT devices are capable of gathering data from proximate locations, enabling real-time monitoring of environmental parameters such as air pressure, temperature, solar irradiation, and humidity [
5]. This capability is further enhanced by the deployment of spatially distributed sensors, which provide high-resolution data essential for accurate and timely weather predictions [
8]. Additionally, these sensors play a crucial role in anomaly detection and improving the spatial resolution of weather models [
9]. The development of the IoT has enabled the acquisition of real-time weather forecasts. This advancement allows high-precision predictive modeling to be achieved by applying machine learning (ML) algorithms to the collected data [
10]. Additionally, the integration of IoT devices with ML techniques significantly enhances the accuracy and reliability of weather predictions [
11]. Weather prediction encompasses a variety of methods, ranging from relatively simple environmental analyses to highly complex automated mathematical models. The temporal scope of weather forecasts can vary from one day to several months [
2]. Given the non-linear relationship between crop yield and influencing factors, ML techniques are suitable for yield predictions. ML, IoT, and remote sensing technologies are transitioning traditional farming to smart farming, with applications like smart irrigation, remote monitoring, and crop growth tracking, providing innovative solutions for crop cultivation [
6].
Over the past years, AI methods have gained popularity in various fields. Among these, ML stands out as a potent tool for enhancing the accuracy and reliability of models used for predicting different parameters. More precise results are often achieved when weather forecasts for smaller areas and shorter timespan are used. ML models usually have better estimation precision when they are trained using large datasets [
12]. Also, it is possible to improve ML models by learning from errors [
13].
Considering that there are numerous methods that can be used in the development process of weather parameter models, the idea of this manuscript was to analyze and compare various Decision Tree (DT), Support Vector Machine (SVM), Gaussian Process Regression (GPR), and linear regression methods for application in agriculture. Considering the advancements in AI and ML technologies and the overall aspiration to smart agriculture and crop cultivation, an overview of different model’s efficiency is going to be presented through statistical analysis. Our research presents two major contributions. The first contribution is the comprehensive analysis of various ML technologies applied in agriculture, identifying the optimal methods for predicting specific agrometeorological parameters. Numerous studies underscore the critical role of accurate weather prediction in agriculture, significantly impacting crop yield and farm management practices. While the use of ML techniques such as linear regression, DT, SVM, and GPR for weather prediction is well documented, there is a need to identify the most effective ML techniques for specific meteorological parameters and geographical areas. Existing research often lacks comprehensive datasets essential for developing robust predictive models, a gap we address by collecting detailed data from IoT sensors across urban, suburban, and rural areas, focusing on agricultural applications. Additionally, while previous studies have applied individual ML models to weather prediction, there is a scarcity of comparative analyses evaluating multiple models across different geographical areas and meteorological parameters. This study fills this gap by comparing 19 regression models for temperature, humidity, air pressure, and solar irradiation. Furthermore, many studies focus on broad geographical scales, potentially overlooking local weather patterns’ nuances. By emphasizing predictions based on Global Positioning System (GPS) coordinates and localized data, this research aims to enhance the precision of weather forecasts for specific micro-locations, benefiting urban agriculture, which is experiencing significant growth. The proposed methods are designed to use a minimal number of input parameters, tailored to the specific parameter being estimated, with GPS coordinates playing a crucial role in determining the micro-location of the predictions. This ensures that our models are efficient and accurate for localized weather prediction. Our test results reveal that the Exponential GPR model achieved the highest R-squared (R2) for both solar irradiation and temperature predictions. For humidity, the Exponential GPR and Bagged Trees models showed the highest accuracy. In air pressure prediction, the Rational Quadratic GPR model excelled, particularly in rural areas. These findings emphasize the robust performance of advanced regression models, especially the Exponential GPR, in accurately predicting meteorological parameters across various regions.
The second contribution is the introduction of a novel database, created using sensor data, which provides more extensive and detailed information for the selected region. This database, which consists of measured values of temperature, air pressure, solar irradiation, and humidity, supports the development of new models for similar locations in the future and offers valuable insights into urban agriculture. The detailed data captured by our sensors enable more precise agrometeorological predictions, which are essential for optimizing agricultural practices and improving crop yields in urban settings. This database stands out because it collects high-resolution weather data from a network of IoT sensors in urban, suburban, and rural areas, providing more detailed and localized information than other available databases. Unlike older databases that rely on less granular data or cover fewer locations in the region of interest, our database captures a wider range of environmental conditions with high temporal resolution, leading to more accurate and specific agrometeorological predictions. Comparative studies show that it offers superior detail and specificity compared to databases from sources like the National Centers for Environmental Information (NCEI) [
14] and the European Centre for Medium-Range Weather Forecasts (ECMWF) [
15]. The proposed database was used in modeling to present a comprehensive analysis of the presented data. Considering the fact that the model accuracy can be greatly affected by the size of the area taken into consideration, in part of the analysis, the data used in modeling were divided into three subcategories: rural, suburban, and urban.
Although numerous research papers describe various solutions for using IoT sensors to collect, store, and display weather data, there is a notable scarcity of open-access databases with meteorological data acquired using IoT sensors that can be used for developing ML models specifically for agricultural applications. The development of IoT-based weather reporting systems has shown the potential for creating open-access databases that provide real-time weather data [
16]. Similarly, advancements in robust and affordable automatic weather stations emphasize the importance of these open-access resources for continuous and reliable data collection [
17]. Furthermore, cost-effective IoT-based weather monitoring systems highlight the need for accessible databases that can enhance the precision and efficiency of weather forecasting models [
18]. Existing databases such as those from the National Centers for Environmental Information [
14], the European Centre for Medium-Range Weather Forecasts [
15], the National Aeronautics and Space Administration (NASA) [
19], the World Meteorological Organization (WMO) [
20], Meteostat [
21], Kaggle Datasets [
22], the Global Historical Climatology Network (GHCN) [
23], and OpenWeatherMap [
24] are valuable resources for developing ML models. However, our objective is to create an open-access database with a substantial amount of data for smaller, agriculturally significant areas in Croatia that are not extensively covered in many existing databases. This database will include data collected from various types of IoT sensors, enabling the analysis of model accuracy across different area sizes.
This paper is organized as follows: following the Introduction,
Section 2 provides detailed information on ML technologies that can be applied for agricultural purposes, their previous application for such purposes, and an overview of technologies we used in our research.
Section 3 presents the test setup, provides an overview of the hardware specification that was used for creating the proposed database and provides overall information about the database itself.
Section 4 presents the developed models, along with the statistical analysis and results of model verification.
Section 5 offers an overview of the test results and provides recommendations for optimal models for estimating weather parameters.
2. Modeling of Meteorological Parameters Using Machine Learning
Given that the objective of this paper is to explore the application and efficiency of ML techniques for developing models to estimate weather parameters (temperature, humidity, solar irradiation, and air pressure), the following section will provide an overview of state-of-the-art solutions documented in the literature. Emphasis will also be placed on the specific ML techniques employed in these studies and open issues.
Commonly used ML techniques include linear regression, DT, and SVM, addressing both classification and regression problems [
12]. These techniques are crucial for various applications, enhancing predictive accuracy and model reliability [
13]. A DT operates as a classification model, demonstrating a recursive partitioning of the instance space. Supervised ML utilizing DTs has long been applied to regression problems to enhance prognostic accuracy [
25]. To achieve an optimal tradeoff between bias and variance as the models evolve from simple to complex, ensembles of trees are employed. Bagging is utilized to reduce variance [
26], whereas boosting aims to mitigate errors from previous trees during data partitioning [
27]. The DT approach can be employed for weather prediction by initially training the ML algorithm on historical climate data. The acquired model can then be applied to forecast various input variables such as temperature, solar irradiation, air pressure, and humidity [
10]. This method has been shown to improve predictive accuracy and reliability in weather forecasting applications by leveraging past data and ML techniques [
28]. Additionally, the integration of DT models with advanced ML techniques has demonstrated significant enhancements in the precision of weather predictions [
29].
SVM is a powerful supervised ML algorithm with a broad range of applications, including the prediction of weather parameters. Traditionally, SVM has been employed for classification tasks. However, with the introduction of decision boundaries and hyperplanes, its use in regression tasks has increased. The objective of regression is to consider points within the decision boundary [
27]. The primary goal of SVMs is to identify an optimal hyperplane that classifies data points into distinct categories. It can also accurately predict continuous target variables. While the model is being trained, the SVM algorithm can adjust hyperplane parameters to minimize mistakes in regression tasks or maximize the margin among classes [
2]. SVMs are particularly effective for high-dimensional data and datasets with non-linear relationships, making them a robust ML technique [
30]. These capabilities allow SVMs to provide significant improvements in prediction accuracy and model performance in various applications [
31].
GPR utilizes kernel functions for non-linear regression tasks [
32]. Beyond performing non-linear regression, GPR also predicts a Gaussian distribution for unfamiliar outputs [
33]. By effectively employing Bayes’ theorem of conditional probability, this technique interpolates observations at regular intervals [
27].
Considering that this research aims to develop and compare models for estimating weather parameters for agricultural applications, specifically maize cultivation, one of the analyzed parameters was solar irradiation, which is crucial for crop growth. Solar irradiation estimation is examined in multiple research papers, given that solar energy is extensively investigated in the context of solar power plants. Concerning this topic, notable research is detailed in [
27], where solar irradiation estimation models were developed using five distinct deep learning algorithms. The study aimed to compare these methods in terms of accuracy (Root Mean Square Error (RMSE), R
2) and time complexity (prediction speed and training time) for regression tasks, with a graphical analysis of their regression training efficacy. The test results indicate that GPR achieves the highest accuracy but with increased time complexity. In contrast, ensemble methods based on DTs demonstrate faster performance but with comparatively lower accuracy. Another analysis of ML techniques, including various linear regression models, regression trees, and SVM, for modeling solar irradiation is presented in [
34]. For modeling purposes, solar irradiation was estimated based on historical weather data, specifically temperature, humidity, wind speed, and air pressure. The test results indicate that the SVM with a Radial Basis Function kernel achieves the best performance compared to other methods. Additionally, the study demonstrates that solar irradiation has a strong correlation with the historical weather data utilized in the modeling process. The Pearson correlation coefficient (R) ranges between 0.75 and 1 for all four parameters, indicating a high degree of linear relationship [
35].
Although there is a limited number of research papers analyzing different ML techniques specifically for agricultural applications, several studies investigate the use of ML techniques for estimating various weather parameters. A notable study is presented in [
26], where the authors employ DT, SVM, random forest, and XGBoost algorithms to estimate solar irradiance and temperature. The accuracy of these methods was evaluated using absolute error (AE), mean absolute error (MAE) [
36], and mean square error (MSE) [
37]. The study concludes that the selection of parameters and the quality of training data significantly impact the efficiency of the proposed models, particularly when DT models are used. Although DT models demonstrate faster performance compared to other models, their efficiency is highly dependent on these factors.
One research paper employing a linear regression algorithm to estimate weather conditions is described in [
38], but it does not compare the proposed solution with other ML techniques. A more comprehensive study on estimating weather conditions is presented in [
39], concluding that models developed using DT achieve an accuracy of 0.82, outperforming K-Nearest Neighbor (KNN) models.
Weather forecasting using ML technologies is also investigated in [
2]. Compared to previous studies, this research incorporates humidity to assess whether a day is rainy, chilly, or hot. The authors employed SVM and DT models in conjunction with artificially trained neural networks. Their findings indicate that the SVM model outperforms the DT model in terms of accuracy considering they achieve an accuracy of 50% and 80%, respectively.
In contrast to most models in the relevant literature, which rely solely on available databases, the authors in [
5] integrate data from diverse sources such as traditional weather stations, user-generated reports, and IoT sensors to develop high-resolution models for estimating weather parameters. These models are designed to predict short-term, localized weather conditions. The approach combines hyperlocal weather estimation and anomaly detection using IoT sensor networks and advanced machine-learning techniques to predict wind, temperature, and precipitation. This solution is also capable of detecting weather anomalies in real-time, potentially indicating incoming extreme weather events. Our research similarly utilizes data collected from IoT sensors for hyperlocal weather estimation, aiming to predict weather changes that could adversely affect maize crops. The authors in [
7] utilize IoT devices to collect real-time weather data for estimating the weather conditions in specific areas. Their approach employs Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) algorithms to improve the accuracy and simplicity of predicting humidity and temperature for localized regions. This is noteworthy as our research also involves data collection via IoT devices and focuses on comparing the accuracy of developed models across areas of varying sizes. A more cost-efficient solution is presented in [
40], utilizing low-cost IoT boards and sensors to measure humidity, light, and temperature. Unlike other solutions, this approach collects data in an indoor environment and employs a logistic regression model to estimate weather parameters in real-time. The authors note that their solution slightly outperforms other available methods. Regarding the use of ML technologies for agricultural estimations, we refer to the research presented in [
6], which compared various ML techniques (Support Vector Machine, Multiple Linear Regression, and Random Forest) for estimating crop yield rates. The authors concluded that for this application, SVM and Random Forest produced similar results and outperformed the Multiple Linear Regression model in terms of accuracy.
Finally, two highly relevant papers that we want to discuss are [
27,
41], as they utilize open-source datasets with weather information for specific locations (Jaipur and Teknaf) in contrast to large-scale global datasets. The authors of [
27] employed five deep learning techniques for weather prediction, evaluating the efficiency of the proposed algorithms using mean squared error, mean absolute error, and time complexity. Their findings indicate that DT models are the fastest but exhibit lower accuracy compared to GPR models. In contrast, the research presented in [
41] uses the most comprehensive set of parameters (eight) for estimating solar irradiance, including humidity, temperature, wind speed, air pressure, precipitation, insolation clearness index, and earth skin temperature. This study was selected because it predicts solar irradiance with high accuracy using a feed-forward backpropagation neural network, despite measuring eight parameters.
The objective of our research is to achieve the best possible accuracy in predicting weather parameters while minimizing the number of input parameters measured in specific areas. By leveraging advanced ML techniques and integrating data from diverse sources, our study aims to develop robust models that require fewer input variables without compromising predictive performance. This approach not only enhances the efficiency and practicality of weather prediction models but also facilitates their application in resource-constrained environments such as small-scale agricultural operations. Through this focused methodology, we aim to provide high-precision weather forecasts that can significantly benefit agricultural productivity and decision-making processes.
For the development of models for the four meteorological parameters and analysis purposes, we used 19 different regression model types, as listed in
Table 1. The models included a variety of machine-learning techniques: Linear Regression Models, regression trees, SVM, GPR models, and ensembles of trees. The model types listed in
Table 1 are all available in the MATLAB regression application and generally present the most commonly used modeling technologies.
When applicable, a five-fold cross-validation method was employed to mitigate overfitting by partitioning the dataset into folds and estimating accuracy for each fold. The minimum leaf size for the models was set to four. Additionally, Principal Component Analysis was disabled, and surrogate decision splits were not utilized.
3. Test Setup and Database
This section outlines an experiment conducted as part of the project titled “An Ecosystem of Networked Devices and Services for IoT Solutions Applied in Agriculture”, which focuses on the continuous monitoring of agrometeorological and weather conditions. The experiment is critical for assessing the impact of drought on crop yields in the Republic of Croatia, where drought is the predominant cause of unprofitable yields of essential crops. Furthermore, the increasing frequency of droughts and recent climate change are expected to substantially affect the viability of strategically important crops in Croatia.
The experiment was carried out from 13 July 2022 to 29 September 2022, in the regions of Osijek and Tovarnik. It employed 18 commercial sensor nodes utilizing the Long-Range Wide Area Network (LoRaWAN) protocol to transmit data to the LORIOT server. Subsequently, these data were relayed to the university server’s database via a web socket. The devices were mounted on tripods to facilitate the real-time collection of meteorological data. The sensors used in this study included the METEOHELIX
® IoT PRO [
42], LoRaWAN Weather Station (WS100LRW/LW) [
43], and ELEVEN PARAMETER WEATHER STATION FOR LoRaWAN
® [
44]. Detailed technical specifications are provided in
Table 2. The locations of 15 sensor nodes employed in and around Osijek are shown in
Figure 1. The remaining three nodes are placed in Tovarnik, Croatia. The sensor locations were strategically chosen to cover rural, suburban, and urban areas, with a particular emphasis on urban areas due to the recent surge in urban agriculture. This focus on urban settings reflects the growing interest and expansion in urban farming practices. However, there were also technological constraints related to network connectivity (the distance from the communication tower) that influenced the placement of sensors. Ensuring reliable connectivity was essential for real-time data collection and monitoring, and this requirement sometimes limited the ability to deploy sensors in more remote rural locations. Despite these challenges, the deployment aimed to provide a comprehensive dataset that captures the diverse environmental conditions across different geographical contexts.
During the experiment, the frequency of data collection was once every 10 min, thus a total of 139,965 records of weather conditions were collected from the field and surrounding areas, specifically the cities of Osijek and Tovarnik. These records were categorized into three groups based on their geographical locations: urban, suburban, and rural. Six sensors were deployed in urban areas (Osijek), nine in suburban areas (near Osijek), and three in rural areas (Tovarnik). The collected data included meteorological parameters such as temperature, air humidity, solar irradiation, and air pressure. These data were processed and analyzed to accurately reflect the agrometeorological and physiological conditions. The comprehensive database generated from this experiment is made available through a GitHub link provided at the conclusion of this paper, facilitating access and utilization by researchers in the field. The data are crucial for agricultural production in the Republic of Croatia as they enable precise analysis of the impact of weather conditions, including drought and other extreme events, on crop yields and agricultural viability. Additionally, the findings from this study highlight the potential of IoT solutions to drive the development of new technologies and methodologies that contribute to the sustainability and productivity of agricultural practices. This experiment underscores the significant benefits of IoT applications in agriculture, demonstrating the feasibility of continuous monitoring and real-time analysis of agrometeorological and physiological conditions of crops. The flowchart provided in
Figure 2 outlines the process of developing machine-learning models for meteorological parameter estimation. Models were developed for four meteorological parameters: temperature, air pressure, humidity, and solar irradiation. These models were constructed separately for urban, suburban, and rural areas, as well as for the aggregated data representing the Slavonia region in Croatia. Following data collection, model training is conducted for each parameter using 19 different regression models (
Table 1), each incorporating various numbers of input parameters to explore their effects on model performance. The selection of the 19 different regression models is based on their performance in previous studies (
Section 2) and their suitability for the data characteristics. The modeling process involved training various ML techniques, including Linear Regression Models, DT, SVM, and GPR models, with carefully selected parameters. For Linear Regression Models, the parameters focused on ordinary least squares fitting. DTs were configured with a minimum leaf size of four and no surrogate decision splits, ensuring a balance between model complexity and interpretability. SVM models utilized different kernel functions, such as linear, quadratic, and cubic, to capture non-linear relationships in the data. GPR models employed kernel functions like squared exponential and rational quadratic to accommodate the non-linearity and variability in meteorological parameters. Hyperparameter selection was refined through a five-fold cross-validation process, which mitigated overfitting and ensured model robustness. By partitioning the dataset into five subsets, each model’s performance was comprehensively evaluated across different data segments. The data from 14 sensor nodes were utilized in the modeling process. After initial model training, the process involves the selection of optimal input parameters based on key performance metrics: RMSE, R
2, prediction speed, and training time. This step ensures that the models use the most effective and efficient set of input variables. Subsequently, an analysis and selection of four models per parameter for each area is performed, again using RMSE, R
2, prediction speed, and training time to determine the best candidates. The best model is chosen based on the highest R
2 value, the lowest RMSE, and the shortest training and prediction times. The next step involves the selection of one optimal model per meteorological parameter for each geographical area, refining the choices to the single best-performing model for each parameter and area type. Validation using a subset of the data is then conducted to confirm the generalizability and accuracy of these selected models. For validation using a subset of the data, we gathered data from an additional four sensor nodes. Specifically, two sensor nodes from urban areas, one from a suburban area, and one from a rural area were designated exclusively for testing purposes. The sensors selected for verification were strategically chosen based on their GPS locations to present a challenging task for the proposed models. This careful selection ensures that the models are rigorously tested under diverse and demanding conditions, thereby validating their robustness and accuracy. By encompassing a wide range of environmental variables and micro-locations, the verification process provides a comprehensive assessment of the models’ performance, ensuring their reliability and effectiveness in real-world applications. To facilitate a comprehensive understanding of the proposed database,
Figure 3 illustrates data collected for urban areas that were used in modeling, and
Figure 4 illustrates data collected for urban areas that were used for verification.
Figure 3 and
Figure 4 present scatter plots illustrating the measured values of temperature, humidity, solar irradiation, and pressure across a span of 2.5 months. The data are depicted for each daytime hour, with multiple measurements captured, reflecting the variability of the environmental conditions over the observation period.
Finally, the modeling process ends with Evaluation Based on Performance Metrics, including RMSE, R2, and R, to ensure they meet the required standards of accuracy and reliability for practical application. The selected models for each parameter and geographical area were chosen based on their performance in terms of RMSE, R2, prediction speed, and training time. The Rational Quadratic GPR and Exponential GPR models consistently showed high accuracy, low RMSE, and high R2 and were therefore chosen as the best models for predicting meteorological parameters across various regions. The detailed evaluation ensures that the selected models provide reliable and efficient predictions, making them highly suitable for practical applications in weather forecasting and agricultural planning. This structured and iterative approach ensures that the developed models are both robust and efficient, capable of providing accurate meteorological predictions for different geographical areas.
4. Results
After generating 304 distinct models—19 for each meteorological parameter across various regional scales—an additional 380 models were created to determine the minimal-yet-sufficient number of input parameters required for accurately estimating each meteorological parameter. Pressure was modeled in three test cases that differed in the number of input parameters: the first test case included latitude, longitude, month, hour, temperature, and humidity; the second was without humidity, and the third was without temperature. Solar irradiation and humidity were modeled in two test cases that differed in the number of input parameters: the first test case included latitude, longitude, month, hour, and temperature, and the second was without temperature. Temperature was modeled in two test cases that differed in the number of input parameters: the first test case included latitude, longitude, month, hour, and humidity, and the second was without humidity. The developed models were evaluated using several performance metrics: Root Mean Square Error, R
2, mean squared error, mean absolute error, prediction speed, and training time. Given the extensive data collected for all 684 models, all test results are available alongside the proposed database [
45].
Based on the analysis of the results, the optimal input parameters for predicting air pressure were identified as latitude, longitude, month, hour, temperature, and humidity. For predicting temperature, the input parameters were determined to be latitude, longitude, month, hour, and humidity. For solar irradiation and humidity predictions, the input parameters were identified as latitude, longitude, month, hour, and temperature.
In the second stage of the analysis, four models per meteorological parameter for each area were identified as having the most promising performance, resulting in a total of 64 models. The codes of these models are available for testing alongside the proposed database. The R
2 values that were selected to be used as a measure of model accuracy for all models are presented in
Table 3. The analysis indicates that regression trees, GPR models, and ensembles of trees achieve the highest R
2 values, thus providing the highest accuracy for modeling meteorological data collected via IoT sensor nodes. These models are particularly effective for agricultural applications.
The analysis reveals that regression trees, GPR models, and ensembles of trees exhibit the highest R2 values, indicating superior accuracy in modeling meteorological data collected via IoT sensor nodes for agricultural purposes. For air pressure, the Rational Quadratic GPR model demonstrated the highest R2 values, particularly in the case of rural areas, with values reaching 0.79. Bagged Trees and Fine Tree models also showed strong performance across all areas, though not as high as the GPR models. For the solar irradiation parameter, the Exponential GPR model consistently achieved the highest R2 values across all areas, with values reaching up to 0.90 in rural areas and 0.88 overall. The Medium Tree and Bagged Trees models also performed well, with R2 values ranging from 0.85 to 0.88 across different regions. Specifically, the Medium Tree model achieved an R2 value of 0.88 in rural areas, while the Bagged Trees model had values close to 0.87–0.88 in suburban and rural areas. For humidity, both the Exponential GPR and Bagged Trees models exhibited high R2 values around 0.88, indicating high accuracy. Medium Tree and Fine Gaussian SVM models also performed effectively, though slightly lower than the GPR and Bagged Trees models. Regarding temperature, the Exponential GPR model again led with the highest R2 values, up to 0.88. Coarse Tree and Bagged Trees models showed solid performance with R2 values around 0.87.
The RMSE values for solar irradiation models in our study show considerable variation across different geographical areas and model types, reflecting their performance and accuracy. For the Medium Tree model, the RMSE values are 82.129 W/m2 (0.0821 kW/m2) for all sensors, 75.806 W/m2 (0.0758 kW/m2) for urban areas, 86.111 W/m2 (0.0861 kW/m2) for suburban areas, and 88.902 W/m2 (0.0889 kW/m2) for rural areas. The Fine Gaussian SVM recorded RMSE values of 91.895 W/m2 (0.0919 kW/m2) for all sensors, 80.888 W/m2 (0.0809 kW/m2) for urban areas, 87.623 W/m2 (0.0876 kW/m2) for suburban areas, and 87.029 W/m2 (0.0870 kW/m2) for rural areas. The Bagged Trees model demonstrated RMSE values of 81.206 W/m2 (0.0812 kW/m2) for all sensors, 78.845 W/m2 (0.0788 kW/m2) for urban areas, 85.572 W/m2 (0.0856 kW/m2) for suburban areas, and 96.037 W/m2 (0.0960 kW/m2) for rural areas. Notably, the Exponential GPR model achieved the lowest RMSE values, with 77.990 W/m2 (0.0780 kW/m2) for all sensors, 72.365 W/m2 (0.0724 kW/m2) for urban areas, 79.079 W/m2 (0.0791 kW/m2) for suburban areas, and 82.773 W/m2 (0.0828 kW/m2) for rural areas.
In comparison to other studies, our results align well with reported RMSE values for solar irradiation predictions using ML models, which range from 40.87 W/m
2 to 94.89 W/m
2 (0.0409 to 0.0949 kW/m
2) [
46]. A relevant study presented in
Table 1 of [
47] reported RMSE values ranging from 75.23 W/m
2 to 146.22 W/m
2 (0.0752 to 0.1462 kW/m
2), which are comparable to our results, indicating the robustness and effectiveness of our selected models. The urban areas in our study exhibited the lowest RMSE values, indicating the highest accuracy for solar irradiation predictions. This is consistent with other research findings that suggest urban areas benefit from more stable and predictable environmental conditions compared to rural and suburban areas, leading to more accurate model predictions.
For all sensors combined, the Fine Tree model for pressure achieved an RMSE of 2.8461 hPa, indicating moderate accuracy with high processing efficiency. The Fine Gaussian SVM model recorded an RMSE of 3.1398 hPa, balancing accuracy and resource use. The Bagged Trees model had an RMSE of 2.8417 hPa, showing robust performance. The Rational Quadratic GPR model achieved the lowest RMSE value of 2.4369 hPa, indicating superior accuracy despite a more resource-intensive process. In urban areas, the Fine Tree model’s RMSE was 2.7039 hPa, demonstrating good accuracy. The Fine Gaussian SVM recorded an RMSE of 2.9425 hPa with moderate prediction speed. The Bagged Trees model showed an RMSE of 2.8352 hPa, making it efficient for urban data processing. The Rational Quadratic GPR model achieved the lowest RMSE of 2.1217 hPa, excelling in accuracy in urban environments. For suburban areas, the Fine Tree model recorded an RMSE of 2.9488 hPa, while the Fine Gaussian SVM had an RMSE of 3.1751 hPa. The Bagged Trees model had an RMSE of 2.9936 hPa. The Rational Quadratic GPR model again demonstrated the lowest RMSE of 2.4009 hPa, indicating its superior performance. In rural areas, the Fine Tree model’s RMSE was 3.3939 hPa. The Fine Gaussian SVM recorded an RMSE of 3.5445 hPa. The Bagged Trees model showed an RMSE of 3.8824 hPa. The Exponential GPR model achieved the lowest RMSE of 2.3321 hPa, highlighting its effectiveness in rural settings.
For all sensors combined, the Medium Tree model for humidity achieved an RMSE of 8.1906%, reflecting moderate accuracy with high computational efficiency. The Fine Gaussian SVM model reported an RMSE of 8.3525%, balancing precision and resource utilization. The Bagged Trees model had an RMSE of 7.8979%, indicating robust performance. The Exponential GPR model attained the lowest RMSE value of 7.8392%, showcasing superior accuracy despite being more resource intensive. In urban areas, the Medium Tree model’s RMSE was 8.5452%, demonstrating good accuracy. The Fine Gaussian SVM recorded an RMSE of 8.8605%, with moderate prediction speed. The Bagged Trees model exhibited an RMSE of 8.4532%, making it efficient for urban data processing. The Exponential GPR model achieved the lowest RMSE of 8.2139%, excelling in accuracy in urban settings. For suburban areas, the Medium Tree model recorded an RMSE of 8.1297%, while the Fine Gaussian SVM had an RMSE of 8.2496%. The Bagged Trees model had an RMSE of 7.9374%. The Exponential GPR model again demonstrated the lowest RMSE of 7.785%, indicating its superior performance. In rural areas, the Medium Tree model’s RMSE was 8.2669%. The Fine Gaussian SVM recorded an RMSE of 8.2611%. The Boosted Trees model showed an RMSE of 9.4674%. The Exponential GPR model achieved the lowest RMSE of 7.814%, highlighting its effectiveness in rural environments. Overall, suburban areas in our study exhibited the lowest RMSE values, indicating the highest accuracy for humidity predictions. These results underscore the robustness and precision of the Exponential GPR model, making it a valuable tool for environmental monitoring and forecasting applications.
When comparing our results with other relevant research, studies on humidity prediction using ML report RMSE values ranging from 5% to 10%, depending on model complexity and the dataset used. For instance, a study utilizing LSTM and ANFIS models for daily relative humidity forecasting in Turkey reported RMSE values of 5.95% to 7.67% across various provinces [
48]. This comparison indicates that our models, particularly the Exponential GPR model, perform competitively with state-of-the-art models, confirming their robustness and reliability in predicting humidity across various environmental conditions.
For all sensors combined, the Coarse Tree model for temperature had an RMSE of 2.3817 °C, showing decent accuracy with good efficiency. The Fine Gaussian SVM model had an RMSE of 2.4075 °C, balancing accuracy and resource use well. The Bagged Trees model recorded an RMSE of 2.2997 °C, indicating solid performance. The Exponential GPR model stood out with the lowest RMSE of 2.2805 °C, despite needing more resources. In urban areas, the Medium Tree model’s RMSE was 2.4158 °C, showing good accuracy. The Fine Gaussian SVM had an RMSE of 2.4548 °C, with moderate prediction speed. The Bagged Trees model had an RMSE of 2.3817 °C, making it efficient for urban data. The Exponential GPR model achieved the lowest RMSE of 2.3230 °C, excelling in accuracy in urban settings. For suburban areas, the Coarse Tree model had an RMSE of 2.3590 °C, while the Fine Gaussian SVM recorded an RMSE of 2.3902 °C. The Bagged Trees model had an RMSE of 2.3186 °C. The Exponential GPR model again had the lowest RMSE of 2.2801 °C, indicating its strong performance. In rural areas, the Medium Tree model’s RMSE was 2.8438 °C. The Fine Gaussian SVM recorded an RMSE of 2.8620 °C. The Boosted Trees model had an RMSE of 3.0385 °C. The Exponential GPR model achieved the lowest RMSE of 2.7016 °C, showing its effectiveness in rural settings.
When comparing our results with other relevant research, studies on temperature prediction using ML report RMSE values ranging from 0.5 °C to 3 °C, depending on model complexity and the dataset used. For instance, a study focusing on temperature forecasting using ML models like LSTM reported RMSE values around 2.3 °C, indicating high accuracy in daily temperature predictions [
49]. Furthermore, a paper on ultra-low-temperature measurement using an SSA-PSO-ELM network model reported an RMSE of 3.3081 °C for SVR and 4.4835 °C for the least squares method, showcasing the potential for higher RMSE values in specific conditions [
50]. This comparison underscores the effectiveness of our approach, with our RMSE values aligning well with those reported in the literature.
Overall, these findings underscore the effectiveness of advanced regression models, such as GPR and ensembles of trees, in capturing the complexities of meteorological data for agricultural applications. These models provide important tools for real-time monitoring and analysis, enhancing the precision of agricultural forecasts and decision-making processes. The comprehensive database, along with these high-performing models, offers valuable resources for ongoing research and practical implementations in the field of smart agriculture.
For predicting solar irradiation, temperature, air pressure, and humidity, tree-based models like Fine Trees and Bagged Trees are often recommended due to their reliable performance and robustness. As can be seen from the results, SVMs are also worth considering, especially for temperature and pressure predictions, if computational resources are carefully managed. GPR models, including variants like Squared Exponential GPR, Matern 5/2 GPR, Exponential GPR, and Rational Quadratic GPR, are highly effective for regression tasks, including predicting weather parameters. While using tree models for their lower computational cost and good scalability in large-scale applications can be beneficial, our results show that GPR models, particularly Exponential GPR, perform strongly when computational resources are available and high accuracy is needed, especially in cases of smaller datasets.
This study also provides comprehensive data on the prediction speed and training time for various regression models used to predict meteorological parameters, highlighting significant differences in computational efficiency. For instance, in predicting air pressure, the Fine Tree model achieved a high prediction speed of 1,500,000 observations per second with a relatively short training time of 40.935 s, whereas the Rational Quadratic GPR model, despite its higher accuracy, had a much lower prediction speed of 700 observations per second and a substantially longer training time of 12,094 s. Similarly, for solar irradiation, the Medium Tree model exhibited an impressive prediction speed of 2,000,000 observations per second and a training time of 28.872 s, while the Exponential GPR model, known for its accuracy, lagged with a prediction speed of 770 observations per second and a training time of 10,791 s. In humidity prediction, the Medium Tree model again outperformed in speed with 2,000,000 observations per second and a training time of 7.7101 s, compared to the Exponential GPR model’s 1600 observations per second and 3871.3 s of training time. For temperature, the Coarse Tree model led with the fastest prediction speed of 2,100,000 observations per second and the shortest training time of 1.8304 s, while the Exponential GPR model had the lowest prediction speed of 1000 observations per second and a longer training time of 5913.1 s. These results indicate that more complex models like the Exponential GPR and Rational Quadratic GPR models offer the highest accuracy, which is what we aimed for in this study, but at the cost of using more computational resources and time. The aforementioned results suggest that while more sophisticated models may provide better accuracy, their slower prediction speeds and longer training times may limit their practicality in some real-time or resource-constrained environments, making it crucial to balance model complexity with computational efficiency depending on the application’s needs.
In the subsequent phase of the research, one model per meteorological parameter for each geographical area was selected, resulting in a total of 16 best-performing models. The Rational Quadratic GPR model was identified as the best-performing model for air pressure in suburban and urban areas, as well as when all data were used in the modeling process. In the case of rural areas, the Exponential GPR model was selected as the best-performing model for air pressure. Furthermore, Exponential GPR models were also chosen as the best-performing models for solar irradiation, humidity, and temperature across all area sizes. These selections underscore the effectiveness of GPR models in accurately capturing the complexities of meteorological data across various environmental conditions.
Data including RMSE, R
2, prediction speed, and training time are presented in
Table 4.
Table 4 presents an evaluation of regression models for predicting temperature, humidity, solar irradiation, and pressure across different geographical areas—urban, suburban, rural, and all data combined. The models generally demonstrate low RMSE values, indicating accurate predictions, and high R
2 values, showing a strong fit between predicted and actual values. Models for rural areas exhibit the highest R
2 values for solar irradiation and pressure, indicating robust performance in these contexts. Prediction speed varies significantly, with rural models displaying much higher speeds, particularly for humidity and temperature, compared to models using all data, which have notably lower speeds for solar irradiation and pressure. Training time also varies, with models trained on all data requiring significantly more time, especially for solar irradiation and pressure, reflecting their resource-intensive nature. In contrast, rural models have much shorter training times, indicating greater efficiency. Urban and suburban models ensure a balance between accuracy and efficiency, offering reasonable prediction speeds and moderate training times, making them practical for real-time applications.
Overall, the table highlights the trade-offs between accuracy, as indicated via RMSE and R2, and computational, i.e., time efficiency, as shown with prediction speed and training time. Rural models are more efficient but slightly less accurate, while models using all data are more accurate but demand more computational resources. Urban and suburban models provide a middle ground, balancing accuracy and efficiency, which is crucial for selecting the appropriate model based on specific application needs, whether prioritizing accuracy, speed, or computational efficiency.
Figure 5,
Figure 6,
Figure 7 and
Figure 8 depict response plots and predicted vs. actual data plots for urban area models. In the verification phase of the modeling process, the performance metrics for temperature, humidity, solar irradiation, and pressure were further analyzed across different geographical areas: urban, suburban, and rural. The models were evaluated using the Pearson correlation coefficient and R
2 values to assess their accuracy and reliability (
Table 5). For temperature, when considering all data, the correlation coefficient was 0.92, with an R
2 value of 0.85, indicating a strong relationship between the predicted and actual temperature values.
In urban areas, the models achieved the highest R-value of 0.93 and an R2 value of 0.87, reflecting excellent model performance. Suburban areas also demonstrated high accuracy with R = 0.94 and R2 = 0.88. However, in rural areas, the R-value was lower at 0.77, and the R2 value was 0.59, suggesting that the model was less accurate in these areas compared to urban and suburban areas. This can be explained by a lower amount of data used in the modeling process that covers a larger area and is something to be improved in the future, although the predictions are still considered to be adequate.
For humidity, the models showed a correlation coefficient of 0.93 and an R2 value of 0.86 when considering all data, indicating high accuracy. In urban areas, the highest R-value was 0.95, with an R2 value of 0.90, indicating very accurate predictions. Suburban areas achieved R = 0.95 and R2 = 0.90, similar to urban areas. In rural areas, the R-value was 0.87 and the R2 value was 0.75, indicating a reasonable level of accuracy, though lower than in urban and suburban areas.
For solar irradiation, the correlation coefficient for all data was 0.91, with an R2 value of 0.83, reflecting strong model performance. In urban areas, the highest R-value was 0.97, with an R2 value of 0.93, indicating very high accuracy. Suburban areas achieved R = 0.95 and R2 = 0.91, showing strong model accuracy. In rural areas, the R-value was 0.87 and the R2 value was 0.76, which, although lower, still indicated good model performance. For pressure, the models showed a correlation coefficient of 0.81 and an R2 value of 0.64 when considering all data, indicating moderate accuracy. In urban areas, the highest R-value was 0.98, with an R2 value of 0.95, reflecting excellent model performance. Suburban areas showed strong performance with R = 0.92 and R2 = 0.84. In rural areas, the R-value was 0.88 and the R2 value was 0.76, indicating a reasonable level of accuracy.
The results from the model verification phase reveal that the models perform exceptionally well in urban and suburban areas, with high correlation coefficients and R2 values across all meteorological parameters. However, the accuracy is slightly lower in rural areas, which may be due to the more variable environmental conditions in these areas and due to the lower number of input data. Overall, the models for temperature, humidity, and solar irradiation exhibit very high accuracy, particularly in urban and suburban areas. The model for pressure, while still accurate, shows moderate performance overall but excellent results in urban settings. These findings underscore the robustness and reliability of the developed models in different geographical contexts, providing valuable insights for agricultural applications and real-time monitoring.