1. Introduction
Agriculture is the cornerstone of food production and a key driver of economic stability in modern society. However, increasing global demand, environmental shifts, and resource limitations present significant challenges for traditional farming practices. These factors require innovative agricultural management approaches, leading to the emergence of smart agriculture. Smart agriculture leverages digital technologies, particularly information and communication technology (ICT), to enhance efficiency, optimize resource use, and promote sustainable farming practices. By integrating Internet of Things (IoT) devices, sensors, and data analytics, farmers and system developers can make informed, data-driven decisions to address these challenges [
1,
2].
In smart agriculture environments, data are continuously collected and analyzed in real time from a wide variety of sensors and IoT devices deployed in fields, greenhouses, and other agricultural settings. These devices monitor key environmental and operational variables, such as soil moisture, temperature, humidity, and crop health. This process results in vast amounts of time-series data, critical for building predictive models that can respond to environmental changes, optimize irrigation, monitor crop conditions, and ultimately automate agricultural processes [
3].
However, given the sheer volume and complexity of time-series data, a primary challenge in developing machine learning applications for smart agriculture is manual data labeling. This process is often labor-intensive, time-consuming, and prone to inconsistencies, making efficient scaling difficult. To overcome these limitations, this study introduces a pseudo-labeling approach to automate time-series data labeling, significantly enhancing the convenience, efficiency, and accuracy of data preparation for machine learning models [
4].
Pseudo-labeling is a semi-supervised learning method that uses a small amount of labeled data to assign labels to a larger, unlabeled dataset, expanding the labeled data through iterative refinement. This approach is highly useful in smart agriculture for rapid data analysis and real-time decision making, as it improves scalability, accuracy, and consistency in data preparation while reducing manual effort [
5,
6].
Figure 1 provides an overview of the intent-based IoT platform proposed in this study. This platform integrates machine learning and artificial intelligence to autonomously manage smart agricultural environments. By analyzing real-time environmental data and system feedback, the platform anticipates and adjusts key agricultural operations, such as irrigation, fertilization, and climate control. This platform enhances farming efficiency by enabling data-driven decision making with minimal manual intervention.
Our proposed machine learning algorithm performs this task by first extracting feature regions through waveform segmentation, which captures the key characteristics of the time-series data. These features are then used by the machine learning model to classify and predict the operational status of agricultural sensors and devices. The process is then further automated by generating labels for new data based on learned features, facilitating the development of robust and scalable machine learning models.
This automated approach enables system developers to manage vast quantities of agricultural time-series data more efficiently, leading to more accurate and responsive agricultural management systems. By integrating pseudo-labeling with machine learning, the proposed method provides a significant improvement over traditional manual labeling techniques. It supports the development of intelligent agricultural systems that can enhance operational efficiency and contribute to sustainable and scalable farming practices.
In this study, we propose a novel approach to automate the labeling of time-series data in smart agriculture, aiming to enhance the scalability and accuracy of machine learning applications in this field. The remainder of this paper is organized as follows.
Section 2 reviews related works, providing a comprehensive analysis of existing methodologies and identifying current limitations in time-series data labeling and automation within the context of smart agriculture.
Section 3 introduces our proposed methodology, detailing the pseudo-labeling algorithm and the waveform segmentation approach that leverages Long Short-Term Memory (LSTM) networks for effective data processing and labeling.
Section 4 presents the experimental setup, dataset description, and results, illustrating the model’s performance and its impact on data processing efficiency. In addition, we discuss the implications of our findings, comparing the advantages of our approach in terms of labeling accuracy, scalability, and operational efficiency to traditional methods. Finally,
Section 5 concludes this paper by summarizing the main contributions of this study and discussing potential future research directions that may further enhance automated data labeling in agricultural applications.
This structure is designed to provide a clear and logical flow, enabling readers to understand the development, implementation, and evaluation of our proposed model and its potential practical applications in smart agriculture.
2. Related Works
2.1. Time-Series Data Usage in Smart Agriculture
Time-series data, characterized by their temporal sequence, are integral to smart agriculture, as they enable real-time monitoring and analysis of critical farming variables. The sensors and IoT devices used in agricultural settings generate continuous streams of time-stamped data, reflecting various phenomena, such as soil moisture levels, crop health indicators, and environmental conditions. By leveraging the analytical power of time-series data, agricultural stakeholders can gain detailed insights into crop yields to forecast environmental trends and optimize resource allocation. For example, models that predict irrigation needs based on soil moisture data can help conserve water and improve crop yields. Similarly, temperature and humidity monitoring can enable precise climate control in greenhouses, fostering better crop growth. Accurate and efficient analysis of time-series data is essential for supporting data-driven decisions that contribute to sustainable agricultural practices.
Agriculture plays an important role in global economies and sustainability. As the world’s population continues to grow, the necessity of increasing food production has led to a focus on enhancing, automating, and optimizing agricultural activities for crops and livestock. These efforts, however, often have negative environmental impacts. Smart farming technologies aim to address these challenges by optimizing productivity while reducing costs, waste, and environmental footprint. Moreover, these technologies contribute to improving the quality of crops and livestock.
A wide variety of sensors is used in precision agriculture to capture several types of measurements. In Sensing Approaches for Precision Agriculture by Kerry and Escolà, various sensors are presented, including those currently available on the market, as well as some under research or development. These sensors perform tasks such as soil sensing, crop health assessment, and disease detection. Other sensors discussed in this work have been deployed on unmanned aerial vehicles (UAVs) [
7].
Several research proposals have emerged that leverage sensor data to optimize resource utilization. For example, a study by Munir et al. presented a system that optimizes water consumption in crop irrigation by using decision-making processes based on a machine learning algorithm (KNN), supported by an ontology and data from humidity, temperature, and light sensors [
8].
Technological and scientific advancements in precision crop and livestock farming have also been explored. Monteiro et al. highlighted applications that use devices to monitor livestock using GPS-based geopositioning and machine learning solutions designed to detect animal discomfort, thereby reducing mortality and improving welfare. Additionally, some systems use UAVs or robots for crop fertilization and agricultural process automation using sensors and AI-driven actuators [
9].
For livestock monitoring, one innovative approach involves the use of multi-sensor collars to monitor individual animals. These collars collect animal activity data and are integrated into cloud-based systems to provide real-time monitoring and proactive management. Andonovic et al. described how this solution can improve the extraction and analysis of livestock monitoring data [
10].
The use of sensors in crop harvesting supports activities such as plant sorting, counting, and health monitoring. Mavridou et al. reviewed studies in this area, including the automation of harvesting tasks with UAVs and robots. These advancements highlight the growing reliance on sensor-based automation in precision farming [
11].
Pesticide spraying is essential for protecting crops from diseases and pests, although manual spraying can expose humans to significant health risks. Mogili and Deepak examined how UAVs can automate pesticide-spraying tasks to improve spraying precision and ensure human safety [
12].
Effective data management is essential for real-time decision making and for extracting valuable information from spatial and temporal data generated by agricultural sensors. Leroux et al. introduced GeoFIS, an open-source decision-support tool designed for precision agriculture. GeoFIS focuses on managing spatial data to support decision making, although it lacks a temporal and semantic management framework to capture the most essential details of agricultural data [
13].
Ren et al. proposed a secure storage mechanism for managing large volumes of agricultural spatio-temporal data using blockchain technology. However, they acknowledged the high computational requirements that limit its usefulness in real-time agricultural scenarios [
14].
Wisnubhadra et al. offered an alternative by presenting an open agricultural spatio-temporal data warehouse designed to handle spatial and temporal attributes using MobilityDB, an extension of PostgreSQL. Their system, while promising, showed limitations, with response times exceeding the maximum threshold for practical real-time applications [
15].
Lastly, Deeken et al. introduced the SEMAP framework, designed for the spatio-semantic management of agricultural environments. Although it provides powerful spatial descriptions, its complexity makes real-time management difficult. In contrast, the data model proposed in this paper is designed to meet the specific needs of precision agriculture by simplifying spatial characterization while adding temporal and semantic dimensions [
16].
2.2. Integration of IoT Platforms in Agriculture
Recent studies have explored the integration of IoT technology with data collection and machine learning for analysis. These technologies were initially planned for use in smart cities but have since found significant parallels and applications in agriculture. For instance, Ref. [
17] employed deep learning convolutional neural networks (CNNs) to extract features from public infrastructure data, expediting evacuation procedures and contributing to smart city development. Similarly, using IoT platforms in agriculture can allow for the dynamic adjustment of irrigation systems and greenhouse environments based on real-time data, optimizing resource use and enhancing productivity. In [
18], IoT technology facilitated urban development by improving public services and economic efficiency, similar to the way IoT in agriculture supports sustainable farming practices and increases yield through improved resource management.
However, smart cities and agriculture share common challenges when relying predominantly on cloud computing, such as latency issues and network limitations that negatively impact real-time system responses [
19]. To overcome these limitations, fog and edge computing have emerged as promising alternative technologies. For example, an intelligent tracking system based on ZigBee wireless networks was proposed for real-time data aggregation in smart cities [
20]. In agriculture, similar wireless sensor networks (WSNs) are important tools for monitoring soil conditions, crop health, and livestock. These networks, similar to those used in urban environments, rely on dynamically configured nodes (such as sensors and actuators) that function efficiently within centralized monitoring systems.
Ref. [
21] introduced a framework for IoT-enabled smart city applications using big data analytics. This has potential benefits for agriculture by enhancing the analysis of sensor data to improve decision making. However, a key limitation is the data loading speed, which can hinder real-time applications. This is particularly important in agriculture, where immediate responses to environmental changes are necessary for operations such as irrigation and pest management.
While cloud computing offers extensive storage and computational capabilities, its latency is a hindrance to time-sensitive agricultural applications [
22]. This has led to the growing adoption of edge computing [
23], which processes data closer to the source, reducing latency and improving the real-time performance required for tasks such as automated irrigation and pest detection. Overall, the integration of IoT platforms with edge and fog computing, combined with advanced data analysis, has the potential to revolutionize the management and efficiency of modern agriculture.
2.3. Anomaly Detection and Pseudo-Labeling in Time-Series Data Analysis
Anomaly detection in time-series data is a promising research topic in various fields, particularly in complex environments such as smart cities, industrial automation, and finance. In these fields, early detection and responses to abnormal data patterns are essential for maintaining system integrity and performance. Recent studies have proposed various techniques to efficiently handle the complexity and scale of time-series data.
For example, research utilizing generative adversarial networks (GANs) has proven effective in detecting anomalies in multivariate time-series data [
24]. These studies leverage GANs to learn the distribution of normal data and assess whether new data deviate from this distribution, thereby identifying anomalies. This approach is especially useful in high-dimensional datasets and has demonstrated detection accuracy superior to traditional statistical methods and simpler machine learning techniques.
Deep learning techniques such as autoencoders have also gained attention for anomaly detection. Autoencoders learn to compress and reconstruct input data, detecting anomalies based on significant reconstruction errors when compared to normal data [
25]. This method is particularly adept at capturing complex patterns in data, showing promise in various domains.
Additionally, recent research has introduced transformer models for time-series anomaly detection [
26]. Transformers are capable of effectively learning long-term dependencies in time-series data, leading to the development of more sophisticated anomaly detection models. This approach has been particularly beneficial in enhancing detection performance in multidimensional time-series data.
These studies highlight the importance of anomaly detection in time-series data and demonstrate the effective application of various techniques in real-world scenarios. Building on these advancements, our study integrates pseudo-labeling into the semantic segmentation of time-series data, enabling more precise and efficient analysis in complex environments such as smart cities.
Pseudo-labeling is a semi-supervised learning technique that has gained significant attention in recent years, particularly in scenarios where labeled data are scarce or expensive to obtain. This approach involves training a machine learning model on a small set of labeled data and then using the model to predict the labels for an unlabeled dataset. The most confident predictions are then added to the labeled set, and the model is retrained iteratively. This method leverages the vast amounts of unlabeled data typically available, improving the model’s performance by expanding the training dataset without extensive manual labeling.
In the context of time-series data, pseudo-labeling has proven effective for various tasks, such as anomaly detection, fault diagnosis, and predictive maintenance [
27]. In the field of smart agriculture, pseudo-labeling has been employed to automatically label environmental sensor data, facilitating real-time decision-making processes, including irrigation management and crop monitoring. In industrial settings, pseudo-labeling techniques have been utilized to identify patterns in mechanical operational data, enabling the early detection of equipment failures and the optimization of maintenance schedules [
28].
Recent studies have further emphasized the potential of pseudo-labeling in improving the scalability and accuracy of models, particularly for large-scale time-series datasets. Pseudo-labeling reduces reliance on fully manually labeled datasets, which is advantageous in domains where manual labeling is not only time-consuming but also susceptible to human error. For example, Du et al. [
29] demonstrated the efficacy of a multi-stage learning strategy using pseudo-labeling for time-series anomaly detection, significantly reducing the need for manual labeling while enhancing model accuracy. Similarly, Jin et al. [
30] introduced a pseudo-labeling framework that effectively improves the reliability of predictive models for urban time-series data, an important component of smart city applications.
Building on these advancements, our study integrates pseudo-labeling into the semantic segmentation of time-series data for smart city applications. This approach not only improves the efficiency of the labeling process but also ensures more consistent and accurate segmentation, which is key to optimizing applications such as traffic monitoring, energy management, and public safety. By leveraging the strengths of pseudo-labeling, our methodology enhances the capability of machine learning models to support more intelligent and autonomous decision making in urban environments.
2.4. Limitations of Related Works
Despite advancements in integrating IoT platforms and machine learning techniques with smart agriculture, several limitations persist in the current body of research. A major challenge lies in the fact that many studies narrowly focus on specific aspects of data collection or sensor management, often overlooking the broader and integrated scope necessary for comprehensive data analysis and decision making in real-world agricultural scenarios.
For example, while much attention has been given to improving sensor reliability and optimizing data acquisition through IoT technologies, there is still a lack of integrated methodologies that seamlessly apply machine learning techniques to automate decision-making processes at scale in smart farming environments. Most studies tend to focus on specific use cases, such as irrigation or soil moisture monitoring, but fail to provide solutions that consider the complexity and variety of agricultural data across different environments and applications.
Another limitation is an over-reliance on traditional non-machine learning methods, such as simple thresholding, pattern detection, or statistical analysis (e.g., ARIMA models), for analyzing time-series data. While these techniques are suitable for basic data analysis, they fall short when applied to large-scale datasets or more complex agricultural systems in which sensor data may have temporal or spatial dependencies. Furthermore, these conventional approaches often depend heavily on manually labeled data, which limits the scaling of applications for real-world farming and the ability to adapt machine learning models for dynamic, evolving environments.
Many existing studies also fail to realize the full potential of real-time decision-making systems, where latency and connectivity issues pose significant challenges, particularly in rural and agricultural settings. Although some progress has been made with cloud computing, issues related to latency, bandwidth, and dependence on mobile networks often impede real-time responsiveness. Recent trends toward fog and edge computing have opened new possibilities for real-time data processing closer to the source, yet the integration of these technologies into agricultural IoT systems is still in its early stages and requires further exploration.
Another significant gap in the literature is the lack of attention given to automated labeling techniques such as pseudo-labeling, which can drastically improve the scalability and efficiency of time-series data analysis in smart agriculture. While pseudo-labeling has shown promise in certain applications, its potential for labeling agricultural sensor data and improving model accuracy remains underexplored. The few studies that have adopted pseudo-labeling often fail to combine this technique with advanced machine learning methods, which are necessary to manage the complexity and diversity of agricultural time-series data.
Additionally, most current research on anomaly detection and time-series data analysis focuses on surface-level anomaly detection and classification but does not study more advanced techniques such as semantic data segmentation. This can limit the accuracy and precision of predictive models, especially for noisy or incomplete datasets. Another limitation is the tendency of existing models to analyze entire datasets without segmentation into smaller, more meaningful units based on short-term changes. This often leads to model inaccuracies, as the complexity and inherent noise in agricultural data can distort the results.
2.5. Contributions
Our study addresses the limitations highlighted in previous research by introducing a more comprehensive and integrated framework for smart agriculture that combines IoT platforms with advanced machine learning techniques such as pseudo-labeling. Specifically, we propose a novel approach that integrates pseudo-labeling into the semantic segmentation of time-series data, providing a more accurate and efficient method for data analysis in agricultural environments.
One of the key contributions of this research is the development of an automated pseudo-labeling process that significantly reduces the manual effort typically required for labeling large agricultural datasets. This allows for the rapid expansion of labeled datasets, thereby improving the scalability and efficiency of machine learning models used for agricultural decision making. By iteratively refining pseudo-labels, our approach ensures higher accuracy in model training, thereby improving predictions related to irrigation management, crop health monitoring, and equipment maintenance.
Our method also incorporates the use of real-time data processing techniques facilitated by edge and fog computing to address latency issues and network limitations that are common in remote agricultural settings. By processing data closer to the source, our system enhances the real-time responsiveness required for critical agricultural operations such as automated irrigation and pest detection. This not only increases operational efficiency but also optimizes resource use, promoting agricultural sustainability.
Moreover, we contribute to the field by providing a more sophisticated analysis of time-series data, particularly through the integration of semantic segmentation. Unlike traditional methods that analyze entire datasets, our approach allows for more granular analysis based on short-term data patterns, improving the precision and relevance of predictive models. This is particularly important in agriculture, where environmental conditions can change rapidly, and timely responses are important in maintaining productivity.
In summary, our contributions can be summarized as follows:
A novel pseudo-labeling framework that automates the labeling of agricultural time-series data, enhancing scalability and model accuracy.
Integration of edge and fog computing to address latency issues and improve real-time decision making in smart agriculture.
Advanced semantic time-series data segmentation techniques, allowing for more precise and context-aware predictions in agricultural systems.
A holistic approach that combines IoT platforms with machine learning to create a robust, scalable system for managing complex agricultural environments, thereby supporting the development of intelligent and sustainable farming practices.
By addressing the gaps in current research, our study offers a significant advancement in the application of agricultural machine learning and IoT technologies, providing opportunities to develop more adaptive, resilient, and efficient agricultural systems.
3. LSTM and FSST-Based Fault Diagnosis Algorithm
The machine learning-based fault diagnosis algorithm proposed in this paper utilizes waveform segmentation to analyze system status and environmental conditions in time-series data, particularly within the context of smart agriculture. Waveform segmentation involves dividing time-series data into temporally contiguous segments, helping to identify specific periods or operational states. This segmentation is especially valuable for understanding periodic changes in agricultural environments, such as soil moisture variations, temperature fluctuations, and humidity levels. By extracting meaningful semantic segments from the data, the algorithm provides crucial insights into each phase, enhancing the accuracy of fault diagnosis for devices like soil moisture sensors, temperature sensors, and irrigation controllers.
Algorithm 1 summarizes the main steps of the
machine learning-based fault diagnosis algorithm. The algorithm consists of three structured stages: data preprocessing, time-frequency feature extraction using the Fourier synchrosqueezed transform (FSST), and waveform segmentation and classification with LSTM. Each stage is designed to improve the analysis and diagnostic accuracy of time-series data collected from agricultural sensors.
Algorithm 1 Machine learning-based fault diagnosis algorithm. |
- Require:
Time-series data X - Ensure:
Fault diagnosis labels - 1:
Data Preprocessing: - 2:
Normalize time-series data:
- 3:
- 4:
Time-Frequency Feature Extraction: - 5:
Apply Fourier synchrosqueezed transform (FSST):
- 6:
Waveform Segmentation Using LSTM: - 7:
Split data into training and testing datasets - 8:
Initialize LSTM neural network: - 9:
Define input layer - 10:
Define LSTM layers with hidden nodes - 11:
Define output layer with softmax activation - 12:
Train LSTM neural network: - 13:
for each epoch do - 14:
Forward pass through LSTM layers - 15:
Compute loss - 16:
Backpropagate error - 17:
Update weights - 18:
end for - 19:
Monitor training accuracy, validation accuracy, loss function, and overfitting - 20:
Perform waveform segmentation: - 21:
Predict labels for test data - 22:
Post-processing: - 23:
Combine segmented labels for the entire input sequence - 24:
Output the fault diagnosis labels
|
Figure 2 illustrates the
semantic segmentation process as applied to time-series data. This segmentation process, enhanced by the Fourier synchrosqueezed transform (FSST), serves as a key analytical tool for examining dynamic frequency components within each data segment. Using the FSST allows for precise interpretation of time-frequency relationships, allowing the system to effectively detect anomalies and monitor sensor performance. The FSST’s ability to capture changes in frequency modes over time is especially beneficial for sensor data analysis in smart agriculture, as it enhances the ability to diagnose and predict equipment performance accurately. In
Figure 2, each segment represents a specific operational state that facilitates accurate fault classification. This visualization is particularly useful for understanding the structured approach applied to the fault diagnosis process.
3.1. Data Preprocessing
For time-series data exhibiting significant or irregular fluctuations over time, if these data are not initially preprocessed, training machine learning algorithms on such data can lead to various issues, including inadequate handling of missing values, the influence of outliers, differences in data scales, overlooking seasonality and trends, and neglect of temporal structures. Filtering signals can help remove unnecessary noise by selectively passing or blocking frequency bands, enabling neural networks to better discern the features of actual signals. Moreover, filtered signals often exhibit more normalized characteristics, contributing to stable neural network operation and reducing input data size imbalances. Utilizing filtered data for neural network training can enhance model performance and reduce the noise and volatility associated with signal processing. However, severely distorting actual signals can lead to inaccurate judgments.
Rescaling time-series data for normalization is the process of transforming data to a new range, allowing the data distribution to be adjusted. One commonly used method involves scaling the data to the desired range using the minimum and maximum values of the input data. This process can be expressed mathematically as shown in Equation (
1).
In this expression, represents the rescaled data, and a and b denote the lower and upper bounds of the new range, respectively. Moreover, X represents the original data, represents the maximum value of the data, and represents the minimum value. Rescaling adjusts the data to a consistent range, which is particularly useful for machine learning models, as it aligns the ranges of various input variables, thereby improving the model’s learning and performance. By adjusting the data distribution, rescaling reduces the model’s sensitivity to outliers or anomalies, thus enhancing the model’s stability and generalization abilities. Rescaling preserves the characteristics of the raw data, as it only involves adjusting the range of the data without altering its distribution. This is one of the reasons why rescaling is widely used in data preprocessing.
In time-series data, feature points exist in different frequency bands. We designed a bandpass filter for data with features to enable selective frequency band passage, high-frequency noise removal, smooth feature extraction, enhanced learning stability, and improved signal-to-noise ratio (SNR). This filter prevents the neural network from learning unnecessary features, thereby enhancing neural network model training. After the time-series data have been filtered, extracting time-frequency features provides the neural network with the input data instead of the original data. This allows the neural network to simultaneously learn time and frequency information.
The proposed infinite impulse response (IIR) filter uses previous samples of the input and output to compute the output. It was designed using the ‘
ellip’ design method, which provides highly damped passbands and the steepest cutoff. The given parameters adjust the frequency response of the filter, a key design feature for digital signal processing. The equation of the designed filter is shown in Equation (
2).
In this expression, represents the transfer function of the filter, which describes the relationship between the input and output using the Z-transform method. K is the scaling constant of the filter, which adjusts the overall magnitude of the transfer function. represents the zeros (roots) at which the output is zero for a given input, and represents the poles, where the output diverges to infinity for a given input.
In time-series data, the relationship between time and frequency reflects periodic changes in the data over time. Specifically, frequency indicates the rate of periodic changes in the data at a specific time. Higher frequencies correspond to shorter time periods, while lower frequencies correspond to longer periods. Time-frequency relationships in time-series data explain the dynamic nature of the data; analyzing these data as a whole provides a more detailed interpretation of the data characteristics.
The FSST is a mathematical tool for analyzing the dynamic characteristics of frequency modes in the frequency-time domain of time-series data. Equation (
3) shows the mathematical expression of the FSST.
In this expression, represents the frequency, ranging continuously from to ∞. t represents time, which indicates the time domain of the given time-series data. is the time-frequency distribution, also known as the time-frequency kernel or window, obtained from the FSST. One of the main features of FSST is that the kernel is used to reconstruct the signal in the frequency-time domain while preserving the frequency mode information of the original signal. represents the result of the FSST, which captures the frequency modes of the time-series data in the frequency-time domain.
Figure 3 provides a visualization of the time-series data after normalization, illustrating how the FSST enables the extraction of dynamic frequency content. These data were collected over a one-month period from a remote agricultural site, focusing on soil conditions. The mathematical normalization process adjusts the data scale to ensure consistency and reliability across a variety of environmental conditions. This enhances the tracking and analysis of temporal variations in frequency modes, enabling precise monitoring of changes within the time-series data. By standardizing the data, the model gains the improved accuracy and generalization essential for effective real-time smart agriculture applications.
3.2. Machine Learning for Waveform Segmentation
The machine learning classification procedure in this study involves dividing the data into training and testing datasets, training the neural network with the training data, and evaluating the neural network’s performance using the testing dataset. We create distinct repositories to separate the data into training and testing datasets. In this paper, LSTM neural network models are utilized for waveform segmentation. The LSTM model generates output in the form of a sequence or mask in which each label corresponds to a segment of the input signal. This configuration is commonly applied in classification tasks where specific events within time-series data need to be detected or predicted. Here, the model diagnoses sensor statuses by detecting the occurrence of events, with the mask providing a binary value at each time step to indicate the presence or absence of these events.
The neural network assigns labels to regions of the signal, effectively categorizing the data into segments based on event occurrence. To enable this segmentation, the region labels of the dataset are transformed into a sequence that contains one label per signal sample. This transformation is achieved through a function that identifies regions of interest and assigns labels accordingly. Regions of interest are identified based on the characteristics of the data. Each sample within these regions is labeled, while irrelevant samples are assigned an “n/a” (not applicable) label.
Processing long sequences of input data requires careful management to prevent performance degradation and excessive memory usage. Thus, input signals and their corresponding label masks are split into shorter segments, improving memory storage and processing efficiency. This is accomplished by utilizing a transformed data repository and a data resizing function that cuts or splits input signals and label masks to manageable lengths. These segmented signals are then passed to the LSTM neural network for training and inference. Each segment is processed independently, allowing the LSTM to estimate based on partial sequences; this is an essential feature of analyzing large datasets in smart agriculture.
LSTM networks are particularly well suited for waveform segmentation of time-series data in agricultural applications because of their ability to capture temporal dependencies. LSTMs excel at processing sequential data where the order of inputs is significant, making them ideal for the complex, non-linear patterns often found in agricultural time-series data. Such data may include variations in soil moisture, temperature, and crop health over time. These non-linear data necessitate a model like LSTM that can adapt and predict based on time-dependent patterns. This capability is invaluable for detecting anomalies and diagnosing sensor statuses in smart farming environments, where the details of temporal data evolution are important. By maintaining sequential continuity and providing accurate predictions, LSTM models effectively support autonomous decision making in smart agriculture, enhancing system reliability and operational efficiency.
3.3. Model Setup and Training
Figure 4 provides a detailed view of our proposed
LSTM neural network model architecture for automated labeling. This model facilitates the segmentation and classification of time-series data collected from agricultural sensors to automate the labeling process. The architecture begins with an input layer that processes one-dimensional time-series data, capturing agricultural parameter sensor readings. Following the input layer, multiple LSTM layers are utilized to capture temporal dependencies within the data, allowing the model to recognize sequential patterns that reflect various operational states. An intermediate layer then extracts the most important classification features, and a final softmax layer then assigns a label to each waveform segment, effectively identifying and categorizing different operational states, such as irrigation, operational status, or device abnormalities. This architecture supports a scalable, automated labeling system to enhance data processing efficiency and improve accuracy in smart agriculture.
The LSTM neural network, configured with these architecture settings, receives time-series data as input, processes each sample in sequential mode, and outputs classifications for each waveform class. To maximize performance, an appropriate number of hidden nodes are used, enhancing the model’s capacity to capture complex temporal patterns in the data, such as fluctuations in soil moisture or temperature changes.
Several elements are implemented to train the neural network efficiently. The data processing pipeline is parallelized to effectively handle large training datasets. An on-demand data-reading method is used to avoid loading all data into memory at once in order to manage memory when processing large volumes of sensor data. The gather function in the data repository further optimizes memory usage by allowing efficient access to the required data in the form of cell-shaped arrays containing training and test signals as well as labeled masks.
During training, various metrics are monitored to assess model performance and to adjust parameters as needed. The key metrics include the following:
Loss function changes: This loss function measures the discrepancy between the predicted values and actual labels; by monitoring the loss function value over time, we can evaluate whether the model is effectively learning. Ideally, the loss function decreases consistently, indicating that the model’s predictions are becoming more accurate.
Training accuracy: This metric indicates how well the neural network is learning from the training data. Tracking training accuracy helps determine the model’s ability to capture patterns in the time-series data, which helps accurately classify sensor statuses in agricultural applications.
Validation accuracy: This measures the model’s generalization ability on unseen data. High validation accuracy and balanced training accuracy indicate that the model is not overfitting. In smart agriculture, this is particularly important for ensuring the model’s robustness in varying environmental conditions.
Training speed: Efficient training is essential for managing large datasets. If training is slow, hyperparameters can be tuned or data preprocessing methods adjusted. Optimizing training speed is particularly relevant in agricultural applications, where models need to process high-frequency data in real time.
Overfitting: Overfitting occurs when the model performs well on training data but poorly on validation data. If training accuracy continues to increase while validation accuracy decreases, adjustments such as regularization techniques or data augmentation can be applied to prevent overfitting.
4. Experimental Environment and Methodology in Smart Farming Environment
4.1. Environment
Figure 5 provides an overview of the
experimental data collection setup within the smart farming environment. This includes IoT-enabled sensors and control systems that continuously monitor soil conditions in real time to automate irrigation. The key components of the system include soil sensors, gateways, and data storage units, all of which support efficient environmental monitoring and data-driven decision making.
These components are described as follows:
Sensors and control units: The smart farm is equipped with various soil sensors and irrigation controllers. Soil sensors monitor real-time data such as moisture, temperature, and humidity levels. Based on these data, the irrigation system is managed by controlling solenoid valves to maintain optimal soil conditions for crop growth.
Gateway: The gateway acts as a central hub to collect data from soil sensors and transmit the data to a remote server. The gateway also communicates with control units to remotely adjust the irrigation system based on the data analysis.
Data collection system: This system stores and manages the gathered sensor data in a centralized database. This system supports real-time data collection and storage, allowing machine learning models to be trained on up-to-date environmental data.
Figure 6 shows the
smart farm system used for the performance evaluation, highlighting its ability to monitor and adjust environmental factors in real time. Using advanced algorithms, this system distinguishes soil textures by analyzing characteristic changes and dynamically adjusting irrigation based on temperature and humidity variations. This functionality optimizes water usage and ensures that crops grow under ideal conditions, promoting sustainable and efficient agricultural practices.
Through the combination of sensors, gateways, and data storage systems, this smart farming setup provides a comprehensive approach to monitoring and managing agricultural environments. By utilizing this IoT-driven framework, the system achieves precise control over environmental variables, facilitating sustainable farming practices and supporting more efficient resource allocation. The collected data serve as a foundation for developing and evaluating machine learning models to diagnose and predict agricultural conditions in real time.
4.2. Methodology
The process of diagnosing sensor states through waveform segmentation in this study includes the following steps:
Time-series data collection: Time-series data related to soil conditions and environmental variables, such as temperature, humidity, electrical conductivity, and pH, are collected from sensors installed on the smart farm. Data are collected for a year and segmented into 10-day intervals. This segmentation allows for the analysis of seasonal patterns and the monitoring of changes in soil and environmental conditions over time for optimizing irrigation and other farming operations.
Waveform segmentation and state extraction: The collected time-series data are segmented using waveform segmentation techniques, and features are then extracted from each segment. These features represent distinct operational states of the sensor controllers in the smart farm environment. Machine learning models are then trained on the extracted features to classify the operational states based on time-series data. The states are categorized into “watering”, “normal operation”, and “abnormal operation”. This classification allows for real-time monitoring and prompt responses to any system issues that may occur.
Figure 7 contains the
results of the dataset state classification, illustrating how the machine learning model segments time-series data into distinct categories. Each segment corresponds to different sensor states, such as “watering”, “normal operation”, and “abnormal operation”, allowing for efficient monitoring and responses to changes in the smart farming environment. This classification supports the real-time diagnosis and management of agricultural devices based on operational status.
Partial labeling of features: To streamline the training process, a subset of the data is manually labeled as a reference for the entire dataset. Initially, we manually labeled specific sections where peaks appear during watering events, as shown by the dashed areas in
Figure 8. By leveraging this partially labeled dataset, machine learning models automatically label the remaining data, significantly reducing the time and effort required for manual labeling.
Figure 8 demonstrates examples of
semantic segmentation based on predicted labels. Each segmented section is analyzed by the model to monitor equipment states in real time and to support accurate predictions. This pseudo-labeling process enhances labeling efficiency, facilitating effective model development.
Our methodology introduces a novel pseudo-labeling technique to automate the labeling of semantic segments in time-series data. This process involves using machine learning algorithms to identify patterns within the data and assign labels based on learned features, reducing the reliance on manual labeling. Pseudo-labeling enhances the model’s ability to handle large-scale datasets with greater accuracy and consistency. The automated labeling approach also enables iterative refinement; as pseudo-labeled data are incorporated back into the training set, the model’s predictions continue to improve.
Figure 8 illustrates the segmentation process along with the inherent challenges of manual labeling, highlighting areas for improvement. This figure provides examples of semantic segmentation based on predicted labels and shows how manual labeling can introduce errors, such as missing segments or inaccuracies. These errors offer insights into areas needing further refinement, enhancing the model’s overall performance.
These errors are described as follows:
Missing segments: These occur when certain regions are overlooked during real-time data labeling, especially in large-scale time series. Missing segments can prevent the model from capturing important transitions in the data, reducing classification accuracy.
Inaccuracies: Misclassification errors may arise from similarities between different classes, the ambiguity of real-time data, or the limitations of segmentation algorithms. Accurate labeling is essential for ensuring the model’s robustness, especially in the context of smart agriculture, where precise monitoring of soil and crop conditions is needed for optimal resource management.
By addressing these challenges through pseudo-labeling, the proposed method improves labeling efficiency and accuracy, enabling the model to deliver consistent and reliable classifications for various agricultural sensor states. This approach supports the effective deployment of machine learning models in smart farming applications, contributing to improved operational decision making and resource optimization.
4.3. Results
Figure 9 provides an overview of the model’s classification performance by displaying the confusion matrix, a tool used to evaluate prediction accuracy across different categories. The experimental results were evaluated using this confusion matrix, from which we identified true positive (
TP), true negative (
TN), false positive (
FP), and false negative (
FN) values. These values were then used to calculate accuracy, precision, recall, and the F1 score, providing a comprehensive assessment of the model’s effectiveness.
The metrics used to assess the model’s effectiveness are described as follows:
Accuracy: The proportion of correct predictions among the total predictions, calculated as follows:
Precision: The ratio of true positive predictions to all positive predictions, calculated as follows:
Recall (or Sensitivity): The proportion of actual positive instances correctly identified as positive, calculated as follows:
F1 Score: The harmonic mean of precision and recall, calculated as follows:
These metrics provide a comprehensive view of the model’s performance and its effectiveness in classification tasks for smart farming applications.
Table 1 summarizes the evaluation metrics for each class.
The proposed model achieved an accuracy rate of 89.07%. However, due to data imbalance, this metric alone does not fully reflect the model’s performance across all classes. While the model demonstrated high precision for abnormal classes, it was less precise for the normal (non-anomalous) class. Additionally, while recall accuracy was high for the normal class, it was relatively lower for the irrigation class. The F1 score was moderately accurate for the irrigation and normal classes but showed superior performance for the abnormal class. To enhance overall model performance, particularly in correctly classifying the normal class, further strategies, such as additional data collection and model fine-tuning, are necessary.
Figure 10 illustrates the impact of labeling errors on model performance by comparing the manually labeled data with the predicted labels. This figure highlights the discrepancies occurring from manual labeling, such as missing segments and misclassifications, which can reduce accuracy and consistency in large-scale time-series data. The top graph shows the manually labeled data, with the black sections indicating missing labels and some irrigation segments incorrectly labeled as normal. These issues illustrate the challenges of manual labeling for large-scale time-series data. In the bottom graph, the proposed algorithm correctly differentiates between normal and irrigation events, emphasizing its potential for automated and accurate labeling.
5. Conclusions
This study explored the use of machine learning algorithms to analyze time-series data in the context of smart agriculture, where monitoring and managing environmental conditions help to improve sustainable farming practices. The proposed methods for data preprocessing, waveform segmentation, and state classification were evaluated in a smart farming environment and achieved high accuracy in sensor state identification. The model demonstrated an overall accuracy rate of 89%, with automated labeling resulting in a 30% reduction in data processing time. However, the model showed room for improvement in the classification of normal states, suggesting that additional refinement is necessary.
Future research may focus on enhancing model generalization by collecting data from various agricultural environments and exploring deeper neural network architectures. Real-time data processing capabilities will be essential in supporting the deployment of these models for smart farming. Additionally, integrating data from multiple devices, such as soil moisture and temperature sensors, can contribute to developing more comprehensive monitoring systems. Improving the interpretability of machine learning models will also be valuable for gaining insights into the models’ decision-making processes.
Our study introduced a pseudo-labeling approach to address the challenges of manual labeling for large-scale time-series data, leading to improvements in accuracy, scalability, and efficiency. This approach holds significant potential not only for smart agriculture but also for other domains requiring precise, rapid data analysis. By automating the labeling process, the model enhances its capability to manage large datasets efficiently, thereby supporting timely and informed decision making in agricultural practices.
In future work, we plan to expand the scope of our research to include IoT advancements and data security frameworks relevant to smart agriculture. This expansion will provide a more holistic understanding of the technological challenges in smart farming and further enhance the applicability of machine learning-based approaches. Ultimately, exploring the real-world impact of machine learning-driven decision making in agriculture will provide valuable insights for refining and extending these technologies, contributing to smarter and more sustainable farming practices.