1. Introduction
In modern agricultural production, the management of tobacco crop health [
1] presents a significant and complex challenge. Diseases affecting tobacco not only severely impact the crop yield and quality [
2] but also lead to economic losses and ecological issues. Therefore, the accurate and efficient identification and grading of tobacco diseases are crucial for enhancing agricultural productivity and sustainability. Given that tobacco is an economically significant crop, the timely and accurate identification of its diseases directly influences the reduction in losses and improvement of the yield [
3]. However, the identification of tobacco diseases faces challenges, such as high diversity and complex environmental conditions [
4], especially in low-resolution and complex agricultural scenes. Traditional identification methods rely on manual vision and experience [
5], which are inefficient and susceptible to subjective biases.
Fitri et al. [
6] explored pest detection in Indonesian tobacco plants using the Gray-Level Co-Occurrence Matrix (GLCM) for texture feature extraction and Naive Bayes for classification, achieving an accuracy of 82.2%. Xu et al. [
7] found traditional ORB corner detection algorithms insufficiently sensitive to image edges when identifying tobacco leaf diseases, leading to a suboptimal performance. Chen et al. [
8] utilized machine learning methods to recognize the health status of tobacco leaves, selecting a 188-dimensional Support Vector Machine (SVM) combination as the final predictor, reaching an accuracy of 92.7%. Sakhamuri Sridevi et al. [
9] reviewed plant diseases in India, emphasizing that manual identification requires extensive labor and botanical knowledge, resulting in high costs.
With the rapid advancement of artificial intelligence and computer vision technologies, their application in disease detection and analysis has become a research hotspot [
10,
11,
12,
13]. However, the complexity of agricultural scenes and limitations in image acquisition often result in low-quality tobacco images [
14], challenging accurate disease identification. Traditional methods based on high-resolution images are less effective in these scenarios. Moreover, existing super-resolution techniques, despite enhancing the image quality, still face inefficiency and inadequate accuracy issues when processing agricultural images, necessitating more effective and precise disease identification methods.
Lin et al. [
15] proposed the CAMFFNet (Coordinate Attention-Based Multiple Feature Fusion Network) CNN model for field tobacco disease recognition, achieving an accuracy of 89%. However, its large parameter size leads to high computational costs. Swasono Dwiretno Istiyadi et al. [
16] used VGG16 for tobacco leaf pest classification, achieving high accuracy levels, but their dataset was limited to 1500 images, questioning the model’s generalizability. Siva Krishna Dasari et al. [
17] designed a CNN-based tobacco grading solution, achieving 85.10% accuracy but only 64% on other datasets. Wu et al. [
18] proposed a convolutional neural network (CNN)-based intelligent bulk curing method, TobaccoNet, addressing the health hazards of bulk tobacco smoking, achieving significant results. Wang et al. [
19] introduced a CNN-based quantitative modeling method for near-infrared spectroscopy datasets to detect nicotine in tobacco, aiding the tobacco industry’s development. Li et al. [
20] improved the YOLOv7 model for tobacco disease identification and tested it on the Android platform, showing over 90% accuracy. Guo et al. [
21] designed a Convolutional Swin Transformer (CST) based on the Swin Transformer for plant disease identification, achieving an accuracy of 90.9%. However, they did not consider the model’s robustness. He et al. [
21] developed a joint Swin Transformer and SCMix MLP architecture for complex tobacco feature learning, proposing a tobacco classification model based on pyramid feature fusion, achieving 75.8% accuracy and a 12 ms inference time. Pant Kartikey et al. [
22] focused on classifying tobacco-related media texts, considering factors like the affected population’s language and its combination in fine-grained classification mechanisms. Borhani Yasamin et al. [
23] used the Vision Transformer (ViT) method for real-time automated plant disease detection, combining a CNN with a ViT, noting that while the model performance increased, the prediction speed decreased.
Despite the maturation of computer vision technologies for tobacco disease identification, these methods generally rely on high-resolution images, and their detection effectiveness decreases with reduced image resolutions. Therefore, this study introduces an innovative model, DiffuCNN, specifically designed for identifying and grading tobacco diseases in low-resolution and complex agricultural scenes. The key contributions of this paper are as follows:
A novel deep learning model, DiffuCNN, is proposed, specially designed for counting tobacco diseases in low-resolution complex agricultural scenes, significantly improving the accuracy of tobacco disease detection under low-resolution conditions.
DiffuCNN integrates a diffusion-based resolution enhancement module, a target detection network optimized through filter pruning, and the CentralSGD optimization algorithm, effectively enhancing the performance of tobacco disease detection and grading.
Experimental results demonstrate that DiffuCNN surpasses other models in accuracy, recall, precision, and frames per second (FPS), particularly excelling in the performance of tobacco disease counting.
Detailed ablation studies on each component of DiffuCNN validate the significance of each part in improving the performance, including the effective application of resolution enhancement, filter pruning, and the CentralSGD optimization algorithm.
In summary, this research aims to provide robust technical support for tobacco disease monitoring and offer new insights and solutions for similar agricultural disease identification problems. In practical applications, this not only aids in enhancing the efficiency and accuracy of disease management but may also positively impact the sustainability of agricultural production. Ultimately, the goal is to contribute to the modernization of global agricultural production through technological innovation.
4. Proposed Method: DiffuCNN
The DiffuCNN model, presented in this paper, is an innovative approach for counting tobacco lesions. It integrates multiple advanced technologies to enhance the accuracy of lesion detection and counting in low-resolution, complex agricultural scenes. Focused not only on conventional target detection challenges, DiffuCNN also emphasizes performance optimization in low-resolution and complex background conditions. The core design of the DiffuCNN model combines a diffusion-based resolution enhancement technique, a target detection network optimized through filter pruning, the CentralSGD optimization algorithm, and the Diffusion Loss Function to efficiently and accurately count tobacco lesions in low-resolution images. After detecting lesions, the grading of the disease severity is achieved through counting.
4.1. Diffusion-Based Resolution Enhancement Module
This module is designed to process low-resolution images, aiming to enhance image details and clarity through a series of algorithms. Inspired by physical diffusion processes, it aims to simulate the natural diffusion of light and color in scenes, effectively improving the visual quality, as illustrated in
Figure 4.
The input to this module is low-resolution agricultural scene images, potentially lacking detail due to poor shooting conditions, such as lighting or distance. The output is images with significantly enhanced resolutions, where details of small objects are more clearly visible. The process begins with the standardization of the input low-resolution images, including color correction and noise suppression. Then, the images undergo resolution enhancement using the diffusion algorithm, which simulates the natural diffusion of light in the scene, enhancing minute details in the image. Finally, sharpening and contrast adjustments are made to further improve the image quality, ensuring targets are clearly discernible in the image. The diffusion process is described by the following partial differential equation (PDE):
where
I represents the image intensity,
t denotes the diffusion time,
D is the diffusion coefficient, and ∇ represents the gradient operator. This equation indicates that the change in image intensity is proportional to the divergence of its gradient, simulating the diffusion of light in the physical world. In practical application, this equation is discretized and applied in image processing. The iterative updating of each pixel value in the image gradually enhances the image resolution and clarity. Specifically, the updating of each pixel in each iteration can be expressed as follows:
where
is a coefficient controlling the rate of diffusion and
and
represent the pixel values before and after the iteration, respectively. In the task of tobacco lesion counting, the diffusion-based resolution enhancement module offers significant advantages. By enhancing the details in low-resolution images, previously hard-to-distinguish tobacco lesions become clear, thus improving the detection accuracy. In complex agricultural scenes, this module effectively highlights targets like tobacco lesions against a complex background, facilitating subsequent recognition and counting. The fiffusion process, simulating the natural diffusion of light, renders the processed images visually more natural, avoiding artificial traces from over-processing. The enhanced image quality ensures the robust performance of the model under varying conditions, such as different lighting and distances.
4.2. Target Detection Network Based on Filter Pruning
In the DiffuCNN model proposed, a target detection network optimized through filter pruning is a key component, specifically designed for the accurate detection of tobacco lesions in images with an enhanced resolution, as shown in
Figure 5. This network optimizes the structure by implementing filter pruning in convolutional layers, aiming to improve the detection efficiency and reduce computational costs.
The network input consists of high-resolution images processed by the diffusion-based resolution enhancement module. These images feature richer details and clear characteristics of tobacco lesions. The output includes the detection results of tobacco lesions, comprising their positions and quantities. The network structure adopts a design stacked with multiple convolutional layers. Each layer consists of several convolutional kernels (filters) responsible for extracting features from the images. Filter pruning is conducted within these convolutional layers. Filters contributing less to the final detection performance are removed after analyzing their importance. This process involves assessing the weights of each filter and then pruning based on predetermined criteria. Convolutional layers are typically followed by an activation layer (e.g., ReLU) and optionally by a pooling layer to enhance non-linear expression capabilities and reduce feature dimensions. Generally, the initial layers of the network have fewer convolutional kernels, mainly extracting low-level features (such as edges and textures), while the number of kernels gradually increases in deeper layers for more complex high-level feature extraction. The input dimension depends on the size of the processed images, while the output dimension is related to the requirements of the detection task, typically involving estimates of the positions and quantities of tobacco lesions.
Not all filters in the convolutional layers significantly contribute to the final detection task. Therefore, the network can be optimized by assessing the importance of each filter and pruning those with lesser contributions. The importance of a filter can be evaluated using the following formula:
where
represents the
i-th filter and
is the weight of the filter at position
. The importance of a filter can be estimated by the sum of the absolute values of its weights. In this process, filters with an importance below a certain threshold are removed. By eliminating unimportant filters, the network’s parameter count and computational costs are reduced, making the model more lightweight and suitable for environments with limited computational resources. The pruning process helps prevent model overfitting by reducing the complexity of the model, allowing the network to focus more on features critical to the task. In processing high-resolution agricultural scene images, the pruned network can more efficiently handle large volumes of data while maintaining a high recognition rate for small targets like tobacco lesions.
4.3. CentralSGD
The design of CentralSGD addresses challenges encountered by traditional Stochastic Gradient Descent (SGD) methods in dealing with complex models and large-scale data. CentralSGD, based on the traditional SGD approach, introduces the concept of centralized gradients. In conventional SGD, each parameter update is based on the gradient computed from a single training sample or a small batch of samples. In CentralSGD, parameter updates consider not only the current batch’s gradient but also the central gradient of all samples (i.e., the average gradient). In each iteration, the gradient of the current batch is first calculated. Then, the gradient center of all samples is computed, and the current batch’s gradient is compared with this center. Finally, model parameters are updated using this centralized gradient information. The mathematical expression for CentralSGD is described by the following formula:
where
is the model parameter at time step
t,
is the learning rate,
is the average gradient of the current batch,
is the historical average of all sample gradients, and
N is the total number of samples. The core idea of this method is to reduce the variance in gradient updates across iterations. Each iteration in traditional SGD can exhibit significant gradient fluctuations due to the randomness of individual batch samples, while CentralSGD reduces these fluctuations by introducing gradient centralization, making parameter updates smoother and more effective.
By reducing the fluctuations and instability in gradient updates, CentralSGD converges faster to the optimal solution. This is particularly important when dealing with large-scale datasets, as traditional SGD methods may lead to slower convergence rates in such cases. CentralSGD improves the stability of the training process by considering the gradient information of the entire dataset, reducing the impact of randomness in individual batch samples. For complex models like DiffuCNN, CentralSGD effectively handles a large number of parameters and complex gradient structures, maintaining efficient optimization during deep network training. CentralSGD is particularly suitable for distributed training environments, where gradient centralization can aid different training nodes in working more effectively together, reducing the decline in training efficiency caused by an uneven data distribution.
4.4. Diffusion Loss Function
In the DiffuCNN model proposed, the Diffusion Loss Function is an innovative loss function design, guiding the learning process of the model in the task of tobacco lesion counting. This loss function combines a traditional loss function (such as cross-entropy loss or mean squared error loss) with a regularization term based on the diffusion process, aiming to enhance the model’s accuracy in detecting and counting tobacco lesions in low-resolution, complex agricultural scenes.
The Diffusion Loss Function consists of two parts: a traditional loss function, evaluating the difference between the model output and the true labels, and a regularization term based on the diffusion process, ensuring the model does not overly rely on specific image features during learning, thereby enhancing its generalizability. The regularization term, designed based on the characteristics of the diffusion process, aims to simulate the propagation and change in image features in the natural world. This approach, rooted in physical diffusion theory, encourages the model to focus more on the overall features of the image rather than local details during learning.
The Diffusion Loss Function can be expressed by the following formula:
In the proposed model, the traditional loss function, denoted as
, is complemented by a regularization term based on the diffusion process, represented as
. The parameter
serves as a hyperparameter balancing these two components. The regularization term
is further expressed as follows:
Here, f signifies the model’s prediction function, is the original image sample, and is the image sample processed through the diffusion method, with N being the total number of samples. By incorporating the regularization term based on the diffusion process, the model is encouraged to learn and recognize features that remain stable in images subjected to diffusion treatment. This approach shifts the model’s focus towards the overall and stable features of images rather than relying solely on specific local details, thereby enhancing the model’s generalization capability. Overfitting, a common issue in deep learning model training, especially with limited training data, is addressed by the Diffusion Loss Function. By constraining the model’s complexity through the regularization term, the model’s tendency to overlearn noise or incidental features in the training data is reduced, aiding in the prevention of overfitting. In complex agricultural scenarios where tobacco lesions may appear under diverse backgrounds and lighting conditions, the Diffusion Loss Function motivates the model to learn features consistent across different environments, enabling better adaptation to these complex settings. Owing to its emphasis on learning the holistic features of images, the model achieves more effective identification and counting of tobacco lesions in images, maintaining a high accuracy even under low-resolution or incomplete visual information conditions.
4.5. Experimental Configuration
In the experimental setup for this study, detailed settings were meticulously established, encompassing hyperparameter configuration and the selection of hardware platforms and libraries, as well as evaluation metrics, which are all crucial for ensuring the effectiveness and reproducibility of the experiments.
4.5.1. Hyperparameter Settings and Hardware Platform with Libraries
The setting of hyperparameters is a critical step in deep learning experiments, directly impacting the training outcomes and the ultimate performance of the model. The hyperparameters in these experiments included the learning rate, batch size, and number of training epochs. The learning rate determines the speed of model weight updates, the batch size affects the amount of data updated in each training iteration, and the number of epochs decides the total number of iterations in the training process. The initial learning rate was set at 0.001, with a learning rate decay strategy implemented. A batch size of 64 was chosen to accelerate training while also increasing the memory demand. The training was set for 100 epochs. The hardware specifications for the experiments comprised an NVIDIA RTX 3000 GPU, an Intel Core i7 CPU, and 32 GB of memory. The deep learning framework utilized was PyTorch (latest v. 2.0), with NumPy (latest v. 1.26.4) for numerical computations and Pandas (latest v. 2.2.0) for data processing and analysis.
4.5.2. Evaluation Metrics
Multiple metrics were employed to comprehensively assess the performance of the model, including precision, recall, accuracy, frames per second (FPS), and mean average precision (mAP).
Precision, defined as the proportion of correctly predicted positive samples to all samples predicted as positive, is mathematically expressed as follows:
where
represents the number of true positives (the number of correctly predicted positive samples) and
denotes the number of false positives (the number of incorrectly predicted positive samples).
Recall, indicating the proportion of correctly predicted positive samples to all actual positive samples, is given by the following formula:
where
stands for the number of false negatives (the number of incorrectly predicted negative samples).
Accuracy, the ratio of correctly predicted samples (including both positive and negative samples) to all samples, is described as follows:
where
signifies the number of true negatives (the number of correctly predicted negative samples).
The FPS, a crucial measure of the model’s processing speed, especially in real-time applications, represents the number of frames processed per second. It is calculated as follows:
where the “Average Processing Time Per Frame” is the average time taken by the model to process a single frame.
mAP (mean average precision), a common metric for evaluating performance in object detection, information retrieval, and related fields, is the mean of the average precision (AP) values, assessing the model’s overall detection capability across multiple categories. AP is calculated as follows:
where
is the precision at recall rate
r.
The calculation of mAP is as follows:
where
N is the number of categories and
is the average precision for the
ith category.
4.6. Baseline
To thoroughly evaluate the tobacco lesion counting model proposed in this article, a series of advanced comparative models was selected for comprehensive comparison. These models cover both probability density-based counting methods and object detection-based counting methods, ensuring a broad and in-depth assessment. For evaluating the tobacco lesion counting model proposed in this study, a comprehensive comparison with advanced baseline models was conducted. These baseline models encompass methods based on probability density counting and target detection, ensuring extensive and in-depth evaluation.
Firstly, from the probability density-based counting methods, the MCNN (Multi-Column Convolutional Neural Network) as referenced in [
37], a model designed for varying scales of objects and particularly suitable for crowded scenes, was selected. CSRNet (Congested Scene Recognition Network) [
38] utilizes its deep convolutional network to achieve high accuracy in density estimation within complex scenarios. The CAN (Context-Aware Network) [
39], with its attention mechanism, focuses on counting in important areas of images, thus effectively enhancing the counting accuracy.
In the realm of target detection-based methods, Faster R-CNN [
40], a high-precision target detection model integrating a Region Proposal Network (RPN), was chosen. YOLOv8 from the YOLO series [
35] is known for its speed and efficiency. SSD (Single Shot MultiBox Detector) [
36] was selected for its capability to handle objects of various sizes, and RetinaNet [
41], distinguished by its unique Focal Loss design, excels in addressing class imbalance issues. CenterNet [
42] and MAF50 [
10], introducing a novel approach to target detection by directly predicting the center points of objects, bring a fresh perspective to the field.
These models, serving as baselines, provided a comprehensive framework for assessing the performance of the proposed tobacco lesion counting model under various scenarios and conditions. Representing the latest advancements in the fields of counting and detection, these models cover a range of needs from real-time detection to high-precision counting. The comparison with these advanced models allowed for an in-depth understanding of the strengths, limitations, and practical applicability of the proposed model. Such thorough comparative analysis is vital for advancing tobacco lesion counting technology and offering guidance for more effective model optimization and application strategies in future works.
5. Results and Discussion
5.1. Objective Detection Performance Results
The objective detection performance experiments were conducted to evaluate the performance of various advanced models in the task of tobacco lesion detecting. By comparing the performance in terms of precision, recall, mAP, and FPS of Faster R-CNN, YOLOv8, SSD, RetinaNet, CenterNet, and the method proposed in this study, insights into the performance differences and their underlying reasons among these models were gained. The experimental results are presented in
Table 2 and
Figure 6.
Experimental results reveal varying degrees of performance among seven models—DiffuCNN, YOLOv8, MAF50, RetinaNet, CenterNet, SSD, and Faster R-CNN—in the task of disease detection. It was observed that DiffuCNN achieved an optimal performance across four metrics: accuracy, recall, mAP, and FPS, with respective values of , , , and 58. This indicates that DiffuCNN not only excels in disease identification accuracy but also possesses advantages in real-time processing speed. YOLOv8 and RetinaNet closely approach DiffuCNN in mAP, yet exhibit a notable gap in FPS, suggesting potential optimization deficiencies in processing speed compared to DiffuCNN. CenterNet, while slightly trailing behind DiffuCNN in mAP, demonstrates a superior performance in FPS, indicating its processing speed advantage. SSD and MAF50 exhibit a moderate performance, whereas Faster R-CNN underperforms across all metrics.
In visual analysis of images, DiffuCNN stands out in identifying the edges of leaf diseases and recognizing diseases against complex backgrounds. The model precisely locates diseases, with bounding boxes closely aligning with the actual disease edges, and exhibits a lower false detection rate in complex backgrounds. This superiority may be attributed to advanced image processing technologies employed by DiffuCNN, such as a diffusion-based resolution enhancement module, which enhances key features in images without adding extra noise, thereby facilitating easier disease feature detection. YOLOv8 and RetinaNet, despite their commendable mAP performances, occasionally misjudge or overlook diseases in complex background sections, likely due to their limited capabilities in feature extraction and background noise suppression. The relatively high number of detection boxes for SSD and MAF50 suggests potential shortcomings in their false positive performance, leading to a reduced accuracy. The fewer bounding boxes produced by Faster R-CNN may result from inadequacies in its Region Proposal Network (RPN) in generating candidate areas, causing missed detections. In practical agricultural applications, the real-time processing capability (FPS) of models is equally critical. DiffuCNN and CenterNet excel in this aspect, implying their ability to swiftly process vast quantities of image data while maintaining a high accuracy, which is vital for the timeliness and precision of disease monitoring systems.
5.2. Counting Performance Results
The counting performance experiments aimed to comprehensively assess the performance of different models in the task of tobacco lesion counting, especially considering key metrics, like precision, recall, accuracy, and FPS. The results demonstrated that newer models exhibit superior performances in tobacco lesion counting with advancements in objective detection technology. The experimental results are shown in
Table 3.
The baseline model in this experiment, the Multi-Column Convolutional Neural Network (MCNN), demonstrated certain target detection capabilities; however, its overall performance was relatively low. The MCNN, designed to extract features at multiple scales, faced limitations when detecting small and irregularly shaped targets such as disease spots. The lower precision, recall, and accuracy may be attributed to its feature extraction layers failing to capture sufficient detail to accurately differentiate between diseased spots and healthy tissue, while the lower FPS reflects its limited processing speed. YOLOv8, known for its swift and accurate target detection, surpassed the MCNN in all performance indicators. Its architecture, which enables the prediction of both the category and location of targets in a single inference, offers significant advantages in speed. Nonetheless, YOLOv8 may not achieve a peak performance when dealing with highly overlapping and small-sized targets due to constraints in its receptive field and anchor box settings. Improvements in precision, recall, and accuracy exhibited by the CSRNet and CAN models reflect their specialized design for dense object detection tasks. CSRNet enhances the precision of crowd counting by deeply characterizing density maps of targets, whereas the CAN employs attention mechanisms to reinforce the learning of local features. These mechanisms proved equally effective in the tobacco disease counting task, as they enhanced the network’s sensitivity and discriminatory power towards disease spot features. RetinaNet addresses the issue of class imbalance in target detection with its innovative Focal Loss, performing exceptionally well in scenarios where there is a significant disparity between the number of positive and negative samples. This feature allows RetinaNet to maintain a high precision and recall when detecting rare and elusive targets like disease spots.The optimal performance across all evaluation metrics achieved by the method presented in this paper can be attributed to the application of several key technologies. Initially, advanced image preprocessing techniques were employed to enhance the features of disease spots in the input images, making it easier for the network to recognize them. Subsequently, the network structure was specially designed to increase the sensitivity and classification performance for disease spots. Finally, sophisticated optimization algorithms were utilized to ensure the stability and efficiency of the training process, thereby achieving a higher precision and recall while maintaining a high FPS.
The method proposed in this study surpassed other models in all evaluation metrics, demonstrating a clear advantage in tobacco lesion counting, as shown in
Figure 7. This superiority stems from optimizations in feature extraction, target localization, and background noise handling. This study’s method employs more advanced network architectures and training strategies, specifically optimized for small, dense targets.
5.3. Ablation Study on Filter Pruning
This section aimed to explore the impact of filter pruning technology on the objective detection model performance through ablation experiments. The experimental design compared the model performance under three conditions: no pruning, filter pruning, and normal pruning. By evaluating precision, recall, accuracy, and FPS, this experiment aimed to reveal the potential of filter pruning technology in enhancing the model efficiency and performance. The experimental results are presented in
Table 4.
The experimental results demonstrated that models employing filter pruning exhibited outstanding performances in terms of precision, recall, and accuracy, along with a significant improvement in FPS. Models without pruning retained all original filters, with no pruning conducted. Although such models maintained a high precision and recall, the extensive number of parameters resulted in a heavy computational burden and slower processing speed, as reflected in the lower FPS. Mathematically, models without pruning possessed more parameters and a higher model capacity, enabling the capture of more feature information but also increasing the risk of overfitting and computational complexity. Common pruning, including techniques like weight pruning or structural pruning, typically reduces the number of parameters in the network randomly or based on certain rules. While this method can improve computational efficiency, the lack of consideration for feature importance might lead to a reduced model performance, as evidenced by the lower precision, recall, and accuracy in the experiments. Additionally, common pruning, though increasing the FPS, did not exhibit as pronounced a performance enhancement as filter pruning. Filter pruning, by eliminating filters that contribute less to the final detection performance, reduces the model’s parameter count and computational complexity. This not only makes the model more lightweight but also speeds up processing, which is evident in the significantly increased FPS. Notably, despite the reduction in the number of parameters, the model’s precision and recall remained high, indicating that filter pruning, while removing redundant parameters, retained crucial feature information for the object detection task. These results suggest that filter pruning can maintain or even enhance the model performance while effectively reducing computational demands.
In summary, filter pruning increases the model computational efficiency while largely preserving or even enhancing the detection performance. This characteristic makes it an effective method for optimizing complex deep learning models and particularly suitable for applications requiring rapid processing of large volumes of image data, such as tobacco lesion counting tasks. Through carefully designed filter pruning strategies, models can be streamlined while ensuring accuracy and efficiency in challenging object detection tasks.
5.4. Ablation Study on Diffusion Module
This experiment aimed to assess the impact of the diffusion module on the object detection model in tobacco lesion counting tasks. The experimental design included comparisons between models with and without the diffusion module. These experiments provided deep insights into the mechanism and effectiveness of the diffusion module in improving the model performance. The results showed that models incorporating the diffusion module significantly improved in precision, recall, and accuracy, with an increase in FPS. The results are presented in
Table 5.
Models without the diffusion module, although showing decent performances, still had room for improvement in precision, recall, and accuracy. This might be attributed to the models’ inability to fully utilize all useful information in images with a low resolution or unclear details. Without steps to enhance the resolution or improve the image quality, the models might overlook some critical features, leading to performance limitations. The introduction of the diffusion module resulted in significant improvements in precision, recall, and accuracy. By emulating the natural process of diffusion, the module enhanced minor details and features in images, enabling more effective recognition and counting of tobacco lesions. This improvement was especially applicable to images with a low resolution or complex backgrounds, as it enhanced the utilizable information in images, thereby improving the model’s target detection capabilities. Mathematically, the diffusion module increased the pixel density and detail in images, enhancing the model’s capability to recognize image features. This method, without introducing additional noise, amplified key features in images, allowing more accurate localization and identification of tobacco lesions.
Theoretically, the introduction of the diffusion module primarily improved the model’s image processing ability. In deep learning, the quality of the input data directly impacts the model performance. By enhancing the image quality, the diffusion module allowed the model to capture more and finer feature information, which is crucial for object detection tasks. In tasks like tobacco lesion counting, where numerous small, dense targets must be identified, every detail in an image could contain key information. The diffusion module, by clarifying these details, bolstered the model’s detection capability. Moreover, while improving the image quality, the diffusion module did not significantly increase the computational load, as evidenced by the increase in FPS. This might be due to the module enhancing key information in images, making the model more efficient in subsequent feature extraction and classification steps.
5.5. Ablation Study on CentralSGD
This section’s experimental design aimed to evaluate and compare the performance differences between the CentralSGD optimization algorithm and other algorithms (such as traditional SGD and the Adam algorithm) in object detection tasks. The experiment compared models using different optimization algorithms in terms of precision, recall, accuracy, and frames per second (FPS), revealing the impact of optimization algorithms on the model performance and contributing to the understanding of the advantages and applicability of CentralSGD. The results are shown in
Table 6.
Models using traditional SGD, while displaying some detection capabilities, did not perform optimally across all metrics. This was mainly due to SGD’s approach of considering only the gradient of the current batch in each iteration, making it susceptible to fluctuations in individual data batches. Such fluctuations could slow the model’s convergence during training, making it challenging to achieve optimal performance. This characteristic of SGD becomes particularly evident in complex object detection tasks involving large data volumes and intricate model structures. In contrast, models incorporating the CentralSGD optimizer had a significantly improved performance across all metrics. CentralSGD’s design philosophy considers the gradient information of the entire dataset, making model parameter updates more stable and efficient. This method reduced fluctuations during training, accelerating the model convergence and enhancing the overall performance, especially in handling large datasets and complex network structures. Models using the Adam optimizer, although performing better than traditional SGD, were still outperformed by CentralSGD in this experiment. The Adam optimizer, combining momentum and adaptive learning rates, is generally considered to accelerate convergence in the initial stages of training and refine parameter adjustments in later stages. However, in this experiment, the Adam optimizer’s performance in complex object detection tasks was still not as good as CentralSGD, which was specifically optimized for such tasks.
Overall, this experiment highlighted the significant role of optimization algorithms in deep learning model training. Different optimization algorithms have distinct characteristics in terms of parameter update strategies, convergence speed, and stability, directly influencing the model performance in practical tasks. CentralSGD, with its unique global gradient consideration approach, not only improved the efficiency of the model training but also ensured a high performance in complex tasks. The advantages of this optimization algorithm are particularly evident in scenarios requiring the processing of large amounts of data and complex model structures, offering new perspectives for enhancing the performance of object detection models in practical applications.
6. Conclusions
In this study, the application of the DiffuCNN model in tobacco lesion counting tasks was thoroughly investigated, its performance across various aspects was assessed, and the impact of different technical components on the model efficacy was compared. After detecting lesions, the grading of the disease severity was achieved through counting. Through a series of experiments and analyses, key findings and insights were obtained, which hold significant value for future applications in the agricultural domain and research in deep learning. In the counting performance experiments, the performance of the DiffuCNN model was evaluated against several other object detection models. The results indicated that DiffuCNN surpassed other models in precision, recall, accuracy, and frames per second (FPS), achieving values of 0.98, 0.96, 0.97, and 62, respectively. This superior performance is attributed to several key factors: Firstly, the resolution enhancement module based on diffusion significantly improved the quality of the input images, enabling more accurate recognition and counting of tobacco lesions in images. Secondly, the object detection network based on filter pruning optimized the model structure, reducing the computational load while maintaining a high detection performance. Lastly, the use of the CentralSGD optimization algorithm enhanced the training efficiency and final performance of the model. In the object detection performance experiments, DiffuCNN demonstrated exceptional detection capabilities. The model accurately detected and located tobacco lesions in images, benefiting from its efficient network architecture and advanced image processing technology. Compared to traditional object detection models, DiffuCNN showed a superior performance in handling small, dense targets in complex agricultural scenes, outperforming other models in precision, recall, mean average precision (mAP), and FPS, with respective values of 0.98, 0.95, 0.96, and 58. This improvement highlights the innovative design and optimization of DiffuCNN, especially in dealing with challenging visual tasks. In conclusion, this research provides a comprehensive evaluation and analysis of the DiffuCNN model, demonstrating how innovative technical components and algorithms can enhance the performance of deep learning models in complex tobacco lesion counting tasks. The combination of these techniques and methodologies offers an effective means for solving practical problems and also directs future research in the field of deep learning.