Semantic Segmentation of Corn Leaf Blotch Disease Images Based on U-Net Integrated with RFB Structure and Dual Attention Mechanism

Mu, Ye; Li, Ke; Sun, Yu; Bao, Yu

doi:10.3390/agronomy14112652

Open AccessArticle

Semantic Segmentation of Corn Leaf Blotch Disease Images Based on U-Net Integrated with RFB Structure and Dual Attention Mechanism

¹

College of Information Technology, Jilin Agricultural University, Changchun 130118, China

²

College of Life Science, Changchun Normal University, Changchun 130118, China

^*

Author to whom correspondence should be addressed.

Agronomy 2024, 14(11), 2652; https://doi.org/10.3390/agronomy14112652

Submission received: 28 September 2024 / Revised: 31 October 2024 / Accepted: 7 November 2024 / Published: 11 November 2024

(This article belongs to the Section Precision and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

Northern corn leaf blight (NCLB) is caused by a fungus and can be susceptible to the disease throughout the growing period of corn, posing a significant impact on corn yield. Aiming at the problems of under-segmentation, over-segmentation, and low segmentation accuracy in the traditional segmentation model of northern corn leaf blight, this study proposes a segmentation method based on an improved U-Net network model. By introducing a convolutional layer and maximum pooling layer to a VGG19 network, the channel attention module and spatial attention module (CBAM) are fused, and the squeeze excitation (SE) attention mechanism is combined. This enhances image feature decoding, integrates feature maps of each layer, strengthens the feature extraction process, expands the sensory fields and aggregates context information, and reduces the loss of location and dense semantic information caused by the pooling operation. Findings from the study show that the proposed NCLB-Net has significantly improved the MIoU and PA indexes, reaching 92.43% and 94.71%, respectively. Compared with the traditional methods, U-Net, SETR, DAnet, OCnet, PSPNet, etc., the MIoU is improved by 20.81%, 16.10%, 9.79%, 5.27%, and 11.06%, and the PA is improved by 11.49%, 8.18%, 9.54%, 13.11%, and 6.26%, respectively.

Keywords:

northern corn leaf blight; segmentation; attention mechanism; U-Net network; image segmentation; maize disease

1. Introduction

Corn is the crop with the highest global production and is a significant cereal, feed, and industrial raw material crop [1]. Throughout its growth stages, corn is prone to various diseases, which can severely affect fruit quality and potentially lead to substantial yield losses. Corn is susceptible to a common disease called northern corn leaf blight (NCLB), which is brought on by italize it. It primarily harms the leaves, and in severe cases, it can also affect the sheaths and bracts. The disease usually starts from the bottom leaves and gradually spreads upwards, potentially engulfing the entire plant. NCLB is one of the crucial diseases of corn and is widely distributed across all corn-growing regions worldwide. In outbreak years, it can cause a reduction in yield by 15–20%, and in severe cases, over 50% [1]. The occurrence and prevalence of this disease are influenced by a combination of factors, including the resistance of inbred lines, crop rotation systems, climatic conditions, and cultivation practices. Therefore, timely and accurate diagnosis of crop diseases is vital for the correct development of sustainable agriculture. Currently, the identification of corn leaf diseases primarily relies on manual inspection and the planting experience of farmers, which is highly subjective and often cannot rapidly and accurately identify the diseases due to the numerous types of diseases. Hence, achieving rapid, efficient, and intelligent identification of corn diseases is of great importance [2]. Computerized systems outperform manual methods in detecting and pinpointing corn leaf diseases, allowing for swift, targeted interventions that bolster crop yield and national food security [3].

Color and texture characteristics can be used to differentiate between sick regions and backgrounds using conventional image segmentation techniques, such as threshold segmentation. In particular, every pixel in the image is compared; pixels with gray values above the threshold are assigned to the class impacted by the disease. This classification implies that the pixel corresponds to an area within the image where the visual characteristics indicative of disease, such as discoloration, irregular shapes, or texture changes, are present. In the context of plant pathology or medical imaging, these disease-affected regions are the focus of diagnostic and therapeutic activity, making them of prime importance for investigation. Pixels with gray values below the threshold are categorized as background class or healthy tissue, whereas pixels with gray values more than the threshold are categorized into another class. This category encompasses all the pixels that represent areas of the image that do not exhibit the visual hallmarks of disease. In the context of plant pathology, this would include the normal, healthy green tissue of the leaf, as well as any non-leaf elements present in the background of the image, such as soil, other plants, or the imaging environment itself. By modifying VI and Otsu thresholds, Talukdar [3] suggested a technique for the effective and real-time segmentation of sick cabbage leaves. In order to obtain complete images of corn leaves, Tong [4] used the Otsu method, OpenCV morphological operations, and morphological transformation methods to outline the contours of healthy corn leaves, large spots, corn rust, and corn gray spots. They then used these contours to determine the difference set between the corn leaves and the background. High picture quality is necessary for traditional image segmentation techniques; if external factors degrade the image quality, the recognition results will be subpar or perhaps incorrect. Consequently, these methodologies’ robustness and generalizability are inadequate, and they are unable to ensure accuracy in real-world applications.

In intelligent crop disease identification based on images, leaf image segmentation is a key step [5]. To improve the accuracy and efficiency of intelligent crop disease detection, numerous scholars have conducted research and made improvements [6,7]. Numerous researchers have started using machine learning for illness spot segmentation as a result of its advancements. A crop disease detection and classification EDLFM-RPD model was created by Almasoud [6]. EDLFM-RPD locates disease characteristics by using k-means and median filtering as preprocessing. The features are then obtained by combining deep features based on inception with manually generated gray-level co-occurrence matrices. Lastly, FSVM is used for categorization. Ambarwari [7] identified plant species with an accuracy of 82.67% using RBF kernel support vector machines (SVM). Machine learning techniques can yield good segmentation results with a small sample size, but they are rather difficult to implement and require several stages of image preprocessing. Machine learning techniques used for disease detection through digital images have some issues unrelated to human visual perception. Moreover, machine learning-based segmentation algorithms are typically poor in unstructured contexts and need researchers to manually build feature extraction and classifiers, making the process more challenging. Machine learning techniques are also used to detect various seeds. Numerous methodologies exist for the identification of various torch types, including content-based image retrieval (CBIR) techniques, and random forest classifiers have been used to detect corn seeds. However, these techniques require a large amount of data to support them, which is particularly challenging in the field of agriculture, where data is often incomplete and difficult to access [8]. The rapid development of deep learning techniques provides a new way of thinking for researchers in the field of agriculture. To augment the precision and efficacy of corn disease detection, the international academic and expert community has persistently engaged in the investigation and refinement of network models, culminating in significant advancements. In their study, Wang [9] introduced an ADSNN-BO model, which is predicated on MobileNet architecture and augmented with advanced attention mechanisms, tailored for the detection and categorization of crop diseases in agricultural fields. The efficacy of this approach was substantiated by cross-validation classification experiments conducted on publicly available datasets, yielding an accuracy rate of 94.65%. Nonetheless, the precision of crop disease identification remains an area requiring additional enhancement. Wang [10] selected apple black rot images from the PlantVillage dataset to diagnose the severity of apple black rot, using a deep convolutional neural network (DCNN) for training. The empirical findings suggest that the proposed methodology obviates the need for labor-intensive feature engineering and threshold-based segmentation techniques. Utilization of the VGG-16 model for training purposes resulted in the highest overall accuracy rate of 90.4%. Sibiya et al. [11] introduced fuzzy logic decision rules to determine the severity of leaf diseases, using the index of lesion area to leaf area ratio. Joshi [12] proposed a convolutional neural network (CNN), Vir Leaf Net, for the diagnosis of Clitoria ternatea mosaic virus, initially using traditional image segmentation techniques to separate the disease from the background in images, followed by the Vir Leaf Net for disease severity assessment, achieving a diagnostic accuracy of 90.48%. Subramanian [13] investigated the use of VGG16, a densely connected CNN, for efficient classification of corn leaf diseases. This method employed Bayesian optimization for hyperparameter tuning and transfer learning. The images were sourced from the PlantVillage dataset and online resources. The assessment of the method was conducted using the performance metrics of precision, accuracy, and recall. A comparative analysis with alternative methodologies revealed that the current approach enhanced precision and concurrently reduced computational time. Nevertheless, it did not achieve comprehensive detection of all pathologies associated with corn foliage. Entuni [14] proposed an automatic identification method for corn leaf diseases using the DenseNet-201 algorithm, with images collected from the PlantVillage dataset. The recognition accuracy of the DenseNet-201 model was recorded at 95.11%, demonstrating elevated precision and robustness, outperforming prior techniques, including ResNet-50, ResNet-101, and the bag of features approach. Nonetheless, the methodology did not incorporate the analysis of other plant components for the identification of infected diseases. Akila [15] analyzed the application of deep convolutional neural networks (DCNN) in the classification of corn leaf diseases. Atila, U [16] elaborated on the use of a DONN based on feature extraction for the prediction of leaf disease types. This method could accurately predict the location of diseases within leaves and classify them, The outcomes of the study indicate that the developed method enhanced accuracy by 96.88%, rendering it appropriate for predictive processes within practical agricultural contexts, albeit it lacked the capability to forecast diseases affecting other crop species. Successively, Chen introduced a series of Deeplab architectures, namely Deeplab [17], DeeplabV2 [18] DeeplabV3 [19], and DeeplabV3+ [20], which are proficient in the efficient extraction of multi-scale semantic information from images. Ronneberger [21] proposed the U-Net model, which advanced the fully convolutional network (FCN) by integrating an encoding pathway that captures contextual information with a decoding pathway designed for accurate localization, The decoding pathway of the U-Net network facilitates the concatenation of high-resolution features with the upsampled output features of the decoder via a jump connection structure. Chen [22] subsequently proposed BLSNet, which is based on U-Net architecture and further enhances the accuracy of crop damage segmentation through the incorporation of attention mechanisms and multi-scale feature extraction. In response to the challenges presented by traditional models, such as missed segmentations, erroneous segmentations, and low segmentation precision, an innovative technique was developed based on an enhanced U-Net network model, termed NCLB-net. Compared to traditional methods, the improved model excelled in the segmentation of northern corn leaf blight. This confirmed the model’s outstanding performance and generalization ability in the segmentation of corn leaf lesions, highlighting its potential value in practical applications.

The contributions of this paper are primarily encompassed in the following three aspects:

To achieve precise segmentation of northern corn leaf blight, a segmentation model for northern corn leaf blight, termed NCLB-Net, is proposed.
The incorporation of a dual attention mechanism into the U-Net encoding framework augments the model’s proficiency in delineating object boundaries, discerning textural characteristics and encapsulating semantic information pertinent to the target, thereby leading to the production of more accurate segmentation results.
Innovatively, the RFB (receptive field block) module is placed at the bottom of the U-Net to perform parallel sampling for feature extraction of input images. Through the enlargement of the sensory field, the approach intensifies semantic information and encodes global contextual data by leveraging image-level features. This strategy mitigates segmentation errors that may arise from an overemphasis on local features, thereby augmenting the segmentation efficacy of the network.

Owing to the minute size and indistinct boundaries of lesions associated with northern corn leaf blight, conventional deep learning methodologies are insufficient in achieving accurate recognition. Consequently, in order to enhance the precision of semantic segmentation for disease detection, the present study introduces an augmented U-Net architecture, termed NCLB-Net. The network utilizes VGG19 architecture [22] as the underlying framework for feature extraction, incorporates attention mechanism modules within the feature extraction phase, and employs the RFB (receptive field block) module to expand the filter’s receptive field. The segmentation efficacy of the standard U-Net, DeeplabV3+, sp-net, and the proposed NCLB-Net were comparatively evaluated on a dataset of northern corn leaf blight images.

2. Materials and Methods

2.1. Obtaining and Preparing Data

The main demonstration area of the National Million Mu Green Food Raw Material (Corn) Standardized Production Base in Lishu County, Siping City, Jilin Province, the People’s Republic of China, is where all of the field photos of corn leaves used in this study were taken. As seen in Figure 1, a Sony (SONY) Alpha 7 III digital camera was used to take the pictures.

In order to accurately depict environmental circumstances, we chose 2419 high-quality photos of NCLB. The gathered dataset comprises photos taken in both well-lit and cloudy settings. All these images have been verified by researchers with specialized knowledge, as detailed in Figure 2.

For the annotation and segmentation of the dataset, the study employed LabelMe version 4.5.13, a Python 3.7.0-based visual annotation tool constructed upon the Qt5 graphics library framework. LabelMe plays a key role in picture annotation for applications like object identification and semantic segmentation. Additionally, it makes it easier to create labels in COCO and VOC formats. In this study, LabelMe was used to trace closed polygons around the afflicted areas in order to define the morphology and location of northern corn leaf blight (NCLB). The label data that was produced was saved in JSON format and then converted into binary PNG pictures using the json_to_dataset tool. As seen in Figure 3, the black pixels in these binary pictures represent the background, while the white pixels indicate the existence of leaf lesions.

In real-world environments, the collected image datasets may be influenced by factors such as weather, lighting, and dust. Therefore, aligning with real-world conditions can further enhance the robustness and generalization capabilities of the model. The sample size of the aforementioned dataset is small, which is insufficient to support effective training and evaluation of deep learning models. To address the issue of limited sample size, this paper employs data augmentation techniques including random transformations, adjustments of image brightness and contrast, addition of noise, and translation. Random rotation increases the diversity of samples, simulating the scenario where the same sample is captured from different angles, thereby enabling the model to learn that disease features can still be recognized from varying viewpoints. Adjustments in brightness simulate actual lighting variations, enhancing the model’s accuracy in real-world applications and emphasizing feature saliency. Adding noise improves the model’s robustness, making it more resilient to noise and reducing its dependence on neat data. The employed augmentation techniques are designed to produce a broader spectrum of image samples, thereby enhancing the model’s robustness and precision across various environmental conditions. Following this, the image dimensions were altered to a resolution of 512 × 512 pixels. This procedure culminated in an experimental dataset comprising 20,897 images, which constitutes the corn leaf blight dataset, hereinafter referred to as the CLB dataset. Figure 4 displays some of the augmentation results, and Table 1 shows the data distribution.

Employing the dataset construction methodology outlined previously, the corn disease imagery was subject to random allocation. For the experimental phase, the dataset was segmented into training and validation subsets in an 8:2 proportion. The details pertaining to the quantity of the images and corresponding labels for both the training and validation subsets are delineated in Table 2. To address the impact of the inherent randomness associated with the partitioning process, a suite of experiments was conducted to bolster the reliability of the obtained results.

2.2. Data Expansion

2.2.1. Random Rotation Transformation

In order to investigate further potentialities, the present study generates a simulated dataset acquired from various perspectives and utilizes data augmentation strategies involving rotational and reflective transformations. The protocol for determining the random rotational augmentation is delineated as follows:

Allowing the pixel coordinates of the image prior to rotation to be denoted as (m, n), their transformed coordinates are expressed as follows:

\{\begin{matrix} n = μ \times s i n (θ) \\ m = μ \times c o s (θ) \end{matrix}

(1)

Following rotation of angle β, the resultant pixel coordinates within the image are denoted as (m′, n′) and are mathematically represented as:

\{\begin{matrix} n^{'} = μ \times s i n (θ - β) \\ m^{'} = u \times c o s (θ - β) \end{matrix}

(2)

The corresponding transformation is delineated as follows:

\{\begin{matrix} m^{'} = μ \times s i n (θ) s i n (β) + μ \times c o s (θ) c o s (β) \\ n^{'} = μ \times c o s (θ) s i n (β) - μ \times s i n (θ) c o s (β) \end{matrix}

(3)

Upon substituting Equation (1) into Equation (3), the resulting expression is derived as follows:

\{\begin{matrix} m^{'} = m \times c o s (β) + n \times s i n (θ) \\ n^{'} = n \times c o s (β) - m \times s i n (θ) \end{matrix}

(4)

The concept of mirroring flip encompasses two distinct orientations: vertical mirroring flip and horizontal mirroring flip. The vertical mirroring flip is characterized by an inversion about the horizontal axis, whereas the horizontal mirroring flip entails an inversion about the vertical axis.

2.2.2. Adjustment of Luminance and Contrast

During the collection of the datasets, the clarity of the datasets may be affected by weather and lighting conditions. In order to more accurately align with the characteristics of corn disease targets within natural settings, the present study augments the datasets by manipulating luminance and contrast levels. The objective of this adjustment is to approximate the datasets model to the diverse scenarios that are likely to be encountered in natural environments.

The protocol for modulating image brightness is delineated as follows: the luminance level of the image is altered through the application of arithmetic operations—addition, subtraction, multiplication, or division—to the individual pixel values. V is denoted as the original RGB value, V′ as the modified RGB value, and h as the adjustment coefficient. The mathematical expression for the brightness adjustment is presented in Equation (5).

{V = V}^{'} \times (1 + h)

(5)

The procedure for modulating image contrast is as follows: by facilitating the training of a neural network to infer the contrast transformation function, a nuanced and efficacious adjustment of contrast levels can be procured. Let m denote the median luminance value of the image, wherein V, V′, and h retain their previous designations as outlined. The detailed computational methodology is delineated in Equation (6).

{V = m + (V}^{'} - m) \times (1 + h)

(6)

2.2.3. Adding Distractions to the Image

The infusion of noise into images serves to mimic the various interference factors that may be encountered in practical scenarios, and this process is instrumental in assessing and validating the efficacy of segmentation algorithms, thereby contributing to the augmentation of their robustness. Gaussian noise, also termed normal distribution noise, constitutes a form of stochastic perturbation that conforms to a Gaussian probability distribution. Notably, it introduces randomly distributed bell-shaped interference patterns within the image. This noise type is characterized by two key parameters: the mean and the variance. The mean value indicates the orientation of the axis of symmetry of the distribution, whereas the variance quantifies the breadth of the normal distribution curve. The probability density function of Gaussian noise is expressed in Equation (7).

Where m represents the random variable,

ℇ

represents the mathematical expectation and σ² is the variance.

F (m) = \frac{1}{\sqrt{2 π δ}} e x p (- \frac{{(M - ℇ)}^{2}}{2 δ^{2}})

(7)

Salt-and-pepper noise, alternately referred to as impulse noise, is a prevalent form of noise encountered in image processing. It is characterized by the sporadic occurrence of black and white pixels within an image, which may appear concurrently. The introduction of salt-and-pepper noise can be attributed to a variety of factors, including sensor malfunctions, transmission errors, or other anomalies that arise during the image acquisition phase.

Salt-and-pepper noise typically results in the image displaying distinct black and white speckles, which can significantly impair the visual perception and degrade the quality of the image. In contrast, Gaussian noise tends to impart a blurred and distorted quality to the overall image, diminishing its clarity and contrast. Within the scope of this study, a composite noise model that integrates both Gaussian and salt-and-pepper noise is utilized to augment the robustness of the algorithm. This approach substantially enhances the generalization capabilities of the model, thereby improving its performance across a broader range of conditions.

2.3. NCLB-Net

2.3.1. Network Architecture

The U-Net architecture, as shown in Figure 5, constitutes a neural network model that integrates an encoder-decoder framework. The encoder segment is configured with a convolutional neural network (CNN) structure, functioning as a contracting path dedicated to feature extraction and the decrement of resolution [23]. The contracting path of the architecture consists of four distinct sub-blocks, each of which includes a sequence of two consecutive 3 × 3 convolutional layers, followed by a rectified linear unit (ReLU) activation function, and culminates in a max pooling layer, which serves to downsample the feature maps. The consecutive 3 × 3 convolutional layers are instrumental in diminishing the complexity of the neural network while preserving the integrity of the segmentation accuracy. Throughout the downsampling process, the feature channel count is incremented by a factor of two. The decoder segment of the network is architected with convolutional blocks that integrate upsampling mechanisms, thereby constituting an expansive pathway dedicated to the restitution of image detail, the delineation of segmentation object contours, and the gradual restoration of the spatial resolution within the feature maps. Within this expansive pathway, each sub-block is comprised of two successive 3 × 3 convolutional layers, a RelU activation function, and a transposed convolutional layer, which is responsible for the upsampling process. Upsampling serves to amplify the feature maps to twice their original dimensions and to reclaim the lost detail information. A distinctive attribute of the U-Net is the implementation of cropping, which facilitates the concatenation of low-level detail features, captured during downsampling with high-level semantic features, extracted during upsampling, at corresponding layers. The final output of the segmentation procedure is augmented through the concatenation of object category recognition, which is inferred from low-resolution data, and the accurate localization segmentation provided by high-resolution features. This integration effectively addresses the challenge of information loss during the upsampling stage, thereby enhancing the precision of the segmentation outcomes.

The U-Net model has yielded exceptional segmentation performance across a multitude of datasets. However, it also has several limitations. Firstly, there is a high degree of redundancy because a patch is extracted for each pixel, and the patches of adjacent pixels have a very high degree of similarity, resulting in substantial redundancy, which in turn leads to a protracted network training process. Second, it is impossible for high classification accuracy and localization accuracy to coexist; lower classification accuracy results from a smaller receptive field, while higher localization accuracy is caused by an increase in the dimensionality reduction multiplier of the next pooling layer. Furthermore, poor segmentation effects near the margins of lesions may arise from directly feeding the decoder portion with shallow network information. The conventional U-Net model structure has undergone the following modifications in order to enhance the model’s segmentation performance and solve the previously noted issues: (1) Switching to VGG19 from the U-Net feature extraction network: VGG19 is utilized in place of the network used for feature extraction in the U-Net framework, which significantly improves the network’s training accuracy and yields a more accurate segmentation method. (2) CBAM module addition: To improve the segmentation accuracy of the suggested model for corn leaf spot images, CBAM is integrated into the feature extraction module to decrease accuracy loss and train a more accurate segmentation technique. (3) The RFB module is incorporated: In order to reduce local information loss caused by grid effects and inadequate long-distance information correlation, the squeeze-and-excitation (SE) attention mechanism is implemented in conjunction with disease characteristics to help the network concentrate more on the important aspects of lesions. This eliminates the need for pooling layers and enables the extraction of features at various scales. Figure 6 displays the updated model’s structure.

2.3.2. Feature Extraction Branch

In this study, the U-Net architecture serves as the foundational framework for model development, with the convolutional and max pooling layers derived from the VGG19 network constituting the encoder segment of the U-Net architecture. The objective of this integration is to augment the efficacy and precision of image feature extraction, thereby refining the semantic segmentation capabilities of the U-Net model and mitigating the influence of extraneous factors on the model’s interpretive accuracy. The component of the backbone network responsible for feature extraction is illustrated in Figure 7.

The encoder facilitates the acquisition of a sequence of feature layers via the VGG architecture, which are subsequently employed for the stacking of convolutional and max pooling layers. The foundational feature extraction component of the backbone is capable of generating five preliminary effective feature layers, which are utilized for the subsequent processes of stacking and cropping.

2.3.3. Convolutional Block Attention Module (CBAM)

In the year 2018, an attention mechanism termed the convolutional block attention module (CBAM) was introduced by researchers for application within convolutional neural network modules. This mechanism amalgamates channel-wise attention with spatial attention modules. It operates by calculating attention weights across both the spatial and channel dimensions within the intermediate feature maps of deep learning models. These computed weights are subsequently multiplied with the original feature maps to effect an adaptive refinement of the features. The CBAM module is designed to be incrementally integrated into established architectures, including VGG 19, ResNet [24], and MobileNet [25]. The configuration of the CBAM module is depicted in Figure 8.

Class activation mapping (CAM) involves the transformation of the input feature map F via global max pooling and global average pooling operations, which are conducted along the dimensions of width and height, thereby generating two distinct feature maps. Subsequently, these resultant feature maps are input into a common two-layer neural network, referred to as a multilayer perceptron (MLP). Following the propagation through the MLP, the outputs are aggregated and passed through a sigmoid activation function to yield the final channel attention feature map, denoted as M_c. Thereafter, M_c is element-wise multiplied with the initial input feature map F to refine the feature representation. This sequence of operations is encapsulated in Equation (8).

M_{c} (F) = σ (M L P (A v g P o o l (F) + M L P (M a x P o o l (F)))) = σ (W_{1} (W_{0} (F_{A v g}^{c}) + W_{1} (W_{0} (F_{M a x}^{c}))))

(8)

In the equation, σ represents the sigmoid function,

W_{0} \in R^{c / r * c}

,

W_{1} \in R^{c * c / r}

, and W₁ and b₁ are the weights and biases of the first layer of the MLP, respectively. Note that the weights and biases of the two inputs to the MLP are shared across the two input branches, followed by a ReLU activation function.

The spatial attention module receives the feature map F produced by the CAM as its input feature map. Initially, both global max pooling and global average pooling are conducted across the channel axis, yielding two distinct feature maps. These feature maps are subsequently concatenated along the channel dimension. Following this concatenation, a convolutional layer is applied, which compresses the concatenated feature maps to a single channel. The resultant spatial attention feature, referred to as M_s, is produced by applying a sigmoid activation function. Ultimately, the spatial attention feature M_s is element-wise multiplied with the module’s input feature map to generate the final refined feature representation. The detailed procedure is articulated in Equation (9).

M_{s} (F) = σ (f^{n * n} ([A v g P o o l (F) \times M a x P o o l (F)]) = σ (f^{n \times n} ([f_{A v g}^{s} \times f_{M a x}^{s}])

(9)

where σ denotes the sigmoid function, and

f^{n * n}

represents the convolution operation with a filter size of n × n.

2.3.4. SE Attention Mechanism Module

In the domain of deep learning, the attention mechanism serves to selectively prioritize salient information pertinent to the task at hand from an extensive information set. The integration of the attention mechanism with convolutional operations can markedly enhance the efficacy of semantic segmentation tasks. The squeeze-and-excitation (SE) module operates on the principle of applying a squeeze operation to the feature maps subsequent to convolution, facilitating the extraction of a global, channel-level feature representation [26]. The proposed model encapsulates the interdependencies amongst channels through an excitation mechanism, which ascertains the comparative significance (or weight) of each channel. The derived channel-wise weights are subsequently reapplied to the original feature maps depicting corn disease lesions, thereby generating the ultimate set of feature maps, while preserving the dimensions of the input feature maps derived from corn imagery. In practice, the squeeze-and-excitation (SE) module implements channel-wise attention or gating mechanisms via fully connected layers, which are responsible for mapping the initially encoded features from a lower-dimensional space to a higher-dimensional space, followed by a recalibration of the importance of each feature map. This process endows the model with an enhanced capacity to discern and express the inter-channel correlations, thereby refining the efficacy of the input data feature representation. The aggregated weights culminate in a distribution that represents the relative importance of the constituents within the feature maps. The structure of the SE attention mechanism is illustrated in Figure 9, with varying hues indicating distinct values used to measure the channel relevance, as cited in reference [27]. The SE attention mechanism adaptively identifies and amplifies the pivotal feature channels in corn disease imagery while concurrently diminishing the influence of less relevant channels. This refinement in the network’s representation of input data features augments the distinctiveness and discriminative power of the network.

2.3.5. Feature Fusion Branch

In the course of convolutional neural network (CNN) feature learning, the progressive application of convolutional layers leads to a gradual decrease in image resolution. Consequently, the resolution of the resulting deep features may be diminished, which can culminate in the mis-classification of objects that occupy a minor proportion of pixels within the image. To enhance the precision of multi-scale object detection, it is advantageous to integrate features extracted from various strata of the network during the training phase.

The feature pyramid network (FPN) represents a methodology for synthesizing feature maps across diverse levels of a convolutional network, thereby enriching the feature extraction phase [28]. The architectural specifics of the FPN are delineated in Figure 10. The FPN facilitates the integration of feature maps that encapsulate information across diverse scales. As depicted, the FPN executes two upsampling operations on deep feature maps, which are then concatenated with shallow feature maps, followed by a convolutional process, to yield an updated set of deep features. The concatenation of features is executed in a sequential fashion, facilitating the prediction network to incorporate the five primary discriminative feature maps generated by the VGG module of the U-Net architecture’s backbone. The resulting composite feature maps encapsulate a richer semantic and spatial information content, a consequence of the amalgamation of features across various hierarchical strata. This enhanced feature representation contributes to the improved segmentation efficacy of the U-Net network.

Using VGG, a series of feature layers can be obtained for stacking convolutions and max pooling. The backbone feature extraction part can yield five initial effective feature layers, which are used for stacking and cropping in the subsequent steps.

2.3.6. RFB Module Construction

During the expansion of the receptive field and the aggregation of contextual information in semantic segmentation networks, pooling operations frequently result in the erosion of positional and dense semantic details. Nevertheless, the receptive field block (RFB) design mitigates the reliance on parameters and computational complexity, whilst preserving the integrity of image resolution. The architecture of the RFB is delineated in Figure 11. It successfully broadens the effective receptive field of the convolutional kernel using a reduced number of parameters and effectively integrates contextual information [23]. The architecture of the network is outlined as follows. Initially, a dimensionality reduction is performed on the input feature map via a 1 × 1 convolution, subsequent to which the feature map is processed in a sequential manner through three K-convolutions with distinct expansion ratios. The outputs from these four convolutional layers are concatenated along the channel axis, yielding a feature map that possesses an extensively enriched receptive field. Subsequently, the feature map with the enhanced receptive field acquires channel-wise weights via a selective convolutional module. These weights are utilized to compute the feature map, thereby effecting an adaptive modification of the receptive field dimensions. Following this, a 1 × 1 convolution is applied to revert the processed feature map to the original number of channels, which is then refined by the squeeze-and-excitation (SE) module. Conclusively, the refined feature map is integrated with the original feature map through a skip connection.

To reduce computational complexity, a 1 × 1 convolution is employed to change the number of channels to one-fourth of the input. The input feature map sequentially passes through the concatenated branches of the RFB convolutional layers and has its receptive field enhanced by K-conv. Since the transformation matrix F within K-conv contains an identity sub-matrix

r_{2}

×

r_{2}

, it can capture local contextual information ignored by dilated convolutions.

The feature maps emanating from the concatenated branches of the convolutional layer are subjected to a selective convolution module that operates on an attention mechanism. This module assigns weights to the output feature maps, replacing the conventional concatenation with a process of weighted summation. This approach facilitates an adaptive refinement of the receptive field size in response to the multi-scale characteristics of the input data.

Subsequent to this weighting process, the feature maps are re-scaled to their original channel depth via a 1 × 1 convolution. The squeeze-and-excitation (SE) module is subsequently employed to further augment the discriminative power of the feature maps. The enhanced feature maps are then combined with the original feature maps through a skip connection, thereby generating the final output feature map.

Following the integration of the K-conv and sequential branches, the extent of the feature map’s receptive field can be computed as follows:

K^{'} = (r_{1} - 1) \times (k - 1) + k + r_{2} - 1

(10)

{R F = R F}_{1} + k^{'} - 1

(11)

where

r_{1}

represents the intermediate expansion factor,

r_{2}

represents the internal shared factor, k represents the size of the convolution kernel, k′ represents the equivalent convolution kernel size after K-conv,

{R F}_{1}

represents the receptive field size of the input feature map, and RF represents the receptive field size obtained by K-conv.

When

r_{2}

= 1, the equivalent convolution kernel size of dilated convolution and K-conv is equal; when

r_{2}

> 1, the equivalent convolution kernel size of K-conv is larger than that of dilated convolution.

Firstly, K-conv can increase the size of the equivalent convolution kernel, thereby leading to an increase in the receptive field. Secondly, since K-conv includes an internal shared factor

r_{2}

, which does not exist in dilated convolution, the network proposed in this paper can fully utilize some partial local contextual information. Thirdly, our network is connected in series, so the serial connection can fully utilize convolutional resources to extract features and reduce the number of parameters in the network. Finally, our method can adaptively adjust the weights of the receptive field based on the feature information of different layers.

The RFB (receptive field block) architecture facilitates the enlargement of the convolution kernel’s effective receptive field while maintaining a parsimonious parameter count, thereby proficiently integrating contextual information. This enlargement augments the perceptual field and enriches the semantic content within the feature maps. Furthermore, the features derived from the RFB are capable of encapsulating global contextual information and accounting for contextual interdependencies, thereby circumventing segmentation inaccuracies that may result from an excessive focus on local features. As a consequence, the precision of target segmentation is significantly enhanced. Consequently, the feature map imbued with high-level semantic content is processed by the RFB module prior to upsampling, in order to generate features across a spectrum of scales. This procedure is pivotal in augmenting the network’s efficacy in lesion detection, thereby refining its overall performance in the extraction of lesions.

2.3.7. Optimize the Loss Function

U-Net is commonly used for pixel-level segmentation tasks and typically employs cross-entropy loss as the primary loss function. To enhance performance, a linear combination of cross-entropy loss (CELoss) and Dice loss is often utilized. In pixel-level segmentation tasks, the number of background pixels usually far exceeds the number of defect samples. Relying solely on cross-entropy loss may lead to the network converging to a local minimum. To address this issue, the Dice loss function is introduced to measure the similarity of samples, which can handle severe imbalances in the number of foreground and background voxels. This loss function particularly emphasizes the correct classification of foreground pixels without overemphasizing background pixels. This approach effectively solves the challenge of imbalanced foreground and background samples. By collaboratively applying Dice loss and cross-entropy loss and adopting a strategy of using a non-linear loss function, it robustly copes with the extreme imbalances in the data. This strategy prioritizes the changes in the loss of the target area, aiming to mitigate the impact of the size of the feature region on segmentation accuracy. By emphasizing the precise calculation of the overall image loss and the target region loss, the goal is to improve the precision of segmentation. This method takes a global perspective and enhances accuracy.

L_{D I C E} = 1 - \frac{2 |X \cap Y|}{|X| + |Y|} = 1 - \frac{2 T P}{F P + 2 T P + F N}

(12)

In this formula, |X∩Y| represents the intersection of sets X and Y, while |X| and |Y| denote the number of pixels in locations X and Y, respectively. The Dice loss function focuses only on the difference between the label and the prediction during computation and is not affected by the imbalance between positive and negative samples. In this experiment, the Dice loss function and the cross-entropy loss function are used separately to compare the differences between the two loss functions.

2.4. Experimental Platform and Evaluation Index

2.4.1. Experimental Environment Configuration

In order to ensure uniformity, model training was carried out in the same environment during the experiment. The hardware and software configurations used in this experiment are shown in Table 3.

2.4.2. Training Settings

In the experimental process, to ensure the comparability of network performance, comparative experiments were conducted using the same dataset under identical environmental conditions. The configuration of experimental parameters was maintained consistently, with the image size set to 512 × 512 pixels for uniform processing. To avoid GPU memory insufficiency due to excessively large batch sizes, which could lead to instability in network training, the batch size was limited to 4, with a total of 100 iterations conducted. This setting was determined through experimental results, aiming to achieve the optimal balance between efficiency and accuracy. Since different learning rates can lead to variations in loss values, a lower learning rate may slow down the convergence speed. Consequently, the learning rate was set to 5 × 10⁻⁴. To meet the requirements of the model at different training stages, the Adam optimizer (was employed to adjust network parameters for minimizing the loss function and accelerating model convergence. These experimental configurations contribute to the training and optimization of deep learning models. Specific parameter settings are detailed in Table 4.

2.4.3. Model Evaluation Index

Commonly used image segmentation evaluation metrics include accuracy, precision, and sensitivity, also known as recall. These evaluation metrics are defined using the confusion matrix, as shown in Figure 12. In the figure, “Positive” refers to positive samples, and “Negative” refers to negative samples. True positive (TP) indicates that the pixel was predicted as a positive sample and is indeed a positive sample in the gold standard. True negative (TN) indicates that the pixel was predicted as a negative sample and is indeed a negative sample in the gold standard. False positive (FP) refers to the pixel being predicted as a positive sample when it is actually a negative sample in the gold standard. False negative (FN) indicates that the pixel was predicted as a negative sample when it is actually a positive sample in the gold standard.

Accuracy, calculated as the ratio of correctly classified samples to the total sample count (0 to 1), is a common performance metric, though its interpretability is limited in imbalanced datasets. Precision, as shown in Equation (14), measures the proportion of true positives among positive predictions, while sensitivity (or recall), as in Equation (10), reflects the detection rate of actual positives. The intersection over union (IOU), as per Equation (15), evaluates the overlap between prediction and ground truth. The Dice similarity coefficient (DSC), detailed in Equation (16), provides a comprehensive assessment of segmentation accuracy by considering both false positives and false negatives, making it a robust evaluation metric [29].

The Kappa coefficient is a statistic used to measure the inter-rater reliability for qualitative (categorical) items. It is generally thought to be a more robust measure than a simple percent agreement calculation, as Kappa accounts for the agreement occurring by chance. In the context of image segmentation, it measures the agreement between the predicted segmentation and the ground truth, adjusted for chance. The formula for DSC is shown in Equation (17), where “Observed agreement” is the proportion of times that the raters (or the model and the ground truth) agreed, and “Expected agreement” is the proportion of times that we would expect them to agree by chance.

A c c = \frac{T N + T P}{F N + F P + T N + T P}

(13)

P r e = \frac{T P}{T P + F P}

(14)

S e n s i t i v e = \frac{T P}{T P + F N}

(15)

I O U = \frac{T P}{T P + F P + F N}

(16)

D i c e = \frac{2 T P}{F P + 2 T P + F N}

(17)

K a p p a = \frac{Observed agreement - Expected agreement}{1 - Expected agreement}

(18)

In summary, this paper employs accuracy and the intersection over union (IOU) coefficient to evaluate the segmentation performance.

3. Results

Employing the NCLB-Net network model, a rigorous training regimen consisting of 100 epochs was executed on the dataset, culminating in a robustly trained and effective model. This training regimen not only expedited the model’s convergence but also markedly improved its capacity for data fitting, which in turn enhanced the precision of lesion segmentation within the confines of the NCLB dataset. By systematically executing these steps, a proficiently trained and optimized model was successfully developed, adeptly fulfilling the task of image segmentation for corn northern leaf blight.

3.1. Comparison of 4 Attention Mechanisms

In order to ascertain the disparities in detection efficacy among distinct attention mechanisms, the current investigation will systematically integrate four attention mechanisms—squeeze-and-excitation networks (SENets), dynamic attention (DA), efficient channel attention (ECA), and a convolutional block attention module (CBAM)—within the basal model for a comparative evaluation. The basal model, which is a refined U-Net architecture, has been augmented with VGG19 and region of interest feature pooling (RFB) modules, maintaining constancy in all other experimental variables to isolate the impact of the attention mechanisms on detection performance. Segmentation experiments on a test set of NCLB disease images with complex backgrounds were conducted using MIoU and PA as metrics. Table 5 shows the comparative results of different attention mechanisms. As indicated in the table, the CBAM attention mechanism achieved the highest MIoU and PA scores, reaching 91.09% and 94.33%, respectively. Therefore, based on the performance of CBAM, this paper selects CBAM as the most suitable attention mechanism and uses the validation to evaluate the segmentation performance of the NCLB-Net model.

3.2. Ablation Experiment

To evaluate the efficacy of the NCLB-Net approach in the semantic segmentation of corn disease, it was benchmarked against conventional semantic segmentation techniques including a fully convolutional network (FCN), a pyramid scene parsing network (PSPNet), U-Net, and DeepLabV3+. The mean intersection over union (MIoU) and pixel accuracy (PA) metrics were employed to quantify the segmentation performance of the respective methods.

In order to examine the generalization capability and robustness of the NCLB-Net, segmentation experiments and comparative analyses were carried out on the established training and test datasets. To substantiate the utility of the NCLB-Net paradigm, which aims to surpass the segmentation performance of the conventional feature extraction network (VGG), the region of interest feature fusion (RFB) module was integrated into the skip connection component, and the squeeze-and-excitation (SE) block was appended to both the enhanced feature extraction module and the RFB module. The following ablation studies were performed on the test dataset:

(1): Scheme1: Replacing the feature extraction network with the VGG19 network on the traditional U-Net architecture results in improved segmentation effects.
(2): Scheme2: Adding the CBAM module on top of Scheme1 leads to an increase in segmentation accuracy.
(3): Scheme3: Integrating the RFB module into the skip connection layers based on Scheme2.
(4): Scheme4: Replacing the feature fusion network with the feature pyramid network (FPN) in Scheme3.
(5): NCLB-Net: Integrating the SE module into the FPN module and RFB module based on Scheme4.

Table 6 provides the experimental results of different configurations on the CLB dataset. The NCLB-Net demonstrates a marked improvement over the other configurations, suggesting that the incorporation of the squeeze-and-excitation (SE) module subsequent to the feature pyramid network (FPN) module and the region of interest feature fusion (RFB) module substantially augments the model’s segmentation proficiency.

3.3. Performance Comparison of Five Segmentation Methods

In this paper, the NCLB-Net is compared with traditional segmentation algorithms including DeeplabV3+, SETR, DAnet, OCnet, and PSPnet.

The outcomes of the comparative analysis of various segmentation algorithms are detailed in Table 7. As indicated within the table, the pixel accuracy (PA) achieved by the enhanced method introduced in this study is 94.71%, which is an increase of 11.49%, 8.18%, 9.54%, 13.11%, and 6.26% over the traditional U-net, SETR, DAnet, OCnet, and PSPnet algorithms, respectively. For the MIoU, the fitting rate of the improved method in this paper is 92.43%. In terms of MIoU, it is 20.18%, 16.10%, 9.79%, 5.27%, and 11.06% higher than the traditional U-net, SETR, DAnet, OCnet, and PSPnet algorithms, respectively. The empirical findings suggest that the incorporation of attention mechanisms into the methodology advanced in this manuscript augments the model’s feature extraction proficiency and markedly elevates the precision of semantic segmentation for corn northern leaf blight.

Figure 13 provides a visual representation of the segmentation outcomes. The first column in Figure 13 depicts the original images of corn northern leaf blight. The second column exhibits the manually annotated ground truth images. The third column reveals the segmentation outputs produced by the SETR model. The fourth column illustrates the segmentation results obtained from the DAnet model. The fifth column showcases the segmentation outcomes derived from the OCnet model. The sixth column presents the segmentation results of the PSPnet model. Conclusively, the seventh column displays the segmentation results yielded by the NCLB-Net model proposed in this study.

The visualization results indicate that while the PSPnet model achieves more accurate segmentation, it tends to miss and misidentify smaller lesions. The NCLB-Net model, conversely, demonstrates greater precision in delineating the boundaries of lesions, including those of a smaller scale, and aligns closely with the annotated ground truth. This alignment results in an exceptionally high accuracy rate for lesion segmentation.

The visual findings indicate that the integration of the RFB (receptive field block) module heightens the model’s sensitivity to the input images, allowing it to encapsulate a more extensive array of contextual information. The supplementary insertion of the CBAM (convolutional block attention module) into the feature extraction module, as well as the SE (squeeze-and-excitation) module during both the feature fusion phase and within the RFB module, contributes to the model’s advanced understanding of the feature interdependencies. This facilitates a refined focus on critical feature channels, leading to a more precise delineation of lesion regions and boundaries.

3.4. Model Performance Verification

To validate the performance of the proposed model, it was compared with five other CNN models. Keeping the experimental methods and environment consistent with the aforementioned setup, each model was trained, and the training results are presented in Figure 14. The proposed model demonstrates characteristics such as high initial segmentation accuracy, low loss rate, and rapid convergence speed.

As shown in Figure 14, the curves of our model during training and validation are relatively smooth with minimal fluctuation, indicating that our model possesses good stability.

3.5. Lesion Classification and Comparison Test

Since there is no clear grading standard for the degree of NCLB disease, in order to more accurately analyze the grading of the NCLB disease, this paper develops a grading standard for NCLB leaf spot with reference to the standard developed by the People’s Republic of China, “Summer Corn Seedling Growth Monitoring Regulation”. Based on the principle of pixel point statistics, this paper employs Python to execute the measurement of disease spot areas. Utilizing this method, the leaves are classified into three distinct categories: grade one, grade two, and grade three. The detailed criteria for each grade are outlined in Table 8, which provides a clear framework for the assessment and categorization of the disease spots on the leaves. This systematic approach ensures a standardized and objective evaluation of the disease severity, which is essential for agricultural and plant health monitoring purposes.

Consistent with the formulation delineated in Equation (19), the variable T denotes the ratio of the diseased region to the complete image area. The fundamental formula for its computation is as follows:

T = \frac{A_{s p o t}}{A_{i m a g e}} = \frac{\sum (x, y) \in R_{s p o t} n}{\sum (x, y) \in R_{i a m g e} n}

(19)

In the equation,

A_{s p o t}

denotes the area of the lesion region,

A_{i m a g e}

represents the area of the entire image,

R_{s p o t}

refers to the lesion region, and

R_{i a m g e}

denotes the image region.

To assess the efficacy of the model, a comparative experiment was conducted on the basis of the CLB dataset classification, where the method proposed in this paper was compared with the Scheme1, Scheme2, Scheme3, and Scheme4 models, as illustrated in the table below.

It is evident from the data presented in Table 9 that across the experimental models the highest segmentation accuracy is observed for category 3. This observation can be attributed to the more substantial area occupied by leaf spots corresponding to category 3. Figure 15 provides a visual comparison of the segmentation accuracy for lesions classified under category 3 across the different models examined.

As is distinctly evident from Figure 15 and Table 9, the segmentation accuracy for the third-level classification of the model proposed in this study has been enhanced by 4.0%, 2.43%, and 1.16% respectively, in comparison to the Scheme1, Scheme2, and Scheme3 models, as well as Scheme4. For the remaining two levels of lesion classification, the proposed model has exhibited superior performance over the Scheme1, Scheme2, Scheme3, and Scheme4 models.

4. Discussion

NCLB is a common disease in corn that can cause lesions on corn leaves, severely affecting photosynthesis and nutrient accumulation, leading to significant yield losses in corn. To minimize production losses, rapid detection and determination of the disease progression stage are crucial. In this research, we have developed an optimized U-Net network tailored for the segmentation of corn blotches. By incorporating VGG19 as the foundation of the U-Net architecture, and integrating the convolutional block attention module (CBAM) into the downsampling phase, we have markedly enhanced the network’s training accuracy and achieved more precise segmentation outcomes. Furthermore, we have strategically inserted the receptive field block (RFB) module at the junction of the encoder and decoder within the U-Net structure, facilitating a stronger connection between the two stages of the network. Subsequently, the SE attention mechanism was introduced after the RFB convolution, which combined with the characteristics of the disease to allow the network to pay more attention to the key parts of the lesion, thus improving the segmentation accuracy. In the upsampling stage, the feature maps from different layers are fused using the FPN embedded with the SE mechanism, which enhances the feature extraction process, as the FPN can integrate feature maps capturing information at different scales. Finally, the loss function is optimized by a linear combination of cross-entropy loss and Dice loss, which includes a comprehensive full-image loss and an accurate loss calculation for the target region.

While the current task demonstrates the primary advantages of the network model in terms of excellent performance and fast operating speed, we acknowledge that there is still significant room for improvement in future research. Specifically, in practical applications, the comprehensive training strategy that considers both image quality and environmental factors is still not ideal. Utilizing data augmentation for model training, future considerations may include the implementation of transfer learning or domain adaptation methods. These methods aim to enhance the model’s adaptability to different environments, thereby improving its robustness. The diversity of disease types has led to suboptimal segmentation results. There is a need to further optimize the network structure to accurately segment lesions of different shapes. Additionally, exploring excellent semantic segmentation models can enhance adaptability to complex natural environments. There are still limitations in the training process. While replacing the encoder module with VGG19 improved segmentation accuracy, it may affect the feature learning capability. The introduction of the SE attention mechanism after the RFB module is intended to enhance the model’s feature learning capability, thus improving overall performance. Without the introduction of an attention mechanism, further refinement of the VGG19 structure to improve the extraction of complex features should be considered. Further optimization of the network model for specific scenarios, training on different plant diseases and types, is crucial for enhancing the model’s generalization capability.

5. Conclusions

To address the issue of diminished segmentation precision for large maize leaf spots, this study introduces a novel deep learning-based segmentation approach termed NCLB-Net. The proposed method amalgamates the U-Net architecture with the VGG19 network and incorporates the CBAM module, which markedly enhances the network’s training accuracy and yields more precise segmentation outcomes. The RFB module is integrated into the skip connections to augment the receptive field, aggregate contextual information, and mitigate the loss of positional and dense semantic data due to pooling operations, thereby reducing reliance on parametric complexity and computational processes. This facilitates the model’s ability to capture image edge information effectively and retain detailed features, culminating in more refined and accurate segmentation results. Furthermore, the SE module is introduced within both the upsampling stage and the RFB module to better reconstruct object edge information, enhance the method’s feature extraction capabilities, and minimize the occurrence of missed detections. Experimental evaluations conducted on the CLB dataset demonstrate the efficacy of the proposed method in extracting the lesion areas of large maize leaf spots, resulting in more accurate and efficient segmentation. Ultimately, the loss function is optimized through a linear combination of the cross-entropy and Dice loss functions, encompassing a comprehensive assessment of overall image loss and precise computation of target region loss. An ablation study is also conducted to ascertain the influence of each enhanced component on the semantic segmentation performance. The results of this experimental validation indicate that the enhanced NCLB-Net achieves significant improvements in the metrics of MIoU, mPA, mPr, and mRecall, with respective values of 86.32%, 88.97%, 91.10%, and 88.97%. Compared with traditional methods such as DeepLabv3+, SegNet, FCN and PSPNet, the improved model performed well in segmenting leaf blight spots of northern maize. This confirms the excellent performance of our model in segmenting northern maize leaf blight spots and highlights its potential value in practical applications. It provides a reference for precise medication and stable yield increase in maize, prevents the misuse of pesticides, reduces the yield loss caused by the disease, provides insight into the disease development process, and provides theoretical support for the establishment of an early warning system for corn diseases.

Author Contributions

Conceptualization: Y.M. and K.L.; methodology: K.L. and Y.S.; software: K.L. and Y.B.; validation: Y.S.; formal analysis: K.L.; investigation: K.L.; resources: Y.B.; data curation: K.L.; writing—original draft preparation: K.L.; writing—review and editing: Y.M. and K.L.; supervision: Y.S.; project administration: Y.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Changchu Science and Technology Bureau, funding number 21ZGN27 http://kijchangchun.gov.cn (accessed on 23 June 2024); the Science and Technology Department of Jilin Province, funding number 20210302009NC http://kjt.jl.gov.cn (accessed on 23 June 2024); and the Department of Education of Jilin Province, funding number JJKH20230386KJ http://jyt.jl.gov.cn (accessed on 23 June 2024).

Data Availability Statement

There are some security issues in the data design. If you need to use the data, please contact the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cai, J.; Pan, R.; Lin, J.; Liu, J.; Zhang, L.; Wen, X.; Chen, X.; Zhang, X. Improved EfficientNet for corn disease identification. Front. Plant Sci. 2023, 14, 1224385. [Google Scholar] [CrossRef] [PubMed]
Kamilaris, A.; Prenafeta-Boldú, F.X. Deep learning in agriculture: A survey Comput. Electron. Agric. 2018, 147, 70–90. [Google Scholar] [CrossRef]
Dutta, K.; Talukdar, D.; Bora, S.S. Segmentation of unhealthy leaves in cruciferous crops for early disease detection using vegetative indices and Otsu thresholding of aerial images. Measurement 2022, 189, 110478. [Google Scholar] [CrossRef]
Liu, Z.; Du, Z.; Peng, Y.; Tong, M.; Liu, X.; Chen, W. Study on corn disease identification based on pca and svm. In Proceedings of the 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chongqing, China, 12–14 June 2020; IEEE: Piscataway, NJ, USA, 2020; Volume 1, pp. 661–664. [Google Scholar]
Tian, K.; Li, J.; Zeng, J.; Evans, A. Zhang Segmentation of tomato leaf images based on adaptive clustering number of K-means algorithm Comput. Electron. Agric. 2019, 165, 104962. [Google Scholar] [CrossRef]
Almasoud, A.S.; Abdelmaboud, A.; Eisa, T.A.E.; AlDuhayyim, M.; Elnour, A.A.H.; Hamza, M.A.; Motwakel, A.; Zamani, A.S. Artificial intelligence-based fusion model for paddy leaf disease detection and classification Cmc Comput. Mater. Contin. 2022, 72, 1391–1407. [Google Scholar]
Ambarwari, A.; Adrian, Q.J.; Herdiyeni, Y.; Hermadi, I. Plant species identification based on leaf venation features using SVM. Telkomnika Telecommun. Comput. Electron. Control 2020, 18, 726–732. [Google Scholar] [CrossRef]
Ali, A.; Qadri, S.; Mashwani, W.K.; Belhaouari, S.B.; Naeem, S.; Rafique, S.; Jamal, F.; Chesneau, C.; Anam, S. Machine learning approach for the classification of corn seed using hybrid features. Int. J. Food Prop. 2020, 23, 1110–1124. [Google Scholar] [CrossRef]
Wang, Y.; Wang, H. Peng Rice diseases detection and classification using attention based neural network and bayesian optimization. Expert Syst. Appl. 2021, 178, 114770. [Google Scholar] [CrossRef]
Wang, G.; Yu, S.; Wang, J. Automatic Image-Based Plant Disease Severity Estimation Using Deep Learning. Comput. Intell. Neurosci. 2017, 7, 29–36. [Google Scholar] [CrossRef]
Sibiya, M.; Sumbwanyambe, M. An algorithm for severity estimation of plant leaf diseases by the use of colour threshold image segmentation and fuzzy logic inference: A proposed algorithm to update a leaf doctor application. AgriEngineering 2019, 1, 205–219. [Google Scholar] [CrossRef]
Joshi, R.C.; Kaushik, M.; Dutta, M.K.; Srivastava, A.; Choudhary, N. VirLeafNet: Automatic analysis and viral disease diagnosis using deep-learning in Vigna Mungo plant. Ecol. Inform. 2020, 6, 97–101. [Google Scholar] [CrossRef]
Subramanian, M.; Lv, N.P.; Sathishkumar, V.E. Hyperparameter optimization for transfer learning of VGG16 for disease identification in corn leaves using Bayesian optimization. Big Data 2022, 10, 215–229. [Google Scholar] [CrossRef]
Entuni, C.J.A.; Zulcaffle, T.M.A. Identification of corn leaf diseases comprising of blight, grey spot, and rust using DenseNet-201. Borneo J. Resour. Sci. Technol. 2022, 12, 125–134. [Google Scholar] [CrossRef]
Priyadharshini, R.A.; Arivazhagan, S.; Arun, M.; Mirnalini, A. Maize leaf disease classification using deep convolutional neural networks. Neural Comput. Appl. 2019, 31, 8887–8895. [Google Scholar] [CrossRef]
Atila, U.; Uçar, M.; Akyol, K.; Uçar, E. Plant leaf disease classification using EfficientNet deep learning model. Ecol. Inform. 2021, 61, 101182. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional Networks for Biomedical Image Segmentation. In Medical ImageComputing and Computer-Assisted Intervention–MICCAI 2015, Proceedings of the 18th International Conference, Munich, Germany, 5–9 October 2015; Part III 18; Springer International Publishing: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Chen, S.; Zhang, K.; Zhao, Y.; Sun, Y.; Ban, W.; Chen, Y.; Zhuang, H.; Zhang, X.; Liu, J.; Yang, T. An approach for rice bacterial leaf streak disease segmentation and disease severity estimation. Agriculture 2021, 11, 420. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition; IEEE: Piscataway, NJ, USA, 2016. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Jin, X.; Xie, Y.; Wei, X.-S.; Zhao, B.-R.; Chen, Z.-M.; Tan, X. Delving deep into spatial pooling for squeezeand-excitation networks. Pattern Recognit. 2022, 121, 108159. [Google Scholar] [CrossRef]
Jadon, S. A survey of loss functions for semantic segmentation. In Proceedings of the 2020 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), Via del Mar, Chile, 27–29 October 2020; pp. 1–7. [Google Scholar] [CrossRef]

Figure 1. (SONY) Alpha 7 III digital camera.

Figure 2. Part of the NCLB disease dataset.

Figure 3. Status of picture annotation: (a) original image, (b) outcomes of image marking. White stands for the lesions and black for the backdrop.

Figure 4. Part of the NCLB disease dataset illustration of data augmentation effect.

Figure 5. U-Net structure model demonstration.

Figure 6. NCLB-Net construction. The numeral “1024” superimposed above the central color block denotes the insertion point of the RFB module within the network architecture. Subsequently, the green color block signifies the position at which the CBAM block attention mechanism has been integrated, while the blue color block denotes the site of incorporation for the SE block attention mechanism.

Figure 7. Backbone feature extraction network structure: (a) the left part represents backbone network feature extraction model; (b) the right part represents backbone feature extraction partial implementation approach.

Figure 8. CBAM structure.

Figure 9. SE structure.

Figure 10. Augmentation of the architecture within the feature extraction segment of the network. The left part represents enhanced feature extraction partial model; the right part represents enhancement of the feature extraction component implementation approach.

Figure 11. RFB structure.

Figure 12. Confusion matrix.

Figure 13. The segmentation results of different algorithms are shown: (a) original image, (b) ground truth, (c) Scheme1, (d) Scheme2, (e) Scheme3, (f) Scheme4, and (g) NCLB-Net. The red bounding boxes delineate regions where significant discrepancies among the various segmentation methods are observed.

Figure 14. Training and validation results of each model.

Figure 15. The accuracy of classification of graded lesions was compared.

Table 1. Dataset distribution.

Distribution	Before Data Augmentation	After Data Augmentation
Images	2419	20,897

Table 2. Target numbers in dataset.

Distribution	Training	Validation
Images	16,717	4180

Table 3. Table of experimental environment parameters.

Experimental Environment	Configuration
Operating system	Windows11 64-bit
CPU	Inter(R) Xeon(R) Platinum 8255C CPU @ 2.50 GHz
GPU	NVDIA GeForce RTX 4090
CUDA	11.7
Python	3.10
PyTorch	1.8.1
RAM	16 GB

Table 4. Experimental parameter settings.

Experimental Parameters	Configuration
Input image	512 × 512
Learning rate	5 × 10⁻⁴
Epoch	100
Batch_size	4
Optimizer	Adam

Table 5. Comparative results of four attention mechanisms.

Method	MIoU(%)	PA(%)
original	90.13	93.60
SENet	90.33	94.14
DANet	90.21	94.07
ECANet	90.39	94.21
CBAM	92.61	95.48

Table 6. NCLB-Net ablation experiment results.

Method	MioU (%)	PA (%)
U-net	82.17	91.41
Scheme1	88.59	92.09
Scheme2	90.18	92.37
Scheme3	90.75	93.01
Scheme4	90.91	93.66
NCLB-Net	92.43	94.71

Table 7. Comparison of objective indicators for semantic segmentation of images using DeepLab V3, segmentation transformer (SETR), DAnet, OCnet, and NCLB-Net algorithms, including PA, MPA, MIOU, and Kappa.

Model	PA	MPA	MIoU	Kappa
U-net	83.22	42.68	72.25	51.63
SETR	86.53	51.31	76.33	59.12
DAnet	85.17	44.58	82.64	53.73
OCnet	81.60	51.69	87.16	57.49
PSPnet	88.45	49.74	81.37	56.75
NCLB-Net	94.71	54.21	92.43	68.47

Table 8. Grading criteria for extent of big spot disease in corn.

Disease Spot Level	The Range of k	Quantity
Level 1	0 ≤ k ≤ 5%	1017
Level 2	5% ≤ k ≤ 25%	748
Level 3	25% ≤ k ≤ 50%	654

Table 9. Comparative accuracy experiments of different model.

Model	PA
	Level 1	Level 2	Level 3
Scheme1	79.11	88.74	90.48
Scheme2	80.73	89.31	91.15
Scheme3	82.17	90.87	92.16
Scheme4	83.26	91.13	93.27
NCLB-Net	85.71	92.24	95.43

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mu, Y.; Li, K.; Sun, Y.; Bao, Y. Semantic Segmentation of Corn Leaf Blotch Disease Images Based on U-Net Integrated with RFB Structure and Dual Attention Mechanism. Agronomy 2024, 14, 2652. https://doi.org/10.3390/agronomy14112652

AMA Style

Mu Y, Li K, Sun Y, Bao Y. Semantic Segmentation of Corn Leaf Blotch Disease Images Based on U-Net Integrated with RFB Structure and Dual Attention Mechanism. Agronomy. 2024; 14(11):2652. https://doi.org/10.3390/agronomy14112652

Chicago/Turabian Style

Mu, Ye, Ke Li, Yu Sun, and Yu Bao. 2024. "Semantic Segmentation of Corn Leaf Blotch Disease Images Based on U-Net Integrated with RFB Structure and Dual Attention Mechanism" Agronomy 14, no. 11: 2652. https://doi.org/10.3390/agronomy14112652

APA Style

Mu, Y., Li, K., Sun, Y., & Bao, Y. (2024). Semantic Segmentation of Corn Leaf Blotch Disease Images Based on U-Net Integrated with RFB Structure and Dual Attention Mechanism. Agronomy, 14(11), 2652. https://doi.org/10.3390/agronomy14112652

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semantic Segmentation of Corn Leaf Blotch Disease Images Based on U-Net Integrated with RFB Structure and Dual Attention Mechanism

Abstract

1. Introduction

2. Materials and Methods

2.1. Obtaining and Preparing Data

2.2. Data Expansion

2.2.1. Random Rotation Transformation

2.2.2. Adjustment of Luminance and Contrast

2.2.3. Adding Distractions to the Image

2.3. NCLB-Net

2.3.1. Network Architecture

2.3.2. Feature Extraction Branch

2.3.3. Convolutional Block Attention Module (CBAM)

2.3.4. SE Attention Mechanism Module

2.3.5. Feature Fusion Branch

2.3.6. RFB Module Construction

2.3.7. Optimize the Loss Function

2.4. Experimental Platform and Evaluation Index

2.4.1. Experimental Environment Configuration

2.4.2. Training Settings

2.4.3. Model Evaluation Index

3. Results

3.1. Comparison of 4 Attention Mechanisms

3.2. Ablation Experiment

3.3. Performance Comparison of Five Segmentation Methods

3.4. Model Performance Verification

3.5. Lesion Classification and Comparison Test

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI