1. Introduction
Sweetgum trees and deciduous trees have very high application value, particularly for maintaining ground strength, for ornamental use, as raw material, and medicinal purposes [
1,
2]. As the planting area for this tree has continuously expanded, the leaf spot disease of the sweetgum tree has gradually aggravated, characterized by a brown color and a small spot area. Generally, the grade of leaf spot disease is determined by the area of the spots on the leaves. Using deep learning to segment leaves enables a quick evaluation of the disease situation, which is of great significance for garden protection.
In recent years, many detection and classification technologies for plant diseases have emerged. Nandini et al. [
3] described different image segmentation technologies, such as edge-detection technology, threshold technology, and clustering technology, which continue to have good potential for detecting diseases on plant leaves. Kumar et al. [
4] proposed a method to classify three important leaf-surface diseases in bananas using local texture characteristics and used a ten-fold cross-verification process to complete a comparative performance analysis. Jothiaruna et al. [
5] proposed a leaf-spot segmentation method for use in the real environment using comprehensive color characteristics and a regional growth method. These methods divide the target area from texture, color, shape, and other information. The preprocessing of traditional methods is very targeted, but plant diseases usually show a variety of characteristics. The artificial extraction of features includes a certain subjectivity, which brings great challenges to traditional plant detection and recognition algorithms.
The emergence of deep learning provides a new technology for plant disease identification and detection. Deep neural networks have a strong ability to learn and self-train. They can obtain image information of plant diseases from multiple angles, which offers advantages for plant disease detection and prevention [
6,
7,
8,
9]. Lakshmi et al. [
10] proposed an effective deep-learning framework for the automatic detection and segmentation of plant diseases using an improved CNN based on the pixel-level mask area. Mobeen et al. [
11] used CNN to classify the symptoms of plant diseases and proposed a phased migration learning method, which can help the model accelerate convergence. Liu Xinda et al. [
12] solved the problem of plant disease identification by weighting visual areas and losses and using the LSTM network to encode the weighted patch feature sequence into a complete feature representation. Prabhjot Kaur et al. [
13] proposed a neural-network-based technology, Modified InceptionResNet-V2 (MIR-V2), combined with migration learning methods to detect plaques in leaves of different tomato plants. The model obtained a 98.92% accuracy in the detection of lesions. Dang Huu Chau et al. [
14] used the Yolo-v5 model, which was based on the deep-learning method, and their proposed stability information was sourced from an automatic encoder to train Plant Village datasets. After training, the detection accuracy for plaques was 81.28%. Hongbo Yuan et al. [
15] proposed an improved DeepLabV3+ deep-learning network for the segmentation of grape leaf black rot spots. They used the ResNet101 network as the backbone network of DeepLabV3+, inserted a channel attention module in the residual module, and added a feature fusion branch based on the feature pyramid network to the encoder, integrating different levels of feature diagrams. The improved network evaluation indicators were greatly improved, enhancing the segmentation of grape leaf black rot spots. Hu et al. [
16] used a CNN to divide tea and plaques and evaluate the degree of damage. Liang et al. [
17] used a PD2SE-Net neural network to segment plant plaque areas and assess their damage, reporting an accuracy of more than 91%. The DeepLabV3+ model is mainly suitable for satellite images, medical images, urban scenes, plant branch segmentation, and element extraction [
18,
19,
20,
21,
22], but there are not many applications for plant diseases. Therefore, use of a deep-learning network on plant disease data images may be a fruitful, unexplored approach for the identification and detection of target diseases.
In 2018, Liang-Chieh Chen et al. [
23] proposed the semantic segmentation network model DeepLabV3+. The network includes a decoder module in the original DeepLabV3 network to refine the semantic segmentation effect, and it is currently the best-performing network in the DeepLab Network series. The encoder part of DeepLabV3+ uses serial cavity convolution in the Modified Aligned Xception backbone feature extraction network and divides the results into two parts, one of which is differently expanded. The atrous spatial pyramid pooling (ASPP) model is composed of a hollow convolution of expansion coefficients to encode, stitch, and fuse image context information and carry out a 1 × 1 convolution compression. This information is then directly input into the decoder part, and the feature layer is selected to be directly passed into the decoder before the pooling of the Block2 module in the Modified Aligned Xception. One part of the decoder inputs the shallow feature that was output from the backbone feature extraction network into the 1 × 1 convolution to reduce the dimensions of the feature map; the other part uses fourfold bilinear interpolation sampling to obtain a feature diagram whose features are compressed by ASPP and 1 × 1 convolution. It then stitches the two feature diagrams to capture shallow, cross-layer connections. Layer features carry detailed information, thus enriching the semantic and detailed information of the image. Finally, the merged feature map is sampled four times to obtain a semantic segmentation map with the same size as the original picture.
Our team developed a method to improve the plaque segmentation accuracy of maple leaves by combining the attention mechanism module and DeeplabV3+ Network. Maple-leaf spot images captured by the East Lake Campus of Zhejiang Agricultural and Forestry University in 2021 were used as the research object for lesion segmentation. The experimental results show that the model proposed in this paper can satisfy the requirements of pixel accuracy, mean recall rate, mean intersection over union, and other aspects and has a good segmentation effect on leaf spots.
2. Materials and Methods
2.1. Production of the Dataset
In this experimental study, the maple leaf spot images came from the maple forest on the East Lake Campus of Zhejiang Agriculture and Forestry University. The research area is located in the northwest of Zhejiang Province. It is located from 118°51′ to 119°52′ east longitude and 29°56′ to 30°23′ north latitude. It has a monsoon climate, warm and humid and sunny and humid, and is suitable for the growth of maple trees. The environmental background, natural light, shooting equipment, etc., will affect the images taken of maple trees. The image collection time was distributed from May to July 2021. The images were taken and collected in the morning, noon, and evening. At the same time, the dataset also includes maple spot images on cloudy and sunny days, which can better reflect the real observation conditions. In this experiment, maple leaf lesions were taken with an iPhone 7’s built-in camera, and 160 valid images were screened.
The image semantic segmentation annotation tool Labelme was used to annotate the spot images at the pixel level. The annotation data were stored in json format, and the data labels were converted to a binary png image using the labelme_json_to__dataset command. The black part is the background and the red part is the leaf spot, as shown in
Figure 1.
In this paper, the experimental data set was enhanced and processed because the direct use of raw data for training can easily cause overfitting in the model. The CVPR Fine-grained Visual Classification Challenge uses data enhancement operations to amplify raw data. Using this method can enhance the generalization ability of the model and gives the image the characteristic of translation flip invariance. The Albumations data enhancement library was used to randomly enhance the brightness, crop, flip, and shift of the marked images—amplified 2, 4, 3, and 2 times, respectively, for a total of 48 times—and the images were then processed into images with a resolution of 400 × 400, with a total of 7680 experimental data images. To build the Sweetgum Leaf Spot Dataset (SLSD).
Figure 2 shows the enhanced results of some data sets.
2.2. Segmentation Method of Diseased Spots on Leaves of Sweetgum
To reduce the number of parameter calculations in the model and improve its calculation speed, the traditional Modified Aligned Xception network used to extract the backbone features of DeepLabV3+ model is replaced by the lightweight Mov-bilenetV2 network in the encoder part. To train a more powerful semantic segmentation model and obtain better segmentation accuracy, this paper adds an SA module [
24] to the encoder part and CBAM [
25] to the decoder part. To solve the imbalance of SLSD categories, a weighted loss function is introduced to assign different weights to the plaque class and the background class to improve the accuracy of the model’s segmentation of the plaque area.
2.3. Backbone Feature Extraction Network
In the encoder section, we changed the Modified Aligned Xception network used for backbone feature extraction networks in the traditional DeepLabv3+ to a lightweight MobileNetV2 network.
The MobileNet model is a lightweight deep neural network proposed by Google for embedded devices such as mobile phones. MobileNetV2 [
26] is an upgraded version of MobileNetV1 [
27]. The convolution is usually followed by a ReLU function. In MobileNetV1, ReLU6 is used, and its maximum output is limited to 6. MobileNetV2 changes the final output of ReLU6, which is directly linear. Xception research experiments have proven that the introduction of ReLU activation after deep convolution will be less effective and will also lead to information loss [
28].
In the feature extraction operation, the neural network extracts the useful information of the target, which can be embedded in the low dimensional subspace. The traditional network structure is normalized by convolutions containing the ReLU activation function, but using the ReLU activation function in the low-dimensional space will lose more useful information. In the linear bottleneck structure, the ReLU activation function is changed to the linear function to reduce the loss of useful network information.
The reverse residual structure of the MobileNetV2 network application consists of three parts. Firstly, 1 × 1 convolution is used to increase the dimension of input features, 3 × 3 depth separable convolution is used for feature extraction, and then 1 × 1 convolution is used for dimension reduction.
The structural parameters of MobileNetV2 in this experiment are shown in
Table 1, where t is the expansion factor; c is the depth of the output feature matrix; n is the number of repeats of Boatneck, which refers to the inverted residual structure; and s is the step distance.
2.4. Convolutional Block Attention Module
Starting from Channel and Spatial scopes, CBAM introduces two analytical dimensions, spatial attention and channel attention, to realize the sequential attention structure from channel to space. The spatial Attention Module (SAM) can make neural networks pay more attention to the pixel areas in the image that determine the segmentation of lesions rather than ignore the irrelevant areas. The Channel Attention Module (CAM) is used to deal with the allocation relationship of feature map channels. At the same time, the allocation of attention to two dimensions enhances the improvement in the performance of the model by the attention mechanism. The structure of CBAM is shown in
Figure 3.
The specific process of CAM is as follows: Enter the feature diagram F through Global Max Pooling and Global Average Pooling based on width and height, respectively, to obtain two 1 × 1 × C feature diagrams, and then send them separately. Enter a two-layer neural network (MLP), which is shared. Then, the characteristics of the MLP output are added element-wise and then activated by Sigmoid to generate the final Channel Attention Feature, M_C. Finally, M_C and input feature diagram F are multiplied element-wise to generate the input features required by the Spatial Attention module. The structure diagram of the Attention module on the channel is shown in
Figure 4.
The specific calculation is as follows:
In the formula: represents the activation function; W0 ∈ RC/r×C; W1 ∈ RC×C/r. Note that the MLP weights W0 and W1 are shared for the two inputs, followed by the ReLU activation function W0.
SAM’s specific process is as follows: Take the feature diagram F’ output by the Channel Attention module as the input feature diagram of this module. First, make a Channel-based Global Max pooling and Global Average pooling, obtain two H × W × 1 feature diagrams, and then use these two feature diagrams to perform Concat operations based on the Channel. Then, use the 7 × 7 convolution operation to reduce the number of channels. After Sigmoid activation, the Spatial Attention Feature, M_S, is generated. Finally, the input feature of M_S and the module is multiplied to obtain the finally generated features. The structure diagram of the Attention module on the space is shown in
Figure 5.
The specific calculation is as follows:
In the formula: represents the activation function, and f7×7 represents the convolution operation with a filter size of 7 × 7.
The low-level features obtained in the shallow network are directly used as input information in the decoding stage, introducing a large number of background features and affecting the segmentation results. By adding CBAM, the channel attention mechanism module will provide greater weight, which is highly responsive to the target object; the spatial attention mechanism pays more attention to the foreground area and focuses on the characteristics of the target area, which helps to generate a more effective feature map.
2.5. Shuffle Attention Module
The SA module uses “channel segmentation” to process the subfeatures of each group in parallel. For channel attention branches, GAP is used to generate channel statistics, and then a pair of parameters is used to scale and move the channel vectors. For spatial attention branches, group norms are used to generate spatial statistics, and then a compact feature similar to channel branches is created. Then, the two branches are connected. After that, all subfeatures are aggregated, and, finally, the “channel shuffle” operator is used to realize information communication between different subfeatures. As shown in
Figure 6, one branch generates a feature diagram of the channel attention mechanism by obtaining the internal connection of the channel, and the other obtains the internal connection of the space to generate a feature diagram of the spatial attention mechanism.
Channel Attention: For a given feature map X ∈ R
C×H×W, where C, H, and W represent the channel, spatial height, and spatial width, respectively, the SA module first divides X into group G along the channel size, that is, X = [X
1,…, X
G]; moreover, X
k ∈ R
C/G×H×W, where each subfunction X
k gradually capture specific semantic responses during training. Then, the corresponding importance coefficient is generated for each subeigen through the attention module. Specifically, at the beginning of each attention unit, the input of X
k is divided into two branches along the channel dimension, X
k1, X
k2 ∈ R
C/2G×H×W. To weigh speed and accuracy, global information is first embedded by simply using the Global Average Pool to generate channel statistics, s ∈ R
C/2G×1×1, which can be calculated by space size H × W shrinkage X
k1:
Then, the Sigmoid activation function is sued to aggregate information to generate a compact feature map, and the channel pays attention to the final output:
In the formula: W1 ∈ RC/2G×1×1 and b1 ∈ RC/2G×1×1 are parameters for scaling and moving, respectively.
Spatial Attention: Spatial Attention supplements Channel Attention. First, Group Norm (GN) is used on X
k2 to obtain spatial statistics, and then Fc(·) is used to enhance the representation of X
k2. The final output is:
In the formula: W2 and b2 are parameters of RC/2G×1×1.
Finally, the two branches are connected so that the number of channels is the same as the number of inputs, that is, .
The input image is extracted through the feature of the backbone network. The fusion channel attention mechanism and spatial attention mechanism not only reduce the loss of local information but also captures global information with long-distance dependency, effectively improving the ability of feature extraction.
2.6. Weighted Loss Function
For an input sample, the difference between the output value of the model and the real value of the sample is called a loss. The loss function is a function that describes this difference. For a deep learning model, the neural network weight in the model completes the training through loss reverse propagation. Therefore, the loss function determines the training effect of the deep learning model, which is crucial. In this study, the annotation of SLSD is divided into two categories: disease spot and background, so multi-category cross-entropy is used as the loss function. However, due to the large proportion of background categories in the data set, the network tends to learn the background characteristics during training and the characteristics of the disease spot area cannot be effectively extracted, resulting in the problem of low segmentation accuracy for spot areas. To solve the problem of the low accuracy of spot area segmentation caused by the imbalance in the SLSD category, a weighted loss function is introduced. Based on the original multi-category cross-entropy loss function, the disease spot class and the background class are given different weights. The specific calculation formula is as follows:
In the formula: N is the total number of pixels, C is the total number of categories, i represents the i training pixels, j represents the category of training pixels, is the real disease spot category of the i training pixel annotation, is the disease spot category predicted by the ist training pixel, represents the weight parameter of category j, and represents the number of pixels of category j.
2.7. Based on Improved DeeplabV3+ Image Semantic Segmentation Algorithm
The improved model structure is shown in
Figure 7.
The algorithm proposed in this article is based on the encoder–decoder structure. The encoder part is a training network, gradually obtaining feature diagrams and capturing higher levels of semantic information. The decoder part projects the semantic features learned by the encoder into the pixel space, and pixel segmentation has been realized. In the encoder, the MobileNetV2 network is used to build a backbone feature extraction network and add a SA module to it, which not only does not affect the speed of the network but also increases the accuracy of the algorithm. Atrous Spatial Pyramid Pooling is connected behind the MobileNetV2 backbone network. Dilated convolution with different sampling rates can be sampled in parallel by ASPP, which is equivalent to capturing the context of images at multiple scales. Dilated convolution adds atrous to the convolution map during the convolution operation to expand the reception field so that each convolution output can contain a larger range of information. The ASPP module outputs high-level features after 1 × 1 convolution. We can visualize the feature diagram of the blue solid line box in the encoder section in
Figure 7, as shown in
Figure 7B.
In the decoder, the low-level features of CBAM processing are adjusted by 1 × 1 convolution. The features are visualized in the solid blue line box in the decoder section shown in
Figure 7A. Then, it is effectively fused with the features obtained by sampling in the encoder 4 times. After 3 × 3 convolution and upsampling, the features are gradually refined, the spatial information is recovered, and, finally, the segmentation result map is obtained.
2.8. Evaluation Indicators
This paper evaluates the segmentation performance for SLSD and selects the Pixel Accuracy (PA), the Mean Intersection over Union (mIou), and the Mean Recall Rate (mRecall) indicator to evaluate the segmentation performance.
PA is the ratio of the correct pixels to the total pixels. The calculation formula is as follows:
mIou is the most commonly used evaluation index in semantic segmentation experimental research. First, calculate the ratio of the real and predicted values of the two sets of real and predicted values to the union, and then calculate the average value of all categories. The calculation formula is as follows:
mRecall is the average value of the ratio of the number of pixels correctly classified in each class to the number of pixels predicted for this category. The calculation formula is as follows:
where k is the total number of categories, P
ij represents the number of pixels that belong to class i but are predicted as class j, P
ii represents the correct number predicted classes, and P
ji is false positive or false negative.
2.9. Disease Spot Classification
Since there is no definite grading standard for leaf spot degree, to more accurately analyze the grade of leaf spot degree of Sweetgum, this paper formulated the grading standard for the leaf spot of Sweetgum by referring to the Technical Specification for Detection and Reporting of Wheat Scab (GB/T15796-2011) formulated by the People’s Republic of China. Based on the principle of pixel point statistics, this paper uses Python to achieve the statistics of the segmentation of the disease spot area and divides the leaves into four levels: level 1, level 2, level 3, and level 4. The criteria for grading leaf lesions are shown in
Table 2:
Among them, k is the proportion of the lesion area in the whole image. The principle calculation formula is as follows:
In the formula: is the area of the lesion area, is the area of the whole image, indicates the lesion area, and indicates the image area.
4. Discussion
4.1. Ablation Experiment
The Modified Aligned Xception network originally used for backbone feature extraction has been changed to the MobileNetV2 network. To verify the validity of adding attention mechanism modules to the backbone feature extraction network part and decoder part, four different groups of different parties have been set up. The case carries out experiments. The experimental results are shown in
Table 4.
Scheme 1: Based on the traditional DeeplabV3+ network structure, the backbone feature network is replaced with the MobileNetV2 network;
Scheme 2: Based on Scheme 1, the SA module is added to the backbone feature extraction network;
Scheme 3: Based on Scheme 1, CBAM is added to the decoder;
Scheme 4: Based on Scheme 1, the attention mechanism module is added to the decoder of the backbone feature extraction network.
From the results of the ablation experiment, it can be seen that compared with the traditional DeeplabV3+ model, the model parameters and training time of Scheme 1 is significantly reduced. Experiments show that replacing the backbone feature extraction network is correct. Scheme 2 and Scheme 3 have improved PA, mRecall, and mIou values compared with the traditional DeeplabV3+. For Scheme 4, the increase is larger, indicating that the introduction of attention mechanisms in the encoder and decoder improves accuracy more obviously.
To verify the effectiveness of the weighted loss function in improving the segmentation accuracy of the disease spot region, a detailed comparison experiment is carried out comparing the segmentation accuracy of Scheme 4 and our method. The experimental results are shown in
Table 5. From the comparison results in
Table 5, it can be seen that this method is relatively improved in the Recall of disease spots compared with Scheme 4, but it has decreased slightly in the Recall of the background and improved in the spot and Iou in the background. Overall, the PA, mRecall, and mIou of this method have improved compared with Scheme 4. Experimental data show that the introduction of a weighted function can improve the segmentation accuracy of the disease spot region of the model, thus improving the overall segmentation accuracy of the model.
4.2. Comparison of The Performance of Different Segmentation Methods
To further verify and improve the segmentation performance of the DeeplabV3+ model, this method is compared with the traditional semantic segmentation models commonly used in plant diseases, such as DeeplabV3+, Unet [
29], and Segnet [
30]. The comparison results are shown in
Table 6.
From the experimental results in
Table 6. Our model outperforms other segmentation networks in every aspect. It can be seen that the PA of the algorithm proposed in this paper is 94.5%, which is 4.2%, 6.2%, and 4.8% higher than the traditional DeeplabV3+, Unet, and Segnet models, respectively. The mRecall proposed in this paper is 85.4%, which is 4.3%, 6.7%, and 5.1% higher compared to the traditional DeeplabV3+, Unet, and Segnet. The mIou proposed in this paper is 81.3%, which is higher than the traditional DeeplabV3+, Unet, and Segnet by 2.4%, 4.9%, and 1.9%, respectively. The experimental data show that the introduction of different dual-channel attention mechanisms and weighted loss functions in this paper has changed the performance of features and enhanced their expression ability.
Compared with the traditional DeeplabV3+ and Unet, the proposed algorithm’s training time is significantly faster. This is because, in the neural network, the greater the number layers of the model, the more parameters there are, the more complex the model is, and the more difficult the training is. However, compared with Segnet, the training time is slow because the structure of Segnet itself is relatively simple, but the segmentation accuracy is not as good as the algorithm proposed in this paper. To sum up, the algorithm proposed in this paper replaces the backbone feature extraction network with MobileNetV2, which reduces the amount of parameter calculation in the model, thus improving the calculation speed.
This paper also visually displays the segmentation effect of four algorithms, as shown in
Figure 12, where red represents the lesions and black represents the background. Compared with Unet and Segnet, the traditional DeeplabV3+ and the algorithm proposed in this paper have a higher recognition rate for the edge of the disease spot area. Because DeeplabV3+ contains Atrous Convolutions of different rates in the encoding part for feature extraction, it can enhance the recognition ability of objects of different sizes and can recover the feature information of the region edge. Compared with the traditional DeeplabV3+, the algorithm in this paper has improved missed detections of disease spots and the edge recognition of the disease spot area. Compared with Unet and Segnet, the segmentation effect of the disease spot is better, which proves that the addition of different two-channel attention mechanisms can strengthen the extraction ability of the information features of the edges and area of disease spots. The weighted loss function can reduce the loss of feature information.