2.2.2. Multi-Scale Feature Fusion Network RepGFPN
The YOLOv8 network continues to adopt the design of YOLOv5 at the Neck end and employs the structure of the feature pyramid PAN [
19] with top-down and bottom-up approaches to handle multi-scale features. This network structure only considers two aspects: the same scale at adjacent levels and different scales at the same level. This limits the breadth and depth of information transmission and fusion. The GFPN [
20] network architecture overcomes the limitations of PAN by incorporating additional features of varying scales from diagonally above and below, as well as features of similar scales from distinct layers, significantly enhancing the scope and depth of feature fusion. Although GFPN improves the accuracy, it significantly increases the inference time consumption and struggles to comply with the need for instant detection in smart farms for chili Phytophthora blight. The drawback of GFPN lies in that it fails to perform differentiated resource allocation for features of different scales but uniformly adopts the same number of channels for the same scale. This might waste resources on high-resolution feature maps, causing a waste of computing resources. Additionally, GFPN involves a significant amount of upsampling and downsampling operations, leading to an increase in computational complexity. The network structure diagrams of PAN and GFPN are depicted in the
Figure 4.
Therefore, this paper adopts the RepGFPN [
21] network. RepGfpn basically retains the network structure of GFPN and inherits the
connections. Specifically, in each level k, the
layer receives the feature maps from at most
number of preceding layers, and these input layers are exponentially apart from depth i with base 2, as denoted:
where
, Concat() and Conv() also represent concatenation and 3 × 3 convolution, respectively. The time complexity of the
-link is only
, rather than
. This means that the RepGFPN network structure can obtain higher-quality feature maps at a lower cost. This network model uses different numbers of channels for features of different scales, respectively. In practical situations, the lesions caused by chili Phytophthora blight disease vary greatly in size, shape, and distribution. By adjusting the number of channels, RepGFPN can more flexibly extract feature information at different scales, thereby enhancing its ability to recognize lesions of various sizes while ensuring computational efficiency. Furthermore, this network eliminates some of the upsampling connections, avoiding unnecessary computational burdens. For most disease detection scenarios, upsampling operations do not significantly enhance the features of small regions. Reducing upsampling connections allows the model to focus on fine features at the original resolution, improving the detection accuracy of small lesions. In terms of feature fusion, the CSPStage module is tailored for this purpose, and Repconv [
22] is employed to substitute the 3 × 3 convolution of GFPN, enhancing the efficacy of feature integration and boosting the model’s precision without incurring additional computational burdens.
Figure 5 presents the architecture of the RepGFPN network and the layout situation of the CSPStage module.
The CSPStage module serves as a crucial component of RepGFPN, primarily accountable for feature fusion. The incoming feature maps have undergone diverse sampling processes to achieve a uniform resolution. Feature maps of a consistent resolution undergo a Contact operation to increase their channel dimensionality. To integrate the information contained within the feature maps, we employ a strategy that integrates multi-branch parallel processing alongside multi-level feature fusion. First, the input feature maps are split into two parallel branches for independent processing. After the first branch undergoes a 1 × 1 convolution for the dimensionality reduction operation, no other operations are performed, and it directly participates in the subsequent concat operation, ensuring the direct retention of the original feature information and providing unmodified original information for the final feature integration. The second branch first applies a Rep3 × 3 convolution layer. During the training process, this layer utilizes the reparameterization technique to obtain diverse feature representations within a multi-branch structure. During the inference stage, it smoothly integrates into an efficient 3 × 3 convolution layer through the application of the same reparameterization technique, guaranteeing model accuracy and boosting inference speed. This multi-branch structure helps the network accelerate the inference process while maintaining high accuracy, enabling it to better distinguish the characteristics of different disease spots, especially in identifying subtle disease spot boundaries and feature differences against complex backgrounds. The specific form of the structure is clearly presented in
Figure 5c. Subsequently, a standard 3 × 3 convolution layer is employed to further refine and enhance the feature maps, thereby enhancing their nonlinear expression capacity and local feature capture capability. We perform the addition operation on the feature maps from the second branch, which have been through Rep3 × 3 and 3 × 3 convolution operations, as well as those that have not undergone any convolution operation, to enhance the richness of feature information. This operation is repeated n times to obtain n new feature maps. By utilizing multi-level parallel processing, the unprocessed feature maps from the initial branch undergo a concatenation operation with the n feature maps generated by the subsequent branch. This facilitates the merging of feature maps across different branches and layers, strengthening the model’s capacity to gather and understand intricate feature details. This operation enables the comprehensive fusion of cross-level and multi-branch feature maps, making the model more discriminative when identifying complex disease spot edges, colors, and texture features. Finally, a 1 × 1 convolution layer is employed to perform dimensionality reduction and linear transformation on the consolidated feature sets, resulting in the ultimate feature representation for facilitating the network’s forward propagation process. This step not only reduces the data dimension but also enhances the discriminability of features through linear transformations, enabling a more precise representation of the characteristics of disease spots. During the forward propagation process, these high-quality features assist the model in discriminating disease spot regions more rapidly and effectively, thereby achieving higher accuracy in practical chili disease detection.
2.2.3. Lightweight DySample Upsampling
The upsampling operator of YOLOv8 employs the nearest neighbor [
23] interpolation algorithm for upsampling. It merely utilizes pixels within a relatively narrow range for prediction and fails to preserve the details of the feature map effectively. In practical situations, there may be instances where small objects are difficult to detect against complex background settings, resulting in a reduction in image details. To tackle this problem, kernel-based dynamic upsampling algorithms (such as CARAFE [
24], FADE [
25], and SAPA [
26]) have been proposed, which have a larger receptive field and can achieve high-quality upsampling. Nevertheless, the integration of the CARAFE algorithm’s dynamic convolution with the supplementary self-network for dynamic kernel generation leads to a notable escalation in computational demands and temporal expenditure, posing obstacles for real-time detection. Therefore, we adopt the DySample [
27] algorithm. Dysample uses a point-based sampling method, balancing the relationship between performance improvement and computational cost, and improves the detection accuracy at a small cost.
Figure 6 distinctly presents the elaborate network framework of Dysample.
Firstly, a feature vector X, characterized by dimensions , is introduced as the input. Here, C represents the channel count, whereas and individually signify the vertical and horizontal dimensions, respectively, of the feature map. The data from all channels of this feature map collectively contribute to forming a preliminary feature map, which supplies sufficient information for subsequent sampling. This feature vector captures all possible spatial information and subtle features that are associated with disease characteristics.
Subsequently, the sampling point generator will establish a sampling set S in accordance with specific rules and algorithms. The dimension of this sampling set S is
. The role of the sampling point generator is to dynamically adjust the positions of sampling points based on specific feature regions in the feature map, in order to capture finer image details. This method of dynamically generating sampling points can automatically identify important areas, enabling the model to focus on small yet crucial disease features, thereby improving detection accuracy. Subsequently, the grid sample function is employed to construct a new feature vector X′ by bilinear interpolation. This operation effectively smooths the features and accomplishes more detailed upsampling with less computational cost, enhancing the detection effect for small targets (such as minute diseases on leaves). The process can be defined as follows:
Illustrated in
Figure 6b is the process by which the sampling point generator generates the sampling set S. In practical scenarios, the field environment is complex and variable. The lesions of Phytophthora blight on chili leaves typically manifest as subtle changes in color and texture. To accurately detect these small and easily overlooked lesions, DySample’s offset matrix generation and dynamic adjustment process adaptively adjusts the sampling positions based on input features, particularly demonstrating good flexibility for lesions of different sizes and shapes. The sampling point generator accepts a feature matrix x as input and uses two linear layers to generate two offset matrices
and
, respectively, and the sizes of these two matrices are
. To effectively prevent the overlap of the sampling positions, a dynamic adjustment ability is introduced. First, the generator activates
through the sigmoid function and then multiplies it by a static factor of 0.5, achieving a soft offset adjustment. The benefit of doing so is that it becomes more sensitive to areas containing fine lesions without losing the overall structure. The adjusted
is then element-wise multiplied with the second offset
to generate a new offset matrix
.
is then reshaped into an offset O with a size of
through the pixel shuffle operation. This operation not only reshapes the dimension of the offset but also achieves efficient upsampling of the feature map by re-arranging the elements and preserving the crucial spatial information. Ultimately, the sampling set S is derived through the combination of the offset O and the initial grid G, which is expressed in the subsequent formula.
In summary, DySample leverages the generation and dynamic adjustment of offset matrices to enable the model to perform a more precise sampling of the delicate and complex lesion areas on chili plants. By adaptively adjusting the positions of sampling points during the upsampling process, it effectively retains the spatial information and crucial details of the feature map, without wasting computational resources on redundant information. This allows the model to more sensitively capture the characteristics of diseases, improving the detection accuracy of small areas even in complex field backgrounds. Furthermore, DySample’s point sampling method achieves efficient feature preservation at a low computational cost, making the model both accurate and suitable for real-time detection tasks.
2.2.4. CoordAtt Module
In the detection of chili Phytophthora blight, situations such as overlapping detection targets and complex background environments may be confronted. We can choose to introduce the attention mechanism to effectively solve these problems. The attention mechanism enables the network to focus its attention on pertinent information, while simultaneously filtering out less crucial details, thereby minimizing distraction from unnecessary data. This not only enhances detection accuracy but also greatly improves the efficiency of computing resource utilization [
28]. Specifically, the SE channel attention mechanism [
29] dynamically adjusts the significance of each channel, effectively boosting the network’s capacity to prioritize and amplify essential information. However, its limitation lies in only focusing on the differences between channels and ignoring the position information in the image space, which to a certain extent constrains the network’s detection ability of chili Phytophthora blight. In contrast, the CBAM [
30] attention mechanism achieves a more comprehensive feature extraction. It incorporates both channel-wise focus and spatial awareness, taking into account the prominence of features across various channels and the cruciality of features within their spatial arrangements. Although this mechanism significantly improves the meticulousness of feature extraction, it also correspondingly increases the computational cost. Furthermore, CBAM primarily concentrates on extracting local feature details, potentially lacking somewhat in the integration and comprehension of holistic, global features. At the same time, there may be slight precision losses in the processing of spatial information, which could affect the final detection performance to a certain extent.
Figure 7 displays the network architectures of SE and CBAM.
The CoordAtt attention (CA) mechanism [
31], takes into account both spatial and channel information. To manage the computational expense, an innovative approach is utilized, which incorporates spatial information within the channels, thereby attaining a more expansive informational vantage point. This attention mechanism abandons the conventional 2D global pooling approach, opting for two separate 1D global pooling processes that efficiently consolidate positional data along the vertical and horizontal axes, respectively. By employing this method, not only does it significantly reduce the computational intricacy, but it also averts the likelihood of detail loss that might arise from the exhaustive amalgamation of spatial data. This method maintains specific traits along the spatial dimension, empowering the model to discern and harness the unique attributes of images or feature maps more proficiently within these two orientations. Consequently, it boosts the network’s capacity to understand complex spatial relationships. The CA attention mechanism is illustrated in
Figure 8, detailing its specific process.
At the onset, the incoming feature representation U, which possesses the shape
, undergoes the Ftr transformation, yielding an altered feature representation X that has the shape
. As a result, the model is able to capture disease characteristics more clearly, effectively preserving the vital information in the lesion areas and making the location of the disease stand out more prominently. Following that, separate encoding processes are applied along the width and height dimensions, utilizing pooling elements with dimensions (H, 1) for the horizontal axis and (1, W) for the vertical axis, respectively. The resulting outputs for the c-th channel at height h and width w are depicted as follows:
The above formula integrates features from diverse directions and outputs a pair of feature maps with known directions, namely
and
. By preserving spatial information in one direction while capturing long-range relationships in the other, this attention module stands apart from the global pooling compression method. The positional information is adeptly incorporated into the produced feature maps, enhancing the network’s capability to pinpoint the target with greater accuracy. Subsequently, a transposed concatenation operation is performed on
and
, generating a composite feature map. Afterwards, it undergoes the transformation function
which employs a 1 × 1 convolutional layer to compress the dimensionality and introduce nonlinearity, yielding the feature map
. This enables the model to enhance its focus on small lesion areas without increasing the computational burden. This allows the model to automatically distinguish subtle differences between diseased and normal areas in complex backgrounds, reducing false detections. The formula is presented as follows:
Among them, the symbol [, ] signifies the joining of elements along the spatial dimension, while
denotes the application of a nonlinear activation function. Next, the feature map f is segmented along the spatial dimension, yielding a separation into two distinct feature vectors:
and
. To elevate the channel counts of these two feature vectors and revert them back to the initial channel count C, we individually employ two 1 × 1 convolutional kernels, namely
and
. This type of 1 × 1 convolution can not only increase the channel number without altering the spatial dimensions of the feature map but also incorporate nonlinear combinations via the learning process of the convolutional kernels, subsequently boosting the representational power of the features. This approach not only preserves the spatial dimensions of the feature map but also enhances the nonlinear combination of features through convolutional learning, ensuring that the model has higher discriminative power and detail restoration capabilities when detecting lesion spots. Subsequently, when paired with the sigmoid non-linear activation function, we acquire
and
. The formula is presented as follows:
The Sigmoid function is denoted by
, and
and
represent distinct 1 × 1 convolutional operations. We expand
and
and subsequently utilize them as the attention coefficients for the feature map. Through the attention coefficients, the CA mechanism assigns higher attention to disease-infected areas and integrates this with the original feature map. This enables the model to prioritize the processing of lesion areas and ignore other irrelevant regions in complex field scenarios, thereby enhancing detection accuracy. In practical applications, this implies that regardless of the orientation or location of the lesion, the CA attention mechanism can assist the model in achieving precise identification. Then, the calculation formula of the CA attention mechanism is as follows:
2.2.5. Inner-MPDIoU
YOLOv8 includes the bounding box regression loss function and the classification loss function. The bounding box regression loss function’s main purpose is to determine the quantitative discrepancy between the predicted and actual bounding boxes [
32]. The accuracy of the regression process for bounding boxes directly influences the overall precision of object detection. The loss function for bounding box regression in YOLOv8 is CIoU and its mathematical formulation for computation is provided below:
The IoU in this formula signifies the ratio of the overlapping area to the combined area of the predicted bounding box and the actual bounding box. signifies the squared distance between the centers of the predicted box and the true box. Meanwhile, denotes the square of the diagonal length of the smallest bounding rectangle that encompasses both boxes. The balance parameter, , serves to adjust the weight given to the effect of the aspect ratio in the equation. v is the parameter that measures the aspect ratio of the predicted box and the true box. and represent the dimensions of the target box in terms of width and height, whereas w and h signify the corresponding dimensions of the true box.
For the detection of chili Phytophthora blight or similar intricate object detection endeavors, achieving the accurate detection of minute targets holds paramount significance. However, the traditional CIoU and its improved versions such as SIoU may have constraints when dealing with certain specific situations. For instance, if the aspect ratio of the predicted bounding box matches that of the true bounding box yet they vary in size, while their center points coincide, these loss functions might fail to adequately discern and refine this discrepancy, ultimately limiting the object detection’s efficacy. To overcome the shortcomings of CIoU, MPDIoU [
33] was proposed, and its calculation formula is as follows:
In the formula, w and h represent the width and height of the input image. is the square of the distance calculated between and , and is the square of the distance calculated between and . signifies the coordinates of the upper left corner of the predicted box, while denotes the coordinates of its lower right corner. Similarly, , and represent the respective coordinates of the upper left and lower right corners of the true box.
The core of MPDIoU lies in utilizing the distance between the top-left and bottom-right key points to evaluate the degree of matching between the predicted bounding box and the ground truth bounding box. Compared to other methods that require separate calculations of multiple indicators such as IoU (Intersection over Union), center point distance, aspect ratio, and more, MPDIoU focuses solely on the distance between these two corner points. By combining the coordinates of the top-left and bottom-right corners with the image’s size information, MPDIoU implicitly covers several key pieces of information, including non-overlapping area, center point distance, and deviations in width and height. MPDIoU cleverly represents all these factors through the corner point distance, making the calculation process more concise and clear. It is worth noting that when the aspect ratios of the predicted bounding box and the ground truth bounding box are the same, the LMPDIoU value for a predicted bounding box located inside the ground truth bounding box will be lower than that for a predicted bounding box located outside. This characteristic ensures the accuracy of bounding box regression, resulting in more compact and less redundant predicted bounding boxes.
However, both IoU and MPDIoU exhibit relatively slow convergence rates. In this paper, the MPDIoU will be improved by adopting the Inner IoU [
34] approach. Inner IoU incorporates supplementary bounding boxes as a means to compute the Intersection over Union loss. These bounding boxes serve as intermediate media for calculating the IoU loss, providing additional information and guidance for the optimization process. Instead of directly calculating the IoU between the predicted bounding box and the ground truth bounding box, Inner IoU provides a more intricate assessment of the localization precision of the bounding box by analyzing the extent of the overlap between an auxiliary bounding box and either the ground truth bounding box or the predicted bounding box. For different detection tasks, a scaling factor is introduced to adjust the dimensions of the auxiliary bounding box and determine the loss. The calculation formula is as follows:
Among them, the ratio is the scale factor, and its usual value range is [0.5, 1.5]. When the ratio is less than 1, the auxiliary bounding box is smaller than the actual bounding box, causing the effective regression range to concentrate in the overlapping area with a high Intersection over Union (IoU) while ignoring the parts outside the boundary. This narrowed scope allows the loss function to focus on finely aligning the overlapping regions of the two boxes. In this scenario, even minor prediction deviations can lead to significant changes in the loss, resulting in larger gradient values. This enables the model to accelerate the alignment adjustment for high IoU samples and improve convergence speed. Conversely, when the ratio is greater than 1, the auxiliary bounding box is larger than the actual bounding box, thereby expanding the effective regression range. In this way, even if the overlap between the predicted box and the ground truth box is small, the loss function can still capture the overall offset between them, allowing the model to focus on broader positional information and improve the regression effect for low IoU samples. Additionally, larger auxiliary bounding boxes reduce sensitivity to minor deviations, leading to relatively smoother gradient changes. This helps steadily bring the predicted box closer to the ground truth box, thereby smoothly optimizing significantly offset boxes and avoiding excessive fluctuations. This adjustment allows the model to balance the regression effects of different IoU samples, and through dynamic adjustment of the ratio, it can effectively optimize the overall accuracy of low IoU samples while improving the convergence speed of high IoU samples. As this paper focuses on detecting chili Phytophthora blight, it necessitates a heightened level of image detail. Consequently, a scaling factor of 1.2 has been chosen for this purpose. The central coordinates of the ground truth box (gt) and its auxiliary bounding box are denoted as
, respectively, while those of the anchor box and its auxiliary bounding box are labeled
. The height and width of the ground truth box are
and
, respectively, and the height and width of the anchor box are h and w, respectively. The auxiliary bounding box of the ground truth box is defined by its upper boundary
, lower boundary
, left boundary
and right boundary
, respectively. Correspondingly,
,
,
,
are, respectively, the upper, lower, left, and right boundaries of the auxiliary bounding box of the anchor box. By incorporating Inner IoU into MPDIoU, we derive Inner-MPDIoU, and its computational formula is stated as follows:
In summary, the Inner-MPDIoU loss function significantly enhances the performance of object detection tasks by integrating the advantages of both the MPDIoU and Inner-IoU methods.