3.2. YOLOv5 Algorithm and Improvements
The YOLOv5 architecture comprises three key components: the backbone, neck, and head. Initially, the input image needs to be preprocessed before entering the backbone, where features are extracted through the cross-stage partial (CSP), convolutional batch normalization SiLU (CBS), and spatial pyramid pooling-fast (SPPF) modules. The CSP module is designed to enhance the model’s performance and computational efficiency. Its primary principle involves dividing the input features into two parts: one part undergoes a series of convolutional operations, while the other part connects directly to subsequent layers. This introduces cross-stage connections, effectively accelerating the information propagation and reducing computational complexity. The CSP module not only enhances feature diversity but also diminishes redundancy among features, enabling the model to learn rich feature representations. Consequently, the CSP module significantly improves the model’s adaptability to complex tasks, particularly demonstrating superior performance in multi-object detection and intricate scenarios. The CBS module integrates convolutional operations, batch normalization, and the SiLU activation function. First, the convolutional layer extracts local features from the input image, facilitating the detection of object edges and shapes. Subsequently, the batch normalization layer standardizes the convolutional outputs, ensuring stability in the outputs of different layers during training and, thus, accelerating convergence. Finally, the SiLU activation function serves as a nonlinear transformation, effectively introducing nonlinearity and enhancing the model capability. The CBS module not only improves object detection accuracy but also enhances processing speed, thereby meeting the demands of real-time applications. The SPPF module aims to enhance the model’s adaptability to objects of varying scales. SPPF employs pooling operations of different sizes (e.g., 1 × 1, 5 × 5, 9 × 9) to perform multi-scale processing on feature maps, allowing for both detailed and contextual information extraction. This multi-scale feature extraction strategy enables YOLOv5s to handle objects of diverse sizes. Moreover, the SPPF module reduces the size of the feature maps, alleviating the computational burden on subsequent layers and improving the overall model inference efficiency.
The extracted features are then passed to the neck component, which is primarily responsible for fusing the multi-level features to enhance the model’s detection capabilities for objects of varying scales. YOLOv5s employs a bottom–up feature fusion strategy that combines high-level semantic information with low-level detail information. This module consists of multiple convolutional layers and upsampling layers. Convolutional layers are used for feature extraction and transformation, while upsampling layers refine low-resolution feature maps to higher resolutions, ensuring effective fusion of features across different levels. On this basis, the model can comprehensively utilize features from all layers, thereby improving its capability to detect multi-scale objects.
Finally, the fused features are input into the head component. It is responsible for converting the features processed by the neck module into the final detection results. The head component directly outputs detection information, including the bounding boxes of objects, class probabilities, and center point coordinates. Specifically, the head component processes the feature map through a series of convolutional layers. To ensure the final accuracy, YOLOv5s also incorporates the maximum suppression (NMS) algorithm, aiming at eliminating overlapping boxes while retaining the best detection results. In this way, YOLOv5s achieves high precision and efficiency in object detection tasks and demonstrates adaptability to various complex visual scenarios.
In urban garbage object detection, various environmental conditions, complex background scenes, and the diversity of garbage objects often result in misclassification of non-garbage objects as garbage and inaccurate localization of garbage objects. To address these issues, we introduce the coordinate attention (CA) module [
33] into YOLOv5 to enhance the feature extraction capability and attention to garbage objects, providing support for subsequent image mixing. The specific addition position of the module is shown in
Figure 1. By incorporating the CA module, the model can effectively capture the spatial relationship between target regions and surrounding environments by leveraging positional information to focus on target areas and suppress irrelevant backgrounds. This ensures accurate localization of target regions, enabling YOLOv5 to achieve more precise and reliable object recognition in urban garbage detection tasks.
The CA module embeds positional information into channel attention, with the objective of enhancing the feature learning capacity. This module is capable of transforming any intermediate feature while maintaining consistent input and output dimensions [
33]. The process is illustrated in
Figure 2.
Assume the input feature size is
. First, we encode each channel using pooling kernels of size
and
along the X and Y coordinates, respectively. The output of the
c-th channel with height
H and the
c-th channel with width
W are as follows:
where
i and
j denote the spatial indexes along the width
W and height
H dimensions.
Next, we combine the feature maps in both the width and height directions to gain a global perception field. Using a
convolution, the feature dimension is reduced to
of the original size, and it is then sent to the sigmoid activation function to obtain the feature map
f:
where
and
denote the feature maps in the height and width directions, respectively;
denotes the convolution operation with a
convolution kernel; and
represents the sigmoid activation function.
The feature map
f is decomposed into two separate parts
and
along the spatial dimension. Using two
convolution operations
and
,
and
are transformed into tensors with the same channel size. After the sigmoid activation, we obtain the attention weights in the height and width directions,
and
, respectively:
Finally, the attention weights
and
are weighted by multiplying with the original feature map to obtain the feature map with coordinate attention:
3.3. Image Mixing Method Based on Attention and Confidence Fusion
In this study, we implement an image mixing technique that uses attention and confidence mechanisms to blend images from the source and target domains. A region is selectively cropped from the target image and overlaid onto the source image, creating a mixed image as depicted in
Figure 3. Firstly, the source image
and target image
are input to the detector. The
and
are passed through the backbone for feature extraction and then passed to the neck section for multi-scale feature fusion. Through the coordinate attention module (C3 in
Figure 1), the attention score
and
is calculated based on the product of the attention scores for
and
in Equations (4) and (5). After processing through the CA module, the images are divided into
regions, with each region corresponding to a pixel block of size
in the original image. Each unit in the feature map (i.e., each element in the
matrix) represents the semantic representation of the corresponding pixel block. Each unit’s value indicates the model’s attention intensity or importance for that region of the original image. High values indicate the model considers the region more important or informative, while low values indicate the opposite.
The attention score matrices
and
are computed in the same way, and
is computed as follows. We first define a confidence matrix
C of size
, initialized to zero. The detector head outputs predictions including the object’s center coordinates
, width
w, height
h, and confidence score
s. Based on these predictions, we update the corresponding region of the confidence matrix
C. The updated matrix is expressed as follows:
where
i is the horizontal index (row) ranging from
to
, and
j is the vertical index (column) ranging from
to
. After that, we apply average pooling to reduce the size of the confidence matrix
C to a new matrix
of size
. The average pooling operation is defined as follows:
where
represents the elements of the compressed
confidence matrix,
is the height of the pooling window,
is the width of the pooling window,
p and
q are local indices within the pooling window, and
refers to the elements of the original matrix
C within the pooling window.
The attention-confidence score matrices
and
can be obtained by conducting the Hadamard product of the attention score matrix and confidence matrix, which can be formulated as
where
is the attention-confidence score fusion matrix,
denotes the attention score matrix, and
denotes the confidence score matrix. Adding 1 to
mitigates the impact of zero confidence on attention scores.
By employing a sliding window technique with a window size of , the regions with the lowest and highest scores can be identified. The highest-scoring region from the target image is then substituted into the lowest-scoring region of the source image, which facilitates information exchange between the source and target domains. This approach mitigates domain discrepancies, thereby enhancing the model’s adaptability to the target domain and improving its generalization capability. By comprehensively integrating confidence and attention, the importance and feature relevance of different regions within the mixed image can be effectively evaluated, resulting in more accurate mixed images. Even in the absence of explicit targets, the attention scores across different regions of the image can guide the mixing process, thereby enhancing the quality of the mixed image.
Compared to the self-attention mechanism employed in typical transformers, self-attention demands significantly more computational resources when processing high-resolution images [
34]. In contrast, CA is a more lightweight attention mechanism and offers superior performance with reduced computational overhead. It can also be seamlessly integrated into existing object detection frameworks, such as the YOLO model.
Based on experimental exploration, the CA module labeled C3 in
Figure 1 was selected as the attention score extractor. The attention effects of the C1, C2, and C3 modules are depicted in
Figure 4. As illustrated in the figure, the attention matrix of C3, compared to C1 and C2, more accurately captures the positions of the garbage objects in the original image. As the network layers go deeper, the extracted features progressively transition from low-level details to high-level semantic representations, allowing for a holistic characterization of the shape and structure of urban garbage objects. Moreover, deeper feature representations offer enhanced robustness against interference, as background noise is incrementally suppressed, allowing the CA module to concentrate more on regions pertinent to garbage targets. Furthermore, the CA module allocates spatial attention across both width and height dimensions, capturing fine-grained spatial variations, thereby enhancing its ability to accurately localize the targets. The neck component of YOLOv5 facilitates a multi-scale integration of both global and local information and helps the CA module to capture multi-scale features of objects in complex environments. Consequently, the C3 module precisely localizes urban garbage targets.