Domain Adaptive Urban Garbage Detection Based on Attention and Confidence Fusion

Yuan, Tianlong; Lin, Jietao; Hu, Keyong; Chen, Wenqian; Hu, Yifan

doi:10.3390/info15110699

Open AccessArticle

Domain Adaptive Urban Garbage Detection Based on Attention and Confidence Fusion

by

Tianlong Yuan

,

Jietao Lin

,

Keyong Hu

^*

,

Wenqian Chen

and

Yifan Hu

School of Information and Control Engineering, Qingdao University of Technology, Qingdao 266520, China

^*

Author to whom correspondence should be addressed.

Information 2024, 15(11), 699; https://doi.org/10.3390/info15110699

Submission received: 28 August 2024 / Revised: 10 October 2024 / Accepted: 21 October 2024 / Published: 4 November 2024

Download

Browse Figures

Versions Notes

Abstract

:

To overcome the challenges posed by limited garbage datasets and the laborious nature of data labeling in urban garbage object detection, we propose an innovative unsupervised domain adaptation approach to detecting garbage objects in urban aerial images. The proposed method leverages a detector, initially trained on source domain images, to generate pseudo-labels for target domain images. By employing an attention and confidence fusion strategy, images from both source and target domains can be seamlessly integrated, thereby enabling the detector to incrementally adapt to target domain scenarios while preserving its detection efficacy in the source domain. This approach mitigates the performance degradation caused by domain discrepancies, significantly enhancing the model’s adaptability. The proposed method was validated on a self-constructed urban garbage dataset. Experimental results demonstrate its superior performance over baseline models. Furthermore, we extended the proposed mixing method to other typical scenarios and conducted comprehensive experiments on four well-known public datasets: Cityscapes, KITTI, Sim10k, and Foggy Cityscapes. The result shows that the proposed method exhibits remarkable effectiveness and adaptability across diverse datasets.

Keywords:

YOLOv5; UDA; image mixing; garbage detection

Graphical Abstract

1. Introduction

In recent years, rapid urbanization and industrialization coupled with the continuous expansion of urban areas and substantial improvements in living standards, have resulted in a persistent rise in household garbage and increasingly complex garbage composition. This poses a great challenge for urban garbage management [1,2]. Particularly, the phenomenon of “garbage besieging cities” has evolved into a widely recognized social issue, significantly deteriorating the living environment for urban residents. Consequently, the accurate detection of urban garbage is crucial for environmental protection and the promotion of sustainable development.

Currently, urban garbage object detection using aerial imagery [3,4] encounters several challenges, including complex urban backgrounds, diverse types of garbage, and varying visual appearances under different lighting and weather conditions. Traditional supervised learning approaches have been extensively explored for urban garbage detection. However, these methods require vast amounts of annotated data, making them time-consuming, labor-intensive, and thus costly. Furthermore, due to the diversity and variability of environments, training data are often specific to certain urban backgrounds and weather conditions, thereby limiting the model’s generalization capability across different scenarios. Unsupervised domain adaptation techniques strive to address the challenge of object detection without relying on annotated data. The primary idea lies in the fact that a large amount of unannotated data from the target domain are used to fine-tune the model trained on source domain data, maintaining its original detection capability while improving its performance in the target domain.

To address these challenges, we propose a novel and effective method for urban garbage object detection, aimed at overcoming the issues of laborious data labeling and limited generalization ability inherent in traditional supervised learning methods. We propose an image mixing method based on attention and confidence to mitigate inter-domain discrepancies and enhance the model’s adaptability to the target domain by blending images from the source and target domains. The efficacy of the proposed method is validated through rigorous experiments, and comparative analysis with existing methods demonstrates superior performance.

2. Related Work

2.1. Object Detection

Object detection, as a significant frontier research branch in the field of computer vision, aims to identify specific objects in images and determine their locations. In recent years, deep learning methods, particularly those based on convolutional neural networks, have become the mainstream technology for object detection [5]. Object detection methods are broadly classified into three categories: single-stage detection methods [6,7,8,9], two-stage detection methods [10,11,12], and transformer-based detection methods [13,14]. Particularly, YOLO has gradually solved the balance problem between detection speed and accuracy [6]. As an extension of object detection in urban management, some researchers have studied the automatic identification and classification of urban garbage targets using high-resolution aerial remote sensing images. Kraft et al. [15] developed an embedded computing platform on a drone and used the YOLO algorithm for real-time low-altitude garbage detection. Liao et al. [16] constructed a real-time marine garbage detection system using an improved YOLO model. Noroozi et al. [17] used a drone system with improved Yolov4 to detect foreign object debris on airport runways. Although existing methods have made some progress in garbage detection, they still face issues such as high data annotation requirements, insufficient generalization ability, and severe interference from complex backgrounds. The proposed method can utilize a large amount of unannotated data in new areas to train the model, improving the model’s adaptability and detection accuracy in different scenarios. In contrast, existing methods primarily rely on annotated data for training and detection when encountering new areas, lacking the adaptive ability to use unannotated data. This limitation may restrict the detection performance of existing methods in new areas, making it difficult to effectively cope with variations in different scenarios and environmental conditions.

2.2. Unsupervised Domain Adaptation

2.2.1. Feature Alignment-Based Methods

Feature alignment based methods, which address domain adaptation by minimizing the domain discrepancy between the source and target domains, have emerged as one of the most prevalent approaches. DAF reframes this minimization problem by using H-divergence [18], aligning features at both image and instance levels and leveraging the adversarial learning principles of GANs [19]. Subsequent work built upon the theoretical foundation of DAF, focusing on improving alignment techniques or incorporating multiple alignment levels to make the source and target domains indistinguishable. Saito et al. [20] proposed that directly aligning features across multiple levels is not the optimal strategy. Instead, they apply weak global alignment at the image level, replace traditional cross-entropy loss with focal loss, and perform strong local alignment at the intermediate feature extraction layers. Similarly, MAF [21] aligns both image-level and instance-level features and further extends this alignment across multiple layers of image-level features. Furthermore, in [22], attention mechanisms are integrated into MAF, aligning more critical foreground information at the image level, achieving effective feature alignment.

2.2.2. Data Augmentation-Based Methods

Data augmentation-based methods also address the domain discrepancy between the source and target domains. However, they differ from feature-based methods in that they typically address domain shifts at the input level [23]. Common approaches utilize GANs or CycleGANs to transform source domain images into the style of the target domain. In [24], a target detector is first pre-trained using source domain data and then fine-tuned by converting source domain images into target domain styles. In [25], CycleGANs are used to generate multiple intermediate domain images, and a multi-domain discriminator is employed to learn domain-invariant features. In [26], style transfer methods are first applied to adapt source domain images at the pixel level, enhancing the robustness of the detector to low-level variations. A consistency constraint is then enforced between the transformed and original images at the feature level, followed by pseudo-labeling of target domain images for re-training. Mattolin et al. [27] mixed high-confidence local regions from target samples with source domain images and introduced an additional consistency loss term to progressively adapt the model to the target domain distribution. Mekhalfi et al. [28] proposed a novel and effective four-step UDA approach, leveraging self-supervised learning to train on both source and target data.

2.2.3. Semi-Supervised Learning-Based Methods

Semi-supervised learning methods treat domain adaptation as a semi-supervised problem, as both tasks involve labeled and unlabeled data. The key difference in domain adaptation lies in the distributional discrepancy between two different domains. One earlier approach is based on the mean teacher model [29] and MTOR [30], where mean teacher is a classical model proposed for semi-supervised learning. MTOR completely ignores the domain discrepancy between the source and target domains. Instead, it feeds labeled source domain data into the student network for supervised training, while the target domain data and their augmented versions are fed into the teacher network. The teacher network introduces multiple consistency constraints to address domain adaptation. Many recent robust learning methods [31] have focused on generating more reliable pseudo-labels for the target domain to alleviate domain adaptation challenges. Given that pseudo-labels often contain inaccuracies, it is necessary to first generate coarse pseudo-labels for the target domain using a pre-trained source model. These pseudo-labels are then refined with an image classifier, and the model is then re-trained with both labeled source data and pseudo-labeled target data. SF-YOLO [32] also adopts a teacher–student framework. In this framework, the student network receives images augmented with target-domain-specific enhancements, allowing the model to be trained solely on unlabeled target data without the need for feature alignment. A communication mechanism between the teacher and student networks is introduced to stabilize the training process and reduce reliance on labeled target data for model selections.

3. Methodology

3.1. Urban Garbage Object Detection Framework

To address the challenges of annotating urban garbage object detection datasets and the limited generalization ability when transferring across different urban environments, we propose an unsupervised domain adaptation framework for urban garbage detection, using YOLOv5 as the detector. The detailed architecture of the detection framework is illustrated in Figure 1. We can see that manual annotation of target domain data is not required. The model leverages data from both the source and target domains for transfer learning and adaptation, selectively mixing target image regions with source images. This enables the model to learn and recognize the environmental and scene features of the target domain, gradually enhancing its detection performance in the target domain while preserving its detection capabilities in the source domain.

In this study, the improved YOLOv5 is used as the detector. Initially, the YOLOv5 detector is pre-trained using the source domain images

x_{S}

. Subsequently, both the source domain images

x_{S}

and the target domain images

x_{T}

are fed into the detector. Features from both the source and target images are extracted via the backbone network, followed by multi-scale feature fusion in the neck module. The fused features are subsequently processed through the coordinate attention (CA) module, as highlighted in green in the figure, to derive attention scores. The detector head then generates the target bounding boxes along with their corresponding confidence scores. At this stage, both the source domain images

x_{S}

and the target domain images

x_{T}

yield prediction results, which we refer to as pseudo-labels. Given that the source domain images are associated with ground truth labels

y_{S}

, it suffices to return the pseudo-labels corresponding to the target domain images

x_{T}

. Next, we implement an image mixing method based on attention and confidence fusion. We can integrate the source domain images

x_{S}

with the target domain images

x_{T}

to produce mixed images

x_{M}

alongside their corresponding pseudo-labels

y_{M}

. The mixed images

x_{M}

and the source domain images

x_{S}

are subsequently input into the detector to yield prediction results

{\tilde{y}}_{S}

and

{\tilde{y}}_{M}

. Finally, the losses on both source and mixed images,

L_{S}

and

L_{M}

, are minimized to enhance adaptation to the target domain. In this way, the detector’s performance on the source images is effectively preserved.

3.2. YOLOv5 Algorithm and Improvements

The YOLOv5 architecture comprises three key components: the backbone, neck, and head. Initially, the input image needs to be preprocessed before entering the backbone, where features are extracted through the cross-stage partial (CSP), convolutional batch normalization SiLU (CBS), and spatial pyramid pooling-fast (SPPF) modules. The CSP module is designed to enhance the model’s performance and computational efficiency. Its primary principle involves dividing the input features into two parts: one part undergoes a series of convolutional operations, while the other part connects directly to subsequent layers. This introduces cross-stage connections, effectively accelerating the information propagation and reducing computational complexity. The CSP module not only enhances feature diversity but also diminishes redundancy among features, enabling the model to learn rich feature representations. Consequently, the CSP module significantly improves the model’s adaptability to complex tasks, particularly demonstrating superior performance in multi-object detection and intricate scenarios. The CBS module integrates convolutional operations, batch normalization, and the SiLU activation function. First, the convolutional layer extracts local features from the input image, facilitating the detection of object edges and shapes. Subsequently, the batch normalization layer standardizes the convolutional outputs, ensuring stability in the outputs of different layers during training and, thus, accelerating convergence. Finally, the SiLU activation function serves as a nonlinear transformation, effectively introducing nonlinearity and enhancing the model capability. The CBS module not only improves object detection accuracy but also enhances processing speed, thereby meeting the demands of real-time applications. The SPPF module aims to enhance the model’s adaptability to objects of varying scales. SPPF employs pooling operations of different sizes (e.g., 1 × 1, 5 × 5, 9 × 9) to perform multi-scale processing on feature maps, allowing for both detailed and contextual information extraction. This multi-scale feature extraction strategy enables YOLOv5s to handle objects of diverse sizes. Moreover, the SPPF module reduces the size of the feature maps, alleviating the computational burden on subsequent layers and improving the overall model inference efficiency.

The extracted features are then passed to the neck component, which is primarily responsible for fusing the multi-level features to enhance the model’s detection capabilities for objects of varying scales. YOLOv5s employs a bottom–up feature fusion strategy that combines high-level semantic information with low-level detail information. This module consists of multiple convolutional layers and upsampling layers. Convolutional layers are used for feature extraction and transformation, while upsampling layers refine low-resolution feature maps to higher resolutions, ensuring effective fusion of features across different levels. On this basis, the model can comprehensively utilize features from all layers, thereby improving its capability to detect multi-scale objects.

Finally, the fused features are input into the head component. It is responsible for converting the features processed by the neck module into the final detection results. The head component directly outputs detection information, including the bounding boxes of objects, class probabilities, and center point coordinates. Specifically, the head component processes the feature map through a series of convolutional layers. To ensure the final accuracy, YOLOv5s also incorporates the maximum suppression (NMS) algorithm, aiming at eliminating overlapping boxes while retaining the best detection results. In this way, YOLOv5s achieves high precision and efficiency in object detection tasks and demonstrates adaptability to various complex visual scenarios.

In urban garbage object detection, various environmental conditions, complex background scenes, and the diversity of garbage objects often result in misclassification of non-garbage objects as garbage and inaccurate localization of garbage objects. To address these issues, we introduce the coordinate attention (CA) module [33] into YOLOv5 to enhance the feature extraction capability and attention to garbage objects, providing support for subsequent image mixing. The specific addition position of the module is shown in Figure 1. By incorporating the CA module, the model can effectively capture the spatial relationship between target regions and surrounding environments by leveraging positional information to focus on target areas and suppress irrelevant backgrounds. This ensures accurate localization of target regions, enabling YOLOv5 to achieve more precise and reliable object recognition in urban garbage detection tasks.

The CA module embeds positional information into channel attention, with the objective of enhancing the feature learning capacity. This module is capable of transforming any intermediate feature while maintaining consistent input and output dimensions [33]. The process is illustrated in Figure 2.

Assume the input feature size is

C \times H \times W

. First, we encode each channel using pooling kernels of size

H \times 1

and

1 \times W

along the X and Y coordinates, respectively. The output of the c-th channel with height H and the c-th channel with width W are as follows:

z_{c}^{h} (h) = \frac{1}{W} \sum_{0 \leq i < W} X_{c} (h, i)

(1)

z_{c}^{w} (w) = \frac{1}{H} \sum_{0 \leq j < H} X_{c} (j, w)

(2)

where i and j denote the spatial indexes along the width W and height H dimensions.

Next, we combine the feature maps in both the width and height directions to gain a global perception field. Using a

1 \times 1

convolution, the feature dimension is reduced to

\frac{C}{r}

of the original size, and it is then sent to the sigmoid activation function to obtain the feature map f:

f = σ (F_{1} ([z^{h}, z^{w}]))

(3)

where

z^{h}

and

z^{w}

denote the feature maps in the height and width directions, respectively;

F_{1}

denotes the convolution operation with a

1 \times 1

convolution kernel; and

σ

represents the sigmoid activation function.

The feature map f is decomposed into two separate parts

f_{h}

and

f_{w}

along the spatial dimension. Using two

1 \times 1

convolution operations

F_{h}

and

F_{w}

,

f_{h}

and

f_{w}

are transformed into tensors with the same channel size. After the sigmoid activation, we obtain the attention weights in the height and width directions,

g_{h}

and

g_{w}

, respectively:

g^{h} = σ (F_{h} (f^{h}))

(4)

g^{w} = σ (F_{w} (f^{w}))

(5)

Finally, the attention weights

g^{h}

and

g^{w}

are weighted by multiplying with the original feature map to obtain the feature map with coordinate attention:

y_{c} (i, j) = x_{c} (i, j) \times g_{c}^{h} (i) \times g_{c}^{w} (j)

(6)

3.3. Image Mixing Method Based on Attention and Confidence Fusion

In this study, we implement an image mixing technique that uses attention and confidence mechanisms to blend images from the source and target domains. A region is selectively cropped from the target image and overlaid onto the source image, creating a mixed image as depicted in Figure 3. Firstly, the source image

x_{s}

and target image

x_{t}

are input to the detector. The

x_{s}

and

x_{t}

are passed through the backbone for feature extraction and then passed to the neck section for multi-scale feature fusion. Through the coordinate attention module (C3 in Figure 1), the attention score

A_{s}

and

A_{t}

is calculated based on the product of the attention scores for

g^{h}

and

g^{w}

in Equations (4) and (5). After processing through the CA module, the images are divided into

N \times N

regions, with each region corresponding to a pixel block of size

\frac{H}{P} \times \frac{W}{p}

in the original image. Each unit in the feature map (i.e., each element in the

N \times N

matrix) represents the semantic representation of the corresponding pixel block. Each unit’s value indicates the model’s attention intensity or importance for that region of the original image. High values indicate the model considers the region more important or informative, while low values indicate the opposite.

The attention score matrices

C_{s}

and

C_{t}

are computed in the same way, and

C_{s}

is computed as follows. We first define a confidence matrix C of size

H \times W

, initialized to zero. The detector head outputs predictions including the object’s center coordinates

(x_{c}, y_{c})

, width w, height h, and confidence score s. Based on these predictions, we update the corresponding region of the confidence matrix C. The updated matrix is expressed as follows:

C (i, j) = s for x_{c} - \frac{w}{2} \leq i \leq x_{c} + \frac{w}{2}, y_{c} - \frac{h}{2} \leq j \leq y_{c} + \frac{h}{2}

(7)

where i is the horizontal index (row) ranging from

x_{c} - \frac{w}{2}

to

x_{c} + \frac{w}{2}

, and j is the vertical index (column) ranging from

y_{c} - \frac{h}{2}

to

y_{c} + \frac{h}{2}

. After that, we apply average pooling to reduce the size of the confidence matrix C to a new matrix

C_{s}

of size

N \times N

. The average pooling operation is defined as follows:

C_{s} (i, j) = \frac{1}{k_{h} \times k_{w}} \sum_{p = 1}^{k_{h}} \sum_{q = 1}^{k_{w}} C (i \cdot k_{h} + p, j \cdot k_{w} + q)

(8)

where

C_{s} (i, j)

represents the elements of the compressed

N \times N

confidence matrix,

k_{h} = \frac{H}{N}

is the height of the pooling window,

k_{w} = \frac{W}{N}

is the width of the pooling window, p and q are local indices within the pooling window, and

C (i \cdot k_{h} + p, j \cdot k_{w} + q)

refers to the elements of the original matrix C within the pooling window.

The attention-confidence score matrices

M_{s}

and

M_{t}

can be obtained by conducting the Hadamard product of the attention score matrix and confidence matrix, which can be formulated as

M_{i} = (C_{i} + 1) ⊙ A_{i}

(9)

where

M_{i}

is the attention-confidence score fusion matrix,

A_{i}

denotes the attention score matrix, and

C_{i}

denotes the confidence score matrix. Adding 1 to

C_{i}

mitigates the impact of zero confidence on attention scores.

By employing a sliding window technique with a window size of

160 \times 160

, the regions with the lowest and highest scores can be identified. The highest-scoring region from the target image is then substituted into the lowest-scoring region of the source image, which facilitates information exchange between the source and target domains. This approach mitigates domain discrepancies, thereby enhancing the model’s adaptability to the target domain and improving its generalization capability. By comprehensively integrating confidence and attention, the importance and feature relevance of different regions within the mixed image can be effectively evaluated, resulting in more accurate mixed images. Even in the absence of explicit targets, the attention scores across different regions of the image can guide the mixing process, thereby enhancing the quality of the mixed image.

Compared to the self-attention mechanism employed in typical transformers, self-attention demands significantly more computational resources when processing high-resolution images [34]. In contrast, CA is a more lightweight attention mechanism and offers superior performance with reduced computational overhead. It can also be seamlessly integrated into existing object detection frameworks, such as the YOLO model.

Based on experimental exploration, the CA module labeled C3 in Figure 1 was selected as the attention score extractor. The attention effects of the C1, C2, and C3 modules are depicted in Figure 4. As illustrated in the figure, the attention matrix of C3, compared to C1 and C2, more accurately captures the positions of the garbage objects in the original image. As the network layers go deeper, the extracted features progressively transition from low-level details to high-level semantic representations, allowing for a holistic characterization of the shape and structure of urban garbage objects. Moreover, deeper feature representations offer enhanced robustness against interference, as background noise is incrementally suppressed, allowing the CA module to concentrate more on regions pertinent to garbage targets. Furthermore, the CA module allocates spatial attention across both width and height dimensions, capturing fine-grained spatial variations, thereby enhancing its ability to accurately localize the targets. The neck component of YOLOv5 facilitates a multi-scale integration of both global and local information and helps the CA module to capture multi-scale features of objects in complex environments. Consequently, the C3 module precisely localizes urban garbage targets.

3.4. Loss Function

The core idea of this paper is to enhance the model’s generalization ability in new environments with a limited dataset while maintaining its original detection performance as much as possible. Therefore, the loss function should contain supervision information from both the source domain and the target domain:

L_{T o t a l} = L_{s} + L_{m}

, where

L_{s}

represents the loss on the source domain images, and

L_{m}

represents the loss on the mixed images.

L_{s}

and

L_{m}

each consist of three parts: Confidence loss

L_{o b j}

, class loss

L_{c l s}

, and bounding box loss

L_{b b o x}

. The confidence loss is calculated as follows:

L_{o b j} (p_{o}, p_{i o u}) = B C E_{o b j}^{s i g} (p_{o}, p_{i o u}, w_{o b j})

(10)

where

p_{o}

represents the target confidence score in the predicted box;

p_{i o u}

is the Intersection over Union (IoU) value between the predicted box and the corresponding ground truth box, and their binary cross-entropy yields the final target confidence loss.

B C E_{o b j}^{s i g}

represents the binary cross-entropy loss;

w_{o b j}

represents the weight of the positive samples. The classification loss is calculated as follows:

L_{c l s} (c_{p}, c_{g t}) = B C E_{c l s}^{s i g} (c_{p}, c_{g t}, w_{c l s})

(11)

where

c_{p}

represents the class score of the predicted box;

c_{g t}

is the true class of the corresponding ground truth box within the predicted box;

B C E_{c l s}^{s i g}

represents the binary cross-entropy loss;

w_{c l s}

represents the weight of the positive samples. The bounding box loss is calculated as follows:

L_{bbox} = 1 - CIOU = 1 - I O U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α v

(12)

where b and

b_{g t}

represent the center points of the predicted box and the ground truth box, respectively;

ρ

is the Euclidean distance between the two center points; c is the diagonal distance of the minimum enclosing rectangle containing both the predicted box and the ground truth box;

α

is the weight function; v is the factor measuring the aspect ratio.

L_{s} = λ_{1} L_{obj} + λ_{2} L_{c l s} + λ_{3} L_{bbox}

(13)

L_{m} = λ_{1} L_{obj} + λ_{2} L_{c l s} + λ_{3} L_{bbox}

(14)

where

λ_{1}

,

λ_{2}

,

λ_{3}

represent the weights of the respective loss parts.

4. Experimental Results and Analysis

4.1. Dataset

The study area for the experiments in this paper encompasses a specific region in Huangdao District, Qingdao City, Shandong Province. The entire region has been divided into four sub-regions. The urban garbage images was acquired using a DJI M300RTK drone equipped with a Zenmuse H20P gimbal (DJI M300RTK and Zenmuse H20P gimbal by DJI, Shenzhen, China). The gimbal was set to a shooting angle of −90°, with a roll angle of 0°, and a resolution of 4056 × 3040. Drone images were captured from an altitude of 140 m. The drone’s flight path is illustrated in Figure 5. The green connection indicates the flight path of the drone, the red dots indicate the photographing locations, and the largest red dot indicates the starting point.

To mitigate the impact of duplicate and non-target images in the model training process, the collected images underwent a cleaning operation to eliminate duplicate images and those without garbage objects. The annotation of the images was performed using the Labellmg 1.8.6 software. To satisfy the requirements of the object detection task, the annotated urban garbage images were segmented into 640 × 640 pixel blocks, which were then utilized for subsequent model training and evaluation. Some image samples in the urban garbage dataset are displayed in Figure 6, the garbage targets in the figure have different morphologies, complex backgrounds, and different lighting conditions. Our dataset consists of data collected from four distinct regions, each containing 300 images, resulting in a total of 1200 samples. Each image has a resolution of 640 × 640 pixels, with all images categorized as “garbage”. The data were sourced from four distinct regions (Region 1 to Region 4) to ensure diversity in the dataset. Despite sharing the same category, each region features distinct lighting conditions and background environments. Region 4 is designated as the source domain, while Region 2 serves as the target domain. The data from Regions 1 and 3 are used for validation purposes. Detailed information is shown in Table 1.

To ensure fair comparisons in the context of unsupervised domain adaptation, the experiments described in this paper utilized the same YOLOv5 detector and the PyTorch deep learning framework. We used an NVIDIA GeForce RTX 3090 graphics processor (RTX 3090 is manufactured by Seven Rainbow in Shenzhen, China), a 14 vCPU Intel(R) Xeon(R) Gold 6330 CPU @ 2.00 GHz, and 80 GB of memory. The experimental setting was configured with an Ubuntu 18.04 LTS 64-bit system, Cuda 12.0, PyTorch 2.0.0, and Python 3.7. The batch size was set to 2, with each batch comprising one source image and one target image.

In the comparative experiments, COCO pre-trained weights were initially employed to train the model on the source domain for 30 epochs, followed by 100 epochs of adaptive learning using both source and target domain data. In the transfer experiments, the COCO pre-trained weights were first applied for 20 epochs of initial training, followed by 100 epochs of adaptive learning. During the non-maximum suppression phase, the IoU threshold was set at 0.5, while the confidence threshold was fixed at 0.25 to generate pseudo labels. The specific parameter settings are detailed in Table 2.

This study used Precision (P) and Recall (R) as well as F1 and Mean Average Precision (MAP) to evaluate the performance of the model. The MAP was calculated with Intersection over Union (IoU) thresholds of 0.5 and 0.5–0.95 to evaluate the model’s detection results. The specific calculations are shown from Formulas (15) to (19):

P = \frac{T P}{T P + F P}

(15)

R = \frac{T P}{T P + F N}

(16)

F 1 = 2 \times \frac{P \times R}{P + R}

(17)

A P = \int_{0}^{1} P (R) d R

(18)

m A P = \frac{\sum_{i - 1}^{k} A P_{i}}{k}

(19)

where

T P

is the number of true positives detected,

F P

is the number of false positives detected,

F N

is the number of samples that are incorrectly predicted as negative, K represents the number of categories,

A P

is the average precision, and

A P_{i}

is the average precision for the i-th category.

4.2. Comparative Experiments

This paper conducts a comparative analysis of the proposed method against ConfMix [27], DACA [28], SF-YOLO [32], and YOLOv5. The YOLOv5 model was trained on the region 4 dataset and can be considered unsupervised if it is used directly to evaluate the region 1 and region 3 datasets. Both DACA and ConfMix represent state-of-the-art unsupervised domain adaptation techniques based on data augmentation. DACA involves the collection of pseudo-labels from high-confidence target detection regions and supervises the training of composite images by synthesizing various augmented versions of the selected regions and utilizing their pseudo-labels. ConfMix adopts a target detector trained on the source domain to the target domain in an unsupervised manner, facilitating a smooth transition from learning target representations to enhancing detection accuracy through a progressive pseudo-labeling scheme that incrementally tightens confidence thresholds.

The experimental scenarios involve transferring from Region 1 to Region 2, with Experiment 1 validated on Region 3 and Experiment 2 validated on Region 1. The experimental results are presented in Table 3 and Table 4. Figure 7 illustrates the detection performance of each method on the urban garbage dataset. Compared to our method with Confmix and SF-YOLO, it can be found that Confmix and SF-YOLO have unrecognized garbage targets and recognize them with lower confidence than our method. Furthermore, compared with DACA, it can be found that non-garbage objects in DACA are incorrectly identified as garbage objects or actual garbage objects are missed.

As illustrated in Table 3 and Table 4, our method shows significant performance improvement over the baseline model, especially in Table 4 where Precision is improved by 55.4%, Recall by 51.4%, and MAP0.5 by 52.6%. In Table 3, our method achieves a Precision (P) of 58.4%, MAP0.5 of 55.4%, and MAP0.5–0.95 of 30.9%. Compared to the DACA method, our approach exhibits a substantial improvement of 12.4% in MAP0.5 and 16.5% in MAP0.5–0.95, indicating a significant performance enhancement. Compared to the SF-YOLOS method, our approach exhibits a substantial improvement of 2.7% in Recall and 3.5% in MAP0.5–0.95, indicating a significant performance enhancement. In Table 4, our method achieves 69.3% in Precision (P), 64.7% in MAP0.5, and 33.2% in MAP0.5–0.95. Compared to the DACA method, our method shows significant improvements across all metrics, with a notable 18.6% increase in the MAP0.5–0.95 metric. Compared to ConfMix and SF-YOLO methods, our approach demonstrates significant enhancements across all metrics, further validating the superiority of our method.

4.3. Ablation Experiments

To validate the effectiveness of each module within the proposed model, a series of experiments were conducted on the collected urban garbage dataset. The results are presented in Table 5 and Table 6. Table 5 shows the ablation results of incorporating the CA attention module into the YOLOv5 model. We can see that three metrics, including precision, MAP0.5, and MAP0.5–0.95, are improved by 5.6%, 5.9%, and 1.1%, respectively. These findings indicate that the CA module effectively enhances the model’s capacity to focus on key features, thereby improving the overall performance of the model. In Table 6, “Conf” and “Attn” denote the only usage of confidence scores and attention scores for image mixing. In comparison to these individual strategies, the mixed strategy demonstrates superior performance across all metrics, particularly in MAP0.5 and MAP0.5–0.95.

4.4. Transfer Experiments

To evaluate the effectiveness and adaptability of the proposed image mixing method, we extended the proposed method to other domains for further validation. We selected car recognition across diverse scenarios as the experimental scenario. Car categories were selected from four public datasets—Cityscapes [35], Kitti [36], Sim10k [37], and Foggy Cityscapes [38]—and transfer experiments were conducted to evaluate the proposed method. Experiment 3 utilized Kitti → Cityscapes as the experimental setting. Experiment 4 employed Sim10k → Cityscapes. Experiment 5 focused on Foggy Cityscapes → Cityscapes.

Cityscapes: A large-scale dataset targeting urban street scenes, including finely annotated images from 50 different cities, covering various weather, seasons, and daytime conditions. Each image provides high-quality pixel-level annotations.
Kitti: Created by the Karlsruhe Institute of Technology and Toyota Technological Institute, covering various scenes such as urban roads, rural roads, and highways, with rich annotations of static and dynamic objects, including cars, pedestrians, and bicycles.
Sim10k: A synthetic dataset consisting of 10,000 images from the video game Grand Theft Auto V, including only the car category.
Foggy Cityscapes: Created based on the original Cityscapes dataset by simulating foggy effects on the Cityscapes images.

The detailed information of the above datasets is shown in Table 7.

As shown in Table 8, the MAP0.5 achieved by our method is marginally higher than that of DACA and ConfMix. However, in the challenging MAP0.5–0.95 metric, our method is on par with DACA and only slightly lower than ConfMix. A similar trend is observed in Table 9. Despite our method’s MAP0.5 being slightly lower than that of DACA, it remains competitive overall, particularly in the finer-grained MAP0.5–0.95 evaluation. Table 10 further underscores the advantages of our method, particularly in adapting complex environments, with significant improvements over both DACA and ConfMix in MAP0.5 and MAP0.5–0.95. Overall, our method demonstrates a clear advantage across multiple datasets.

The results for each experiment are illustrated in Figure 8. Figure 8a presents a comparison of the detection results from different models in Experiment 3, highlighting that ConfMix exhibits false positives, while DACA demonstrates instances of missed detections. Figure 8b compares the detection results of various models in Experiment 4, where the detection performance across models is relatively comparable. Figure 8c compares the detection results in Experiment 5, where ConfMix notably fails to recognize certain cars.

5. Discussions

In the experiment section, the experimental results clearly demonstrate that the proposed urban garbage detection model, combined with the attention and confidence-based image fusion method, exhibits superior performance over the other baseline models.

Conventional garbage object detection methods rely heavily on large annotated datasets, making them less adaptable to the diversity encountered in real-world scenarios. As a result, applying these methods in data-scarce scenarios, particularly in the domain of urban garbage detection, presents significant challenges. In contrast, the unsupervised domain-adaptive urban garbage object detection method proposed in this study not only mitigates the reliance on extensively annotated datasets but also improves cross-domain detection of models. As shown in Table 3 and Table 4, our approach provides a significant improvement over the baseline model.

Semi-supervised learning approaches leverage small amounts of labeled data to enhance model performance, yet they often encounter the problem of feature alignment due to domain distribution discrepancies. For urban garbage detection, the large variation in the appearance and spatial distribution of garbage objects makes semi-supervised methods particularly vulnerable to noise and label inconsistencies, ultimately leading to unsatisfactory detection accuracy. As illustrated in Figure 7, the semi-supervised model, SF-YOLO, exhibits notable instances of missed detections. Furthermore, these methods typically require the generation of high-quality pseudo-labels, which is often non-trivial in complex urban environments. The pseudo-label generation relies significantly on the initial detection capability of the pretrained model. In contrast, our approach can generate high-quality pseudo-labels even when the detector initially fails to identify the target by leveraging attention scores across different regions of the image during the image fusion process. As shown in Table 3, our method produces higher-quality pseudo-labels compared to strategies that rely solely on confidence scores for pseudo-label generation. This significantly enhances the model’s ability to detect urban garbage objects.

In the proposed model, data augmentation techniques use style transfer on source domain images to mitigate domain discrepancies. In practice, the generated target domain images may lose essential features intrinsic to real-world environments. As a result, augmented data can not accurately reflect the actual conditions of the target domain, consequently degrading detection performance. For urban garbage detection, the variety of shapes and colors of garbage objects makes over-reliance on data augmentation prone to significant information loss. As demonstrated in Table 3 and Table 4, the application of style transfer techniques leads to information loss, resulting in lower precision and recall rates. Our method, which blends source and target domain images, better preserves critical feature information and enhances the model’s performance in real-world scenarios.

Feature alignment techniques can improve detection performance by minimizing feature discrepancies between source and target domains. However, they often demand substantial computational resources and training times. Moreover, such alignment may inadvertently result in the loss of critical feature information. In complex urban garbage detection tasks, this may lead to missed detections or false positives. In contrast, our method is able to focus on garbage targets more effectively through the attention mechanism and confidence scoring.

In practical scenarios, weather conditions like overcast and haze are common. Insufficient lighting can lead to the ambiguity of object details in images. This particularly affects color and texture features. Low-light conditions also increase image noise, compromising detection accuracy. Our approach generates pseudo-labels for the target domain under low-light conditions. We used a detector trained on the source domain with ideal lighting. The model gradually learns features associated with low-light conditions. This enables adaptation from source to target domains without relying on labeled data. As shown in Table 10, the Foggy Cityscapes dataset simulates foggy conditions on Cityscapes images. The experimental results demonstrate that our method achieves robust performance in foggy conditions. We achieved a MAP0.5 of 70.4% and a MAP0.5–0.95 of 49.2%. Furthermore, the proposed model can dynamically adjust detection strategies based on actual environmental conditions so that the adaptation can be guaranteed in time.

6. Conclusions

In this paper, we present an innovative unsupervised domain adaptation method for urban garbage object detection. This method addresses key challenges in garbage management, especially in rapidly urbanizing regions. Our core contribution is to mitigate the issue of limited labeled data in garbage object detection tasks. Our approach uses unsupervised domain adaptation techniques, which circumvents the need for large labeled datasets and improves the model’s generalization across varied environments. We introduce an image mixing method based on attention and confidence fusion. In this approach, source and target domain images are integrated to generate pseudo-labels, which significantly boosts the model’s adaptability to the target domain. Our method effectively addresses inter-domain discrepancies and enhances detection accuracy on unlabeled data. Experimental results show that our method consistently outperforms existing approaches across diverse scenarios and datasets.

In future, we aim to offer fine-grained category annotations for the collected garbage dataset, such as construction garbage, household garbage, and industrial garbage. The proposed model will be further refined on fine-grained categorical datasets. We will also investigate how to optimize our model to broaden its applicability to 3D data or multi-modal data. This will further provide robust decision supports for environmental governance and sustainable urban development.

Author Contributions

Conceptualization, T.Y.; Resources, J.L., Y.H. and W.C.; Supervision, K.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by National Natural Science Foundation of China under Grants 61902205 and 42201506 and Shandong Provincial Natural Science Foundation under Grant ZR2023MF052.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fan, Z.; Meng, J. Classification and treatment of urban domestic garbage in China. In IOP Conference Series: Earth and Environmental Science; IOP Publishing: Bristol, UK, 2020; Volume 514, pp. 1–7. [Google Scholar]
Khan, S.; Anjum, R.; Raza, S.T.; Bazai, N.A.; Ihtisham, M. Technologies for municipal solid garbage management: Current status, challenges, and future perspectives. Chemosphere 2022, 288, 1–10. [Google Scholar] [CrossRef] [PubMed]
Millner, N. As the drone flies: Configuring a vertical politics of contestation within forest conservation. Political Geogr. 2020, 80, 1–45. [Google Scholar] [CrossRef]
Westbrooke, V.; Lucock, X.; Greenhalgh, I. Drone Use in On-Farm Environmental Compliance: An Investigation of Regulators’ Perspectives. Sustainability 2023, 15, 2153. [Google Scholar] [CrossRef]
Zhao, Z.Q.; Zheng, P.; Xu, S.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Adv. Neural Inf. Process. Syst. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Dai, Z.; Cai, B.; Lin, Y.; Chen, J. UP-DETR: Unsupervised Pre-Training for Object Detection with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1601–1610. [Google Scholar]
Kraft, M.; Piechocki, M.; Ptak, B.; Walas, K. Autonomous, onboard vision-based trash and litter detection in low altitude aerial images collected by an unmanned aerial vehicle. Remote. Sens. 2021, 13, 965. [Google Scholar] [CrossRef]
Liao, Y.H.; Juang, J.G. Real-time UAV trash monitoring system. Appl. Sci. 2022, 12, 1838. [Google Scholar] [CrossRef]
Noroozi, M.; Shah, A. Towards optimal foreign object debris detection in an airport environment. Expert Syst. Appl. 2023, 213, 1–16. [Google Scholar] [CrossRef]
Ben-David, S.; Blitzer, J.; Crammer, K.; Kulesza, A.; Pereira, F.; Vaughan, J.W. A theory of learning from different domains. Mach. Learn. 2010, 79, 151–175. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 2672–2680. [Google Scholar] [CrossRef]
Saito, K.; Ushiku, Y.; Harada, T.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Strong-weak distribution alignment for adaptive object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; Volume 35, pp. 6956–6965. [Google Scholar]
He, Z.; Zhang, L. Multi-adversarial faster-RCNN for unrestricted object detection. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6668–6677. [Google Scholar]
Zheng, Y.; Huang, D.; Liu, S.; Wang, Y. Cross-domain object detection through coarse-to-fine feature adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13766–13775. [Google Scholar]
Yu, F.; Wang, D.; Chen, Y.; Karianakis, N.; Shen, T.; Yu, P.; Lymberopoulos, D.; Lu, S.; Shi, W.; Chen, X.; et al. Sc-uda: Style and content gaps aware unsupervised domain adaptation for object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 382–391. [Google Scholar]
Inoue, N.; Furuta, R.; Yamasaki, T.; Wang, Y. Cross-domain weakly-supervised object detection through progressive domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5001–5009. [Google Scholar]
Kim, T.; Jeong, M.; Kim, S.; Choi, S.; Kim, C. Diversify and match: A domain adaptive representation learning paradigm for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12456–12465. [Google Scholar]
Rodriguez, A.L.; Mikolajczyk, K. Domain adaptation for object detection via style consistency. arXiv 2019, arXiv:1911.10033. [Google Scholar]
Mattolin, G.; Zanella, L.; Ricci, E.; Wang, Y. ConfMix: Unsupervised Domain Adaptation for Object Detection via Confidence-based Mixing. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 423–433. [Google Scholar]
Mekhalfi, M.L.; Boscaini, D.; Poiesi, F. Detect, Augment, Compose, and Adapt: Four Steps for Unsupervised Domain Adaptation in Object Detection. arXiv 2023, arXiv:2308.15353. [Google Scholar]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Adv. Neural Inf. Process. Syst. 2017, 30, 1195–1204. [Google Scholar]
Cai, Q.; Pan, Y.; Ngo, C.W.; Tian, X.; Duan, L.; Yao, T. Exploring object relation in mean teacher for cross-domain detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 11457–11466. [Google Scholar]
Khodabandeh, M.; Vahdat, A.; Ranjbar, M.; Macready, W.G. A robust learning approach to domain adaptive object detection. In Proceedings of the IEEE International Conference on Computer Vision, Long Beach, CA, USA, 15–20 June 2019; pp. 480–490. [Google Scholar]
Varailhon, S.; Aminbeidokhti, M.; Pedersoli, M.; Granger, E. Source-Free Domain Adaptation for YOLO Object Detection. arXiv 2024, arXiv:2409.16538. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Johnson-Roberson, M.; Barto, C.; Mehta, R.; Sridhar, S.N.; Rosaen, K.; Vasudevan, R. Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks? arXiv 2016, arXiv:1610.01983. [Google Scholar]
Sakaridis, C.; Dai, D.; Van Gool, L. Semantic foggy scene understanding with synthetic data. Int. J. Comput. Vis. 2018, 126, 973–992. [Google Scholar] [CrossRef]

Figure 1. Urban garbage object detection framework, where the green box indicates the prediction box of the source image, the blue box indicates the prediction box of the target image, the yellow box indicates the prediction box of the blended image, and the red box indicates the true label.

Figure 2. The architecture of coordinate attention.

Figure 3. Image mixing method based on attention and confidence fusion.

Figure 4. Attention effect diagram of each position coordinate attention module.

Figure 5. UAV image acquisition route, where the red dots indicate drone aerial photography points.

Figure 6. Example of urban garbage dataset.

Figure 7. Comparison of detection effects of different models on experiment 1.

Figure 8. Comparisons of different models on Experiments 3, 4, 5.

Table 1. Detailed information of the urban garbage dataset.

Name	Sample Number	Pixel Size	Category
Region 1	300	640 × 640	garbage
Region 2	300	640 × 640	garbage
Region 3	300	640 × 640	garbage
Region 4	300	640 × 640	garbage

Table 2. Parameter settings.

Momentum	Learning Rate	Batch Size	Image Size
0.937	0.001	2	640 × 640

Table 3. Comparison results from Region 4 to Region 2 (validation set: Region 3).

Method	P/%	R/%	F1	MAP0.5/%	MAP0.5–0.95/%
YOLOv5	14.2	12.6	0.133	11.1	3.9
DACA	37.2	41.3	0.391	33.0	14.4
ConfMix	56.8	55.4	0.560	54.7	26.6
SF-YOLO	57.0	53.3	0.550	54.4	27.4
Ours	58.4	56.0	0.571	55.4	30.9

Table 4. Comparison results from Region 4 to Region 2 (validation set: Region 1).

Method	P/%	R/%	F1	MAP0.5/%	MAP0.5–0.95/%
YOLOv5	13.9	13.2	0.135	12.1	4.5
DACA	37.1	42.3	0.395	36.5	14.6
ConfMix	64.3	58.0	0.610	60.0	28.7
SF-YOLO	65.1	57.2	0.609	60.4	29.9
Ours	69.3	64.6	0.668	64.7	33.2

Table 5. Ablation results of adding CA.

Method	P/%	MAP0.5/%	MAP0.5–0.95/%
YOLOv5	73.2	58.1	35.5
YOLOv5+CA	78.8	64.0	36.6

Table 6. Ablation results of adding confidence and attention.

Method	P/%	MAP0.5/%	MAP0.5–0.95/%
YOLOv5+Conf	64.4	60.3	31.2
YOLOv5+Attn	65.4	60.7	31.3
Ours	69.3	64.7	32.2

Table 7. Detailed information of the datasets.

Dataset	Sample Number	Pixel Size	Category
Cityscapes	3299	2048 × 1024	Car
Kitti	6648	1242 × 375	Car
Sim10k	9468	1914 × 1052	Car
Foggy Cityscapes	3457	2048 × 1024	Car

Table 8. Experimental results from Kitti to Cityscapes.

Method	P/%	MAP0.5/%	MAP0.5–0.95/%
DACA	78.1	48.5	24.7
ConfMix	64.1	48.8	25.7
Ours	75.4	49.6	24.7

Table 9. Experimental results from Sim10k to Cityscapes.

Method	P/%	MAP0.5/%	MAP0.5–0.95/%
DACA	80.5	57.4	33.9
ConfMix	68.9	54.5	31.1
Ours	78.9	57.2	32.8

Table 10. Experimental results from Foggy Cityscapes to Cityscapes.

Method	P/%	MAP0.5/%	MAP0.5–0.95/%
DACA	85.1	64.7	43.1
ConfMix	84.3	68.6	46.4
Ours	84.7	70.4	49.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yuan, T.; Lin, J.; Hu, K.; Chen, W.; Hu, Y. Domain Adaptive Urban Garbage Detection Based on Attention and Confidence Fusion. Information 2024, 15, 699. https://doi.org/10.3390/info15110699

AMA Style

Yuan T, Lin J, Hu K, Chen W, Hu Y. Domain Adaptive Urban Garbage Detection Based on Attention and Confidence Fusion. Information. 2024; 15(11):699. https://doi.org/10.3390/info15110699

Chicago/Turabian Style

Yuan, Tianlong, Jietao Lin, Keyong Hu, Wenqian Chen, and Yifan Hu. 2024. "Domain Adaptive Urban Garbage Detection Based on Attention and Confidence Fusion" Information 15, no. 11: 699. https://doi.org/10.3390/info15110699

APA Style

Yuan, T., Lin, J., Hu, K., Chen, W., & Hu, Y. (2024). Domain Adaptive Urban Garbage Detection Based on Attention and Confidence Fusion. Information, 15(11), 699. https://doi.org/10.3390/info15110699

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Domain Adaptive Urban Garbage Detection Based on Attention and Confidence Fusion

Abstract

1. Introduction

2. Related Work

2.1. Object Detection

2.2. Unsupervised Domain Adaptation

2.2.1. Feature Alignment-Based Methods

2.2.2. Data Augmentation-Based Methods

2.2.3. Semi-Supervised Learning-Based Methods

3. Methodology

3.1. Urban Garbage Object Detection Framework

3.2. YOLOv5 Algorithm and Improvements

3.3. Image Mixing Method Based on Attention and Confidence Fusion

3.4. Loss Function

4. Experimental Results and Analysis

4.1. Dataset

4.2. Comparative Experiments

4.3. Ablation Experiments

4.4. Transfer Experiments

5. Discussions

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI