Next Article in Journal
Functional Connectivity Differences in the Perception of Abstract and Figurative Paintings
Previous Article in Journal
Advanced Structural Monitoring Technologies in Assessing the Performance of Retrofitted Reinforced Concrete Elements
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

SS-YOLOv8: A Lightweight Algorithm for Surface Litter Detection

1
School of Computer and Information Engineering, Harbin University of Commerce, Harbin 150028, China
2
Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin 150028, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(20), 9283; https://doi.org/10.3390/app14209283
Submission received: 31 August 2024 / Revised: 8 October 2024 / Accepted: 9 October 2024 / Published: 12 October 2024
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

Abstract

:
With the advancement of science and technology, pollution in rivers and water surfaces has increased, impacting both ecology and public health. Timely identification of surface waste is crucial for effective cleanup. Traditional edge detection devices struggle with limited memory and resources, making the YOLOv8 algorithm inefficient. This paper introduces a lightweight network model for detecting water surface litter. We enhance the CSP Bottleneck with a two-convolutions (C2f) module to improve image recognition tasks. By implementing the powerful intersection over union 2 (PIoU2), we enhance model accuracy over the original CIoU. Our novel Shared Convolutional Detection Head (SCDH) minimizes parameters, while the scale layer optimizes feature scaling. Using a slimming pruning method, we further reduce the model’s size and computational needs. Our model achieves a mean average precision (mAP) of 79.9% on the surface litter dataset, with a compact size of 2.3 MB and a processing rate of 128 frames per second, meeting real-time detection requirements. This work significantly contributes to efficient environmental monitoring and offers a scalable solution for deploying advanced detection models on resource-constrained devices.

1. Introduction

The meteorological conditions, which have become increasingly erratic, have resulted in the accumulation of waste material in river systems. Surface garbage is a significant contributor to river pollution, disrupting the ecological balance of rivers and causing the death of vast numbers of river organisms. This, in turn, affects the ecological balance of rivers and the living environment of people in riverine areas. The recycling and cleaning of river garbage is an essential step in maintaining the ecological balance of the planet. Nevertheless, this process is inherently time consuming and labor intensive.
In recent years, a substantial body of research has been conducted on target detection in the domain of deep learning, as evidenced in [1,2,3,4,5]. These methods have the capacity to extract features from images and have demonstrated notable efficacy in target detection, image classification, and target localization. However, these models suffer from several limitations, including the use of large models, parameter redundancy, limited robustness, and low efficiency. Furthermore, they are not suitable for deployment on edge devices. In recent years, studies [6,7,8] have shown that deploying neural network models on edge devices is economical and applicable. The combination of high-resolution cameras and the compact size of edge devices allows for the effective detection of surface litter. Furthermore, the flexibility of the edge device, which has a minimal environmental impact, makes it a sustainable tool for the ecological maintenance of rivers. The ability to analyze real-time data enhances the timeliness of decision-making and effectively reduces the waste of labor. Thus, the deployment of the network model at the edge facilitates a shift towards more efficient, safe, and environmentally friendly practices in the field of river ecological surface waste detection.
In recent years, deep learning has become more and more important in the field of object detection, which is proved by the literature [9]. Application of deep learning neural networks for target calling algorithms has demonstrated considerable promise in the automatic detection of rubbish, offering real-time, robust, and widely available solutions. Nevertheless, the extant algorithmic models remain susceptible to improvement with respect to the detection of water surface litter. The traditional Faster R-CNN [10] is afflicted by the vices of parameter redundancy, a substantial model, and low efficiency. Classical single-stage algorithms, such as YOLOv8 [11], have demonstrated notable efficiency gains. However, they remain suboptimal in terms of accuracy and are characterized by large models that present challenges for deployment on edge devices.
It is commonly acknowledged that traditional convolutional neural networks (CNNs) often necessitate substantial computational resources and parameters to achieve satisfactory performance. This presents a significant challenge in terms of deploying such models on edge devices, which are often constrained by limited memory and computational capabilities. Accordingly, our objective is to develop a lightweight and highly performant model for the detection of water surface litter.
Henar Mike [12] made a number of comparisons of deployments at the edge of some of the columns. The optimized model deployed in all the Jetson series achieved an average percent power reduced (%R) of 53.18% with a percent difference of 35.9% from the results of the local PC.
Despite the success of lightweight neural networks in various industries, research in river water surface litter detection is quite limited. This study aims to address this gap by developing an efficient and lightweight surface waste detection model. The contributions of this study are as follows:
(1)
It is proposed that the faster implementation of the CSP Bottleneck with convolutions and the Sobel convolutions (Sobel_Conv) C2f-WS module be used for the detection of real water surface litter, based on the C2f module in YOLOv8. The WS module employs a Sobel convolution branch (Convbranch) to explicitly extract edge features from the image. An additional convolutional branch is employed for the extraction of spatial information. In contrast to the Sobel_Conv branch, the convolutional branch (conv_branch) extracts features from the original image, thereby preserving rich spatial details. The WS module concatenates the features extracted from the Sobel_Conv branch and the convolutional branch. This fusion operation permits the learned feature representation to contain both rich edge information and spatial information, thereby enabling a more comprehensive portrayal of the image content;
(2)
The concept of shared convolution is employed to develop a more compact detection head, SCDH, which significantly reduces the number of parameters and increases the efficiency of the model, particularly on constrained devices. To solve the problem that the target data scales detected via each detection head are inconsistent, the scale coefficient is used to scale the features in shared convolution;
(3)
The slimming pruning method reduces the number of parameters and the calculation, thus reducing the size of the model and speeding up the model’s execution;
(4)
The introduction of PIoU2 as a novel regression loss function serves to accelerate the regression of the network on low IoU samples and enhance the localization precision of targets.

2. Related Works

2.1. Surface Litter Detection

The conventional methodologies employed for the detection of water surface litter primarily entail the construction of templates through the enumeration of the texture, hue, and contour characteristics of water surface entities. Subsequently, the pertinent classifiers are utilized to ascertain the categories of water surface litter.
Scholars currently employ deep learning techniques for the purpose of object detection [13,14]. The primary distinction between these two approaches is whether they employ a single or double stage. In order to achieve real-time deployment, the single-phase you only look once (YOLO) series [15,16] model is commonly utilized in the edge segment. Zhou et al. [17] employed the MobileViTv3 as a feature extraction network to encode the entire visual field, thereby reducing the parameters of the YOLO model and expanding its scope of application. Jiang et al. [18] employed a modified YOLOv7 with ACanny PConv-ELAN and multi-scale gated attention for adaptive weight allocation (MGA) attention to enhance the precision of target detection for small, challenging objects.
While these methods are valuable for water surface litter detection, they still present several challenges. (1) Although numerous methodologies exhibit augmented detection precision in specific scenarios, they frequently necessitate larger model parameters and elevated computational complexity, which constrains their applicability for deployment at the edge. (2) Some existing lightweight models reduce the number of model parameters but fail to achieve the best match point between accuracy and model size. In this paper, we propose a convolutional module based on Sobel convolution, which is designed to facilitate the recognition of target edges, facilitate the separation of target and background, and thereby enhance the accuracy of the model. Furthermore, the model is pruned in order to reduce its overall size. This method achieves the best match point between model size and accuracy.

2.2. YOLOv8 Algorithm

YOLOv8 is the next major update version based on YOLOV5, which was open sourced by ultralytics on 10 January 2023. It currently supports image classification, object detection, and instance segmentation tasks. In view of the good performance of Yolov5, Yolov8 was well received at the start of its release.
The YOLOv8 model introduces new features and improvements based on the YOLOv5 [19] model. The YOLOv8 model better balances speed and accuracy than the YOLOv5 model. The YOLOv8 model is categorized into five different sizes of models, n, s, m, l, and x, depending on the depth and width of the network. The n model has lower computational and parameter requirements, which facilitates deployment on mobile and CPU devices. YOLOv8 consists of three main components: the backbone, which is based on CSP, the neck, which is the feature enhancement network, and the head, which is the detection head.
These components are utilized for target characteristics extraction, target characteristics fusion, and prediction image output, respectively. The network structure is illustrated in Figure 1. The backbone comprises the CBS, C2f, and SPPF modules. The C2f module makes the model lighter by reducing one convolution layer from the original C3 module. In addition, it retains the concept of CSP in YOLOv5, and the Space Pyramid Pool (SPPF) module. The neck component employs the PA-FPN (Path Aggregation Network-Feature Pyramid Network) concept, which involves a two-way fusion of low-level and high-level features to enhance target detection across different scales. This is achieved by removing the convolution of the sampling stage in the PA-FPN of YOLOV5 and replacing the C3 module with a C2f module. The head is replaced by an anchor-free decoupled head derived from the anchor-based coupled head in YOLOv5. This change removes the obj branch. It introduces a new cls classification branch and a box regression branch. Furthermore, the regression branch employs the integral form representation of the DFL strategy, while the decoupled head [20] is utilized for more efficient training and inference. Ultimately, this approach incorporates three detection heads for the prediction of targets of varying sizes. These improvements make YOLOv8 perform more fine tuning and optimization while maintaining the advantages of YOLOv5’s network structure and improving the model’s performance in various scenarios.

3. SS(Sobel and Share)-YOLOv8 Algorithm

3.1. Algorithm Overview

The YOLOv8n system has been developed to address the issues of inconspicuous image feature extraction, insufficient spatial information, low accuracy of small-target localization, and insufficient real-time performance in water surface litter detection. In order to address these deficiencies, the C2f module is improved, WS-C2f is proposed, and the original C2f module is replaced. Secondly, the detection head of the original network model is replaced by SCDH-Detect. Finally, the PIoU2 [21] replaces the original loss function as a new loss function, which serves to accelerate the convergence of the model and enhance its accuracy for small target localization. The model resulting from the lightweight design process is designated as SS-YOLOv8n.

3.2. WS-C2f Module

The YOLOv8 feature extraction network is structured in a manner similar to the CSPDarkNet network. However, the C2f module, comprising the standard convolution (SConv) and the bottleneck structure (Bottleneck), exhibits shortcomings in terms of a considerable number of parameters and a relatively slow computation speed. To reduce the model’s requirements on device arithmetic and memory, and to improve the accuracy of water surface trash detection, this paper constructs the WS-C2f module based on the concept of the Hgnet (Hourglass net) module in the backbone network in RT-DETR [22] and replaces the C2f in the original backbone network with it. The module is designed to serve as an efficient front-end module in image recognition tasks.
This module fuses Sobel_Conv [23] branching for extracting edge information and convolutional branching for extracting spatial information, thereby enabling the learning of more nuanced representations of image features. The WS-C2f module explicitly extracts the edge features of an image through Sobel_Conv branching. The Sobel filter is a well-established edge detection filter that effectively captures abrupt changes in intensity in an image, thereby obtaining crucial edge information. The Sobel operator is a discrete differential operator used in image processing edge detection algorithms. It detects the approximate gradient used via the edge device to calculate the grayscale of the image by calculating the gradient of each pixel point in the image. The greater the gradient, the more likely that it is an edge. The operator combines a Gaussian smoothing function and differential derivative function. It is also known as a first-order differential operator and derivative operator. It extracts the image in the horizontal direction and the vertical direction, respectively, to obtain the image gradient in the x direction and the y direction. The Sobel operator is based on the assumption that pixels in the neighborhood do not have an equivalent effect on the current pixel. Consequently, pixels at different distances are assigned different weights, which in turn produce varying effects on the results of the operator. In general, the influence of a given point decreases with increasing distance.
The Sobel operator principle involves convolving the incoming image pixels. Convolution can be understood as a process of identifying the gradient value or performing a weighted average, where the weights are the so-called convolution kernel. Subsequently, a thresholding operation is applied to the resulting new pixel gray value, thereby obtaining the horizontal and vertical gradient information. This information is utilized to determine edge information, as illustrated in Figure 2 and Figure 3 and Equations (1) and (2).
P 5 X = i , j X i P j ; i , j 1,9
  P 5 Y = X 1 P 1 + X 4 P 2 + X 7 P 3 + X 2 P 4 + X 5 P 5 + X 8 P 6 + X 3 P 7 + X 6 P 8 + X 9 P 9
The WS module employs an additional convolutional branch, as illustrated in Figure 4, to extract spatial information. In contrast to the Sobel_Conv branch, the convolutional branch is capable of extracting features from the original image, thereby preserving the rich spatial details inherent to the source data. The WS module combines the extracted features from the Sobel_Conv branch and the convolutional branch. The fusion operation permits the learned feature representation to contain both rich edge information and spatial information, thereby facilitating a more comprehensive portrayal of the image content.

3.3. SCDH-Detect Module

The most significant alteration is evident in the head component of YOLOv8. As the designation “Decoupled-Head” implies, this configuration entails the decomposition of the original single detection head into two distinct components. The three decoupled heads in the head section are analogous to the feature map outputs P3, P4, and P5 in the neck section. After entering the decoupling head, it is divided into two branches, and after undergoing three convolutional layers, it obtains the confidence score for each class and the four coordinates (center point coordinates, width, and height) for the regression bounding box, so as to accurately locate the object, as illustrated in Figure 5.
Feature maps obtained from the network, designated as P3, P4, and P5, are fed into the decoupling header for prediction purposes. Additionally, these maps are incorporated into the regression branch of the detection header. The regression header is responsible for computing the positional offsets between the predicted box and the actual box. Subsequently, the offsets are fed into the regression header for loss computation, resulting in the generation of a four-dimensional vector. It represents the upper-left coordinates x and y and lower-right coordinates x and y of the target box, respectively. The classification head performs RoI (Region of interest) Pooling and convolution processing on each anchor-free generated candidate box to obtain a classifier output tensor. Each position value of this tensor represents the probability that the candidate box belongs to a class. Finally, maximum value suppression (NMS) is used to filter out the final detection results.
Group normalization (GN) [24] is an improved algorithm based on the high error rate of batch normalization (BN) [25] when the size of the batch is small, because the calculation results of the BN layer depend on the data of the current batch. When the batch size is small (such as 2 and 4), the mean and root mean square of batch data are less representative and, therefore, have a greater impact to the end result. GN divides the channel into groups and calculates the normalized mean and variance for each group. For each feature channel, GN calculates the mean and variance within each group and then uses these statistics to standardize the features within each group. The calculation of GN is not affected by the batch size, so the accuracy remains consistent under different batch sizes. In other words, the normalization operation of GN is independent of batch size. The formula is as follows:
  x i ~ = 1 σ i x i μ i
μ i = 1 m k S i x k , σ i = 1 m k S i ( x k μ i ) 2 + ϵ
with  ϵ  as a small constant. Where  x  is a feature computed by layer with feature size  ( N , C , H W ) , and m is the size  S i  of this set varying with different normalization methods in GN, where x is a feature computed by layer with feature size  ( N , C , H W ) , and m is the size  S i  of this set varying with different normalization methods; in GN,
  S i = k k N = i N , k C C G = i C C G
where  G  is the number of groups and is a predefined hyperparameter (default is  G  = 32).  c / G  is the number of channels in each group, and GN computes  µ  and  σ  along  ( C / G e , H , W ) .
Therefore, the core idea of GN is to normalize each set of identical feature maps, and these groups are divided only in the channel dimension, making the normalization process independent of batch size.
To reduce the size of the model, this paper refers to the use of the same set of convolutional kernels for the three output layers through the use of shared convolution [26]. Significantly reducing the number of parameters makes the model lighter, especially on resource-constrained devices. When using shared convolution, the group normalization algorithm is used instead of the batch normalization (BN) algorithm, and the SCDH is proposed to scale the features using the scale layer to address the problem of inconsistency in the scale of the target detected via each sensing head, as shown in Figure 6.

3.4. Optimizing the Loss Function

CIoU [27] is used as the loss function in YOLOv8. Three factors are considered: the overlap area between the predicted frame and the real frame and the distance between the center point and the aspect ratio. The definition range is as follows:
  L C I O U = 1 I O U A , B + ρ 2 A c t r , B c t r c 2 + α v
  v = 4 π 2 ( a r c t a n ω g t h g t a r c t a n ω h ) 2 ,   α = v 1 I O U + v
In the formula,  A , B  represent two frames, and  A c t r , B c t r  _ctr represent the center point of  A  and  B . From the formula, it can be seen that if the aspect ratio of the predicted frame and the  g t  frame are the same, then the aspect ratio penalty term is constant 0, which is not reasonable. Observing the gradient of  ω , h  relative to  v  in CIoU, it is found that these two gradients are a pair of opposite numbers, i.e.,  ω  and h cannot be increased or decreased at the same time, which is obviously not reasonable either. The loss function based on CIoU is affected by an unreasonable penalty factor, which causes the anchor frame to expand during the regression process, significantly slowing down the convergence.
To solve this problem, an enhanced IOU loss function PIoU is proposed, which combines a penalty factor of adaptive target size with a gradient adjustment function based on anchor frame mass. PIoU losses can guide the anchor frame regression along an effective path, thus converging faster than traditional IoU-based loss functions. As shown in Figure 7.
Although PIOU can effectively accelerate the convergence speed, there is a limitation that the accuracy decreases in the face of the surface litter dataset; so, this paper introduces a non-monotonic attention function and uses it in combination with a single hyperparameter controlled PIoU. By incorporating an attention mechanism into PIoU loss to yield the PIoU2 loss, we enhance the focus on mid-to-high-quality anchor frames, leading to improved target detector performance and precision. The attention function and the PIoU2 loss formulation are outlined below:
q = e P , q 0,1 ,   u x = 3 x · e x 2 ,   L P I o U 2 = u λ q · L P I o U = 3 · λ q · e λ q 2 · L P I o U
Introduce a new parameter  q , ranging from 0 to 1 inclusively, to represent the quality assessment of the anchor frame, substituting the penalty factor  P . Setting  q  to 1 corresponds to  P  at 0, signifying perfect alignment of the anchor frame with the target frame. As  q  grows, the value of  P  diminishes, reflecting a decline in anchor frame quality. The hyperparameter  λ  regulates the dynamics of the attention function, enhancing the focus on medium-quality anchor frames through a non-monotonic attention function,  u ( λ q ) . This innovation ensures that the PIoUv2 loss incorporates not just the attention mechanism but also maximizes its benefits, particularly by prioritizing the medium-quality phase of the anchor frame regression. By concentrating on these medium-quality frames, PIoUv2 boosts the model’s precision in identifying the target amidst surface clutter, streamlining hyperparameter optimization efforts.

3.5. Channel Pruning Based on Slimming

The model after lightweighting is somewhat improved in terms of computation and memory usage but still has parameter redundancy. In order to reduce the parameter and computational redundancy, while trying to ensure the effectiveness of the model in detecting water surface debris and reduce the difficulty of its end use, structured pruning is further used in this paper. Pruning techniques significantly reduce the volume and computation of a deep learning model by removing redundant or unimportant weight parameters from the model, while keeping the accuracy of the model as much as possible. This not only reduces the need for storage and computing resources but also speeds up the reasoning process so that pruned models can be deployed directly on edge devices.
In the literature [28], network slimming is proposed, which uses a wide and large network as an input model, but during training, unimportant channels are automatically identified and subsequently pruned, resulting in a thin and compact model with considerable accuracy. The core of slimming is to introduce a scaling factor γ for each channel and multiply it with the output of that channel. Next, we train the weights of the network and these scaling factors simultaneously and apply sparse regularization to the scaling factors. Finally, we prune the channels with small factors and fine-tune the pruned network. Specifically, the training goal of our method is
  L = x , y l f x , W , y + λ γ T g γ
In the above formula,   ( x , y )  denote the training inputs and targets,  W  denotes the trainable weights, the 1st term and the term correspond to the normal training loss of the CNN,  g ( · )  is the penalty on the scaling factor due to sparsity, and  λ  balances the two terms. Setting  Z i n  and  Z o u t  are the inputs and outputs of the BN layer,  ϵ  is a small constant, and  β  denotes the current small batch of data; the BN layer performs the following transformations:
  Z ^ = Z i n μ B σ B 2 + ϵ ; Z o u t = γ z ^ + β
As shown in Figure 8, the smaller the scaling factor  γ , the less important the channel is.

4. Discussion

4.1. Experimental Environment and Dataset

The environment configuration for this experiment is shown in Table 1 below.
The key parameters for this network training are shown in Table 2.
The edge end device used in this post is the Jetson Nano b01. The key parameters of the Jetson Nano b01 are shown in Table 3.
This experiment was conducted using the publicly available surface litter dataset SLD (Surface litter Dataset), which is a high-quality dataset focusing on floating debris, consisting of a series of well-labeled images designed to train and evaluate the performance of a computer vision model on the task of floating debris detection. The dataset contains 2400 images in eight categories, namely ‘bottle’, ‘grass’, ‘branch’, ‘milk crate’, ‘plastic bag’, ‘plastic garbage’, ‘ball’, and ‘leaf’. Examples of some images from the SLD are shown in Figure 9. The analysis shows that the dataset is relatively complex, the impact of water surface reflection is serious, and the edge contour of the object is entangled with the reflection of the water surface of the object and is difficult to distinguish, so it is often necessary to detect objects on images of different scales, requiring the network model to accurately identify the contour of the target to identify the target.

4.2. Evaluation Indicators

In this paper, the experiments use Precision (P), Recall (R), and mean Average Precision (mAP) as the evaluation metrics of the algorithm’s detection performance. The formula is as follows:
  P = T P T P + F P
R = T P T P + F N
A P = 0 1 P R d R
  m A P = i = 1 n A P i n
where  F P  is the number of negative samples that were incorrectly predicted,  F N  is the number of positive samples that were incorrectly predicted, and  T P  is the number of positive samples that were correctly predicted.  A P  reflects the accuracy of target detection for a single category, and  m A P  indicates the average  A P  value for multiple categories. Additional metrics are introduced to evaluate model complexity and computational efficiency, mainly GFLOPs, Params, Model Size, Latency, and FPS. GFLOPs and Params are used to evaluate the computational complexity of the model; Latency is the time it takes the model to complete a forward inference, excluding image preprocessing and non-maximum suppression (NMS), which is used to evaluate the computational efficiency of the network; FPS represents the recognition speed of the model.

4.3. Ablation Experiments

Ablation experiments were conducted to ascertain the efficacy of each SS-YOLOv8n improvement in terms of accuracy and the detection of light river garbage. The results are presented in Table 4. In Experiment A, the C2f module was replaced with WS-C2f, forming a new backbone network. This resulted in a reduction of 0.53 in the number of model parameters and 0.4 GFLOPs in computation. In Experiment B, the SCDH was employed in lieu of the original model’s detection head. This results in a reduction in the number and volume of model parameters, with an improvement in mAP50 of 1.2%. This demonstrates that the novel network configuration mitigates the loss of information pertaining to surface spam targets. Experiment C introduces a novel loss function, PIoU2, based on B, which enhances the mAP50 to 80.1% and further optimizes the detection accuracy. The utilization of an auxiliary bounding box has been demonstrated to enhance the efficacy of water surface litter target localization. Experiment D employs the slimming pruning method to compress the model derived from C. Subsequently, the compressed model is fine-tuned. While the mAP50 is reduced by 0.2%, the resulting model is only 2.3 MB in size and exhibits a 0.12ms reduction in inference delay. By adopting this strategy, the quantity of parameters is efficiently minimized without compromising the model’s efficacy. Moreover, this results in increased operational speed.
To further show the effect of Experiments A, B, and C, the models of Experiments A, B, and C were migrated to the Jetson Nano b01 for testing, as shown in Figure 10.

4.4. Comparative Experiments

4.4.1. Comparison of Different Loss Functions

To assess PIoU2’s efficacy, it was contrasted against WIoU [29] and EIoU [30], along with Wise-PIoU2 and Wise-Inner-PIoU2 [31], which integrated concepts from wiseiou and inneriou, respectively. Findings delineated in Table 5 illustrate this evaluation. Post-application, when mAP calculations were conducted at 50% and 95% model performance levels, enhancements amounted to 2.1% and 1.5%, respectively.

4.4.2. Comparison of Different Pruning Methods

To determine the effectiveness of the slimming pruning method, it is compared with the lamp [32] and Group-norm [33] methods at a 50 percent pruning rate. It can be observed that the accuracy of the slimming, lamp, and Group-norm methods remain consistent, even when the pruning size is identical, whereas the accuracy of the other two methods declines, as shown in Table 6.

4.4.3. Comparison of Whether Global Pruning Is Turned on or Not

The question of whether to eliminate key network layers is a significant challenge in the process of pruning. Global pruning, which does not differentiate between the importance of network layers based on the pruning method, has the advantage of significantly reducing the complexity of the network model. However, this approach may result in a trade-off between reduced complexity and accuracy. In the absence of global pruning, certain crucial layers may be bypassed during the pruning process. This approach offers the benefit of enhanced accuracy but comes at the cost of a larger network model. As illustrated in Table 7.

4.4.4. Comparison of Different Pruning Rates

As illustrated in Table 8, the impact of disparate pruning rates under a unified pruning methodology exhibits variability. The lightweight model was subjected to pruning at varying rates, after which the pruned model was fine-tuned to restore accuracy. It is found that with the increase in pruning rate, the number of parameters and the computational complexity of the model decrease. However, this does not necessarily result in a reduction in the model’s expressive capacity. In contrast, the accuracy of the 50% pruned model is enhanced by 0.4% in comparison to the baseline model. This indicates that the convolutional kernels within the same layer of the model may have acquired numerous analogous features, thereby resulting in parameter and computational redundancy.
Furthermore, the number of channels in each layer of the network before and after pruning is analyzed. This is illustrated in Figure 11, which depicts the 7, 8, 9, and 21 output channels of the backbone network as the most heavily pruned. This suggests that the redundancy within the shallow network is relatively minimal, with parameter redundancy emerging as the network deepens. Additionally, it indicates that the exportation parameters identified will not significantly impact the overall performance of the network.

4.4.5. Comparison of Different Models

To assess the algorithm’s superiority, this research directly benchmarks SS-YOLOv8s against counterparts from the YOLO lineage: YOLOv5n, YOLOv5s, YOLOv8n, and YOLOv8s. As demonstrated in Table 9, in terms of lightweighting, SS-YOLOv8s exhibits the lowest parameter, computation, and model memory, particularly exhibiting a memory consumption of only 12% and 10% of YOLOv5s and YOLOv8s, respectively. This represents a reduction of 12% and 10%, respectively, in comparison to YOLOv5s and YOLOv8s. In terms of accuracy, SS-YOLOv8n showed the highest detection accuracy and the lowest number of parameters compared with YOLOv8-ghost [34] of the same scale.
Considering processing speed, the FPS of SS-YOLOv8n hits 128.9, suggesting an image detection timeframe of roughly 8.7 milliseconds. It represents an improvement of 13 frames/s compared with the original model and 10 frames/s compared with YOLOv8-ghost. This evidence supports the assertion that the pruning method proposed in this paper is effective in achieving the objectives of model compression and inference acceleration.

4.5. Visual Analytics

To analyze the enhancement effect of the overall improved model in comparison with the pre-improved model, Grad-CAM [35] was employed to generate heat maps for the feature layers with a network layer index of 10 in the pre- and post-improved models. The heat map is used to represent the degree of confidence that the model has about the existence of the object. The values at each location represent the probability of detecting an object at that location, and the heat map is used to represent the model’s confidence in the presence of the object. The value at each location represents the probability of detecting an object at that location. The model’s focus on a particular region can be visually assessed by observing changes in the region’s shadow brightness, as shown in Figure 12.
The improved model produces a more concentrated region of brightness. For example, in the figure, SS-YOLOv8n not only accurately distinguishes the foreground and background, but it also shows that the model pays more attention to the top half of the target. In contrast, the original model correctly detects the target, but it erroneously identifies part of the background region as the target’s feature information. In YOLOv8-mobilentv4, the recognition of the target and the background is not accurate, and some edges are not recognized. YOLOv8 ghost will mainly detect the central location of the target, which is not very friendly to garbage floating on the surface of the water with reflections. YOLOv8-mobilentv4 is a network that replaces the backbone of YOLOv8n with the backbone of mobilentv4, which is too thin, so the target positioning is much worse than the other three groups of models.
According to the different thermal maps of the same images of four groups of different models, SS-YOLOv8n has the best performance identifying the target edge, which improves the detection ability of the model.

5. Conclusions

In this paper, we make improvements for YOLOv8 when deployed at the edge end where the model is too large, the processing speed is too slow, and the real-time effect is not good. To this end, this paper proposes a fast and lightweight SS-YOLOv8 algorithm. First, the WS-C2f module with Sobel_conv was used to replace the C2f module, with the objective of enhancement of backbone feature extraction capabilities. Secondly, in order to ensure the model is lightweight, a detection head SCDH utilizing shared convolutional composition is proposed. Thirdly, the original CIoU loss function is replaced by PIoU2, which improves the localization accuracy of small garbage on the surface of the water while accelerating the convergence speed of the network. Fourth, a slimming pruning method is employed to significantly reduce the number of redundant parameters, thereby reducing the model size. The experiments demonstrate that SS-YOLOv8 exhibits notable enhancements in model size, computational efficiency, and inference speed. The model’s mAP50 and mAP50–95 are 79.9% and 51%, respectively, while its memory occupation is reduced by 88.3%. Additionally, the detection speed has been accelerated by 14 frames/s, which considerably simplifies its operation on edge devices while maintaining a minimal degree of accuracy. Subsequent work will concentrate on the implementation of the model on edge devices, with the objective of enhancing the model’s detection capabilities in complex airspace and special scenarios.
Surface litter has a more complex background and is susceptible to weather. The semi-floating trash and the refraction of the water surface can affect the computer’s vision. This is the limitation of this model. In the future, high-resolution drones, satellite imagery, and underwater sensors will be capable of monitoring and identifying the type and location of surface debris in real time. The model presented in this paper, when combined with automated garbage sorting and tracking, can facilitate improvements in the accuracy and efficiency of cleanup work. The results of this research have the potential to enhance the quality of life for citizens by improving the living environment.

Author Contributions

Conceptualization, Z.Q. (Zheng Qin) and Z.F.; methodology, Z.F.; formal analysis, Z.Q. (Zheng Qin) and Z.F.; investigation, Z.Q. (Zheng Qin), Z.Q. (Zeguo Qiu), and W.L.; resources, Z.Q. (Zheng Qin) and Z.F.; data curation, Z.Q. (Zheng Qin); writing—original draft preparation, Z.F.; writing—review and editing, Z.Q. (Zheng Qin) and Z.F.; visualization, Z.F.; supervision, Z.Q. (Zeguo Qiu) and Z.F.; project administration, M.C.; funding acquisition, M.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Heilongjiang Postdoctoral Fund to pursue scientific research (grant number LBH-Z23025), Heilongjiang Province Colleges and Universities Basic Scientific Research Business Expenses Project (grant number 2023-KYYWF-1052), Harbin University of Commerce Industrialization Project (grant number 22CZ04), and Collaborative Innovation Achievement Program of Double First-class Disciplines in Heilongjiang Province (grant number LJGXCG2022-085).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The surface litter dataset for this paper can be accessed with the URL ‘https://pan.baidu.com/s/1rkhxL9LsOrdiWwotMfFXyw?pwd=u0b0’ (accessed on 8 October 2024). Additional original contributions presented in this study are included in this article, and further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Yu, D.; Yuan, Z.; Wu, X.; Wang, Y.; Liu, X. Real-Time Monitoring Method for Traffic Surveillance Scenarios Based on Enhanced YOLOv7. Appl. Sci. 2024, 14, 7383. [Google Scholar] [CrossRef]
  2. Karim, J. Enhancing Agriculture through Real-Time Grape Leaf Disease Classification via an Edge Device with a Lightweight CNN Architecture and Grad-CAM. Sci. Rep. 2024, 14, 16022. [Google Scholar] [CrossRef]
  3. Lee, Y.J.; Hwang, J.Y.; Park, J.; Jung, H.G.; Suhr, J.K. Deep Neural Network-Based Flood Monitoring System Fusing RGB and LWIR Cameras for Embedded IoT Edge Devices. Remote Sens. 2024, 16, 2358. [Google Scholar] [CrossRef]
  4. Su, K.; Tomioka, Y.; Zhao, Q. YOLIC: An efficient method for object localization and classification on edge devices. Image Vis. Comput. 2024, 147, 105095. [Google Scholar] [CrossRef]
  5. Vinoth, K.; Sasikumar, P. Lightweight object detection in low light: Pixel-wise depth refinement and TensorRT optimization. Results Eng. 2024, 23, 102510. [Google Scholar] [CrossRef]
  6. Karim, J.M.; Nahiduzzaman, M.; Ahsan, M. Development of an early detection and automatic targeting system for cotton weeds using an improved lightweight YOLOv8 architecture on an edge device. Knowl.-Based Syst. 2024, 300, 112204. [Google Scholar] [CrossRef]
  7. Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
  8. Wang, Z.; Zhan, J.; Duan, C.; Guan, X.; Lu, P.; Yang, K. A review of vehicle detection techniques for intelligent vehicles. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 3811–3831. [Google Scholar] [CrossRef]
  9. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  10. Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
  11. Glenn, J.; Ayush, C.; Jing, Q. Ultralytics YOLOv8. Available online: https://github.com/ultralytics/ultralytics (accessed on 2 July 2024).
  12. Henar, M.; Angela, C.; James, R.; Wansu, L.; Martin, M. Edge TMS: Optimized Real-Time Temperature Monitoring Systems Deployed on Edge AI Devices. IEEE Internet Things J. 2024, 1, 2490–2506. [Google Scholar]
  13. Song, G.; Chen, W.; Zhou, Q.; Guo, C. Underwater Robot Target Detection Algorithm Based on YOLOv8. Electronics 2024, 13, 3374. [Google Scholar] [CrossRef]
  14. Chen, X.; Yuan, M.; Yang, Q.; Yao, H.; Wang, H. Underwater-ycc: Underwater target detection optimization algorithm based on YOLOv7. J. Mar. Sci. Eng. 2023, 11, 995. [Google Scholar] [CrossRef]
  15. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  16. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  17. Xia, Z.; Zhou, H.; Hu, H.; Zhang, G.; Hu, J.; He, T. YOLO-MTG: A lightweight YOLO model for multi-target garbage detection. Signal Image Video Process. 2024, 18, 5121–5136. [Google Scholar] [CrossRef]
  18. Jiang, Z.; Wu, B.; Ma, L.; Zhang, H.; Lian, J. APM-YOLOv7 for Small-Target Water-Floating Garbage Detection Based on Multi-Scale Feature Adaptive Weighted Fusion. Sensors 2024, 24, 50. [Google Scholar] [CrossRef] [PubMed]
  19. Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; Kwon, Y.; Michael, K.; Fang, J.; Yifu, Z.; Wong, C.; Montes, D. ultralytics/yolov5: v7. 0-yolov5 Sota Realtime Instance Segmentation. Zenodo 2022. [Google Scholar]
  20. Lu, Y.; Lu, G.; Li, J. Fully shared convolutional neural networks. Neural Comput. Appl. 2021, 33, 8635–8648. [Google Scholar] [CrossRef]
  21. Can, L.; Kaige, W.; Qing, L.; Kun, Z. Powerful-IoU: More straightforward and faster bounding box regression loss with a nonmonotonic focusing mechanism. J. Neural Netw. 2024, 170, 276–284. [Google Scholar]
  22. Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. arXiv 2023, arXiv:2304.08069. [Google Scholar]
  23. Xiao, H.; Xiao, S. Image Sobel edge extraction algorithm accelerated by OpenCL. J. Supercomput. 2022, 78, 16235–16265. [Google Scholar] [CrossRef]
  24. Guo, B.; Cao, N.; Zhang, R.; Yang, P. GETNet: Group Normalization Shuffle and Enhanced Channel Self-Attention Network Based on VT-UNet for Brain Tumor Segmentation. Diagnostics 2024, 14, 1257. [Google Scholar] [CrossRef]
  25. Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Proc. Int. Conf. Mach. Learn. 2015, 37, 448–456. [Google Scholar]
  26. Wang, H.; Xie, Q.; Zeng, D.; Ma, J.; Meng, D.; Zheng, Y. OSCNet: Orientation-Shared Convolutional Network for CT Metal Artifact Learning. IEEE Trans. Med. Imaging 2024, 43, 489–502. [Google Scholar] [CrossRef] [PubMed]
  27. Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. Proc. AAAI Conf. Artif. Intell. 2020, 34, 12993–13000. [Google Scholar] [CrossRef]
  28. Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; Zhang, C. Learning Efficient Convolutional Networks through Network Slimming. arXiv 2017, arXiv:1708.06519. [Google Scholar]
  29. Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
  30. Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
  31. Zhang, H.; Xu, C.; Zang, S. Inner-IoU: More effective intersection over union loss with auxiliary bounding box. arXiv 2023, arXiv:2311.0287. [Google Scholar]
  32. Lee, J.; Park, S.; Mo, S.; Ahn, S.; Shin, J. Layer-adaptive sparsity for the magnitude-based pruning. arXiv 2020, arXiv:2010.07611. [Google Scholar]
  33. Fang, G.; Ma, X.; Song, M.; Mi, M.B.; Wang, X. DepGraph: Towards Any Structural Pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 16091–16101. [Google Scholar]
  34. Li, Y.; Yin, C.; Lei, Y.; Zhang, J.; Yan, Y. RDD-YOLO: Road Damage Detection Algorithm Based on Improved You Only Look Once Version 8. Appl. Sci. 2024, 14, 3360. [Google Scholar] [CrossRef]
  35. Selvaraju, R.; Cogswell, M.; Das, A.; Vedantam, R.; PAarikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis. 2019, 128, 336–359. [Google Scholar] [CrossRef]
Figure 1. YOLOv8 architecture diagram.
Figure 1. YOLOv8 architecture diagram.
Applsci 14 09283 g001
Figure 2. Schematic diagram of horizontal gradient extraction.
Figure 2. Schematic diagram of horizontal gradient extraction.
Applsci 14 09283 g002
Figure 3. Schematic diagram of vertical gradient extraction.
Figure 3. Schematic diagram of vertical gradient extraction.
Applsci 14 09283 g003
Figure 4. WS-C2f modular structure.
Figure 4. WS-C2f modular structure.
Applsci 14 09283 g004
Figure 5. Head structure.
Figure 5. Head structure.
Applsci 14 09283 g005
Figure 6. SCDH-Detect structure.
Figure 6. SCDH-Detect structure.
Applsci 14 09283 g006
Figure 7. PIoU loss function diagram.
Figure 7. PIoU loss function diagram.
Applsci 14 09283 g007
Figure 8. Schematic diagram of pruning.
Figure 8. Schematic diagram of pruning.
Applsci 14 09283 g008
Figure 9. Selected samples of the dataset.
Figure 9. Selected samples of the dataset.
Applsci 14 09283 g009
Figure 10. (a) Experiment A; (b) Experiment B; (c) Experiment C.
Figure 10. (a) Experiment A; (b) Experiment B; (c) Experiment C.
Applsci 14 09283 g010
Figure 11. Comparison of the number of channels before and after pruning.
Figure 11. Comparison of the number of channels before and after pruning.
Applsci 14 09283 g011
Figure 12. (a) SS-YOLOv8n heat map. (b) YOLOv8n heat map. (c) YOLOv8-ghost heat map. (d) YOLOv8-mobilentv4 heat map.
Figure 12. (a) SS-YOLOv8n heat map. (b) YOLOv8n heat map. (c) YOLOv8-ghost heat map. (d) YOLOv8-mobilentv4 heat map.
Applsci 14 09283 g012
Table 1. Experimental environment configuration.
Table 1. Experimental environment configuration.
NameEnvironmental Parameters
operating systemWindows 11 Chinese 64-bit
graphics processorNVIDIA T100
display memory16
Python3.8
framePyTorch1.11
Table 2. Configuration of training parameters.
Table 2. Configuration of training parameters.
Parameter NameParameter Value
Mosaic1.0
Weight decay0.0005
Batch size8
Epochs150
Momentum0.937
Learning rate0.01
Table 3. The key parameters of the Jetson Nano b01.
Table 3. The key parameters of the Jetson Nano b01.
NameParameter Value
GPU128-core Maxwell
CPU4-core ARM@[email protected] GHz
Display Memory4 GB 64-bit LPDDR4
Link (on a website)Gigabit Ethernet
Sizes 100   m m × 80   m m × 29   m m
Table 4. Ablation experiments.
Table 4. Ablation experiments.
BaselineABCD
WS-C2f
SCDH
PIoU2
Slimming
mAP50/%0.7780.7810.7930.8010.799
mAP50–95/%0.5040.5050.4940.5110.51
Params/M3,007,2082,852,3122,208,0752,208,075739,851
GFLOPs/G8.17.76.2 6.23.1
Size/MB6.05.75.14.82.3
Latency/ms0.901.070.950.940.77
FPS112.594.3105.5105.9128.9
Table 5. Comparison of results of different loss functions.
Table 5. Comparison of results of different loss functions.
Loss FunctionPRmAP50mAP50–90
CIoU0.7970.7580.780.496
EIoU0.7590.7320.7770.504
WIoU0.7890.720.7910.498
PIoU0.7570.7330.7950.487
PIoU20.8270.6830.8010.511
Wise- PIoU20.7950.7580.7880.499
Wise-Inner- PIoU20.7340.7710.7770.494
Table 6. Comparison of results of different pruning methods.
Table 6. Comparison of results of different pruning methods.
Pruning MethodSize/MmAP50mAP50–95Flops/GParameters
Base6.00.7780.5048.13,007,208
Lamp2.80.7870.5013.51,066,213
Slimming2.80.8050.5133.51,066,213
Group-norm2.80.7870.5013.51,066,213
Table 7. Comparison of results with and without global pruning turned on.
Table 7. Comparison of results with and without global pruning turned on.
Pruning MethodGlobal-PruneSize/MmAP50mAP50–95Flops/GParameters
Base 4.80.8010.5116.22,208,075
SlimmingFalse2.80.8050.5133.51,066,213
SlimmingTrue1.70.7670.4883.6528,281
Table 8. Comparison of results for different pruning rates.
Table 8. Comparison of results for different pruning rates.
Pruning RateGlobal-PruneSize/MmAP50mAP50–95Flops/GParameters
Base 4.80.8010.5116.22,208,075
50%False2.80.8050.5133.51,066,213
60%False2.30.7990.513.1879,473
67%False2.10.7820.4852.8739,851
75%False1.70.7460.4612.4538,124
Table 9. Comparative tests of different models.
Table 9. Comparative tests of different models.
ModelSize/mmAP50mAP50–95Flops/GParametersLatency/msFPS
YOLOv5n5.00.7640.4817.12,504,5040.86115.2
YOLOv5s18.50.7760.49623.8911,4631.952.0
YOLOv8n6.00.7780.5048.13,007,2080.90112.5
YOLOv8s220.7790.51328.511,128,6802.1845.9
YOLOv8-mobilentv411.20.7340.46822.55,701,4481.281.2
YOLOv8-ghost3.60.760.4935.01,715,6362.1118.4
SS-YOLOv8n(ours)2.30.7990.513.1879,4730.77128.9
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fan, Z.; Qin, Z.; Liu, W.; Chen, M.; Qiu, Z. SS-YOLOv8: A Lightweight Algorithm for Surface Litter Detection. Appl. Sci. 2024, 14, 9283. https://doi.org/10.3390/app14209283

AMA Style

Fan Z, Qin Z, Liu W, Chen M, Qiu Z. SS-YOLOv8: A Lightweight Algorithm for Surface Litter Detection. Applied Sciences. 2024; 14(20):9283. https://doi.org/10.3390/app14209283

Chicago/Turabian Style

Fan, Zhipeng, Zheng Qin, Wei Liu, Ming Chen, and Zeguo Qiu. 2024. "SS-YOLOv8: A Lightweight Algorithm for Surface Litter Detection" Applied Sciences 14, no. 20: 9283. https://doi.org/10.3390/app14209283

APA Style

Fan, Z., Qin, Z., Liu, W., Chen, M., & Qiu, Z. (2024). SS-YOLOv8: A Lightweight Algorithm for Surface Litter Detection. Applied Sciences, 14(20), 9283. https://doi.org/10.3390/app14209283

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop