1. Introduction
Various floating objects frequently appear in water bodies such as rivers, lakes, and reservoirs. Their substantial accumulation can adversely affect water quality and the safe operation of hydraulic engineering [
1]. Traditional river drift detection mainly relies on manual inspection, which is labor-intensive and inefficient [
2]. Achieving the target detection of floating objects through monitoring videos for automatic identification and processing of river images significantly enhances detection efficiency and saves human resources. Compared to manual detection, target detection based on artificial intelligence is less influenced by subjective factors and can provide more accurate detection results. With the development and application of machine learning technologies, the efficiency and accuracy of floating object detection have gradually improved [
1].
In the early stages of floating object detection, methods such as image segmentation [
3] and background subtraction [
4] were commonly employed. However, these detection approaches often result in issues related to the completeness and accuracy of target extraction, especially in scenarios with fast-moving floating objects and significant variations in scene lighting. This leads to serious problems of missed detections and false negatives, making it challenging to meet the current requirements for floating object detection accuracy.
In recent years, deep learning has become the trend in floating object detection, and this technology has made significant strides in the field. The development of deep learning object detection algorithms has gone through several important stages. Two-stage object detection algorithms: In this stage, algorithms first generate candidate boxes and then classify and regress each candidate box. The R-CNN [
5] series represents this stage, which initially uses traditional methods such as selective search to generate candidate boxes, then employs a CNN for feature extraction, and finally uses classifiers like SVM [
6] for classification. Single-stage object detection algorithms: To further improve the speed of object detection, researchers have proposed single-stage object detection algorithms such as YOLO (You Only Look Once) [
7] and SSD (Single-Shot MultiBox Detector) [
8]. These algorithms do not require generating candidate boxes; instead, they directly perform dense sampling on the image, transforming the object detection task into a regression problem.
Lin et al. [
9] achieved river floating object detection based on unmanned aerial vehicles by combining time difference detection and background subtraction. However, due to variations in water surface characteristics, this approach was susceptible to detection errors. M Tharani et al. [
10] introduced a new category for garbage detection, providing a manually collected and annotated dataset of garbage images. They proposed a new attention layer focusing on smaller objects and established a deep learning network based on the Darknet framework. F Lin et al. [
11] proposed an enhanced YOLOv5s (FMA-YOLOv5s) algorithm by adding a Feature Map Attention (FMA) layer at the end of the backbone network to enhance the network’s feature extraction capability. The FMA layer does not alter the size of the input feature map, allowing it to be easily and flexibly added to any network structure. Kong et al. [
12] employed a target detection algorithm based on the YOLOv3 network, enhancing detection accuracy and speed by training on a self-constructed dataset, ultimately deploying it for detection on robots. Z Yi et al. [
13] proposed a detection and localization algorithm aimed at addressing the issue of low accuracy in unmanned boats when detecting and locating floating objects. Chen et al. [
14] enhanced the detection of small floating objects on water surfaces in the YOLOv5 network architecture by incorporating a small target detection head in the shallow layers. This integration of spatial and semantic feature information maximally preserves the crucial features of small floating objects, thereby improving their detection performance. The loss function CIOU was replaced with the SIOU, where the SIOU considers the orientation of the real and predicted frames, enhancing the effectiveness of detecting small floating objects on the water surface.
In terms of making YOLOv5 lightweight, Gangliu et al. [
15] introduced the C3Ghost and GhostConv modules into the backbone network of YOLOv5. They also combined the DWConv module with the C3Ghost module in the neck network of YOLOv5 and tested it on the PascalVOC dataset. The algorithm reduced FLOPs by 54% and the number of model parameters by 52.53%, while maintaining the same mAP. Rio Arifando et al. [
16] integrated the GhostConv and C3Ghost modules into the YOLOv5 network to reduce the number of parameters and floating-point operations per second (FLOPs). They also added the SimSPPF module to replace the SPPF module in the YOLOv5 backbone network, aiming to improve computational efficiency and accurate object detection capability. The improved algorithm reduced FLOPs by 48% and decreased the number of model parameters by 59.4%. S Chen et al. [
17] integrated the GhostConv module into the YOLOv5s network and added the CBAM attention module in the backbone network. They replaced the upsampling module of the network with a lightweight upsampling operator called CARAFE. The improved network reduced the parameter count from 7.1 M to 3.9 M and increased the mAP by 1.5%.
The current mainstream detection algorithms include YOLOv5 [
18,
19], YOLOv6 [
20], YOLOv7 [
21], and so on, among others. YOLOv5, developed by Ultralytics in 2020, has now evolved to version 7.0. Despite improvements in the network structure and training strategies introduced in YOLOv6 and subsequent models, YOLOv5 has been extensively validated in practical applications, demonstrating stable and reliable performance. Therefore, this study opted to enhance YOLOv5 for the detection of floating object targets.
Although YOLOv5 performs well on large public datasets such as MSCOCO [
22] and PASCAL VOC, it still has the following shortcomings in practical detection tasks, such as floating object detection on video surveillance platforms: (1) Analyzing the fully convolutional network structure of YOLOv5 reveals a series of standard convolutions, upsampling, residual units, and other basic modules. However, as the network deepens, the increase in network parameters and computational costs poses high hardware requirements for platforms deploying and running this algorithm. Therefore, further improvements in the algorithm’s model complexity are needed to adapt to more lightweight hardware platforms. (2) Floating objects exhibit significant scale variations, including many small targets. YOLOv5’s multi-scale prediction approach lacks robustness in feature extraction, especially leading to suboptimal recognition of small-scale floating object targets. This paper focuses on addressing the shortcomings in the YOLOv5 network structure and proposes a lightweight floating object detection algorithm based on YOLOv5.
2. Lightweight Approaches to the YOLOv5 Algorithm
2.1. YOLOv5 Algorithm
The YOLOv5 algorithm incorporates the concept of grids, dividing the image into multiple segments. Each grid is assigned the task of predicting one or more objects, generating prediction boxes. The ability of grids to generate prediction boxes stems from the existence of several template prediction boxes, referred to as “anchors”, within each grid. Each anchor is characterized by predefined width, height, coordinates, and confidence values. During the training process, if the center of a manually annotated box falls within a particular grid, the corresponding anchor undergoes vigorous “growth” or “shrinkage” towards the annotated box, setting its confidence to 1 (indicating the presence of an object). In contrast, anchors in grids without predicted boxes are assigned a confidence of 0. The objective detection problem is thus simplified to a combination of regression prediction and classification issues, treating the differences in width, height, and coordinates between anchors and real boxes as loss, and employing binary cross-entropy as the confidence loss function.
YOLOv5 incorporates PANet (Path Aggregation Network) [
23] in its network structure to construct a Feature Pyramid [
24], as shown in
Figure 1. The Feature Pyramid aids in detecting objects at different scales, thereby enhancing the model’s robustness to objects of various sizes. The backbone network adopts the CSPDarkNet53 + Focus structure, which effectively extracts features but complicates the structure, leading to an increase in parameters and hindering the process of making the model lightweight. The feature extraction network adopts the SPP + PAN structure, further enhancing the network’s feature extraction capabilities, improving the model’s accuracy in identifying different objects but increasing complexity and computational load, thereby slowing down the inference process. YOLOv5’s activation function utilizes the SilU function, a non-linear activation function with smooth and non-monotonic characteristics. However, it may lack sufficient discriminative power in certain situations, limiting its ability to enhance model performance.
This paper primarily makes improvements to the structure of YOLOv5 in three aspects, constructing the FRl-YOLOv5s model, as shown in
Figure 2. Firstly, it extracts the FasterNet block from FasterNet [
25] and integrates it into the C3 module of the YOLOv5 feature extraction network, forming C3-Faster. Simultaneously, it applies RepConv, proposed in the RepVGG [
26] network, to the YOLOv5 feature fusion network to replace the Conv module, reducing the parameter count while enhancing feature fusion capabilities. Secondly, it employs the ACON (Activate or Not) activation function to replace the SiLU activation function originally used in the C3 module of YOLOv5. ACON can adaptively choose whether to activate neurons, further improving the model’s precision and robustness. Finally, it utilizes the LAMP algorithm to perform structured pruning on the model, achieving significant optimization in accuracy and volume for the improved model.
2.2. C3-Faster
FLOPS (floating-point operations per second): FLOPS represents the number of floating-point arithmetic operations that the system can perform within one second. FLOPs: FLOPs represent the actual number of floating-point arithmetic operations performed by the system or algorithm. Reducing FLOPs can alleviate the computational burden of neural networks, thereby shortening both forward and backward propagation times. This reduction significantly contributes to achieving lower latency. Simultaneously, a lower number of FLOPs can enhance the parallelism of neural networks on hardware, improving throughput and accelerating the processing speed of input data streams. Therefore, to achieve lightweight neural networks that meet the demands for low latency and high throughput, previous research has predominantly measured the reduction in FLOPs (SqueezeNet [
27], GhostNet [
28], MobileNets [
29]). However, there is a relationship between neural network latency and FLOPs as follows (refer to Equation (
1)):
However, while reducing FLOPs, it leads to increased memory access, accompanied by additional segmented computations (connection, shuffling and pooling, normalization, and activation). Previous focus was solely on FLOPs, with limited attention to FLOPS. Although some awareness existed, it was considered that there were no viable alternatives. Research indicates that the main cause of the low-FLOPS issue is frequent memory access [
25]. Therefore, this paper introduces a novel CSP architecture, C3-Faster, to address this problem.
The traditional C3 module is the main module for learning residual features, as shown in
Figure 3. Its structure consists of two branches: one utilizes specified multiple Bottleneck stacks and three standard convolutional layers, while the other passes through a basic convolutional module. Finally, the outputs of both branches are concatenated. This paper replaces the Bottleneck in the C3 module of the YOLOv5 backbone network with PConv, and the branches undergo residual operations, forming the C3-Faster module, as shown in
Figure 4. PConv uses the standard Conv module only on certain input channels for convolutional operations, enabling spatial feature extraction. This method allows selective processing while retaining specific channel information. The FLOPs calculation for PConv only (refer to Equation (
2)) is:
In this context,
denotes the number of channels involved in regular convolutions within PConv, typically considered as 1/4 of
c. Consequently, the floating-point operations per second (FLOPs) of PConv are only 1/16 of those in a standard convolution, leading to a reduction in memory access. This substitution enables a decrease in the frequency of neural network memory access. Such an alternative allows the maintenance of a higher number of FLOPS with fewer FLOPs, thereby reducing neural network latency. And we believe that through C3-Faster, FRL-YOLO has a large receptive field which will more effectively extract spatial features, as compared in
Figure 5.
2.3. RepConv
RepVGG is a technique designed for model reparameterization, enabling the transformation of a complex convolutional neural network into a network composed of simple convolutional and fully connected layers. This reparameterization process significantly reduces the computational workload and the number of parameters in the model, thereby enhancing the inference speed and lightweight deployment capabilities. The core concept involves replacing the original convolutional operations with a module consisting of convolutional layers and element-wise addition operations, achieving the reparameterization of the network.
Integrating RepVGG into the feature fusion network involves constructing the training-time Feature Replication Layer (FRL) using the identity branch, 1 × 1 convolution, and 3 × 3 convolution. During the inference phase, structural reparameterization transforms the identity branch, 1 × 1 convolution, and 3 × 3 convolution into a 3 × 3 RepConv block, as shown in
Figure 6. The multi-branch topology architecture learns diverse feature information during training, while the simplified single-branch architecture saves memory consumption during inference, enabling rapid inference. Following multi-branch training on one tensor, it is channel-wise concatenated with another tensor. Channel-wise random operator is also employed to enhance information fusion between the two tensors, achieving deep integration of features from different input channels with lower computational complexity.
2.4. Activation Function Selection
ACON-C is a variant of the ACON (Activate Or Not) activation function. By introducing learnable parameters, the ACON-C activation function effectively models and captures the non-linear relationships within the data, thereby enhancing the expressive power of the model. Additionally, this activation function exhibits the capability to adaptively adjust its shape and characteristics based on the learned parameters. This adaptive feature enables the model to better accommodate diverse input data distributions, ultimately improving the generalization performance of the model. The ACON-C activation function can be expressed using the following form:
In the ACON-C formula,
represents the sigmoid function,
represents the Smooth Maximum, a smoothed and differentiable variant of the MAX function. Here,
represents a smoothing factor. As
approaches infinity, the Smooth Maximum converges to the standard MAX function. Conversely, when
is 0, the Smooth Maximum becomes an arithmetic mean operation; the
formula is shown in Equation (
4). In the ACON-C formula, the adaptive adjustment of
and
is achieved through two learnable parameters. Taking the first derivative of these parameters yields Equation (
5).
As x approaches positive infinity, the gradient of the function is
, and as x approaches negative infinity, the gradient is
. Taking the second derivative of the function yields Equation (
6).
In order to obtain the upper and lower bounds for the first derivative, setting the second derivative to zero and solving for the critical points result in the upper and lower bounds for the first derivative, as expressed in Equations (7) and (8), respectively.
The upper and lower bounds of the first derivative of ACON-C can be observed to be jointly determined by adjusting two parameters, namely and . This design imparts a heightened level of flexibility to the model during the training process, enabling it to adapt more dynamically to varying data distributions. By learning and adjusting these two parameters, the model can finely tune the upper and lower bounds of the first derivative, thereby enhancing its capability to capture non-linear relationships within the data. This aspect is crucial for elevating the model’s expressiveness and adaptability, underscoring its significance in improving overall performance.
2.5. Model Compression
Currently, popular model compression methods include knowledge distillation [
30], parameter quantization [
31], and model pruning [
32]. This research employs model pruning as the chosen model compression technique. During the pruning process, the focus is primarily on trimming neurons in deep models and pruning the network model in deep networks. Pruning is generally categorized into structured pruning and unstructured pruning [
33]. In this experiment, the structured pruning approach was adopted. In comparison to unstructured pruning, structured pruning typically involves pruning at higher levels, such as layer or filter levels, rather than at the individual weight level. This method constitutes a coarse-grained pruning approach, yet it proves to be a more effective strategy. A notable advantage of this method is its direct compatibility with existing deep learning frameworks and hardware accelerators, eliminating the need for special software or hardware libraries.
This paper utilizes the Layer-Adaptive Magnitude-Based Pruning (LAMP) algorithm for structured pruning [
34]. The core of the algorithm involves calculating the LAMP score for each connection and pruning the connection with the minimum score. This algorithm achieves higher sparsity while preserving model performance, thereby contributing to a reduction in computational requirements and memory demands of the model. The LAMP score for the u-th index of the weight tensor
W is then defined as:
The LAMP score quantifies the relative importance of the target connection among all surviving connections within the same layer. Once the LAMP scores have been computed, the weights are sorted in ascending order based on a given index map. Subsequently, connections with the minimum LAMP scores are globally pruned until the desired level of overall sparsity constraint is achieved. Notably, the LAMP score does not involve any hyperparameters that require tuning and only necessitates basic tensor operations. This ensures that LAMP scores can be computed with minimal computational overhead.
4. Results
In
Table 2,
Table 3 and
Table 4, the evaluation data were obtained by training and testing on the FloW-Img floating object dataset. Analyzing the data allowed for the assessment of the performance of both the baseline and optimized models. YOLOv5s* denotes the YOLOv5s model optimized with the C3-Faster module, RepConv module, and the ACON-C activation function.
As shown in
Table 2, the YOLOv5s algorithm was compared with Faster RCNN, SSD, YOLOv3-tiny, YOLOv6s, and YOLOv7-tiny. The mAP values of the YOLOv5s algorithm increased by 30.8%, 9.5%, 5.6%, 2.7%, and 3.6%, respectively, compared to these models. When comparing the YOLOv5s algorithm with YOLOv8s, although YOLOv8s had a 0.18% higher accuracy than YOLOv5s, the parameter count of YOLOv5s was only 64.8% of YOLOv8s, which aligned more with the lightweight principle of this article. Therefore, YOLOv5s was chosen as the benchmark model.
The model size of FRL-YOLO was 2 M, the number of GFLOPs was 4.6 G, and the mAP reached 79.3%. Our model achieved an mAP that surpassed almost all other classic detection models (except for YOLOv8s). Additionally, some YOLO models with similar mAP values to FRL-YOLO, such as YOLOv5s and YOLOv8s, had a computational intensity more than six times that of FRL-YOLO. On the other hand, FRL-YOLO achieved an FPS of 623.5 with a batch size of 128, surpassing all other models in detection speed. This indicates that our approach meets the requirements for real-time detection. In summary, our model strikes an appropriate balance between speed, accuracy, and computational workload.
As shown in
Table 3, experiment A represented the original YOLOv5s algorithm, with an mAP of 0.789, a model volume of 13.9 M, and a parameter count of 7,012,822. Experiments B, C, and D introduced the C3-Faster module into the backbone network, the RepConv module into the neck network, and changed the activation function of the C3 module to ACON-C, respectively. These three experiments showed that introducing C3-Faster significantly reduced the parameter count and model volume, albeit with a considerable decrease in accuracy. However, introducing RepConv had little impact on model performance, while introducing the ACON-C activation function increased the average precision by 0.5%, with minimal changes in parameter count and model volume. Experiments E, F, and G combined these three improvement modules in different combinations. Experiment E combined C3-Faster and RepConv, maintaining accuracy while reducing the parameter count by 8.4% and decreasing the model volume by 1.1 M. Experiment F simultaneously incorporated all three improvement modules, resulting in increased detection accuracy and average precision, with only a 0.4% increase in parameter count compared to the original YOLOv5s algorithm in experiment A. Both experiments E and F demonstrated significant reductions in parameter count and model volume compared to the original YOLOv5s algorithm, while also improving average precision. This is because the C3-Faster module was used to reconstruct the receptive field of the backbone network. This enabled it to receive more feature information. The multi-branch structure of the RepConv module also made feature processing more effective. Additionally, the ACON-C activation function can selectively activate neurons in an adaptive manner, thereby better adapting to different data distributions and patterns and improving the model’s adaptability and generalization ability.
In order to further investigate the impact of the pruning rate on network model performance, we designed comparative experiments with different pruning rates, and the results are shown in
Table 4. The baseline network of FRL-YOLO was pruned at pruning rates of 40%, 50%, 60%, 70%, 75%, and 80%, and the pruning results were analyzed.
After fine-tuning training, the performance of the improved models at different pruning rates was improved to varying degrees. However, as the pruning rate exceeded 60%, the average precision of the model gradually decreased. When pruning at higher rates, some important feature weights were pruned, leading to a sharp decline in model accuracy.
Regarding model volume and parameter count, they gradually decreased as the pruning rate increased. When the pruning rate reached 80%, the model volume could be reduced to 1.3 M, only 10.3% of the baseline model. However, its accuracy showed a significant decrease compared to the 60% pruning rate scenario. Taking all factors into consideration, we ultimately chose a pruning rate of 70% as the basic pruning parameter for FRL-YOLO. This significantly reduced the model’s storage space while maintaining consistency with the baseline model accuracy, achieving a good balance between accuracy and model size.
To visually demonstrate the superior detection performance of the proposed FRL-YOLO in this study, a comparative analysis was conducted on the test set of the floating object dataset. Special emphasis was placed on evaluating the detection models YOLOv5s and YOLOv8s, with a focus on assessing the effectiveness of FRL-YOLO in detecting small floating objects. The results, as depicted in
Figure 7,
Figure 8 and
Figure 9 (from top to bottom: YOLOv5s, YOLOv8s, FRL-YOLO), revealed that when detecting relatively large floating objects, the performance of YOLOv5s, YOLOv8s, and FRL-YOLO was comparable, with all models successfully detecting them (with confidence levels above 50%). However, when YOLOv5s and YOLOv8s were tasked with detecting small target floating objects, it was observed that the confidence levels were only around 40%, leading to potential misjudgments. In contrast, FRL-YOLOv5s achieved a confidence level of over 50% when detecting small target objects, indicating its robustness in extracting key information from complex scenes for identifying and localizing small targets.
In the GradCAM heatmaps shown in
Figure 7,
Figure 8 and
Figure 9, the color intensity represents the importance of each pixel or region with respect to the predicted target category. Deeper colors (such as red or blue) indicate a greater influence on the prediction results. It is evident that the YOLOv5s and YOLOv8s models exhibit less attention towards small targets compared to FRL-YOLO, which displays a higher responsiveness towards small targets. This suggests that the FRL-YOLO model effectively focuses on and localizes small targets, even amidst complex scenes.