YOLO-TC: An Optimized Detection Model for Monitoring Safety-Critical Small Objects in Tower Crane Operations

Ding, Dong; Deng, Zhengrong; Yang, Rui

doi:10.3390/a18010027

Open AccessArticle

YOLO-TC: An Optimized Detection Model for Monitoring Safety-Critical Small Objects in Tower Crane Operations

by

Dong Ding

,

Zhengrong Deng

^* and

Rui Yang

Guangxi Key Laboratory of Images and Graphics Intelligent Processing, Guilin University of Electronic Technology, Guilin 541004, China

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(1), 27; https://doi.org/10.3390/a18010027

Submission received: 7 November 2024 / Revised: 26 December 2024 / Accepted: 30 December 2024 / Published: 6 January 2025

(This article belongs to the Special Issue Advances in Computer Vision: Emerging Trends and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Ensuring operational safety within high-risk environments, such as construction sites, is paramount, especially for tower crane operations where distractions can lead to severe accidents. Despite existing behavioral monitoring approaches, the task of identifying small yet hazardous objects like mobile phones and cigarettes in real time remains a significant challenge in ensuring operator compliance and site safety. Traditional object detection models often fall short in crane operator cabins due to complex lighting conditions, cluttered backgrounds, and the small physical scale of target objects. To address these challenges, we introduce YOLO-TC, a refined object detection model tailored specifically for tower crane monitoring applications. Built upon the robust YOLOv7 architecture, our model integrates a novel channel–spatial attention mechanism, ECA-CBAM, into the backbone network, enhancing feature extraction without an increase in parameter count. Additionally, we propose the HA-PANet architecture to achieve progressive feature fusion, addressing scale disparities and prioritizing small object detection while reducing noise from unrelated objects. To improve bounding box regression, the MPDIoU Loss function is employed, resulting in superior accuracy for small, critical objects in dense environments. The experimental results on both the PASCAL VOC benchmark and a custom dataset demonstrate that YOLO-TC outperforms baseline models, showcasing its robustness in identifying high-risk objects under challenging conditions. This model holds significant promise for enhancing automated safety monitoring, potentially reducing occupational hazards by providing a proactive, resilient solution for real-time risk detection in tower crane operations.

Keywords:

computer vision; small object detection; YOLOv7; attention mechanism; feature fusion

1. Introduction

Tower cranes are essential for large-scale construction projects, particularly in the development of high-rise buildings and infrastructure. However, ensuring the safety of crane operations remains a persistent challenge, as improper behaviors such as smoking or mobile phone usage are significant contributors to accidents. While behavioral recognition techniques have been explored for monitoring unsafe behaviors, their reliance on continuous spatial–temporal analysis makes them computationally expensive and impractical for real-time monitoring. Instead, object detection techniques, which focus on identifying specific objects like mobile phones and cigarettes in individual image frames, provide a more efficient alternative.

In recent years, deep learning-based object detection models have advanced significantly, with the YOLO (You Only Look Once) series [1,2,3,4,5], PP-YOLO [6], and YOLOv8 emerging as prominent frameworks. YOLOv7 [7], for example, achieves an optimal trade-off between accuracy and speed, making it suitable for real-time applications. However, when applied to small object detection, these models exhibit critical shortcomings. Small objects occupy minimal pixel areas and lack salient features, which limits the ability of conventional feature extraction methods to distinguish them effectively. PP-YOLO, while enhancing detection through improved data augmentation and training strategies, struggles to achieve consistent precision for extremely small targets in cluttered environments. Similarly, YOLOv8 introduces a more efficient network architecture and anchor-free strategies, yet its performance remains constrained by challenges such as semantic disparity across scales and interference from complex backgrounds. Faster R-CNN [8], despite its accuracy, suffers from high computational overhead, making it impractical for real-time applications. These limitations manifest prominently in scenarios like tower crane operator surveillance, where objects such as cigarettes and mobile phones are often obscured by intricate backgrounds, variable lighting, and visual noise. The inability of these models to achieve robust performance in detecting small objects under such challenging conditions highlights the need for further architectural enhancements.

To address these issues, we propose YOLO-TC, an enhanced version of YOLOv7 specifically optimized for small object detection in complex environments. The primary contributions of this work are as follows:

We introduce ECA-CBAM, a novel channel–spatial attention module that integrates the strengths of Efficient Channel Attention (ECA) [9] and the Convolutional Block Attention Module (CBAM) [10]. By utilizing lightweight 1D convolution for cross-channel interactions, ECA-CBAM improves feature extraction efficiency while avoiding information loss.
To address multi-scale detection challenges, we propose the Hierarchical Asymptotic Path Aggregation Network (HA-PANet), which mitigates the semantic gap between non-adjacent feature maps. This enhancement enables more effective multi-scale feature fusion and improves the localization accuracy for small objects.
We conduct extensive experiments on a custom tower crane dataset and the PASCAL VOC benchmark. Comparative analyses with state-of-the-art models, including Retina-Net [11], YOLOv5, YOLOX, YOLOv7 and YOLOv8, demonstrate that YOLO-TC achieves superior performance in detecting small objects while maintaining competitive inference speed and computational efficiency.

The experimental results confirm that YOLO-TC effectively addresses the limitations of existing methods, offering significant improvements in accuracy, robustness, and generalizability for small object detection tasks. By balancing precision and real-time applicability, YOLO-TC provides a reliable solution for safety-critical scenarios, such as monitoring tower crane operations, where the accurate detection of small objects is paramount.

2. Related Work

2.1. Development and Application of Object Detection Algorithms

With the development of deep learning technologies, object detection algorithms have found widespread applications across various domains. Object detection algorithms can be categorized into two types: two-stage and one-stage algorithms. Two-stage algorithms first generate candidate boxes and then classify these boxes; a typical example of such algorithms is the Faster R-CNN series. One-stage algorithms, including SSD [12] and the YOLO series, do not require complex frameworks, enabling real-time detection.

To address issues such as complex image backgrounds and severe occlusion, Zhang et al. [13] introduced a Target-Aware Attention Module (TAAM) and a Channel Attention Module (CAM) in the YOLOv6 [14] model. The TAAM helps suppress background features, while CAM reduces irrelevant channel features. The modified model enhances the extraction of prohibited items’ information in complex backgrounds.

In the research on tower crane safety management based on computer vision technology, Yang [15] developed an automatic system for the collection, analysis, and early warning of safety distances for tower cranes based on Mask-RCNN [16]. This system utilizes the Mask-RCNN method to recognize video data and extract RGB color from the mask layer, thereby identifying hazardous areas and worker coordinates. These coordinates are then transformed into real-world distances to calculate the safety distance. By incorporating a mask layer, this method allows for the presence of distorted angles between the camera and the target, thus improving the recognition accuracy and applicability. Kang [17] proposed using Mask-RCNN to address the safety hazard caused by mismatched hook and steel ladle ear axes during the hoisting process. Compared to the manual acquisition of the matching status, the proposed method can quickly and accurately determine the hook’s matching status. Luo et al. [18] introduced a construction equipment recognition framework based on computer vision (CV) technology, which, using information captured by cameras, automatically estimates the overall posture of construction equipment.

Compared to existing research, the approach of using object detection methods to detect prohibited items in tower crane operation scenarios, to prevent safety accidents, is highly innovative. This method consumes fewer resources and is more efficient than traditional behavior recognition methods. Furthermore, the improved object detection model achieves a higher mAP than the more advanced YOLOv8 while also maintaining a balance between speed and the number of parameters.

2.2. Development and Application of Attentional Mechanisms

The attention mechanism has found widespread application in computer vision, particularly in recent years. The fundamental concept is to selectively focus on the most pertinent characteristics by allocating weights to various input information segments, enhancing the model’s capacity for feature extraction.

CBAM is a widely recognized attention mechanism that combines both channel and spatial attention. First, CBAM generates the weights of each channel using the channel attention module, which adjusts weight allocation based on global information from the feature map. Then, it assigns weights to spatial positions through the spatial attention module, allowing for more accurate spatial feature capture. This dual-attention mechanism performs exceptionally well in visual tasks, especially in modeling multi-scale information, thereby considerably improving the model’s feature extraction ability.

This method enhances attention to local and global information in complex scenarios, augmenting the model’s overall performance.

ECANet is a lightweight channel attention mechanism that replaces the conventional fully connected layer with 1D convolution, significantly lowering computational cost and simplifying the model’s structure. By adaptively weighting each channel, ECANet enhances the model’s feature extraction capabilities while maintaining a low parameter count and reduced computational complexity. This selective feature weighting, combined with the reduction in parameters, plays a critical role in optimizing model performance. The mechanism effectively mitigates the high computational costs typically associated with traditional channel attention methods, thereby making it suitable for deployment in resource-constrained environments. This design enables ECANet to excel in balancing model performance and computational overhead and is suitable for scenarios that require fast reasoning and fewer resources. The structure of ECANet is shown in Figure 1.

2.3. Development and Application of Feature Fusion Strategies

Feature fusion significantly improves object detection performance, particularly in detecting objects at multiple scales. Conventional feature fusion methods, like the Feature Pyramid Network (FPN) [19], combine high-level semantic features with low-level detailed information through top-down feature propagation, enhancing the detection network’s ability to handle multi-scale objects. Despite its advantages, the FPN still faces challenges with inadequate feature fusion, particularly when dealing with small objects.

Adaptively Spatial Feature Fusion (ASFF) [20] is an adaptive feature fusion strategy; its core concept is to adaptively select different levels of features to be fused according to the current input, avoiding redundancy and interference between features. ASFF effectively leverages information across different scales by adaptively weighting features at multiple resolutions, thereby improving the accuracy of small object detection. This flexible feature fusion approach performs well in complex environments, particularly when dealing with multi-scale objects, significantly enhancing the detection capability for small objects.

The Asymptotic Feature Pyramid Network (AFPN) [21] draws on adaptive spatial feature fusion to fuse features of different resolutions in stages during bottom-up feature extraction. By fusing the information from low-level features to high-level features one by one, the AFPN realizes the dual fusion of semantic and detailed information while avoiding the information gap between non-adjacent feature layers. This design can proficiently prevent the loss of feature information during multi-layer transfer, improving the model’s efficacy in the multi-scale detection of objects. The structure of the AFPN is shown in Figure 2.

Research on comprehensive attention mechanisms and feature fusion indicates that effective feature selection and fusion strategies can significantly enhance the performance of object detection networks, particularly for small object detection tasks. Building upon YOLOv7, this study proposes a novel attention mechanism and feature fusion strategy to address the limitations of the original approach in small object detection, thereby improving detection accuracy and robustness.

3. Methodology

3.1. YOLO-TC Network Structure

YOLO-TC includes four innovative points. Firstly, we aim to improve the CBAM attention module by utilizing the attention mechanism. We propose the lightweight channel–spatial attention module ECA-CBAM, which focuses on small object spatial dimension features. We integrate the ECA-CBAM after each convolution of the backbone network to improve feature extraction. Second, HA-PANet is proposed based on the AFPN idea, which realizes the progressive fusion of adjacent adequate feature layers and alleviates the semantic gap between feature layers. Third, we reconstructed the detection head to decrease network computation and enhance the small object recognition capability. Finally, we introduced the MPDIoU loss function [22], facilitating the more precise recognition and localization of the object by the model. The network structure of YOLO-TC is shown in Figure 3.

3.2. ECA-CBAM Attention Mechanism

When detecting the surveillance images of tower crane cabs, the image quality is poor because the camera is usually installed at a high position, the pixel proportion of phones and cigarettes in the image is small, and issues such as occlusion, deformation, and blurred details are present. The YOLOv7 algorithm needs help properly extracting features from these objects, resulting in the inadequate identification of an operator using a phone or smoking. The conventional YOLOv7 network detects objects of different sizes by fusing feature maps at three resolutions (80 × 80, 40 × 40, and 20 × 20), with three anchor boxes assigned to each scale to capture small, medium, and large objects. The minimum detection capability for an input image of 640 × 640 pixels is 8 × 8 pixels. However, cigarettes and phones are much smaller in the tower crane scene than in this resolution. To enhance model performance, we add a 160 × 160 detection branch to YOLOv7, increasing the minimum detection resolution to 4 × 4 pixels. Otherwise, an attention mechanism is implemented to augment the network’s sensitivity to small objects, enhancing detection accuracy, and the network learns richer feature representations.

However, introducing the attention mechanism augments the quantity of parameters, augmenting the model’s computational complexity. The computation of the fully connected layer correlates with the number of channels, and given that the detection branch has a substantial number of channels, the computational overhead of channel attention is relatively higher. We improve the CBAM by reducing the computation of channel attention. In this paper, we draw on the ECANet network structure to design the improved channel attention module and name it the E-CAM. The E-CAM is shown in Figure 4.

The figure shows that the E-CAM uses one-dimensional convolution with convolution kernel length k to realize local cross-channel interaction and extract inter-channel dependencies based on the original channel attention. After one-dimensional convolution, The element-wise sum of the output results is passed through an activation function to produce channel attention weights. The attention weights are subsequently applied to the original input feature maps channel by channel, enabling feature recalibration. Among them, the size of the one-dimensional convolution kernel is k, and the function of the number of channels C determines the value of k. The mapping relationship between the two is as follows:

C = ϕ (k) = 2^{r \times k - b}

(1)

Moreover, the number of channels C is known at this point, and we can derive an expression for k:

k = ψ (C) = {|\frac{\log_{2} (C)}{γ} + \frac{b}{γ}|}_{o d d}

(2)

The

{|x|}_{odd}

is the nearest odd number,

γ

is 2, and

b

is 1.

The kernel size

k

is adjusted adaptively based on the number of channels

C

. The theoretical rationale behind this design lies in the observation that with an increasing number of channels, the cross-channel dependencies become more complex, necessitating a larger receptive field (i.e., a larger

k

) to effectively model long-range dependencies across channels. Conversely, when the number of channels is smaller, a relatively smaller

k

is sufficient to capture the essential inter-channel relationships.

Compared to a fixed kernel size, the adaptive adjustment mechanism of

k

offers several advantages. First, it achieves a balance between model complexity and representational capacity. Second, the dynamic kernel size enables the E-CAM to flexibly adapt to features with varying numbers of channels, thereby enhancing the effectiveness and generalization capability of the channel attention mechanism.

The improved E-CAM and SAM are utilized to construct a new channel–spatial attention structure, the ECA-CBAM, as shown in Figure 5.

The module first utilizes the E-CAM and SAM to obtain the channel attention weight

M_{c}

and the spatial attention weight

M_{s}

, respectively. Then,

M_{c}

and

M_{s}

are expanded to the size of

R^{W \times H \times C}

, summed element-wise, and normalized to form the spatial and channel attention weight matrices

M_{c s}

. The weight values represent the attention distribution across the feature map, allowing the model to extract more relevant features from the focused regions. The calculation formula is as follows.

M_{c s} = s i g m o i d (M_{c} + M_{s})

(3)

Finally, the hybrid attention weight matrix

M_{c s}

is element-wise multiplied with the input feature map F, resulting in the refined feature map

F_{c s}

, which is calculated using the formula below:

F_{c s} = F {\otimes M}_{c s}

(4)

The attention mechanism guides the model in allocating more computational resources to emphasize the region of interest, enhancing its expressiveness. The concept of the E-CBAM is to obtain attention weight matrices

M_{c}

and

M_{s}

from the input feature map

F

in both the spatial and channel dimensions, which improves the adequate flow of feature information in the network. The module emphasizes focusing on meaningful features in both channel and spatial dimensions, focusing on essential features and suppressing ineffective ones. For small objects, individual feature regions will receive more significant weight and contain more valid objects, allowing the model to focus more on learning the features of that region, thereby improving feature extraction under limited computational resources.

3.3. HA-PANet Structure

The Path Aggregation Network (PANet) structure [23] in YOLOv7 has limitations when dealing with object detection in tower crane operating scenarios. The PANet is developed without adequately addressing the semantic information disparity between features from non-adjacent layers. The top-down fusion of high-level and low-level features may result in semantic information loss due to substantial differences in object scales during multi-scale feature extraction. Additionally, bottom-level features often require more detailed information due to suboptimal fusion methods during bottom-up integration. As a result, the existing PANet experiences insufficient feature fusion, semantic information degradation, and detailed information loss during object detection, ultimately compromising detection accuracy and robustness.

For the detection algorithm in this paper, the HA-PANet structure is proposed by improving the PANet structure with the AFPN idea in the YOLOv7 neck network structure.

The HA-PANet structure asymptotically fuses effective feature maps from the backbone network with neighboring feature layers in a deeper-to-shallower order to reduce computational demands. The HA-PANet structure accomplishes the asymptotic fusion between the feature maps of various scales through the ASFF strategy. The HA-PANet structure accomplishes the progressive fusion between different scales of feature maps through the ASFF strategy. It retains the feature information of each effective feature layer as much as possible during feature fusion and extraction processes. The specific processing strategy of each progressive fusion branch is determined by the dimensions of the feature layers and the number of channels engaged in that branch’s fusion, i.e., the ASFF-1 module, ASFF-2 module, and ASFF-3 module, which have slight differences in their internal structures. The structure of the HA-PANet is shown in Figure 6.

The proposed HA-PANet aims to further improve the detection of small objects. Unlike the original AFPN, the HA-PANet performs progressive feature fusion only on the modified P2, P3, and P4 branches of the backbone network, thereby reducing the number of fusion steps and minimizing additional computational and memory costs. In contrast to the AFPN, which performs multiple feature fusion operations across different levels, the HA-PANet groups P2 with P3 and P3 with P4, conducting a single fusion operation between each pair. This progressive enhancement process enables the low-level detailed features and high-level semantic features to complement each other, effectively balancing the representation of fine-grained details and high-level semantics.

Moreover, the HA-PANet incorporates ideas from ASFF, where the model introduces the P2 branch with higher spatial resolution. This branch is fused with higher-level features using a weighted aggregation, which prevents the loss of low-level details and mitigates the excessive abstraction of high-level semantics. As a result, the fusion of low-level and high-level features becomes more precise, further narrowing the semantic gap between them. Compared to other hierarchical feature fusion strategies, the HA-PANet better preserves the low-level feature details without allowing them to be overshadowed by higher-level features while also placing greater emphasis on small object detection. Consequently, the HA-PANet demonstrates superior performance, particularly in the detection of small objects, within the framework of multi-scale object detection.

3.4. Detection Head Reconstruction

Most object detection models typically extract features by stacking deeper convolutional neural network layers. Deeper convolutional layers gradually enrich semantic information but simultaneously blur fine-grained details, resulting in reduced accuracy for small object detection. The backbone network of YOLOv7 downsamples the input image three times in total. The network extracts four feature map layers: P2, P3, P4, and P5. The P3, P4, and P5 feature maps are processed through the neck layer and then directed to the small, medium, and large object detection heads in the head layer, respectively, for detecting objects of varying scales.

However, in the tower crane operating scenario, the cigarette and phone objects are tiny and have extremely weak contrast with the background. The P2 feature map contains richer object underlying feature information. The process involves inputting the P2 feature map to the neck layer and adding a tiny head to enhance detection accuracy and efficiency. Integrating low-level features allows the network to capture finer details of tiny objects, helping to prevent the gradual loss of small object information as the network deepens. As the network depth increases, information about tiny objects is progressively lost. However, the large object detection head is redundant; removing the P5 detection layer improves the accuracy and stability of tiny object detection in the tower crane operating environment. This adjustment increases the model’s adaptability to specific scenarios, focusing on cigarette and phone object sizes while minimizing attention to irrelevant sizes and decreasing computational complexity. The specific improved structure is shown in Figure 7.

3.5. MPDIoU Loss

Intersection Over Union (IoU) is a simple metric that measures positional loss and the overlap between two bounding boxes by dividing the area of intersection by the area of union. In the original YOLOv7 algorithm, the CIoU loss function computes the coordinate regression loss using the following formula.

C I o U = I o U - \frac{ρ^{2} (b, b^{g})}{c^{2}} - α \cdot v

(5)

Here,

p (b, b^{g})

represents the Euclidean distance between the center points of the predicted box

b

and the ground truth box

b^{g}

, which measures the predicted box’s spatial proximity to the ground truth box.

c

denotes the diagonal length of the smallest enclosing box.

α

and

v

are the evaluation parameter and the aspect ratio balance factor, respectively.

While CIoU considers factors such as intersection area, center point distance, and aspect ratio to evaluate bounding box overlap, its approach to calculating aspect ratio needs to be revised to represent the true disparities in width and height between the predicted and ground truth boxes. More attention must be paid to the problem of uneven sample quality, which might lead to a diminished convergence rate for the model. Therefore, to more effectively tackle the regression problem related to overlapping and non-overlapping bounding boxes while accounting for centroid distances and variances in width and height, we choose the MPDIoU loss function rather than CIoU. MPDIoU relies on the similarity metric of the lowest point distance between bounding boxes and immediately minimizes the distance between the upper-left and lower-right corners of the predicted and ground truth boxes, thus streamlining the computational process. The formula for its computation is as follows.

d_{1}^{2} = {(x_{1}^{B} - x_{1}^{A})}^{2} + {(y_{1}^{B} - y_{1}^{A})}^{2},

(6)

d_{2}^{2} = {(x_{2}^{B} - x_{2}^{A})}^{2} + {(y_{2}^{B} - y_{2}^{A})}^{2},

(7)

M P D I o U = \frac{A \cap B}{A \cup B} - \frac{d_{1}^{2}}{w^{2} + h^{2}} - \frac{d_{2}^{2}}{w^{2} + h^{2}},

(8)

I o U = \frac{A \cap B}{A \cup B},

(9)

M P D I o U = I o U - \frac{d_{1}^{2}}{w^{2} + h^{2}} - \frac{d_{2}^{2}}{w^{2} + h^{2}},

(10)

L O S S_{M P D I o U} = 1 - M P D I o U,

(11)

Here,

w

and

h

are the width and height of the predicted box, respectively.

A

and

B

are the predicted box and ground truth box, respectively.

(x_{1}^{A}, y_{1}^{A})

and

(x_{2}^{A}, y_{2}^{A})

represent the upper-left and lower-right t coordinates of bounding box

A

, respectively.

(x_{1}^{B}, y_{1}^{B})

and

(x_{2}^{B}, y_{2}^{B})

represent the upper-left and lower-right coordinates of bounding box

B

, respectively.

MPDIoU effectively promotes the predicted box to align more closely with the ground truth box, whether their centroids overlap or not, thereby preventing loss function failure. In addition, the detection task of cigarette and phone objects considers attributes such as shape and size. By incorporating factors like diagonal length, MPDIoU allows the model to detect and localize objects more accurately, enhancing both convergence speed and regression accuracy.

4. Experimental Design

4.1. Datasets and Evaluation Indicators

The research team intercepted frames from surveillance videos of the tower crane cab at an equipment storage site operated by a construction company in Guangxi to obtain the dataset used for the experiments in this paper. The obtained dataset consists of 8766 images, in which the ratio of positive and negative samples reaches 1:1, and includes images of tower crane cabs taken from different angles and at different times; in addition, it also includes interfering factors such as the object being obscured and moving out of the frame. The objects in the images were labeled and classified using the LabelImg tool, and the label format was TXT. The labels were divided into cigarette (C) and phone (P). Some of the dataset images are shown in Figure 8.

In this study, the model’s performance is assessed using average precision (AP), mean average precision (mAP), recall (R), precision (P), and parameter count. AP and mAP quantify model precision, whereas the parameter count indicates the model’s complexity. Precision is defined as the ratio of true positive predictions to the total number of positive predictions made by the model. Recall is defined as the ratio of true positive predictions to the total number of actual positive instances. AP represents the area under the precision–recall (P-R) curve, where a larger area corresponds to higher recognition accuracy. mAP is defined as the average AP across all categories, where a higher value signifies improved overall object recognition performance. The parameter count indicates the total number of model parameters, where a smaller count enhances the model’s suitability for deployment on mobile devices.

P r e c i s i o n = \frac{T P}{F P + T P},

(12)

R e c a l l = \frac{T P}{F N + T P},

(13)

A P = \int_{0}^{1} P (R) d R,

(14)

m A P = \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{1} P (R) d R

(15)

4.2. Parameter Settings

The experiments use the operating system Ubuntu18.04.6, the CPU is Intel Xeon W-2265, the GPU is NVIDIA RTX A4000, and the experimental environment is pytorch1.13.0, CUDA11.4, CUDNN8.2.4.

This paper details the division of the dataset into training, testing, and validation sets with a ratio of 6:2:2. The training network is configured with an input size of 640 × 640. The model parameters are optimized during training using the SGD optimizer, with 300 training epochs and a batch size of 8. An initial learning rate of 0.01 is set to enhance training effectiveness, and a weight decay coefficient of 0.0005 is applied to mitigate the risk of overfitting. No pre-trained weights are used in any of the experiments.

4.3. Comparative Experiments on Attentional Mechanisms

To assess the efficacy of the channel–spatial attention mechanism ECA-CBAM introduced in this research, we incorporated several modules into the YOLOv7 detection algorithm under identical experimental settings to examine the influence of each module on the model’s performance. The results are shown in Table 1. The symbol “↑” in the table indicates that the higher the value, the better.

CAM denotes the channel attention module with a compression rate of 16, and E-CAM denotes the lightweight CAM. From Table 1, we can see that the E-CAM reduces the number of parameters by 4.51 ×

10^{6}

, the computational cost by 7 ×

10^{9}

operations, the detection accuracy by 0.6%, and the detection speed by a factor of 14 in comparison to the CAM. The analysis concludes that the feature compression introduced by the CAM in the fully connected layer caused information loss. At the same time, E-CAM accomplishes cross-channel interaction using 1D convolution, which not only does not cause information loss but also reduces the complexity and is more effective. The ECA-CBAM enhances the detection algorithm’s accuracy by 2.3% among the added attention modules. At the same time, the ECA-CBAM also ensures the use of several parameters, computation and the real-time performance of the algorithm.

The precision-recall curves for the YOLOv7 model using the CBAM attention mechanism and the ECA-CBAM attention mechanism are shown in Figure 9. Figure 9a represents the use of CBAM, and Figure 9b represents the use of ECA-CBAM.

4.4. Ablation Experiment

Multiple ablation studies were conducted on the autonomous dataset to assess the impact of each enhancement on YOLO-TC. The results are shown in Table 2. The symbol “√” in the table indicates that the module was used in this experiment.

First, incorporating the ECA-CBAM attention mechanism into the backbone network improves the mAP by 2.3% and increases the parameters by 1.56 ×

10^{6}

. Despite the slight increase in parameters, the ECA-CBAM mechanism improves the model’s capacity to detect small objects in the tower crane operating scenario. Secondly, introducing the HA-PANet module alone improves the mAP by 1%, with an additional parameter volume of 2.73 ×

10^{6}

. After reconfiguring the detection head, the map improves by 0.8%. It reduces the volume by 3.07 × 10⁶, focusing more on the cigarette and phone small object sizes than the original model and reducing the focus on the non-relevant sizes. The improved backbone network improves the mAP by 3.7%, with an additional parameter volume of 2.13 ×

10^{6}

. Although this network has increased, the new network shows more stability and anti-interference ability in dealing with various disturbing factors in tower crane operation compared to the original backbone network.

4.5. Comparative Experiments with Other Algorithms

To further ensure the reliability of the results, YOLO-TC is trained and validated on this paper’s dataset by comparing it with the object detection algorithms Retina-Net, Center-Net [24], YOLOv5, YOLOX, YOLOv7 and YOLOv8. The final comparative results are shown in Table 3.

The data in the table indicate that YOLO-TC presented in this study surpasses the accuracy of existing object recognition algorithms in identifying both phone and cigarette objects. Compared to other object detection algorithms, the mAP of the proposed algorithm is improved by 14.7% compared with Center-Net, which is the best performer among the first-stage object detection algorithms, and the AP_C and AP_P are improved by 18% and 10.7%, respectively. Compared with YOLOX, which adopts the anchorless frame method in the YOLO series of algorithms, the mAP is improved by 10.5% and AP_C and AP_P by 10.7% and 7.6%, respectively; and compared with YOLOv7, which has higher detection accuracy, the mAP is improved by 3.7%. Compared to the YOLOv8 model, our model does not outperform it in terms of parameter count and inference speed. However, in terms of mAP, our model shows an improvement of 1.7% over YOLOv8. Notably, for small objects such as cigarette butts, our model achieves an AP improvement of 2.5% over YOLOv8. The method presented in this study, while not ideal regarding parameter count (38.67 MB) and FPS (64), maintains a commendable equilibrium between detection accuracy and velocity, rendering it competitive overall performance. The accuracy–recall curve and the visualized confusion matrix of YOLO-TC in this paper are shown in Figure 10.

4.6. Model Generalizability Analysis

We performed tests using the PASCAL VOC dataset to assess YOLO-TC’s generalizability further. We also evaluated the YOLO-TC network model using the mAP metrics to demonstrate its good detection performance for ordinary small objects. The PASCAL-VOC dataset consists of batches of 80 images. The optimizer used is SGD, with a total of 300 training epochs. The learning rate gradually increases from 0 to 0.015, while the momentum and weight decay are set to 0.9 and 5 × 10⁻⁴, respectively. The experimental results of these data are described in detail below. The average accuracy plots for YOLOv7, YOLOv8 and YOLO-TC on PASCAL VOC for each category are shown in Figure 11:

In the PASCAL VOC dataset, many small objects with relatively small sizes, insufficiently distinctive features, and small sample areas can be used to test the detection capability of the object detection model for small objects. From the average precision comparison chart, it can be observed that the YOLO-TC model improves the detection accuracy of several small objects on the PASCAL VOC dataset. Specifically, the average precision is increased by 2.2 percentage points compared to YOLOv7 and by 0.7 percentage points compared to the more advanced YOLOv8. This indicates that YOLO-TC is more effective at accurately identifying and localizing small objects, thereby enhancing the detection capability for these challenging targets. Being able to detect more small-scale objects means that YOLO-TC is highly generalizable for small object detection tasks and can be applied to small object detection tasks in more practical application scenarios.

4.7. Example Analysis of Testing

To demonstrate the enhanced algorithm’s detection efficacy more clearly, we compare the detection outcomes of YOLO-TC and YOLOv7 on the identical dataset to examine the specific differences in detection performance between the two models.

The visualized comparison results are shown in Figure 12a for the original labels annotated with the LabelImg tool, Figure 12b for the detection results of YOLOv7, and Figure 12c for the detection results of YOLO-TC. The entities indicated by labels in the picture represent the detection objects, while the numbers adjacent to the labels denote the confidence levels of the detection frames. A rectangular box indicates the occurrence of a missed detection object, and an elliptical box indicates the occurrence of a misdirected object. From the detection results in Figure 12b, the original YOLOv7 algorithm detects object misdetection, omission, and background misdetection when detecting real tower crane cab monitoring images with complex backgrounds. From the detection results in Figure 12c, it can be seen that compared with the original YOLOv7 algorithm, the optimized and improved algorithm enhances the model’s ability to identify and locate the object by enhancing feature extraction and combining multi-scale information, which shows obvious advantages in the detection of phone and cigarette objects, effectively reduces the common problems of omission and misdetection in the detection, and improves the localization accuracy and feature extraction capability for small objects and the detection of various object types. The detection accuracy of all kinds of objects increases with reasonable accuracy, which provides more reliable technical support for its application in related fields.

In this research study, we utilized video data from 2021, selected for its adequacy in supporting the detection of small objects within operational settings. While the data may not represent the most current conditions, it reflects the necessary balance between availability and the specific needs of our study. We have reviewed the current legislation and found no significant changes that could impact the validity of our data. For completeness, future research should consider using more recent datasets to verify the persistence of our findings and adapt to any legislative or operational changes.

5. Conclusions

The YOLO-TC detection model has been developed, rigorously tested, and verified to identify tiny objects, such as phones and cigarettes, in tower crane operational contexts. A dataset was generated by acquiring real-world surveillance images of tower cranes to improve the model’s adaptability to various detection environments. A novel channel–spatial attention module, the ECA-CBAM, was introduced and integrated into the backbone network to enhance model efficiency and focus on the spatial characteristics of small objects. Additionally, the HA-PANet architecture was designed by incorporating the AFPN concept within the neck network to address semantic discrepancies across multiple feature layers. The detection head was restructured to reduce computational load while improving small object recognition capabilities, and the MPDIoU loss function was optimized to further refine the baseline model.

YOLO-TC demonstrates high precision and the real-time identification of small objects, such as phones and cigarettes, in tower crane operations, showcasing excellent generality and robustness. This model significantly contributes to improving site safety and production management. While the enhanced detection model achieves higher object identification accuracy, it also increases the number of model parameters and computational demands. Future work should focus on optimizing the model’s lightweight characteristics, ensuring scalability across different detection scenarios, and improving real-time performance without compromising detection accuracy. Furthermore, given the model’s potential, its applicability could be expanded to other domains, such as industrial automation and surveillance in different environments, allowing for broader deployment in safety-critical applications.

Author Contributions

Conceptualization, D.D. and R.Y.; methodology, D.D. and R.Y.; validation, D.D. and R.Y.; formal analysis, D.D. and R.Y.; investigation, D.D. and R.Y.; writing—original draft and visualization, D.D. and Z.D.; supervision, Z.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the Guangxi Science and Technology Project (No.AB22035052), the Guangxi Key Laboratory of Image and Graphic Intelligent Processing Project (No. GIP2308), and the Innovation Project of GUET Graduate Education (No.2023YCXS062).

Data Availability Statement

There are no restrictions on the sharing of relevant data in this study.

Acknowledgments

The authors thank the Special Issue editors and anonymous reviewers for their valuable comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Redmon, J. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Ge, Z. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Long, X.; Deng, K.; Wang, G.; Zhang, Y.; Dang, Q.; Gao, Y.; Shen, H.; Ren, J.; Han, S.; Ding, E.; et al. PP-YOLO: An effective and efficient implementation of object detector. arXiv 2020, arXiv:2007.12099. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Lin, T. Focal Loss for Dense Object Detection. arXiv 2017, arXiv:1708.02002. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Zhang, Y.; Zhuo, L.; Ma, C.; Zhang, Y.; Li, J. CTA-FPN: Channel-Target Attention Feature Pyramid Network for Prohibited Object Detection in X-ray Images. Sens. Imaging 2023, 24, 14. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Yang, Z.; Yuan, Y.; Zhang, M.; Zhao, X.; Zhang, Y.; Tian, B. Safety distance identification for crane drivers based on mask R-CNN. Sensors 2019, 19, 2789. [Google Scholar] [CrossRef] [PubMed]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Kang, S.; Wang, H. Crane hook detection based on mask r-cnn in steel-making plant. J. Phys. Conf. Ser. 2020, 1575, 012151. [Google Scholar] [CrossRef]
Luo, H.; Wang, M.; Wong PK, Y.; Cheng, J.C. Full body pose estimation of construction equipment using computer vision and deep learning techniques. Autom. Constr. 2020, 110, 103016. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; Liang, R. AFPN: Asymptotic feature pyramid network for object detection. In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Honolulu, HI, USA, 1–4 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 2184–2189. [Google Scholar]
Ma, S.; Xu, Y. Mpdiou: A loss for efficient and accurate bounding box regression. arXiv 2023, arXiv:2307.07662. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]

Figure 1. ECANet structure.

Figure 2. AFPN structure.

Figure 3. YOLO-TC network structure.

Figure 4. E-CAM.

Figure 5. ECA-CBAM structure.

Figure 6. HA-PANet structure.

Figure 7. Detection head reconstruction.

Figure 8. Partial dataset images.

Figure 9. Precision–recall curve for attentional mechanisms. (a) CBAM. (b) ECA-CBAM.

Figure 10. YOLO-TC accuracy–recall curve and visualized confusion matrix. (a) Accuracy–recall curve. (b) Confusion matrix.

Figure 11. YOLO-TC vs. YOLOv7 and YOLOv8 class mAP plot on the PASCAL VOC dataset.

Figure 12. Comparison of detection results of the algorithm before and after improvement. (a) Original picture. (b) Results of YOLOv7. (c) Results of YOLO-TC.

Table 1. Results of comparative experiments on attentional mechanisms.

Models	Params/10⁶	FLOPs/10⁹	[email protected]/% ↑	FPS ↑
YOLOv7	36.54	104.6	85.1	87
YOLOv7 + CAM	42.61	129.4	85.7	53
YOLOv7 + E-CAM	38.10	122.4	86.3	67
YOLOv7 + CBAM	42.62	129.6	86.8	45
YOLOv7 + ECA-CBAM	38.10	122.4	87.4	67

Table 2. Results of ablation experiments.

YOLOv7	ECA-CBAM	HA-PANet	Detection Head Reconstruction	MPDIoU	Params/10⁶	[email protected]/% ↑
√					36.54	85.1
√	√				38.10	87.4
√		√			39.27	86.1
√			√		33.47	85.9
√				√	36.54	85.5
√	√	√			43.32	87.9
√	√	√	√		38.67	88.5
√	√	√	√	√	38.67	88.8

Table 3. Experimental results of comparison with other algorithms.

Models	AP_C/% ↑	AP_P/% ↑	[email protected]/% ↑	Params/10⁶	FPS
Retina-Net	52.1	87.1	67.1	37.30	34
Center-Net	60.6	88.3	74.1	32.00	80
YOLOv5	66.5	90.9	77.7	6.60	65
YOLOX	67.9	91.4	78.3	8.27	62
YOLOv7	73.5	96.2	85.1	36.54	87
YOLOv8	76.1	98.7	87.1	29.14	101
Ours	78.6	99.0	88.8	38.67	64

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ding, D.; Deng, Z.; Yang, R. YOLO-TC: An Optimized Detection Model for Monitoring Safety-Critical Small Objects in Tower Crane Operations. Algorithms 2025, 18, 27. https://doi.org/10.3390/a18010027

AMA Style

Ding D, Deng Z, Yang R. YOLO-TC: An Optimized Detection Model for Monitoring Safety-Critical Small Objects in Tower Crane Operations. Algorithms. 2025; 18(1):27. https://doi.org/10.3390/a18010027

Chicago/Turabian Style

Ding, Dong, Zhengrong Deng, and Rui Yang. 2025. "YOLO-TC: An Optimized Detection Model for Monitoring Safety-Critical Small Objects in Tower Crane Operations" Algorithms 18, no. 1: 27. https://doi.org/10.3390/a18010027

APA Style

Ding, D., Deng, Z., & Yang, R. (2025). YOLO-TC: An Optimized Detection Model for Monitoring Safety-Critical Small Objects in Tower Crane Operations. Algorithms, 18(1), 27. https://doi.org/10.3390/a18010027

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLO-TC: An Optimized Detection Model for Monitoring Safety-Critical Small Objects in Tower Crane Operations

Abstract

1. Introduction

2. Related Work

2.1. Development and Application of Object Detection Algorithms

2.2. Development and Application of Attentional Mechanisms

2.3. Development and Application of Feature Fusion Strategies

3. Methodology

3.1. YOLO-TC Network Structure

3.2. ECA-CBAM Attention Mechanism

3.3. HA-PANet Structure

3.4. Detection Head Reconstruction

3.5. MPDIoU Loss

4. Experimental Design

4.1. Datasets and Evaluation Indicators

4.2. Parameter Settings

4.3. Comparative Experiments on Attentional Mechanisms

4.4. Ablation Experiment

4.5. Comparative Experiments with Other Algorithms

4.6. Model Generalizability Analysis

4.7. Example Analysis of Testing

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI