1. Introduction
In recent years, online transactions have rapidly become a mainstay of the shopping industry. Especially, the Coronavirus disease-2019 (COVID-19) pandemic has significantly impacted shopping behavior, with people preferring to purchase all kinds of essentials, including groceries, food, and clothing, online [
1]. Even though the pandemic is almost over, many newcomers to online shopping are expressing continued interest in online shopping [
2].
Due to the growth of the e-commerce market, along with the increase in online shoppers, the demand for autonomous delivery mobile robots is rapidly increasing, as well as due to a severe labor shortage. In line with this trend, the autonomous delivery mobile robot market will reach a new turning point, growing at a CAGR of 20.4% from 2021 to 2026 [
3]. In addition, various fields require employing autonomous mobile robots [
4,
5,
6]. Therefore, in the near future, numerous mobile robots are expected to drive on traffic roads to deliver groceries, food, parcels, and other necessities.
In order for the era of autonomous mobile robots to become a reality, it is important for robots to accurately perceive their surroundings on roads with traffic. Specifically, to enable mobile robots to operate autonomously on traffic roads, the real-time identification of traffic lights is of utmost importance.
Several works have been proposed for this purpose, classified into one-stage systems [
7,
8,
9,
10] and two-stage systems [
11,
12,
13,
14,
15]. The one-stage system detects the location and state of traffic lights by using a single network. In [
7], the authors utilized modified Alexnet networks to analyze detection performance. The works in [
8,
9,
10] exploited SSD [
16], R-CNN [
17], and YOLOv3 [
18] to detect the location and state of traffic lights. On the other hand, a two-stage system sequentially performs two steps: first finding the location of the traffic light and then checking the signal status of the detected traffic light. In [
11], they used a convolutional neural network (CNN) to detect traffic lights, which generated saliency maps. The authors of [
13] utilized YOLO [
19] to locate traffic lights, with a CNN being used for traffic light status recognition. The works in [
12,
14] adopted a histogram of oriented gradients (HOG) for the detection of traffic lights and also used a CNN for checking the status of traffic signals. In [
15], the proposed system used deep learning networks for all components of the system to detect traffic signs.
However, achieving good recognition performance with a one-stage system can be difficult due to its network structure. In one-stage systems, a network predicts the location and state of traffic lights simultaneously, while two-stage systems consist of two separate networks for each task. The use of a single network to perform two different tasks simultaneously may potentially hinder performance in the one-stage approach. In fact, the work in [
8] presented low precision and recall performance for red and yellow signals. Previous works on one-stage systems also have some limitations.
Table 1 summarizes the comparison between our work and related studies. As shown in
Table 1, the system in [
7] can only recognize a limited number of traffic light states, while the works in [
9,
10] provide limited evaluation results of the traffic light state recognition (TLSR) networks.
The two-stage system is composed of two networks, where the first network identifies the location of traffic lights and the second predicts their state. As shown in
Table 1, all previous works on the two-stage system [
11,
12,
13,
14] have built a network that predicts traffic light states using only a basic CNN constructed by sequentially stacking multiple convolutional layers. Recently, however, several advanced network structures have been developed [
20,
21]. By utilizing these features to design a new network architecture, we can improve recognition performance while reducing the number of weight parameters. Indeed, the TLSR module proposed in this paper utilizes advanced network structures [
20,
21], leading to improved recognition performance. Additionally, we conducted a comprehensive evaluation of both the proposed model and previous works using a larger number of traffic light states.
In addition, for the network that detects the location of traffic lights, all the mentioned previous studies provide limited evaluations, which do not take into account the trade-off between detection performance and processing time for different input images and backbone network types.
The TLD module should provide good detection performance while ensuring real-time operation. As the size of the input image increases, detection performance increases, but processing speed becomes slower. Therefore, extensive evaluation is required to determine the input image size and backbone network type suitable for real-time operation. In this work, we conduct extensive evaluations with varying input image sizes and backbone network types.
In this paper, we revisit the implementation of a real-time traffic light recognition system from a two-stage viewpoint, comprising a traffic light detection (TLD) module and a traffic light state recognition (TLSR) module. The contributions of this work are summarized as follows:
To improve the recognition performance of the TLSR module and reduce the number of its weight parameters, we propose a lightweight and effective network using advanced features [
20,
21], while the existing research utilized a simple CNN stacking convolution layers sequentially. Through evaluation, it was confirmed that the proposed network architecture shows superior recognition performance for seven traffic signals compared with the simple CNN used in [
11,
12,
13,
14] while using fewer weight parameters. The reduction in the weight parameters allows the proposed model to be used on devices with limited resources. Furthermore, through evaluations, this work also shows how the use of advanced features affects the recognition performance.
Additionally, in order to enhance the TLSR’s performance, we propose to utilize a ratio-preserving zero padding (RZP) approach for data preprocessing. This approach maintains the aspect ratio of traffic lights in images, preventing distortions that can arise from scaling.
By using the determined system parameters for the TLD module and proposed network architecture trained with the RZP for the TLSR module, the real-time traffic light recognition system is demonstrated using actual videos recorded in Gumi and Deagu, Korea. This demonstration shows that the system operates in real time at over 30 FPS while it stably detects traffic lights and identifies the state of detected traffic lights.
The rest of this paper is organized as follows. The design of our system is presented in
Section 2, where two essential system modules are outlined: the traffic light detection (TLD) module and the traffic light state recognition (TLSR) module. In
Section 3, to propose a lightweight network structure for the TLSR module, we conduct a comprehensive evaluation by comparing the performance across various network architectures, with and without ratio-preserving zero padding, and assess performance metrics such as HF score–confidence curves, average HF scores, trainable parameters, and the number of convolution layers used. For the TLD module, we provide a comprehensive evaluation, assessing its performance over various input image sizes and network types. This analysis utilizes multiple performance metrics such as precision–recall curves, mean average precision, and frames per second. Based on the results, we determine system parameters, such as input image size and network type, and then present the demonstration results. Finally,
Section 5 concludes the paper.
3. Evaluation of TLD and TLSR Modules and Demonstration
To achieve real-time operation, a system should ensure a minimum requirement of 30 frames per second (FPS). It is also essential to optimize the system’s detection performance while meeting the minimum FPS requirement. However, there is a trade-off between FPS and detection performance. Therefore, it is crucial to achieve a balance between FPS and detection performance.
For the TLD module, the trade-off arises due to input image size and network depth. Increasing the size of an input image can improve detection performance, but it also results in longer network processing time and consequently reduces FPS. Furthermore, increasing the depth of a network leads to improved detection performance; however, it results in increased processing time.
As a result, the size of an input image and the depth of a backbone network are system parameters for the TLD module. In this section, based on comprehensive evaluations, we determine suitable system parameters for real-time operation.
For the TLSR module, a lightweight and effective network structure is selected from the candidate network structures introduced in
Section 2.1.1. All the network architectures are compared in terms of HF score, number of trainable network parameters, and number of convolutional layers. The HF score is a metric introduced in this work, which is a harmonic mean between the F1 score of each traffic signal class and the F1 score of the correctness of a traffic light detected by the TLD module.
3.1. Performance Metrics
In order to evaluate the performance of the TLD module, we utilize the precision–recall curve obtained by varying the confidence threshold, which is the minimum value required to detect the presence of a traffic light. Additionally, the metrics of average precision (AP) and frames per second (FPS) are utilized.
To evaluate the performance of the TLSR module, we present the HF score, which is the harmonic mean of the F1 scores for recognizing the state of each traffic signal class and for filtering out non-traffic-light images. The HF score is calculated as
where
is the F1 score of the
i-th traffic light signal,
is the F1 score for filtering out non-traffic-light images, and
. As
approaches 1, the TLSR module shows good performance in both recognizing and filtering capabilities.
As stated in
Section 2.1.2, the TLSR module can filter out incorrectly cropped images by the TLD module if all elements of
are below a certain threshold, denoted as
. When a network is trained to improve its filtering performance, however, its ability to classify the status of each traffic signal may decrease. This is due to how an activated traffic signal is identified, where the
i-th traffic signal is predicted as activated if
. Therefore, an evaluation considering both aspects of performance simultaneously is essential due to the trade-off between them. Therefore, by using the HF score, we can concurrently evaluate the candidate network structures by simultaneously considering their ability to filter images without traffic lights and their performance in recognizing traffic light states.
In order to evaluate the efficiency of the TLSR module, we consider two factors: the number of trainable parameters and floating point operations (FLOPs), where FLOPs is a metric that represents the computational complexity of a network.
3.2. Training Details
To train both the TLD and TLSR modules, we utilized the stochastic gradient descent (SGD) optimizer with a weight decay of
and a momentum of
, alongside the cosine scheduler. When training the TLD module with the scheduler, we applied an initial learning rate of
and a final learning rate of
. Similarly, for the TLSR module, we established an initial learning rate and a final learning rate of
. The training and demonstration are conducted utilizing the NVIDIA GeForce RTX 4090. For the TLD module, before training with the traffic light dataset, we perform a pretraining procedure with the coco dataset and the imagenet dataset [
39,
40].
In order to determine system parameters for the TLD module and a network structure suitable for the TLSR module, both modules are trained with the traffic light dataset provided by the Electronics and Telecommunications Research Institute (ETRI), which is one of the most influential national research institutions in Korea [
41]. This dataset consists of 10k images obtained from real-life road environments. The limited dataset allowed quicker testing of various models across various conditions, while also providing enough data for training large networks. This approach is often used for the deep learning field, as in [
42]. For training the TLSR module, we intentionally crop the areas lacking traffic lights and utilize them to enhance the TLSR module’s ability to filter out the erroneously detected traffic light images by the TLD module.
After selecting suitable parameters and network architecture, for the demonstration, both modules are trained using 230k additional data, collected in Seoul and nearby cities in Korea.
3.3. Network Structure Suitable for TLSR Module
3.3.1. Comparison of HF Scores with Different Network Structures
Figure 5 shows the HF scores of six traffic light signals with different levels of confidence. Note that all the existing work [
11,
12,
13,
14] on two-stage systems uses the ‘CONV’ architecture, and thus, ‘CONV’ can be used as the baseline for the performance comparison.
For the most frequent signals, namely red, yellow, green, and left, it can be observed from
Figure 5a–d that the ‘CONV’ and ‘CONV-FIT’ network designs show lower HF scores compared with other networks for almost all confidence values. Moreover, it is evident that neither of the network designs outperforms other network architectures.
In
Figure 5e, the ‘FPN-FIT’, ‘FPN-RES’, and ‘FPN-RES-FIT’ structures outperform other network architectures in the top-left case, while ‘FPN-FIT’ and the ‘FPN-RES-FIT’ are better than the other networks in
Figure 5f. For the top-right signal in
Figure 5g, the ‘FPN’ and ‘FPN-RES-FIT’ architectures exhibit superior performance compared with the others. The findings suggest that the feature pyramid structure in [
20] has a significant impact on improving performance. In addition, the ‘FPN-RES’ architecture shows outstanding performance, as in
Figure 5a–d, confirming that the skip connection from [
21] also yields significant performance improvements.
Table 2 presents the average of the HF scores across all traffic signals and trainable parameters. The ‘FPN-RES-FIT’ architecture shows the highest average HF score with 27% fewer weight parameters compared with the ‘CONV’ network design. Additionally, the computational requirement of the ‘FPN-RES-FIT’ network structure is 0.30K FLOPs, while that of ‘CONV’ is 0.39K FLOPs. Because of its small number of weight parameters and low computational complexity, its processing time is negligible. As a result, the ‘FPN-RES-FIT’ network design is the most appropriate for the TLSR module.
Remark: We need to examine the results of the ‘FPN-RES-FIT’ network structure. The ‘FPN-RES-FIT’ network design shows generally better performance, but it is slightly inferior to the ‘FPN-RES’ and ‘FPN-FIT’ networks in the top left and bottom left of
Figure 5e,f. The application of ‘RES’ or ‘FIT’ to ‘FPN’ improves performance. However, applying ‘RES’ and ‘FIT’ simultaneously results in slightly lower performance improvement compared with applying them separately. The reason can be explained as follows. In
Figure 2e, in order to apply ‘RES’ and ‘FIT’ simultaneously, a pooling operation should be implemented in the shortcut path. The pooling operation is applied along the width and height dimensions, which reduces spatial information in the resulting feature maps. Due to the relatively higher number of classes associated with the left direction, this reduction may slightly impair the network’s ability to detect the top left and bottom left. In future work, based on this insight, we plan to develop a shortcut block that can effectively combine ‘RES’ and ‘FIT’.
3.3.2. Comparison of HF Scores with and without Ratio-Preserving Zero Padding
This section examines the impact of the RZP method on the ‘FPN-RES-FIT’ network’s performance.
Figure 6 presents the HF scores of six traffic signals with and without the RZP method. At all confidence levels, the network with the RZP method results in superior performance compared with the one without it. As a result, maintaining the aspect ratio of traffic lights significantly increases recognition performance.
Figure 7 shows the precision–recall curve (PR) of the TLD module with changes in backbone type and input size. When the intersection over union (IoU) threshold of nonmaximum suppression (NMS) is set to 0.5,
Figure 7a,b shows the PR curve for 30 confidence thresholds between 0 and 1.
Figure 7a presents detection performance for the input size range [416, 640, 960, 1280, 1920], and the backbone type is fixed to resnet-18.
Figure 7b shows detection performance for the input size range [640, 960, 1280], and the backbone type is changed from resnet-18 to resnet-34. For a detailed explanation,
Figure 7c,d is an enlargement of specific parts of
Figure 7a,b. The terms ‘RES-18’ and ‘RES-34’ in the legend of the figures signify that the results are obtained with resnet-18 and resnet-34, respectively.
3.4. System Parameters for TLD Module: Input Image Size and Backbone Network Type
3.4.1. Comparison of HF Scores with Different Input Image Sizes
Table 3 and
Table 4 show the average precision (AP) and FPS. AP is the area under the PR curve, and FPS is the number of frames processed per second by the TLD module. In the tables,
[email protected] is the result of setting the IoU threshold to 0.5, and
[email protected]:0.95 is the result of averaging APs measured by changing the IoU threshold from 0.5 to 0.95.
In
Figure 7a–d, for image sizes from 960 to 1920, the area of the PR curve is almost 1, while the area for 416 and 640 is smaller. In
Table 3 and
Table 4, the results of
[email protected] show that the detection performance becomes higher as the input size increases. On the other hand, for
[email protected]:0.95, the detection performance increases up to an input size of 960 and then decreases. For FPS,
Table 3 shows that it decreases as the input size increases. Especially, for input sizes larger than 1280, FPS is fewer than 30, which is the minimum requirement for real-time operation.
Table 4 confirms that the processing speed of resnet-18 is 46.7%, 48.3%, and 46.1% faster than that of resnet-32 for input sizes of 640, 960, and 1280, respectively.
Finally,
Figure 8 shows the resulting images for the input sizes 640 and 960. Small traffic lights within the same image are better detected with an input image size of 960 than with an image size of 640. This explains that the larger the image, the better the detection performance. Furthermore, the same holds true when there are multiple traffic lights.
As a result, extensive evaluations confirm that there is a trade-off between detection performance and processing time. Therefore, since the minimum requirement for real-time operation is typically 30 FPS, we decided to set the input size and backbone type to 960 and resnet-18, respectively, for the TLD module.
3.4.2. Remark
In this paper, we focus on the investigation of the trade-off between detection performance and processing time for the TLD module. In particular, the TLD module is constructed following YOLOv5. The recent work in [
15] also uses YOLOv5 to detect traffic signs that are different from traffic lights. Although it is a study to find different objects, it shows that YOLOv5 outperforms YOLOv3 and v4 when detecting objects in a driving environment. Furthermore, through our extensive evaluation, we can confirm that the average precision of the TLD module is almost 1 for input image sizes of 960, 1280, and 1920 for the ETRI dataset. Therefore, we believe it is appropriate to evaluate the trade-off only for the TLD module we established.
3.5. Demonstration
Finally, we demonstrate our real-time traffic light recognition system through videos recorded on various traffic roads in Gumi and Daegu, Korea, which are not the cities where the dataset was collected. Based on evaluations, the TLD module utilizes an input image size of 960 and a backbone network type of resnet-18, while the ‘FPN-RES-FIT’ network trained with RZP is used for the TLSR module. This setting provides the best performance while ensuring the minimum requirement of 30 FPS (if the input image size is 1280, the FPS for the whole procedure are fewer than 30). Finally, the proposed system for this demonstration runs on a Geforce RTX 4090, with 2.06 MB memory for the TLSR module and 2.317 GB memory for the TLD module.
Figure 9 shows several resulting video frames, including the bounding boxes and states of detected traffic lights, and also records the FPS of the two-stage system. Additionally, the location of the traffic road was marked on Google Maps and displayed in the lower left corner. Through the resulting images, it was confirmed that traffic lights of various angles and sizes can be detected and that various types of traffic lights can be recognized in an actual car driving environment. More video results can be found at the following link:
https://youtu.be/mQA8THT7OmE?si=-chjWc8h9_zc6Xuk, accessed on 30 January 2024.
4. Discussion
4.1. Necessity of Real-Time Traffic Light Recognition System
We can think about how vehicles use big data from traffic lights. Two cases are considered: (1) when big data are on remote servers and provided to vehicles through wireless communication and (2) when big data are imported into vehicles.
For the first situation, to recognize the status of the traffic lights in front of vehicles, they need to be connected to a remote server using vehicle communication networks such as VANET (Vehicular Ad hoc Network) or infrastructure networks such as LTE and 5G, where networks are established using wireless communication. However, the performance of wireless communication, such as throughput and delay, can vary. As the number of vehicles using wireless communication increases, the throughput decreases and the delay increases. Therefore, when wireless communication performance is poor, the vehicle has to determine the status of the traffic lights ahead without information from the server.
For the second scenario, vehicles can identify the status of traffic lights by utilizing the big data stored in them. However, they would not be able to handle changing situations, such as the dynamic variation in traffic light flashing patterns. In Korea, to maintain smooth traffic flow, traffic light flashing patterns can dynamically vary based on real-time traffic conditions, such as congestion. This is because it is difficult to include all cases of traffic flow that varies and changes differently in every situation, making real-time management challenging with imported data alone.
As a result, there is a need to develop technology that can be implemented in vehicles and that allows vehicles to independently determine the status of the traffic light ahead.
4.2. Evaluation under Different Weather Conditions
In the field of computer vision, performance can be affected by weather conditions, including brightness, darkness, cloudiness, and rain.
Figure 10 shows that the system developed in this work is capable of detecting and recognizing the location and state of traffic lights in various conditions, such as during sunset and nighttime, in rainy weather, and under cloudy skies. The results present that the system can also operate in different lighting conditions. Unfortunately, however, the training dataset used does not have labels for weather conditions, and thus, it is difficult to determine the amount of data for each weather condition. A brief check shows that there are relatively little data for the above weather conditions. In future work, we plan to conduct research on various weather conditions by collecting additional data and addressing imbalances between them.
4.3. Why Should a YOLO-Style Network Be Used for the TLD Module?
In the field of object detection, there are two approaches: one-stage detectors and two-stage detectors. Although two-stage detectors show better performance than one-stage detectors, their slower processing speeds make them unsuitable for real-time systems. Additionally, a new technique called vision transformer has recently been proposed for image classification, which uses transformer architecture. However, it also has limitations, such as high processing time and high memory requirements due to the large number of weight parameters and high computational complexity. Therefore, it is not yet suitable for real-time object detection systems.
The YOLO neural network family is a representative one-stage detector, while the Faster R-CNN [
43] is a representative two-stage detector. Specifically, the YOLO neural network family is known for its exceptional balance between accuracy and speed, allowing for the rapid and dependable identification of objects in images. Therefore, for this work, we utilize a YOLO-style network for real-time traffic light detection.
4.4. Why Is 30 FPS Considered the Real-Time Operation Criterion in This Work?
In Korea, the speed limit for vehicles is 50 km/h in urban areas and 30 km/h in school zones. These speed limits translate to 13.8 m/s and 8.3 m/s, respectively. Based on 30 FPS, the elapsed time of 5 frames is 0.16 s, and the vehicle’s moving distances during only 5 frames are 2.2 m and 1.3 m, respectively. If the FPS decreases, the moving distance increases, which results in longer reaction times for vehicles. Longer reaction times could increase the risk of accidents. Additionally, the basic specification of commercial cameras is typically 30 FPS. Therefore, this work considers 30 FPS for real-time operation.
5. Conclusions and Future Works
In this paper, we revisit the real-time traffic light recognition system, which is a two-stage system comprising TLD and TLSR modules. Then, we propose a lightweight and effective network architecture for the TLSR module, introduce a data preprocessing approach to improve TLSR performance, and at the same time, determine the suitable system parameters of the TLD module through a comprehensive evaluation.
For the TLSR module, our objective is to construct an effective and lightweight network that has little impact on the overall FPS of the two-stage system. We evaluated various network structures and proposed a network design that uses skip connections, multiple feature maps, and filter sizes customized to the size of the incoming feature maps. The proposed network design shows an average HF score close to 1 while using a few weight parameters. In addition, we introduce the ratio-preserving zero padding approach in traffic light images detected by the TLD module, which improves the recognition performance of the TLSR module.
In the TLD module, to achieve a balance between detection performance and FPS for real-time operation, we conducted a comprehensive evaluation that varied system parameters such as the input image’s size and the backbone network type. We compared precision–recall curves, average precision, and FPS to determine a suitable pair that guarantees at least 30 FPS with adequate detection performance.
Finally, we demonstrate a system constructed using the proposed network structure trained with determined system parameters and ratio-preserving zero padding. This demonstration shows that our real-time traffic light detection system can reliably detect traffic lights. Especially, when the vehicle is close to a traffic light, the system can continuously detect the traffic light and accurately identify the status of the detected traffic signal, proving the reliability of the system.
For future work, we plan to improve our real-time system to enable the continuous detection of traffic lights over considerable distances. Our system is also good at finding distant traffic lights but tends to detect them relatively discontinuously compared with nearby traffic lights. Therefore, we will develop a TLD module that considers multiple consecutive video frames as input data. Through this, the TLD module is expected to focus on areas where traffic lights exist in multiple video frames, rather than operating independently for each frame. Additionally, for the development of these TLD modules, we plan to consider one of the recent object detection networks, such as YOLOv8.