1. Introduction
Object recognition and detection has been an important problem for a long time for many different industries, including the medical field, the security industry, and the transportation sector. This issue, however, takes on a greater level of significance for people who are blind or have some other form of visual impairment. According to estimates by the World Health Organization (WHO), at least 2.2 billion people globally suffer from some form of near or distance vision impairment, with associated annual productivity losses of around
$411 billion [
1]. People who are visually impaired face a wide variety of difficulties on a daily basis. These challenges can range from difficulties with mobility and orientation to difficulties accessing information and services. One of the most significant issues that individuals with visual impairment face on a daily basis is related to the use of public transportation, with the difficulty of getting to their destination, followed by the inconvenience and then safety concerns [
2]. In the context of public bus transportation, people with visual impairment require information about their surroundings and any visible details at bus stops and terminals, such as schedules and routes, in order to use the system independently and safely. However, the majority of individuals with visual impairment face challenges when it comes to taking the correct bus and disembarking at the correct destination [
3]. As a result, they may have to rely on others for transportation, limit their travel, or choose simpler activities that do not require extensive travel.
In recent years, the advancement of technology has enabled the development of more accurate and efficient object detection and recognition systems. Therefore, bus detection has been an area of active research. RFID and embedded systems are widely used for bus detection in intelligent transportation systems. These systems use RFID tags embedded in buses to communicate with stationary RFID readers at bus stops, enabling real-time bus detection and location tracking [
4,
5,
6,
7]. For instance, a study by Raj et al. [
8] proposed an RFID-based bus detection system using passive RFID tags and RFID readers at bus stops. Another study proposed a mechatronic system architecture to improve transportation for people with visual impairment by combining RFID and wireless sensor networks [
9]. However, RFID and embedded systems also have their limitations. These systems require specific hardware and can be expensive to install and maintain [
10]. They can also be affected by environmental factors such as interference from nearby metal objects, signal reflections, and other sources of noise [
11].
However, deep learning-based methods, such as those using CNNs and object detection models, are becoming increasingly popular for object detection due to their high accuracy and real-time performance. These methods can accurately detect and recognize buses in real-world scenarios, even in complex urban environments. They can also be used on mobile devices, making them accessible to individuals with visual impairment. For example, Liang et al. [
12] introduced a deep learning model, a lightweight and energy-efficient transportation model detection system, using only accelerometer sensors in smartphones. Another study conducted a comparative study to evaluate the effectiveness of two deep learning models, AlexNet and Faster R-CNN, for detecting vehicles in an urban video sequence, which produced positive results, and the tests provided important insights into the architectures and strategies used to implement these models for video detection [
13].
The deep-learning-based object detection algorithm has made significant progress in detection accuracy and speed. There are two types of deep learning target detection algorithms: those with two stages and those with only one. After feature extraction in the first stage, the two-stage detection algorithm will generate a region proposal that may contain the target to be detected. The second step involves locating and classifying the data using a convolutional neural network with high detection accuracy, such as R-CNN [
14], Fast R-CNN [
15], Faster R-CNN [
16], SPPNet [
17], etc. In contrast to two-stage algorithms, which require candidate regions to be generated before positioning and classification can be performed, one-stage detection algorithms perform both operations simultaneously. The detection speed of algorithms such as YOLO [
18], SSD [
19], RetinaNet [
20], and FCOS [
21] is better than that of the two-stage algorithm, but the accuracy is lower.
YOLOv5 is an object detection algorithm that belongs to the YOLO (You Only Look Once) family of algorithms. It was released in June 2020 and quickly gained attention in the computer vision community due to its improved performance in speed, accuracy, and model size [
22]. In terms of speed, YOLOv5 can detect objects in real time or near real time, meaning it can process a video stream frame by frame and return object detections almost instantly. The combination of fast speed and high accuracy makes YOLOv5 a popular choice for object detection tasks in various fields, including surveillance [
23], robotics [
24], and autonomous driving [
25]. In these applications, quick and accurate object detection is crucial for the success of the system. In terms of fast object detection systems, Huawei’s Noah’s Ark Lab proposed a new end-to-end neural network architecture called GhostNet, which aimed to improve the detection speed. This architecture was included in CVPR2020 [
26]. The GhostNet paper proposed a Ghost module that could generate more feature maps from cheaper operations than conventional convolution operations, thereby reducing the number of floating-point operations (FLOPs) in the network. The experimental results showed that the proposed Ghost module decreased the computational cost of general convolutional layers while maintaining similar recognition performance.
Fast object detection and high accuracy are important factors to consider when developing a bus detection model, especially for people with visual impairment. A lightweight and high-accuracy model is desirable to ensure real-time performance and ease of use on mobile devices. To address this issue, this study proposes a bus detection model that is lightweight, accurate, and suitable for use on mobile devices. The proposed model aims to provide an efficient and reliable solution for individuals with visual impairment to detect and recognize buses in real time. By combining YOLOv5 and GhostNet, we can offer a promising solution for real-time bus detection. Our solution leverages YOLOv5’s high speed and accuracy and GhostNet’s ability to reduce computational costs. This combination can efficiently detect buses in real-time video streams from different angles and distances. Furthermore, our solution addresses the computational challenges associated with object detection by using GhostNet to reduce the computational burden of YOLOv5. This enables our solution to be deployed on resource-constrained devices, such as mobile phones or embedded systems. As a result, we can implement real-time bus detection on edge computing systems, without relying on cloud computing resources. Moreover, the SimSPPF module is added to the YOLOv5 backbone to enhance the computational efficiency and object detection capabilities. The Slim scale detection model is also developed by modifying the original YOLOv5 structure, making it more efficient and faster for real-time object detection applications.
In summary, our proposed combination has the potential to provide an efficient and accurate solution for real-time bus detection for individuals with visual impairments, as real-time bus detection can aid in navigation and improve accessibility for those with disabilities. Additionally, our solution can also be beneficial for smart city transportation systems and autonomous driving. The ability to reduce the computational costs and deploy on resource-constrained devices makes this combination an attractive option for edge computing systems.
The following bullet points summarize the key contributions of the research study, which proposes a lightweight and high-accuracy bus detection model:
We proposed a lightweight and high-accuracy bus detection model based on an improved YOLOv5 model;
We integrated the GhostConv and C3Ghost Modules into the YOLOv5 network to reduce the number of parameters and floating-point operations per second (FLOPs), ensuring the detection accuracy while reducing the model parameters;
We added the SimSPPF module to replace the SPPF in the YOLOv5 backbone for increased computational efficiency and accurate object detection capabilities;
We developed a Slim scale detection model by modifying the original YOLOv5 structure to make the model more efficient and faster, which is critical for real-time object detection applications;
The experimental results showed that the Improved-YOLOv5 outperformed the original YOLOv5 in terms of the precision, recall, and
[email protected];
Further analysis of the model complexity revealed that the Improved-YOLOv5 was more efficient due to fewer FLOPS, fewer parameters, less memory usage, and faster inference time capabilities;
The proposed model is smaller and more feasible to implement in resource-constrained mobile devices and a promising option for bus detection systems.
The structure of this paper consists of four main sections, with each section covering essential topics related to the proposed model for bus detection.
Section 2 elaborates on the principles of YOLOv5 and the Improved-YOLOv5, providing an in-depth discussion of the technical aspects of the proposed model. Moving forward,
Section 3 focuses on the model training process and experimental results. The section highlights the various experiments performed to evaluate the proposed model’s performance. Finally, in
Section 4, the authors conclude the presented work, summarizing the significant findings and contributions of the study. This section highlights the strengths of the proposed model, including its efficiency, accuracy, and feasibility.
3. Experiment and Results
3.1. Datasets
The dataset used in this study was composed of images that depicted buses in various positions, orientations, perspectives, and environmental conditions. The data collection process was conducted meticulously to ensure high-quality data. The images were collected using a mobile phone camera, as well as from the internet. The purpose of this dataset was to aid people with visual impairment in identifying nearby buses, and as such, it was focused on bus detection. The dataset contained three main classes: bus, door, and route.
Figure 6 provides an example of an image in the dataset, which shows a bus during daylight conditions. The image contains one instance of a bus, two instances of doors, and one instance of a route. Each image in the dataset was labeled with bounding boxes that indicated the location and area of each instance. This labeling information was then utilized by the model to conduct object detection within the images.
To elaborate further on the nature of the dataset, it is worth noting that it was a labeled dataset, which means that each image was annotated with bounding boxes that precisely identified the locations of the buses, doors, and routes present within the images. The dataset contained a variety of environmental conditions, which included different lighting conditions, weather conditions, and backgrounds. Furthermore, the dataset contained instances of buses with different colors, shapes, and sizes, which made it a diverse dataset that could be used to train object detection models that were robust and adaptable to various real-world scenarios.
The dataset consisted of 1602 images, with 75% used as the training set, which was a total of 1202 images, 20% used as the validation set, which was a total of 320 images, and 5% used as the testing set, which was a total of 80 images. The images were labeled using the roboflow.com website with YOLOv5 PyTorch output format, which is .txt, with a total of 5148 annotations and an average of 3.2 annotations per image. The annotation details per class were 1745 instances for bus, 2254 instances for door, and 1149 instances for route.
Figure 7a presents a graph that shows the number of annotations per class in the dataset.
Figure 7b shows the distribution of the bounding boxes in the dataset, by visualizing the location and size of each bounding box. This helps in understanding the distribution of the bounding boxes in the dataset and ensures that there is enough variation in the position and size of objects for the model to recognize.
Figure 7c,d, show the statistical distribution of the position and size of the bounding boxes, which shows how the distribution of the bounding boxes is spread across the dataset. This graph helps in determining whether the bounding boxes have an even distribution or whether there are parts of the dataset that are heavier. This is important to ensure that the model will not have problems detecting objects in the image, because all objects vary in size and position.
This study used the default data augmentation techniques from the YOLOv5 model, including HSV-Hue augmentation, image translation, image scale, image flip left–right, and image mosaic, as part of the model training process. This technique helped to expand the amount of data used during the training process, strengthened the model to perform better object detection in images and reduce the risk of overfitting, and strengthened the model to perform object detection in different environments. All images in the dataset were also preprocessed by auto-orienting and resizing to a size of 640 × 640, with an average image size of 0.41 MP. This information helped to ensure that the model could detect the objects in the image properly, because the objects had various sizes and positions, and the bounding boxes were evenly distributed.
3.2. Experimental Environment
The experimental environment in this study required the use of several critical components, including a GPU with CUDA support and the PyTorch framework. To perform the learning process and object detection quickly and accurately, the YOLOv5 deep learning model required parallel computing. As a result, a GPU and CUDA were used to accelerate the computing, while the PyTorch framework was used to build and train models.
Table 1 shows the configuration details for this experimental environment.
3.3. Evaluation Metrics
Model evaluation is very important to determine the model performance and model compatibility with the research objectives. There are several evaluation metrics that can be used in bus object detection, which include evaluation metrics related to the accuracy, speed, and efficiency of the model. Evaluation metrics related to accuracy are usually the main focus in detecting bus objects, because the model must be able to identify buses with high accuracy. The basic evaluation metrics that are often used are precision (P) and recall (R). Precision is the ratio between the number of true positives (TP) and the number of positive predictions (TP + false positives (FP)). This metric measures how accurately the model predicts the positive class. Recall is the ratio between the number of true positives (TP) and the number of true positive classes (TP + false negatives (FN)). This metric measures how many positive classes the model can identify. Apart from precision and recall, there are other evaluation metrics that are more commonly used in object detection, namely average precision (AP) and mean average precision (mAP). The AP measures how well the model finds relevant objects and ignores the irrelevant objects. The AP is calculated by plotting the precision–recall curve and calculating the area under the curve. The mAP, on the other hand, is the average of the APs for all the object classes found by the model. The mAP provides a more comprehensive picture of the model’s performance because it measures the accuracy for all classes of objects, not just for one class. The equations are shown in Equations (
11)–(
14).
In addition to the evaluation metrics related to accuracy, there are evaluation metrics related to the model speed and efficiency. For example, gigaflops per second (GFLOPS) measures the number of floating-point operations a model can perform in one second, and the inference time measures the time it takes a model to process a single image or video. The evaluation metrics related to the model speed and efficiency are critical in real-time applications such as object detection in video.
Apart from that, there are other evaluation metrics such as the number of model parameters and the model size. The number of model parameters is the number of parameters (weights) used by the model to learn patterns in the training data. The more parameters used, the more complex the model. This metric gives an idea of the complexity of the model. Model size, on the other hand, measures the size of the model file in bytes or megabytes (MB). This metric gives an idea of the complexity and practicality of the model, especially in applications that require fast and space-efficient file transfers.
3.4. Training Results and Analysis
In this study, the accuracy of two models was compared: YOLOv5n and Improved-YOLOv5. The baseline model for comparison was the YOLOv5n because it has the same depth and width structure as the Improved-YOLOv5. To avoid overfitting, the YOLOv5n and Improved-YOLOv5 models were trained for 1000 epochs with early stopping. Early stopping is a regularization technique that halts training if the model’s performance on a validation set stops improving after a certain number of epochs, called the “patience”. The patience was set to 100 epochs in this case. Both models’ performance on the validation set was monitored during training. When the models failed to improve their validation performance after 100 epochs, the training was halted to avoid overfitting. The Improved-YOLOv5, on the other hand, was stopped earlier than YOLOv5n, at epoch 816 versus epoch 517, indicating that it was better able to learn and generalize the data features. The ability of the Improved-YOLOv5 to continue training for a longer period of time suggests that it was less vulnerable to overfitting than YOLOv5n. This could be because the Improved-YOLOv5 had a better architecture or a more appropriate hyperparameter configuration, which resulted in better model generalization. Ultimately, the improved ability of the Improved-YOLOv5 to generalize its features and avoid overfitting may have led to the better performance compared to YOLOv5n. Comparison of the detection results of the different models is shown in
Figure 8, which contains a visual comparison of the detection results obtained by the two models, YOLOv5n and Improved-YOLOv5. The figure shows how the two models performed in terms of detecting buses in images, and it can provide a clear comparison of their accuracy.
The results in
Figure 9 provide insights into how the models learned over time and whether there were any particular trends or patterns in their performance. Additionally,
Figure 10 shows a comparison of the models’ accuracy. When the two models were compared, the Improved-YOLOv5 outperformed the YOLOv5n in terms of the precision and
[email protected]. The precision of the Improved-YOLOv5 model was 0.939, while the precision of the YOLOv5n model was 0.921. This suggests that the Improved-YOLOv5 was better at identifying true positives, which is important for people with visual impairment because false positives can lead to missed buses, frustration, and potential safety hazards. Both models had high recall, indicating that they correctly identified the majority of the buses in the image. The recall for the YOLOv5n was 0.899, while the recall for the Improved-YOLOv5 was 0.895. This demonstrates that both models detected the majority of the buses in the images. The mean Average Precision (mAP) is a metric that measures how well the model performs at various IoU thresholds. Both models in this study had a good
[email protected], indicating that they could detect buses with a high level of accuracy. The Improved-YOLOv5, on the other hand, outperformed YOLOv5n, with a
[email protected] of 0.933 compared to YOLOv5n’s score of 0.923. The Improved-YOLOv5 had a lower
[email protected]:0.9 score of 0.619, while the YOLOv5n had a higher score of 0.628. This means that the YOLOv5n detected buses with greater accuracy at various IoU thresholds, especially at higher thresholds, than the Improved-YOLOv5.
In summary, the comparison revealed that the Improved-YOLOv5 outperformed the YOLOv5n in terms of the precision, recall, and
[email protected], indicating its ability to detect true positives with greater accuracy, detect the majority of objects in images, and detect objects with a high level of accuracy, respectively. YOLOv5n, on the other hand, had a higher
[email protected]:0.9 score, indicating that it was better at detecting buses with greater accuracy at different IoU thresholds, particularly at higher thresholds. According to the findings, the Improved-YOLOv5 could be a better option for bus detection systems because it demonstrated better generalization and less overfitting, resulting in better performance. The difference in the
[email protected]:0.9, on the other hand, highlights the need for additional research and testing to determine the best model for the specific application.
The comparison of the models’ complexity is shown in
Table 2. In this analysis, the model complexity comparisons were YOLOv5n and the Improved-YOLOv5, with YOLOv5n serving as the baseline. The goal was to determine which model was more efficient and less complicated, making it a more practical solution for bus detection systems. The number of parameters or the total number of learnable parameters in the model was the first metric considered. YOLOv5n had 1763,224 parameters, while the Improved-YOLOv5 only had 717,020 parameters. This demonstrates that the Improved-YOLOv5 was a simpler model to learn because it had fewer parameters to learn. This is advantageous because it means the model is less complex and has a lower risk of overfitting, which is necessary to ensure that the model generalizes well to new data. Furthermore, having fewer parameters means that the model requires less memory and storage, which can be beneficial in resource-constrained environments.
The second metric considered was GFLOPs, which measures the number of floating-point operations per second that the model can perform. The Improved-YOLOv5 outperformed the YOLOv5n in this case, with 2.1 GFLOPs versus 4.1 GFLOPs. This means that the Improved-YOLOv5 completed more operations in less time, which can be advantageous in applications where speed is critical, such as real-time object detection. Furthermore, this metric is significant because it determines the model’s speed and efficiency, which can have significant implications for real-world applications.
The third metric is weight/MB, which measures the model’s size in terms of its weight (i.e., the size of the model parameters) per megabyte of storage. In this regard, the Improved-YOLOv5 outperforms the YOLOv5n, with a weight of 1.76 MB versus 3.72 MB. This means that the Improved-YOLOv5 is more compact and takes up less storage space, which is useful in situations where storage space is limited. This metric is also significant because it determines how much storage is required to store the model, which has significant implications for real-world applications. Overall, our analysis of the models’ complexity indicates that the Improved-YOLOv5 is a more efficient and less complex model for bus detection systems than YOLOv5n. It has fewer parameters, uses less memory, and can perform more operations in less time. It is also more compact, requiring less storage space. These metrics are significant because they determine how well the model performs in real-world applications as well as the resources needed to run it. As a result, the Improved-YOLOv5 is the more practical solution for bus detection systems because it is simpler, more efficient, and more compact.
3.5. Ablation Experiment
An ablation experiment involves systematically removing or modifying individual components or parts of a model in order to assess their contribution to the overall performance. To accomplish this, we used the same dataset, hyperparameters, and training/testing procedures for all models, with the exception of the specific components being evaluated, throughout the study. This enabled us to attribute any differences in performance between the modified model and the original model to the specific modifications under consideration.
The ablation experiments are presented in
Table 3. The results of the ablation study showed that the changes made to the YOLOv5 model had a significant impact on the accuracy and
[email protected] performance metrics. When compared to the original YOLOv5, the YOLOv5-Slim model had a slightly lower precision but a higher
[email protected]. The addition of the ghost module to the YOLOv5-Slim-Ghost model improved the
[email protected] score, while keeping the precision score relatively high. This demonstrates that incorporating a ghost module can be an effective way to reduce the computational cost while maintaining the accuracy and
[email protected]. The YOLOv5-Slim-Ghost-SimSPPF (Improved-YOLOv5) model, which combined the previous modifications with a Simplified Spatial Pyramid Fusion module, outperformed the previous models in terms of the precision and
[email protected], while maintaining a similar recall score. This suggests that the SimSPPF module can significantly improve the model’s performance, particularly in terms of the accuracy and
[email protected].
The ablation study demonstrated a clear tradeoff between the number of parameters and the computational cost, as measured by GFLOPs. The YOLOv5 model had the most parameters (1,763,224) and GFLOPs (4.1), respectively. The YOLOv5-Slim-Ghost model, on the other hand, had the fewest parameters and GFLOPs, with 716,636 and 2.1, respectively. This suggests that combining the slim and ghost modules can result in a more efficient model with a significantly lower computational cost, while still maintaining relatively good performance. However, the addition of the SimSPPF module in the Improved-YOLOv5 model increased the number of parameters and GFLOPs over the YOLOv5-Slim-Ghost model. The Improved-YOLOv5 model had 717,020 parameters and 2.1 GFLOPs, which was comparable to the YOLOv5-Slim-Ghost model. This implies that the addition of the SimSPPF module can improve the model performance without significantly increasing the computational cost. Overall, the ablation study results indicated that the changes made to the YOLOv5 model led to more efficient and effective bus detection models. The slim and ghost modules, which reduced the parameters, significantly reduced the computational cost while maintaining relatively good performance. The SimSPPF module improved the model’s performance, particularly in terms of the precision, without significantly increasing the computational cost.
3.6. Comparative Experiments
The comparison of various models for the bus detection system using the YOLOv5 architecture provides valuable insights into the performance of these models. The evaluation criteria for these models included the inference time, the
[email protected], the number of parameters, and the GFLOPs. The Improved-YOLOv5 stood out from the other models, as it achieved the best tradeoff between these evaluation criteria. The inference time for the Improved-YOLOv5 was the lowest among all models, which means that it generated predictions quickly, making it well-suited for real-time applications. The
[email protected] score for the Improved-YOLOv5 was also impressive, indicating that it was highly accurate in detecting objects in an image. One of the most notable features of the Improved-YOLOv5 was its lightweight architecture. With only 717,020 parameters, the Improved-YOLOv5 required significantly less storage space than other models, making it more efficient for deployment in resource-limited environments. Moreover, with only 2.1 GFLOPs, the Improved-YOLOv5 required less computational power than the other models, reducing energy consumption and processing time. It is worth noting that YOLOv8n also achieved a high
[email protected] score, but it required more parameters and computational power than the Improved-YOLOv5. Similarly, the YOLOv5x achieved a high mAP score but at the cost of requiring significantly more parameters and computational power. Therefore, the Improved-YOLOv5 is the optimal choice for bus detection systems, as it achieves a better balance between accuracy, efficiency, and resource requirements. The comparison of the various models for the bus detection system using the YOLOv5 architecture highlights the superiority of the Improved-YOLOv5. Its low inference time, high accuracy, and lightweight architecture make it the ideal choice for real-time bus detection and tracking systems. The Improved-YOLOv5 represents a significant improvement over the previous models, offering the best performance and efficiency for this application. The performance comparison with various models can be found in
Table 4 and
Figure 11.
4. Conclusions
We proposed a lightweight and high-accuracy bus detection model based on a YOLOv5 architecture in this study to address the significant challenges that individuals with visual impairment face while waiting for buses. The Improved-YOLOv5, our proposed model, integrates the GhostConv and C3Ghost Modules into the YOLOv5 network, implements a slim scale detection model, and replaces the SPPF in the backbone with SimSPPF for increased computational efficiency and accurate object detection capabilities. By reducing the number of parameters and FLOPs, our proposed model outperformed the original YOLOv5 model in terms of the precision, recall, and
[email protected]. The Improved-YOLOv5 model outperformed other object detection models in terms of the accuracy and inference time, as demonstrated by our experimental results. Furthermore, because of its smaller size, lower memory usage, and faster inference time capabilities, the proposed model was more efficient. This makes it a more viable option for mobile devices and real-time applications, especially in resource-constrained scenarios.
The proposed bus detection model has the potential to assist individuals with visual impairment in identifying buses, thereby increasing their independence and quality of life. It can also be integrated into bus detection systems, improving public transportation services, particularly in urban areas. Our proposed model has significant potential benefits for assisting individuals with visual impairment and improving public transportation services, and we believe that our research will inspire further development and research in this area. However, several challenges remain to be addressed, such as improving the model’s generalizability across different datasets and environments. More research is needed to investigate the effectiveness of our proposed model under various lighting conditions and environmental factors. Nonetheless, our findings show that object detection models based on the Improved-YOLOv5 have the potential to address real-world problems and contribute to the development of more efficient and accurate bus detection systems, paving the way for future research and development in this field.