1. Introduction
Smart cities utilize big data to enhance urban planning, optimize resource allocation, and improve the overall quality of life for their residents [
1]. Big data refers to the collection, organization, and processing of a vast amount of information using specialized software tools in a relatively short time frame [
2,
3,
4]. These data are essential for the development of smart cities, particularly traffic management [
5]. The widespread use of big data has made it a major focus of attention across various sectors. The ability of big data technology to accurately handle and store large volumes of information has not only captured people’s interest, but also brought about significant changes in how we live. Consequently, in the management of smart cities, the efficient processing of enormous data and its role in supporting urban development have become central concerns [
4]. It has the potential to enhance people’s quality of life, particularly in this era of a knowledge-driven economy.
Among photogrammetry and remote sensing equipment (i.e., color, multispectral, thermal, Synthetic Aperture Radar (SAR) cameras) [
6,
7,
8], radiometric images have shown an efficient-economic performance in various fields such as environmental monitoring, agriculture fields analysis, and urban mobility [
9]. However, image segmentation and desired object classification present challenges. Since there is no universally accepted definition of what makes an object due to variability in human comprehension and uncertainty in visual perception, object detection and image segmentation have become a major point of discussion among numerous remote sensing topics.
Previous methodologies aimed to identify vehicles in each single image without using adjacent frame information as if a vehicle may exist in multiple consecutive frames [
10]. Each frame may contain various objects ranging from moving vehicles such as cars to natural ones like trees. For vehicle detection, the prior knowledge of vehicles plays a key role, meaning that vehicles follow a rectangular shape, and this provides meaningful information for distinguishing one vehicle type from another. In this case, hypothesis generation (HG), which estimates several candidate locations, and hypothesis verification (HV), which verifies that vehicles among the candidate objects are considered [
11].
Regarding the HG step, appearance features (i.e., edge detection) are used because they are simple and well-organized in implementation. The following feature detectors are the most popular and robust procedures: color, shadow, symmetry, edge, Histogram of Gradient (HOG), Haar-like feature, Gabor, Speeded Up Robust Features (SURF), Scale-Invariant Feature Transform (SIFT), and convolutional neural networks (CNNs) [
12]. These features have their benefits and drawbacks. Regarding the benefits, by converting the RGB space of images into HSV color space, red color values will be intensified, which helps to identify the tail-lights of vehicles at nighttime [
13]. This transformation space is also robust in terms of the illumination change, one of the most important challenges in image processing. However, these features also have drawbacks. For instance, HOG and Haar-like features can be computationally expensive, making real-time processing challenging. SURF and SIFT, while robust to scale and rotation changes, can be slow and require significant computational resources. Additionally, color-based features can be affected by variations in lighting and shadows, leading to potential inaccuracies in detection.
The above-mentioned features cannot individually or in combination find true vehicles. Therefore, the HV step aims to generate a meaningful connection from the above extracted features for detecting vehicles. In this case, machine learning and deep learning classifiers have shown an efficient performance in vehicle extraction. Examples can be seen in support vector machine (SVM), random forest (RF), K-nearest neighbor (KNN), Adaboost, back propagation neural network (BPNN), and genetic algorithm (GA) [
13]. By combining these classifiers with feature detectors, researchers tried to recognize vehicles. For instance, with a combination of HOG and SVM, the vehicles viewable from the rear can be detected, while it may not be rapid enough [
14]. Using the Harr feature with generative Adaboost is not only rapid enough to work in real-time situations, but it can also detect vehicles from rear-view images [
15]. Noticeably, various deep learning classifiers have been proposed from the last decade, which play the roles of HG and HV simultaneously. You Look Only Once (YOLO), RCNN, Faster RCNN, and Single Shot Detection (SSD) methodologies are the most popular in vehicle detection [
16]. These methods have been trained with large free access datasets like COCO for object extraction including vehicles.
In summary, while these methodologies may demonstrate effectiveness on images, significant challenges remain. These include the necessity of vast training data, reliance on supercomputers for code execution, and the absence of comprehensive and challenging video datasets depicting highway scenarios. Previous research has scarcely examined their methodologies across diverse datasets encompassing day, night, and various weather conditions. Moreover, their focus has predominantly been limited to a single study area, thereby diminishing the generality of their algorithms. Therefore, it is crucial to address these issues to enhance the applicability and robustness in real-world highway settings. Additionally, few studies have focused on low-resolution highway cameras, which represent one of the most challenging scenarios for object detection. Low-resolution cameras provide limited contextual information, making it particularly difficult to detect objects, especially distant vehicles. This challenge has rarely been addressed in the existing research.
As an image segmentation algorithm, the segment anything model (SAM) has received significant academic attention due to its outstanding performance in image segmentation tasks [
17]. The SAM was trained on a diverse dataset of over 1 billion images including road vehicles. Beyond the extensive dataset, SAM offers several advantages such as robustness, adaptability, precision, scalability, and ease of integration. These features make SAM highly suitable for urban traffic surveillance, enabling accurate and reliable vehicle detection and tracking in various conditions. SAM’s effectiveness lies in its ability to accurately partition images into segments without prior knowledge of specific object classes or categories. This flexibility results from SAM’s robust training on a diverse and extensive dataset, enabling it to properly generalize to various image segmentation challenges. Using such an extensive dataset enhances SAM’s segmentation accuracy and contributes to its adaptability across a wide array of applications.
SAM’s proficiency extends to diverse domains including medical imaging, autonomous navigation, environmental monitoring, and more [
18]. Ref. [
19] introduced a universal crater detection approach based on the SAM algorithm. They enhanced the computation time of crater detection, which was previously carried out manually. In a novel way, ref. [
20] detected the water leakage by SAM inside tunnels with about 77% accuracy in mIoU (mean Intersection over Union). SAM has also been conducted in several medical image segmentation tasks [
21], where they investigated possible paths for future research directions regarding SAM’s involvement in medical image segmentation. On the negative side, SAM produces unclassified segments, necessitating post-processing to assign class labels to the segments. While SAM has been predominantly used in medical image segmentation, its robust performance and adaptability make it suitable for urban traffic surveillance. The key similarities lie in the need for precision and robustness, while the primary differences involve the application context, real-time processing requirements, and data characteristics. By leveraging SAM’s capabilities, we aim to enhance vehicle detection and tracking in diverse and challenging traffic environments. The main contributions of this research are as follows.
Introduction of the Segment Anything Model (SAM): This research introduces the segment anything model (SAM), an advanced image segmentation algorithm, for the detection and tracking of vehicles using low-resolution, uncalibrated highway cameras. This novel application of SAM demonstrates its potential to address the complexities of vehicle detection in urban traffic surveillance.
Robust Performance in Diverse Conditions: The SAM-based algorithm demonstrates superior performance in various challenging conditions including different weather scenarios (rain, snow), lighting conditions (day and night), and diverse fields of view. This robustness ensures reliable vehicle detection and tracking, making it suitable for the dynamic and unpredictable nature of urban traffic environments.
Enhanced Real-Time Capabilities: By achieving significantly faster processing times, especially when utilizing GPU acceleration (100 to 200 milliseconds), the SAM-based algorithm facilitates real-time vehicle detection and tracking. This capability is crucial for intelligent transportation systems and smart city applications, where timely and accurate traffic data are essential for improving urban mobility and traffic management.
2. Methodology
Figure 1 illustrates the flowchart of our proposed algorithm for detecting and tracking on-road vehicles. In the initial stage, the algorithm identifies the surface of road infrastructures to ensure that the segment anything model (SAM) methodology focuses solely on relevant objects, thereby enhancing computational efficiency. Following the application of SAM on the segmentation of road objects, the algorithm generates multiple segments encompassing vehicles and road objects. Since the class of these segments is unknown, we employed three additional steps to further analyze and categorize vehicles.
2.1. Road Surface Detection
In the captured video images from cameras, various irrelevant objects located outside the highway road facilities are also included. To optimize our algorithm’s computation time and accuracy, a crucial step was to narrow down the application of our vehicle detection (VD) technique to road infrastructure. Two options are available in this situation. (1) Manual restriction process (MRP), or building a road detection algorithm similar to Unet [
22]. MRP entails manually extracting the road surface boundaries through annotation, whereas the second scenario necessitates a development step that mostly uses road colors or road markings for boundary identification. MRP was appropriate in our case because the cameras are fixed in highway monitoring projects. MRP is inappropriate if the cameras are positioned on a moving vehicle as a sophisticated road extraction method is required. As a result, MRP was implemented using Computer Vision tools written in Python available within the OpenCV library.
2.2. Segment Anything Model (SAM)
The segment anything model (SAM) is an image segmentation algorithm that has gained attention for its versatile applications in various fields including computer vision and artificial intelligence [
17]. SAM utilizes advanced deep learning techniques to accurately identify and segment objects within an image, regardless of their class or category. Unlike traditional object detection algorithms focusing on specific predefined classes, SAM can segment any object in an image, making it highly flexible and adaptable [
23]. It achieves this by leveraging a vast mask training dataset comprising approximately 1 billion samples. SAM employs prompt encoding, prompt decoding, and rapid mask decoding to achieve precise segmentation results.
The backbone network, typically a deep CNN like ResNet or a vision transformer (ViT), extracts features from the input image [
24]. Let
represent the feature maps, where
with
and
being the height and width of the feature maps, respectively, and
is the number of channels.
The attention mechanism enhances the feature representation by computing the relationships between different spatial locations. Given the feature maps
, the self-attention operation can be formulated as:
Here, (query), (key), and (value) are linear transformations of the input feature maps . The term is a scaling factor, where is the dimensionality of the key vectors.
The segmentation head decodes the enhanced feature maps to produce the segmentation mask
This is typically achieved through a series of upsampling operations and convolutional layers, leading to the final output:
where
represents the decoder network, and
are the attention-enhanced feature maps.
SAM is trained using a combination of loss functions to ensure accurate segmentation. The primary loss function used is the cross-entropy loss, defined as:
where
is the ground truth label and
is the predicted probability for pixel
belonging to class
. Additionally, a dice loss can be employed to improve the performance of instance segmentation:
The total loss function
is a weighted sum of the cross-entropy loss and the dice loss:
where
and
are hyperparameters that balance the contribution of each loss term [
25].
During training, the model parameters
and
are optimized to minimize the total loss
[
17]. The training process involves forward and backward passes. In the forward pass, feature maps
F are computed, attention mechanisms are applied to obtain
, and the segmentation mask
M is generated. The total loss
L is then calculated using the ground truth masks and the predicted masks. In the backward pass, gradients are calculated, and the model parameters are updated using an optimizer such as Adam [
4].
During inference, the trained model takes an input image X and produces the segmentation mask M. The inference process involves extracting feature maps F from the input image using the backbone network, applying the attention mechanism to obtain the enhanced feature maps , and decoding the feature maps to generate the final segmentation mask M.
To evaluate the performance of SAM, several metrics are commonly used. Intersection over Union (IoU) measures the overlap between the predicted segmentation mask and the ground truth mask [
24]. Pixel accuracy is the ratio of correctly classified pixels to the total number of pixels. Mean IoU (mIoU) is the average IoU across all classes. The Dice coefficient, similar to IoU, focuses on the similarity between the predicted and ground truth masks.
Figure 2 presents a sample of SAM’s output on a highway image, demonstrating the successful segmentation of vehicles. However, there remain three challenges in the exclusive detection of vehicle segments. First, SAM is a segmentation process, which means that the class and type of the segments are not initially known. Second, SAM may also segment non-vehicle objects including on-road elements like road markings and off-road entities like traffic signs. While the manual restriction process (MRP) stage may eliminate some of these non-vehicle objects, removing the remaining ones is required. Finally, SAM has the potential to divide a vehicle into multiple segments such as body and windshield. Consequently, it is necessary to merge segments belonging to a single vehicle into a unified vehicle class.
2.3. On-Road Moving Vehicle Detection (OR-MVD)
This step concentrates on detecting vehicle segments by simultaneously processing multiple video frames. Since vehicles are in motion and do not maintain a fixed position throughout the recording, the segments corresponding to vehicles exhibit slight positional changes between two adjacent frames.
Let us consider a scenario where a vehicle moves at a velocity of ), and the camera records at a frame rate of 30 (FPS), meaning that 30 images are captured every second, and the time interval between each two consecutive frames is . This implies that the vehicle moves approximately between two adjacent frames. However, objects like traffic signs, road markings, and poles have fixed positions in both adjacent frames.
To find the vehicle segments inside each image, two steps are implemented in this stage. First, the center of each SAM segment in the two consecutive adjacent frames is extracted. Subsequently, the minimum distance between the segments in the first and second images is calculated. If the calculated minimum distance for each segment is zero, it indicates that the segment has not moved and is thus considered as a non-vehicle segment, which is subsequently removed. The final output of this step consists of segments that belong to vehicles captured in each image frame.
2.4. Vehicle Segment Merging (VSM)
Since a vehicle may be converted into several segments by SAM, this step aims to merge the segments of a single vehicle. Therefore, each segment is compared with all of the other segments in order to merge parts of each vehicle into a unique segment.
Let us consider two segments that belong to a single vehicle. The first scenario occurs when one segment is entirely enclosed within another. The second scenario involves two segments that share boundary parts. A similar situation can arise for adjacent vehicles, which poses a challenge for this second scenario.
By using the union probability procedure, scenario one is implemented. Let us consider and as two segments that belong to a vehicle where is enclosed by . and indicate the number of pixels in each segment. As the segments are polygons, a closing operation morphology is applied on each segment to fill the possible holes. Then, the number of pixels, which is the same inside the two segments, is counted and we name it . By dividing by , the value of union probability is obtained. If P is equal to one, it means that the segment of is completely inside , and both segments are merged.
If the calculated is not zero (zero means that two segments have no overlap), it means that two segments have the same boundary. This is where the second scenario arises. In this situation, two segments with common boundaries are merged and classified as part of a vehicle if the length of the merged segments increases in the direction of the road. Conversely, if the length does not exhibit this characteristic, the two segments are not considered as part of a vehicle. This approach is based on the observation that vehicles typically follow a rectangular shape aligned with the direction of the road.
2.5. Counting and Tracking Vehicles (CTV)
A single car may appear in several images because cameras capture several consecutive frames per second. As a result, this phase attempts to give each vehicle that appears in multiple frames a unique number. The SAM technique generates a perimeter around each vehicle and detects only the pixels of the vehicles. In this case, the measurement is focused on determining the center of the segments corresponding to vehicles.
Between the proposed tracking methodologies [
26], DeepSORT, an abbreviation for Deep Simple Online and Realtime Tracking, represents a sophisticated object tracking algorithm utilized in the realm of computer vision [
27]. It serves the purpose of aiding computers in comprehending and monitoring the movement of objects, particularly vehicles, within videos. Acting as an evolution of the SORT (Simple Online and Real-time Tracking) method, DeepSORT integrates deep learning features to augment the object tracking capabilities, especially in scenarios characterized by a high degree of complexity.
The underlying process of DeepSORT unfolds in a sequence of steps. Initially, it engages in the detection of objects such as vehicles within each frame of a video, accomplished through the application of a previously trained model. Subsequently, the algorithm proceeds to extract distinctive features from these detected objects, leveraging a deep learning model. These features serve as unique characteristics that facilitate the identification and differentiation of one object from another.
The pivotal stage involves the association and tracking of these objects across various frames. This association takes into consideration both the visual characteristics of the objects (appearance features) and their respective movements (motion information). This dual consideration is crucial for establishing a coherent understanding of which object in one frame corresponds to a specific object in the subsequent frame.
Facilitating the counting of objects, particularly vehicles, is another essential functionality of DeepSORT. By tracking the movements and occurrences of specific objects within a video, the algorithm enables the derivation of valuable insights, particularly in scenarios like traffic analysis. Additionally, post-processing techniques are employed to refine the tracking results, addressing challenges such as the temporary disappearance of objects from the field of view or instances of false detections.
The prowess of DeepSORT becomes particularly evident in scenarios where numerous objects are in simultaneous motion such as densely populated traffic scenes. The incorporation of deep learning features enhances its robustness, ensuring accurate and consistent object tracking over time.
Figure 3 shows the output of our proposed algorithm in which a unique bounding box was considered for each detected vehicle.
2.6. Evaluation Criteria
The formulas of precision, recall, and F1-score, expressed in Equations (6) and (8), respectively, are commonly used metrics to assess the effectiveness of classification models [
3].
Precision evaluates the effectiveness of a model in accurately identifying true positives (vehicles) from a pool of positive examples (true positive + false positive). It quantifies the proportion of relevant items that have been correctly classified. Recall typically refers to the recall metric used in evaluating the performance of a classification model, particularly in binary classification tasks. It measures the ability of the model to correctly identify all relevant instances within a dataset, also known as the true positive rate. Mathematically, recall accuracy is calculated by Equation (7). Here, “True Positive” represents the number of correctly identified positive instances (vehicles), and “False Negative” represents the number of positive instances that were incorrectly classified as negative. The harmonic mean of recall and precision is the F1-score, which offers a statistic that strikes a compromise between precision and recall.
3. Datasets
Dataset 1: The governmental organization of “Ministère des Transports et de la Mobilité durable” located in Québec City, Canada, has implemented a comprehensive system of traffic surveillance cameras across its road infrastructure (Québec 511). These strategically placed cameras are vital in monitoring and managing traffic flow. Although the cameras possess a relatively low resolution of 352 × 240 pixels and operate at a frame rate of 15 frames per second (FPS), their significance lies in their wide coverage of roads from diverse fields of view. These cameras diligently record videos throughout the day, ensuring 24-h surveillance. These cameras effectively capture a wealth of valuable information, enabling the assessment of our proposed algorithm in many challenging scenarios. In this study, we considered 30 highway videos downloaded from the 511 highways in Quebec City, Canada. This dataset included 2250 images captured under various challenging conditions such as different illumination changes, particularly nighttime, diverse weather conditions like snowy days, varied fields of view from different camera angles, and different road infrastructures. Our aim was to evaluate the proposed algorithm under these challenging situations to assess its generality and robustness.
Dataset 2: The “Traffic Speed Estimation from Surveillance Video Data” project for the 2nd NVIDIA AI City Challenge Track 1 focuses on developing algorithms to estimate the speed of vehicles using surveillance video footage [
28]. This challenge aims to leverage advanced AI techniques such as computer vision and deep learning to analyze video data captured from traffic cameras. The goal is to create accurate and efficient models that can process real-time video streams to monitor traffic flow, enhance road safety, and assist in urban planning. This project addresses practical challenges including varying lighting conditions, occlusions, and camera perspectives. This dataset has provided challenging highway video datasets at various road infrastructures. Additionally, some datasets were recorded at nighttime as the most challenging videos and diverse weather conditions like rainy days. We used five challenging datasets with around 12,000 images to evaluate the proposed algorithm on challenging situations such as nighttime, various fields of view, different weather conditions, and high-resolution images.
As can be seen in
Figure 4, four different cameras with various fields of view, camera quality, and different study areas were selected in dataset 1. The availability of such a rich dataset allows for the rigorous testing and evaluation of our algorithm’s performance in real-world traffic conditions. By leveraging the recordings from these cameras, we could simulate and analyze various traffic situations including congestion, accidents, and inclement weather. This enabled us to develop and fine-tune our algorithm to address the unique challenges of different scenarios, ultimately improving its effectiveness and reliability.
4. Results and Discussion
This section focuses on a comprehensive discussion of the obtained results, providing an in-depth analysis. Furthermore, the benefits and drawbacks of our algorithm are thoroughly examined in comparison to state-of-the-art methodologies such as single shot detector, region-based convolutional neural network (RCNN), and You Look Only Once (YOLO) [
11,
29,
30].
4.1. Results
Regarding vehicle detection (VD), which can be seen in
Figure 5, our algorithm achieved a remarkable success rate, detecting approximately 89.68% of vehicles accurately without misclassifying non-vehicles as cars (recall = 100%). These scores indicate excellent accuracy and reliability in detecting and classifying vehicles. YOLOv7 exhibited the closest accuracy to our algorithm, at around 80.22% and 93.25 for the precision and recall accuracies, respectively. The non-detection of distant vehicles in
Figure 5 can be attributed to the resolution of the images captured by the surveillance cameras and the high traffic density. Low-resolution cameras lack the detail required to accurately detect distant objects, while high traffic density can lead to the occlusion of smaller or more distant vehicles. Future work will focus on enhancing the image resolution and refining detection algorithms to improve the detection of distant vehicles.
Table 1 displays the precision, recall, and F1-score values for different algorithms including YOLO versions (YOLOv8, YOLOv7, YOLOv6, YOLOv5), Faster RCNN, SSD, and our SAM-based algorithm. Faster RCNN and SSD obtained very low precision, recall, and F1-score values, all were less than 2.00%. These results indicate the need for additional training datasets to improve the classification accuracy of these methods.
Figure 3 and
Figure 5 demonstrate the capability of our algorithm to detect relatively small-sized vehicles located at distances exceeding 300 m from the camera. Notably, the achievement of this high accuracy in vehicle detection was due to the high performance of SAM in image segmentation (
Figure 6).
This paper primarily focused on utilizing uncalibrated camera sensors installed on highways that collect publicly accessible online images. The vehicles often appear in a small size and rarely contain high contextual information. Despite their lower resolution, extensive coverage and constant availability of the traffic surveillance cameras offer a valuable resource for evaluating and refining our algorithm under diverse and challenging conditions. Through this collaboration, we aimed to enhance the efficiency and effectiveness of traffic management systems, ultimately benefiting both road users and the transportation infrastructure. Notably, there was no need to reinstall and design high-resolution camera sensors on the roadways. For the first time, we propose the acceptable performance of the SAM algorithm, published recently by the Meta company, on the traffic surveillance images and show how efficiently this model works.
4.2. Challenges of the Datasets
In our study, we utilized two primary datasets: low-resolution 511 highway camera footage and the high-resolution NVIDIA AI Smart City contest dataset. Each dataset presented unique challenges that we addressed to improve the performance and evaluation of our proposed methods. The 511 highway cameras provided low-resolution images, which significantly affected the accuracy of vehicle detection and tracking. The limited pixel information made it challenging to identify distant or small vehicles accurately, and it also exacerbated the difficulty in distinguishing between closely packed vehicles and detecting fine details required for vehicle classification. On the other hand, while the NVIDIA AI Smart City dataset offered high-resolution images that facilitated more accurate detection and classification, it introduced computational challenges. Higher resolution images required more processing power and memory, which could limit the feasibility of real-time applications.
The datasets included varying fields of view, from narrow angles focusing on specific sections of a road to wide angles covering entire intersections. This diversity required robust algorithms that could adapt to different perspectives and ensure consistent performance across various scenarios. One of the most challenging aspects was the significant change in illumination between day and nighttime footage. During the day, shadows and glare interfered with detection algorithms, while at night, low light levels and artificial lighting created noise and reduce visibility. Therefore, ensuring reliable performance under varying lighting conditions required sophisticated preprocessing and adaptive algorithms.
Weather conditions such as rain and snow introduced additional complexities. Rain caused reflections and blur, while snow obscured vehicles and altered the appearance of the road surface. Both conditions demanded robust algorithms capable of maintaining accuracy despite these adversities. Additionally, the datasets encompassed a range of road infrastructures, from highways to urban streets. Each type of road presents unique challenges such as varying traffic densities, road markings, and surrounding environments. For instance, highways typically have higher speeds and more homogeneous traffic, while urban roads exhibit more variability in vehicle types, speeds, and interactions with pedestrians and cyclists.
Overall, we addressed these challenges by developing flexible, adaptive algorithms capable of handling diverse conditions. The insights gained from addressing these challenges in our datasets contributed significantly to the robustness and generalizability of our proposed methods, paving the way for more reliable urban traffic surveillance systems.
4.3. Running Time
In this study, we evaluated the processing time of our proposed system under different computational setups. The system was tested on a machine running Windows 10 with 16 GB DDR3 RAM, and an Intel(R) Core (TM) i7-4700HQ CPU @ 2.4 GHz. Python.3.10.8 was used as the programming environment.
When the system was run without utilizing a GPU, the processing time ranged between 4 and 8 s per frame. This processing time, although acceptable for certain applications, does not meet the requirements for real-time processing, which is crucial for urban traffic surveillance systems.
However, when we leveraged GPU acceleration, the processing time significantly improved, ranging from approximately 100 to 200 milliseconds per frame. This substantial reduction in processing time demonstrates that the algorithm, when used with GPU support, is capable of real-time performance. The real-time processing capability is essential for practical deployment in live traffic monitoring and management systems, where timely data processing and decision making are critical.
By optimizing our system to utilize GPU acceleration, we ensured that it can handle the computational demands of real-time vehicle detection and tracking in various challenging conditions, as discussed in the previous sections.
4.4. Comparison with the Deep Learning Algorithms of SSD, RCNN and YOLO
To obtain more information on the state-of-the-art methodologies of Faster RCNN, YOLO, and SSD, a comparative analysis between these methods was undertaken by [
16].
Table 1 summarizes the obtained results of YOLO, Faster RCNN, SSD, and our algorithm. It was observed that two methods of SSD and Faster RCNN were not suitable for the detection of vehicles inside highway images. The main reasons behind these low accuracies of SSD and Faster RCNN are the training datasets. This means that the used datasets in Faster RCNN and SSD did not include the vehicles at the position of the highway cameras. To obtain a better detection accuracy, new training datasets are needed to fine-tune these algorithms. To address this, we propose employing transfer learning, which involves fine-tuning pre-trained models on a specific dataset. This approach is expected to enhance the accuracy and efficiency of SSD and Faster RCNN for urban traffic surveillance. Future work will include implementing transfer learning and conducting extensive experiments to validate its effectiveness.
The YOLO family of models, particularly YOLOv7, showed acceptable performance with a good balance between precision and recall, making it a reliable choice for various object detection tasks. However, YOLOv6 and YOLOv8 exhibited lower precision compared to YOLOv7. YOLO models are known for their impressive speed, making them suitable for real-time applications, but they may struggle with detecting smaller objects due to their coarse grid. YOLO models initially convert the input images into same-length grids and then detect objects. Our proposed SAM-based algorithm outperformed all of the other models with significantly higher precision, recall, and F1-score. The SAM-based algorithm’s ability to accurately detect and track vehicles in various challenging conditions including low-resolution images, different weather conditions, and varying lighting demonstrates its robustness and generality.
Our SAM-based algorithm demonstrated superior performance in diverse weather conditions such as rainy and snowy days compared to YOLO, SSD, and Faster RCNN. This can be attributed to its robustness in handling variations in weather, which often cause reflections, blurs, and obscured views. The SAM-based algorithm’s high precision and recall under different lighting conditions, particularly nighttime, highlight its adaptability. While YOLO models also performed reasonably well, SSD and Faster RCNN struggled significantly, likely due to their lower robustness against lighting variations.
Handling low-resolution images is particularly challenging due to the limited contextual information. Our SAM-based algorithm’s high performance in these scenarios suggests that it effectively managed the reduced pixel information, whereas SSD and Faster RCNN showed poor results, indicating their inadequacy for low-resolution datasets. YOLO models, although better than SSD and Faster RCNN, still did not match the performance of the SAM-based algorithm. The diverse fields of view in the datasets, from narrow to wide angles, posed a significant challenge. The SAM-based algorithm’s adaptability to these varying perspectives further underscores its robustness compared to other models.
In summary, our SAM-based algorithm demonstrated superior performance in all metrics, making it the most effective for our specific application in vehicle detection under diverse and challenging conditions. While YOLO models, particularly YOLOv7, offer a good balance and are effective for real-time applications, the SAM-based algorithm’s exceptional accuracy makes it the best choice for achieving high reliability and robustness in various scenarios. SSD and Faster RCNN, on the other hand, are not suitable for our use case due to their poor performance metrics.
4.5. Comparison with the Previous Traditional Algorithms
For VD on the radiometric images, the proposed methodologies can be grouped into three classes: binary generation models (BGM), multimodal fusion sensors (MFS), and deep learning-based methodologies. BGM algorithms make an effort to recognize vehicles based on subtracting two adjacent frames. As vehicles move on the road infrastructures, the vehicles are detected if a binary thresholding parameter is applied to the subtracted frames. These kinds of algorithms are not used as often because of their high sensitivity to changes in illumination, whether at night or during bad weather conditions. Concerning the MFS models, additional hardware such as RADAR is required in addition to cameras, which are more expensive. The deep learning algorithms have demonstrated effective performance in detecting objects including cars. Three well-liked proposed algorithms are single shot detector, region-based convolutional neural network (RCNN), and You Look Only Once (YOLO). These models were developed using extensive, open-source benchmark datasets like Common Objects in Context (COCO), which include a wide range of objects from trucks to animals [
31]. Numerous researchers have tried to utilize these models on VD highway traffic images. They gathered hundreds of training vehicle photos and retrained these deep modes to improve performance, which was mostly conducted by high-tech cloud computing systems. In some circumstances, this necessitates that cloud computing platforms are not genuinely cost-effective for retraining and retesting these models. Another drawback of these models is that due to their reliance on high-context and high-resolution images for operation, they are inappropriate for use with the low-resolution traffic surveillance cameras that we used in this study. If a high-tech camera sensor is used for traffic monitoring, people’s privacy in these designed cameras is in doubt because the vehicle’s license plate and interior can easily be seen. In conclusion, the need for training data, high-end cloud computing platforms, expensive camera sensors, and installation costs have made the deployment of these algorithms difficult.
4.6. SAM Performance
A critical aspect of the performance of SAM is its ability to handle low-resolution images effectively. This capability is particularly important because many urban traffic surveillance systems rely on cameras that do not produce high-resolution images. Low-resolution images present a challenge due to limited contextual information, but SAM managed to maintain a high performance by efficiently handling reduced pixel data. In contrast, SSD and Faster RCNN performed poorly with low-resolution datasets, indicating their inadequacy for such applications. YOLO models, although better than SSD and Faster RCNN, still fell short compared to SAM. Additionally, the SAM algorithm’s adaptability to diverse fields of view—from narrow to wide angles—further underscores its robustness. This versatility is crucial for urban environments where cameras capture various perspectives. Overall, SAM’s superior performance across multiple metrics including precision, recall, and adaptability to challenging conditions makes it the most effective model for vehicle detection in our specific application. While the YOLO models, particularly YOLOv7, offer a good balance and are suitable for real-time applications, SAM’s exceptional accuracy and reliability in diverse scenarios establish it as the best choice for high-reliability and robust vehicle detection.
Figure 6 shows the outputs of the SAM algorithm on challenging areas with high performance in image segmentation. Additionally, we demonstrate SAM’s performance on satellite images (
Figure 7). This algorithm can efficiently segment vehicles on satellite images obtained from Google Earth. This algorithm also has an acceptable performance on images showing vehicles top view, which opens up the opportunity to further work on platforms with a bird’s eye view ranging from unmanned aerial vehicles (UAVs) to satellite images. Additionally, another ten challenging highway cameras were evaluated for SAM performance, as shown in
Figure 8. Other roadside objects such as traffic signs, road surface, and road markings could be recognized by SAM.
The current implementation of the SAM has limitations in determining whether a vehicle is parked and in classifying vehicle types. These shortcomings are due to SAM’s focus on segmentation tasks. To address this, we propose combining SAM with the YOLO model, which excels in real-time object detection and classification. By integrating SAM’s segmentation capabilities with YOLO’s classification strengths, we aim to enhance the system’s overall performance. Future work will involve implementing and evaluating this combined approach to improve both vehicle detection and classification.
5. Conclusions and Future Works
The proposed deep learning algorithm based on the segment anything model (SAM) demonstrates significant advancements in vehicle detection and tracking using uncalibrated urban traffic surveillance cameras. By leveraging SAM’s flexible and adaptable segmentation capabilities, along with the robust tracking performance of DeepSORT, the methodology achieved high precision, recall, and F1-score metrics of 89.68%, 97.87%, and 93.60%, respectively. This indicates a notable improvement over existing state-of-the-art methods like YOLO. The algorithm’s ability to accurately detect and track vehicles, even under challenging conditions such as low resolution and varying illumination, underscores its potential for enhancing traffic management systems in smart cities.
Despite its success, the study acknowledges certain limitations and areas for further enhancement. The computational complexity introduced by segment merging and multi-frame analysis poses challenges for real-time application. Additionally, while the algorithm is tailored for uncalibrated highway cameras, its performance in other surveillance environments needs thorough investigation.
Future research will focus on optimizing the algorithm for real-time performance without compromising accuracy. This involves refining the segment merging process and exploring more efficient deep learning architectures. Additionally, expanding the algorithm’s applicability to various surveillance camera types and diverse urban environments will be essential. Another avenue for exploration is the integration of additional contextual information such as traffic flow patterns and vehicle types to enhance the robustness and accuracy of the detection and tracking system. Finally, collaboration with city traffic management authorities to implement and test the algorithm in real-world scenarios will provide valuable insights and further validate its practical utility. This continued development aims to contribute to more intelligent and efficient urban traffic monitoring and management systems.