1. Introduction
In computer vision, object tracking is a fundamental recognition problem that involves three key steps: target localization, continuous motion tracking, and trajectory estimation. A common approach is to extract the target surface features, predict their position, establish associations, and update features. However, challenges such as lighting variations, scale changes, rotation, occlusion, and similar appearances complicate multiple-object tracking. Occlusion caused by static or moving objects can disrupt tracking and result in ID switches when the tracker assigns a new ID upon reappearance, mistaking the occluded object as new or halting tracking during mutual occlusion.
Object tracking methodologies can be categorized into single-object and multiple-object tracking. Single-object tracking, often applied to pedestrians, vehicles, or animals, has been widely studied due to the variability in target shapes and appearances, particularly with pedestrians. These algorithms are designed to accurately track an object’s motion over time and estimate its trajectory. A predominant approach in object tracking is the tracking-by-detection framework, wherein a tracker identifies and marks the target at regular intervals.
Multi-object tracking (MOT) poses greater challenges than single-object tracking because of the complexities in distinguishing multiple objects with similar appearances, handling occlusions, and managing fragmented trajectories. A critical challenge in MOT is determining whether detection boxes across different frames belong to the same object. MOT methods can be categorized as offline or online. Offline approaches optimize object trajectories across the entire frame sequence, offering higher accuracy but at the cost of significant computational demand and lack of real-time capability. By contrast, online methods process frames sequentially, relying on current and past data, making them suitable for real-time applications. However, online tracking faces difficulties with occlusions and fragmented trajectories, which complicate consistent object identification.
Network flow [
1] models multi-object tracking using a directed cyclic graph to determine the optimal trajectories via a minimum-cost algorithm. Zhang et al. [
2] combined temporal and spatial data for cross-camera pedestrian tracking, optimized by a conditional random field. Dai et al. [
3] improve tracking with proposal generation and graph convolutional networks. Zhang et al. [
4] streamlined re-ID with energy minimization for dense scenes. Although energy-based methods [
5] can achieve global optimization, offline tracking methods are computationally heavy and unsuitable for real-time applications.
Multi-object tracking [
6] models a problem using Markov decision processes (MDPs). TC-ODAL [
7] improves track reliability by classifying trackers based on detectability and continuity, using linking strategies and incremental linear discriminant analysis. SORT [
8] enables real-time tracking with the Kalman filter and Hungarian algorithm, enhanced by a faster R-CNN. Although traditional methods rely on visual similarity and feature models to reduce ID switches, they often fail to distinguish subtle differences between similar targets, leading to potential misidentifications in online tracking scenarios.
Traditional feature models with manually designed descriptors struggle with noise and imprecise target representations. In contrast, the convolutional neural network (CNN) [
9] proposed by Lecun learns features autonomously, overcoming these limitations. CDA-DDAL [
10] improved tracking using adaptive strategies and tracklet metrics. CRDTC [
11] links vehicle trajectories via background modeling and uses a support vector machine (SVM)–CNN hybrid for efficiency and accuracy. Deep-SORT [
12], enhanced by ResNet [
13] and the Kalman filter, balances frame rates and tracking accuracy, thereby improving the handling of prolonged occlusions.
Another issue is managing new objects entering the scene, which requires tracker initialization. Tracking by detection is the most common strategy, as Nodehi [
14] demonstrated, for detecting individuals and generating trajectories through re-identification. However, false detection bounding boxes increase tracker errors and the number of missing objects. Effective methods must robustly initialize trackers and filter incorrect boxes. Many tracking methods, including TC-ODAL, CDA-DDAL, and AP_HWDPL [
15], lack precise tracker initialization, leading to errors. Some use additional algorithms for verification, such as the MDP Tracking SVM, but face challenges with false positives. Others, such as tracker hierarchy and Deep-SORT, use matching rates or frame counts to filter errors but struggle with continuous false detections. Methods such as NOTL [
16] provide real-time tracking but struggle with ID switching and misidentification in complex scenarios with similar targets.
Current tracking methods that use tracking-by-detection often encounter challenges with erroneous detection boxes. Effective frameworks should discern and discard incorrect detections. Wang et al. [
17] proposed a multidomain joint learning method to address multiple-angle issues in drone data to track pedestrians by utilizing domain-guided dropout for feature organization. This approach enhances the model accuracy across domains and integrates it with Markov decision-process trackers for drone multi-object tracking.
To optimize the performance of deep learning feature extraction models, a large and diverse training dataset is essential. Training on a single dataset limits the generalization and effectiveness of the model. Thus, cross-dataset learning, which uses multiple datasets simultaneously, offers a way to enhance the training’s diversity and complementarity. Using only one dataset limits the generalization, which can reduce the performance in complex tracking environments. Therefore, this study adopted cross-dataset learning to improve model performance and generalization.
Deep learning-based tracking frameworks outperform traditional methods but require substantial data. Owing to the scarcity of multi-object tracking datasets, cross-dataset learning with pedestrian re-identification datasets is commonly used. Cross-dataset learning, based on multi-task learning [
18], merges training data from various tasks for common feature representations, enhancing performance in multiple applications [
19,
20,
21,
22,
23,
24]. Ref. [
25] used a cross-domain CNN approach that integrated spatial and frequency features with attention mechanisms to enhance image manipulation detection, whereas MTDnet [
26] and SpindleNet [
27] incorporated auxiliary tasks to boost the accuracy of person re-identification. When data variations are subtle, single-task learning, as used by domain-guided dropout [
28], is preferred to optimize the neuron impact on datasets. However, this can lead to undertraining of neurons shared across tasks. Zhou et al. [
29] developed the CenterTrack tracker for online, real-time object tracking by associating objects between adjacent frames. However, its performance declines with prolonged occlusions due to its reliance on local frame information.
This study proposes a novel online tracking framework using deep feature models, called single-task joint learning (STJL). The main goal is to incorporate training data from diverse environments to enhance target recognition across various scenarios. Although it uses a tracking-by-detection approach, the model’s addresses the existing limitations by introducing a strict initialization judgment to reduce incorrect detection boxes, minimize errors, and maintain the tracking accuracy. This framework will improve MOT identification accuracy in the application of developed surveillance systems [
30,
31,
32].
The STJL model is first trained offline as shown in
Figure 1. Multiple training datasets are standardized and relabeled to avoid duplicate labels and then combined into a single massive STJL dataset. This dataset is used for single-task training with cosine metric learning [
33], resulting in the STJL model.
After the offline training phase, the STJL model is used for the tracking. It extracts features from detection boxes, predicts object states, and uses deep features to determine whether boxes belong to the same object, linking them. New trackers are initialized for new objects. It is recommended to use both the detection box confidence and track confidence as criteria to differentiate accurate boxes from erroneous, thus excluding faulty detections.
This study proposes two key contributions: the STJL model, which enhances feature extraction in diverse and challenging environments, and a refined tracker initialization strategy that combines detection and tracklet confidence, significantly reducing false positives and ID switches for improved tracking performance.
This paper is structured as follows:
Section 2 presents an initialization technique combining detection and tracklet confidence to enhance tracking precision.
Section 3 validates the proposed deep model through experiments and compares its performance with state-of-the-art methods.
Section 4 offers concluding remarks.
2. STJL Model Applied to Online Multi-Object Tracking
An enhanced online multi-object tracking technique using a deep-feature model within a tracking-by-detection framework is proposed. By combining the deep feature model with the Kalman filter, it links detection boxes of targets in each frame. The algorithm runs iteratively until the image input ends, producing the tracking results.
At time
t, an object detector scans the input image. Detected objects are passed to the tracking algorithm, which extracts features from detection boxes and uses the Kalman filter to predict target location and motion. The feature distance between the detection box and the tracker is calculated to determine similarity. Based on position and feature similarity, the algorithm links detection boxes to trackers and updates their states. If a detection box is not linked to a tracker, a new tracker is added if a new object is indicated or discarded if the instance is deemed a false positive. The algorithm also removes trackers if objects exit, then moves to the next image at time
t + 1, repeating the process until the sequence ends, as shown in
Figure 2.
The following subsections detail the STJL model, covering its network architecture, training methods, object state prediction, tracking procedures, and linkage optimization using deep features. We also propose a novel tracker initialization method and explain the framework for evaluating tracker reliability.
2.1. Deep Feature Extraction
Feature extraction is the key to linking trackers to detection boxes. Traditional algorithms using features often struggle to distinguish similar-looking objects owing to their simplicity. In contrast, our model leverages a CNN based on a residual network and is trained using cosine metric learning to capture complex pedestrian features across diverse contexts. By merging multiple pedestrian re-identification datasets into a unified STJL dataset, the model’s generalization and object differentiation capabilities are enhanced.
As the depth of CNNs increases, the feature accuracy improves but can lead to vanishing gradients. ResNet addresses this by introducing shortcut connections that skip layers, enhancing gradient propagation and preventing this issue. These connections help the network to learn residual functions, allowing for deeper stacking without hindering training. Each ResNet layer includes batch normalization, ReLU activation, and a 3 × 3 convolution. In this study, a ResNet architecture with two convolutional layers, one max-pooling layer, six residual blocks, and a fully connected layer were used. A 1 × 1 convolutional layer ensured consistent input and output dimensions across blocks for residual computation.
Metric learning, often used to compare images, measures distances between elements through various distance functions and is optimized using techniques such as SVM, K-means, and k-NN. Although CNNs can use pre-training on datasets such as ImageNet or MS COCO, category mismatches can limit their effectiveness for image retrieval tasks such as pedestrian re-identification. Effective sampling strategies are crucial for training to prevent mismatched data from hindering the convergence. Deep cosine metric learning simplifies and maintains the efficiency of the process.
Deep cosine metric learning uses a network with both feature extraction and classification layers, but during testing, only the feature vectors are used, excluding the classification layer. In terms of the training dataset, each data point consists of a training image, denoted as , and its corresponding label, , which signifies the individual’s ID. Given that is part of a set comprising total IDs, the network perceives each ID as a distinct category. The terminal layer of the network contains neurons. When an image is fed into a network, it yields probability scores for every category.
The softmax classifier computes the classification probability that is defined as , where is the feature vector computed by the network′s feature extraction layer, represents the network weights of the final fully connected layer, and b is the bias term. denotes the computed probability score for the image belonging to category . After obtaining the probability scores for all categories, the loss function is defined as where D represents the set of training images in the dataset, N is the total number of images in the dataset, and l refers to the label or class of the image, representing the object’s unique ID. This loss function minimizes the cross-entropy between and . Ideally, the final chosen parameter will approach 1 when the decision for is correct, and it will tend towards 0 for other categories when .
Training solely with a standard softmax classifier may not optimize the feature vector distances for the same or different identities.
Figure 3a shows a feature vector space with three categories. Ideally, as in
Figure 3b, the classifier should define clear decision boundaries. However, within-category feature vectors can still vary significantly and are not always aligned with the mean value of their category.
To ensure compact feature vectors, the softmax classifier training parameters were adjusted. By adding
normalization to the final layer, all feature vectors and weights were normalized to unit length, leading to the definition for the cosine softmax classifier:
where
is the network weight and
is a learnable dimension weight. Equation (1) excludes the bias parameter
, simplifying the model with fewer parameters. The cross-entropy from (1) remains the loss function for training.
Figure 3c shows the decision boundary for the cosine softmax classifier. After training, all the sample vectors were normalized to the unit length; they not only moved away from the inter-category boundaries, but also converged towards their category’s mean value, fulfilling metric learning objectives.
Different datasets serve various computer vision tasks. Most datasets focus on a single environment, limiting adaptability to challenges like lighting and occlusions. A universal feature extractor is crucial for tracking in diverse settings. The rise of smart surveillance and pedestrian re-identification research is increasing dataset availability, improving model training. STJL enhances adaptability by training on multiple datasets. For instance, combining Market-1501 [
34], which lacks volume, with DukeMTMC-reID [
35], rich in occluded images, improves the network’s robustness against occlusions, boosting accuracy on Market-1501.
To train the network using multiple datasets, assume there are datasets. Each dataset contains images with a total of pedestrians. Thus, our total training dataset can be defined as . In this context, represents the j-th image in the i-th dataset, where is the ID label of the j-th image in the i-th dataset. The label belongs to the set .
When aiming to train with multiple datasets, the most intuitive approach is to employ multi-task learning. This entails training the network separately with distinct datasets. The learning loss can be defined as , where is the loss function. However, this training approach might cause the network to overfit to one particular dataset, and the order of training datasets can also influence the final performance of the network.
Data from various datasets were combined into an aggregated STJL dataset, with individuals re-labeled to ensure unique IDs across datasets for pedestrian re-identification. The new STJL dataset is represented as , where x represents the training images, N is the total number of training images of all datasets, and y denotes the re-labeled ID, which belongs to the set . is the aggregate number of IDs in the new dataset and is defined as . represents the total number of IDs in the i-th dataset. The learning loss for the STJL dataset is refined to .
Merging datasets into the STJL dataset enables the network to distinguish individuals both within and across datasets, enriching data and adding complexity by differentiating similar individuals, thus enhancing feature model training.
2.2. Tracking State Prediction
The Kalman filter predicts a target’s position in the current frame based on the dynamics of its previous state. In multi-object tracking with uncalibrated cameras and no motion data, the target’s state vector is defined as , where are the pixel coordinates of the detection box center, is the aspect ratio, is the height, and and represent the velocities and rates of change for aspect ratio and height. The prediction step computes the state using the state transition matrix . The predicted error covariance matrix is calculated as , where represents the predicted error covariance and is the process noise covariance.
In the update phase, the Kalman filter refines its prediction using current observations , where is the observation model. The Kalman gain optimizes the update by minimizing the error. The updated state is , and the updated error covariance is , where is the object’s state, is the predicted state, and is the updated error covariance.
2.3. Data Association
The Kalman filter refines the predicted object state by linking it to the detection results using the Hungarian algorithm. After prediction, the object’s location is estimated, and feature vectors are extracted. The squared Mahalanobis distance, , measures the positional distance between the i-th tracker and the j-th detection box, incorporating the filter’s prediction. The inverse covariance matrix, is derived from the error covariance matrix using Cholesky decomposition. Distances within the 95% confidence interval of a chi-square distribution are considered probable links, with the connectivity matrix indicating valid connections when the Mahalanobis distance falls below the threshold .
When target dynamics are reliable, Mahalanobis distance is used to assess tracker-detection box linkage, but its accuracy diminishes with rapid camera movement. To compensate, cosine distance based on appearance is proposed. Feature vectors r for each detection box are normalized, and each tracker’s feature history is stored in a gallery . is composed of the feature vectors from the most recently linked detection boxes. The cosine distance, , calculates the similarity between the i-th tracker and the j-th detection box. Connections are deemed valid if the cosine distance falls below threshold , represented by the connectivity matrix .
By combining Mahalanobis distance, which evaluates short-term position dynamics, with cosine distance, which accounts for appearance over longer occlusions, a comprehensive connection cost is derived. The final connectable indicator integrates both metrics, as shown in . In multi-object tracking, linking trackers and detection boxes is treated as a series of multiple subproblems. To improve robustness, “matching overlap” prioritizes trackers with consistent visibility, linking detection boxes with the lowest distance cost while accounting for the tracker’s “tracking age”, , which measures the time since its last link.
2.4. Tracker Initialization and Terminate
After matching, unmatched boxes might indicate new objects or false positives. Traditionally, they start as trial trackers, becoming official if they persist. This “continuous tracking determination” filters single-frame errors but may still track consecutive errors. A stringent initialization confidence threshold is recommended to prevent this.
To enhance tracking accuracy, detection boxes in frame
t, denoted as
, are filtered based on “detection confidence”, which is calculated using the likelihood score provided by the detector. Detection confidence serves as a preliminary criterion for tracker initialization and error reduction. Specifically, detection boxes are separated into two subsets according to a confidence threshold
: strong detection boxes
, which meet or exceed the confidence threshold, and weak detection boxes
, which fall below this threshold. High-confidence detections,
are preferred for initializing trackers, providing a robust foundation for accurate tracking. In contrast, weak detection boxes
are more prone to errors and are therefore cautiously considered or excluded from the initialization process. This two-tier confidence-based filtering approach, as represented by Equation (2), helps to reduce false positives and enhance the reliability of tracker initialization, contributing to overall tracking stability:
Strong detection boxes with high confidence are suitable for target linking and initialization, while weak detection boxes with lower confidence are prone to errors but can still be used for linking. Depending solely on the detection box, confidence can be misleading, as high-confidence errors might resemble human features. To address this, trajectory confidence is proposed via assessing the detection boxes from a tracking perspective. Reliable trajectories indicate genuine objects, whereas errors can lead to fragmented paths. Unmatched boxes start as trial trackers, and they are promoted to official trackers with unique IDs only when their trajectory confidence consistently exceeds a threshold.
The trajectory confidence, as shown in Equation (3), is based on two factors: the similarity between the linked detection boxes and the trajectory length. A consistent tracker will have similar boxes, and longer trajectories indicate higher reliability.
where
is the trajectory of object
i. The first term represents the average similarity of detection boxes between frames, where
is the length of the trajectory and
h refers to the index of the frames considered in the calculation of the trajectory similarity. The interval of
h typically ranges from the starting frame of the detection to the current frame, i.e.,
, where
is the frame when the detection first appears and
is the current frame being analyzed. The similarity between detection boxes,
is calculated using cosine similarity between the feature vectors extracted from the deep learning model.
is the detection box of object
I in frame
h. Specifically, for two feature vectors
and
, the similarity is computed as
. This similarity helps measure how closely related the detection boxes are in terms of appearance, contributing to the overall trajectory confidence. The first term sums the detection boxes’ similarities from time
to
and divides by the trajectory length
, yielding the average similarity of the detection boxes. The higher the similarity, the higher the confidence. The second term evaluates the trajectory length, where
is the control coefficient; the longer the trajectory, the higher the confidence. Both terms fall within the range of 0 to 1.
Only robust detection boxes are retained as trial trackers for potential new objects, while weaker ones are discarded. If their trajectory confidence surpasses a threshold, they become official trackers; otherwise, they are eliminated. In the proposed method, the confidence of a tentative trajectory
is evaluated based on a threshold to determine whether it can transition into a confirmed trajectory. Specifically, as described in Equation (4), if the confidence of the tentative trajectory
exceeds the threshold
, the trajectory is classified as a new confirmed trajectory
. Otherwise, the trajectory remains in the tentative state. This decision process is critical for ensuring that only reliable trajectories are promoted to the confirmed state, while less reliable ones remain tentative.
After the matching process, there may be unmatched detection boxes as well as trackers that are not linked to any detection boxes. Some of these trackers may no longer be linked because the object has left the tracking scene, requiring a decision on whether to terminate the tracker. The tracker’s age is used as the criterion for this decision. Tracker age refers to the number of frames since the tracker was last linked to a detection box, as defined in Equation (5). The tracker age
increments by 1 each time the Kalman filter makes a prediction. If at the current time
n, the tracker
is linked to a new detection box, the tracker age is reset to 0. When the object has left the scene, the tracker is unable to link to a detection box, causing its age to continue increasing. The algorithm terminates the tracker when its age exceeds a certain number of frames. In practice, this termination threshold is set to 30 frames and can be adjusted based on the video’s frames per second.
3. Experiments
This section outlines the design and analysis of the experiments conducted. The experiments were performed in two parts. The first part involved the evaluation of the feature models. We adjusted the network architecture and training strategies to assess their performance to identify the optimal model for the subsequent multi-object tracking experiments. The second part focused on multi-object tracking tests, in which the best feature model was applied to multiple test videos, primarily tracking pedestrians. We tested different tracker initialization methods and adjusted various parameters to determine the most effective approach. Finally, we compared our results with several state-of-the-art tracking methods using popular multi-object tracking benchmark videos.
Section 3.1 outlines the evaluation criteria of this research model.
Section 3.2 evaluates the performance of single-task joint learning and cosine metric learning.
Section 3.3 compares our proposed method with state-of-the-art algorithms using a multi-object tracking dataset and provides a detailed analysis and discussion of the experimental results.
The experimental setup consisted of an Intel Core i7-930 CPU, Nvidia GeForce GTX 1080Ti GPU, and 24 GB RAM, running Ubuntu 14.04 LTS with CUDA 8.0 and CUDNN v6 for GPU acceleration, and implemented using Python 3.6 and TensorFlow 1.4.
To construct a comprehensive comparative analysis, we employed a diverse array of state-of-the-art MOT algorithms, each chosen for its distinct methodological focus within MOT, addressing critical aspects such as trajectory stability, occlusion handling, and computational efficiency. EAMTT [
36] provides adaptive trajectory tracking, ensuring stability across extended sequences. CDT [
37] utilizes discriminative feature learning to maintain object identity through gradual appearance variations, while MDP_SubCNN [
38] applies part-based, multi-domain feature extraction to enhance resilience under occlusion. TSML_CDE [
39] implements two-stream metric learning to achieve consistent object embedding, minimizing ID switches in high-mobility contexts. Deep-SORT [
12] and KCF [
40] serve as robust solutions for real-time processing; Deep-SORT leverages deep feature embedding for efficient object association, while KCF introduces a lightweight correlation filter-based approach suitable for resource-constrained environments. CDA-DDAL [
10] and AP_HWDPL [
15] are specifically optimized for occlusion robustness and hierarchical association prediction, offering reliable performance in densely populated, dynamic tracking conditions. Collectively, these algorithms provided a well-rounded framework for assessing our model’s adaptability and performance across diverse, real-world tracking scenarios, emphasizing its versatility and competitive standing within the MOT landscape.
Our system used a ResNet for feature extraction, with a computational complexity of per convolutional layer, depending on the kernel size and input dimensions. For Metric Learning, we employed cosine softmax, with complexity , where d is the feature vector dimensionality.
Tracking state prediction was achieved through a Kalman filter, operating with a complexity of , where m denotes the state dimension. Object association utilized cosine distance with complexity , while detection confidence assessment added minimal complexity through basic comparison operations.
Combining these components, the total computational complexity of our system was dominated by the feature extraction stage (ResNet), resulting in an approximate upper bound of . Given our implementation on an Intel Core i7-930 CPU (Intel Corporation, Santa Clara, CA, USA) with an Nvidia GTX 1080Ti GPU (Nvidia Corporation, Santa Clara, CA, USA), the system achieved near real-time processing speeds. This performance suggests that our model is feasible for real-world surveillance applications, with potential for further optimization through model compression techniques or hardware acceleration to ensure efficiency in more resource-constrained environments.
The feature model was trained as a pedestrian re-identification task using multiple widely cited datasets—CUHK03 [
41], Market-1501, and DukeMTMC-reID—each selected for their unique representation of real-world tracking challenges and to ensure model robustness across diverse environments. CUHK03, captured from two non-overlapping cameras, includes challenges such as detection box misalignment and diverse pedestrian angles, which test the model’s ability to adapt to variable perspectives and occlusions. Market-1501, filmed across six cameras, reflects real-world scenarios with frequent misalignments and varying angles, enhancing the model’s ability to handle dynamic background changes and resolution inconsistencies. DukeMTMC-reID [
35], featuring footage from eight synchronized cameras, provides a range of scenes with frequent occlusions, lighting changes, and multi-view tracking scenarios, ideal for testing and refining multi-object tracking models in complex multi-camera systems. Together, these datasets helped evaluate the model’s scalability and adaptability in handling larger numbers of objects and expansive tracking environments, simulating conditions akin to real-world surveillance networks.
3.1. Model Evaluation Criteria
The performance evaluation utilized the cumulative match characteristic (CMC) in pedestrian re-identification. Features were extracted from query and gallery image sets. The query sets contained images of pedestrians with known IDs, while the gallery set consisted of images of pedestrians with unknown identities. The algorithm first extracted feature vectors for all images in both the query and gallery sets. Once all feature vectors were obtained, a distance function was applied to calculate the distances between feature vectors in the query and gallery sets, generating a distance matrix. Using this matrix, the images were then ranked according to distance, with smaller distances indicating higher similarity and thus appearing earlier in the ranking order. This ranking was subsequently used for performance evaluation. To obtain the CMC curve, the accuracy at each rank was calculated, derived using the following Equations (6) and (7):
In Equation (6),
represents the accuracy of a pedestrian with ID
in the query set at Rank
. The accuracy is 1 if a pedestrian with ID
is within Rank
, and 0 otherwise. After calculating
for each ID, the overall accuracy at Rank
was computed using Equation (7), where
is the total number of IDs in the query set. The average of all
values divided by
gives the accuracy at Rank
. Once the accuracies for all ranks were calculated, plotting these values yielded the CMC curve.
Figure 4 displays query images on the left and gallery images on the right, with green frames indicating correct matches and red frames denoting incorrect ones. The accuracy was calculated based on the similarity ranks, with Rank 3 achieving 100% accuracy.
While the CMC metric has limitations, such as not accounting for multiple correct matches in the gallery, this issue is less relevant in multi-object tracking. In this context, each tracker was linked to a single detection box per frame, thereby eliminating the possibility of multiple correct matches. Therefore, Rank 1 accuracy serves as a reliable metric for evaluating the effectiveness of our deep feature model in multi-object tracking scenarios.
3.2. Evaluation of STJL Performance
STJL is a training strategy that utilizes multiple datasets to enhance the model’s generalization capabilities for broader applications. In this section, the constraints of training with a singular dataset are examined, and the efficacy of the STJL method is evaluated.
Experiments using three datasets are detailed: CUHK03 (‘c’), Market-1501 (‘m’), and DukeMTMC-reID (‘d’). All images were resized to 128 × 64 pixels, as listed in
Table 1. The training method used was cosine metric learning.
Models were first trained using individual datasets and subsequently evaluated across all three datasets using CMC, with an emphasis on Rank 1 accuracy.
Table 1 shows the image counts for training. The results in
Figure 5a confirm that the single-dataset models excelled only on their corresponding test set, revealing a lack of generalization.
Single- and multi-dataset training strategies were contrasted, using Market-1501 as a baseline, with all multi-dataset models incorporating them for training. The total number of training images was the sum of those in each dataset. Models were named based on the datasets used for training; for example, the model trained on Market-1501 and DukeMTMC-reID was named ‘md’ and the one using all three datasets was ‘cmd’. Four models: ‘m’, ‘md’, ‘cm’, and ‘cmd’, were contrasted, as depicted in
Figure 5b. The results indicate that multi-dataset models outperformed the single-dataset model ‘m’. Among the dual-dataset models, ‘md’ and ‘cm’ saw Rank 1 accuracy increases of approximately 0.8% and 0.6%, respectively, while the ‘cmd’ model using all three datasets improved by 1%.
Subsequently, the STJL approach was continued with multiple datasets, centering on the performance of these multi-dataset models across varied test datasets. Four distinct multi-dataset models: ‘cm’, ‘cd’, ‘md’, and ‘cmd’, were trained and assessed on three datasets. The results are shown in
Figure 6a. The results indicate that models trained on two datasets performed worse on unseen datasets, whereas the model trained on all three datasets (‘cmd’) showed good performance across all test datasets. Moreover, ‘cmd’ outperformed the two-dataset models, with a 0.4% higher accuracy on Market-1501 compared with ‘cm’, and a 1.5% higher accuracy on CUHK03 compared with ‘cd’.
In summary, our experiments showed that using a single dataset was insufficient for robust pedestrian re-identification. For example, Market-1501 lacked the scope to train the model for occlusions, while adding DukeMTMC-reID improved the performance by 0.8%. This supports the notion that diverse datasets not only enrich the training data but also address specific challenges. Hence, the ‘cmd’ model was chosen for subsequent endeavors.
Building upon prior training approaches, STJL was executed using CUHK03, Market-1501, and DukeMTMC-reID as the training datasets. Two learning methodologies were contrasted; the initial one employed the conventional softmax classifier (denoted as “softmax”), and the latter leveraged cosine metric learning (termed “cosine softmax”). The primary evaluation metric was centered on the CMC-Rank 1 accuracy.
The model comparison results in
Figure 6b show that the models trained with the standard softmax classifier performed poorly on all three datasets, indicating their inability to learn generalizable features. In contrast, the models trained using cosine metric learning exhibited significantly better performance. Specifically, they outperformed the former by 52.13% for CUHK03, 33.23% for Market-1501, and 33.87% for DukeMTMC-reID. These experiments confirmed that cosine metric learning is more effective than the conventional softmax classifier in learning discriminative features across different environments, ensuring that models can extract universal and representative features.
Subsequently, the proposed method was juxtaposed with state-of-the-art techniques. Methods included in this comparison encompassed S2S [
42], BaseIDE [
43], MARS (IDE+XQDA) [
44], CML, LOMO+XQDA, and TAUDL [
45]. Except for LOMO+XQDA, all of these methods are deep-learning-based approaches. The compared datasets included Market-1501, CUHK03, and DukeMTMC-reID, with CMC-Rank 1 accuracy as the evaluation criterion.
The accuracy results for the three datasets are shown in
Figure 7. Our model achieved the highest Rank 1 accuracy on CUHK03, surpassing the latest methods with a 17.1% improvement over S2S and a 36% improvement over TAUDL (
Figure 7a). Additionally, as shown in
Figure 7b, our model outperformed other state-of-the-art methods in Rank 1 accuracy on Market-1501, with a 15.52% lead over TAUDL, 13.08% over S2S, and 1% over CML. In
Figure 7c, it is evident that our model also achieved the best results in Rank 1 accuracy for Duke-MTMC-reID. Compared with traditional methods like LOMO+XQDA, our model achieved a remarkable 36.38% improvement. Among other deep learning methods, it outperformed TAUDL by 4.46% and BaseIDE by 0.94%. Overall, our model demonstrated superior Rank 1 accuracy across all datasets compared with other state-of-the-art methods.
3.3. Evaluation Multiple Object Tracking Performance
In this section, a thorough assessment of the multi-object pedestrian tracking approach is provided. The experiments utilized video footage from various datasets as testing sources, including S2L1 and S2L2 from PETS2009 [
46] as well as TUD-Crossing [
47]. The experimental videos are listed in
Table 2.
Multiple-object tracking accuracy (MOTA), precision (MOTP), false negatives (FN), false positives (FP), ID switches, mostly tracked (MT), and mostly lost (ML) are standard performance metrics. FP indicates incorrect detections, while FN refers to missed objects. In tracking by detection, FN and FP are mainly influenced by the detector, and there is often a trade-off; increasing sensitivity reduces FN but raises FP, and vice versa.
ID switches are crucial in tracking, as the goal is to maintain consistent object IDs. However, challenges like occlusions and similar appearances often cause ID switches. MOTA and MOTP measure overall tracking performance, while MT and ML assess trajectory completeness. A trajectory was considered MT if tracked for over 80% of its duration, and ML if tracked for less than 20%, without accounting for ID consistency.
The data association strategies of the tracking approach were evaluated using two distance metrics derived from
Section 2.3: the position distance
and feature distance
. Two approaches were assessed; the first utilized only the position distance for linking and the latter amalgamated both distances. The latter ensured that the detection boxes closely matched the position and features before associating with the tracker.
The experiment utilized PETS09-S2L1 and TUD-Crossing videos for testing and evaluating the tracking performance through four key metrics: FP, FN, IDs, and MOTA. As shown in
Figure 8, the results demonstrated the impact of combining both position and feature distances in multi-object tracking. Using only the position distance
for linking resulted in suboptimal tracking, with a higher FP and IDs, a slight FN increase, and a lower MOTA. Conversely, employing both distance metrics,
and
, significantly reduced the FP and IDs, leading to a higher MOTA. The figure clearly illustrates how the combination method outperformed the use of position distance alone, leading to more robust tracking results across various test scenarios. These findings emphasize that considering only position distance leads to more linking errors, while incorporating feature distance effectively mitigates these issues, improving tracking accuracy in diverse environments.
This section outlines the comparison of four tracker initialization methods to improve tracking results, as follows:
Continual determination (CD): A trial tracker is established for potential new objects, requiring successful tracking for at least three frames before official initialization;
Detection confidence (DC): Detection boxes with confidence scores above a threshold are directly initialized;
Tracklet confidence (TC): Similar to CD, a trial tracker is created for new objects, but initialization occurs only when tracklet confidence exceeds a threshold;
Detection + tracklet confidence (DC + TC): Trial trackers are first created using detection confidence then initialized once both detection and tracklet confidence surpass set thresholds.
This experiment involved testing PETS09-S2L1 and TUD-Crossing videos with evaluations based on FP, FN, IDs, and MOTA (
Figure 9). In both scenarios, the CD approach resulted in higher FP and IDs, leading to lower MOTA. The use of detection confidence and tracklet confidence reduced FP and IDs, slightly improving MOTA. Combining detection and tracklet confidence significantly reduced FP and IDs, with a slight increase in FN, ultimately achieving the best MOTA compared with the other methods.
In summary, the different initialization methods exhibited consistent trends in both the scenarios. Combining detection and tracklet confidence proved effective in reducing FP and IDs while maintaining a balanced FP and FN, resulting in the highest MOTA.
Experiments were performed on three videos: PETS09-S2L1, PETS09-S2L2, and TUD-Crossing to juxtapose the proposed method with leading-edge state-of-the-art multiple object-tracking techniques, as shown in
Table 3. Except for TSML_CDE and CDT, which are offline tracking methods, all the others are online tracking methods. Deep-SORT, AP_HWDPL, CDA-DDAL, and KCF are among the tracking methods that utilize deep learning networks.
In the experimental evaluation across three videos, namely PETS09-S2L1, PETS09-S2L2, and TUD-Crossing, our proposed method consistently outperformed the other state-of-the-art techniques. It excelled in metrics such as MT, ML, FN, and MOTA. The tracking results of the three videos are shown in
Table 3. For PETS09-S2L1, our method achieved superior results, surpassing Deep-SORT by 2.5%, OMVT_TAAC by 12.8%, DMOT_MDP by 1.6%, and HMOT by 0.1% in terms of MOTA. For PETS09-S2L2, our method demonstrated excellent performance in MT, FN, and MOTA, outperforming EAMTT by 1.8%, AP_HWDPL by 6.1%, CDT by 14.4%, KCF by 10.3%, MDP_SubCNN by 14.6%, and CDA-DDAL by 11.3%. In the case of TUD-Crossing, our method outperformed ML, FN, and MOTA. In terms of MOTA, it outperformed CDT by 10.6%, KCF by 4.6%, MDP_SubCNN by 1.7%, TSML_CDE by 4.9%, AP_HWDPL by 2.7%, and CDA-DDAL by 1.7%.
Overall, our method consistently achieved the best results in FN and MOTA across the three experimental videos, demonstrating its adaptability to various video scenarios and its ability to overcome diverse challenges for accurate tracking outcomes.
The practical tracking outcomes of the proposed method are showcased, accompanied by an analysis. The results for PETS09-S2L1 are depicted in
Figure 10a. It is evident that our method accurately tracked all objects, even when pedestrians with ID 1 and 2 briefly stopped and temporarily occluded each other during their encounter. Despite this, our tracker maintained the continuous tracking of both pedestrians. Similarly, pedestrian ID 3 navigated around pedestrians with ID 1 and 2 after passing through a streetlight that momentarily obstructed the view. It also experienced occlusion during this process, but tracking seamlessly continued after the occlusion was cleared.
The results for PETS09-S2L2 are shown in
Figure 10b. In this video, a significant number of pedestrians walked randomly, and the pedestrian density was higher, leading to more pronounced occlusions among pedestrians. The results show that most of the targets were accurately tracked. Pedestrians located farther away from crowded areas, such as pedestrians with ID 40 and 58, were successfully tracked. Among the pedestrians within the crowded areas, apart from individual pedestrians who could not be detected owing to extensive occlusion, the majority, including pedestrians with ID 41 and 11, were also effectively tracked.
The results for the TUD-Crossing dataset are illustrated in
Figure 10c. Although this video featured fewer pedestrians, the use of a frontal view camera angle resulted in relatively larger areas of occlusion among pedestrians. The video depicted multiple pedestrians converging in the same direction, such as pedestrians with ID 2, 3, 4, 5, and 6, sometimes temporarily obscured by pedestrians moving in the opposite direction, such as pedestrian ID 1. From the images, it is evident that these pedestrians’ IDs were maintained successfully, even after brief occlusions. Additionally, pedestrian ID 1 continued to be accurately tracked despite the presence of multiple pedestrians in the background.