Single-Task Joint Learning Model for an Online Multi-Object Tracking Framework

Wang, Yuan-Kai; Pan, Tung-Ming; Hu, Chi-En

doi:10.3390/app142210540

Open AccessArticle

Single-Task Joint Learning Model for an Online Multi-Object Tracking Framework

by

Yuan-Kai Wang

¹

,

Tung-Ming Pan

^2,*

and

Chi-En Hu

¹

Department of Electrical Engineering, Fu Jen Catholic University, New Taipei 242, Taiwan

²

Holistic Education Center, Graduate Institute of Applied Science and Engineering, Fu Jen Catholic University, New Taipei 242, Taiwan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(22), 10540; https://doi.org/10.3390/app142210540

Submission received: 7 October 2024 / Revised: 7 November 2024 / Accepted: 9 November 2024 / Published: 15 November 2024

Download

Browse Figures

Versions Notes

Abstract

:

Multi-object tracking faces critical challenges, including occlusions, ID switches, and erroneous detection boxes, which significantly hinder tracking accuracy in complex environments. To address these issues, this study proposes a single-task joint learning (STJL) model integrated into an online multi-object tracking framework to enhance feature extraction and model robustness across diverse scenarios. Employing cross-dataset training, the model has improved generalization capabilities and can effectively handle various tracking conditions. A key innovation is the refined tracker initialization strategy that combines detection and tracklet confidence, which significantly reduces the number of false positives and ID switches. Additionally, the framework employs a combination of Mahalanobis and cosine distances to optimize data association, further improving tracking accuracy. The experimental results demonstrate that the proposed model outperformed state-of-the-art methods on standard benchmark datasets, achieving superior MOTA and reduced ID switches, confirming its effectiveness in dynamic and occlusion-heavy environments.

Keywords:

multi-object tracking; single-task joint learning; cross-dataset training; feature extraction; tracker initialization; cosine distance; data association; occlusion handling

1. Introduction

In computer vision, object tracking is a fundamental recognition problem that involves three key steps: target localization, continuous motion tracking, and trajectory estimation. A common approach is to extract the target surface features, predict their position, establish associations, and update features. However, challenges such as lighting variations, scale changes, rotation, occlusion, and similar appearances complicate multiple-object tracking. Occlusion caused by static or moving objects can disrupt tracking and result in ID switches when the tracker assigns a new ID upon reappearance, mistaking the occluded object as new or halting tracking during mutual occlusion.

Object tracking methodologies can be categorized into single-object and multiple-object tracking. Single-object tracking, often applied to pedestrians, vehicles, or animals, has been widely studied due to the variability in target shapes and appearances, particularly with pedestrians. These algorithms are designed to accurately track an object’s motion over time and estimate its trajectory. A predominant approach in object tracking is the tracking-by-detection framework, wherein a tracker identifies and marks the target at regular intervals.

Multi-object tracking (MOT) poses greater challenges than single-object tracking because of the complexities in distinguishing multiple objects with similar appearances, handling occlusions, and managing fragmented trajectories. A critical challenge in MOT is determining whether detection boxes across different frames belong to the same object. MOT methods can be categorized as offline or online. Offline approaches optimize object trajectories across the entire frame sequence, offering higher accuracy but at the cost of significant computational demand and lack of real-time capability. By contrast, online methods process frames sequentially, relying on current and past data, making them suitable for real-time applications. However, online tracking faces difficulties with occlusions and fragmented trajectories, which complicate consistent object identification.

Network flow [1] models multi-object tracking using a directed cyclic graph to determine the optimal trajectories via a minimum-cost algorithm. Zhang et al. [2] combined temporal and spatial data for cross-camera pedestrian tracking, optimized by a conditional random field. Dai et al. [3] improve tracking with proposal generation and graph convolutional networks. Zhang et al. [4] streamlined re-ID with energy minimization for dense scenes. Although energy-based methods [5] can achieve global optimization, offline tracking methods are computationally heavy and unsuitable for real-time applications.

Multi-object tracking [6] models a problem using Markov decision processes (MDPs). TC-ODAL [7] improves track reliability by classifying trackers based on detectability and continuity, using linking strategies and incremental linear discriminant analysis. SORT [8] enables real-time tracking with the Kalman filter and Hungarian algorithm, enhanced by a faster R-CNN. Although traditional methods rely on visual similarity and feature models to reduce ID switches, they often fail to distinguish subtle differences between similar targets, leading to potential misidentifications in online tracking scenarios.

Traditional feature models with manually designed descriptors struggle with noise and imprecise target representations. In contrast, the convolutional neural network (CNN) [9] proposed by Lecun learns features autonomously, overcoming these limitations. CDA-DDAL [10] improved tracking using adaptive strategies and tracklet metrics. CRDTC [11] links vehicle trajectories via background modeling and uses a support vector machine (SVM)–CNN hybrid for efficiency and accuracy. Deep-SORT [12], enhanced by ResNet [13] and the Kalman filter, balances frame rates and tracking accuracy, thereby improving the handling of prolonged occlusions.

Another issue is managing new objects entering the scene, which requires tracker initialization. Tracking by detection is the most common strategy, as Nodehi [14] demonstrated, for detecting individuals and generating trajectories through re-identification. However, false detection bounding boxes increase tracker errors and the number of missing objects. Effective methods must robustly initialize trackers and filter incorrect boxes. Many tracking methods, including TC-ODAL, CDA-DDAL, and AP_HWDPL [15], lack precise tracker initialization, leading to errors. Some use additional algorithms for verification, such as the MDP Tracking SVM, but face challenges with false positives. Others, such as tracker hierarchy and Deep-SORT, use matching rates or frame counts to filter errors but struggle with continuous false detections. Methods such as NOTL [16] provide real-time tracking but struggle with ID switching and misidentification in complex scenarios with similar targets.

Current tracking methods that use tracking-by-detection often encounter challenges with erroneous detection boxes. Effective frameworks should discern and discard incorrect detections. Wang et al. [17] proposed a multidomain joint learning method to address multiple-angle issues in drone data to track pedestrians by utilizing domain-guided dropout for feature organization. This approach enhances the model accuracy across domains and integrates it with Markov decision-process trackers for drone multi-object tracking.

To optimize the performance of deep learning feature extraction models, a large and diverse training dataset is essential. Training on a single dataset limits the generalization and effectiveness of the model. Thus, cross-dataset learning, which uses multiple datasets simultaneously, offers a way to enhance the training’s diversity and complementarity. Using only one dataset limits the generalization, which can reduce the performance in complex tracking environments. Therefore, this study adopted cross-dataset learning to improve model performance and generalization.

Deep learning-based tracking frameworks outperform traditional methods but require substantial data. Owing to the scarcity of multi-object tracking datasets, cross-dataset learning with pedestrian re-identification datasets is commonly used. Cross-dataset learning, based on multi-task learning [18], merges training data from various tasks for common feature representations, enhancing performance in multiple applications [19,20,21,22,23,24]. Ref. [25] used a cross-domain CNN approach that integrated spatial and frequency features with attention mechanisms to enhance image manipulation detection, whereas MTDnet [26] and SpindleNet [27] incorporated auxiliary tasks to boost the accuracy of person re-identification. When data variations are subtle, single-task learning, as used by domain-guided dropout [28], is preferred to optimize the neuron impact on datasets. However, this can lead to undertraining of neurons shared across tasks. Zhou et al. [29] developed the CenterTrack tracker for online, real-time object tracking by associating objects between adjacent frames. However, its performance declines with prolonged occlusions due to its reliance on local frame information.

This study proposes a novel online tracking framework using deep feature models, called single-task joint learning (STJL). The main goal is to incorporate training data from diverse environments to enhance target recognition across various scenarios. Although it uses a tracking-by-detection approach, the model’s addresses the existing limitations by introducing a strict initialization judgment to reduce incorrect detection boxes, minimize errors, and maintain the tracking accuracy. This framework will improve MOT identification accuracy in the application of developed surveillance systems [30,31,32].

The STJL model is first trained offline as shown in Figure 1. Multiple training datasets are standardized and relabeled to avoid duplicate labels and then combined into a single massive STJL dataset. This dataset is used for single-task training with cosine metric learning [33], resulting in the STJL model.

After the offline training phase, the STJL model is used for the tracking. It extracts features from detection boxes, predicts object states, and uses deep features to determine whether boxes belong to the same object, linking them. New trackers are initialized for new objects. It is recommended to use both the detection box confidence and track confidence as criteria to differentiate accurate boxes from erroneous, thus excluding faulty detections.

This study proposes two key contributions: the STJL model, which enhances feature extraction in diverse and challenging environments, and a refined tracker initialization strategy that combines detection and tracklet confidence, significantly reducing false positives and ID switches for improved tracking performance.

This paper is structured as follows: Section 2 presents an initialization technique combining detection and tracklet confidence to enhance tracking precision. Section 3 validates the proposed deep model through experiments and compares its performance with state-of-the-art methods. Section 4 offers concluding remarks.

2. STJL Model Applied to Online Multi-Object Tracking

An enhanced online multi-object tracking technique using a deep-feature model within a tracking-by-detection framework is proposed. By combining the deep feature model with the Kalman filter, it links detection boxes of targets in each frame. The algorithm runs iteratively until the image input ends, producing the tracking results.

At time t, an object detector scans the input image. Detected objects are passed to the tracking algorithm, which extracts features from detection boxes and uses the Kalman filter to predict target location and motion. The feature distance between the detection box and the tracker is calculated to determine similarity. Based on position and feature similarity, the algorithm links detection boxes to trackers and updates their states. If a detection box is not linked to a tracker, a new tracker is added if a new object is indicated or discarded if the instance is deemed a false positive. The algorithm also removes trackers if objects exit, then moves to the next image at time t + 1, repeating the process until the sequence ends, as shown in Figure 2.

The following subsections detail the STJL model, covering its network architecture, training methods, object state prediction, tracking procedures, and linkage optimization using deep features. We also propose a novel tracker initialization method and explain the framework for evaluating tracker reliability.

2.1. Deep Feature Extraction

Feature extraction is the key to linking trackers to detection boxes. Traditional algorithms using features often struggle to distinguish similar-looking objects owing to their simplicity. In contrast, our model leverages a CNN based on a residual network and is trained using cosine metric learning to capture complex pedestrian features across diverse contexts. By merging multiple pedestrian re-identification datasets into a unified STJL dataset, the model’s generalization and object differentiation capabilities are enhanced.

As the depth of CNNs increases, the feature accuracy improves but can lead to vanishing gradients. ResNet addresses this by introducing shortcut connections that skip layers, enhancing gradient propagation and preventing this issue. These connections help the network to learn residual functions, allowing for deeper stacking without hindering training. Each ResNet layer includes batch normalization, ReLU activation, and a 3 × 3 convolution. In this study, a ResNet architecture with two convolutional layers, one max-pooling layer, six residual blocks, and a fully connected layer were used. A 1 × 1 convolutional layer ensured consistent input and output dimensions across blocks for residual computation.

Metric learning, often used to compare images, measures distances between elements through various distance functions and is optimized using techniques such as SVM, K-means, and k-NN. Although CNNs can use pre-training on datasets such as ImageNet or MS COCO, category mismatches can limit their effectiveness for image retrieval tasks such as pedestrian re-identification. Effective sampling strategies are crucial for training to prevent mismatched data from hindering the convergence. Deep cosine metric learning simplifies and maintains the efficiency of the process.

Deep cosine metric learning uses a network with both feature extraction and classification layers, but during testing, only the feature vectors are used, excluding the classification layer. In terms of the training dataset, each data point consists of a training image, denoted as

x_{i}

, and its corresponding label,

y_{i}

, which signifies the individual’s ID. Given that

y_{i}

is part of a set comprising

C_{i}

total IDs, the network perceives each ID as a distinct category. The terminal layer of the network contains

C

neurons. When an image is fed into a network, it yields probability scores for every category.

The softmax classifier computes the classification probability that is defined as

p (y = k| r) = \frac{\exp {(w}_{k}^{T} r + b_{k})}{\sum_{n = 1}^{C} \exp ({w_{n}}^{T} r + b_{n})}

, where

r

is the feature vector computed by the network′s feature extraction layer,

w

represents the network weights of the final fully connected layer, and b is the bias term.

p (y = k| r)

denotes the computed probability score for the image belonging to category

k

. After obtaining the probability scores for all categories, the loss function is defined as

L (D) = - \sum_{i = 1}^{N} \sum_{k = 1}^{C} l_{y_{i} = k} \cdot \log p (y_{i} = k| r_{i}),

where D represents the set of training images in the dataset, N is the total number of images in the dataset, and l refers to the label or class of the image, representing the object’s unique ID. This loss function minimizes the cross-entropy between

l_{y_{i} = k}

and

p (y = k| r)

. Ideally, the final chosen parameter

l_{y_{i} = k}

will approach 1 when the decision for

y = k

is correct, and it will tend towards 0 for other categories when

y \neq k

.

Training solely with a standard softmax classifier may not optimize the feature vector distances for the same or different identities. Figure 3a shows a feature vector space with three categories. Ideally, as in Figure 3b, the classifier should define clear decision boundaries. However, within-category feature vectors can still vary significantly and are not always aligned with the mean value of their category.

To ensure compact feature vectors, the softmax classifier training parameters were adjusted. By adding

l_{2}

normalization to the final layer, all feature vectors and weights were normalized to unit length, leading to the definition for the cosine softmax classifier:

p (y = k| r) = \frac{\exp {(K \cdot \tilde{w}}_{k}^{T} r)}{\sum_{n = 1}^{C} \exp (K \cdot {\tilde{w}}_{n}^{T} r)},

(1)

where

\tilde{w}

is the network weight and

K

is a learnable dimension weight. Equation (1) excludes the bias parameter

b

, simplifying the model with fewer parameters. The cross-entropy from (1) remains the loss function for training. Figure 3c shows the decision boundary for the cosine softmax classifier. After training, all the sample vectors were normalized to the unit length; they not only moved away from the inter-category boundaries, but also converged towards their category’s mean value, fulfilling metric learning objectives.

Different datasets serve various computer vision tasks. Most datasets focus on a single environment, limiting adaptability to challenges like lighting and occlusions. A universal feature extractor is crucial for tracking in diverse settings. The rise of smart surveillance and pedestrian re-identification research is increasing dataset availability, improving model training. STJL enhances adaptability by training on multiple datasets. For instance, combining Market-1501 [34], which lacks volume, with DukeMTMC-reID [35], rich in occluded images, improves the network’s robustness against occlusions, boosting accuracy on Market-1501.

To train the network using multiple datasets, assume there are

K

datasets. Each dataset contains

N_{i}

images with a total of

M_{i}

pedestrians. Thus, our total training dataset can be defined as

T D = {(x_{j}^{(i)}, y_{j}^{(i)})_{j = 1}^{N_{i}}}_{i = 1}^{K}

. In this context,

x_{j}^{(i)}

represents the j-th image in the i-th dataset, where

y_{j}^{(i)}

is the ID label of the j-th image in the i-th dataset. The label

y_{j}^{(i)}

belongs to the set

{1, 2, \dots, M_{i}}

.

When aiming to train with multiple datasets, the most intuitive approach is to employ multi-task learning. This entails training the network separately with distinct datasets. The learning loss can be defined as

L_{l o s s} = \sum_{i = 1}^{K} \sum_{j = 1}^{N_{i}} L (x_{j}^{(i)}, y_{j}^{(i)})

, where

L

is the loss function. However, this training approach might cause the network to overfit to one particular dataset, and the order of training datasets can also influence the final performance of the network.

Data from various datasets were combined into an aggregated STJL dataset, with individuals re-labeled to ensure unique IDs across datasets for pedestrian re-identification. The new STJL dataset is represented as

{T D}_{S T J L} = {x_{i}, y_{i}}_{i = 1}^{N}

, where x represents the training images, N is the total number of training images of all datasets, and y denotes the re-labeled ID, which belongs to the set

\{1, 2, \dots, M_{n e w}\}

.

M_{n e w}

is the aggregate number of IDs in the new dataset and is defined as

M_{n e w} = \sum_{i = 1}^{K} {I D}_{i}

.

{I D}_{i}

represents the total number of IDs in the i-th dataset. The learning loss for the STJL dataset is refined to

L_{l o s s} = \sum_{i = 1}^{N} L (x_{i}, y_{i})

.

Merging datasets into the STJL dataset enables the network to distinguish individuals both within and across datasets, enriching data and adding complexity by differentiating similar individuals, thus enhancing feature model training.

2.2. Tracking State Prediction

The Kalman filter predicts a target’s position in the current frame based on the dynamics of its previous state. In multi-object tracking with uncalibrated cameras and no motion data, the target’s state vector

X_{k c}

is defined as

X_{k c} = {[u, v, γ, h i, \dot{x}, \dot{y}, \dot{γ}, \dot{h}]}^{T}

, where

(u, v)

are the pixel coordinates of the detection box center,

γ

is the aspect ratio,

h i

is the height, and

(\dot{x}, \dot{y})

and

(\dot{γ}, \dot{h})

represent the velocities and rates of change for aspect ratio and height. The prediction step computes the state

X_{k c}^{'} = F \cdot X_{k c - 1}

using the state transition matrix

F

. The predicted error covariance matrix is calculated as

P_{k c}^{'} = F \cdot P_{k c - 1} \cdot F^{T} + Q

, where

P_{k c}^{'}

represents the predicted error covariance and

Q

is the process noise covariance.

In the update phase, the Kalman filter refines its prediction using current observations

Z_{k c} = H \cdot X_{k c}

, where

H

is the observation model. The Kalman gain

K_{k c} = P_{k c}^{'} \cdot H^{T} \cdot {(H \cdot P_{k c}^{'} \cdot H^{T} + R)}^{- 1}

optimizes the update by minimizing the error. The updated state is

X_{k c} = X_{k c}^{'} + K_{k c} \cdot (Z_{k c} - H \cdot X_{k c}^{'})

, and the updated error covariance is

P_{k c} = P_{k c}^{'} - K_{k c} \cdot H \cdot P_{k c}^{'}

, where

X_{k c}

is the object’s state,

X_{k c}^{'}

is the predicted state, and

P_{k c}

is the updated error covariance.

2.3. Data Association

The Kalman filter refines the predicted object state by linking it to the detection results using the Hungarian algorithm. After prediction, the object’s location is estimated, and feature vectors are extracted. The squared Mahalanobis distance,

d^{(1)} (i, j) = {(d_{j} - y_{i})}^{T} S_{i}^{- 1} (d_{j} - y_{i})

, measures the positional distance between the i-th tracker and the j-th detection box, incorporating the filter’s prediction. The inverse covariance matrix,

S_{i}^{- 1},

is derived from the error covariance matrix using Cholesky decomposition. Distances within the 95% confidence interval of a chi-square distribution are considered probable links, with the connectivity matrix

b_{i, j}^{(1)}

indicating valid connections when the Mahalanobis distance falls below the threshold

t^{(1)}

.

When target dynamics are reliable, Mahalanobis distance is used to assess tracker-detection box linkage, but its accuracy diminishes with rapid camera movement. To compensate, cosine distance based on appearance is proposed. Feature vectors r for each detection box are normalized, and each tracker’s feature history is stored in a gallery

R_{k g} = {r_{k g}^{(i)}}_{k g = 1}^{L_{k g}}

.

R_{k g}

is composed of the feature vectors from the most recently linked

L_{k g}

detection boxes. The cosine distance,

d^{(2)} (i, j) = m i n {1 - r_{j}^{T} r_{k g}^{(i)} | r_{k g}^{(i)} \in R_{i}}

, calculates the similarity between the i-th tracker and the j-th detection box. Connections are deemed valid if the cosine distance falls below threshold

t^{(2)}

, represented by the connectivity matrix

b_{i, j}^{(2)}

.

By combining Mahalanobis distance, which evaluates short-term position dynamics, with cosine distance, which accounts for appearance over longer occlusions, a comprehensive connection cost

{c_{i, j} = d}^{(1)} (i, j) + d^{(2)} (i, j)

is derived. The final connectable indicator

b_{i, j}

integrates both metrics, as shown in

b_{i, j} = \prod_{m = 1}^{2} b_{i, j}^{(m)}

. In multi-object tracking, linking trackers and detection boxes is treated as a series of multiple subproblems. To improve robustness, “matching overlap” prioritizes trackers with consistent visibility, linking detection boxes with the lowest distance cost while accounting for the tracker’s “tracking age”,

A

, which measures the time since its last link.

2.4. Tracker Initialization and Terminate

After matching, unmatched boxes might indicate new objects or false positives. Traditionally, they start as trial trackers, becoming official if they persist. This “continuous tracking determination” filters single-frame errors but may still track consecutive errors. A stringent initialization confidence threshold is recommended to prevent this.

To enhance tracking accuracy, detection boxes in frame t, denoted as

D^{t}

, are filtered based on “detection confidence”, which is calculated using the likelihood score provided by the detector. Detection confidence serves as a preliminary criterion for tracker initialization and error reduction. Specifically, detection boxes are separated into two subsets according to a confidence threshold

t h 1

: strong detection boxes

D_{s t r o n g}^{t}

, which meet or exceed the confidence threshold, and weak detection boxes

D_{w e a k}^{t}

, which fall below this threshold. High-confidence detections,

D_{s t r o n g}^{t}

are preferred for initializing trackers, providing a robust foundation for accurate tracking. In contrast, weak detection boxes

D_{w e a k}^{t}

are more prone to errors and are therefore cautiously considered or excluded from the initialization process. This two-tier confidence-based filtering approach, as represented by Equation (2), helps to reduce false positives and enhance the reliability of tracker initialization, contributing to overall tracking stability:

D_{i}^{t} = \{\begin{matrix} D_{s t r o n g}^{t}, & i f D_{i}^{t} . c o n f i d e n c e > t h 1 \\ D_{w e a k}^{t}, & o t h e r w i s e \end{matrix},

(2)

Strong detection boxes with high confidence are suitable for target linking and initialization, while weak detection boxes with lower confidence are prone to errors but can still be used for linking. Depending solely on the detection box, confidence can be misleading, as high-confidence errors might resemble human features. To address this, trajectory confidence is proposed via assessing the detection boxes from a tracking perspective. Reliable trajectories indicate genuine objects, whereas errors can lead to fragmented paths. Unmatched boxes start as trial trackers, and they are promoted to official trackers with unique IDs only when their trajectory confidence consistently exceeds a threshold.

The trajectory confidence, as shown in Equation (3), is based on two factors: the similarity between the linked detection boxes and the trajectory length. A consistent tracker will have similar boxes, and longer trajectories indicate higher reliability.

C o n f (T_{i}) = (\frac{1}{L} \sum_{h \in [t_{s}^{i}, t_{e}^{i}]} Λ (T_{i}, z_{h}^{i})) \times (1 - {e x p}^{- β \cdot \sqrt{L}}),

(3)

where

T_{i}

is the trajectory of object i. The first term represents the average similarity of detection boxes between frames, where

L

is the length of the trajectory and h refers to the index of the frames considered in the calculation of the trajectory similarity. The interval of h typically ranges from the starting frame of the detection to the current frame, i.e.,

h \in [t_{s}, t_{e}]

, where

t_{s}

is the frame when the detection first appears and

t_{e}

is the current frame being analyzed. The similarity between detection boxes,

Λ (T_{i}, z_{h}^{i}),

is calculated using cosine similarity between the feature vectors extracted from the deep learning model.

z_{h}^{i}

is the detection box of object I in frame h. Specifically, for two feature vectors

r_{i 1}

and

r_{i 2}

, the similarity is computed as

Λ (T_{i}, z_{h}^{i}) = \frac{r_{i 1} \cdot r_{i 2}}{‖r_{i 1}‖ ‖r_{i 2}‖}

. This similarity helps measure how closely related the detection boxes are in terms of appearance, contributing to the overall trajectory confidence. The first term sums the detection boxes’ similarities from time

t_{s}

to

t_{e}

and divides by the trajectory length

L

, yielding the average similarity of the detection boxes. The higher the similarity, the higher the confidence. The second term evaluates the trajectory length, where

β

is the control coefficient; the longer the trajectory, the higher the confidence. Both terms fall within the range of 0 to 1.

Only robust detection boxes are retained as trial trackers for potential new objects, while weaker ones are discarded. If their trajectory confidence surpasses a threshold, they become official trackers; otherwise, they are eliminated. In the proposed method, the confidence of a tentative trajectory

T_{t e n t a t i v e}^{i}

is evaluated based on a threshold to determine whether it can transition into a confirmed trajectory. Specifically, as described in Equation (4), if the confidence of the tentative trajectory

T_{t e n t a t i v e}^{i}

exceeds the threshold

t h 2

, the trajectory is classified as a new confirmed trajectory

T_{n e w}

. Otherwise, the trajectory remains in the tentative state. This decision process is critical for ensuring that only reliable trajectories are promoted to the confirmed state, while less reliable ones remain tentative.

T_{t e n t a t i v e}^{i} = \{\begin{matrix} T_{n e w}, & i f T_{t e n t a t i v e}^{i} c o n f i d e n c e > t h 2 \\ T_{t e n t a t i v e}^{i}, & o t h e r w i s e \end{matrix},

(4)

After the matching process, there may be unmatched detection boxes as well as trackers that are not linked to any detection boxes. Some of these trackers may no longer be linked because the object has left the tracking scene, requiring a decision on whether to terminate the tracker. The tracker’s age is used as the criterion for this decision. Tracker age refers to the number of frames since the tracker was last linked to a detection box, as defined in Equation (5). The tracker age

T_{a g e}

increments by 1 each time the Kalman filter makes a prediction. If at the current time n, the tracker

T_{i}

is linked to a new detection box, the tracker age is reset to 0. When the object has left the scene, the tracker is unable to link to a detection box, causing its age to continue increasing. The algorithm terminates the tracker when its age exceeds a certain number of frames. In practice, this termination threshold is set to 30 frames and can be adjusted based on the video’s frames per second.

T_{a g e} = \{\begin{matrix} 0, & i f T_{i} (n) \neq \emptyset \\ T_{a g e} + 1, & O t h e r w i s e \end{matrix},

(5)

3. Experiments

This section outlines the design and analysis of the experiments conducted. The experiments were performed in two parts. The first part involved the evaluation of the feature models. We adjusted the network architecture and training strategies to assess their performance to identify the optimal model for the subsequent multi-object tracking experiments. The second part focused on multi-object tracking tests, in which the best feature model was applied to multiple test videos, primarily tracking pedestrians. We tested different tracker initialization methods and adjusted various parameters to determine the most effective approach. Finally, we compared our results with several state-of-the-art tracking methods using popular multi-object tracking benchmark videos. Section 3.1 outlines the evaluation criteria of this research model. Section 3.2 evaluates the performance of single-task joint learning and cosine metric learning. Section 3.3 compares our proposed method with state-of-the-art algorithms using a multi-object tracking dataset and provides a detailed analysis and discussion of the experimental results.

The experimental setup consisted of an Intel Core i7-930 CPU, Nvidia GeForce GTX 1080Ti GPU, and 24 GB RAM, running Ubuntu 14.04 LTS with CUDA 8.0 and CUDNN v6 for GPU acceleration, and implemented using Python 3.6 and TensorFlow 1.4.

To construct a comprehensive comparative analysis, we employed a diverse array of state-of-the-art MOT algorithms, each chosen for its distinct methodological focus within MOT, addressing critical aspects such as trajectory stability, occlusion handling, and computational efficiency. EAMTT [36] provides adaptive trajectory tracking, ensuring stability across extended sequences. CDT [37] utilizes discriminative feature learning to maintain object identity through gradual appearance variations, while MDP_SubCNN [38] applies part-based, multi-domain feature extraction to enhance resilience under occlusion. TSML_CDE [39] implements two-stream metric learning to achieve consistent object embedding, minimizing ID switches in high-mobility contexts. Deep-SORT [12] and KCF [40] serve as robust solutions for real-time processing; Deep-SORT leverages deep feature embedding for efficient object association, while KCF introduces a lightweight correlation filter-based approach suitable for resource-constrained environments. CDA-DDAL [10] and AP_HWDPL [15] are specifically optimized for occlusion robustness and hierarchical association prediction, offering reliable performance in densely populated, dynamic tracking conditions. Collectively, these algorithms provided a well-rounded framework for assessing our model’s adaptability and performance across diverse, real-world tracking scenarios, emphasizing its versatility and competitive standing within the MOT landscape.

Our system used a ResNet for feature extraction, with a computational complexity of

O (n^{2})

per convolutional layer, depending on the kernel size and input dimensions. For Metric Learning, we employed cosine softmax, with complexity

O (d)

, where d is the feature vector dimensionality.

Tracking state prediction was achieved through a Kalman filter, operating with a complexity of

O (m^{2})

, where m denotes the state dimension. Object association utilized cosine distance with complexity

O (d)

, while detection confidence assessment added minimal complexity through basic comparison operations.

Combining these components, the total computational complexity of our system was dominated by the feature extraction stage (ResNet), resulting in an approximate upper bound of

O (n^{2})

. Given our implementation on an Intel Core i7-930 CPU (Intel Corporation, Santa Clara, CA, USA) with an Nvidia GTX 1080Ti GPU (Nvidia Corporation, Santa Clara, CA, USA), the system achieved near real-time processing speeds. This performance suggests that our model is feasible for real-world surveillance applications, with potential for further optimization through model compression techniques or hardware acceleration to ensure efficiency in more resource-constrained environments.

The feature model was trained as a pedestrian re-identification task using multiple widely cited datasets—CUHK03 [41], Market-1501, and DukeMTMC-reID—each selected for their unique representation of real-world tracking challenges and to ensure model robustness across diverse environments. CUHK03, captured from two non-overlapping cameras, includes challenges such as detection box misalignment and diverse pedestrian angles, which test the model’s ability to adapt to variable perspectives and occlusions. Market-1501, filmed across six cameras, reflects real-world scenarios with frequent misalignments and varying angles, enhancing the model’s ability to handle dynamic background changes and resolution inconsistencies. DukeMTMC-reID [35], featuring footage from eight synchronized cameras, provides a range of scenes with frequent occlusions, lighting changes, and multi-view tracking scenarios, ideal for testing and refining multi-object tracking models in complex multi-camera systems. Together, these datasets helped evaluate the model’s scalability and adaptability in handling larger numbers of objects and expansive tracking environments, simulating conditions akin to real-world surveillance networks.

3.1. Model Evaluation Criteria

The performance evaluation utilized the cumulative match characteristic (CMC) in pedestrian re-identification. Features were extracted from query and gallery image sets. The query sets contained images of pedestrians with known IDs, while the gallery set consisted of images of pedestrians with unknown identities. The algorithm first extracted feature vectors for all images in both the query and gallery sets. Once all feature vectors were obtained, a distance function was applied to calculate the distances between feature vectors in the query and gallery sets, generating a distance matrix. Using this matrix, the images were then ranked according to distance, with smaller distances indicating higher similarity and thus appearing earlier in the ranking order. This ranking was subsequently used for performance evaluation. To obtain the CMC curve, the accuracy at each rank was calculated, derived using the following Equations (6) and (7):

{A c c}_{q}^{i} = \{\begin{matrix} 1, & i f t o p q r a n k e d g a l l e r y c o n t a i n t h e i d e n t i t y \\ 0, & O t h e r w i s e \end{matrix}

(6)

R_{q} = \frac{\sum_{i = 1}^{T C} {A c c}_{q}^{i}}{T C}

(7)

In Equation (6),

{A c c}_{q}^{i}

represents the accuracy of a pedestrian with ID

i

in the query set at Rank

q

. The accuracy is 1 if a pedestrian with ID

i

is within Rank

q

, and 0 otherwise. After calculating

{A c c}_{q}

for each ID, the overall accuracy at Rank

q

was computed using Equation (7), where

T C

is the total number of IDs in the query set. The average of all

{A c c}_{q}^{i}

values divided by

T C

gives the accuracy at Rank

q

. Once the accuracies for all ranks were calculated, plotting these values yielded the CMC curve. Figure 4 displays query images on the left and gallery images on the right, with green frames indicating correct matches and red frames denoting incorrect ones. The accuracy was calculated based on the similarity ranks, with Rank 3 achieving 100% accuracy.

While the CMC metric has limitations, such as not accounting for multiple correct matches in the gallery, this issue is less relevant in multi-object tracking. In this context, each tracker was linked to a single detection box per frame, thereby eliminating the possibility of multiple correct matches. Therefore, Rank 1 accuracy serves as a reliable metric for evaluating the effectiveness of our deep feature model in multi-object tracking scenarios.

3.2. Evaluation of STJL Performance

STJL is a training strategy that utilizes multiple datasets to enhance the model’s generalization capabilities for broader applications. In this section, the constraints of training with a singular dataset are examined, and the efficacy of the STJL method is evaluated.

Experiments using three datasets are detailed: CUHK03 (‘c’), Market-1501 (‘m’), and DukeMTMC-reID (‘d’). All images were resized to 128 × 64 pixels, as listed in Table 1. The training method used was cosine metric learning.

Models were first trained using individual datasets and subsequently evaluated across all three datasets using CMC, with an emphasis on Rank 1 accuracy. Table 1 shows the image counts for training. The results in Figure 5a confirm that the single-dataset models excelled only on their corresponding test set, revealing a lack of generalization.

Single- and multi-dataset training strategies were contrasted, using Market-1501 as a baseline, with all multi-dataset models incorporating them for training. The total number of training images was the sum of those in each dataset. Models were named based on the datasets used for training; for example, the model trained on Market-1501 and DukeMTMC-reID was named ‘md’ and the one using all three datasets was ‘cmd’. Four models: ‘m’, ‘md’, ‘cm’, and ‘cmd’, were contrasted, as depicted in Figure 5b. The results indicate that multi-dataset models outperformed the single-dataset model ‘m’. Among the dual-dataset models, ‘md’ and ‘cm’ saw Rank 1 accuracy increases of approximately 0.8% and 0.6%, respectively, while the ‘cmd’ model using all three datasets improved by 1%.

Subsequently, the STJL approach was continued with multiple datasets, centering on the performance of these multi-dataset models across varied test datasets. Four distinct multi-dataset models: ‘cm’, ‘cd’, ‘md’, and ‘cmd’, were trained and assessed on three datasets. The results are shown in Figure 6a. The results indicate that models trained on two datasets performed worse on unseen datasets, whereas the model trained on all three datasets (‘cmd’) showed good performance across all test datasets. Moreover, ‘cmd’ outperformed the two-dataset models, with a 0.4% higher accuracy on Market-1501 compared with ‘cm’, and a 1.5% higher accuracy on CUHK03 compared with ‘cd’.

In summary, our experiments showed that using a single dataset was insufficient for robust pedestrian re-identification. For example, Market-1501 lacked the scope to train the model for occlusions, while adding DukeMTMC-reID improved the performance by 0.8%. This supports the notion that diverse datasets not only enrich the training data but also address specific challenges. Hence, the ‘cmd’ model was chosen for subsequent endeavors.

Building upon prior training approaches, STJL was executed using CUHK03, Market-1501, and DukeMTMC-reID as the training datasets. Two learning methodologies were contrasted; the initial one employed the conventional softmax classifier (denoted as “softmax”), and the latter leveraged cosine metric learning (termed “cosine softmax”). The primary evaluation metric was centered on the CMC-Rank 1 accuracy.

The model comparison results in Figure 6b show that the models trained with the standard softmax classifier performed poorly on all three datasets, indicating their inability to learn generalizable features. In contrast, the models trained using cosine metric learning exhibited significantly better performance. Specifically, they outperformed the former by 52.13% for CUHK03, 33.23% for Market-1501, and 33.87% for DukeMTMC-reID. These experiments confirmed that cosine metric learning is more effective than the conventional softmax classifier in learning discriminative features across different environments, ensuring that models can extract universal and representative features.

Subsequently, the proposed method was juxtaposed with state-of-the-art techniques. Methods included in this comparison encompassed S2S [42], BaseIDE [43], MARS (IDE+XQDA) [44], CML, LOMO+XQDA, and TAUDL [45]. Except for LOMO+XQDA, all of these methods are deep-learning-based approaches. The compared datasets included Market-1501, CUHK03, and DukeMTMC-reID, with CMC-Rank 1 accuracy as the evaluation criterion.

The accuracy results for the three datasets are shown in Figure 7. Our model achieved the highest Rank 1 accuracy on CUHK03, surpassing the latest methods with a 17.1% improvement over S2S and a 36% improvement over TAUDL (Figure 7a). Additionally, as shown in Figure 7b, our model outperformed other state-of-the-art methods in Rank 1 accuracy on Market-1501, with a 15.52% lead over TAUDL, 13.08% over S2S, and 1% over CML. In Figure 7c, it is evident that our model also achieved the best results in Rank 1 accuracy for Duke-MTMC-reID. Compared with traditional methods like LOMO+XQDA, our model achieved a remarkable 36.38% improvement. Among other deep learning methods, it outperformed TAUDL by 4.46% and BaseIDE by 0.94%. Overall, our model demonstrated superior Rank 1 accuracy across all datasets compared with other state-of-the-art methods.

3.3. Evaluation Multiple Object Tracking Performance

In this section, a thorough assessment of the multi-object pedestrian tracking approach is provided. The experiments utilized video footage from various datasets as testing sources, including S2L1 and S2L2 from PETS2009 [46] as well as TUD-Crossing [47]. The experimental videos are listed in Table 2.

Multiple-object tracking accuracy (MOTA), precision (MOTP), false negatives (FN), false positives (FP), ID switches, mostly tracked (MT), and mostly lost (ML) are standard performance metrics. FP indicates incorrect detections, while FN refers to missed objects. In tracking by detection, FN and FP are mainly influenced by the detector, and there is often a trade-off; increasing sensitivity reduces FN but raises FP, and vice versa.

ID switches are crucial in tracking, as the goal is to maintain consistent object IDs. However, challenges like occlusions and similar appearances often cause ID switches. MOTA and MOTP measure overall tracking performance, while MT and ML assess trajectory completeness. A trajectory was considered MT if tracked for over 80% of its duration, and ML if tracked for less than 20%, without accounting for ID consistency.

The data association strategies of the tracking approach were evaluated using two distance metrics derived from Section 2.3: the position distance

d^{(1)}

and feature distance

d^{(2)}

. Two approaches were assessed; the first utilized only the position distance for linking and the latter amalgamated both distances. The latter ensured that the detection boxes closely matched the position and features before associating with the tracker.

The experiment utilized PETS09-S2L1 and TUD-Crossing videos for testing and evaluating the tracking performance through four key metrics: FP, FN, IDs, and MOTA. As shown in Figure 8, the results demonstrated the impact of combining both position and feature distances in multi-object tracking. Using only the position distance

d^{(1)}

for linking resulted in suboptimal tracking, with a higher FP and IDs, a slight FN increase, and a lower MOTA. Conversely, employing both distance metrics,

d^{(1)}

and

d^{(2)}

, significantly reduced the FP and IDs, leading to a higher MOTA. The figure clearly illustrates how the combination method outperformed the use of position distance alone, leading to more robust tracking results across various test scenarios. These findings emphasize that considering only position distance leads to more linking errors, while incorporating feature distance effectively mitigates these issues, improving tracking accuracy in diverse environments.

This section outlines the comparison of four tracker initialization methods to improve tracking results, as follows:

Continual determination (CD): A trial tracker is established for potential new objects, requiring successful tracking for at least three frames before official initialization;

Detection confidence (DC): Detection boxes with confidence scores above a threshold are directly initialized;

Tracklet confidence (TC): Similar to CD, a trial tracker is created for new objects, but initialization occurs only when tracklet confidence exceeds a threshold;

Detection + tracklet confidence (DC + TC): Trial trackers are first created using detection confidence then initialized once both detection and tracklet confidence surpass set thresholds.

This experiment involved testing PETS09-S2L1 and TUD-Crossing videos with evaluations based on FP, FN, IDs, and MOTA (Figure 9). In both scenarios, the CD approach resulted in higher FP and IDs, leading to lower MOTA. The use of detection confidence and tracklet confidence reduced FP and IDs, slightly improving MOTA. Combining detection and tracklet confidence significantly reduced FP and IDs, with a slight increase in FN, ultimately achieving the best MOTA compared with the other methods.

In summary, the different initialization methods exhibited consistent trends in both the scenarios. Combining detection and tracklet confidence proved effective in reducing FP and IDs while maintaining a balanced FP and FN, resulting in the highest MOTA.

Experiments were performed on three videos: PETS09-S2L1, PETS09-S2L2, and TUD-Crossing to juxtapose the proposed method with leading-edge state-of-the-art multiple object-tracking techniques, as shown in Table 3. Except for TSML_CDE and CDT, which are offline tracking methods, all the others are online tracking methods. Deep-SORT, AP_HWDPL, CDA-DDAL, and KCF are among the tracking methods that utilize deep learning networks.

In the experimental evaluation across three videos, namely PETS09-S2L1, PETS09-S2L2, and TUD-Crossing, our proposed method consistently outperformed the other state-of-the-art techniques. It excelled in metrics such as MT, ML, FN, and MOTA. The tracking results of the three videos are shown in Table 3. For PETS09-S2L1, our method achieved superior results, surpassing Deep-SORT by 2.5%, OMVT_TAAC by 12.8%, DMOT_MDP by 1.6%, and HMOT by 0.1% in terms of MOTA. For PETS09-S2L2, our method demonstrated excellent performance in MT, FN, and MOTA, outperforming EAMTT by 1.8%, AP_HWDPL by 6.1%, CDT by 14.4%, KCF by 10.3%, MDP_SubCNN by 14.6%, and CDA-DDAL by 11.3%. In the case of TUD-Crossing, our method outperformed ML, FN, and MOTA. In terms of MOTA, it outperformed CDT by 10.6%, KCF by 4.6%, MDP_SubCNN by 1.7%, TSML_CDE by 4.9%, AP_HWDPL by 2.7%, and CDA-DDAL by 1.7%.

Overall, our method consistently achieved the best results in FN and MOTA across the three experimental videos, demonstrating its adaptability to various video scenarios and its ability to overcome diverse challenges for accurate tracking outcomes.

The practical tracking outcomes of the proposed method are showcased, accompanied by an analysis. The results for PETS09-S2L1 are depicted in Figure 10a. It is evident that our method accurately tracked all objects, even when pedestrians with ID 1 and 2 briefly stopped and temporarily occluded each other during their encounter. Despite this, our tracker maintained the continuous tracking of both pedestrians. Similarly, pedestrian ID 3 navigated around pedestrians with ID 1 and 2 after passing through a streetlight that momentarily obstructed the view. It also experienced occlusion during this process, but tracking seamlessly continued after the occlusion was cleared.

The results for PETS09-S2L2 are shown in Figure 10b. In this video, a significant number of pedestrians walked randomly, and the pedestrian density was higher, leading to more pronounced occlusions among pedestrians. The results show that most of the targets were accurately tracked. Pedestrians located farther away from crowded areas, such as pedestrians with ID 40 and 58, were successfully tracked. Among the pedestrians within the crowded areas, apart from individual pedestrians who could not be detected owing to extensive occlusion, the majority, including pedestrians with ID 41 and 11, were also effectively tracked.

The results for the TUD-Crossing dataset are illustrated in Figure 10c. Although this video featured fewer pedestrians, the use of a frontal view camera angle resulted in relatively larger areas of occlusion among pedestrians. The video depicted multiple pedestrians converging in the same direction, such as pedestrians with ID 2, 3, 4, 5, and 6, sometimes temporarily obscured by pedestrians moving in the opposite direction, such as pedestrian ID 1. From the images, it is evident that these pedestrians’ IDs were maintained successfully, even after brief occlusions. Additionally, pedestrian ID 1 continued to be accurately tracked despite the presence of multiple pedestrians in the background.

4. Conclusions

This study proposes an online multi-object tracking method that uses advanced deep learning techniques to train feature models with pedestrian re-identification datasets. To overcome the limitations of single datasets, we applied a single-task multi-dataset joint learning approach to enhance data diversity and model performance. Additionally, we propose an initialization method that combines detector and tracklet confidence, achieving the highest MOTA performance in the comparative analysis. Overall, the proposed method outperformed state-of-the-art techniques, and future work will focus on further improving feature models and tracking accuracy through more complex loss functions and network architectures.

While this study specifically addresses pedestrian multi-object tracking (MOT) using Single-Task Joint Learning (STJL) in moderate-scale scenarios, future work will focus on scaling the model for larger applications, including environments with higher object counts and multi-camera systems where data volume and tracking complexity are significantly increased. To support real-world deployment, we will leverage scalability and resource efficiency techniques from prior research on surveillance optimization, such as distributed computing, model parallelization, and dynamic resource allocation, to ensure that the model maintains real-time responsiveness even in resource-constrained settings. Additional strategies, including data compression and model pruning, will enable effective operation of the model on edge devices with real-time performance, facilitating smooth integration into existing video management system (VMS) standards across complex surveillance networks. Moreover, extending the model’s adaptability in handling diverse object types and responding to unpredictable conditions such as variable weather and lighting will provide a more comprehensive evaluation of its robustness in real-world scenarios. These enhancements can establish our model as a scalable, resilient solution capable of meeting the practical demands of large-scale, dynamic surveillance environments.

Author Contributions

Conceptualization, Y.-K.W., T.-M.P. and C.-E.H.; methodology, Y.-K.W. and C.-E.H.; software, T.-M.P. and C.-E.H.; validation, Y.-K.W., T.-M.P. and C.-E.H.; formal analysis, Y.-K.W., T.-M.P. and C.-E.H.; data curation, C.-E.H.; writing—original draft preparation, C.-E.H.; writing—review and editing, T.-M.P. and Y.-K.W.; supervision, Y.-K.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, L.; Li, Y.; Nevatia, R. Global data association for multi-object tracking using network flows. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, AK, USA, 23–28 June 2008. [Google Scholar]
Zhang, X.; Xie, X.; Lai, J.; Zheng, W.-S. Cross-camera trajectories help person retrieval in a camera network. IEEE Trans. Image Process. 2023, 32, 3806–3820. [Google Scholar] [CrossRef] [PubMed]
Dai, P.; Weng, R.; Choi, W.; Zhang, C.; He, Z.; Ding, W. Learning a proposal classifier for multiple object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2443–2452. [Google Scholar]
Zhang, J.; Zhang, X.; Zhu, Z.; Deng, C. Efficient combination graph model based on conditional random field for online multi-object tracking. Complex Intell. Syst. 2023, 9, 3261–3276. [Google Scholar] [CrossRef]
Leon, F.; Gavrilescu, M. A review of tracking and trajectory prediction methods for autonomous driving. Mathematics 2021, 9, 660. [Google Scholar] [CrossRef]
Amosa, T.I.; Sebastian, P.; Izhar, L.I.; Ibrahim, O.; Ayinla, L.S.; Bahashwan, A.A.; Bala, A.; Samaila, Y.A. Multi-Camera Multi-Object Tracking: A Review of Current Trends and Future Advances. Neurocomputing 2023, 552, 126558. [Google Scholar] [CrossRef]
Bae, S.-H.; Yoon, K.-J. Robust Online Multi-object Tracking Based on Tracklet Confidence and Online Discriminative Appearance Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple Online and Realtime Tracking. arXiv 2016, arXiv:1602.00763. [Google Scholar]
Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6999–7019. [Google Scholar] [CrossRef]
Bae, S.H.; Yoon, K.J. Confidence-Based Data Association and Discriminative Deep Appearance Learning for Robust Online Multi-Object Tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 595–610. [Google Scholar] [CrossRef]
Min, W.; Fan, M.; Guo, X.; Han, Q. A New Approach to Track Multiple Vehicles with the Combination of Robust Detection and Two Classifiers. IEEE Trans. Intell. Transp. Syst. 2018, 19, 174–186. [Google Scholar] [CrossRef]
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017. [Google Scholar]
Shafiq, M.; Gu, Z. Deep Residual Learning for Image Recognition: A Survey. Appl. Sci. 2022, 12, 8972. [Google Scholar] [CrossRef]
Nodehi, H.; Shahbahrami, A. Multi-Metric Re-Identification for Online Multi-Person Tracking. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 147–159. [Google Scholar] [CrossRef]
Chen, L.; Ai, H.; Shang, C.; Zhuang, Z.; Bai, B. Online Multi-Object Tracking with Convolutional Neural Networks. In Proceedings of the IEEE International Conference on Image Processing, Beijing, China, 17–20 September 2017. [Google Scholar]
Tian, R.; Zhang, X.; Chen, D.; Hu, Y. Multi-object Tracking Based on Nearest Optimal Template Library. In Proceedings of the Artificial Neural Networks and Machine Learning–ICANN 2021: 30th International Conference on Artificial Neural Networks, Bratislava, Slovakia, 14–17 September 2021; Proceedings, Part I 30. Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 331–342. [Google Scholar]
Wang, Y.-K.; Guo, J.; Pan, T.-M. Multidomain Joint Learning of Pedestrian Detection for Application to Quadrotors. Drones 2022, 6, 430. [Google Scholar] [CrossRef]
Li, W.; Zhao, R.; Xiao, T.; Wang, X. DeepReID: Deep Filter Pairing Neural Network for Person Re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Scalable Person Re-identification: A Benchmark. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
Solera, F.; Calderara, S.; Ristani, E.; Tomasi, C.; Cucchiara, R. Tracking Social Groups Within and Across Cameras. IEEE Trans. Circuits Syst. Video Technol. 2017, 27, 441–453. [Google Scholar] [CrossRef]
Zhang, Y.; Yang, Q. A Survey on Multi-Task Learning. IEEE Trans. Knowl. Data Eng. 2021, 34, 5586–5609. [Google Scholar] [CrossRef]
Yang, Y.; Cui, H.; Yang, C. PTGN: Pre-Train Graph Neural Networks for Brain Network Analysis. In Proceedings of the 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan, 17–20 December 2022. [Google Scholar]
Rajababu, M.; Srinivas, K.; Sankar, H.R. CNN Based Age Estimation Using Cross-Dataset Learning. Int. J. Intell. Syst. Appl. Eng. 2023, 11, 745–752. [Google Scholar]
Tokmakov, P.; Li, J.; Burgard, W.; Gaidon, A. Learning to Track with Object Permanence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10860–10869. [Google Scholar]
Zhang, B.; Bao, Y. Cross-Dataset Learning for Age Estimation. IEEE Access 2022, 10, 24048–24055. [Google Scholar] [CrossRef]
Jiang, H.; Shen, F.; Chen, L.; Peng, Y.; Guo, H.; Gao, H. Joint domain symmetry and predictive balance for cross-dataset EEG emotion recognition. J. Neurosci. Methods 2023, 400, 109978. [Google Scholar] [CrossRef]
Gominski, D.; Gouet-Brunet, V.; Chen, L. Cross-dataset Learning for Generalizable Land Use Scene Classification. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; IEEE: Piscataway, NJ, USA, 2022. [Google Scholar]
Li, S.; Xu, S.; Ma, W.; Zong, Q. Image manipulation localization using attentional cross-domain cnn features. IEEE Trans. Neural Networks Learn. Syst. 2021, 34, 5614–5628. [Google Scholar] [CrossRef] [PubMed]
Zhou, X.; Koltun, V.; Krähenbühl, P. Tracking Objects as Points. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2020; pp. 474–490. [Google Scholar]
Pan, T.-M.; Fan, K.-C.; Wang, Y.-K. Object-Based Approach for Adaptive Source Coding of Surveillance Video. Appl. Sci 2019, 9, 2003. [Google Scholar] [CrossRef]
Wang, Y.-K.; Chen, H.-Y. Intelligent Mobile Video Surveillance System with Multilevel Distillation. J. Electron. Sci. Technol. 2017, 15, 133–140. [Google Scholar]
Fan, C.-T.; Wang, Y.-K.; Huang, C.-R. Heterogeneous Information Fusion and Visualization for a Large-Scale Intelligent Video Surveillance System. IEEE Trans. Syst. Man Cybern. 2017, 47, 593–604. [Google Scholar] [CrossRef]
Wang, Y.; Xu, B.; Guo, X.; Yang, G. A Multi-task Deep Network for video-based Person Re-identification. In Proceedings of the ICCIR’21: 2021 1st International Conference on Control and Intelligent Robotics, Guangzhou, China, 18–20 June 2021; pp. 320–324. [Google Scholar]
Zhao, H.; Tian, M.; Sun, S.; Shao, J.; Yan, J.; Yi, S.; Wang, X.; Tang, X. Spindle Net: Person Re-identification with Human Body Region Guided Feature Decomposition and Fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Xiao, T.; Li, H.; Ouyang, W.; Wang, X. Learning Deep Feature Representations with Domain Guided Dropout for Person Re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
Sanchez-Matilla, R.; Poiesi, F.; Cavallaro, A. Online Multi-target Tracking with Strong and Weak Detections. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016. [Google Scholar]
Kim, H.-U.; Kim, C.-S. CDT: Cooperative Detection and Tracking for Tracing Multiple Objects in Video Sequences. In European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2016. [Google Scholar]
Xiang, Y.; Alahi, A.; Savarese, S. Learning to Track: Online Multi-object Tracking by Decision Making. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
Wang, B.; Wang, G.; Chan, K.L.; Wang, L. Tracklet Association by Online Target-Specific Metric Learning and Coherent Dynamics Estimation. Trans. Pattern Anal. Mach. Intell. 2017, 39, 589–602. [Google Scholar] [CrossRef] [PubMed]
Chu, P.; Fan, H.; Tan, C.C.; Ling, H. Online Multi-Object Tracking with Instance-Aware Tracker and Dynamic Model Refreshment. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 7–11 January 2019. [Google Scholar]
Wojke, N.; Bewley, A. Deep Cosine Metric Learning for Person Re-identification. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018. [Google Scholar]
Zhou, S.; Wang, J.; Shi, R.; Hou, Q.; Gong, Y.; Zheng, N. Large margin learning in set-to-set similarity comparison for person re-identification. IEEE Trans. Multimed. 2018, 20, 593–604. [Google Scholar] [CrossRef]
Zheng, L.; Yang, Y.; Hauptmann, A.G. Person Re-identification: Past, Present and Future. arXiv 2016, arXiv:1610.02984. [Google Scholar]
Zheng, L.; Bie, Z.; Sun, Y.; Wang, J.; Su, C.; Wang, S.; Tian, Q. MARS: A Video Benchmark for Large-Scale Person Re-Identification. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016. [Google Scholar]
Li, M.; Zhu, X.; Gong, S. Unsupervised Person Re-identification by Deep Learning Tracklet Association. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8 September 2018. [Google Scholar]
Ferryman, J.; Shahrokni, A. PETS2009: Dataset and challenge. In Proceedings of the 2009 Twelfth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, Snowbird, UT, USA, 7–9 December 2009. [Google Scholar]
Mykhaylo, A.; Stefan, R.; Bernt, S. People-Tracking-by-Detection and People-Detection-by-Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008. [Google Scholar]
Le, Q.C.; Conte, D.; Hidane, M. Online Multiple View Tracking: Targets Association Across Cameras. In Proceedings of the 6th Workshop on Activity Monitoring by Multiple Distributed Sensing, Newcastle, UK, 6 September 2018. [Google Scholar]
Zuo, G.; Du, T.; Ma, L. Dynamic target tracking based on corner enhancement with Markov decision process. J. Eng. 2018, 2018, 1617–1622. [Google Scholar] [CrossRef]
Liu, J.; Cao, X.; Li, Y.; Zhang, B. Online Multi-Object Tracking Using Hierarchical Constraints for Complex Scenarios. IEEE Trans. Intell. Transp. Syst. 2018, 19, 151–161. [Google Scholar] [CrossRef]

Figure 1. STJL model.

Figure 2. Flow chart of tracking method.

Figure 3. Schematic illustration of the decision boundaries in classifier training: (a) distribution of training samples; (b) softmax; (c) cosine softmax.

Figure 4. Temporal diagram for pedestrian re-ID testing.

Figure 5. Rank 1 accuracy comparison: (a) single dataset models, (b) multi-dataset models trained on Market-1501.

Figure 6. Rank 1 accuracy comparison: (a) multi-dataset models, (b) different learning algorithms.

Figure 7. Comparison of our model with state-of-the-art pedestrian re-identification Techniques: (a) CUHK03 dataset, (b) Market-1501 dataset, (c) Duke-MTMC-reID dataset.

Figure 8. Impact of distance metrics on tracking results: (a) FP, (b) FN, (c) IDs, (d) MOTA.

Figure 9. Impact of initialization methods on tracking results, TUD-Crossing: (a) FP and FN, (b) IDs and MOTA; PETS09-S2L1 (c) FP and FN, (d) IDs and MOTA.

Figure 10. Tracking results in (a) PETS09-S2L1, (b) PETS09-S2L2, (c) TUD-Crossing.

Table 1. Number of images in each dataset.

Dataset	Number of Images		Number of IDs
Dataset	Training Set	Test Set	Training Set	Test Set
CUHK03	15,315	3309	900	100
Market-1501	12,936	19,732	750	751
DukeMTMC-reID	16,522	17,661	702	1110

Table 2. Tracking experiment datasets.

	FPS	Resolution	Frame	Tracks
PETS09-S2L1	7	768 × 576	795	19
PETS09-S2L2	7	768 × 576	436	42
TUD-Crossing	25	640 × 480	201	13

Table 3. Comparison of multi-object tracking results between state-of-the-art methods and our method.

Dataset	Method	Metric
Dataset	Method	MT	ML	FP	FN	IDs	MOTA	MOTP
PETS09-S2L1	Deep-SORT [12]	100%	0%	260	294	18	87.2%	77.6%
	OMVT_TAAC [48]	x	x	372	592	68	76.9%	72.3%
	DMOT_MDP [49]	90.4%	x	x	x	5	88.1%	68.7%
	HMOT [50]	94.7%	0%	121	318	20	89.6%	87.6%
	Ours	100%	0%	217	232	11	89.7%	77.4%
PETS09-S2L2	EAMTT [36]	21.4%	4.8%	489	3146	189	60.3%	74%
	AP_HWDPL [15]	14.3%	4.8%	387	3636	368	56%	74%
	CDT [37]	21.4%	19%	502	4429	113	47.7%	70.4%
	KCF [40]	16.7%	2.4%	1045	3418	181	51.8%	68.8%
	MDP_SubCNN [38]	7.1%	9.5%	341	4524	196	47.5%	72.6%
	CDA-DDAL [10]	16.7%	4.8%	695	3075	193	50.8%	70.3%
	Ours	28.6%	4.8%	625	2903	367	62.2%	73%
TUD-Crossing	CDT [37]	61.5%	0%	73	229	29	70%	73.3%
	KCF [40]	46.2%	0%	8	248	8	76%	75.1%
	MDP_SubCNN [38]	69.2%	0%	32	195	6	78.9%	76.7%
	TSML_CDE [39]	53.8%	7.7%	18	239	11	75.7%	75.8%
	AP_HWDPL [15]	53.8%	0%	2	225	16	77.9%	75.8%
	CDA-DDAL [10]	61.5%	0%	39	181	12	78.9%	72.2%
	Ours	61.5%	0%	59	137	18	80.6%	72.4%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.-K.; Pan, T.-M.; Hu, C.-E. Single-Task Joint Learning Model for an Online Multi-Object Tracking Framework. Appl. Sci. 2024, 14, 10540. https://doi.org/10.3390/app142210540

AMA Style

Wang Y-K, Pan T-M, Hu C-E. Single-Task Joint Learning Model for an Online Multi-Object Tracking Framework. Applied Sciences. 2024; 14(22):10540. https://doi.org/10.3390/app142210540

Chicago/Turabian Style

Wang, Yuan-Kai, Tung-Ming Pan, and Chi-En Hu. 2024. "Single-Task Joint Learning Model for an Online Multi-Object Tracking Framework" Applied Sciences 14, no. 22: 10540. https://doi.org/10.3390/app142210540

APA Style

Wang, Y. -K., Pan, T. -M., & Hu, C. -E. (2024). Single-Task Joint Learning Model for an Online Multi-Object Tracking Framework. Applied Sciences, 14(22), 10540. https://doi.org/10.3390/app142210540

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Single-Task Joint Learning Model for an Online Multi-Object Tracking Framework

Abstract

1. Introduction

2. STJL Model Applied to Online Multi-Object Tracking

2.1. Deep Feature Extraction

2.2. Tracking State Prediction

2.3. Data Association

2.4. Tracker Initialization and Terminate

3. Experiments

3.1. Model Evaluation Criteria

3.2. Evaluation of STJL Performance

3.3. Evaluation Multiple Object Tracking Performance

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI