Detector–Tracker Integration Framework for Autonomous Vehicles Pedestrian Tracking

Wang, Huanhuan; Jin, Lisheng; He, Yang; Huo, Zhen; Wang, Guangqi; Sun, Xinyu

doi:10.3390/rs15082088

Open AccessArticle

Detector–Tracker Integration Framework for Autonomous Vehicles Pedestrian Tracking

by

Huanhuan Wang

¹,

Lisheng Jin

^1,2,*,

Yang He

¹,

Zhen Huo

¹,

Guangqi Wang

¹ and

Xinyu Sun

¹

School of Vehicle and Energy, Yanshan University, Qinhuangdao 066004, China

²

Hebei Key Laboratory of Special Delivery Equipment, Yanshan University, Qinhuangdao 066004, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(8), 2088; https://doi.org/10.3390/rs15082088

Submission received: 23 February 2023 / Revised: 8 April 2023 / Accepted: 12 April 2023 / Published: 15 April 2023

(This article belongs to the Special Issue Signal Processing and Machine Learning for Autonomous Vehicles)

Download

Browse Figures

Versions Notes

Abstract

:

Pedestrian tracking is an important aspect of autonomous vehicles environment perception in a vehicle running environment. The performance of the existing pedestrian tracking algorithms is limited by the complex traffic environment, the changeable appearance characteristics of pedestrians and the frequent occlusion interaction, which leads to the insufficient accuracy and stability of tracking. Therefore, this paper proposes a detector–tracker integration framework for autonomous vehicle pedestrian tracking. Firstly, a pedestrian objects detector based on the improved YOLOv7 network was established. Space-to-Depth convolution layer was adopted to improve the backbone network of YOLOv7. Then, a novel appearance feature extraction network is proposed, which integrates the convolutional structural re-parameterization idea to construct a full-scale feature extraction block, which is the optimized DeepSORT tracker. Finally, experiments were carried out on MOT17 and MOT20 public datasets and driving video sequences, and the tracking performance of the proposed framework was evaluated by comparing it with the most advanced multi-object tracking algorithms. Quantitative analysis results show that the framework has high tracking accuracy. Compared with DeepSORT, MOTA improves by 2.3% in the MOT17 dataset and MOTA improves by 4.2% in the MOT20 dataset. Through qualitative evaluation on real driving video sequences, the framework proposed in this paper is robust in a variety of climate environments, and can be effectively applied to the pedestrian tracking of autonomous vehicles.

Keywords:

autonomous vehicles; pedestrian tracking; YOLOv7; DeepSORT

Graphical Abstract

1. Introduction

Object tracking is an important aspect of computer vision, which is widely used in autonomous driving, human–computer interaction, intelligent transportation, and video surveillance [1,2,3]. The main task of multi-object tracking is to determine the location of multiple objects of interest in each frame of a video sequence, maintain the identity of these objects between different frames, and finally generate the motion trajectories of these objects [4,5,6].

For intelligent vehicle driving assistance, pedestrian tracking is one of the key technologies of autonomous driving environment perception, and an important precondition for autonomous vehicle to achieve industrial landing. Multi-object tracking (MOT) in autonomous driving refers to combining the detection results of current time t and historical time 1:t − 1, matching the same objects in continuous time 1:t, giving the tracking ID to distinguish different individuals, so as to solve the problem of who the object is and what the trajectory is, and improve the stability of object detection results. The historical trajectory of each individual object is formed to provide the historical information of each object, including the current moment for the subsequent future motion trajectory prediction task.

Due to the characteristics of pedestrian objects, such as small size, variable movement, and uncertain walking direction, it is more difficult to accurately identify and track pedestrians than detecting and tracking vehicles. Therefore, rapid detection of pedestrians and judgment of their walking intentions are of great significance for both the safety of autonomous vehicles and the safe travel of pedestrians.

In recent years, experts and scholars have made several achievements in multi-object tracking research. Based on whether the detection and tracking phases are independent, visual multi-object tracking can be divided into Tracking by Detection (TBD) and Joint Detection and Tracking (JDT) [7]. TBD is divided into two stages. In the first stage, the external detector is used to find and locate all the objects of interest in each frame image, and the detection response is expressed in the form of a rectangular bounding box. In the second stage, the detection responses are associated with tracking multiple objects, while JDT undertakes object detection and tracking simultaneously.

According to the requirements of the input video frame, the algorithms of TBD can be further divided into offline MOT and online MOT [8]. The offline MOT is based on the known whole situation of the video, which includes all the frames of the video sequence image information and detection results. The multi-object tracking task can be considered a clustering process of detection results in all frames using maximum a posteriori probability. The process is able to pass the data through different types and it is a formal mathematical approach to modeling. Some researchers have built sophisticated graph optimization frameworks to solve offline MOT problems. Tang [9] used the deep ReID model to add remote dependency connections between nodes in the graph, which enhanced the ability of the model to perform accurate matching when the frame interval was long. Keuper [10] performed bottom-up motion segmentation on the detected point trajectories to establish affinity connections, which alleviated the identity conversion problem to some extent. Henschel [11] used the finely graded body joint parts in the detection to determine the dependency between pedestrian nodes, and improved the accuracy of subsequent matching. However, offline MOT requires a large amount of computation and a long time, so it is difficult to meet the real-time requirements of the autonomous vehicle system.

Different from the offline algorithm, the online MOT only uses the images of the history frame and the current frame and the corresponding detection results. Online MOT is usually simplified as an extension of the existing tracking trajectory, and the key problem is how to assign the detected object to the tracking corresponding to the correct identity. In order to provide support for data association, online MOT also needs to predict and update the object state online during tracking [12,13,14]. Bewley [15] proposed the SORT algorithm, which focused on the prediction and association between frames, and proposed a simple and effective online tracking framework by combining the Kalman Filter and Hungarian Algorithm. Wojke [16] proposed DeepSORT. Considering the problem of frequent ID changes in the SORT algorithm, DeepSORT introduced human re-recognition technology as the appearance model on the basis of SORT algorithm. The introduction of appearance models reduced the number of objects blocking ID changes. A cascade matching of appearance feature matching and motion feature matching was designed to improve the accuracy of object tracking. Azimi [17] proposed the use of a long short-term memory network and graph convolutional neural network modules to fuse temporal and graphical information to achieve more accurate and stable tracking.

JDT MOT performs object detection and object tracking simultaneously. Considering the real-time performance, the detection and data association were fused, and the detection and ReID information were output at the same time to speed up the reasoning. Zhang [18] proposed FairMOT. FairMOT integrates detection and appearance feature extraction into a network structure. The detector adopts CenterNet [19] and adds appearance feature extraction branch in parallel on the basis of detection branch, which well combines object detection and re-recognition. CenterTrack [20] is an algorithm that detects and tracks simultaneously. The model is detected using a pair of images from the previous detection and the current frame. Based on the improvements of the combined RetinaNet, RetinaTrack [21] achieves the instance-level ReID feature extraction. It mainly solves the confusion of instance features caused by high overlap of detection areas. CSTrack [22] and CSTrackv2 [23] learned the correlation and difference information through the channel attention mechanism, and then used the multi-scale attention force module to extract the surface features. RelationTrack [24] designed a Global Context Disentangling module for decoupling detection and has ReID embedding characteristics to solve the problem of feature contradiction. SimpleTrack [25] extracted apparent features based on FairMOT model using different feature fusion branches than detection. These modules were designed to increase the feature map modeling in the head network, enrich the differences between the respective task head and the original feature map, and improve the detection performance and recognition performance of the algorithm. Wan [26] used a time-based encoder-decoder architecture for multi-frame prediction, and simultaneously estimated multi-channel trajectory maps.

In recent years, with the development of Transformer [27,28,29], Transformer has also begun to be applied in multi-object tracking. It takes advantage of the characteristics of query vector learning trajectory to locate and identify objects between frames and implicitly complete data association, thus simplifying the complex architecture. Meinhardt [30] first introduced Transformer into multi-object tracking, and used the DETR [31] detector to iterate in the time direction. In the following frames, the object query of the previous frame is input into the decoder together with the randomly initialized null query, where the object query is responsible for detecting the position of the trajectory in the current frame and passing the identity to implicitly complete the data association, and the null query is responsible for detecting the newly emerging objects. Zeng [32] proposed a query interaction module (QIM), which removes the setting of adding new empty queries in each frame, and uses the query interaction module to iteratively update all query vectors, so that a single query always tracks the same objects. This strategy enhanced the historical information modeling ability of the query vector, and significantly improved the identity retention ability of the algorithm.

These joint frameworks suffer from two problems: first, competition between different components; second, limited data for jointly training these components. Although some solutions are proposed, these problems are still not well solved, which limits the accuracy of multi-object tracking.

Through the summary and analysis of existing methods, it can be found that pedestrian tracking in complex scenes is still the difficulty of current research. Frequent occlusion in the tracking process makes the objects difficult to be accurately located; different objects may have high appearance similarity, which increases the difficulty of maintaining object ID. The interaction between objects may cause the tracking frame to drift.

Aiming at the shortcomings of the existing methods, a detector–tracker integration framework is proposed for autonomous vehicle pedestrian tracking. The main contributions of this paper are as follows:

(a): A pedestrian objects detector based on the improved YOLOv7 [33] network is established. The Space-to-Depth (SPD) convolution layer is adopted to improve the backbone network of YOLOv7. Yolov7-SPD has better pedestrian detection effect in complex scenes.
(b): A novel appearance feature extraction network is proposed, which integrates the convolutional structural re-parameterization idea to construct a full-scale feature extraction block. The realization of pedestrian appearance features comprehensive extraction.
(c): Experiments were carried out on MOT17 and MOT20 public datasets and driving video sequences, and the tracking performance of the proposed framework was evaluated by comparing with the most advanced multi-object tracking algorithms.

In this paper, Section 2 introduces the autonomous vehicle multi-object tracking system, and reviews the relevant algorithms of object detection and object tracking and their evaluation indexes. Section 3 describes the details of the online pedestrian multi-object tracking algorithm proposed in this paper. In Section 4, the effectiveness of the proposed algorithm is verified by experiments on public data sets and real driving video data. Section 5 discusses and analyzes the advantages and disadvantages of the proposed algorithm. Section 6 summarizes the conclusions of this paper.

2. Related Work

2.1. System Model

The multi-object tracking system of the environment perception of autonomous vehicle belongs to the online perception system. The information of its future frame is unknown, which requires high real-time performance. When the vehicle is running, the number of pedestrians in the front vision of the vehicle is uncertain, and the motion state is also different, including dynamic and static. Therefore, automated pedestrian multi-objects system often adopts the online multi-object tracking method based on tracking-by-detection framework, referred to as “online tracking-by-detection”, and the tracking flow chart is shown in Figure 1. As shown in Figure 1, the online tracking-by-detection system for autonomous vehicles in this paper includes three subtasks. Firstly, a detection network is used to detect and locate the targets, and then the feature of the targets is extracted by the feature extraction network. Finally, the affinity between the targets is calculated by the data association algorithm and the targets are associated.

2.2. Object Detection Algorithms

Object detection is the basis of multi-object tracking. The detector provides the location information of the objects in the image for the tracker and generally outputs the detection box of the objects. Currently, object detection algorithms with higher accuracy are usually implemented based on deep convolutional neural networks. Different from traditional machine learning, object detection based on deep learning adopts a convolutional neural network to automatically extract learnable features of objects, and feature extraction is more reasonable and accurate. According to different detection principles, object detection algorithms based on deep learning can be divided into one-stage and two-stage object detection algorithms.

Two-stage object detection will generate and classify candidate regions in two steps. Typical algorithms include region proposals with a CNN (R-CNN) [34], a spatial pyramid pooling network (SPPNet) [35], and a feature pyramid network (FPN) [36]. R-CNN is the first two-stage algorithm in the field of autonomous vehicle object detection, which provides a research basis for subsequent two-stage algorithms. R-CNN input a fixed-size image into the neural network to train and extract objects features. It has higher detection accuracy than traditional object detection methods, but it has a large amount of calculations and a slow object detection speed. Based on the R-CNN algorithm, two-stage object detection methods such as Fast R-CNN and Faster R-CNN are proposed. Compared with traditional object detection methods, this method has higher detection accuracy, but because the generation of candidate regions is separated from the classification of candidate regions, the algorithm runs slowly and it is difficult to realize real-time object detection [37,38]. In order to improve real-time performance, algorithms such as Fast R-CNN and Faster R-CNN are proposed successively, but the problem of double calculation still exists. R-CNN also has the limitation that the input image size is fixed. In order to reduce the calculation amount of image scaling before input, SPPNet is proposed to reduce redundant calculation to a certain extent. At present, Faster R-CNN is the fastest and closest real-time detection algorithm in the R-CNN series, but it is still difficult to meet the requirements of object detection system for autonomous vehicles.

The one-stage detection algorithm has no region proposal stage, and directly outputs the category probability and position coordinates of the object. Representative algorithms include you only look once (YOLO) [39] series algorithms, single shot multibox detector (SSD) [40], and RetinaNet [41]. The YOLO framework uses cells instead of the candidate box concept of the two-stage object detection model, which can distinguish and associate the detection objects and the background environment and reduce the false detection rate. However, YOLO has poor detection accuracy for smaller objects. Single shot multibox detector (SSD) [40] uses feature maps of different sizes to detect objects, improving YOLO’s defects in detection of smaller objects. At present, the YOLO series algorithm has been developed to v7 version, which improves the detection accuracy while ensuring high detection speed.

For complex autonomous driving operating environment, the existing object detection algorithms still do not achieve the optimal balance in detection speed and accuracy.

2.3. Tracking by Detection Algorithms

The tracking effect depends on good object detection results. The tracker mainly depends on three aspects: objects feature prediction module, apparent feature extraction module, and data association module. The apparent feature extraction network extracts the apparent feature vector of the object in the detection box, and it is independent of the image processing process in the detection stage, and its input is a set of slices of the detected object region, and the apparent features of the slices are extracted through the re-identification network as the apparent feature vector of the object. The object feature prediction module predicts the subsequent spatial location and apparent feature of the object based on its previous spatial location and apparent feature. The data association module updates the trajectory according to the trajectory features and detection features, uses the output of the detection model as input, directly predicts the current assignment matrix, and then recognizes the object’s identity through the assignment matrix.

2.3.1. Apparent Feature Extraction Based on Feature Re-Extraction

The apparent features of objects are widely used in multi-object tracking for association. Compared with baseline trackers that only use simple motion modeling [42,43,44], the apparent features of objects are more stable and can connect objects with long-term occlusion more robustly. For example, on the basis of the SORT algorithm, DeepSORT uses the ResNet [45] pre-trained on the large-scale ReID dataset to extract the apparent features of the objects, and then integrates the similarity measurement of the apparent features into the association cost for association, which greatly reduces the problem of objects identity switching. In Tracking by Detection Algorithms, the apparent features are learned by an additional network whose input is the cropped and aligned detection image, ResNet-50, Wide Residual Network [46], and GoogLeNet [47] are widely adopted [48,49,50]. Typically, these algorithms treat the same objects between different frames as separate classes and use cross-entropy loss for ID classification [51,52]. This kind of algorithm needs to use an additional appearance model to train separately, which cannot perform global optimization and is easy to obtain suboptimal solutions. Such algorithms are usually architectures designed for general object classification problems, which to some extent limit the performance of multi-object tracking algorithms.

2.3.2. Matching-Based Deep Data Association

The data association of online multi-object tracking is generally regarded as a bipartite graph matching problem. The cost matrix is constructed by the similarity of trajectory features and detection features, and then the assignment matrix is solved. There are two constraints in this problem. First, detection and trajectory can only be matched one-to-one. Second, the assignment matrix minimizes the total cost, that is, the total similarity between the objects on the matching is the highest. The Hungarian algorithm is an efficient combinatorial optimization method for solving this problem, which has been widely used in online multi-object tracking algorithms.

Some researchers also use deep learning models to solve the problem of data association [53,54]. Usually, the object features of two frames are concatenated, and the matching probability is inferred by correlation modeling to obtain the matching matrix. For example, DeepMOT [55] proposed a deep Hungarian network that uses twice bidirectional RNN to transfer the global information in the cost matrix, and its loss function is designed according to two differentiable evaluation metrics. The allocation matrix of the network output is more conducive to the improvement of the index, but the complexity of the model is high. Son [56] proposed a quadrilateral convolutional neural network (Quad-CNN) for object detection using quadrilateral loss across frames. Data association was achieved using the appearance of the targets and their temporal adjacencies. Quaternion loss enforces additional constraints to make positioning more accurate. At present, the research of deep data association based on matching is still in its infancy, and there are still some problems in practical applications, such as slow speed and high computational cost, which are not conducive to practical deployment.

2.3.3. Evaluation Metrics for Multi-Object Tracking

Due to the complexity of multi-object tracking, a single standard cannot fully evaluate the overall tracking effect. In order to comprehensively and quantitatively evaluate the performance of the multi-object tracking algorithm, a variety of indexes are used to evaluate the performance of the multi-object tracking algorithm from different perspectives.

Multiple Object Tracking Accuracy (MOTA) corresponds to the objects that can best reflect the comprehensive performance of the video multi-object tracking algorithm.

M O T A = 1 - \frac{\sum_{t = 1}^{| t |} (F N_{t} + F P_{t} + I D s_{t})}{\sum_{t = 1}^{| t |} G T_{t}}

(1)

True Positive (TP): Objects where the predicted trajectory coincides with the true trajectory.
False Positive (FP): Objects that do not coincide with any GT trajectory.
ID switch (IDs): Number of times the object ID has changed.
False Negative (FN): Objects whose true trajectory does not coincide with any generated trajectory.

G T_{t}

represents the track marking result in frame t,

F N_{t}

,

F P_{t}

,

I D s_{t}

represents frame t, FP, FN, IDs, and the range of MOTA is (−∞,1).

An important indicator Objects Identification Accuracy (IDF1):

I D F 1 = \frac{2 | I D T P |}{2 | I D T P | + | I D F P | + | I D F N |}

(2)

Different from MOTA, IDTP, IDFP, and IDFN in Equation (2) refer to the true positive, false positive, and false negative corresponding to the object’s identity, respectively, which can better reflect the performance of the multi-object tracking algorithm to correctly retain the object’s identity.

In most cases, these two indexes can reflect the performance of video multi-object tracking algorithm to a large extent. However, the founders of MOTchallenge found that MOTA and IDF1 overemphasized the accuracy of detection and correlation, and later proposed a series of new quantitative indicators to evaluate the performance of video multi-object tracking algorithms. The most important indicator is Higher Order Tracking Accuracy (HOTA)

H O T A = \int_{0}^{1} H O T A_{α} d α

(3)

Here, α is the localization threshold.

H O T A_{α} = \sqrt{\frac{\sum_{c} A (c)}{| T P | + | F N | + | F P |}}

(4)

A (c) = \frac{| T P A (c) |}{| T P A (c) | + | F N A (c) | + | F P A (c) |}

(5)

For a pair of complete true and predicted trajectories, True Positive Association (TPA (c)) is the correctly predicted trajectory segment. False Negative Association (FNA (c)) is the true trajectory record that was not predicted. False Positive Association (FPA (c)) is the negative trajectory wrongly predicted as positive trajectory by the model.

3. Proposed Online Pedestrian MOT Algorithm

3.1. Overall Architecture

The overall structure of the detector tracker integration framework is shown in Figure 2. The video sequence is fed into the detector to obtain the detection box of the object. The tracker represents the prediction result of the corresponding trajectory by the object bounding box obtained by the tracker for the appearance feature extraction network and the Kalman filter.

Through the analysis and comparison of various detectors and trackers, the detector uses YOLOv7, which currently has a good balance between object detection accuracy and speed, as the baseline algorithm, and uses a Space-to-Depth (SPD) convolution layer to improve the backbone network of YOLOv7. The improved network structure is more suitable for pedestrian multi-object detection in vehicle driving environment. For the tracker, considering that the research in this paper is applied to the environment perception of autonomous vehicles, the tracker takes DeepSORT as the baseline algorithm, because DeepSORT can be implemented on the ground in industry. A novel appearance feature extraction network is proposed, which integrates the convolutional structural re-parameterization idea to construct a full-scale feature extraction bottleneck. The realization of pedestrian appearance features comprehensive extraction. Real-time pedestrian multi-object tracking is implemented on GPU platform.

3.2. Objects Detector

3.2.1. YOLOv7-SPD

YOLOv7 [33] is the fastest object detection algorithm of YOLO series. YOLOv7 network consists of three modules: input, backbone, and head. The input module processes the image to fit the size of the backbone network. The backbone network has the same function as other network structures and is used to extract the deep features of the image. The backbone network is composed of a CBS convolutional layer, an ELAN convolutional layer, and an MPConv convolutional layer. ELAN is an efficient layer aggregation network. ELAN designed a multi-stream structure to learn multiple features through different convolutional streams. The MPConv convolution layer is different from the common pooling layer. The MPConv convolution layer is composed of two branches, one of which is the CBS layer and the other is the Maxpool layer. After feature extraction, the feature fusion of different layers is realized by the Concat operation. The header network uses an SPP pyramid structure to make the header network suitable for multiple input. Then, the information of the bottom layer is transferred to the top layer along the bottom-up path by using the aggregated feature pyramid network structure to realize the fusion of features of different levels. Finally, the number of channels with different scale features is adjusted by the RepConv structure.

The structure of the detector proposed in this paper is YOLOv7-SPD, and its overall structure is shown in Figure 3. YOLOv7-SPD takes YOLOv7 as the baseline. The Space-to-Depth (SPD) module was designed to optimize YOLOv7. The differences between YOLOv7-SPD and YOLOv7 are mainly reflected in the CBS convolutional layer and the MPConv convolutional layer. The CBS convolutional layer of YOLOv7 has three structures: Conv(k = 1, s = 1), Conv(k = 3, s = 1), and Conv(k = 3, s = 2), where k is the convolution kernel size and s is the convolution step size. YOLOv7-SPD improves the CBS convolutional layer of YOLOv7 by replacing Conv(k = 3, s = 2) with SPD-Conv. For the MPConv convolution layer of YOLOv7, SPD-Conv is fused to form MP-SPD, as MP1-SPD and MP2-SPD in Figure 2. Due to the SPD-Conv space-to-depth transformation structure and the convolution, stride size is 1 designed, which perfectly avoids the disadvantages of the pooling operation and stride length convolution information loss. Therefore, the YOLOv7-SPD algorithm has better performance for small object detection and is more suitable for pedestrian detection tasks of autonomous vehicles.

3.2.2. Space-to-Depth (SPD)

SPD-Conv consists of a space-to-depth (SPD) layer and a convolutional layer with a stride size of 1.

SPD is inspired by the image transformation technique [57,58] and uses the transformed feature map from image space-to-depth for downsampling, as shown in Figure 4.

Input the original feature map S of size X × X × C1, and divide it into multiple sub-feature maps according to the ratio factor, as follows:

\begin{array}{l} f_{0, 0} = X [0 : X : r, 0 : X : r], \\ f_{1, 0} = X [1 : X : r, 0 : X : r], \\ ⋮ \\ f_{r - 1, 0} = X [r - 1 : X : r, 0 : X : r]; \\ f_{0, 1} = X [0 : X : r, 0 : X : r], \\ f_{1, 1} = X [1 : X : r, 1 : X : r], \\ ⋮ \\ f_{r - 1, 1} = X [r - 1 : X : r, 1 : X : r]; \\ ⋮ \\ f_{r - 1, r - 1} = X [r - 1 : X : r, r - 1 : X : r] \end{array}

As shown in Figure 4, given the original feature map X, each feature map samples X in ratio (r). When r = 2, the original feature map S is transformed to obtain four sub-feature maps. The size of the sub-feature map is 1/2 of the original feature map, which is equivalent to downsampling. Next, the sub-feature maps are connected along the channel dimension to obtain a feature map S′. Compared with feature S, the feature map S′ is reduced by a scaling factor from the spatial dimension, and the channel dimension is increased by a scaling factor of 2.

After the SPD layer, add a stride = 1 convolution layer with C2 filters and implement further transformation of features S′ (X/r, X/r, C1) → S′′ (X/r, X/r, C2). Using convolution with a step size of 1 can retain more fine-grained features and avoid the loss of feature information caused by convolution stride size.

3.3. Objects Tracker

3.3.1. Overview of DeepSORT

The DeepSORT algorithm is an improvement of the SORT algorithm. The SORT algorithm uses the Kalman filter to predict the motion state of the objects. The IOU value is used as the data association evaluation metric. The IOU value is calculated by the predicted objects boundary box and the objects boundary box output by the detection network, and the data association (IOU Match) is completed by the Hungarian algorithm to realize the object tracking. DeepSORT adds a pedestrian appearance feature similarity measure based on the SORT algorithm, and combines a cascade matching module to reduce the ID switching when pedestrians are occluded and improve the robustness of the model. In the tracking process, the feature extraction network has a great impact on the accuracy and richness of the feature appearance information extraction of pedestrian targets, as well as on the tracking effect.

The DeepSORT algorithm framework is shown in Figure 5. DeepSORT consists of two main branches: the deep appearance descriptor branch and the motion prediction branch. The motion prediction branch uses the Kalman filter for trajectory state prediction. The Kalman filter predicts the position information of the target in the current frame by using the trajectory state of the past frame. Then, Mahalanobis distance is used to measure the spatio-temporal dissimilarity between tracklets and detections. The deep appearance descriptor branch is a simple convolutional network for image classification. It is used to extract the appearance features of the detection boxes to obtain the appearance feature vector, and cosine distance is used as the appearance feature similarity measure. Then, the cosine distance and the Mahalanobis distance are used by the matching cascade algorithm to complete the association of trajectory segments. Finally, the track is updated, initialized, and deleted in the track management stage.

Through investigation and analysis, it is found that deep appearance descriptor branch has a significant impact on multi-target tracking results. By adding the apparent feature extraction stage, DeepSORT achieved good results in the ID switching frequency of the objects, but there are still problems such as tracking drift and tracking loss for pedestrian tracking of autonomous vehicles. DeepSORT still has room for improvement:

(a): The appearance feature extraction network is only a general network for image classification tasks, which does not fully consider the difficulties of large intra-class variation and high similarity between classes in traffic scenes, and it is difficult to obtain accurate appearance feature vectors.
(b): The feature extraction network cannot learn full-size features, and some matching errors will occur when the scale of the pedestrian object changes.

In view of these shortcomings, this study proposes an appearance feature extraction network full-scale feature extraction network (FSNet) to improve the tracking accuracy of DeepSORT in pedestrian tracking in complex traffic scenes.

3.3.2. Improved Feature Extraction Network

In this section, FSNet is introduced, which is used to extract pedestrian appearance features in pedestrian multi-object tracking. Firstly, the convolutional structural re-parameterization is introduced, and then the full-scale feature extraction block and feature aggregation modules are introduced.

Convolutional structural re-parameterization. Referring to the RepVGG [59] framework, structural reparameterization is used to decouple the training and inference network. A multi-branch training convolution bottleneck and a single-branch inference convolution bottleneck are designed. The structure is shown in Figure 5.

REPLConv has a different structure during training and deployment. During training, it is composed of a Lwtconv3×3 + BN branch and a BN branch, adding the two branches as output. It is shown in Figure 6a. During deployment, in order to facilitate deployment, the parameters of the branch are re-parameterized to the main branch, and the main branch of Lwtconv3×3 + BN is taken as the output. It is shown in Figure 6b. Lwtconv3×3 consists of Conv1×1, DWConv3×3, a BN layer, and ReLU. DWConv3×3 is the depthwise separable convolution [60,61]. Compared with ordinary convolution, the number of parameters of the depthwise separable convolution is greatly reduced. Given an input tensor

x \in ℝ^{h \times w \times c}

, the ordinary convolution parameter computation amount is

k^{2} \times c \times c^{'}

, where k is the convolution kernel size, c is the number of input channels, and c′ is the number of output channels. The computation of the depthwise separable convolution parameters is

k^{2} \times c + c \times c^{'}

. Here,

k^{2} \times c

is the number of parameters of per-channel convolution and

c \times c^{'}

is the number of parameters of per-point convolution. Lwtconv3×3 is shown in Figure 7.

Full-scale feature extraction block. The full-scale feature extraction block consists of REPLConv layers and is shown in Figure 7. In order to learn multi-scale features, a multi-branch structure is designed. Each branch has different REPLConv layers, and the number of REPLConv layers is different, so the receptive fields are different, and features of different scales can be extracted. In this paper, the parameter t is used to represent the stacking depth of REPLConv layers, and the resulting receptive field size can be expressed as (2t + 1) × (2t + 1), where t > 1. Through the experimental analysis, when t = 4, the effect of feature extraction and the depth of network structure achieve a good balance. It can not only learn effective features at smaller scales, but also capture features at the whole spatial scale. The specific structure is shown in Figure 8.

Feature aggregation. Each branch of the feature extraction block is able to obtain features of a specific size scale and cannot obtain full-scale features. In order to learn full-scale features, inspired by the literature [62], the feature dynamic aggregation (FA) network structure is added. FA is a small convolutional neural network that aggregates features on the channel dimension. FA consists of a global average pooling layer, a fully connected layer, and an activation function layer.

FSNet is built by a simple stacked full-scale feature extraction block. The details of the network architecture are shown in Table 1.

4. Experiments

4.1. Experimental Environment

The software environment for this experiment is Ubuntu 18.04, PyTorch, CUDA 11.1, cuDNN 9.1 and Python 3.8.0.

The hardware environment for this experiment is NVIDIA GeForce GTX 3060Ti GPU, 8 GB, Intel Core i9-10850K CPU @ 3.60 GHz processor and 32 GB memory.

4.2. Datasets

4.2.1. Public Datasets

The proposed algorithm is evaluated on the MOT17 and MOT20 benchmark dataset [63] under the “private detection” protocol. MOT17 has a total of 14 video sequences, of which 7 sequences are used for training with a total of 5316 frames and 7 sequences are used for testing, with a total of 5919 frames. MOT20 is set for highly crowded challenging scenes, with 4 sequences and 8931 frames for training, and 4 sequences and 4479 frames for testing.

The improved detector is trained on the CrowdHuman [64] dataset and tested on Citypersons [65] dataset. The CrowdHuman dataset has a relatively large amount of data, with 15,000 images in the training set, 5000 in the testing set, and 4370 in the validation set. There are a total of 470K instances in the training and validation sets, about 23 people per image, and there are various occlusions. Each human instance is annotated with a head bounding box, a human visible region bounding box, and a human full-body bounding box. Citypersons dataset contains only personal annotations and provides both viewable areas and full-body annotations, with 2975 images for training, 500 images for validation, and 1575 images for testing.

The improved feature extraction model is trained on the Market-1501 [66] pedestrian recognition dataset, which contains 32,668 pedestrian images, including 12,936 in the training set and 19,732 in the testing set.

4.2.2. Driving Datasets

Three video sequences of actual driving data were collected. The scenes of the three sequences were different, namely, sunny, rainy, and snowy days. Due to weather restrictions, the actual driving data was collected in foggy weather, and only sunny, rainy, and snowy days were used to verify the robustness of the proposed algorithm for climatic conditions. There were 500 frames of valid data for each driving scenario, and only qualitative analysis was undertaken on the actual driving data set.

4.3. Implementation Results

4.3.1. Quantitative Analysis

Since this paper designs a detector–tracker integrated framework, the experiment is divided into two groups. One group is the unimproved YOLOv7 detector to compare the effect of MOT algorithm, and the other group is the improved YOLOv7 detector to compare the effect of MOT algorithm. It is proved that the improvement of the detector in this paper is effective.

Based on the unimproved YOLOv7 detector, the proposed algorithm is compared with 10 state-of-the-art trackers on the MOT17 and MOT20 test sets. The results are shown in Table 2 and Table 3.

In Table 2, comparing SORT and DeepSORT, the MOTA of SORT without feature extraction network is much lower than that of DeepSORT pedestrian tracking algorithm using feature extraction network, and the number of IDS is much higher than that of DeepSORT pedestrian tracking algorithm. It can be seen that the feature extraction network has a great impact on improving tracking. It can be seen from Table 2 that the tracking algorithm proposed in this paper ranks first in indicators MOTA and IDs based on the YOLOv7 detector. Compared with the baseline algorithm DeepSORT, HOTA increased by 2.3%, MOTA increased by 2.4%, IDF1 increased by 1.9%, the number of IDs decreased by 307 (16.2%), and the FPS reached 17. Table 3 shows that the proposed tracking algorithm evaluation metrics IDF1 and IDs achieve the best results. Compared with the baseline algorithm DeepSORT, HOTA improves by 2.4%, MOTA improves by 4.5%, IDF1 improves by 8.2%, and the number of IDs is reduced by 481 (21.2%). Although the index FPS was not optimal, it could meet the real-time requirements. Based on the improved YOLOv7 detector, The proposed algorithm is compared with 10 state-of-the-art trackers on the MOT17 and MOT20 test sets. The results are shown in Table 4 and Table 5.

It can be seen from Table 4 that based on the improvement of YOLOv7 detector, the tracking algorithm proposed in this paper achieves the best results in indicators HOTA, IDF1, MOTA, and IDs. Compared with the baseline algorithm DeepSORT, HOTA increased by 2.6%, MOTA increased by 2.3%, IDF1 increased by 4.3%, the number of IDS decreased by 384 (20.9%), and 16 FPS was achieved in terms of speed. Table 5 shows that the proposed tracking algorithm evaluation metrics HOTA, IDF1, MOTA and IDs achieve the best results. Compared with the baseline algorithm DeepSORT, HOTA improves by 3.0%, MOTA improves by 4.2%, IDF1 improves by 7.0%, and the number of IDs is reduced by 549 (26.2%). Evaluation metric FPS has a small decrease, but FPS index is closely related to hardware devices. Under the condition of hardware equipment, the speed of the tracking algorithm proposed in this paper will be greatly improved.

By comparing Table 2 and Table 4, it can be found that the tracking effect of improved YOLOv7 detectors is better than that of YOLOv7 detectors, which verifies the effectiveness of the improvement of YOLOv7 in this paper.

4.3.2. Qualitative Analysis

Figure 9 shows the tracking results of the proposed algorithm on real driving data. The results show the effectiveness of the proposed method in complex traffic scenarios.

According to the qualitative test results of real driving data, the detector–tracker integration framework proposed in this paper can realize multi-pedestrian object tracking and still show good robustness under complex scenes.

5. Discussion

The experimental results show that the detector–tracker integrated framework has the following advantages: (a) The SPD-Conv is fused into the YOLOv7 network structure, and the learnable information is not lost in SPD-Conv downsampling. It can solve the problems of step length convolution and pooling operation feature loss, and has a significant detection effect on pedestrians with deformation appearance and small objects size. (b) In the front end of object tracking, the feature extraction network is designed. Multi-scale appearance features are extracted by a multi-branch structure, and full-scale feature extraction is achieved by a feature aggregation module, which further improves the pedestrian tracking ability of autonomous vehicles.

In addition to baselines, there are several approaches similar to those presented. SORT [15] achieves the fastest tracking speed on both MOT17 and MOT20, even up to 112FPS. However, due to the lack of appearance features, the tracking accuracy is low. FairMOT [18] integrates detection and appearance feature extraction into a network structure. Joint training leads to competition between different components, and the number of target identity switches during FairMOT tracking is higher. The multi-object tracking algorithm based on Transformer has a great improvement in tracking accuracy, but it has low real-time performance and cannot be applied to autonomous vehicles. The speed of TransCenter [29] and TransMOT [30] on the MOT20 is only 1FPS under the hardware conditions based on the paper. Quadruplet Convolutional Neural Networks [56] slice the network into two parts to learn the robust features of the appearance of the upper and lower body joints, but the network parameters are too large to meet the real-time requirements. Person of Interest uses a network similar to GoogLeNet [47] to extract appearance features. As GoogLeNet is designed for classification tasks, it is difficult to extract effective information from pedestrian tracking with large intra-class differences and small inter-class differences, resulting in poor tracking performance on MOT17 compared with the proposed method.

The method proposed in this paper also has the following shortcomings that need to be solved: (a) During the frame design process, ID management optimization is not considered. When the object is seriously obscured, the object is lost, and ID conversion is generated due to the lack of proper processing. (b) The framework proposed in this paper is used for pedestrian tracking of autonomous vehicles. Both the pedestrian and the camera are in motion, but the camera motion compensation is not considered, which limits the accuracy of the multi-object tracking algorithm to a certain extent. In the future research, we will focus on the design of anti-occlusion network structure to solve the problem of accuracy and robustness of multi-target tracking of autonomous driving vehicles in congested scenes. At the same time, motion compensation will be integrated into multi-target tracking to improve the accuracy of high-speed motion multi-target tracking of autonomous vehicles.

6. Conclusions

Aiming to address the pedestrian tracking problem of autonomous vehicles in complex traffic environments, a feasible and effective solution is proposed in this paper. Using convolutional neural networks, a novel detector–tracker integrated framework for pedestrian tracking is proposed. A YOLOv7-SPD pedestrian object detector was designed. YOLOv7 was used as the pedestrian detection baseline, and the SPD module was used to optimize the YOLOv7 network structure to reduce the feature loss caused by step convolution and pooling operations. The classical algorithm DeepSORT appearance feature extraction network was improved, and FSNet was designed to obtain full-scale appearance features of pedestrians by extracting appearance features of different depths.

The experiments compare 10 multi-object tracking algorithms. Through quantitative analysis on the public datasets MOT17 and MOT20, the results show that the HOTA and IDF1 evaluation indicators on the detector–tracker integrated framework MOT17 achieve the best results. The best results are achieved by HOTA, IDF1, and MOTA evaluation indexes on the detector–tracker integrated framework MOT20. MOT20 has more complex scenes and more difficult target tracking than MOT17, which indicates that the detector–tracker integration framework is more robust in complex scenes. In addition, compared with the baseline algorithm DeepSORT, the proposed detector–tracker integration framework has significant improvements in HOTA, IDF1, MOTA, IDs, and FPS evaluation indicators, which proves the effectiveness of our algorithm. In summary, the algorithm proposed in this paper can effectively realize pedestrian tracking of autonomous vehicles in complex scenes.

Author Contributions

Investigation, G.W.; Data curation, X.S.; Writing—original draft, H.W.; Writing—review and editing, Y.H. and Z.H.; Supervision, L.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Key Research and Development Program of China (2021YFB3202200), the National Natural Science Foundation of China (52072333), and the Postgraduate Innovation Funding Program of Hebei Province (CXZZBS2023061).

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, Y.; Liu, Y.; Ma, M.; Mei, S. A Spectral–Spatial Transformer Fusion Method for Hyperspectral Video Tracking. Remote Sens. 2023, 15, 1735. [Google Scholar] [CrossRef]
Luo, Y.; Yin, D.; Wang, A. Pedestrian tracking in surveillance video based on modified CNN. Multimed. Tools Appl. 2018, 77, 24041–24058. [Google Scholar] [CrossRef]
Hao, J.X.; Zhou, Y.M.; Zhang, G.S. A review of objects tracking algorithm based on UAV. In Proceedings of the 2018 IEEE International Conference on Cyborg and Bionic Systems, Shenzhen, China, 25–27 October 2018; pp. 328–333. [Google Scholar]
Li, Y.; Wei, P.; You, M.; Wei, Y.; Zhang, H. Joint Detection, Tracking, and Classification of Multiple Extended Objects Based on the JDTC-PMBM-GGIW Filter. Remote Sens. 2023, 15, 887. [Google Scholar] [CrossRef]
Zhang, J.; Xiao, W.; Mills, J.P. Optimizing Moving Object Trajectories from Roadside Lidar Data by Joint Detection and Tracking. Remote Sens. 2022, 14, 2124. [Google Scholar] [CrossRef]
Peng, X.; Shan, J. Detection and Tracking of Pedestrians Using Doppler LiDAR. Remote Sens. 2021, 13, 2952. [Google Scholar] [CrossRef]
Ciaparrone, G.; Sánchez, F.L.; Tabik, S. Deep learning in video multi-object tracking: A survey. Neurocomputing 2020, 381, 61–88. [Google Scholar] [CrossRef] [Green Version]
Xu, Y.; Zhou, X.; Chen, S. Deep learning for multiple object tracking: A survey. IET Comput. Vis. 2019, 13, 355–368. [Google Scholar] [CrossRef]
Tang, S.; Andriluka, M.; Andres, B. Multiple people tracking by lifted multi cut and person re-identification. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–27 July 2017; pp. 3539–3548. [Google Scholar]
Keuper, M.; Tang, S.; Andres, B. Motion segmentation & multiple object tracking by correlation co-clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 42, 140–153. [Google Scholar]
Henschel, R.; Zou, Y.; Rosenhahn, B. Multiple people tracking using body and joint detections, In Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–20 June 2019.
Zhou, W.; Luo, Q.; Wang, J.; Xing, W. Distractor-aware discrimination learning for online multiple object tracking. Pattern Recognit. 2020, 107, 107512. [Google Scholar] [CrossRef]
Yang, J.; Ge, H.; Yang, J. Online multi-object tracking using multi-function integration and tracking simulation training. Appl. Intell. 2022, 52, 1268–1288. [Google Scholar] [CrossRef]
Liu, Q.; Chu, Q.; Liu, B.; Yu, N. Gsm: Graph similarity model for multi-object tracking. In Proceedings of the 2020 IJCAI, Online, 7–15 January 2021; pp. 530–536. [Google Scholar]
Bewley, A.; Ge, Z.Y.; Ott, L. Simple online and real time tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
Wojk, E.N.; Bewley, A.; Paulus, D. Simple online and real time tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing, Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
Azimi, S.M.; Kraus, M.; Bahmanyar, R.; Reinartz, P. Multiple Pedestrians and Vehicles Tracking in Aerial Imagery Using a Convolutional Neural Network. Remote Sens. 2021, 13, 1953. [Google Scholar] [CrossRef]
Zhang, Y.F.; Wang, C.Y.; Wang, X.G. FairMOT: On the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vis. 2021, 129, 3069–3087. [Google Scholar] [CrossRef]
Duan, K.W.; Song, B.; Xie, L.X. CenterNet: Keypoint triplets for object detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision(ICCV), Seoul, Repulic of Korea, 27 October–2 November 2019; pp. 6568–6577. [Google Scholar]
Zhou, X.; Koltun, V.; Krahenbuhl, P. Tracking objects as points. In Proceedings of the 2020 Conference on Computer Vision, Seattle, WA, USA, 13–19 June 2020; pp. 474–490. [Google Scholar]
Lu, Z.C.; Rathod, V.; Votel, R. RetinaTrack: Online single stage joint detection and tracking. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, DC, USA, 13–19 June 2020; pp. 14656–14666. [Google Scholar]
Liang, C.; Zhang, Z.; Lu, Y. Rethinking the competition between detection and ReID in multi-object tracking. arXiv 2020, arXiv:2010.12138. [Google Scholar]
Liang, C.; Zhang, Z.P.; Zhou, X. One more check: Making “fake background” be tracked again. arXiv 2021, arXiv:2104.09441. [Google Scholar] [CrossRef]
Yu, E.; Li, Z.L.; Han, S.D. RelationTrack: Relation-aware multiple object tracking with decoupled representation. arXiv 2021, arXiv:2105.04322. [Google Scholar] [CrossRef]
Li, J.X.; Ding, Y.; Wei, H.L. SimpleTrack: Rethinking and improving the JDE approach for multi-object tracking. arXiv 2022, arXiv:2203.03985. [Google Scholar] [CrossRef]
Wan, X.Y.; Zhou, S.P.; Wang, J.J. Multiple object tracking by trajectory map regression with temporal priors embedding. In Proceedings of the 2021 ACM Multimedia Conference, New York, NY, USA, 17 October 2021; pp. 1377–1386. [Google Scholar]
Vaswani, A.; Shazeer, N.M.; Parmar, N. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Sun, P.; Cao, J.; Jiang, Y.; Zhang, R.; Xie, E.; Yuan, Z.; Wang, C.; Luo, P. Transtrack: Multiple object tracking with transformer. arXiv 2020, arXiv:2012.15460. [Google Scholar]
Xu, Y.; Ban, Y.; Delorme, G.; Gan, C.; Rus, D.; Alameda-Pineda, X. Transcenter: Transformers with dense queries for multiple-object tracking. arXiv 2021, arXiv:2103.15145. [Google Scholar]
Meinhardt, T.; Kirillov, A.; Leal-taixe, L. TrackFormer: Multi-object tracking with transformers. arXiv 2021, arXiv:2101.02702. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G. End-to-end object detection with transformers. arXiv 2020, arXiv:2005.12872. [Google Scholar]
Zeng, F.G.; Dong, B.; Wang, T.C. MOTR: End-to-end multiple-object tracking with transformer. arXiv 2021, arXiv:2105.03247. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
He, K.M.; Zhang, X.Y.; Ren, S.Q. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Lin, T.Y.; Dollar, P.; Girshick, R. Feature pyramid networks for object detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision and Pattern, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.Q.; He, K.M.; Girshick, R. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [Green Version]
Redmon, J.; Divvala, S.; Girshick, R. You only look once: Unified real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D. SSD: Single shot multi box detector. In Proceedings of the 14th European Conference Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Tsung-Yi, L.; Priya, G.; Ross, G.; Kaiming, H.; Piotr, D. Focal Loss for Dense Object Detection. In Proceedings of the IEEE international conference on computer vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Zhang, Y.F.; Sun, P.Z.; Jiang, Y.J. ByteTrack: Multi object tracking by associating every detection box. arXiv 2021, arXiv:2110.06864. [Google Scholar]
Shan, C.B.; Wei, C.B.; Deng, B. Tracklets Predicting Based Adaptive Graph Tracking. arXiv 2020, arXiv:2010.09015. [Google Scholar]
Cao, J.; Weng, X.; Khirodkar, R. Observation centric sort: Rethinking sort for robust multi-object tracking. arXiv 2022, arXiv:2203.14360. [Google Scholar]
He, K.M.; Zhang, X.Y.; Ren, S.Q. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Zagoruyko, S.; Komodakis, N. Wide residual networks. arXiv 2016, arXiv:1605.07146. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.Q. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Yang, F.; Chang, X.; Sakti, S. ReMOT: A model agnostic refinement for multiple object tracking. Image Vis. Comput. 2021, 106, 104091. [Google Scholar] [CrossRef]
Baisa, N.L. Robust online multi-objects visual tracking using a HISP filter with discriminative deep appearance learning. J. Vis. Commun. Image Represent. 2021, 77, 102952. [Google Scholar] [CrossRef]
Chen, L.; Ai, H.Z.; Zhuang, Z.J. Real-time multiple people tracking with deeply learned candidate selection and person re- identification. In Proceedings of the 2018 IEEE International Conference on Multimedia and Expo, San Diego, CA, USA, 23–27 July 2018; pp. 1–6. [Google Scholar]
Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.; Su, F.; Gong, T. Strongsort: Make deepsort great again. IEEE Trans. Multimed. 2023. [Google Scholar] [CrossRef]
Karthik, S.; Prabhu, A.; Gandhi, V. Simple unsupervised multi-object tracking. arXiv 2020, arXiv:2006.02609. [Google Scholar]
Baisa, N.L. Occlusion- robust online multi- object visual tracking using a GM-PHD filter with a CNN-based reidentification. J. Vis. Commun. Image Represent. 2021, 80, 103279. [Google Scholar] [CrossRef]
Chu, P.; Wang, J.; You, Q.; Ling, H.; Liu, Z. Transmot: Spatial-temporal graph transformer for multiple object tracking. arXiv 2021, arXiv:2104.00194. [Google Scholar]
Xu, Y.; Osep, A.; Ban, Y.; Horaud, R.; Leal-Taixé, L.; Alameda-Pineda, X. How to train your deep multi-object tracker. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6787–6796. [Google Scholar]
Son, J.; Baek, M.; Cho, M.; Han, B. Multi-object tracking with quadruplet convolutional neural networks. In Proceedings of the 2017 IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5620–5629. [Google Scholar]
Sajjadi, M.S.; Vemulapalli, R.; Brown, M. Frame-recurrent video super-resolution. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6626–6634. [Google Scholar]
Sunkara, R.; Luo, T. No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects. arXiv 2022, arXiv:2208.03641. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the 2021 IEEE/CVF conference on computer vision and pattern recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13733–13742. [Google Scholar]
Aandrew, G.H.; Menglong, Z.; Bo, C.; Dmitry, K.; Weijun, W.; Tobias, W. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the 2017 IEEE/CVF conference on computer vision and pattern recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Zhou, K.; Yang, Y.; Cavallaro, A.; Xiang, T. Omni-scale feature learning for person re-identification. In Proceedings of the 2019 IEEE/CVF international conference on computer vision, Long Beach, CA, USA, 16–20 June 2019; pp. 3702–3712. [Google Scholar]
Dendorfer, P.; Osep, A.; Milan, A. MOTChallenge: A Benchmark for Single Camera Multiple Objects Tracking. Int. J. Comput. Vis. 2021, 129, 845–881. [Google Scholar] [CrossRef]
Shao, S.; Zhao, Z.; Li, B.; Xiao, T.; Yu, G.; Zhang, X.; Sun, J. Crowdhuman: A benchmark for detecting human in a crowd. arXiv 2018, arXiv:1805.00123. [Google Scholar]
Zhang, S.; Benenson, R.; Schiele, B. Citypersons: A diverse dataset for pedestrian detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–27 July 2017; pp. 3213–3221. [Google Scholar]
Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Bu, J.; Tian, Q. Person re-identification meets image search. arXiv 2015, arXiv:1502.02171. [Google Scholar]

Figure 1. Flowchart of online tracking-by-detection.

Figure 2. Structure of the detector–tracker integration framework.

Figure 3. The architecture of YOLOv7-SPD.

Figure 4. The architecture of space-to-depth (SPD) layer.

Figure 5. The process of DeepSORT.

Figure 6. The structure of REPLConv.

Figure 7. The structure of Lwtconv3×3.

Figure 8. The structure of the full-scale feature extraction block.

Figure 9. The tracking results of different real driving data for the testing sequences.

Table 1. The architecture of FSNet with input image size 256 × 128.

Stage	FSNet	Output
Conv1	conv3×3, stride 2 max pooling3×3, stride 2	128 × 64, 64 64 × 32, 64
Conv2	block×2	64 × 32, 256
Conv3	conv1×1 average pooling2×2, stride 2	64 × 32, 256 32 × 16, 256
Conv4	block×2	32 × 16, 384
Conv5	conv1×1 average pooling2×2, stride 2	32 × 16, 384 16 × 8, 384
Conv6	block×2	16 × 8, 512
Conv7	conv1×1	16 × 8, 512
Gap	global average pooling	1 × 1, 512
Fc	fc	1 × 1, 512

Table 2. Comparison with state-of-the-art MOT methods on the MOT17 test set (YOLOv7).

Method	HOTA (↑)	IDF1 (↑)	MOTA (↑)	IDs (↓)	FPS (↑)
SORT [15]	32.9	38.1	41.2	4816	114
FairMOT [18]	56.0	70.3	71.2	3398	22
DeepMOT [55]	40.9	52.6	52.9	1990	6
CenterTrack [20]	51.2	63.3	66.8	3139	5
TransTrack [28]	53.3	62.6	72.8	3678	43
TransCenter [29]	52.3	62.1	71.7	4911	2
TransMOT [30]	60.1	72.9	73.7	2640	2
ByteTrack [42]	58.9	75.6	76.2	2298	22
StrongSORT [51]	63.3	74.5	75.8	1646	18
DeepSORT [16]	60.1	72.8	74.6	1898	14
Ours	62.4	74.7	77.0	1591	17

Table 3. Comparison with state-of-the-art MOT methods on the MOT20 test set (YOLOv7).

Method	HOTA (↑)	IDF1 (↑)	MOTA (↑)	IDs (↓)	FPS (↑)
SORT [15]	34.4	42.9	38.1	4968	48
FairMOT [18]	52.4	64.8	60.2	5419	13
DeepMOT [55]	38.2	49.7	49.4	2065	3
CenterTrack [20]	48.1	59.7	62.4	3320	3
TransTrack [28]	50.8	58.6	72.1	3875	43
TransCenter [29]	41.2	47.5	55.1	5023	2
TransMOT [30]	54.1	68.7	65.8	2891	2
ByteTrack [42]	57.1	72.3	71.5	2361	13
StrongSORT [51]	59.8	74.1	70.9	1809	9
DeepSORT [16]	56.5	66.1	67.3	2269	8
Ours	58.9	74.3	71.8	1788	10

Table 4. Comparison with state-of-the-art MOT methods on the MOT17 test set (improved YOLOv7).

Method	HOTA (↑)	IDF1 (↑)	MOTA (↑)	IDs (↓)	FPS (↑)
SORT [15]	33.7	38.9	42.4	4796	112
FairMOT [18]	57.2	71.2	72.5	3306	20
DeepMOT [55]	42.4	53.2	53.4	1954	5
CenterTrack [20]	52.2	64.7	67.3	3039	4
TransTrack [28]	54.0	63.1	74.2	3609	42
TransCenter [29]	54.3	64.3	72.5	4710	2
TransMOT [30]	60.5	74.6	75.1	2340	2
ByteTrack [42]	60.1	76.2	78.4	2236	22
StrongSORT [51]	63.5	75.1	76.3	1446	18
DeepSORT [16]	61.1	74.2	76.0	1837	14
Ours	63.7	78.5	78.3	1453	16

Table 5. Comparison with state-of-the-art MOT methods on the MOT20 test set (improved YOLOv7).

Method	HOTA (↑)	IDF1 (↑)	MOTA (↑)	IDs (↓)	FPS (↑)
SORT [15]	35.7	44.7	40.4	4856	46
FairMOT [18]	53.9	66.1	60.5	5235	12
DeepMOT [55]	39.3	50.2	49.9	2098	3
CenterTrack [20]	48.9	60.1	63.6	3423	2
TransTrack [28]	51.6	59.7	73.2	3891	24
TransCenter [29]	42.1	48.2	56.4	4690	1
TransMOT [30]	54.3	69.2	67.9	2451	1
ByteTrack [42]	59.7	74.5	73.8	2531	12
StrongSORT [51]	60.2	74.7	71.4	1782	8
DeepSORT [16]	57.4	68.1	69.7	2095	7
Ours	60.4	75.1	73.9	1546	8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, H.; Jin, L.; He, Y.; Huo, Z.; Wang, G.; Sun, X. Detector–Tracker Integration Framework for Autonomous Vehicles Pedestrian Tracking. Remote Sens. 2023, 15, 2088. https://doi.org/10.3390/rs15082088

AMA Style

Wang H, Jin L, He Y, Huo Z, Wang G, Sun X. Detector–Tracker Integration Framework for Autonomous Vehicles Pedestrian Tracking. Remote Sensing. 2023; 15(8):2088. https://doi.org/10.3390/rs15082088

Chicago/Turabian Style

Wang, Huanhuan, Lisheng Jin, Yang He, Zhen Huo, Guangqi Wang, and Xinyu Sun. 2023. "Detector–Tracker Integration Framework for Autonomous Vehicles Pedestrian Tracking" Remote Sensing 15, no. 8: 2088. https://doi.org/10.3390/rs15082088

APA Style

Wang, H., Jin, L., He, Y., Huo, Z., Wang, G., & Sun, X. (2023). Detector–Tracker Integration Framework for Autonomous Vehicles Pedestrian Tracking. Remote Sensing, 15(8), 2088. https://doi.org/10.3390/rs15082088

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Detector–Tracker Integration Framework for Autonomous Vehicles Pedestrian Tracking

Abstract

1. Introduction

2. Related Work

2.1. System Model

2.2. Object Detection Algorithms

2.3. Tracking by Detection Algorithms

2.3.1. Apparent Feature Extraction Based on Feature Re-Extraction

2.3.2. Matching-Based Deep Data Association

2.3.3. Evaluation Metrics for Multi-Object Tracking

3. Proposed Online Pedestrian MOT Algorithm

3.1. Overall Architecture

3.2. Objects Detector

3.2.1. YOLOv7-SPD

3.2.2. Space-to-Depth (SPD)

3.3. Objects Tracker

3.3.1. Overview of DeepSORT

3.3.2. Improved Feature Extraction Network

4. Experiments

4.1. Experimental Environment

4.2. Datasets

4.2.1. Public Datasets

4.2.2. Driving Datasets

4.3. Implementation Results

4.3.1. Quantitative Analysis

4.3.2. Qualitative Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI