Cross-Video Pedestrian Tracking Algorithm with a Coordinate Constraint

Huang, Cheng; Li, Weihong; Yang, Guang; Yan, Jiachen; Zhou, Baoding; Li, Yujun

doi:10.3390/s24030779

Open AccessArticle

Cross-Video Pedestrian Tracking Algorithm with a Coordinate Constraint

by

Cheng Huang

^1,2,

Weihong Li

^1,2,3,

Guang Yang

^1,2,3,*,

Jiachen Yan

^1,2,

Baoding Zhou

⁴

and

Yujun Li

¹

School of Geography, South China Normal University, Guangzhou 510631, China

²

Guangdong Shida Weizhi Information Technology Co., Ltd., Qingyuan 511500, China

³

SCNU Qingyuan Institute of Science and Technology Innovation, Qingyuan 511500, China

⁴

College of Civil and Transportation Engineering, Shenzhen University, Shenzhen 518060, China

^*

Author to whom correspondence should be addressed.

Sensors 2024, 24(3), 779; https://doi.org/10.3390/s24030779

Submission received: 2 January 2024 / Revised: 22 January 2024 / Accepted: 22 January 2024 / Published: 25 January 2024

(This article belongs to the Topic Visual Object Tracking: Challenges and Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Pedestrian tracking in surveillance videos is crucial and challenging for precise personnel management. Due to the limited coverage of a single video, the integration of multiple surveillance videos is necessary in practical applications. In the realm of pedestrian management using multiple surveillance videos, continuous pedestrian tracking is quite important. However, prevailing cross-video pedestrian matching methods mainly rely on the appearance features of pedestrians, resulting in low matching accuracy and poor tracking robustness. To address these shortcomings, this paper presents a cross-video pedestrian tracking algorithm, which introduces spatial information. The proposed algorithm introduces the coordinate features of pedestrians in different videos and a linear weighting strategy focusing on the overlapping view of the tracking process. The experimental results show that, compared to traditional methods, the method in this paper improves the success rate of target pedestrian matching and enhances the robustness of continuous pedestrian tracking. This study provides a viable reference for pedestrian tracking and crowd management in video applications.

Keywords:

cross-video; coordinate features; pedestrian tracking; overlapping view

1. Introduction

Surveillance videos play a vital role in public safety and personnel management by monitoring and capturing real-time data within their coverage areas. With improvements in information construction in various scenes, the expanse of surveillance video coverage has significantly increased. Pedestrian tracking in videos is an important task for various industries, but it remains challenging due to numerous obstacles. Video tracking [1] mainly includes three parts [2,3,4]: pedestrian detection [5,6,7], pedestrian matching [8,9], and pedestrian trajectory acquisition [4,10].

In current research, although there are many studies on pedestrian tracking, these methods are mostly limited to extracting pedestrian information solely from a video context. Establishing a connection between monitoring videos and the real environment requires personnel familiar with both, but accurately describing the location of the target pedestrian remains challenging. Meanwhile, in many application scenarios, such as crime investigation and security management, obtaining real-time location information of pedestrians is crucial for precise tracking and monitoring.

Because of the limited coverage of single surveillance cameras [11], current research focuses on employing multiple cameras for continuous target tracking [12]. Cross-video pedestrian tracking refers to tracking pedestrians in multiple different surveillance videos, achieving consistent correlation across multiple video frames to enable the seamless tracking of pedestrians across different surveillance devices. In cross-video pedestrian tracking, a common approach involves extracting pedestrian appearance features using neural networks for similarity assessment. However, relying solely on pedestrian appearance features can be challenging for the accurate identification of target pedestrians due to factors such as environmental lighting, camera angles, and clarity.

Additionally, when a target pedestrian performs actions across multiple videos in overlapping fields of view, often only one surveillance camera’s pedestrian location is chosen while disregarding the other, failing to fully utilize the information resources of multiple surveillance devices. Furthermore, due to inherent limitations in tracking algorithms, surveillance devices, and their placement, there may be errors in pedestrian localization. The experiments in Refs. [13,14] have also indicated that as pedestrians move away from the center of the camera’s field of view, localization errors tend to increase, resulting in pedestrian position jumps and trajectory discontinuities [15,16,17].

Consequently, in order to address these aforementioned challenges, this paper proposes a cross-video pedestrian tracking algorithm with a coordinate constraint that effectively utilizes spatial information to complete a pedestrian tracking cross-video. Specifically, instead of traditional image-based tracking trajectories, this paper proposes a collaborative real coordinates approach to reflect the trajectory information of pedestrians. Moreover, this paper introduces spatial affinity to balance the process of pedestrian matching, which only relies on appearance similarity. Because of the poor robustness and low accuracy of cross-video tracking, a linear weighted method is proposed. This method utilizes multiple location data for data association, thereby achieving a more robust transition across videos.

The contributions of this paper are as follows:

The introduction of spatial affinity improves the accuracy and reliability of pedestrian re-identification.
The proposed linear weighted algorithm enhances the precision and robustness of cross-video tracking effectively.

The remainder of this paper is structured as follows: Section 2 provides related studies. Section 3 outlines the principles of related methods and the construction of the proposed cross-video pedestrian tracking algorithm with coordinate constraints. Subsequently, Section 4 presents our experimental scheme and experimental results, followed by a comprehensive analysis and discussion of the experimental results in Section 5. Our conclusion is given in Section 6.

2. Related Work

2.1. Pedestrian Localization

Based on the revelation of a paper by [18], a convolutional neural network is used to extract feature maps in the spatial domain. Accurate pedestrian spatial localization information in video streams and images is a fundamental basis for the comprehensive analysis of pedestrian behavior [19]. Meanwhile, real-world pedestrian positions are crucial factors in management and decision making. Lie Yang et al. [20] employed a fisheye camera mounted on an indoor ceiling for detecting and tracking pedestrian heads, ensuring consistency in pedestrian trajectories between the video and the real world. Additionally, depth cameras have been utilized to obtain three-dimensional information [21], providing another avenue for video-based pedestrian positioning. Zoltan Koppanyi et al. [22] achieved real position estimation without camera calibration through triangulation and DEM-related principles. Domonkos Varga et al. [23] inferred the actual positions of pedestrians based on image registration and matching. However, the exploration of tracking real-world pedestrian position information in video content remains limited despite various estimation methods for pedestrian location positioning.

2.2. Across-Video Pedestrian Tracking

When cross-video pedestrian tracking is performed, pedestrian matching steps are required first. A common approach is to utilize the invariance of pedestrian feature information. An improved multi-person matching cascade scheme was proposed to solve these problems by Yundong Guo et al. [24]. It can increase the accuracy of inter-camera person re-identification (Re-ID) by taking advantage of association priorities. Qiang Wang et al. [25] introduced a prototype-guided instance matching framework to boost the performance of multiple pedestrian tracking tasks, where we can jointly learn object detection and appearance embedding in an end-to-end fashion. However, in real-world scenarios, appearance variations in video pedestrians under different cameras or angles can disrupt the precise matching of pedestrian information, leading to tracking errors [26,27].

Cross-video pedestrian tracking is applied in two application scenarios as follows: overlapping views and nonoverlapping views [28]. In the case of nonoverlapping views, prevailing strategies rely on the similarity of pedestrian features across different surveillance sources [29]. By contrast, investigations into pedestrian tracking within an overlapping view have explored various solutions. For instance, Liu Caihong et al. [30] achieved large-scale, long-distance pedestrian tracking by seamlessly merging overlapping regions from multiple surveillance videos. Sohaib Khan and Mubarak Shah [31] introduced the concept of the camera’s field of view (FOV) boundary as a foundational premise for cross-video tracking. The overlapping view boundaries are defined as the boundaries of areas where two or more visual fields overlap. Building upon this foundation, Yang Xia et al. [32] leveraged the scale-invariant feature transform (SIFT) algorithm to autonomously identify matching points and generate FOV lines when pedestrians traverse the overlapping view boundary. This approach facilitates smooth transitions between two cameras for pedestrian tracking. Furthermore, Xinchao Xu et al. [33] employed an overhead camera angle to capture pedestrians, ensuring that the pedestrian’s size remained consistent during movement in the video. This characteristic simplifies continuous tracking across cameras by capitalizing on the scale invariance of pedestrian size for matching. However, the universal applicability of overhead cameras is constrained, and the available pedestrian feature information is restricted, posing challenges for pedestrian matching. Most cross-video pedestrian tracking studies focus primarily on automatically generating and dividing FOV boundaries to resolve location conflicts. However, these studies tend to overlook the potential of multidimensional position information from overlapping views to improve the robustness and accuracy of pedestrian tracking.

3. Materials and Methods

The overall framework is presented in Figure 1. Firstly, this study processes videos from two surveillance cameras with overlapping views. In Camera A, target pedestrian X is detected, while Camera B identifies potential candidates. Subsequently, the matching process for the candidates in Camera B is conducted by combining appearance features and coordinate features, accurately locating target pedestrian X. Upon the matching of pedestrian X using both Camera A and Camera B, the linear weighted processing of pedestrian X’s position is executed using auxiliary lines within the overlapping view. Ultimately, the system outputs the complete trajectory of target pedestrian X.

3.1. Mapping the Coordinates of Pedestrians in the Video

3.1.1. Coordinate System Transformation

In order to better describe the trajectory of pedestrians and establish a connection between the video and the actual world, we perform coordinate system transformations. The principle of coordinate system transformation derives from the field of photogrammetry [34,35]. By deducing the relationship between the world and the pixel coordinate system, we can achieve the functionality of obtaining the real-world position of pedestrians from a two-dimensional video frame. The processes for converting pixels into world coordinates are presented as follows:

The internal and external parameters of the camera are obtained using camera calibration, such as the focal length of the camera, pixel size, rotation matrix, and the translation vector of the camera. In this paper, Zhang Zhengyou’s checkerboard camera calibration method [36] is used to obtain the internal and external parameters of the surveillance.
The relationship between the pixel and the image coordinate system is a process of continuation [37].
The relationship between the camera and the image coordinate system is based on perspective and can be calculated using the principle of similar triangles.
The relationship between the world and the camera coordinate system is a rigid transformation involving rotation and translation.

Combining these steps, we can obtain the equation for converting similar coordinates to world coordinates as follows:

z [\begin{matrix} u \\ v \\ 1 \end{matrix}] = [\begin{matrix} f / d x & 0 & u_{0} \\ 0 & f / d y & v_{0} \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} R_{3 * 3} & t_{3 * 1} \\ O_{1 * 3} & 1 \end{matrix}] [\begin{matrix} X_{w} \\ Y_{w} \\ Z_{w} \\ 1 \end{matrix}]

(1)

where u₀ and v₀ represent the coordinates of the pixel coordinate system’s origin in the horizontal and vertical directions; dx and dy represent the lengths of the pixel grid in the horizontal direction and vertical direction, respectively; f represents the focal length of the camera; (X_w, Y_w, Z_w) represents the world coordinates corresponding to the pixel coordinates of the pedestrian (u, v); R_3∗3 represents a 3∗3 rotation matrix; and t_3∗1 represents a 3∗1 translation vector.

3.1.2. Coordinate System Unity

After projection transformation, two surveillance systems each establish their own three-dimensional coordinate systems. Based on these corresponding points, unity between the two coordinate systems is achieved using rotation and translation matrices.

We selected four corresponding points P_Ai (x_Ai, y_Ai, z_Ai) and P_Bi (x_Bi, y_Bi, z_Bi) from the overlapping view of Cam_A and Cam_B (where i = 1, 2, 3, 4). These coordinates are represented in matrix form as shown in Equations (2) and (3):

A = [\begin{matrix} \begin{matrix} x_{A 1} & y_{A 1} \\ x_{A 2} & y_{A 2} \end{matrix} & \begin{matrix} z_{A 1} & 1 \\ z_{A 2} & 1 \end{matrix} \\ \begin{matrix} x_{A 3} & y_{A 3} \\ x_{A 4} & y_{A 4} \end{matrix} & \begin{matrix} z_{A 3} & 1 \\ z_{A 4} & 1 \end{matrix} \end{matrix}]

(2)

B = [\begin{matrix} \begin{matrix} x_{B 1} & y_{B 1} \\ x_{B 2} & y_{B 2} \end{matrix} & \begin{matrix} z_{B 1} & 1 \\ z_{B 2} & 1 \end{matrix} \\ \begin{matrix} x_{B 3} & y_{B 3} \\ x_{B 4} & y_{B 4} \end{matrix} & \begin{matrix} z_{B 3} & 1 \\ z_{B 4} & 1 \end{matrix} \end{matrix}]

(3)

As shown in Equations (4) and (5), the transpose matrix of B is multiplied by matrix A to obtain matrix M. Then, singular value decomposition [38] is performed on M, resulting in the left singular vector U, the right singular vector V, and the singular value matrix S [39]. Using the first three columns of U and V, we construct R (rotation matrix). By subtracting the average coordinates of B from the average coordinates of A, we obtain the translation vector t. This computation yields the final transformation matrix T, which achieves the unity of the two coordinate systems.

M = B^{T} A = U S V^{T}

(4)

\{\begin{matrix} R = U V^{T} \\ t = m e a n (B) - m e a n (A) \end{matrix} \Rightarrow T = [\begin{matrix} R & T \\ 0 & 1 \end{matrix}]

(5)

3.2. Cross-Video Pedestrian Tracking Method

3.2.1. Traditional Pedestrian Detection and Matching

YOLOv3 [5,6,7] was released by Joseph Redmon from the University of Washington in April 2018. Compared to other object detection algorithms, YOLOv3 requires fewer computational resources and has lower hardware requirements, rendering it widely applicable in the field of surveillance [40,41]. YOLOv3 adopts Darknet-53 as its backbone network, and the loss function is defined by the bounding box loss, confidence loss, and classification loss. Since the objects we mainly focus on are pedestrians, the classification loss can be disregarded. Therefore, the loss function equation is presented as follows:

The bounding box loss is expressed by Equations (6) and (7) as follows:

L_{x y} = λ_{c o o r d} \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} 1_{i j k}^{o b j} {(x_{i}^{p r e d} - x_{i}^{t u r e})}^{2}

(6)

L_{w h} = λ_{c o o r d} \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} 1_{i j k}^{o b j} {(w_{i}^{p r e d} - w_{i}^{t u r e})}^{2}

(7)

The confidence loss is expressed by Equation (8) as follows:

L_{c o n f} = \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} 1_{i j k}^{o b j} {(I O U - {\hat{p}}_{i}^{o b j})}^{2} + λ_{n o o b j} \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} 1_{i j k}^{n o o b j} {(I O U - {\hat{p}}_{i}^{o b j})}^{2}

(8)

where L_xy represents the loss for the predicted bounding box center coordinates, and L_ωh represents the loss for the predicted bounding box’s width and height.

x_{i}^{p r e d}

and

w_{i}^{p r e d}

represent the predicted box’s center coordinate and width–height, respectively, while

x_{i}^{t r u e}

and

w_{i}^{t r u e}

represent the true box’s center coordinate and width–height, respectively. S represents the size of the feature map, B represents the number of predicted bounding boxes per position, and k represents the position index on the feature map. λ_coord is a hyperparameter used to adjust the weight of the bounding box’s coordinate loss. L_conf represents the confidence loss, IOU_i represents the intersection over the union between the predicted box and the true box and

{\hat{p}}_{i}^{o b j}

is the probability of the predicted box containing a pedestrian. λ_noobj is a hyperparameter that is used to adjust the weight of the confidence loss for non-pedestrian regions.

The basic components of a convolutional neural network (CNN) [42] include convolutional layers, pooling layers, and fully connected layers [43]. The convolutional layer aims at extracting local information using convolutional operations, starting from individual pixels and gradually expanding the receptive field [44]. The pooling layer is responsible for down-sampling or up-sampling the input data to reduce the computational complexity. The fully connected layer maps the output feature maps to specified classification results via matrix calculations [45]

The principal equation for the convolutional layer is given by Equation (9):

p_{y (i, j)} = \sum_{m} \sum_{n} p_{x} (i - m, j - n) * h (m, n)

(9)

The principal equation for the pooling layer is given by Equation (10):

\{\begin{matrix} y (i, y) = p o o l i n g [θ] \\ θ = x_{(i \times s) : (i \times s + R - 1), (j \times s) : (i \times s + R - 1)} \end{matrix}

(10)

The principal equation for the fully connected layer is given by Equation (11):

y = W^{T} x + b

(11)

In the equations, p_x(i,j) represents the input pixel, h(m,n) represents the weight of the convolution kernel, p_y(i,j) represents the output pixel, s represents the stride, and R represents the pooling window size. W represents the weight matrix, b represents the bias vector, x represents the input feature vector, and y represents the output vector.

The triplet loss function [46] is utilized to build a better model. By minimizing the distance between the same pedestrians and maximizing the distance between different pedestrians, the loss function equation is established, as shown in Equation (12):

L o s s = k * {(1 - \cos (A, B))}^{2} + (1 - k) * \max {[1 - (1 - \cos (A, B)), 0]}^{2}

(12)

where k is the label, which can have a value of either 0 or 1; (1 − cos(A,B)) represents the distance score between the two feature vectors; and m is the constant used to control the threshold of similarity. If k = 1, we aim to minimize the Euclidean distance between the two feature vectors. Conversely, if k = 0, we aim to maximize the Euclidean distance between the two feature vectors.

3.2.2. Pedestrian Matching by Introducing the Coordinate Constraint

Pedestrian matching is another challenging continuous tracking task, but the prevailing matching methods only rely on the similarity among the pedestrian’s appearance features, leading to incorrect results. The position of the pedestrian in the video is objectively determined; theoretically, the coordinates of the same pedestrian in both cameras A and B are identical at the same moment. Therefore, this paper proposes using the coordinate features of pedestrians to constrain the matching process. As described in “Section 3.1.1 Coordinate System Transformation”, the position points of all pedestrians in video B are obtained through photogrammetric principles. The affinity between the target pedestrian and the candidate pedestrian is calculated based on the Euclidean distance difference between their positions. Tracking errors may occur in the system due to the influence of the camera’s clarity, color, and frame rate [47,48]. Based on the test results, the error outcomes roughly follow a normal distribution. Since the appearance similarity and coordinate error distribution probability are independent of each other, the product of them represents the matching probability. The equation is expressed as shown in Equation (13):

\{\begin{matrix} ε (i) = \sqrt{{(x_{t} - x_{i})}^{2} + (y_{t} - y_{i})} \\ P = m a x [f (ε (i)) P_{c n n} (i)] \end{matrix}

(13)

Between the target pedestrian and candidate pedestrian i, ε(i) represents the Euclidean distance, f(ε(i)) is the probability density function value of the normal distribution corresponding to the Euclidean distance, P_cnn(i) is the appearance similarity extracted using the CNN, and P is the probability of the candidate pedestrian who matches the target pedestrian. The equation states that f(ε(i)) is multiplied by P_cnn to achieve a balance. Specifically, when there are many pedestrians in the video, and the reliability of appearance similarity decreases, using this equation improves the accuracy and reliability of pedestrian matching.

3.2.3. Cross-Video Pedestrian Tracking with Linear Weighting

The channel and spatial reliability tracker (CSRT) algorithm [49] is a mature, widely applicable, and high-performance object-tracking algorithm [50]. Thus, the CSRT algorithm has good real-time performance and practical application. The core formula is given by Equation (14):

δ = [d x, d y, d s, d p]

(14)

Here, dx and dy represent the changes in the pedestrian’s center, ds represents the change in the pedestrian’s size, and dp represents the change in the pedestrian’s rotation angle. This algorithm uses a regression model to predict the pedestrian’s state in the next frame based on a training dataset, enabling pedestrian tracking.

However, despite its real-time performance and practicality, the CSRT algorithm is associated with low accuracy. To decrease the tracking error, inspired by the part of “Image Fusion Module” in the literature [51], this paper proposes a linear weighting method for weighted auxiliary lines, as shown in Figure 2 (where Center_A and Center_B represent the principal points of photographs in surveillance A and surveillance B, respectively; A and B represent the points at the connection of Center_A and Center_B, respectively, that is, the baseline, intersects on the boundary lines of the overlapping view; Pedestrian_X_A and Pedestrian_X_B represent the tracking positions of pedestrians in surveillances A and B; and X_A and X_B represent the orthogonal projections of Pedestrian_X_A and Pedestrian_X_B, respectively, on the baseline). By considering the two positions of the target pedestrian obtained from the two surveillances, the position closer to the principal point of the photograph is selected for the linear weighting step. This method assesses the confidence level when localizing target pedestrian X in surveillances A and B, which relies on the proportion of pedestrian X who occupies the length of line AB on View_AB. As pedestrian X moves beyond the field of view of surveillance A, the level of confidence in localizing pedestrian X in surveillance A gradually decreases to zero. This trend ensures the robustness of the entire pedestrian transition process.

The principal equation for linear weighting of the weighted auxiliary line is given in Equation (15):

\{\begin{matrix} P_{X} = k * P_{X A} + (1 - k) * P_{X A} \\ k = |B X| / |A B| \end{matrix}

(15)

where the selection process of X is completed in X_a and X_b, and its selection criteria are based on the minimum distance between the two and their corresponding monitoring center point; |BX| and |AB| represent the lengths of the line segments BX and AB, respectively; k represents the linear weighting factor; P_XA and P_XB represent the positional information obtained from the tracking of pedestrian X through surveillance A and B, respectively, at the same moment; and P_X represents the position of the final pedestrian X.

3.2.4. Cross-Video Pedestrian Tracking Algorithm Workflow

In this study, the principles of photography and the CSRT algorithm are used to realize the continuous tracking of pedestrians on the Cam_A surveillance screen and to calculate the coordinates of pedestrians. When pedestrian X enters the overlapping view of two surveillances, the candidate pedestrian in Cam_B is detected and matched via a coordinate feature and similarity. In the overlapping view, the weight k is adjusted according to the position of pedestrian X on the weighted auxiliary line. When pedestrian X is closer to surveillance A, k is increased to place more trust in the position information of surveillance A. Conversely, when pedestrian X is closer to surveillance B, k is reduced to place more trust in the location information of surveillance B, and pedestrian tracking cross-video is realized.

The specific algorithm steps are described as follows:

Pedestrian X is detected and tracked using the CSRT algorithm to acquire its position in each frame.
The coordinates of the two surveillance systems are unified using an overlapping view, principles of photography, and rotation and translation matrices.
When pedestrian X enters the overlapping view, surveillance A obtains the position of X, and the candidate pedestrians in surveillance B are matched using the positional features and the features extracted via the neural network.
Using the position of pedestrian X that projects onto the weighted auxiliary line, weights are assigned to determine the positions of pedestrian X obtained from surveillances A and B.
The weights of both surveillances are adjusted as the distance of pedestrian X changes from Cam_A to Cam_B, enhancing the tracking robustness of pedestrians between the two surveillances.

The algorithm workflow is illustrated in Figure 3.

4. Results

4.1. Experimental Protocol

4.1.1. Experimental Scene

To validate the proposed cross-video pedestrian tracking algorithm, an experiment was conducted in the circular square in front of the Guangzhou Science and Technology Achievement Transformation Center, as shown in Figure 4. Here, labeled points P_Ai and P_Bi (i = 1, 2, 3, 4) in the figure serve as homonym-labeled reference points for unifying the coordinate system, and P_A5, P_A6, P_B5, P_B6 are points to validate that the two surveillances achieve coordinate unification. The radius of the square is approximately 15 m (m). There are seven pedestrians in the square as experimental subjects.

The weighted auxiliary line is established by setting identical markers and establishing a unified coordinate system. Since the position of a pedestrian is determined based on their foot placement while moving, the experimental results are presented in terms of planar information without considering Z-axis coordinate data. According to Equations (1), (4), and (5), the transformed coordinates of the principal point of the photograph are (159.1, 368.4) in surveillance A and (1703.5, 361.3) in surveillance B (unit: centimeters). The weighted auxiliary line intersects with the boundaries of the surveillances at (481.9, 366.9) and (1116.3, 364.0). The principal points of the photograph of the two surveillances and the weighted auxiliary line are shown in Figure 5.

Since the Y-direction error between endpoint A and endpoint B of the weighted auxiliary line is within 10 cm, for the sake of the calculation’s convenience, it is approximated as a line segment with a constant Y value.

4.1.2. Experimental Hardware and Parameters

The two surveillances employed in this experiment are Hikvision Ezviz H5 surveillances, as shown in Figure 6. The focal length is 4 mm, and the resolution is 2 million pixels. Surveillances A and B are positioned 15 m from the center of the square, with a height of approximately 4 m. The Euler angles for surveillance A are a pitch angle of 10 degrees, a yaw angle of 15 degrees, and a roll angle of 0 degrees. For surveillance B, the Euler angles are a pitch angle of 10 degrees, a yaw angle of −15 degrees, and a roll angle of 0 degrees. After calibrating the cameras using Zhang Zhengyou’s camera calibration method [36], the transformation parameter of surveillance is approximately

[\begin{matrix} 674.90 & 0.0 & 479.06 \\ 0.0 & 672.08 & 280.63 \\ 0.0 & 0.0 & 1.0 \end{matrix}]

, where the focal length in the x-direction fx = 674.90, the focal length in the y-direction fy = 672.08, and the coordinates of the pixel center point are (479, 280). The rotation matrix for surveillance A is

R_A = [\begin{matrix} 0.9956 & 0.0621 & - 0.0698 \\ - 0.0840 & 0.9222 & - 0.3774 \\ 0.0409 & 0.3816 & 0.9234 \end{matrix}]

, and the rotation matrix for surveillance B is

R_B = [\begin{matrix} 0.9659 & - 0.2546 & 0.0405 \\ - 0.2846 & 0.9396 & - 0.1420 \\ 0.0794 & 0.1937 & 0.9776 \end{matrix}]

. The translation matrix for surveillance A is

R_A = {[\begin{matrix} 7.6233 & 6.6627 & - 2.3028 \end{matrix}]}^{T}

, and the translation matrix for surveillance B is

R_A = {[\begin{matrix} 5.1222 & - 10.9856 & - 4.2732 \end{matrix}]}^{T}

.

4.1.3. Pedestrian Tracking Performance Metrics

This paper primarily employs three evaluation metrics for the pedestrian tracking trajectory assessment, namely, the root mean square error (RMSE), mean error (ME), and maximum error (MaxE), as shown in Equations (16)–(18), respectively. RMSE is used to measure the average level of long-term error between predicted or estimated values and actual observed values, reflecting the overall stability of the tracking algorithm. ME provides the average deviation between predicted or estimated values and the actual observed values. In contrast to RMSE, it does not consider the polarity of the error but focuses solely on its magnitude. MaxE is used to reveal the maximum deviation between predicted or estimated values and the actual observed values. In trajectory evaluation, it can be utilized to assess the performance of the tracking algorithm in worse-case scenarios.

R M S E = \sqrt{\frac{Σ_{i = 1}^{n} {({\hat{x}}_{i} - x_{i})}^{2} + {({\hat{y}}_{i} - y_{i})}^{2}}{n}}

(16)

M E = \frac{\sum_{i = 1}^{n} \sqrt{{({\hat{x}}_{i} - x_{i})}^{2} + {({\hat{y}}_{i} - y_{i})}^{2}}}{n}

(17)

M a x E = a r g m a x [\sqrt{{({\hat{x}}_{i} - x_{i})}^{2} + {({\hat{y}}_{i} - y_{i})}^{2}}]

(18)

In the equations, (

{\hat{x}}_{i}, {\hat{y}}_{i}

) represents the true position of the pedestrian, while (

x_{i}

,

y_{i}

) represents the estimated position of the pedestrian.

4.2. Pedestrian Detection and Matching

For pedestrian detection, this study uses the YOLOv3 algorithm combined with the COCO class index to detect pedestrians. By filtering out other classes and retaining only the pedestrian class, pedestrian detection is accomplished. As shown in Figure 7, the blue boxes represent the candidate pedestrians, while the red boxes represent the target pedestrians. By introducing the coordinate features of pedestrians, the success rate of pedestrian matching can be improved.

Produced by Equation (13), the candidate pedestrian with the highest probability (P) is considered the target pedestrian. Using pedestrian 1 as an example, the matching probability is shown in Table 1, where the coordinates of the target pedestrian calculated in Monitor A are (456, 348). The mean value of the normal distribution is 20 cm, and the standard deviation is 15 cm, as obtained from the test data.

Here, P_cnn represents the similarity of pedestrians, and f(i) represents the probability density function value of the error between candidate pedestrians and target pedestrians. The success rate when matching the 7 tested pedestrians using the traditional method is 5/7. The success rate after introducing coordinate constraints is 7/7.

To validate the higher accuracy of our method, it is important to note that our approach requires specific coordinate data to constrain the pedestrian-matching process. Publicly available datasets do not provide the coordinates of pedestrians. In this study, we utilized a custom dataset consisting of 20 pedestrians and 200 images captured from different angles and surveillances. The results of our validation are presented below in Table 2:

Through the comparison results, the pedestrian matching method proposed in this paper has better performance in mAP evaluation indicators than CNN.

4.3. Cross-Video Pedestrian Tracking Experiment Results

4.3.1. Cross-Video Pedestrian Tracking without Coordinate Constraints

When performing cross-video tracking, pedestrian 1, who was tracked (highlighted with a red box in Figure 8A), is matched with pedestrian 3 (highlighted in a blue box in Figure 8B) with the highest similarity at the same moment. However, pedestrian 3 is located at a distance from pedestrian 1 in terms of coordinates, clearly indicating that they are not the same person. The resulting trajectories are shown in Figure 9.

From the trajectory plot above, while tracking pedestrian 1 from surveillance A to B within the overlapping view, there is an incorrect association with pedestrian 3, resulting in tracking errors and the loss of information regarding pedestrian 1.

4.3.2. Cross-Video Pedestrian Tracking with Coordinate Constraint

Three pedestrians in the surveillance video are selected to display the results of this experiment. Among them, pedestrian 1 is matched in the overlapping view fields of the two surveillance videos, as shown in Figure 10. The trajectories of these three pedestrians are shown in Figure 11. The red trajectory lines represent the tracking trajectories of surveillance A to pedestrians; the blue trajectory lines represent the tracking trajectories of surveillance B to pedestrians; the black trajectory lines represent the real trajectories of pedestrians; and the green trajectory lines represent the pedestrian trajectories weighted by the weighted auxiliary line.

The weight is assigned according to the position of the pedestrian relative to the weighted auxiliary line. The complete trajectory after fusion is shown in Figure 12, where the red trajectory line is the complete trajectory of the merged pedestrian, and the black trajectory line is the real trajectory of the pedestrian.

From the above results, the pedestrian trajectories without weighted auxiliary lines (that is, traditional pedestrian tracking) exhibit varying degrees of position point jumps and trajectory line mutations. In contrast, the pedestrian trajectories obtained with weighted auxiliary lines closely align with the real trajectories without position point jumps or trajectory line mutations.

A comprehensive error analysis is conducted from the following four perspectives: single surveillance A, single surveillance B, average weighting, and auxiliary line weighting. This analysis focuses on the perpendicular distance between the position points obtained for each frame of the three pedestrians and the straight line representing the true trajectory, as shown in Figure 13. The line charts clearly demonstrate that using the linear weighting of the weighted auxiliary line method significantly improves positioning and tracking accuracy in terms of the mean error value, maximum error value, and root mean square error value. Our method is better than the average weighting method. Specifically, for pedestrian 1, the average error accuracy improves by 6.96 cm (cm), the maximum error accuracy improves by 8.95 cm, and the root mean square error accuracy improves by 3.27 cm. In the case of pedestrian 2, the average error accuracy improves by 12.72 cm, the maximum error accuracy improves by 43.33 cm, and the root mean square error accuracy improves by 9.06 cm. Similarly, for pedestrian 3, the average error accuracy improves by 6.75 cm, the maximum error accuracy improves by 11.85 cm, and the root mean square error accuracy improves by 2.79 cm.

5. Discussion

In this paper, we introduce an algorithm for cross-video pedestrian tracking with a coordinate constraint, which is designed to address the practical application needs of pedestrian tracking in multi-surveillance video scenarios. Experiments were conducted to demonstrate the feasibility and effectiveness of this method. In this paper, we discuss the reasons behind the experimental results obtained with this algorithm.

The clarity and color of the surveillance affect the appearance features of the monitored pedestrians, but by introducing coordinates as pedestrian feature information, the matching process not only follows similarity but also considers the closeness of the pedestrian locations. In addition, when monitoring numerous pedestrians, the probability of incorrect matching is greatly decreased.
In monitoring the location of a pedestrian from multiple monitors, pedestrian tracking is not limited to one monitor and has strong robustness. Simultaneously, the weights of pedestrian locations are assigned based on a weighted auxiliary line, which weakens the values with large tracking errors and yields more precise results for pedestrian tracking.

Terminal-based positioning methods (ultrawideband [54], Bluetooth [55], wireless fidelity [56], etc.) are limited in practical scenes, such as places where people are not allowed to carry terminal equipment or label loss. Positioning using surveillance videos is expected to solve the above application problems. Cross-video pedestrian tracking is the premise of the practical application of positioning using surveillance videos. This paper proposes a cross-video pedestrian tracking method that not only accurately matches pedestrian targets (by introducing coordinate features) but also achieves the return of a continuous trajectory (by linear weighting). This paper provides viable technical support and a case study for large-scale, accurate, and continuous pedestrian tracking in practical application scenarios. However, this study still faces certain limitations. First, in the case of nonoverlapping viewing fields, it is hard to apply coordinate features to constrain pedestrian matching. Additionally, when targets are heavily occluded, there is currently no suitable solution to address this issue.

6. Conclusions

Due to the constraints of terminal-based and single-surveillance tracking methods, there has been a growing interest in adopting cooperative tracking methods in multi-surveillance systems. However, the traditional pedestrian reidentification algorithm faces challenges in cross-video matching. Additionally, camera performance and tracking algorithm limitations significantly hinder cross-video pedestrian tracking. To solve these problems, this paper presents a cross-video pedestrian tracking algorithm with a coordinate constraint. Our approach involves obtaining the real location of the target pedestrian via coordinate system transformation and pedestrian detection. Then, the pedestrian location serves as feature information to improve the success rate of the matching process. Throughout the cross-video pedestrian tracking process, weights are assigned to the obtained locations using weighted auxiliary lines, outputting more reliable locations. Experiments show that this algorithm achieves higher positioning accuracy and better robustness in cross-video pedestrian tracking. Accurate and continuous pedestrian tracking enhances the accuracy of pedestrian detection and behavior analysis in video surveillance scenarios, thereby improving the intelligence of monitoring systems. In practical applications, this study also strengthens criminal investigations and improves traffic management efficiency. However, this study only focuses on the pedestrian tracking process from an overlapping view. Future research should further investigate pedestrian tracking from a nonoverlapping view to resolve the persistent issue of continuous tracking under all conditions.

Author Contributions

Conceptualization, C.H., W.L. and G.Y.; methodology, C.H., W.L., G.Y. and B.Z.; validation, C.H. and G.Y.; formal analysis, C.H., G.Y. and B.Z.; investigation, C.H., J.Y. and Y.L.; resources, C.H.; data curation, C.H., J.Y. and Y.L.; writing—original draft preparation, C.H.; writing—review and editing, C.H. and G.Y.; visualization, C.H.; supervision, W.L. and G.Y.; project administration, C.H. and G.Y.; funding acquisition, W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was a phased achievement of the Ministry of Public Security Science and Technology Project 2022JSYJD05.

Institutional Review Board Statement

No ethical approval is required for this study.

Informed Consent Statement

The study did not involve humans.

Data Availability Statement

Data are contained within the article.

Acknowledgments

We would like to thank the editor and anonymous referees for their valuable comments and suggestions which helped us to improve this paper. We would also like to thank Qunxiong, Lin and Junjie, and Chen of the Guangdong Public Security Department for their strong support.

Conflicts of Interest

Author Cheng Huang, Weihong Li, Guang Yang, Jiachen Yan were employed by the company Guangdong Shida Weizhi Information Technology. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Valera, M.; Velastin, S.A. Intelligent Distributed Surveillance Systems: A Review. IEE Proc. Vis. Image Process. 2005, 152, 192. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. arXiv 2013, arXiv:1311.2524. [Google Scholar]
Marvasti-Zadeh, S.M.; Cheng, L.; Ghanei-Yakhdan, H.; Kasaei, S. Deep Learning for Visual Tracking: A Comprehensive Survey. IEEE Trans. Intell. Transport. Syst. 2022, 23, 3943–3968. [Google Scholar] [CrossRef]
Marčetić, D.; Maleš, L.; Ribarić, S. Crowd Motion Pattern Detection at the Microscopic Level. In Proceedings of the 2019 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, 20–24 May 2019; pp. 1093–1098. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; Wang, S. Beyond Part Models: Person Retrieval with Refined Part Pooling (and a Strong Convolutional Baseline). In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 480–496. [Google Scholar]
Wei, L.; Zhang, S.; Yao, H.; Gao, W.; Tian, Q. Glad: Global-Local-Alignment Descriptor for Pedestrian Retrieval. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 420–428. [Google Scholar]
Yan, W.; Forsyth, D.A. Learning the Behavior of Users in a Public Space through Video Tracking. In Proceedings of the 2005 Seventh IEEE Workshops on Applications of Computer Vision (WACV/MOTION’05)—Volume 1, Breckenridge, CO, USA, 5–7 January 2005; pp. 370–377. [Google Scholar]
Rao, J.; Xu, K.; Chen, J.; Lei, J.; Zhang, Z.; Zhang, Q.; Giernacki, W.; Liu, M. Sea-Surface Target Visual Tracking with a Multi-Camera Cooperation Approach. Sensors 2022, 22, 693. [Google Scholar] [CrossRef] [PubMed]
Zhou, S.; Ke, M.; Qiu, J.; Wang, J. A Survey of Multi-Object Video Tracking Algorithms. In Proceedings of the International Conference on Applications and Techniques in Cyber Security and Intelligence ATCI 2018, Shanghai, China, 11–13 July 2018; Abawajy, J., Choo, K.-K.R., Islam, R., Xu, Z., Atiquzzaman, M., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 351–369. [Google Scholar]
Han, S.; Miao, S.; Hao, X.; Chen, R. Spatial Localization Method for Dynamic Objects in Surveillance Videos. Bull. Surv. Mapp. 2022, 87–92. [Google Scholar] [CrossRef]
Milosavljević, A.; Rančić, D.; Dimitrijević, A.; Predić, B.; Mihajlović, V. A Method for Estimating Surveillance Video Georeferences. ISPRS Int. J. Geo-Inf. 2017, 6, 211. [Google Scholar] [CrossRef]
Gündoğdu, E.; Alatan, A.A. The Visual Object Tracking VOT2016 Challenge Results. In Proceedings of the Computer Vision–ECCV 2016 Workshops, Amsterdam, The Netherlands, 8–10, 15–16 October 2016; Proceedings, Part II. Springer International Publishing: Cham, Switzerland, 2016; pp. 777–823. [Google Scholar]
Kristan, M.; Leonardis, A.; Matas, J.; Felsberg, M.; Pflugfelder, R.; Zajc, L.Č.; Vojír, T.; Bhat, G.; Lukežič, A.; Eldesokey, A.; et al. The Sixth Visual Object Tracking VOT2018 Challenge Results. In Computer Vision–ECCV 2018 Workshops; Leal-Taixé, L., Roth, S., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2019; Volume 11129, pp. 3–53. ISBN 978-3-030-11008-6. [Google Scholar]
Li, S.; Yeung, D.-Y. Visual Object Tracking for Unmanned Aerial Vehicles: A Benchmark and New Motion Models. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar] [CrossRef]
Zhang, R.; Xu, L.; Yu, Z.; Shi, Y.; Mu, C.; Xu, M. Deep-IRTarget: An Automatic Target Detector in Infrared Imagery Using Dual-Domain Feature Extraction and Allocation. IEEE Trans. Multimed. 2021, 24, 1735–1749. [Google Scholar] [CrossRef]
Sreenu, G.; Saleem Durai, M.A. Intelligent Video Surveillance: A Review through Deep Learning Techniques for Crowd Analysis. J. Big Data 2019, 6, 48. [Google Scholar] [CrossRef]
Yang, L.; Hu, G.; Song, Y.; Li, G.; Xie, L. Intelligent Video Analysis: A Pedestrian Trajectory Extraction Method for the Whole Indoor Space without Blind Areas. Comput. Vis. Image Underst. 2020, 196, 102968. [Google Scholar] [CrossRef]
Gupta, S.; Arbelaez, P.; Girshick, R.; Malik, J. Aligning 3D Models to RGB-D Images of Cluttered Scenes. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4731–4740. [Google Scholar]
Koppanyi, Z.; Toth, C.; Soltesz, T. Deriving Pedestrian Positions from Uncalibrated Videos. In Proceedings of the ASPRS Imaging & Geospatial Technology Forum (IGTF), Tampa, FL, USA, 4–8 May 2017. [Google Scholar]
Varga, D.; Szirányi, T.; Kiss, A.; Spórás, L.; Havasi, L. A Multi-View Pedestrian Tracking Method in an Uncalibrated Camera Network. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Santiago, Chile, 7–13 December 2015; pp. 37–44. [Google Scholar]
Guo, Y.; Liu, Z.; Luo, H.; Pu, H.; Tan, J. Multi-Person Multi-Camera Tracking for Live Stream Videos Based on Improved Motion Model and Matching Cascade. Neurocomputing 2022, 492, 561–571. [Google Scholar] [CrossRef]
Wang, Q.; Zhang, W.; Yang, W.; Xu, C.; Cui, Z. Prototype-Guided Instance Matching for Multiple Pedestrian Tracking. Neurocomputing 2023, 538, 126207. [Google Scholar] [CrossRef]
Yang, H.; Cai, J.; Liu, C.; Ke, R.; Wang, Y. Cooperative Multi-Camera Vehicle Tracking and Traffic Surveillance with Edge Artificial Intelligence and Representation Learning. Transp. Res. Part C Emerg. Technol. 2023, 148, 103982. [Google Scholar] [CrossRef]
Rajjak, S.S.A.; Kureshi, A.K. Multiple-Object Detection and Segmentation Based on Deep Learning in High-Resolution Video Using Mask-RCNN. Int. J. Patt. Recogn. Artif. Intell. 2021, 35, 2150038. [Google Scholar] [CrossRef]
Wu, Y.-C.; Chen, C.-H.; Chiu, Y.-T.; Chen, P.-W. Cooperative People Tracking by Distributed Cameras Network. Electronics 2021, 10, 1780. [Google Scholar] [CrossRef]
Gheissari, N.; Sebastian, T.B.; Hartley, R. Person Reidentification Using Spatiotemporal Appearance. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition—Volume 2 (CVPR’06), New York, NY, USA, 21–23 June 2006; Volume 2, pp. 1528–1535. [Google Scholar]
Liu, C.; Zhang, L.; Huang, H. Visualization of Cross-View Multi-Object Tracking in Traffic Intersection Surveil-lance Videos. J. Comput. Sci. 2018, 41, 221–235. [Google Scholar]
Khan, S.; Shah, M. Consistent Labeling of Tracked Objects in Multiple Cameras with Overlapping Fields of View. IEEE Trans. Pattern Anal. Mach. Intell. 2003, 25, 1355–1360. [Google Scholar] [CrossRef]
Yang, X. A Moving Target Tracking Algorithm Based on Multi-Camera Coordination. In Proceedings of the 2021 IEEE 3rd Eurasia Conference on IOT, Communication and Engineering (ECICE), Yunlin, Taiwan, 29–31 October 2021; pp. 182–185. [Google Scholar]
Xu, X.; Li, X.; Zhao, H.; Liu, M.; Xu, A.; Ma, Y. A Real-Time, Continuous Pedestrian Tracking and Positioning Method with Multiple Coordinated Overhead-View Cameras. Measurement 2021, 178, 109386. [Google Scholar] [CrossRef]
Luhmann, T.; Fraser, C.; Maas, H.-G. Sensor Modelling and Camera Calibration for Close-Range Photogrammetry. ISPRS J. Photogramm. Remote Sens. 2016, 115, 37–46. [Google Scholar] [CrossRef]
Linder, W. Digital Photogrammetry; Springer: Berlin/Heidelberg, Germany, 2009; ISBN 978-3-540-92724-2. [Google Scholar]
Zhang, Z. A Flexible New Technique for Camera Calibration. IEEE Trans. Pattern Anal. Machine Intell. 2000, 22, 1330–1334. [Google Scholar] [CrossRef]
Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Klema, V.; Laub, A. The Singular Value Decomposition: Its Computation and Some Applications. IEEE Trans. Automat. Contr. 1980, 25, 164–176. [Google Scholar] [CrossRef]
Golub, G.H.; Van Loan, C.F. Matrix Computations; JHU Press: Baltimore, MD, USA, 2013; ISBN 1-4214-0859-7. [Google Scholar]
Liu, Z.; Gu, X.; Yang, H.; Wang, L.; Chen, Y.; Wang, D. Novel YOLOv3 Model with Structure and Hyperparameter Optimization for Detection of Pavement Concealed Cracks in GPR Images. IEEE Trans. Intell. Transp. Syst. 2022, 23, 22258–22268. [Google Scholar] [CrossRef]
Khan, A.; Sohail, A.; Zahoora, U.; Qureshi, A.S. A Survey of the Recent Architectures of Deep Convolutional Neural Networks. Artif. Intell. Rev. 2020, 53, 5455–5516. [Google Scholar] [CrossRef]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Zhou, W.; Wang, H.; Wan, Z. Ore Image Classification Based on Improved CNN. Comput. Electr. Eng. 2022, 99, 107819. [Google Scholar] [CrossRef]
Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Verma, K.K.; Kumar, P.; Tomar, A. Analysis of Moving Object Detection and Tracking in Video Surveillance System. In Proceedings of the 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 11–13 March 2015; pp. 1758–1762. [Google Scholar]
Mohan, A.; Kaseb, A.S.; Gauen, K.W.; Lu, Y.-H.; Reibman, A.R.; Hacker, T.J. Determining the Necessary Frame Rate of Video Data for Object Tracking under Accuracy Constraints. In Proceedings of the 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), Miami, FL, USA, 10–12 April 2018; pp. 368–371. [Google Scholar]
Danelljan, M.; Robinson, A.; Khan, F.S.; Felsberg, M. Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking. In Proceedings of the Computer Vision–ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Lecture Notes in Computer Science. Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; Volume 9909, pp. 472–488. [Google Scholar]
Dardagan, N.; Brđanin, A.; Džigal, D.; Akagic, A. Multiple Object Trackers in OpenCV: A Benchmark. In Proceedings of the 2021 IEEE 30th International Symposium on Industrial Electronics (ISIE), Kyoto, Japan, 20–23 June 2021. [Google Scholar]
Ding, B.; Zhang, R.; Xu, L.; Liu, G.; Yang, S.; Liu, Y.; Zhang, Q. U²D² Net: Unsupervised Unified Image Dehazing and Denoising Network for Single Hazy Image Enhancement. IEEE Trans. Multimed. 2023, 26, 202–217. [Google Scholar] [CrossRef]
Gu, H.; Li, J.; Fu, G.; Wong, C.; Chen, X.; Zhu, J. Autoloss-Gms: Searching Generalized Margin-Based Softmax Loss Function for Person Re-Identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4744–4753. [Google Scholar]
Su, C.; Li, J.; Zhang, S.; Xing, J.; Gao, W.; Tian, Q. Pose-Driven Deep Convolutional Model for Person Re-Identification. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3960–3969. [Google Scholar]
Guo, Y.; Li, W.; Yang, G.; Jiao, Z.; Yan, J. Combining Dilution of Precision and Kalman Filtering for UWB Positioning in a Narrow Space. Remote Sens. 2022, 14, 5409. [Google Scholar] [CrossRef]
Ho, Y.H.; Chan, H.C. Decentralized Adaptive Indoor Positioning Protocol Using Bluetooth Low Energy. Comput. Commun. 2020, 159, 231–244. [Google Scholar] [CrossRef]
Bi, J.; Zhao, M.; Yao, G.; Cao, H.; Feng, Y.; Jiang, H.; Chai, D. PSOSVRPos: WiFi Indoor Positioning Using SVR Optimized by PSO. Expert Syst. Appl. 2023, 222, 119778. [Google Scholar] [CrossRef]

Figure 1. Framework of the proposed algorithm.

Figure 2. Illustration of weighted auxiliary line for pedestrian trajectory acquisition.

Figure 3. Flowchart of cross-video pedestrian tracking algorithm.

Figure 4. Experimental scenes. (A) Perspective view of surveillance, (B) Perspective view of surveillance.

Figure 5. Weighted auxiliary line.

Figure 6. Hikvision Ezviz H5 surveillance.

Figure 7. Pedestrian detection. Blue boxes: candidate pedestrians; red boxes: target pedestrians.

Figure 8. Cross-video pedestrian matching without coordinate constraints in the overlapping view. (A) Pedestrian 1 in surveillance, (B) Matching result in surveillance.

Figure 9. Trajectories of pedestrian 1 and pedestrian 3.

Figure 10. Matching the result of pedestrian 1 with coordinate constraints in the overlapping view.

Figure 11. Cross-video pedestrian tracking with overlapping view (Left: complete trajectories of cam_A and cam_B; Right: details of the overlapping region and fusion trajectories).

Figure 12. Complete trajectories of pedestrian tracking across videos.

Figure 13. Pedestrian tracking error graphs.

Table 1. Pedestrians matching probability.

Candidate Pedestrians	Coordinates	P_cnn	f(i)	p
1	(405, 345)	0.4140	2.032 × 10⁻²	0.0084
2	(1520, 284)	0.1570	0	0
3	(536, 308)	0.4280	1.281 × 10⁻²	0.0055
4	(986, 878)	7.249 × 10⁻⁵	4.033 × 10⁻¹⁷	0
5	(591, 1146)	0.0005	6.725 × 10⁻²²	3.36 × 10⁻²⁷
6	(386, 517)	7.199 × 10⁻⁶	9.216 × 10⁻⁹	6.64 × 10⁻¹⁶

Table 2. The mean average precision (mAP) of different algorithms.

Methods	mAP (%)
PCB [8]	64.8
GLAD [9]	61.3
AutoLoss-GMS-A [52]	70.5
PDC [53]	66.1
CNN [42]	52.6
Ours	76.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, C.; Li, W.; Yang, G.; Yan, J.; Zhou, B.; Li, Y. Cross-Video Pedestrian Tracking Algorithm with a Coordinate Constraint. Sensors 2024, 24, 779. https://doi.org/10.3390/s24030779

AMA Style

Huang C, Li W, Yang G, Yan J, Zhou B, Li Y. Cross-Video Pedestrian Tracking Algorithm with a Coordinate Constraint. Sensors. 2024; 24(3):779. https://doi.org/10.3390/s24030779

Chicago/Turabian Style

Huang, Cheng, Weihong Li, Guang Yang, Jiachen Yan, Baoding Zhou, and Yujun Li. 2024. "Cross-Video Pedestrian Tracking Algorithm with a Coordinate Constraint" Sensors 24, no. 3: 779. https://doi.org/10.3390/s24030779

APA Style

Huang, C., Li, W., Yang, G., Yan, J., Zhou, B., & Li, Y. (2024). Cross-Video Pedestrian Tracking Algorithm with a Coordinate Constraint. Sensors, 24(3), 779. https://doi.org/10.3390/s24030779

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cross-Video Pedestrian Tracking Algorithm with a Coordinate Constraint

Abstract

1. Introduction

2. Related Work

2.1. Pedestrian Localization

2.2. Across-Video Pedestrian Tracking

3. Materials and Methods

3.1. Mapping the Coordinates of Pedestrians in the Video

3.1.1. Coordinate System Transformation

3.1.2. Coordinate System Unity

3.2. Cross-Video Pedestrian Tracking Method

3.2.1. Traditional Pedestrian Detection and Matching

3.2.2. Pedestrian Matching by Introducing the Coordinate Constraint

3.2.3. Cross-Video Pedestrian Tracking with Linear Weighting

3.2.4. Cross-Video Pedestrian Tracking Algorithm Workflow

4. Results

4.1. Experimental Protocol

4.1.1. Experimental Scene

4.1.2. Experimental Hardware and Parameters

4.1.3. Pedestrian Tracking Performance Metrics

4.2. Pedestrian Detection and Matching

4.3. Cross-Video Pedestrian Tracking Experiment Results

4.3.1. Cross-Video Pedestrian Tracking without Coordinate Constraints

4.3.2. Cross-Video Pedestrian Tracking with Coordinate Constraint

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI