1. Introduction
Ship wake detection and tracking are critical for ensuring maritime safety, conducting effective ocean monitoring, managing maritime affairs, ensuring national defense and military security, carrying out search and rescue operations, and advancing scientific research. Ship tracking technology helps to monitor maritime traffic, prevent collisions and accidents, and protect the safety of crew members and cargo. In ocean monitoring, it helps to oversee vessel activities at sea and is vital for preventing illegal fishing, smuggling, and other maritime crimes. Moreover, ship tracking plays a significant role in maritime management, assisting authorities in regulating sea traffic, optimizing shipping routes, and enhancing efficiency. In the domain of national defense and military security, this technology aids in monitoring and controlling territorial waters, safeguarding national security, and promptly identifying and responding to potential threats. During search and rescue missions, ship tracking technology can rapidly locate distressed vessels or individuals at sea, thereby improving the efficiency of rescue operations, and is essential for reducing loss of life and property in maritime disasters. Finally, ship tracking technology provides a wealth of data for marine scientific research, aiding in the study of changes in the marine environment, marine biodiversity, and the sustainable use of marine resources. As technology progresses, the application of ship tracking technology in these fields is anticipated to become more extensive and profound.
Compared to vessels themselves, the wakes left by ships cover a significantly larger area in satellite imagery. This characteristic enables detection under conditions that require lower camera altitude and resolution, making it more accessible and practical for a variety of monitoring applications. However, the pursuit of ship wake tracking is riddled with challenges such as trajectory overlap, data deficiency, and the dynamics of various environmental changes. In heavily trafficked sea areas, the wakes of multiple vessels may intersect and overlap, complicating the process of differentiation. The limitations of monitoring equipment along with environmental factors such as adverse weather conditions can lead to incomplete data. The fluctuating marine environment, including changes in currents and wind directions, also impacts tracking of wakes. Variations in a ship’s speed and heading contribute to the complexity and variability of wake patterns. Furthermore, sensor noise and rapid dispersion of wakes in the ocean increase the difficulty of tracking. Different lighting and imaging conditions such as nighttime and cloudy weather amplify the difficulty of tracking wakes through satellite or aerial imagery. Moreover, sophisticated data processing and analytical techniques are required for automatic detection and discrimination of wakes. At times, vessels may deliberately take measures to conceal their wakes in order to protect privacy or increase security, adding an extra layer of challenge to the tracking process.
Existing methods for ship tracking face certain challenges and limitations when dealing with spatial and temporal variations. In sea areas with dense ship traffic, such as ports or confluences of shipping lanes, distinguishing and tracking the trajectory of each vessel becomes highly complex. The dynamics of the maritime environment, including changes in ocean currents, wind speed, and wind direction, can affect the movement of ships, and traditional algorithms may fail to accurately anticipate these fluctuations. The collected data may be incomplete or noisy due to weather conditions or sensor malfunction, affecting the continuity and accuracy of tracking. Variations in the size and shape of ships can impact the detection capabilities of algorithms as well. Ships may be obscured by other vessels, islands, or coastlines, leading to interruptions in the tracking process. Traditional tracking algorithms often rely on simple prediction models that may not capture the complex patterns of ship movement. As the number of tracking targets increases, the computational burden on the algorithm increases significantly, especially in applications requiring real-time processing. Some algorithms may perform well on specific datasets but fail to function effectively in new or unknown environments. Furthermore, ship tracking systems must operate while protecting privacy and ensuring data security, which may constrain the design and implementation of algorithms.
Trajectory correlation in maritime surveillance is instrumental in linking consecutive observations of maritime targets into comprehensive trajectories. This process aids in the accurate identification and classification of targets, mitigates false alarms, anticipates target behaviors, and optimizes the allocation of surveillance resources. In addition, it enhances tracking precision, bolsters situational awareness, supports compliance checks, and aids in search and rescue operations. Moreover, trajectory correlation plays a crucial role in environmental protection by monitoring illegal discharge activities. These functionalities are realized through technologies such as data fusion, pattern recognition, and machine learning, thereby enhancing the efficiency and effectiveness of maritime surveillance systems.
2. Related Works
The luminous pixels of ship targets are prone to interference from terrestrial harbors, oceanic islands, and other artificial structures. Furthermore, the patterns of ship wakes often blend with linear structural features such as coastlines, oil spills, and intra-oceanic waves, making them indistinguishable. Such interference significantly undermines the robustness of many conventional ship and wake detection algorithms; hence, there is an urgent need to develop automated techniques that ensure efficient and accurate detection. Despite extensive research and demonstrations of powerful feature learning capabilities by various methods, it is notable that the focus has primarily been on ship detection rather than wake detection. Researchers including Wang et al. [
1], Wei et al. [
2], Zhang et al. [
3], Zhang et al. [
4], and Ding et al. [
5] have each generated distinct datasets of remotely sensed ships which serve as a foundation for ship target detection studies [
6]. Zhang et al. [
7], Shao et al. [
8], and Li et al. [
9] have explored algorithmic models for ship detection, aiming to construct more sophisticated deep learning networks for this purpose. In the realm of wake detection, Kang and Kim [
10] have utilized convolutional neural networks (CNNs) to discern ships and wakes from Synthetic Aperture Radar (SAR) imagery, achieving relatively precise velocity estimations. Although their study was based on a limited sample of 200 ship and wake instances, the results were promising. Del Prete et al. have developed a novel dataset comprising 250 Sentinel-1 wake images for evaluating wake detection with various standard CNNs, confirming the efficacy of CNNs in this domain despite the dataset’s constraints [
11]. Additionally, there is ongoing research into simulating ship wake data, which is pivotal for enabling deep learning-based wake detection and ship classification tasks [
12]. Xue et al. have created the SWIM dataset, an optical remote sensing collection consisting of 15,356 ship wake instances, and proposed WakeNet, which integrates attention to the frequency channel along with the wake’s frequency characteristics to highlight the capability of frequency extraction in wake detection features [
13]. Ding et al. manually curated and labeled Gaofen-3 images to assemble a SAR remote sensing dataset encompassing 862 ship and wake pairs [
14]. Through the observation and documentation of a ship’s wake, valuable insights into the ship’s position, velocity, trajectory, and conduct can be gleaned. Analyzing the wake’s attributes and patterns allows inferences to be made regarding the ship’s category, dimensions, functionality, propulsion system, maneuverability, and hydrodynamics. Moreover, it aids in assessing and monitoring the environmental impact of maritime activities, including the discharge of wastewater and oil pollution [
15,
16].
The goal of target detection is to anticipate a series of bounding boxes and categorical labels for each object of interest. Modern detectors address this challenge indirectly by formulating regression and classification problems for numerous proposals [
17], anchor boxes [
18], or window centers [
19,
20]. The effectiveness of these detectors is substantially influenced by postprocessing procedures, anchor box design, and heuristics for associating target boxes with anchor boxes. To streamline these processes, Carion et al. introduced a direct ensemble prediction approach known as DEtection with TRansformers (DETR), which obviates the need for intermediary agent tasks [
21]. DETR simplifies the training process by treating target detection as a straightforward ensemble prediction task, and employs an encoder–decoder architecture based on transformers [
22]. Such architectures explicitly model interactions between sequence elements through the transformer’s self-attention mechanism, making them particularly well suited for handling ensemble prediction-specific constraints such as avoiding duplicate predictions. However, DETR faces challenges due to the limitations of the transformer’s attention module when dealing with image feature mapping, exhibiting slow convergence, limited feature space resolution, and unclear query interpretation. To address these issues, extensive research has focused on enhancing DETR’s performance through innovations in pairwise attention, querying, positional embedding, and bipartite matching.
Due to the high computational complexity associated with attention mechanisms, especially when dealing with a large number of image pixels, Zhu and Dai et al. introduced Deformable DETR [
23]. This method builds upon the concept of deformable convolution and implements deformable attention [
24], merging the sparse spatial sampling of Deformable convolution with the relational modeling capabilities of transformers. This innovative approach addresses the issues of slow convergence and high complexity encountered with DETR, introducing multiscale deformable attention and various versions of Deformable DETR to tackle the challenge of limited detection performance for small targets. Dai et al. introduced Dynamic DETR, which implicitly incorporates a coarse-to-fine box coding approach by introducing dynamic attention and stacking dynamic decoders at the codec stage [
25]. To address the computational expense and slow convergence issues associated with DETR, Yao et al. proposed Efficient DETR [
26], which improves target query initialization and reduces the number of required decoders. Furthermore, the number of proposals can be dynamically adjusted during the training process, further reducing DETR’s computational demands. Zhang et al. presented DDQ DETR [
27], which introduces dense distinguished queries into the framework to generate dense queries, which are then iteratively processed to produce the final sparse output. To tackle the issue of poor-quality content embedding in DETR, which hampers precise object localization, Meng et al. introduced Conditional DETR [
28]. This approach decouples content information from location information in the original cross-attention, enabling the decoder to search for the target’s end region on display, in turn improving object localization and reducing DETR’s reliance on content embedding during training. To address the issue of unclear and ambiguous learnable embeddings in DETR, Wang et al. introduced Anchor DETR [
29]. This approach draws inspiration from the concept of anchors in CNN target detectors, with each query in Anchor DETR based on a specific anchor, allowing queries to focus on targets located near their respective anchors. Liu et al. expanded on this approach by introducing DAB DETR, which adds scale information to the anchor points to make DETR’s queries more interpretable [
30]. This scaling data enhances query interpretability and acts as a positional guide for faster model convergence, while also adjusting the attention graph by incorporating scaling information related to bounding boxes. Sun et al. attributed the slow convergence of DETR to its use of the Hungarian loss applied to DETR’s bipartite matches, arguing that noisy environments during different training periods lead to unstable matches. To address this, they introduced TSP-FCOS and TSP-RCNN [
31]. These methods eliminate stochasticity by using a trained stationary teacher model, with the teacher model’s predicted bipartite matches serving as ground truth labels. This approach helps to eliminate randomness in the training process. Li et al. first tackled the problem of unstable bipartite matching in DETR by conducting denoising training, leading to the development of DN DETR [
32]. This approach not only accelerates model convergence but also significantly improves detection results. DN DETR focuses on predicting anchor boxes with nearby ground truths, but struggles when there are anchor boxes without nearby objects. To overcome this limitation, Zhang et al. merged DN DETR, DAB DETR, and Deformable DETR to create DINO [
33]. DINO employs comparative denoising training to discard unnecessary anchor boxes, thereby enhancing overall performance. The notion of end-to-end target detection has garnered significant attention and research thanks to its simplified structure and efficient performance. Ongoing efforts to enhance model convergence and reduce computational complexity have highlighted the impressive capabilities and potential of the end-to-end target detection approach. Within this paradigm, research on oriented target detection methods is essential for addressing the challenges of DETR, including slow convergence and high computational requirements. This investigation holds immense importance for the field of oriented target detection.
Target tracking has evolved significantly with the advent of advanced computational techniques, becoming a cornerstone in the domain of computer vision and artificial intelligence [
34]. Target tracking involves the localization and motion prediction of objects within a sequence of frames, which is crucial for a myriad of applications such as surveillance, autonomous driving, and human–computer interaction [
35]. Over the past decades, research into target tracking has transitioned from traditional algorithms based on Kalman filters and means-shifting to more complex models that leverage machine learning and deep learning [
36]. These modern approaches have shown remarkable improvements in handling challenges such as occlusion, scale variations, and fast motion [
37]. In particular, deep learning has revolutionized the way in which features are extracted and learned for tracking tasks. Convolutional neural networks (CNNs) have become instrumental in automatically capturing rich representations from data, which has led to significant enhancements in tracking accuracy and robustness [
38]. Moreover, the integration of recurrent neural networks and attention mechanisms has further improved the ability of trackers to model complex object behaviors and interactions [
39].
SORT, an acronym for Simple Online and Realtime Tracking, has emerged as a pivotal algorithm in the field of object tracking due to its efficiency and simplicity [
40]. Since its inception, SORT has been widely adopted for applications where real-time performance is crucial, such as surveillance systems and autonomous vehicles. SORT operates on the principle of tracking-by-detection, utilizing a detection model to identify objects in each frame and a Kalman filter to predict the motion of these objects [
41]. This approach allows SORT to handle multiple targets simultaneously while maintaining low computational overhead. Despite its effectiveness, SORT faces challenges in complex scenarios with heavy occlusion or rapid movements, where its reliance on detection accuracy becomes a limiting factor [
42]. To address these issues, researchers have proposed enhancements to the SORT framework, including the incorporation of deep learning-based detectors and more sophisticated state estimation techniques [
43]. The SORT algorithm also stands out for its ability to provide a strong baseline in multi-object tracking benchmarks, often serving as a starting point for more advanced algorithms that aim to improve tracking accuracy while preserving real-time capabilities [
44]. As the field of object tracking continues to evolve, SORT remains a foundational algorithm, inspiring various extensions and improvements; the pursuit of more robust and adaptive tracking solutions is an ongoing endeavor, with SORT being a key reference in this dynamic research landscape.
Despite these advancements, there are numerous outstanding challenges that the research community aims to address. These include the need for real-time tracking capabilities, the generalization of trackers to diverse datasets, and the development of more efficient algorithms that can operate on resource-constrained platforms [
45]. The pursuit of more accurate and efficient tracking algorithms is an active area of research, with ongoing efforts focused on improving the computational efficiency, adaptability, and robustness of tracking models [
46]. As the field progresses, it is expected that future trackers will become more intelligent, requiring less manual intervention while providing more reliable tracking results in various scenarios [
47].
In conclusion, ship wake tracking is a vital area of research with significant implications for maritime safety, environmental monitoring, and security. The integration of advanced machine learning techniques, particularly deep learning, has revolutionized the field, providing more accurate and efficient methods for tracking and analyzing ship wakes [
48].
This paper introduces a novel approach to ship tracking that leverages the TriangularSORT algorithm for employing triangular wake tracks to monitor vessels. This method addresses the issue in traditional bounding box detection of the center of the rectangle lacking any specific physical meaning. By closely associating the vertices of the triangular wake with the coordinates of the ship, the proposed approach provides enhanced ship tracking ability. The method shifts the focus from rectangular bounding boxes to triangular tracks, which better represent the physical characteristics of a ship’s wake. The vertices of the triangle are used to pinpoint the key points of the ship’s movement, providing a more accurate and physically meaningful representation of the ship’s trajectory. This innovative tracking technique offers several advantages over conventional rectangular bounding box methods, particularly in scenarios where the dynamics of the ship’s wake are critical to understanding its movement. The TriangularSORT algorithm is renowned for its efficiency in handling multiple targets in real-time tracking scenarios. By integrating this algorithm with our triangular wake-tracking approach, it is possible to significantly improve the precision and reliability of ship tracking. This integration allows for the effective management of complex maritime traffic in situations where multiple vessels and their wakes must be monitored simultaneously. In summary, the use of triangular wake tracks in conjunction with the TriangularSORT algorithm presents a promising advancement in the field of ship tracking. This method not only overcomes the limitations of traditional detection methods but also opens up new possibilities for enhanced maritime surveillance and vessel management.
3. The Principle of TriangularSORT
This section initially introduces detection-based tracking methods. These rely on independent object detection in each frame of the video and subsequent association of detection results to form continuous tracking trajectories. The rest of this section elaborates on the combination of DETR with Triangular Intersection over Union (IoU), an evaluation metric specifically designed to measure the similarity between predicted and actual triangular bounding boxes. Furthermore, we introduce the Triangular Attention Mechanism, which guides the model’s focus to key areas of the image by defining triangular points on the feature map. This enhances the model’s ability to recognize and analyze local features in visual tasks. Lastly, the design of the TriangularSORT multi-object tracking algorithm is described. This algorithm significantly improves the accuracy and robustness in handling multi-object tracking tasks through the integration of deep learning detectors, optimized feature extraction techniques, and sophisticated data association strategies. In addition, it incorporates the AFLink and GSI algorithms, with Triangle IoU serving as the matching evaluation metric. This section presents the methodological foundations of our proposal for achieving efficient and accurate ship tracking.
3.1. Detection-Based Tracking
Detection-based tracking is a method for tracking objects in video sequences. It relies on independent object detection in each frame and association detection results to form continuous tracking trajectories. The main steps of the method are as follows:
Initialization: Object detection is used to identify and locate objects in the initial frame of the video and assign a unique tracking ID to each object.
Frame-wise Detection: Object detection algorithms are run in each frame to identify and locate objects and to obtain an associated bounding box, category, and confidence level.
Data Association: The correspondence between the detection results of the current frame and the tracked objects in the previous frames is determined using the nearest neighbor, Kalman filter, or Hungarian algorithm method.
Tracking Maintenance: The state of the objects is updated to handle any object occlusion, loss, or reappearance and to merge or split closely adjacent objects.
Tracking Evaluation: Tracking performance is evaluated using metrics such as accuracy, robustness, and computational efficiency, with the parameters then adjusted based on the results.
Feedback and Adjustment: The detector parameters are adjusted based on tracking results to improve the model’s accuracy and robustness.
Detection-based tracking methods have significant advantages in the field of video analysis. The simplicity and flexibility of this method makes it easy to adapt to various scenarios, including the sudden appearance and disappearance of objects. Because it can independently utilize powerful object detection algorithms, it can handle complex backgrounds and changing environmental conditions. However, this method also has some disadvantages, mainly due to the need for full object detection in every frame of the video, which may lead to high computational costs and affect real-time performance. In addition, the stability and continuity of tracking may be affected if the accuracy of the detection algorithm is insufficient or if errors occur in frame-to-frame association. Therefore, it is necessary to improve the stability and continuity of the detector. In this paper, we propose a triangular IoU evaluation metric for DETR-based detection of triangular ship wakes. This not only improves the model’s detection accuracy for triangular objects but also enhances the model’s robustness in complex scenarios. At the same time, triangular detection boxes provide a more intuitive physical meaning than rectangular detection boxes during the subsequent tracking processes.
3.2. DETR with Triangular IoU
Detection with Transformer (DETR) is a cutting-edge object detection framework constructed upon the transformer architecture. DETR treats the task of object detection as a direct set prediction problem. Unlike traditional anchor-based methods, which rely on region proposal networks (RPNs) or anchors, DETR predicts a set of candidate objects’ categories and locations without such dependencies. This approach theoretically allows DETR to sidestep the complexities and limitations found in conventional detection models.
3.2.1. Triangular IoU
The Triangular IoU is a variant of the IoU metric that is specifically designed to measure the similarity between predicted and ground-truth triangular bounding boxes. The IoU is a widely used metric in object detection for assessing the degree of overlap between predicted and actual bounding boxes. For triangular targets, including certain types of ship wakes, the traditional rectangular IoU may not adequately describe the shape and position of the target. Thus, the triangular IoU serves as an assessment indicator. This loss function measures the overlap between the predicted triangular bounding box and the actual triangular bounding box. By minimizing this loss, the model learns to accurately predict the position and size of triangular targets.
In summary, combining DETR with the triangular IoU offers a flexible and robust framework for detecting targets with complex shapes, such as triangular ship wakes. The advantage of this method is its ability to directly handle shape variations of targets within the model without the need for predefined anchors or region proposals.
Calculating the triangular IoU loss involves the following steps.
Given the following predicted and ground truth parameters of the bounding box and landmark:
Predicted bounding box width and height
Predicted landmark coordinates and angles
Ground truth bounding box width and height
Ground truth landmark coordinates and angles
First, calculate the lengths of the predicted and ground truth triangles by
Next, use the landmark angles and lengths to compute the coordinates of the triangle vertices for both the predicted and ground truth triangles by
Form the oriented bounding boxes for both triangles:
Finally, calculate the IoU loss as
Minimizing the loss function based on the triangular IoU helps the DETR model to refine its predictions, ensuring that the detected triangular targets closely match their actual shapes and positions.
3.2.2. Triangular Attention
Attention mechanisms are computational approaches that mimic the human visual attention system, enabling models to focus on the most relevant parts when processing information. By assigning varying weights to different parts of the input data, these mechanisms allow models to selectively concentrate on important features amid a wealth of information. Attention mechanisms have been widely applied In natural language processing and computer vision to enhance model performance, especially when dealing with long sequences or complex imagery.
Unlike traditional neural networks, in which all input features are typically considered to have equal importance, attention mechanisms enable dynamic adjustment of focus based on context. Specifically, they calculate the similarity between queries, keys, and values to generate a weight distribution that indicates which input features are more significant for the current task. This process can be achieved through dot-product or other similarity metrics, followed by a softmax function to normalize the weights.
In practical applications, attention mechanisms have been shown to improve model performance, particularly in handling long-range dependencies; they enhance not only a model’s expressive power but also its ability to capture long-distance relationships. In this way, attention mechanisms allow models to understand and generate information more effectively, propelling the success of transformers and similar architectures while leading to significant advancements across multiple tasks.
This paper integrates the previously proposed triangular IoU and a new triangular attention mechanism, introduced below. As shown in
Figure 1, the triangular attention mechanism guides the model’s focus to key areas of the image by defining triangular points on the feature map. The attention mechanism initially generates a set of triangular points using a query landmark guide, identifying local regions that require the model’s special attention; then, it calculates sampling offsets for each point to extract local features related to these points from the feature map. Under the multihead attention framework, each head independently weights the local features to create attention weights. By subsequently aggregating these weighted features, the model’s ability to capture key information is enhanced. Ultimately, after integrating the outputs of different heads through a linear layer, the model produces the final response, achieving more accurate local feature recognition and analysis on visual tasks. This mechanism is particularly suitable for tasks that require precise localization, such as object detection and tracking, where it significantly improves model performance and accuracy.
The core concept of the proposed triangular attention mechanism is to integrate a deformable structure into the attention computation process to allow for more precise modeling of the feature regions. Given an input feature map
and a 2D reference point
, the triangular attention can be computed as follows:
where the rotation matrix is
J denotes the number of attention heads,
and
are learnable weights (with
by default), and the attention weights
are normalized to
. Here,
are learnable weights,
represents the sampling point offset, and
and
correspond to the concatenation/sum of element content embeddings and position embeddings, respectively.
Introduction of the triangular attention mechanism provides deep learning models with a novel tool for handling tasks involving sequential data, potentially advancing the field of sequential data processing. By understanding and applying this mechanism, it is possible to uncover additional model optimization strategies and enhance project performance.
3.3. The TriangularSORT Multi-Object Tracking Algorithm
In the realm of ship tracking, learning about a vessel’s motion characteristics and potential intentions from historical movement data requires continuous surveillance of that particular vessel. Therefore, in this paper we construct a multi-object ship tracking model from a space-based perspective. The challenge of ship tracking lies in associating the same ship that appears in the field of view at any given moment with a unique identifier (ID). There are two main difficulties: matching moving targets, and avoiding ID switching phenomena when targets are occluded and reappear over time. The SORT algorithm and its variants effectively address both of these issues.
3.3.1. Target Tracking Design
Multi-object tracking involves assigning unique identifiers to multiple targets of interest, each labeled with distinct IDs such as 0, 1, 2, etc., and continuously tracking these identified targets throughout a video or image sequence. This process allows for the acquisition of temporal information related to the targets.
The typical tracking process generally encounters the following three scenarios:
Target Initialization: At the beginning of a video sequence or when a target first enters the field of view, the tracking algorithm must detect the target and initiate a new tracking trajectory. This typically involves detecting the presence of a target in the first frame or several consecutive frames and assigning it a unique identifier (ID).
Target Matching: In each frame, the algorithm must match newly detected targets with existing tracking trajectories. If an existing trajectory finds a matching target in the current frame, the trajectory is considered to have successfully continued to track the target. The matching process may involve calculating the distance, similarity, or other relevant correlation measures between the detection box and the predicted trajectory.
Target Termination: When a target leaves the field of view, is occluded, or cannot be detected for other reasons, the tracking algorithm must recognize this situation and appropriately terminate or pause the corresponding tracking trajectory. This may involve setting a time threshold; if a trajectory is not detected for several consecutive frames, then the target is considered to have disappeared, and its corresponding trajectory can be terminated or marked as ‘lost’.
The multi-object tracking model based on DeepSORT effectively tracks multiple targets within a video. This streamlined and efficient process is capable of handling multi-object tracking issues within videos, and can operate effectively even when targets undergo occlusions or rapid movements.
The StrongSORT multi-object tracking algorithm is an innovative approach that enhances the traditional SORT (Simple Online and Realtime Tracking) methodology. SORT is known for its simplicity and effectiveness in real-time tracking based on detection results from object detection algorithms; however, the initial version did not account for the appearance features of targets, leading to tracking failures or ID switches, particularly for occluded targets. DeepSORT, an improved iteration of SORT, mitigates these issues by incorporating appearance features and a cascading matching mechanism, resulting in significantly reduced ID switches and enhanced tracking performance.
StrongSORT further elevates the performance of DeepSORT through the integration of advanced detectors, optimized feature extraction techniques, refined data association strategies, and incorporation of the AFLink and GSI algorithms. These advancements notably bolster the accuracy and robustness of DeepSORT in handling multi-object tracking tasks, providing more reliable and continuous tracking even in complex scenarios such as occlusion and swift movements.
Kalman filtering is widely applied in dynamic systems. It utilizes the system’s linear state equations to observe system inputs and outputs, thereby achieving optimal state estimation. In practice, inaccuracies in the observed data due to interference make this estimation process also a filtering one.
To make a prediction, the state equation of the Kalman filter is first constructed based on the motion model acquired by the detection algorithm. Let the state of the target at time
be
, where
p denotes position and
v denotes velocity. Within the neighborhood of time
, the target can be approximated as moving at a constant speed; hence, the state information at time
k can be found by
In addition, the possible acceleration
during the target’s motion should also be considered. Let
represent the difference between times
k and
; then, the state at time
k can be expressed as
Rewriting the above in matrix multiplication form, we have
The state transition equation for the Kalman filter during the prediction process can be written as
where
Here, is known as the state transition matrix, which is used to transfer the estimated value from the previous moment to the new predicted position, is known as the control matrix, and represents the acceleration . In actual situations, the motion of the target is affected by various factors; represents the control input vector, and the product represents known external influencing factors, while represents the current state estimated using the state estimate from the previous moment, which is an estimated value rather than the true value.
From the analysis above, it is known that the estimates obtained at each moment during prediction have a certain degree of uncertainty. Therefore, it is necessary to represent the uncertainty of the estimated state and the correlation between state quantities. The covariance matrix is used to represent the correlation between the two, with the following calculation formula:
The mathematical model is generally described by the covariance matrix of the noise. Considering the prediction uncertainty and the impact of noise comprehensively, the second prediction formula of the Kalman filter can be obtained as follows:
The system observation equation is then obtained through the transformation matrix
H:
where
represents the observable measurement of the system and
represents the noise in the observation process.
In actual measurements, the predicted noise and the actual measured noise are assumed to both follow a Gaussian distribution and to be independent of each other; thus, their product will follow a Gaussian distribution. The gain matrix
K is calculated as follows:
Finally, using the above formula to correct the prediction formula, the optimal estimate of the target can be obtained as
3.3.2. Data Association Design
Data association refers to the process of matching targets detected in each frame with those detected in the previous and even earlier frames while ensuring that the associated targets are the same vehicle and assigning the same ID. Matching is primarily based on the similarity between targets, including motion similarity and representation feature similarity, which are measured by different cost matrices. To ensure that the inter-frame association is as complete as possible, two rounds of matching are conducted; the first consists of cascaded matching of representation and motion, while the second is based solely on the IoU metric.
- (a)
Cascaded Matching
The representation features are extracted from the detected 2D bounding boxes of the targets. The cost matrix is calculated based on the representation features. The representation matching involves two steps: computing the cosine distance cost matrix, and computing the motion feature similarity. The cosine similarity represents the cosine of the angle between two feature vectors, while the cosine distance is the unit value minus the cosine similarity:
where
represents the
k-th representation feature of the
j-th tracked target at time
and
is the set of representation features retained for the last 40 frames according to the sequence.
The motion feature similarity measures the spatial proximity between the detected position in the current frame and the predicted position from the previous frame, assuming that the target follows a hypothetical motion model. The Mahalanobis distance is used for this purpose, as it corrects for scale inconsistencies across different dimensions and is invariant to the units of measurement:
where
represents the detected position of the
i-th bounding box,
represents the predicted position of the
j-th tracked target, and
is the covariance matrix from the Kalman filter.
The motion feature similarity can be thresholded using the cosine distance cost matrix to prevent targets from matching with vehicles that are too far away in space. The threshold is set to the 95th percentile of the
distribution with four degrees of freedom,
:
By combining the motion similarity, which provides information about the target’s short-term predicted position, with the appearance similarity, which provides information about the target’s long-term motion features, we obtain the final cascaded matching similarity matrix through linear weighting:
where
is the weight assigned to appearance features and motion features, set as 0.98 in this paper.
- (b)
Triangular IoU Matching
IoU matching is a method employed within the SORT (Simple Online and Realtime Tracking) algorithm to assess the similarity between two bounding boxes. This approach takes into account not only the overlapping area between the bounding boxes but also their shape and size.
In this work, we deviate from traditional SORT’s utilization of GIoU and instead adopt the triangular IoU, a novel evaluation metric proposed in this paper. The triangular IoU provides a more precise description of the target’s shape and position, enabling more accurate localization of ship coordinates. This enhancement leads to improved accuracy in target matching, allowing for the more effective association between targets detected in the current frame and those tracked in the previous frame. This improvement is particularly beneficial for SORT in maintaining accurate tracking in complex scenarios, such as when targets are occluded, moving rapidly, or undergoing shape changes.
The use of triangular IoU in SORT offers several advantages; notably, it enhances the algorithm’s robustness in handling various challenging conditions that traditional IoU metrics might fail to address adequately. By incorporating the geometric properties unique to triangular shapes, the triangular IoU better captures the nuances of targets such as ship wakes, leading to more reliable tracking results.
- (c)
Hungarian Algorithm
The Hungarian algorithm is a combinatorial optimization technique utilized to identify an optimal matching within a bipartite graph, aiming to maximize or minimize the sum of the weights of all matched edges. In the context of the SORT (Simple Online and Realtime Tracking) algorithm, this method addresses the challenge of how to optimally assign the best matches between a set of targets detected in the current frame and those tracked in the previous frame by minimizing the total distance of matches. By minimizing the cost of assignments, the Hungarian algorithm effectively enhances the stability and precision of tracking, reducing the loss or errors in tracking that can result from incorrect matches.
In summary, the Hungarian algorithm offers SORT a highly efficient and reliable approach to managing the problem of target assignment, which is especially crucial in scenarios involving multi-object tracking. This algorithm is pivotal in maintaining the continuity and accuracy of target tracking.
3.4. Overall Framework of the TriangularSORT Algorithm
The TriangularSORT algorithm addresses several distinct situations when matching detection boxes with tracks, which dictate how the algorithm handles the relationship between detected targets and existing tracks. As shown in
Figure 2, the following are detailed explanations of these scenarios, applicable to detection-based multi-object tracking methods:
Matched Tracks: In this scenario, the detection boxes obtained from the detection algorithm successfully match the predicted boxes in a determined state by the Kalman filter. This implies that the algorithm can continuously track the target and the detection box is associated with an existing track. This is the most common situation in the tracking process, indicating that the tracking system can effectively maintain tracking of the target.
Handling of Unmatched Tracks and Detections: For tracks that are initialized and in an unconfirmed state (Unconfirmed Tracks), tracks that fail to match successfully through cascaded matching (Unmatched Tracks), and unmatched detection boxes (Unmatched Detections), the TriangularSORT algorithm attempts to match using an appearance model such as deep learning-based feature extraction. If a match is successful, then the track is updated using the Kalman filter. If the matches fail consecutively, these detection boxes may be initialized as new tracks or deleted according to certain strategies to avoid redundant or erroneous tracking.
Mismatched Tracks: If a track fails to match through IoU and is in a determined state, but has fewer mismatches than a set threshold, then the algorithm continues to predict the track and attempts to match it again in the next frame. If the number of mismatches exceeds the threshold, the track may be deleted to prevent tracking loss or erroneous tracks. This mechanism helps the algorithm to maintain effective tracking of other targets even when a target is occluded or temporarily leaves the field of view.
These scenarios cover various situations that the TriangularSORT algorithm may encounter when matching detection boxes with tracks, ensuring the robustness and accuracy of the algorithm. By dynamically handling matched, unmatched, and mismatched situations, TriangularSORT can effectively track multiple targets in a video even in cases of occlusion or rapid movement between targets.
The overall process of the TriangularSORT algorithm is as follows:
Step 1: Initialization and Prediction: In the initial phase of the TriangularSORT algorithm, object detection algorithms are first used to process the first frame of the video stream to detect targets in the scene and create corresponding tracks. Then, a Kalman filter is initialized for each detected target to predict the target’s motion status. In subsequent frames, the Kalman filter predicts the position of the target in the current frame based on the target’s historical motion information, a step that does not rely on the detection results of the current frame.
Step 2: Object Detection and Matching: In each frame, the TriangularSORT algorithm uses object detection algorithms to detect targets, resulting in a series of detection boxes. Then, appearance features are extracted from these detection boxes using a feature embedding model. Next, these detection results are matched with the predicted positions by the Kalman filter. The matching process typically involves computing matching costs such as the Mahalanobis distance, cosine similarity, etc., to determine the best track–detection match pair.
Step 3: Track Update and Interpolation Processing: For successfully matched tracks, the TriangularSORT algorithm uses detection box information to update the Kalman filter and track status to reflect the target’s new position and possible new velocity or direction. For missing detection cases, i.e., targets not detected in the current frame, the algorithm uses the Gaussian-Smoothed Interpolation (GSI) algorithm for track interpolation to fill gaps in the track, thus maintaining continuous tracking of the target.
Step 4: Global Association and Optimization: For tracks and detection boxes that fail to match successfully, the TriangularSORT algorithm uses the AFLink algorithm for global association to solve the problem of missing associations. The AFLink algorithm uses spatiotemporal information to predict whether two track segments belong to the same target, thereby improving the accuracy of global association. In addition, optimization techniques such as the Hungarian algorithm are used to find the optimal track–detection match pair, ensuring the optimization of tracking results.
Step 5: Result Output and Iteration: After completing all matching and updating steps, the TriangularSORT algorithm outputs the tracking results of the current frame, including the track and identity information of each target. Then, the algorithm iteratively processes the next frame, repeating the above steps until the end of the video stream. This iterative processing ensures continuous tracking of all targets in the video even in cases of occlusion or rapid movement between targets.