In this section, we will evaluate the performance of TrackAISNet. First,
Section 5.1 presents the experimental setup, including the environment, dataset, and evaluation metrics.
Section 5.2 provides relevant statistical information on the four types of vessels in the custom dataset, including dataset partitioning. Next, in
Section 5.3, we assess the performance of TrackAISNet on the custom AIS dataset.
Section 5.4 conducts ablation experiments to demonstrate the performance enhancement effects of each module within TrackAISNet. Finally, in
Section 5.5, we further validate the model’s effectiveness by comparing TrackAISNet against state-of-the-art multimodal algorithms on a public AIS dataset related to three types of fishing activities.
5.1. Experimental Setup
Environment: All experiments were conducted on a machine equipped with an Intel i5-13400F CPU and an NVIDIA RTX 4060Ti GPU. Python 3.8 and PyTorch 2.2.2 were used to build and train our model.
Datasets: (1) A custom multivariate variable-length AIS time series dataset. (2) A publicly available AIS dataset comprising three types of fishing activities.
Experimental Details: The training of the classification model employed a weighted cross-entropy loss function to mitigate the issue of class imbalance. The Adam optimizer was utilized to accelerate convergence and improve training efficiency. Hyperparameter settings were tuned using a grid search strategy, including hidden layer dimensions (64 and 128), the number of network layers (1 to 3), dropout rates (0.1 and 0.2), and learning rates (0.0001, 0.0005, and 0.001).
Evaluation Metrics: The classification accuracy, recall, precision, and F1 score were used as evaluation metrics. Additionally, in the comparative experiments with publicly available datasets and state-of-the-art algorithms, we also included time per sample as an evaluation metric. Time per sample indicates the time taken to test each individual sample, which is measured in seconds. Furthermore, it is essential to understand the confusion matrix before introducing other metrics, as shown in
Table 2:
TP (True Positive): Instances that are correctly predicted as positive. In other words, the actual value is positive, and the prediction is also positive.
TN (True Negative): Instances that are correctly predicted as negative. This refers to situations where the actual value is negative, and the prediction is also negative.
FP (False Positive): Instances that are incorrectly predicted as positive. This occurs when the actual value is negative, but it is mistakenly predicted as positive.
FN (False Negative): Instances that are incorrectly predicted as negative. This occurs when the actual value is positive, but it is mistakenly predicted as negative.
Based on these definitions, we can derive the definitions of accuracy, precision, recall, and F1 score.
Accuracy represents the proportion of correctly classified samples among the total number of samples, and it is expressed as follows:
Precision represents the proportion of predicted positive samples among the actual positive samples, and it is expressed as follows:
Recall represents the proportion of actual positive samples among the predicted positive samples, and it is expressed as follows:
The F1 score is the weighted average of precision and recall, and it is expressed as follows:
Precision reflects the model’s ability to distinguish negative samples, with a higher precision indicating stronger discrimination capability for negative samples. Recall reflects the model’s ability to identify positive samples; a higher recall indicates a greater ability to recognize positive samples. The F1 Score combines both metrics, and a higher F1 Score suggests a more robust model.
ROC curve is a graphical tool for representing the performance of a classification model. By plotting the True Positive Rate (TPR) on the Y axis against the False Positive Rate (FPR) on the X axis, it illustrates the classifier’s performance across different threshold settings.
True Positive Rate (TPR): Also known as recall, the TPR measures the classifier’s ability to correctly identify positive instances. The TPR can be understood as the detection rate among all positive instances, where a higher TPR indicates better performance. Its calculation formula is the same as that of the recall metric. Its calculation formula is as follows:
False Positive Rate (FPR): The FPR represents the proportion of negative instances that the model incorrectly classifies as positive. It can be understood as the rate of false positives among all actual negative instances (also known as the false alarm rate), where a lower FPR indicates better performance. Its calculation formula is as follows:
AUC (Area Under the Curve): The AUC is the area under the ROC curve and serves as a metric to evaluate the classifier’s performance. The higher the AUC value, the better the classifier performs; conversely, a lower AUC indicates poorer performance.
5.3. Experiments on a Self-Constructed Dataset
We compared the performance of our proposed TrackAISNet and TCN-GA against various temporal algorithms, including LSTM, BiLSTM, GRU, and BiLSTM-CNN, as well as several lightweight convolutional neural networks such as MobileNetV2 [
25], ShuffleNetV2 [
26], and EfficientNet [
27] under different modalities. For the comparison algorithms, hyperparameter settings were optimized using a grid search method. The performance comparison of different algorithms is shown in
Table 4.
The data from the aforementioned table indicate that the TCN-GA model exhibited the best performance among the time modality algorithms, with an accuracy of 80.35%, an F1 score of 80.26%, a precision of 80.18%, and a recall of 80.35%. This was followed closely by the BiLSTM-CNN model. Notably, although the TCN performed relatively weakly when used alone, the improved TCN-GA significantly enhanced model performance. This improvement may be related to the dataset being composed of variable-length time series data. When the time series data are padded, using the last time step for classification predictions may lead to lower identification accuracy. The incorporation of an attention mechanism helps focus on key features at critical moments, thereby improving recognition accuracy.
In the image modality comparison, the EfficientNet-B0 network outperformed both MobileNetV2 and ShuffleNetV2 in terms of its accuracy, F1 score, and precision, although the differences among the three are relatively small.
In the area of multimodality processing, TrackAISNet demonstrated excellent performance, achieving the highest accuracy of 81.38% compared to the various single-modality algorithms, along with correspondingly high F1 score, precision, and recall values. This underscores the robustness of TrackAISNet in handling noise, uncertainty, and AIS data loss. In practical applications, AIS data are often subject to noise or partial loss due to equipment malfunctions, transmission interference, or environmental factors. TrackAISNet incorporates a multimodal fusion mechanism within its architecture, effectively integrating information from both the temporal and image modalities. This multimodal approach enables the model to leverage complementary information from one modality when the other is affected by noise or anomalies, significantly mitigating the impact of single-modality noise on overall performance. This compensation mechanism greatly enhances the model’s adaptability and robustness in scenarios involving AIS data loss.
Selecting the right algorithm for a specific task is critical. For tasks focused on time series analysis, TCN-GA stood out as the most suitable choice. With its balanced parameter size and low computational complexity, it is well suited for deployment in scenarios with high real-time performance requirements. When the task involves image data processing, EfficientNet was a strong candidate, offering excellent performance despite its relatively higher parameter size and computational demands, provided sufficient computational resources are available. For complex applications that require integrating multisource information to optimize outcomes, the multimodal approach proposed in this study—TrackAISNet—proves to be a robust and comprehensive solution.
As the primary model proposed in this study, TrackAISNet has a relatively large parameter size (16.64 M) and higher computational complexity. However, the rapid advancement of embedded systems and edge computing technologies ensures that modern high-performance hardware can fully support this scale. Moreover, optimization techniques such as quantization, pruning, and knowledge distillation can significantly reduce computational overhead, making the model more efficient for deployment in real-world environments.
In maritime traffic management systems, real-time performance is critical, as it directly affects operational efficiency and safety. By combining time series and image data, TrackAISNet offers a holistic analytical capability, making it ideal for key decision support scenarios, such as intelligent route planning in high-risk areas and vessel collision risk prediction. Despite its relatively high computational demands, TrackAISNet can be efficiently supported by modern high-performance servers or embedded edge devices commonly used in maritime systems. This ensures the model can deliver accurate, reliable, and timely decision support, enhancing the safety and efficiency of maritime operations.
5.5. Experiments on Public Dataset
To further validate the effectiveness of the TrackAISNet model presented in this paper, we conducted comparative experiments on a public AIS dataset that includes three types of fishing activities. The dataset comprises trajectory data from vessels in the East China Sea (anonymized) and represents authentic historical maritime tracking information, encompassing multiple dimensions of data. Each trajectory includes details such as vessel ID, latitude, longitude, speed, heading, timestamp, and operational mode (trawl, surround, and gillnet). This dataset can be found at
https://aistudio.baidu.com/datasetdetail/146541 (accessed on 19 January 2025). It contains 14,656 training samples and 3664 testing samples, with a balanced distribution of positive and negative samples. Most of the experimental results for the algorithms compared in the following table are derived from the latest literature [
28].
The results in
Table 6 indicate that TrackAISNet demonstrated significant performance improvements in comparative experiments on public datasets. After only three training epochs, the model achieved an accuracy of 82.76%, with recall and precision reaching 82.76% and 83.01%, respectively. At this stage, its performance was comparable to that of the latest multimodal algorithm, MFGTN, which reported an accuracy of 82.61% and a recall of 83.23%. As training progressed to 10 epochs, TrackAISNet underwent further optimization, reaching an accuracy of 89.33% and an F1 score of 89.32%, showcasing superior performance. Moreover, TrackAISNet delivered these results with an impressive processing speed of only 0.005 s per sample, underscoring its potential in ship trajectory classification tasks. While other models, such as SVP-T and MFGTN, also demonstrated strong results, TrackAISNet’s performance over extended training suggests it may serve as an effective solution for this task.
Figure 9 and
Figure 10 present the ROC curves of our algorithm across different network models on a public dataset containing three types of fishing activities. The X axis represents the false positive rate, while the Y axis represents the true positive rate. The dotted line indicates the micro-average ROC curve, and the dashed line shows the macro-average ROC curve. The solid blue line represents the ROC curve for gillnetting, the solid green line represents the ROC curve for purse seining, and the solid red line represents the ROC curve for trawling.