An Anchor-Free Siamese Network with Multi-Template Update for Object Tracking
Round 1
Reviewer 1 Report
Good paper, the idea is not really original. Some minor revisions are needed:
- Figures 2 and 4 are not really clear, try to adjust the size and split the legend into the two sides of the image, or make it horizontally.7
- The graphs of Figure 7 could not be joined in the same figure, it is not clear.
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Reviewer 2 Report
This paper proposed an anchor-free Siamese network for object tracking. It replaces RPN with an anchor-free prediction network which predicts object category and location in pixel level. In addition, a fusion mechanism is used to combine high-level and low-level feature maps. The authors claimed better IoU and speed. In general, this paper is well-presented and the idea is easy to follow. However, my major concerns are:
- Object tracking is a hot topic in not only CV, but also many other fields. CV Researchers mainly work on vision-based tracking, by designing good visual detectors to find the object locations, and then get trajectories. In sequential data such as video and audio, temporal information is as important as spatial information. The authors should cover research in this direction as well, such as trackers based on filtering algorithms [1] or trackers based on sequential correlation [2].
- The experiment needs to compare with state-of-the-art algorithms. There are many papers working on object tracking, but in this paper most algorithms for comparison are before 2017. Even at the early time, tracking algorithm based on Mask-RCNN [3] should be included. I suggest to compare object tracking algorithm in CVPR2020 [4].
- Line 162, page 3. Why to use last three residual blocks? Is it empirical?
- Line 258, page 7. The experiment is on Nvidia "Tian" 1080Ti, a typo?
- The experiment doesn't show the computation efficiency of the proposed approach.
- Ablation study should not focus on how many feature maps, but on the test when remove a major part of the proposed approach, such as removing the multi-template update mechanism, or keep using RPN instead of anchor-free prediction, to see the performance degradation.
[1]. Distributed Mean-Field-Type Filter for Vehicle Tracking, in American Control Conference (ACC), Seattle, USA, May 2017
[2]. Correlative Mean-Field Filter for Sequential and Spatial Data Processing, in the Proceedings of IEEE International Conference on Computer as a Tool (EUROCON), Ohrid, Macedonia, July 2017
[3]. M. Runz, M. Buffier and L. Agapito, "MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects," 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Munich, Germany, 2018, pp. 10-20, doi: 10.1109/ISMAR.2018.00024.
[4]. Fast Template Matching and Update for Video Object Tracking and Segmentation. CVPR 2020.
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Reviewer 3 Report
This study proposes an anchor-free Siamese network(AFSN) with multi-template update for object tracking. The proposed AFSN is built on ResNet-50[8] network and combines multi-level features with rich spatial and semantic information for boosting track performance. It combines multiple prediction results to make the final result more stable and smoother. The author claim the proposed anchor-free mechanism can avert a large number of complicated calculations and complex parameters related to anchor boxes. Moreover, this study also investigates a multi-template update mechanism to determine whether template should be updated. Experimental results validated that the proposed method achieves the state-of-the-art performance on GOT-10K, LaSOT, UAV123, OTB100 tracking datasets
The detail comments about this paper are as the follows:
- This study integrates several strategies for improving tracking accuracy and effectiveness. They provided the ablation study w/o multiple feature fusion, but there are no compared results about other proposed strategies such as w/o multi-template update mechanism.
- The author claimed that the anchor-free prediction network can decrease the number of hyper-parameters, make the tracker simpler, speed up the tracking process. However, there have provided no compared computational cost in this paper.
- The size of the outputted feature maps for second, third and fourth layers should be different, and it is unclear how to directly concatenate them with different sizes.
- The explanation about the architecture of the proposed AFSN is not enough. It is difficult to understand the overview process of the proposed AFSN in the current description of the paper.
- In the subsection of ‘Anchor-free Prediction Network’, the author mentioned that The final prediction output is obtained by weighted summing the results of three anchor-free prediction subnetworks, and The combination weights can be end-to-end optimized offline together with the whole network. I cannot get the idea how to optimize them offline in your tracking task. What is the optimized values for these hyper-parameter in your experiments.
- The captions of the subsections 4.1 and 4,2 are the same.
- There are many typos and grammar errors in this
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Round 2
Reviewer 2 Report
Although some modifications have been made, the overall novelty is still low. The paper does not include sufficient background and all relevant references, especially the research in the recent 3 years. Most references are before 2018. I highly recommend the authors to cover more recent work, and narrow down the research scope to focus on one specific use case, since general object tracking algorithms develop very fast and I didn't see advantage of this approach over the CVPR2020 papers.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
Basically, I an satisfied with the response of the authors.
Author Response
Many thanks! We have supplemented and modified the content of this paper.