Skill Fusion in Hybrid Robotic Framework for Visual Object Goal Navigation
Round 1
Reviewer 1 Report
The paper presents an approach focused on the problem of object navigation, i. e., finding an object in an unknown environment. This is a classical application in mobile robotics, in which the robot must explore the environment, while localizing in the map (SLAM) and, at the same time, observe its surroudings to find the object of interest.
The introductory section is clear and the authors provide enough references. The description of the methods presented is correct and a fair amount of experiments are presented in simulation. As mentioned below, the amount of experiments carried out with the real platform should be improved.
Below, I propose a list of changes that, in my opinion, will enhance the quality of the paper:
Comment 1:
The "Conclusion" section usually begins with a short summary of the paper, some comments about the experiments carried out and, finally, some final thoughts. In my opinion, the authors should add a short summary at the begining of this section.
Comment 2:
Regarding the experiments carried out with a real Husky robot (Section 7.2), it would be interesting to add 1 or 2 photos of the environment. In this way, other researchers would obtain a general idea of the difficulties involved, for example, in finding an object. According to the authors, a single run for each experiment are presented in Table 4. In my opinion, the authors should include several runs (for example, starting from dissimilar initial locations) with the RL, Classic and SkillFusion methods.
Some minor errata and other considerations.
20 and the amount of engineering patches (also known as hacks). I consider that this concept is not relevant here.
63 from simulator to the real world --> from the simulator
76 enhancing --> enhance
139 Explore skill --> The explore skill
141 GoalReacher skill --> The GoalReacher skill
143 PointNav --> The Point Nav
145 Flee skill --> The flee skill
159 Husky has a Velodyne --> the Husky robot has a Velodyne.
190 we use the back projection of the depth map to build the occupancy grid. --> this sentence is unclear. Is the depth information projected to a 2D grid map?
Author Response
Comment 1:
The "Conclusion" section usually begins with a short summary of the paper, some comments about the experiments carried out and, finally, some final thoughts. In my opinion, the authors should add a short summary at the begining of this section.
Answer: In the "Conclusion" section we added a short summary of our paper (Lines 421-428). We presented our final thoughts in Lines 430-437, and provided comments about the experiments in Lines 424-435.
Comment 2:
Regarding the experiments carried out with a real Husky robot (Section 7.2), it would be interesting to add 1 or 2 photos of the environment. In this way, other researchers would obtain a general idea of the difficulties involved, for example, in finding an object. According to the authors, a single run for each experiment are presented in Table 4. In my opinion, the authors should include several runs (for example, starting from dissimilar initial locations) with the RL, Classic and SkillFusion methods.
Answer: We added Fig.7 which includes a photo of the Husky robot and our laboratory environmental setup. In Table 4, as suggested, we included additional runs with three types of goal objects and three types of initial locations.
Reviewer 2 Report
Well written paper. Few grammar and spelling mistakes in the paper.
1. It is good to explain why they didn't consider also simulating the same robot in the simulator. It is good to compare the performance difference between the simulation and the real world using the same robot.
2. line 103: learnable - not learnable
3. In line 358: the authors said they "increase the robustness of the encoder", but they freeze the pre-trained CLIP encoder. This situation does not affect the encoder. Maybe re-write the sentence. Line 357-360, has two different experiment set-up with different intentions. So, it is better to re-write them.
4. line 361 The authors suggested that "To confirm that the neural network relies on CLIP embeddings rather than solely on embeddings from depth sensors,". Usually, neural networks rely on the RGB data to extract as much as information possible and the depth data doesn't have much info for object detection or recognition. The depth only aids in navigation. This needs rephrasing the sentence "to analyze the performance aid from depth sensors... such experiments were carried out". this is supported in authors statement in line 369.
5. In Section 7.2 the authors presented their results. It would be good if the authors present an ablation study with the different target object, such as keeping a chair but of different color, or a different shape than the intended goal object.
6. Authors presented their results in SPL metric, maybe including other metrics from the literature will provide confidence over their performance claims.
Author Response
Comment 1:
It is good to explain why they didn't consider also simulating the same robot in the simulator. It is good to compare the performance difference between the simulation and the real world using the same robot.
Answer:The Habitat environment, by default, simulates the agent as a circle. Our tests show that slightly increasing the radius of the virtual robot does not affect the results. However, simulating a physical model of a robot is not supported by the simulator by default. The exact experimental section, dedicated to deploying our method on a real robot, demonstrates that despite the non-physical nature of the robot's movements in the simulator during training, our algorithm possesses good generalization capabilities and effectively transfers to the real environment.
Comment 2:
line 103: learnable - not learnable
Answer: Corrected this error.
Comment 3:
In line 358: the authors said they "increase the robustness of the encoder", but they freeze the pre-trained CLIP encoder. This situation does not affect the encoder. Maybe re-write the sentence. Line 357-360, has two different experiment set-up with different intentions. So, it is better to re-write them.
Answer: The main intention of using and freezing CLIP is to make our entire neural network robust to novel scenes, taking into account the visual richness of home environments and the lack of scenes in the dataset. We have rewritten this part of the text to clarify the motivation behind each experimental setup (Lines 357-369).
Comment 4:
line 361 The authors suggested that "To confirm that the neural network relies on CLIP embeddings rather than solely on embeddings from depth sensors,". Usually, neural networks rely on the RGB data to extract as much as information possible and the depth data doesn't have much info for object detection or recognition. The depth only aids in navigation. This needs rephrasing the sentence "to analyze the performance aid from depth sensors... such experiments were carried out". this is supported in authors statement in line 369.
Answer: Our goal for this experiment (ablation of the depth sensor) was to demonstrate that CLIP embeddings can not only solve object detection problems, but also enable the agent to perform navigation tasks solely based on them. We have clarified this statement in the text to make it more obvious (Lines 365-369).
Comment 5:
In Section 7.2 the authors presented their results. It would be good if the authors present an ablation study with the different target object, such as keeping a chair but of different color, or a different shape than the intended goal object.
Answer: As suggested, we have conducted more real-world experiments with the Husky robot. We added a sofa and a blue chair as goal objects and ran our pipeline from three different starting points (Table 4, Figure 8). It should be noted that the semantic segmentation module in our pipeline is independent of the agents policy and can thus be replaced to detect any given type of object.
Comment 6:
Authors presented their results in SPL metric, maybe including other metrics from the literature will provide confidence over their performance claims.
Answer: In Table 4, we have included Path Length and Time metrics to provide more perspective on the robot’s trajectory. We have also added Figure 8, which contains more images of the map and resulting trajectories.