Human Activity Recognition for Assisted Living Based on Scene Understanding
Round 1
Reviewer 1 Report
The paper present a human activity recognition system for assist people based on scene understanding using video monitoring. They use two neural networks for the detection of interactions between humans and objects.
The paper is interesting but presents many weaknesses that should be properly tackled by the authors.
First of all, the article is structured in too many points. Point 3 is too long and the information provided is not relevant and can be found in the references. On the other hand, points 4 and 5 refer to the SmartCare system which is not the subject of this article. I think these three points should be summarised. Moreover, the last two paragraphs of the initial section 5, before 5.1, contain redundant information.
The authors present different related work on recognition activities, but they do not clearly compare their proposal with the existing ones. And even if it is an improvement to the Smartcare system by the same authors there should be a clear comparison of the results obtained with and without this improvement.
Figures and tables should be revised, relocated and correctly referenced in the text (like table 7 which is not referenced or use expressions like "previous figure"). In addition, some figures (such as 7 and 8) need to be generated again due to their poor quality.
There is no adequate justification for the use of the two neural networks, why those and not others.
Author Response
We appreciate the thoroughness with which this review was done, all suggestions and best practices are welcome.
Significant changes in the paper structure have been made in order to comply with the suggestions from both reviewers:
-
- the old Section 3 (Equipment) was taken out, while in the new Section 3: we briefly describe the SmartCare system (part of the old Section 4), explain the integration of HAR within the system (part of the old Section 5 - without the redundant information at the end), Subsections 5.2 & 5.3 and the old Section 9 now become one (3.2) in which we present the results on object detection and human-object interaction detection.
- the old Section 6 is now Section 4
- the old Section 7 is now Section 5
- the old Section 8 is now Section 6 (with the Figures rearranged next to the paragraphs where it is referred). Subsection 6.1 is newly added in which it is explained how HAR results are used within the bigger context of SmartCare.
- the old Section 10 is now Section 7 (Conclusions)
The majour changes in the order they appear in the text are:
- updated the last paragraph of Section 1 to reflect the new structure of the paper
- two paragraphs were added at the end of Section 2 (state-of-the-art video based approaches for HAR)
- the Equipment section was completely taken out
- the SmartCare system is introduced in fewer paragraphs (in extenso in the paper referenced as 37)
- redundant information was taken out - the paragraphs before the old section 5.1
- the old Tables 7,8,9 (now Tables 1,2,3) are better explained in section 3.1
- a new paragraph is introduced at the beginning of 3.2 (the old 5.2) in which the practical constraints of HAR are better explained
- the old Section 9 is now at the end of 3.2 as it is closely related to the results on object detection and human-object interaction
- no changes in the text of Sections 4 & 5 was made (the old Sections 6 & 7)
- a new sub-section (6.3) was added at the end of Section 6 (Design and implementation, the old Section 8) in which we briefly describe how HAR results are used in the rule engine along with information from different sensors/systems connected to SmartCare.
The majority of figures in the paper are now bigger in size so that they are easier to follow. Some figures were moved next to the paragraphs referring to them, as they were displaced in the text.
The justification for using the selected models is closely related to the constraints imposed by the deployability of the system (please see the new paragraphs added at the beginning of sub-section 3.2). The need for lightweight detection models is imposed by the limited computational power of the mobile platform, the HAR solution uses an object detection network due to the need to detect the interaction with specific objects. Also, the dataset used to train the object detector can easily be enhanced with new classes as the need arises. In order to select the model, a comparison between the object detectors in the context of human-object interaction detection is discussed in Tables 4 and 5.
Any other suggestions to the reorganized paper are welcomed.
Author Response File: Author Response.docx
Reviewer 2 Report
This paper presents an indoor human activity recognition (HAR) system based on scene understanding. The experimental results demonstrate that such a system can be integrated in the SmartCare platform to provide information about the activity of the monitored patien. The goal of the system and algorithm is clear, the design is reasonable, the description is detailed, and the results are reliable. It's recommended to publish after minor revision. The modification suggestions are as follows:
1. The system uses a depth camera Zen2. However, the use of depth information is not mentioned in the system design and implementation. Please explain why to use a depth camera, what is the role of depth information, and how to integrate it with HOI recognition results.
2. From the perspective of human activity recognition, this paper adopts a two-stage recognition method based on detection and recognition of a single frame image, but there are similar single-stage recognition method based on a single frame image, and the recognition algorithm directly based on video. Please analyze and compare them to illustrate the rationality of the existing algorithm.
Author Response
We appreciate the thoroughness with which this review was done, all suggestions and best practices are welcome.
Significant changes in the paper structure have been made in order to comply with the suggestions from both reviewers:
-
- the old Section 3 (Equipment) was taken out, while in the new Section 3: we briefly describe the SmartCare system (part of the old Section 4), explain the integration of HAR within the system (part of the old Section 5 - without the redundant information at the end), Subsections 5.2 & 5.3 and the old Section 9 now become one (3.2) in which we present the results on object detection and human-object interaction detection.
- the old Section 6 is now Section 4
- the old Section 7 is now Section 5
- the old Section 8 is now Section 6 (with the Figures rearranged next to the paragraphs where it is referred). Subsection 6.1 is newly added in which it is explained how HAR results are used within the bigger context of SmartCare.
- the old Section 10 is now Section 7 (Conclusions)
The majour changes in the order they appear in the text are:
- updated the last paragraph of Section 1 to reflect the new structure of the paper
- two paragraphs were added at the end of Section 2 (state-of-the-art video based approaches for HAR)
- the Equipment section was completely taken out
- the SmartCare system is introduced in fewer paragraphs (in extenso in the paper referenced as 37)
- redundant information was taken out - the paragraphs before the old section 5.1
- the old Tables 7,8,9 (now Tables 1,2,3) are better explained in section 3.1
- a new paragraph is introduced at the beginning of 3.2 (the old 5.2) in which the practical constraints of HAR are better explained
- the old Section 9 is now at the end of 3.2 as it is closely related to the results on object detection and human-object interaction
- no changes in the text of Sections 4 & 5 was made (the old Sections 6 & 7)
- a new sub-section (6.3) was added at the end of Section 6 (Design and implementation, the old Section 8) in which we briefly describe how HAR results are used in the rule engine along with information from different sensors/systems connected to SmartCare.
The majority of figures in the paper are now bigger in size so that they are easier to follow. Some figures were moved next to the paragraphs referring to them, as they were displaced in the text.
The justification for using the selected models is closely related to the constraints imposed by the deployability of the system (please see the new paragraphs added at the beginning of sub-section 3.2). The need for lightweight detection models is imposed by the limited computational power of the mobile platform, the HAR solution uses an object detection network due to the need to detect the interaction with specific objects. Also, the dataset used to train the object detector can easily be enhanced with new classes as the need arises. In order to select the model, a comparison between the object detectors in the context of human-object interaction detection is discussed in Tables 4 and 5.
Indeed, the depth information from the ZED camera is not used for HAR. The comparison between the acquisition sensors was taken out, it was done for an old feature which is now a stand-alone module of SmartCare: in the context of Covid-19 the physical distance between people (patient - patient/patient - caregiver/patient - doctor) needs to be monitored, hence the constraint of using a depth camera for people location estimation. This sub-system is not the subject of the current paper, therefore, the constraint of using the ZED camera was taken as such.
Any other suggestions to the reorganized paper are welcomed.
Author Response File: Author Response.docx
Round 2
Reviewer 1 Report
The authors have made an effort to improve the article following the recommendations of the previous review.
Some details remain to be revised:
1. Figure 2 should be reprinted with the white background to facilitate its visualization.
2. Several figures appear before they are referenced in the text. They should be relocated and correlated with the text.
- Section 5 begins with a figure, which is unusual and has not been referenced earlier in the text.
- Figure 8 also appears before being referenced in the text. Delete the "previous" reference in the text.
- The same is true for figure 9
There is no reference to future work in the Conclusions. Does this mean that the system has no margin for improvement?
Author Response
Thank you for your prompt answer, the paper was modified as follows:
- Figure 2 was reprinted as recommended and the paragraph referring to it has been moved before.
- Figures 2, 6, 7, 8, 9, 17 and 19 were relocated so that they appear after their mention in the text.
- Table 7 was relocated after the paragraph mentioning it.
- After this update Section 5 begins with a text paragraph, the one referring to Figures 6 and 7.
- The paragraph referring to Figure 8 has been relocated before it. The “previous” mention in the text was deleted because the figure is now after the text paragraph.
- Indeed, the future work was not mentioned in the last Section. Two paragraphs were added at the very end discussing the next steps to be taken regarding the HAR module.
Any other suggestions are welcomed as well.
Author Response File: Author Response.docx