Incremental Entity Blocking over Heterogeneous Streaming Data
Round 1
Reviewer 1 Report
The author presents a blocking approach for the entity resolution (ER) problem in a streaming scenario. It addresses several issues related to the general problem, such as the tolerance to noise and the selection of the best attributes to consider.
Entity Resolution (ER) represents a well-known and deep-explored research topic. However, this paper introduces a technique in a scenario that has not been deeply explored, yet. The research topic is very interesting, but the proposed solution should be better described w.r.t. the issues triggered by the novel scenario. Moreover, the presentation is too colloquial, and sometimes the text contains redundant content.
In particular:
1) the novelty of the proposal should be better highlighted,
2) the overall presentation should be improved,
3) an overview of the complete process should be introduced.
More specifically,
- The paper contains many sentences that appear repetitive. Please, try to remove redundant sentences.
- Section 2 should better present the problem statement. In particular, the authors should first present the ER problem in static scenarios, and then highlight the problem issues that arise with streaming data. This could permit to better emphasize the novelty of the proposal w.r.t. the novel scenario. Such a section should also present problem issues in a more formal way.
- I suggest introducing a summary table containing all the symbols used throughout the paper.
- The splitting of the related work section should better reflect the problem issues that would be addressed by the presented proposal.
- Concerning the attributes selection issue, there exist many approaches that use metadata and/or data profiles to perform and/or enhance the ER processes. An example is:
http://www.vldb.org/pvldb/vol13/p712-koumarelas.pdf
This kind of proposal exploits the possibility to infer data dependencies from database instances through discovery algorithms, such as:
https://doi.org/10.1109/TKDE.2020.2967722
Such algorithms also started to be introduced for updating metadata in incremental scenarios, but they do not consider a value comparison method based on similarity. An example is: https://doi.org/10.1145/3397462
- Concerning the proposal, the authors should better emphasize the novelty of the proposal w.r.t. the problem issues. Moreover, the presentation should be better structured. Contents are hard to follow, I suggest presenting not only the system architecture but also a Figure providing an overall overview of the ER process, in which the application of the designed strategy is shown.
- Also the presentation of the experimental evaluation and results should be improved. For instance, Result paragraphs should directly recall the RQs.
- Try to rephrase RQ1. Its meaning remains obscure.
- The goal and contents of Section 7.1 are not clear.
- The graphical presentation of the Figures needs some improvements.
- The paper contains many acronyms, some of which could be avoided, such as SM and PT.
- Please, proofread the whole paper trying to remove errors and typos. Just to highlight a typo on Page 21: A comparative of our technique -> a comparison of our technique
Author Response
We are most grateful for all suggestions and comments, which have contributed toward improving the quality of our paper. Following, the list of suggestions and comments in addition to the respective solutions implemented.
[TOPIC 1]
In particular:
1) the novelty of the proposal should be better highlighted,
2) the overall presentation should be improved,
3) an overview of the complete process should be introduced.
Please, find several improvements during the paper regarding the novelty of the proposal in Sections 1, 2, 3, and 4. These sections were updated in order to clarify the main goals of the proposal. Particularly, in the Related Work section, the different challenges addressed in this work are presented and the novelty is highlighted in the sense of the lack of state-of-the-art works inserted in these different contexts simultaneously. Related to the overall presentation, please, find updates in the text to clarify and organize the understanding workflow throughout the paper. Section 1 was remodeled to remove redundant material. Sections 4 and 7 were rewritten to highlight the main points of the proposal and the respective results. Regarding the overview of the complete process, an introductory part was added in Section 4 to introduce the ER task as a whole and how blocking should contribute, detaching the contribution of the proposed blocking technique.
[TOPIC 2]
The paper contains many sentences that appear repetitive. Please, try to remove redundant sentences.
Following this advice, several parts of the paper were reorganized to avoid repetitive sentences or redundant content. Please, find some of these improvements throughout the paper, mostly, in the first four sections.
[TOPIC 3]
Section 2 should better present the problem statement. In particular, the authors should first present the ER problem in static scenarios, and then highlight the problem issues that arise with streaming data. This could permit to better emphasize the novelty of the proposal w.r.t. the novel scenario. Such a section should also present problem issues in a more formal way.
Section 2 was rewritten to formally define the traditional ER and also the blocking step. Now, ER and the blocking technique are introduced in the context of streaming data. Notice that, in Section 4 (Incremental Blocking over Streaming Data), some formalizations are also highlighted, reinforcing the challenges stated in Section 2.
[TOPIC 4]
I suggest introducing a summary table containing all the symbols used throughout the paper.
Following this advice, please, find Table 2 (Section 2), which summarizes all the symbols used throughout the paper.
[TOPIC 5]
The splitting of the related work section should better reflect the problem issues that would be addressed by the presented proposal.
The related work subsections were renamed to guide the reader to reflect on the topic issues addressed at this work. Moreover, in the last paragraph of each subsection, we highlight the main issues related to each topic against the presented proposal.
[TOPIC 6]
Concerning the attributes selection issue, there exist many approaches that use metadata and/or data profiles to perform and/or enhance the ER processes. An example is:
http://www.vldb.org/pvldb/vol13/p712-koumarelas.pdf
This kind of proposal exploits the possibility to infer data dependencies from database instances through discovery algorithms, such as:
https://doi.org/10.1109/TKDE.2020.2967722
Such algorithms also started to be introduced for updating metadata in incremental scenarios, but they do not consider a value comparison method based on similarity. An example is: https://doi.org/10.1145/3397462
Section 3.3 was rewritten to add the idea of functional dependency (FDS) as a base for strategies related to attribute selection. In this sense, the proposed works were highlighted and discussed to ground the idea of applying schema relations (links between attributes) as a strategy to support matching tasks.
[TOPIC 7]
Concerning the proposal, the authors should better emphasize the novelty of the proposal w.r.t. the problem issues. Moreover, the presentation should be better structured. Contents are hard to follow, I suggest presenting not only the system architecture but also a Figure providing an overall overview of the ER process, in which the application of the designed strategy is shown.
Section 4 was remodeled in the sense of introducing the main challenges and issues addressed by this work, highlighting the novelty of the proposal. To clarify the proposal, we accepted the suggestion to introduce a Figure that provides an overall overview of the ER process and how this work is related to the ER process. Thus, Figure 1 was added to Section 4. Moreover, a proposal overview was inserted at the introduction of Section 4 to provide a better understanding of the challenges addressed in this work and the proposed solution for each of them.
[TOPIC 8]
Also the presentation of the experimental evaluation and results should be improved. For instance, Result paragraphs should directly recall the RQs.
Please, notice that all conclusions derived from the results are linked with the respective RQ. Thus, our approach was first to provide a discussion of the result, and then, recall the RQ that was answered by the discussion. For both sections related to experimental discussion (efficiency and effectiveness), this behavior was maintained to provide a standard textual style.
[TOPIC 9]
Try to rephrase RQ1. Its meaning remains obscure.
RQ1 was rewritten to highlight the sense of equivalence of the state-of-the-art and the proposed techniques in terms of effectiveness.
[TOPIC 10]
The goal and contents of Section 7.1 are not clear.
Section 7.1 was redesigned to clarify and highlight the main objectives of the section. Notice, since state-of-the-art blocking techniques do not work properly in scenarios involving incremental and streaming data, a comparison of our technique against these blocking techniques is unfair and, mostly, unfeasible. Therefore, we developed a baseline technique, which is also a contribution of our work, able to handle the context of this work. It also reflects the lack of blocking techniques for addressing all challenges faced in this work.
[TOPIC 11]
The graphical presentation of the Figures needs some improvements.
Following this advice, we have updated the quality of all figures through a professional image editor.
[TOPIC 12]
The paper contains many acronyms, some of which could be avoided, such as SM and PT.
Please, consider that, due to the high number of combinations involving the Proposed Technique and the proposed strategies (top-$n$ neighborhood and attribute selection), the text could be repetitive in terms of technique nomenclatures. Furthermore, applying the full names in the figures could generate visual damage in terms of results presentation and comprehension.
[TOPIC 13]
Please, proofread the whole paper trying to remove errors and typos. Just to highlight a typo on Page 21: A comparative of our technique -> a comparison of our technique
To improve the quality, the paper went through a complete textual revision, removing typos and reorganizing phrases.
Reviewer 2 Report
The paper proposed a novel schema-agnostic blocking technique capable of incrementally processing streaming data in parallel, where challenges involving heterogeneous data, parallel computing, noisy data, incremental processing, and streaming data have been addressed.
The paper topic is sure of interest, and the amount of work done, and information provided define the work contributions. Also, the structure of the article is well done. The authors explained the contributions made in the paper well, as well as the structure of other sections. Finally, the conclusion and future work are promising.
Author Response
We are most grateful for all evaluation and compliments of our work.
Thanks for the feedback.
Round 2
Reviewer 1 Report
In this revised version, the paper has been improved substantially, and the authors solved all my remarks. They better introduced the overall problem and provided an overview of the general process. I only would like to suggest swapping Section 2 and Section 3. I would be useful to read related works first, and then focusing on the problem statement.