Next Article in Journal
Roll Eccentricity Signal Detection and Its Engineering Application Based on SFFT-IAA
Previous Article in Journal
Cooling Performance of Fresh and Aged Automatic Transmission Fluids for Hybrid Electric Vehicles
 
 
Article
Peer-Review Record

Anomaly Detection Paradigm for Multivariate Time Series Data Mining for Healthcare

Appl. Sci. 2022, 12(17), 8902; https://doi.org/10.3390/app12178902
by Abdul Razaque 1,*, Marzhan Abenova 1, Munif Alotaibi 2,*, Bandar Alotaibi 3,4,*, Hamoud Alshammari 5, Salim Hariri 6 and Aziz Alotaibi 7
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Appl. Sci. 2022, 12(17), 8902; https://doi.org/10.3390/app12178902
Submission received: 20 November 2021 / Revised: 25 June 2022 / Accepted: 31 August 2022 / Published: 5 September 2022

Round 1

Reviewer 1 Report

Dear Authors and Editors,

I'll be short. In my opinion the submitted manuscript can not be published. Next to dozens of editing type deficiencies (indices of notations, commas, colons, dots and many more) there are serious mathematical errors. I'll mention a few of them:

  1. Equalities in (15) are incorrect if m is as given in (14), there is something else that m has to be equal to if you want (15) to be correct.
  2. In (19) there is calculated an expectation of deterministic (non-random) variable. I would be very surprised if the metric d^2_{x,y} is random. 
  3. The equality before (25) is incorrect.
  4. The third equality in (26) is incorrect.
  5. There are no real solutions for inequality a>=1+a^2 in the estimate given in line just after (26).

There are many more errors, I do not see any sense to proceed. If that is fixable somehow, and the main result remains valid, please come back later at some point in time.

 

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 2 Report

The manuscript sent to me for evaluation entitled "Hybrid Algorithm for Anomaly Removal in Time Series Data Mining" is interesting and important for the development of modern science.

The introduction contains the necessary elements: an introduction to the topic, description of the issues, the importance of research and the purpose. It also contains a description of the structure.

The methodology is correctly described.

The authors' goal is understandable - to remove anomalies from the time series.

However, the experiment itself raises my reservations due to the flattening of the data properties (throwing them into one bag). Normalization is not a panacea in this case.

The authors built the database, as I understand it by mixing different types of data together - information is missing (how many observations were taken from each database; so I assume 365; 34; 480 - this is at least apparent from the database description in section 4.1.). Or multiples of these quantities.

They used:

  1. Female birth: The total number of daily female births in California in 1959.
  2. ElectricProduction: Annual totals of electricity manufactured in California from 1985 to 2018.
  3. BeerProduction: Monthly Australian beer production date from 1956 to 1995.

They concern different spatial ranges: California and Australia, different time ranges 1959, 1985-2018, 1956-1995, different frequencies: daily, annual, monthly, and different subject ranges: female birth, electric production, beer production.

To my knowledge, there must be a substantive premise that prompts the inclusion of various variables in the study due to the phenomenon under study (female birth, electric production, beer production). I can't see him.  Moreover, data must be of the same frequency (of the same type - with the same structure), have a common point of attachment (the so-called key).

I understand that the authors compare the time for the operation of the process using the proprietary algorithm, but they must not forget about other layers important from the point of view of repeatability of the study (including the selection of variables for the study). If the study shows that the selection of variables for the study is not significant (and this is how I perceive the approach presented in the article - the prism of randomness), in my opinion they question the merits of the substantive selection of data for the study; procedures already developed in science. Data Mining is used in a process that has some common anchor point between the data that is being analyzed.

Besides, data with different frequencies have completely different specifics (the way of removing anomalies is completely different). Monthly data may be characterized by seasonality, short and average cycles and trend + irregular element, annual data by trend and medium or long cycles + irregular element, and daily data may show elements of instability. Lack of proper filtering / decomposition of data with filters of different lambda frequency causes that some component (e.g. seasonality) is treated as anomalies, which is not the correct approach.

Besides, one of the measures refers in its mathematical construct to correlation (which the authors rightly noticed), but how does the birth of women in California in 1959 correlate with the monthly production of beer in Australia in the years 1956-1995?

 

Data Mining and Taxonomy, including Euclidean Metrics, are not used to search for patterns explaining the shaping of variables that are illogically related to each other (no statistical verification of such correlation). It is like examining the correlation between the arrivals of storks and the births of children - it will probably be high (but statistically insignificant), then include these variables in one study - distance metrics and show the diversity of the observed observations ..., then remove the noise from them in the given ranks, proving that that the method is better at recognizing data patterns.

I suggest the authors perform the experiment again, on data that are focused around a process / phenomenon - in some way related in terms of content. Because the use of the authors' output (publications) must not raise any objections from the beginning to the end.

Other cosmetic considerations:

  • Figure 1 - Prediction (should be capitalized); Stationary Testing (here I only propose Testing, because in addition to stationary testing, you should take into account cointegration, normality and outliers).
  • In section 1.4. the reference is missing [15], is [14] and then [16].
  • Please explain in Figure 3 where the different confidence intervals come from.
  • In the description of formula (31) T (i, m) - the opening brace should be moved from subscript to normal text.
  • The introduction lacks a short description of the methods and data used in the study.
  • Databases (data) from section 4.1. describe in more detail, add a reference; they should also be incorporated into the reference.
  • Below the tables and graphs there should be information about the data used for the analysis
  • There is also no separate section in which the authors would describe the limitations of research resulting from the selection of such data.
  • In reference list # 7 Euclidean-like – should be capital letter.
  • Authors wrote: “Data Availability Statement: The data that supports the findings of this research is publicly available as indicated in the references”. I could not find data in the references.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 3 Report

The article is interesting and well-written in general, it introduces a new anomaly detection/removal algorithm. I consider it should be published.

Some issues to consider are:

1) page 3, section 1.2: "This situation occurs when observations are not constant over time [due to time series data.]"

I suggest rewriting this part to improve clarity, what is inside the parenthesis seems irrelevant (?)

2) page 4: "However, the performance of these implementing
algorithms is complex and counterintuitive."

Please rewrite this part.

3) page 5: "normalized by dividing them in the input window"

Please, explain better/rewrite this part.

4) page 9: "Considering ... that the correlation is limited to the range [ 1, 1 ]"

Should it be [-1,1]?

5) page 9: Let us assume that we have [a] sequence s ∈ R m and two noise sequences... Then, the estimated ... that are sampled out of a normal distribution ... distance between the two sequences obtained by applying the noise to the base sequence can be represented as follows:"

Please rewrite this part and the explanation that follows. It is defined a n' which is never used!

6) page 10: "Here, we use n as a random..."

The "n" should be in italic/math mode.

7) page 10: "We observe that µS..."

The "S" should be a subindex?

8) page 11: "In steps 2-13, the distance of each"

Is the interval correct?

9) page 13: Equation 31, "where ∞ is the value of infinity..."

According to the equation, ∞ should be a set, not a single value. I suggest changing the notation.

10) page 14: Correct Eqns. 34 and 35

11) page 14: "proposed proofing next by using"

Please rewrite this part.

12) page 15: alignment in Eqn. 42.

13) page 15: "...a and b can be obtained:"

Parameters a and b should be in italics/math mode.

14) page 17: The authors show several run-time measurements, it would be interesting to see their respective uncertainties (+/-).

15) page 17: "The compared STAMP and STOMP..."

Suggestion: "In comparison STAMP and STOMP..."

16) page 18: "...that all representations have better results, but..."

I suggest using the word "good" instead of "better".

17) page 18: "...STAMP slows down to 96%..."

Replace "slows down" with "drops".

That's all.

 

 

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

 

In my opinion, the authors made only minor cosmetic changes, not focusing on the most important ones. The experiment still raises my doubts. The comments are the same as in review 1. Some issues have been hidden, rather than eliminating them and improving the experiment (mainly data selection - I reiterate my previous sentence).

 

Although I see potential in the material, without its appropriate improvement, I cannot recommend it further. If the authors introduce appropriate corrections then further consideration may be given. At this stage, I maintain the previous assessment adequately to the correction.

 

In addition, other shortcomings identified in the previous review have not been eliminated:

The reference numbering used in the text is not in the order, after [13] is followed by [15], then [31] (see pages 1-2), and where [14] [16-30]?

 

My question from review 1 "Please explain in Figure 3 where the different confidence intervals come from" is still missing from the text.

 

The designations of the formulas should be clarified, incl. Formula 30. Still the remark is the same as from Review 1. Formula 30 (then 31) slightly modified but also includes the wrong parenthesis (it should not be subscript).

Author Response

Please see the attachment

Author Response File: Author Response.docx

Round 3

Reviewer 2 Report

The authors reformulated the article in his favor. They made significant amendments, and their contribution was thus strongly emphasized. They explained, thus defending the methodology and the related results. They dealt with weak points by turning them into strengths. The amendments satisfy me. Congratulations.

Back to TopTop