Next Article in Journal
Study of Comprehensive Utilization of Water Resources of Urban Water Distribution Network
Next Article in Special Issue
A Hybrid Model for Water Quality Prediction Based on an Artificial Neural Network, Wavelet Transform, and Long Short-Term Memory
Previous Article in Journal
Regional Agroclimate Characteristic and Its Multiple Teleconnections: A Case Study in the Jianghan Plain (JHP) Region
Previous Article in Special Issue
Comparison of Deterministic and Statistical Models for Water Quality Compliance Forecasting in the San Joaquin River Basin, California
 
 
Article
Peer-Review Record

Classification and Prediction of Fecal Coliform in Stream Waters Using Decision Trees (DTs) for Upper Green River Watershed, Kentucky, USA

Water 2021, 13(19), 2790; https://doi.org/10.3390/w13192790
by Abdul Hannan 1 and Jagadeesh Anmala 2,*
Reviewer 1: Anonymous
Reviewer 2:
Water 2021, 13(19), 2790; https://doi.org/10.3390/w13192790
Submission received: 24 August 2021 / Revised: 2 October 2021 / Accepted: 5 October 2021 / Published: 8 October 2021
(This article belongs to the Special Issue Decision Support Tools for Water Quality Management)

Round 1

Reviewer 1 Report

The paper is interesting. The comments are listed below.

Major comments

  1. Line 240, the authors stated, "The agriculture land-use factor (ALUF or a) is highly and negatively correlated with 240 the forest land use factor (FLUF or f)." Do the authors remove one of the two factors from the input variables to test different models' performance? Is any testing conducted?
  2. In the method section, I can't find anything is created or modified by the authors. Or this paper aims to use the techniques others created and compare which is better?
  3. The authors adopt decision tree (DT) techniques to predict Fecal Coliform in the streams. However, more information is needed to highlight the contribution to this topic. For example, it's the first to adopt DTs to predict Fecal Coliform or any improvement of the exsited tehcniques? Otherwise, it's a case study.

Minor comments

  1. In abstract, the Decision Tree algorithms is recommended to list full name for the first time (ex. Random forest RF)
  2. The format of the citation does not fit the format of the journal.
  3. Figure 2 is poor in resolution, and please update it. In addition, watershed boundaries are suggested to add in Fig. 1.
  4. The equation should be numbered. And the authors listed the equations. The land use equations are recommended to move to the method section.

Author Response

Response to Reviewer 1 Comments

Yes

Can be improved

Must be improved

Not applicable

 

Does the introduction provide sufficient background and include all relevant references?

( X)

()

( )

( )

Is the research design appropriate?

( X)

()

( )

( )

Are the methods adequately described?

( )

(X)

()

( )

Are the results clearly presented?

( )

(X)

()

( )

Are the conclusions supported by the results?

( )

(X)

()

( )

 

Response to the above table: The introduction has been improved with a few more relevant references.  The new text has been introduced from the lines 83 to 88, 90 to 93, 95 to 98, 103 to 106, 109 to 110, 114 to 118, and 122 to 147.  The research design has been improved by introducing the objectives of the study, and plan of the manuscript from lines 135 to 147.  The discussion of methods has been improved from lines 227 to 241, 278 to 287, 296 to 301, 307 to 313, 320 to 324, 333 to 338, and 342 to 348.  The section on results has been improved with more clear explanation in lines from 386 to 392, 416 to 418, 421 to 427, and 456 to 461.  The section on conclusions has been improved from lines 628 to 647.

 

Comments and Suggestions for Authors

 

The paper is interesting. The comments are listed below.

 

Response 1:  The authors would like to thank the reviewer for positive comments.  The authors are glad to know that they have done interesting work.

Major comments

Point 1: Line 240, the authors stated, "The agriculture land-use factor (ALUF or a) is highly and negatively correlated with 240 the forest land use factor (FLUF or f)." Do the authors remove one of the two factors from the input variables to test different models' performance? Is any testing conducted?

 

Response 1: The authors would like to thank the reviewer for pointing out an interesting aspect of the paper.  The corresponding author with other researchers have done studies with regard to this aspect in a different study Turuganti et al. (2020).  In this work of Turuganti et al. (2020), they have done PCA, and CCA analysis and reduced the effective number of input variables or dimensions and tested the performance of the ANN models.  However, they have also noticed a more comprehensive output performance when all the land use parameters were included.  The authors would like to explore this aspect using DTs for a different study that would be forthcoming.  Inclusion of reducing model dimension in the current study would increase the length of the paper even further.  However, this has been discussed from lines 386 to 391.

 

Point 2: In the method section, I can't find anything is created or modified by the authors. Or this paper aims to use the techniques others created and compare which is better?

 

Response 2: Yes.  The primary objective of this paper is to study the applicability and potential of existing Decision Trees (DTs) in classifying the fecal pollution based on climate and land use parameters.  Eventually, to suggest a Decision Support System based on Decision Tree modelling.

 

Point 3: The authors adopt decision tree (DT) techniques to predict Fecal Coliform in the streams. However, more information is needed to highlight the contribution to this topic. For example, it's the first to adopt DTs to predict Fecal Coliform or any improvement of the exsited tehcniques? Otherwise, it's a case study.

 

Response 3: Yes.  It is first of its kind, in the sense that the prediction of fecal coliform is done based on causal parameters such as climate and land use.  This has been clearly mentioned in the abstract.  At the same time, the authors would also agree to point that it is a case study, as we need to test the methodology for more parameters for a variety of watersheds and climate changes.  The objectives of the paper have been added from line 135 to line 140 so that the contributions of the topic are highlighted more clearly.

Minor comments

Point 1: In abstract, the Decision Tree algorithms is recommended to list full name for the first time (ex. Random forest RF)

 

Response 1: The authors have made the necessary change by writing the full form of the abbreviations in the abstract.

 

Point 2: The format of the citation does not fit the format of the journal.

 

Response 2: The format of the citation has been changed to suit the journal format.

 

Point 3: Figure 2 is poor in resolution, and please update it. In addition, watershed boundaries are suggested to add in Fig. 1.

 

Response 3: Figure 1 and Figure 2 have been updated.  The change can be seen in lines 221 and 223.

 

Point 4: The equation should be numbered. And the authors listed the equations. The land use equations are recommended to move to the method section.

 

Response 4: The equations are numbered continuously now.  The land use equations have been move to the methodology section 3.1.

 

 

Author Response File: Author Response.pdf

Reviewer 2 Report

The article is worth correcting, that is my general conclusion.

 

I start the review with the biggest doubt then some minor remarks are given.

 

The process of creating the DTC presented in fig 10 and described below it is quite smart.

However:

  1. Is there any proof that applying ID3, CART and C4.5 exactly the same feature will be chosen for each node? Is it correct If not?
  2. Is there any proof that accuracy of classification will be higher if the proposed classifier is applied (higher than a single tool)?
  3. Is there any proof that minimizing entrophy at tree's nodes separately for each node, the minimum entrophy will be achieved in end leaves? (similarily to chosing the quickest path for 10 sections of the trip may not give the quickest whole trip)

 

My answers are "no, there isn't" for these 3 questions. To partially prove overperformance of the proposed model the calculations can be done (not just a proposal without verification). But even it is verified, the verification will be valid for the existing data. There will be still no proof for the data of a new (not met yet) combination of values. After this kind of verification the model can stay as it is proposed, but for its better reliability the cross-validation shoud be proceeded.

 

To accept it as a general approach the proves for for  1., 2. and 3. should be made. I think it is very hard or even impossible.

 

Minor remakrs:

Formulas should be numbered

Formulas in lines 122-124 Full names should be presented in the text and only their abbreviations in formulas.

Study area and Data should be separated in subsections.

Data should be wider presented. I miss the information about the intervals of measurments in the same place, if there were several sensors on one river.

The issue which is not discussed (I wonder if it is possible for the proposed approach) is e.g. the day of measurement is sunny and hot, then fc can be different if it is after 20 similar days or after 20 rainy days.

The next issue is time dependance for the sensors located on the same river.

How many rows on data are based on 1 sensor (the same point of measurements)?

In the formula for gain ratio (line 216) wj is not explained. Split info is not defined. J and K symbols are not defined.

Quite big differences in accuracies (training and testing) presened in fig 5 (for RF, GB and ERT) may indicate the necessity of cross validation precedure, or the trees are cut too late (overtraining).

Support in fig 6 is not defined.

F1 minus score (F1 - score) can be misleading in formula in line 344

How it is possible that FC (Table 1) give sum 450 000, average is over 2000 and variance is 11.9 mln.

Why the variance is given, not the standard deviation?

Based on table 3 it may be concluded that as average FC were dangerous. Histogram of FC would be helpful.

It is beter to present formulas before the figures using the terms from formulas (see fig 6).

Author Response

Response to Reviewer 2 Comments

Yes

Can be improved

Must be improved

Not applicable

 

Does the introduction provide sufficient background and include all relevant references?

(X)

()

( )

( )

Is the research design appropriate?

( )

(X)

()

( )

Are the methods adequately described?

( )

(X)

()

( )

Are the results clearly presented?

( )

(X)

()

( )

Are the conclusions supported by the results?

( )

( )

(X)

(x)

 

Response to the above table: The introduction has been improved with a few more relevant references.  The new text has been introduced from the lines 83 to 88, 90 to 93, 95 to 98, 103 to 106, 109 to 110, 114 to 118, and 122 to 147.  The research design has been improved by introducing the objectives of the study, and plan of the manuscript from lines 135 to 147.  The discussion of methods has been improved from lines 227 to 241, 278 to 287, 296 to 301, 307 to 313, 320 to 324, 333 to 338, and 342 to 348.  The section on results has been improved with more clear explanation in lines from 386 to 392, 416 to 418, 421 to 427, and 456 to 461.  The section on conclusions has been improved from lines 628 to 647.

 

Comments and Suggestions for Authors

 

Point 1: The article is worth correcting, that is my general conclusion.

 

Response 1:  The authors would like to thank the reviewer for the detailed review of the work.

Major comments

Point 1: I start the review with the biggest doubt then some minor remarks are given.

The process of creating the DTC presented in fig 10 and described below it is quite smart.

However:

  1. Is there any proof that applying ID3, CART and C4.5 exactly the same feature will be chosen for each node? Is it correct If not?
  2. Is there any proof that accuracy of classification will be higher if the proposed classifier is applied (higher than a single tool)?
  3. Is there any proof that minimizing entrophy at tree's nodes separately for each node, the minimum entrophy will be achieved in end leaves? (similarily to chosing the quickest path for 10 sections of the trip may not give the quickest whole trip)

My answers are "no, there isn't" for these 3 questions. To partially prove overperformance of the proposed model the calculations can be done (not just a proposal without verification). But even it is verified, the verification will be valid for the existing data. There will be still no proof for the data of a new (not met yet) combination of values. After this kind of verification, the model can stay as it is proposed, but for its better reliability the cross-validation shoud be proceeded.

To accept it as a general approach the proves for for  1., 2. and 3. should be made. I think it is very hard or even impossible.

 

Response 1: The authors would like to thank the reviewer for pointing out a few interesting aspects of the paper.  The authors would like to agree with the reviewer on the answer to above three questions- that there is no proof of exactly same feature will be chosen for each node of applying 3 Decision Trees.  The authors have succinctly presented the results and omitted the Decision Tree figures of the three methods to reduce the length of the paper.  Yes, there is no guarantee that accuracy of classification will be higher for the proposed classifier.  It needs to be tested for a variety of water quality parameters of different watersheds under climate changes. Yes, being greedy at each step/node does not ensure overall minimization of entropy or global optimization of the process.  In the current paper, the authors have focussed on only the classification capabilities of the Decision Trees for this particular dataset.  In that sense, the authors agree to the fact that it is essentially a case study.  The present work explores the classification capabilities in training and testing phases only.  The size of the data was one limitation because of which, the authors could not go for cross-validation.  However, the detailed results of verification/cross-validation are reported in a different paper of corresponding author using Decision Tree regressor instead of Decision Tree classifier.  These results can be found from the following paper-

Jagadeesh Anmala, V. Turuganti, (2021).  Comparison of the performance of decision tree algorithms and ELM model in the prediction of water quality of the Upper Green River watershed, Water Environment Research, in press.

Minor comments

Point 1: Formulas should be numbered

 

Response 1: The Formulas have been numbered.

 

Point 2: Formulas in lines 122-124 Full names should be presented in the text and only their abbreviations in formulas..

 

Response 2: It has been changed as per above suggestion.

 

Point 3: Study area and Data should be separated in subsections.

 

Response 3: Study area and Data are separated into two subsections 2.1 and 2.2.

 

Point 4: Data should be wider presented. I miss the information about the intervals of measurments in the same place, if there were several sensors on one river.

 

Response 4: The Data was collected manually.  The measurements were done monthly for a six-month period as described in lines 212-214.

 

Point 5: The issue which is not discussed (I wonder if it is possible for the proposed approach) is e.g. the day of measurement is sunny and hot, then fc can be different if it is after 20 similar days or after 20 rainy days.

 

Response 5: The Data was collected manually in the afternoon period at a monthly interval from May 2002 to October 2002.  Yes, the data was collected in a few rainy months and a few non-rainy months.  A statement has been included in line 220.

 

Point 6: The next issue is time dependance for the sensors located on the same river.

 

Response 6: The Data was collected monthly, and manually and no sensors were used.

 

Point 7: How many rows on data are based on 1 sensor (the same point of measurements)?

 

Response 7: Sensors were not used in the present study.  The data was collected manually.  There were 42 sampling locations along the Green River.  The data was collected at all the locations every month for a period of six months.  Therefore, at each location six observations/measurements were available.

 

Point 8: In the formula for gain ratio (line 216) wj is not explained. Split info is not defined. J and K symbols are not defined.

 

Response 8: The formula is explained with all the details in lines 320-324.

 

Point 9: Quite big differences in accuracies (training and testing) presened in fig 5 (for RF, GB and ERT) may indicate the necessity of cross validation precedure, or the trees are cut too late (overtraining).

 

Response 9: Because of data size limitations, cross-validation procedure was not performed.  Yes, for RF, GB, ERT the training accuracies are higher than the testing accuracies.  However, these accuracies were obtained after a several number of trials and reasonably optimizing the cut points.  Yes, a much higher training accuracy compared to testing accuracy in some sense indicates the possibility of overtraining.  This has been discussed from lines 456 to 461, and 644 to 647.

 

Point 10: Support in fig 6 is not defined.

 

Response 10: The Support is defined in lines 484 to 485.

 

Point 11: F1 minus score (F1 - score) can be misleading in formula in line 344

 

Response 11: It has been corrected.

 

Point 12: How it is possible that FC (Table 1) give sum 450 000, average is over 2000 and variance is 11.9 mln.

 

Response 12: The variability is large.  Descriptive statistics has been added on Figure 5.

 

Point 13: Why the variance is given, not the standard deviation?

 

Response 13: The variance is removed.  Standard deviation is given in the table 1.  However, descriptive statistics has been added on Figure 5.

 

Point 14: Based on table 3 it may be concluded that as average FC were dangerous. Histogram of FC would be helpful.

 

Response 14: The histogram has been provided in Figure 5 (lines 393-394).

 

Point 15: It is beter to present formulas before the figures using the terms from formulas (see fig 6).

 

Response 15:  The formulae have been moved before the Figure 7 (now Figure 7, old Figure 6, lines 479-486).

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

All comments are addressed. The main contribution and some content mentioned in the response are suggested to feedback to the manual script (Some had been added).  

ID3 (Bagging) and ID3 (Adaptive Boosting) are suggested to nail different abbreviations to distinguish the two (for example ID3-Bagg and ID3-AB.  

Author Response

Response to Reviewer 1 Comments

 

Comments and Suggestions for Authors

All comments are addressed. The main contribution and some content mentioned in the response are suggested to feedback to the manual script (Some had been added).  

ID3 (Bagging) and ID3 (Adaptive Boosting) are suggested to nail different abbreviations to distinguish the two (for example ID3-Bagg and ID3-AB.  

Response: The authors would like to thank the reviewer for kind comments and a patient review.  The main contributions and contents in the response have already been added as shown in the previous response.  However, a few more lines have been added from lines 590 to 600 in this regard.  Yes, the above two have been given different abbreviations.  This can be seen in lines 416 to 417.

Author Response File: Author Response.docx

Reviewer 2 Report

The Authors carefully addressed all remarks given in round 1.

Now it's much more clear what's achieved.

The manuscript is focused on comparing the selected classifiers.

However...

it is titled "Classification and Prediction (...) as a Decision Support System (DSS)".

This second part of the title is not described. To name sth a decision support system, the types of decisions should be presented, discussed based on the output (the result of classification). How the achieved (relatively low) errors can influence the results of decisions potentially undertaken. It can't be found in the article.

The conclusions prove that the Authors' focus is not a DSS but the comparison of classifiers.

Nevertheless, the article is consistent, its novelty is proved, the description of of advanced calculations is clearly presented, so.... I strongly recommend to adjust the title of the manuscript (and abstract maybe) to make them consistent with the core of the article - the comparison of classifiers for the specific dataset.  Even DSS is presented in figure 10, its usage, kinds of supported decions, effect improper decisions based on predictions (which are not perfectly accurate) are not sufficiently presented and discussed.

Author Response

Response to Reviewer 2 Comments

 

Comments and Suggestions for Authors

The Authors carefully addressed all remarks given in round 1.

Now it's much more clear what's achieved.

The manuscript is focused on comparing the selected classifiers.

However...

it is titled "Classification and Prediction (...) as a Decision Support System (DSS)".

This second part of the title is not described. To name sth a decision support system, the types of decisions should be presented, discussed based on the output (the result of classification). How the achieved (relatively low) errors can influence the results of decisions potentially undertaken. It can't be found in the article.

The conclusions prove that the Authors' focus is not a DSS but the comparison of classifiers.

Nevertheless, the article is consistent, its novelty is proved, the description of of advanced calculations is clearly presented, so.... I strongly recommend to adjust the title of the manuscript (and abstract maybe) to make them consistent with the core of the article - the comparison of classifiers for the specific dataset.  Even DSS is presented in figure 10, its usage, kinds of supported decions, effect improper decisions based on predictions (which are not perfectly accurate) are not sufficiently presented and discussed.

Response: The authors would like to thank the reviewer for kind comments and encouraging discussion.  The authors have rephrased the title “Classification and Prediction of Fecal Coliform in Stream Waters using Decision Trees (DTs) for Upper Green River watershed, Kentucky, USA” instead of the current title “Classification and Prediction of Fecal Coliform in Stream Waters using Decision Trees (DTs) as a Decision Support System (DSS)”.  This would indicate the comparison of classifiers for specific watershed/dataset.  Moreover, this would also alleviate the above other difficulties.  However, a few more lines have been added from lines 590 to 600.

Author Response File: Author Response.docx

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.


Back to TopTop