Next Article in Journal
GW-DC: A Deep Clustering Model Leveraging Two-Dimensional Image Transformation and Enhancement
Next Article in Special Issue
Graph Based Feature Selection for Reduction of Dimensionality in Next-Generation RNA Sequencing Datasets
Previous Article in Journal
Computing the Atom Graph of a Graph and the Union Join Graph of a Hypergraph
 
 
Article
Peer-Review Record

Robust Representation and Efficient Feature Selection Allows for Effective Clustering of SARS-CoV-2 Variants

Algorithms 2021, 14(12), 348; https://doi.org/10.3390/a14120348
by Zahra Tayebi, Sarwan Ali and Murray Patterson *
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3: Anonymous
Algorithms 2021, 14(12), 348; https://doi.org/10.3390/a14120348
Submission received: 16 October 2021 / Revised: 11 November 2021 / Accepted: 25 November 2021 / Published: 29 November 2021
(This article belongs to the Special Issue Explainable Artificial Intelligence in Bioinformatic)

Round 1

Reviewer 1 Report

In the manuscript Tayebi Z and co-authors perform clustering of available SARS-CoV-2 Spike variants sing different clustering algorithms and feature selection approaches for efficient clustering of variants. The authors suggest k-means algorithm for efficient sequence clustering with high F1 scores and efficient runtime. The method and results are described appropriately.

However, the major claim by the authors their clustering approach is effective in studying the ‘behavior of different known variants’ is not supported by the data and analyses presented here. Moreover, the use of large dataset by authors to suggest k-means as an efficient approach does not significantly add or improve to our understanding of the methodology. The approach here has been applied to similar large datasets.

Moreover, the authors do suggest exploration of additional clustering methods. In fact, in a recent publication the same group recognizes a clustering approach that efficiently identify SARS-CoV-2 variant and sub-types.

Author Response

We thank the reviewer for the helpful comments, which believe has improved this work.  Please see in the attached, a point-by-point response to each comment, as well as a revised manuscript with the changes (in red) that were made as a result.

Author Response File: Author Response.pdf

Reviewer 2 Report

The topic of the study is very actual and important, and the approaches developed will assist in future studies of SARS-Cov-2 virus. Anyway, I would suggest some corrections to make the text and results more clear from the practical point of veiw.

  1. Fig 1 could be improved by showing the exact locations of E, M and N proteins separately. In addition , the length of S protein in aa looks nor suitable for the Fig. 1 legend because the figure is about genes.
  2. Before claiming that the method, proposed in this study is better than the basic method (line 63) please describe better the basic method and show direct comparision of characteristics of basic and new methods, discuss benefits and deficiencies.
  3. There are many models for phylogenetic studies (NJ, ME, etc.), do you suggest to include new algorithms, developed in your study to one of the existing software packages, or you will develop a new one?

Author Response

We thank the reviewer for the helpful comments, which believe has improved this work.  Please see in the attached, a point-by-point response to each comment, as well as a revised manuscript with the changes (in red) that were made as a result.

Author Response File: Author Response.pdf

Reviewer 3 Report

Review of the article:

 

Robust Representation and Efficient Feature Selection Allows for Effective Clustering of SARS-CoV-2 Variant.

 

 

In this article, the authors propose an approach to cluster spike protein sequences in order to study the behavior of different known SARS-CoV-2 variants that are increasing at a very high rate throughout the world. For this, they used a k-mers based approach to first generate a fixed-length feature vector representation for the spike sequences. Finally, the authors mention that they can efficiently and effectively cluster the spike sequences based on the different variants. 

 

The article seems interesting but it lacks scientific rigor.

They mentioned the optimal cluster number is calculated using the elbow method which is a supervised strategy but they did not mention the number of clusters used in each strategy. 

There is no indication of how many groups were obtained and why.

It would be interesting to add a table with this information.

 

There is no indication of the meaning of the colors used in the t-SNE plots. Also, these colors need to be synchronized amongst all plots. In other words, a group should have the same color in all plots.

For instance, in Figure 5a, there are a lot of purple points not only forming a circular cluster on the right but also, they form a “river-like” structure at the top of the plot. What does this mean? Do these points belong to the same group? Why do they behave so differently and still belong to the same group?

This could be due to misrepresentation of the used groups, meaning that the authors used a lower number of groups than necessary, but this could be detected by the distortion score of this representation.

 

Also, the representations after feature selection show an expected particularity that was not discussed by the authors, which is the location of certain points isolated in the middle of a perfectly circular absence of points. This has to do with the feature selection and is a misrepresentation of the real data and it has to be corrected.

 

In conclusion the authors mention:

“We propose a feature vector representation and a set of feature selection methods to eliminate the less important features, allowing many different clustering methods to successfully cluster SARS-CoV-2 spike protein sequences with high F1 scores.”

What do they mean by successfully clustering? 

The authors do not show any measurement or statistical metric to indicate that their clustering is successful.

 

Also, they used several strategies but never compare them to determine which one is the better strategy and why. 

 

Unfortunately, after careful consideration, I feel that this manuscript is not enough for publication, even with a deep change of form and structure. 

 

Author Response

We thank the reviewer for the helpful comments, which believe has improved this work.  Please see in the attached, a point-by-point response to each comment, as well as a revised manuscript with the changes (in red) that were made as a result.

Author Response File: Author Response.pdf

Round 2

Reviewer 3 Report

After a very thorough review, I have no further questions or concerns. This work is interesting and I approve this version for publication

Back to TopTop