Classification of Left and Right Coronary Arteries in Coronary Angiographies Using Deep Learning
Round 1
Reviewer 1 Report
The work is completed and comprehensive with convincing results. Would the author provide some comments about integrating this research into an actual diagnostic situation?
Author Response
Response to Reviewer 1 Comments:
Dear reviewer
Thank you for taking the time to review this manuscript.
Point 1:
Comment |
Response |
Change in text |
The work is completed and comprehensive with convincing results. Would the author provide some comments about integrating this research into an actual diagnostic situation? |
This is an excellent point. The findings presented in this paper can be used to prepare and curate CAG videos. The first important task is to distinguish between left coronary artery (LCA) and right coronary artery (RCA). After this distinguishment between LCA and RCA it is possible to develop a model for automatic and quantitative stenosis assessment individually for left coronary artery (LCA) and right coronary artery (RCA). This model must be trained and developed before integrating this research into a diagnostic situation (future work). After successfully training and developing the models, the models can be deployed in a diagnostic situation. It is important to classify LCA and RCA in a diagnostic situation to trigger the correct model for LCA and RCA. We have tried to clarify the impact of this research and how to integrate it into a diagnostic situation. |
Please see line 29-46, 60-61, 72-74, 86-89, and please see Figure 1. |
Author Response File: Author Response.pdf
Reviewer 2 Report
Major concerns:
The authors used several different deep learning models to classify LCA and RCA. This is a typical classification task. The technical innovation overall is very limited. Therefore, it is important to describe the clinical significance of this task. I didn’t think the authors explained the clinical importance of this classification task clearly. The authors should revise this part carefully, otherwise the significance of this work is very unclear.
Although the authors were using the existing model architectures, these models' information should be described in more detail.
In line 167, the author said, ‘using the previously described preprocessing in section 2.3’. However, the paper doesn’t have section 2.3.
Minor:
Line 176 ‘non-linear relationship between the scanner parameters and LCA and RCA’. The expression is a bit strange. Between the parameters and the performance? That definitely not a linear relationship.
The authors should carefully check the writing.
Author Response
Response to Reviewer 2 Comments:
Dear reviewer
Thank you for taking the time to review this manuscript.
Point 1:
Comment | Response | Change in text |
The authors used several different deep learning models to classify LCA and RCA. This is a typical classification task. The technical innovation overall is very limited. Therefore, it is important to describe the clinical significance of this task. I didn’t think the authors explained the clinical importance of this classification task clearly. The authors should revise this part carefully, otherwise the significance of this work is very unclear.: |
We can see the point of being very specific and clarifying the impact of this research. |
Please see line 29-46, 60-61, 72-74, 86-89, and please see Figure 1. |
Point 2:
Comment | Response | Change in text |
Although the authors were using the existing model architectures, these models' information should be described in more detail. |
Thank you for pointing this out. We acknowledge that the descriptions of the model architectures can be improved. |
Please see lines 119-120, 132-141, 145-153, 156-163 and Table 1. |
Point 3:
Comment | Response | Change in text |
In line 167, the author said, ‘using the previously described preprocessing in section 2.3’. However, the paper doesn’t have section 2.3. |
We apologize. This has been changed. | Please see line 191. |
Point 4:
Comment |
Response |
Change in text |
Line 176 ‘non-linear relationship between the scanner parameters and LCA and RCA’. The expression is a bit strange. Between the parameters and the performance? That definitely not a linear relationship.
|
We agree and we have now removed the comment about the non-linear relationship between parameters and performance.
|
Please see line 199.
|
Author Response File: Author Response.pdf
Reviewer 3 Report
his paper proposes a method for classifying left and right coronary arteries using deep learning networks, and has been tested on multiple datasets, all of which have achieved good results and have certain practical application value. However, the main network framework of the article is not clear enough in Figure 1, the feature layers of the deep learning network are not shown, and there is not much description in the following text. The main problem of this article is that the proposed framework is not clearly expressed. After careful review, the following review comments are put forward:
1. In the data set setting section, the data set is divided into training set and validation set in a ratio of four to one. Please explain the significance of this ratio.
2. Why does the baseline section choose to use two MLP overlays, ignoring more mature methods such as SVM.
3.In the second part, models such as R(2+1)D and X3D are also mentioned. What is the role of these models in the overall framework, and how are they stacked with the MLP structure mentioned above.
4. In the experimental results section, the results obtained by the MVIT-32 model have large fluctuations compared with the results of other models. Please explain this.
Author Response
Response to Reviewer 3 Comments:
Dear reviewer
Thank you for taking the time to review this manuscript.
Point 1.
Comment |
Response |
Change in text |
This paper proposes a method for classifying left and right coronary arteries using deep learning networks, and has been tested on multiple datasets, all of which have achieved good results and have certain practical application value. However, the main network framework of the article is not clear enough in Figure 1, the feature layers of the deep learning network are not shown, and there is not much description in the following text. The main problem of this article is that the proposed framework is not clearly expressed. After careful review, the following review comments are put forward:
|
Thank you for pointing this out. We have improved the description of the framework. We agree that visualizing the feature maps learned by the deep learning models would be interesting. However, it is difficult in practice, as deep learning models learn thousands of feature maps from low-level features (blobs and edges) to higher-level features. So it would be challenging to pick the correct feature maps, which are of importance for the decisions made by the model. There exist methods for visualizing feature maps of significance for the classification task. These are known as class activation maps, e.g.," GradCam" [1]. However, these methods are not perfect and are not guaranteed to look at the same features as humans [2], although these models have high predictive performance.
[1] Selvaraju, Ramprasaath R., et al. "Grad-cam: Visual explanations from deep networks via gradient-based localization." Proceedings of the IEEE international conference on computer vision. 2017. [2] Arun, Nishanth, et al. "Assessing the trustworthiness of saliency maps for localizing abnormalities in medical imaging." Radiology: Artificial Intelligence 3.6 (2021).
|
Please see line 29-46, 60-61, 72-74, 86-89, and please see Figure 1.
|
Point 2.
Comment |
Response |
In the data set setting section, the data set is divided into training set and validation set in a ratio of four to one. Please explain the significance of this ratio. |
We constructed two datasets. The first dataset contains 3,545 videos used for training/validation. This dataset was split into 80% training and 20% validation, giving 2,836 videos for training and 709 videos for validation. It is important to have a greater portion for training, as it we need enough training data to learn the data. The validation set is only used for development and circumventing overfitting. We choose to use 80% for training and 20% for validation as it is common practice in machine learning [3].
The second dataset is used for testing, meaning we do not make any model selection based on this dataset. We only report results on this dataset (the dataset is unbiased). This dataset contains 520 videos. [3] Gholamy, Afshin, Vladik Kreinovich, and Olga Kosheleva. "Why 70/30 or 80/20 relation between training and testing sets: a pedagogical explanation." (2018).
|
Point 3.
Comment |
Response |
Change in text |
Why does the baseline section choose to use two MLP overlays, ignoring more mature methods such as SVM. |
Thank you for this comment. We have now added a SVM model to the baseline section.The baseline experiments aim to highlight the performance gains using linear vs. non-linear models with the scanner parameters as inputs. In other words, we would like to test the predictive power of the scanner parameters without using the videos as inputs. For the non-linear models, we choose neural networks with one hidden layer and two hidden layers. However, we choose neural networks as these are pretty flexible models. With a simple shallow neural network (e.g. 2, layer neural networks), we can approximate almost any function (particularly those of interest for modeling data from tabular data). Based on these observations, we believe that a one-layer neural network and a two-layer neural network are well suited as baselines.
|
Please see line 127-129 and Table 2. |
Point 4.
Comment |
Response |
Change in text |
In the second part, models such as R(2+1)D and X3D are also mentioned. What is the role of these models in the overall framework, and how are they stacked with the MLP structure mentioned above.
|
The MLP structure mentioned above is only used for the baseline. We experimented with three different model architectures to investigate which settings were important for training a LCA and RCA video classification model. Another reason for choosing three different model architectures was to show the robustness between architectures and not to be biassed towards one specific model. We have improved the description of the models used for classifying coronary angiography videos. The MLP baseline was only used to explore the predictive performance using scanner parameters only (no videos are used as input). |
Please see lines 119-120, 132-141, 145-153, 156-163 and Table 1. |
Point 5.
Comment |
Response |
Change in text |
In the experimental results section, the results obtained by the MVIT-32 model have large fluctuations compared with the results of other models. Please explain this.
|
The fluctuations observed in Figure 2 are probably caused by the fact that Vision Transformer models tend to require more data than CNN models. Thus, pretraining might be crucial for good performance. This pattern is also observed by the inventor of the Vision Transformer (Dosovitskiy et al.) [4]. Dosovitskiy et al. argue that when training on a mid-size or small-size dataset, the CNN models are superior to Vision transformers since CNN models introduce inductive bias such as translation equivariance and locality. However, this picture changes if the vision transformer has been pretrained on a much larger dataset and then applied to the same mid-size or small-size dataset. The hypothesis is that the vision transformer can learn the same kind of inductive bias from the data itself. [4] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020). |
Please see line 229-233. |
Author Response File: Author Response.pdf
Round 2
Reviewer 2 Report
The authors have address concerns
Reviewer 3 Report
My concerns have been addressed.