Next Article in Journal
Accuracy of Three-Dimensional Soft-Tissue Prediction Considering the Facial Aesthetic Units Using a Virtual Planning System in Orthognathic Surgery
Next Article in Special Issue
Precision Medicine in Radiomics and Radiogenomics
Previous Article in Journal
Ultrathin Struts Drug-Eluting Stents: A State-of-the-Art Review
 
 
Article
Peer-Review Record

Neck Lymph Node Recurrence in HNC Patients Might Be Predicted before Radiotherapy Using Radiomics Extracted from CT Images and XGBoost Algorithm

J. Pers. Med. 2022, 12(9), 1377; https://doi.org/10.3390/jpm12091377
by Yi-Lun Tsai 1, Shang-Wen Chen 2,3,4,5, Chia-Hung Kao 5,6,7,8 and Da-Chuan Cheng 5,9,*
Reviewer 2:
Reviewer 3: Anonymous
J. Pers. Med. 2022, 12(9), 1377; https://doi.org/10.3390/jpm12091377
Submission received: 19 July 2022 / Revised: 20 August 2022 / Accepted: 23 August 2022 / Published: 25 August 2022
(This article belongs to the Special Issue Precision Medicine in Radiomics and Radiogenomics)

Round 1

Reviewer 1 Report

The authors have prevented a commendable study regarding the application of artificial intelligence in head and cancer management. Here are a few notes aimed at improving the manuscript.

Title:

The authors aimed at evaluating the effect of under-sampling and over-sampling on sensitivity. Both under-sampling and over-sampling are known techniques to handle imbalanced datasets. Therefore, these techniques definitely have an effect on important performance metrics such as sensitivity. Hence, it is important to properly justify the novelty and interest of this paper to the readership of this journal.

Abstract:

The performance of the model before under-sampling and over-sampling should be mentioned. Thus, the effects of the under-sampling and over-sampling can easily be evaluated by the reader. In addition, the under-sampling or over-sampling techniques should be mentioned.

Methodology

The composition of the data largely includes 36 patients with oropharyngeal cancer (OPC) and 43 patients with hypopharyngeal cancer (HPC). Therefore, can this still be considered viable for HNC where there was a lack of many HNC subsites in the data composition? How will this model perform if evaluated on new cases of tongue squamous cell carcinoma?

The methodology is logical. However, the model may not be generalized for HNC. Rather, it may be limited to OPC and HPC.

Obviously, there is also a case of gender bias in terms of the data used in this study. Therefore, it would be worthwhile to see how gender affects the target variable in this study.

Discussion

The paragraphs in the discussion need to be properly arranged to discuss the results obtained in this study. In this present form, it is a bit difficult to follow up with the discussion. The discussion in this present form is not discussing the results and previous studies. Thus, it becomes challenging to see the contribution of this study.

Author Response

Point1 - Title:

The authors aimed at evaluating the effect of under-sampling and over-sampling on sensitivity. Both under-sampling and over-sampling are known techniques to handle imbalanced datasets. Therefore, these techniques definitely have an effect on important performance metrics such as sensitivity. Hence, it is important to properly justify the novelty and interest of this paper to the readership of this journal.

Response 1: We have modified the title of this paper as follows: “A Study in Predicting Neck Lymph Nodes Recurrence in Patients with Head and Neck Cancer Based on an Unbalanced Dataset”

Point2 - Abstract:

The performance of the model before under-sampling and over-sampling should be mentioned. Thus, the effects of the under-sampling and over-sampling can easily be evaluated by the reader. In addition, the under-sampling or over-sampling techniques should be mentioned.

Response 2: Thank you for your comments and suggestions. The original dataset performed as follows: accuracy=0.76±0.07, sensitivity=0.44±0.22, specificity=0.88±0.06. After we used the over-sampling technique, the accuracy, sensitivity, specificity values were 0.80±0.05, 0.67±0.11, 0.84±0.05, respectively. Furthermore, after using the under-sampling technique, the accuracy, sensitivity, specificity values were 0.71±0.09, 0.73±0.13, 0.70±0.13, respectively. The suggested changes have already been added to the manuscript.  

Point3 - Methodology:

The composition of the data largely includes 36 patients with oropharyngeal cancer (OPC) and 43 patients with hypopharyngeal cancer (HPC). Therefore, can this still be considered viable for HNC where there was a lack of many HNC subsites in the data composition? How will this model perform if evaluated on new cases of tongue squamous cell carcinoma?

The methodology is logical. However, the model may not be generalized for HNC. Rather, it may be limited to OPC and HPC.

Response 3: In the era of artificial intelligence, this study aimed to use PET/CT-derived radiomics for predicting radiotherapy-based outcome in patients with head and neck cancer. Because definitive chemoradiotherapy/radiotherapy is a mainstay of treatment for patients with cancers of hypopharyx, larynx, and oropharynx, our goal is to optimize the individualized treatment decision between organ preservation and treatment modification. Of course, our findings need external validation. The base of the tongue is generally categorized as oropharyngeal cancer, whereas mobile tongue cancer belongs to cancer of oral cavity, and patients are always treated with surgery with or without adjuvant radiotherapy. In patients with buccal or lip cancer, radiotherapy is mainly used for adjuvant treatment.

Point 4: Obviously, there is also a case of gender bias in terms of the data used in this study. Therefore, it would be worthwhile to see how gender affects the target variable in this study.

Response 4: Regarding the gender bias in this study, we would like to explain that the incidence of human papilloma virus (HPV)-associated oropharyngeal cancer was very low in many Asian countries like Taiwan (<20% in our study cohort). More than 90% of our studied patients had a history of smoking, alcoholism, or betel nut squid, which is regarded a carcinogen of their cancers. On the other hand, the incidence of HPV-associated hypopharyngeal cancer was also very low in Asian countries (<10% in our study cohort). According to our cultural habit, very few women had a history of smoking, alcoholism, or betel nut chewing, this was the reason that the gender for most of the studied objects was male

Point 5: Discussion

The paragraphs in the discussion need to be properly arranged to discuss the results obtained in this study. In this present form, it is a bit difficult to follow up with the discussion. The discussion in this present form is not discussing the results and previous studies. Thus, it becomes challenging to see the contribution of this study.

Response 5: Thank you for your comment. The paragraphs have been re-arranged to make the chapter more comprehensive. We also discuss the results and previous studies.

Reviewer 2 Report

Tsai et al. report CT radiomics in head neck malignancies using over-sampling and under-sampling techniques. Kindly find my comments below

Comments:

1. Include details of CT acquisition parameters and machine details in the methodology.

2. Will suggest using “primary tumor/ disease” instead of “tumor in situ”

3. To elaborate on image preprocessing, details of radiomics feature

4. To mention who had done the image segmentation eg. Radiologist? (xx years of experience)

5. To mention in selection criteria if all patients were required to have minimal follow up duration

6. To specify validation techniques

7. Authors need to present classifier performances using lymph node data alone since clinical data alone has good results and values went down when combined features were used from primary, nodes along with clinical data 

Author Response

Point 1: Include details of CT acquisition parameters and machine details in the methodology.

Response 1: The patients were scanned using a PET/CT scanner (PET/CT-16 slice, Discovery STE; GE Medical System, Milwaukee, WI, USA). The patients were instructed to fast for at least 4 hours before the administration of 18F-FDG, and FDG PET/CT imaging was conducted approximately one hour after the administration of 370 MBq of 18F-FDG. The details of CT image acquisition has been described in detail in chapter 2.2.

Point 2: Will suggest using “primary tumor/ disease” instead of “tumor in situ”

Response 2: Thank you for your suggestion. We have replaced “tumor in situ” with “primary tumor” in our manuscript.

Point 3: To elaborate on image preprocessing, details of radiomics feature

Response 3:

The part describing radiomics has been included in chapter 2.4 and is as follows:

The term radiomics was first coined by Lambin et al. 2012, to describe quantitative medical imaging data. Radiomics involves extracting high-throughput data from medical images such as CT, PET, MRI or SPECT scans through advanced mathematical and statistical analysis. Radiomic has several types of features. These are Shape-based features, First-order statistics, Gray Level Run Length Matrix (GLRLM), Gray Level Co-Occurrence Matrix (GLCM), Gray Level Size Zone Matrix (GLSZM), Neighboring Gray Tone Difference Matrix (NGTDM), Gray Level Dependence Matrix (GLCM), A Laplacian of Gaussian (LoG) features and Wavelet features. Radiomics explores the intensity, shape and texture of the tumor in order to calculate thousands of advanced features.

Point 4: To mention who had done the image segmentation eg. Radiologist? (xx years of experience)

Response 4: The segmentation of the images was done by members of our team and the processed images were then approved by a physician with more than 30 years of experience.

Point 5: To mention in selection criteria if all patients were required to have minimal follow up duration

Response 5: The minimal follow up duration for patients was 12 months.

Point 6: To specify validation techniques

Response 6: After utilizing the over-sampling technique and under-sampling techniques, we used the XGBoost algorithm to analyze the data which was separated into a training set (70%) and test set (30%). The analysis was run one hundred times, each time with randomly chosen sets of patients. Due to the small cohort size, we didn’t have a validation set, therefore, the test data was used to the validate the algorithm.

Point 7: Authors need to present classifier performances using lymph node data alone since clinical data alone has good results and values went down when combined features were used from primary, nodes along with clinical data

Response 7: Local recurrence is defined as a cancer that recures in the same place as the original cancer or very close to it. Regional recurrence (neck recurrence in head and neck) means that the tumor has grown into lymph nodes or tissues near the original cancer (National Cancer institution https://www.cancer.gov/types/recurrent-cancer). Therefore, we used lymph nodes to predict neck recurrence. When we added the primary tumor data (used for predicting local recurrence) the performance in predicting neck recurrence decreased.

Reviewer 3 Report

general comments:

* low cohort size: the 2022 edition of the HECKTOR data challenge has 489 patients with recurrence free survival data and PET-CT images and will most likely be more impactful. Investigation should be made to see if this dataset can be used for your research, to validate your pipeline.

*  lengthy details of machine learning algorithm tuning: these details are moderately relevant for a medical journal and its target audience.

* not enough clarification on the choice of machine learning algorithm: why use xgboost when other (simpler) algorithms could be used ? You stated that random forest might be worse, but it would be easy to check it directly on your dataset and compare different algorithms performance using cross validation on the training set.

* the choice of the metrics do not correspond to the censored nature of your  data. Recurrence is not a simple classification task, it requires to take into account censorship. As such, classification metrics are not relevant, and C index should be used to evaluate model performance.: your problem should be reframe as a survival task rather than just a classification task.

* it is unclear if the performance reported are for the training set or the test set. Moreover, the xgboost algorithm uses a validation set to monitor the overfitting during training, like deep learning algorithms. It should be clarified if the test set has been used or not as a validation set for xgboost. If the case, all results are most likely over optimist.

materials and methods:

- this paragraph should not detail the patient characteristics but how the cohort has been  collected. It is unclear why do we go from 110 to 79 patients in the selection process.

- why not use the PET images to extract radiomics, which are most likely more relevant for RN/LR/DM ?

- table 1:

* please provide percentage rather than raw values

* max diameter of LN: in which unit ? cm I think.

* define Necrotic LN

Results

section 3.1 : "We found that radiomics derived from CT was superior 159 than that of PET images for predicting the neck node recurrence". This analysis was not described in the M&M section where you stated you will only use CT radiomics.

Figures:

* figure 8  : feature importance when feature name is not very self describing does not add value to this research.

Author Response

Point 1: low cohort size: the 2022 edition of the HECKTOR data challenge has 489 patients with recurrence free survival data and PET-CT images and will most likely be more impactful. Investigation should be made to see if this dataset can be used for your research, to validate your pipeline.

Response 1: Thank you for your suggestion. We have requested access to the open data set images from Dr.Vallières (the Canadian data set resource of HECKTOR data challenge) on March 2022 at https://www.cancerimagingarchive.net/. After we received the information, we found out the data set has three outcomes: Local recurrence, Distant Metastasis and Death. However, in our research, our end point is neck (lymph nodes) recurrence which wasn’t recorded in the HECKTOR data challenge clinical data. Hence, due to the different end point, we couldn’t use the dataset to validate our results.

Point 2: lengthy details of machine learning algorithm tuning: these details are moderately relevant for a medical journal and its target audience.
Response 2: Thank you for your suggestion. We edited the text to properly reflect the tuning process in chapter 3.2

Point 3: not enough clarification on the choice of machine learning algorithm: why use xgboost when other (simpler) algorithms could be used ? You stated that random forest might be worse, but it would be easy to check it directly on your dataset and compare different algorithms performance using cross validation on the training set.

Response 3: Thank you for your advice. We originally did use the Random Forest algorithm; however, the value of accuracy was 0.61, sensitivity=0.59, specificity = 0.63, which were unsatisfactory. Therefore, we opted to use the XGBoost algorithm instead which offered better results. The values when using Random Forest can be seen in Table 2 which we included to showcase the difference.

Point 4: the choice of the metrics do not correspond to the censored nature of your data. Recurrence is not a simple classification task, it requires to take into account censorship. As such, classification metrics are not relevant, and C index should be used to evaluate model performance.: your problem should be reframe as a survival task rather than just a classification task.

Response 4: We discussed this topic with an experienced clinical physician, and the reason why we decided to conduct research into recurrence rather than survival was because early recognition of patients at risk for nodal failure after curative nonsurgical treatment can optimize the individual treatment schemes by reducing the number of patients undergoing unsuitable treatment. Therefore, we focused on whether there will be recurrence rather than survival. This way, patients can be treated alternatively.

Point 5: it is unclear if the performance reported are for the training set or the test set. Moreover, the xgboost algorithm uses a validation set to monitor the overfitting during training, like deep learning algorithms. It should be clarified if the test set has been used or not as a validation set for xgboost. If the case, all results are most likely over optimist.

Response 5: Due to the small cohort of our patients, we only have training and test data sets. The test data was used as a validation set for the algorithm.

Point 6: materials and methods: this paragraph should not detail the patient characteristics but how the cohort has been collected. It is unclear why do we go from 110 to 79 patients in the selection process

Response 6: There are two reasons why we downsized the patient population from 110 to 79. First, the CT/PET images of 17 patients couldn’t be loaded in the Lifex segmentation software. This was most likely due to an incompatible format. We attempted to obtain the data with the compatible format, however, the data we received was the only data available. Second, 14 patients only had primary tumors, with no obvious lymph nodes that could be seen in the images.

Point 7: why not use the PET images to extract radiomics, which are most likely more relevant for RN/LR/DM ?

Response 7: The reason for choosing CT images is touched upon in chapter 3.1. After using the 8 extractors from radiomics to extract features from the 18F-FDG PET/CT images which were used to predict the neck recurrence, we found that, by an-alyzing PET images with lymph node masks there were 105 parameters in total with p-value < 0.005. By analyzing the CT images with lymph node masks, we found 258 parameters in total with p-value < 0.005. We found that radiomics derived from CT was superior than that of PET images for predicting the neck node recurrence. There-fore, the subsequent analysis used the CT scans with image features extracted by the exampleCT extractor combined with the clinical data for machine learning.

Point 8: table 1:

* please provide percentage rather than raw values

* max diameter of LN: in which unit ? cm I think.

* define Necrotic LN

Response 8: Thank you for your suggestions. All three points have been revised.

Point 9: Results - section 3.1 : "We found that radiomics derived from CT was superior 159 than that of PET images for predicting the neck node recurrence". This analysis was not described in the M&M section where you stated you will only use CT radiomics.

Response 9: Thank you for your comment. The selection process for the images has been described and included in chapter 2.2.

Point 10: Figures - figure 8 : feature importance when feature name is not very self describing does not add value to this research.

Response 10: A list of significant features with their names was added. It can be found in the Appendix.

 

 

Top 10 Features of Figure 8. (A)(B)

Feature 531

original_firstorder_Median

Feature 1037

wavelet-LHH_glszm_GrayLevelNonUniformityNormalized

Feature 715

wavelet-HHL_firstorder_Mean

Feature 787

wavelet-HHL_glszm_SizeZoneNonUniformityNormalized

Feature 87

diagnostics_Mask-interpolated_Mean

Feature 717

wavelet-HHL_firstorder_Median

Feature 529

original_firstorder_Mean

Feature 12

N-SUVmax >4.9

Feature 778

wavelet-HHL_glszm_GrayLevelNonUniformity

Feature 308

log-sigma-3-0-mm-3D_gldm_DependenceVariance

 

 

Top 10 Features of Figure 8. (C)(D)

Feature 8

lesion site 1)Opx   2)HPx

Feature 203

log-sigma-2-0-mm-3D_glcm_DifferenceAverage

Feature 1037

wavelet-LHH_glszm_GrayLevelNonUniformityNormalized

Feature 527

original_firstorder_Kurtosis

Feature 246

log-sigma-2-0-mm-3D_glrlm_ShortRunEmphasis

Feature 717

wavelet-HHL_firstorder_Median

Feature 469

log-sigma-5-0-mm-3D_glcm_InverseVariance

Feature 7

alcohol 1) Yes   2) No

Feature 78

diagnostics_Image-interpolated_Size

Feature 985

wavelet-LHH_glcm_ClusterProminence

 

 

Top 10 Features of Figure 8. (E)(F)

Feature 1037

wavelet-LHH_glszm_GrayLevelNonUniformityNormalized

Feature 7

alcohol 1) Yes   2) No

Feature 794

wavelet-HLH_firstorder_10Percentile

Feature 1

diagnosis: 1) OPC; 2) HPC

Feature 10

SUVmax

Feature 802

wavelet-HLH_firstorder_MeanAbsoluteDeviation

Feature 446

log-sigma-5-0-mm-3D_firstorder_Median

Feature 255

log-sigma-2-0-mm-3D_glszm_LargeAreaLowGrayLevelEmphasis

Feature 1300

wavelet-LLL_glszm_LargeAreaLowGrayLevelEmphasis

Feature 270

log-sigma-3-0-mm-3D_firstorder_Kurtosis

 

 

Round 2

Reviewer 1 Report

The authors have addressed most of the comments. Although, the rationale behind some of the responses may lead to further concerns by the readership of this journal.

Minor comments:

1. The title of this paper should relate to the stated objectives of the study as mentioned in the introduction. In the present form, there seems no correlation between the title and the objectives.

2. I strongly think that the composition of the dataset used in this study does not qualify to be addressed as head and neck cancer. Therefore, the limitations in terms of the absence of other subsites should be mentioned in the study.

3. Handling imbalance dataset with undersampling or oversampling is not novel. Handling imbalanced data should be standard practice when dealing with an imbalanced dataset. A study that would revolve around a new technique for handling data imbalance would be worthwhile. However, in the present form of this study, it is still uncertain what this study would add to the readership of this journal.

4. Some of these points may be mentioned as part of the limitations so that the readers can clearly understand this study.

Author Response

Minor comments:

Point 1: The title of this paper should relate to the stated objectives of the study as mentioned in the introduction. In the present form, there seems no correlation between the title and the objectives.

Response 1: Thank you for your suggestion. We have modified the title of this paper as follows:” Neck Lymph Node Recurrence of HNC Patients Might Be Predicted Before Radiotherapy - Using Radiomics Extracted From CT Images and XGBoost Algorithm”.

Point 2: I strongly think that the composition of the dataset used in this study does not qualify to be addressed as head and neck cancer. Therefore, the limitations in terms of the absence of other subsites should be mentioned in the study.

Response 2: Thank you very much for your suggestion. We have added the following section where we discuss this limitation to the discussion.

“This study had some limitations. First, external validation with independent data sets is essential to create the model’s clinical utility because this study was conducted at a single institute. Second, although organ preservation with definitive radiotherapy is a mainstay of treatment for patients with cancers of hypopharynx or oropharynx, our findings should be interpreted cautiously because of the absence of other subsites of head and neck cancers. Nonetheless, this study represents a step toward enabling customization of precision therapy for head and neck cancer patients when the optimal management of the metastatic neck node becomes an issue of debate. After further validation and inclusion of more tumor sites, oncologists may use the proposed model to advise patients on the relative suitability of treatment options.”

Point 3: Handling imbalance dataset with undersampling or oversampling is not novel. Handling imbalanced data should be standard practice when dealing with an imbalanced dataset. A study that would revolve around a new technique for handling data imbalance would be worthwhile. However, in the present form of this study, it is still uncertain what this study would add to the readership of this journal.

Response 3: Thank you for your comment. The under-sampling and over-sampling techniques were part of this study, however, the main propose of this study was to see how information gathered from lymph nodes, rather the primary tumors, could be used to predict regional recurrence (neck recurrence in HNC). We found that the imbalanced data strategy improved the results, however, the strategy itself was not the focus of this study. The main focus of this study was to show the importance of indicators of personal treatment. The outcome of metastatic neck nodes in patients with HNC receiving radiotherapy for organ preservation can be predicted based on the results of machine learning. This way, patients can receive personalized treatment.

We modified several parts of the paper to highlight the main objective of this study.

Point 4: Some of these points may be mentioned as part of the limitations so that the readers can clearly understand this study.

Response 4: Thank you for your suggestion. We addressed your points by modifying several parts of the paper, including the title, discussion, and conclusion.

Reviewer 2 Report

Thank you for answering my comments

Author Response

Dear Reviewer,

Thank you very much for your insightful suggestions. I am glad I was able to succesfully address your comments. 

Kind regards

Yi-Lun Tsai

Reviewer 3 Report

Response 1: thanks for the clarification.

Response 2: it is indeed much better.

Response 3: The clarification is nice, but the reason xgboost is better is overfit. During the training of xgboost you have to watch the validation set, and stop at an appopriate timing when the performance is best. For this reason, you may have overfit to your validation set. Lack of test set is critical for such algorithms, whereas not so much for random forest where you can use out of bag estimation of performance on the train set without "watching" your validation set for performance. There is indeed usually a better performance of boosted trees vs random forest, but not that much. 

Response 4: You missed the point here. It was not about local recurrence vs PFS, it was about the censored nature of your data. You cannot use simple classification when there is censored data, you have to use algorithm that take into account the censorship. I strongly suggest you drop completely xgboost, and use random survival forest (you can find a nice package in R that will do just that: ranger).

Response 5: This is a critical design flaw when using xgboost. cf Response 3 and 4 for improvement.

Response 6: This is a bit disappointing. There are lot of software out there that can be used to extract radiomics. I suggest you use Pyradiomics python package for this task. The removal of 17 patients due to software not working is not likely to induce any bias in the results, but what will happen in real life if you cannot extract the radiomics for a specific patient ?

Response 7: Ok

Response 8: define Necrotic LN not adressed.

Response 9: adressed

Response 10: adressed

Additionnal point: regarding clarification of PET CT acquisition, it is very strange that all patients were injected with the same FDG activity (370MBq). The dosage is weight dependant (usually between 2.5 and 3.5 MBq/kg).

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Back to TopTop