Focused and TSOM Images Two-Input Deep-Learning Method for Through-Focus Scanning Measuring
Round 1
Reviewer 1 Report
Dear Editor and Authors,
I have carefully read the paper by Zhang and co-workers entitled “Focused and TSOM images two-input deep-learning method for through-focus scanning measuring”. The authors propose the use of convolutional neural networks (CNNs) to predict the possible fabrication defects associated to the semiconductor industry using an optical device. Their proposal allows for detecting errors even slightly below the optical diffraction limit. In particular, the authors feed the CNN with the images obtained by means of the technique Through-focus scanning optical microscopy (TSOM). The authors describe in detail the TSOM technique, the TSOM experimental set-up, the data acquisition, and the CNN architecture. Finally, the authors compare the results of their CNN with other Machine and Deep Learning approaches, and perform an analysis on the sensibility of their approach to defocusing.
The authors have tackled a problem of great technological interest. The proposal is ingenious, and the results are promising. The experiments have been mostly adequately design (see comments below), and it has been mostly properly carried out. The manuscript is concise, well organized and written (but for a few points, see comments below), and mostly understandable. Nevertheless, I identify a few weaknesses (see comments below) that should be addressed before I could recommend acceptance of this manuscript.
Comments on the CNN architecture:
Section 2.3: In the second and third channels (Figure 10), the final layers, before merging, have sizes of 6x6x96, while that of the first channel is 23x23x96. In the next step, the features are merged into an array 23x23x288. My question is: How can you stack arrays of size 6x6x98 with an array of size 23x23x96? Is there any kind of zero padding or something in the smallest ones so that the size increases to 23x23? Please, explain.
In addition, if you apply a 5x5 convolution layer to the input 89x89 image, there is no way that you end up with an array of size 45x45. If no padding is used, the convoluted layers would have size 85x85. Likewise, if padding is used, the convoluted layer would have size 89x89. Actually, this is what happens with the first channel. The same applies to the 7x7 convolution. Judging by the idea behind MCNN (ref. 23), and in view of my previous comment, there must be something wrong with the scheme of figure 10. Not only the numbers of the layer sizes, but the depiction. The first convoluted layers for channels 1 to 3 within the “TSOM images processing” block should all have sizes 89x89xnumber_of_layers. Once the rest of operations are applied (convolutions and max-pooling), then, the final convoluted layers of the three channels would all have size 23x23. I don’t know if it is only an error of the figure or of the implemented CNN architecture as well. Please, correct figure 10 and/or explain in detail.
In addition, if the structure of reference 23 is used, then the third channel should always use convolutions of size 7x7, whereas figure 10 shows that the second convolution of the third channel uses 5x5. Please, correct or explain the change.
In the “merging and output” block, the authors use a dense NN, but no details are provided and, accordingly, the results cannot be reproduced. Please, describe the architecture of the dense NN: Number of layers, number of neurons per layer, activation functions, … Do you apply drop-out and/or batch normalization? In addition, which optimizer is used? Which is the learning rate? And the batch size and number of training epochs? How do you avoid overfitting? All this information is needed.
Comments on comparison with other methods:
Section 3.1: The authors compare their method to other CLASSIFICATION algorithms, and use the MSE and MAE as comparison metric. That comparison is not completely fair, as MSE and MAE are not metrics for classification problems. It is like comparing apples to pears. If the authors want to make a comparison of their method against others, it should be done with other REGRESSION models. In this particular case, they should use ResNet50 and DenseNet121 in regression mode, not in classification mode (!). Adaboost and library-matching should be left aside, as there is no way of doing regression. Although Adaboost could be optionally replaced by LightGBM or XGBoost or other boosting algorithms that allow for regression. Then, the comparison would be much more fair. Please, revisit this section. In addition, the authors should state in the manuscript whether they have used the other models with the same hyperparameters reported in the corresponding papers, and add the references again, as well.
Comment on sensitivity to “focused image” defocusing:
Section 3.2: That experiment seems to be not well designed. To check the sensitivity or generalization capability of the CNN against the defocusing of the “focused image”, there are two options: a) train and evaluate the CNN performance with a pool of both focused and defocused images (0nm, ±200nm, ±400nm AND ±600nm); b) train the CNN with images with a given [de]focusing conditions (0nm, for example), and evaluate the performance with a test set of images with completely different defocusing conditions. With option a) the CNN would be capable of allowing for defocusing, whereas option b) would give a sense on the sensitivity, which is actually what the authors were trying to assess. As it is currently done, or at least I understand so, the CNN is trained to recognize a defocusing configuration, and, in principle, will work well only for those defocusing conditions, but not others. This is, if you train the CNN with images defocused by +200 nm, it will perform well when other 200 nm defocused images are tested, but the performance would worsen if tested with a -400nm defocused image. Thus, I would recommend the authors to carry out either experiment a) or experiment b). I guess option b) would be much more useful.
Other comments and suggestions:
Line 124: “the field diaphragm lens” should be “the field diaphragm (FD), lens”
Figures 4 and Figure 6 could be merged into a single one to shorten the manuscript. Although this is optional, of course.
In figure 7a, remove “nm” from x-axis.
Section 2.2: The authors state that they have used 37 different sizes of target, and the dataset has 34000 images. The question is, there are only 37 different targets or there are several targets for each of the sizes? This is, the 34000 images correspond to unique targets, or several images have been taken for a single target? Explain, please.
Same section: The gold lines are placed so that they are in the middle of the image. What is the effect of shifting this image out of the center? Shouldn’t it be best to place it randomly or subjected to small deviations? You cannot always control this centrality in a real scenario.
Line 272-273: y_hat is the predicted value, not the measurement value.
Lines 282-284: The authors state that “Multiple measurements using our two-input deep-learning TSOM method are made for the gold lines target with the linewidths range of 247-1010 nm, and the average value is taken as the experimental results.” What do you mean by the “average”? Could you explain better?
Figure 11: does it depicts the test set? If so, please, specify somewhere.
Figure 11: Could the authors add histograms for a couple of examples? That way it would be better appreciated how good the model performs. Currently, it is difficult to distinguish whether the points are clustered closer to the zero or evenly distributed. With a histogram, one could evaluate the mean, the mode, etc…
Question on the method: Does it only works with vertical lines? What happens with oblique lines? Or a combination of vertical and horizontal lines? Would that be feasible? Please, explain briefly.
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Reviewer 2 Report
In this study, the authors proposed a framework based on deep learning for through-focus scanning measuring. Although the model reached a promising performance, some major points should be addressed as follows:
1. Overall, the English language should be improved in this manuscript. It contains grammatical errors, typos, or jargon. Thus, the authors should re-check and revise carefully.
2. The authors should have external validation set to evaluate the external validity of model.
3. The authors should compare the predictive performance to previously published works on the same problem/dataset.
4. How did the authors tune the optimal hyperparameters of the deep learning models? Why was this architecture selected rather than other deep learning ones?
5. When comparing the performance among methods/models, the authors should conduct statistical tests to see the significant differences.
6. Uncertainties of model should be reported.
7. More discussions should be added.
8. Deep learning is well-known and has been used in previous studies i.e., PMID: 34812044, PMID: 31380767. Thus, the authors are suggested to provide more references in this part to attract a broader readership.
9. Quality of figures should be improved.
10. Source codes should be provided for replicating the study.
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Round 2
Reviewer 2 Report
My previous comments have been addressed.