1. Introduction
Metrology and defect inspection are significant steps in the fabrication of semiconductor integrated circuits and nanoscale devices industry. Several methods are currently used for nanoscale measurement, such as scanning electron microscope (SEM) [
1,
2] based on electronic excitation scattering measurement and atomic force microscope (AFM) [
3] based on probe measurement. Although these above methods show high measuring accuracy in measurement, they are difficult to be applied to online real-time detection due to their low measuring efficiency, high cost and destructions. Additionally, methods based on optical scattering are unable to achieve accurate nanoscale measurement [
4] due to the limitation of diffraction limit of optical imaging system though they are fast and efficient.
Through-focus scanning optical microscopy (TSOM) is a novel, fast and non-destructive micro/nano-scale measurement method based on computational imaging [
5,
6,
7,
8,
9]. Different from traditional optical measurement, TSOM is based on the view of “what you get is not what you see”. TSOM captures a series of images consisting of a focused image and a lot of defocused images of the imaging target by scanning along the optical axis. Captured images are stacked according to spatial position to form an image cube (TSOM cube), and the middle sectional view of the TSOM cube is taken to generate the TSOM image. The features of the TSOM image are sensitive to the nanoscale size change and the nanoscale detect of the measuring target, for which the parameters of the measuring target can be retrieved from the TSOM image. By using TSOM, the diffraction limit of optical imaging system can be avoided so that the ordinary wide field optical microscope can achieve the dimensional measurement in nanoscale. TSOM has been applied in mask-defect detection [
10], nanoparticle shape evaluation [
11], nanoscale dimensional measurements of three-dimensional structures [
12] and overlay analysis [
13,
14].
Three kinds of TSOM methods are used to measure the nanoscale parameters of the target: the library-matching TSOM method [
15,
16], the machine-learning TSOM method [
17] and the deep-learning TSOM method [
18,
19].
Library-matching TSOM method is a classical measurement method of TSOM [
15,
16]. In this method, the experimental TSOM image of the target is matched with the simulated TSOM image library established by simulation algorithm. Then, the parameters of the target can be recorded from the best match. Library-matching TSOM method need much time for establishing the simulated TSOM image library, and its measuring accuracy is limited by the fineness of the established library.
Qu et al. proposed the machine-learning TSOM method according to TSOM image features [
17]. In their method, three kinds of image feature extraction algorithms are used to extract TSOM image features: the gray level co-occurrence matrix (GLCM), the local binary pattern (LBP) and the gradient direction histogram (HOG). Then, the extracted TSOM image features are input into different machine-learning models in different combinations for training and measuring. Finally, they get more accurate measurement results than the library-matching method. However, due to the limitation of feature extraction algorithms, the machine-learning TSOM method is still not accurate enough.
Deep learning has developed rapidly in recent years and has been widely used in various fields, such as image recognition [
20,
21], biomedical sciences [
22,
23] and computational imaging [
24]. The earliest deep-learning TSOM method proposed by CHO et al. is based on the idea of image classification [
18]. With setting different parameter sizes as the classifying categories, this method uses a single-column Convolutional Neural Network (CNN) to calculate the probability of the TSOM image being assigned to each category. Finally, they obtain the measurement results from weighted average of all categories. The method is simple and convenient, but the measurement range is small and the accuracy is limited. Nie et al. used ResNet50 and DenseNet121 models for TSOM measurement [
19]. They achieve a larger size range of nanoscale measurement compared to CHO’s, but its accuracy is still limited as a classification method.
The common deficiency of the above three kinds of TSOM measurement methods is that only a single TSOM image is used to retrieve the measurement parameters. That means only a sectional view of the entire TSOM cube is used, which leads to the low image information utilization and further leads to the limited accuracy of these TSOM measurement methods. JOO et al. studied the relationship between the TSOM height and TSOM volume with the defects of nanodevices, indicating that there is a regression relationship between the entire TSOM cube and parameters of the measuring target [
25]. This conclusion indicates that parameters of the measuring target can be retrieved by inputting more images of the TSOM cube to improve the accuracy rather than inputting only a sectional view. Since only the defocus distance between different images that constitutes the TSOM cube changes, and the focused image has the highest definition and contains more information relatively among the TSOM cube, a single focused image is the most suitable to be added as the supplementary input of TSOM image.
On this basis, a two-input deep-learning TSOM method based on CNN is proposed in this paper. The two-input CNN uses the focused image and the TSOM image as two inputs (instead of using only a single TSOM image as previous TSOM methods) and uses the measuring parameters of the target as the output to make full use of more effective information of the TSOM cube. Additionally, the regression model is used to solve the problem that the measurement accuracy is limited by the classification categories.
The structure of the rest of this paper is as follows: the second part introduces the materials and methods, including the process, the experimental devices, the acquisition of the dataset, the structure of the proposed two-input CNN and how to train it. The third part introduces our experiment, including model-evaluation indices, the measurement results, analysis, and the uncertainties of the model. The fourth part introduces our discussion, mainly concerning the effect of focusing position error and the influence of the shift of the measuring target. The fifth part introduces our conclusion.
2. Materials and Methods
The process of the two-input deep-learning TSOM method is shown in
Figure 1. Firstly, the TSOM imaging system is built to capture experimental images. Secondly, by scanning along the optical axis, a series of images of the measuring target and the background of the imaging region are obtained, respectively. After the background images are subtracted from the target images, a series of images of the target removing the background noise are obtained, consisting of a focused image and lots of defocused images. These images are stacked to form an image data cube (TSOM cube) according to their spatial positions. Then, the image definition of all images in the TSOM cube is evaluated to obtain the best focused image. At the same time, a sectional view of the TSOM cube is extracted and processed by smoothing, interpolation and pseudo-color methods to generate the TSOM image. Next, the obtained focused images and TSOM images are input into the two-input CNN network for training and testing. Finally, the network outputs the measuring parameters of the corresponding target.
2.1. TSOM Imaging System
The experimental images are acquired by TSOM system imaging system shown in
Figure 2. The system is divided into two parts: the illumination system (the red-light path in
Figure 2a) and the imaging system (the blue-light path in
Figure 2a). In the illumination system, LED with wavelength of 520 nm is used as the light source. The light emitted by LED passes through the objective lens (OL1), an aperture, a lens (L1), polarizer, the field diaphragm (FD), lens (L2) and the splitting prism, and then converging on the conjugate rear focal plane of the objective lens (OL2). The light passed through the objective lens (OL2) irradiates on the nanostructured target to achieve Kohler illumination. The imaging system consists of an objective lens (OL2), a splitter prism and a lens (L3), which help in imaging the target on a CCD camera. While through-focus scanning, the objective lens (OL2) is controlled to scan along the optical axis by the piezoelectric objective locator (PZ) and the piezoelectric position mechanism. Background images and target images are captured every 200 nm during the scanning process. After the background images are subtracted from the target images, a series of images of the target without background noise are obtained, consisting of a focused image and lots of defocused images.
2.2. The Dataset
The image dataset is captured by the measuring target composed of a series of isolated gold lines (Au) with a length of 100 μm and height of 100 nm. The measuring parameter is the linewidths (LW) of the gold lines. The model of the gold line is shown in
Figure 3. The linewidths to be measured ranged, increasing from 247 nm to 1010 nm with a total of 37 sizes. The results of scanning electron microscopy (SEM) are used as the truth values of the linewidths of the gold lines.
The gold lines are placed on the displacement table, and the TSOM imaging system built in 2.1 is used to obtain a series of images of the lines including a focused image and many defocused images. Then, the images are stacked in spatial positions to form a three-dimensional cube (TSOM cube), as shown by the yellow cube in
Figure 4.
The pixel size of each image as shown in
Figure 5 to form the TSOM cube is 89 × 89 pixels, and each gold line to be measured locates in the middle of each image. The image closer to the focused position has higher definition, and the definition of the image gradually decreases with the increase in the defocused distance.
A focused image and a TSOM image used to construct the dataset could be extracted from a TSOM cube. There are 10 identical gold lines in each linewidth (making a total of 370 gold lines). In total, 1000 different positions on the 10 identical gold lines are selected randomly to collect 1000 TSOM cubes of each linewidth. Therefore, 1000 focused images and 1000 TSOM images are captured from each linewidth (i.e., from 10 identical gold lines). The entire dataset consists of 37,000 focused images and 37,000 TSOM images. A TSOM image corresponds to a focused image, both of them are labeled with a corresponding linewidth. The dataset is divided into a training set, a validation set and a test set in a 3:1:1 ratio.
2.2.1. Focused Image
In order to obtain the best focused image from the TSOM cube, it is necessary to evaluate the image definition of the images within TSOM cube as
Figure 6 shows. The image with the highest definition is selected as the focused image of the cube for constructing the dataset.
The general methods used for image focusing and definition evaluation [
26] include Fourier transform, gradient energy maximization, high-pass filtering, histogram entropy, local change histogram and normalized variance, etc. This paper uses normalized variance as the focusing evaluation function to evaluate the sharpness of the image sequence for reducing the influence of noise. The expression of the normalized variance is shown in Equation (1). Generally, the higher the definition of the image, the larger the value of the focusing evaluation function appears.
where H and W are the height and width of the pixel size of the image, f(x,y) is the brightness value of the pixel in the image whose coordinate is (x,y) and μ is the average brightness value of all pixels of the image.
The focusing evaluation function values are calculated one by one for each series of images corresponding to different linewidths. The image corresponding to the peak value of the function is selected to be the focused image of the image series. The focusing evaluation function curves of image series corresponding to gold lines with linewidths of 247 nm, 298 nm, 625 nm and 768 nm are shown in
Figure 7. Taking the evaluation result of the gold line with 247 nm linewidths as an example, it can be seen from the peak position of the focusing evaluation function that image no. 96 shows the highest definition, then it is selected as the best focused image of the image series.
It should be noted that the focusing evaluation function curve may shows two peaks due to the uneven height of the measuring target and the incomplete ideal lighting system [
5,
27] when evaluating the definition of the image series, such as the function curves of the image series corresponding to the gold lines with linewidths of 625 nm and 768 nm as
Figure 7c,d shows. In this case, the larger of the two peaks is selected as the best focusing position.
Finally, a total of 37,000 focused images are obtained, each focused image is shown in
Figure 5a (
Section 2.2) with size of 89 × 89 pixels, and the gold line to be measured is located in the middle of the image.
2.2.2. TSOM Image
Another component of the dataset is the TSOM image. The TSOM image is obtained from the transverse sectional view of the TSOM cube, and its generation process is shown in
Figure 8. Taking the best focused image obtained in
Section 2.2.1 as the center, 44 images are taken from each side to the left and right to form a small TSOM cube (the cube consists of 89 images including a focused image and 88 defocused images), and the transverse sectional view in the middle of the cube is taken to be spliced as shown in
Figure 8a. Finally, the TSOM image is obtained by interpolation, smoothing and pseudo-color processing of the spliced sectional view. Each line of the TSOM image represents the light intensity information of a defocused image or the focused image.
Finally, 37,000 TSOM images are obtained, each with a size of 89 × 89 pixels as shown in
Figure 9, and each TSOM image corresponds to a focused image.
2.3. The Structure of the Two-Input CNN
The structure of the two-input CNN proposed in this paper (the following is named Focus & TSOM-CNN) for TSOM measurement is shown in
Figure 10. Focus & TSOM-CNN, from left to right, includes an input part, processing of TSOM images and focused images part and the feature merging and output part.
The left area of the network mainly contains the input layer, which is used to input the TSOM images and the focused images into the network. The two kinds of images are 89 × 89 pixels in size.
The middle area of the network as
Figure 10 shows is the processing part of the two types of input images, and the upper part is the processing network of TSOM images. Since TSOM images contain relatively more effective information, the TSOM image processing network is constructed by referring to MCNN [
28] which integrates features of different scales in order to fully extract the features of the input image. After the TSOM image is input, three columns of convolution channels are set for processing and three sizes of convolution kernels of 3 × 3, 5 × 5 and 7 × 7 pixels are used to extract features of different scales in the three channels, respectively. Considering the small pixel size of TSOM images, 3 convolution layers are set in the first 3 × 3 convolutional channel and the convolutional stride is set to 1, two convolution layers are set in the other two channels and the convolutional strides are set to 2. Both of three channels use the zero padding in the convolutional process. In both three channels, in order to retain the original characteristics, the pooling layer in the network is maximum pooling and the activation function is ReLU function:
. After feature extraction of the three convolution channels, the first 3 × 3 size convolutional channel obtains features with the size of 23 × 23 × 96, others obtain features with the size of 6 × 6 × 96, respectively. The features with the size 6 × 6 × 96 are increased to 23 × 23 × 96 by zero padding. Then, all the features are merged as the TSOM image features with the size of 23 × 23 × 288 pixels.
The lower part of the middle area of the network is the processing of the focused image. Considering the texture of the focused image is relatively simple and effective information is concentrated in the middle area of the image, this part of the network refers to AlexNet [
29] using single column convolution channel. The focused image features are extracted through three convolution layers and two pooling layers after input. The size of the convolution kernel is set to 3 × 3 pixels. The pooling method adopts maximum pooling and the activation function is ReLU function. After the process, the size of the obtained features of the focused image is 23 × 23 × 96 pixels.
The right area of the network is the feature merging and result output part. After feature extraction of the TSOM images and the focus images is completed, the extracted features of the two types of images are merged, and the size of the merged features is 23 × 23 × 384 pixels. Then, the merged features are resized and mapped to the measuring target (linewidths) through the two fully connected layers for output. The activation function of the fully connected layers is ReLU function.
The adaptive moment estimation (Adam) is used as the optimizer with a beginning learning rate of 0.0001. Drop-out is used to avoid overfitting. The images in the datasets are normalized by the zero-mean normalization (Z-Score) method before input. The definition of Z-Score is shown in Equation (2).
where
is the mean of the data x and
is the standard deviation of the data x.
2.4. Training
Focus & TSOM-CNN belongs to the deep-learning regression model; therefore, the mean square error (MSE) is used as the loss function for training. The definition of MSE is shown in Equation (3).
where
indicates the truth value to be measured in each measurement,
indicates the predicted value and n indicates the number of measurements.
The batch size is set to 16 and the model is trained for 200 epochs.
3. Experiment
In order to evaluate the performance of the proposed method in measurement after training and testing, our experimental results are compared with two other regressive deep-learning TSOM methods: DenseNet121 model and ResNet50 model in regression.
3.1. Evaluation Indicators
In this paper, MSE and MAE, commonly used in regression models, are used to evaluate the trained models. The definition of MSE is shown in Equation (3) in
Section 2.4, and the definition of MAE is shown in Equation (4):
3.2. Linewidths Measurement Results
Multiple measurements using Focus & TSOM-CNN are made for the gold lines target with the linewidths range of 247–1010 nm. The measuring performance of Focus & TSOM-CNN and other two deep-learning TSOM methods (DenseNet121 model and ResNet50 model in regression) on the test set is shown in
Figure 11, where the abscissa of the scatterplot is for measuring linewidth, the vertical represents the absolute value of the error of the prediction results (
, the absolute value of the difference between the measured value and true value), and each point in the figure represents a measurement result. The DenseNet121 model and ResNet50 model are trained with the same hyperparameters reported in the Reference [
19].
As can be perceived from
Figure 11, the measuring error of Focus & TSOM-CNN is generally lower than that of the other two regression models, and most of the test errors are less than 15 nm with only a few exceptions, which is potentially related to the image noise from data collection.
Figure 12 shows a couple of examples of the experiment in order to evaluate the results more intuitively: the multiple measurements of gold lines with 247 nm, 357 nm, 625 nm and 768 nm linewidths. It could be concluded from these figures that the measurement output values are clustered to the true values. The experiment shows that the two-input deep-learning TSOM method proposed in this paper is accurate in nanoscale measurement and has good repeatability.
Table 1 presents the MSE and MAE of the measuring results of 247–1010 nm gold lines of the three regression models on the test set. According to the data in
Table 1, the MSE of our two-input deep-learning TSOM method is 5.18 nm
2 and the MAE is 1.67 nm, which are both far less than other two regressive deep-learning TSOM models. This suggests that the Focus & TSOM-CNN shows the highest accuracy in nanoscale measurement.
The two-input deep-learning TSOM method is not limited by the feature extraction algorithm and the classification categories for extracting the features automatically and using the regression model. Moreover, this method is not needed to carry out complex simulation modeling process such as the library-matching TSOM method. Therefore, the method has good performance in precision and convenience of measurement.
3.3. Uncertainties of the Model
In this section, model uncertainties are evaluated by repeated training. In the experiment, ten models are obtained through training. The MSE and RMSE of the obtained models on the test set are shown in
Figure 13. RMSE is the arithmetic square root of MSE, which definition is shown in Equation (5).
The standard deviation of the mean of RMSE (Equation (6)) is normally used to evaluate uncertainties of the model. The result of model uncertainties is shown in
Table 2.
Many factors effect uncertainties in the deep-learning TSOM measurement, including the model, the parameters of the imaging system, the truth uncertainty, the imaging noise, etc. In practical experiments, some methods can be used to reduce the uncertainties of measurement, such as imaging system optimization, image denoising, etc. Reducing uncertainties would be one of the improvements of TSOM in the future.