3.2. Evaluation of the RoIs Extraction
To validate the localization performance of the RoIs extraction model, we randomly selected 200 images to test YOLOv5.
Table 1 shows the mean Average Precision (mAP) of different RoIs. The expression of mAP is as follows.
Here,
,
denotes the number of Intersection over Union (IoU) between the predicted object boxes, and the ground truth is greater than threshold value
.
N denotes the total number of test samples, mAP is the average of all category APs, and
k is the number of different categories contained in the test set. In
Table 1,
[email protected] denotes the mean average precision for
, and
[email protected]:0.95 denotes the mean average precision for increasing the IoU threshold from 0.5 to 0.95 (in steps of 0.05).
It can be seen from
Table 1 that the localization accuracy of “H”, “P”, “C”, and “H+P+C” RoIs are close to 1 when the threshold value of IoU is 0.5. When the threshold of IoU is limited to 0.5, an IoU of 0.5 between the ground truth and the predicted object box is considered as a successful localization, resulting in a high value of mAP. In addition, since the hand bone background is single and the pose and position of the hand are required when taking X-rays, the position of the hand bone in the image is relatively fixed and is conducive to the improvement of localization accuracy. Second, the
[email protected]:0.95 of “H”, “P”, and “C” in the table gradually decreases due to the relatively small RoIs objects of “C” and “P”, resulting in the inability of the network to locate them accurately. In particular, the RoIs of “P” also has interference from the same parts of other fingers. To ensure that the model can smoothly cut the corresponding RoIs object boxes, the threshold value of IoU in the training network is set to 0.5 in this paper.
To validate the classification performance of the model for RoIs, the normalized confusion matrix is also analyzed in this paper. The confusion matrix is a matrix with the number of the predicted category as the vertical axis and the number of the ground truth as the horizontal axis. The diagonal of its normalized matrix represents the probability of accurately predicting the category as its ground truth category.
As shown in
Figure 8, the confusion matrix’s high diagonal values indicate the model’s high classification accuracy. We believe that this is due to the large morphological gap between the “H”, “C”, and “P” RoIs. Therefore, the model classifies them correctly with high accuracy.
3.3. Evaluation of Bone Age Assessment
Baseline networks We pre-trained on five baseline models to select a better performing baseline network for the BAA task, including ResNet50, MobileNet v3, AlexNet, VGG, and MobileNet v2. All baseline networks were trained from scratch, no pre-trained weights were loaded, and the networks were initialized randomly. The MAE assessed by the baseline model and the parametric model size are shown in
Table 2.
As observed from
Table 2, the MobileNet v2 baseline model has the best evaluation results with the least number of model parameters and, thus, runs efficiently. This is attributed to the fact that the feature extractor of MobileNet v2 is composed of deep-separable convolution, which makes the network light, effectively prevents overfitting of the model, and is suitable for small datasets such as medical datasets. Therefore, MobileNet v2 is chosen as the baseline network for the BAA task in this paper.
Fusion of RoIs after background denoising and RoIs of prior knowledge. In order to verify the effectiveness of background denoising and fusion of prior knowledge for the BAA task, we compare the original image (O) and the background denoising hand image (Hand, H) input model. In clinical practice, if the error between the chronological age and bone age is within one year of the ground truth [
33], it is considered that the development of this bone is within the normal range. This paper sets three accuracy standards for different ranges of time (that is, 6 months, 12 months, and 24 months) to evaluate the accuracy of the proposed model in a multiscale manner. Above this, the RoIs of RUS-CHN (i.e., Carpal and Phalanx and C and P) multi-channel input networks are compared, and the experimental results are shown in
Table 3.
As shown in
Table 3, compared to using “O” as input, the effect of “H” as the input network was improved and MAE decreased by 0.76 months, and accuracy is improved by 2.04%, 4.1%, and 0.36% for ground truth within
year,
year, and
year, respectively. This illustrates that background denoising can improve the accuracy of model evaluation and has a positive effect on BAA model evaluation. In addition,
Table 3 shows that the MAE of the model decreases by 1.04 when “H+C+P” is used as input compared to “H” as input, and accuracy is improved by 2.19%, 9.66%, and 1.45% for ground truth within
year,
year, and
year, respectively. The experimental results in
Table 3 illustrate that background denoising and fused prior knowledge can effectively improve the evaluation accuracy of the BAA model and bring a positive impact to the BAA task.
BAA model ablation experiments. To evaluate the effectiveness of the improved feature extractor and the proposed bone age assessment label representation for the BAA task, ablation experiments were conducted. Where the improved feature extractor contains two modules for ablation experiments, i.e., CBAM and 5 × 5 convolution kernel size (k: 5 × 5), “MP” in
Table 4 represents the proposed Multi-point distribution of bone age labels, and “G” denotes the gender information vector.
From
Table 4, we can see that, compared to the baseline model, applying the improved feature extractor decreases MAE by 1.16 and accuracy is improved by 7.66%, 3.36%, and 0.34% for ground truth within
year,
year, and
year, respectively. We believe this increase in the size of the convolutional kernel can expand the range of the RF, thus enhancing the local feature extraction capability of the model, which can extract more local features of the hand bone and improve the performance of the model. Meanwhile, the adaptive assignment of channel weights by CAM allows the model to automatically place larger weights on the more important RoIs channels, thus improving the evaluation accuracy, and the SAM module enables the model to focus attention on the spatial region of the hand bone, further improving evaluation accuracy. After adding Multi-point distributed bone age labels containing rich semantic information to the model, compared to the baseline model, MAE decreased by 1.31 and accuracy is improved by 12.1%, 4.44%, and 0.36% for ground truth within
year,
year, and
year, respectively. It is shown that the proposed multi-point distribution bone age label, which well interprets the characteristics of skeletal development, contains the semantic information of classification, regression, and distribution learning within the label such that the model fully utilizes the label information and achieves a better assessment effect. In addition, we add the gender information to the network, and MAE decreased by 0.08 and accuracy is improved by 3.70% and 0.27% for ground truth within
year and
year, respectively. This is because adding gender information to the bone age label can further enrich the semantic information of the label, allowing the model to learn more useful information to help bone age assessment.
Loss function comparison experiment. In order to verify the effectiveness of the improved label distribution learning loss function, we conducted ablation experiments. At the same time, we compared it with the commonly used probability distribution learning loss function Kullback–Leibler Divergence (KL), and the experimental results are shown in
Table 5. Note that the loss functions of the regression layers are all using the MAE.
From the
Table 5, compared to KL, using the FL cascade to train the model, MAE decreased by 0.16 and the accuracy is improved by 2.83%, 0.33%, and 0.54% for ground truth within
year,
year, and
year, respectively. We believe this is because FL is more applicable to data with uneven sample distribution such as medical datasets, which enables the model to adaptively adjust the weights of difficult samples, thus focusing the model’s attention on difficult sample discovery and improving the model recognition performance. In addition, evaluation performance is further enhanced by the combined loss function of FL + LS, MAE decreased by 0.09, and accuracy is improved by 1.32% and 0.66% for ground truth within
year and
year, respectively. That is, the proposed loss function can improve the uneven distribution of samples in medical data sets and the characteristics of overfitting caused by the small amount of data to a certain extent.
Assessment performance at different ages. In order to detect the performance of the model in each age group, the paper divides 0–18 years into 18 age intervals to verify the performance of the model. A new evaluation metric, RMSE, is also introduced, which will provide a higher weight to larger errors than with MAE will, and for larger errors, RMSE is the most useful evaluation metric with the following expression.
From
Table 6, we can find that the model assessment results are more stable at (0, 15] years of age, with low MAE and RMSE, while the error in the assessment is larger at (15, 18] years of age, especially between (17, 18] years of age. We speculate is due to the fact that from 15 to 18 years of age, the skeletal development of the hand gradually matures, and its morphology no longer changes significantly, making it difficult for the model to accurately assess bone age.
Comparison experiments with advanced models To verify the performance as well as the applicability of the proposed model, it is validated on the RNSA2017 public dataset, and since this dataset is not evaluated using the RUS-CHN method, only background denoising RoIs are used in the comparative experiments on the public dataset, and the results of the comparative experiments are shown in
Table 7.
Table 7 shows that when the model uses only background denoised RoIs, it achieves good levels of accuracy on both private and public datasets, with model evaluation accuracies of 93.80% and 95.20% for ground truth within ±1 year. Compared with other BAA models using background denoising on public datasets, MAE is lower, and the evaluation accuracy is higher. We believe that this is due to the cascade model proposed in this paper being more consistent with the characteristics of skeletal development and makes full use of the bone age labeling information, thus improving the evaluation of the BAA model. Meanwhile,
Table 7 shows that the cascade model proposed in this paper also has some applicability for the public dataset.