1. Introduction
Atypical Femoral Fracture (AFF) is a dangerous fracture type that occurs in the subtrochanteric or diaphyseal regions of the femur [
1]. It can develop with slight or without any injury and is characterized by radiographic findings such as a simple transverse or short oblique fracture [
2]. Several factors have been identified as contributing to the occurrence of AFF, including excessive femoral curvature [
3], vitamin D deficiency [
4], and the use of proton pump inhibitors (PPIs) and corticosteroids [
5]. Notably, a strong correlation has been observed between AFF and the prolonged use of bisphosphonates (BPs) and denosumab for osteoporosis treatment [
6,
7,
8], so AFF predominantly occurs in the elderly population. If an AFF occurs, surgical intervention becomes more challenging and can lead to functional impairment as well as various complications such as nonunion and femoral head necrosis [
9], leading to an increased mortality risk following AFF [
10,
11]. For these reasons, with the increasing aging population, the incidence of AFF is also expected to rise, emphasizing the importance of early diagnosis and intervention before AFF occurrence to mitigate these risks.
In the early stages preceding the occurrence of AFF, cortical buckling will develop in the lateral cortex of the femur due to repeated cycles of microfracture and healing [
12]. This condition is termed Incomplete Atypical Femoral Fracture (IAFF). IAFF exhibits various characteristics and is classified based on its location as shown in
Figure 1: Diaphyseal IAFF (D-IAFF), which occurs in the femoral shaft, and Subtrochanteric IAFF (S-IAFF), which occurs in the subtrochanteric region [
13]. Although IAFF is a crucial precursor to AFF, it is often asymptomatic or presents with vague features, making detection difficult and often resulting in delayed diagnosis. As a precursor to AFF, the progression process from IAFF to AFF is illustrated in
Figure 2. IAFF is typically diagnosed through bone scans [
14] or Magnetic Resonance Imaging (MRI) [
15]. However, these diagnostic methods have notable drawbacks, including high costs and time consumption. Furthermore, there remains the risk of misdiagnosis [
14], which can lead to either unnecessary or delayed interventions, ultimately culminating in a complete fracture.
To mitigate these problems, the development of a diagnostic support system utilizing X-rays is essential, but there are several obstacles to be overcome: (1) An IAFF is often extremely small in size, (2) lacks distinct characteristics, and (3) is similar to normal anatomical deformations, making it easy to overlook. Additionally, (4) the location and features of an IAFF vary depending on the type, and (5) its appearance may differ slightly depending on the radiographic view, even for the same patient (
Figure 3). Due to these factors, even experienced orthopedic specialists may miss an IAFF if they do not examine it meticulously.
To address these challenges, we propose a universal model, Context-aware Level-wise Feature Fusion Network with Anomaly Focus (CFNet), inspired by the diagnostic methods employed by orthopedic specialists for IAFF diagnosis. This model is designed to be seamlessly integrated into various classification frameworks. When diagnosing IAFF, specialists typically begin by assessing the overall shape, curvature, and suspicious regions on a femur X-ray scan. They then carefully examine areas where IAFFs frequently occur and make the final diagnosis by comparing these regions with other potential conditions or deformities across all regions. Based on this diagnostic approach, CFNet extracts features at multiple levels from both a single entire X-ray image and high-resolution images segmented into four sections, utilizing Dual Context-aware Complementary Extractor (DCCE) blocks within each input branch of DCCE. This design allows the model to capture the overall femoral features from the entire X-ray image while simultaneously identifying tiny and ambiguous IAFF characteristics from the high-resolution sliced images, thus providing a mutual complement. The features extracted from each DCCE block are fused by the Level-wise Perspective-preserving Fusion Network (LPFN) to minimize information loss and ensure that features at the same level are integrated without interference. LPFN enhances the model’s representational capacity by learning features and correlations that are difficult to capture independently, thereby improving prediction accuracy. Moreover, we incorporate a Spatial Anomaly Focus Enhancer (SAFE) to focus on IAFF features and allow the model to comprehensively capture the correlation between the IAFF, its surrounding information, and overall image information. This approach prevents the model from overfitting and ensures effective learning of the scarce and subtle IAFF information. With these components, CFNet provides a highly accurate solution capable of detecting even tiny and ambiguous IAFFs while precisely distinguishing subtle differences. In our experiments, the proposed model demonstrates significant performance improvements over existing models, with each component effectively minimizing missed IAFFs and achieving accurate classification. Our main contributions are as follows:
We propose a novel model inspired by the diagnostic approach employed by specialists to enhance the classification performance of tiny and ambiguous IAFFs. To the best of our knowledge, this is the first model capable of effectively identifying all known types of IAFF features, regardless of prior surgical history or the presence of pathological fractures, while minimizing False Negatives. Furthermore, we are the first to utilize all major radiographic views (AP, ER, IR, LT) ensuring high accuracy in recognizing IAFFs across various imaging conditions.
We introduce the DCCE to overcome the challenges of information loss in small fractures and the limited contextual understanding encountered in conventional classification models. DCCE comprises two branches: one branch captures the overall characteristics of the femur and identifies potential IAFF regions across the entire X-ray image, while the other understands and focuses on IAFF features with surrounding details in high-resolution images, thereby extracting complementary information.
We propose the LPFN to effectively learn the subtle features of small and ambiguous IAFFs and prevent the misclassification of noise and artifacts as IAFFs by leveraging complementary information. This approach preserves the unique meaning and perspective of the extracted features, integrating them without interference across different levels. By doing so, the model can utilize information from multiple levels and learn complex features and correlations that are difficult to capture independently.
We incorporate SAFE to minimize missed IAFFs and mitigate model bias toward regions unrelated to IAFFs. This approach captures comprehensive contextual information and long-range dependencies within the input, addressing the limitations of traditional Convolutional Neural Networks (CNNs) [
16] and emphasizing anomalous regions. Consequently, it ensures that even subtle differences are not overlooked while preventing the model from overfitting to normal regions and backgrounds.
The remainder of this paper is organized as follows.
Section 2 reviews related works, and
Section 3 provides a detailed description of our proposed model.
Section 4 presents the dataset utilized for the experiments, experiment details, evaluation metrics, and experimental results. Finally, the discussion and conclusion are provided in
Section 5 and
Section 6, respectively.
3. Proposed Model
3.1. Model Overview
The proposed CFNet is a universal approach that can be applied to a wide range of classification models, inspired by the diagnostic methods used by specialists for identifying IAFF. The classification performance of the proposed model is majorly enhanced by three key components, DCEE, LPFN, and SAFE, as illustrated in
Figure 4.
CFNet consists of two input branches. Each branch of DCCE is composed of a feature extractor selected from classification models and is divided into four blocks based on the level of features being extracted. The first input branch processes the entire X-ray image, while the second input branch sequentially receives four high-resolution X-ray slices of equal size, divided from the top to the bottom of the original image. Features at various levels are then extracted through each block.
These features are integrated at different levels by our LPFN. It fuses the feature maps extracted from the corresponding blocks of each branch, enabling the model to learn richer features and correlations that are difficult to obtain individually. The fused results are then further combined with the output of the subsequent LPFN. Consequently, the classifier utilizes information that has been aggregated from all LPFN outputs. This approach ensures that features from various levels complement each other without interference, making them effectively utilized for accurate predictions.
Despite these advancements, there remains a possibility of missing tiny IAFFs, and the high proportion of normal regions and background information may limit the model’s ability to learn the characteristics of IAFF. To address this problem, we incorporate SAFE into the results of the first and last LPFN to emphasize anomalous regions and comprehensively learn the relationships among positional information. The results from the two SAFE modules are then combined and fed into the classifier of the selected model to generate the final classification result. By adopting and integrating these modules, the proposed model prevents misclassification and minimizes the risk of missing tiny and ambiguous IAFFs, thereby achieving high classification performance.
3.2. Dual Context-Aware Complementary Extractor (DCCE)
In medical image classification, many models either rely on a single entire image or crop it into small patches for training. However, since IAFFs are tiny and lack distinct features, relying solely on a single entire image can lead to a significant loss or even a complete disappearance of IAFF information as the model deepens. Furthermore, the patch-based approach only utilizes information from a highly limited region, making it unable to utilize surrounding contextual information. Additionally, the severe class imbalance between normal and IAFF patches often leads to a model bias toward the majority class, hindering the accurate learning of IAFF characteristics. To address these limitations and effectively extract features across different feature levels, we propose the DCCE.
DCCE serves as the feature extractor of CFNet and is compatible with various classification models. It comprises two branches. The first branch processes the entire femoral X-ray to extract overall features, which emulates the approach specialists use when reviewing the entire femur. This branch captures the overall structure of the femur, including key IAFF characteristics such as curvature. It also identifies suspicious regions and distributions of IAFF, enabling the model to leverage location-based features. The second branch sequentially processes data that have been divided into four equal segments from top to bottom, based on the height of the entire image. This approach preserves high resolution, enabling a detailed analysis of IAFF boundaries and patterns. Unlike patch-based approaches, this strategy enables the utilization of surrounding information related to the IAFF, preserving contextual information and allowing the model to capture IAFF details more accurately through comprehensive analysis. This branch simulates how specialists zoom in and identify areas where IAFFs frequently occur, thereby capturing additional information about tiny IAFFs that may be missed or insufficiently addressed in the first branch.
In addition, each DCCE branch is organized into four blocks based on the level of features being extracted. To define these blocks, we first divide the total number of layers in the selected feature extractor into four equal groups. Within these divided groups, each DCCE block is defined using the nearest unit block or stage of the feature extractor as a division point. The first DCCE block focuses on extracting low-level features such as the edges and textures of the femur. The second and third blocks capture the shape and structural characteristics of the femur, extracting intermediate-level features, and the last block extracts high-level features, including contextual and semantic information.
DCCE overcomes the challenges of information loss and limited contextual understanding found in conventional methods by extracting both the overall features of the input image and IAFF characteristics in conjunction with surrounding context through two separate branches. This approach ensures that the model captures even ambiguous and small critical features without overlooking them. Additionally, the four DCCE block structures enable level-wise feature extraction at multiple levels and perspectives, allowing the model to accurately distinguish between IAFFs, anatomical deformities, and normal regions while recognizing subtle differences. Consequently, DCCE provides complementary and rich information to LPFN and SAFE, significantly enhancing the accuracy of IAFF detection.
3.3. Level-Wise Perspective-Preserving Fusion Network (LPFN)
The features extracted from different inputs each contain unique meanings and information. If these features are integrated randomly without considering their respective levels, their inherent meanings and correlations may be distorted, leading to the loss of useful patterns and relationships that the model could learn. It can hinder the effective utilization of key information extracted from each branch, such as the femur’s overall structural characteristics, IAFF positional information, and detailed IAFF features. Therefore, it is essential to integrate the features appropriately according to their levels.
To mitigate this challenge, we propose the Level-wise Perspective-preserving Fusion Network (LPFN), which integrates features extracted from the DCCE at different levels without interference. LPFN employs 1 × 1 convolution to align the channel dimensions of feature maps and preserve essential information from the outputs of the DCCE blocks extracted in the same sequence from each branch. For the second branch, which processes four inputs per data sample, the feature maps are sequentially merged according to their order and then consolidated into a single feature map. Subsequently, this result is resized to match the dimensions of the feature map from the first branch. To minimize information loss and preserve the unique perspectives of each feature, we fuse the two feature maps through concatenation. This approach expands the dimensionality of the feature map, allowing the model to learn from more comprehensive information. The resultant feature map is then processed with a 3 × 3 convolutional layer, followed by Batch Normalization [
64], which ensures the stable learning of complex relationships and features that would be difficult to capture independently from each branch’s results alone. Moreover, this process enhances the model’s representation capabilities and enables the comprehensive use of information from different levels. The result from each LPFN is then fused with the subsequent LPFN results, and ultimately, the classifier utilizes the information fused across all levels of LPFN results. The training procedure of the LPFN is illustrated in
Figure 5, and the result of the
nth LPFN is represented in Equation (
1):
represents the feature map extracted from the first branch of the
nth DCCE block, while
denotes the feature map extracted from the
ith slice (where
i = 1, 2, 3, 4) in the second branch of the
nth DCCE block.
and
refer to the feature maps from the first and second branches after processing for fusion, respectively. ‖ represents concatenation, and
is the output of the
nth LPFN.
LPFN operates analogously to how specialists compare and analyze both the overall femur information and the details of regions where IAFFs frequently occur. By comprehending the overall characteristics of the image, LPFN aids in reducing the misinterpretation of noise and artifacts as IAFFs and provides positional information that indicates a higher probability of IAFF presence. Based on this guide, the model accurately learns the fine details of small and ambiguous IAFF from high-resolution information, ultimately enhancing prediction accuracy and sensitivity through the utilization of complementary information.
3.4. Spatial Anomaly Focus Enhancer (SAFE)
IAFF features are often ambiguous and resemble typical anatomical deformations, making them easy to overlook. Additionally, due to their small size, the model may be biased toward normal regions and backgrounds. In such cases, the model may fail to accurately understand the unique characteristics of IAFF, leading to an increased rate of False Negatives where IAFF is misclassified as a normal case. Therefore, it is crucial to emphasize IAFF features and understand their relationships with the surrounding information. While CNN models are adept at learning local patterns through convolutional kernels, they are limited in capturing long-range dependencies and broader contextual information. To overcome this limitation, we propose SAFE, a self-attention mechanism-based [
65] approach.
SAFE treats each position of
as a unique vector and learns the relationships between positions. To achieve this, spatial information is consolidated into a single dimension, converting the feature map into a sequence format (Equation (
2)). A linear projection (Equation (
3)) is then applied to derive three vectors: query (Q), key (K), and value (V) (Equation (
4)). This reflects the input information directly, preserves the relationships between positional information, and flexibly transforms the dimensions based on data complexity and feature characteristics. Subsequently, SAFE is computed as shown in Equation (
5):
In Equation (
3),
X represents the input vector or matrix,
A is the weight matrix, and
b denotes the bias vector. In Equation (
5),
refers to the dimensions of
K,
Q represents the feature vector of a specific location in the image, and
K contains information about all other locations. The similarity between
Q and
K is computed using the dot product to identify their interrelationships, assigning higher weights to more important positions. Subsequently, by applying the Softmax function [
66] with a ‘soft-assignment’ approach, the values in the output vector are transformed into probabilities ranging from 0 to 1 with a total sum of 1. This prevents problems of divergence or convergence to 0, enabling the model to comprehensively learn relationships across multiple regions.
V represents the actual information at each location, and by weighting
V according to this probability distribution, more important information is emphasized with higher weights, allowing the model to focus on important features. This result is combined with the input sequence (
) after linear transformation to generate the final outcome (
), enabling efficient computation without distortion of the similarity calculation results.
This approach preserves the original input information and enables the model to comprehensively understand the relationship between the input image context, surrounding information of IAFF, and IAFF itself by learning the correlations between each position in the image and all other positions. In addition, by assigning higher weights to anomalous regions and emphasizing IAFF, it prevents the model from overfitting to regions unrelated to IAFF. As a result, CFNet effectively focuses more on relevant regions and enhances the model’s ability to accurately detect even subtle differences.
Loss
We employ the Cross-Entropy loss function (
) to effectively model the mutually exclusive probability distributions between the IAFF and normal classes. The
measures the difference between the model’s predicted probability distribution and the target probability distribution, evaluating how closely the model’s output aligns with the target distribution. The equation is as follows:
where
C indicates the number of classes,
denotes the target label, and
represents the probability predicted by the model for class
i. This loss function ensures that the predicted distribution aligns with the target distribution during training, thereby enhancing classification accuracy.
4. Experiments
4.1. Dataset
The University Hospital (UH) dataset was collected at Kyungpook National University Hospital (KNUH) between August 2010 and November 2022. It comprises 794 X-ray images from 236 patients, including 430 images from 92 patients with IAFF and 364 images from 144 patients in the normal group. The IAFF cases are further categorized into D-IAFF and S-IAFF, and three orthopedic specialists reviewed and classified the data into normal or IAFF types. The dataset comprises images of both the left and right femurs, obtained from different radiographic views, including Anteroposterior (AP), External Rotation (ER), Internal Rotation (IR), and Lateral (LT). To ensure data independence for each patient, reflecting real clinical settings, the training and evaluation sets are split by patient. Among the 794 collected images, 666 images (IAFF: 354, Normal: 312) were randomly selected for 5-fold cross-validation training, and the remaining 128 images (IAFF: 76, Normal: 52) were used for evaluation. This dataset was approved by the KNUH Institutional Review Board under approval number KNUH202402007-HE001 on 26 February 2024.
4.2. Data Preprocessing
To enhance the data processing efficiency and optimize memory usage, we converted Digital Imaging and Communication in Medicine (DICOM) files into Numerical Python (NumPy) format. The data were then preprocessed using two different methods (
Figure 6). (1) Crop (C): To reduce unnecessary information in the image, such as left/right markers and knee implants, we cropped the images to a size of 2200 × 2200, corresponding to the smallest data dimension. To preserve S-IAFF characteristics, only the bottom portion of the images was cropped for height, while both sides were symmetrically cropped for width. (2) Automated Extraction and Alignment (AEA): We employed a segmentation model to generate femur masks from the original X-ray images. After evaluating several models, U-Net++ was selected for its superior performance. To further refine the masks, the connectivity of pixel values in the generated mask was computed to eliminate noise, and the inlier set was extracted using the RANdom SAmple Consensus (RANSAC) algorithm [
67]. The Hough transform [
68] was then applied to determine the rotation angle for vertically aligning the mask. Additionally, the histogram was analyzed to exclude the knee and pelvis regions, ensuring that only the RoI of the femur was extracted from the mask. After preprocessing, the (C) images were resized to 1024 × 1024, and the (AEA) images were resized to 1024 × 256 using bilinear interpolation. Finally, the pixel values were normalized to the range [0,1].
4.3. Training Details
We conducted experiments using a 5-fold cross-validation approach to evaluate generalization performance and ensure an accurate comparison of results. For the experiments, We utilized models pretrained on ImageNet [
69], fine-tuning each model without freezing any layers for 300 epochs with a batch size of 16 for each fold. We employed a Stochastic Gradient Descent (SGD) optimizer [
70], with a learning rate set to
and a momentum of 0.9. To enhance model robustness, flipping and rotation augmentation were applied. All methods were implemented in PyTorch, and experiments were conducted on a single NVIDIA RTX A6000 GPU (48 GB).
4.4. Evaluation Metrics
We evaluate the performance of the proposed model using several widely recognized classification metrics: accuracy, F1-score, AUROC, AUPRC, precision, recall, and specificity. These metrics provide a comprehensive assessment of the model’s ability to classify IAFF and normal cases. In these metrics, True Positive () refers to cases where the model correctly predicts positive (IAFF) samples, True Negative () represents cases where the model correctly identifies negative (normal) samples, False Positive () indicates instances where the model incorrectly predicts negative samples as positive, and False Negative () denotes cases where positive samples are misclassified as negative. All metrics range from 0 to 1, with values closer to 1 indicating higher performance.
Accuracy: This metric evaluates the overall correctness of the model’s predictions by calculating the ratio of correctly predicted cases (both
and
) to the total number of cases:
: The
is the harmonic mean of precision and recall, providing a balance between these two metrics. It is particularly useful when precision and recall are in a trade-off. A value closer to 1 suggests that both precision and recall are high, while a value closer to 0 indicates a deficiency in one or both metrics:
AUROC: AUROC represents the area under the Receiver Operating Characteristic (ROC) curve, where the x-axis plots the False Positive Rate (), and the y-axis plots the True Positive Rate (). This metric evaluates the model’s ability to distinguish between positive and negative classes across various thresholds. A higher indicates that the model can effectively reduce False Positives while maximizing True Positives at various threshold levels.
AUPRC: AUPRC measures the area under the Precision–Recall (PR) curve, where precision is plotted on the y-axis and recall on the x-axis. This metric focuses on the model’s ability to identify positive samples, excluding negative class performance. A higher AUPRC represents a model’s effectiveness in identifying positive cases while minimizing False Positives. An AUROC or AUPRC value of 1 indicates an ideal model, signifying perfect classification, while 0.5 suggests performance equivalent to random guessing. Values below 0.5 indicate the model performs worse than random chance, reflecting a tendency to misclassify instances.
Precision:
Precision measures the proportion of True Positive cases among all cases predicted as positive, highlighting the significance of False Positives. A high precision value implies fewer False Positives:
Recall:
Recall measures the proportion of True Positive cases among all actual positive instances, focusing on minimizing False Negatives. A higher recall suggests that the model is less likely to miss positive cases:
Specificity:
Specificity quantifies the model’s ability to correctly identify negative cases, with a focus on minimizing False Positives. A high specificity indicates that the model effectively avoids misclassifying negative samples as positive:
4.5. Comparison of Classification Performance with Other State-of-the-Art Models
We evaluate the performance and suitability of the proposed model for IAFF classification by comparing it with state-of-the-art models and also utilizing them as our baseline models. As shown in
Table 1, baseline models with crop preprocessing, ResNet-50, a widely used CNN model, exhibit relatively poor performance across all metrics, failing to accurately capture the features of small and ambiguous IAFFs. The latest model, MobileNetV4 [
71], is highly lightweight, making it challenging to adequately capture small and ambiguous IAFF features. As a result, it shows very low performance, similar to ResNet. Although DenseNet-121, EfficientNet-B2, ConvNeXt V2 [
72], and RDNet [
73] show improved performance over traditional CNN models such as ResNet-50 and GoogLeNet [
20], their results are still insufficient. Vision Transformer-based [
74] models, such as RepViT [
75] and FastViT [
76], despite their recent success and remarkable performance on various tasks, show similar or even inferior performance compared to traditional CNN models due to the limited amount of data. In contrast, EdgeNeXt [
77], FocalNet [
78], and VGG16 outperform other models, with VGG16 achieving the highest performance among the state-of-the-art models. However, these baseline models face challenges in distinguishing noise, artifacts, and common deformations from IAFFs, resulting in high recall but lower precision and specificity performance.
Even with AEA preprocessing, ResNet-50 and MobileNetV4 still show relatively lower performance compared to others. In contrast, EfficientNet-B1, EfficientNet-B3, and FastViT, which perform poorly with crop preprocessing, show improved performance and achieve results comparable to or surpassing DenseNet-121 and EfficientNet-B2. ConvNeXt V2 and RDNet also demonstrate enhanced performance, attaining results comparable to EdgeNeXt, FocalNet, and VGG16. Notably, VGG16 achieves the highest performance among all models, even with AEA preprocessing. While AEA preprocessing reduces noise and artifacts, leading to a relatively higher precision due to decreased misclassifications, these models still struggle to capture IAFF characteristics accurately. Consequently, they tend to misclassify IAFFs as normal cases or confuse regular deformations and normal regions with IAFFs, leading to lower recall and specificity.
When comparing the effects of different preprocessing methods, it is observed that most baseline models, with the exception of EfficientNet-B2, EdgeNeXt, and FocalNet, achieve approximately 4% to 6% improvement in accuracy and F1-score with AEA preprocessing compared to the crop method. This finding suggests that excessive information from normal regions and backgrounds can hinder the model’s ability to learn IAFF characteristics. Moreover, the three aforementioned models also demonstrate enhanced performance in distinguishing between IAFF and normal cases, resulting in increased AUROC and AUPRC, as well as higher precision and specificity performance compared to crop preprocessing results. Despite these improvements, these models still tend to misclassify small and ambiguous IAFFs as normal cases, resulting in relatively lower recall performance.
By integrating our proposed model with these baselines, we observed notable improvements across all metrics and models, regardless of the preprocessing method. Notably, ResNet-50 shows significant improvements, with accuracy and F1-score increasing by approximately 9% with crop preprocessing and 6% with AEA preprocessing. Specifically, when integrating CFNet with VGG16 and applying AEA preprocessing, it outperforms all other models across every metric, achieving accuracy, F1-score, AUROC (
Figure 7), and AUPRC scores of 0.931, 0.9456, 0.9694, and 0.9854, respectively. Additionally, the proposed method demonstrates a significant improvement in recall performance across all models compared to the baselines, showing its effectiveness in mitigating the False Negative problem. Furthermore, even when applying crop preprocessing, the proposed method accurately learns and classifies IAFFs without bias, despite the abundance of normal regions and background information. The results of this experiment validate that the proposed method is applicable to various models and significantly enhances performance, thereby demonstrating the superiority and suitability of our approach for IAFF classification.
4.6. Experiments on DCCE and LPFN Group Configurations
We evaluate the performance of the CFNet across different configurations of the DCCE and LPFN groups, utilizing VGG16 as the baseline due to its superior performance in previous experiments. In accordance with our proposed methodology, we first count the total number of layers and divide them into
n (where
n = 3, 4, 5) equal groups. Subsequently, each DCCE block is defined based on the nearest unit block or stage as a division point. As shown in
Table 2, the 3-group configuration exhibits lower performance across all metrics for both preprocessing methods compared to the 4-group and 5-group configurations, primarily due to the reduced extraction and learning of feature details in the 3-group setup. Nevertheless, due to the effectiveness of the SAFE component, recall shows excellent performance. The 5-group configuration extracts and integrates features at a more detailed level, utilizing relatively more information compared to the 4-group configuration. Consequently, while it exhibits slightly higher performance in accuracy and F1-score, it shows lower performance in AUROC and AUPRC, indicating a lack of robust across all thresholds. Furthermore, although the increased sensitivity to IAFF in the 5-group configuration enhances precision and recall performance slightly, it also leads to a rise in False Positives, thereby reducing specificity, as the model misclassifies more normal cases as IAFF. Ultimately, while the 5-group configuration demonstrates comparable or better performance than the 4-group setup, the increased training time and resource consumption renders it less practical. Therefore, for a more balanced and practical approach, we select the 4-group configuration for DCCE and LPFN in our final model.
4.7. Performance Comparison Based on the Number of Input Slices
We evaluate the performance based on the number of image slices fed into the second branch. Following the approach utilized in our proposed method, we divided the image into
n slices (where
n = 3, 4, 5) from top to bottom based on the height of the image prior to inputting them into the second branch. We utilize VGG16 as the baseline, with all conditions kept the same as the proposed model, except for the number of input slices. As shown in
Table 3, when using 3 slices, the model can capture a relatively broad range of context and IAFF characteristics. However, due to the lower resolution of each slice than
n = 4 or 5, it struggles to capture the fine details of IAFF, resulting in slightly lower overall performance compared to the 4-slice configuration. Moreover, when using 5 slices, the performance further decreases, showing similar or even lower results compared to the 3-slice configuration. This decline in performance is likely due to the height of the sliced images being too short relative to their width, limiting the model’s ability to utilize contextual information and features surrounding the IAFF in the femur. As a result, the understanding of the relationship between the IAFF features and the surrounding information diminishes, making it difficult to distinguish IAFFs from other deformities or normal regions, which in turn leads to a decline in performance. Based on these experimental results, we conclude that the optimal configuration for our model is to use 4 DCCE and LPFN groups, along with 4 input slices for the second branch.
4.8. Classification Performance Analysis Using Confusion Matrix
To implement the IAFF classification model in real medical settings, achieving high accuracy is essential; however, it is equally important to conduct a thorough analysis of False Positives (Type I error) and False Negatives (Type II error). In particular, Type II errors where an actual IAFF is missed can have severe consequences, as patients may not receive timely treatment. Therefore, we evaluate these errors and the overall performance of the model using a confusion matrix. VGG16 is set as the baseline, and its results are compared with CFNet. As shown in
Table 4, VGG16 with crop preprocessing achieves a True Positive count of 69 and a True Negative count of 44, indicating overall acceptable performance. However, with 8 False Positives and 7 False Negatives, it relatively frequently misses IAFFs or misclassifies normal cases as IAFFs, especially in data with small and ambiguous IAFF features. This occurs because crop preprocessing includes unnecessary information outside the femur region, causing the model to misinterpret noise and artifacts in normal data as IAFF or to overlook IAFF features due to the distraction of irrelevant details. However, CFNet shows improved performance over the baseline model, with 6 False Positives and 4 False Negatives. As shown in
Table 5, using AEA preprocessing, VGG16 reduces False Positives to 6, but the False Negatives remain unchanged compared to
Table 4. This result indicates that using only the femur region as input reduces extraneous information, leading to fewer misclassifications. However, small and ambiguous IAFFs continued to be missed, resulting in a high False Negative rate. In contrast, CFNet improves overall performance by leveraging complementary, rich features and understanding the surrounding context. This allows it to better distinguish IAFF from other information in the input image. Additionally, SAFE emphasizes anomalous regions, enhancing the identification of these areas and minimizing both False Negatives and False Positives. As a result, CFNet based on VGG16 achieves the best performance with 5 False Positives and 3 False Negatives, showcasing exceptional accuracy and reliability. These results highlight the potential of CFNet for practical application in real medical settings.
We also analyze the cases in which False Negatives and False Positives occur in CFNet. Typically, IAFF manifests in the lateral cortex of the femur and exhibits characteristics of cortical buckling. However, in the LT view, IAFF may present with entirely different features, such as fine line patterns, or in some instances, no noticeable features at all, as illustrated in
Figure 8a. When no discernible features are present, specialists rely on information from other radiographic views to make a diagnosis. While the model accurately classifies data with different features, False Negatives are observed in cases where no features are visible in the LT direction. Moreover, the femur is susceptible to a variety of deformities and diseases beyond IAFF, some of which display characteristics nearly identical to IAFF as shown in
Figure 8b. It is observed that False Positives occur in such cases, where the model mistakenly identifies these deformities as IAFF.
4.9. Analysis of Classification Performance and Robustness Across Different X-Ray Radiographic Views
The shapes and characteristics of IAFFs vary depending on the radiographic view, with certain views showing nearly absent distinguishing features, making it challenging for the model to learn, and leading to potential misclassifications. In this experiment, we evaluate the classification performance and robustness of the model across various radiographic views. We use VGG16 as the baseline model and compare the performance using accuracy and F1-score metrics. As shown in
Table 6, the baseline results with crop preprocessing achieve relatively high performance in AP view, where IAFF features are more distinguishable from normal regions. However, the model shows lower performance in the LT view, where the distinguishing features are either insufficient or less prominent. With AEA preprocessing, the removal of irrelevant information allows the model to focus on essential features, leading to an overall improvement in the baseline performance. Nonetheless, performance in the LT view remains lower compared to other views due to the fact that the LT view exhibits completely different characteristics. Additionally, the limited data make it difficult for the baseline model to learn these features effectively. In contrast, the proposed method significantly improves performance across all radiographic views. With crop preprocessing, the proposed method improves accuracy by approximately 6.6% and F1-score by 4.1% in the ER view, where the baseline performance is initially low. In the LT view, the proposed method achieves notable improvements, with a 5.4% increase in accuracy and a 6.6% increase in F1-score. Similarly, AEA preprocessing with the proposed method also demonstrates overall performance enhancements, with the LT view accuracy increasing by 5.4% and F1-score by 5.7%. These experimental results validate that the proposed method effectively captures even insufficient and ambiguous features, as well as very subtle differences. It also demonstrates robustness and high performance across all radiographic views regardless of the preprocessing method, highlighting the model’s suitability and superiority.
4.10. Performance Analysis Based on Parameter and Execution Time
We apply our proposed CFNet to baseline models to evaluate performance in terms of the number of parameters, as well as execution (training and inference) times. For comparison, we select ResNet-50 as a representative CNN model, VGG16 as a high-performance model, MobileNetV4 as the latest lightweight model, and FastViT as a Vision Transformer-based model. All models are compared under AEA preprocessing conditions. The training time is measured over 300 epochs, including both training and validation phases, while inference time is measured based on processing a total of 128 test samples. As shown in
Table 7, the ResNet-50-based model requires a relatively large number of parameters and shows longer execution time, while demonstrating the lowest performance. The MobileNetV4-based model demonstrates the fewest parameters and shortest execution time, making it suitable for deployment in medical devices. However, its low parameter count limits the model representation, reducing its ability to accurately identify small and ambiguous IAFF features, resulting in low performance similar to ResNet-50-based CFNet. In contrast, FastViT achieves relatively high performance with roughly half the parameters of the ResNet-50-based model. Notably, VGG16-based model demonstrated even more robust and superior performance, with a parameter count and execution time comparable to FastViT. Additionally, its short inference time allows for prompt assistance to medical professionals, underscoring its suitability for real clinical application.
4.11. Ablation Study
We evaluate the contribution of each component of CFNet to the overall classification performance.
Table 8 presents the results of our ablation study based on VGG16, which achieved the highest performance in previous experiments. As shown in the results, each of our novel components provides significant performance enhancement to the baseline model. While the baseline model shows decent performance with both preprocessing methods, it struggles with noise, artifacts, and deformations, often misclassifying them as IAFF and failing to capture fine details, resulting in lower precision and specificity. By adding DCCE and LPFN to the baseline, the model enables to learn both the overall characteristics of the femur and IAFF features in conjunction with the surrounding contextual information. This integration reduces errors in misclassifying normal regions as IAFF, leading to an improvement in precision. However, False Negatives persist, resulting in relatively low recall performance. When applying SAFE to the baseline, the model effectively learns long-range dependencies and relationships between IAFF features and broader contextual information, leading to notable improvements in AUROC and AUPRC. Additionally, SAFE emphasizes anomalous regions and captures tiny IAFFs, reducing False Negative rates and improving recall performance. However, False Positives still affect the model’s specificity performance. In contrast, our proposed model, which integrates the strengths of each component, accurately learns IAFF characteristics while minimizing the misclassification of normal cases as IAFFs. Consequently, the proposed method achieves high performance across precision, recall, and specificity. Furthermore, the proposed approach clearly distinguishes between IAFF and normal cases, resulting in superior AUROC and AUPRC performance results, achieving the highest performance across all metrics. These findings demonstrate that our proposed method offers high reliability in IAFF classification and holds significant potential for practical application in real clinical settings.
5. Discussion
In this study, we experimentally demonstrate that applying the IAFF diagnostic method commonly used by specialists to CFNet significantly enhances overall performance. We also show that integrating DCCE with LPFN, along with applying SAFE, is highly effective for classifying tiny and ambiguous IAFFs. CFNet can extract features from femoral X-ray images using various state-of-the-art models. In this process, using only a single entire image may result in the loss of critical details of tiny IAFFs, while relying exclusively on small patches can hinder accurate learning due to the restricted context information. To address these challenges, we propose DCCE, which consists of two branches to capture both overall features and fine-grained details along with the surrounding context of IAFFs. This approach mirrors the clinical process, where specialists first review the entire X-ray image and then focus on regions where IAFFs commonly occur. Additionally, to integrate features extracted from DCCE at different levels while preserving their unique significance, we propose LPFN. LPFN prevents the misclassification of noise and artifacts as IAFF, enhances sensitivity by capturing detailed information from high-resolution images, and provides complementary information. LPFN operates similarly to how clinicians analyze and compare overall structures and frequently occurring regions. However, since IAFF features are often ambiguous and difficult to distinguish from deformations, there remains a risk of missing IAFF, and their small size can lead to model bias towards normal and background regions. To address this, we incorporate SAFE, which captures long-range dependencies that are challenging for CNNs alone and helps understand spatial relationships more effectively. Furthermore, SAFE assigns higher weights to abnormal regions, emphasizing IAFF features and preventing the model from overfitting to normal regions.
Despite the promising results, there are still limitations and room for improvement. Experimental results show that in cases where IAFF features are either entirely absent such as in some LT images or in a very early stage, the model encounters difficulty accurately classifying IAFF. As a result, there is a risk of IAFF going unnoticed by both patients and medical professionals, potentially delaying treatment. This could lead to progression to a complete fracture or the significant worsening of IAFF, making treatment more complex and challenging. To prevent this problem, based on previous studies that indicate a close association between IAFF and conditions such as osteoporosis and femoral curvature, we plan to collect relevant meta-data and modify CFNet to support multimodal learning alongside image data, further improving classification accuracy in all cases. Additionally, the proposed method has demonstrated significantly improved performance with limited data. However, to ensure the generalizability of our approach, more data are required. Currently, there are no publicly or privately available IAFF datasets, so we are actively collecting additional data from KNUH to validate the generalization performance of the proposed method. Additionally, to demonstrate the model’s robustness through external validation, we are collecting data from other university hospitals with varying X-ray conditions. The goal of the proposed method is to apply it in real medical settings. We will validate the model’s generalizability and robustness through the collection of additional data and further experiments.
6. Conclusions
In this article, we propose CFNet, a novel approach to accurately classify even tiny and ambiguous IAFFs without missing any. In our model, DCCE recognizes contextual information and extracts complementary information on both the overall femur features and detailed regions at multiple levels, addressing problems of information loss and limited contextual understanding. By introducing LPFN, our model preserves the unique meaning of features at each level, enabling seamless feature fusion without interference. This design supports the stable learning of complex relationships and enhances classification accuracy and sensitivity. In addition, SAFE comprehensively captures spatial dependencies and emphasizes anomalous regions, minimizing missed IAFFs and preventing overfitting to normal regions. Experimental results demonstrate that each component of CFNet contributes meaningfully to the model’s improved overall performance.
For future work, the current approach will divide the selected model into four equal segments based on the number of layers and assign DCCE blocks according to adjacent unit blocks or stages. We plan to enhance the DCCE module by incorporating techniques such as receptive field analysis or filter response evaluation. These methods will allow for a more precise assessment of feature map levels extracted at each layer and enable a more adaptive assignment of DCCE blocks tailored to each model. Additionally, since the femur can present with various diseases, deformities, and fractures that may mimic IAFF characteristics, we aim to incorporate a Large Language Model (LLM) [
79] to leverage extensive domain knowledge, enabling CFNet to analyze predictions and offer insights into potential conditions beyond IAFF. This approach is expected to increase the model’s reliability, making it a valuable tool in clinical settings for accurate patient risk assessment and for guiding preventive measures.