1. Introduction
Cone beam computed tomography (CBCT) is a state-of-the-art 3D medical imaging technology that employs a cone-shaped X-ray beam to produce high-resolution images of anatomical structures. One of the standout attributes of CBCT is its high spatial resolution, enabling the visualization of fine details, particularly in hard tissues. This feature makes CBCT emerge in orthopedics, where it supports implant planning, joint assessment, and the evaluation of traumatic injuries, ultimately enhancing patient outcomes and delivering personalized care [
1]. Additionally, compared to traditional CT imaging, CBCT can deliver detailed images at a significantly lower radiation dose, making it a valuable tool in emergency departments and surgical rooms, particularly for diagnosis and pre-surgical planning [
2,
3].
Currently, CBCT is especially beneficial in maxillofacial and oral surgery. For instance, Bhat et al. [
4] proposed a workflow for dental pre-surgical planning using immersive virtual reality and CBCT data. Lalonde et al. [
5] utilized 3D-printed models derived from CBCT data to treat a rare type of dens invaginatusin a mandibular incisor, highlighting the utility of 3D models in guiding proper treatment. The generation of accurate 3D models relies on precise bone segmentation [
6].
Due to the small dimensions of the scanning machine, CBCT technology is particularly feasible, not only for the scanning of the head and maxillofacial district, but also for other extremity anatomical regions such as foot bones, hands and wrists, limbs, and joints. However, bone segmentation in these anatomical regions is challenging. Unlike long bones, extremities and joints exhibit weak bone borders, fluctuating densities of cancellous tissue, and small inter-bone spacing. Furthermore, extremities consist of various tiny, asymmetrically formed structures with different densities, such as those in the hand, wrist, and foot. Additionally, unlike conventional CT, CBCT grayscale values are directly associated with X-ray attenuation and lack the standardization provided by Hounsfield Unit (HU) calibration, introducing nuances and variations in image intensity characteristics, which complicate the segmentation process.
The existing literature extensively covers bone segmentation in CBCT, primarily in dental and maxillofacial surgery contexts [
7]. Examples of CBCT segmentation for extremities mainly focus on weight-bearing scans of the foot and ankle. However, there is a deficiency of studies related to extremities and joint segmentation from CBCT scans.
The U-Net architecture [
8] is a widely used convolutional neural network (CNN) architecture for medical imaging segmentation. U-Net has been applied to conventional CT segmentation of various bony structures [
9,
10,
11,
12,
13,
14,
15]. In the context of CBCT data, applications of U-Net have been demonstrated in the maxillofacial region. For instance, Lin et al. [
16] used U-Net to accurately segment the mandibular canal in CBCT data, and Zhang et al. employed DBA-U-Net for maxillary sinus segmentation [
17].
The first CNNs developed to segment three-dimensional images acquired using magnetic resonance imaging (MRI) or computed tomography (CT) were trained on two-dimensional image slices [
18]. The majority of these CNNs used axial slices as the input [
19,
20], due to the high in-plane resolution compared to the slice thickness [
21]. CBCT instead offers volumetric high resolution since 3D CBCT images have isotropic voxels, which means that the voxel dimensions are the same in all three spatial dimensions. The axial training does not take into account any 3D information. Zhou et al. [
22] proposed to train separate CNNs for each of the three orthogonal slices to classify the voxel through a majority voting scheme.
This paper proposes a deep learning-based bone segmentation tool for CBCT orthopedic imaging. The aim is to provide an easy-to-use segmentation and 3D modeling workflow for intricate anatomical districts such as ankles, wrists, and joints. This workflow consists of three main steps: bone segmentation, separation, and 3D modeling, designed to be highly intuitive and user-friendly. The main innovation of our work lies in the tool’s ability to deliver high-quality results with minimal user interaction. This work utilizes a U-Net architecture trained with a strategy particularly feasible for cone beam CT, to leverage the isotropic nature of CBCT data. In a previous work [
23], the authors introduced the workflow focusing on segmentation results. This paper extends its findings by providing a detailed comparison between the proposed and state-of-the-art methods, along with an evaluation of the 3D models and the user interface.
2. Materials and Methods
This paper proposes a deep learning-based workflow using the U-Net architecture to accurately segment extremities and joints in high-resolution cone beam CT scans. As illustrated in
Figure 1, bones in images acquired through a commercial CBCT scan undergo segmentation and are separated from the surrounding soft tissues. Binary segmentation is performed using U-Net. Then, the bones are separated by the application of the watershed algorithm. Finally, the user through the user interface can select the bone to be modeled in three dimensions. A simple and intuitive user interface was developed for the bone segmentation and modeling tool. The workflow was optimized to require minimal user intervention, relying primarily on deep learning algorithms, which automate most of the process.
2.1. Binary Semantic Segmentation
2.1.1. Data
Anatomical preparations were scanned using a commercial CBCT device, the See Factor CT3 (Imaginalis, Florence, Italy). The Feldkamp, Davis, and Kress (FDK) algorithm was employed to reconstruct the scans [
24]. The volumetric data include
pixels and boast an isotropic resolution of 0.2 mm. Scans were performed under varying acquisition parameters (kV and mA).
To train and evaluate the proposed deep learning models, an in-house annotated dataset of CBCT scans was created. A total of fourteen CBCT scans, acquired with different parameters, were considered. These scans focus on extremities, where bone segmentation is particularly challenging due to the high number of adjacent bones and the similar gray levels between spongy tissues and soft tissues.
To generate ground truth labels, the scans were masked using 3D Slicer (version 5.6), an open-source software developed by the Surgical Planning Laboratory at Brigham. Initially, a custom threshold was applied to each scan to isolate cancellous tissues. Subsequently, manual segmentation refinements were performed using the 3D Brush tool in 3D Slicer to ensure accuracy.
2.1.2. Network Architectures
The first stage of the proposed workflow involves binary semantic segmentation. A simple network architecture with fewer parameters was implemented to achieve faster inference times, which is critical for ensuring a good user experience in clinical applications. This decision was based on the need for a practical tool that can be easily adopted in routine workflows without requiring extensive computational resources. Consequently, the approach was compared with other methods that share a similar focus on simplicity and efficiency. To achieve this task, a neural network based on the well-known U-Net architecture was implemented, and its performances were compared with another encoder–decoder architecture designed for accurate binary segmentation tasks: SegNet.
The U-Net architecture [
8] is a CNN designed for semantic segmentation tasks, where the goal is to assign a label to each pixel in an input image. It has an encoder–decoder path. The encoder path consists of convolutional and pooling layers, which progressively downsample the input image. Each convolutional layer is followed by a rectified linear unit (ReLU) activation function. Max pooling operations are applied to reduce the spatial dimensions of the feature maps while increasing the receptive field. The decoder path consists of upsampling and convolutional layers, which gradually upsample the feature maps to the original input resolution. Skip connections are introduced between corresponding layers in the encoder and decoder paths to preserve spatial information. These connections directly link layers at the same spatial resolution in the encoder and decoder paths. By concatenating feature maps from the encoder with those in the decoder, skip connections enable the decoder to access high-resolution features from earlier stages of the network. This helps the decoder refine the segmentation masks by incorporating detailed spatial information that may have been lost during downsampling.
SegNet has an encoder–decoder architecture with a symmetrical contracting and expanding structure. Unlike U-Net, which uses skip connections, SegNet uses pooling indices to perform upsampling. These indices store the locations of max pooling during the downsampling phase and are used to retain fine-grained details during the upsampling. SegNet was implemented as described by Badrinarayanan et al. [
25].
2.1.3. Training Strategies
Three distinct training strategies were assessed and compared for U-Net and SegNet. The first strategy, referred to as axial training, is the traditional 2D training method. Axial slices of pixels were employed to train the developed networks. This strategy does not take into account any 3D information, and to overcome this, the so-called 2.5D training strategies were considered.
The second strategy is majority voting (MV), in which separate convolutional neural networks (CNNs) are trained for each of the three orthogonal slices. The prediction for a voxel is determined by aggregating predictions from all three CNNs, and the voxel is considered part of the foreground if at least two of the CNNs predict it as the foreground.
Lastly, an augmented 2D training strategy was evaluated, referred to as
multi-planar training (MPT). In this approach, the network is trained with a dataset composed of slices from all three orthogonal planes (axial, sagittal, and frontal). Each batch, during the training process, contains images from the three views. The networks under evaluation include six variations: U-Net with axial training, MPT, and MV and SegNet with axial training, MPT, and MVtraining, as shown in
Figure 2.
A workstation with a GeForce RTX 2070 SUPER (NVIDIA Corporation, Santa Clara, CA, USA) was used for the training. We employed an Adam Optimizer, with an initial learning rate of . The networks were trained for 250 epochs using a batch size of 16. Early stopping and learning rate decay were integrated into the network as callbacks. Before training, a normalization of the pixel intensity values to standardize the input images was performed. Extensive data augmentation techniques were applied during training. These techniques included random rotations, flips, shifts, and zooms, as well as the addition of noise and adjustments to brightness and contrast. Moreover, the dataset was arranged according to volumes to prevent data leakage, making sure that every slice from a given volume was either in the training set or the test set. In this way, the training and test sets’ data independence was preserved, which is helpful for the generalization of the model’s performance and an accurate model evaluation. Four volumes were used for testing and eight volumes for training. Moreover, two volumes distinct from the training and test sets were considered to be the validation set.
2.1.4. Metrics
To quantitatively assess the performance of our training strategies and facilitate comparison, the predictions of our networks were compared to the ground truth by calculating the following well-established segmentation metrics: the Jaccard index (JI) and the Dice coefficient (DC). The Jaccard index evaluates the similarity between two sets by evaluating the ratio of the size of their intersection to the size of their union.
The Dice coefficient serves as a metric for quantifying the similarity between two sets. Specifically, it is calculated by dividing the size of the intersection of the sets by the average size of the sets.
In the given context, set A comprises pixels labeled as positive (e.g., bone pixels) in the ground truth, while set B encompasses pixels predicted as bone by the CNN being evaluated. Both coefficients have a range from 0 to 1, with 1 being the optimal value.
2.2. Instance Segmentation
The results of binary segmentation were processed to label each bone separately. Initially, a binary hole-filling operation was applied, effectively closing the little void space within the segmented regions. Subsequently, a binary erosion operation was employed to refine the boundaries of the filled mask. Following the pre-processing stages, a distance transform was computed based on the filled binary mask. The resulting distance map encodes the Euclidean distances from each foreground pixel to the nearest background pixel. Using the computed distance transform, markers for the watershed algorithm are identified through a thresholding process.
The watershed algorithm was then applied, utilizing the negative of the distance transform as a gradient image and the identified markers as seeds for region segmentation.
2.3. Three-Dimensional Model
To accurately model the bone in 3D in this study, a function implementing the Lewiner algorithm [
26] for extracting iso-surfaces from 3D volumetric data was implemented. This algorithm is an enhanced version of the original marching cubes algorithm [
27], providing faster performance and resolving ambiguities to ensure topologically correct results. Specifically, it uses a refined set of lookup tables to handle all possible configurations of surface intersections within a cube. This approach not only improves the accuracy of surface reconstruction, but also ensures the robustness of the generated meshes, making it particularly suitable for complex and high-resolution datasets. The algorithm’s implementation allows for efficient processing and accurate depiction of iso-surfaces, contributing to the visualization and analysis of the 3D data generated by the segmentation process in our research.
To evaluate the results in terms of modeling, a surface-based metric was used to calculate the distance between the obtained mesh and a reference mesh derived from the manual segmentation of bone. The Hausdorff distance is a measure of the extent to which two subsets of a metric space are close to each other. More formally, given two non-empty subsets
A and
B of a metric space with a distance function
d, the Hausdorff distance
is defined as:
where the following applies:
measures the shortest distance from a point a in set A to any point in set B.
then considers the farthest distance of these shortest distances for all points in A. This ensures that every point in A is close to some point in B.
Similarly, ensures that every point in B is close to some point in A.
The maximum of these two quantities is taken to make the distance symmetric and to reflect the greatest extent to which one set can be far from the other.
In brief, the Hausdorff distance is the greatest of all the distances from a point in one set to the closest point in the other set, ensuring that both sets are close to each other in a symmetrical sense. Smaller values of the Hausdorff distance indicate better performance.
2.4. User Interface
The user interface (UI) of the bone segmentation and modeling tool has been designed with a focus on simplicity and usability, and a schematic representation is given in
Figure 3. The UI presents a clean and organized layout, with essential tools and options easily accessible. The main screen includes a central viewing area for the CBCT images and a sidebar with segmentation tools and options.
The UI provides real-time feedback, allowing users to see the immediate effects of their actions. Additionally, the UI guides users through the segmentation and modeling process with tooltips, minimizing the learning curve for new users.
The interface allows the user to load a DICOM folder and visualize volume in multiplanar reformation (MPR) mode. The user can scroll through all the views. Labels are generated fully automatically, and the bone segmentation is superimposed over the MPR views with different colors. By clicking on a specific anatomical part, the user can visualize only the label related to the bone of interest, easily switching between different bones. Once the user selects the bone to be exported, he/she can save it as a triangular mesh. The UI is part of the Multimodal Biomedical Imaging Platform All-in-One software 2022 [
28] developed by Imaginalis S.r.l. (Sesto Fiorentino, Italy), and this software was tested using a reliable usability testing protocol [
29,
30]. The UI design follows Nielsen’s usability heuristics, focusing on simplicity, consistency, and error prevention [
31].
3. Experiments
While the study primarily focuses on bone segmentation and modeling in specific anatomical regions (limbs, joints, extremities) using CBCT images, the underlying methodology has the potential to be generalized to other body parts and different types of medical imaging. The same deep learning framework and segmentation techniques were applied to assess the versatility of the tool. To evaluate the generalizability of the method, preliminary experiments were conducted on additional anatomical regions, including the spine and pelvis, as well as veterinary CBCT scans. The tool was used for the pre-planning phase of a canine acetabular cup insertion, in particular to generate the anatomical models of the femurs and pelvis to be printed.
4. Results
4.1. Binary Segmentation
To evaluate the first stage of this workflow, the developed U-Net was compared with the benchmark SegNet. To identify the most effective training strategies for the task, networks trained with various approaches on the same test subset were evaluated, using four volumes. The networks under evaluation included six variations: U-Net with axial training, MPT, and MV and SegNet with axial training, MPT, and MV training. Each network was tested individually using axial slices, sagittal slices, and frontal slices. For majority voting, three separate training sessions were conducted for axial, sagittal, and frontal orientations and then combining the results using a majority voting scheme. This comprehensive evaluation allowed us to rigorously compare the different training strategies and architectures, providing insight into the most effective methods for CBCT segmentation of bones.
In
Figure 4, segmented images of the human foot are presented, featuring one axial, one sagittal, and one frontal view, alongside binary masks obtained from the six networks under evaluation. The proposed networks performed well, with challenges emerging mainly in segmenting frontal and sagittal views using axial-trained networks.
The quantitative evaluation of segmentation performance on experimental CBCT images utilized two metrics: the Jaccard index and Dice coefficient. These metrics were computed separately for each volume, and then, the mean and standard deviation of the results across the three test volumes were computed.
Table 1 displays the results in terms of the JI and DC. The MPT networks exhibited the highest metrics (
,
), while
axial training during sagittal and frontal tests yielded the lowest metrics (
,
). In terms of training and segmentation time, MPT requires more time than
axial training, while MV requires three times the time requested for
axial training due to the need to train three distinct networks to perform the voting scheme.
Axial training took nearly 12 h, MPT 24 h, and MV training, involving three networks for the voting scheme, took almost 36 h. The segmentation of a volume (
pixels) took 70 s for the
axial and MPT and 220 s for the MV training. Although
majority voting training achieved results as good as MPT, it required more computational time, as three predictions have to be made to complete the majority voting scheme.
The results validated the proposed U-Net trained using MPT’s superior performance in handling complicated anatomical structures, underscoring its practical utility in extremity binary segmentation. It is worth noting that SegNet also provides good results in terms of the evaluated metrics, but with a significantly higher number of parameters. This indicates that, while both architectures are effective, U-Net, particularly with MPT, offers a more computationally efficient solution without compromising the quantitative or quality results.
4.2. Three-Dimensional Model
To evaluate the quality of the 3D model derived from the proposed workflow, the models were compared using two pixel-based metrics, the DC and JI, and one distance-based metric, the Hausdorff distance. Specifically, the mesh derived from U-Net with MPT segmentation was compared to those obtained using other methods, including SegNet with MPT, thresholded bone segmentation, and with a graph cut algorithm proposed by Boykov et al. [
32] and implemented as described by Tiribilli et al. [
33]
The results in term of the Jaccard index and Dice coefficient are presented in
Table 2.
The Hausdorff distance was computed between the mesh produced by each segmentation method and the corresponding ground truth mesh. The Hausdorff distance provides a measure of the maximum discrepancy between two point sets on the surfaces of the meshes, thus allowing assessing the accuracy and precision of the segmentation methods.
The results of this comparative analysis are presented in
Table 3. This table highlights the max, mean, and standard deviation of the Hausdorff distances for each segmentation method on the target bone, thereby allowing us to determine which method produces the most accurate and reliable 3D models. By analyzing these distances, the effectiveness of each segmentation technique can be objectively evaluated. The proposed method, U-Net with MPT segmentation, demonstrated the lowest mean in terms of the Hausdorff distance.
Figure 5 depicts a visualization of the Hausdorff distances for the target bone across the evaluated methods. In this context, blue areas on the models indicate minimal differences from the ground truth, while red areas signify substantial discrepancies. The model obtained via threshold-based segmentation shows high Hausdorff distances across the entire surface, indicating poor accuracy. The graph cut segmentation model has high distances in specific regions, reflecting localized segmentation errors. The U-Net and SegNet methods exhibit lower Hausdorff distances, with the U-Net model demonstrating the best overall performance, closely aligning with the ground truth. The color bar at the bottom provides a visual reference for the distance values, emphasizing the superiority of the U-Net segmentation approach.
4.3. User Interface
In
Figure 6 and
Figure 7, the user interface is shown, specifically showcasing the segmentation and modeling of a human talus and a human hamate. These figures illustrate the effectiveness of the segmentation process and the ability to handle complex anatomical structures. Labels are generated automatically and superimposed on the MPR views. The interface provides a clear and intuitive visualization that aids in the accurate identification and segmentation of these structures. The user can isolate a single bone with a click on the bone of interest and export it as a mesh.
4.4. Experiment
The preliminary results indicate that the method can be effectively extended to other anatomical regions. For example, when applied to spine and pelvis images, our tool achieved segmentation accuracies comparable to those obtained for limbs and joints. Similarly, experiments using veterinary scans demonstrated the robustness of our approach across different types of medical imaging.
Figure 8 shows the results achieved by printing the models obtained with the tool for the surgical planning of an acetabular cup insertion of a dog.
5. Discussion
The results of the presented study demonstrate the efficacy of using deep learning techniques, particularly the U-Net architecture, for the segmentation and 3D modeling of bones in CBCT orthopedic imaging.
The performances of the proposed U-Net and a benchmark SegNet architecture for CBCT segmentation of bones were compared, employing different training strategies including axial, MPT, and MV. The evaluation on three CBCT volumes revealed that both the U-Net and SegNet architectures achieve high segmentation accuracy, but with distinct differences in computational efficiency and parameter count. U-Net trained with MPT exhibited the highest performance metrics (, ), particularly excelling in handling complex anatomical structures. This training strategy effectively leverages information from multiple planes, enhancing segmentation robustness across different orientations. However, while MV also achieved high accuracy, it required significantly more computational resources and time due to the necessity of training and combining three separate networks. Although SegNet provided competitive results regarding the JI and DC, it required a substantially higher number of parameters than U-Net. This higher parameter count translates into increased computational load and longer training times, which may be a limiting factor in resource-constrained environments. The challenges observed in segmenting frontal and sagittal views using axial-trained networks highlight the importance of considering multiplanar information during training. The superior performance of MPT underscores its potential as a preferred training strategy for enhancing the accuracy and reliability of CBCT data segmentation.
The choice to focus on a simple network with fewer parameters was driven by the need to create a tool that balances technical performance with practical usability. While more advanced models may offer incremental improvements in segmentation accuracy, they often come at the cost of increased complexity and longer inference times. This approach demonstrates that a well-designed, simple architecture can provide excellent results with significant advantages in speed and user experience, making it highly suitable for real-world clinical applications.
By comparing the Hausdorff distances and the JI and DC between the meshes generated by various segmentation methods, it is evident that the U-Net-MPT-based approach offers superior accuracy. The Hausdorff distance indicates that the U-Net MPT segmentation method yields models with closer alignment to the ground truth, which is critical for applications requiring high precision, such as pre-surgical planning and customized implant design. The workflow’s user interface has also proven to be effective in facilitating the segmentation and visualization process. The intuitive design allows users to generate 3D models easily. The ability to accurately segment and model bones such as the talus, wrist, knee, elbow, and shoulder underscores the versatility of the proposed approach.It allows the generation of an accurate instance segmentation of bones in different anatomical parts without the need to train different neural networks for each specific task.
The method may struggle with highly complex or irregular anatomical structures, such as bones with extensive deformities or fractures. In such cases, the segmentation accuracy could be reduced, potentially requiring manual intervention to correct the segmentation boundaries.
6. Conclusions
In conclusion, these findings suggest that U-Net with MPT is a highly effective and computationally efficient approach for CBCT bone segmentation. While SegNet also performs well, its higher parameter requirements pose a significant drawback.
The integration of accurate segmentation with advanced 3D modeling and a user-friendly interface significantly enhances the practical utility of the entire workflow. The tool not only technically improves the bone segmentation process in CBCT images, but also represents a significant advancement in terms of usability. By reducing the need for user intervention, the process is more accessible and practical for everyday clinical use. This can lead to broader adoption of the technology in clinical settings. The positive results from the preliminary experiments suggest that this bone segmentation and modeling tool can be generalized to various anatomical regions and different types of medical imaging, including veterinary. Future work will involve further validation on larger datasets and additional anatomical regions to fully establish the generalizability of this method. Moreover, additional case studies will be investigated to apply the tool for fracture visualization, bone growth assessment, and preoperative planning for other surgical procedures.