1. Introduction
The lumbar spine is an essential support and protective system in the human body. In recent years, with the aging population and changes in modern lifestyles, the incidence of spinal diseases is gradually increasing [
1]. X-ray is a simple, fast, and economical modality for disease diagnosis, making it one of the basic methods for examining the spine. Traditionally, the diagnosis and treatment of spinal diseases heavily depend on the subjective experience of doctors. Even when following the same diagnostic criteria for lumbar X-ray images, experienced doctors may provide different assessment conclusions [
2]. Recently, with the development of deep learning in medical image analysis [
3,
4,
5,
6,
7,
8], automated vertebrae detection can offer a promising solution to assist doctors in spinal treatment. The automatic vertebrae detection task aims to automatically localize and recognize vertebrae by extracting features from X-ray images. By providing precise information about the position and class of each vertebra, this task assists surgeons in conducting spinal diagnosis and treatment with greater accuracy and efficiency.
However, the task of automatic vertebrae detection in X-ray images encounters significant challenges, as shown in
Figure 1. (1) Variations in the field of view. Only a partial subset of the spine can be captured in each image, and the number of visible vertebrae within the region exhibits variability. Since the presence of certain specific vertebrae (such as the sacrum or thoracic vertebra) cannot be guaranteed in the input image, it is impossible to utilize these vertebrae for classifying other vertebrae. (2) Similar visual characteristics. Vertebrae exhibit similar visual characteristics, and individual shapes and sizes differ. This further increases the difficulty of category recognition. (3) Excessive interference information. These images suffer from reduced contrast, blurred vertebral edges, and frequent confusion with surgical markers such as Kirschner wires, PVP catheters, forceps, and screws, as illustrated in
Figure 1a–d. These objects have the potential to occlude vertebrae or introduce artifacts, exacerbating the intricacies of the detection task. In summary, the similarity and diversity of data pose challenges for vertebrae detection. Therefore, there is an urgent need for a deep learning model that can extract robust spatial features of vertebrae from complex images, thereby optimizing the classification and localization of vertebrae.
Recently, deep learning-based detection techniques have demonstrated significant potential in accurately identifying and localizing spinal structures [
9,
10,
11,
12]. The complexity and limitations of traditional handcrafted feature extraction methods [
13,
14] have driven the adoption of convolutional neural networks (CNNs) as a direct and simple approach for localizing vertebrae in images. Compared to traditional methods, CNN can extract more robust features [
9,
10], resulting in better detection performance. However, these methods still face challenges such as blurred vertebral edges and occlusion of landmarks. To address these issues, CNN-based heatmap methods reformulate the vertebrae detection problem as a keypoint prediction problem, directly predicting the centers of the vertebrae. For instance, Yi et al. [
11] applied these maps generated by a 2D Gaussian function to locate the central points of the thoracic and lumbar vertebrae. Similarly, the Gaussian function is used to generate a heatmap for regressing the expected centroids of vertebrae and fitting the spinal curve [
12]. These approaches not only simplify the network but also handle higher-resolution images. Although these methods using the Gaussian function achieve success, they do not effectively represent the distance of each pixel inside the bounding box to the target center. To address this, FCOS [
15] introduced the concept of centerness, which leverages the normalized distances between points within the object region and the center, suppressing inaccurate predictions during inference. Furthermore, performing spinal detection solely based on the center points of each vertebra fails to fully consider the distinctive linear structure of the spine.
Utilization of spatial prior information is also particularly crucial in various medical imaging applications, especially in the field of spinal imaging that involves structures with linear features. While the similarity of vertebrae challenges their classification, exploiting the spatial relationships among these structures enables us to improve the accuracy of category recognition in detection tasks by leveraging knowledge about expected positions, sequential order, and other pertinent aspects within the spine. Deep learning-based methods also extensively investigate the exploration of spatial relationships between target objects [
16,
17] in the computer vision domain. For example, MonoPair [
16] enhances the accuracy of occluded object detection by effectively capturing the spatial relationships between paired objects. GANet [
17] devises a local information aggregation module that adaptively captures the localized correlations between neighboring keypoints, thereby augmenting the interconnectivity between neighboring keypoints. However, despite the significance of spatial prior information and recent progress in computer vision detection tasks, it has not been extensively explored for spine vertebrae detection by recent deep learning-based approaches.
To address the aforementioned challenges, in this paper, based on X-rays of the lumbar spine, we propose a novel method to enhance spine vertebrae detection accuracy by leveraging the spatial relationships between vertebrae. Our method leverages both global and local spatial priors to improve the accuracy of detection. Specifically, to capture the global spatial prior, we employ a two-stage detector. In the first stage, an intermediate heatmap is generated to encode global spatial information of vertebrae, providing valuable clues for subsequent stages. The second stage takes the intermediate heatmap and the original image as input to output the final results. This two-stage architecture enables the model to benefit from the contextual information provided by vertebrae, leading to enhanced detection performance. Furthermore, we introduce a novel detection head to capture the local spatial information. This detection head is specifically designed to predict neighboring vertebrae information, enabling each vertebra to learn the neighboring spatial information, visibility, and relative offsets of its neighboring vertebrae. In the inference step, we design a fusion strategy to incorporate the spatial information of neighboring vertebrae, where each vertebra combines the spatial offsets of its neighboring vertebrae with its central heatmap. Therefore, by incorporating this local spatial information, our method achieves improved accuracy in spine vertebrae detection by enabling the model to better understand the relationships and dependencies between neighboring vertebrae. In summary, combining the global and local spatial priors, our method effectively captures the inherent spatial characteristics of the spine vertebrae, which in turn can boost the spine detection results.
Furthermore, we introduce a novel representation of object centers that offers advantages in the detection process, specifically by emphasizing critical regions. In this way, we further improve the model’s ability to concentrate on the spine center, mitigating the impact of limited data samples and excessive interference from intraoperative images.
To evaluate the effectiveness of our approach, we conduct a comprehensive evaluation using two lumbar spine datasets. The experimental results show that our method achieves a promising performance when compared with standard detection models. The effectiveness of each component is validated via ablation studies as well. In the validation of the CM spine dataset, our model achieves an average performance improvement of 13.6% AP over the standard object detection model CenterNet [
18]. In the anterior view and lateral view of the BUU spine dataset [
19], our model achieves average performance improvements of 6.5% AP and 4.8% AP respectively, compared to CenterNet. When compared to YOLOv5 [
20], our model demonstrated average performance improvements of 1.9% AP and 0.8% AP in BUU. Furthermore, our model surpasses Faster R-CNN [
21] by 10.0% and 12.3% AP. Our code and data are available at:
https://github.com/zengyuyuyu/Neighbor (accessed on 19 March 2024).
Our main contributions are summarized as follows:
We propose a novel two-stage method for accurate spine vertebrae detection to capture global spatial priors by encoding spine information in the intermediate heatmap and feeding them into a second detection sub-network along with the original image.
We introduce a detection head that focuses on capturing the local spatial information, which is specifically designed to predict neighboring vertebrae information, allowing each vertebra to learn the spatial relationships and dependencies with its neighboring vertebrae.
A modified center map function is built upon the standard Gaussian function of the heatmap-based detection method to represent centers of spine vertebrae, which enhances the accuracy and reliability of spine detection.
3. Method
We propose a novel approach to improve the accuracy of vertebrae detection by leveraging the spatial relationships between neighboring vertebrae. The overview of our proposed method is illustrated in
Figure 2. We choose CenterNet [
18] as the baseline for improvement. However, these improved components are not limited to a specific framework but can also be extended to other object detection algorithms.
Our method utilizes a two-stage approach to capture global spatial information of spine vertebrae. In the first stage, the backbone network extracts features from the input image, generating a heatmap specifically for vertebrae. The second stage takes the intermediate heatmap from the first stage, along with the original image, to produce enhanced vertebrae detection results. To leverage local spatial information of vertebrae, the second stage also predicts the results for neighboring vertebrae. During the inference step, a neighboring optimization strategy is employed to generate a fused heatmap, which subsequently produces enhanced vertebrae detection results.
3.1. Global Spatial Relationship
We propose a new two-stage precise spine detection method to capture global spatial priors. The global spatial priors represent the spatial contextual information of the entire image, which refers to the analysis of the distribution, shape, and size relationships among vertebrae in the entire image for detection. Given an input image
with the size of
, in the first stage, the backbone network extracts features from
and generates an intermediate heatmap
M specifically for spine, with the size of
. Then, we reshape
M to
, the same size as the input image
. In the second stage, the resized heatmap
M is added to the original input image
through element-wise summation, resulting in a new feature map
. This map is fed into the second detection sub-network for prediction, yielding enhanced vertebrae detection results. The transformation from the output
M of the first-stage backbone network to the input
of the second-stage backbone network can be written as:
where
represents the convolution function,
is an upsampling function, and
represents the softmax operator. In this way, the output of the first stage
M encodes the global spatial information of vertebrae by providing the heatmap of the entire vertebrae present in the input image. By fusing
M with the original input
to generate the input
for the second stage, our method incorporates the global spatial relationships of vertebrae in a neural network, which in turn can produce more accurate lumbar vertebrae detection results.
and
are used in the experiments.
3.2. Local Spatial Relationship
In addition to capturing global spatial relationships, we introduce a novel scheme to explore the local spatial relationships of vertebrae. The local spatial priors represent the information between the target vertebrae and their neighbor vertebrae in the image. Unlike global spatial priors, the local spatial priors focus more on analyzing and utilizing the features of local regions in the image to optimize vertebrae detection. For a vertebra, there are strong spatial priors between its neighbors. For example, for a certain vertebra, the relative distance to its neighboring vertebra is within a limited distance and it is easy to estimate the rough location of the neighboring vertebra. To make use of such local spatial information, different from the standard detection model that predicts the individual vertebrae, our method additionally predicts the neighbor of each vertebrae.
We first revisit the conventional heatmap-based detection network [
18]. CenterNet [
18] consists of three prediction heads, the output of object size
, the offset of object center to its downsampled center
, and the heatmap of objects
. In addition to CenterNet, our network also predicts information about its upward and downward vertebra. Typically, we add two output heads to compute the visibility of neighbors
and relative offsets to its neighbors
. For example,
indicates the probability map of its upward vertebra, and
indicates the existence of its downward vertebra.
computes the relative offset of its upward vertebra and
records the relative distance between its downward vertebra.
During the training, we follow CenterNet to compute the standard vertebra loss
. For the additional spatial relationship output, we compute classification loss
for
V and
loss for the relative distance prediction
D. In summary, the training loss can be denoted as:
where
,
,
,
, and
.
In the inference step, a neighboring vertebrae optimization strategy has been designed to adjust the central heatmap of each vertebra by utilizing neighboring vertebra information. For example, utilizing upward neighborhood information to update the vertebra heatmap is shown in
Figure 3. The spinal structure is ordered from L5, L4, L3, L2, to L1, with the L5 vertebra serving as the initial point. In Algorithm 1, using the
i-th vertebra, we show the steps of the proposed neighboring vertebrae optimization method using upward local spatial prior.
Finding the i-th vertebra:The location of the i-th vertebra can be directly inferred from the heatmap prediction H of the CenterNet-like head by taking the maximum value. Its probability value is then used to update its neighbor’s existence value.
Finding the neighbor of the i-th vertebra: By having neighbor location prediction D, for example, outputs the relative offset to its upward neighbor at location . Therefore, for the i-th vertebra at a location , its upward neighbor is at location .
Updating heatmap with neighbor vertebra: After inferencing the upward neighbor vertebra, we have additional information to update the original heatmap from CenterNet output when the visibility value
. Typically, we use a weighted 2D Gaussian function as the heatmap inferred by local spatial prior:
where
is the coordinates of its upward neighbor vertebra,
is a Gaussian variance, and
is applied to weight the Gaussian function by using the probability value of
i-th vertebra. Then, the heatmap of its upward neighbor vertebra (the
-th vertebra) can be updated by
.
Analogously, one can update the vertebrae heatmap by considering the downward local spatial prior. Finally, we can obtain the updated vertebrae heatmap by using local spatial relationship information. A simple post-processing algorithm [
18] can be applied to generate vertebrae from the updated heatmap.
Algorithm 1: Pseudo-code of neighboring vertebrae optimization strategy. |
Input: Vertebrae heatmap H; visibility of neighbors V; relative offsets of neighbors D; threshold ; Gaussian parameter Output: Optimized heatmap of objects H |
3.3. CenterMap
Our proposed heatmap method is adapted from CenterNet [
18] and FCOS [
15]. CenterNet employs the center points of bounding boxes to represent objects. By leveraging a Gaussian kernel, it maps the center points of ground-truth bounding boxes onto a heatmap, thereby mitigating the penalty imposed on negative positions within the positive position radius. The 2D-Gaussian disk
is defined as follows:
where
is the coordinates of the center point of the object on the feature map, and
is an object size-adaptive standard deviation.
However, the heatmap in CenterNet does not take into account the proximity of other positions within the bounding box to the target center. In order to suppress low-quality detection bounding boxes, FCOS introduces the concept of centerness, which effectively describes the normalized distance from a position to the object center. The centerness
is defined as follows:
where
represents the distances from a certain position inside the bounding box to the four sides of the bounding box. The value of
ranges from 0 to 1. When the value of
is 1, it means that the pixel is precisely located at the center of the target. When the value is 0, it indicates that the pixel is not inside the bounding box.
As shown in
Figure 4, in this paper, we propose a centermap representation method that combines the 2D Gaussian heatmap
from CenterNet and the heatmap
representing centerness. By fusing these two heatmaps, we can simultaneously consider both the center and positional distribution of objects. This fusion technique effectively mitigates the penalty imposed on negative positions near the center while also enhancing the representation of information from other locations within the bounding box. As a result, our approach offers a more comprehensive depiction of object centrality. Mathematically, the centermap is defined as follows:
where
utilizes a Gaussian kernel for center encoding, and
quantifies the proximity of each position to the center. Both sets of heatmaps have a size of
.