1. Introduction
A large number of clinical studies have shown that diseases such as diabetic retinopathy (DR) [
1], cataracts [
2], dry eye syndrome (DES) [
3], and glaucomatous lesions [
4] are associated with structural and morphological alterations of retinal vessels. As part of ophthalmic diagnostic criteria, optical coherence tomography angiography (OCTA) enables the identification and measurement of blood flow to obtain high-resolution images of the blood vessels in the retina, choroid, and conjunctival areas [
5]. Compared with traditional fluorescein fundus angiography and indocyanine green angiography, OCTA has the advantages of non-invasive, rapid, and three-dimensional imaging, making it a very promising vascular imaging technique in the field of ophthalmology [
6]. As shown in
Figure 1a, color fundus images obtained by conventional retinal imaging techniques have difficulty capturing fine vessels and capillaries. The optical coherence tomography angiography [
7] techniques can generate images of the retinal vascular plexus at different depths in
Figure 1b–d. High-quality OCTA images can present microvascular information in different OCTA depth layers, which can be easily applied to clinical research. To precisely identify and diagnose the variations in retinal blood vessels, medical personnel need to extract the retinal vessels from the fundus image to observe the length, curvature, width, and other morphological conditions of the retinal vascular trees. However, the manual segmentation of retinal vessels requires complicated work and is both tedious and time-consuming [
8]. Various automatic segmentation algorithms that can improve efficiency and reliability have gradually attracted much attention in clinical practice procedures to solve this situation.
In the past few decades, many efforts have been made to segment retinal vessels. For instance, Gao et al. [
9] proposed an automated method for the diagnosis of diabetic retinopathy that could help physicians diagnose patients more quickly and accurately. The approach relies on annotating a large number of images, which requires a lot of time and human resources, and reannotating images for different cases. Jin et al. [
10] presented a new dataset of fundus images based on vascular segmentation, which can provide researchers with rich experimental data. The size of the dataset is not extremely large and includes only one disease (diabetic retinopathy), which may affect the generalization ability of the algorithm. Song et al. [
11] presented a machine-learning-based clinical decision model that uses a set of rules developed by physician experts and combines traditional feature extraction methods with automatic feature learning by convolutional neural networks (CNNs) to improve the diagnostic accuracy of pathological ptosis. However, the study lacks comparative experiments to assess the advantages and disadvantages of the model with other methods. The state-of-the-art methods for retinal vessel segmentation come from the fully convolutional networks (FCNs), such as U-Net and its variants [
12], which are based on the encoder–decoder architecture. U-Nets can capture contextual semantic information by using a cascade of convolutional layers and combining high-resolution feature maps with skip connections to achieve precise localization. The impact of skip connections is improved by Attention U-Net [
13], which introduces an attention module to weight encoder features and fuse them with corresponding decoder features. This enhances the retention and reinforcement of critical vessel features in the decoder. However, the interactions between information at different scales are ignored by the skip connections, which only enhance the vessel representation by adding over the channels to the corresponding decoder features. It has been indicated by studies [
14] that not all skip connections effectively connect the encoder and decoder. Additionally, it was found that the original U-Net performs worse than a U-Net without skip connections on some datasets.
Many studies have focused on retinal vessel segmentation in OCTA images due to the superiority of OCTA images in visualizing the retinal plexuses. OCTA images are characterized by rich retinal vessels, complex branching structures, and a low signal-to-noise ratio, making it difficult to distinguish small capillaries, arterioles, and venous regions in the image, which leads to poor segmentation. In addition, variety in vessel size, shadow artifacts, and retinal abnormalities further complicates segmentation. To address these challenges, Ma et al. [
7] proposed a split-based coarse-to-fine OCTA image segmentation network (OCTA-Net) that comprises a coarse segmentation stage and a fine segmentation stage. The coarse segmentation network is utilized to generate preliminary confidence maps for pixel-level and centerline-level vessels, while the fine stage serves as a fusion network to obtain the final refined segmentation result. Although this approach divides OCTA image segmentation into two stages, mitigating the problem of discontinuity in vessel segmentation, the training process is laborious and impractical. Pissas et al. [
15] presented an effective recurrent CNN for vessel segmentation in OCT-A, which uses fully convolutional networks (FCNs) to segment the entire image in each forward pass and iteratively refines the quality of vessel generation through weight-sharing coupled with perceptual losses. Despite achieving a good performance, CNN-based approaches generally exhibit limitations for capturing long-range (global) dependencies due to the intrinsic convolution operations. It causes the convolutional network to only focus on local features of the retinal vessel image, making it prone to breaking and missing the widely existing small blood vessels.
The existing studies have proposed that transformer architecture using the self-attention mechanism has emerged to make up for the information loss in convolution operations and effectively establish long-range dependencies. Self-attention is the key computational primitive of the transformer. It can implement pairwise entity interactions with a context aggregation mechanism, giving the transformer the ability to handle long-range dependencies. Preliminary studies with different forms of self-attention have shown its practicality in various medical image segmentation tasks [
16,
17]. Despite their exceptional representational power, the training and progress of the transformer architecture have intimidating challenges. One of the challenges is that complexity is quadratically related to the image input size in the vanilla transformer module. Secondly, without the ConvNet inductive biases, transformers cannot perform well on a small-scale dataset. The above challenges make it difficult to process a lesser number of medical images with higher resolutions, leaving a large amount of room for further improvements.
In summary, we have identified several limitations of existing OCTA retinal vessel segmentation methods: (1) The continuity of retinal vessels amplifies the defects of convolution calculations, and the convolutional network’s weak global capturing ability makes it susceptible to breaking or missing segmented vessels. (2) The skip connections in U-Net simply propagate vessel information from the encoder to the decoder on features of the same scale, resulting in limited interaction between features at different scales, which fails to prevent information loss and blurring. (3) Although the pure transformer network structures can achieve global context interaction through the self-attention mechanism, the high computational complexity of self-attention remains a challenge, especially for processing larger images with transformer-based structures.
To address these issues, this paper introduces a transformer embedded in a convolutional U-shaped network: TCU-Net, combining the advanced convolutional network and self-attention mechanism for OCTA retinal image segmentation. Specifically, an efficient cross-fusion transformer (ECT) is proposed to replace the original skip connections. The ECT module leverages the advantages of convolution and self-attention to avoid large-scale pre-training by exploiting the image induction bias of convolution, as well as the capability of the transformer to capture long-range relationships with linear computational complexity. Moreover, features with different scales are input by the encoder into an efficient multihead cross-attention mechanism to achieve interaction between different scales and compensate for the loss of vessel information. Finally, the efficient channel-wise cross attention (ECCA) module is introduced to fuse the transformer module’s multiscale features and decoder features to solve the semantic inconsistency between them and enhance effective vessel features. The main contributions of this work include the following:
We proposed a novel end-to-end OCTA retinal vessel segmentation method that embeds convolution calculations into a transformer for global feature extraction.
An efficient cross-fusion transformer module was designed to replace the original skip connections, thus achieving interaction between multiscale features and compensating for the loss of vessel information. The multihead cross-attention mechanism of the ECT module reduces the computational complexity compared to the original multihead self-attention mechanism.
To reduce the semantic difference between the output of ECT module and decoder features, we introduce a channel cross-attention module to fuse and enhance effective vessel information.
Experimental evaluation on two OCTA retinal vessel segmentation datasets, ROSE-1 and ROSE-2, demonstrates the effectiveness of the proposed TCU-Net.