FERDNet: High-Resolution Remote Sensing Road Extraction Network Based on Feature Enhancement of Road Directionality

Zhong, Bo; Dan, Hongfeng; Liu, MingHao; Luo, Xiaobo; Ao, Kai; Yang, Aixia; Wu, Junjun

doi:10.3390/rs17030376

Open AccessArticle

FERDNet: High-Resolution Remote Sensing Road Extraction Network Based on Feature Enhancement of Road Directionality

by

Bo Zhong

^1,2

,

Hongfeng Dan

^1,2,

MingHao Liu

^1,*,

Xiaobo Luo

¹

,

Kai Ao

²

,

Aixia Yang

² and

Junjun Wu

^2,3

¹

College of Computer Science and Technology, University of Posts and Telecommunications, Chongqing 400065, China

²

State Key Laboratory of Remote Sensing Science, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100101, China

³

Hainan Aerospace Information Research Institute, Sanya 572029, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(3), 376; https://doi.org/10.3390/rs17030376

Submission received: 9 December 2024 / Revised: 13 January 2025 / Accepted: 20 January 2025 / Published: 23 January 2025

(This article belongs to the Section Environmental Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

:

The identification of roads from satellite imagery plays an important role in urban design, geographic referencing, vehicle navigation, geospatial data integration, and intelligent transportation systems. The use of deep learning methods has demonstrated significant advantages in the extraction of roads from remote sensing data. However, many previous deep learning-based road extraction studies overlook the connectivity and completeness of roads. To address this issue, this paper proposes a new high-resolution satellite road extraction network called FERDNet. In this paper, to effectively distinguish between road features and background features, we design a Multi-angle Feature Enhancement module based on the characteristics of remote sensing road data. Additionally, to enhance the extraction capability for narrow roads, we develop a High–Low-Level Feature Enhancement module within the directional feature extraction branch. Furthermore, experimental results on three public datasets validate the effectiveness of FERDNet in the task of road extraction from satellite imagery.

Keywords:

road extraction; high-resolution satellite images; CNN; strip convolution

1. Introduction

Road planning is a crucial aspect of urban planning and applications [1,2] and autonomous driving [3,4]. Against the backdrop of rapid advancements in remote sensing technology, the resolution of remote sensing images has generally reached 0.5 m. With the support of reliable data sources, the road extraction task has developed rapidly. The elongated shape is a prominent characteristic of roads in remote sensing images, but these images often contain complex backgrounds, including numerous road-like features such as ditches, narrow rivers, and bare land. Additionally, since roads are human-made structures, they vary widely in type, appearance, and materials. For example, in the same remote sensing road image, there may be roads made of different materials, such as dirt roads, cement roads, and asphalt roads, appearing simultaneously.

Furthermore, in complex environmental backgrounds, issues such as buildings, trees, and shadow occlusions pose significant challenges for road extraction. These factors greatly increase the difficulty of road extraction tasks. In Figure 1, shadow occlusions and the presence of narrow and blurred roads in road images are illustrated, highlighting the challenges involved.

Traditional road extraction methods mainly rely on three approaches: template matching, knowledge-driven methods, and object-oriented techniques [5]. Manual design plays a crucial role in traditional road extraction; however, it often proves insufficient in complex road scenarios. Moreover, initial manual extraction techniques require a significant amount of time and manpower and are prone to subjective bias. Single manually designed features often struggle to accomplish road extraction tasks effectively, and traditional methods still have room for improvement in their generalization capabilities.

In recent years, the rapid advancement of deep learning has greatly advanced road extraction tasks. Deep learning approaches have demonstrated better robustness and generalization in handling various complex environmental backgrounds while also reducing the consumption of human resources and time.

In recent times, the evolution of convolutional neural networks (CNNs) has provided efficient approaches for road extraction in remote sensing. Fully Convolutional Networks (FCNs) [6] introduced an innovative approach by replacing fully connected layers with convolutional layers. Based on FCNs, researchers have developed efficient encoder–decoder structures that effectively extract road features. LeNet [7] implemented global average pooling to replace fully connected layers, which significantly reduces parameter redundancy and inefficiency in fully connected layers while also improving the recognition efficiency and accuracy of satellite road images to some extent. U-Net [8] proposed a U-shaped network structure that fuses high- and low-level features through skip connections, addressing issues related to road occlusion and edge blurriness to some extent. In the Deeplabv3+ [9] framework, the utilization of pyramid pooling and “dilated convolution” enhances the model’s receptive field, allowing for a finer capture of contextual associations in satellite road images, which leads to improved road segmentation performance. D-LinkNet [10] introduced an encoder–decoder structure that uses “dilated convolution” to expand the model’s receptive field without compromising image resolution. This enables the model to capture the long-distance correlations of road information, effectively enhancing its capacity to extract road features. However, while expanding the receptive field, fine and low-level features are often irreversibly lost as the network depth increases. As a result, the extraction performance for narrow roads remains inadequate.

Satellite road images often contain complex background information, and the model’s ability to learn subtle object details typically depends on this intricate context. Features like roads, which are elongated and narrow, tend to be more prone to being missed or falsely located during detection. The receptive fields of standard convolution often fail to adequately capture and extract road features.

To address the issues of misdetection of road features, missed detections across multiple road categories, and the loss of small roads in high-resolution remote sensing road extraction, we propose a road-directional feature enhancement network called FERDNet. Specifically, we utilize VAN [11], which has a large receptive field, as the feature encoder to capture the long-distance features of roads. Simultaneously, we design a Multi-angle Feature Enhancement module (MFE) to extract the directional features induced by the arbitrarily located narrow and elongated characteristics of roads and a High–Low-Level Feature Information Compensation Module (ICM) to restore features such as boundaries and shapes that are lost during the learning process, enabling better extraction of narrow roads.

Our contribution consists of the following aspects:

(1): We propose a new Multi-angle Feature Enhancement module (MFE) that captures road surface features more effectively through strip convolutions, which are better adapted to road characteristics. This enhances the model’s ability to represent road information in blurred environmental backgrounds and its capability to distinguish non-road features.
(2): We introduce a new High–Low-Level Feature Information Compensation Module (ICM), which enhances and fuses features between two adjacent layers, effectively preserving spatial detail information and improving the model’s ability to extract narrow roads.
(3): We evaluated FERDNet using three distinct satellite road datasets and compared it with alternative methods. The experimental outcomes demonstrate that the proposed FERDNet achieves superior performance.

2. Related Work

2.1. Semantic Segmentation-Based Road Extraction from RS Imagery

Semantic segmentation methods comprise one of the mainstream approaches in road extraction from satellite images. RCFSNet [12] integrates road contextual information with multi-scale features, enabling it to more completely capture road network labels and demonstrate excellent extraction capabilities in scenarios with occlusions. CR-HR-RoadNet [13] has the ability to model both local and global information, effectively preserving narrow road details as well as spatial information that is prone to loss. LDANet [14] introduces an improved Inception structure based on asymmetric convolution blocks, which enhances the ability to extract low-level features and effectively reduces the loss of details in narrow roads. RoadNet [15] is a multi-task, pixel-level, end-to-end CNN designed to predict road surfaces, road edges, and road centerlines simultaneously. Its refined segmentation approach enhances the ability to extract roads in complex environmental backgrounds. Similarly, in earlier work [16], three modules were designed: a pixel-level feature learning module, an edge feature learning module, and a road thematic feature learning module. These modules learn road features from three different levels. Additionally, a direction-aware module was implemented to maximally preserve the topological structure of roads, effectively enhancing the connectivity of road extraction. D-LinkNet [10] features an encoder–decoder structure with dilated convolutions, which significantly improves the model’s capability to capture long-range information about roads. The U-shaped network structure greatly reduces the loss of low-level features, enhancing the extraction ability for small roads and helping address shadow occlusion issues to some extent. DDU-net [17], based on the U-Net architecture, incorporates a dual-decoder structure, where the low-level decoder preserves low-level features, reducing the loss of boundary information and details in small roads. Dual-path Morph-UNet [18] employs a parallel U-Net structure with residual and dense connections, reducing parameter redundancy while improving road extraction performance. In NL-LinkNet [19], a non-local attention module is proposed, which allows each road spatial point to reference rich background information through non-local attention, resulting in more accurate road segmentation.

All of the aforementioned models incorporate various mechanisms to enhance their ability to capture long-distance road information, thereby improving road extraction performance. For instance, the CR-HR-RoadNet network connects two convolutions with different dilation rates in parallel to learn road features in a multi-scale manner. Similarly, D-LinkNet concatenates multiple convolutions with varying dilation rates in the deepest layer of the network to continuously expand the receptive field and capture long-distance dependencies of roads. In NL-LinkNet, a non-local attention mechanism is employed to effectively capture road information. Furthermore, nearly all of these models adopt a U-shaped, layer-by-layer decoding structure to enhance their ability to extract narrow roads.

However, it is evident that strip convolution is better suited for remote sensing road extraction. When dealing with road images that suffer from an imbalance between positive and negative samples, methods that rely on continuously expanding the receptive field—such as dilated convolutions—tend to introduce excessive background information. This, in turn, negatively impacts the model’s ability to focus on learning road features. To address this challenge, our model is designed with strip convolution to more effectively learn road features.

2.2. Application of Strip Convolution in Road Extraction

Due to the geometric characteristics of roads, strip convolutions are better suited for road extraction from satellite images. In earlier work [20], replacing 3 × 3 transposed convolutions with strip convolutions achieved greater accuracy in road extraction. CoANet [21] applies strip convolution to the model decoder. Through convolutions in four different directions, it captures long-distance context information and simultaneously reduces interference caused by information from irrelevant regions. In StripUNet [22], a strip convolution attention module was designed to enable the model to focus more on vertical or horizontal road features. Additionally, the authors developed a strip feature enhancement module using strip convolutions, effectively suppressing interference from abundant background information during feature restoration and highlighting road features more prominently. RCFSNet [12] features a multi-scale strip dilated convolution module, which is composed of standard dilated convolutions and dilated convolutions in both vertical and horizontal directions, aiming to more accurately extract road information. DPSDA-Net [23] constructs a strip position attention module using strip convolutions in four directions, vertical, horizontal, and both diagonal orientations, enhancing the network’s ability to learn road information. The strip position attention module is also integrated into a pyramid structure, further improving the model’s capacity to extract road features and enhancing the continuity of road extraction. MSPFE-Net [24] employs a parallel structure of vertical and horizontal pooling to design a strip pooling module, which uses multiple serial multi-level strip pooling modules of different sizes to strengthen the model’s ability to effectively integrate information at different scales, thereby improving its capability to capture details.

All the aforementioned models have integrated strip convolutions to enhance the learning of road features. Notably, CoANet replaces standard convolutions in the decoding process with four strip convolutions oriented in different directions, serving as a key source of inspiration for our design. Similarly, RCFSNet introduces a dilated strip convolution attention mechanism to effectively capture road information. In general, these models utilize relatively long strip convolutions to address the need for capturing long-distance road dependencies. However, when addressing the challenge of preserving narrow roads, the use of excessively long strip convolutions is clearly inadvisable.

In summary, for road extraction tasks, the model’s global modeling ability is critical for features like roads that exhibit strong long-range dependencies. However, relying solely on global modeling can result in the loss of small roads. Therefore, both local and global modeling capabilities are key focuses of our research while we also consider the elongated structural characteristics of roads. This study primarily aims to effectively enhance both local and global modeling abilities and to capture long-distance context information between pixels, thereby improving the interconnectedness and completeness of road extraction.

3. Methods

3.1. Overall Architecture

We propose an encoder–decoder structured network (FERDNet), as illustrated in Figure 2. This network primarily consists of the encoder, the Multi-angle Feature Enhancement module (MFE), and the Information Compensation Module (ICM). The encoder utilizes the pre-trained VAN-tiny as its backbone, while the MFE module and ICM serve as the key components for feature extraction in the decoder. We adopt a progressive structure to gradually integrate features from the encoder through the MFE modules and ICMs.

Specifically, given an input satellite road image of size H × W × 3, the encoder VAN-tiny produces multi-level feature maps at resolutions of 1/2, 1/4, 1/8, 1/8 of the original image. The MFE module emphasizes the differences between road features and environmental features, enabling the model to capture road features in complex backgrounds. The ICM primarily focuses on enhancing the fusion of high-level and low-level features, thereby improving the extraction of detailed features. The feature maps generated at four stages by the encoder are sequentially fed into the decoder for gradual integration, resulting in a feature map of size

\frac{H}{2} \times \frac{W}{2} \times C

. Finally, through 1 × 1 convolution and 2× upsampling, we obtain the final road segmentation result.

3.2. Encoder

The VAN [11] network is a type of architecture with a large-kernel perception capability, making it highly effective at capturing long-distance dependencies. It is particularly suitable for extracting features from objects like roads that exhibit strong long-distance dependencies. We utilize a pre-trained VAN-tiny network as the encoder to perform the preliminary extraction of road features.

3.3. The MFE Module

To enhance the proposed model’s ability to extract road features, we developed a Multi-Angle Feature Enhancement (MFE) module. Next, we provide the basic design principles and a detailed explanation of the proposed MFE module. The proposed MFE module is illustrated in Figure 3.

The design of the Multi-Angle Feature Enhancement (MFE) module is inspired by the data characteristics of satellite road images. To better extract road features, we utilize strip convolutions for feature extraction. In satellite images, there are significant differences between road surface features and environmental features. By applying strip convolutions in different orientations, we effectively capture these differences, which aids in distinguishing between the two.

The detailed structure of the MFE module is as follows: given a feature map

F_{i n} \in R^{C \times H \times W}

, we apply the same 1 × 3 convolution to learn the feature map and extract features in four directions, resulting in four feature maps from different directional convolutions. The reason for using the same convolutional kernel is to minimize the impact of varying convolutional kernel weights in subsequent operations. Next, we compute the cosine similarity between each pair of the four feature maps. In satellite images, there are significant differences between road and non-road features. When the directional convolutions extract road features, the higher the similarity between two feature maps, the higher the confidence in the road features.

In satellite images, the directionality of roads is often diverse rather than singular. Roads of the same type typically exhibit a relatively high similarity in pavement features, while significant differences emerge when crossing road boundaries. At road intersections, the features learned by the model in different directions often appear similar. Based on this observation, we determine the direction of roads by performing pairwise comparisons of feature similarities across different directions. To capture both the directional information of roads and the characteristics of road intersections, we first extract feature maps in four distinct directions. Since the specific direction of roads is unknown during the model’s learning process, we perform similarity operations between each pair of directional feature maps, resulting in six unique feature maps.

After that, we add the six feature maps together, and through a sigmoid operation, attention weights W are obtained. Then, the input feature map is multiplied by these weights W to produce an enhanced feature map

E \in R^{C \times H \times W}

. Finally, this enhanced feature map is combined with the input feature map through a residual connection to produce the final feature map

Y \in R^{C \times H \times W}

, which exhibits more prominent road features compared to the input

F_{i n}

. The overall process of the MFE module can be described as follows:

\begin{matrix} W = σ (θ (F_{v}, F_{h}) + θ (F_{v}, F_{r o}) + θ (F_{v}, F_{l o}) + \\ θ (F_{h}, F_{r o}) + θ (F_{h}, F_{l o}) + θ (F_{r o}, F_{l o})) \end{matrix}

(1)

\begin{matrix} Y = W \times F_{i n} + F_{i n} \end{matrix}

(2)

where

F_{v}

,

F_{h}

,

F_{r o}

, and

F_{l o}

represent the feature maps obtained from the 1 × 3 convolution kernels in four directions, θ represents the calculation of cosine similarity, and σ represents the sigmoid calculation.

3.4. The ICM

Due to the progressive reduction in the resolution of feature maps within the encoder, there is often an irreversible loss of spatial information. Therefore, in the ICM (High–Low-Level Feature Information Compensation Module), we enhance deep features using shallow features to preserve road edge information and reduce the loss of small roads. In the network, deeper semantic information is required to better distinguish between roads and the background, which is often lacking in low-level features. Meanwhile, high-level features frequently lose important low-level characteristics, such as edges and shapes. Therefore, it is necessary to compensate for high-level features based on their differences to mitigate the loss of low-level information. The proposed ICM is illustrated in Figure 4. Specifically, the high-level features

F_{h i g h}

from the decoder and the low-level features

F_{l o w}

from the previous stage of the encoder are first given. Both are processed through a 1 × 1 convolution to produce

{\bar{F}}_{h i g h}

and

{\bar{F}}_{l o w}

. Then,

{\bar{F}}_{h i g h}

is upsampled to match the size of

{\bar{F}}_{l o w}

. Afterward, a dimension swap is performed on both feature maps, aiming to unfold the channel dimension and highlight the spatial differences between the high-level and low-level features. This operation results in the feature maps

{\tilde{F}}_{h i g h}

and

{\tilde{F}}_{l o w}

. Subsequently, the Euclidean distance between these two feature maps is calculated to identify the differences W between them. These differences are then transformed back to their original dimensions, and the difference values are used as the basis for enhancement. Weights

\bar{W}

are calculated using a softmax function based on these difference values. Next, these weights

\bar{W}

are multiplied with the low-level feature map, and the result is added to

{\bar{F}}_{h i g h}

to create an enhanced feature map. Finally, via a residual connection, this enhanced feature map

F_{e}

is added back to the low-level feature map

{\bar{F}}_{l o w}

to produce the final output

F_{o u t}

. The overall process of the ICM can be described as follows:

\begin{matrix} {\bar{F}}_{h i g h} = C o n v_{1 \times 1} (F_{h i g h}) \end{matrix}

(3)

\begin{matrix} {\bar{F}}_{l o w} = C o n v_{1 \times 1} (F_{l o w}) \end{matrix}

(4)

\begin{matrix} {\tilde{F}}_{h i g h} = t r a n s p o s e (u p ({\bar{F}}_{h i g h})) \end{matrix}

(5)

\begin{matrix} {\tilde{F}}_{l o w} = t r a n s p o s e (u p ({\bar{F}}_{l o w})) \end{matrix}

(6)

\begin{matrix} W = ρ ({\tilde{F}}_{h i g h}, {\tilde{F}}_{l o w}) \end{matrix}

(7)

\begin{matrix} F_{e} = s o f t m a x (t r a n s p o s e (W)) \times {\bar{F}}_{l o w} + {\bar{F}}_{h i g h} \end{matrix}

(8)

\begin{matrix} F_{o u t} = F_{e} + {\bar{F}}_{l o w} \end{matrix}

(9)

where represents the Euclidean distance calculation.

3.5. Loss Function

In the context of satellite road extraction, roads typically constitute a smaller sample size, while the background significantly exceeds roads in terms of area coverage. The Dice loss is employed to address the issue of sample imbalance. It places greater emphasis on the overlapping region between the predicted results and the ground-truth labels. When cross-entropy loss is used alone to handle the imbalance between positive and negative samples, as in the case of roads, the model may bias its predictions toward negative samples to minimize the loss value. Therefore, a combined strategy of Dice loss and cross-entropy loss is adopted. The detailed definition is as follows:

L_{d i c e} = 1 - \frac{2 \cdot \bar{Y} \cdot s o f t m a x (\tilde{Y})}{\bar{Y} + s o f t m a x (\tilde{Y})}

(10)

L_{c e} = - \frac{1}{N} \sum_{i} y_{i} l n {\hat{y}}_{i} + (1 - y_{i}) \ln (1 - {\hat{y}}_{i})

(11)

L = L_{d i c e} + L_{c e}

(12)

where

\bar{Y}

represents the true labels,

\tilde{Y}

represents the predicted labels, N denotes the total number of pixels in the image,

y_{i} \in \{0, 1\}

indicates the true label of a pixel, and

{\hat{y}}_{i}

represents the probability that a pixel belongs to category 1.

4. Experiments

In this section, we validated the effectiveness of the FERDNet network using three public datasets: the CHN6-CUG dataset, the Massachusetts dataset, and the DeepGlobe dataset. Additionally, we performed ablation studies on the CHN6-CUG dataset to assess the impact of the proposed modules on the network.

4.1. Datasets

The CHN6-CUG [25] dataset primarily consists of satellite images from six cities in China: Beijing, Shanghai, Wuhan, Shenzhen, Hong Kong, and Macau. It includes a variety of roads reflecting different levels of development and urban scales. The CHN6-CUG dataset contains 4511 images, each with a size of 512 × 512 pixels. Among these, 3608 images are used as the training set, while the remaining 903 images constitute the test set.

The Massachusetts [26] road dataset consists of 1171 aerial images of Massachusetts, each with a resolution of 1500 × 1500 pixels. This dataset covers various urban, suburban, and rural areas. The training set includes 1108 images, while the test set comprises 14 images.

The DeepGlobe [27] road dataset comprises 6226 satellite images, each with a resolution of 1024 × 1024 pixels and a pixel resolution of 0.5 m. Thailand, India, and Indonesia are the main sampling areas for this dataset. It predominantly features rural areas, along with some images of urban regions, wilderness, and other landscapes. Roads in this dataset include concrete, asphalt, and mountain roads. The dataset contains a substantial amount of rural road data, which include numerous instances of tree occlusion and roads such as muddy paths that closely resemble the background.

In Figure 5, examples from the CHN6-CUG, Massachusetts, and DeepGlobe datasets are displayed from left to right, respectively.

4.2. Evaluation Metrics

To validate the effectiveness of our proposed model, we selected four effective evaluation metrics, which include precision, recall, F1 score, and Intersection over Union (IoU). The specific formulas are as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(13)

R e c a l l = \frac{T P}{T P + F N}

(14)

F_{1} = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(15)

I o U = \frac{T P}{T P + F P + F N}

(16)

where TP represents the number of true-positive pixels, FP represents the number of false-positive pixels, and FN represents the number of false-negative pixels.

4.3. Training Setup

In our experiments, all tests were conducted on an NVIDIA GeForce RTX 4090 GPU. To ensure better comparability with other models, we used the DeepGlobe, CHN6-CUG, and Massachusetts datasets to train and evaluate the proposed model, respectively. For the CHN6-CUG dataset, we used 2426 images for training and 606 images for validation. For the Massachusetts dataset, the images were resized to

256 \times 256

, and 9220 images were used for training, while 572 images were used for validation. Similarly, for the DeepGlobe dataset, the images were resized to

256 \times 256

, with 10,374 images used for training and 3760 images for validation. The batch size for training was set to 16, and the initial learning rate was set to 0.0003. The weight decay was set to 2.5 × 10⁻⁴, and we adopted a “poly” learning rate adjustment strategy with a factor of 0.9. Bilinear interpolation was used for the resizing process. All three datasets were trained for 80 epochs.

4.4. Experimental Results

4.4.1. Ablation Experiments

To validate the effectiveness of the proposed MFE module and ICM in the model, we adopted three baselines. Baseline 1 used VAN-tiny as the backbone network and FCN as the decoder. Baseline 2 employed VAN-tiny as the backbone network and Deeplabv3+ as the decoder. Baseline 3 incorporated VAN-tiny as the backbone network and U-Net as the decoder. Since both Baseline 3 and our model adopt the same layer-by-layer decoding structure, we primarily compared our model with Baseline 3. For simplicity, we conducted the ablation experiments only on the CHN6-CUG dataset, and we took the Intersection over Union (IOU) as the main reference indicator. The ICM primarily functions as an auxiliary component for the MFE. When used independently, the ICM failed to capture sufficient road features. Furthermore, when low-level features were used to complement high-level features, the absence of semantic guidance became evident. Consequently, when the ICM was added alone, the IoU score improved by only 0.2 percentage points compared to Baseline 3, which utilizes the same layer-by-layer decoding structure. As shown in Table 1, the inclusion of the MFE module resulted in a 1.8 percentage point increase in IoU. This indicates that incorporating the MFE module into the model enhances its ability to distinguish between road features and environmental features. It can be observed that compared to the three baseline models, the introduction of the MFE module improved the model’s recall score by 2 percentage points, indicating that the MFE module effectively reduces the missed detection rate. However, while reducing missed detections, it also introduced a higher risk of false positives. Consequently, with the addition of the MFE module, the precision score decreased by up to 1 percentage point. By incorporating the MFE module, the model effectively learns road features. However, when roads are too narrow, the model tends to misclassify them as background features. As the network deepens, the features of these narrow roads are inevitably lost. To address this issue, low-level features are needed to compensate for high-level information. To this end, we introduced the ICM to mitigate the loss of narrow road features. With the addition of the ICM, our IoU score improved by 0.6 percentage points. The ICM compensates for the loss of shapes, edges, and narrow roads in high-level features, enabling the model to more accurately recognize roads. When both modules were added to the model, there was a 2.4 percentage point increase in IoU compared to Baseline 3, demonstrating the effectiveness of the two proposed modules.

4.4.2. Comparative Experiments

To validate the effectiveness of our proposed model, we trained our model from scratch on each of the three datasets—CHN6-CUG, DeepGlobe, and Massachusetts—individually and compared its performance with six other models, including CR-HR-RoadNet [13], DDCTNet [28], DeepLabv3+ [9], Dlinknet [10], PVT Transformer [29], and UNetFormer [30].

Table 2 shows the quantitative results of our proposed FERDNet compared to six other models on the CHN6-CUG dataset. The results demonstrate that FERDNet achieves an IoU score of 67.6%, which is 1.4% higher than the second-best method, DLinkNet. Furthermore, although our method ranks slightly below DDCTNet in terms of recall, it outperforms all six models in the F1 score. However, the relatively high recall score increases the risk of false positives, resulting in a slightly lower precision score compared to some models.

The CHN6-CUG dataset contains road data from six cities at varying levels of development, covering areas such as highways, residential districts, and rural roads. This reflects the presence of complex challenges in the dataset, including multi-branch roads, building occlusions, and roads resembling background features. Experimental results show that our method is robust across different types of roads under various background conditions.

The quantitative experimental results on the DeepGlobe dataset are presented in Table 3. Notably, our method achieves a performance comparable to that on the CHN6-CUG dataset, with the IoU score improving by 1.1% compared to the second-best method, PVTTransformer. The DeepGlobe dataset contains a substantial number of rural roads and yellow dirt paths, where road information closely resembles background features. The experimental results demonstrate that our method effectively distinguishes road-like features, reliably mitigating interference from irrelevant regional features.

As shown in Table 4, the quantitative experimental results of our method on the Massachusetts dataset demonstrate that our method improves the IoU score by 1.8% compared to the CR-HR-RoadNet method. Notably, the recall score of our method is 4.4% higher than that of CR-HR-RoadNet. The Massachusetts dataset contains a large number of road branches and small roads. The experimental results highlight the superiority of our method in extracting small roads.

In Table 5, the comparison of parameters (Param) and floating-point operations (FLOPs) between our model and other popular models is presented. In the CR-HR-RoadNet paper, the GFLOPs metric for the model was not provided; therefore, a horizontal line is used as a placeholder. Clearly, our model demonstrates significant advantages in both parameters and computational efficiency. With only 4.16 MB of parameters and 27.33 GFLOPs, these results further emphasize the effectiveness of our model.

4.4.3. Visualization of Experimental Results

To better showcase the performance of our model, we visualized the final experimental results. Based on the quantitative evaluations above, the PVTTransformer network, which features a pyramid structure and combines the advantages of CNNs and Transformers, is selected, along with the classic road segmentation network D-LinkNet and the classic semantic segmentation network DeepLabv3+, which possesses multi-scale information extraction capabilities, for the following visual comparisons.

As shown in Figure 6, the visualization results of our model on the CHN6-CUG dataset demonstrate its capabilities. From the first experimental result, it can be observed that our model has a certain ability to extract roads under shadow occlusion and successfully predicts the narrow roads in the image, whereas the other three models fail to make effective predictions. In the fourth visualization result, as indicated by the red box, our model extracts the correct roads present in the image that are not included in the ground truth, which the other three models fail to do. As highlighted by the red frames in the first and third images, our model successfully predicts narrow and small roads. This achievement is largely attributed to the ICM, which effectively mitigates the loss of small targets during the learning process. By integrating low-level features into high-level representations, the ICM enhances feature retention, ensuring our model’s ability to accurately extract these narrow and small roads. The visualization results of the third and sixth images indicate that the model can effectively reconstruct road features even when the background information closely resembles the road information. In these two images, the street trees and buildings adjacent to the roads served as the learning foundation for our MFE model. The significant feature changes enabled the model to effectively differentiate road features from background features, ultimately leading to improved performance in the prediction maps generated by our model.

As shown in Figure 7, the visualization results of our model on the DeepGlobe dataset demonstrate its capabilities. The DeepGlobe dataset is characterized by road features that closely resemble environmental features. Benefiting from the MFE module, the model exhibits a strong ability to distinguish background features from road features. The visualization results clearly show that our model performs better in scenarios where environmental features closely resemble road features.

In the first visualization result, it is evident that the integrity of the roads predicted by our method surpasses that of the other three models. As indicated by the red box, our method also predicts real roads that are missing from the ground truth. The second and third visualization results demonstrate that our method can make effective predictions even in the presence of occlusions caused by buildings and trees. In this dataset, our model also shows a certain capability to extract roads that are not correctly delineated in the labels, which partially explains why our model does not achieve optimal performance in the precision metric.

In Figure 8, the visualization results of our model on the Massachusetts dataset are presented. In these results, our model demonstrates a strong capability for extracting main roads. Compared to the other three models, the visualization results of our model exhibit a wider representation of the main roads and a lower false detection rate, indicating better integrity in the extraction of main roads. This can be attributed to the effectiveness of our ICM. The ICM successfully preserves low-level features, such as road boundaries and textures. Furthermore, in the MFE module, we assigned higher weights to road-related information, enabling our model to achieve superior feature extraction and deliver improved prediction results.

In Figure 9, the visualization results for areas affected by shadow occlusion are presented. The regions outlined in red frames highlight areas where street trees or building shadows obscure portions of the images. Despite the challenges posed by shadow occlusion, our model demonstrates relatively strong robustness. While shadow occlusion can partially affect the local features of the road, we mitigate this issue by enhancing road features through the use of six similarity feature maps within the MFE module. This design ensures that our model maintains reliable feature extraction capabilities even under shadow occlusion conditions.

To evaluate the generalization ability of our model after training on different datasets, we introduced a road type dataset named RDCME [31], which is not included in the CHN6-CUG, DeepGlobe, or Massachusetts datasets. The RDCME dataset primarily consists of data from mountainous areas, with a resolution of 0.3 m per pixel. In Figure 10, the road conditions in the RDCME dataset are illustrated.

In Figure 11, the visualization outcomes of our model, trained on diverse datasets and applied to the RDCME dataset, are presented. Notably, our model demonstrates the ability to make favorable predictions even when confronted with complex road scenarios.

5. Conclusions

In this paper, we propose a U-shaped network model (FERDNet) for road extraction from satellite images. This model addresses the connectivity and completeness of roads to effectively mitigate the interference caused by similar background information and some of the obstructed information induced by trees or shadows. The MFE module employed strip convolutions with several directions to emphasize the narrow and elongated characteristics of roads to capture more complete roads with connectivity. The connectivity and completeness of the retrieved roads are visually presented in Figure 6, Figure 7 and Figure 8. Furthermore, with the help of the ICM we designed, the FERDNet model effectively resolves issues related to road discontinuities and unclear boundaries, which can effectively reduce the obstruction of road trees. Experimental results from three datasets demonstrate the effectiveness and feasibility of our network. Compared to other representative methods, our model exhibits higher accuracy and better extraction results. In addition, all the three datasets can be put into the model for further training to increase the adaptability and accuracy of the model.

Prospects: Graph Neural Networks (GNNs) have proven to be highly effective in remote sensing road extraction, particularly in addressing road connectivity challenges. Moving forward, we plan to focus on leveraging Graph Neural Networks to tackle the issue of disconnections in road extraction.

Author Contributions

H.D. proposed the concept, conducted the experiments, and drafted the initial manuscript. B.Z. suggested improvements, analyzed the experimental results, and revised the initial draft. K.A. proposed ideas for algorithm enhancements. M.L., X.L., A.Y. and J.W. reviewed and revised the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Hainan Provincial Natural Science Foundation of China (422QN350), Science and Technology Fundamental Resources Investigation Program of China (2022FY100200), and Beijing Science and Technology Plan (Z241100005424006).

Data Availability Statement

The DeepGlobe dataset is available for download at http://deepglobe.org/challenge.html, the Massachusetts dataset is available for download at https://www.cs.toronto.edu/~vmnih/data/, and the CHN6-CUG dataset is available for download at http://cugurs5477.mikecrm.com/ZtMn5tR.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Akhtarmanesh, A.; Abbasi-Moghadam, D.; Sharifi, A.; Yadkouri, M.H.; Tariq, A.; Lu, L. Road extraction from satellite images using attention-assisted UNet. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 17, 1126–1136. [Google Scholar] [CrossRef]
Luo, H.; Wang, Z.; Du, B.; Dong, Y. A deep cross-modal fusion network for road extraction with high-resolution imagery and lidar data. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4503415. [Google Scholar] [CrossRef]
Chen, L.; Zhang, Y.; Tian, B.; Ai, Y.; Cao, D.; Wang, F.-Y. Parallel driving os: A ubiquitous operating system for autonomous driving in cpss. IEEE Trans. Intell. Veh. 2022, 7, 886–895. [Google Scholar] [CrossRef]
Chen, L.; Li, Y.; Huang, C.; Li, B.; Xing, Y.; Tian, D.; Li, L.; Hu, Z.; Na, X.; Li, Z.; et al. Milestones in autonomous driving and intelligent vehicles: Survey of surveys. IEEE Trans. Intell. Veh. 2022, 8, 1046–1056. [Google Scholar] [CrossRef]
Cheng, G.; Han, J. A survey on object detection in optical remote sensing images. ISPRS J. Photogramm. Remote Sens. 2016, 117, 11–28. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
LeCun, Y.; Boser, B.; Denker, J.; Henderson, D.; Howard, R.; Hubbard, W.; Jackel, L. Handwritten digit recognition with a back-propagation network. In Proceedings of the Advances in Neural Information Processing Systems, Denver, CO, USA, 27–30 November 1989; Volume 2. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Zhou, L.; Zhang, C.; Wu, M. D-LinkNet: LinkNet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 182–186. [Google Scholar]
Guo, M.-H.; Lu, C.-Z.; Liu, Z.-N.; Cheng, M.-M.; Hu, S.-M. Visual attention network. Comput. Vis. Media 2023, 9, 733–752. [Google Scholar] [CrossRef]
Yang, Z.; Zhou, D.; Yang, Y.; Zhang, J.; Chen, Z. Road extraction from satellite imagery by road context and full-stage feature. IEEE Geosci. Remote Sens. Lett. 2022, 20, 8000405. [Google Scholar] [CrossRef]
Chen, J.; Yang, L.; Wang, H.; Zhu, J.; Sun, G.; Dai, X.; Deng, M.; Shi, Y. Road extraction from high-resolution remote sensing images via local and global context reasoning. Remote Sens. 2023, 15, 4177. [Google Scholar] [CrossRef]
Liu, B.; Ding, J.; Zou, J.; Wang, J.; Huang, S. LDANet: A lightweight dynamic addition network for rural road extraction from remote sensing images. Remote Sens. 2023, 15, 1829. [Google Scholar] [CrossRef]
Liu, Y.; Yao, J.; Lu, X.; Xia, M.; Wang, X.; Liu, Y. RoadNet: Learning to comprehensively analyze road networks in complex urban scenes from high-resolution remotely sensed images. IEEE Trans. Geosci. Remote Sens. 2018, 57, 2043–2056. [Google Scholar]
Li, X.; Wang, Y.; Zhang, L.; Liu, S.; Mei, J.; Li, Y. Topology-enhanced urban road extraction via a geographic feature-enhanced network. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8819–8830. [Google Scholar] [CrossRef]
Wang, Y.; Peng, Y.; Li, W.; Alexandropoulos, G.C.; Yu, J.; Ge, D.; Xiang, W. DDU-Net: Dual-Decoder-U-Net for road extraction using high-resolution remote sensing images. IEEE Trans. Geosci. Remote. 2022, 60, 4412612. [Google Scholar] [CrossRef]
Dey, M.S.; Chaudhuri, U.; Banerjee, B.; Bhattacharya, A. Dual-path morph-UNet for road and building segmentation from satellite images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 3004005. [Google Scholar] [CrossRef]
Wang, Y.; Seo, J.; Jeon, T. NL-LinkNet: Toward lighter but more accurate road extraction with nonlocal operations. IEEE Geosci. Remote. Lett. 2021, 19, 3000105. [Google Scholar] [CrossRef]
Sun, T.; Di, Z.; Che, P.; Liu, C.; Wang, Y. Leveraging crowdsourced gps data for road extraction from aerial imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7509–7518. [Google Scholar]
Mei, J.; Li, R.-J.; Gao, W.; Cheng, M.-M. Coanet: Connectivity attention network for road extraction from satellite imagery. IEEE Trans. Image Process. 2021, 30, 8540–8552. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Zhou, D.; Chen, Z. Stripunet: A method for dense road extraction from remote sensing images. IEEE Trans. Intell. Veh. 2024, 1–13. [Google Scholar] [CrossRef]
Zhao, L.; Ye, L.; Zhang, M.; Jiang, H.; Yang, Z.; Yang, M. DPSDA-Net: Dual-path convolutional neural network with strip dilated attention module for road extraction from high-resolution remote sensing images. Remote Sens. 2023, 15, 3741. [Google Scholar] [CrossRef]
Wei, Z.; Zhang, Z. Remote sensing image road extraction network based on MSPFE-Net. Electronics 2023, 12, 1713. [Google Scholar] [CrossRef]
Zhu, Q.; Zhang, Y.; Wang, L.; Zhong, Y.; Guan, Q.; Lu, X.; Zhang, L.; Li, D. A global context-aware and batch-independent network for road extraction from vhr satellite imagery. ISPRS J. Photogramm. Remote. 2021, 175, 353–365. [Google Scholar] [CrossRef]
Mnih, V. Machine Learning for Aerial Image Labeling; University of Toronto: Toronto, ON, Canada, 2013. [Google Scholar]
Demir, I.; Koperski, K.; Lindenbaum, D.; Pang, G.; Huang, J.; Basu, S.; Hughes, F.; Raskar, R. DeepGlobe 2018: A challenge to parse the earth through satellite images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 172–181. [Google Scholar]
Gao, L.; Zhou, Y.; Tian, J.; Cai, W. Ddctnet: A deformable and dynamic cross transformer network for road extraction from high resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4407819. [Google Scholar] [CrossRef]
Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 568–578. [Google Scholar]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. Unetformer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Zhang, X.; Jiang, Y.; Wang, L.; Han, W.; Feng, R.; Fan, R.; Wang, S. Complex mountain road extraction in high-resolution remote sensing images via a light roadformer and a new benchmark. Remote Sens. 2022, 14, 4729. [Google Scholar] [CrossRef]

Figure 1. As indicated by the red ellipses, the road images exhibit issues such as shadow occlusion, obstruction from street trees, and the presence of narrow and blurred roads.

Figure 2. The overall structure of the model consists of a VAN encoder along with three MFE modules and three ICMs.

Figure 3. MFE module, where the dashed box indicates different directions of the same convolution.

Figure 4. High- and low-level feature information compensation module.

Figure 5. Samples from the three datasets.

Figure 6. Visualization of segmentation results on the CHN6-CUG dataset. (a) Ours. (b) PVTTransformer. (c) DlinkNet. (d) Deeplabv3+.

Figure 7. Visualization of segmentation results on the DeepGlobe dataset. (a) Ours. (b) PVTTransformer. (c) DlinkNet. (d) Deeplabv3+.

Figure 8. Visualization of segmentation results on the Massachusetts dataset. (a) Ours. (b) PVTTransformer. (c) DlinkNet. (d) Deeplabv3+.

Figure 9. The visualization results in the shadow occlusion areas. (a) Ours. (b) PVTTransformer. (c) DlinkNet. (d) Deeplabv3+.

Figure 10. Samples from the RDCME dataset.

Figure 11. The visualization results of the model trained on different datasets on the RDCME dataset. (a) The model trained on the CHN6-CUG dataset. (b) The model trained on the Massachusetts dataset. (c) The model trained on the DeepGlobe dataset.

Table 1. Ablation experiments of proposed modules on the CHN6-CUG dataset, with the maximum values of evaluation metrics highlighted in bold.

Methods	Modules		Precision (%)	Recall (%)	F1 (%)	IoU (%)
Methods	MFE	ICM	Precision (%)	Recall (%)	F1 (%)	IoU (%)
Baseline 1			82.3	76.8	79.5	65.9
Baseline 2			81.2	77.3	79.2	65.6
Baseline 3			82.1	76.0	78.8	65.2
+ICM		√	82.3	76.1	79.1	65.4
+MFE	√		81.3	79.3	80.3	67.0
+MFE+ICM	√	√	83.8	77.6	80.6	67.6

Table 2. CHN6-CUG dataset experimental comparison results, with the best results highlighted in bold.

Method	Precision (%)	Recall (%)	F1 Score (%)	IoU (%)
CR-HR-RoadNet	78.4	77.4	77.9	63.8
DDCTNet	80.1	81.1	80.5	64.7
Deeplabv3+	78.4	73.8	76.0	61.3
Dlinknet	84.5	75.4	79.7	66.2
PVTTranformer	83.7	72.8	77.9	63.8
UNetFormer	82.4	71.8	76.8	62.3
Ours	83.8	77.7	80.6	67.6

Table 3. DeepGlobe dataset experimental comparison results, with the best results highlighted in bold.

Method	Precision (%)	Recall (%)	F1 Score (%)	IoU (%)
CR-HR-RoadNet	76.4	77.1	76.8	62.3
DDCTNet	80.3	78.1	79.1	64.3
Deeplabv3+	85.0	79.2	82.1	69.6
Dlinknet	84.5	81.3	82.9	70.8
PVTTranformer	84.6	82.3	83.5	71.6
UNetFormer	83.5	81.3	82.4	70.1
Ours	83.4	84.1	84.2	72.7

Table 4. Massachusetts dataset experimental comparison results, with the best results highlighted in bold.

Method	Precision (%)	Recall (%)	F1 Score (%)	IoU (%)
CR-HR-RoadNet	80.3	76.1	78.2	64.2
DDCTNet	80.8	73.2	76.8	62.2
Deeplabv3+	79.5	74.0	76.7	62.1
Dlinknet	81.6	72.1	76.6	62.0
PVTTranformer	81.9	73.9	77.7	63.6
UNetFormer	80.1	75.2	77.6	63.4
Ours	78.6	80.5	79.5	66.0

Table 5. Parameters (Param) and floating-point operations (FLOPs) among different models, where the best results are highlighted in bold.

Method	Param (M)	FLOPs (G)
CR-HR-RoadNet	15.28	-
DDCTNet	63.45	128.78
Deeplabv3+	39.76	239.10
Dlinknet	31.56	139.52
PVTTranformer	25.47	110.15
UNetFormer	11.72	46.95
Ours	4.10	27.33

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhong, B.; Dan, H.; Liu, M.; Luo, X.; Ao, K.; Yang, A.; Wu, J. FERDNet: High-Resolution Remote Sensing Road Extraction Network Based on Feature Enhancement of Road Directionality. Remote Sens. 2025, 17, 376. https://doi.org/10.3390/rs17030376

AMA Style

Zhong B, Dan H, Liu M, Luo X, Ao K, Yang A, Wu J. FERDNet: High-Resolution Remote Sensing Road Extraction Network Based on Feature Enhancement of Road Directionality. Remote Sensing. 2025; 17(3):376. https://doi.org/10.3390/rs17030376

Chicago/Turabian Style

Zhong, Bo, Hongfeng Dan, MingHao Liu, Xiaobo Luo, Kai Ao, Aixia Yang, and Junjun Wu. 2025. "FERDNet: High-Resolution Remote Sensing Road Extraction Network Based on Feature Enhancement of Road Directionality" Remote Sensing 17, no. 3: 376. https://doi.org/10.3390/rs17030376

APA Style

Zhong, B., Dan, H., Liu, M., Luo, X., Ao, K., Yang, A., & Wu, J. (2025). FERDNet: High-Resolution Remote Sensing Road Extraction Network Based on Feature Enhancement of Road Directionality. Remote Sensing, 17(3), 376. https://doi.org/10.3390/rs17030376

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FERDNet: High-Resolution Remote Sensing Road Extraction Network Based on Feature Enhancement of Road Directionality

Abstract

1. Introduction

2. Related Work

2.1. Semantic Segmentation-Based Road Extraction from RS Imagery

2.2. Application of Strip Convolution in Road Extraction

3. Methods

3.1. Overall Architecture

3.2. Encoder

3.3. The MFE Module

3.4. The ICM

3.5. Loss Function

4. Experiments

4.1. Datasets

4.2. Evaluation Metrics

4.3. Training Setup

4.4. Experimental Results

4.4.1. Ablation Experiments

4.4.2. Comparative Experiments

4.4.3. Visualization of Experimental Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI