1. Introduction
Road segmentation is the process of predicting and classifying road pixels in an image to aid the generation of accurate road network information. Road segmentation using satellite images is a critical tool for efficient traffic management and planning. Proactive measures can be taken to identify potentially dangerous intersections and congestion areas, thereby reducing the risk of traffic-related accidents and improving the overall safety of a road network. Hence, authorities can use road segmentation to monitor changes over time and detect any obstacles or hazardous conditions that could make a route less safe, allowing for constructive action to be taken.
Recently, advances in the field of artificial intelligence have resulted in various state-of-the-art model architectures and their optimizations to be proposed to solve complex segmentation problems [
1,
2,
3]. Machine learning and deep learning approaches are being applied to improve the quality of road segmentation from remote sensing data [
4,
5,
6,
7]. Naturally, a proper solution of road segmentation using remote sensing images is a highly non-linear problem; hence, deep learning-based semantic segmentation solutions are more favorable in this matter [
8,
9,
10,
11]. Deep learning model development is seen as an epochal development. However, it is strongly data driven, which sometimes makes it difficult to make accurate predictions. The performance of deep learning models is greatly influenced by both the quality and quantity of the data utilized in their training process [
12,
13]. If the data are not representative of the real-world scenario, this leads to poor performance. A major challenge is the shadow effect, which makes it difficult to segment the road pixels [
14]. The resolution and spectral capability are highly interrelated, and they result in false perceptions at the boundary points of the road [
15]. The resolution of the satellite image should give high spatial information for extraction. However, accessing high-resolution images can make it costly or impossible to perform qualified road segmentation. There are also a lot of problems with woodlands because of the lack of information about roads. The missing information can be completed using the LiDAR point cloud. Therefore, it may be possible to overcome the limitations of satellite images in areas where the road information is problematic and improve the accuracy of road segmentation by exploiting the additional geometric relations, integrating LiDAR data into deep learning models [
16,
17].
LiDAR is seen as a valuable data source and it provides useful information that cannot be extracted from optical images. The LiDAR scanner can be mounted on terrestrial, mobile, and airborne platforms. The mobile and terrestrial LiDAR data can be used for the semantic segmentation of road information [
18,
19], but the applications of these platforms are limited due to their small coverage areas and the difficulties of point cloud measurements increases in challenging areas [
20]. To be able to conduct road segmentation in large areas, or even country-wide, the airborne LiDAR platforms become more feasible in this regard, as these platforms can cover cities and countries rapidly. There is substantial research on road segmentation using airborne LiDAR data. In Li et al. [
21], using a grid index structure to detect roads in LiDAR point clouds, a morphological gradient was applied on ground points which were filtered using local intensity distribution. Li et al. [
22] conducted a study based on similar parameters to the method of Li et al. [
21]. After identifying the differences in shape, reflectance, and road width, road centrelines were extracted with local principle component analysis. Then, road networks were extracted using the global primitive grouping method. Hui et al. [
23] introduced a novel approach consisting of skewness balancing, rotating neighbourhoods, and hierarchical fusion and optimization to extract roads from point cloud data. Tejenaki et al. [
24] implemented a hierarchical method that refined intensity with mean shift segmentation and extraction of road centerlines with a Voronoi diagram. The proposed method improved both road extraction results and water surface detection. Sánchez et al. [
16] aimed to distinguish between road and ground points based on intensity constraints. In their proposed method, an improved skewness balancing algorithm was used for the calculation of the intensity threshold. There are deep learning techniques, such as PointNet [
25] and PointConvNet [
26], that take point clouds as direct input for object classification and segmentation problems. Following the gradual development of these approaches, PointNet++ was published for labeling the road points in the point cloud data [
27]. There are, however, still issues to be addressed when relying solely on LiDAR data. For instance, there is no spectral information in LiDAR data which can lead to mixing between objects with similar density and texture such as parking lots. Therefore, the combination of satellite images and LiDAR data can resolve such deficiencies, leading to more accurate segmentation of the road network [
22,
28,
29].
There have been numerous studies conducted on the potential benefits of remote sensing data in segmenting images. However, the combination and integration of different data sources are yet to achieve the same level of attention. Audebert et al. [
30] conducted thorough research on integrating multiple sources and models and proposed an approach that combined LiDAR and multispectral images using different fusion strategies. They used an Infrared-Red-Green (IRRG) image and a combination of the Normalized Digital Surface Model (NDSM), Digital Surface Model (DSM), and Normalized Difference Vegetation Index (NDVI) obtained from LiDAR data. In a similar study, Zhang et al. [
31] suggested using high-resolution images and nDSM derived from LiDAR. Their proposed method involves segmenting fused data, classifying image objects, generating a road network, and extracting the road centerline network using a multistage approach, including morphology thinning, Harris corner detection, and least square fitting. Zhou et al. [
32] introduced a novel method named the FuNet (Fusion Network) that integrates satellite images with binary road images generated from GPS data collection. This approach, which depends on a multi-scale feature fusion strategy, has been found to be more effective at resolving the road connectivity issue than using solely satellite imagery. Torun and Yuksel [
33], on the other hand, proposed a technique that involves combining hyperspectral images with LiDAR data for unsupervised segmentation. They applied a Gaussian filter to the point cloud while performing principle component analysis on the images, which yielded results used to create an affinity matrix. In another study, Gao et al. [
34] presented multi-scale feature extraction for 3D road segmentation, combining characteristic features from high-resolution images and LiDAR data.
Although there are findings suggesting the benefits of integrating diverse data sources for multiple applications, existing research indicates that there are still challenges that need to be addressed in this area. The combination of different types of remotely sensed data can be challenging, yet resourceful for segmentation studies considering the complexity of the solution. Specifically, the combination of 2D images and a 3D point cloud that has an irregular data structure requires complex solutions for deep learning architectures. An alternative and easier approach is to extract context information from the point cloud that represents the local 3D shapes and fuse them with the high-level features extracted from 2D optical images [
20]. Therefore, in this study, a feature-wise fusion strategy of 2D optical images and point cloud data was proposed to enhance the road segmentation capability of deep learning models in areas where the use of optical images is inadequate due to object blockage and shadow problems. The high-resolution optical satellite images obtained from the Google Maps platform were combined with the contextual feature images derived from the point cloud obtained using airborne LiDAR data. The combination of these two types of data was carried out feature-wise in a deep residual U-Net-based deep learning model. The features generated by different ResNet backbones using only optical satellite images were fused with the geometric features calculated from the LiDAR point cloud before the final convolution layer of the model. This study provides insight into how to combine optical images and point-cloud data to improve the segmentation of roads. The proposed strategy can be implemented in any deep learning model architecture with proper modifications. The motivation of this study is to improve the road segmentation capability of the deep learning models by combining 2D and 3D information properly.
The major objectives of the study were as follows: (1) the use of irregular point cloud data together with the 2D optical images in an end-to-end deep learning model was introduced as a feature-wise fusion strategy; (2) the improvement brought about by the LiDAR data was outlined together with the statistical results of the models and the prediction performance of the proposed fusion strategy was evaluated in areas where the road segmentation is challenging; (3) the consistency of the strategy was evaluated with different ResNet backbones in deep residual U-Net architecture. Finally, the relative importance of optical and LiDAR-derived features was calculated using the best-performing combined model to outline which features contribute to the improvement of road segmentation the most. The intent of this study is not to propose a deep learning architecture, but to provide a combination strategy for optical images and point-cloud data to improve the segmentation of roads. However, the proposed fusion strategy can be implemented in any model architecture with proper modifications. The motivation behind this study was to improve the road segmentation capability of any deep learning model, which has been proven to be successful in this regard, by combining 2D and 3D information properly.
2. Data and Methods
2.1. Study Area and Data Collection
In this study, open-source airborne LiDAR data collected by the U.S. Geological Survey were used together with high-resolution optical satellite images from the Google Maps platform. While the Google Map API service is available to gather satellite images everywhere on Earth, airborne LiDAR is a valuable (and expensive) data source. The U.S. Geological Survey initiated the National Geospatial Program with airborne LiDAR campaigns that cover the United States to improve and deliver topographic information. This program contains different quality completed products from quality Level-0 to Level-3. In order to test the performance of the feature-wise fusion of optical images and LiDAR, only level-1 quality data that are located over major cities are taken into account. In this context, the Florida southeast project that falls within the Florida counties of Broward, Collier, Hendry, Miami-Dade, Monroe, and Palm Beach was chosen as the study area [
35].
This LiDAR data were collected on June 2018 and published by U.S. Geological Survey with the name of Florida Southeast LiDAR-Block 1. The point cloud was generated at level 1 quality with a source DEM of 0.5 m/pixel. The vertical accuracy of the point cloud is ±10 cm in terms of root mean square error. Nominal pulse spacing is smaller than 0.35 m. The point density is 14.87 points/m. It has seven classes, namely ground, low noise, water, bridge decks, high noise, ignored ground, and unclassified. The dataset consists of 526 tiles, each of them covering a 1 km × 1 km area. Concerning data size, only 393 tiles, which cover the majority of the road network in the area, were used in the study.
After the LiDAR data were obtained, the required satellite images were generated using the Google Map Static API-based tool. This tool was created by [
36] and it generates satellite and corresponding mask images randomly or in sequence in the defined region based on latitude and longitude. It also produces a metadata file with which to rectify these satellite images, if required. In this way, the registration of the LiDAR data and the satellite images could be performed. The Google Maps Static API provides images at various levels which correspond to different scales and resolutions on the Earth’s surface. To generate images, zoom level 17 was chosen because of its ideal coverage area and pixel resolution. The satellite and mask images were extracted with a dimension of 512 × 512 pixels. This resulted in an image with a spatial resolution of 1.07 m × 0.96 m. Consequently, a total of 1426 images were generated, which cover the LiDAR dataset and contain roads.
The optical satellite images and their corresponding masks are geo-referenced in the WGS84 datum. This enables the inter-usability of these images together with other geospatial datasets; in this case, the LiDAR point cloud. As shown in a sample overlay presented in
Figure 1, the horizontal coordinates of these datasets match with each other harmoniously for use in road segmentation studies.
2.2. LiDAR Feature Extraction
LiDAR is an active sensing system that operates by measuring the round-trip timing of a laser beam from an object in order to determine the distance from the sensor to the target. By analyzing the laser time range in combination with the scan angle and spatial coordinates of the laser scanner, the spatial coordinates of an object as (X, Y, Z) are obtained. Along with spatial data, intensity values, number of returns, point classification values, and GPS times are recorded. However, the complete geometric properties of the targetted objects in the point cloud cannot be solely represented by their spatial information (X, Y, and Z), though they can be characterized by their geometric features. In order to avoid the heavy computational load of using complex point cloud data, semantic information, extracted by generating geometric features, can be used instead.
As presented in
Figure 2, feature extraction was carried out in four steps. In the first step, points whose noise levels are high were eliminated from the LiDAR point cloud. After noise removal, outlier detection was performed by removing the points whose average distance from their neighbouring points exceeded a given threshold value. In the next step, the statistical relationship of the points was formed by creating a new data structure based on neighbouring points, similar to the second step. But this time, the neighbouring points were determined using a clean point cloud by applying a k-dimensional tree (
k-d tree) algorithm in 3D. In the final step, eigenvalues and the 3D geometric features were calculated.
There are various methods available for determining the neighborhood of points in LiDAR data. The neighborhood of points can be constructed using a spherical radius or parameterized based on the number of closest neighbours in 2D or 3D space (i.e., k-NN methods). In this study, k-NN based neighbourhood selection was constructed via the well-known
k-d tree algorithm, in which the data are partitioned using a binary tree structure [
37]. As a result of the neighbourhood selection, each query point (
) and its neighbors were indexed. The distance between the query point and its
k neighbours was determined and indexed in order from nearest to farthest.
In the first part of the feature extraction, the features that represent the geometric 3D properties of the neighbourhood were calculated using the height of each query point, i.e., the absolute height
H, the height difference between the query point and its neighbouring points
, the standard deviation of absolute height within neighbouring points
, local 3D neighbourhood represented by the radius of k nearest neighbours
), and local point density
.
where
is the distance between the query point and its neighbours.
In the other part of the feature extraction, eigenvalue-based features were generated after the neighbourhood was determined. In order to compute these features for a point in the 3D point cloud, the covariance matrix (denoted as
) needed to be calculated.
In this equation,
is the
ith neighbour and
is the geometric centre of the neighbourhood determined by,
After the covariance matrix was computed, eigenvalues (
w) and eigenvectors (
v) were determined. The direction along which the dataset had the maximum variation is indicated by the eigenvector with the largest eigenvalue. These were used for eigenvalues-based feature extraction. First, the eigenvalues, and correspondingly the eigenvectors, were sorted in ascending order as
>
>
. In order to perform accurate feature extraction, it had to be ensured that the vectors were normalized between 0 and 1 and that the eigenvalues were greater than 0 [
38].
The calculated eigenvalue based features were linearity
, planarity
, sphericity
, omnivariance
, anisotropy
, eigenentropy
, sum of eigenvalues
and change of curvature
.
The 3D features calculated for each point were reduced to the gridded horizontal plane in order to be combined with the features calculated from the satellite images. Each tile covered by LiDAR point clouds was divided into 1 m × 1 m cells in . Due to one or more points falling within the same cell, the feature value of each cell is represented by the average value of these points. All gridded LiDAR features can be used directly, but the absolute height requires post-processing to be performed. The elevations derived from LiDAR data represent geographic elevations as absolute heights, in contrast with other features that are characterized by statistical relationships. To determine the distance between objects and the ground in this study, the digital elevation model was subtracted from the height values of points in the point cloud. Finally, the LiDAR feature extraction was completed by clipping the features based on satellite images.
Figure 3 illustrates the rendering of the generated features in blue and red colour for low- and high-impact feature values. In general, these features provide insight into the geometry of objects in the point cloud and their relation with the surrounding objects. For instance, anisotropy represents the uniformity of a point cloud, while linearity can be defined as a measure of linear attributes. Furthermore, absolute height provides a distinction between the roads and other objects above [
39]. Together, these features complete the geometric relation between the road and its neighbouring object which is not properly handled in the satellite-only segmentation solutions.
2.3. U-Net Model Structure
The U-Net was initially proposed for segmenting biomedical images [
40]. It was formed as the encoder (contracting), bridge, and decoder (expansive) blocks. As its name implies, it is a U-shaped architecture in which the high-resolution features are extracted by down-sampling the input image, followed by up-sampling them to recover the original spatial resolution. These features are concatenated with the up-sampled output to achieve precise localization. High-dimensional feature spaces, created by up-sampling, allow the architecture to feed the context information into higher-resolution layers. In order to achieve a precise output, high-resolution features from the encoder path are concatenated with the up-sampled output. Consequently, pixel-wise predictions can be made.
As with many deep learning models, U-Net is susceptible to vanishing gradient issues. Although deepening the networks is intended to extract the complex information that cannot be obtained in the background, it also tends to reveal the problem of vanishing gradients. The reason for this is that in the back-propagation stage, the gradient of the loss values updates very little to the previous layers, or sometimes does not update at all. Additionally, as the depth of the network increases, it becomes more difficult to optimize and reaches a kind of accuracy saturation, which leads to higher training error rates [
41]. Accordingly, He et al. [
1] introduced residual learning, called ResNet, which can extract underlying features by increasing the depth. Basically, it is a deep-learning algorithm used to classify images. By integrating shortcut connections into the plain network and copying identity mapping from a shallower model, residual learning will prevent the training error of a deeper model from increasing. The ResNet architecture includes residual connections, which enable the gradients to be carried forward and work out the vanishing gradient problem by going deeper. A ResNet model is identified based on the number of layers it contains. There are five different ResNet models published with 18, 34, 50, 101, and 152 layers. There is an increase in model depth from ResNet 18 to ResNet 152. In order to resolve a highly non-linear problem, important features can be extracted by deepening the model architecture. However, deeper models require high computation costs and hyperparameter tuning is more challenging compared to shallow models. By exploiting ResNet’s ability to provide identity mapping between layers, these training problems can be overcome by extracting high-resolution complex features.
Zhang et al. [
8] introduced a novel method for segmenting roads by combining the strengths of U-Net and deep residual learning. This cooperation solves the vanishing gradient problem and performs a powerful segmentation. In the case of ResNet, the network can be trained more easily, while in the case of U-Net, the intricacies of the algorithm will be greatly reduced. As part of this approach, called deep residual U-Net architecture, residual blocks are incorporated into the encoder stages. Thus encoders of ResNet are responsible for extracting difficult-to-obtain features from images, while decoders of U-Net are responsible for generating segmentation masks with precise localization.
2.4. Feature-Wise Fusion Strategy
In this study, the deep residual U-Net, described as an end-to-end network with a ResNet backbone, was used. To implement feature-wise fusion for segmentation, the training was carried out in a sequential model where the satellite-based features and LiDAR-based features were both integrated. The model consists of two sequential parts connected to each other at the end of U-Net architecture. This was achieved by forwarding optical images through the encoder and decoder segments of the deep residual U-Net architecture, which makes up the first sequential part of the model, to extract high-level features from satellite images. In the second part, 13 geometric features derived from the LiDAR point cloud are fed to the sequential model without any computation to extract new features from them, as they are already linear features. Before the final convolution block of the model, these geometric features (512 × 512 × 13) were concatenated with the high-level features (512 × 512 × 16) along the channel dimension to form a multi-layer feature map with size 512 × 512 × 29. This combined feature map passed to the final convolution layer and an output layer with
Sigmoid as an activation function, to predict road pixels (see
Figure 4).
2.5. Model Setup
The deep residual U-Net architecture consists of five down-sample blocks, each with decreasing filter sizes of 256, 128, 64, 32, and 16. The convolution blocks consist of a sequence of convolution, batch normalization, activation, and zero padding layers repeated two times for each block. The convolution layers use a kernel size and the ReLU activation function. There is a transpose convolution with stride in the upsampling stage, which is consistent with the size of the filter. In this stage, the dropout layer was added to each convolution block flow.
The model was trained with 80% of the dataset, randomly split from the entire optical images, LiDAR, and corresponding mask images. The training data were further divided into 75% training and 25% validation data. The optuna framework [
42] was used to optimize the hyperparameters, such as the optimizer, learning rate, and dropout rate. Adam was selected as an optimizer and Binary cross entropy was used as a loss function in the model. The learning rate was tested between
and
. The Dropout rate was tested from 10% to 50%. The training process was stopped after 10 epochs if no improvement was observed in the validation accuracy.
2.6. Evaluation Metrics
The trained models were evaluated based on five metrics, namely Precision, Recall, F1-Score, Intersection over Union (IoU), and Cohen’s kappa score (
). The recall is how much of the road that should be predicted is actually a road. F1-Score is the equilibrium level between Precision and Recall. IoU indicates the ratio of the overlapping area to the size of the total area of prediction and the ground truth. All metrics range from 0 to 1 and they can be calculated as
where TP, TN, FP, and FN are the true positive, true negative, false positive, and false negative, respectively.
4. Conclusions
In this study, a feature-wise fusion strategy of optical images and point cloud was performed to enhance the road segmentation performance of a deep learning model based on Deep Residual U-Net architecture. In order to compensate for the missing information in optical satellite images that stems from the obstacles and shadow effects in the scenery, we proposed to combine 2D and 3D information to compensate for the absence of depth information in satellite images. For this purpose, high-resolution satellite images and their corresponding road masks, generated over Florida state, were combined with 3D geometric features computed from airborne LiDAR data. In the proposed fusion technique, the optical satellite images were fed into U-Net-based model architecture to generate deep features. Before the final convolution layer, these high-level features were concatenated with the geometric features of the point cloud.
The experimental results validated the effectiveness of the proposed fusion approach for enhancing road segmentation. The combination of 2D and 3D information from satellite images and LiDAR point cloud showed superior results in areas where the satellite image-only models could not predict the road pixels accurately. In challenging areas where the road pixels precluded observation, such as wooded areas and shadows from objects in scenery, the deep learning models trained with the fusion approach predicted the road pixels and the continuity of the road network better than deep learning models trained with satellite images only. The study provided new insight into the relationship between the 2D and 3D features of satellite images and LiDAR data. Moreover, the findings of this study showed the combination of data from various sources can be promising for enhancing the quality of road segmentation. The importance of the features showed that while the optical images have a significant impact on the prediction of road pixels, the contribution of LiDAR features, specifically linearity and height difference within neighbouring points, can lead to a more effective segmentation model development for the extraction of the road network. These 3D features and the geometric relation between the neighbouring points are proven to be significant as the optical images and indicate the importance of contextual information among the point cloud data. It is necessary to further investigate the optimal integration of the most important LiDAR features with the optical satellite images so as to not only capture the geometric properties of the road network efficiently but also to minimize computation costs. It is worth noting that the advancement in the remote sensing data collection techniques and mobility of measuring platforms for LiDAR and laser scanning can ease the generation of accurate and reliable point cloud data. Hence, with these technological developments, the combination of 2D and 3D information from the road networks to increase the performance of deep learning models can be a feasible solution for road extraction studies over challenging areas. The aim of this study is not to exceed the performance of all existing models, but rather to show the combination of optical images and LiDAR can exceed the performance of satellite-only segmentation models. The improved model showed superiority over the problematic areas. In future studies, multi-model fusion strategies will be analyzed to exploit the contribution of LiDAR features that were found to be most effective in road network extraction.