Optimizing Semantic Segmentation of Street Views with SP-UNet for Comprehensive Street Quality Evaluation

Hua, Caijian; Lv, Wei

doi:10.3390/su17031209

Open AccessArticle

Optimizing Semantic Segmentation of Street Views with SP-UNet for Comprehensive Street Quality Evaluation

by

Caijian Hua

^1,2,*,†

and

Wei Lv

^1,†

¹

School of Computer Science and Engineering, Sichuan University of Science and Engineering, Yibin 644000, China

²

Sichuan Key Provincial Research Base of Intelligent Tourism, Sichuan University of Science and Engineering, Yibin 644000, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Sustainability 2025, 17(3), 1209; https://doi.org/10.3390/su17031209

Submission received: 16 December 2024 / Revised: 23 January 2025 / Accepted: 26 January 2025 / Published: 2 February 2025

Download

Browse Figures

Versions Notes

Abstract

:

Traditional street quality evaluations are often subjective and limited in scale, failing to capture the nuanced and dynamic aspects of urban environments. This paper presents a novel and data-driven approach for objective and comprehensive street quality evaluation using street view images and semantic segmentation. The proposed SP-UNet (Spatial Pyramid UNet) is a multi-scale segmentation model that leverages the power of VGG16, SimSPPF (Simultaneous Spatial and Channel Pyramid Pooling), and MLCA (Multi-Level Context Attention) attention mechanisms. This integration effectively enhances feature extraction, context aggregation, and detail preservation. The model’s average intersection over union, Mean Pixel Accuracy, and overall accuracy achieving improvements of 5.83%, 6.52%, and 2.37% in mIoU, Mean Pixel Accuracy (mPA), and overall accuracy, respectively. Further analysis using the CRITIC method highlights the model’s strengths in various street quality dimensions across different urban areas. The SP-UNet model not only improves the accuracy of street quality evaluation but also offers valuable insights for urban managers to enhance the livability and functionality of urban environments.

Keywords:

street space; SP-UNet; street quality measurement; street view image; semantic segmentation

1. Introduction

To enhance urban image and livability, scientifically assessing the quality of city streets is particularly important. Streets, as the heart of urban public spaces, reflect the cultural heritage and level of social development of a city [1]. However, traditional assessment methods, such as observation, surveys, or small-sample analysis [2], struggle to achieve large-scale and nuanced evaluations [3]. The introduction of street view images has breathed new life into street quality assessment research, offering abundant data and a fresh perspective with their wealth of information, intuitive visualization, and low-cost data acquisition [4,5,6,7,8,9,10,11].

The combination of machine learning with street view data has significantly advanced street quality research by providing extensive high-resolution datasets [12,13,14]. Deep learning has enhanced the accuracy and efficiency of these analyses, offering strong tools for evaluating and optimizing street quality [15]. Researchers use deep learning models to assess street spatial quality through image segmentation, which automatically identifies key elements like pedestrians, vehicles, buildings, and greenery [16,17,18,19], providing a comprehensive view of street composition. Street view images also help track changes in urban street quality. Additionally, they provide insights into the built environment’s impact on street vitality, aiding in the creation of comprehensive, multi-dimensional street quality assessment models [20,21,22,23,24]. Scholars have refined these models by including metrics such as road motorization rate and visual entropy to better reflect specific urban contexts [25,26,27]. However, deep learning-based semantic segmentation models still face challenges in accurately recognizing and classifying complex street elements, especially in dense or variable lighting scenarios, where capturing fine details and contextual information is crucial for precise assessment [28].

To address these challenges, this study focuses on the main urban areas of Chongqing, developing a multi-scale semantic segmentation model, SP-UNet, specifically optimized for street view images. By integrating the VGG16 network with the UNet backbone and incorporating the SimSPPF module and MLCA attention mechanism, the model effectively captures contextual information, reduces overfitting, and enhances segmentation accuracy. Through an in-depth analysis of the segmentation results and Points of Interest (POI) data, we extract and calculate indicators across various dimensions, including environmental quality, facility convenience, and safety, to comprehensively assess the street quality characteristics of different areas in Chongqing. The findings of this research offer valuable insights and practical references for improving street space quality and shaping the urban image.

2. Study Area and Data

2.1. Study Area

This study focuses on the main urban area of Chongqing, a major city located in southwestern China. As the largest municipality in the region, Chongqing serves as a crucial transportation hub and economic center for the upper Yangtze River. The research specifically targets nine districts: Yuzhong, Jiangbei, Shapingba, Dadukou, Jiulongpo, Nan’an, Beibei, Yubei, and Banan, as depicted in Figure 1.

2.2. Data Collection and Processing

2.2.1. Street View Data

Road network data for Chongqing was sourced from OpenStreetMap. Road segments with street view imagery available on Baidu Maps were selected, while those without were excluded. To enhance the density of sampling points, the road network was densified at 100 m intervals, resulting in a set of street view sampling points. A random selection of 100 points was made, as shown in Figure 1. A Python script, combined with the panoramic static street view API, was employed to crawl street view images at each sampling point, covering a 360° panorama with 90° intervals. This process yielded a final dataset of 2879 street view images, obtained through spatial sampling and data partitioning.

2.2.2. Sample Dataset

The semantic segmentation dataset comprises the open-source Cityscapes dataset and a selection of self-collected street view images, as depicted in Figure 2. The training and validation sets contain a total of 2637 meticulously annotated images, with 2373 images in the training set and 264 images in the validation set. Each image has a resolution of 1024 × 2048 pixels. To optimize GPU memory usage during model training, the image resolution was reduced to 512 × 512 pixels, and data augmentation techniques were employed. Ultimately, the validation set was utilized to assess the model’s accuracy.

2.2.3. Acquisition of POI Data

Infrastructure POI data was sourced from Gaode Maps, encompassing information on accommodation, dining, shopping, and healthcare services across the nine main districts of Chongqing, as detailed in Table 1. These data facilitated the calculation of the facility convenience dimension.

3. Methodology

The main goal of this study is to evaluate the spatial quality of urban streets using semantic segmentation technology. The research is structured into two key phases: semantic segmentation and street quality evaluation. Figure 3 illustrates the process of urban street space quality evaluation, commencing with the image segmentation phase. In this phase, a multi-scale SP-UNet model is employed to perform semantic segmentation on street view images and extract relevant street indicators. Subsequently, road network data and POI data are integrated, and spatial statistics and proximity analysis methods are utilized to calculate geographic indicators. These geographic and street view indicators are then combined, and the CRITIC method is applied to determine the weight of each indicator, facilitating a comprehensive and quantitative evaluation of street quality across different urban areas.

3.1. Semantic Segmentation

In the network architecture of SP-UNet, we adopt UNet [29] as the base model, with modifications to the backbone network using VGG16. Additionally, the SimSPPF module and the MLCA attention mechanism are integrated, as shown in Figure 4. This configuration enhances the model’s ability to perform the precise semantic segmentation of street view images. SP-UNet boasts a deeper and wider architecture compared to the standard U-Net, designed to capture more complex features and effectively handle the variability inherent in street view data. Skip connections are employed to link the encoder’s feature maps with the decoder at the same resolution level, facilitating the preservation of spatial information and enabling the decoder to refine segmentations with context from higher-resolution layers. Residual blocks are introduced within the network to simplify optimization and enable the training of deeper networks, incorporating identity mappings that facilitate gradient flow during backpropagation. The integration of attention mechanisms allows the model to focus on relevant features for street quality prediction, enhancing its ability to emphasize regions of interest while suppressing less significant features.

3.1.1. VGG16

To enhance the feature extraction capability of the UNet backbone, this paper integrates the deeper VGG16 network to capture richer high-level semantic information. The down-sampling path now includes three convolutional layers, up from two, thereby improving the network’s nonlinear transformation capacity and further preserving image details and edge information. The architecture of the VGG16 network, depicted in Figure 5, comprises convolutional and max-pooling layers and is engineered to leverage pre-trained weights from ImageNet, which contributes to improved model initialization and enhanced training efficiency.

3.1.2. SimSPPF

To capture multi-scale contextual information, mitigate overfitting, and enhance generalization capabilities, this paper introduces the SimSPPF module, positioned at the feature layer junction of the UNet network’s encoding–decoding pathway, as depicted in Figure 6. The integration of the SimSPPF module reduces computational complexity, accelerates inference speed, and facilitates the fusion of local and global features, ultimately improving detection accuracy. These enhancements significantly improve the network’s performance in image segmentation and object detection tasks, making it more suitable for complex image processing scenarios.

3.1.3. MLCA

To enhance the representation of spatial and channel features, improve detail retention, and enhance boundary precision, this paper presents an MLCA module, integrated into the upsampling phase, as depicted in Figure 7. By adaptively integrating local spatial features with channel-wise information, the module enables the network to selectively focus on critical regions and relevant channels within the image, thereby improving feature extraction capabilities for complex images. This enhancement markedly augments the network’s performance in image segmentation tasks, particularly in scenarios involving complex backgrounds and high detail requirements, allowing the model to more accurately extract target regions and provide more precise segmentation results.

3.2. Construction of Evaluation Indicators

3.2.1. Evaluation Indicators

The selection and analysis of street indicators encompass a variety of urban features that contribute to the evaluation of street quality. In this study, the following indicators were selected for analysis: urban green view rate, sky openness, vehicle density, building density, interface enclosure, medical facility density, shopping facility density, catering service facility density, and accommodation service facility density [30], as detailed in Table 2.

3.2.2. Indicator Calculation

In this paper, a deep learning model is employed to semantically segment street view images, while POI data are utilized to characterize the urban visual environment, aiding in the quantitative analysis of street quality. Street quality is evaluated from three dimensions: environment, safety, and facility convenience. The indicators for each dimension are calculated as presented in Table 3.

The street view semantic segmentation indicators are used to calculate the environmental dimension feature f₁, and the calculation formula is:

f_{1} = c_{1} w_{1} + c_{2} w_{2}

(1)

The street view semantic segmentation indicators are used to calculate the security dimension feature f₂, and the calculation formula is:

f_{2} = r_{1} w_{3} + r_{2} w_{4} + r_{3} w_{5}

(2)

The POI index is used to calculate the facility convenience dimension feature f₃, and the calculation formula is:

f_{3} = t_{1} w_{6} + t_{2} w_{7} + t_{3} w_{8} + t_{e} w_{9}

(3)

The weights of each indicator are calculated using the CRITIC method [31], which comprehensively considers both the variability and the correlation between indicators. Unlike methods that rely solely on the magnitude or fluctuation of individual data points, the CRITIC method allocates weights by analyzing the relationships among the indicators, ensuring a more comprehensive and objective evaluation process. The specific formula is as follows:

w_{j} = \frac{σ_{j} \sum_{i = 1}^{n} (1 - r_{i j})}{\sum_{j = 1}^{m} σ_{j} \sum_{i = 1}^{n} (1 - r_{i j})}

(4)

In the formula,

w_{j}

represents the weight of the j-th indicator,

r_{i j}

is the correlation coefficient between indicator i and indicator j, and

σ_{j}

is the standard deviation of indicator j.

Different features may have different dimensions and numerical ranges. Through normalization, these differences can be eliminated, making the model training more stable and improving the model’s generalization ability. It helps to eliminate the impact of different feature dimensions, making the data comparable, and it can also accelerate the convergence speed during the model training process.When calculating the CRITIC weight, the greater the index, the more positive the treatment of the street quality. Formula:

\frac{x - x_{M i n}}{x_{M a x} - x_{M i n}}

(5)

The greater the index, the reverse the treatment of the street quality. Formula:

\frac{x_{M a x} - x}{x_{M a x} - x_{M i n}}

(6)

The above two formulas are used to convert the original data x to the range of [0, 1]. Here,

x_{M i n}

is the minimum value in the dataset, and

x_{M a x}

is the maximum value in the dataset.

4. Results

4.1. Performance Evaluation and Comparative Analysis of SP-UNet for Street View Semantic Segmentation

4.1.1. Training Setting

The experiment on street quality evaluation was conducted using Python 3.8, with CUDA 12.2 and PyTorch 2.3.0 as the deep learning framework. The hardware configuration consisted of an NVIDIA GeForce RTX 3090 GPU with 24 GB of memory and an Intel Core i9-10900K CPU with 64 GB of memory. The experimental parameters were as follows: a batch size of 4, training for 100 epochs, and using the Adam optimizer with an initial learning rate of 0.0001. Image preprocessing involved resizing the images to 512 × 512 pixels.

4.1.2. Model Accuracy

This study, based on the research background, utilizes the mean intersection over union, mean pixel accuracy, and precision as evaluation metrics [32]. The street view images of Chongqing are categorized into ten classes: cars, sky, tricycles, motorcycles, electric bikes, vegetation, humans, buildings, walls, and fences. The SP-UNet model emerged as the top performer, achieving an impressive mean Intersection over Union (mIoU) of 62.48%, a Mean Pixel Accuracy (mPA) of 72.57%, an overall accuracy of 91.09%, and an F1 score of 73.92%. These results represented increases of 5.83%, 6.52%, and 2.37%, respectively, compared to the baseline model.

4.1.3. Ablation Experiment

To evaluate the efficacy of the SP-UNet model and validate its performance, we conducted an ablation study grounded in the original UNet architecture. This study entailed the systematic creation of various model iterations, each sequentially integrating the VGG16, SimSPPF, and MLCA modules. As depicted in Table 4, the integration of VGG16 into the UNet architecture yields a marked enhancement in performance, with respective increments in the mIoU, mPA, and accuracy metrics of 5.75%, 5.75%, and 2.22% compared to the baseline UNet. VGG16 serves as a robust feature extractor, leveraging its deep convolutional layers to capture intricate details. This improvement is consistent with findings from previous studies [33], which have demonstrated the effectiveness of VGG16 as a feature extractor in image segmentation tasks.

The addition of the SimSPPF module, however, led to a slight decrease in performance, with mIoU, mPA, and accuracy dropping by 9.81%, 9.97%, and 3.89%, respectively. SimSPPF enhances multi-scale feature learning, which is crucial for handling varying object sizes in segmentation tasks. This suggests that while the SimSPPF module can capture multi-scale features, its standalone effect may not be optimal, as it might introduce redundancy or lead to confusion between classes, negatively impacting segmentation accuracy. This observation aligns with the findings of Simonyan [34], who noted that multi-scale feature fusion modules can sometimes introduce noise if not properly integrated.

On the other hand, integrating the MLCA attention mechanism into the UNet model resulted in a notable improvement, with mIoU, mPA, and accuracy increasing by 3.31%, 4.44%, and 3.13%, respectively, demonstrating its ability to enhance image detail and boundary expression. MLCA introduces an attention mechanism that refines feature representation by emphasizing relevant spatial information. This is in line with the results reported by Shui Y [35], who showed that attention mechanisms can significantly improve the performance of segmentation models by focusing on relevant features. When combining VGG16 and SimSPPF, the model’s performance slightly improved, with mIoU, mPA, and accuracy increasing by 5.75%, 6.34%, and 2.09%, respectively, indicating that these two modules complement each other effectively.

Finally, the SP-UNet model, which incorporates all the improvements, achieved the best overall performance, with mIoU, mPA, and accuracy increasing by 5.83%, 6.52%, and 2.37%, respectively, compared to the original UNet model. This demonstrates that the integration of VGG16, SimSPPF, and MLCA attention mechanisms provides significant enhancements to the model’s segmentation accuracy and overall performance.

The mIoU values exhibit a progressive increase with the sequential addition of each of the three modules to the model, as shown in Figure 8. Significantly, the simultaneous integration of all three modules yields the highest mIoU, thereby illustrating the synergistic enhancement in model performance that arises from the combination of these components.

4.1.4. Comparative Experiment

To evaluate the performance of various semantic segmentation models for street view image analysis, this study conducted a comprehensive comparison between the proposed SP-UNet model and several contemporary advanced models. The experimental results, meticulously summarized in Table 5, provide a detailed overview of each model’s performance across key metrics such as mIoU, mPA, accuracy, and F1 score. The SP-UNet model emerged as the top performer, achieving an impressive mIoU of 62.48%, an mPA of 72.57%, an accuracy of 91.09%, and an F1 score of 73.92%. A comparative analysis with the widely used baseline model, UNet, highlights significant enhancements in SP-UNet’s capabilities, with notable increases of 5.83% in mIoU, 6.52% in mPA, 2.37% in accuracy, and 7.55% in F1 score. This study further expanded its evaluation to include other prominent models like DeeplabV3+, HRNet, PSPNet, LinkNet, and ESPNet. Across all four performance indicators, SP-UNet consistently demonstrated its prowess, underscoring its robustness and precision in the challenging domain of street view image segmentation.

While UNet has demonstrated its excellence in medical image segmentation, DeeplabV3+ has shown commendable performance in a variety of scenarios, HRNet is renowned for its deep feature fusion capabilities, PSPNet adeptly captures multi-scale contextual information through its pyramid pooling modules, LinkNet offers efficient inference times, and ESPNet is specifically designed for real-time applications. However, when it comes to the semantic segmentation of street view images, certain limitations become apparent. UNet’s feature extraction capabilities, although effective in medical contexts, do not match the power and sophistication of SP-UNet’s. DeeplabV3+’s atrous convolution-based feature extraction, while generally effective, lacks the refinement and precision that SP-UNet brings to the table. HRNet, despite its strengths in deep feature fusion, is not as finely optimized for the unique challenges of street scenes as SP-UNet is. PSPNet’s reliance on pyramid pooling modules for capturing multi-scale information falls short of the advanced feature representation optimization provided by SP-UNet’s MLCA attention mechanism. LinkNet, while offering fast inference, does not reach the same levels of accuracy and F1 score that SP-UNet achieves, indicating a trade-off between speed and precision. Lastly, ESPNet, although designed for real-time applications, lags behind SP-UNet in terms of mIoU and mPA, suggesting that its focus on speed comes at the expense of segmentation accuracy and completeness.

4.1.5. Street View Semantic Segmentation Results

The SP-UNet model performs semantic segmentation on street view images, dividing the elements within the images into different categories such as sky, vegetation, buildings, roads, pedestrians, and vehicles, and calculating the pixel proportions of each category. The original image and its corresponding segmentation result are shown in Figure 9. This method can effectively extract street space features from the images, helping us gain a deeper understanding of the street environments in different regions. Better segmentation results mean that the model can more accurately identify and classify different types of objects, thus extracting street space features from the images with greater precision. By combining the segmentation results with other data sources, we can gain a deeper understanding of the street space features in different regions. The precise extraction of street space features and comprehensive street quality assessment can provide important references for urban planning and management, helping to build a more livable and beautiful urban environment.

4.2. Street Quality Evaluation Based on Semantic Segmentation

Utilizing the proportions of street view elements derived from semantic segmentation, the methodology for calculating street view indices, as detailed in Table 2, is implemented to ascertain various urban indicators. These include urban green visibility, sky openness, vehicle density, building density, and interface enclosure, among additional metrics. The computed results are tabulated in Table 6. To enhance the interpretability of our findings, ArcGIS software was employed for the visualization of the results.

4.2.1. Green View Index

The Green View Index reflects the proportion of visible green spaces within a city, serving as an indicator of the urban greening level and ecological environment. As illustrated in Figure 10, Beibei District stands out with the highest Green View Index, signaling the area’s extensive green coverage and overall favorable green space environment. Both Dazhuokou and Yubei Districts also exhibit relatively high Green View Indices, although they fall short of Beibei’s levels. In contrast, Yuzhong and Banan Districts show lower Green View Indices, suggesting these areas may have limited green coverage and less favorable urban greening conditions.

4.2.2. Openness

The Open Space Index measures the proportion of open space within a city, which is often closely linked to urban accessibility and livability. As shown in Figure 11, Yubei District has the highest Open Space Index, indicating a greater availability of open spaces, making the area more suitable for residential living and recreational activities. Nanan District and Shapingba District also feature spacious public and leisure areas. In contrast, Jiangbei District has a lower Open Space Index, suggesting potential challenges related to spatial density and insufficient open areas.

4.2.3. Vehicle Density

Vehicle density indicates the number of vehicles per unit area, which is typically associated with traffic flow and congestion. As shown in Figure 12, Jiulongpo District has a high vehicle density, suggesting that the area may face significant traffic congestion. Yuzhong District and Jiangbei District also exhibit relatively high vehicle densities, which could point to potential traffic issues. In contrast, Yubei District and Beibei District have lower vehicle densities, indicating less traffic congestion and potentially smoother traffic flow.

4.2.4. Building Density

Building density refers to the number of buildings per unit area, reflecting the level of spatial utilization and development density in a city. As shown in Figure 13, Yuzhong District and Jiulongpo District have the highest building densities, indicating higher levels of urbanization, more closely packed buildings, and possibly greater population densities. Jiangbei District also has a relatively high building density, suggesting more compact urban development. In contrast, Beibei District and Yubei District have lower building densities, indicating less urban development and a more spacious, relaxed residential environment.

4.2.5. Enclosure Degree

Interface enclosure refers to the degree of closure in urban streets or areas, often associated with a sense of security and clearly defined boundaries. As shown in Figure 14, Yuzhong District has the highest level of interface enclosure, with a more closed street layout, possibly indicating a commercial or central area with a strong sense of separation. Jiulongpo District and Jiangbei District also exhibit relatively high levels of enclosure, suggesting well-defined boundaries and more spatial separation. In contrast, Yubei District and Banan District show lower levels of interface enclosure, implying a greater sense of openness and better connectivity between neighborhoods.

4.3. Multi-Dimensional Feature Results in Different Urban Areas

Based on the calculated characteristic indices and integrated with POI data, the evaluation indices for all sampling points within the main urban area are computed. The urban comprehensive index is subsequently calculated using the weight coefficients provided in Table 3. The outcomes of these calculations are presented in Table 7. Furthermore, a comprehensive evaluation of the study area’s multi-dimensional attributes is conducted using the feature expression vector.

The comprehensive scores, as illustrated in Figure 15, reveal key insights into the livability of different districts. Beibei District stands out for its high Green View Index and Open Space Index, coupled with low vehicle density and building density, which suggests a higher standard of living. In contrast, Yuzhong District and Jiulongpo District, which are commercial and densely populated, have higher building density and vehicle density, but their low Green View and Open Space Indices indicate potential environmental quality concerns. Yubei District and Banan District offer a more open environment, despite having relatively lower Green View and Open Space Indices; improving greenery and public spaces in these areas could further enhance livability. Other districts also exhibit varying levels of greenery and spatial characteristics, and urban planning strategies that consider these unique characteristics can contribute to enhancing overall street quality and livability.

To examine the inter-relationships among the multi-dimensional features of the study area, Pearson correlation analysis was conducted on the feature vectors. The Pearson correlation coefficient, an extensively applied metric in statistical inquiries [41], quantifies both the intensity and direction of the linear association between any two variables. The computation of the Pearson correlation coefficient is delineated by the following formula:

ρ (X, Y) = \frac{cov (X, Y)}{σ_{X} \cdot σ_{Y}}

(7)

ρ (X, Y)

represents the Pearson correlation coefficient between variables X and Y.

cov (X, Y)

represents the covariance between variables X and Y.

σ_{X}

represents the standard deviation of variable X.

σ_{Y}

represents the standard deviation of variable Y.

Figure 16 illustrates the Pearson correlation coefficients for the multi-dimensional features within the study area. The comprehensive score exhibits a notable positive correlation with the facility convenience dimension, whereas it reveals only weak correlations with the environmental and safety dimensions. The feature vectors indicates that Yuzhong District, Jiangbei District, and Jiulongpo District have superior safety metrics, which are labeled as “safe”. Beibei District and Yubei District are distinguished by their exceptional environmental qualities and are labeled as “beautiful”. Shapingba District and Yuzhong District are characterized by high levels of facility convenience and are labeled “convenient”. Conversely, Nan’an District and Ba’nan District are noted for their low vehicle density and expansive skies, thus being categorized under the “landscape-oriented” label.

5. Conclusions

This study utilizes street view image data from Chongqing’s main urban areas to develop a multi-scale semantic segmentation model and a multi-dimensional feature evaluation system for comprehensive street quality analysis. The proposed SP-UNet model, enhanced with VGG16, the SimSPPF module, and the MLCA attention mechanism, demonstrates superior capability in accurately extracting and representing multi-scale features. Compared to traditional models, SP-UNet produces clearer boundaries, retains more details, and achieves significant improvements in key metrics such as average intersection over union (IoU) and accuracy. These advancements offer strong support for the quantitative analysis of street quality and highlight the model’s potential for complex urban street view applications.

By integrating segmentation results with POI data, a multi-dimensional evaluation system was created to assess urban characteristics such as environmental quality, facility convenience, and safety. The CRITIC method was employed to calculate indicator weights, revealing notable spatial variations in these dimensions across different urban districts in Chongqing. This approach not only provides a fresh perspective on urban structure and function but also supplies actionable data to inform targeted urban optimization strategies.

In the environmental dimension, this research underscores the significance of urban greening in enhancing street quality and livability. Urban planners should prioritize two key actions: increasing green spaces and improving sky visibility. These efforts will not only bolster the urban environment but also enhance aesthetics and improve overall livability. Policymakers can leverage these findings to develop targeted greening initiatives. For example, creating more parks to expand public green areas, planting street trees to enhance urban biodiversity, promoting green roofs to mitigate heat island effects. Collectively, these measures can foster a healthier and more attractive urban landscape.

In the safety dimension, this study highlights a correlation between high building density, a sense of enclosure, and residents’ perceptions of safety. Effective planning is crucial for managing building density, preventing the “canyon effect”, designing broader streets, installing surveillance systems, and expanding public spaces to enhance the city’s security and reduce crime risks. Urban policies should be developed to regulate building heights and densities, particularly in high-density areas, ensuring that the built environment fosters a sense of safety and contributes to the overall security of the city.

In the facility convenience dimension, this research reveals a positive correlation between the comprehensive score and the convenience of facilities. Efforts should be made to develop business districts, enhance public transportation networks, ensure a balanced distribution of public services, and adopt smart urban management systems to improve residents’ quality of life and the efficiency of urban services. This insight can guide urban planners and policymakers in allocating resources more effectively, focusing on areas that need improvements in public facilities and transportation infrastructure to enhance the convenience and livability of urban areas.

Despite the valuable insights offered, this research has its limitations. For example, biases may arise from the acquisition of street view image data, which may not fully represent all environmental conditions. The model’s generalization ability has improved, yet challenges persist in processing images with extreme lighting or occlusions. Future studies could integrate multi-source data to refine the evaluation’s comprehensiveness, including remote sensing images for large-scale environmental monitoring, field survey data for ground-truth validation, cross-platform street view data to mitigate sampling bias. Although the SP-UNet model has achieved improvements in performance, in practical applications, especially in urban management systems that require real-time feedback, computational efficiency is equally important. Therefore, researching lighter network architectures or optimizing the computational processes of existing models to reduce inference time is an important direction for future work. Moreover, future model optimization could explore more efficient attention mechanisms or multimodal fusion techniques to enhance segmentation performance and computational efficiency.

Author Contributions

Conceptualization, C.H. and W.L.; methodology, C.H. and W.L.; software, W.L.; validation, W.L.; formal analysis, C.H.; investigation, W.L.; resources, W.L. and C.H.; data curation, W.L.; writing—original draft preparation, W.L.; writing—review and editing, C.H.; visualization, W.L.; supervision, C.H.; project administration, C.H.; funding acquisition, C.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (Grant No. 42471437), the Sichuan key Prorincial Research Base of Intelligent Tourism (Grant No. ZHZJ24-02), the Scientific Research and Innovation Team Program of Sichuan University of Science and Technology (Grant No. SUSE652A006), and the Graduate Innovation Fund of Sichuan University of Science and Engineering (Grant No. Y2024126).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets can be provided by the corresponding author upon reasonable request.

Acknowledgments

The authors would like to express their heartfelt gratitude to those people who have helped with this manuscript and to the reviewers for their comments on the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wei, X.; Wang, T.; Jie, Z. Research on Street Space Quality Measurement Based on Multi-Source Data: Taking Wuhu Downtown as an Example. J. Beijing Inst. Civ. Eng. Archit. 2024, 40, 37. [Google Scholar]
Li, S.; Ma, S.; Tong, D.; Jia, Z.; Li, P.; Long, Y. Associations between the quality of street space and the attributes of the built environment using large volumes of street view pictures. Environ. Plan. B Urban Anal. City Sci. 2022, 49, 1197–1211. [Google Scholar] [CrossRef]
Tang, F.; Zeng, P.; Wang, L.; Zhang, L.; Xu, W. Urban Perception Evaluation and Street Refinement Governance Supported by Street View Visual Elements Analysis. Remote Sens. 2024, 16, 3661. [Google Scholar] [CrossRef]
Wu, D.; Gong, J.; Liang, J.; Sun, J.; Zhang, G. Analyzing the influence of urban street greening and street buildings on summertime air pollution based on street view image data. ISPRS Int. J. Geo-Inf. 2020, 9, 500. [Google Scholar] [CrossRef]
Yang, Z.; Yu, H.; Feng, M.; Sun, W.; Lin, X.; Sun, M.; Mao, Z.H.; Mian, A. Small object augmentation of urban scenes for real-time semantic segmentation. IEEE Trans. Image Process. 2020, 29, 5175–5190. [Google Scholar] [CrossRef] [PubMed]
Zhou, W.; Lin, X.; Lei, J.; Yu, L.; Hwang, J.N. MFFENet: Multiscale feature fusion and enhancement network for RGB–thermal urban road scene parsing. IEEE Trans. Multimed. 2021, 24, 2526–2538. [Google Scholar] [CrossRef]
Patel, M.J.; Kothari, A.M.; Koringa, H.P. A novel approach for semantic segmentation of automatic road network extractions from remote sensing images by modified UNet. Radioelectron. Comput. Syst. 2022, 161–173. [Google Scholar] [CrossRef]
Yuan, F.; Zhu, Y.; Li, K.; Fang, Z.; Shi, J. An anisotropic non-local attention network for image segmentation. Mach. Vis. Appl. 2022, 33, 23. [Google Scholar] [CrossRef]
Ai, Y.; Liu, X.; Zhai, H.; Li, J.; Liu, S.; An, H.; Zhang, W. Multi-Scale Feature Fusion with Attention Mechanism Based on CGAN Network for Infrared Image Colorization. Appl. Sci. 2023, 13, 4686. [Google Scholar] [CrossRef]
Xu, J.; Wang, J.; Zuo, X.; Han, X. Spatial quality optimization analysis of streets in historical urban areas based on street view perception and multisource data. J. Urban Plan. Dev. 2024, 150, 05024036. [Google Scholar] [CrossRef]
Li, J.; Zheng, K.; Gao, L.; Ni, L.; Huang, M.; Chanussot, J. Model-informed Multistage Unsupervised Network for Hyperspectral Image Super-resolution. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5516117. [Google Scholar] [CrossRef]
Sun, D.; Ji, X.; Lyu, M.; Fu, Y.; Gao, W. Evaluation and diagnosis for the pedestrian quality of service in urban riverfront streets. J. Clean. Prod. 2024, 452, 142090. [Google Scholar] [CrossRef]
Li, J.; Zheng, K.; Li, Z.; Gao, L.; Jia, X. X-shaped interactive autoencoders with cross-modality mutual learning for unsupervised hyperspectral image super-resolution. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5518317. [Google Scholar] [CrossRef]
Li, J.; Zheng, K.; Yao, J.; Gao, L.; Hong, D. Deep unsupervised blind hyperspectral and multispectral data fusion. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Gong, F.Y.; Zeng, Z.C.; Zhang, F.; Li, X.; Ng, E.; Norford, L.K. Mapping sky, tree, and building view factors of street canyons in a high-density urban environment. Build. Environ. 2018, 134, 155–167. [Google Scholar] [CrossRef]
Gao, S.; Yang, K.; Shi, H.; Wang, K.; Bai, J. Review on panoramic imaging and its applications in scene understanding. IEEE Trans. Instrum. Meas. 2022, 71, 1–34. [Google Scholar] [CrossRef]
Li, J.; Zheng, K.; Liu, W.; Li, Z.; Yu, H.; Ni, L. Model-guided coarse-to-fine fusion network for unsupervised hyperspectral image super-resolution. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5508605. [Google Scholar] [CrossRef]
Tang, S.L.; Ayob, N.; Puah, C.H.; Kim, Y. Web Text Analysis of Image Perception of Tourist Destinations. Int. J. Acad. Res. Econ. Manag. Sci. 2024, 13. [Google Scholar] [CrossRef]
Yu, M.; Zheng, X.; Qin, P.; Cui, W.; Ji, Q. Urban Color Perception and Sentiment Analysis Based on Deep Learning and Street View Big Data. Appl. Sci. 2024, 14, 9521. [Google Scholar] [CrossRef]
Tang, J.; Long, Y. Measuring visual quality of street space and its temporal variation: Methodology and its application in the Hutong area in Beijing. Landsc. Urban Plan. 2019, 191, 103436. [Google Scholar] [CrossRef]
Wu, W.; Niu, X.; Li, M. Influence of built environment on street vitality: A case study of West Nanjing Road in Shanghai based on mobile location data. Sustainability 2021, 13, 1840. [Google Scholar] [CrossRef]
Rui, Q.; Cheng, H. Quantifying the spatial quality of urban streets with open street view images: A case study of the main urban area of Fuzhou. Ecol. Indic. 2023, 156, 111204. [Google Scholar] [CrossRef]
Navarrete-Hernandez, P.; Vetro, A.; Concha, P. Building safer public spaces: Exploring gender difference in the perception of safety in public space through urban design interventions. Landsc. Urban Plan. 2021, 214, 104180. [Google Scholar] [CrossRef]
Ye, Y.; Zeng, W.; Shen, Q.; Zhang, X.; Lu, Y. The visual quality of streets: A human-centred continuous measurement based on machine learning algorithms and street view images. Environ. Plan. B Urban Anal. City Sci. 2019, 46, 1439–1457. [Google Scholar] [CrossRef]
Wang, M.; He, Y.; Meng, H.; Zhang, Y.; Zhu, B.; Mango, J.; Li, X. Assessing street space quality using street view imagery and function-driven method: The case of Xiamen, China. ISPRS Int. J. Geo-Inf. 2022, 11, 282. [Google Scholar] [CrossRef]
Ma, X.; Ma, C.; Wu, C.; Xi, Y.; Yang, R.; Peng, N.; Zhang, C.; Ren, F. Measuring human perceptions of streetscapes to better inform urban renewal: A perspective of scene semantic parsing. Cities 2021, 110, 103086. [Google Scholar] [CrossRef]
Qiu, W.; Li, W.; Liu, X.; Huang, X. Subjectively measured streetscape perceptions to inform urban design strategies for Shanghai. ISPRS Int. J. Geo-Inf. 2021, 10, 493. [Google Scholar] [CrossRef]
Zhang, F.; Zhang, D.; Liu, Y.; Lin, H. Representing place locales using scene elements. Comput. Environ. Urban Syst. 2018, 71, 153–164. [Google Scholar] [CrossRef]
Li, X.; Qian, W.; Xu, D.; Liu, C. Image segmentation based on improved unet. In Proceedings of the Journal of Physics: Conference Series, Online, 2 November 2021; Volume 1815, p. 012018. [Google Scholar]
Li, X.; Pang, C. A Spatial Visual Quality Evaluation Method for an Urban Commercial Pedestrian Street Based on Streetscape Images—Taking Tianjin Binjiang Road as an Example. Sustainability 2024, 16, 1139. [Google Scholar] [CrossRef]
Lucchi, E.; Buda, A. Urban green rating systems: Insights for balancing sustainable principles and heritage conservation for neighbourhood and cities renovation planning. Renew. Sustain. Energy Rev. 2022, 161, 112324. [Google Scholar] [CrossRef]
Du, Z.; Ye, H.; Cao, F. A novel local–global graph convolutional method for point cloud semantic segmentation. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 4798–4812. [Google Scholar] [CrossRef] [PubMed]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Simonyan, K. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Shui, Y.; Yuan, K.; Wu, M.; Zhao, Z. Improved Multi-Size, Multi-Target and 3D Position Detection Network for Flowering Chinese Cabbage Based on YOLOv8. Plants 2024, 13, 2808. [Google Scholar] [CrossRef] [PubMed]
Pan, S.; Li, J.; Jiang, J. A street view semantic segmentation algorithm based on DeeplabV3+ architecture. In Proceedings of the 3rd International Conference on Artificial Intelligence, Automation, and High-Performance Computing (AIAHPC 2023), Hong Kong, China, 31 March–2 April 2023; Volume 12717, pp. 602–609. [Google Scholar]
Zhou, R.; Zhao, J. HISNet: A hybrid instance segmentation network for urban street scenes combining top-down and bottom-up approaches. In Proceedings of the Fourth International Conference on Computer Vision and Pattern Analysis (ICCPA 2024), Anshan, China, 17–19 May 2024; Volume 13256, pp. 407–411. [Google Scholar]
Ge, J.; Li, Y.; Jiu, M.; Cheng, Z.; Zhang, J. PFANet: A network improved by PSPNet for semantic segmentation of street scenes. In Proceedings of the International Conference in Communications, Signal Processing, and Systems, Online, 23–24 July 2022; pp. 128–134. [Google Scholar]
Zhang, L.; Liang, X. Image Segmentation of Plant Leaves in Natural Environments Based on LinkNet. J. Comput. Electron. Inf. Manag. 2023, 11, 67–72. [Google Scholar] [CrossRef]
Shen, S.; Wang, J.; Feng, Y.; Li, Y. Multi-Modal Instrument Recognition Method Based on Improved YOLOv5s and ESPNet. In Proceedings of the 2024 IEEE 6th International Conference on Power, Intelligent Computing and Systems (ICPICS), Shenyang, China, 26–28 July 2024; pp. 549–555. [Google Scholar]
Bermudez-Edo, M.; Barnaghi, P.; Moessner, K. Analysing real world data streams with spatio-temporal correlations: Entropy vs. Pearson correlation. Autom. Constr. 2018, 88, 87–100. [Google Scholar] [CrossRef]

Figure 1. The spatial distribution of the study area and street sampling points. (a) The scope of the study area. (b) The spatial distribution of some street sampling points.

Figure 2. Street view.

Figure 3. Research framework diagram.

Figure 4. SP-UNet model diagram.

Figure 5. VGG16 network structure.

Figure 6. SimSPPF module diagram.

Figure 7. MLCA network structure.

Figure 8. Comparison of mIoU. Note: Model 1: VGG16; Model 2: VGG16 + SimSPPF; Model 3: VGG16 + SimSPPF + MLCA.

Figure 9. Street view elements extraction: (a) original image; (b) output image.

Figure 10. Spatial distribution of Green View Index.

Figure 11. Spatial distribution of openness.

Figure 12. Spatial distribution of vehicle density.

Figure 13. Spatial distribution of building density.

Figure 14. Spatial distribution of enclosure degree.

Figure 15. Spatial distribution of the comprehensive scores.

Figure 16. Pearson correlation matrix.

Table 1. The service point POIs of each district.

District	Banan	Beibei	Dadukou	Jiangbei	Nan’an	Shapingba	Yubei	Yuzhong	Jiulongpo
Accommodation	751	928	259	1168	1762	2115	1899	3137	1020
Dining	1955	2463	2082	2475	3978	5021	2959	3731	2802
Shopping	2782	2854	2367	2637	3945	5752	3136	4708	3733
Healthcare	1125	1308	887	1514	2138	2587	1917	1546	1827

Table 2. Indicator selection.

Evaluating Indicator	Evaluation Description	Computing Formula
Urban green view rate ( $c_{1}$ )	The proportion of greenery (such as trees, lawns, and green belts) in the field of view of pedestrians on urban streets.	Urban green view rate = $\frac{S_{g r e e n}}{S_{t o t a l}}$
Openness of the sky ( $c_{2}$ )	The proportion of the sky in the field of view of pedestrians at a specific location or area.	Openness of the sky = $\frac{S_{s k y}}{S_{t o t a l}}$
Vehicle density ( $r_{1}$ )	The spatial proportion occupied by vehicles within a specific area.	Vehicle density = $\frac{S_{c a r}}{S_{t o t a l}}$
Building density ( $r_{2}$ )	The number and density of buildings within a specific area.	Building density = $\frac{S_{b u i l d i n g}}{S_{t o t a l}}$
Enclosure degree of interface ( $r_{3}$ )	The degree to which the streets are enclosed by buildings, walls, fences, and trees on both sides.	Enclosure degree of interface = $\frac{S_{b u i l d i n g} + S_{w a l l} + S_{f e n c e} + S_{t r e e s}}{S_{t o t a l}}$
Density of medical facilities ( $t_{1}$ )	The number and distribution density of medical facilities in a specific area.
Shopping facility density ( $t_{2}$ )	The number and distribution density of shopping facilities in a specific area.
Catering service facility density ( $t_{3}$ )	The number and distribution density of catering facilities in a specific area.
Accommodation service facility density ( $t_{4}$ )	The number and distribution density of accommodation service facilities in a specific area.

Note:

S_{t o t a l}

—Total area of street view image;

S_{g r e e n}

—The area of vegetation elements in the image; the rest of the elements follow the same naming convention.

Table 3. Calculation results for each weight.

Feature Dimensions	Parameter	Weight
Environment Dimensions	Urban green view rate (c₁)	13.88% (w₁)
	Openness of the sky (c₂)	13.25% (w₂)
Safety Dimensions	Vehicle density (r₁)	13.69% (w₃)
	Building density (r₂)	14.07% (w₄)
	Enclosure degree of interface (r₃)	12.78% (w₅)
Facility Convenience Dimension	Density of medical facilities (t₁)	8.07% (w₆)
	Shopping facility density (t₂)	8.17% (w₇)
	Catering service facility density (t₃)	8.03% (w₈)
	Accommodation service facility density (t₄)	8.06% (w₉)

Table 4. Performance comparison of different network modules.

Model	MIoU	mPA	Accuracy
UNet	56.65	66.05	88.72
UNet + VGG16	62.4	71.8	90.94
UNet + SimSPPF	46.84	56.08	84.83
UNet + MLCA	53.96	63.49	87.94
UNet + VGG16 + SimSPPF	62.4	72.39	91.03
SP-UNet	62.48	72.57	91.09

Table 5. Performance comparison of different algorithms.

Model	MIoU	mPA	Accuracy	F1
UNet [29]	56.65	66.05	88.72	66.37
DeeplabV3+ [36]	53.98	63.80	87.13	64.13
HRNet [37]	56.45	67.32	87.76	67.75
PSPNet [38]	52.01	63.30	86.26	64.9
LinkNet [39]	53.66	63.31	88.72	65.11
ESPNet [40]	45.24	55.02	84.55	55.83
SP-UNet	62.48	72.57	91.09	73.92

Table 6. Calculation results of the proportion of street view elements.

District	GVI	Openness	Vehicle Density	Building Density	Enclosure Degree
Yuzhong	0.1389	0.0824	0.0471	0.2202	0.4324
Shapingba	0.1798	0.1449	0.0312	0.1211	0.3620
Jiangbei	0.1715	0.0739	0.0392	0.1973	0.4119
Dadukou	0.2574	0.1041	0.0367	0.0885	0.3667
Jiulongpo	0.1886	0.0752	0.0499	0.1928	0.4078
Nan’an	0.1807	0.1457	0.0267	0.1103	0.3324
Beibei	0.3153	0.1368	0.0231	0.0567	0.3856
Yubei	0.1967	0.1847	0.0159	0.0579	0.2782
Banan	0.1409	0.1663	0.0204	0.1337	0.3014

Table 7. Multi-dimensional characteristic results.

Sample	Environmental Dimension	Safety Dimension	FCD	Score
Banan	4.16	6.01	535.59	0.3540
Beibei	6.19	6.04	611.30	0.4798
Dadukou	4.95	6.43	453.02	0.2929
Jiangbei	3.12	8.15	630.51	0.3470
Nan’an	4.43	6.13	956.29	0.4975
Shapingba	4.32	6.55	1252.36	0.6030
Yubei	5.18	4.59	801.58	0.4566
Yuzhong	2.99	9.24	1061.85	0.5093
Jiulongpo	3.58	8.01	759.36	0.3951

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hua, C.; Lv, W. Optimizing Semantic Segmentation of Street Views with SP-UNet for Comprehensive Street Quality Evaluation. Sustainability 2025, 17, 1209. https://doi.org/10.3390/su17031209

AMA Style

Hua C, Lv W. Optimizing Semantic Segmentation of Street Views with SP-UNet for Comprehensive Street Quality Evaluation. Sustainability. 2025; 17(3):1209. https://doi.org/10.3390/su17031209

Chicago/Turabian Style

Hua, Caijian, and Wei Lv. 2025. "Optimizing Semantic Segmentation of Street Views with SP-UNet for Comprehensive Street Quality Evaluation" Sustainability 17, no. 3: 1209. https://doi.org/10.3390/su17031209

APA Style

Hua, C., & Lv, W. (2025). Optimizing Semantic Segmentation of Street Views with SP-UNet for Comprehensive Street Quality Evaluation. Sustainability, 17(3), 1209. https://doi.org/10.3390/su17031209

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimizing Semantic Segmentation of Street Views with SP-UNet for Comprehensive Street Quality Evaluation

Abstract

1. Introduction

2. Study Area and Data

2.1. Study Area

2.2. Data Collection and Processing

2.2.1. Street View Data

2.2.2. Sample Dataset

2.2.3. Acquisition of POI Data

3. Methodology

3.1. Semantic Segmentation

3.1.1. VGG16

3.1.2. SimSPPF

3.1.3. MLCA

3.2. Construction of Evaluation Indicators

3.2.1. Evaluation Indicators

3.2.2. Indicator Calculation

4. Results

4.1. Performance Evaluation and Comparative Analysis of SP-UNet for Street View Semantic Segmentation

4.1.1. Training Setting

4.1.2. Model Accuracy

4.1.3. Ablation Experiment

4.1.4. Comparative Experiment

4.1.5. Street View Semantic Segmentation Results

4.2. Street Quality Evaluation Based on Semantic Segmentation

4.2.1. Green View Index

4.2.2. Openness

4.2.3. Vehicle Density

4.2.4. Building Density

4.2.5. Enclosure Degree

4.3. Multi-Dimensional Feature Results in Different Urban Areas

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI