Containment Control-Guided Boundary Information for Semantic Segmentation

Liu, Wenbo; Zhang, Junfeng; Zhao, Chunyu; Huang, Yi; Deng, Tao; Yan, Fei

doi:10.3390/app14167291

Open AccessArticle

Containment Control-Guided Boundary Information for Semantic Segmentation

by

Wenbo Liu

,

Junfeng Zhang

,

Chunyu Zhao

,

Yi Huang

,

Tao Deng

and

Fei Yan

^*

School of Information Science and Technology, Southwest Jiaotong University, Chengdu 611756, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(16), 7291; https://doi.org/10.3390/app14167291

Submission received: 9 July 2024 / Revised: 12 August 2024 / Accepted: 13 August 2024 / Published: 19 August 2024

(This article belongs to the Special Issue Digital Image Processing: Novel Technologies and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Real-time semantic segmentation is a challenging task in computer vision, especially in complex scenes. In this study, a novel three-branch semantic segmentation model is designed, aiming to effectively use boundary information to improve the accuracy of semantic segmentation. The proposed model introduces the concept of containment control in a pioneering way, which treats image interior elements as well as image boundary elements as followers and leaders in containment control, respectively. Based on this, we utilize two learnable feature fusion matrices in the high-level semantic information stage of the model to quantify the fusion process of internal and boundary features. Further, we design a dedicated loss function to update the parameters of the feature fusion matrices based on the criterion of containment control, which enables fine-grained communication between target features. In addition, our model incorporates a Feature Enhancement Unit (FEU) to tackle the challenge of maximizing the utility of multi-scale features essential for semantic segmentation tasks through the meticulous reconstruction of these features. The proposed model proves effective on the publicly available Cityscapes and CamVid datasets, achieving a trade-off between effectiveness and speed.

Keywords:

semantic segmentation; containment control; features fusion

1. Introduction

Semantic segmentation is pivotal in computer vision, facilitating comprehensive scene analysis. Among the inherent challenges faced in semantic segmentation, the precise delineation of boundaries between different categories stands out. These boundaries often imply transition regions between objects or elements with different semantics. Accurately capturing the spatial relationships and transitions between different categories enables a more nuanced and context-aware segmentation, which in turn contributes to improved scene understanding. Thus, in the field of semantic segmentation, the quality of models is intrinsically linked to their ability to recognize complex boundaries [1,2,3,4]. A model’s proficiency in accurately delineating boundaries directly affects its ability to interpret complex scenes.

Traditional methods may have difficulty in capturing subtle transitions between categories, leading to less accurate segmentation [5,6]. Effective utilization of these transition cues can significantly improve segmentation accuracy, thereby improving downstream applications such as scene understanding. Therefore, our work aims to address the limitations of traditional semantic segmentation methods by introducing an innovative approach that not only effectively utilizes boundary information but also establishes structured constraints between boundaries and internal features.

Containment control is a prevalent issue in the study of multi-agent systems (MASs), aiming to drive a subset of agents, often referred to as followers, to ultimately reside within a convex hull defined by certain specific agents, commonly known as leaders [7,8]. The development and implementation of containment control strategies integrate control theory and graph theory to address issues such as information diversion, cooperation, and self-organization in MASs [9,10,11]. However, the application of containment control should not be limited to the field of control. By transferring the relationship between leaders and followers to semantic segmentation, specifically where leaders represent boundary features and followers represent internal representations, the precise modeling of the interrelationship between boundary features and internal representations can be achieved, guiding more accurate semantic segmentation in images.

Our work not only conceptually incorporates containment control but also applies the classical theory of containment control to practically determine the spatial relationship between boundary feature information and internal feature information. This approach helps to quantify the dynamic exchange between boundary features and internal features, which leads to a nuanced control of the model utilizing boundary feature information. This pioneering integration of containment control principles into semantic segmentation models is expected to become a new paradigm in the field of visual scene understanding. Meanwhile, our network achieves high accuracy and real-time requirements as shown in Figure 1.

In conclusion, our work brings forth significant contributions, including the following:

Boundary Feature Integration: We recognize the importance of boundary feature information in semantic segmentation tasks and dig deep into the way it is used. The dynamic fusion process of boundary features and internal representations is quantified by building a features fusion matrix based on the Laplacian matrix, which effectively improves the model’s ability to capture and utilize boundary information.
Containment Control-Guided Segmentation: We utilize the containment control criterion to construct a dedicated loss function for updating the features fusion matrix, which achieves finer control over the information fusion between boundary features and internal representations. Through the gradient descent algorithm, on the one hand, the internal representations and the external features are made to present a contained spatial relationship, and on the other hand, a suitable features fusion matrix can be established to guide the feature fusion.
Multi-scale Feature Enhancement: To improve the quality of features, we have designed and introduced the Feature Enhancement Unit. This innovative module is specifically crafted to tackle the challenge of maximizing the utility of multi-scale features essential for semantic segmentation tasks through the meticulous reconstruction of these features.
Performance Evaluation: Conduct extensive experiments and performance evaluations to validate the effectiveness of our proposed approach. Compare the results with traditional segmentation methods to showcase the superiority of our model in handling complex scene structures.

2. Related Works

2.1. Semantic Segmentation with Multi-Scale Features

Since the inception of research in this field, numerous studies have concentrated on the influence of multi-scale feature information in semantic segmentation tasks. Notable examples include architectures such as PSPNet [2], the DeepLab series [4,12,13], and ICNet [3]. PP-LiteSeg [14] is a classical real-time semantic segmentation model that optimizes the traditional encoder–decoder structure. It pays much attention to the processing of multi-scale features and further proposes the Unified Attention Fusion Module and Simple Pyramid Pooling Module, both related to multi-scale features. Addressing the critical need for a balance between precision and speed, DDRNets [15] incorporate the innovative Deep Aggregation Pyramid Pooling Module (DAPPM). This module strategically enhances the effective receptive fields and fuses multi-scale context from low-resolution feature maps, contributing to the method’s exceptional performance. EMSNet [16] is a novel convolutional network for semantic segmentation that integrates an Enhanced Regional Module and a Multi-scale Convolution Module. This approach improves segmentation performance by effectively leveraging multi-scale context and pixel-to-pixel similarity in the channel direction, achieving notable results on benchmark datasets. Yan et al. [17] propose a novel multi-scale learner, Varying Window Attention, to address issues of scale inadequacy and field inactivation in semantic segmentation. Similarly, our study emphasizes the fine-grained treatment of multi-scale features. We build upon the advancements made in backbone fusion modules and introduce a novel Feature Enhancement Unit (FEU) aimed at enhancing the quality of multi-scale features. This module contributes to an improved and nuanced handling of information across different scales.

2.2. Semantic Segmentation of Geometric Relations

Several network architectures propose leveraging boundary information, achieving high accuracy while also considering inference speed. These can be summarized as follows. The authors of BNFNet [1] propose a model that integrates a fully convolutional network with boundary cues, aiming to enhance the object localization capability by leveraging boundary information. Graph-Segmenter [18] is a transformer-based semantic segmentation approach that enhances relation modeling between image regions and pixels using a graph transformer and boundary-aware attention module. AMKBANet [19] proposes an attention-based multi-kernelized and boundary-aware network that performs well in semantic segmentation tasks. CBLNet [20] provides a plug-and-play conditional boundary loss approach, which effectively improves the performance of boundary segmentation. BANet [21] adopts an encoder–decoder architecture and additionally incorporates boundary information to accurately delineate the shape and boundary of objects. AGLNet [22] introduces a global attention upsampling module to capture object shapes and boundaries. The authors of EdgeNet [23] propose a category-aware edge loss module that improves inference accuracy without compromising model inference speed. PIDNet [24] innovatively incorporates PID control methods into semantic segmentation research. The authors map proportional, integral, and derivative controllers to detail feature branches, context information branches, and boundary attention branches, achieving exceptional inference accuracy while maintaining high inference speed.

3. Method

3.1. How Containment Control Guides Semantic Segmentation

In this subsection, we will specifically introduce containment control and apply its ideas to semantic segmentation. Specifically, we consider image boundary features and internal features as “leaders” and “followers” in containment control, respectively; see Figure 2. And we use the containment control criterion to construct a unique loss function: boundary-internal containment loss function (BIC-loss). BIC-loss is used to quantify the connection between the boundary information and the internal information of the image, and to guide the updating of the weights of the features fusion matrix, which will be introduced below. Now, let us get started with the containment control.

A multi-agent system (MAS) normally consists of several leaders and followers. Containment control is a prevalent issue in the multi-agent system study for designing controllers for the followers to ultimately reside within the convex hull formed by the leaders as shown in Figure 3. The topology of interactions among agents in a MAS can be represented using a graph

G = (V, E)

with a set of N nodes

V = {v_{1}, v_{2}, \dots, v_{N}}

and a set of edges

E \subseteq V \times V

.

v_{i}, \forall i \in {1, 2, \dots, N},

represents the agents in the MAS, while the edges represent the communication strength between the agents. The Laplacian matrix is defined as follows:

\begin{matrix} L = \{\begin{matrix} - a_{i j} & i \neq j, \\ \sum_{j = 1}^{N} a_{i j} & i = j, \end{matrix} \end{matrix}

(1)

where

L \in R^{N \times N}

of

G

.

a_{i j}

signifies the weights of the edges, indicating that nodes j and i are adjacent if node i can directly receive information from node j for

(v_{j}, v_{i}) \in E

. In this case,

a_{i j} > 0

; otherwise,

a_{i j} = 0

means that there is no direct connection between i and j, i.e.,

(v_{j}, v_{i}) \notin E

.

a_{i i} = 0

,

\forall i \in {1, 2, \dots, N}

implies the absence of self-loops. The Laplacian matrix

L

reflects the internal communication situation within the system, which inspires us to utilize it as a bridge for the fine fusion between image internal features and boundary features in semantic segmentation.

In the containment control of the MAS containing N nodes, let

G

be a topology of directed communication with

M (M < N)

followers and

(N - M)

leaders. Note that the leaders have no neighbors. Thus, the Laplacian matrix

L

associated with

G

can be partitioned as

\begin{matrix} L = [\begin{matrix} L_{1} & L_{2} \\ 0_{(N - M) \times M} & 0_{(N - M) \times (N - M)} \end{matrix}], \end{matrix}

(2)

in which

L_{1} \in R^{M \times M}

and

L_{2} \in R^{M \times (N - M)}

.

L_{1}

embodies the interaction among followers, while

L_{2}

embodies the interaction between followers and leaders. It is evident in [25] that every element of

- L_{1}^{- 1} L_{2}

possesses a non-negative value, while each row of

- L_{1}^{- 1} L_{2}

exhibits a sum equal to one. This is the key to the containment control criterion, and it is with the help of the containment control criterion that we construct the link between the image boundary information and the internal information.

Now, we will briefly introduce the containment control model to elicit the containment control criterion. Consider the following global linear MAS:

\begin{matrix} \dot{x_{f}} (t) = (I_{M} \otimes A) x_{f} (t) + (I_{M} \otimes B) u (t) \\ \dot{x_{l}} (t) = (I_{N - M} \otimes A) x_{l} (t) \end{matrix}

(3)

in which

x_{f} (t) \in R^{n M}

is the global state of followers,

x_{l} (t) \in R^{n (N - M)}

is the global state of leaders, and

u (t) \in R^{M}

is the global control input. ⊗ denotes the Kronnecker product, and

I_{M}

denotes a M-dimensional identity matrix. Based on the conclusion provided in the preceding paragraph,

\begin{matrix} x_{f} + (L_{1}^{- 1} L_{2} \otimes I_{n}) x_{l} \to 0 \end{matrix}

(4)

means all followers in the MAS will ultimately reside within the convex hull formed by the leaders, thereby indicating the successful resolution of the containment control. Equation (4) is just like the containment control criterion.

In the realm of semantic segmentation, conventional feature fusion methods often grapple with fully exploiting boundary information, motivating us to introduce the groundbreaking concept of containment control to intricately model the interplay between boundary and internal information. Our novel methodology strategically incorporates containment control as a guiding principle, seamlessly integrating it into the training process of the segmentation model. We establish a robust mapping between containment control, as encapsulated by Equation (4), and the intricate feature landscape within the model. Within the context of our bespoke semantic segmentation task, a dedicated network branch adeptly extracts boundary information, aligning with

x_{f}

in Equation (4). Concurrently, other branches of the network extract specific segmentation results, corresponding to

x_{l}

in Equation (4). Drawing inspiration from neural architecture search principles, we introduce a set of learnable feature fusion matrices

\tilde{L}

(i.e.,

L_{1}^{- 1} L_{2})

to orchestrate seamless communication between internal and external feature information within the model. Utilizing the features fusion matrix-derived specific weights, each channel of the internal feature map will undergo a sophisticated weighting process involving all external feature channels. Based on the above approach, we introduce a relevant loss function. Through linear neural mapping, different features extracted by our network are collapsed into one dimension. Then, the internal information (

I

) and boundary information (

B

) are stacked as column vectors. Using the containment control criterion (4) and the features fusion matrix

\tilde{L}

, the loss function is formulated as follows:

\begin{matrix} L o s s_{c c} = \frac{1}{C} \sum_{i = 1}^{C} {(I + \tilde{L} B)}_{i}, \end{matrix}

(5)

in which C denotes the number of feature channels, and

{(I + \tilde{L} B)}_{i}

denotes the i-th element of the vector

(I + \tilde{L} B)

. Here, we employ the term “internal information” as a comprehensive representation of detailed features and contextual features. Subsequently, we will provide a detailed exposition on this matter. The loss calculation based on the condensed information reduces the computational complexity while preserving feature information.

When we consider image boundary features and internal features as “leaders” and “followers” in containment control, it should be noted that

L_{1}

in Equation (2) reflects the connection between the internal information of the image, while

L_{2}

reflects the connection between the image boundary information and the internal information. Since the internal information of the image has already been interpreted by the detail features and the contextual feature branches, the spatial constraints between the internal information of the image are not taken into account when parsing the relationship between the image boundary and the internal information by using the containment control criterion. Thus,

L_{1}

, which reflects the connection between the internal information of the image, is set as a unit matrix to avoid the model interpreting the internal features of the image again and to improve the efficiency of the model inference. In addition, since BIC-loss is calculated using one-dimensional information after linear neural mapping,

I_{n}

in Equation (4) is actually equal to one.

We visualize the construction process of the loss function in Figure 3. Stage 1 represents feature compression. Stage 2 is the linear neural mapping used to extract more representational features. Stage 3 is the computation process for

L o s s_{c c}

. Stage 4 indicates the updating basis of the features fusion matrix

\tilde{L}

.

3.2. CcNet

In this work, we propose the Containment Control Network (CcNet) to improve the accuracy of complex semantic segmentation tasks by seamlessly integrating the containment control into the model as shown in Figure 4. CcNet adopts PIDNet as the backbone and is composed of three branches: the detailed branch (D), the contextual branch (C), and the boundary branch (B). The design of CcNet places particular emphasis on the relationships between the boundary features and internal features, leveraging containment control mechanisms. This strategic approach facilitates the improved integration of boundary information from the B branch into the internal features of the D and C branches.

To achieve a nuanced fusion of internal and boundary feature information, we introduce a Feature-Weighted Fusion (FWF) mechanism. This mechanism ensures that each channel of the internal feature incorporates information from the boundary feature, represented as the weighted sum of all channels of the boundary feature. The features fusion matrix

\tilde{L}

is specifically designed to quantify the compression process applied to the boundary feature, tailoring it to provide customized boundary features for the internal features.

{\tilde{L}}_{1}

plays a crucial role in guiding the fusion between the D branch and the B branch, facilitating the amalgamation of detailed feature information and boundary feature information. Similarly,

{\tilde{L}}_{2}

is instrumental in guiding the fusion of the C branch and the D branch, promoting the integration of contextual feature information and boundary feature information. Both feature fusion matrices have corresponding

L o s s_{c c}

terms dedicated to their updates. These loss terms help ensure the effective utilization of boundary feature information by regulating the fusion process in their respective branches.

In the FWF mechanism, we first weight the boundary feature channels, i.e.,

\begin{matrix} B_{i} (j) = \tilde{L} (i, j) \times B (j), \end{matrix}

(6)

in which

B (j)

denotes the

j^{t h}

channel feature map of the boundary feature channel, and

\tilde{L} (i, j)

denotes the value of the

{(i, j)}^{t h}

element of the feature fusion matrix. Then, the weighted boundary feature channel is compressed and downsized, and summed with the internal feature channel to obtain the output, i.e.,

\begin{matrix} o u t (i) = \sum_{j = 1}^{C} B_{i} (j) + I (i), \end{matrix}

(7)

in which C denotes the number of feature channels. The sum over channels

\sum_{j = 1}^{C} B_{i} (j)

is akin to a 1 × 1 convolution, which is essentially a weighted sum across the channels. Therefore, the FWF mechanism, with its complexity of

C \times H \times W \times C

, is a very efficient operation, especially given the substantial functionality it provides in processing feature maps. This efficiency makes it well suited for scenarios where both computational cost and performance are critical.

3.3. Feature Enhancement Unit

To enhance the quality of multi-scale features in semantic segmentation models, we design a Feature Enhancement Unit (FEU) as shown in Figure 5. This module draws inspiration from classical feature enhancement models [26,27] and adopts a “split-restructure” paradigm.

In the splitting phase, each channel of the input feature map is associated with a set of learnable parameters

α

in the Group Normalization (GN) layer. On the one hand, SoftMax operations on

α

produce the parameter

β

, from which two different masks

M_{1}

and

M_{2}

are derived according to a predetermined threshold. These masks help to split the spatial locations of representation information and redundant feature information. On the other hand, the feature maps that have gone through the GN layer are multiplied with the parameter

β

and activated by a Sigmoid function to produce new weights for reweighting. Subsequently, the new weighting parameters together with the two masks will generate two separated feature maps, i.e., crucial representation information and redundant feature information. In the merge phase, the feature information considered important by the model is mined more deeply so that more computation is used on the important features. To achieve this, we employ group-wise and point-wise processing for crucial representations, while adopting only point-wise processing for redundant features. Next, a dedicated compression module is designed to compress each channel within the two feature information streams. This module excels at optimizing the spatial structure of features, taking into account the spatial relationships between adjacent channels when generating attention weights. These weights undergo competition at corresponding locations before being transmitted back to their respective feature information streams for weighted processing. The final step in this process involves directly merging the two feature information streams, preventing the loss of important information.

4. Experiment

4.1. Dataset

Cityscapes

Cityscapes [28] comprises 5000 meticulously labeled images captured from 50 different cityscapes. The images are thoughtfully partitioned into sets of 2975, 500, and 1525 for training, validation, and testing, respectively. Since the test set of Cityscapes is not publicly available, we used the validation set for evaluation. With an image resolution of 2048 × 1024, the dataset poses a considerable challenge to the model.

CamVid

Cambridge-driving Labeled Video Database (CamVid) [29] comprises 701 driving scenes, meticulously divided into sets of 367, 101, and 233 images for training, validation, and testing, respectively. The images have a resolution of 960 × 720, and annotations span across 32 categories. Notably, 11 of these categories have been deliberately chosen to ensure a fair and meaningful comparison with prior studies.

4.2. Quantitative Analysis

Cityscapes

Our quantitative analysis focuses on evaluating the performance of CcNet against state-of-the-art semantic segmentation models using the Cityscape dataset. Table 1 provides a comprehensive overview of the performance of various models, including CcNet, on different metrics. CcNet exhibits very competitive performance, with mIOUs of 78.5% and 81.2% for the small variant (CcNet-S) and the large variant (CcNet-L). In comparison with AMKBANet-T, CcNet-L attains superior performance while featuring a smaller parameter size. These results demonstrate the efficacy of the proposed models in achieving high segmentation accuracy. CcNet-S with 83.4 FPS and CcNet-L with 30.2 FPS belong to the real-time semantic segmentation models. CcNet shows good results in different sizes of models. CcNet performs well in models of different sizes and always maintains a very competitive performance. In conclusion, our quantitative analysis emphasizes the strong performance of CcNet in semantic segmentation tasks. This is strong evidence that containment control can have a positive impact on the results of semantic segmentation tasks.

CamVid

Our quantitative evaluation is centered on assessing the efficacy of CcNet in comparison to state-of-the-art semantic segmentation models, employing the CamVid dataset as depicted in Table 2. In terms of mIOU, CcNet emerges as the superior performer, achieving an impressive score of 80.6%. While CcNet shares a similar scale with PIDNet-S, its performance does not surpass that of PIDNet-S-Wider; however, CcNet demonstrates faster inference. Additionally, CcNet-s attains an impressive rate of 182 frames per second on a RTX 4090. This underscores the suitability of the overall CcNet model for GPU computations, with the inference speed exhibiting a notable dependence on the scale of the input image.

4.3. Qualitative Analysis

Our network places special emphasis on the constraint of boundary information on internal information, resulting in more detailed delineation for classification. The qualitative analysis in Figure 6 highlights the superiority of our approach. In the first two images, our proposed CcNet demonstrates its superior performance by accurately identifying and segmenting small distant objects based on boundary features. For example, in the second image, traffic signs are located to the left of both the car and the second tree. The second traffic signpost on the left blends into the background due to its similar color, resulting in its omission in the PIDNet segmentation output. Although these signs are not annotated in the ground truth, our model effectively detects them, highlighting its robustness in segmenting subtle features. In the third image, multiple overlapping bicycles are closely positioned behind a truck. In this scenario, PIDNet generates a segmentation result that abruptly segments the bicycles, failing to capture their continuity and spatial relationships. This limitation indicates the difficulty of PIDNet in maintaining consistency in scenarios involving overlapping objects. In contrast, CcNet produces higher-quality segmentation results by effectively preserving object continuity and spatial context. The fourth image illustrates the limitations of PIDNet in correctly interpreting the seat portions of bicycles, which are misclassified as pedestrians. This misclassification underscores the challenges of PIDNet in distinguishing between similar objects and fine-grained details within complex scenes, while CcNet demonstrates improved capability in leveraging boundary features for accurate classification. In the fifth image, PIDNet overlooks certain well-defined classes, exposing its shortcomings in boundary delineation and class representation. In the sixth image, CcNet significantly enhances semantic information retrieval, particularly for distant fences; its ability to provide detailed semantic information for objects at varying distances reflects its advanced boundary recognition capabilities and overall robustness in semantic segmentation tasks. Overall, the results indicate that CcNet more effectively utilizes boundary information, providing more accurate and detailed semantic segmentation in complex traffic scenarios compared to PIDNet.

Figure 7 demonstrate that after weighting through the FWF module, the boundary information achieves desirable constraint effects on both the detailed branch and the contextual branch. The boundaries between different categories are adequately delineated, further confirming the effectiveness of our approach.

4.4. Ablation Study

We perform ablation experiments on two variants of CcNet, named CcNet-S and CcNet-L.

Features fusion matrix $\tilde{L}$

The features fusion matrix

\tilde{L}

serves as a quantifying tool for the fusion process between boundary and internal features. In this context, we focus on a comprehensive analysis of the impact of different configurations on the features fusion matrix within the CcNet architecture. To establish targeted communication channels between internal and boundary information, we introduce two sets of matrices. Two distinct losses guide the fusion of the D branch, C branch, and B branch, necessitating mappings from two sets of internal semantic information. Thus, for each variant of CcNet, we investigate the influence of three different combinations of internal feature information on the Laplacian matrix: D branch, C branch, and Head.

Experimental results are shown in Table 3, indicating consistent effects across both variants of CcNet for different combinations of feature information. The most effective approach involves mapping features from the D branch and C branch, utilizing the generated loss to guide the fusion of features between the self-branch and the B branch. This targeted configuration demonstrates optimal efficiency in leveraging the Laplacian matrix for enhanced fusion of internal and boundary features. These findings provide valuable insights into the role of the features fusion matrix in guiding feature fusion processes within CcNet.

FEU

This study aims to assess the performance of the FEU module concerning its placement and to analyze its influence on the two variants of CcNet. The results are shown in Table 4. There are two key variations to the location of the FEU module: the C branch and the Head. The C branch processes contextual information, while the Head processes the final feature information fed to the Head of the network. In the context of different variants, CcNet-S and CcNet-L, the introduction of the C branch into the FEU module significantly improves mIOU. This enhancement is attributed to the inherent rich contextual information of the C branch, which undergoes further optimization through the DAPPM module. Hence, the demand for optimizing multi-scale features at this juncture is particularly pronounced.

The structure of the FEU is quite complex. Therefore, to more intuitively demonstrate the complexity of the FEU, we conduct comparative experiments based on CcNet-s. Multiple experiments show that CcNet-s can achieve a 5 FPS improvement after removing FEU. This complexity may be due to the additional operations required during the “split-restructure” process, which improves feature quality but increases the computational load. However, the FEU can be viewed as a mechanism to balance between higher accuracy (through better feature enhancement) and faster inference speed. By including or excluding the FEU, one can adjust the performance of CcNet depending on the specific application requirements.

Additionally, relocating the FEU module from the C branch to the head in the CcNet-S variant raises the mIOU from 78.04 to 78.48. Conversely, for the CcNet-L variant, relocating the FEU module from the C branch to the head results in a sharp mIOU decline to 80.0. This indicates that, as the model size increases, the FEU module processing the results from the FWF module may have a detrimental effect. This phenomenon could be attributed to the high-quality feature maps generated by the FWF module when the model size increases, causing the FEU module to destroy the spatial relationships in the original feature maps. In summary, the FEU module enhances the quality of multi-scale features but exhibits sensitivity to its position within the model.

The loss function $L o s s_{c c}$

We validate the effectiveness of the proposed

L o s s_{c c}

by performing comparative analyses with the use of random search (rs) methods. Based on the optimal model weights obtained from training, we perform three rounds of random search on

\tilde{L}

and evaluate it on the validation dataset as shown in Table 5. It can be seen that the updated model based on

L o s s_{c c}

achieves a mIOU value of 81.2 on the validation set, which enables the model to find a suitable features fusion matrix. This result suggests that

L o s s_{c c}

can effectively guide the updating of the Laplace matrix to achieve a fine fusion of boundary features and interior features in the model.

5. Conclusions

In this paper, we proposed a novel real-time three-branch semantic segmentation paradigm that leverages boundary information to enhance segmentation accuracy in real-time complex scenes. The new paradigm introduces containment control, which regards the inner and boundary elements of images as followers and leaders, respectively, so as to skillfully integrate details information, context information, and boundary information. Additionally, a specialized loss function

L o s s_{c c}

and the FEU for multi-scale feature reconstruction further optimize the model’s performance. Our experimental results on the Cityscapes and CamVid datasets demonstrate the model’s effectiveness, achieving a balance between accuracy and speed. These results prove the potential of using boundary information effectively in complex scenes and provide new ideas for the further development of real-time semantic segmentation technology in the future.

Notably, the features from the boundary branch need to be propagated to the other branches, highlighting the model’s heavy reliance on the quality of edge information. When the quality of the edge information is suboptimal, the Laplacian matrix

L

, which is influenced by the

L o s s_{c c}

, can cause instability in the model’s performance. Furthermore, the necessity of a dedicated branch for edge information extraction results in a relatively large model structure. This added complexity presents challenges for enhancing inference speed.

Author Contributions

Methodology, W.L. and J.Z.; Project administration, F.Y.; Validation, C.Z.; Visualization, C.Z.; Writing—original draft, W.L. and J.Z.; Writing—review and editing, Y.H. and T.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (62373311, 62103342, 62106208), the Natural Science Foundation of Sichuan Province (2022NSFSC0892, 2023NSFSC1418) and the China Postdoctoral Science Foundation (2021TQ0272, 2021M702715).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bertasius, G.; Shi, J.; Torresani, L. Semantic Segmentation with Boundary Neural Fields. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Zhao, H.; Qi, X.; Shen, X.; Shi, J.; Jia, J. Icnet for real-time semantic segmentation on high-resolution images. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 405–420. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N. BiSeNet V2: Bilateral Network with Guided Aggregation for Real-Time Semantic Segmentation; Springer: New York, NY, USA, 2021. [Google Scholar]
Fan, M.; Lai, S.; Huang, J.; Wei, X.; Chai, Z.; Luo, J.; Wei, X. Rethinking BiSeNet For Real-time Semantic Segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 9711–9720. [Google Scholar] [CrossRef]
Li, Z.; Ren, W.; Liu, X.; Fu, M. Distributed containment control of multi-agent systems with general linear dynamics in the presence of multiple leaders. Int. J. Robust Nonlinear Control 2013, 23, 534–547. [Google Scholar] [CrossRef]
Zhang, J.; Yan, F.; Feng, T.; Deng, T.; Zhao, Y. Fastest containment control of discrete-time multi-agent systems using static linear feedback protocol. Inf. Sci. 2022, 614, 362–373. [Google Scholar] [CrossRef]
Wang, X.; Xu, R.; Huang, T.; Kurths, J. Event-Triggered Adaptive Containment Control for Heterogeneous Stochastic Nonlinear Multiagent Systems. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 8524–8534. [Google Scholar] [CrossRef] [PubMed]
Zuo, R.; Li, Y.; Lv, M.; Park, J.H.; Long, J. Event-triggered distributed containment control for networked hypersonic flight vehicles. IEEE Trans. Syst. Man Cybern. Syst. 2023, 53, 5271–5280. [Google Scholar] [CrossRef]
Yan, J.; Peng, S.; Yang, X.; Luo, X.; Guan, X. Containment Control of Autonomous Underwater Vehicles With Stochastic Environment Disturbances. IEEE Trans. Syst. Man Cybern. Syst. 2023, 53, 5809–5820. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Peng, J.; Liu, Y.; Tang, S.; Hao, Y.; Chu, L.; Chen, G.; Wu, Z.; Chen, Z.; Yu, Z.; Du, Y.; et al. Pp-liteseg: A superior real-time semantic segmentation model. arXiv 2022, arXiv:2204.02681. [Google Scholar]
Pan, H.; Hong, Y.; Sun, W.; Jia, Y. Deep Dual-Resolution Networks for Real-Time and Accurate Semantic Segmentation of Traffic Scenes. IEEE Trans. Intell. Transp. Syst. 2023, 24, 3448–3460. [Google Scholar] [CrossRef]
Li, T.; Cui, Z.; Han, Y.; Li, G.; Li, M.; Wei, D. Enhanced multi-scale networks for semantic segmentation. Complex Intell. Syst. 2024, 10, 2557–2568. [Google Scholar] [CrossRef]
Yan, H.; Wu, M.; Zhang, C. Multi-Scale Representations by Varying Window Attention for Semantic Segmentation. arXiv 2024, arXiv:2404.16573. [Google Scholar]
Wu, Z.; Gan, Y.; Xu, T.; Wang, F. Graph-Segmenter: Graph transformer with boundary-aware attention for semantic segmentation. Front. Comput. Sci. 2024, 18, 185327. [Google Scholar] [CrossRef]
Zhou, X.; Wu, G.; Sun, X.; Hu, P.; Liu, Y. Attention-Based Multi-Kernelized and Boundary-Aware Network for lmage semantic segmentation. Neurocomputing 2024, 597, 127988. [Google Scholar] [CrossRef]
Wu, D.; Guo, Z.; Li, A.; Yu, C.; Gao, C.; Sang, N. Conditional Boundary Loss for Semantic Segmentation. IEEE Trans. Image Process. 2023, 32, 3717–3731. [Google Scholar] [CrossRef] [PubMed]
Zhou, Q.; Qiang, Y.; Mo, Y.; Wu, X.; Latecki, L.J. BANet: Boundary-Assistant Encoder-Decoder Network for Semantic Segmentation. IEEE Trans. Intell. Transp. Syst. 2022, 23, 25259–25270. [Google Scholar] [CrossRef]
Zhou, Q.; Wang, Y.; Fan, Y.; Wu, X.; Zhang, S.; Kang, B.; Latecki, L.J. AGLNet: Towards real-time semantic segmentation of self-driving images via attention-guided lightweight network. Appl. Soft Comput. 2020, 96, 106682. [Google Scholar] [CrossRef]
Han, H.Y.; Chen, Y.C.; Hsiao, P.Y.; Fu, L.C. Using Channel-Wise Attention for Deep CNN Based Real-Time Semantic Segmentation with Class-Aware Edge Information. IEEE Trans. Intell. Transp. Syst. 2020, 22, 1041–1051. [Google Scholar] [CrossRef]
Xu, J.; Xiong, Z.; Bhattacharyya, S.P. PIDNet: A Real-Time Semantic Segmentation Network Inspired by PID Controllers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 19529–19539. [Google Scholar]
Meng, Z.; Ren, W.; Zheng, Y. Distributed finite-time attitude containment control for multiple rigid bodies. Automatica 2010, 46, 2092–2099. [Google Scholar] [CrossRef]
Zhang, Q.; Jiang, Z.; Lu, Q.; Han, J.; Zeng, Z.; Gao, S.; Men, A. Split to be slim: An overlooked redundancy in vanilla convolution. arXiv 2020, arXiv:2006.12085. [Google Scholar]
Li, J.; Wen, Y.; He, L. SCConv: Spatial and Channel Reconstruction Convolution for Feature Redundancy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6153–6162. [Google Scholar]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
Brostow, G.J.; Fauqueur, J.; Cipolla, R. Semantic object classes in video: A high-definition ground truth database. Pattern Recognit. Lett. 2009, 30, 88–97. [Google Scholar] [CrossRef]
Kumaar, S.; Lyu, Y.; Nex, F.; Yang, M.Y. CABiNet: Efficient Context Aggregation Network for Low-Latency Semantic Segmentation. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13517–13524. [Google Scholar] [CrossRef]
Nirkin, Y.; Wolf, L.; Hassner, T. HyperSeg: Patch-Wise Hypernetwork for Real-Time Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 4061–4070. [Google Scholar]
Zhou, Y.; Zheng, X.; Yang, Y.; Li, J.; Mu, J.; Irampaye, R. Multi-directional feature refinement network for real-time semantic segmentation in urban street scenes. IET Comput. Vis. 2023, 17, 431–444. [Google Scholar] [CrossRef]
Si, H.; Zhang, Z.; Lv, F.; Yu, G.; Lu, F. Real-Time Semantic Segmentation via Multiply Spatial Fusion Network. arXiv 2019, arXiv:1911.07217. [Google Scholar]
Hu, P.; Caba, F.; Wang, O.; Lin, Z.; Sclaroff, S.; Perazzi, F. Temporally Distributed Networks for Fast Video Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Zhang, L.; Jiang, F.; Yang, J.; Kong, B.; Hussain, A. A real-time lane detection network using two-directional separation attention. Comput. Aided Civ. Infrastruct. Eng. 2024, 39, 86–101. [Google Scholar] [CrossRef]

Figure 1. The trade-off between the inference speed and accuracy (reported) of real-time models on the Cityscapes dataset. Red stars refer to our models while yellow points represent others.

Figure 2. The connection between semantic segmentation and containment control.

Figure 3. The leaders and followers communicate based on the Laplacian matrix

L

to eventually achieve containment. The suitable communication weights of the features fusion matrix

\tilde{L}

that guide the feature fusion process are derived through

L o s s_{c c}

that infers the relationship between internal representation and boundary information.

Figure 3. The leaders and followers communicate based on the Laplacian matrix

L

to eventually achieve containment. The suitable communication weights of the features fusion matrix

\tilde{L}

that guide the feature fusion process are derived through

L o s s_{c c}

that infers the relationship between internal representation and boundary information.

Figure 4. The primary inferential component of the Containment Control Network (CcNet). The Feature Weighting Fusion (FWF) module is used to fuse the boundary and internal features. The yellow arrows indicate the channel of the lightweight version, while the purple arrows represent the channel of the high-precision version.

o u t 1

and

o u t 2

represent the outputs after the weighted fusion of different branches.

Figure 4. The primary inferential component of the Containment Control Network (CcNet). The Feature Weighting Fusion (FWF) module is used to fuse the boundary and internal features. The yellow arrows indicate the channel of the lightweight version, while the purple arrows represent the channel of the high-precision version.

o u t 1

and

o u t 2

represent the outputs after the weighted fusion of different branches.

Figure 5. Illustration of FEU module, consisting of a split stage and restructure stage, with the objective of enhancing the crucial features.

Figure 6. Qualitative analysis on Cityscapes.

Figure 7. Channel feature visualization.

Table 1. Comparison of accuracy and speed on Cityscapes.

Model	GPU	Resolution	mIOU (%)	#FPS	GFLOPs	Params (M)
CABiNet [30]	GTX 2080Ti	2048 × 1024	76.6	76.5	12.0	2.64
AMKBANet-T [19]	A100	2048 × 1024	81.1	-	884	64
BiSeNetV2 [5]	GTX 1080Ti	1536 × 768	74.8	65.5	55.3	49
BiSeNetV2-L [5]	GTX 1080Ti	1024 × 512	75.8	47.3	118.5	-
STDC1-Seg75 [6]	RTX 3090	1536 × 768	74.5	74.8	-	-
STDC2-Seg75 [6]	RTX 3090	1536 × 768	77.0	58.2	-	-
PP-LiteSeg-T2 [14]	RTX 3090	1536 × 768	76.0	96.0	-	-
PP-LiteSeg-B2 [14]	RTX 3090	1536 × 768	78.2	68.2	-	-
HyperSeg-M [31]	RTX 3090	1024 × 512	76.2	59.1	7.5	10.1
HyperSeg-S [31]	RTX 3090	1536 × 768	78.2	45.7	17.0	10.2
DDRNet-23-S [15]	RTX 3090	2048 × 1024	77.8	108.1	36.3	5.7
DDRNet-23 [15]	RTX 3090	2048 × 1024	79.5	51.4	143.1	20.1
MRFNet-S [32]	GTX 1080Ti	512 × 1024	79.3	144.5	-	-
MRFNet-L [32]	GTX 1080Ti	1024 × 1024	79.9	73.63	-	-
PIDNet-S-Simple [24]	RTX 3090	2048 × 1024	78.8	100.8	46.3	7.6
PIDNet-S [24]	RTX 4090	2048 × 1024	78.8	93.2	47.6	7.6
PIDNet-M [24]	RTX 3090	2048 × 1024	80.1	39.8	197.4	34.4
PIDNet-L [24]	RTX 4090	2048 × 1024	80.9	31.1	275.8	36.9
CcNet-S	RTX 4090	2048 × 1024	78.5	83.4	55.0	7.85
CcNet-L	RTX 4090	2048 × 1024	81.2	30.2	290.2	37.4

Table 2. Comparison of accuracy and speed on CamVid.

Model	mIOU	#FPS	GPU
MSFNet [33]	75.4	91.0	GTX 2080Ti
PP-LiteSeg-T [14]	75.0	154.8	GTX 1080Ti
TD2-PSP50 [34]	76.0	11.0	TITAN X
BiSeNetV2 [5]	76.7	124.0	GTX 1080Ti
BiSeNetV2-L [5]	78.5	33.0	GTX 1080Ti
HyperSeg-S [31]	78.4	38.0	GTX 1080Ti
HyperSeg-L [31]	79.1	16.6	GTX 1080Ti
TSA-LNet [35]	79.7	143.0	GTX 2080Ti
DDRNet-23-S [15]	78.6	182.4	RTX 3090
DDRNet-23 [15]	80.6	116.8	RTX 3090
PIDNet-S [24]	80.1	153.7	RTX 3090
PIDNet-S-Wider [24]	82.0	85.6	RTX 3090
CcNet-s	80.6	182.0	RTX 4090

Table 3. Ablation study of FWF.

Model	FWF			mIOU
Model	D Branch	C Branch	Head	mIOU
CcNet-S		✔	✔	78.12
	✔		✔	78.16
	✔	✔		78.48
CcNet-L		✔	✔	79.85
	✔		✔	80.55
	✔	✔		81.15

Table 4. Ablation study of FEU for CcNet.

Model	FEU		mIOU
Model	C Branch	Head	mIOU
CcNet-S			77.78
	✔		78.04
		✔	78.48
CcNet-L			80.73
	✔		81.15
		✔	80.00

Table 5. Ablation study on the training methodology of

\tilde{L}

.

Table 5. Ablation study on the training methodology of

\tilde{L}

.

Weights of $\tilde{L}$		mIOU
Random Search	D + C	mIOU
✔		65.62
✔		70.52
✔		69.17
	✔	81.15

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, W.; Zhang, J.; Zhao, C.; Huang, Y.; Deng, T.; Yan, F. Containment Control-Guided Boundary Information for Semantic Segmentation. Appl. Sci. 2024, 14, 7291. https://doi.org/10.3390/app14167291

AMA Style

Liu W, Zhang J, Zhao C, Huang Y, Deng T, Yan F. Containment Control-Guided Boundary Information for Semantic Segmentation. Applied Sciences. 2024; 14(16):7291. https://doi.org/10.3390/app14167291

Chicago/Turabian Style

Liu, Wenbo, Junfeng Zhang, Chunyu Zhao, Yi Huang, Tao Deng, and Fei Yan. 2024. "Containment Control-Guided Boundary Information for Semantic Segmentation" Applied Sciences 14, no. 16: 7291. https://doi.org/10.3390/app14167291

APA Style

Liu, W., Zhang, J., Zhao, C., Huang, Y., Deng, T., & Yan, F. (2024). Containment Control-Guided Boundary Information for Semantic Segmentation. Applied Sciences, 14(16), 7291. https://doi.org/10.3390/app14167291

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Containment Control-Guided Boundary Information for Semantic Segmentation

Abstract

1. Introduction

2. Related Works

2.1. Semantic Segmentation with Multi-Scale Features

2.2. Semantic Segmentation of Geometric Relations

3. Method

3.1. How Containment Control Guides Semantic Segmentation

3.2. CcNet

3.3. Feature Enhancement Unit

4. Experiment

4.1. Dataset

4.2. Quantitative Analysis

4.3. Qualitative Analysis

4.4. Ablation Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI