In this section, we conduct experiments with and validate the proposed MM-ED network model. In
Section 4.1, we introduce the dataset used and its annotation details.
Section 4.2 provides a detailed description of our experimental platform and parameter settings. Subsequently, in
Section 4.3, we present the metrics used to evaluate the improved model. In
Section 4.4, we conduct a series of comparative experiments and ablation studies aimed at demonstrating the effectiveness of the improvements in the backbone and attention mechanism on model performance. We compare the proposed method with representative methods in the segmentation field, including U-Net [
32], HRNet [
33], FCN [
34], and DeepLabv3+. In
Section 4.5, we integrate the improved MM-ED network into the ORB-SLAM3 system for the establishment of a 3D point cloud semantic map, generating occupancy grid maps and OctoMaps. This step aims to validate the effectiveness of the mapping method proposed in this paper.
4.2. Hardware and Software Configuration
The entire model training process was implemented by renting a “Tianchi Cloud” shared GPU server. The server’s CPU is an Intel(R) Xeon(R) E5-2686 v4 @ 2.30 GHz with 60 GB of memory (Intel, Luoyang, China). The operating system used was Ubuntu 20.04, 64 bit, with Python 3.10, PyTorch 2.0.1, CUDA 11.8, and cuDNN 8 used for model training. A GPU (NVIDIA RTX A4000, NVIDIA, Santa Clara, CA, USA, 16 GB VRAM) was utilized to optimize the training speed. For experiments establishing the lawn working environment, a handheld depth camera and a laptop were used. The test model’s frames per second (FPS) value was evaluated using an NVIDIA GeForce RTX 3060 Laptop GPU (6 GB) (NVIDIA, Santa Clara, CA, USA) and an Intel Core 12th i7-12700H CPU (Intel, Luoyang, China).
Table 2 provides detailed parameters for model training, such as the maximum iterations, learning rate, and the number of classes.
4.3. Evaluation Indicators
To objectively assess the semantic segmentation model’s performance on the lawn segmentation dataset and facilitate comparisons with various methods, the following evaluation metrics were adopted: the MIoU (Mean Intersection over Union) and MPA (Mean Pixel Accuracy).
The MIoU is widely used to measure pixel-level overlap between model-predicted segmentation results and actual labels. It calculates the intersection of predicted values and actual values for individual pixel classes. The MIoU reflects a model’s ability to accurately segment image pixels by averaging the IoUs for all categories. In lawn segmentation tasks, a higher MIoU indicates the model’s ability to accurately capture the boundaries and shapes of objects in the lawn environment. The formula is as follows (2):
The CPA (Category Pixel Accuracy) measures how many pixels predicted to be positive by the model are true positives. The calculation formula is as follows (3):
The MPA (Mean Pixel Accuracy) calculates the proportion of correctly classified pixels in each class, i.e., the CPA, and then computes the average through accumulation, as shown in Formula (4):
The calculation of these two parameters is carried out through a pixel-level comparison between the model’s predicted results and the actual labels. These metrics provide robust support for objectively evaluating the performance of semantic segmentation models. Here, the symbol “k” represents the number of segmentation classes, and the symbol “i” represents the predicted category. True Positive (TP), False Negative (FP), True Negative (TN), and False Positive (FN) parameters are used, and this series of parameters forms a solid foundation for the in-depth analysis of the performance of semantic segmentation models.
In this study, the backbone network was replaced, and the original model underwent lightweight processing. Therefore, it was necessary to evaluate the model’s size and running speed to verify the effectiveness of the improvements. We used metrics such as Giga Floating-Point Operations Per Second (GFLOPs), Params, and FPS for a comprehensive comparison.
Floating-Point Operations Per Second (FLOPs) is a critical metric for measuring the computational complexity and performance of neural network models. It represents the number of floating-point operations performed per second. In neural networks, floating-point operations involve floating-point arithmetic operations on weights, inputs, and activation values. GFLOPs, indicating billions of floating-point operations per second, is a key indicator for assessing model complexity. For resource-constrained devices such as mobile or edge devices, lower GFLOPs values may be more suitable to ensure the model runs smoothly on these devices.
Params refers to the number of parameters that the model needs to learn. These parameters include weights and bias terms which are adjusted during the model’s training process to enable it to learn from input data and produce appropriate outputs. This is also an important metric for evaluating the complexity of a neural network model.
The FPS metric is used to evaluate detection speed, indicating the number of images processed per second or the time required to process a single image. A shorter time implies a higher speed. In this paper, we use the FPS metric to objectively evaluate the actual running speed of the proposed model.
By comprehensively comparing these metrics, we can gain a comprehensive understanding of the proposed model’s advantages and disadvantages in terms of its complexity, scale, and practical running speed.
4.4. Experiment and Analysis
4.4.1. Backbone Comparison Experiment
In this study, we conducted experiments on different backbone networks for the encoder, aiming to compare their performance in lawn environment segmentation. Taking DeepLabV3+ as the baseline, we substituted the backbone network and carried out comparative experiments involving major backbones such as MobileNetV2, ResNet18 [
37], ResNet50, and ResNet101.
By observing
Table 3, we can see that among various backbone networks, MobileNetV2 achieves satisfactory results owing to its depthwise separable convolutional structure. Not only does this model reduce to a comparable level with ResNet18 in terms of GFLOPs and its parameter count but also in the recognition metric of FPS; the recognition speed of MobileNetV2 is almost twice that of ResNet50. It is noteworthy that although ResNet18 has a relatively small parameter count, its recognition accuracy is lower than that of MobileNetV2. Although the MIoU of MobileNetV2 is only 0.33% higher than that of ResNet18, the FPS value of ResNet18 is 16 higher than that of MobileNetV2, which might lead to the misconception that ResNet18 performs better. However, in practical applications, we need to input semantic segmentation results into ORB-SLAM3 systems to generate 3D semantic maps, with the system being most suitable for processing speeds of around 30 FPS. Exceeding this speed would result in an excessive burden on the system, possibly leading to delays in processing. Therefore, when FPS requirements are met, MobileNetV2, which has a higher MIoU, is the more appropriate choice.
Hence, in scenarios in which simplifying backbone networks, improving recognition speed, and adapting to ORB-SLAM3 systems are objectives, selecting MobileNetV2 as the backbone network is an extremely attractive option. Overall, the results of this experiment indicate that in the task of lawn environment segmentation, MobileNetV2 demonstrates excellent performance and efficiency as the backbone network, providing strong support for achieving model lightweighting and accelerating recognition speed.
4.4.2. Experiment Comparing Attentional Mechanisms
From
Table 3, it can be observed that adopting MobileNetV2 as the backbone network undoubtedly reduces the model parameters and GFLOPs while improving the recognition speed. However, both the MIoU and MPA show a decreasing trend compared to the commonly used original backbone network in DeepLabv3+, ResNet50. In order to maintain the lightweight effect at the network level by retaining MobileNetV2 as the backbone network while improving the accuracy of model recognition, we conducted experiments by introducing various attention mechanisms at the same position in the network.
We selected representative attention mechanisms for integration, including Efficient Channel Attention (ECA), a Squeeze-and-Excitation (SE) block [
38], a Convolutional Block Attention Module (CBAM) [
39], and Coordinate Attention (CA) [
40]. These attention mechanisms were embedded at the same location within the backbone network to explore their impact on model performance.
The SE mechanism, through the introduction of a “Squeeze-and-Excitation” block structure, explicitly learned relationships between channels, resulting in more effective attention allocation. ECA focused primarily on channel dimensions, allowing the network to concentrate more on crucial features in the image, enhancing its perception of global information. The CBAM comprehensively considered both spatial and channel attention, capturing key information in the feature map. The CA attention mechanism encoded precise positional information in the neural network, aiding in modeling channel relationships and long-term dependencies.
The comparative experiments from
Table 4 reveal that ECA demonstrates the most significant improvement in the MIoU when employed as the backbone network in the MobileNetV2-based model, showcasing its remarkable performance. Therefore, we opted for ECA as the attention mechanism for our model in this paper, aiming to further enhance the model’s performance. Regarding the MIoU parameter rather than the MPA, the introduction of the ECA attention mechanism leads to a 0.52% increase in the MIoU for models switched to lightweight backbone networks. The MIoU serves as a metric for measuring the degree of overlap between predicted results and ground truth labels in semantic segmentation models, whereas the MPA predominantly focuses on the accuracy of predicted results. In the comparison provided in
Table 4, we place greater emphasis on the impact on the MIoU because it not only considers the accuracy of predicted results but also consistency with ground truth labels. Therefore, in the comparison of these attention mechanisms, the MIoU holds more reference value.
4.4.3. Ablation Experiment
This section evaluates modular variables in different neural networks and analyzes the factors influencing neural network performance. We used DeepLabv3+ as the baseline, using ResNet50 as the baseline’s backbone. We conducted improvement experiments on the following different network structures: (1) Baseline + A, an ECA attention module embedded in the baseline method. (2) Baseline + B, with MobileNetV2 replacing the baseline method’s backbone. We initially conducted single-variable improvement experiments. (3) Baseline + A + B: Both variables were added and experiments were performed to examine the combined effects on the model. Quantitative evaluations were performed on the test set for each experiment, and the results are presented in
Table 5.
From
Table 5, it can be observed that our method exhibited a steady improvement in segmentation results compared to the baseline. The addition of the ECA module, whether integrated into the baseline or replacing the backbone with MobileNetV2, contributed to an increase in the model’s recognition MIoU. After replacing the backbone network, there were significant reductions in computational and parameter counts accompanied by a substantial improvement in recognition speed. Regarding the MIoU parameters, the introduction of the ECA attention mechanism led to a 0.52% improvement in the model when transitioning to a lightweight backbone.
4.4.4. Image Segmentation Results
Using the same training environment and parameters, we conducted comparative experiments on different semantic segmentation networks.
Figure 12 illustrates the prediction results of various networks. From top to bottom are the results of segmenting different types of images using FCN, HRNet18, U-Net, DeepLabv3+, and our proposed method, with the last row showing the annotated ground truth images. We chose to compare our method with FCN, HRNet18, U-Net, and DeepLabv3+, which are some of the most outstanding methods in the field of semantic segmentation, demonstrating excellent performance on various datasets. These methods represent different characteristics: FCN transforms traditional convolutional networks into fully convolutional structures for segmentation tasks; U-Net addresses the issue of information loss in semantic segmentation by introducing skip connections; DeepLabv3+ enhances segmentation performance through dilated convolutions and multi-scale feature fusion; and HRNet improves segmentation accuracy by preserving high-resolution features. By comparing these representative methods, we can gain a comprehensive understanding of the strengths and limitations of different segmentation approaches.
Our model demonstrates superior segmentation results across the six object categories requiring segmentation, with masks that are smoother and more complete. It is observable that in the recognition of lawns, boundaries, and humans, the performances of these algorithms appear quite similar. However, in the recognition of tree trunks, although the FCN and U-Net networks can correctly identify tree trunks, their ability to recognize lawns is significantly lacking, which is undesirable. In the recognition of obstacles, the algorithm proposed in this paper produced results closest to the ground truth images, with other algorithms missing some details. In recognizing shrubs, especially in the hard-to-recognize area on the lower right side of the shrubs, our algorithm’s recognition effect is visibly better than that of the other algorithms.
Overall, as shown in
Table 6, our method achieves segmentation results that are nearly identical to those of the computationally expensive DeepLabV3+ architecture, with a reduction in both the number of parameters and computational cost in GFLOPs. This lays a solid foundation for our model’s future deployment in embedded systems, making it suitable for operation on machines with limited resources.
As observed in
Table 7, the network model proposed in this study was compared with other types of networks. Clearly, our network model far surpasses the other network models in terms of FPS, approaching 30. The closer the FPS value is to 30, the better the compatibility of the model with the ORB-SLAM3 algorithm. In terms of the MIoU and MPA, which are indicators of model recognition accuracy, our proposed network forms the best, exceeding the recognition effects of other networks.
Although our architecture does not appear to be lighter than HRNet18 in terms of GFLOPs and Params, as shown in
Table 7, compared to HRNet18, our method only demonstrates decreases of 1.03% and 0.43% in the MIoU and MPA, respectively. Relatively speaking, although HRNet18 is lighter, it has a lower degree of recognition accuracy and cannot achieve satisfactory results. Moreover, even though it has a lower number of parameters, its FPS improvement is not significant, showing no advantage over our proposed architecture. Considering all indicators, our proposed architecture is the optimal choice, and in terms of integrating a semantic segmentation module into the ORB-SLAM3 environment, our method is the most suitable.
4.5. Lawn Map Construction Experiment
This section validates the method proposed in this paper for constructing a 3D semantic map of a lawn scene through experimentation. The experiment utilized an Intel Realsense D455 depth camera with a resolution of 640 × 480. The program ran on a laptop running Ubuntu 20.04. The depth camera was handheld and walked around the lawn that needed mapping. During the walking process, the camera was continuously swayed from side to side to cover the entire lawn that needed to be captured.
Figure 13 illustrates the real-time recognition of the lawn during the walking process.
It can be seen in the process of building a 3D semantic map that the architecture proposed in this paper demonstrates relatively good recognition of the working lawn. In general, it can accurately identify all six categories that need to be identified. According to the method mentioned in
Section 3.2, the results of the semantic segmentation are added to the ORB-SLAM3 system. Through this system, a 3D semantic point cloud map can be obtained. As shown in
Table 8, compared with the original RGB point cloud map, the semantic point cloud map occupies much less space. Since the semantic map constructed in this paper is for the establishment of a working space service for gardening pruning robots, only a semantic map based on the working height of the robot is needed. The crowns at the tops of trees have no effect on the lawn mowing task of the robot, so they are all removed. This makes the saved point cloud map more abundant, and more effective information is saved in the same point cloud map.
As relying solely on a point cloud map is insufficient for path navigation, the 3D semantic point cloud map needs to be converted into an octree map and an occupancy grid map suitable for future three-dimensional and two-dimensional path navigation.
Figure 14 illustrates the colored point cloud map, semantic point cloud map, octree map, and occupancy grid map obtained using the MM-ED method proposed in this paper. As the presence of moving individuals in the environment can significantly impact the 3D semantic point cloud map, the point cloud maps established in this experiment exclusively feature scenes without any people.