1. Introduction
In recent years, with the rapid development of smart mine construction and unmanned driving technology, driverless open-pit mines have gradually been established. However, there are some negative obstacles, such as potholes and road collapses on open-pit roads. These negative barriers blend with the road to a large extent, resulting in inconspicuous feature information. This poses a significant risk to the safe driving of driverless mine trucks in open-pit mines. Therefore, there is an urgent need to research rapid and accurate negative obstacle detection methods for the roads in open-pit mining areas [
1].
Most existing detection methods rely on infrared or radar sensors to analyze and detect negative obstacles based on their local characteristics. Matthies et al. [
2], for example, proposed a negative obstacle detection method based on thermal infrared images, i.e., the local intensity analysis of infrared images to determine negative obstacles. Cheng et al. [
3] proposed a negative obstacle feature detection method based on radial distance and local dense features using multiple LIDAR and synthetic features, achieving good negative obstacle detection but poor detection of small targets. Kang et al. [
4] proposed a single-line LIDAR and vision fusion algorithm for environment perception to address the problem of difficult small targets. However, this method has a poor detection effect and can’t adapt to the harsh and complex environment of open-pit mines.
Vision sensors have higher reliability in harsh environments, such as those with high temperatures, magnetic fields, or dust, and have good adaptability to open-pit mining. Machine vision-based negative obstacle detection methods can mainly be divided into image analysis algorithms and deep learning algorithms. Pothole detection algorithms based on 2D image analysis usually have four main steps: (1) image pre-processing; (2) image segmentation; (3) shape extraction; and (4) object recognition [
5]. For example, Li et al. [
6] used morphological filters [
7] to reduce image noise and enhance pothole contours. The preprocessed road images are segmented using histogram-based thresholding pairs, as in Otsu’s method, used by Buza et al. [
8], or the triangle method, used by Koch et al. [
9]. Otsu’s method minimizes the intra-class variance and is better for separating damaged and normal road areas. Finally, the extracted area is modeled using an ellipse, and the image texture inside the ellipse is compared with the texture of the undamaged road area. If the former is coarser than the latter, the ellipse is considered a pothole [
6]. However, all the above-mentioned image processing techniques are severely affected by various factors, especially insufficient light conditions, which hinder the implementation of obstacle recognition systems [
10]. Therefore, some authors, such as Tsai et al. [
11,
12], have proposed segmentation on depth maps. This proves that better performance can be obtained when segmenting broken road areas. The fact that pits are always irregular in shape and lack sufficient texture information sometimes makes geometry and texture-based assumptions unreliable.
Deep learning methods have made great progress in recent years [
13]. Silvister et al. [
14] and Thiruppathiraj et al. [
15] used convolutional neural networks for detecting damage to pavements, and they achieved high accuracy, though the detection speed was slow. Suong et al. [
16] proposed an improved Yolo-v2 target detection network for intelligent pothole detection with an average detection accuracy reaching 82.43% and a detection speed of 21 frames per second. Chen et al. [
17] proposed a location-aware convolutional neural network-based pothole detection method that could achieve both an accuracy of 95.2% and a recall of 92.0%, which is better than most existing methods. However, the real-time detection of negative obstacles for unstructured roads in open-pit mines has rarely been studied.
Therefore, this paper provides an in-depth analysis of the characteristics of unstructured roads in open-pit mines and proposes a real-time detection method for negative obstacles on open-pit roads that can adapt to the complex and harsh environment of open-pit mines. The main contributions of this paper are as follows:
- (1)
In this paper, we construct a dataset of negative obstacles on roads in open-pit mines and propose a real-time detection method for negative obstacles on roads in open-pit mines. This method provides the real-time, efficient, and highly reliable detection of negative obstacles for the early warning of unmanned vehicles in open-pit mining areas.
- (2)
We propose an improved Yolov4 negative obstacle detection convolutional neural network, using RepVGG as the backbone feature extraction network in the feature extraction phase and the SimAM attention mechanism in the feature fusion phase to solve the feature information loss problem. Finally, we use dynamic convolution to further improve the feature representation capability.
- (3)
We propose a new non-maximum suppression algorithm that has better detection accuracy compared with the traditional NMS algorithm and effectively solves the problem of difficult detection when the targets overlap.
2. System Model and Definitions
The proposed method for the real-time detection of negative obstacles in open-pit mines is shown in
Figure 1.
In the training phase of the network: (1) the input image is preprocessed, and the backbone network is used to extract the features of the image; (2) the feature maps of different sizes are feature fused using the feature pyramid structure to achieve better multi-scale detection capability; (3) the features of different sizes are predicted in the classification and bounding box prediction module to obtain the class and bounding box location of the target; (4) finally, the results of comparing the prediction results and the correct labels are used as the basis for the reverse update of the network neurons, and the weights between the network neurons are updated; (5) steps (1)–(4) of the training process are repeated until the network is fully fitted.
In the inference process of the network, the trained network model is used to predict the images to get the obstacle classification and bounding box prediction results, and the non-maximum suppression algorithm is then used to remove the redundant prediction boxes to ensure that there is only one detection result for an obstacle.
Because the road texture of the open-pit mines is basically the same, the negative obstacles are often blurred due to complex light and stagnant water. Furthermore, the possible size of the negative obstacle may be anywhere within an enormous range according to the angle of the picture, the distance, and the varying size of road potholes. To ensure accuracy, the automated detection of negative obstacles should be a multitasking process that integrates feature extraction, target recognition, and localization. Specifically, the detection algorithm should have good capability in these three aspects: (1) extracting valid feature information about negative obstacles; (2) accurate target localization; (3) good multiscale target detection capabilities.
Mainstream target detection frameworks usually consist of a backbone network pre-trained on a target detection dataset as a feature extraction network and a head used to identify and locate the target [
18]. The backbone networks for feature extraction can be divided into two categories according to the platform on which they run. Backbone networks running on GPU platforms include VGG [
19], ResNet [
20], DenseNet [
21], etc., while backbone networks running on CPU platforms include SqueezeNet [
22], MobileNet [
23], and ShuffleNet [
24]. Since negative obstacle detection is used in driverless vehicles, real-time target recognition is a very important metric, and GPU platforms have a natural speed advantage for image processing. The heads for target identification and localization are usually divided into two categories, ‘one-stage’ and ‘two-stage’ target detectors. Two-stage detectors include Fast R-CNN [
25], Faster R-CNN [
26], R-FCN [
27], and Libra R-CNN [
28], while for one-stage detectors, the most representative models are Yolo [
29,
30,
31,
32], SSD [
33], and RetinaNet [
34]. Two-stage detectors have high target recognition precision due to the inclusion of candidate box selection in the target recognition process, though this also leads to unsatisfactory detection when the target is obscured or occupies a large area of the image. A good way to solve the problem of multi-scale target recognition is to extract feature images from different stages and fuse the features after multiple upsampling and subsampling. Networks that use this feature fusion mechanism include Feature Pyramid Network (FPN) [
35], Path Aggregation Network (PANet) [
36], BiFPN [
37], and NAS-FPN [
38]. They usually place the feature fusion module between the backbone network for feature extraction and the head for target detection and localization to further extract the feature information of multi-scale targets.
Negative obstacles on the roads in open-pit mines have a large fusion range with the road, and the obstacle boundary feature may not be obvious, as shown in
Figure 2a. This paper presents an in-depth analysis of Yolov4 (Yolov4: optimal speed and accuracy of object detection) [
32] target detection theory and proposes a negative obstacle target detection model for one-stage mining roads based on Yolov4. The principle is shown in
Figure 2b–d. The problem of obstacle detection at different scales is solved by designing three different sizes (13 × 13, 26 × 26, and 52 × 52) of a priori boxes on the feature maps.
The boundary box coordinate prediction method follows the practice of YOLOv3 (as shown in
Figure 3), where
represent the predicted output of the model,
and
represent the coordinates of grid cells,
and
represent the size of the boundary box before prediction,
,
,
, and
are the center coordinates and size of the predicted boundary box, and the loss of coordinates adopts square error loss.
and
represent the distance between the center coordinate of the predicted boundary box and the coordinate of the network element on the
and
axes.
3. The Proposed Optimization Method
The negative obstacle detection model proposed in this paper is composed of the backbone extraction network RepVGG, a spatial pyramid pooling SPP module, a multi-scale feature bilateral fusion PANet module, and classification and prediction branches. The network model is shown in
Figure 4.
3.1. Feature Extraction Network Based on RepVGG
Because the inference speed of the CSPDarkNet [
39] backbone network used by Yolov4 is slow, this paper proposes RepVGG [
40] as the feature extraction network to improve the Yolov4 model. The network structure of RepVGG is similar to that of the ResNet network, and the residual structure is used to solve the gradient disappearance problem of the deep network base on the VGG net. in addition, a 1 × 1 convolution and identity residual structure is used to make the network simpler and more efficient. The base block structure of the training and inference phase is shown in
Figure 5. In the model inference stage, the network first merges the convolutional layer and the BN layer in the residual block and then converts all the convolutions of specific different convolutional kernels into convolutions of convolutional kernels with 3 × 3 size before finally merging the Conv3 × 3 in the residual branch. That is, the weight and the bias of all the branches are superimposed to obtain a Conv3 × 3 network layer after fusion. Since only the 3 × 3 convolution and ReLU activation function are stacked in the model inference stage, the residual structure is discarded. Since most of the current inference engines have a specific acceleration for 3 × 3 convolution, it is easier for model inference and acceleration than Yolov4′s CSPDarkNet backbone network, which uses the Mish activation function and many different convolutional kernel sizes.
The spatial pyramid pooling (SPP) module is connected to the convolution of the last feature layer of RepVGG to increase the receptive field of the neural network and separate significant contextual features to enhance the receptive field of the network for small targets. After three convolutions of the last feature layer of RepVGG, the number of channels is halved and they are processed using maximum pooling at four different scales of 13 × 13, 9 × 9, 5 × 5, and 1 × 1. The feature maps are then concatenated and the number of channels is halved again using the three-time convolution module to obtain the feature maps of the original input dimension. Because SPP uses different pooling cores for multi-faceted feature extraction and re-aggregation, the robustness of the network is stronger, and the detection performance of the network is improved.
3.2. Multiscale Feature Fusion Based on Channel-Wise and Spatial-Wise Attention
For unstructured negative road obstacles in open-pit mines with large size spans, irregular shapes, and multi-scale features, the PANet module is used in the detection network model for multi-layer feature fusion. The PANet structure performs multiple upsampling and downsampling processes for feature fusion to strengthen the convergence effect. However, in the process of upsampling, the nearest interpolation method will lose a certain amount of local feature map information. Therefore, the SimAM [
41] spatial-wise and channel-wise attention mechanisms are introduced to mine the important features of each neuron between channels.
Existing attention modules in computer vision focus on the channel domain or spatial domain corresponding to feature-based attention and spatial-based attention in the human brain. However, in humans, these two mechanisms coexist and together facilitate information selection during visual processing. SimAM defines the following energy function for each neuron:
Here,
and
are the linear transformations of
and
, where
and
are the target neuron and other neurons in a single channel of the input feature
,
is the index over spatial dimension,
is the number of neurons on the channel, and
is the weighting and bias transform. By solving the above equations, fast closed-form solutions for
and
can be obtained:
Here,
,
are mean and variance calculated over all neurons except
in that channel. Since the existing solutions shown in the above equation were obtained on a single channel, it is reasonable to assume that all pixels in a single channel follow the same distribution. Given this assumption, the mean and variance can be computed on all neurons and reused on that channel for all neurons. This can significantly reduce the computational cost and avoid iterative calculations
and
for each position. Thus, the following minimum energy equation is obtained:
Here,
,
in the above equation implies that the lower the energy, the more the neuron t is distinguished from the surrounding neurons, and the higher the importance. Therefore, the importance of a neuron can be obtained by d. Finally, the features are augmented using sigmoid as follows:
Here, E groups all across channels and spatial dimensions. Therefore, this paper uses the SimAM attention mechanism to calculate the importance of neurons in each channel in upsampling and enhance the corresponding neuron features so as to improve the problem of feature information loss in the process of upsampling interpolation.
3.3. Optimization of Classification and Prediction Module
The negative obstacle recognition and localization module is responsible for the interaction of feature maps and local features using 3 × 3 regular convolution and 1 × 1 dynamic convolution for three different size feature images to complete the classification regression operation. However, the conventional 1 × 1 convolution does not permit strong feature characterization, and this limits the detection performance of the model. Therefore, dynamic convolution [
42] is introduced in the classification regression layer of the model to find a balance between network structure and computational consumption and to increase the expressive ability of the model without increasing the depth or width of the network. That is, the convolution parameters are adaptively adjusted according to the input image. Instead of using one convolutional kernel in each layer, it adjusts the weight of each convolutional kernel and makes targeted choices based on the dynamic aggregation of multiple parallel convolutional kernels. Appropriate parameters are used to extract features. Firstly, a general dynamic perceptron is defined, as is shown in
Figure 6.
The output result is generally expressed as , where represent the weight, bias, and activation function, respectively. The perceptron is therefore defined as follows: ,, s.t ,.
Here,
represents the attention weight, which is not fixed, but which changes as the input changes, including the attention weight calculation and dynamic weight fusion, as shown in Equation (6):
Like the dynamic perceptron, the dynamic convolution also has K convolution kernels, as shown in
Figure 7.
After the dynamic convolution, the BN layer and the ReLU layer are connected and the K kernels are set with the same scale and number of channels for a certain layer and merged through their respective attention weights to obtain the convolution kernel parameters of the layer. At the same time, the global average pooling is first performed in the attention layer to obtain global spatial features and then mapped to K dimensions by two fully connected layers. Finally, softmax normalization is performed, so that the attention weights obtained can be assigned to the K kernels of the layer. The original fixed convolutional kernels are now dynamically selected according to the input, which significantly improves the feature representation capability.
3.4. Optimization of Target Positioning
When the negative obstacle detection model is calibrating the accurate position of the target, the same obstacle will often output multiple suspected target detection frames with high confidence. In order to remove the repeated false detection frames, each object has only one detection result, as is shown in
Figure 8. The use of a non-maximum suppression (NMS) [
43] algorithm to obtain the local maximum is common in computer vision. The traditional NMS is not accurate for boundary frame localization and it may easily cause false suppression for similar negative obstacles. In this paper, we propose a new non-maximum suppression method (CIoU Soft Non-Maximum Suppression, CS-NMS) to calculate the confidence of each detection frame with weighted optimization so as to achieve the accurate localization of negative obstacle targets.
Intersection over union (IoU) [
44], which measures the intersection of the predicted and true boxes, is the most popular assessment method used in target detection benchmarks, but it cannot be measured and evaluated when there is no intersection between the predicted box and the real box. In order to solve this problem, three important geometric elements of the boundary frame need to be considered: (1) overlapping area; (2) center point distance; and (3) aspect ratio. Complete intersection over union (CIoU) [
45] adds a centroid normalized distance and an influence factor αv that takes into account the predicted box aspect ratio, the true box aspect ratio, and the above three factors.
Here,
A is the real box and
B is the prediction box,
denote the center point of the prediction box and the real box,
is the calculation of the European distance between two center points,
represents the diagonal distance of the smallest closed region that can contain both predicted and real boxes, and
and
,
represent the width and height of the predicted and real frames, respectively. The traditional non-maximal suppression algorithm will directly zero the score of the current detection box and the highest score detection box when the IoU of the box is greater than the threshold, which will lead to target boxes with a large overlapping areas being overlooked.
Here,
indicates the score of the current detection frame
,
is the threshold value of IoU, and
M is the highest scoring detection box. The current detection box score is multiplied by a weight function that attenuates the scores of neighboring detection boxes that overlap with
M. The more the detection box overlaps with M, the more serious the overlap attenuation. The Soft-NMS [
46] chooses the Gaussian function as the weighting function to de-reduce the score of the prediction box instead of the original score, rather than directly zeroing it, thus modifying its rule of removing the detection box. The Gaussian weight function is as follows:
In the Soft-NMS, IoU values are used to suppress redundant detection boxes, as the overlap area between the predicted and real boxes is the only factor, resulting in false suppression in cases where there is masking of the detected target. This paper considers the use of CIoU, a more accurate measure of detection frames, as a Soft-NMS criterion. This is because, in the suppression criterion, not only the overlapping area but also the distance between the centroids of the two frames and the aspect ratio of the current detection frame should be considered. However, the Soft-NMS reduces the confidence of all prediction boxes using the Gaussian function. There is a negative impact for those prediction boxes with CIoU scores below the threshold
. The effect of the Soft-NMS with CIoU overlay instead becomes worse. This paper defines the new non-maximal suppression method CS-NMS as:
where
denotes the score of the current detection box
,
is the CS-NMS threshold, and
M is the detection box with the highest score. It uses the Gaussian function to reduce the prediction box IoU scores that are higher than the threshold value, while the scores below the threshold value are kept unchanged. The algorithm flow is as follows (Algorithm 1):
Algorithm 1: The algorithm in original Equation (1) (Soft-NMS) is replaced with Equation (2) (CS-NMS). |
Input: (B is the list of initial detection boxes, S contains corresponding detection scores, ε is the NMS threshold.) |
Begin: |
|
While do |
|
For in B do |
Soft-NMS (1) |
if then CS-NMS (2) |
end end return D, S end |
4. Performance Analysis
4.1. Dataset Construction and Experimental Setup
Roads in the open-pit mines consist of unstructured roads and some temporary semi-structured roads. The datasets used in this paper were collected from a metallic open-pit mine in Henan Province and a non-metallic open-pit mine in Hubei Province. There are 4150 images in the dataset, including 895 images that contain only background texture information and do not contain negative obstacles. The image resolution is 1080 × 1080, and 3200 images are used for the training set and 950 images are used for the test set. Because stagnant water is characteristically different from road potholes, this dataset classifies negative obstacles into two categories, potholes and stagnant water. The background is involved as a separate category during training, so this model is actually a three-category target detection model.
The training of deep learning models requires a large number of samples, and the original sample involved in the training included only 3220 images. As shown in
Figure 9, in order to achieve the ideal training effect, we have carried out data augmentation on the original pictures, including horizontal and vertical flips, mirroring, changing brightness, adding Gaussian noise, and rotating at a certain angle (−15°~15°). The sample size after data augmentation was four times the original sample for the training set, and the distribution of samples in the dataset before and after augmentation is shown in
Table 1.
The computer used in this experiment is configured with an Intel i7-7800X CPU, an NVIDIA GeForce 2080 Ti(11G) GPU, and its operating system is Windows 10 Professional. The network model of this experiment is based on the Pytorch 1.2 framework, adopts the method of migration learning, and introduces the weights of the pre-trained VOC dataset. The initial learning rate was set as 0.0016, the momentum was set as 0.9, and two-stage training was carried out. First, part of the feature extraction network layer of the network was frozen to speed up the training, and the batch size was set to 16, the epoch to 20, and the attenuation factor to 0.001. All layers were then thawed for training, and the batch size was set to 8, the epoch to 80, and the attenuation factor to 0.0001.
In order to validate the effectiveness of the negative obstacle risk detection model, the experiment used precision (P), recall (R), average precision (AP), mean average precision (mAP), and miss rate (MR) as criteria for quantitative evaluation:
Here, TP is a correctly detected target, FP is a background incorrectly detected as a target, FN is a target that failed to be detected, n indicates the class of target detection, and N stands for the number of obstacles detected. As shown in the formula, precision and recall cancel each other out; increasing one of them will usually reduce the other, and the most common way to achieve a balance between these two metrics is to use AP. This measures the accuracy of the model for this class of target detection. mAP is the average of all AP, which measures the accuracy of the overall model. Better comprehensive performance of the model indicates that it can better detect negative obstacles in extreme situations and control the driving risks more effectively. Therefore, each performance index of the model should achieve better scores in order to adapt the risk detection to the harsh environment of the open-pit mines.
4.2. Performance Validation of Models
In order to verify the performance of our model, the negative obstacle detection optimization model of this paper was compared with the current mainstream target detection network model for experiments, and the experimental results are shown in
Table 2.
The negative obstacle detection model proposed in this paper has good comprehensive detection performance, with an accuracy of 94.15%, and it achieves a mAP of 96.35%. This means that our model can more accurately detect negative obstacles and avoid potential risks while the truck is in motion. In open-pit mining areas, trucks travel at 20–30 km/h. In such a motion scenario, higher demands are placed on the timeliness of the detector. The camera frame rate of driverless vehicles is 60 fps, but the detection speed needs to be higher than 60 fps to meet the requirements of real-time detection. Compared with Yolov4, the RepVGG backbone network used in this paper, due to its inter-layer fusion and discarding of residual branches in the inference stage, greatly improves the detection speed while ensuring accuracy, reaching 69.3 fps, which is enough to meet the requirements of safe driving in motion scenes. We chose the D0 version of EfficientDet [
35] with the least number of network layers because it’s simpler for the feature extraction layer and it has good timeliness. RatinaNet, one of the latest single-stage networks, has good feature extraction efficiency and high detection accuracy. RatinaNet incorporates focal loss to solve the imbalance of the number of difficult and easy samples, enabling more accurate target detection and localization, but it also suffers from the shortcomings of low recall and poor timeliness. In addition, both EfficientDet and RatinaNet have high miss rates, which can lead to a large number of undetected targets and threaten driving safety. An improved version of Yolov3, Yolov4 has made significant progress in detection accuracy, recall rate, false detection rate, and timeliness. The CSPNet structure used enhances the feature extraction efficiency and achieves a good balance of detection accuracy and speed. Experiments show that our model has good applicability for negative obstacle detection in open-pit mines, with high precision and high real-time characteristics, and can meet the driving safety requirements of unmanned vehicles in complex environments.
Figure 10 shows the negative obstacle recognition results, which include a variety of situations, such as potholes, stagnant water, complex environments, multiple targets, and complex light. The red boxes are the pothole targets, the blue boxes are the stagnant water targets, and the upper left corner of the detection boxes show the target category and confidence level.
In
Figure 10a, YOLO v3 has a classification error, our model and EfficientNet are not precise enough for locating negative obstacles, and RatinaNet performs best. In
Figure 10b,e where there was obscured feature information and detection was difficult, our model and YOLO v3 performed the best. Due to the minimal number of layers in the EfficientNet network, it was difficult to extract more feature information, and this resulted in poor detection performance in complex situations. In general, our model performs well in complex and multi-target situations and has strong multi-scale target detection ability, enabling it to reliably meet the requirements of harsh open-pit mine environments.
4.3. Effectiveness of Model Training Methods
In the training of the negative obstacle detection model, this experiment used the VOC2007 + 2012 dataset as the pre-training data for migration learning. The model iteration was performed 96,850 times, and the loss convergence status is shown in
Figure 11.
It can be seen from
Figure 11a that the model tended to fit after 60,000 iterations. It can be seen from
Figure 11b that due to migration learning training, the model has learned a large number of common underlying image features and is able simultaneously to reuse bottom-level features and fine-tune high-level abstract features in downstream tasks. This results in smaller initial loss values and faster convergence of the network, and it stabilized at a smaller range of values after 5000 iterations. The model in this paper has better convergence than Yolov4.
As mentioned above, in order to make the deep learning model achieve better accuracy and speed, the more efficient RepVGG network is used in the feature extraction stage, while the SimAM attention mechanism is introduced in the feature fusion module, dynamic convolution is used in the classification regression, and a new CS-NMS algorithm is proposed in the post-processing stage. Because the dataset in this paper is too small to accurately reflect the detection accuracy of the network, in the process of migration learning, the public dataset VOC2007 Test is used for testing at the same time. The results of the ablation experiment are shown in
Table 3. Data augmentation can effectively improve the detection accuracy and the overfitting effect of the model for the negative obstacle dataset. Furthermore, adding dynamic convolution, SimAM attention, and CS-NMS to the model further improves the accuracy of the model and obtains a performance improvement of 2–3% mAP for both the VOC dataset and the negative obstacle dataset.
4.4. Performance Analysis of CS-NMS Algorithm
Since most potholes on the road do not have clear boundaries, target positioning is a difficult point. As is shown in
Figure 12, after the introduction of the CS-NMS, the accuracy of bounding box positioning has been improved. In addition, the error suppression in the case of obscured targets has been reduced for the multi-target situation and the mAP has been improved by 0.25% compared with the unimproved network. However, the test set in this paper is small and the sample type is single, and this cannot accurately reflect the effect of the CS-NMS.
Therefore, we used Yolov4 and RetinaNet to verify the CS-NMS in the VOC2007 test set. The test result is shown in
Figure 13. Yolov4 uses CS-NMS to increase the mAP by 0.85%. Since the confidence of the detected target in the standard test is 0.01, a large number of low-confidence false targets will be generated without good engineering application value, so this paper uses a confidence of 0.5 and an input image size of 416 × 416.
As is shown in
Figure 13, in most cases, the CS-NMS proposed in this paper performs better than the Soft-NMS. When the threshold
is equal to 0.46, the performance of the two is similar, and when the threshold
is equal to 0.49, the three kinds of NMS achieve the best performance, and the CS-NMS performs the best. Under each threshold
, the CS-NMS is better than or equal to the original Soft-NMS. Even the worst-performing CS-NMS can at least achieve a performance close to the best performance of the Soft-NMS, which means that even if the threshold
is not adjusted in other target detection networks, the CS-NMS proposed in this paper will perform better.
Because the CS-NMS code proposed in this article is very concise and can be easily transplanted to other one-stage networks, RetinaNet has also obtained a 0.72% mAP improvement by using the CS-NMS on the VOC2007 test set. The test shows that the CS-NMS proposed in this paper can significantly improve the accuracy of target detection, and at the same time has good portability.
6. Conclusions and Future Work
In this paper, we propose a Yolov4-based target detection network model for negative obstacle recognition on unstructured roads in open-pit mining areas which can detect negative obstacles in open-pit mining areas more accurately and quickly. This method meets the demand for the fast and accurate recognition of negative obstacle targets in front of unmanned vehicles in complex road conditions such as open-pit mining areas. The experimental analysis shows that the negative obstacle detection model proposed in this paper has a good recognition effect in a variety of road scenarios in open-pit mining areas. The detection model achieves 96.35% mAP, 94.15% accuracy, and 94.18% recall, and is also in a leading position compared with mainstream target detection networks. The CS-NMS proposed in the paper is helpful for improving target detection accuracy and has good portability. This paper applies a deep learning target detection framework to negative obstacle detection in open-pit mines, demonstrating the feasibility of target detection applied in complex environments (e.g., mines) with obscured obstacle features, and provides a new solution for future unmanned mining truck obstacle warnings.
The difficulty and high risk involved in data collection in open-pit mines have resulted in small datasets. The next step should be to increase the number of samples and further improve the accuracy of the detection network. In addition, the improved model proposed in this paper is expected to be applied to more complex scenes and solve the problems associated with positive and negative obstacles in open-pit mining areas.