3.1. Training Results Analysis
In the process of training different models, after each epoch training, the model’s performance will be assessed using the validation set data. Strictly, the validation set data has never been encountered by the model, so the performance indicators obtained after inferring and testing the validation set are convincing. The inference results of the validation set also reflect the model’s capacity for generalization. The batch size is set to 8.
Figure 7 shows the pattern of a batch of data in the training set when it is input to the model. Due to the use of mosaic data enhancement technology for image data, each datum in the batch is not a separate image, but a spliced image after image processing. And, the annotation information will also be scaled and spliced to ensure that the instance coordinates correspond to the image size. The objective remains centered on enhancing the model’s robust generalization capability. In the model validation phase, the training of the model and the update of the weight parameters will not be affected by the setting of the batch size. So, in order to speed up the model validation, the batch size of the validation set is set to 16.
Figure 8 shows the validation process of the Yolov5n model for a batch of data in the validation set.
Table 4 and
Figure 9 clearly show the detection effect of different versions of the Yolov5 model on gas slug and dense bubbles instances in the validation set. The highest detection precision of the gas slug and dense bubbles is 97.9% and 86.3%, respectively, with both using the Yolov5s model. Yolov5m has the highest detection cmAP for the gas slug, reaching 73.46%. On the contrary, Yolov5s is the lowest at 66.87%. For dense bubbles, Yolov5l’s cmAP is the highest, reaching 65.57%, and Yolov5s is the lowest, at 62.50%. In general, the detection effect of the gas slug is better than that of dense bubbles. The reason is that in image data the features of the gas slug are more obvious than dense bubbles. And, when detecting slug flow, it is most important to detect gas slug accurately and efficiently. But, for one of the categories, the difference in cmAP of the four models is not large, which means that the average detection accuracy is similar. To compare the comprehensive performance of different models, we continued to compare the overall average detection performance of the models, including the indicators of precision, recall, and mAP. The complexity of the model was also analyzed.
Table 5 shows the overall performance of Yolov5’s related models for the validation set data. The cmAP of Yolov5l is the highest, reaching 69.25%. However, compared with other models, the model complexity of Yolov5l is also the highest, with a parameter amount of 46113663 and Flops of 107.7 G. The performance of embedded devices is often inferior to that of fixed computers, so if such a complex and computationally intensive model is deployed to the device, the inference speed will be very slow, which is not conducive to the realization of applications in special scenarios. The cmAP of Yolov5n is 67.94%, which is only 1.31% lower than Yolov5l, and the precision and recall are only 2.2% and 1.4% lower than Yolov5l, respectively. However, the parameter amount and Flops of Yolov5n are only 3.8% of Yolov5l, which are 1761871 and 4.1G, respectively. Among the Yolov5 comparison models, the model complexity of Yolov5n is the lowest. Although the cmAP is a little lower compared with Yolov5m and Yolov5l, the difference of about 1% frequently does not significantly affect the model’s performance. For practical applications, it can also perform very well. Therefore, in the related comparison model of Yolov5, Yolov5n has the most outstanding comprehensive performance.
Figure 10a shows the change curves of some important indicators of the Yolov5n model during the training process, such as loss, precision, recall, and mAP. All loss indicators are gradually decreasing, and precision, recall, and mAP are gradually increasing. All metrics tend to stabilize in the later stages of training. There is no over-fitting phenomenon in the training process, and the model’s capacity for generalization is assured. Precision and recall are a set of contradictory indicator parameters.
Figure 10b shows the precision–recall (PR) curve. It reflects the relationship between precision and recall of the model with the best overall performance during the training process when the IoU threshold is 0.5. The area bounded by the PR curve and the
x-axis and
y-axis is mAP
0.5, with a value of 0.935.
Figure 11 illustrates the inference detection outcomes obtained from the Yolov5 series model applied to certain data. Yolov5s and Yolov5l are capable of detecting dense bubbles located along the edges of the image. The detection effect is good. The most important gas slug part can be detected by all models. But sometimes, the gas slug will be blocked by bubbles. Solely using a computer vision classification model makes it challenging to accurately identify the gas slug. Therefore, the strength of the object detection algorithm is that it can still exert a strong performance on partially obstructed objects. Yolov5l can completely detect the gas slug hidden behind the bubbles. Although other models do not fully detect a small part of the gas slug, most of the gas slug is also detected when it is obstructed by bubbles. Taking the image data in a validation set as an example,
Figure 12 shows the feature map changes in Yolov5n in the process of inferring and detecting images. Each layer in the network is playing an important role in feature extraction, and all have the nature of learning and parameter updating. The deeper the feature extraction layer, the more abstract the obtained feature map, indicating that Yolov5n has strong feature extraction and integration capabilities, providing a solid foundation for the work of the detection head. The input image consists of three channels (c = 3), representing the RGB three channels, respectively. In the first two Conv parts, the number of channels of the feature map is changed from 16 to 32, and the feature map will carry more channel feature information. With the process of feature extraction and feature integration, as the network deepens, the size of the feature map progressively decreases, which makes the receptive field of the feature map larger and larger. The 23rd layer is the last layer of the overall network, with the largest receptive field, and the corresponding detect head will be more sensitive to big objects. On the contrary, the depth of the 17th layer is lower, the receptive field is smaller, and the detect head will be more sensitive to small objects. Similarly, the detect head pair corresponding to the 20th layer is more focused on objects of medium size.
Similarly, the Yolov3 series of models also includes lightweight models and high-precision models.
Table 6 shows the overall performance of different models in the Yolov3 series for inference on the validation set data.
Figure 13 shows the detection results of Yolov3 series models on slug flow images. In Yolov3, the lightest is Yolov3-tiny, with only 8,669,002 parameters. The volume of Flops is 12.9 G, about 80% of Yolov5s. Although the model of Yolov3-tiny is small, the overall performance of Yolov3-tiny is the worst among the models of the Yolov5 and Yolov3 series, mAP
0.5 and mAP
0.5:0.95 are the lowest, and cmAP is only 62.52%. The cmAP of Yolov3 and Yolov3-spp is 68.27% and 67.47%, respectively, which are basically the same as the performance of Yolov5n. However, the parameters of Yolov3 and Yolov3-spp are 35 times that of Yolov5n, and Flops are 38 times that of Yolov5n. Therefore, due to the huge model size of Yolov3 and Yolov3-spp, the inference detection speed is very low. When the cmAP is roughly equal, the comprehensive performance is low, and Yolov3 and Yolov3-spp are not suitable for being deployed in embedded devices for functional development.
Figure 14 shows the overall detection performance metrics for Yolov3 and Yolov5. In general, the different models of the Yolov3 and Yolov5 series show little difference in the detection effect of slug flow, which is mainly reflected in the mAP index. But, from the perspective of model complexity, Yolov5n has the least number of model parameters and Flops and is the model with the strongest comprehensive performance. Yolov5n lays the model foundation for the functional development of embedded devices.
Table 7 and
Figure 15 provide comparison with previous research methods. The contrastive models encompass two one-stage object detection models: SSD and Yolov5-bifpn. Yolov5-bifpn is an enhancement over Yolov5l, where the neck section is replaced with the bifpn structure. Additionally, the two two-stage object detection models are included: Faster R-CNN and Mask R-CNN. The backbone component for the SSD model is VGG16, whereas Faster R-CNN and Mask R-CNN employ ResNet50 as their backbone. Both VGG16 and ResNet50 are well-established deep convolutional neural networks renowned for their exceptional feature extraction capabilities. Analyzing the comparative results from
Table 7 reveals that the cmAP for SSD and Mask R-CNN surpasses that of Yolov5n, achieving 68.05% and 68.36%, respectively. Furthermore, the mAP
0.5 and mAP
0.5:0.95 metrics also outperform Yolov5n due to the robust feature extraction and integration capabilities of SSD and Mask R-CNN. For Faster R-CNN, mAP
0.5 is higher than Yolov5n, reaching 93.8%. But, mAP
0.5:0.95 is slightly lower than Yolov5n at 65.1%. This led to a decrease in cmAP as well, which was 0.06% lower than Yolov5n. For Yolov5-bifpn, the primary differences are observed in mAP
0.5:0.95 and cmAP, both of which are lower compared to Yolov5n by 1.2% and 1.07%, respectively. When comparing the mAP metric, the recognition performance of the five models does not exhibit significant discrepancies. Minor differences in average precision are unlikely to heavily impact practical detection outcomes. However, considering model complexity, the distinction among the five models is substantial. Apart from Yolov5n, SSD possesses the smallest parameter count at 34.31M, approximately 19 times that of Yolov5n. Excluding Yolov5n, the model with the lowest Flops is Faster R-CNN at 20.7 G, roughly five times that of Yolov5n. In summary, compared to models employed in previous research, Yolov5n demonstrates similarly excellent detection precision. Furthermore, Yolov5n holds a significant advantage in terms of model complexity and computational load. This lays a strong foundation for high-speed slug flow detection and embedded deployment.
3.3. Embedded Deployment Results of Yolov5n
In the previous analysis, for the slug flow image data, the Yolov5n model with the best comprehensive performance was obtained. It has a strong average detection accuracy. The most important thing is that the number of model parameters is very small; Flops is only 4.1 G, which provides the foundation and data guarantee for the embedded deployment of the model and the development of special detection application functions. To validate the practicality of the model for slug flow detection on embedded devices, we initially collected video data showcasing slug flow phenomena. The dimensions of this video data correspond to those of the image data used during model training, ensuring the model can fully leverage its inference and detection capabilities when applied to video data. Then, we deployed the trained Yolov5n model (.pt file format) in the Jetson Nano embedded device to perform inference detection on the slug flow video data.
Figure 20 shows the detection effect of the Yolov5n model on slug flow video data in Jetson Nano. Since the image in the video data is the same as the image data in the dataset of the previous training model, and the size is also the same, the detection accuracy is the same as that in the training results analysis. This aspect of the research is dedicated to assessing the detection speed of the Yolov5n model on the Jetson Nano platform because the realization and development of detection functions in special scenarios are directly affected by the detection speed of embedded devices. During Yolov5n’s inference and detection of slug flow video data, the average preprocessing, inference, and NMS (Non-Maximum Suppression) time for each image are 2.1 ms, 50.2 ms, and 8.6 ms, respectively, a total of about 60 ms. The FPS of video inference is about 16.7. Such inference is very slow, and the basic requirements of industrial production cannot be met. Therefore, it is very necessary to accelerate the inference of the model in Jetson Nano.
TensorRT is an integrated tool designed for deep learning inference acceleration, which can convert the Yolov5n model into a TensorRT dedicated model and combine it with NVIDIA GPU to accelerate the inference process of the model. First, the Yolov5n model (.pt) is converted into another new format (.wts) using a file written in the Python language. Then, the serializable file is generated by continuing to execute the CMake command. Also, it is important that the new model format (.wts) must be converted to a final format (.engine) that TensorRT can read. Finally, the video data are inferred by reading the model file (.engine).
Figure 21 shows the inference detection effect after TensorRT acceleration. In the detection results, category 0 represents the gas slug, and category 1 represents dense bubbles. During inference, the gas slug and dense bubbles can be accurately detected. Importantly, however, the total time of inference processing for each image in the video data is reduced to about 12 ms, and the FPS reaches about 83.3. Therefore, after using TensorRT to accelerate model deployment, the inference speed is greatly improved, which is five times that before the acceleration. The power of TensorRT for inference acceleration is proven. This method provides a new idea and method for real-time detection of slug flow.