1. Introduction
Oceans contain abundant biological resources, and the utilization of these biological resources has strategic significance for economic development. Harnessing modern science and technology to enhance the rational upgrading of marine resource processing, as well as improving production efficiency, is a shared objective among numerous researchers. In recent years, owing to the proliferation of deep learning visual technology, object detection algorithms based on deep learning have superseded traditional approaches due to their exceptionally high accuracy [
1]. In the past, underwater image data needed to be processed at a land center. However, processing large amounts of data on ground computing entities results in a high latency in the data transmission. In addition, communication conditions for such architectures are often not met in marine environments, and the information transmission rate of wireless communication systems at sea is relatively low—only short messages and low-speed message services can be transmitted. Although ocean satellite communication has a high speed, its cost is high, and the disadvantage of shore-based marine communication applications is that they cover a small coastal area [
2]. To avoid the above two problems, an effective approach is to use a mobile edge computing platform to process the underwater data at the edge using edge nodes and edge servers.
As shown in
Figure 1, mobile edge computing is a computing paradigm that uses mobile edge devices to provide relevant computing services at the network edge nodes instead of using the Elastic Compute Service (ECS) [
3]. Edge computing nodes can be seen as the extension of the Internet of Things in the ocean. Edge computing has broad application prospects in water quality monitoring, pollution observation, marine resource exploration, etc. [
4]. The advantage of an edge computing platform for underwater biological object detection is that it can achieve real-time processing and analysis, as well as reduce the cost of data transmission and storage. Additionally, edge computing platforms can process and analyze data in an underwater environment, thus avoiding the delay latency and instability of data transmission. In addition, for promotion of the accuracy and efficiency of detection, edge computing platforms can use machine learning algorithms to classify and identify underwater biological objects. This technology can be used for marine ecological environment monitoring, marine resource investigation, and marine scientific research, as well as in other fields.
Although edge platforms have natural advantages in performing marine organism detection tasks, due to hardware limitations, performing deep learning inference on mobile edge devices can easily lead to excessive load, thus resulting in an insufficient object detection capability so that the detection requirements cannot be met. In order to enable edge nodes to execute deep learning-based detection algorithms more efficiently, mature deep neural network models can be compressed to obtain lightweight models to better perform edge node computing tasks.
In this paper, the YOLOv8 [
5] model was selected as the baseline model, and a new lightweight method based on neural network structure search and distillation is proposed to solve the problems of limited computing power, low recognition accuracy, high model complexity, and the high latency of edge nodes in an edge computing environment. Specifically, there are two steps: (1) Use the once for all neural architecture search algorithm and compress the YOLOv8 model backbone network to obtain a backbone network with low model complexity and high accuracy; (2) To compensate for the loss of accuracy, a mixed distillation method is designed, which integrates Channel-wise Distillation (CWD) [
6] and KD [
7] algorithms to achieve a synchronous distillation of the intermediate knowledge layers and to label knowledge in order to improve accuracy. Also, a service framework for the edge platform is designed to implement the model application of the edge computing platform.
2. Related Works
Currently, the primary lightweight methods for relevant models include neural architecture search, pruning, quantization, and knowledge distillation.
Neural architecture search (NAS) is a method that involves an automated design of neural network structures. It mainly includes two key steps: search space design and search strategy selection. Search space design refers to defining the structural space of a neural network, such as the number of layers and convolutional kernels. The selection of a search strategy is to determine the search method and objective function, such as a genetic algorithm.
Pruning is a technique used to reduce the number of parameters and computational complexity of neural networks. Pruning reduces the size and computational complexity of a model while maintaining its accuracy by removing unnecessary weights and connections from the neural network.
Quantization is another method of optimizing neural networks. It mainly reduces the size and computational complexity of a model by reducing the accuracy of the model parameters and activation values, which usually leads to a certain degree of accuracy loss. Quantitative techniques can effectively reduce model storage space and computational requirements.
Knowledge distillation is a model compression technique that utilizes large, high-performance teacher models to guide the training of small, compact student models. The core idea is to enable the student model to learn the output or intermediate representation of the teacher model so that the student model can approach the performance of the teacher model as closely as possible while maintaining a smaller model size.
The compression algorithm utilized in this article involves the neural structure search and knowledge extraction methods. The following are related research areas of these methods.
- (1)
Neural architecture search
At present, the model of neural network structures is relatively complex and lacks theoretical guidance, which is not conducive for the use of deep learning. Therefore, researchers have begun to seek an automated way to independently design neural network structures, that is, through neural architecture search [
8]. In the initial stage, Zoph et al. [
9] sampled substructures (child networks) in the search space with a certain probability distribution through a controller by using reinforcement learning. Subsequently, the obtained structure is trained and its performance was tested on a validation set. Real et al. [
10] first proposed a method similar to biological evolution, which mutates models with superior performance in randomly generated models and gradually eliminates network structures with poor performance to finally obtain the best performing model. However, such methods require a very large amount of computation, which greatly hinders research progress. Therefore, Pham et al. [
11] proposed a parameter-sharing-based neural structure search, i.e., the efficientNet neural architecture via parameter sharing (ENAS) method. This method is a fast and low-cost automatic model design method. In ENAS, the controller searches for the best subgraph in a large computational graph while greatly reducing the computational overhead through a weight sharing among submodels, thus initiating the second stage of NAS. The above two stages have excessive overhead and are not suitable for practical model deployment. Researchers have begun to consider introducing relevant prior knowledge to reduce the cost of neural structure search algorithms, especially with limited computing resources available on mobile edge devices. Designing a resource-constrained mobile model is challenging. Based on the MobileNet search space, Stamoulis et al. [
12] proposed the concept of a super kernel with a unified convolution kernel of 3 × 3 and 5 × 5. The two convolutions of five make the network a single-path structure. The practical deployment of deep learning models needs to adapt to different hardware platforms as the time spent on re-training is long. To address the practical deployment problem of the NAS model, Cai et al. [
13] proposed a once for all (OFA) structure search algorithm to handle multiple deployment scenarios. This method separates the model training and structure search processes and trains an OFA network that supports multiple different structural settings such as depth, width, convolutional kernel size, and spatial resolution. Based on the actual deployment scenario, it simply selects the appropriate substructure in OFA without the need for an additional training process.
- (2)
Knowledge distillation.
Knowledge distillation is like the teacher–student structure of human beings, providing knowledge through the pre-trained teacher models while the student models acquire teacher knowledge through distillation training. It can transfer complex teacher model knowledge to simple student models at a slight performance loss. Label knowledge is the potential information contained in a neural network’s final prediction output of sample data. Hinton et al. [
7] first proposed knowledge distillation, which falls within this category. Due to the many uncertainties in the soft labels after adjusting the distillation temperature, Yang et al. [
14] proposed using the label knowledge generated by the intermediate model updated by the teacher model in each training cycle to guide the student model. In order to fully explore the label information and remove interference, Muller et al. [
15] used a subcategory distillation method to group the original labels and participate in soft label distillation learning. Owing to the output layer’s label knowledge’s incomplete information, some researchers hope to obtain more representative feature knowledge in the middle layer and transfer it to student models [
16]. Differentiating between foreground and background regions in distillation is essential for object detection according to numerous techniques. By using L2 loss to force the feature maps within the student network RPN to resemble those of the teacher network, MIMIC [
17] discovered that utilizing direct pixel level loss may negatively impact object detection performance. A proposal to extract fine-grained features near object anchor points was made by Wang et al. [
18]. Zhang [
19] achieved some outcomes by using the attention function to construct masks that distinguishes the foreground from the background. Recent research results [
20] have also focused on the information found in every channel. Zhou and colleagues computed the mean activation value for every channel and matched the weighted discrepancies for every channel in the categorization. CSC [
21] calculates the pairwise relationships between all spatial positions and all knowledge transfer channels. According to channel exchange [
22], the data on each channel are universal and transferable between various models.
Object detection accuracy is significantly increased by using a deep learning-based approach. The two primary categories of CNN-based algorithms are two-stage object detection and single-stage object detection [
23]. The detection problem is divided into two phases by the two-stage object detection technique. The algorithm in the first step creates candidate regions, which are then further refined and classified by the second stage algorithm. These algorithms include R-CNN [
24], Fast R-CNN [
25], Faster R-CNN [
26], and Cascade R-CNN [
27]. The one-stage object detection algorithm simultaneously detects and classifies objects, thus directly outputting the classification values of the objects’ location and probability coordinates. The common algorithm models range from examples such as the YOLO series [
28,
29,
30,
31,
32], SSD [
33], RetinaNet [
34], FreeAnchor [
35], FSAF [
36], and FCOS [
37]. Currently, there are also some lightweight studies on deep learning models. Fan [
38] proposed a lightweight object detection algorithm, CM-YOLOv8, for coal mining working faces. They introduced an adaptive predefined anchor box tailored for the dataset and an L1-norm-based pruning method, which compresses the computational and parameter complexity of a model without affecting accuracy. Yang [
39] proposed an automatic detection method based on an enhanced YOLOv8s model, which utilizes depthwise separable convolution (DSConv) to generate a substantial number of feature maps, thereby reducing computational complexity. Guo [
40] proposed an underwater object detection method that optimized YOLOv8s by incorporating FasterNet as the backbone network. They modified the feature pyramid network to a fast feature pyramid network and introduced a lightweight C2f structure. The aforementioned methods have all introduced impressive solutions, with a particular emphasis on lightweight enhancements for YOLOv8s. However, deploying YOLOv8s on a Jetson Xavier NX proves challenging due to its size and complexity. Hence, this article adopted YOLOv8n as the baseline model.
4. Neural Architecture Search Algorithm
The once for all neural architecture search algorithm trains a one-time network that supports different architectures through separating the search from training to reduce costs. Without further training, it may rapidly pick from the network of OFA to obtain specific subnetworks.
4.1. OFA Large Model Training
The OFA network provides a large model that accommodates several subnetworks of various sizes by taking into account the four dimensions of crucial convolutional neural network (CNN) architecture, namely depth, width, kernel size, and resolution. The fine tuning of the depth and convolutional kernel processes are shown in
Figure 7 and
Figure 8, respectively, where the OFA network divides the CNN model into a series of units with progressively smaller feature map sizes and an increasing number of channels. Each unit is composed of a series of layers, and if the feature map size decreases, only the first layer has a step size of 2. All the other layers in the unit have a step size of 1. The OFA large network includes several subnetworks of various sizes, including tiny subnetworks nestled inside larger subnetworks. To prevent interference between subnetworks, the OFA network gradually executes the training sequence from large subnetworks to small subnetworks. Then, the OFA network starts by using the greatest possible kernel size, depth, and width to train the neural network. Smaller subnets are gradually added to the sampling space to gradually fine tune the network to support smaller subnets. In particular, it offers elastic kernel size, which may be chosen at each layer from 3, 5, 7 after training the largest network, and the depth and width stay at their maximum values. Following that, the elastic width is supported and then the depth in turn. Throughout the training process, the resolution is elastic, which is achieved by varying the size of the picture samples for every training data batch.
4.2. Dedicated Model Deployment for OFA
After training the OFA network, then obtaining specific subnetworks for the specified deploying situation is the next step. The goal is to search for neural networks that meet the efficiency constraints of the target hardware (such as latency and energy) while maximizing precision. We do not currently require any sort of training expenditures since the OFA network separates the process of training the models from the search for neural architectures. In addition, in order to provide quick feedback on the quality of a model, OFA has developed neural network twins that can forecast the delay and accuracy of a particular neural network architecture. By substituting the anticipated accuracy/delay for measurement accuracy/waiting time, it eliminates the expense of redundant searches.
4.3. Application of the OFA Network in Object Detection
The OFA network demonstrates outstanding performance in classification tasks, and this paper extends its application to detection tasks. This paper uses the OFA neural network search algorithm to optimize the backbone network of the object detection algorithm. This paper dose not retrain the OFA large model, it only uses the dedicated model search algorithm of the OFA network. The search is constrained by the model’s computational complexity (in terms of FLOPs) and accuracy, where the large OFA model is already trained on the ImageNet dataset. With consideration of the target hardware and latency limitations, an evolutionary search is performed based on neural network twins to obtain specialized subnetworks. As the cost of using neural network twins for search can be negligible, this paper does not spend too much time obtaining subnetworks. After obtaining the subnetworks, the subnetworks are then transplanted into existing object detection algorithms, and the NECK layer of the object detection algorithm is fine tuned. A new object detection model is retrained to test its effectiveness. The process is shown as
Figure 9.
The OFA sub network used in this paper is based on MobileNetV3, with a width multiplier of 1.2 (this supports an elastic depth of (2, 3, 4), an elastic scaling ratio of (3, 4, 6), and an elastic kernel size of (3, 5, 7) for each stage). This paper uses a subset of ImageNet for testing, which includes 2000 images (~250 M). Then, for constructing the accuracy prediction and complexity prediction functions, the searching is started with a neural architecture that is constrained by FLOPs. The search algorithm employed is an evolutionary algorithm. In each generation, the population size is set to 10, and the total number of generations to be searched is 500. The probability of mutation in the evolutionary search is 0.1, with a mutation ratio of 0.5 for the networks generated in each generation through mutation. Evolutionary algorithms explore the structural space, and their adaptability is gauged by accuracy and complexity functions. Upon reaching the specified number of iterations, the optimal subnet information and weights are preserved.
The backbone network resulting from the NAS search fulfills the anticipated requirements of this paper in terms of FLOPs. However, the network derived from the search process differs from the original YOLOv8n network, thereby exhibiting inadequate adaptability. A direct migration to this network may lead to a reduction in detection accuracy. Consequently, this paper opted to employ the knowledge distillation algorithm to enhance accuracy.
5. Knowledge Distillation Algorithm
In this paper, a neural network distillation architecture that objectifies both the label knowledge and the intermediate knowledge layer simultaneously was designed by utilizing the detection ability of the large model as a prior basis. When changing the object detection model of the backbone network for training, the intermediate layer features of the large model are added, and the output layer features correct the intermediate and output layers of the small model to obtain a more accurate model.
5.1. Basic Principles of Knowledge Distillation
A model can be seen as a “black box” by knowledge distillation because knowledge is a relationship that maps from inputs to outputs. Its basic process is shown in
Figure 8. Therefore, a teacher network can be trained first; afterward, the student network’s aim, Q, can serve as the output result of the teacher network in order to instruct the student network so that P, the student network’s result, approaches Q. Therefore, its loss function can be designated as
where
y is the single-hot transformation of the real label,
q is the teacher network’s output result,
p is the student network’s output result, and
is the cross entropy. The loss function here is to add the cross entropy with the teacher network output as the label on the basis of the original cross entropy.
The class probability is generated by using the
output layer of transformation
. Then,
is compared with the other logistic regressions and the probability
for each class is calculated. The formula is as follows:
This formula, called softmax, yields the probability of every class based on a logit if T is around 0. It is identical to one-hot encoding, where other values will be nearer to 0 and the maximum value will be nearer to 1 if T is near to 0. The overall distribution of the final outcomes will be smoother if T is greater, where it acts as a smoothing function to retain similar information. Formula (5) states that if T equals infinity, it approximates a uniform distribution.
5.2. Channel-Wise Distillation
Channel-wise distillation (CWD) is a knowledge distillation method for knowledge across channels. In order to better utilize the knowledge in each channel, the CWD algorithm proposes to gently adjust the activation function of the corresponding channel between the networks of teachers and students. In
Figure 10, the fundamental procedure is displayed. To achieve this, CWD was used to first convert the activation of the channel into a probability distribution, which can be measured using probability distance measures (such as KL divergence).
Figure 11 illustrates how the activation of several channels tends to encode the scene category’s salience in the input image. CWD, as a revolutionary channel refinement paradigm, can let student networks learn from teacher networks with higher model capabilities.
Let
T and
S denote the teacher and student networks, and let
and
denote the activation mappings from
T and
S, respectively. The following is the channel distillation loss:
In the CWD algorithm, the activation values are transformed into probability distributions using
, as shown below:
where
i indicates the channel’s spatial location,
indicates the channel, and
T is a hyperparameter (temperature). The probability softens with increasing
T, thus indicating the need to concentrate on a greater geographical region for each channel. By applying softmax normalization, the influence of the magnitude between large and compact networks is eliminated. This normalization is helpful for KD. If the number of channels between teachers and students does not match, the student network’s channel count is upsampled using the
convolutional layer. Then,
evaluates the differences in the channel distribution of teacher networks and student networks by using KL divergence as follows:
KL divergence is a kind of asymmetric metric. It is evident from the above equation that in order to reduce KL divergence, should be as large as if is large. Additionally, KL divergence is only somewhat minimized if is small. Therefore, student networks tend to focus on foreground salience by generating a similar activation distribution. The activation associated with the teacher network’s background area has a relatively small impact on learning.
5.3. The Distillation Structure Design of This Paper
To achieve a comprehensive understanding of teacher characteristics, this paper integrates the channel-wise distillation and knowledge distillation algorithms to devise a novel knowledge distillation architecture, as depicted in
Figure 12. Initially, the paper utilizes CWD to extract the feature pyramid network (FPN) by distilling the regions with the most abundant feature information. This enables the feature layer of the student network to assimilate prior learning outcomes, thereby enhancing the salience of the feature pyramid generated by the student network. Moreover, KL divergence is incorporated into the HEAD section to expedite the convergence of the student network and to offer guidance during the training process. Weighting the loss of the HEAD segment of YOLOv8n serves to stabilize the outcomes.
CWD is used at the NECK module to distill the feature pyramid, and KL divergence is used at the beginning of the output head to achieve soft label distillation. Finally, the two are added with a certain weight to obtain the total distillation loss and to participate in backpropagation together with the conventional loss.
6. Experimental Analysis
6.1. Dataset and Experimental Environment Settings
The 2020 China Underwater Vehicle (Zhanjiang) Competition officially donated the underwater biological picture dataset used in this paper. There are 5534 photos of starfish, scallops, sea cucumbers, and sea urchins in the collection. Some low-quality photographs were chosen for augmentation since many of the images in the original dataset had low contrast and color distortion. The dataset was improved using the CLAHE underwater image enhancement technique. The dataset for this paper consists of 7930 photos altogether, which are created by combining the enhanced and original photographs.
The experimental platform processor is E5-2660v4 CPU, the GPU is GTX2080, and the computing environment is PyTorch 1.1.4.
6.2. Baseline Model Detection Experiment
This paper first conducted pre-experiments on underwater biological objects, selecting several typical models for experimental comparison to demonstrate the superiority of choosing YOLOv8n. To select the lightweight versions of various models, the ResNet minimum network ResNet18 was selected as the backbone network for non YOLO series models, while, for the YOLO series models, the minimum network for each version (v3 and v4 are not proposed as lightweight models) was selected.
From
Table 1, it can be seen that the classic two-stage and one-stage object detection algorithms, even if their backbone network is replaced with the smallest version of ResNet (i.e., ResNet18), still have a much higher computational complexity FLOPs than the lightweight version models proposed by the YOLO series. In the lightweight version models of the YOLO series, YOLOv8n has a computational complexity and model size second only to YOLOv5n, but its detection accuracy (mAP) is much higher than YOLOv5n. Therefore, this paper chose YOLOv8n as the baseline model for the experiment.
6.3. Experimental Results after Compression Optimization
This paper first used pre-trained large-scale models in an OFA network to conduct a structural search in the experimental environment of this paper. The model accuracy (mAP) and model complexity (FLOPs) metrics were used as search objectives for 300 iterations to obtain the model parameters and model structure files. The OFA algorithm uses MolileNetv3 as its basic large model, and, after 500 iterations, its accuracy mAP value and model complexity (FLOPs) were 72.3% and 2.4 GFlops, respectively.
The YOLOv8n’s backbone network was swapped out for one that the OFA method repeatedly searches for in training; then, the model results are obtained. To increase the model’s accuracy and enhance the training procedure, the training procedure incorporates the distillation architecture suggested in this research. The teacher network used in this paper is the YOLOv8s model.
From
Table 2 and
Table 3, as well as
Figure 13, it can be concluded that the performance of the YOLOv8n network model has made much progress compared to before adopting the OFA algorithm and the distillation algorithm. In terms of model complexity, the compressed model exhibited a computational complexity of 7.4 Gflops and a MAC value of 2.7 G, which aligns closely with the smallest YOLOv5n model within the YOLO series. In comparison to its performance prior to compression, there has been a reduction in complexity by 1.3 Gflops and a decrease in the MAC value by approximately 32%. Due to the compensatory effect of the distillation algorithm, the compressed YOLOv8 also achieved the best precision of the YOLO series small models. From
Figure 13, it can be seen that the confidence level of each object increased by nearly 10 points. In actual deployment, the confidence threshold can be increased to filter similar objects and improve generalization. Compared with YOLOv8n, the compressed YOLOv8 increased AP50, AP75, and mAP by 2.0%, 3.0%, and 1.9%, respectively. From the comprehensive calculation complexity and model precision, in summary, the YOLOv8n model, which was compressed for this paper, is currently the better tiny model in the YOLO family, and it is also the most suitable network for being deployed on an edge computing platform.
7. Edge Computing Platform Applications
This chapter mainly studies the deployment and application of the edge computing platform of the object detection algorithm that was constructed with Jetson Xavier NX (which is shown in
Figure 14) as the deployment platform used to improve YOLOv8n. This chapter also shows how the model was accelerated through TensorRT, which further improves the detection speed of the model, as well as how the server framework was used to implement the streaming of video and the post-processing of detection data.
TensorRT is a software development suite proposed by NVIDIA (Santa Clara, CA, USA) for optimizing trained deep learning models to achieve high-performance inference. TensorRT includes a deep learning inference optimizer for trained deep learning models, as well as a runtime engine for execution, which can run deep learning models with a higher throughput and lower latency. In this paper, the model file is first converted into an onnx intermediate format file, and the middleware is converted into an engine file using the compiler in the TensorRT suite. Afterward, the C++/Python API interface in the middleware can be called to implement end-to-end model applications. The specific process is shown in
Figure 15.
This paper uses an edge computing platform as an edge computing server that can complete data collection and calculation in a production environment. In the framework of network programming, the model compiled by TensorRT is used to provide application services. This paper implements the video stream push based on TensorRT and Django frameworks. Django is a Python-based network service framework (which can easily provide the required network services). The main process of the video streaming is as follows:
Using the OpenCV and V4L2 frameworks to capture images from USB-driven cameras;
Processing image information using the detection model compiled by TensorRT, which is then pushed to the FFmpeg process;
Using FFmpeg to implement the RTMP protocol (RTMP video stream is a real-time video stream protocol) to push the video streams, with the proxy server being the Nginx server.
Through the Nginx service, video stream access can be achieved on a local area network or external network. The V4L2 framework is a video source capture driver framework that is used in Linux systems. It is widely used in embedded devices, mobiles, and personal computer devices for video capture. FFmpeg is a set that can be used for recording, and it is an open-source computer program that converts digital audio and video into streams. Nginx is a high-performance HTTP and reverse proxy server, and it is characterized by low memory consumption and strong concurrency, and it is suitable for network IO processing on edge computing platforms. The control of video streams is determined by the requests sent by the client to the Django framework.
From
Figure 16, it can be seen that the application architecture of this paper is divided into four layers and the program processing order is bottom-up. The first layer is the basic tool for edge nodes, which is the V4L2 driver program for collecting image information; the TensorRT model is used for image processing; the UNIX network IO is used for communication, with the second layer serving as the FFmpeg tool for recommending video streams; the Django framework is used for providing network services; the SQLite database is used to store the data; the third layer merely utilizes Nginx’s ability to handle high concurrency to improve concurrency by a few points; and the fourth layer is the application services that can play FLV videos through web pages or can directly call RTMP video streams through other decoding tools.
Figure 17 illustrates the remote access application of the edge computing platform. The edge computing platforms can be deployed as edge servers in offshore shallow waters. Through the deployment of an optical fiber communication network underwater, a water–shore network covering both underwater and shore areas can be established. This setup facilitates information sharing and collaborative work between underwater devices and shore-based servers. The edge servers have the capability to process and analyze underwater data in real-time. They can detect and statistically analyze marine organisms near the shore, thus enabling quick responses to the requirements of onshore clients. This integration is suitable for intelligent marine ranching, intelligent marine ecological protection, and other scenarios.