Based on our proposed inference latency profiling model, we assign the Bayesian search with the ability of hardware resource awareness. In the search process, accuracy and inference latency are taken as the search targets. By finding the Pareto Optimal Front, the search direction is guided, and the CNN with a trade-off between accuracy and inference latency for the edge devices is also obtained. Moreover, since inference latency is the additional search target, the search space will be further narrowed down, and the search efficiency will also be enhanced.
4.1. The Kernel of Gaussian Process in Bayesian Search
Bayesian optimization is a model-based hyperparameter optimization method where a Gaussian process (GP) is adopted to update the posterior distribution of the objective function by continuously adding sample points without knowing the internal structure of the objective function that is usually defined as:
From the above equation, it can be seen that the goal of Bayesian optimization is to find x from the candidate set S to maximize . When Bayesian optimization is applied for the purpose of searching for the CNN, each input x is a CNN structure, and its corresponding output is the accuracy and inference latency of this CNN structure. Here, it is assumed that satisfies the Gaussian distribution, with the prior distribution of , and then new CNNs can be gradually added so as to update the distribution. These new CNNs are obtained from the previously generated network structure through network morphism, such as widening the layers and increasing the number of layers of the network. Subsequently, by continuously correcting and modifying the original assumed prior distribution according to the newly added CNNs, we can finally get the real distribution and the CNN with a trade-off between accuracy and inference latency that can be effectively deployed on the edge devices.
Furthermore, the GP is applied to update the Gaussian distribution. Additionally, in order to make Bayesian Optimization more suitable for network morphism, it is redesigned by [
33]. In traditional Bayesian Optimization, the GP is usually adopted in the Euclidean space. However, the neural network architectures do not belong to Euclidean space and are difficult to parameterize into a fixed-length vector. It is impractical to directly vectorize neural networks. Therefore, an edit-distance based GP kernel is applied to calculate the operands of transforming an existing CNN to a new one.
Assuming
and
are two CNN structures, the kernel is defined as following:
where
is the edit-distance between two CNN structures;
is a mapping function that maps the distance in the original metric space to the corresponding distance in the new space. Calculating the edit-distance of two neural networks can be seen as calculating the edit-distance of two graphs, and this approximate solution is denoted by:
where
is the edit-distance for morphing the layers and
is for morphing the connections between layers.
and
are the set of layers of CNN
and
,
and
are their layer connection sets. In addition, the
and
are calculated by:
and
where
and
are layer and connection injective functions,
is the edit-distance of widening a layer into another,
is the edit-distance for two matched connections.
Through the above kernel function, the GP model can be updated, and applied to guide the CNN generation in the next round of the search process. Then, the details of the CNN model selection will be given.
4.2. Pareto-Bayseian Search for Edge Computing
When searching for high accuracy and low latency CNNs for an edge device based on its computation resources, the search process is constrained from two dimensions. For a CNN structure, there is a conflict between its high accuracy and low latency. In order to get high accuracy, its structure will be deeper and wider. However, only with a limited scale of the structure, low latency can be achieved. Therefore, during our search process, these two searching objectives are optimized at the same time to guide the search direction, and finally, a CNN with a trade-off between accuracy and inference latency is obtained.
Two kinds of commonly-used multi-objective optimization methods are mainly shown below: (1) Convert the multi-objective problem into a mono-objective problem. How to combine different kinds of objectives and assign them with different weights is the key point to solve the optimization problem. (2) It is Pareto optimal. Compared with the above method, without manual intervention, the results of Pareto optimal are more objective and accurate. Therefore, in the search process, Pareto optimal is chosen to find the trade-off between accuracy and latency.
Pareto optimal. In Pareto optimal, assume there are two objective functions and : Dominant solution: the values of and belonging to solution a are better than those of solution b. In this case, solution a is superior to solution b, or in other words, solution a dominates solution b.
Non-dominant solution: If there are no other solutions better than solution a, that is, solution a is not dominated by other solutions, then solution a is called a non-dominant solution or the Pareto solution.
Pareto front: the set of all non-dominant solutions. All the solutions in the Pareto front are not dominated by other solutions. There is an example of the Pareto front shown in
Figure 3.
Pareto-Bayesian search. In the Bayesian search, a tree-structured search process is applied as shown in
Figure 4, starting with an initialized structure and expanding it, and all of its child nodes in the tree are the structures after the network morphism. The search process is to continuously select nodes in the tree and expand them, and finally obtain a network structure that meets the requirements. When applying the Pareto optimal to the search process, at each round of the search process, the new morphed networks are trained, and through the profiling model, their accuracy and inference latency are obtained. The accuracy and latency are saved as a tuple, by finding the Pareto front to get the results of Pareto optimal. Then, the Bayesian optimization is performed on the searched network structure in the Pareto front, and these networks are sorted by their accuracy and latency comprehensively. Apart from that,
[
40] search is applied to determine which network structure will be morphed.
search is an efficient tree-structure search algorithm. In the tree structure, it maintains a priority queue and can keep expanding the best node in the queue.
With search, for the searched networks, their priorities are obtained based on accuracy and latency and added to the priority queue. As the network with the highest priority is taken out and expanded by network morphism, then the priorities of the morphed networks are calculated. According to the acceptance function of the Simulated Annealing Algorithm, whether the new morphed networks are better than their father node is determined. If so, the new network is added to the top of the priority queue, and it is also considered as the starting point of the next network morphism.
It can be seen from
Figure 4 that in the first round of search, the initialized CNN is subjected to morphism of the new CNNs according to the determined morphism direction with the GP model. Then, we can obtain the Pareto front of the accuracy and inference latency of these networks and sort them with the
search. Furthermore, node 1 is selected after the first round of the search. Here, the structural features and morphism selections of node 1 will be applied to update the GP model. Since node 1 satisfies the high-accuracy and low-latency search objectives, in this round of network morphism, the changes in accuracy and resource consumptions caused by changes in network structure will be reflected in the Gaussian distribution and guide the subsequent network morphism. Thus, it is shown that when the search algorithm has hardware resource awareness, the search direction can be regulated. Later, in the second round of the search process, network morphism is performed on the basis of node 1, and the above operations are repeated. In addition, in each round of the search process, all the morphed CNNs are sorted, and each node has the opportunity to be selected for network morphism. Therefore, in round 4, node 3 is selected to be morphed rather than simply expand the child nodes of node 2. Through the above search method, by continuously evaluating the CNNs that have been selected and the new CNNs obtained by network morphism, we can more clearly know about the impact of operation selection and network structure expansion in the network morphism process on the accuracy and inference latency and have a trade-off between accuracy and inference latency. Based on this, the GP model can be updated, and the adaptive relationship between operation selections and resource consumptions will be recorded. After updating the GP model multiple times, it has a preference for network morphism to some extent. At this time, the GP model is adopted to get the direction of the network search for further guiding the search process more efficiently. In addition, the
search can morph the networks from the nodes that are known but not searched, instead of only expanding the current optimal node, and the Simulated Annealing Algorithm assigns a probability mutation that is time-varying and tending to zero. The combination of these two methods can prevent the search process from falling into a local optimum.
Compared with the search process when only the accuracy is set as the search target, inference latency as an additional target avoids the brutal growth of the network structure in pursuit of higher accuracy, which is suppressing the growth of the tree structure and reducing the number of nodes to be searched in the tree structure. In addition to that, with two dimensions of search targets, the search direction is shrunk into a more narrowed range. Then, a clearer search direction can further help to reduce the number of networks that need to be evaluated. Therefore, the Pareto-Bayesian search can obtain a CNN with high accuracy and low latency, and it is more suitable to be deployed on the edge device and can further improve the search efficiency.