In this section, we utilize the TrueSkill algorithm to refine the evaluation function for assessing network performance. Initially, within the TrueSkill component, we initiate the process by descending along the gray arrow to update the prior probability, as shown in
Figure 1. This involves pitting two distinct networks against each other in a competitive scenario. With a priori knowledge regarding the performance of each network, we employ this information to predict the outcome of the competition. Subsequently, by juxtaposing the predicted outcome with the observed result, we iteratively update the posterior probability along the orange arrow. This refined posterior probability is then employed to adjust the weighting of the two indicators within the evaluation function.
Following the TrueSkill refinement, we proceed to execute the pruning algorithm aimed at identifying the optimal single-path network structure. Initially, we initialize the process with a super-network , comprising all possible edges and operations. Subsequently, leveraging the evaluation function, we compute the significance of each operation and systematically prune the operation with the lowest significance. This iterative pruning process continues until the network converges to a single-path configuration.
3.1. Training-Free Pruning-Based Architecture Search Algorithm
The primary challenge encountered by the majority of Neural Architecture Search (NAS) methods is substantial training costs involved. To illustrate, envision a network comprised of recurrently structured Directed Acyclic Graph (DAG) cells, with each cell containing a set of E edges. Furthermore, suppose that each edge offers a choice of distinct candidate operations. In such a scenario, the entirety of the search space would expand exponentially to . Even if we employ sampling techniques to explore a fraction of this vast search space, represented by a sampling ratio denoted as , the training burden remains formidable. Specifically, to achieve the optimal performance network, one would still need to train approximately distinct architectures. Consequently, the computational demands associated with training such a multitude of architectures pose a significant obstacle in the NAS process.
If a method could be devised to narrow down the search space to , the associated searching costs would be notably mitigated, particularly when dealing with extensive search spaces. Additionally, if we could circumvent the necessity of utilizing training results from neural architectures, opting instead for a training-free evaluation method to assess their performance, further savings in computational costs could be achieved.
Moreover, by restricting the exploration to a reduced search space, represented by , the computational burden of exploring myriad architectural configurations would be considerably alleviated. This reduction in search space not only enhances efficiency but also streamlines the process of identifying optimal architectures within large-scale search landscapes. Furthermore, the adoption of training-free evaluation methods obviates the need for extensive training iterations, thus significantly reducing the computational overhead associated with training numerous architectures. Consequently, the integration of such methodologies holds the potential to revolutionize the efficiency and scalability of NAS paradigms, facilitating more expedient and cost-effective exploration of architectural design spaces.
In the subsequent sections, we propose a training-free methodology aimed at enhancing the efficiency of conventional NAS approaches. Our approach entails several key steps. Firstly, we endeavor to identify pertinent indicators that are capable of effectively evaluating the quality of the child architectures. These indicators serve as crucial metrics for assessing the performance and efficacy of different architectural configurations. Subsequently, we strive to develop a methodology for effectively balancing these indicators and integrating them cohesively to guide the architecture search process. This involves devising a robust framework for incorporating multiple indicators into the search process, ensuring comprehensive evaluation and optimization of architectural designs. Finally, we introduce a holistic training-free algorithm centered around the concept of pruning, leveraging the identified indicators to inform the pruning process. By integrating these indicators into the pruning algorithm, we aim to refine the search space and expedite the identification of optimal architectural configurations. Through this systematic approach, we seek to streamline the NAS process and mitigate the computational costs associated with traditional training-based methodologies.
3.2. Evaluation of Quality of Child Architectures
In recent research, numerous facets of neural architectures have undergone scrutiny to assess their performance. In our study, we have identified and selected two pivotal indicators to steer our architecture search process: trainability and expressivity. These indicators have been chosen based on their significance in characterizing the effectiveness and efficiency of neural architectures. By focusing on these specific aspects, we aim to develop a comprehensive understanding of architectural design principles and optimize the search process accordingly.
3.2.1. Trainability of Neural Architecture
The ability of a neural network to be optimized using gradient descent [
13,
14,
15] is known as its trainability. Lechao et al. (2019) [
16] examined the relationship between the conditioning of the trainable parameters and the trainability of wide networks in their research.
In the finite width NTK, the kernel function is defined as the product of the transpose of the Jacobian of the output of the last layer evaluated at point x, with the Jacobian of the output of the last layer evaluated at point x’. The Jacobian is defined as the derivative of the output of the i-th neuron in the last output layer with respect to the parameter , and it is denoted as .
Lechao et al. [
16] defined the relationship between the conditioning of the trainable parameters and the trainability of networks as follows:
the training process can be represented by Equation (1), where
is the training output,
I is the identity matrix,
is the exponential function,
is the learning rate, and
is the trainable parameter. The eigenvalues
of
are used in this equation, and it is hypothesized that the maximum feasible learning rate scales as
, where
is the largest eigenvalue. When this scaling is plugged into Equation (1), it can be observed that the smallest eigenvalue will converge at a rate given by
, where
is the condition number. Therefore, if the condition number of NTK associated with a neural network increases, it will become untrainable, and the condition number is used as a metric for trainability.
It has been observed that when neural networks have large depths, the spectrum of the trainable parameters, , typically has one large eigenvalue, , followed by a gap that is much larger than the rest of the spectrum. The condition number of the NTK, , is defined as the ratio of the largest eigenvalue to the smallest eigenvalue of a specific neural network , as shown in Equation (2). This value can be calculated without any gradient descent or labels. The correlation between the condition number of NTK, , and the test accuracy of architectures in NAS-Bench-201 has been found to be negative, with a Kendall-tau correlation of −0.42, which suggests that minimizing could lead to finding neural architectures with high performance.
3.2.2. Expressivity by Number of Linear Regions
To ensure a balanced approach in selecting operations for neural architectures, we propose a new evaluation criterion called Expressivity. This criterion is intended to be used in architecture search to complement other evaluation metrics.
Expressivity of a neural network measures its capability to represent complex functions [
17,
18]. We adopt the approach proposed by Xiong et al. (2020) [
19] to evaluate the expressivity of a neural network, which is based on the number of linear regions that a ReLU network can separate. According to Xiong et al. (2020) [
19], the expressivity is evaluated by counting the activation patterns and linear regions, which is defined as follows:
Let
be a ReLU CNN with L hidden convolutional layers. An activation pattern of
is a function P that maps the set of neurons to 1,–1, meaning that for each neuron
z in
, we have
. Given a fixed set of parameters
in
and an activation pattern
P, we define the region corresponding to
P and
as follows:
where
is the pre-activation of a neuron
z. A linear region of
at
is defined as a non-empty set
for some activation pattern
P. The number of linear regions of
at
is represented as follows:
Furthermore, we define the
as shown in Equation (4) when
ranges over
.
From Equation (3) we can deduce that the number of linear regions
reflects the number of distinct activation patterns that the network can separate. We approximate the expected value of
by calculating its average:
where
is an approximation of the expected value of the number of linear regions, and
denotes the expectation over the distribution of parameters
.
The results of an experiment conducted by Chen et al. [
3] showed that there is a positive correlation between the number of linear regions and the test accuracy of neural architectures, as reflected by a Kendall-tau correlation of 0.5. Therefore, it can be inferred that maximizing the number of linear regions can lead to finding architectures with high performance.
3.2.3. Defining the Importance
As demonstrated above, and serve as two key indicators for assessing the quality of a neural architecture. exhibits a negative correlation with the architecture’s test accuracy, whereas demonstrates a positive correlation with the architecture’s test accuracy. However, in order to ascertain which operation to prune in each iteration, it becomes imperative to devise a method for amalgamating these two indicators into a unified function. This amalgamation will facilitate informed decision-making during the pruning process, enabling the identification of optimal architectural configurations.
3.2.4. Performance Estimation
We can clearly see that
and
favor different operations on NAS-Bench-201, which is shown in
Figure 2. They both employ
with a large ratio, but
favors the skip-connect operation compared with the
operation, while
favors the
operation with a much higher ratio compared with the skip-connect operation. Then, if there is a conflict where an operation that
favors is not what
favors, how can we determine which operation should be pruned?
Chen et al. (2021) [
3] made a rough trade-off that they directly attributed to using the two relative rankings of
and
as the selection criterion. More specifically, they sorted the
in descending order and
in ascending order. The higher the
value, the more likely they would prune
, and the lower the
value, the more likely they would prune
. They pruned the operation with the lowest sum rank of
and
. This indeed works in some specific cases, but this algorithm ignores two important facts: (1) trainability may be not as important as expressivity; and (2) the preferences of
and
may be different on different types of datasets; for example, a neural network that performs well on an image dataset may not provide a good neural architecture if trained with words.
We assume that
and
have different importance coefficients based on the quality of the neural architecture.
and
can be seen as the importance of
and
to the performance of a neural network. Given the importance of
and
, we simplify the combined function as follows:
We use to estimate the quality of .
Understanding Equation (6) entails consideration of a neural architecture
that is composed of multiple candidate operations on edges. If we prune one operation
from
, we use
to represent the variation in
’s estimate quality. The estimation of
is related to two indicators:
and
. Then, we can use a 2D plane to show the estimation state of
in
Figure 3.
Let us write a total derivative of
as follows:
and are two independent unit vectors pointing in two directions. Since we do not know the scale of each direction, we use two parameters and to estimate them. The closer and are to their ground truth, the less bias we would suffer during the process of pruning the candidate operations.
We consider two neural architectures,
and
, which have at least one different operation on the same edge. We can use the combined function
to evaluate the quality of
and
based on our current priori estimation of
and
as follows:
3.2.5. Update the Importance
We initiate a competitive scenario wherein two neural architectures engage in a head-to-head comparison to ascertain the alignment between their actual performance and our current estimation. Subsequently, we leverage the outcome of this competition to iteratively refine and update our estimation in a reverse manner. This iterative process enables us to validate the accuracy of our initial estimation by directly observing the comparative performance of the competing architectures. Through this feedback loop, we continually enhance the precision and reliability of our estimations, ensuring they remain reflective of the practical performance exhibited by the neural architectures. If
, then we could deduce that
should perform better than
. We simplify the inequation
as follows:
According to Inequation (10), comparing and is equivalent to comparing and , which means we can use the competence result to update and , where and can be calculated as constants in each competence. By making them compete with each other multiple times, we could determine their importance more precisely. We propose a new method to combine and with their importance and by applying the TrueSkill algorithm given in Algorithm 1.
In Algorithm 1,
is a constant value used to measure the instability of the performance of two competitors, and
v and
w are the additive and multiplicative correction terms for the mean and variance of a (doubly) truncated Gaussian, respectively. This is shown as follows:
While the ELO algorithm remains a widely adopted method for evaluating performance, we have opted to employ the TrueSkill algorithm in our approach for several reasons:
TrueSkill can handle fair status.
TrueSkill can handle multi-player situations; although two factors are taken into consideration, we may use more factors in the future.
TrueSkill has a rigorous theory foundation.
TrueSkill converges faster than ELO.
TrueSkill requires little fine-tuning of hyperparameters.
Algorithm 1 Measurement of and |
Input: Begin with two different neural architectures and , current estimation , for importance of and Initialization: ; ; ; if then wins, others wins Let , , ,
Based on the competing result r, update and as: update as: return
train the neural architecture with the dataset and return the final test accuracy. |
3.2.6. Pruning Network
Traditional network pruning typically requires pre-trained weights and specific strategies to ensure successful learning of the pruned structure. However, recent advancements in pruning-from-scratch, as demonstrated in works such as [
20,
21], have shown that the structure of pruned models can be directly learned from randomly initialized weights without sacrificing performance. This breakthrough enables network pruning without the need for pre-trained weights, streamlining the pruning process and reducing dependencies on external resources.
Inspired by the prune-from-scratch approach, we extend our methodology to propose a pruning-operation-by-importance NAS method aimed at reducing the search space and enhancing search efficiency. In our approach, we iteratively prune one candidate operation with the least importance on each edge in a single round. This strategy effectively reduces the algorithmic complexity from
to
. We encapsulate our training-free and pruning-based algorithm, referred to as TTNAS, in Algorithm 2. This algorithmic framework streamlines the search process and facilitates efficient exploration of architectural configurations, leading to improved performance and computational efficiency.
Algorithm 2 Training-free TrueSkill NAS |
Input: ,, super-network stacked by cells, each cell has E edges, each edge has candidate operations Initialization: step - 1:
while is not a single-path network do - 2:
for each operation in do - 3:
- 4:
- 5:
end for - 6:
- 7:
- 8:
for each edge do - 9:
- 10:
- 11:
end for - 12:
- 13:
end while - 14:
return pruned single-path network
|
In Algorithm 2, we start with a super-network
stacked by multiple repeated cells [
22], which is composed of all possible edges and candidate operations. In each outer loop, we prune one operation with the least importance to the whole architecture
on each edge until
is a single-path network. In the inner loop, we calculate the change in
and
before and after pruning each candidate operation, and we evaluate its importance with
, and prune the candidate operation with the lowest importance on each edge.
The entire pruning process exhibits exceptional speed and efficiency. As we will illustrate in subsequent sections, our approach is methodical and adaptable, requiring minimal modifications for application across diverse architectural spaces and datasets. The prune-operation-by-importance mechanism employed in our methodology can be seamlessly extended beyond the indicators and . This extension holds promise for incorporating additional indicators or network properties, thereby enhancing the versatility and applicability of our approach in various contexts. Furthermore, the scalability of our methodology ensures its effectiveness across different domains and datasets, underscoring its potential as a robust solution for efficient neural architecture refinement.
3.3. TTNAS with Multiple Indicators
In this section, we introduce an indicator for evaluating the performance of neural architectures, namely discriminability. This section is structured as follows: First, we will elucidate the concept of discriminability in the context of neural networks and detail the methodology for calculating this metric. Following this, we will explore how the TrueSkill algorithm can be employed to ascertain the relative importance of discriminability within the evaluation framework. We will then discuss how to integrate discriminability with existing indicators to provide a comprehensive assessment of neural network performance. Lastly, we will outline how the prune-based search strategy of the training-free TrueSkill NAS can be adapted to incorporate this new evaluation method, thereby enhancing the search process for optimal network architectures.
3.3.1. Introduction to Discriminability
Discriminability, within the context of neural network performance evaluation, pertains to the capability of a network to differentiate between distinct classes or inputs with a high degree of accuracy. This concept is crucial, as it directly influences the network’s ability to perform well in classification tasks and other applications requiring precise distinctions. Explicitly defined, a robust and flexible network should not only be adept at discerning local linear operators for each individual data point but also exhibit consistency in its outputs for similar data points. This dual capability ensures that the network can handle complex patterns and maintain coherence in its decision-making processes.
The ideal scenario in neural network evaluation would involve an untrained network that exhibits low correlations between different data points. Specifically, data points belonging to the same category should be clustered closely together, indicating a high degree of similarity within categories. This clustering effect would facilitate the network’s learning process during training, as it would encounter fewer challenges in distinguishing between different categories. Consequently, the network would be able to learn and generalize more effectively, leading to improved performance metrics.
In this ideal scenario, the network’s ability to distinguish between similar data points within the same category is balanced with its capacity to differentiate between distinct categories. This balance is essential for achieving high discriminability, as it ensures that the network can accurately classify new, unseen data points based on the patterns it has learned during training. The importance of this balance cannot be overstated, as it directly impacts the network’s overall performance and reliability in real-world applications.
To further elucidate the concept of discriminability, consider a neural network tasked with classifying images of different animals. A network with high discriminability would be able to accurately distinguish between images of cats and dogs, even if the images differ only in subtle details. Moreover, the network should consistently classify images of similar animals (e.g., different breeds of dogs) correctly, demonstrating its ability to handle fine-grained distinctions. This level of performance is indicative of a well-trained network that has learned to recognize and differentiate between various patterns with high accuracy.
3.3.2. Assessing the Discriminability of Neural Networks
We base our approach on a fundamental assumption: different networks can be effectively compared by evaluating their behavior using local linear operators at various data points. This assumption stems from the premise that the local behavior of neural networks can provide valuable insights into their overall performance and capabilities. By examining how networks respond to small perturbations in the input space, we can gain a deeper understanding of their decision-making processes and identify key differences between various network architectures.
To do this, we define a linear map,
, which maps the input
through the network
, where
represents an image that belongs to a batch
X, and
D is the input dimension. Then, the linear map can be computed as follows:
In order to evaluate how a network behaves with different data points, we calculate the Jacobian matrix
for different data points of the batch,
:
The Jacobian Matrix J contains information about the network output with respect to the input for several data points. We can then evaluate how points belonging to the same class correlate with each other.
To evaluate the discriminability of neural networks, we assess the correlation of
J values with respect to their respective classes by computing a covariance matrix for each class present in
J. This approach allows us to quantify the relationships between the Jacobian values and the class labels, thereby providing insights into the network’s ability to distinguish between different classes.
where
is the matrix with elements:
where
c represents the class,
, and
C is the number of classes present in the batch.
Then, it is possible to calculate the correlation matrix per class
for each covariance matrix
:
Each individual correlation matrix provides a comprehensive analysis of the behavior of an untrained network across different classes. This analysis serves as a critical indicator of the local linear operators’ capability to distinguish between various class characteristics.
To facilitate the comparison among the various individual correlation matrices, which may exhibit varying sizes due to the differing number of data points per class, each matrix is evaluated separately:
where
is a small constant. We denote
as the number of elements of the set
.
Finally, we use
S to evaluate the discriminability of neural networks:
where
E is the vector containing all correlation matrices’ scores.
3.3.3. Evaluation Method with Three Indicators
With an additional indicator included, we have three indicators: trainability , expressivity , and discriminability .
We define the performance of three different neural networks,
, and
, as follows:
We let the three neural networks compete with each other. An overview of the TrueSkill algorithm with the three indicators is shown in
Figure 4.
Once two neural networks, and , have competed, we fix one of the indicators and update the other two indicators.
Combined with the TrueSkill algorithm, we can update the equations of these three indicators:
Let
,
,
,
,
, where
is a constant. According to the competition results of neural networks
and
, we update
,
,
,
as follows:
After updating
,
,
,
, we can update
and
as follows:
Similarly, we update and via the competition between two neural networks, and :
Let
where
is a constant. According to the competition results of two neural networks,
and
, we update
.
After updating
, we can update
and
as follows:
We update and via the competition of two neural networks, and :
Let , , , where is a constant. According to the competition results of two neural networks, and , we update with the equations.
After updating
, we can update
and
as follows:
With three indicators included in the TrueSkill training-free NAS, we first update the measurement of these three indicators, which are .
After updating
, we use them to evaluate the performance of the neural network:
3.3.4. Prune-Based Search Strategy with Three Indicators
As the additional indicator added into consideration, we summarize the evaluation method based on these indicators. We can easily conclude the following points regarding the prune-based search algorithm:
Similar to the training-free true-skill NAS in Algorithm 3, we initiate with a super-network
, which is constructed by stacking multiple repeated cells [
22]. This super-network encompasses all feasible edges and candidate operations. During each iteration of the outer loop, we systematically eliminate one operation that has the least significance to the overall architecture
on each edge, continuing this process until
is reduced to a single-path network. Within the inner loop, we compute the variations in
,
, and
before and after the pruning of each candidate operation. The importance of these operations is then assessed using
, and the candidate operation with the lowest importance on each edge is pruned accordingly.
Algorithm 3 Training-free TrueSkill NAS |
Input: , super-network stacked by cells, each cell has E edges, each edge has candidate operations Initialization: step - 1:
while is not a single-path network do - 2:
for each operation in do - 3:
- 4:
- 5:
- 6:
end for - 7:
- 8:
- 9:
for each edge do - 10:
- 11:
- 12:
end for - 13:
- 14:
end while - 15:
return pruned single-path network
|