1. Introduction
Artificial intelligence (AI) generally involves modeling intelligent behavior using a computer program with limited human interference [
1]. AI manifests through numerous applications, such as autonomous driving, natural language processing, intelligent information retrieval, expert consulting systems, theorem proving systems, robotics, automatic programming, combinatorial and scheduling systems, and perception problems [
2,
3].
Machine learning (ML) is considered a subfield of AI [
4]. However, some believe that only the intelligent part of ML should be regarded as a subset of AI [
5]. Either way, both kinds of literature converge on the notion that in ML a computational algorithm is built that can be applied to make decisions or estimates based on training data, without being explicitly programmed to perform the task [
4,
6].
An ML model learns from experience for a given task if its performance towards the task improves with experience [
7]. Machine learning models are primarily classified into four groups: supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning [
8]. Deep learning leads to other machine learning tools in general imaging and computer vision [
9].
Deep learning (DL) refers to techniques that build on artificial neural networks in which multiple network layers are added to increase the levels of abstraction and performance [
10]. For better generalizability, a good DL model is obtained through several processes: pre-processing of data, feature engineering, model generation, and model evaluation [
11]. Model generation includes the design of DL models and optimization during training. Model generation is turned into an optimization process for state-of-the-art (SOTA) DL models with transfer learning.
The optimization process can either involve hyperparameter optimization (HPO) or architecture optimization (AO) [
11]. The former focuses on tuning training-related parameters, for example, the batch size, learning rate, and the number of training iterations. The latter deals with model-related parameters, for example, the number of hidden layers, filter size, and the number of neurons per layer. Popular optimization methods include grid and random search [
12,
13], Bayesian optimization [
14,
15], and gradient-based optimization [
16,
17,
18,
19].
Standard DL models require determining both training- and model-related parameters before training. The model’s performance can be affected considerably by the choice of these parameters, yet finding good values is notably tough [
20]. The problem is exacerbated by the fact that some parameter values are integers (e.g., batch size), others are floating-point (e.g., learning rate), and others are categorical (e.g., optimizer). During training, landing on the veracious set of parameter values for better generalization remains an indistinct procedure, hampering the replication of ML experiments.
Luckily, there has been growing interest in automating the ML pipeline, i.e., automated machine learning (AutoML), to free data scientists from burdensome tasks. Google introduced publicly shared Cloud AutoML (
https://cloud.google.com/automl/, accessed on 4 November 2022) systems, and others have submitted open-source optimization libraries, i.e., Hyperopt [
21], Skopt [
22], SMAC [
23], and KerasTuner (
https://keras.io/keras_tuner/, accessed on 29 June 2023). Similarly, the majority of recent literature on AutoML for DL models concentrates on automating the design of DL models through the neural architecture search (NAS) process [
24,
25,
26,
27,
28]. However, works on automating the optimization process after obtaining the candidate architecture through NAS is still insufficient.
To date, an insufficient number of works of literature on the automated optimization of DL models have been identified. Similarly, building high-quality DL models through the optimization process can take time for non-experts. Therefore, existing difficulties motivated a desire to seek improved approaches for model optimization. Deliverables from such systems may be applied to join efforts to free data scientists from the burdensome optimization task.
For that reason, this empirical work proposes to advance the understanding of this growing area of research by introducing automated optimization of DL models for computer vision, specifically image classification tasks. Four DL models, each presenting a different architecture and prominent datasets, were employed during the implementation of the proposed solution. The proposed solution experimented with a transformer-based model to test its applicability to vision transformer models.
Data scientists, DL practitioners, experts, and non-experts working on image classification tasks can use the proposed solution without geographical limitations. A data scientist has control over the optimization attempts in the proposed solution, as opposed to using brute force and exhaustive searches. The main contributions of this paper can be summarized as follows:
Through a series of empirical tests, the association and impact on the performance of two main training-related parameters, batch size and learning rate, is explored;
A framework for automated optimization of the DL model for image classification tasks is presented;
An empirical demonstration showing that the proposed framework improves the performance of DL models is presented;
A set of training-related parameter values for better performance of DL models for further extensive empirical evaluations is recommended.
The remainder of this paper is structured as follows:
Section 2 concerns materials and methods.
Section 3 is dedicated to presenting results; subsequently,
Section 4 discusses results and findings.
Section 5 outlines a conclusion and some highlights of what to expect in future works.
2. Materials and Methods
This section summarizes the experimental setup, presents an overview of selected deep learning models, parameters to be used during automated optimization, datasets, and performance measures for empirical evaluation; an overview of the problem at hand is provided, proceeded by an optimization algorithm and the proposed framework.
2.1. Experimental Setup
Three sets of experiments were conducted in this study: the first to assess the association of learning rate and batch size; the second to assess the impact of optimizers; and the third to demonstrate the proposed solution. The first set involved twenty tests, the second set involved three, and the third involved three tests for each of the selected DL models. A total of 104 experimental runs were conducted.
All experiments were coded using Python and run using an online Jupyter Notebook executed on the Google Colab cloud computing on a TensorFlow framework. A graphics processing unit (GPU) offered on the Google platform was utilized to expedite the training process. At the beginning of each training session, Google offers up to 12.7 GB of RAM, 15 GB of GPU, and 78.2 GB of disk space that are dynamically assigned and utilized.
2.2. Deep Learning Models
Four of the top state-of-the-art deep convolution neural network (D-CNN) models, standard for image classification tasks, were selected in this work. A very deep convolutional network for large-scale image recognition (VGG-16) is beneficial as it generalizes well on other databases, apart from the ImageNet database, while achieving SOTA results [
29]. GoogleLeNet or Inception uses fewer computational resources and few parameters while achieving high SOTA results [
30]. In this paper, we considered Inception version V3.
Likewise, from the clan of Residual Network (ResNet), ResNet50 was selected as it presents a novel architecture, skipping connections that go deeper by up to 152 layers (8 times deeper than VGG-19) but still maintaining low complexity [
31]. The final nomination was the EfficientNet, which introduced a novel model scaling method called compound scaling [
32]. The proposed approach broke the mold of its predecessors, which previously used the conventional system of stacking many layers to scale up the model. EfficientNetB0 was considered in these experiments.
The selected four DL models are winners of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) at different timeframes. In the ILSVRC 2014, VGG-16 was ranked second, and Inception was first. Likewise, ResNet50 was ranked first in the ILSVCR 2015. Lastly, EfficientNet was a winner in the ILSVRC 2019.
Each of the selected models presents a unique architecture that generalizes well on a variety of datasets (VGG-16), minimizes computational resources (Inception), gets deeper while reducing complexity (ResNet50), and uniformly scales up (EfficientNet). Diverse architectures in the selected models lays a foundation for most variants of D-CNN.
A transformer-based model by Dosovitskiy et al. [
33] was used to test the applicability of the proposed approach to vision transformers (ViT). The model contains multiple transformer blocks with a multi-head attention layer applied to the sequence of image patches. The final output of each transformer block is flattened to be used as the representation of image input to the classifier head. The classifier head uses a softmax activation to produce the output probability of each class.
In this paper, deep learning models based on ConvNets were mainly considered in the experiments. However, in recent years, the popularity of transformer-based models for computer vision applications has been growing. Yet recent ConvNets can be as robust and reliable or, in some cases, even more so than transformers [
34]. Therefore, the experiments and results presented in this paper primarily concentrate on ConvNet-based deep learning models; however, at the same time, the proposed approach was evaluated with a transformer-based model.
2.3. Parameters
Four parameters were used in performing automated optimization during the empirical work: the number of trainable hidden layers, learning rate, batch size, and optimizer. Nevertheless, the number of training iterations (epoch) remained the same because early stopping was used in the selected DL models. Early stopping monitored improvements in the validation loss with patience set to 10 and mode set to min, as we seek to minimize loss when validation accuracy improves, restoring best model weights.
2.4. Datasets and Performance Measures
Standard datasets for image classification tasks used across research and hackathons were selected [
35,
36]. The ImageNet (
https://image-net.org/download, accessed on 4 November 2022) is the largest dataset for image classification and localization. In addition, we selected Fashion-MNIST [
37], Stanford Cars [
38], and the Google Cats and Dogs dataset (
https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip, accessed on 4 November 2022). All datasets are publicly available and can be freely accessed for various image classification challenges.
With a balanced dataset used to evaluate the proposed approach, the performance metric used in this work is accuracy. Accuracy measures the probability of an image being classified as positive or negative. With initial accuracy
and final accuracy
, the relative error
drop can be calculated as:
Evaluation metrics like precision, recall, and confusion matrix may be employed when the proposed approach uses imbalanced datasets. However, a data scientist must explicitly define the evaluation metric before optimizing.
2.5. Problem Definition
This work addresses the following problem. Given:
A set of deep learning models Ϻ(1),…, Ϻ(t), with t being the number of deep learning candidate models for a CV task T;
A number m of datasets as Ɗ(1),…, Ɗ(m);
A set of n parameters with domains Θ1,…, Θn;
Model Ϻ’s configuration space as a product of parameter domains given as Θ = Θ1 × … × Θn;
For each of the selected model Ϻ on dataset Ɗ during task T, a number of K sets have a pair of empirical performance metrics with parameter settings as .
Our goal is to automate the process of determining, for the selected model on a given dataset for a particular image classification task, which parameter values are to be applied for the expected better performance of a model during the optimization process.
2.6. Optimization Algorithm
We employed a sequential-model-based optimization algorithm to implement the proposed scheme. More specifically, we used Bayesian optimization (BO), a popular SMB-based hyperparameter optimization method [
14,
15]. With an evaluation function f and acquisition function S, SMBO can be articulated in Algorithm 1 [
14].
Algorithm 1: Sequential-Model-Based Optimization |
Input: f, Θ, S, Ϻ |
Ɗ ← INITSAMPLES (f, Θ) |
for i in [1, 2, … , t] do |
p(y|, Ɗ) ← FITMODEL (Ϻ, Ɗ) |
← S( p(y|, Ɗ)) |
← f () |
Ɗ ← Ɗ () |
end for |
After tuning the probabilistic model Ϻ to fit on Ɗ, function S selects subsequent promising neural architecture from Ϻ. A balance between exploring new architectures from M and exploiting architectures already identified to have auspicious values is defined by S. Function f evaluates the selected neural architecture after training and validation. The new pair of evaluation results () is appended on Ɗ.
2.7. Automated Optimization Framework
The proposed framework of the automated optimization scheme can be summarized in
Figure 1. The optimization scheme accepts a DL model, classification task, and control variable to exploit the configuration space. Finally, it yields the best-performing DL model, associate parameters, and evaluation metrics.
At first, a data scientist specifies a DL model Ϻ, dataset Ɗ, and a maximum number of optimizations attempt Ɲ for a given image classification task. The algorithm compiles a model after preprocessing the data on Ɗ using default parameters . Afterwards, the algorithm fits Ϻ on Ɗ and records the initial performance measure using . A comparison between and new performance values will be made during the optimization process.
The algorithm then initializes automated optimization of Ϻ on Ɗ using configuration space Θ. Parameters
are selected iteratively to compile and fit Ϻ on Ɗ while recording new performance
. If Ϻ improves as
is compared with
, the algorithm updates the performance measure and corresponding parameter value (
) on saving Ϻ. Otherwise, the algorithm retrains after unfreezing the top layers of Ϻ and reducing the learning rate. In this paper, unfreezing happened layer by layer, since the training process of the internal layers of DL differs significantly [
39].
In both cases, if there are no improvements, parameters
and compiled Ϻ are discarded upon selecting subsequent parameter values for the next optimization attempt. The counter for optimization attempts,
, will be incremented. The process repeats with Ɲ controlling the exploration of
from Θ when exploiting selected parameters. The automated optimization can be summarized in Algorithm 2.
Algorithm 2: Automated Optimization of DL Models |
Input: Ϻ, Ɗ, Ɲ |
Output: best Ϻ |
Method: the algorithm works as follows. |
1: Initialize Ϻ on Ɗ |
) |
) |
in Θ do |
) |
) |
7: else unfreeze Ϻ |
) |
) |
|
11: end if |
12: end if |
< Ɲ, reiterate |
Most of the existing approaches optimize models for ML tasks other than computer vision. Optimizing a model for a computer vision task is often a resource-intensive task. The proposed method stands out from the existing literature for three reasons. First, it automates the optimization process of models for computer vision, a solution currently insufficiently addressed in the current literature. Second, it can be configured and supports state-of-the-art deep learning models with transfer learning for any image classification task. Third, a user controls the optimization process and may limit the utilization of resources, as opposed to using brute force and exhaustive searches.
2.8. Implementation
Transfer learning was employed on the selected models with pre-computed weights on ImageNet. The Fashion-MNIST and Stanford Cars datasets were used to initially examine the association between batch sizes and learning rates during training with a generic model. The Stanford Cars dataset encompasses 16,185 images and 196 car classes. Fashion-MNIST holds 70,000 28 × 28 grayscale fashion images from 10 categories, divided into 60,000 training and 10,000 test examples. The proposed automated optimization approach was implemented and evaluated on the Stanford Cars and the Cats and Dogs datasets with 1500 color images for each class, split into 2000 images for training and 1000 for validation.
Three optimizers were used: RMSProp, SGD with momentum, and Adam. Batch size increases by 2, starting with 8 through to 128. The number of epochs was fixed to 100 with the early stopping monitoring validation loss, with patience set to 10, and a 0.5 dropout rate to prevent overfitting, and saving only the best model.
4. Discussion
This section discusses the results presented in the previous section and positions our contributions relative to knowledge in the existing literature.
4.1. Contributions to Related Literature
Probst et al. [
40] employed a statistical approach to study the importance of hyperparameter tuning and introduce tunability, signifying the gain in the model’s performance achieved through hyperparameter tuning. However, the research paper was confined to six classical machine learning models.
Faes et al. [
41] researched the practicability of automated deep learning design by medical practitioners with non-programming and DL expertise. Developed models through the automated stream were compared with SOTA DL models on medical image classification tasks using publicly available datasets. Results indicated comparable discriminative performance and diagnostic properties [
41].
Xu et al. [
42] introduced an automatic network adaptation framework for object detection that goes beyond searching a classification backbone. The authors proposed the Auto-FPN framework with two modules performing an auto search: Auto-Fusion and Auto-Head. For both modules, the search space of gradient-based architecture claimed to be efficient with resource constraints [
42].
Weerts et al. [
43] introduced turning risk, an incurred loss into performance when a hyperparameter is not tuned but left to default value. Authors asserted that a model with default values might even outperform a model with adjustments to its hyperparameter in some cases [
43]. However, only two classical ML algorithms were experimented with: Support Vector Machine and Random Forest.
Guo et al. [
44] proposed a hierarchical trinity search framework that automatically discovers effective architecture for object detection. Their solution employs an end-to-end approach to discover all components and architectures of object detection simultaneously. With much less computational overhead, experimental results suggest that the proposed framework outperforms both manually designed and NAS-based architectures [
44].
X. He, Wang, et al. [
45] conducted a series of empirical evaluations to benchmark DL models and automated model design for COVID-19 detection. A random search was proven to deliver DL models with SOTA performance results through empirical results [
45]. However, automating the optimization process of selected architectures after NAS remains not to be seen.
Yi and Bui [
46] presented a deep learning model with an automated hypermeter search for highway traffic prediction. Their work employed a long short-term memory (LSTM) model and a recurrent neural network (RNN) variant to build a time series model for an intelligent transport system. Bayesian optimization is claimed to be superior to a manual search, grid search, or random search in searching for hyperparameter configuration [
46].
While the literature on AutoML already exists, to our knowledge, it is limited to automating the design on DL models.
Table 6 summarizes the research contribution to the existing body of knowledge. Thus far, some authors have attempted to highlight the significance of hyperparameter optimization [
40,
43]. Other researchers are more concerned with automating the design of deep learning models [
41,
42,
44,
45], while [
46] focuses on automating the hyperparameter search for time series deep learning models.
Based on the reported literature, this paper concentrates on automating the optimization of deep learning models for image classification tasks. Experimental work examines the importance of hyperparameters and searches for the best value by automating the optimization process. Similarly, it empirically demonstrates that the proposed approach can improve the performance of deep learning models through automation.
4.2. Association between Batch Size and Learning Rate
Our results suggest a significant positive correlation between batch sizes and learning rates. This implies that lower rates should be used with smaller batches for better results, and the inverse is true for larger batches. Following the present results, previous studies have suggested using smaller batches [
47,
48] and high learning rates with large batch sizes [
49]. Furthermore, batch size 32 was recommended as the default value that provides good results [
50].
Another critical finding distinctive from the previously reported results is observed when a learning rate of 0.1 is selected. Regardless of how the batch size was iterated in all DL models, validation accuracy was at around 50%. An implication of this finding is the possibility that high rates plateau the performance close to a specific value. However, with the stochastic nature of ML experiments, these results must be interpreted cautiously.
4.3. Optimization Implications
The post-automated optimization results signify that the proposed scheme can improve the performance of DL models on a given image classification task. VGG-16 consumed approximately an hour of training and adjusting the knobs to reduce the error rate by 73%. Likewise, for EfficientNet, optimizing to a final accuracy took 2 h and 41 min.
Without automated optimization tools, data scientists spend hours manually training and optimizing DL models. Additionally, the optimization process requires a good knowledge of the impact of both model- and training-related parameters relative to performance evaluation metrics. With such constraints, it is evident that building and optimizing DL for image classification becomes a burdensome task.
However, with a fully automated machine learning pipeline, data scientists shall be liberated from the majority of burdensome tasks involved in deploying DL models. For this reason, the proposed automated optimization scheme becomes significant. Data scientists can leverage such a tool during the design and optimization process, consequently becoming more productive and efficient in completing other valuable scientific tasks in the machine learning pipeline. This implies that data scientists will only be required to define a task, model, and dataset and then submit it to automated machine learning tools.
4.4. Proposed Approach on Vision Transformers
The proposed tool can be applied to fine-tune and optimize transformer-based models automatically. However, it has to be customized to incorporate ViT hyperparameters. ViT hyperparameters include patch size, projection dimension, number of heads, number of transformer layers, multilayer perceptron units, and the standard convent parameters. This implies that the search space will be more extensive when compared to ConvNets.
Transformer-based models must be trained on larger datasets for long enough to ensure convergence [
51]. The number of training iterations can be set to 500 with early stopping to control overfitting. This signifies that ViT models can be trained for up to five times more iterations than ConvNet models.
Similarly, transformer-based models can be trained with large batch sizes compared with convent models. Our results suggest that mini batches between 16 and 64 promise optimal performance with convents, but ViT models can be trained with 256 and 512 batches. The results presented in
Section 3.6 were obtained when the ViT model was trained using a batch size of 256.
The proposed approach can be customized to train and optimize ViT models. The search space must be expanded to accommodate transformer-based hyperparameters. Similarly, training time must be long enough, with many iterations to guarantee convergence to an optimal solution.