Used image classification models are based on different convolution neural network architectures. Some CNN architectures have more variations of the same architecture (e.g., different number of layers, different input image size or computational techniques etc.) such as Inception, ResNet, VGG, MobileNet, NASNet, and DenseNet. A significant number of used models were trained with TensorFlow, while some models were trained in other frameworks such as Caffe or Keras. Each model trained with Caffe or Keras was converted to a suitable TensorFlow format.
AlexNet model [
24,
25] is based on a deep CNN architecture of the same name [
26,
27], which was originally trained with the Caffe framework. AlexNet won the ImageNet competition in 2012. It uses features such as Rectified Linear Unit (ReLU)) activation, data augmentation, dropout, and local response normalization, which are standard parts of modern classification neural networks [
28]. AlexNet is considered as a predecessor of all modern CNNs. Densely Connected Convolution Network (DenseNet)) models [
27,
29] are also based on deep CNN architecture [
30] and originally were trained with Keras framework. In the present research, DenseNet-121 (k = 32), DenseNet-161 (k = 48), and DenseNet-169 (k = 32) were used in image classification process. The number in the name of the model denotes the number of layers of the DenseNet model, while parameter k denotes the number of feature maps’ growth rates. Some advantages of DenseNet models are reducing the number of parameters, decreasing the vanishing-gradient problem, feature reuse, and concatenation of the feature maps learned by different layers, in order to improve efficiency [
31]. Google is the author of the Inception model, which is implemented in several versions: Inception v1 [
32], Inception v2 [
33], Inception v3 [
34], Inception v4 [
35], and a hybrid inception model Inception–Resnet [
35]. All used Inception based ImageNet pre-trained models were downloaded from TensorFlow Slim image classification library web page [
19]. Inception v1 architecture network was introduced in 2014 and won the ImageNet challenge the same year. The authors of the architecture have taken into account the fact that the objects in an image may have different sizes—larger objects take up larger areas, while smaller objects take up smaller regions of the image. The authors proposed the implementation of inception blocks, which splits the input into different parallel paths (or towers), and at the end of the inception block, the outputs of the different paths were concatenated [
31]. Inception architecture introduces 1 × 1 convolutions, to reduce the depth for each path, and uses the global average pooling layers instead of fully connected ones. The Inception v2 version (or Inception-BN) uses batch normalization, in order to use much higher learning rates and be more tolerant toward initialization issues. The improved second version also replaces 5 × 5 convolution kernels with two 3 × 3 kernels, which reduces the number of calculations and saves memory. Inception v3 version factorizes convolutions into smaller convolutions and uses efficient grid size reduction, batch normalization in the auxiliary classifiers, and several inception grids [
31,
34]. Both architectures, Inception v4 and Inception–ResNet, are presented in the same paper. Inception v4 uses “pure inception architecture” and is a more simplified version of the Inception v3 architecture, with more inception blocks. It also introduces reduction blocks, which are used to change the width and height of the grid. Inception–Resnet is hybrid architecture, i.e., residual connection from the ResNet [
27,
31,
36] model is integrated into the convolution network in order to make the network deeper and faster during the training process [
35,
37]. MobileNet architecture is specifically optimized for mobile and embedded applications in order to meet the resource constraints [
27,
31]. It uses two simple global hyperparameters that efficiently trade-off between latency and classification or recognition accuracy [
38]. There are three versions of the MobileNet architecture: The first version [
38] is based on a streamlined architecture that uses depth-wise separable convolutions to build light weight deep CNNs, while the second version [
39] additionally implements linear bottlenecks between the layers and shortcut connections between the bottlenecks. The MobileNet ver. 3 is the third version of the MobileNet architecture [
40]. This version uses two algorithms in order to construct suitable network architecture for a specific problem—the MnasNet [
41] is used to select optimal network configuration, and NetAdapt [
42] is used to fine-tune the proposed configuration. MobileNet ver. 3 is more accurate and faster than MobileNet ver. 2, but the authors of the algorithm present only top-1 accuracy, while top-5 accuracy is not mentioned at all. Models are released as MobileNetV3-Large and MobileNetV3-Small versions, which are targeted for high- and low-resource use cases [
40]. Both large and small model versions use all advanced properties of MobileNetv3 architecture, while the so-called minimalistic models do not utilize advanced blocks such as 5 × 5 convolutions, squeeze-and-excite units, and hard swish. In our research four MobileNets ver. 1 (Mob_v1_0.25, Mob_v1_0.50, Mob_v1_0.75, and Mob_v1_1.0) [
43], six MobileNets ver. 2 (Mob_v2_0.35, Mob_v2_0.50, Mob_v2_0.75, Mob_v2_1.0, Mob_v2_1.3, and Mob_v2_1.4) [
44], and four MobileNets ver. 3 (Mob_v3_lrg, Mob_v3_lrgm, Mob_v3_sml, and Mob_v3_smlm) [
44] pre-trained models were used, with the same image input size of 224 × 244 pixels. The number besides the model name and version denotes the depth multiplier, which defines the number of channels in each layer—i.e., value 0.5 will halve the number of channels, which cuts the number of computations and effectively speeds up classification process but with lower accuracy. The Neural Architecture Search Network (NASNet) architecture structure was not predefined by authors, but it was searched by the controller Recurrent Neural Network (RNN) [
27,
45]. Main structure cells (or blocks) were searched on smaller datasets and then transferred to larger datasets. These cells are called normal cell and reduction cell. A normal cell is a convolution cell that returns a feature map of the same dimension, while the reduction cell returns the halved feature map of the dimension. The authors used slightly differently structured normal and reduction cells in the research and introduced three model versions: NASNet-A, NASNet-B, and NASNet-C. In the presented research here, NASNet-A architecture was used in two versions: NasNet large and NASNet mobile [
19]. The authors of the Progressive Neural Architecture Search (PNASNet) [
46] propose a Sequential Model-Based Optimization (SMBO) strategy instead of reinforcement learning and evolutionary algorithms introduced in (previously mentioned) NASNet network architecture. PNASNet is eight times faster in terms of total compute and up to five times more efficient in the same search space than NASNet [
46]. According to the used number of blocks (and complexity), the PNASNet architecture is denoted from PNASNet-1 (low complexity) to PNASNet-5 (high complexity). In the present research, the PNASNet architecture was used in two versions: PNASNet-5 large and PNASNet-5 mobile [
19]. The Residual Network (ResNet)) architecture is focused on solving problems with deep CNNs [
27,
36]—increasing the convolution network depth leads to network accuracy degradation. Network depth property is crucial in order to gain a better model accuracy [
32,
36,
47]. The authors proposed the implementation of the residual block, which consists of two or three sequential convolutional layers and a shortcut connection between the input of the first and the output of the last layer [
31]. ResNet models can be used for extremely deep models, but model accuracy decreases, i.e., a 1202-layer network is less accurate than a 110-layer network [
36]. The second version of the ResNet architecture introduced the restructured residual block, with the implementation of identity mappings as skip connections and after-addition activation [
48]. ResNet v1 models [
49] were originally trained with the Caffe framework and converted to TensorFlow format, while ResNet v2 models were trained with TensorFlow. Both ResNet architecture versions in this research were used with 50-, 101-, and 152-layer deep networks (ResNet v1 50/101/152 and ResNet v2 50/101/152) [
19]. The pre-trained ResNet v2 models use Inception pre-processing and input image size of 299 × 299 pixels [
19]. The authors from Oxford’s Visual Geometry Group (VGG) found that convolution layers with larger filters (one 5 × 5 filter) can be replaced with two convolution layers with smaller 3 × 3 filters (factorized convolution)—the proposed structure requires lower computational capacities and reduced number of parameters [
31,
50]. The VGG architecture [
47] consists of multiple blocks with stacked convolution layers combined with a max-pooling layer and three fully-connected layers; therefore, final VGG models are computationally expensive and memory inefficient. In the present research, two ImageNet pre-trained models were used—namely, VGG-16 (16 layers) and VGG-19 (19 layers) [
19]. Both used VGG models were originally trained with Caffe and converted to suitable TensorFlow formats [
51]. The Extreme Inception (Xception) architecture involves depth-wise separable convolutions instead of Inception modules and shortcuts between convolution blocks (such as ResNet) [
52]. Xception is very similar to Inception v3 [
34] but shows better results [
53]. The used Xception pre-trained model was converted from Keras framework into the TensorFlow checkpoint file [
54].
4.1. Pre-Trained Classification Model Properties
The evaluated image classification models, with respect to their properties, structure, and inference speed, are listed in the next two tables. In the first table, in addition to the model name, the abbreviated name is noted in order to distinguish model versions. Each model has default image width and height size in pixels. The pre-trained models have certain top-n classification accuracy—top-1 is the inference of the model with the highest probability with regard to the expected answer, while top-5 is a situation when the expected answer is in the models’ first five inferences with the highest probability. Top-1 and top-5 accuracy values presented in the table refer to the pre-trained models, not to our actual experiments. As stated, some models were not pre-trained with TensorFlow, so original training ML library was noted for those model versions, listed alphabetically by the model name (
Table 1).
The structure and complexity of the model varies with each different model and model variant—number and types of the layers, number of the filters, and filter stride affect the number of parameters. The number of filters, used activation functions, and computation techniques affect the calculation speed. TensorFlow saves the trainable CNN models in two ways—as a checkpoint, which captures the exact value of all parameters, and as SavedModel, which, in addition to the checkpoint, includes a serialized description of the computation. SavedModel format is independent of the source code that created the model and makes it suitable for serving or utilizing in other programming languages [
55]. The trained model in form of a checkpoint is stored in four files, in order to separate model graph structure (model.ckpt.meta), value of the variables (model.ckpt.data0000-of-0001), index of the variables (model.ckpt.index), and standalone checkpoint information for older versions of the TensorFlow framework (model.ckpt). Checkpoints files and a SavedModel can additionally be processed into the “frozen” and optimized Protocol buffers (protobuf) format [
56]. Freezing a model is a process of converting all model graph variables to constants, while optimization is a process of removing all CNN layers that are not necessary for the inference process. Furthermore, the size of the network in memory and on the disc is proportional to the number of parameters, while latency and power usage of the network corresponds to the number of Floating-point Operations (FLOP ) of the model. Instead of FLOP, some authors use a number of Multiplication and Addition (MAdd) operations or a number of Multiply–Accumulates (MACs) [
40,
57]. In general, one MAC contains one multiplication and one addition operation, which indicates that 1 MAC = 2 FLOP, but some systems can perform fused multiplication and addition in a single step [
58], which indicates that in such cases, 1 MAC = 1 FLOP. The complexity of each evaluated model is presented with two values. The first value is the total number of parameters (trainable and untrainable), retrieved from model checkpoint files with a simple inspection script. As mentioned before, checkpoint (ckpt) files were frozen and optimized in order to speed up image classification process. The second presented value is the number of FLOPs calculated from the frozen and optimized model (pb) file with the TensorFlow profiler application [
59]. According to some users, the TensorFlow profiler has some issues with its calculation procedure [
60]. All model checkpoint files were downloaded from websites and converted to a frozen file on a local computer. Both calculated values are noted in
Table 2.
Top-5 and bottom-5 values of the total number of parameters (NoPs) in checkpoint file and the number of floating-point (FLOP) operations in millions for the frozen graph are displayed in the graphs of
Figure 1.
In order to check model inference (or classification) rate, 1000 images were resized to suitable input size, model files were "frozen", and SQLite database [
61] was prepared to store top-5 inference results. All images, models, and the database were stored in RAM disk. Model inference speed depends on computer hardware and software configuration—the presented inference speed is the average value of three consecutive measuring processes on (old and cheap) configuration: CPU: AMD A8-6600K APU [
62] and GPU: Gigabyte GeForce GTX 1070 8 GB [
63]. The inference rate of a particular model is presented in
Table 3 and
Figure 2.
It is noticeable that CNN architecture complexity influences the models’ inference rate, which confirms the claim that CNN models are a trade-off between inference speed and accuracy—models with faster inference speed results are less accurate and vice versa [
43].
There is an additional procedure to speed up the image inference speed on the system, which is using NVIDIA GPU—TensorRT [
64,
65]. TensorRT will restructure the saved model or the frozen model graph by removing unused output layers and conducts horizontal and vertical layer fusion in order to speed up the inference process. TensorRT also supports different types of calculation precision: 32 FP, 16 FP, and 8 INT, which can additionally improve performance.