4.1. Construction Method
To verify the effectiveness and accuracy of evaluating model generalization methods, a comprehensive model set is needed. The two existing model sets for the generalizability study were constructed on two simplified datasets, CIFAR and SVHN [
1,
3], which contain 756 and more than 10,000 CNN models, respectively. On the one hand, the image resolution in the CIFAR and SVHN datasets is only 32 × 32 pixels, with low data dimensions. On the other hand, the models in these sets mostly consist of around 10 layers with a small scale, which has a gap with the practical application scenarios. To bridge this gap and achieve a more precise and exhaustive validation of generalization assessment methodologies, this study constructed a model set with more than 2000 CNN models trained on the ImageNet subset by systematically varying commonly used hyperparameters. The models were developed utilizing the ResNet framework, specifically employing ResNet18, ResNet34, ResNet50, and ResNet101 architectures, which are prevalent in practical applications. The model set constructed in this study is comparable to the above two model sets in terms of quantity. Moreover, the size of the models is larger, and the images used for training are of higher resolution (with an average resolution of 496 × 387 pixels), aligning more closely with practical scenarios.
Taking into account the constraints of training duration, the study selected a random subset of 20 categories from the ImageNet dataset for experimental analysis. The data encompass a diverse array of categories, including animals, everyday objects, musical instruments, food items, and transportation-related items. This diversity is further accentuated by the extensive variation in the sizes and shapes of the objects represented within the dataset. A detailed breakdown of the data categories and their respective quantities used for model training is delineated in
Table 1. For the purpose of determining test accuracy, a sample of 50 images per category was utilized, culminating in a comprehensive evaluation across 1000 images.
In the course of model training, we meticulously fine-tuned a suite of hyperparameters to generate a diverse array of CNN models. These hyperparameters, which encompass batch size, learning rate, optimization algorithm, regularization coefficients (weight decay), model architecture, data augmentation, and the utilization of pre-trained weights from the ResNet models on the ImageNet dataset, are widely recognized for their influence on the generalization capacity of machine learning models. Through an exhaustive exploration of various hyperparameter combinations, we successfully cultivated a spectrum of models with distinct generalization capabilities.
The hyperparameters for tuning can be formally defined as , taking values from the set , for i = 1, …, n, and n denoting the total number of hyperparameter types. In our study, 7 hyperparameters were selected, so n = 7. The selected hyperparameters were as follows:
Batch size: It determines the amount of data the model sees at each training step, which can influence the stability and diversity of the learning updates, thereby impacting the model’s exposure to the overall data distribution. We hoped to unify the batch size used for training models with different network architectures, but we were limited by computing resources, so the maximum batch size we used for training was 64. For the sake of experimental diversity, we also chose values of 32 and 16 by dividing by 2;
Learning rate: It controls the step size the model takes during optimization, with larger rates potentially causing the model to overshoot minima and smaller rates leading to slower convergence, both of which can impact the model’s ability to find a good balance between bias and variance. In our previous experiments, we found that when the learning rate was , the model could be trained to convergence quickly, so we enlarged and reduced the learning rate by 100 times to and respectively to affect the training process of the model. This results in a model set with diverse generalization capabilities;
Optimization algorithm: It determines the path the model takes to minimize the loss function, which can influence the convergence speed and the quality of the solution found, thereby impacting the model’s capacity to learn from the training data without overfitting. We chose two of the most common and more basic optimization algorithms: SGD and Adam;
Regularization coefficient: It controls the balance between fitting the training data and maintaining model simplicity, which helps prevent overfitting and encourages the model to learn more generalizable patterns. We empirically chose regular term coefficients and , and then added 0, i.e., no regular term;
Model structure: It determines the complexity and representational capacity of the model, which directly influences its capability to capture underlying patterns without overfitting to the training data. Considering the computational resources and training time, the largest model structure we chose was ResNet101, and we also chose ResNet18, ResNet34, and ResNet50
Data augmentation: It increases the diversity of the training data, which helps the model learn more robust features that can better represent the underlying data distribution, thus improving its performance on unseen data. Data augmentation is a common way to enhance the generalization ability of a model during training, so we chose to use or not apply data augmentation for training. In the experiments in this study, data enhancement was performed by randomly adding image flipping and color enhancement;
Pre-trained weights: They provide a good starting point with learned features from a vast dataset, which can transfer useful knowledge to new tasks, thereby reducing the need to learn from scratch and potentially improving the model’s performance on similar data distributions. In our experiments, we found that using ResNet pre-training weights on ImageNet significantly affects the training time and generalization ability of the model, so we chose to use or not apply the pre-training weights.
Table 2 delineates the specific values assigned to each hyperparameter. For each value of hyperparameters,
where
.
In order to ensure that the model is trained to convergence, we established specific stopping conditions. These conditions were as follows:
The training loss function value should be below a threshold, which was set at 0.1. We used the cross-entropy loss function for this evaluation.
The model’s accuracy on the training set needs to exceed a threshold of 0.95.
The loss function value decreases for two consecutive batches while the test error rate increases, which indicates overfitting and signals the need to stop training.
The training process should not exceed 150 epochs. This condition ensures that the training can be completed within a limited time frame. For more implementation details, see
Appendix A.
The first two termination conditions are to ensure that the model can converge, and the last two conditions ensure that training ends within a limited period. We saved one or two models for each set of training parameters. The training process stops when any two of the first three conditions are met or directly when the fourth condition is met. We repeated the experiment twice for each set of parameters. In theory, models can be generated, but taking into account the time factor, we utilized four GTX 2080ti GPUs and dedicated nearly 30 days to train a total of over 2100 models. Eventually, we selected 2000 models that achieved a training accuracy greater than 80% to comprise our model set. Within this set, 500 models were chosen for each network structure. The reason for this choice is that the training of these models is near convergence, which aligns well with the actual application scenario. Only a small number of models did not converge, for example, the training accuracy of the model after training 150 batches was only 20%.
4.2. Model Distribution
The construction of a high-quality dataset is fundamental to the investigation of methodologies for assessing model generalization capabilities. The distribution of these capabilities is a critical metric for evaluating the dataset’s quality.
A box plot, a versatile graphical tool, provides a comprehensive statistical summary of the data distribution, delineating critical aspects such as central tendency and dispersion. Specifically, it captures the median; the interquartile range (IQR), which includes the central 50% of the data; and the lower (Q1) and upper (Q3) quartiles corresponding to the 25th and 75th percentiles, respectively. The plot also highlights potential outliers, represented as individual points beyond the whiskers that extend to 1.5 times the IQR from the quartiles. Collectively, these elements offer a succinct yet informative overview of the variability and central tendency of the generalization gap values across the spectrum of model architectures within the established dataset.
Figure 4 illustrates the distribution of generalization capabilities across various structural models within the curated dataset. The generalization gap was employed to quantify the generalization ability of each model. Upon scrutiny of
Figure 4, it becomes evident that the generalization gap values exhibited a wide range, signifying considerable heterogeneity in the generalization capabilities of the constituent models. Furthermore, the consistency in the distribution of generalization capabilities across different model architectures precludes the confounding effects of structural variance on the validation of the generalization assessment methodology.
Figure 5 illustrates the quantity distribution of the models across various ranges of generalization gaps. The results depicted in the figure reveal that the majority of models were concentrated within the extremes of very small or very large generalization gaps, while only a few models exhibited intermediate values. This distribution pattern can be attributed to the training termination conditions employed during the model training process, which aimed to achieve convergence. Converged models tend to demonstrate either minimal or substantial generalization gaps. A small generalization gap indicates proficient generalization to unseen data, whereas a large gap suggests overfitting, wherein the model incorporates task-irrelevant features from the training data, resulting in impressive performance on the training set but notable degradation on new data. Models displaying intermediate generalization gap values are in a transitional state, indicating that they have not yet attained the optimal balance between bias and variance. Such models may benefit from additional training. In summary, the quantity distribution of models across different generalization gap ranges, as depicted in
Figure 5, signifies that the majority of models in the dataset had reached a state of convergence. This achievement serves as a foundation for the subsequent methodological investigations into assessing model generalization capability and validating the methodology’s efficacy. Furthermore, this distribution validates the efficacy of the established training termination conditions, ensuring both the quality and efficiency of model training.
In conclusion, this chapter introduced a large-scale model set that addresses the limitations of existing sets. The diverse range of generalization abilities and consistent distribution across different model structures make this model set valuable for studying and validating generalizability methods.