Next Article in Journal
Comparative Analysis of Statistical, Time–Frequency, and SVM Techniques for Change Detection in Nonlinear Biomedical Signals
Previous Article in Journal
Curved Text Line Rectification via Bresenham’s Algorithm and Generalized Additive Models
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

Quantifying Shape and Texture Biases for Enhancing Transfer Learning in Convolutional Neural Networks

Faculty of Science and Engineering, Doshisha University, Kyoto 602-0898, Japan
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Signals 2024, 5(4), 721-735;
Submission received: 29 June 2024 / Revised: 27 October 2024 / Accepted: 28 October 2024 / Published: 4 November 2024


Neural networks have inductive biases owing to the assumptions associated with the selected learning algorithm, datasets, and network structure. Specifically, convolutional neural networks (CNNs) are known for their tendency to exhibit textural biases. This bias is closely related to image classification accuracy. Aligning the model’s bias with the dataset’s bias can significantly enhance performance in transfer learning, leading to more efficient learning. This study aims to quantitatively demonstrate that increasing shape bias within the network by varying kernel sizes and dilation rates improves accuracy on shape-dominant data and enables efficient learning with less data. Furthermore, we propose a novel method for quantitatively evaluating the balance between texture bias and shape bias. This method enables efficient learning by aligning the biases of the transfer learning dataset with those of the model. Systematically adjusting these biases allows CNNs to better fit data with specific biases. Compared to the original model, an accuracy improvement of up to 9.9% was observed. Our findings underscore the critical role of bias adjustment in CNN design, contributing to developing more efficient and effective image classification models.

1. Introduction

Neural networks have inherent inductive biases because of their learning algorithms, datasets, and architectural designs. These biases significantly influence the information that the network prioritizes while learning patterns from data and making predictions. In image processing, various model architectures, such as vision transformers and convolutional neural networks (CNNs), have been proposed and have demonstrated outstanding performance [1,2,3,4,5,6]. Among these architectures, CNNs have been reported to exhibit texture bias, which is an inductive bias inherent to these networks [7,8,9]. Owing to this, CNNs are particularly sensitive to local features and textures in images, often preferring their interpretation to the objects’ overall shape.
A primary reason for texture bias in CNNs is the convolutional layer structure. These layers apply filters to the local regions of an image and extract features, such as edges and textures [10,11,12]. This process is repeated over the entire image, and the resulting feature maps are propagated to subsequent layers. This convolution operation makes CNNs highly sensitive to local image details, which are considered to contribute to texture bias, which is closely related to the model performance in image-classification tasks. CNNs can perform highly accurate classifications on datasets rich in texture information, such as natural images. However, performance can decrease in tasks where shapes and contours are crucial factors, such as classifying abstract images like sketches and cartoons. Therefore, designing and training the model with an appropriate bias for specific tasks is crucial for maximizing its recognition accuracy.
Moreover, these biases significantly affect the performance of pre-trained CNN models for transfer learning in specific domains [13]. In domains wherein texture features are crucial, models with a texture bias can learn more efficiently. Conversely, models with a shape bias are expected to be more advantageous in domains where the shape is more important.
Recently, considerable advances have been made in understanding and adjusting CNN biases. For example, Geirhos et al. [9] revealed the differences in biases between CNNs and the human visual system and explored their impact on recognition robustness. Subsequent studies [14] discovered that models trained on large-scale datasets, such as vision transformers, exhibit less texture bias than CNNs. To address this texture bias, various approaches have been proposed, including methods that utilize multiple models [15,16,17] and image preprocessing techniques [18,19]. Furthermore, Li et al. [20] proposed learning methods to eliminate shape/texture biases, which led to improved adversarial robustness. Our previous work [13] introduced a method for amplifying shape bias and mitigating texture bias in CNNs by modifying the ImageNet dataset using simple image separation techniques, which resulted in improved accuracy for specific domain datasets.
This study builds on these findings by focusing on CNNs’ texture and shape biases and their responses to adjustments in the kernel size and layers of the network. Modifications to the kernel size and dilation can be easily implemented. The results indicate that appropriate tuning of these elements can improve the recognition performance of CNNs for specific datasets. Furthermore, we propose a new method for quantitatively assessing shape and texture biases, and demonstrate that it can be applied to a broader range of datasets than traditional methods. Aligning the data biases with those of the model enables effective and efficient learning during transfer learning. The contributions of this paper can be summarized as follows:
  • It quantitatively shows that increasing the kernel size and dilation step improves the shape bias. Additionally, it demonstrates that altering the kernel sizes and dilation step can significantly enhance the performance of CNNs for shape-dominant domains, such as sketches and cartoons.
  • It presents a new method for quantitatively assessing texture and shape biases. The proposed method offers the advantage of application to a broader range of datasets compared with conventional methods [9], which tend to rely on specific datasets. This new evaluation method enables more effective incorporation of texture and shape biases in the CNN transfer-learning process, improving the efficiency of transfer learning.
The remainder of this paper is organized as follows. Section 2 discusses previous studies on CNN biases, Section 3 describes the proposed methodology, Section 4 presents and discusses the experimental results, Section 5 discusses the considerations and limitations of this study, and Section 6 concludes the paper and presents directions for future work.

2. Related Works

2.1. Inductive Bias in Neural Networks

Neural networks exhibit various inductive biases, and extensive research has been conducted to understand these biases to aid in deciphering their black-box nature and improving learning efficiency [14,21,22,23,24]. Geirhos et al. [9] demonstrated that CNN models tend to prioritize texture information when performing object recognition, exhibiting an algorithmic approach different from the shape bias inherent in human perception. Cao et al. [25] showed that even randomly initialized CNNs possess an inductive bias that naturally focuses on objects. Additionally, transformers’ self-attention mechanisms function as filters that emphasize low-frequency information while suppressing high-frequency information [26]. Additionally, there are studies that have leveraged the unique characteristics of CNNs [27,28,29].
Graph neural networks (GNNs) exhibit relational inductive biases [30]. This bias enhances the capability of deep learning models to handle structured data, a critical component of human-like intelligence [31,32,33]. There are also proposals for neural network designs focusing on directed acyclic graphs, introducing a strong inductive bias that emphasizes partial orders [34].
Language models also possess inductive biases [35,36]. Studies, such as Papadimitriou et al. [37], have investigated the incorporation of structural cues, such as recursive processing and non-context-free grammatical relationships, into transformer language models, examining how different structural biases affect language learning. Further research has explored the effectiveness of syntactic inductive biases in learning low-resource languages [38] and leveraging the inductive biases of large language models to perform abstract textual reasoning [39].

2.2. Analyzing Shape/Texture Bias in Vision Models

Deep learning-based vision models are known to possess inductive biases related to shape and texture. Geirhos et al. [14] pointed out that CNN models rely heavily on texture information, leading to misclassifications of images with similar textures. Models that successfully classify images using only local texture information have also been reported. This texture bias is considered a factor contributing to the existence of adversarial examples. Moreover, multimodal models like vision and language models possess a shape bias compared to others, but still exhibit a stronger texture bias than humans [40]. In contrast, large-scale vision transformers have shown tendencies more aligned with human perception [41].

2.3. Improving Shape Bias

CNNs exhibit object-recognition capabilities similar to or better than those of humans; however, unlike humans, CNNs tend to recognize objects by primarily emphasizing local texture information rather than shape information [9]. Geirhos et al. [14] constructed a dataset to test this bias and analyzed the texture bias. Subsequently, they evaluated the biases of humans and a wide range of vision models and proposed methods for aligning them in terms of shape and texture.
Lee et al. [18] noted that methods aimed at enhancing existing shape biases could result in dataset alternations, leading to deterioration in the accuracy of the original dataset. Therefore, they improved the shape bias without compromising the integrity of the dataset by separating and recombining the images into the foreground and background, which also improved the accuracy of the original dataset. Yoshihara et al. [19] discovered that training on both normal and blurred images could make CNNs robust against image blurring and slightly reduce their texture bias; however, they did not achieve fully robust object-recognition capabilities similar to those of humans. Chung et al. [42] focused on biases and suggested that the root cause of texture bias lies within the data distribution. They confirmed that texture bias can be mitigated and out-of-distribution (OOD) performance can be improved by redesigning the label space. Lukasik et al. [43] improved OOD performance and model robustness using frequency normalization for the kernel parameters. By contrast, generative models that have been used for classification have demonstrated biases resembling human shape bias, as well as human-level OOD performance.
Ding et al. [44] addressed the effective receptive field of vision transformers and suggested that enlarging the kernel size was more effective than deepening the model layers for expanding the receptive field. Li et al. [20] acknowledged that although shape and texture serve as complementary cues, datasets tend to be biased toward either shape or texture. They developed a straightforward algorithm that utilizes images with conflicting shape and texture information to augment the training data with the aim of eliminating both shape and texture biases.
Some studies have also attempted to expand networks to mitigate shape bias. Ye et al. [15] supplemented the conventional learning network with a separate network designed to extract edge features from input images, and employed distinct learning processes for edge and texture features. By simply introducing a new network, they enabled learning without altering the original network structure. Using these two networks improved the image classification and detection accuracy and increased the shape bias.
Yunhao et al. [16] divided images into shape, texture, and color components, and learned each component to organize and summarize their information to determine their respective contributions. Satyam et al. [17] further decomposed images into grayscale, edges, silhouettes, and texture, and extracted features in specific modalities and discussed explainability using biases. Padmanabhan et al. [45] leveraged implicit prior knowledge within data for few-shot learning to learn more generalized features. This approach enhanced the generalization and robustness of the model, and lowered its vulnerability to texture biases and adversarial perturbations.
Discussions regarding shape and texture biases in vision models often focus on differences between human perception and robustness against noise. Additionally, as shown in Table 1, methods to increase shape bias frequently involve data preprocessing or utilizing multiple networks. This study proposes evaluation metrics that leverage texture and shape biases to achieve efficient and effective transfer learning. Moreover, we focus on fundamental architectural components, such as kernel size and dilation rate, to enhance shape bias.

3. Proposed Methods

3.1. Shape/Texture Bias Metric

Establishing a metric for assessing bias is essential to preemptively comprehend the potential prejudices exhibited by pre-trained models. This study mainly considers the congruity between model biases and data. Although Geirhos et al. [9] have proposed a method, it can be applied to only a limited range of datasets and requires a greater degree of versatility. Furthermore, their method is primarily designed to compare human perception with vision models and is unsuited for comparing them with datasets. The proposed method introduces a more generic bias metric to address this issue.
Figure 1 presents an overview diagram for measuring shape/texture bias. Algorithm 1 provides a detailed procedure for generating combined shape-texture images. Shape- and texture-dominant images were created using an edge-preserving smoothing method [46]. The original image I orig is smoothed to obtain a shape-dominant image I shape , and a texture-dominant image is then created by increasing the pixel values of the difference between I orig and I shape to half their maximum values. Subsequently, a combined shape/texture image is obtained.
Shah et al. [47] reported that neural networks favor more straightforward features. This implies that features more favorable for the CNN model are used for classification. The models trained in simple two-class classification measure the shape or texture bias.
Algorithm 1: Algorithm for combining shape and texture images
Input:  I orig : Original image
Output:  I comb : Combined shape/texture image
Data:  L 0 smoothing : Smoothing algorithm, I shape : Shape image, I tex : Texture image
I shape L 0 smoothing ( I orig )
I tex ( I orig I shape ) + 128
I comb concatenate ( I shape , I tex )
return  I comb
Finally, the focus area is calculated using Grad-CAM [48]. The method for measuring bias using Grad-CAM is outlined in Algorithm 2. An attention area predominantly on the upper side (shape-dominant image) indicates shape bias, whereas that on the lower side (texture-dominant image) suggests texture bias. The specific bias values are calculated using the attention region heatmaps. Additionally, an object mask ensures that only the classification object is evaluated and only the area where the background image has been removed is assessed. Shape bias is calculated using Grad-CAM heatmaps: first, for both shape-dominant and texture-dominant images, the sum of the Grad-CAM heatmap values is divided by the image size, excluding the background areas. H s and H t denote the values for the shape- and texture-dominant images, respectively. To evaluate whether the CNN model exhibits shape or texture bias, these two values are used to compute the proposed shape/texture bias B as follows:
B = H s H s + H t
As indicated in Equation (1), the shape-bias value ranges from zero to one, with values closer to one indicating a higher shape bias and those closer to zero indicating a lower shape bias (higher texture bias).
Algorithm 2: Measurement of shape/texture bias using a Grad-CAM implementation in a PyTorch-like framework
  • import numpy as np
  • from import DataLoader, Dataset
  • from PIL import Image
  • for combined_img, label in combined_image_data_loader:
  •      # Load object mask image
  •      mask_img = Image.Open(mask_img_path)
  •      # Output heatmap using gradcam
  •      heatmap = GradCAM(combined_img, model, ...) * mask_img
  •      # Calculate average heatmap values
  •      upper_mean = np.mean(heatmap[:int(heatmap.shape[0]/2), :])
  •      lower_mean = np.mean(heatmap[int(heatmap.shape[0]/2):, :])
  •      # Calculate shape/texture bias
  •      shape_texture_bias = upper_mean/(upper_mean + lower_mean)

3.2. Enhancing Shape Bias Through Architectural Modifications

In prior work, approaches such as replacing the texture of images in a training dataset with those from other images or removing textures altogether have been employed to enhance CNNs’ shape bias. These methods encourage the model to ignore local textures and prioritize global shape information during learning. Additionally, texture bias, an inductive bias inherent to CNNs, can be attributed to the convolution operation, which correlates with surrounding pixels. Due to their repeated convolutional operations, CNNs tend to emphasize local textures.
To address this, we propose strategies such as increasing the kernel size of convolutional layers or expanding the dilation rate. These strategies broaden the pixel range used in each convolutional operation, thereby expanding the CNN’s receptive field and enabling the network to consider a wider array of pixel arrangements. Moreover, when the dilation rate is increased, pixels between the convolved pixels remain unprocessed, which may allow the network to ignore local continuous pixel textures. For example, a convolution operation with a kernel size of five and a dilation rate of two covers the same 5 × 5 area in a single operation. However, with a dilation rate of two, gaps are introduced between the pixels used for computation, allowing the model to disregard continuous local textures.

4. Experiments

In this section, the effectiveness of the proposed bias metric, along with its accuracy, is verified on shape-dominant data using pre-trained models with different kernel sizes and dilation factors.

4.1. Pre-Training

The model was pre-trained with a larger kernel size and an extended model architecture. Typically, kernel size augmentation substantially increases the number of parameters in a standard convolution. To circumvent this, ResNeXt50 32 × 4d [49], which uses a grouped convolution, was employed. Many vision models composed of CNNs are derivatives of ResNet. ResNeXt was adopted due to its relatively smaller increase in the number of parameters compared to ResNet when increasing the kernel size. We compared architectures having kernel sizes of three, five, and seven and dilation factors of one, two, and three, with a kernel size of three and a dilation factor of one representing the original ResNeXt architecture.
ImageNet-1k [50] was employed as the dataset, and training was conducted over 100 epochs. ImageNet is composed of natural images and is considered a texture-biased dataset. For a focused comparison based solely on the kernel size and dilation, all other hyperparameters were kept constant across the architectures. Python 3.11, CUDA 1.8, and cuDNN 8.6 were utilized in the experiment. The training was performed using PyTorch 2.0.1 (, accessed on 27 October 2024. In pre-training, the initial learning rate was set to 0.4, and a cosine scheduler with warm-up was employed. The batch size was set to 256, and Stochastic Gradient Descent (SGD) was used for optimization. NVIDIA V100 × 4 was used for pre-training.
The pre-training results are listed in Table 2. During the pre-training phase, the accuracy for the ImageNet1k validation data was the highest for the original ResNeXt, with a discernible decrease in precision as either the kernel size or dilation increased. Subsequently, these pre-trained models were employed for transfer learning and shape-bias evaluations.

4.2. Transfer Learning and Bias Evaluation

Models with shape bias are expected to exhibit enhanced accuracy and training efficiency for shape-dominant data. Thus, to evaluate the shape bias of the model, its accuracy was compared to shape-dominant data. The employed datasets encompassed logo [51], cartoon, sketch [52], and silhouette images. The silhouette datasets used were MPEG7 [53] and Animal2000 [54], and are considered relatively small in scale. The logo dataset comprised a collection of images featuring logos from various companies and products. The cartoon dataset consisted of images that capture cropped faces of characters from anime and similar media. The sketch dataset included line drawings of various objects, such as airplanes and apples, created in a short time. The MPEG7 dataset contained silhouette images of diverse objects, including apples and butterflies, while the Animal2000 dataset specifically focused on the collection of animal silhouette images. The number of images included in each dataset is presented in Table 3.
Classification and evaluation were performed on shape-dominant data or images with little local texture. Logo images often feature distinctive shapes, whereas sketch images and silhouette images solely comprise object contours, devoid of color or local textures. Additionally, compared with natural images, cartoon images contain fewer local textures, with identical objects often depicted in the same color. The classification and evaluation were performed using shape-dominant data or images with minimal local textures.
In transfer learning, only the weights of the last classification layer are updated, while those of the CNN-based feature extractors are fixed. This allows evaluation of the effect of the feature-extractor bias on the classification accuracy. We also evaluated the accuracy after one and two epochs to compare the learning efficiency and demonstrate the impact of shape bias on the learning speed. In transfer learning, the learning rate was set to 0.01, and a cosine scheduler was used. Additionally, the batch size was set to 256, and the optimizer used was SGD. The model was trained for 40 epochs.
Additionally, the proposed bias-evaluation metric was compared with those employed in previous studies. Conventional shape-bias metrics employ style transfer to stylistically transform images, generating images with two disparate labels for the global shape and local texture (e.g., the global shape of a car and the local texture of a cat). Additionally, they used 16 classes: airplanes, bears, bicycles, birds, boats, bottles, cars, cats, chairs, clocks, dogs, elephants, keyboards, knives, ovens, and trucks. Moreover, the number of model outputs matching the global shape label was defined as N s , and those matching the local texture label as N t . Using these, they calculated the degree of the shape bias S as follows:
S = N s N s + N t
Subsequently, the quantity of data used for training was reduced to evaluate performance under constrained data conditions. Experiments were conducted to verify that aligning the model’s bias with the inherent bias of the data facilitated efficient learning, even with a minimal dataset. The datasets employed for these experiments included the cartoon, sketch, and logo datasets from Table 3, which possessed adequate data. Experiments were carried out using 5%, 10%, 15%, 20%, 25%, 30%, 35%, and 40% of the total training data, with the same hyperparameter values as transfer learning.

4.3. Results

The experimental results are listed in Table 4, and the Grad-CAM visualization of the proposed method is shown in Figure 2.
For the shape-dominated dataset, the original model exhibited the lowest accuracy, except with a kernel size of five. Although the original ResNeXt model exhibited the highest accuracy for the ImageNet validation set, the results suggest that changing the kernel size or extension affects the shape/texture bias. The increase in the accuracy for the shape-dominant dataset as the kernel size and dilation factor increased also suggests that shape bias increased as the architectural variations increased.
Comparisons of the shape/texture bias metrics indicated that the indices from prior studies increased the shape bias compared to the original ResNeXt. However, the shape-dominant dataset accuracies were inconsistent. According to the conventional metric, the model with the highest shape bias had a kernel size of five, followed by that with a dilation factor of three. The accuracy for sketch images, which are strongly indicative of shape bias, was the highest for a dilation factor of three, followed by a dilation factor of two, which was inconsistent with the learning accuracy after transfer learning. However, the proposed shape/texture bias metric exhibited congruence between the post-transfer-learning accuracy and the evaluation metric across all datasets. This may be because the conventional shape-bias evaluation metric was developed to compare human visual processing with that of vision models. Consequently, the relationship between transfer learning and shape/texture bias was not prioritized. By contrast, the proposed metric was designed to specifically determine whether an image’s shape or texture was preferred. Therefore, it was expected that the proposed method would be able to clearly elucidate the relationship between data and shape bias.
Next, the results of the transfer learning for each dataset were analyzed. As mentioned previously, sketch images represented the most shape-dominant data, followed by cartoons and logo images. Models with a high shape bias owing to an altered dilation factor showed the most significant improvement in accuracy for sketch images, followed by cartoons and logo images. In particular, on the logo dataset, models with larger kernel sizes achieved higher accuracy than models with larger dilation. This can be attributed to the relatively weak shape features and the presence of texture in the logo images. Additionally, the gaps introduced by dilation could have affected the performance, making a larger kernel size more beneficial. However, for the sketch images, which were the most shape-dominant, the two pre-trained models with modified dilation factors outperformed the other models. This suggests that dilation-factor changes that expand the receptive field of a CNN and introduce gaps in the kernel size are more efficient for increasing the shape bias.
As shown in Table 5, in the early stages of training, all variants with a shape bias surpassed the accuracy of the original architecture in two epochs. However, in the case of a kernel size of five, although the initial accuracy exceeded that of the original, the final accuracy fell below it. This may be attributed to the features extracted by the encoder of the model employing the kernel size of five as they are immediately comprehensible to the classifier, but fail to fully capture the image characteristics as effectively as the original encoder, resulting in accuracy inversion as the learning progresses. This phenomenon was observed across all shape-dominant datasets and warrants further investigations of the model with a kernel size of five.
The experimental results are shown in Figure 3. Both increasing the kernel size and the dilation rate facilitated efficient learning with limited data. This was particularly evident in datasets where shape information was paramount, such as the sketch and cartoon datasets. These findings suggest that in transfer learning scenarios targeting shape-dominant datasets, employing models with a shape bias enables more efficient learning and enhances accuracy compared to traditional methods.
As demonstrated in the above experiments, increasing the kernel size and dilation enhanced shape bias. Furthermore, the increase in dilation contributed more significantly to the enhancement of shape bias compared to the increase in kernel size. This is likely because the increase in dilation introduced non-convolutional pixels between those involved in the convolution operation, allowing the model to ignore fine, continuous textures. Additionally, the proposed shape/texture bias measurement method offered a more quantitative evaluation of the relationship between the model and the data compared to existing bias measurement techniques.

5. Discussion and Limitations

Most existing research has increased shape bias by modifying datasets or preparing multiple images. However, this study focuses on the fundamental structures of CNN architecture, such as kernel size and dilation, to enhance shape bias. Therefore, this approach can be applied to almost all models utilizing CNNs. By enlarging these sizes, we have effectively increased the receptive field of CNN models and ignored small, continuous textures through dilation, thereby enhancing shape bias. Furthermore, by augmenting shape bias, we have improved the efficiency of transfer learning in datasets where shape information is paramount. Consequently, using models with texture bias for texture-rich natural images or fine-grained classification data, and models with shape bias for data where shape information is critical, such as sketch or cartoon images, can enhance accuracy and learning efficiency. Additionally, the proposed evaluation method enhances the efficiency of model transfer learning, making it a valuable approach for improving performance and learning with limited data. Moreover, this approach is compatible with any classification model since bias can be evaluated through short-term training of only the final layer. This method is particularly effective in high-precision scenarios and when training data is sparse, offering significant benefits in both cases.
However, this study has several limitations. When varying the kernel size, the use of standard convolution operations typically increases the number of parameters. In such cases, the use of grouped convolution can mitigate the increase in parameters; however, this approach inevitably results in higher computational costs, thereby increasing the computation time. Furthermore, while there may be a desire to enhance the texture bias further, in models where the kernel size is three and the dilation rate is set to one, achieving any further enhancement to the texture bias becomes challenging. It is impractical to experiment with all datasets that are either shape-dominant or otherwise; thus, this research utilizes data with shape features. Additionally, as this study focuses on learning efficiency with limited data, the datasets used for evaluation are relatively small, and learning is conducted via linear probing, which only trains the classification layer. Hence, the effects of bias in training with large-scale datasets and the changes in bias through full fine-tuning remain unverified. Additionally, the newly proposed method for measuring shape and texture bias evaluates the biases present in vision models. To more accurately demonstrate the relationship between data and the model, techniques that measure data bias or assess both the data and the model simultaneously must be developed.

6. Conclusions

This study proposed and investigated a novel shape/texture bias-evaluation metric. The results showed that, unlike previous studies, the results of the proposed metric are consistent with those of transfer learning. It allows efficient exploitation of the shape/texture biases in transfer learning. Furthermore, we demonstrated that increasing the kernel size or dilation factor of the model can significantly improve performance for datasets with shape-dominant data. As a result, this allows for the easy manipulation of bias within models composed of CNNs.
In future work, extensive experimentation and analysis must be conducted for the model with a kernel size of five. Additionally, developing methods for measuring dataset bias could enhance the consistency of the proposed metric. Furthermore, although this study proposed and experimented with shape/texture biases in classification models, future work should extend the investigation of inductive biases in reconstruction and generative models.

Author Contributions

Conceptualization, A.I. and M.O.; Data curation, A.I.; Funding acquisition, M.O.; Investigation, A.I.; Methodology, A.I.; Project administration, A.I.; Software, A.I.; Supervision, M.O.; Validation, A.I.; Writing—original draft, A.I.; Writing—review and editing, M.O. All authors have read and agreed to the published version of the manuscript.


This work was supported in part by MEXT KAKENHI JSPS23K11174 and MEXT Promotion of Distinctive Joint Research Center Program JPMXP 0621467946.

Data Availability Statement

Data available on request due to restrictions (e.g., privacy, legal or ethical reasons) The data presented in this study are available on request from the corresponding author due to (specify the reason for the restriction).


We would like to express my gratitude to the members of the Intelligent Mechanism Laboratory for engaging in discussions with me regarding this research.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.


  1. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  2. Liu, H.; Zhang, C.; Deng, Y.; Xie, B.; Liu, T.; Li, Y.F. TransIFC: Invariant Cues-aware Feature Concentration Learning for Efficient Fine-grained Bird Image Classification. IEEE Trans. Multimed. 2023, 1–14. [Google Scholar] [CrossRef]
  3. Wang, W.; Dai, J.; Chen, Z.; Huang, Z.; Li, Z.; Zhu, X.; Hu, X.; Lu, T.; Lu, L.; Li, H.; et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 14408–14419. [Google Scholar]
  4. Liu, H.; Zhang, C.; Deng, Y.; Liu, T.; Zhang, Z.; Li, Y.F. Orientation Cues-Aware Facial Relationship Representation for Head Pose Estimation via Transformer. IEEE Trans. Image Process. 2023, 32, 6289–6302. [Google Scholar] [CrossRef] [PubMed]
  5. Liu, T.; Liu, H.; Yang, B.; Zhang, Z. LDCNet: Limb Direction Cues-Aware Network for Flexible HPE in Industrial Behavioral Biometrics Systems. IEEE Trans. Ind. Inform. 2024, 20, 8068–8078. [Google Scholar] [CrossRef]
  6. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  7. Azad, R.; Fayjie, A.R.; Kauffmann, C.; Ben Ayed, I.; Pedersoli, M.; Dolz, J. On the texture bias for few-shot cnn segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual Event, 5–9 January 2021; pp. 2674–2683. [Google Scholar]
  8. Hermann, K.; Chen, T.; Kornblith, S. The origins and prevalence of texture bias in convolutional neural networks. Adv. Neural Inf. Process. Syst. 2020, 33, 19000–19015. [Google Scholar]
  9. Geirhos, R.; Rubisch, P.; Michaelis, C.; Bethge, M.; Wichmann, F.A.; Brendel, W. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv 2018, arXiv:1811.12231. [Google Scholar]
  10. Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
  11. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25. [Google Scholar] [CrossRef]
  12. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016; Available online: (accessed on 2 June 2024).
  13. Iwata, A.; Okuda, M. CNN Pretrained Model with Shape Bias using Image Decomposition. APSIPA Trans. Signal Inf. Process. 2023, 12, e42. [Google Scholar] [CrossRef]
  14. Geirhos, R.; Narayanappa, K.; Mitzkus, B.; Thieringer, T.; Bethge, M.; Wichmann, F.A.; Brendel, W. Partial success in closing the gap between human and machine vision. Adv. Neural Inf. Process. Syst. 2021, 34, 23885–23899. [Google Scholar]
  15. Ye, Z.; Gao, Z.; Cui, X.; Wang, Y.; Shan, N. DuFeNet: Improve the Accuracy and Increase Shape Bias of Neural Network Models. Signal Image Video Process. 2022, 16, 1153–1160. [Google Scholar] [CrossRef]
  16. Ge, Y.; Xiao, Y.; Xu, Z.; Wang, X.; Itti, L. Contributions of shape, texture, and color in visual recognition. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 369–386. [Google Scholar] [CrossRef]
  17. Mohla, S.; Nasery, A.; Banerjee, B. Teaching CNNs to Mimic Human Visual Cognitive Process & Regularise Texture-Shape Bias. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 1805–1809. [Google Scholar] [CrossRef]
  18. Lee, S.; Hwang, I.; Kang, G.C.; Zhang, B.T. Improving robustness to texture bias via shape-focused augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4323–4331. [Google Scholar] [CrossRef]
  19. Yoshihara, S.; Fukiage, T.; Nishida, S. Does training with blurred images bring convolutional neural networks closer to humans with respect to robust object recognition and internal representations? Front. Psychol. 2023, 14, 1047694. [Google Scholar] [CrossRef] [PubMed]
  20. Li, Y.; Yu, Q.; Tan, M.; Mei, J.; Tang, P.; Shen, W.; Yuille, A.; Xie, C. Shape-texture debiased neural network training. arXiv 2020, arXiv:2010.05981. [Google Scholar]
  21. Goyal, A.; Bengio, Y. Inductive biases for deep learning of higher-level cognition. Proc. R. Soc. A 2022, 478, 20210068. [Google Scholar] [CrossRef]
  22. Zheng, J.; Li, X.; Lucey, S. Convolutional Initialization for Data-Efficient Vision Transformers. arXiv 2024, arXiv:2401.12511. [Google Scholar]
  23. Cohen, N.; Shashua, A. Inductive bias of deep convolutional networks through pooling geometry. arXiv 2016, arXiv:1605.06743. [Google Scholar]
  24. Wang, Z.; Wu, L. Theoretical Analysis of the Inductive Biases in Deep Convolutional Networks. In Proceedings of the 37th Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar] [CrossRef]
  25. Cao, Y.H.; Wu, J. A random cnn sees objects: One inductive bias of cnn and its applications. In Proceedings of the AAAI Conference On Artificial Intelligence, Philadelphia, PA, USA, 27 February–2 March 2022; pp. 194–202. [Google Scholar] [CrossRef]
  26. Shin, Y.; Choi, J.; Wi, H.; Park, N. An attentive inductive bias for sequential recommendation beyond the self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, USA, 20–27 February 2024; pp. 8984–8992. [Google Scholar] [CrossRef]
  27. Liu, H.; Zheng, C.; Li, D.; Shen, X.; Lin, K.; Wang, J.; Zhang, Z.; Zhang, Z.; Xiong, N.N. EDMF: Efficient Deep Matrix Factorization With Review Feature Learning for Industrial Recommender System. IEEE Trans. Ind. Inform. 2022, 18, 4361–4371. [Google Scholar] [CrossRef]
  28. Liu, H.; Fang, S.; Zhang, Z.; Li, D.; Lin, K.; Wang, J. MFDNet: Collaborative Poses Perception and Matrix Fisher Distribution for Head Pose Estimation. IEEE Trans. Multimed. 2022, 24, 2449–2460. [Google Scholar] [CrossRef]
  29. Liu, H.; Liu, T.; Zhang, Z.; Sangaiah, A.K.; Yang, B.; Li, Y. ARHPE: Asymmetric Relation-Aware Representation Learning for Head Pose Estimation in Industrial Human–Computer Interaction. IEEE Trans. Ind. Inform. 2022, 18, 7107–7117. [Google Scholar] [CrossRef]
  30. Battaglia, P.W.; Hamrick, J.B.; Bapst, V.; Sanchez-Gonzalez, A.; Zambaldi, V.; Malinowski, M.; Tacchetti, A.; Raposo, D.; Santoro, A.; Faulkner, R.; et al. Relational inductive biases, deep learning, and graph networks. arXiv 2018, arXiv:1806.01261. [Google Scholar]
  31. Ringsquandl, M.; Sellami, H.; Hildebrandt, M.; Beyer, D.; Henselmeyer, S.; Weber, S.; Joblin, M. Power to the relational inductive bias: Graph neural networks in electrical power grids. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Online, 1–5 November 2021; pp. 1538–1547. [Google Scholar] [CrossRef]
  32. Oliva, M.; Banik, S.; Josifovski, J.; Knoll, A. Graph neural networks for relational inductive bias in vision-based deep reinforcement learning of robot control. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; pp. 1–9. [Google Scholar] [CrossRef]
  33. Veličković, P. Everything is connected: Graph neural networks. Curr. Opin. Struct. Biol. 2023, 79, 102538. [Google Scholar] [CrossRef] [PubMed]
  34. Thost, V.; Chen, J. Directed acyclic graph neural networks. arXiv 2021, arXiv:2101.07965. [Google Scholar]
  35. Chang, Y.; Bisk, Y. Language Models Need Inductive Biases to Count Inductively. arXiv 2024, arXiv:2405.20131. [Google Scholar]
  36. White, J.C.; Cotterell, R. Examining the inductive bias of neural language models with artificial languages. arXiv 2021, arXiv:2106.01044. [Google Scholar]
  37. Papadimitriou, I.; Jurafsky, D. Injecting structural hints: Using language models to study inductive biases in language learning. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 8402–8413. [Google Scholar] [CrossRef]
  38. Rytting, C.; Wingate, D. Leveraging the inductive bias of large language models for abstract textual reasoning. Adv. Neural Inf. Process. Syst. 2021, 34, 17111–17122. [Google Scholar]
  39. Gessler, L.; Schneider, N. Syntactic Inductive Bias in Transformer Language Models: Especially Helpful for Low-Resource Languages? arXiv 2023, arXiv:2311.00268. [Google Scholar]
  40. Gavrikov, P.; Lukasik, J.; Jung, S.; Geirhos, R.; Lamm, B.; Mirza, M.J.; Keuper, M.; Keuper, J. Are Vision Language Models Texture or Shape Biased and Can We Steer Them? arXiv 2024, arXiv:2403.09193. [Google Scholar]
  41. Dehghani, M.; Djolonga, J.; Mustafa, B.; Padlewski, P.; Heek, J.; Gilmer, J.; Steiner, A.P.; Caron, M.; Geirhos, R.; Alabdulmohsin, I.; et al. Scaling vision transformers to 22 billion parameters. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 7480–7512. [Google Scholar]
  42. Chung, H.; Park, K.H. Shape Prior is Not All You Need: Discovering Balance Between Texture and Shape Bias in CNN. In Proceedings of the Asian Conference on Computer Vision, Macao, China, 4–8 December 2022; pp. 4160–4175. [Google Scholar]
  43. Lukasik, J.; Gavrikov, P.; Keuper, J.; Keuper, M. Improving Native CNN Robustness with Filter Frequency Regularization. Trans. Mach. Learn. Res. 2023, 1–36. Available online: (accessed on 28 June 2024).
  44. Ding, X.; Zhang, X.; Zhou, Y.; Han, J.; Ding, G.; Sun, J. Scaling Up Your Kernels to 31 × 31: Revisiting Large Kernel Design in CNNs. arXiv 2022, arXiv:2203.06717. [Google Scholar]
  45. Padmanabhan, D.C.; Gowda, S.; Arani, E.; Zonooz, B. LSFSL: Leveraging Shape Information in Few-shot Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 4970–4979. [Google Scholar]
  46. Xu, L.; Lu, C.; Xu, Y.; Jia, J. Image Smoothing via L0 Gradient Minimization. ACM Trans. Graph. 2011, 30, 1–12. [Google Scholar]
  47. Shah, H.; Tamuly, K.; Raghunathan, A.; Jain, P.; Netrapalli, P. The pitfalls of simplicity bias in neural networks. Adv. Neural Inf. Process. Syst. 2020, 33, 9573–9585. [Google Scholar]
  48. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
  49. Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
  50. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
  51. Wang, J.; Min, W.; Hou, S.; Ma, S.; Zheng, Y.; Wang, H.; Jiang, S. Logo-2K+: A large-scale logo dataset for scalable logo classification. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 6194–6201. [Google Scholar]
  52. Eitz, M.; Hays, J.; Alexa, M. How Do Humans Sketch Objects? ACM Trans. Graph. 2012, 31, 1–10. [Google Scholar] [CrossRef]
  53. Latecki, L.; Lakamper, R.; Eckhardt, T. Shape descriptors for non-rigid shapes with a single closed contour. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2000), Hilton Head Island, SC, USA, 13–15 June 2000; Volume 1, pp. 424–429. [Google Scholar] [CrossRef]
  54. Bai, X.; Liu, W.; Tu, Z. Integrating contour and skeleton for shape classification. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, Kyoto, Japan, 27 September–4 October 2009; pp. 360–367. [Google Scholar] [CrossRef]
Figure 1. The combined shape/texture images used to calculate the bias metric (on the left side) included a shape-dominant image in the upper part and a texture-dominant image in the lower part. This combined image is used for transfer learning in a two-class classification. Subsequently, these test-combined images are input into the model, and the shape/texture bias is calculated through gradient-weighted class activation mapping (Grad-CAM).
Figure 1. The combined shape/texture images used to calculate the bias metric (on the left side) included a shape-dominant image in the upper part and a texture-dominant image in the lower part. This combined image is used for transfer learning in a two-class classification. Subsequently, these test-combined images are input into the model, and the shape/texture bias is calculated through gradient-weighted class activation mapping (Grad-CAM).
Signals 05 00040 g001
Figure 2. Results of Grad-CAM visualization used with the proposed shape/texture bias metric. In the case of the original ResNeXt (a texture-biased model), the heat map of the lower image (which is texture-dominant) turns red, indicating that the model is focusing on it. On the other hand, for the ResNeXt with a dilation rate of three (a shape-biased model), the heat map of the upper image (which is shape-dominant) turns red, indicating its focus. This demonstrates that simply increasing the dilation rate results in a stronger bias towards shapes in the model. (a) The visualization images were obtained by applying Grad-CAM to the original ResNeXt (texture-biased model). (b) The visualization images were obtained by applying Grad-CAM to ResNeXt with a dilation of three (shape-biased model).
Figure 2. Results of Grad-CAM visualization used with the proposed shape/texture bias metric. In the case of the original ResNeXt (a texture-biased model), the heat map of the lower image (which is texture-dominant) turns red, indicating that the model is focusing on it. On the other hand, for the ResNeXt with a dilation rate of three (a shape-biased model), the heat map of the upper image (which is shape-dominant) turns red, indicating its focus. This demonstrates that simply increasing the dilation rate results in a stronger bias towards shapes in the model. (a) The visualization images were obtained by applying Grad-CAM to the original ResNeXt (texture-biased model). (b) The visualization images were obtained by applying Grad-CAM to ResNeXt with a dilation of three (shape-biased model).
Signals 05 00040 g002
Figure 3. Results of limiting the training data for each dataset. The accuracy rate is defined as the accuracy achieved with limited training data divided by the accuracy achieved with the entire dataset. The data ratio represents the proportion of the data used for training. (a) Results of reducing the amount of training data in the Logo dataset. (b) Results of reducing the amount of training data in the Cartoon dataset. (c) Results of reducing the amount of training data in the Sketch dataset.
Figure 3. Results of limiting the training data for each dataset. The accuracy rate is defined as the accuracy achieved with limited training data divided by the accuracy achieved with the entire dataset. The data ratio represents the proportion of the data used for training. (a) Results of reducing the amount of training data in the Logo dataset. (b) Results of reducing the amount of training data in the Cartoon dataset. (c) Results of reducing the amount of training data in the Sketch dataset.
Signals 05 00040 g003aSignals 05 00040 g003b
Table 1. A summary of related work on shape bias and this study. Previous research has primarily focused on increasing shape bias through data preprocessing, augmentation, and multi-network approaches. In contrast, this study focuses on model architecture to enhance shape bias. Additionally, we propose a new method for measuring shape and texture bias, which represents one of the few existing approaches in this domain.
Table 1. A summary of related work on shape bias and this study. Previous research has primarily focused on increasing shape bias through data preprocessing, augmentation, and multi-network approaches. In contrast, this study focuses on model architecture to enhance shape bias. Additionally, we propose a new method for measuring shape and texture bias, which represents one of the few existing approaches in this domain.
Shape/Texture Bias MetricOur Work [9]
Enhancing shape biasData preprocessing or augmentation [9,13,18,20,42,45]
Multi network [15,16,17]
Model architectureOur work
Table 2. The results represent the accuracy of the ImageNet validation set after pre-training for each architecture, along with the corresponding number of parameters. ImageNet consists of natural images, and the results reflect performance on a texture-biased image set. The bold text in the table indicates the highest accuracy, while the underlined text represents the second-highest accuracy. The accuracy is highest for the original and decreases as the architecture changes more.
Table 2. The results represent the accuracy of the ImageNet validation set after pre-training for each architecture, along with the corresponding number of parameters. ImageNet consists of natural images, and the results reflect performance on a texture-biased image set. The bold text in the table indicates the highest accuracy, while the underlined text represents the second-highest accuracy. The accuracy is highest for the original and decreases as the architecture changes more.
ArchitectureImageNet Val Accuracy (%)Parameters
Kernel size: 575.8027,543,848
Kernel size: 775.7531,316,264
Dilation: 270.5825,028,904
Dilation: 369.0325,028,904
Table 3. The number of data in each dataset was analyzed. Additionally, experiments reducing the data volume were conducted for the cartoon, sketch, and logo datasets.
Table 3. The number of data in each dataset was analyzed. Additionally, experiments reducing the data volume were conducted for the cartoon, sketch, and logo datasets.
Table 4. Results to evaluate shape bias for each architectural change: accuracy on the shape-dominant dataset, two measures of shape bias, and the number of parameters. The shape-dominant dataset was trained five times and the average accuracy of those five times was taken as the final result. Higher values for each of the shape/texture bias indices indicate stronger shape bias. The same is true for the prior shape bias. The bold text in the table indicates the highest accuracy, while the underlined text represents the second-highest accuracy.
Table 4. Results to evaluate shape bias for each architectural change: accuracy on the shape-dominant dataset, two measures of shape bias, and the number of parameters. The shape-dominant dataset was trained five times and the average accuracy of those five times was taken as the final result. Higher values for each of the shape/texture bias indices indicate stronger shape bias. The same is true for the prior shape bias. The bold text in the table indicates the highest accuracy, while the underlined text represents the second-highest accuracy.
Shape/Texture Bias
Shape Bias
Accuracy (%)
Accuracy (%)
Accuracy (%)
Accuracy (%)
Accuracy (%)
Kernel size: 50.47730.289630.4147.3831.2852.7854.90
Kernel size: 70.48870.295935.1854.5334.8064.9163.60
Dilation: 20.50360.277840.9252.1632.2666.0660.60
Dilation: 30.51960.293541.9657.7433.8567.2161.00
Table 5. Accuracy in the initial phase of transfer learning. In both cases, the accuracy of the original ResNeXt is exceeded by two epochs. The shape bias in the early stages of learning results in a faster increase in learning accuracy.
Table 5. Accuracy in the initial phase of transfer learning. In both cases, the accuracy of the original ResNeXt is exceeded by two epochs. The shape bias in the early stages of learning results in a faster increase in learning accuracy.
Sketch (%)Cartoon (%)Logo (%)
Architecture1 Epoch2 Epoch1 Epoch2 Epoch1 Epoch2 Epoch
Kernel size: 52.4009.1006.48716.506.05013.86
Kernel size: 72.87511.058.55822.887.97316.69
Dilation: 23.52514.807.59118.956.31514.24
Dilation: 34.27515.3011.1123.057.09615.55
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Iwata, A.; Okuda, M. Quantifying Shape and Texture Biases for Enhancing Transfer Learning in Convolutional Neural Networks. Signals 2024, 5, 721-735.

AMA Style

Iwata A, Okuda M. Quantifying Shape and Texture Biases for Enhancing Transfer Learning in Convolutional Neural Networks. Signals. 2024; 5(4):721-735.

Chicago/Turabian Style

Iwata, Akinori, and Masahiro Okuda. 2024. "Quantifying Shape and Texture Biases for Enhancing Transfer Learning in Convolutional Neural Networks" Signals 5, no. 4: 721-735.

APA Style

Iwata, A., & Okuda, M. (2024). Quantifying Shape and Texture Biases for Enhancing Transfer Learning in Convolutional Neural Networks. Signals, 5(4), 721-735.

Article Metrics

Back to TopTop