1. Introduction
The brain acquires information from the external world through sensory systems [
1], with approximately half of the human cerebral cortex involved in processing visual information [
2,
3,
4]. The primary visual system serves as the foundation and initiation stage of visual processing in the human cortex, playing an important role in perceiving fundamental visual information [
5]. This system encompasses the biological structures and processing functions spanning from the retina, through the lateral geniculate nucleus (LGN), to the primary visual cortex (V1) [
6,
7]. Through the integration and processing of this visual information, the collaboration between the primary visual system and cerebral cortex enables accurate perception and interpretation of our surrounding visual environment [
6,
8]. Understanding these mechanisms is crucial for advancing artificial vision systems with enhanced sophistication and robustness.
The retina is composed of various neuronal cell types, including photoreceptors, bipolar cells, ganglion cells, horizontal cells, and amacrine cells [
7,
8,
9]. Photoreceptors convert light signals into electrical signals [
9], which are subsequently transmitted by bipolar cells to ganglion cells [
8,
10]. Bipolar, horizontal, and amacrine cells contribute to lateral modulation processes that enhance contrast and edge detection abilities while also facilitating the processing of complex visual information, such as motion detection and regulation of light intensity [
11,
12]. The processed information within the retina is then transmitted from ganglion cells to the lateral geniculate nucleus (LGN) cells [
13]. The LGN serves as an intermediary relay station of information transmission between the retina and brain, while also carrying out initial processing of visual information [
14]. Moreover, LGN cells exhibit receptive fields that are organized into concentric excitatory (ON) and inhibitory (OFF) regions, characterized by center-surround antagonistic structures [
15]. Subsequently, the processed visual information is transmitted from LGN cells to the primary visual cortex [
14,
16].
In the primary visual cortex (V1), neurons inextricably engage in the initial recognition of visual features, encompassing color, motion direction, and orientation perception [
13]. These distinct recognitions are predominantly accomplished by discrete neuronal subtypes [
13,
17]. For instance, a specific subtype known as simple cells exhibits remarkable selectivity towards orientation stimuli within its receptive field while displaying diminished or no responses to other orientations [
18,
19,
20]. This mechanism of single-feature selectivity among V1 neurons forms the fundamental basis for intricate visual feature recognition at higher hierarchical levels [
17,
21]. Moreover, it is noteworthy that brain neurons possess plasticity and adaptability in response to environmental changes, particularly during critical periods following birth. The acquisition and enhancement of motion and orientation recognition abilities are further facilitated through subsequent visual stimulation and learning during this developmental phase [
22,
23].
The study of the primary visual system inspired the development of artificial neural networks. An important step in this regard was the development of Neocognitron, which referred to the functional mechanisms of V1 cells to design neural networks [
24,
25]. This breakthrough inspired LeCun to construct LeNet, an early implementation of CNN, and subsequently progress of deep neural networks such as VGG and ResNet [
26,
27,
28,
29]. However, it is worth noting that these deep networks have increasingly deviated from the actual structure of biological visual systems, and lack interpretability. Although some researchers have reported that the CNN would acquire certain feature extraction kernels that exhibit properties similar to the response property of visual neurons during the training phase [
27], it is hard to ascertain whether the interactions between these kernels are reliable and the same as those observed in actual neurons. In the real visual system, the interactions between neurons, like the orientation selective neurons [
13], ensure the strong robustness of the visual system, which highlights the gap between biological neurons and artificial neurons (convolutional kernel). Furthermore, certain neural network models have attempted to simulate LGN or V1 cell functions within their architecture, they often suffer from computational complexity, learning costs, and limited flexibility issues [
30,
31,
32,
33].
In this paper, we introduce an artificial visual system (AVS), a bio-inspired, straightforward, and explainable pre-trained network for improving CNNs overall performance. The core of AVS is to simulate the mechanism of orientation feature extraction within the primary visual pathway and applied to regulate the feature preference of CNN. The AVS is first trained on artificial object orientation data for generating orientation feature selectivity, which is further employed as an image information regulator in the CNN training process. What sets the AVS–CNN framework apart is the AVS is pre-trained on orientation data and frozen in further image tasks. This bio-inspired approach not only improves the backbone model’s stability and robustness, but also does not cost extra training resources, except for a small amount of orientation object data. Additionally, AVS is only implemented in the training process to regulate the backbone network’s preference for image features, and is removed in the validation and testing stage. This flexibility and generalization ensures that AVS can be employed in diverse image tasks, and does not generate extra costs for hardware and software implementations. The contributions of this paper are as follows:
We propose AVS, a straightforward and bio-inspired pre-trained network to regulate CNN training preference. To the best of our knowledge, this is the first attempt to decouple the generation of initial feature selectivity and extraction and recognition of high-dimensional information.
The pre-trained AVS can be integrated as a plug-and-play component to enhance the robustness of various convolutional backbone networks, such as ResNet-50.
From experiment results, we demonstrate that our AVS–CNN method can achieve overall performance improvement in backbone networks across various datasets and image tasks. AVS effectively enhances both the generalization and stability of CNNs.
We visualize the preference differences of image information between baseline models and their enhanced versions, enhancing the explainability of AVS–CNN.
4. Results
In this section, we present the results of experiments conducted on various image datasets to demonstrate the effectiveness of the AVS–CNN framework. The section is divided into two parts. In the first part, we provide preliminary information about the experiments, including the datasets, models, and training details. In the second part, we evaluate the overall performance of AVS–CNN from robustness and stability perspectives.
4.1. Preliminaries for Experiments
4.1.1. Datasets
An artificial object orientation dataset was generated and utilized to fine-tune the neurons’ orientation selectivity in AVS. The artificial object orientation dataset comprises 400 images for training and 2000 images for validation (image size:
). Each image possesses an object with a specific orientation, random background color, random object color configuration, and random object scale. We extensively evaluate the performance of the AVS–CNN framework on diverse datasets, including Stanford Dogs [
66], Oxford 102 Flowers [
67], MNIST-M [
68], Oxford-IIIT Pet [
69], STL-10 [
38], and PASCAL VOC 2012 [
70]. These datasets were selected to cover diverse characteristics of visual diversity, domain variations, and object complexity, which contribute to evaluating the generalization of the AVS–CNN framework.
Furthermore, to evaluate the impact of AVS on enhancing CNN robustness against image noise, we generated multiple types of noise test sets based on clean test sets from each image dataset. The noise types included Additive Gaussian noise (AGN), Exponential noise (EPN), Poisson noise (PSN), Rayleigh noise (RLN), Speckle noise (SKN), Salt and Pepper noise (SPN), and Uniform noise (UFN). Additionally, for each type of noise subset, five different intensities of corresponding noise were incorporated.
4.1.2. CNN Models
We investigated the performance improvement of AVS–CNN in image classification and object detection tasks. For the image classification task, ResNet-50 [
29], DenseNet121 [
71], and EfficientNet-B0 [
72] were employed as the backbone models. For the object detection task, Faster R-CNN [
73] and RetinaNet [
74] were adopted as the baselines using the ResNet-50 backbone. The selection of this model was driven by the following considerations:
Diversity in model architecture: ResNet-50 effectively solves the issue of gradient degradation through residual connection, which makes it a classical deep learning baseline model; DenseNet121 uses dense connections for feature sharing and efficient gradient propagation, which provides a different perspective on feature extraction; EfficientNet-B0 uses compound scaling strategy to balance the model efficiency and performance, which represents the modern CNN architectures.
Object detection model: Faster R-CNN is a typical two-stage object detector that is the preferred choice for high-precision detection. RetinaNet is a one-stage detector that is particularly suitable for detecting small objects in large-scale datasets. We can comprehensively evaluate the performance of the AVS–CNN framework under different object detection paradigms.
Model complexity: These backbone networks and object detection models have different computational complexity and architectural design. We can effectively validate the robustness and adaptability of the AVS–CNN framework under different model complexity conditions.
4.1.3. Training Details
Before implementing integration with CNN, AVS was pre-trained for 10 epochs with 400 artificial object orientation image data (training settings for orientation selectivity tuning are provided in
Appendix A,
Table A1). For image recognition tasks, both AVS–CNN and base backbone models were implemented with pre-trained weights on the ImageNet [
75] or MS COCO [
76] dataset. During the fine-tuning stage for each image dataset, we set the maximum epochs as 15 epochs. All image recognition tasks were conducted three times under different random seeds to eliminate training randomness. Importantly, in AVS–CNNs, the AVS block was solely employed during training with frozen parameters, and was removed during the validation and testing stages.
4.2. Noise Robustness
We investigated the robustness improvement of the proposed AVS–CNN framework against image noise. All models were trained with clean data and tested on various validation subsets. We used precision, recall, f1-score, and error rate evaluation metrics to assess the AVS–CNN performance. The precision, recall, and f1-score results are presented in
Appendix A,
Table A2 and
Table A3. The error rate results are summarized in
Table 1. The error rate (top-1) was on overall mean, noise mean, common noise corruptions, and clean images for AVS–CNN and standard CNNs methods, namely ResNet50 (RN), DenseNet121 (DN), and EfficientNetB0 (EN). Noise mean refers to the average error rate obtained over seven types of image noise corruptions. The overall mean encompasses the average error rate across the three noise types and the clean dataset. All results are presented as the mean.
As expected, each implemented AVS–CNN model exhibited significant improvements in overall performance, and all types of noise compared to the base CNN. While the extent of improvement varied across different datasets due to backbone performance, most AVS–CNNs demonstrated progress on noise data with a slight decline in performance on clean data. In fifteen groups of model comparisons, AVS–CNN enhanced noise data in seven groups while maintaining or improving clean data performance, especially compared to ResNet-50. Futhermore, in terms of the results of precision, recall, and f1-score, the AVS–CNN framework also exhibited improved performance compared with CNNs. The improvement in these metrics demonstrated that the AVS–CNN framework’s predictions for the positive class are more accurate and have a lower rate of missed detections. Overall, AVS–CNN exhibits enhanced performance compared to the baseline model in terms of prediction accuracy, coverage, and overall performance. The AVS–CNN framework contributes to improving the robustness of CNN models against various types of noise without extra great computation cost.
The effectiveness of the AVS–CNN framework was also evaluated for object detection tasks. Detection results were evaluated based on the PASCAL VOC metric (mAP@IoU = 0.5) on object detection, as presented in
Table 2. AVS–CNNs also exhibited significant improvements in noise data and overall performance. These results further demonstrate the feasibility and superiority of the AVS–CNN framework in terms of robustness improvement.
4.3. Model Stability
Summarize the error rates and standard deviations of overall model performance on various datasets based on three repeated image classification tasks to evaluate AVS–CNNs’ stability and robustness. The results are presented in
Table 3, Values are reported as mean error rate and standard deviation (n = 3 seeds). The standard deviation curves are depicted in
Figure 8. Notably, the standard deviations of AVS–CNNs consistently remained within 2%, exhibiting significantly lower fluctuation compared to corresponding base CNNs. In short, AVS enhanced the robustness of the backbone model, leading to stronger reliability and stability in the performance of AVS–CNNs.
5. Visualization and Discussions
In this section, we employ Gradient-weighted Class Activation Mapping (Grad-CAM) [
77] technology to visualize the weight activation distribution of AVS–CNN and CNN on images, aiming to investigate the impact of AVS on the decision-making process of the backbone network. Specifically, we generated weight activation heat maps of AVS-ResNet50 and ResNet50 based on flower data.
Figure 9 presents the clean image, noise images, and corresponding heat maps. By overlaying the heatmap onto the original image, we visually demonstrate where the model focuses its attention during prediction, providing valuable insights into salient features considered by the model. The red and yellow colors indicate the deemed most crucial for the model’s decision, and intensifying towards red signifies higher significance in determining output. Notably, the base ResNet50 struggled to focus on the object when confronted with the noise images, and its focus was often misled to the background due to the noise information.While the AVS-ResNet50 consistently captured object-related details in the image regardless of clean or noise. The visualization instances elucidate how AVS–CNNs achieve their robustness.
Currently, the majority of CNNs are trained on the ImageNet dataset and subsequently fine-tuned for other image tasks to enhance their recognition capabilities. However, some studies have indicated that CNNs trained on ImageNet may exhibit a bias towards texture information rather than shape information in classification tasks and lack robustness when confronted with image noise [
45,
78]. In this study, we propose the AVS–CNN framework as a solution to address these limitations. Extensive experimental results demonstrate that incorporating pre-trained AVS significantly improves the stability and robustness of ImageNet-based CNNs across various transfer-learning tasks. Furthermore, future research can focus on refining the integration process of orientation information to further enhance the overall performance of backbone networks, which can be extended to more complex tasks such as large-scale datasets, object segmentation, and non-transfer learning.