Enhancing Convolutional Neural Network Robustness Against Image Noise via an Artificial Visual System

Li, Bin; Todo, Yuki; Tao, Sichen; Tang, Cheng; Wang, Yu

doi:10.3390/math13010142

Open AccessArticle

Enhancing Convolutional Neural Network Robustness Against Image Noise via an Artificial Visual System

by

Bin Li

¹

,

Yuki Todo

^2,*

,

Sichen Tao

^3,*

,

Cheng Tang

⁴

and

Yu Wang

¹

Division of Electrical Engineering and Computer Science, Kanazawa University, Kanazawa 920-1192, Japan

²

Faculty of Electrical, Information and Communication Engineering, Kanazawa University, Kanazawa 920-1192, Japan

³

Faculty of Engineering, Toyama University, Gofuku, Toyama 930-8555, Japan

⁴

Faculty of Information Science and Electrical Engineering, Kyushu University, Fukuoka 819-0395, Japan

^*

Authors to whom correspondence should be addressed.

Mathematics 2025, 13(1), 142; https://doi.org/10.3390/math13010142

Submission received: 13 December 2024 / Revised: 29 December 2024 / Accepted: 30 December 2024 / Published: 2 January 2025

(This article belongs to the Special Issue Intelligent Perception Computing and Graph Neural Networks: Algorithms, Applications, and New Challenges)

Download

Browse Figures

Versions Notes

Abstract

:

The convolutional neural network (CNN) was initially inspired by the physiological visual system, and its structure has become increasingly complex after decades of development. Although CNN architectures now have diverged from biological structures, we believe that the mechanism of feature extraction in the visual system can still provide valuable insights for enhancing CNN robustness and stability. In this study, we investigate the mechanism of neuron orientation selectivity and develop an artificial visual system (AVS) referring to the structure of the primary visual system. Through learning on an artificial object orientation dataset, AVS acquires orientation extraction capabilities. Subsequently, we employ the pre-trained AVS as an information pre-processing block at the front of CNNs to regulate their preference for different image features during training. We conducted a comprehensive evaluation of the AVS–CNN framework across different image tasks. Extensive results demonstrated that the CNNs enhanced by AVS exhibit significant model stability enhancement and error rate decrease on noise data. We propose that incorporating biological structures into CNN design still holds great potential for improving overall performance.

Keywords:

CNN; robustness; artificial visual system

MSC:

68T01

1. Introduction

The brain acquires information from the external world through sensory systems [1], with approximately half of the human cerebral cortex involved in processing visual information [2,3,4]. The primary visual system serves as the foundation and initiation stage of visual processing in the human cortex, playing an important role in perceiving fundamental visual information [5]. This system encompasses the biological structures and processing functions spanning from the retina, through the lateral geniculate nucleus (LGN), to the primary visual cortex (V1) [6,7]. Through the integration and processing of this visual information, the collaboration between the primary visual system and cerebral cortex enables accurate perception and interpretation of our surrounding visual environment [6,8]. Understanding these mechanisms is crucial for advancing artificial vision systems with enhanced sophistication and robustness.

The retina is composed of various neuronal cell types, including photoreceptors, bipolar cells, ganglion cells, horizontal cells, and amacrine cells [7,8,9]. Photoreceptors convert light signals into electrical signals [9], which are subsequently transmitted by bipolar cells to ganglion cells [8,10]. Bipolar, horizontal, and amacrine cells contribute to lateral modulation processes that enhance contrast and edge detection abilities while also facilitating the processing of complex visual information, such as motion detection and regulation of light intensity [11,12]. The processed information within the retina is then transmitted from ganglion cells to the lateral geniculate nucleus (LGN) cells [13]. The LGN serves as an intermediary relay station of information transmission between the retina and brain, while also carrying out initial processing of visual information [14]. Moreover, LGN cells exhibit receptive fields that are organized into concentric excitatory (ON) and inhibitory (OFF) regions, characterized by center-surround antagonistic structures [15]. Subsequently, the processed visual information is transmitted from LGN cells to the primary visual cortex [14,16].

In the primary visual cortex (V1), neurons inextricably engage in the initial recognition of visual features, encompassing color, motion direction, and orientation perception [13]. These distinct recognitions are predominantly accomplished by discrete neuronal subtypes [13,17]. For instance, a specific subtype known as simple cells exhibits remarkable selectivity towards orientation stimuli within its receptive field while displaying diminished or no responses to other orientations [18,19,20]. This mechanism of single-feature selectivity among V1 neurons forms the fundamental basis for intricate visual feature recognition at higher hierarchical levels [17,21]. Moreover, it is noteworthy that brain neurons possess plasticity and adaptability in response to environmental changes, particularly during critical periods following birth. The acquisition and enhancement of motion and orientation recognition abilities are further facilitated through subsequent visual stimulation and learning during this developmental phase [22,23].

The study of the primary visual system inspired the development of artificial neural networks. An important step in this regard was the development of Neocognitron, which referred to the functional mechanisms of V1 cells to design neural networks [24,25]. This breakthrough inspired LeCun to construct LeNet, an early implementation of CNN, and subsequently progress of deep neural networks such as VGG and ResNet [26,27,28,29]. However, it is worth noting that these deep networks have increasingly deviated from the actual structure of biological visual systems, and lack interpretability. Although some researchers have reported that the CNN would acquire certain feature extraction kernels that exhibit properties similar to the response property of visual neurons during the training phase [27], it is hard to ascertain whether the interactions between these kernels are reliable and the same as those observed in actual neurons. In the real visual system, the interactions between neurons, like the orientation selective neurons [13], ensure the strong robustness of the visual system, which highlights the gap between biological neurons and artificial neurons (convolutional kernel). Furthermore, certain neural network models have attempted to simulate LGN or V1 cell functions within their architecture, they often suffer from computational complexity, learning costs, and limited flexibility issues [30,31,32,33].

In this paper, we introduce an artificial visual system (AVS), a bio-inspired, straightforward, and explainable pre-trained network for improving CNNs overall performance. The core of AVS is to simulate the mechanism of orientation feature extraction within the primary visual pathway and applied to regulate the feature preference of CNN. The AVS is first trained on artificial object orientation data for generating orientation feature selectivity, which is further employed as an image information regulator in the CNN training process. What sets the AVS–CNN framework apart is the AVS is pre-trained on orientation data and frozen in further image tasks. This bio-inspired approach not only improves the backbone model’s stability and robustness, but also does not cost extra training resources, except for a small amount of orientation object data. Additionally, AVS is only implemented in the training process to regulate the backbone network’s preference for image features, and is removed in the validation and testing stage. This flexibility and generalization ensures that AVS can be employed in diverse image tasks, and does not generate extra costs for hardware and software implementations. The contributions of this paper are as follows:

We propose AVS, a straightforward and bio-inspired pre-trained network to regulate CNN training preference. To the best of our knowledge, this is the first attempt to decouple the generation of initial feature selectivity and extraction and recognition of high-dimensional information.
The pre-trained AVS can be integrated as a plug-and-play component to enhance the robustness of various convolutional backbone networks, such as ResNet-50.
From experiment results, we demonstrate that our AVS–CNN method can achieve overall performance improvement in backbone networks across various datasets and image tasks. AVS effectively enhances both the generalization and stability of CNNs.
We visualize the preference differences of image information between baseline models and their enhanced versions, enhancing the explainability of AVS–CNN.

2. Related Work

2.1. Noise Robustness

The noise robustness of CNNs has been a subject of concern since their inception. A common method for enhancing model robustness involves incorporating a certain type of noise data in the training phase [34,35,36], although this comes with a significant increase in training costs and can only enhance robustness to a specific type of noise. Additionally, some researchers proposed to enhance model robustness through innovation network structures or computation components, such as dynamic routing [37], equivariant convolution operations [38], rotation-invariant convolution kernels [39], and some capsule network variants [40,41,42,43,44]. However, these approaches just show certain advantages on specific tasks or small-scale datasets, and it has not been widely applied to middle or large-scale datasets due to computational complexity and long training time. Additionally, CNNs, particularly those trained on ImageNet, exhibit a preference for texture information during recognition tasks [45]. To address this, researchers proposed transferring original image styles to different styles during the training phase, thereby changing the texture pattern and modulating CNN’s feature preference towards texture. Consequently, the model’s robustness is enhanced. Similar to the method of training with noise data, this approach would significantly increase both learning cost and time cost.

2.2. Orientation Selectivity

Hubel and Wiesel conducted groundbreaking work in the field of orientation selectivity, revealing that certain V1 neurons in cats exhibit selective responses to specific orientational visual stimuli [18,19,20]. Orientation selectivity refers to the robust response of V1 neurons to preferred orientations and their weak response to non-preferred stimuli [13,46]. The V1 neurons demonstrating local orientation selectivity are referred to as simple cells [20]. These mechanisms of neuron response were subsequently discovered in other mammals, highlighting the universal nature of orientation selectivity [47,48]. Moreover, while newborn mammals already possess a certain degree of orientation selectivity in the visual cortex, it is important to note that visual experience significantly influences the development and refinement of this neural mechanism [49,50,51,52,53].

2.3. Bio-Inspired CNN

The HMAX model is the pioneering computational model that successfully simulates the hierarchical structure and processing mechanisms of the primate visual cortex [54], establishing a foundation for subsequent research on artificial visual models [55]. However, with the emergence of deep learning, researchers have gradually shifted more focus from faithfully simulating the structure of the visual system to modeling specific components of the primary visual system and integrating them with deep neural networks. Hu and Qiao et al. employed a CNN architecture to simulate the hierarchical structure of the V1 cortical layer, encompassing multiple convolutional layers and pooling layers, to progressively extract visual features [56]. This study effectively demonstrates that CNN serves as a powerful tool for simulating biological visual systems while also exhibiting great potential in capturing their inherent characteristics. Joel Dapello further enhanced CNN robustness by simulating the V1 cells and implementing them at the front end of the network structure [32]. Some researchers have omitted the V1 layer altogether, and concentrated on simulating LGN cells instead. Eslam Mounies designed a deep convolutional neural network capable of accurately predicting responses from LGN neurons when exposed to various visual stimuli, thus confirming deep learning’s efficiency in simulating biological neural systems [57]. Inspired by LGN’s processing mechanism within biological vision systems, Federico Bertoni proposed an innovative convolutional neural network architecture known as LGN-CNN [33]. By incorporating layers that emulate LGN functionality into this CNN architecture, researchers significantly improved its ability to process visual information while enhancing robustness against noise and disturbances.

3. Methods

In this section, we first present the construction and computation details of the proposed artificial visual system (AVS). Next, we delineate the orientation selectivity tuning process implemented for AVS. Lastly, we introduce the implementation details of the AVS–CNN framework.

3.1. Artificial Visual System

We utilize the generation mechanism of orientation selectivity within the primary visual pathway to construct AVS. The overall conceptual structure of an AVS is shown in Figure 1. The AVS implemented consists of three primary components: retina, LGN, and V1 simple cells. We define retinal cells perform the task of receiving the stimuli and information binarization. Each LGN cell accepts binarized information within its local receptive field and completes the initial information computation. We defined four orientation types within a local receptive field (

0^{\circ}

,

45^{\circ}

,

90^{\circ}

, and

135^{\circ}

). Accordingly, four types of V1 simple cells are realized, and receive the specific LGN cell inputs to generate corresponding local orientation selectivity.

3.1.1. Retinal Cells

The retina is composed of photoreceptors, bipolar cells, horizontal cells, amacrine cells, and ganglion cells [13]. In AVS, we define photoreceptors are responsible for receiving light stimuli and transmitting information to downstream retinal cells, while ganglion cells receive information from upstream retinal cells and transmit it to LGN cells. Although other types of retinal cells have been implicated in lateral modulation processes that enhance contrast and edge detection [11,12], the underlying computational mechanisms remain incompletely understood. Moreover, considering that orientation selectivity is generally independent of other features such as color and phase, we propose that other retinal cells facilitate a visual information binarization process to separate color information.

An instance of the binarization process within a local receptive field size of

3 \times 3

is presented in Figure 2. The middle-stage retinal cells modulate the information transmission based on an adaptive threshold, which is determined by the intensity of the central pixel stimuli and the overall stimuli within each local receptive field. The threshold

θ

can be defined as follows:

θ = \frac{1}{N} \sum_{i, j} {∥ p_{i, j} - \hat{p} ∥}_{2} + ρ,

(1)

where N denotes the total number of photoreceptors within a local receptive field,

p_{i, j}

represents the stimuli intensity (pixel value) received by each photoceptor, and

\hat{p}

represents the central pixel value.

ρ

denotes the lower limit threshold, which is set to 3 based on the dynamic range of resolvable grayscale differences in human eyes [58]. Accordingly, the ganglion cell outputs g within this local receptive field can be defined as follows:

g = H (θ^{2} - {(p - \hat{p})}^{2}),

(2)

where p denotes the corresponding pixel value, and

H (\cdot)

is the Heaviside function, which is extensively employed in signal processing and the simulation of neuron activation [59,60].

3.1.2. LGN Cells

LGN cells in AVS receive ganglion cell outputs to perform initial stimuli responses. The receptive fields of LGN cells are characterized by a center-surround antagonistic arrangement with excitatory (ON) and inhibitory (OFF) regions [15]. In this study, we employed ON–OFF type LGN cells and approximately represented the receptive field property using specific activation weights, as illustrated in Figure 3.

The central region was designated as the ON region with weight

w_{e}

as 8, while the surrounding regions were defined as the OFF region with weights

w_{s}

as −1. Each LGN cell receives inputs from ganglion cells that correspond to their respective receptive field positions. We employed the Sigmoid function to model simple cell activation, a widely employed technique in the field of neural modeling [61,62,63]. The activation rule can be mathematically expressed as follows:

l = \frac{1}{1 + e^{- k_{l} (〈 g, w_{l} 〉 - θ_{l})}},

(3)

where

k_{l}

denotes the response sensitivity of LGN cells to stimuli,

g

and

w_{l}

represent the ganglion cell inputs matrix and weight matrix, and

θ_{l}

denotes the activation threshold of LGN cells.

3.1.3. V1 Simple Cell

The orientation selectivity of V1 simple cells is generated through the integration of outputs from LGN cells [46]. Figure 4 illustrates the generation of

90^{\circ}

orientation selectivity and two instances of stimuli patterns.

We define that each simple cell connects with three upstream LGN cells, whose receptive fields are closely positioned and arranged in a specific orientation. Moreover, the effective stimuli patterns within each LGN cell should match the overall arrangement of receptive field orientations, thereby enabling downstream V1 simple cells to generate corresponding orientation-selective responses. Therefore, it is necessary to record the effective activation pattern of LGN cells within each local receptive field for computations in the V1 stage. Figure 5 illustrates the generation of four types of orientation selectivity within a local receptive field, and provides information on activation weights for corresponding simple cells.

Excitatory regions are assigned a weight (

w_{e}

) value of 1, while inhibitory regions have a weight (

w_{i}

) value of 0. Mathematically, we define the activation pattern of simple cells as follows:

s = \frac{1}{1 + e^{- k_{s} (〈 l, w_{s} 〉 - θ_{s})}},

(4)

where

k_{s}

denotes the response sensitivity of simple cells to inputs,

l

and

w_{s}

represent the LGN cell inputs matrix and weight matrix, and

θ_{s}

denotes the activation threshold of simple cells.

3.2. Orientation Selectivity Tuning

Biologically, while certain V1 neurons are innate with orientation selectivity, external visual stimuli play a crucial role in enhancing and refining the ability of orientation recognition. Consequently, we tuned the neuron’s orientation selectivity of the initial AVS using a few artificial object orientation image data points. The overall computational structure of AVS is illustrated in Figure 6. We propose that image data featuring a particular orientation would elicit greater activation in simple cells exhibiting corresponding orientation selectivity. Hence, we employed the total number of four types of simple cells activated as the index for orientation selectivity tuning. According to Equations (3) and (4), by adjusting the

k_{l}

,

k_{s}

,

θ_{l}

, and

θ_{s}

during the training process to modulate the neuron’s activation sensitivity and threshold, which are critical factors influencing the system performance. In the tuning process, the evaluation criterion is whether the cell type with the maximum activation sum aligns with the orientation of the stimuli. We set a large initial learning rate to maximize the

k_{l}

and

k_{s}

parameters and expected neurons to have relatively great response sensitivity to stimuli. Additionally, it is worth noting that the same type of cells also shares the same parameters, resulting in only four parameters being adjusted during the tuning process (

k_{l}

,

k_{s}

,

θ_{l}

, and

θ_{s}

).

3.3. AVS–CNN Framework

Orientation information is a fundamental and crucial element of image data, and various essential object features such as edges and textures can be represented through the combination of different local orientation information [64,65]. Hence, we propose decoupling the extraction of orientation information from the learning process of complex object features in image recognition tasks, and introducing the AVS–CNN network framework to facilitate the efficiency.

The framework structure of AVS–CNN is depicted in Figure 7. Initially, we freeze the parameters of pre-trained AVS and utilize it to extract orientation information. Subsequently, we transform the extracted orientation information and integrate it with original color data. Lastly, we employ CNNs to learn advanced abstract features based on this integrated input. Moreover, it is important to point out that the AVS is only implemented in the learning process and removed in the validation and test stages.

We suggest that a single receptive field containing more specific orientation information may have limited effective feature information, while fewer orientations indicate higher informational content. To capture these characteristics effectively, we obtain four simple cell layers corresponding to four types of orientation selectivity (

0^{\circ}

,

45^{\circ}

,

90^{\circ}

, and

135^{\circ}

), which are then transformed into a weight map representing the local information complexity level at each pixel location. The computation for integrating this information is described as follows:

e = \prod s_{i},

(5)

where e represents the local information complexity of the pixel contained, and

s_{i}

denotes the activation values of distinct simple cells within the receptive field corresponding to this centered pixel,

i \in [0^{\circ}, 45^{\circ}, 90^{\circ}, 135^{\circ}]

. Accordingly, the output of the AVS block for downstream CNN can be described as follows:

V = (1 - E) ⊙ I,

(6)

where V denotes the output of AVS block, and E and I represent the integrated orientation information and original image information.

4. Results

In this section, we present the results of experiments conducted on various image datasets to demonstrate the effectiveness of the AVS–CNN framework. The section is divided into two parts. In the first part, we provide preliminary information about the experiments, including the datasets, models, and training details. In the second part, we evaluate the overall performance of AVS–CNN from robustness and stability perspectives.

4.1. Preliminaries for Experiments

4.1.1. Datasets

An artificial object orientation dataset was generated and utilized to fine-tune the neurons’ orientation selectivity in AVS. The artificial object orientation dataset comprises 400 images for training and 2000 images for validation (image size:

128 \times 128

). Each image possesses an object with a specific orientation, random background color, random object color configuration, and random object scale. We extensively evaluate the performance of the AVS–CNN framework on diverse datasets, including Stanford Dogs [66], Oxford 102 Flowers [67], MNIST-M [68], Oxford-IIIT Pet [69], STL-10 [38], and PASCAL VOC 2012 [70]. These datasets were selected to cover diverse characteristics of visual diversity, domain variations, and object complexity, which contribute to evaluating the generalization of the AVS–CNN framework.

Furthermore, to evaluate the impact of AVS on enhancing CNN robustness against image noise, we generated multiple types of noise test sets based on clean test sets from each image dataset. The noise types included Additive Gaussian noise (AGN), Exponential noise (EPN), Poisson noise (PSN), Rayleigh noise (RLN), Speckle noise (SKN), Salt and Pepper noise (SPN), and Uniform noise (UFN). Additionally, for each type of noise subset, five different intensities of corresponding noise were incorporated.

4.1.2. CNN Models

We investigated the performance improvement of AVS–CNN in image classification and object detection tasks. For the image classification task, ResNet-50 [29], DenseNet121 [71], and EfficientNet-B0 [72] were employed as the backbone models. For the object detection task, Faster R-CNN [73] and RetinaNet [74] were adopted as the baselines using the ResNet-50 backbone. The selection of this model was driven by the following considerations:

Diversity in model architecture: ResNet-50 effectively solves the issue of gradient degradation through residual connection, which makes it a classical deep learning baseline model; DenseNet121 uses dense connections for feature sharing and efficient gradient propagation, which provides a different perspective on feature extraction; EfficientNet-B0 uses compound scaling strategy to balance the model efficiency and performance, which represents the modern CNN architectures.
Object detection model: Faster R-CNN is a typical two-stage object detector that is the preferred choice for high-precision detection. RetinaNet is a one-stage detector that is particularly suitable for detecting small objects in large-scale datasets. We can comprehensively evaluate the performance of the AVS–CNN framework under different object detection paradigms.
Model complexity: These backbone networks and object detection models have different computational complexity and architectural design. We can effectively validate the robustness and adaptability of the AVS–CNN framework under different model complexity conditions.

4.1.3. Training Details

Before implementing integration with CNN, AVS was pre-trained for 10 epochs with 400 artificial object orientation image data (training settings for orientation selectivity tuning are provided in Appendix A, Table A1). For image recognition tasks, both AVS–CNN and base backbone models were implemented with pre-trained weights on the ImageNet [75] or MS COCO [76] dataset. During the fine-tuning stage for each image dataset, we set the maximum epochs as 15 epochs. All image recognition tasks were conducted three times under different random seeds to eliminate training randomness. Importantly, in AVS–CNNs, the AVS block was solely employed during training with frozen parameters, and was removed during the validation and testing stages.

4.2. Noise Robustness

We investigated the robustness improvement of the proposed AVS–CNN framework against image noise. All models were trained with clean data and tested on various validation subsets. We used precision, recall, f1-score, and error rate evaluation metrics to assess the AVS–CNN performance. The precision, recall, and f1-score results are presented in Appendix A, Table A2 and Table A3. The error rate results are summarized in Table 1. The error rate (top-1) was on overall mean, noise mean, common noise corruptions, and clean images for AVS–CNN and standard CNNs methods, namely ResNet50 (RN), DenseNet121 (DN), and EfficientNetB0 (EN). Noise mean refers to the average error rate obtained over seven types of image noise corruptions. The overall mean encompasses the average error rate across the three noise types and the clean dataset. All results are presented as the mean.

As expected, each implemented AVS–CNN model exhibited significant improvements in overall performance, and all types of noise compared to the base CNN. While the extent of improvement varied across different datasets due to backbone performance, most AVS–CNNs demonstrated progress on noise data with a slight decline in performance on clean data. In fifteen groups of model comparisons, AVS–CNN enhanced noise data in seven groups while maintaining or improving clean data performance, especially compared to ResNet-50. Futhermore, in terms of the results of precision, recall, and f1-score, the AVS–CNN framework also exhibited improved performance compared with CNNs. The improvement in these metrics demonstrated that the AVS–CNN framework’s predictions for the positive class are more accurate and have a lower rate of missed detections. Overall, AVS–CNN exhibits enhanced performance compared to the baseline model in terms of prediction accuracy, coverage, and overall performance. The AVS–CNN framework contributes to improving the robustness of CNN models against various types of noise without extra great computation cost.

The effectiveness of the AVS–CNN framework was also evaluated for object detection tasks. Detection results were evaluated based on the PASCAL VOC metric (mAP@IoU = 0.5) on object detection, as presented in Table 2. AVS–CNNs also exhibited significant improvements in noise data and overall performance. These results further demonstrate the feasibility and superiority of the AVS–CNN framework in terms of robustness improvement.

4.3. Model Stability

Summarize the error rates and standard deviations of overall model performance on various datasets based on three repeated image classification tasks to evaluate AVS–CNNs’ stability and robustness. The results are presented in Table 3, Values are reported as mean error rate and standard deviation (n = 3 seeds). The standard deviation curves are depicted in Figure 8. Notably, the standard deviations of AVS–CNNs consistently remained within 2%, exhibiting significantly lower fluctuation compared to corresponding base CNNs. In short, AVS enhanced the robustness of the backbone model, leading to stronger reliability and stability in the performance of AVS–CNNs.

5. Visualization and Discussions

In this section, we employ Gradient-weighted Class Activation Mapping (Grad-CAM) [77] technology to visualize the weight activation distribution of AVS–CNN and CNN on images, aiming to investigate the impact of AVS on the decision-making process of the backbone network. Specifically, we generated weight activation heat maps of AVS-ResNet50 and ResNet50 based on flower data. Figure 9 presents the clean image, noise images, and corresponding heat maps. By overlaying the heatmap onto the original image, we visually demonstrate where the model focuses its attention during prediction, providing valuable insights into salient features considered by the model. The red and yellow colors indicate the deemed most crucial for the model’s decision, and intensifying towards red signifies higher significance in determining output. Notably, the base ResNet50 struggled to focus on the object when confronted with the noise images, and its focus was often misled to the background due to the noise information.While the AVS-ResNet50 consistently captured object-related details in the image regardless of clean or noise. The visualization instances elucidate how AVS–CNNs achieve their robustness.

Currently, the majority of CNNs are trained on the ImageNet dataset and subsequently fine-tuned for other image tasks to enhance their recognition capabilities. However, some studies have indicated that CNNs trained on ImageNet may exhibit a bias towards texture information rather than shape information in classification tasks and lack robustness when confronted with image noise [45,78]. In this study, we propose the AVS–CNN framework as a solution to address these limitations. Extensive experimental results demonstrate that incorporating pre-trained AVS significantly improves the stability and robustness of ImageNet-based CNNs across various transfer-learning tasks. Furthermore, future research can focus on refining the integration process of orientation information to further enhance the overall performance of backbone networks, which can be extended to more complex tasks such as large-scale datasets, object segmentation, and non-transfer learning.

6. Conclusions

In this study, we propose the AVS–CNN framework to enhance the overall robustness of CNNs against image noise. The AVS is designed to emulate orientation selectivity within the primary visual pathway and is trained using a few artificial object orientation data to refine its selectivity. We decouple the extraction of orientation information from learning abstract features by integrating a pre-trained AVS at the front end of CNNs. Moreover, we implement the AVS as a plug-and-play block solely during training to modulate CNN’s focus on image features, which is then removed in the testing stage. Consequently, our proposed AVS–CNN framework does not require additional hardware resources or training data. Experiment results on classification and detection tasks also demonstrated that AVS can considerably enhance CNN’s overall performance, and the AVS–CNN is reliable, stable, and robust. The AVS–CNN framework provides an explainable and reliable attempt to construct robust artificial neural networks based on biological neuron mechanisms.

Author Contributions

Conceptualization, B.L. and Y.T.; methodology, B.L. and Y.T.; simulation, B.L. and Y.W.; writing—original draft preparation, B.L.; writing—review and editing, B.L., Y.T., C.T. and S.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by JST, the establishment of university fellowships towards the creation of science technology innovation, Grant Number JPMJFS2116; and JST Support for Pioneering Research Initiated by the Next Generation (SPRING) under Grant JPMJSP2145.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to data privacy regulations.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CNN	Convolutional neural network
AVS	Artificial visual system
V1	Primary visual cortex
LGN	Lateral geniculate nucleus
AGN	Additive Gaussian noise
EPN	Exponential noise
PSN	Poisson noise
RLN	Rayleigh noise
SKN	Speckle noise
SPN	Salt and Pepper noise
UFN	Uniform noise
RN	ResNet50
DN	DenseNet121
EN	EfficientNet-B0
Grad-CAM	Gradient-weighted Class Activation Mapping

Appendix A

Table A1. Training settings for AVS in orientation selectivity tuning.

training config	AVS
loss function	cross entropy
optimizer	Adam
base learning rate	1.0
batch size	256
training epochs	10
learning rate schedule	exponential decay (0.5)
random seed	7

Table A2. Precision comparisons between AVS–CNNs and CNNs in image classification tasks.

Dataset	Model	Clean	Noise							Overall
Dataset	Model	Clean	AGN	EPN	PSN	RLN	SKN	SPN	UFN	Overall
Dogs	RN	85.6%	58.3%	58.4%	65.8%	65.4%	57.9%	48.8%	68.7%	63.6%
	AVS-RN	86.9%	71.8%	72.1%	80.0%	78.5%	69.7%	59.0%	80.7%	74.8%
	DN	80.3%	62.3%	62.6%	68.4%	67.7%	61.0%	25.4%	69.6%	62.1%
	AVS-DN	79.3%	67.9%	68.4%	72.0%	72.4%	66.5%	59.1%	73.4%	69.9%
	EN	76.9%	32.8%	31.9%	19.8%	46.0%	26.6%	48.9%	51.4%	41.8%
	AVS-EN	78.1%	55.3%	55.2%	64.2%	60.2%	52.3%	61.2%	63.6%	61.3%
Flowers	RN	96.2%	77.0%	77.3%	83.0%	73.9%	72.5%	52.0%	76.0%	76.0%
	AVS-RN	92.4%	84.3%	84.7%	87.7%	80.3%	80.9%	76.7%	82.5%	83.7%
	DN	97.1%	82.2%	82.9%	82.0%	81.4%	78.3%	6.4%	81.3%	73.9%
	AVS-DN	89.6%	82.9%	83.5%	84.1%	80.6%	79.8%	76.7%	81.4%	82.3%
	EN	96.2%	77.7%	76.9%	81.3%	71.8%	70.8%	82.8%	76.8%	79.3%
	AVS-EN	94.8%	79.7%	81.4%	87.2%	73.6%	76.8%	90.1%	76.5%	82.5%
MNIST-M	RN	97.3%	87.1%	83.5%	96.4%	84.9%	86.4%	88.7%	88.3%	89.1%
	AVS-RN	97.3%	90.0%	86.9%	96.7%	87.9%	89.8%	89.6%	91.2%	91.2%
	DN	97.3%	87.7%	83.7%	96.6%	85.3%	87.2%	90.0%	88.9%	89.6%
	AVS-DN	96.9%	89.7%	86.3%	96.4%	87.4%	89.1%	90.2%	90.7%	90.8%
	EN	93.4%	67.9%	65.6%	91.7%	73.3%	71.6%	77.2%	70.0%	76.3%
	AVS-EN	92.1%	77.8%	75.7%	91.8%	80.5%	79.0%	79.5%	79.3%	82.0%
Pets	RN	93.9%	69.4%	68.9%	79.9%	65.4%	69.2%	34.7%	68.0%	68.7%
	AVS-RN	94.1%	86.7%	87.0%	91.1%	81.8%	84.1%	70.6%	84.9%	85.0%
	DN	92.5%	75.2%	75.6%	82.6%	73.8%	75.1%	12.4%	74.9%	70.3%
	AVS-DN	91.4%	84.6%	85.0%	87.9%	80.7%	83.6%	72.0%	82.9%	83.5%
	EN	91.8%	45.5%	44.7%	23.9%	32.9%	35.5%	65.6%	42.7%	47.8%
	AVS-EN	91.5%	66.6%	67.7%	83.8%	63.3%	68.5%	79.7%	64.7%	73.2%
STL	RN	93.5%	78.3%	65.9%	85.7%	73.0%	76.1%	44.6%	75.7%	74.1%
	AVS-RN	94.9%	86.5%	75.0%	91.5%	81.9%	84.3%	60.7%	84.6%	82.4%
	DN	90.8%	78.7%	65.8%	85.4%	73.6%	76.8%	42.0%	76.4%	73.7%
	AVS-DN	89.8%	82.5%	72.4%	86.2%	78.1%	81.8%	60.5%	81.1%	79.1%
	EN	84.4%	59.3%	40.2%	54.4%	55.4%	54.4%	52.3%	57.5%	57.2%
	AVS-EN	86.0%	64.2%	38.2%	76.4%	56.9%	61.8%	58.2%	60.2%	62.7%

Table A3. Recall comparisons between AVS–CNNs and CNNs in image classification tasks.

Dataset	Model	Clean	Noise							Overall
Dataset	Model	Clean	AGN	EPN	PSN	RLN	SKN	SPN	UFN	Overall
Dogs	RN	84.8%	37.5%	38.8%	58.0%	55.8%	37.9%	6.4%	61.5%	47.6%
	AVS-RN	86.3%	67.1%	67.7%	78.0%	76.3%	62.9%	49.3%	79.0%	70.8%
	DN	79.1%	42.0%	43.7%	61.8%	60.0%	36.2%	2.4%	63.5%	48.6%
	AVS-DN	78.0%	64.3%	65.1%	68.7%	70.0%	61.2%	45.7%	71.2%	65.5%
	EN	75.7%	13.5%	14.0%	6.5%	22.9%	12.4%	39.3%	28.0%	26.5%
	AVS-EN	77.1%	31.8%	33.9%	60.0%	52.4%	35.6%	53.9%	58.7%	50.4%
Flowers	RN	95.7%	55.2%	57.8%	73.8%	42.8%	44.3%	11.6%	49.1%	53.8%
	AVS-RN	90.9%	77.9%	78.1%	83.7%	68.0%	70.8%	57.8%	74.5%	75.2%
	DN	96.7%	45.1%	47.7%	66.9%	34.2%	34.8%	2.7%	38.6%	45.9%
	AVS-DN	86.7%	71.7%	73.8%	76.4%	63.2%	67.7%	56.1%	68.3%	70.5%
	EN	95.7%	46.8%	50.1%	74.5%	35.2%	44.5%	75.0%	40.6%	57.8%
	AVS-EN	93.7%	69.1%	72.5%	83.1%	54.7%	64.9%	88.5%	61.2%	73.5%
MNIST-M	RN	97.3%	85.7%	80.4%	96.4%	82.7%	84.8%	88.2%	87.2%	87.9%
	AVS-RN	97.3%	89.9%	86.4%	96.7%	87.5%	89.6%	89.4%	91.1%	91.0%
	DN	97.3%	86.5%	81.5%	96.6%	83.6%	85.8%	89.5%	88.0%	88.6%
	AVS-DN	96.9%	89.3%	85.4%	96.4%	86.5%	88.7%	89.8%	90.5%	90.4%
	EN	93.4%	62.8%	60.4%	91.7%	70.3%	68.1%	74.8%	65.5%	73.4%
	AVS-EN	92.1%	73.8%	74.6%	91.8%	79.9%	76.0%	78.1%	76.7%	80.4%
Pets	RN	93.8%	48.2%	49.6%	76.9%	36.4%	48.6%	4.1%	41.9%	49.9%
	AVS-RN	93.7%	85.7%	85.8%	90.5%	79.4%	82.7%	64.4%	83.4%	83.2%
	DN	92.0%	45.8%	48.6%	77.9%	35.1%	42.4%	3.4%	39.6%	48.1%
	AVS-DN	90.5%	81.8%	82.2%	85.5%	76.4%	79.8%	60.6%	79.8%	79.6%
	EN	91.2%	19.0%	19.5%	10.0%	14.1%	17.6%	58.0%	17.3%	30.8%
	AVS-EN	91.1%	51.9%	55.2%	81.6%	39.2%	57.4%	76.5%	44.4%	62.2%
STL	RN	93.5%	75.3%	52.6%	84.9%	65.7%	70.9%	21.3%	71.4%	66.9%
	AVS-RN	94.9%	86.0%	71.4%	91.5%	80.4%	83.3%	49.5%	83.8%	80.1%
	DN	90.7%	75.4%	53.5%	84.5%	67.2%	72.1%	21.0%	71.5%	67.0%
	AVS-DN	89.7%	81.3%	67.3%	85.7%	75.2%	80.1%	48.5%	79.5%	75.9%
	EN	84.2%	30.3%	16.5%	45.6%	23.5%	31.7%	32.8%	26.7%	36.4%
	AVS-EN	85.6%	44.1%	25.9%	74.4%	35.7%	45.8%	39.6%	39.4%	48.8%

Table A4. F1-score comparisons between AVS–CNNs and CNNs in image classification tasks.

Dataset	Model	Clean	Noise							Overall
Dataset	Model	Clean	AGN	EPN	PSN	RLN	SKN	SPN	UFN	Overall
Dogs	RN	84.8%	41.0%	41.9%	58.4%	57.0%	41.3%	9.3%	62.5%	49.5%
	AVS-RN	86.2%	67.6%	68.2%	78.2%	76.4%	63.9%	50.2%	79.1%	71.2%
	DN	79.1%	45.3%	46.7%	61.9%	60.7%	39.7%	1.9%	64.0%	49.9%
	AVS-DN	77.8%	64.4%	65.1%	68.5%	69.9%	61.2%	46.0%	71.0%	65.5%
	EN	75.5%	16.5%	16.8%	6.7%	27.4%	14.6%	40.3%	33.0%	28.8%
	AVS-EN	76.8%	37.1%	38.6%	59.9%	53.5%	38.6%	54.5%	59.1%	52.2%
Flowers	RN	95.5%	58.8%	60.8%	73.8%	48.0%	48.2%	15.6%	53.8%	56.8%
	AVS-RN	90.6%	77.9%	78.3%	83.2%	69.1%	71.3%	60.1%	74.9%	75.7%
	DN	96.7%	51.9%	54.3%	67.4%	42.1%	40.6%	1.5%	45.9%	50.1%
	AVS-DN	86.4%	72.0%	74.2%	75.9%	65.1%	67.8%	57.5%	69.0%	71.0%
	EN	95.7%	53.4%	55.7%	74.3%	41.5%	48.8%	76.1%	48.0%	61.7%
	AVS-EN	93.7%	70.2%	73.2%	82.7%	57.2%	65.7%	88.3%	63.0%	74.3%
MNIST-M	RN	97.3%	85.9%	80.8%	96.4%	82.9%	85.0%	88.3%	87.3%	88.0%
	AVS-RN	97.3%	89.9%	86.4%	96.7%	87.5%	89.5%	89.5%	91.1%	91.0%
	DN	97.3%	86.7%	81.7%	96.6%	83.8%	86.0%	89.6%	88.1%	88.7%
	AVS-DN	96.9%	89.4%	85.5%	96.4%	86.6%	88.7%	89.9%	90.5%	90.5%
	EN	93.4%	63.5%	61.0%	91.7%	70.7%	68.6%	75.0%	66.2%	73.8%
	AVS-EN	92.1%	74.8%	74.7%	91.8%	80.0%	76.7%	78.3%	77.3%	80.7%
Pets	RN	93.7%	52.6%	53.6%	76.7%	41.9%	52.6%	2.6%	47.3%	52.6%
	AVS-RN	93.6%	85.5%	85.6%	90.3%	79.3%	82.4%	64.9%	83.2%	83.1%
	DN	91.9%	52.7%	54.6%	77.5%	42.9%	48.7%	1.1%	47.4%	52.1%
	AVS-DN	90.0%	81.2%	81.5%	84.9%	75.7%	79.1%	60.6%	79.1%	79.0%
	EN	91.2%	22.8%	23.1%	9.4%	16.0%	19.9%	58.4%	20.7%	32.7%
	AVS-EN	91.0%	54.9%	57.4%	81.3%	44.1%	59.1%	76.3%	48.9%	64.1%
STL	RN	93.5%	75.9%	54.8%	85.0%	67.1%	71.8%	19.4%	72.2%	67.5%
	AVS-RN	94.8%	85.9%	71.5%	91.4%	80.5%	83.4%	49.0%	83.8%	80.0%
	DN	90.7%	75.7%	55.1%	84.5%	67.9%	72.6%	18.3%	72.1%	67.1%
	AVS-DN	89.6%	81.3%	67.2%	85.7%	75.2%	80.1%	47.9%	79.5%	75.8%
	EN	84.2%	32.2%	13.8%	45.4%	23.9%	33.4%	32.5%	27.9%	36.7%
	AVS-EN	85.6%	46.1%	25.3%	74.4%	36.8%	47.6%	39.7%	41.0%	49.6%

References

Kawasaki, H.; Nishida, S.; Kobayashi, I. Hierarchical processing of visual and language information in the brain. In Proceedings of the Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, Online, 20–23 November 2022; pp. 405–410. [Google Scholar]
Himmelberg, M.M.; Winawer, J.; Carrasco, M. Linking individual differences in human primary visual cortex to contrast sensitivity around the visual field. Nat. Commun. 2022, 13, 3309. [Google Scholar] [CrossRef] [PubMed]
Bergmann, J.; Genç, E.; Kohler, A.; Singer, W.; Pearson, J. Neural anatomy of primary visual cortex limits visual working memory. Cereb. Cortex 2016, 26, 43–50. [Google Scholar] [CrossRef] [PubMed]
Eickhoff, S.B.; Rottschy, C.; Kujovic, M.; Palomero-Gallagher, N.; Zilles, K. Organizational principles of human visual cortex revealed by receptor mapping. Cereb. Cortex 2008, 18, 2637–2645. [Google Scholar] [CrossRef]
De Ladurantaye, V.; Rouat, J.; Vanden-Abeele, J. Models of information processing in the visual cortex. In Visual Cortex-Current Status and Perspectives; IntechOpen: London, UK, 2012; pp. 227–246. [Google Scholar]
Brodal, P. The Visual System. In The Central Nervous System; Oxford University Press: Oxford, UK, 2016. [Google Scholar] [CrossRef]
Zhang, X.; Wang, X.; Zhu, J.; Chen, K.; Ullah, R.; Tong, J.; Shen, Y. Retinal VIP-amacrine cells: Their development, structure, and function. Eye 2024, 38, 1065–1076. [Google Scholar] [CrossRef] [PubMed]
Holmes, D. Reconstructing the retina. Nature 2018, 561, S2–S3. [Google Scholar] [CrossRef] [PubMed]
Kremers, J.; Silveira, L.C.L.; Parry, N.R.; McKeefry, D.J. The retinal processing of photoreceptor signals. In Human Color Vision; Springer: Cham, Switzerland, 2016; pp. 33–70. [Google Scholar]
Evangelou, N.; Alrawashdeh, O.S. Anatomy of the Retina and the Optic Nerve. In Optical Coherence Tomography in Multiple Sclerosis: Clinical Applications; Thieme Medical Publishers: New York, NY, USA, 2016; pp. 3–19. [Google Scholar]
Kramer, R.H.; Davenport, C.M. Lateral inhibition in the vertebrate retina: The case of the missing neurotransmitter. PLoS Biol. 2015, 13, e1002322. [Google Scholar] [CrossRef] [PubMed]
Slaughter, M. The vertebrate retina. In Science of vision; Springer: Berlin/Heidelberg, Germany, 1990; pp. 53–83. [Google Scholar]
Kandel, E.R.; Schwartz, J.H.; Jessell, T.M.; Siegelbaum, S.; Hudspeth, A.J.; Mack, S. Principles of Neural Science; McGraw-hill: New York, NY, USA, 2000; Volume 4. [Google Scholar]
Schiller, P.H.; Tehovnik, E.J. The Lateral Geniculate Nucleus. In Vision and the Visual System; Oxford University Press: Oxford, UK, 2015. [Google Scholar] [CrossRef]
Dhande, O.S.; Huberman, A.D. Retinal ganglion cell maps in the brain: Implications for visual processing. Curr. Opin. Neurobiol. 2014, 24, 133–142. [Google Scholar] [CrossRef] [PubMed]
Ghodrati, M.; Khaligh-Razavi, S.M.; Lehky, S.R. Towards building a more complex view of the lateral geniculate nucleus: Recent advances in understanding its role. Prog. Neurobiol. 2017, 156, 214–255. [Google Scholar] [PubMed]
Banich, M.T.; Compton, R.J. Cognitive Neuroscience; Cambridge University Press: Cambridge, UK, 2018. [Google Scholar]
Hubel, D.H.; Wiesel, T.N. Receptive fields of single neurones in the cat’s striate cortex. J. Physiol. 1959, 148, 574–591. [Google Scholar] [CrossRef] [PubMed]
Hubel, D.H.; Wiesel, T.N. Integrative action in the cat’s lateral geniculate body. J. Physiol. 1961, 155, 385. [Google Scholar] [CrossRef] [PubMed]
Hubel, D.H.; Wiesel, T.N. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. 1962, 160, 106. [Google Scholar] [CrossRef]
Bear, M.; Connors, B.; Paradiso, M.A. Neuroscience: Exploring the Brain, Enhanced Edition: Exploring the Brain; Jones & Bartlett Learning: Burlington, MA, USA, 2020. [Google Scholar]
Rochefort, N.L.; Narushima, M.; Grienberger, C.; Marandi, N.; Hill, D.N.; Konnerth, A. Development of direction selectivity in mouse cortical neurons. Neuron 2011, 71, 425–432. [Google Scholar] [CrossRef]
Tohmi, M.; Cang, J. Rapid development of motion-streak coding in the mouse visual cortex. Iscience 2023, 26, 1105778. [Google Scholar] [CrossRef]
Fukushima, K. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 1980, 36, 193–202. [Google Scholar] [CrossRef] [PubMed]
Fukushima, K. Neocognitron: A hierarchical neural network capable of visual pattern recognition. Neural Netw. 1988, 1, 119–130. [Google Scholar] [CrossRef]
LeCun, Y.; Boser, B.; Denker, J.; Henderson, D.; Howard, R.; Hubbard, W.; Jackel, L. Handwritten digit recognition with a back-propagation network. Adv. Neural Inf. Process. Syst. 1989, 2. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Parhi, K.K.; Unnikrishnan, N.K. Brain-inspired computing: Models and architectures. IEEE Open J. Circuits Syst. 2020, 1, 185–204. [Google Scholar] [CrossRef]
Hu, X.d.; Wang, X.q.; Meng, F.j.; Hua, X.; Yan, Y.j.; Li, Y.y.; Huang, J.; Jiang, X.l. Gabor-CNN for object detection based on small samples. Def. Technol. 2020, 16, 1116–1129. [Google Scholar] [CrossRef]
Dapello, J.; Marques, T.; Schrimpf, M.; Geiger, F.; Cox, D.; DiCarlo, J.J. Simulating a primary visual cortex at the front of CNNs improves robustness to image perturbations. Adv. Neural Inf. Process. Syst. 2020, 33, 13073–13087. [Google Scholar]
Bertoni, F.; Citti, G.; Sarti, A. LGN-CNN: A biologically inspired CNN architecture. Neural Netw. 2022, 145, 42–55. [Google Scholar] [CrossRef]
Bishop, C.M. Training with noise is equivalent to Tikhonov regularization. Neural Comput. 1995, 7, 108–116. [Google Scholar] [CrossRef]
Kurakin, A.; Goodfellow, I.; Bengio, S. Adversarial machine learning at scale. arXiv 2016, arXiv:1611.01236. [Google Scholar]
Zheng, S.; Song, Y.; Leung, T.; Goodfellow, I. Improving the robustness of deep neural networks via stability training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4480–4488. [Google Scholar]
Sabour, S.; Frosst, N.; Hinton, G.E. Dynamic routing between capsules. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Coates, A.; Ng, A.; Lee, H. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, Lauderdale, FL, USA, 11–13 April 2011; pp. 215–223. [Google Scholar]
Marcos, D.; Volpi, M.; Komodakis, N.; Tuia, D. Rotation equivariant vector field networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5048–5057. [Google Scholar]
Hinton, G.E.; Sabour, S.; Frosst, N. Matrix capsules with EM routing. In Proceedings of the International conference on learning representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Xi, E.; Bing, S.; Jin, Y. Capsule network performance on complex data. arXiv 2017, arXiv:1712.03480. [Google Scholar]
Cohen, T.; Welling, M. Group equivariant convolutional networks. In Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA, 20–22 June 2016; pp. 2990–2999. [Google Scholar]
Worrall, D.E.; Garbin, S.J.; Turmukhambetov, D.; Brostow, G.J. Harmonic networks: Deep translation and rotation equivariance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5028–5037. [Google Scholar]
Paoletti, M.E.; Haut, J.M.; Roy, S.K.; Hendrix, E.M. Rotation equivariant convolutional neural networks for hyperspectral image classification. IEEE Access 2020, 8, 179575–179591. [Google Scholar] [CrossRef]
Geirhos, R.; Rubisch, P.; Michaelis, C.; Bethge, M.; Wichmann, F.A.; Brendel, W. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv 2018, arXiv:1811.12231. [Google Scholar]
Hubel, D.H. Exploration of the primary visual cortex, 1955–1978. Nature 1982, 299, 515–524. [Google Scholar] [CrossRef]
Ibbotson, M.; Jung, Y.J. Origins of functional organization in the visual cortex. Front. Syst. Neurosci. 2020, 14, 10. [Google Scholar] [CrossRef]
Sadeh, S.; Rotter, S. Statistics and geometry of orientation selectivity in primary visual cortex. Biol. Cybern. 2014, 108, 631–653. [Google Scholar] [CrossRef] [PubMed]
Blakemore, C.; Cooper, G.F. Development of the brain depends on the visual environment. Nature 1970, 228, 477–478. [Google Scholar] [CrossRef]
Frégnac, Y.; Imbert, M. Development of neuronal selectivity in primary visual cortex of cat. Physiol. Rev. 1984, 64, 325–434. [Google Scholar] [CrossRef] [PubMed]
White, L.E.; Coppola, D.M.; Fitzpatrick, D. The contribution of sensory experience to the maturation of orientation selectivity in ferret visual cortex. Nature 2001, 411, 1049–1052. [Google Scholar] [CrossRef] [PubMed]
Espinosa, J.S.; Stryker, M.P. Development and plasticity of the primary visual cortex. Neuron 2012, 75, 230–249. [Google Scholar] [CrossRef] [PubMed]
Li, M.; Liu, T.; Xu, X.; Wen, Q.; Zhao, Z.; Dang, X.; Zhang, Y.; Wu, D. Development of visual cortex in human neonates is selectively modified by postnatal experience. eLife 2022, 11, e78733. [Google Scholar] [CrossRef] [PubMed]
Riesenhuber, M.; Poggio, T. Hierarchical models of object recognition in cortex. Nat. Neurosci. 1999, 2, 1019–1025. [Google Scholar] [CrossRef]
Liu, C.; Sun, F. HMAX model: A survey. In Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN), Killarney, Ireland, 12–17 July 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1–7. [Google Scholar]
Hu, Y.; Qiao, K.; Tong, L.; Zhang, C.; Gao, H.; Yan, B. A CNN-based computational encoding model for human V1 cortex. In Proceedings of the 2018 Tenth International Conference on Advanced Computational Intelligence (ICACI), Xiamen, China, 29–31 March 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 408–413. [Google Scholar]
Mounier, E.; Abdullah, B.; Mahdi, H.; Eldawlatly, S. A deep convolutional visual encoding model of neuronal responses in the LGN. Brain Inform. 2021, 8, 11. [Google Scholar] [CrossRef]
Tang, Y.Q.; Zhang, X.X.; Li, X.E.; Ren, J.Y. Image processing method of dynamic range with wavelet transform based on human visual gray recognition characteristics. Chin. J. Liq. Cryst. Displays 2012, 27, 385–390. [Google Scholar] [CrossRef]
Tsoi, N.; Milkessa, Y.; Vázquez, M. A heaviside function approximation for neural network binary classification. arXiv 2020, arXiv:2009.01367. [Google Scholar]
Zhang, W.; Zhou, Y. Level-set functions and parametric functions. In The Feature-Driven Method for Structural Optimization, 1st ed.; Elsevier: Amsterdam, The Netherlands, 2020; pp. 9–46. [Google Scholar]
Alonso, N.I. John Hopfield’s Contributions to Neural Networks: A Detailed Mathematical Exploration. 2024. Available online: https://ssrn.com/abstract=4980016 (accessed on 1 December 2024).
Hopfield, J.J. Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. USA 1984, 81, 3088–3092. [Google Scholar] [CrossRef] [PubMed]
Park, S.; Shin, Y.G.; Ko, S.J. Contrast enhancement using sensitivity model-based sigmoid function. IEEE Access 2019, 7, 161573–161583. [Google Scholar] [CrossRef]
Marr, D.; Hildreth, E. Theory of edge detection. Proc. R. Soc. Lond. Ser. B Biol. Sci. 1980, 207, 187–217. [Google Scholar]
Jain, A.K.; Farrokhnia, F. Unsupervised texture segmentation using Gabor filters. Pattern Recognit. 1991, 24, 1167–1186. [Google Scholar] [CrossRef]
Khosla, A.; Jayadevaprakash, N.; Yao, B.; Li, F.F. Novel dataset for fine-grained image categorization: Stanford dogs. In Proceedings of the CVPR Workshop on Fine-Grained Visual Categorization (FGVC), Colorado Springs, CO, USA, 20–25 June 2011; Volume 2. [Google Scholar]
Nilsback, M.E.; Zisserman, A. Automated flower classification over a large number of classes. In Proceedings of the 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, Bhubaneswar, India, 16–19 December 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 722–729. [Google Scholar]
Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; March, M.; Lempitsky, V. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 2016, 17, 1–35. [Google Scholar]
Parkhi, O.M.; Vedaldi, A.; Zisserman, A.; Jawahar, C. Cats and dogs. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2021; IEEE: Piscataway, NJ, USA, 2012; pp. 3498–3505. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Lee, S.; Hwang, I.; Kang, G.C.; Zhang, B.T. Improving robustness to texture bias via shape-focused augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4323–4331. [Google Scholar]

Figure 1. The overall conceptual structure of AVS.

Figure 2. An instance of the binarization process within a local receptive field.

Figure 3. The weight assignation of an LGN receptive field.

Figure 4. Two instances of preferred and ineffective stimuli pattern and the generation of

90^{\circ}

orientation selectivity.

Figure 4. Two instances of preferred and ineffective stimuli pattern and the generation of

90^{\circ}

orientation selectivity.

Figure 5. The generation of four types of orientation selectivity within a local receptive field and the weight assignation for each type of simple cell.

Figure 6. The overall computational structure of AVS.

Figure 7. The overall framework structure of AVS–CNN.

Figure 8. Standard deviations curves of overall model performance on various datasets.

Figure 9. Test images and heat maps overlaid on test images. Each column represents the test image and the activation heat maps based on ResNet50 and AVS-ResNet50 (from top to bottom). The first row exhibits test images including clean image, Gaussian noise image, salt and pepper noise image, and uniform noise images (from left to right). The second row exhibits the activation heat maps overlaid on the test image based on the ResNet50. The third row shows the activation heat maps overlaid on the test image based on the AVS-ResNet50.

Table 1. Error rate comparisons between AVS–CNNs and CNNs in image classification tasks.

Dataset	Model	Clean	Noise
Dataset	Model	Clean	AGN	EPN	PSN	RLN	SKN	SPN	UFN	Mean
Dogs	RN	14.8%	62.3%	61.0%	42.0%	43.9%	62.2%	93.6%	38.2%	57.6%
	AVS-RN	13.3%	32.5%	31.9%	21.6%	23.2%	36.9%	50.2%	20.5%	31.0%
	DN	20.4%	57.6%	55.8%	37.8%	39.5%	63.8%	97.6%	35.9%	55.4%
	AVS-DN	21.6%	35.3%	34.5%	31.0%	29.5%	38.7%	53.8%	28.4%	35.9%
	EN	23.8%	86.4%	85.9%	93.7%	77.0%	87.6%	60.3%	71.9%	80.4%
	AVS-EN	22.4%	68.0%	65.8%	39.7%	47.2%	64.4%	45.5%	40.9%	53.1%
Flowers	RN	4.4%	44.8%	42.2%	26.2%	57.2%	55.7%	88.4%	50.9%	52.2%
	AVS-RN	9.1%	22.2%	21.9%	16.3%	32.0%	29.2%	42.2%	25.5%	27.0%
	DN	3.3%	54.9%	52.3%	33.1%	65.8%	65.2%	97.3%	61.4%	61.4%
	AVS-DN	13.3%	28.3%	26.2%	23.6%	36.8%	32.3%	43.9%	31.7%	31.8%
	EN	4.3%	53.2%	49.9%	25.5%	64.8%	55.5%	25.0%	59.4%	47.6%
	AVS-EN	6.3%	30.9%	27.5%	16.9%	45.3%	35.1%	11.5%	38.8%	29.4%
MNIST-M	RN	2.7%	14.4%	19.7%	3.6%	17.4%	15.3%	11.8%	12.9%	13.6%
	AVS-RN	2.7%	10.1%	13.7%	3.3%	12.6%	10.5%	10.5%	8.9%	9.9%
	DN	2.7%	13.6%	18.6%	3.4%	16.4%	14.2%	10.5%	12.1%	12.7%
	AVS-DN	3.1%	10.7%	14.6%	3.6%	13.5%	11.3%	10.1%	9.5%	10.5%
	EN	6.6%	37.2%	39.7%	8.3%	29.8%	31.9%	25.1%	34.4%	29.5%
	AVS-EN	7.9%	25.9%	25.3%	8.2%	20.1%	23.8%	21.7%	23.1%	21.2%
Pets	RN	6.3%	51.8%	50.4%	23.2%	63.6%	51.5%	96.0%	58.2%	56.4%
	AVS-RN	6.3%	14.4%	14.3%	9.6%	20.6%	17.4%	35.6%	16.6%	18.3%
	DN	8.0%	54.2%	51.5%	22.2%	64.9%	57.6%	96.6%	60.4%	58.2%
	AVS-DN	9.5%	18.3%	17.9%	14.6%	23.7%	20.3%	39.4%	20.3%	22.1%
	EN	8.8%	81.0%	80.5%	90.0%	85.9%	82.4%	42.1%	82.7%	77.8%
	AVS-EN	8.9%	48.1%	44.8%	18.5%	60.9%	42.7%	23.5%	55.6%	42.0%
STL	RN	6.5%	24.7%	47.3%	15.0%	34.2%	29.0%	78.6%	28.5%	36.8%
	AVS-RN	5.1%	14.0%	28.7%	8.5%	19.6%	16.7%	50.5%	16.2%	22.0%
	DN	9.3%	24.5%	46.4%	15.5%	32.7%	27.7%	78.8%	28.4%	36.3%
	AVS-DN	10.3%	18.7%	32.7%	14.3%	24.8%	19.9%	51.5%	20.5%	26.1%
	EN	15.7%	69.8%	83.6%	54.3%	76.6%	68.4%	67.3%	73.4%	70.5%
	AVS-EN	14.4%	55.9%	74.1%	25.5%	64.3%	54.2%	60.4%	60.7%	56.4%

Table 2. Error rate comparisons between AVS–CNNs and CNNs in object detection tasks.

Models	Overall	Noise
Models	[%]	[%]
Faster R-CNN	62.1 ± 0.5	55.7 ± 0.7
AVS-Faster R-CNN	68.1 ± 0.4	64.1 ± 0.5
RetinaNet	52.8 ± 0.7	44.1 ± 0.9
AVS-RetinaNet	59.9 ± 0.6	54.7 ± 0.4

Table 3. Overall mean error rate and standard deviation of models across various datasets.

Model	Overall	Dataset
Model	Overall	Dogs	Flower	MNIST-M	Pets	STL
RN	38.8 ± 14.9	52.2 ± 2.7	46.2 ± 6.9	12.2 ± 0.6	50.1 ± 3.1	33.0 ± 1.2
AVS-RN	19.9 ± 6.8	28.8 ± 1.9	24.8 ± 1.0	9.0 ± 0.1	16.8 ± 0.3	19.9 ± 1.4
DN	40.3 ± 16.3	51.1 ± 3.0	54.1 ± 6.0	11.4 ± 0.2	51.9 ± 1.9	32.9 ± 1.1
AVS-DN	23.5 ± 8.4	34.1 ± 0.8	29.5 ± 1.5	9.5 ± 0.2	20.5 ± 0.1	24.1 ± 0.5
EN	55.0 ± 17.8	73.3 ± 1.3	42.2 ± 6.5	26.7 ± 1.1	69.2 ± 1.5	63.6 ± 2.3
AVS-EN	36.9 ± 12.4	49.2 ± 0.9	26.5 ± 0.7	19.5 ± 1.3	37.9 ± 1.1	51.2 ± 1.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, B.; Todo, Y.; Tao, S.; Tang, C.; Wang, Y. Enhancing Convolutional Neural Network Robustness Against Image Noise via an Artificial Visual System. Mathematics 2025, 13, 142. https://doi.org/10.3390/math13010142

AMA Style

Li B, Todo Y, Tao S, Tang C, Wang Y. Enhancing Convolutional Neural Network Robustness Against Image Noise via an Artificial Visual System. Mathematics. 2025; 13(1):142. https://doi.org/10.3390/math13010142

Chicago/Turabian Style

Li, Bin, Yuki Todo, Sichen Tao, Cheng Tang, and Yu Wang. 2025. "Enhancing Convolutional Neural Network Robustness Against Image Noise via an Artificial Visual System" Mathematics 13, no. 1: 142. https://doi.org/10.3390/math13010142

APA Style

Li, B., Todo, Y., Tao, S., Tang, C., & Wang, Y. (2025). Enhancing Convolutional Neural Network Robustness Against Image Noise via an Artificial Visual System. Mathematics, 13(1), 142. https://doi.org/10.3390/math13010142

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Convolutional Neural Network Robustness Against Image Noise via an Artificial Visual System

Abstract

1. Introduction

2. Related Work

2.1. Noise Robustness

2.2. Orientation Selectivity

2.3. Bio-Inspired CNN

3. Methods

3.1. Artificial Visual System

3.1.1. Retinal Cells

3.1.2. LGN Cells

3.1.3. V1 Simple Cell

3.2. Orientation Selectivity Tuning

3.3. AVS–CNN Framework

4. Results

4.1. Preliminaries for Experiments

4.1.1. Datasets

4.1.2. CNN Models

4.1.3. Training Details

4.2. Noise Robustness

4.3. Model Stability

5. Visualization and Discussions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI