Quotient Network-A Network Similar to ResNet but Learning Quotients

Hui, Peng; Zhao, Jiamuyang; Li, Changxin; Zhu, Qingzhen

doi:10.3390/a17110521

Open AccessArticle

Quotient Network-A Network Similar to ResNet but Learning Quotients

¹

School of Agricultural Engineering, Jiangsu University, Zhenjiang 212013, China

²

School of Software Engineering, Beihang University, Beijing 100191, China

^*

Author to whom correspondence should be addressed.

Algorithms 2024, 17(11), 521; https://doi.org/10.3390/a17110521

Submission received: 5 August 2024 / Revised: 1 November 2024 / Accepted: 11 November 2024 / Published: 13 November 2024

(This article belongs to the Topic AI and Computational Methods for Modelling, Simulations and Optimizing of Advanced Systems: Innovations in Complexity)

Download

Browse Figures

Versions Notes

Abstract

:

The emergence of ResNet provides a powerful tool for training extremely deep networks. The core idea behind it is to change the learning goals of the network. It no longer learns new features from scratch but learns the difference between the target and existing features. However, the difference between the two kinds of features does not have an independent and clear meaning, and the amount of learning is based on the absolute rather than the relative difference, which is sensitive to the size of existing features. We propose a new network that perfectly solves these two problems while still having the advantages of ResNet. Specifically, it chooses to learn the quotient of the target features with the existing features, so we call it the quotient network. In order to enable this network to learn successfully and achieve higher performance, we propose some design rules for this network so that it can be trained efficiently and achieve better performance than ResNet. Experiments on the CIFAR10, CIFAR100, and SVHN datasets prove that this network can stably achieve considerable improvements over ResNet by simply making tiny corresponding changes to the original ResNet network without adding new parameters.

Keywords:

quotient learning; residual learning; convolutional neural network; image classification

1. Introduction

Convolutional neural networks have demonstrated strong performance in computer vision tasks [1,2,3,4]. In their continuous performance improvements, depth is the key factor in success [5,6]. In order to successfully train deeper networks, in addition to initialization regularization methods [7,8,9], a landmark breakthrough that cannot be ignored is the ResNet method [10]. It changes the learning target to the residual value through a shortcut so that the network does not need to learn new features from scratch but learns the difference between the new and old features, thereby reducing the learning difficulty. When the weight parameters are relatively small, it is easier to maintain the identity mapping so that a deep network is at least no worse than a shallow network.

However, this also brings about two problems. (1) The arithmetic difference between the new and old features is an absolute difference that does not fully utilize the size information of the old features. Obviously, for values 0.1 and 1, the effect of adding the same number 0.1 is quite different. Ignoring the size information of the old features will cause the learned intermediate values to be more sensitive to the size of the old features. The result is that the transformation is too strong for some old features and too weak for others. (2) More importantly, the CNN differs from the RNN because its input and output feature types differ. Its different layers will learn low/medium/high-level features [11]. Therefore, the difference between the new and old features is not an intermediate feature with a very clear and independent meaning. That may make the functions to be learned by the network too complex and increase the learning difficulty. These two problems will eventually cause ResNet to be unable to utilize the performance of the deep network fully.

For these reasons, we propose a more natural network to make learning easier. The network changes the learning goal to the quotient between the new and the old features, so the final feature is obtained by multiplying the old features by the quotient, so we call our network the quotient network. This network can perfectly solve these two problems of ResNet. It learns the relative difference (i.e., the quotient) between the new and old features, which better allows the network to take full advantage of the size information of the old features compared with ResNet, generating quantities insensitive to the size of the old features. That makes the quotient network more influential in transforming each old feature than ResNet. Looking back at the nature in which we live, we can see that the quotient of two different classes of features is more likely to be a third feature with a meaning that is more independent and clear than the difference. For example, the value of mass divided by volume is more intuitively clear than the value of mass minus volume, with the former being density; the value of force divided by mass is more meaningful than the value of force minus mass, with the former being acceleration; the value of voltage divided by current is more straightforward than the value of voltage minus current, with the former being resistance, and so on. In the quotient network, our learning goal is the quotient of different features, which tends to have a specific meaning. Introducing such prior knowledge can reduce the complexity of the function that is to be learned by the network.

Moreover, our network also has ResNet’s advantages. Instead of learning the target features from scratch, it learns them by building on top of the old features. By making the activation function of the last layer of the quotient module pass through the (0, 1) point, the quotient network can also maintain the identity mapping more easily. Therefore, we have reasons to believe that our network performance is better than ResNet.

We propose some empirical guidelines for designing such networks, including finding better activation functions and placement of activation functions. Based on these criteria, we have obtained a network with powerful performance, which can stably achieve better performance than ResNet without changing the number of parameters of ResNet and only slightly increasing the amount of calculation. In experiments on CIFAR10 [12], CIFAR100 [12], and SVHN [13] datasets, we only made corresponding slight modifications to ResNets with different numbers of layers and then stably achieved better performance than ResNets, demonstrating this network’s ease of use and power. Furthermore, by visualizing the quotient feature maps, we justify our motivations.

Comparison with attention mechanisms. In the design process of neural networks, multiplication has also been widely used in attention mechanisms. For example, SENet [14] and CBAM [15] allow networks to focus on more valuable information by multiplying the weights of channels or spatial locations. In contrast, our method does not add weights to existing features but learns new and different features. Unlike other attentions, self-attention [16,17] updates each feature by calculating correlations with other features. In contrast, our network has no Q, K, and V operations before generating features. In addition, when generating each new token, self-attention multiplies all old tokens by different weights and then adds them up. In contrast, our network will only perform one point-to-point multiplication operation on all old features. Let us look at self-attention from a perspective similar to that of the quotient network and ResNet, where the attentions of a particular token relative to other tokens are the weights of the neuron generating the new token.

2. Related Work

2.1. Branches

What is developing simultaneously with the increasing depth and width of the network is the concept of branches. Inception [6,9,18,19] uses branches to concatenate the features of filters of different sizes. DenseNet [20] connects each layer to every other layer to enrich features. ResNeXt [21] calculates more channel information by grouping different channels without increasing the number of calculations and parameters. In object detection or semantic/instance segmentation, branches are also used to enrich feature information. The FPN [22] uses branches to fuse the lower position information with higher semantic information of the network. UNet [4] supplements the detailed information of the image through the horizontal branches of the U-shaped network. Branches are also used to reduce the number of calculations and parameters. MobileNet [23] lightens the network by group convolutions in which the number of channels equals the number of groups, and ShuffleNet [24] further groups 1 × 1 convolutions through shuffle. ResNet [10] is different from the above. It uses branches to change the objective function of network learning. Its unique perspective has achieved great success, making residual learning a widely used operation today [16,25,26].

2.2. Gates and Attention Mechanisms

Multiplication is widely used in gates and attention mechanisms. In the RNN, gates solve the long-distance dependency problem by controlling the flow of information [27,28]. The attention mechanisms allocate limited computing resources to more valuable feature areas by multiplying the weights. SENet [14] allocates attention to channels through squeeze and excitation operations. Based on this, there are improvements to the pooling operation used to extract features in the squeeze process [29,30] and improvements to the fully connected method of excitation [31]. CBAM [15] uses average and maximum pooling to allocate attention to channels and spaces. Unlike CBAM, which does channel first and then spatial attention, BAM [32] adopts a parallel method of channels and spaces. A breakthrough achievement in attention mechanisms is the proposal of the transformer [16], which was initially used in NLP. ViT [17] splits the image into some patches for encoding and then introduces the transformer into the field of visual tasks. However, the number of data required is enormous, so DeiT [33] uses a distillation token to reduce the need for massive data numbers. There are also many improvements to architecture, the Swin transformer [34] uses local windows and cross windows to transform, and the pyramid transformer [35] uses a shrinking pyramid to reduce the amount of calculation and produce a high-resolution output, and these networks can be used as the backbone of a variety of visual tasks. In object detection [36,37,38] or semantic/instance segmentation [39,40,41], many transformer-based methods have achieved good performance.

3. Quotient Network

3.1. Reviewing Residual Learning

In order to solve the problem of the number of layers increasing but the training accuracy decreasing, ResNet changes the learning goal. Assume that the function a specific network block wants to learn is H(x). ResNet does not learn this goal directly from scratch but learns F(x) = H(x) − x, so the network structure becomes F(x) = H(x) + x. This operation reduces the network’s learning difficulty. It helps to keep the information unchanged because, compared with directly learning the identity mapping, this method can approximate the identity mapping as long as the parameters are small enough so that F(x) is close to 0. The ease of learning identity mapping ensures that the training loss of deep networks will not be greater than that of shallow networks.

However, as discussed in the introduction, the arithmetic difference between different feature types is not an independent feature with clear meaning. Although nonlinear multi-layer networks can approximate complex functions, the increased complexity of the objective function caused by the difference without clear meaning will degrade network performance. Moreover, the arithmetic difference itself cannot make good use of the size information of the old features. The same increment will cause the smaller values to be over-updated and the larger values to be under-updated, which is detrimental to the learning of the network.

3.2. Quotient Learning

We learn the quotient between two features to reduce the difficulty of network learning. As the introduction discusses, the transformation between two different features is often achieved by multiplying or dividing a third feature with an independent and clear meaning. As a result, the quotient we learn is more likely to be some meaningful feature, and this will reduce the complexity of the objective function. Moreover, through the multiplication operation, we can ensure that the same quotient value can enable old features of different sizes to be updated efficiently. Specifically, assuming that the function a particular network block wants to learn is H(x), we change its goal to F(x) = H(x)/x, and the final network structure becomes F(x) = H(x) × x. The comparison of ResNet with the quotient network is shown in Figure 1. Unlike ResNet, our module is activated before the final multiplication.

For the network to successfully learn the required quotient values, we need to design it specifically. Methods commonly used in convolutional networks may not be applicable to this network, such as the most commonly used activation function ReLU [42], which places half of the definition domain in the unsaturated zone and effectively solves the problem of vanishing gradients. However, if quotient values adopt ReLU, this may lead to exponential explosive growth of features, making the network unable to train. In Section 3.3, some empirical principles for selecting activation functions and model construction will be introduced.

3.3. The Rules of Designing

After many attempts and failures, in which we constantly analyzed the reasons and made corrections, we finally propose the following empirical principles of quotient network design and explain their possible causes.

3.3.1. Choice of Activation Function for Quotients

The value range should not be too large or too small and should avoid negative numbers. If it is too large, it will lead to difficulty in learning the desired features, which are computed in the quotient network by a large number of point-to-point scalar multiplications, y = h_n × h_n-1…h₂ × h₁ × x, so the direct exponential growth makes an explosion of computational results more likely compared with traditional networks, and a gradient explosion more likely during backpropagation. If there is no restriction on the range of h for each computation, this will easily lead to unsuccessful training of the network, and in our experiments the results behave as white noise. If it is too small, the range of features that can be updated each time will be too small, and more layers will be needed to achieve the same effect, thus reducing the network’s performance. Moreover, the value range of the function should remain positive because, for general activation functions (e.g., ReLU, Sigmoid), the representation of the learned useful features tends to be positive, and, for the useless ones, close to or equal to 0. If multiplied by a negative number, this structure is destroyed, especially when used with ReLUs in the same network. If wanting to enlarge or reduce the features, multiplying by a positive number is more straightforward and intuitive.

The function should pass through the (0, 1) point. Like ResNet, we want to make it easier for the network to learn the identity mapping. When the weight parameters of the network are small, the weighted sum tends to be 0. At this time, ensuring that the value after activation is close to 1 helps to keep the previous features unchanged when multiplied by the previous features.

The function should be globally differentiable. The value range of the activation function is bounded, but zero gradients cannot be used like ReLU in areas where the value range is close to the upper and lower bounds. Half of the domain in ReLU is in the unsaturated area. However, most domains of this activation function are the regions where the function values are close to the upper and lower bounds. Therefore, if the gradient in the region close to the upper and lower limits is completely zero, a large number of neurons will not be updated, affecting the performance and efficiency of the quotient network’s learning.

Equation (1), as the activation function, can perfectly meet the above requirements. It can ensure that the function is positive, passes through the (0, 1) point, is globally differentiable, and can control the value range through α. In the Appendix A, we will compare the experimental results of different activation functions to illustrate the effectiveness of the design principles and the fact that the network using the activation function of Equation (1) can indeed show good performance.

a c t i v a t e (x) = s i g m o i d (x - \ln (α - 1)) \times α

(1)

3.3.2. How to Change the Number of Channels

In ResNet, when the number of channels increases, the number of channels of the old features does not match the number of channels of the learned residual features and cannot be added. The authors designed three methods to increase the number of channels of the original features. The first one is to add new channels directly to the old features, and the values in the newly added channels are all 0. The second one is to increase the number of channels of the old features by convolutional transformation when the number of channels increases. The third is to convolve the old features regardless of whether the number of channels increases and then add the residuals to the old features.

In our network, because we use multiplication, if we multiply a channel with all zeros, the results of its subsequent operations will always be 0, and no useful information can be obtained. In the third, the number of calculations is large. Therefore, like ResNet generally adopts the second method, we also adopt the second method of increasing channels. Formally, our network structure changes from F(x) = H(x) × x to F(x) = H(x) × C(x), where C stands for convolution-related operations.

3.3.3. Activation Functions for Other Areas of the Network

In addition to the activation function of the last convolution before multiplication that needs to be specially designed, there are two other places worth noting. Before stacking quotient modules, the network needs to convolve the three-channel RGB image to generate more channel features. Here, we find that it is better to use the same activation function as the last layer of the quotient module, as shown in Figure 2. That will avoid excessively large output values when using ReLU and help keep the features consistent in size. For the same reason, when channels increase, this activation function should also be used after the old features are convolved to increase the number of channels, as shown in Figure 3.

3.4. An Example

As an example, this section will present the simple residual networks used in the experiments on CIFAR10 and then give the corresponding quotient networks version. We use the network similar to the model the ResNet paper [10] used in the CIFAR10 experiment as the basis. That is, the first layer of the network is a 3 × 3 convolution operation with stride one and output channels 16, and then three stages. Each stage consists of the same number of two layers’ residual modules (both 3 × 3 convolutions) stacked. The first convolution of the first residual module of stage 2 and stage 3 adopts a 3 × 3 convolution with stride two and double channels. Therefore, the number of channels in the three stages is 16, 32, and 64, respectively, and the feature map size is 32, 16, and 8, respectively. Finally, after a global average pooling, the 10-way fully connected layer outputs prediction results. Depending on the number of modules stacked in each stage, networks of different depths can be formed, such as 20, 32, 44, 56, and 110 (the number of modules stacked in every stage is 3, 5, 7, 9, and 18). Unlike the original paper shortcuts, which add new channels with all zeros when the number of channels increases, we use a 3 × 3 convolution with stride two to increase the number of channels.

We modify the above network to obtain the corresponding quotient network. First, all residual modules in the three stages should be replaced, as shown in Figure 1. Moreover, unlike ResNet, which uses ReLU activation functions, we replace the activation functions in the first layer of the network and the convolutions that increase the number of channels in the shortcuts, as shown in Figure 2 and Figure 3.

3.5. The Limitation and Complex Analysis

On the surface, our network does not increase the amount of computation because FLOPs (multiply-adds) do not change. But frankly speaking, our network increases the calculations by a certain amount. The increased amount of calculations comes from two aspects. The first is point-by-point multiplication. Since multiplication is implemented by many additions, compared with point-by-point addition, it will increase the calculations by a certain amount. The second is that our designed activation function will also increase the calculation amount compared with the ReLU activation function. These will eventually cause our network to have a longer training and prediction time than ResNet with the same architecture. After our measurement, with the training mini-batch of 128 images, using the 56-layer model to train one CIFAR10 epoch, we need 22.603 s and ResNet needs 21.966 s, and to predict 128 pictures we need 41.6 ms and ResNet needs 40.7 ms (all performed on a single NVIDIA Corporation RTX 4090 GPU, Santa Clara, CA, USA).

However, from the perspective of the parameter amount, we do not add any parameters compared with ResNet. Moreover, as shown in Table 1, the accuracy of our 56-layer network is not only higher than the 56-layer ResNet but also higher than the 110-layer ResNet. After measuring the 110-layer ResNet, we find it takes 29.553 s to train one epoch and 45.5 ms to predict 128 pictures, so the time consumption is much higher than that of our 56-layer network.

4. Experiments

We empirically demonstrate the effectiveness of the quotient networks on different datasets and compare the ResNet models to illustrate a steady and considerable improvement in our network performance compared with ResNets. Finally, we visualize the learned feature maps to justify the motivations of quotient networks.

4.1. Datasets

CIFAR10/CIFAR100. Both datasets are composed of 32 × 32 RGB images, of which CIFAR10 has a total of 10 categories, and CIFAR100 has a total of 100 categories. The two datasets have the same number of training and test sets, where the training set consists of 50,000 images, and the test set consists of 10,000 images. We randomly select 5000 images from the training set as the validation set, and the remaining 45,000 are used for training. Finally, we report the results of the test set. For data augmentation, we perform a random horizontal flip of the image, fill all sides with 4 pixels, and then randomly crop a 32 × 32 image. Finally, we normalize the data using channel means and standard deviations. For testing, we only evaluate the single view of the original 32 × 32 image (including subtracted by the mean and divided by the std).

SVHN. The street view house numbers (SVHN) dataset consists of 32 × 32 images, with 73,257 as the training data set and 26,032 as the test data set. We randomly select 6000 images from the training set as the validation set, train on the remaining 67,257 images, and then test on the test set and report the results. In data preprocessing, we normalize the data using the channel means and standard deviations.

4.2. Training

We follow the training method in the ResNet paper [10]. All the networks are trained with stochastic gradient descent (SGD) on all three datasets, with a momentum of 0.9, batch size of 128, and initial learning rate of 0.1. At epochs 92 and 136, the learning rate is divided by 10, and the final training epoch is 182.

4.3. Classification on CIFAR10

We compare our models with ResNets of different layers. Following our proposed network design rules, we choose Equation (1) as the activation function. Moreover, this activation function is used in the head and shortcuts (channels increasing). For the value of α, we found experimentally that the optimal value of α is different for networks with different numbers of layers, and the optimal values of α are 1.8, 1.7, and 1.5 for networks with 44, 56, and 110 layers, respectively, which are characterized by the fact that the more numbers of layers there are, the smaller is the optimal value of α.

In order to make the experiments more credible, we conduct multiple experiments and report the statistical results. The experiment results are shown in Table 1. When comparing networks with the same number of layers, the accuracy of our network is always higher than that of the corresponding ResNet, where the error rate of our network relative to ResNet is reduced by 6% in the optimal results. The accuracy of our 44-layer network is already close to that of the 56-layer ResNet, and when we increase the number of layers to 56 the accuracy of our network is even higher than that of the 110-layer ResNet. Thus, it can be proved that our network performs much better than ResNets.

4.4. Classification on CIFAR100 and SVHN

In order to verify that our proposed network is suitable for a variety of datasets and is not just showing higher accuracy on CIFAR10, we conduct experiments on both CIFAR100 and SVHN. Since the experiments aim to demonstrate that our network outperforms ResNet on multiple datasets rather than to improve the accuracy on these datasets, we do not conduct special designs. Specifically, on the SVHN dataset, we use the same network as on the CIFAR10 dataset. On the CIFAR100 dataset, we double the number of channels in the network so that the number of channels in the three stages is 32, 64, and 128, respectively, and replace the 10-way fully connected layer with a 100-way one. As a side note, the value of α in the activation function is kept unchanged on both datasets, i.e., 1.8, 1.7, and 1.5 for the 44-, 56-, and 110-layer networks, respectively.

As on CIFAR10, we conduct multiple experiments and report the statistical results. The results are shown in Table 2 and Table 3. As can be seen, our networks stably outperform ResNets on both SVHN and CIFAR100 datasets, and, comparing the optimal results, our network reduces the error rate by 4.3% and 1.7%, respectively, compared with ResNet. Although the quotient network and ResNet show some overfitting at the layer number of 110, our network still maintains a higher accuracy than ResNet. Moreover, our 44-layer network already outperforms the ResNets of all layer numbers. All these prove that our network has a general advantage over ResNet.

4.5. Visualization

To verify the motivations proposed in the introduction, we visualize the intermediate feature (quotient for quotient network and residual for ResNet) maps calculated in the first three stacked modules of the 110-layer networks trained on CIFAR10. The feature maps when the input image is a frog are shown in Figure 4, and the feature maps when the input images are other categories are shown in the Appendix B. As can be seen from the figure, the quotient feature maps are clearer than the residual feature maps, and it is easier to see the complete structure of a frog from the quotient feature. This phenomenon is in line with our conjecture. The quotient of new and old features is more likely to be an independent and meaningful feature that can reflect a particular aspect of the characteristics of the frog. Therefore, it generates more clearly identifiable feature maps. On the contrary, ResNet learns the arithmetic difference of different types of features and lacks independent attribute meaning, so its feature maps are blurrier and more difficult to recognize. Moreover, our network learns relative difference (i.e., quotient), which is not sensitive to the size of old features, and it can exert a stable and effective influence on feature values of different sizes. That, in turn, makes our feature map information richer compared with ResNet’s feature map information. It can even be seen that the feature maps of some ResNet channels are approximately pure colors.

4.6. Discussion

Experiments on large datasets are more important nowadays with the rapid increase in computational resources and data volume. However, due to time and hardware constraints, we did not use large-scale datasets such as ImageNet to learn and did not perform tasks such as object detection based on models pre-trained on ImageNet. Moreover, much can be studied in this network in the future. Since ResNet was proposed, many models have used residual learning (including transformers). Applying quotient learning to these models may also bring good performance or lead to some interesting problems.

5. Conclusions

We propose a new network architecture called the quotient network. In order to solve two problems of ResNet, namely (1) the difference between the two kinds of features does not have an independent and clear meaning, and (2) the amount of learning is based on the absolute rather than the relative difference, which is sensitive to the size of existing features, the quotient network changes the learning objective of a network block into the quotient of the target feature and the current feature. We present several design guidelines for designing such networks, including the choice of activation function for quotients, how to change the number of channels, and where to apply the activation functions. Through experiments on three datasets, we demonstrate the powerful performance of this network, which consistently outperforms the completely corresponding ResNet. Specifically, on CIFAR10, SVHN, and CIFAR100, we reduce the average error rate by 6%, 4.3%, and 1.7% relative to ResNet. Moreover, the design of this kind of network is straightforward and can be obtained by directly modifying the sum operation of ResNets to a product operation.

Author Contributions

Conceptualization, P.H.; methodology, P.H.; validation, P.H.; formal analysis, P.H. and J.Z.; investigation, P.H. and C.L.; resources, Q.Z.; data curation, P.H.; writing—original draft preparation, P.H.; writing—review and editing, Q.Z.; visualization, P.H. and J.Z.; supervision, Q.Z.; project administration, Q.Z.; funding acquisition, Q.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of Jiangsu Province (Grant No. BK20230548).

Data Availability Statement

The datasets used in this study are publicly available. CIFAR-10 and CIFAR-100 can be accessed at https://www.cs.toronto.edu/~kriz/cifar.html (accessed on 1 November 2024), and SVHN is available at http://ufldl.stanford.edu/housenumbers/ (accessed on 1 November 2024). The implementation of the proposed model, as well as our custom ResNet model, will be provided at https://github.com/penghui2024/Quotient-Networks (accessed on 1 November 2024) to support reproducibility.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

In this section, we verify the design guidelines presented in Section 3.3 of the main text by conducting experiments on the CIFAR10 dataset. Specifically, we compare the accuracy of using activation functions with different value ranges, whether the activation functions have negative values, whether the activation functions pass through the (0, 1) point, whether the activation functions are globally differentiable, and whether the activation functions are used at the beginning of the network, as well as in the shortcuts when increasing the number of channels.

The value range of the activation function. We use 20-layer quotient networks as the basis for comparison. We use two classes of activation functions: the modified linear functions and the modified sigmoid functions. In order to keep the experimental results more comparable, we fix each activation function to pass through the (0, 1) point, the convolutional layer at the beginning of the network, and the shortcuts when increasing the number of channels all use this activation function. The experimental results are shown in Table A1. As the table shows, for the 20-layer networks, the accuracy is the highest when the value range of the modified linear function is [0, 4] or when the value range of the modified sigmoid function is (0, 2). The accuracy will be reduced whether the value range is enlarged or reduced. Especially when the value range is infinite, training cannot be successful.

Table A1. Using activation functions with different value ranges.

Mod Linear			Mod Sigmoid
Activate Function	Value Range	Accuracy (%)	Activate Function	Value Range	Accuracy (%)
ReLU	[0, +∞)	10
min(max(0, x + 1), 8)	[0, 8]	88.32	sigmoid(x − ln3) × 4	(0, 4)	91.15
min(max(0, x + 1), 4.5)	[0, 4.5]	90.75	sigmoid(x − ln1.5) × 2.5	(0, 2.5)	91.57
min(max(0, x + 1), 4)	[0, 4]	91.01	sigmoid(x) × 2	(0, 2)	91.72
min(max(0, x + 1), 3.5)	[0, 3.5]	90.98	sigmoid(x − ln0.5) × 1.5	(0, 1.5)	91.44
min(max(0, x + 1), 2)	[0, 2]	90.71

Whether to contain negative region. We continue to use 20-layer networks as the basis for comparison. For the modified linear functions, keep the value range size unchanged at 4; for the modified sigmoid, keep the value range size unchanged at 2. Keep each function passing through the (0, 1) point. The convolution at the beginning of the network and the shortcuts when increasing the number of channels all use this activation function. We only change whether the value range includes the negative area and the size of the negative area, as shown in Table A2. Whether a modified linear function or a modified sigmoid, its accuracy will be reduced when its value range contains the negative area. It can be seen that the larger the area containing negative numbers is, the greater the accuracy decreases.

Table A2. Whether the activation functions have negative values.

Mod Linear			Mod Sigmoid
Activate Function	Value Range	Accuracy (%)	Activate Function	Value Range	Accuracy (%)
min(max(0, x + 1), 4)	[0, 4]	91.01	sigmoid(x) × 2	(0, 2)	91.72
min(max(−0.5, x + 1), 3.5)	[−0.5, 3.5]	90.56	sigmoid(x + ln3) × 2 − 0.5	(−0.5, 1.5)	91.21
min(max(−1, x + 1), 3)	[−1, 3]	90.37	sigmoid(x + ln9) × 2 − 0.8	(−0.8, 1.2)	91.06

Whether to pass the point (0, 1). Like ResNet, whether it passes the (0, 1) point is the key to whether the deep network can more easily maintain features. So, we use two different depths of 20 and 32 layers for comparison. Similarly, the modified linear function value range is kept as [0, 4], the modified sigmoid function is kept as (0, 2), and the convolution at the beginning of the network and the shortcuts when increasing the number of channels all use the designed activation function; only the function value at 0 is transformed. The comparison of 20-layer networks is shown in Table A3, and the comparison of 32-layer networks is shown in Table A4. When the independent variable is 0, whether the value is greater or less than 1, the accuracy will be reduced, and, as the depth increases, the accuracy decrease will be more obvious. Here, we find that when the modified linear function is used in the 32-layer networks, regardless of whether it passes through (0, 1), the accuracy will be reduced compared with the 20-layer networks. That may be because [0, 4] is no longer suitable for 32-layer networks using the modified linear functions, but it is not important for our experimental purposes.

Table A3. Whether the activation functions pass through the (0, 1) point (20-layer networks).

Mod Linear			Mod Sigmoid
Activate Function	Passing Point	Accuracy (%)	Activate Function	Passing Point	Accuracy (%)
min(max(0, x + 0.5), 4)	(0, 0.5)	90.07	sigmoid(x − ln3) × 2	(0, 0.5)	91.43
min(max(0, x + 1), 4)	(0, 1)	91.01	sigmoid(x) × 2	(0, 1)	91.72
min(max(0, x + 1.5), 4)	(0, 1.5)	90.58	sigmoid(x + ln3) × 2	(0, 1.5)	91.46

Table A4. Whether the activation functions pass through the (0, 1) point (32-layer networks).

Mod Linear			Mod Sigmoid
Activate Function	Passing point	Accuracy (%)	Activate Function	Passing Point	Accuracy (%)
min(max(0, x + 0.5),4)	(0, 0.5)	82.84	sigmoid(x − ln3) × 2	(0, 0.5)	91.72
min(max(0, x + 1),4)	(0, 1)	86.47	sigmoid(x) × 2	(0, 1)	92.51
min(max(0, x + 1.5),4)	(0, 1.5)	85.84	sigmoid(x + ln3) × 2	(0, 1.5)	91.9

Globally differentiable or not. With the above results, it is easy to realize that the accuracy of the modified linear function is always much lower than the modified sigmoid function. For the quotient network, the modified sigmoid function, which is a globally differentiable function, is more appropriate as the activation function.

The head and shortcuts (channels increasing). We finally compare the accuracy changes caused by whether the designed activation function is used in the first convolution and shortcuts when the number of channels increases. Here, 20-layer networks are used as the basis, and a modified linear function of [0, 4] value range and a modified sigmoid function of (0, 2) value range are taken, and both functions pass through the (0, 1) point. Then, use the activation function for the head, use the activation function for both the head and shortcuts (channels increasing), and do not use it for either place for comparison, as shown in Table A5. We can see that the accuracy is the worst when not using it in either place, and the accuracy is second when using it only in the head. The best is to use it in both places, and the accuracy is significantly improved.

Table A5. Placing the designed activation function at different positions.

Mod Linear			Mod Sigmoid
Activate Function	Position	Accuracy (%)	Activate Function	Position	Accuracy (%)
min(max(0, x + 1), 4)	null	89.15	sigmoid(x) × 2	Null	90.67
min(max(0, x + 1), 4)	head	89.57	sigmoid(x) × 2	Head	90.93
min(max(0, x + 1), 4)	head + shortcuts	90.01	sigmoid(x) × 2	Head + shortcuts	91.72

Appendix B

In this section, we visualize the first three intermediate feature (quotient for quotient network, residual for ResNet) maps when the inputs are pictures of other categories. Specifically, the bird in Figure A1, the plane in Figure A2, the dog in Figure A3, the ship in Figure A4, and the horse in Figure A5. It can be seen that all of these pictures justify the motivations of the quotient network.

Figure A1. The middle feature maps when the input image is a bird. The left is for the quotient network, and the right is for ResNet. From top to bottom, it is for the first, second, and third stacked modules in order.

Figure A2. The middle feature maps when the input image is a plane. The left is for the quotient network, and the right is for ResNet. From top to bottom, it is for the first, second, and third stacked modules in order.

Figure A3. The middle feature maps when the input image is a dog. The left is for the quotient network, and the right is for ResNet. From top to bottom, it is for the first, second, and third stacked modules in order.

Figure A4. The middle feature maps when the input image is a ship. The left is for the quotient network, and the right is for ResNet. From top to bottom, it is for the first, second, and third stacked modules in order.

Figure A5. The middle feature maps when the input image is a horse. The left is for the quotient network, and the right is for ResNet. From top to bottom, it is for the first, second, and third stacked modules in order.

References

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems 25 (NIPS 2012), Lake Tahoe, NV, USA, 3–6 December 2012. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Cagliari, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 448–456. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; pp. 818–833. [Google Scholar]
Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; Ng, A.Y. Reading digits in natural images with unsupervised feature learning. In Proceedings of the NIPS Workshop on Deep Learning and Unsupervised Feature Learning, Vancouer, BC, Canada, 4–9 December 2011; p. 4. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
Gao, Z.; Xie, J.; Wang, Q.; Li, P. Global second-order pooling convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3024–3033. [Google Scholar]
Qin, Z.; Zhang, P.; Wu, F.; Li, X. Fcanet: Frequency channel attention networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 783–792. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar]
Park, J.; Woo, S.; Lee, J.-Y.; Kweon, I.S. Bam: Bottleneck attention module. arXiv 2018, arXiv:1807.06514. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 568–578. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Li, Y.; Mao, H.; Girshick, R.; He, K. Exploring plain vision transformer backbones for object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 280–296. [Google Scholar]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. In Proceedings of the Advances in Neural Information Processing Systems 34 (NeurIPS 2021), Virtual, 6–14 December 2021; pp. 12077–12090. [Google Scholar]
Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]

Figure 1. The residual module (left) and the quotient module (right).

Figure 2. Convolution processing before stacking quotient modules.

Figure 3. The residual module (left) and the quotient module (right) when changing the number of channels.

Figure 4. The middle feature maps when the input image is a frog. The left is for the quotient network, and the right is for ResNet. From top to bottom, it is for the first, second, and third stacked modules in order.

Table 1. Comparison with ResNets of different layers on the CIFAR10 dataset. Results are expressed as “mean ± std”.

#Params	Quotient Network		ResNet
#Params	Network	Accuracy (%)	Network	Accuracy (%)
0.68 M	quotient network44	92.78 ± 0.25	Resnet44	92.61 ± 0.33
0.87 M	quotient network56	93.1 ± 0.15	Resnet56	92.84 ± 0.18
1.75 M	quotient network110	93.44 ± 0.17	Resnet110	93.02 ± 0.33

Table 2. Comparison with ResNets of different layers on the SVHN dataset. Results are expressed as “mean ± std”.

#Params	Quotient Network		ResNet
#Params	Network	Accuracy (%)	Network	Accuracy (%)
0.68 M	quotient network44	96.17 ± 0.12	Resnet44	95.98 ± 0.04
0.87 M	quotient network56	96.20 ± 0.11	Resnet56	95.96 ± 0.06
1.75 M	quotient network110	96.12 ± 0.05	Resnet110	96.03 ± 0.01

Table 3. Comparison with ResNets of different layers on the CIFAR100 dataset. Results are expressed as “mean ± std”.

#Params	Quotient Network		ResNet
#Params	Network	Accuracy (%)	Network	Accuracy (%)
2.72 M	quotient network44	73.25 ± 0.27	Resnet44	72.66 ± 1.24
3.50 M	quotient network56	73.53 ± 0.18	Resnet56	73.07 ± 0.24
6.99 M	quotient network110	73.00 ± 0.55	Resnet110	72.34 ± 0.95

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hui, P.; Zhao, J.; Li, C.; Zhu, Q. Quotient Network-A Network Similar to ResNet but Learning Quotients. Algorithms 2024, 17, 521. https://doi.org/10.3390/a17110521

AMA Style

Hui P, Zhao J, Li C, Zhu Q. Quotient Network-A Network Similar to ResNet but Learning Quotients. Algorithms. 2024; 17(11):521. https://doi.org/10.3390/a17110521

Chicago/Turabian Style

Hui, Peng, Jiamuyang Zhao, Changxin Li, and Qingzhen Zhu. 2024. "Quotient Network-A Network Similar to ResNet but Learning Quotients" Algorithms 17, no. 11: 521. https://doi.org/10.3390/a17110521

APA Style

Hui, P., Zhao, J., Li, C., & Zhu, Q. (2024). Quotient Network-A Network Similar to ResNet but Learning Quotients. Algorithms, 17(11), 521. https://doi.org/10.3390/a17110521

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Quotient Network-A Network Similar to ResNet but Learning Quotients

Abstract

1. Introduction

2. Related Work

2.1. Branches

2.2. Gates and Attention Mechanisms

3. Quotient Network

3.1. Reviewing Residual Learning

3.2. Quotient Learning

3.3. The Rules of Designing

3.3.1. Choice of Activation Function for Quotients

3.3.2. How to Change the Number of Channels

3.3.3. Activation Functions for Other Areas of the Network

3.4. An Example

3.5. The Limitation and Complex Analysis

4. Experiments

4.1. Datasets

4.2. Training

4.3. Classification on CIFAR10

4.4. Classification on CIFAR100 and SVHN

4.5. Visualization

4.6. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI