Appendix A
In this section, we verify the design guidelines presented in
Section 3.3 of the main text by conducting experiments on the CIFAR10 dataset. Specifically, we compare the accuracy of using activation functions with different value ranges, whether the activation functions have negative values, whether the activation functions pass through the (0, 1) point, whether the activation functions are globally differentiable, and whether the activation functions are used at the beginning of the network, as well as in the shortcuts when increasing the number of channels.
The value range of the activation function. We use 20-layer quotient networks as the basis for comparison. We use two classes of activation functions: the modified linear functions and the modified sigmoid functions. In order to keep the experimental results more comparable, we fix each activation function to pass through the (0, 1) point, the convolutional layer at the beginning of the network, and the shortcuts when increasing the number of channels all use this activation function. The experimental results are shown in
Table A1. As the table shows, for the 20-layer networks, the accuracy is the highest when the value range of the modified linear function is [0, 4] or when the value range of the modified sigmoid function is (0, 2). The accuracy will be reduced whether the value range is enlarged or reduced. Especially when the value range is infinite, training cannot be successful.
Table A1.
Using activation functions with different value ranges.
Table A1.
Using activation functions with different value ranges.
Mod Linear | Mod Sigmoid |
---|
Activate Function | Value Range | Accuracy (%) | Activate Function | Value Range | Accuracy (%) |
---|
ReLU | [0, +∞) | 10 | | | |
min(max(0, x + 1), 8) | [0, 8] | 88.32 | sigmoid(x − ln3) × 4 | (0, 4) | 91.15 |
min(max(0, x + 1), 4.5) | [0, 4.5] | 90.75 | sigmoid(x − ln1.5) × 2.5 | (0, 2.5) | 91.57 |
min(max(0, x + 1), 4) | [0, 4] | 91.01 | sigmoid(x) × 2 | (0, 2) | 91.72 |
min(max(0, x + 1), 3.5) | [0, 3.5] | 90.98 | sigmoid(x − ln0.5) × 1.5 | (0, 1.5) | 91.44 |
min(max(0, x + 1), 2) | [0, 2] | 90.71 | | | |
Whether to contain negative region. We continue to use 20-layer networks as the basis for comparison. For the modified linear functions, keep the value range size unchanged at 4; for the modified sigmoid, keep the value range size unchanged at 2. Keep each function passing through the (0, 1) point. The convolution at the beginning of the network and the shortcuts when increasing the number of channels all use this activation function. We only change whether the value range includes the negative area and the size of the negative area, as shown in
Table A2. Whether a modified linear function or a modified sigmoid, its accuracy will be reduced when its value range contains the negative area. It can be seen that the larger the area containing negative numbers is, the greater the accuracy decreases.
Table A2.
Whether the activation functions have negative values.
Table A2.
Whether the activation functions have negative values.
Mod Linear | Mod Sigmoid |
---|
Activate Function | Value Range | Accuracy (%) | Activate Function | Value Range | Accuracy (%) |
---|
min(max(0, x + 1), 4) | [0, 4] | 91.01 | sigmoid(x) × 2 | (0, 2) | 91.72 |
min(max(−0.5, x + 1), 3.5) | [−0.5, 3.5] | 90.56 | sigmoid(x + ln3) × 2 − 0.5 | (−0.5, 1.5) | 91.21 |
min(max(−1, x + 1), 3) | [−1, 3] | 90.37 | sigmoid(x + ln9) × 2 − 0.8 | (−0.8, 1.2) | 91.06 |
Whether to pass the point (0, 1). Like ResNet, whether it passes the (0, 1) point is the key to whether the deep network can more easily maintain features. So, we use two different depths of 20 and 32 layers for comparison. Similarly, the modified linear function value range is kept as [0, 4], the modified sigmoid function is kept as (0, 2), and the convolution at the beginning of the network and the shortcuts when increasing the number of channels all use the designed activation function; only the function value at 0 is transformed. The comparison of 20-layer networks is shown in
Table A3, and the comparison of 32-layer networks is shown in
Table A4. When the independent variable is 0, whether the value is greater or less than 1, the accuracy will be reduced, and, as the depth increases, the accuracy decrease will be more obvious. Here, we find that when the modified linear function is used in the 32-layer networks, regardless of whether it passes through (0, 1), the accuracy will be reduced compared with the 20-layer networks. That may be because [0, 4] is no longer suitable for 32-layer networks using the modified linear functions, but it is not important for our experimental purposes.
Table A3.
Whether the activation functions pass through the (0, 1) point (20-layer networks).
Table A3.
Whether the activation functions pass through the (0, 1) point (20-layer networks).
Mod Linear | Mod Sigmoid |
---|
Activate Function | Passing Point | Accuracy (%) | Activate Function | Passing Point | Accuracy (%) |
---|
min(max(0, x + 0.5), 4) | (0, 0.5) | 90.07 | sigmoid(x − ln3) × 2 | (0, 0.5) | 91.43 |
min(max(0, x + 1), 4) | (0, 1) | 91.01 | sigmoid(x) × 2 | (0, 1) | 91.72 |
min(max(0, x + 1.5), 4) | (0, 1.5) | 90.58 | sigmoid(x + ln3) × 2 | (0, 1.5) | 91.46 |
Table A4.
Whether the activation functions pass through the (0, 1) point (32-layer networks).
Table A4.
Whether the activation functions pass through the (0, 1) point (32-layer networks).
Mod Linear | Mod Sigmoid |
---|
Activate Function | Passing point | Accuracy (%) | Activate Function | Passing Point | Accuracy (%) |
---|
min(max(0, x + 0.5),4) | (0, 0.5) | 82.84 | sigmoid(x − ln3) × 2 | (0, 0.5) | 91.72 |
min(max(0, x + 1),4) | (0, 1) | 86.47 | sigmoid(x) × 2 | (0, 1) | 92.51 |
min(max(0, x + 1.5),4) | (0, 1.5) | 85.84 | sigmoid(x + ln3) × 2 | (0, 1.5) | 91.9 |
Globally differentiable or not. With the above results, it is easy to realize that the accuracy of the modified linear function is always much lower than the modified sigmoid function. For the quotient network, the modified sigmoid function, which is a globally differentiable function, is more appropriate as the activation function.
The head and shortcuts (channels increasing). We finally compare the accuracy changes caused by whether the designed activation function is used in the first convolution and shortcuts when the number of channels increases. Here, 20-layer networks are used as the basis, and a modified linear function of [0, 4] value range and a modified sigmoid function of (0, 2) value range are taken, and both functions pass through the (0, 1) point. Then, use the activation function for the head, use the activation function for both the head and shortcuts (channels increasing), and do not use it for either place for comparison, as shown in
Table A5. We can see that the accuracy is the worst when not using it in either place, and the accuracy is second when using it only in the head. The best is to use it in both places, and the accuracy is significantly improved.
Table A5.
Placing the designed activation function at different positions.
Table A5.
Placing the designed activation function at different positions.
Mod Linear | Mod Sigmoid |
---|
Activate Function | Position | Accuracy (%) | Activate Function | Position | Accuracy (%) |
---|
min(max(0, x + 1), 4) | null | 89.15 | sigmoid(x) × 2 | Null | 90.67 |
min(max(0, x + 1), 4) | head | 89.57 | sigmoid(x) × 2 | Head | 90.93 |
min(max(0, x + 1), 4) | head + shortcuts | 90.01 | sigmoid(x) × 2 | Head + shortcuts | 91.72 |
Appendix B
In this section, we visualize the first three intermediate feature (quotient for quotient network, residual for ResNet) maps when the inputs are pictures of other categories. Specifically, the bird in
Figure A1, the plane in
Figure A2, the dog in
Figure A3, the ship in
Figure A4, and the horse in
Figure A5. It can be seen that all of these pictures justify the motivations of the quotient network.
Figure A1.
The middle feature maps when the input image is a bird. The left is for the quotient network, and the right is for ResNet. From top to bottom, it is for the first, second, and third stacked modules in order.
Figure A1.
The middle feature maps when the input image is a bird. The left is for the quotient network, and the right is for ResNet. From top to bottom, it is for the first, second, and third stacked modules in order.
Figure A2.
The middle feature maps when the input image is a plane. The left is for the quotient network, and the right is for ResNet. From top to bottom, it is for the first, second, and third stacked modules in order.
Figure A2.
The middle feature maps when the input image is a plane. The left is for the quotient network, and the right is for ResNet. From top to bottom, it is for the first, second, and third stacked modules in order.
Figure A3.
The middle feature maps when the input image is a dog. The left is for the quotient network, and the right is for ResNet. From top to bottom, it is for the first, second, and third stacked modules in order.
Figure A3.
The middle feature maps when the input image is a dog. The left is for the quotient network, and the right is for ResNet. From top to bottom, it is for the first, second, and third stacked modules in order.
Figure A4.
The middle feature maps when the input image is a ship. The left is for the quotient network, and the right is for ResNet. From top to bottom, it is for the first, second, and third stacked modules in order.
Figure A4.
The middle feature maps when the input image is a ship. The left is for the quotient network, and the right is for ResNet. From top to bottom, it is for the first, second, and third stacked modules in order.
Figure A5.
The middle feature maps when the input image is a horse. The left is for the quotient network, and the right is for ResNet. From top to bottom, it is for the first, second, and third stacked modules in order.
Figure A5.
The middle feature maps when the input image is a horse. The left is for the quotient network, and the right is for ResNet. From top to bottom, it is for the first, second, and third stacked modules in order.