1. Introduction
The convolutional neural network (CNN) is the most basic neural network based on solving problems of various machine learning tasks, such as classification [
1], segmentation, and denoising in computer vision. One of the problems with CNN training is that the convolution operation of all convolutional layers requires considerable cost. In particular, as the size of the image or kernel increases, the amount of computation inevitably increases, resulting in a latency of learning. One method proposed to solve this problem is to change the domain through Fourier transform, and construct a CNN in the frequency domain because the convolution operation in the spatial domain is the same as the point-wise multiplication in the Fourier domain. In general, point-by-point multiplication is more uncomplicated and computationally cheaper to compute than convolution. Prior approaches have focused on improving computational speed to handle the time cost problem [
2,
3,
4].
There are two factors in determining computational complexity. One is the time complexity that was implemented in existing studies, and the other is the memory complexity that was studied for model weight reduction in the spatial domain [
5,
6,
7,
8,
9,
10]. However, previous studies on the Fourier domain were not conducted on a method of reducing the number of parameters that directly affect the complexity of memory. The efficient use of memory is a critical issue since unlimited resources are not provided in the real world. A recent example is an application used in a mobile device that requires a high speed and a lightweight model. In addition, since the number of GPUs or memory is limited, building a neural network that can learn sufficiently with a small or few GPU is important. In addition, CNNs are used in a variety of engineering areas for practical applications, such as conditions monitoring of marine vehicles [
11] and fault diagnosis of the cerebral cortex [
12]. Therefore, in order to propose a method to perform deep learning in a limited environment in further work, we propose the method of reducing the number of parameters of the convolutional neural network, which is the base of the neural network. Furthermore, previous studies on CNNs in the Fourier domain have focused on the study of efficient neural networks for high-speed training, and the problem of reducing the number of parameters was actively investigated in the spatial domain rather than the Fourier domain; therefore, the goal is to design an efficient CNN by applying kernel methods with a few parameters.
In previous studies, the implementation and pooling method of the convolutional layer corresponding to the spatial domain was actively investigated in the Fourier domain [
13]. While the ReLU-based activation function is known to be effective in the spatial domain, an appropriate activation function in the Fourier domain has not been established. The previous approach mainly relies on the approximation of ReLU. However, it is limited in that it has a huge computational cost, cannot function as a nonlinear operation, and cannot expect a higher, or the same, accuracy as ReLU in the spatial domain [
14,
15,
16]. On the contrary, we present the activation function used in the Fourier domain, which performs the same operation as ReLU in the spatial domain, and introduces a new Fourier convolutional network that applies a new activation function for the Fourier domain. The novel activation function in the Fourier domain is built upon the characteristics of the Fourier transformed image consisting of phase and magnitude, which will be covered in detail in
Section 3.
There are two major studies on weight reduction in the Fourier CNN. The first is to adjust the kernel size in our proposed Fourier CNN, and the second is to learn the standard deviation (std.) for the Gaussian distribution by creating a random kernel based on compression sensing. First, unlike the spatial domain with local information, Fourier transformed images have global information. In the spatial domain, a large-sized kernel can be used by finding the location and information of the pixel locally to extract the characteristics of the image. On the contrary, in the Fourier domain, it is expected that even a small-sized kernel will be able to sufficiently identify the characteristics of an image and perform classification tasks by using global information consisting of low- and high-frequency components. Second, in compression sensing, a random vector is generated by multiplying the sparse signal by a random matrix of Gaussian distribution to compressively restore the original signal. According to this theory, it is assumed that using a random filter can learn scalar values for a random matrix of a fixed Gaussian distribution, and therefore, by learning a standard for a Gaussian distribution, image classification can be performed with a few parameters.
In conclusion, the contributions of the paper are listed as follows: we present a new activation function in the Fourier domain, discard the unnecessary shift in the Fourier transform process, introduce the novel convolutional neural network, using a small-sized kernel in the Fourier domain based on our proposed activation function, and investigate an efficient convolutional neural network based on the random kernel in the Fourier domain to reduce the number of weight parameters.
2. Related Work
Convolutional neural networks (CNNs) have been used to extract and learn image features from the deep neural networks in computer vision such as classification, segmentation, etc. [
1]. In addition, deep neural networks, such as AlexNet, VGG, DeseNet and ResNet [
17,
18,
19,
20], have been widely developed and have been effectively applied to various tasks using large datasets, such as ImageNet [
20].
However, deep neural network models have a problem that the number of learning parameters increases as the layer becomes deeper. Therefore, significant memory costs for learning a model and high computational costs are required.
In particular, it is difficult to apply in a limited environment of mobile devices with limited hardware resources. As one of the methods to solve this problem, model compression [
6,
8] has been explored. For example, ShuffleNets are designed to optimize for the mobile device environment [
7]; MobileNet-based models are described to reduce the number of parameters through depth-wise separable convolution [
5,
9]; and SqueezeNet shows similar performance to AlexNet with fewer parameters by reducing the number of input channels in the 3 × 3 filter and replacing the 3 × 3 filter with the 1 × 1 filter [
10].
Other techniques for model compression include pruning [
21,
22], distillation [
23] and quantization [
24]. First, pruning is a method of removing neuron or weights with less important information [
21,
22]. Second, distillation refers to a mechanism of transferring the knowledge of a more extensive ensembled neural network to a relatively small single neural network in order to solve the inefficient use of memory resources that generally occurs when a model is ensembled [
23]. Third, quantization is a technique of minimizing the loss of accuracy versus full precision while using a low bit width [
24].
However, these methods have a fundamental limitation in that they cannot solve the time computational problem of convolution of the image and the kernel in the convolutional layer, which is the stage of learning features of the image. In recent years, to solve the time–cost problem in a convolutional layer, a CNN in the Fourier domain through Fourier transform was actively studied, using the theory that convolution in the spatial domain is equivalent to point-wise multiplication in the Fourier domain [
14,
16,
25,
26,
27].
The convolutional layer of the Fourier domain for time complexity has been widely explored because point-by-point multiplication in the Fourier domain is much faster than convolution in the spatial domain. Furthermore, one of the leading fast Fourier transform techniques is based on discrete Fourier transform (DFT). In general, the Cooley–Tukey algorithm is used for the fast Fourier algorithm. However, training the convolutional layer in the Fourier domain requires an additional operation—inverse Fourier transform. Therefore, pooling and activation functions in the frequency domain for fully training the CNNs in the frequency domain have been studied for many recent years [
2,
3,
4].
First, truncating the low-frequency components of the Fourier transformed image into a predetermined size, which are used as a spectral representation and for extracting only important information, has been proposed as a method of implementing spectral pooling. However, because the Fourier transform is performed before spectral pooling and the inverse Fourier transform is used after each pooling, there is an additional computational cost to implement the iteration. Moreover, the proposed method has not considered the process of training the convolutional layer in the Fourier domain and has not examined the problem of computational cost [
13]. Another proposed method of pooling is discrete Fourier transform (DFT) based magnitude pooling. The first component of the Fourier transformed image is the DC component. DC stands for direct current in electrical engineering, but it simply refers to the zero frequency or the mean value of the frequencies in the Fourier domain. The whole process of training is implemented by calculating the magnitude of the DFT and reducing the resolution to include the first component from the values obtained. However, the phase information is not considered in the DFT-based pooling method, even though preserving both the phase and magnitude of the Fourier transformed image is vital for reconstructing the image after inverse Fourier transform. In addition, using a number of parameters for creating ensemble networks is difficult to regard as an efficient network in the frequency domain [
28].
Second, suitable activation functions in the Fourier domain have been actively investigated. Research on the activation function in the Fourier domain is largely divided into two main directions. One is to roughly estimate the activation function in the Fourier domain, which has a shape similar to the activation function in the spatial domain [
15,
16]. The other is to present a new formula for the Fourier-based activation function, taking into account the properties of the frequency components [
14]. One of the most popular functions is spectral ReLU (SReLU) [
15], which is designed to approximate the conventional ReLU function in terms of a quadratic function. The basic idea of SReLU is to find a quadratic function that is determined to be roughly similar to ReLU in the spatial domain. However, calculating the quadratic function for each activation function has considerable time complexity [
29]. Especially when the input size is large or the layer depth is deep, performing the activation function by SReLU causes a computational burden.
Another approach to implementing the approximation function in the Fourier domain is to find a linear function similar to the tanh and sigmoid functions in the spatial domain, using the linearity property of Fourier transform. However, the presented linear functions cannot perform as nonlinear functions, making it tough to train the complex model.
One of the recent approaches of an activation function is using the property of low- and high-frequency components in the Fourier domain. For instance, a second harmonics superposition activation function (2SReLU) has been proposed to overlap the first and second harmonics, including the DC component of the Fourier transformed image. Since the first harmonic of the image contains low frequencies and the second harmonic of the image has some high frequencies, the neural network can be trained at both low and high frequencies. In addition, because the Fourier transformed image is composed of complex numbers, it is expressed as a magnitude value of the real and imaginary parts of the image plus periodic functions, such as cosine and sine functions, respectively. Considering the composition of the Fourier transformed image function, adding several harmonics causes the sin wave function to converge to zero, like the negative part of ReLU [
14]. According to Equation (
1),
F refers to the Fourier transform, each
is the
ith harmonic or interval, and hyper-parameters alpha and beta are predetermined to 0.7 and 0.3. After multiplying each alpha and beta for the first and second harmonic weight, respectively, two harmonics are added as follows:
Yet, in terms of measuring the accuracy of the classification task, 2SReLU [
14] is poorly fit to Fourier-based CNNs, compared to the previous activation function, SReLU [
15]. Therefore, our novel activation function is focused on the activation function that fits the Fourier domain, while considering the characteristics of the frequency domain image.
Recently, several complex value-based activation functions, such as
modReLU [
30], zReLU [
31], and complex
ReLU (
[
32]) were introduced. First, modReLU is defined as Equation (
3), and it refers that the activation is applied when a learnable bias term
b is positive, where
z is a complex number and the phase of
z is denoted as
. The equation is designed to preserve the pre-activated phase information [
30].
Second,
zReLU is described as Equation (
4). The equation refers that
zReLU maintains the input number
z when the phase exists in the first quadrant; otherwise, it replaces it with 0 [
31].
Third,
is the latest complex number-based activation function that obtains more information than
modReLU and
zReLU, which is explained as Equation (
5).
Assuming that the positive and negative values of the complex-valued image are represented in the four quadrants, has the advantage that information can be obtained from the remaining three quadrants, except when both real and imaginary components are less than 0 since applies to the real and imaginary components, respectively. On the other hand, in the case of modReLU, due to the learning bias term b, a circle with a radius of length b is inactive, and the outside of the circle is active. zReLU is active only when the input phase is in the first quadrant.
Furthermore, Fourier-based CNNs were researched to complete training entirely in the frequency domain before entering through fully connected layers [
25,
26,
27] to eliminate additional computations when performing inverse Fourier transform after applying activation [
16] or pooling layers [
15,
33]. However, the previous training method of CNNs within the Fourier domain has several limitations. First, the activation function of the Fourier domain corresponding to ReLU in the conventional CNN in the spatial domain has not yet been explicitly described [
25,
26,
27,
33]. Second, the existing research on reducing memory cost in the Fourier domain was not conducted in the entire Fourier domain. In addition, the previously presented tanh-based activation function in the spectral domain performs on a different principal from ReLU in conventional CNN [
33]. Third, earlier studies on memory cost only considered zero sparsity, using model compression. However, applying a kernel with a small number of parameters can be another efficient training process and has the advantage that it can be applied to various CNNs, regardless of the architecture, without being limited to a compressed model. Therefore, we intend to design an efficient CNN in the Fourier domain by reducing the number of parameters, using different kernel methods that directly affect energy reduction in the memory aspect.