Understanding the Flows of Signals and Gradients: A Tutorial on Algorithms Needed to Implement a Deep Neural Network from Scratch

Klęsk, Przemysław

doi:10.3390/app14219972

Open AccessTutorial

Understanding the Flows of Signals and Gradients: A Tutorial on Algorithms Needed to Implement a Deep Neural Network from Scratch

by

Przemysław Klęsk

Faculty of Computer Science and Information Technology, West Pomeranian University of Technology in Szczecin, Żołnierska 49, 71-210 Szczecin, Poland

Appl. Sci. 2024, 14(21), 9972; https://doi.org/10.3390/app14219972

Submission received: 23 September 2024 / Revised: 18 October 2024 / Accepted: 23 October 2024 / Published: 31 October 2024

(This article belongs to the Special Issue Advanced Digital Signal Processing and Its Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Theano, TensorFlow, Keras, Torch, PyTorch, and other software frameworks have remarkably stimulated the popularity of deep learning (DL). Apart from all the good they achieve, the danger of such frameworks is that they unintentionally spur a black-box attitude. Some practitioners play around with building blocks offered by frameworks and rely on them, having a superficial understanding of the internal mechanics. This paper constitutes a concise tutorial that elucidates the flows of signals and gradients in deep neural networks, enabling readers to successfully implement a deep network from scratch. By “from scratch”, we mean with access to a programming language and numerical libraries but without any components that hide DL computations underneath. To achieve this goal, the following five topics need to be well understood: (1) automatic differentiation, (2) the initialization of weights, (3) learning algorithms, (4) regularization, and (5) the organization of computations. We cover all of these topics in the paper. From a tutorial perspective, the key contributions include the following: (a) proposition of R and S operators for tensors—rashape and stack, respectively—that facilitate algebraic notation of computations involved in convolutional, pooling, and flattening layers; (b) a Python project named hmdl (“home-made deep learning”); and (c) consistent notation across all mathematical contexts involved. The hmdl project serves as a practical example of implementation and a reference. It was built using NumPy and Numba modules with JIT and CUDA amenities applied. In the experimental section, we compare hmdl implementation to Keras (backed with TensorFlow). Finally, we point out the consistency of the two in terms of convergence and accuracy, and we observe the superiority of the latter in terms of efficiency.

Keywords:

deep neural networks; automatic differentiation; weights intialization; optimization; regularization; Python; Numba; JIT; CUDA

1. Introduction

Software frameworks and libraries, such as Theano, TensorFlow, Keras, Torch, PyTorch, and others, have undoubtedly and remarkably stimulated the popularity of research on deep learning (DL) and its applications among practitioners from different fields. Imagine someone trying to select a few significant research papers on DL, e.g., from a particular month of the current year. Those are so numerous that the task seems hopeless and pointless. In fact, the “hype” around DL is so strong that it is even hard to publish a new paper on machine learning if the proposed ideas do not address DL in some way.

It is easy to notice that the large majority of research on DL is, in fact, application work conducted by practitioners capable of using frameworks. That is understandable and not a fault in itself. Yet, apart from all the good they achieve, the danger of frameworks, like that of any other powerful tool, is that they unintentionally spur a black-box attitude towards the method being applied. It is fair to say that some researchers play around with building blocks offered by frameworks and rely on them, having a superficial understanding of the internal mechanics.

This paper is intended to constitute a concise yet thorough tutorial on DL’s mathematics and algorithms, allowing one to successfully implement a deep neural network from scratch. By “from scratch”, we mean with access to a programming language, and with numerical libraries at one’s disposal, but without frameworks or components that hide DL computations underneath. To achieve this goal, the following five topics need to be well understood:

automatic differentiation—i.e., a backpropagation algorithm with suitable inductive steps for different layer types and the activation functions;
initialization of weights—to avoid exploding or vanishing signals/gradients and to guarantee convergence;
learning algorithms (optimizers)—to guarantee suitably fast convergence;
regularization—to avoid overfitting;
organization of computations—to plan when and how to use CPU or GPU computations.

We discuss those topics in the paper and concentrate on providing the reader with selected state-of-the-art methods to suitably guide an implementation.

1.1. Motivation and Contributions

First of all, let us clearly emphasize that this manuscript is a tutorial (invited and submitted as the tutorial article type to the special issue “Advanced Digital Signal Processing and Its Applications” of the Applied Sciences). It is not a regular research paper. Therefore, the primary goal of this tutorial is educational, and its value should be assessed in terms of educational contributions to build a better understanding of the flows of signals and gradients in deep neural networks.

Let us refer here to two remarkable examples of tutorials well known in the machine learning community. The first one is the tutorial on hidden Markov models by Rabiner [1] from 1989. The second is the tutorial on boosting (the ensemble learning technique) by Friedman, Hastie, and Tibshirani [2] from the year 2000. Both of these works have been, and continue to be up to this date, great sources of knowledge and tutoring material for researchers in those fields. Our motivation and ambition is that this tutorial on deep neural networks can also become well recognized in the machine learning community.

We regard the following elements as the key contributions of this tutorial:

The proposition of R and S operators for tensors—rashape and stack, respectively—that facilitate algebraic notation for computations (forward and backward) involved in convolutional, pooling, and flattening layers;
A Python project named hmdl (“home-made deep learning”)—a practical implementation of the ideas presented in this tutorial;
A careful analysis of a small example—tracking flows of signals and gradients through a four-layer convolutional network (Section 3);
A concise explanation of DL computations (the aforementioned topics)—automatic differentiation, initialization of weights, learning algorithms, and regularization;
Clear and consistent notation across all mathematical contexts involved.

Let us briefly elaborate here on the second element—the hmdl project. It constitutes the postulated “from scratch” implementation that can be treated as an example and a reference. It provides the common layer types and functionalities, and it does that in approximately 2 k lines of code. Therefore, the task of developing a “home-made” DL framework is not as difficult as it might seem. In sum, we think the hmdl project helps the reader understand how DL’s mathematics and algorithms become translated into the realm of programming and computations.

Apart from tutoring contributions, this manuscript is additionally intended to constitute a short survey of crucial scientific contributions in the development of DL. We focus especially on the ones that took place during the last thirteen years (since the “DL big bang”) and have paved the way for DL’s popularity and frameworks. We present those contributions on the canvas of the five aforementioned topics and with the main goal in mind—guiding the reader towards implementation.

1.2. Organization of the Remaining Content

The rest of this paper is organized as follows. Section 2 presents the general background, that is, the history of neural networks and deep learning, and the core computational principles involved in the backpropagation method. Additionally, in that section, we introduce the notational conventions applied throughout the paper. With automatic differentiation in mind, in Section 3, we analyze in detail a small case example of a network consisting of four layers (two of them convolutional). We study the flows of signals (forward pass) and the flow of gradients (backward pass) in this network, and we explicitly derive all necessary formulas. Building upon the small case example, Section 4 provides general formulas for forward and backward computations in the most common layer types: convolutional, max pooling, average pooling, flattening, dropout, and dense. Particular attention is also paid to the softmax activation function and the differentiation involved. Section 5 is devoted to the topic of the random initialization of weights. Recursive equations for variances of signals/gradients explained in the section help explain the problems of vanishing or exploding gradients. Section 6 discusses optimizers, i.e., the state-of-the-art learning algorithms applied in modern neural networks. In particular, the section explains exponential moving averages (EMAs) and the role they play in contemporary learning algorithms. In Section 7, we turn our attention to regularization techniques applied to DL in order to temper down overfitting. Section 8 pertains to the aforementioned Python project, entitled hmdl. We present its design, structure, and key computational elements (with source code examples). The section ends with multiple deep learning experiments comparing hmdl with Keras on three well-known data sets, using various network architectures, two hardware environments, and several approaches to compute convolutions. We instruct the reader as to how the experiments and results can be reproduced. Section 9 concludes the paper with final thoughts and remarks.

2. Background

2.1. Milestones in the Not-So-Long History of Deep Learning

The history of neural networks and (then) deep learning is peculiar. It includes excitements, stagnation periods, and breakthroughs. Figure 1 presents the selected milestones in that history along the timeline. We categorized most of them according to the five key topics listed in the previous section, hence the following labels: “backpropagation”, “initialization”, “learning”, “regularization”, and “computations”. Obviously, these threads were interwoven along the way, but certain periods of specific interests can be observed. An additional category, “structure”, is introduced to label various structure-related ideas—early models of a neuron, novel layer types or activations, contemporary deep architectures, etc.

As is commonly known, it all began in the 1940s and 1950s from the first mathematical models of biological neurons proposed in the works of [3] and then [4]. Despite some initial interest, a bizarre, 20-year-long stagnation period followed, roughly from 1960 to 1980. This was, to some extent, caused by a misunderstanding involving the so-called XOR problem and the inability of the simple perceptron to cope with it, which made some people draw false general conclusions. An interesting “lonely” event in that period (apart from minor exceptions) is a master’s thesis by [5] on automatic differentiation. The work contained a suitable inductive algorithm but did not pertain to neural networks explicitly. And it was not until 1986 that Rumelhart, Hinton, and Williams [7] (re)discovered the algorithm for the realm of neural nets and coined the name backpropagation for it. Two other works from 1980s deserve attention: firstly, the one on Neocognitron by [6], who presented the early ideas of convolutional and pooling layers, and secondly, the one on zip code recognition by LeCun [8]. The latter turned into the first real-world application of a neural network, trained end to end with backpropagation and without pre-training. LeCun’s net consisted of over 4 k units with 2 k weights (some shared), and it was organized in four hidden layers, including convolution-like and average pooling. Thus, it may be regarded as the first deep learning application, even though the term did not exist at the time.

The second stagnation period, lasting for about 10 years, was mainly due to difficulties experienced in training the networks, even medium-sized networks. Convergence problems and large computational powers that had to be invested in were the reasons for this frustration. Researchers started to notice the problem of vanishing gradients. As a result, in the 1990s and early 2000s, the focus of the machine learning community moved to other approaches (e.g., SVMs, boosting, HMMs, etc.) with hand-engineered features or moment-based descriptors. A notable exception was Larry Heck’s research group SRI International, funded by the US government, which kept working on speech processing using neural networks. Their successful application of a speaker recognition system [11], developed in 1998–2000, proved that automatically extracted raw features may outperform hand-crafted ones, in particular the Mel–Cepstral features popular in speech processing. Also around that time, the term “deep learning”, introduced by [10], started to appear in common use.

In 2006, some revival of interest in DL occurred due to the work on deep belief nets [13]. It indicated new ways of overcoming vanishing gradients by pre-training one layer at a time and then fine-training with backpropagation. Some ideas on pre-training are still valid today (transfer learning and stacked autoencoders).

The major breakthrough came around 2009, and it is frequently referred to as the “DL big bang”. It arose due to a combination of three reasons: GPU technology due to NVIDIA [15], large data sets [14,23], and convenient software that appeared shortly after. As Bengio, LeCun, and Hinton (the 2019 Turing award winners) describe it now: “(…) the emergence of GPUs and the availability of large datasets were key enablers of deep learning and they were greatly enhanced by the development of open source, flexible software platforms with automatic differentiation such as Theano, Torch, Caffe, TensorFlow, and PyTorch (…)” [36].

Looking at the timeline (Figure 1), one can see that after 2009, key events start to become dense. In 2010, [17] published their groundbreaking work on vanishing or exploding signals/gradients in deep structures. They derived a suitable recurrence equation to analyze variance in signals and provided a method for random weight initialization that keeps variance approximately constant—a state-of-the-art method up to this day. The reader should be strongly reminded that the carefree initialization of weights (making the signals vanish or explode) was one of the main reasons for DL stagnation in the 1990s.

Also in 2010, [18] demonstrated the usefulness of activation functions called rectified linear units (ReLUs). As a form of regularization, ReLUs contribute to higher accuracy and convergence. Other important structure- and regularization-related contributions followed in 2012: max pooling [20] and dropout due to Ciresan, Hinton, et al. In parallel to those, the research on faster convergence produced new algorithms with adaptive learning rates: AdaGrad [19] and RMSProp [20] in 2011 and 2012, respectively.

A combination of novel ideas from that time allowed Alex Krizhevsky [23] to construct a structure with 8 layers, named AlexNet, that drew the attention of the whole computer vision community. AlexNet won the ImageNet 2012 competition by a large margin [14,36], being able to accurately recognize objects from one thousand classes in natural images. The key elements of that victory were as follows: ReLUs, convolutions with max pooling, dropout regularization, and the use of multiple GPUs. Great credit must also be given to the efforts of Fei-Fei’s group for collecting a training set of over 1 million labeled images [14]. It is interesting to note that many elements applied in AlexNet—inherent components of DL still nowadays, such as: “weight sharing” [37], convolutions and pooling [6], and ReLU-like activation [38]—had already appeared before 1990 and thus significantly predated the contemporary era. Within barely three years since AlexNet, new impressive deeper structures came to life: VGGNet [26], GoogLeNet [27], and ResNet [30].

Around the same time, studies on efficient computations brought about novelties. Initial works on fast convolutions based on FFT [24] or Winograd transformations [28], and their GPU/CUDA implementations opened an important research topic, which is still lively today [39]. Undoubtedly, convolutional layers play a crucial role in contemporary DL (hence the commonness of the names CNNs or DCNNs, the latter standing for deep convolutional neural networks). Simultaneously, they are the most time-consuming layers, with complexity scaling roughly as a product of the number of input pixels, channels, kernels (or feature maps), and the size of a filter and batch. As a side remark, it is worth pointing out that the name “convolution”, strongly rooted in DL’s nomenclature, is not strictly correct. In forward computations, the operation commonly called a “2D convolution” is, in fact, a 2D correlation in the mathematical sense. It is during backward computations that one deals with convolution due to weights being flipped up–down and left–right.

Regarding studies on optimizers, in 2014, Kinga and Ba published their great proposition of Adam [25]—a learning algorithm that applies first- and second-order gradient moments estimated via exponential moving averages. Adam’s effectiveness and simplicity quickly made it a highly popular state-of-the-art optimizer, recommended as the default one by many sources. It is fair to remark that two later propositions, marked on our timeline in 2016 and 2018, called Nadam [31] and AMSGrad [33], respectively, represent only minor modifications of the main strong idea behind Adam. On the other hand, also worth attention is a recent, yet already widely recognized, proposition of EDEAdam from 2022 [35]. It integrates the main idea of Adam with an ensemble of differential evolution algorithms and helps further prevent getting stuck in local optima.

The topic of weight initialization resurfaced in 2015. [29] demonstrated the proper way to generate random weights when a layer is equipped with the ReLU activation. It turned out that Glorot’s method from five years before was not complete regarding that particular case.

Last, but certainly not least, we end the timeline from Figure 1 with some entries that represent recent advances in natural language processing. One should emphasize at this point novelties like the following: self-attention [32], recurrent independent mechanisms [34], and transformer architectures [32,40]. Owing to the above advances and a proper combination of deep and reinforcement learning, the researchers from the OpenAI company were able to develop a family of large language models (called GPT-3) and release, in November 2022, a truly astonishing chatbot named ChatGPT [41].

The research on DL is clearly very lively nowadays. In the most recent few years (2021–2024), several fresh and interesting research topics have emerged, such as capsule networks [42], self-attention with/without routing [43], hierarchical aware CNNs [44], vision transformers [45], and federated or adaptive ensemble learning [46] for deep neural networks. Those topics are outside the scope of this manuscript, as we have decided to focus on the contemporary but well-established ideas of DL and on state-of-the-art algorithms.

To conclude this section, we would like to point out a certain contrast. On the one hand, important mathematical results are known, e.g., due to [47], proving that one hidden layer with sufficiently many non-linear neurons is sufficient for universal approximation. On the other hand, the practice shows that deep nets generalize better than shallow ones (with the same number of parameters). The question “why depth?” was recently considered by DL authorities—Bengio, LeCun, and Hinton [36]. They believe depth helps build internal representations of concepts. It is known that human intelligence emerges from highly parallel networks of fairly simple neurons that learn by adjusting the strengths of their connections. Apparently, this simple mechanism allows humans to perform complicated tasks (recognize objects, understand language, etc.). Deep networks excel because they exploit a particular form of compositionality in which features in one layer are combined in many different ways to create more abstract features in the next layer [48,49]. That compositionality and successive levels of abstraction work very well for perception-related tasks, and there is strong evidence that this mechanism is used in biological perceptual systems [36].

2.2. Backpropagation

In this section, we turn our attention to automatic differentiation and the backpropagation algorithm. Obviously, various structural novelties that appeared throughout the history—weight sharing, convolutions, pooling, ReLUs, softmax activation, dropout, etc.—and possibly others still to come in the future, need to be handled properly during the computations of derivatives, which allow for tuning network parameters. Therefore, apart from being correct and efficient, the algorithm for that purpose needs to be (and is) universal. With that universality in mind, here, we discuss the general principles that underlie backpropagation.

The usual goal is to train (using data examples) a neural network that can be seen as a large graph of connections with tunable parameters called weights. Commonly, the connections are organized as a layered structure; see, e.g., Figure 2. The training is carried out through some variant of the gradient descent in order to minimize a certain error function for the whole training data set with respect to weights. That error is, in turn, defined by the so-called loss function at the level of a single training pair,

(x, y^{*})

—categorical cross-entropy and squared loss being the common choices for loss functions in classification and regression tasks, respectively.

For the large majority of networks (graphs of connections), automatic differentiation can be constructed from two rules well known from calculus: (i) the chain rule—stating that, for a composition of two differentiable functions, the derivative of the composition is equal to the derivative of the outer function (calculated with respect to the inner one) multiplied by the derivative of the inner function; (ii) the sum rule—stating that a derivative of a sum of functions is the sum of their derivatives. In practice, the rules imply that intermediate derivatives leading towards a particular weight along a single path of connections must be multiplied, whereas the ones along alternate paths must be summed.

Although a derivative for any individual weight could be computed separately via the above approach, doing so would be inefficient because of unnecessary recomputations of many intermediate values. Therefore, one can do better by applying an induction. For a fixed training example (or a batch of examples), the induction proceeds from the back of the network towards the front, layer by layer, and computes derivatives specific to the given layer—components of the final gradient—based on values already computed. To achieve this, one introduces an auxiliary quantity, traditionally named

δ

(we skip its subscripts for a moment), associated with every neuron and computed inductively. That quantity is interpreted as an “error”, but more accurately, it should be seen as a partial contribution to the final loss derivative that takes place from the point of the given neuron onwards in the network. We must clarify that

δ

values themselves are not equal to loss derivatives. To turn them into derivatives, an additional multiplication by inputs of the given layer is necessary.

Let l denote the loss function, and suppose that we are about to perform backward computations for the i-th layer, having already performed the computations for layers

i + 1, i + 2, \dots

. Suppose also that the layer is equipped with a collection (a tensor) of weights, denoted here as

W_{i}

. The algorithmic goal is to be able to compute

\partial l / \partial w

for any

w \in W_{i}

, based on error quantities

δ_{i}

. The key argument to realize is that w can affect the loss of the whole network only through its direct effect on the very next layer,

i + 1

. Therefore, to compute

δ_{i}

—the collection of error quantities for the current layer—one needs to know only

δ_{i + 1}

and

W_{i + 1}

, i.e., direct subsequent errors and weights of outgoing connections through which these errors are reached, respectively. Error quantities for

i + 2, i + 3, \dots

are not needed. In fact, their influence is already present inside

δ_{i + 1}

.

Below, we present two algorithmic rules that govern backpropagation. We purposely wrote them down in a mnemonic verbal form.

Rule 1 (inductive step):

“current δ

: =

activation derivative times weighted sum of subsequent δs taken over outgoing connections.”

Rule 2 (gradient):

“∇

: =

δs times inputs.”

The first rule tells us how a

δ

quantity should be computed based on subsequent, already computed,

δ

s (or, equivalently, how the subsequent

δ

s should be propagated back). This rule has to be executed always, regardless of the layer type. The second rule tells us how to compute the gradient, denoted as ∇ (again, subscripts are skipped for now), for weights present in the current layer (if any). The gradient can be applied later in formulas correcting the weights, in accordance with a particular variant of gradient descent (e.g., simple SGD, Adam, AMSGrad, etc.). The type of multiplication hidden under the word ‘times’ in the rules (e.g., algebraic or element-wise multiplication) is implied by the layer type and, in some cases, also by the activation. In particular, please note that, for convolutional layers, the weights of each filter are shared by many inputs—a fact that must be taken into account when applying both Rule 1 and Rule 2 for such layers.

In many further places in the text, we will refer to the above two rules: in Section 3, where we provide a “manual” derivation of backpropagation for a specific small network, and in Section 4, where we present general formulas.

2.3. Notation

We choose to index elements in vectors, matrices, and tensors starting from 0 (not from 1), as in C or Python languages. This choice is slightly more convenient than natural indexing for some operations on tensors like reshaping or taking subranges.

Additionally, for convolutional filters, we apply a convention that allows for negative indexes of weights. This convention treats the point (pixel) at which the filter is currently anchored as a temporary center, i.e., the

(0, 0)

point. More specifically, assuming an odd-sized

F \times F

filter (

1 \times 1

,

3 \times 3

,

5 \times 5

, etc.), one can write its indexed weights as

w_{a, b}

, where both

a, b

take values in the set

{- ⌊ F / 2 ⌋, \dots, - 1, 0, 1, \dots, ⌊ F / 2 ⌋}

. Let us set an example; suppose that X denotes an input image, and Y denotes the resulting convolution image (for simplicity, both with 1 channel). Additionally, let

F = 7

. Then, elements of Y are computed as follows:

y_{j, k} = \underset{- 3 ⩽ a, b ⩽ 3}{\sum \sum} w_{a, b} \cdot x_{j + a, k + b} .

(1)

Obviously,

x_{j + a, k + b}

must be well defined for regions outside the proper borders of X (padding with zeros is a standard practice that we address later on). We remark that, in general, not pairs but quadruples shall be used as indexes of filter weights to suitably handle input and output channels.

In many places where indicator functions are needed, we apply the so-called Iverson convention. It represents an indicator with a square bracket,

[\cdot]

, yielding 1 if its argument (any logical statement) is true, and 0 otherwise.

As regards multiplications, we use the symbol “·” for standard algebraic products of numbers or matrices, whereas we use “∘” for element-wise products of tensors with the same size (also known as the Hadamard product). In some places, even if the multiplication symbol is mathematically redundant, we still write it explicitly in order to emphasize the particular terms involved in the backpropagation chain. Identity matrices are denoted as I with a suitable subscript to define the size, whereas bold symbols, 1, stand for tensors consisting of ones, e.g.,

1_{5 \times 1} = {(1, 1, 1, 1, 1)}^{T}

.

We apply the “

: =

” signs to indicate assignments, algorithmic or computational, and we use them distinctly from purely mathematical equality signs, “=”.

In places related to tensor-form formulas, it shall be convenient for us to introduce two operators: R—performing a reshape of a tensor, and S—generating a stack of tensors as a function of certain iterators. For example, given a vector,

x = (x_{0}, \dots, x_{23})

, the expression

\underset{3 \times 2 \times 4}{R} x

yields a three-dimensional tensor of shape

3 \times 2 \times 4

with the following successive slices:

(\begin{matrix} x_{0} & x_{1} & x_{2} & x_{3} \\ x_{4} & x_{5} & x_{6} & x_{7} \end{matrix}), (\begin{matrix} x_{8} & x_{9} & x_{10} & x_{11} \\ x_{12} & x_{13} & x_{14} & x_{15} \end{matrix}), (\begin{matrix} x_{16} & x_{17} & x_{18} & x_{19} \\ x_{20} & x_{21} & x_{22} & x_{23} \end{matrix}),

(2)

where the last dimension changes the fastest. Obviously, the superposition

\underset{24}{R} \underset{3 \times 2 \times 4}{R} x

yields back the original x. Shaping x explicitly to become a row or a column vector is also possible by writing

\underset{1 \times 24}{R} x

or

\underset{24 \times 1}{R} x

, respectively. As regards the second operator—S, it can be understood as a parameterized generator of larger tensors produced from smaller ones. It works by iterating over a certain variable (or variables) and stacks the successive operands together, thereby generating a tensor extended to an extra dimension with respect to the operands. When the argument under S is a vector, then the resulting tensor will be a matrix consisting of stacked vectors. When the argument is a matrix, the result will be a three-dimensional tensor consisting of stacked matrices, and so forth. The expression below provides a simple example:

\underset{10 ⩽ a < 13}{S} (a \cdot \underset{1 \times 5}{1}) = (\begin{matrix} 10 & 10 & 10 & 10 & 10 \\ 11 & 11 & 11 & 11 & 11 \\ 12 & 12 & 12 & 12 & 12 \end{matrix}) .

(3)

As regards convolutions, there exist three modes according to which they can be computed, commonly named “same”, “full”, and “valid” (see, e.g., the documentation of scipy.signal.convolve2d). For simplicity, we choose to apply the mode “same” throughout the paper. This means the resulting convolution image is of the same size as the input, and it is achieved by aligning the top-left input pixel in every channel with the central weight of a filter. It also means that suitable image padding with zeros must be introduced. We use superscript P to denote a zero-padded tensor; e.g., for a tensor, X, of shape

H \times V \times C

, the elements of

X^{P}

ought to be understood (we use V for width since W is occupied by weights) as

\begin{matrix} x_{j, k, q}^{P} & = \{\begin{matrix} x_{j, k, q}, & if 0 ⩽ j < H and 0 ⩽ k < V; \\ 0, & otherwise, \end{matrix} \end{matrix}

(4)

regardless of the channel index, q.

Frequently, formulas related to convolutions and padding go together with extracting subtensors, i.e., taking subranges from original tensors. For that purpose, we shall use the “:” symbol, working as the range operator in a way known from programming languages, where the index written before the colon is treated inclusively and the one afterwards exclusively. E.g.,

X_{j_{1} : j_{2}, k_{1} : k_{2}}

represents a subtensor consisting of the following elements:

x_{j, k, q}

for all

j_{1} ⩽ j < j_{2}

,

k_{1} ⩽ k < k_{2}

and all q (if not restricted; otherwise, subtensors are taken along all the channels).

Furthermore,

ϕ

shall stand for activation functions, and we are going to use that same overloaded symbol for all layers. The reader will be able to understand it correctly from the context, based on the name of

ϕ (\cdot)

’s argument. When no activation function for a layer is declared, one should simply treat

ϕ

as an identity function and its derivative as a 1.

Finally, we explain that, in actual DL implementations (also in ours based on NumPy and Numba), all tensors are typically stored with an extra dimension pertaining to a batch of images (data examples) currently processed, rather than a single image. For example, a layer producing 16 convolutional images of size

32 \times 32

each, for a batch of 100 input images, can be stored as a 4D tensor of the following shape:

100 \times 32 \times 32 \times 16

. In spite of this, in the mathematical and algorithmic sections, we choose to simplify the considerations (and notation) by not taking that extra dimension into account. In other words, we look at computations as if they were carried for a single data example.

3. Flows of Signals and Gradients—Direct Derivation for a Small Network Example

A good general understanding can be developed by studying small examples first. In this section, we provide a direct “manual” derivation of relevant formulas, working with an exemplative small network consisting of four layers, two of them convolutional. We believe this exercise helps abstract out the general formulas later on.

The structure to consider is shown in Figure 3. Suppose it is meant to perform a classification task with 10 classes, based on

64 \times 64

input images with 3 channels. The letters

X, α, β, γ, Y

shall be used as names (to refer to layers) and also to represent the tensors storing actual values. In this section, we purposely use explicit numbers to define the sizes of tensors, filters, etc., rather than symbolic constants. Those explicit sizes shall thus be visible in the formulas, allowing the reader to clearly see their impact. Also purposely, we choose the sigmoid activation function for the output layer. We are aware that, e.g., the softmax activation would be a more natural choice for the classification task, but we prefer to postpone the discussion on softmax until Section 4.7.

Below, we specify the names for collections of weights in tunable layers of the network. Convolutional weights are kept as 4D tensors, weights of the dense layer as a matrix, and weights related to free terms (also known as biases) as 1D vectors.

\begin{matrix} W_{α} & = {(w_{α; a, b, p, q})}_{7 \times 7 \times 3 \times 16}, & W_{0, α} & = {(w_{0, α; q})}_{16}, \end{matrix}

(5)

\begin{matrix} W_{β} & = {(w_{β; c, d, q, r})}_{5 \times 5 \times 16 \times 32}, & W_{0, β} & = {(w_{0, β; r})}_{32}, \end{matrix}

(6)

\begin{matrix} W_{Y} & = {(w_{Y; e, f})}_{10 \times 131 072}, & W_{0, Y} & = {(w_{0, Y; e})}_{10} . \end{matrix}

(7)

For clarity, we picked distinct letters for indexes pertaining to different pieces of information. For example, q is the index of an output channel (kernel) produced via layer

α

and simultaneously the index of an input channel from the perspective of layer

β

. Those letters are written in subscripts after a layer name and a semicolon. Additionally, we use the prefix 0 in front of a layer name to denote the collection of its biases.

As regards activation functions in our example, we have

ϕ (β_{j, k, r}) = [β_{j, k, r} > 0] \cdot β_{j, k, r}

since, for layer

β

, the ReLU activation is chosen; then,

ϕ (α_{j, k, q}) = α_{j, k, q}

, and

\partial ϕ (α_{j, k, q}) / \partial α_{j, k, q} = 1

, since no activation is declared for layer

α

, and finally,

ϕ (y_{e}) = 1 / (1 + \exp (- y_{e}))

—the sigmoid function for layer Y.

In the following subsections, we discuss forward and backward computations for the described network. We do so layer by layer, writing first the more natural scalar-form formulas and then providing their tensor-form counterparts. Please note that at the implementation level, tensor formulas are typically well suited for CPU computations based on efficient libraries for algebra (e.g., BLAS and LAPACK underlying Python’s numpy), whereas scalar formulas are typically better suited to (or more easily translatable to) GPU computations.

3.1. Forward Pass

The first layer, named

α

, is a 2D convolutional layer equipped with

7 \times 7

filters using 3 input channels and mapping to 16 output channels (also known as kernels or feature maps). These outputs should be computed as follows:

α_{j, k, q} : = \underset{- 3 ⩽ a, b ⩽ 3}{\sum \sum} \sum_{0 ⩽ p < 3} w_{α; a, b, p, q} \cdot x_{j + a, k + b, p} \cdot [0 ⩽ j + a < 64] \cdot [0 ⩽ k + b < 64] + w_{0, α; q},

(8)

for all

0 ⩽ j, k < 64

and

0 ⩽ q < 16

. The tensor-form counterpart of the above scalar formula is given below. It applies the stack and reshape operations, together with image padding.

α : = \underset{0 ⩽ q < 16}{S} \underset{64 \times 64}{R} ((\underset{1 \times 7^{2} \cdot 3}{R} W_{α; \cdot, \cdot, \cdot, q}) \cdot (\underset{7^{2} \cdot 3 \times 64^{2}}{\underset{︸}{\underset{0 ⩽ j, k < 64}{S} \underset{7^{2} \cdot 3 \times 1}{R} X_{\begin{matrix} j - 3 : j + 4 \\ k - 3 : k + 4 \end{matrix}}^{P}}})) + \underset{0 ⩽ q < 16}{S} (w_{0, α; q} \cdot \underset{64 \times 64}{1}) .

(9)

At the inner-most level of Formula (9), each image subregion of shape

7 \times 7 \times 3

(implied by the filter size and input channels) becomes reshaped to a column vector and stacked into a temporary matrix of the final shape

7^{2} \cdot 3 \times 64

. This matrix is then multiplied, from the left, by a subset of filter weights (for the q-th output channel) reshaped to a row vector

1 \times 7^{2} \cdot 3

. Then, matrices resulting from such multiplications are stacked over all q indexes,

0 ⩽ q < 16

, to produce the final output tensor. The addition of biases takes place in the second line of (9). Such an approach is referred to as GEMM-based convolution [28,39] (general matrix-to-matrix multiplication, a part of BLAS). No activation is declared for layer

α

; hence,

ϕ (α) = α

.

Now, we move on to layer

β

. It is equipped with a

5 \times 5

filter, and it computes 32 output channels as follows:

β_{j, k, r} : = \underset{- 2 ⩽ c, d ⩽ 2}{\sum \sum} \sum_{0 ⩽ q < 16} w_{β; c, d, q, r} \cdot ϕ (α_{j + c, k + d, q}) \cdot [0 ⩽ j + c < 64] \cdot [0 ⩽ k + d < 64] + w_{0, β; r},

(10)

for

0 ⩽ j, k < 64

and

0 ⩽ r < 32

; or via a tensor counterpart:

β : = \underset{0 ⩽ r < 32}{S} \underset{64 \times 64}{R} ((\underset{1 \times 5^{2} \cdot 16}{R} W_{β; \cdot, \cdot, \cdot, r}) \cdot (\underset{5^{2} \cdot 16 \times 64^{2}}{\underset{︸}{\underset{0 ⩽ j, k < 64}{S} \underset{5^{2} \cdot 16 \times 1}{R} ϕ (α_{\begin{matrix} j - 2 : j + 3 \\ k - 2 : k + 3 \end{matrix}}^{P})}})) + \underset{0 ⩽ r < 32}{S} (w_{0, β; r} \cdot \underset{64 \times 64}{1}) .

(11)

ReLU activation for this layer can be computed as follows:

ϕ (β) : = [β > 0] \circ β,

(12)

where

[β > 0]

represents a

64 \times 64 \times 32

tensor consisting of ones and zeros, depending on whether the condition

β > 0

is satisfied or not element-wise. We remark that formula (12) represents the efficient way to implement ReLU activation in Python. Even when the

[β > 0]

tensor is a numpy array of bools, it is easy to check that, despite multiplications, this version is faster than versions based on an if–else statement or conditional array indexing.

The subsequent layer,

γ

, flattens the tensor

ϕ (β)

, yielding

γ : = \underset{131 072 \times 1}{R} ϕ (β),

(13)

and further

ϕ (γ) : = γ

, since we have not imposed any activation to the flattening layer (we remark that this is theoretically possible, though seldom achieved in practice).

The flattened result is a column vector; hence, the subsequent dense layer, Y, can conveniently multiply such an input by its matrix of weights (standard multiplication):

Y : = W_{Y} \cdot ϕ (γ) + W_{0, Y} .

(14)

And lastly, through the application of the sigmoid activation function, the final network output becomes

ϕ (Y) : = 1 / (1 + \exp (- Y)) .

(15)

3.2. Backward Pass

Suppose that categorical cross-entropy is selected as the loss function to fit our network and that

Y^{*}

stands for a one-hot encoded vector representing the output wanted for the given training example. The loss between Y and

Y^{*}

is

l (Y, Y^{*}) = - \sum_{0 ⩽ e < 10} y_{e}^{*} \log ϕ (y_{e}) .

(16)

In accordance with the chain rule, loss derivatives with respect to

w_{Y; e, f}

weights can be calculated as the following product:

\frac{\partial l}{\partial w_{Y; e, f}} = \frac{\partial l}{\partial ϕ (y_{e})} \cdot \frac{\partial ϕ (y_{e})}{\partial y_{e}} \cdot \frac{\partial y_{e}}{\partial w_{Y; e, f}} = - \frac{y_{e}^{*}}{ϕ (y_{e})} \cdot ϕ (y_{e}) (1 - ϕ (y_{e})) \cdot ϕ (γ_{f}),

(17)

where the second line applies the actual derivatives.

Now, we start deriving the backpropagation algorithm for our network by writing down partial derivatives of the loss as the following column vector:

δ_{l} : = {(\frac{\partial l}{\partial ϕ (y_{0})} \frac{\partial l}{\partial ϕ (y_{1})} \dots \frac{\partial l}{\partial ϕ (y_{9})})}^{T} .

(18)

Even though the loss function is not an actual layer in the network, we purposely denote the above vector as

δ_{l}

and treat it as our initial error quantity from which the induction ought to be started. Having

δ_{l}

, one can define errors

δ_{Y}

for the proper last layer using Rule 1 (see page 8):

\begin{matrix} δ_{Y} & : = ϕ^{'} (Y) \circ (I_{10} \cdot δ_{l}) \\ = {(\frac{\partial ϕ (y_{0})}{\partial y_{0}} \cdot \frac{\partial l}{\partial ϕ (y_{0})} \frac{\partial ϕ (y_{1})}{\partial y_{1}} \cdot \frac{\partial l}{\partial ϕ (y_{1})} \dots \frac{\partial ϕ (y_{9})}{\partial y_{9}} \cdot \frac{\partial l}{\partial ϕ (y_{9})})}^{T}, \end{matrix}

(19)

where each entry, for our example, is

δ_{Y; e} = ϕ (y_{e}) (1 - ϕ (y_{e})) \cdot (- y_{e}^{*} / ϕ (y_{e}))

,

0 ⩽ e < 10

. We explain that the identity matrix

I_{10}

, written purposely in the first line of (19), can be seen as a matrix of weights for outgoing connections. Every neuron in the output layer, Y, can be thought of as having exactly one outgoing connection, with weight 1, to a corresponding virtual loss neuron. Obviously, the presence of an identity matrix does not change anything mathematically, but we want to keep the writing comprehensive and consistent with the inductive Rule 1.

Having computed

δ_{Y}

, we can compute the gradient for weights in the last layer with the use of Rule 2. It yields

\begin{matrix} \nabla_{Y} & : = δ_{Y} \cdot ϕ {(γ)}^{T}, \end{matrix}

(20)

\begin{matrix} \nabla_{0, Y} & : = δ_{Y} \cdot \underset{10 \times 1}{1}, \end{matrix}

(21)

respectively for proper connections and for biases.

We remark that, with (20) and (21) computed, corrections of weights could be potentially performed. For example, using the simple SGD, one would have

W_{Y} : = W_{Y} - η \nabla_{Y}

, and

W_{0, Y} : = W_{0, Y} - η \nabla_{0, Y}

, where

η > 0

stands for the learning rate. However, correction steps are typically withheld until the complete backward pass through all the layers is finished.

We now move to the flattening layer,

γ

. Again, in accordance with Rule 1, a particular

δ_{γ; f}

must take into account its own activation derivative, and subsequent

δ

s reached via outgoing connections. In the scalar form, this yields

δ_{γ; f} : = \underset{1}{\underset{︸}{\frac{\partial ϕ (γ_{f})}{\partial γ_{f}}}} \cdot \sum_{0 ⩽ e < 10} w_{Y; e, f} \cdot δ_{Y; e},

(22)

for

0 ⩽ f < 131 072

, which, in the tensor form, is equivalent to

\begin{matrix} δ_{γ} & : = ϕ^{'} (γ) \circ ({W_{Y}}^{T} \cdot δ_{Y}) = \underset{131 072}{1} \circ ({W_{Y}}^{T} \cdot δ_{Y}) . \end{matrix}

(23)

Again, the initial multiplication by a column vector of ones is presented for the sake of consistency with Rule 1. This vector represents derivatives of the activation function (none declared).

There are no tunable weights in the flattening layer; hence, the gradient

\nabla_{γ}

needs not be computed. Yet, we do need to suitably propagate

δ

quantities from layer

γ

back to layer

β

. This must be performed via a suitable reshape operation, additionally taking into account the fact that

β

is equipped with the ReLU activation. Suppose, for a moment, that an index, f, of a certain neuron in layer

γ

corresponds to an index,

(g, h, r)

, of a neuron in layer

β

. The error quantity for that neuron should be computed as follows:

δ_{β; g, h, r} : = \frac{\partial ϕ (β_{g, h, r})}{\partial β_{g, h, r}} \cdot 1 \cdot δ_{γ; f},

(24)

which, again, can be seen through the prism of Rule 1 as a “sum” over one outgoing connection with weight 1 that reaches

δ_{γ; f}

. This implies the following tensor formula:

\begin{matrix} δ_{β} & : = ϕ^{'} (β) \circ \underset{64 \times 64 \times 32}{R} (I_{131 072} \cdot δ_{γ}) = [β > 0] \circ \underset{64 \times 64 \times 32}{R} δ_{γ} . \end{matrix}

(25)

Now, we move to loss derivatives with respect to convolutional weights:

w_{β; c, d, q, r}

. For mathematical purposes, we introduce the following auxiliary notations related to the mapping between indexes of

β

and

γ

layers. Given a fixed index, r, of an output channel in layer

β

, think of the set of indexes corresponding to it within the flattened layer:

{f} (r) = {f : 0 ⩽ f < 131 072 and f \mod 32 = r} .

(26)

E.g., for

r = 3

in our example, the above set becomes

{f} (3) = {3, 35, 67, \dots, 131 043}

. Furthermore, given a particular index, f, selected from that set, let

ϕ (β_{g (f), h (f), r})

denote the activated neuron connected subsequently to

γ_{f}

. Our example implies that

g (f) = ⌊ f / 64 ⌋

,

h (f) = f \mod 64

.

Using the above notation, the derivatives we look for are as follows:

\begin{matrix} \frac{\partial l}{\partial w_{β; c, d, q, r}} = \sum_{0 ⩽ e < 10} \underset{δ_{Y; e}}{\underset{︸}{\frac{\partial l}{\partial ϕ (y_{e})} \cdot \frac{\partial ϕ (y_{e})}{\partial y_{e}}}} \cdot \sum_{{f} (r)} \frac{\partial y_{e}}{\partial ϕ (γ_{f})} \cdot \underset{1}{\underset{︸}{\frac{\partial ϕ (γ_{f})}{\partial γ_{f}}}} \cdot \underset{1}{\underset{︸}{\frac{\partial γ_{f}}{\partial ϕ (β_{g (f), h (f), r})}}} \\ \cdot \frac{\partial ϕ (β_{g (f), h (f), r})}{\partial β_{g (f), h (f), r}} \cdot \frac{\partial β_{g (f), h (f), r}}{\partial w_{β; c, d, q, r}} \\ = \sum_{0 ⩽ e < 10} δ_{Y; e} \cdot \sum_{{f} (r)} w_{Y; e, f} \cdot [β_{g (f), h (f), r} > 0] \cdot ϕ (α_{g (f) + c, h (f) + d, q}) \\ = \sum_{{f} (r)} [β_{g (f), h (f), r} > 0] \cdot \underset{δ_{γ; f}}{\underset{︸}{\sum_{0 ⩽ e < 10} w_{Y; e, f} \cdot δ_{Y; e}}} \cdot ϕ (α_{g (f) + c, h (f) + d, q}) \\ = \sum_{{f} (r)} \underset{δ_{β; g (f), h (f), r}}{\underset{︸}{[β_{g (f), h (f), r} > 0] \cdot δ_{γ; f}}} \cdot ϕ (α_{g (f) + c, h (f) + d, q}) = \underset{0 ⩽ g, h < 64}{\sum \sum} δ_{β; g, h, r} \cdot ϕ (α_{g + c, h + d, q}) . \end{matrix}

(27)

Below, we write down tensor formulas corresponding to (27). Note that, due to the filter range

- 2 ⩽ c, d ⩽ 2

in layer

β

, the indexes of pixels present in

α^{P}

can take values within

- 2 ⩽ j, k < 64 + 2

. Therefore, in compliance with Rule 2, the gradient for weights in layer

β

can be written as follows:

\begin{matrix} \nabla_{β} & : = \underset{0 ⩽ r < 32}{S} \underset{5 \times 5 \times 16}{R} ((\underset{1 \times 64^{2}}{R} δ_{β; \cdot, \cdot, r}) \cdot (\underset{64^{2} \times 5^{2} \cdot 16}{\underset{︸}{\underset{\begin{matrix} - 2 ⩽ c, d ⩽ 2 \\ 0 ⩽ q < 16 \end{matrix}}{S} \underset{64^{2} \times 1}{R} ϕ (α_{\begin{matrix} c : c + 64 \\ d : d : 64 \\ q \end{matrix}}^{P})}})), \end{matrix}

(28)

\begin{matrix} \nabla_{0, β} & : = \underset{0 ⩽ r < 32}{S} ((\underset{1 \times 64^{2}}{R} δ_{β; \cdot, \cdot, r}) \cdot \underset{64^{2} \times 1}{1}) . \end{matrix}

(29)

We remind the reader that

ϕ (α^{P})

is, in fact, equal to

α^{P}

since no activation function for layer

α

is declared.

We have now reached the left-most layer,

α

, and we need to compute

δ_{α}

. To do so, we need to remember that filter weights are shared among many outgoing connections. In particular, given a fixed index of a weight, say

(c, d, q, r)

, one must realize that, for all

j, k

, the quantities

δ_{α; j, k, q}

reach the quantities

δ_{β; j - c, k - d, r}

(note the minuses) via connections associated with that weight, i.e.,

w_{β; c, d, q, r}

. Therefore, the proper way to apply now the Rule 1 is as follows:

δ_{α; j, k, q} = \underset{1}{\underset{︸}{\frac{\partial ϕ (α_{j, k, q})}{\partial α_{j, k, q}}}} \cdot \underset{- 2 ⩽ c, d ⩽ 2}{\sum \sum} \sum_{0 ⩽ r < 32} w_{β; c, d, q, r} \cdot δ_{β; j - c, k - d, r} \cdot [0 ⩽ j - c < 64] \cdot [0 ⩽ k - d < 64] .

(30)

Please note that, with the substitutions

c : = - c

and

d : = - d

, one can equivalently write

δ_{α; j, k, q}

down as follows:

δ_{α; j, k, q} = \underset{- 2 ⩽ c, d ⩽ 2}{\sum \sum} \sum_{0 ⩽ r < 32} w_{β; - c, - d, q, r} \cdot δ_{β; j + c, k + d, r} \cdot [0 ⩽ j + c < 64] \cdot [0 ⩽ k + d < 64] .

(31)

This lets us see the

δ

s resulting from the backward pass strictly as a convolution, rather than a 2D correlation (a nomenclature remark made in the introduction). This means that one needs to use flipped weights, bottom–up and left–right, that we shall denote as

{\bar{W}}_{β} = ({\bar{w}}_{β; c, d, q, r}) = (w_{β; - c, - d, q, r})

for all valid

c, d, q, r

.

The tensor-form formula for

δ_{α}

can be written down as

δ_{α} : = \underset{0 ⩽ q < 16}{S} \underset{64 \times 64}{R} ((\underset{1 \times 5^{2} \cdot 32}{R} {\bar{W}}_{β; \cdot, \cdot, q, \cdot}) \cdot (\underset{5^{2} \cdot 32 \times 64^{2}}{\underset{︸}{\underset{0 ⩽ j, k < 64}{S} \underset{5^{2} \cdot 32 \times 1}{R} δ_{β; \begin{matrix} j - 2 : j + 3 \\ k - 2 : k + 3 \end{matrix}}^{P}}})) .

(32)

In Figure 4, we graphically illustrate computations, both forward and backward, taking place at the junction of layers

α

and

β

.

Now, we can derive the loss derivatives with respect to

w_{α; a, b, p, q}

weights. The long chain of involved multiplications and summations proceeds in the following manner:

\begin{matrix} \frac{\partial l}{\partial w_{α; a, b, p, q}} & = \sum_{0 ⩽ e < 10} \underset{δ_{Y; e}}{\underset{︸}{\frac{\partial l}{\partial ϕ (y_{e})} \frac{\partial ϕ (y_{e})}{\partial y_{e}}}} \cdot \sum_{0 ⩽ f < 131 072} \frac{\partial y_{e}}{\partial ϕ (γ_{f})} \cdot \underset{1}{\underset{︸}{\frac{\partial ϕ (γ_{f})}{\partial γ_{f}}}} \cdot \underset{1}{\underset{︸}{\frac{\partial γ_{f}}{\partial ϕ (β_{g (f), h (f), r (f)})}}} \\ \cdot \frac{\partial ϕ (β_{g (f), h (f), r (f)})}{\partial β_{g (f), h (f), r (f)}} \cdot \underset{- 2 ⩽ c, d ⩽ 2}{\sum \sum} \frac{\partial β_{g (f), h (f), r (f)}}{\partial ϕ (α_{g (f) + c, h (f) + d, q})} \cdot \underset{1}{\underset{︸}{\frac{\partial ϕ (α_{g (f) + c, h (f) + d, q})}{\partial α_{g (f) + c, h (f) + d, q}}}} \\ \cdot [0 ⩽ g (f) + c < 64] \cdot [0 ⩽ h (f) + d < 64] \cdot \frac{\partial α_{g (f) + c, h (f) + d, q}}{\partial w_{α; a, b, p, q}} \end{matrix}

\begin{matrix} = \sum_{0 ⩽ f < 131 072} \underset{δ_{β; g (f), h (f), r (f)}}{\underset{︸}{[β_{g (f), h (f), r (f)} > 0] \cdot \underset{δ_{γ; f}}{\underset{︸}{\sum_{0 ⩽ e < 10} w_{Y; e, f} \cdot δ_{Y; e}}}}} \\ \cdot \underset{- 2 ⩽ c, d ⩽ 2}{\sum \sum} w_{β; c, d, q, r (f)} \cdot [0 ⩽ g (f) + c < 64] \cdot [0 ⩽ h (f) + d < 64] \cdot x_{g (f) + c + a, h (f) + d + b, p} \\ = \underset{0 ⩽ g, h < 64}{\sum \sum} \sum_{0 ⩽ r < 32} δ_{β; g, h, r} \cdot \underset{- 2 ⩽ c, d ⩽ 2}{\sum \sum} w_{β; c, d, q, r} \cdot [0 ⩽ g + c < 64] \cdot [0 ⩽ h + d < 64] \\ \cdot x_{g + c + a, h + d + b, p} \end{matrix}

(33)

By looking at the subscript of x in the final line, one can observe that the scope of particular image inputs (pixels) involved in the derivative

\partial l / \partial w_{α; a, b, p, q}

is, in a sense, broadened, as we have passed backwards through two consecutive convolutional layers. This can be noticed in composed shifts

c + a

and

d + b

with respect to a pixel index,

(g, h)

. In other words, by letting the convolutional indexes range as

- 3 ⩽ a, b ⩽ 3

(a

7 \times 7

filter) and

- 2 ⩽ c, d ⩽ 2

(a

5 \times 5

filter), the involved inputs range within a

12 \times 12

neighborhood around each

(g, h)

index of a pixel, disregarding, for a moment, channel and kernel indexes

p, q, r

. In order to simplify (33) and inductively make use of quantities

δ_{α}

, we introduce the following

1 : 1

substitutions:

g = j - c

and

h = k - d

. This leads to

\begin{matrix} \frac{\partial l}{\partial w_{α; a, b, p, q}} & = \underset{0 ⩽ j - c, k - d < 64}{\sum \sum} \sum_{0 ⩽ r < 32} δ_{β; j - c, k - d, r} \cdot \underset{- 2 ⩽ c, d ⩽ 2}{\sum \sum} w_{β; c, d, q, r} \cdot [0 ⩽ j < 64] \cdot [0 ⩽ k < 64] \\ \cdot x_{j + a, k + b, p} \\ = \underset{0 ⩽ j, k < 64}{\sum \sum} \underset{- 2 ⩽ c, d ⩽ 2}{\sum \sum} \sum_{0 ⩽ r < 32} w_{β; c, d, q, r} \cdot δ_{β; j - c, k - d, r} \cdot [0 ⩽ j - c < 64] \cdot [0 ⩽ k - d < 64] \\ \cdot x_{j + a, k + b, p} \\ = \underset{0 ⩽ j, k < 64}{\sum \sum} δ_{α; j, k, q} \cdot x_{j + a, k + b, p} . \end{matrix}

(34)

In the second pass, we changed places between ranges of external summations (

0 ⩽ j - c, k - d < 64

) and indicator fuctions

[0 ⩽ j < 64] \cdot [0 ⩽ k < 64]

being part of the expression under the sum (equivalent replacement). Also, we changed the order of inner sums (

c, d

before r) and substituted for

δ_{α}

in accordance with (31).

The corresponding tensor formulas are as follows:

\begin{matrix} \nabla_{α} & : = \underset{0 ⩽ q < 16}{S} \underset{7 \times 7 \times 3}{R} ((\underset{1 \times 64^{2}}{R} δ_{α; \cdot, \cdot, q}) \cdot (\underset{64^{2} \times 7^{2} \cdot 3}{\underset{︸}{\underset{\begin{matrix} - 3 ⩽ a, b ⩽ 3 \\ 0 ⩽ p < 3 \end{matrix}}{S} \underset{64^{2} \times 1}{R} X_{\begin{matrix} a : a + 64 \\ b : b + 64 \\ p \end{matrix}}^{P}}})), \end{matrix}

(35)

\begin{matrix} \nabla_{0, α} & : = \underset{0 ⩽ q < 16}{S} ((\underset{1 \times 64^{2}}{R} δ_{α; \cdot, \cdot, q}) \cdot \underset{64^{2} \times 1}{1}) . \end{matrix}

(36)

Our exercise—deriving backpropagation formulas for the network from Figure 3—is finally done. Below, we put it in a summarized algorithm-like form. We concentrate on the backward pass with references to Rule 1 and Rule 2. The rules take place alternately, and we signal their presence even if, in some cases, their execution is void. Weight corrections are written according to the simple SGD method. More advanced optimization approaches are discussed in Section 6.

Forward pass: Formulas (8)–(15).
Backward pass:
1
Compute initial error quantity $δ_{l}$ —see (18).
2
Rule 1: compute $δ_{Y}$ as a function of $ϕ^{'} (Y)$ , $I_{10}$ , $δ_{l}$ —see (19).
3
Rule 2: compute $\nabla_{Y}$ , $\nabla_{0, Y}$ as a function of $δ_{Y}$ , $ϕ (γ)$ —see (20), (21).
4
Rule 1: compute $δ_{γ}$ as a function of $ϕ^{'} (γ)$ , $W_{Y}$ , $δ_{Y}$ —see (23).
5
Rule 2: do not compute $\nabla_{γ}$ (no tunable parameters in layer $γ$ ).
6
Rule 1: compute $δ_{β}$ as a function of $ϕ^{'} (β)$ , $I_{131 072}$ , $δ_{γ}$ —see (25).
7
Rule 2: compute $\nabla_{β}$ , $\nabla_{0, β}$ as a function of $δ_{β}$ , $ϕ (α)$ —see (28), (29).
8
Rule 1: compute $δ_{α}$ as a function of $ϕ^{'} (α)$ , $W_{β}$ , $δ_{β}$ —see (32).
9
Rule 2: compute $\nabla_{α}$ , $\nabla_{0, α}$ as a function of $δ_{α}$ and X—see (35), (36).
10
Rule 1: do not compute $δ_{X}$ (not needed since X is the initial layer).
Corrections (simple SGD):
1
Compute corrections for layer Y:
$W_{Y} : = W_{Y} - η \nabla_{Y},$
$W_{0, Y} : = W_{0, Y} - η \nabla_{0, Y} .$
2
Do not compute corrections for layer $γ$ (no tunable parameters).
3
Compute corrections for layer $β$ :
$W_{β} : = W_{β} - η \nabla_{β},$
$W_{0, β} : = W_{0, β} - η \nabla_{0, β} .$
4
Compute corrections for layer $α$ :
$W_{α} : = W_{α} - η \nabla_{α},$
$W_{0, α} : = W_{0, α} - η \nabla_{0, α} .$

4. General Formulas for Common Layer Types

Having analyzed a small structure with particular sizes of tensors and filters, in this section, we provide general formulas. The presentation is concise, with more comments given only for new layer types not discussed previously.

As regards notation, in all cases, we consider a junction of two layers (somewhere within a network) named X and Y. More precisely, in the forward pass, an activated tensor or vector,

ϕ (X)

, becomes mapped to Y and then possibly to

ϕ (Y)

. In the backward pass, we want to compute gradients with respect to tunable weights (if present), and to propagate a tensor of quantities

δ_{Y}

to the preceding tensor,

δ_{X}

. For layer types discussed previously, we skip forward computations and concentrate only on the backward pass. The shape of input tensors is denoted as

H \times V \times C

, understood as height × width × the number of input channels. We pick the letter V to represent widths because W was already used for weights. We reduce considerations to square-sized convolutional filters of the shape

F \times F

, where F is odd, and square pools of shape

S \times S

, all working with a stride of 1. In the case of dense layers, the mapping is performed from a flat vector of length N to another flat vector of size M.

The last subsection is devoted to the softmax activation function, or more generally, to such activations that depend on more than one argument.

4.1. Convolutional Layers (2D)

Let K stand for the number of output channels to be produced (also known as kernels or feature maps). Tensor

W_{Y}

, storing convolutional weights, is of the following shape:

F \times F \times C \times K

; whereas vector

W_{0, Y}

, storing the biases, is of length K.

In the backward pass, the gradients ought to be computed as follows:

\begin{matrix} \nabla_{Y} & : = \underset{0 ⩽ q < K}{S} \underset{F \times F \times C}{R} ((\underset{1 \times H \cdot V}{R} δ_{Y; \cdot, \cdot, q}) (\underset{\begin{matrix} - ⌊ F / 2 ⌋ ⩽ a, b ⩽ ⌊ F / 2 ⌋ \\ 0 ⩽ p < C \end{matrix}}{S} \underset{H \cdot W \times 1}{R} ϕ (X_{\begin{array}{l} a : a + H \\ b : b + V \\ p \end{array}}^{P}))), \end{matrix}

(37)

\begin{matrix} \nabla_{0, Y} & : = \underset{0 ⩽ q < K}{S} ((\underset{1 \times H \cdot V}{R} δ_{Y; \cdot, \cdot, q}) \cdot \underset{H \cdot V \times 1}{1}) . \end{matrix}

(38)

And the propagation of errors is carried out by computing the following:

\begin{matrix} δ_{X} & : = ϕ^{'} (X) \circ \underset{0 ⩽ q < C}{S} \underset{H \times V}{R} ((\underset{1 \times F^{2} \cdot K}{R} {\bar{W}}_{Y; \cdot, \cdot, q, \cdot}) (\underset{\begin{matrix} 0 ⩽ j < H \\ 0 ⩽ k < V \end{matrix}}{S} \underset{F^{2} \cdot K \times 1}{R} δ_{Y; \begin{matrix} j - ⌊ F / 2 ⌋ : j + ⌊ F / 2 ⌋ + 1 \\ k - ⌊ F / 2 ⌋ : k + ⌊ F / 2 ⌋ + 1 \end{matrix}}^{P})) . \end{matrix}

(39)

4.2. Max Pooling Layers (2D)

Using a grid of non-overlapping two-dimensional pools of shape

S \times S

each (or, more precisely,

S \times S \times 1

if their one-channel depth is written explicitly), an input tensor

ϕ (X)

of shape

H \times V \times C

becomes mapped to an output tensor, Y, with a reduced shape,

⌊ H / S ⌋ \times ⌊ V / S ⌋ \times C

.

Single outputs, for all

0 ⩽ j < ⌊ H / S ⌋

,

0 ⩽ k < ⌊ V / S ⌋

, and

0 ⩽ q < C

, are computed as follows:

\begin{matrix} y_{j, k, q} & : = \max_{\begin{matrix} j S ⩽ u < (j + 1) S \\ k S ⩽ v < (k + 1) S \end{matrix}} ϕ (x_{u, v, q}) . \end{matrix}

(40)

Additionally, in the forward pass, one needs to memorize the coordinates for which the maxima are attained. They shall be needed later for backward propagation. We denote them as

\begin{matrix} a_{j, k, q}^{*} & : = \underset{\begin{matrix} j S ⩽ u < (j + 1) S \\ k S ⩽ v < (k + 1) S \end{matrix}}{\arg \max} ϕ (x_{u, v, q}) . \end{matrix}

(41)

At the implementation level, (112) can be established first (or in parallel with (40)), and using its entries, one can populate Y. For example, when

a_{j, k, q}^{*} = (u^{*}, v^{*})

, then

y_{j, k, q} : = ϕ (x_{u^{*}, v^{*}, q})

.

We remark that max pooling layers may potentially be equipped with an activation function, mapping Y non-linearly to

ϕ (Y)

, although this is seldom met in practice.

As regards the backward pass, the calculus-based rationale for max pooling layers is as follows. For suitably small changes in all

ϕ (x_{u, v, q})

values, the maxima are attained at the same locations. Then, each entry,

y_{j, k, q}

, can be, in fact, treated as a function of only one input signal—

ϕ (x_{u^{*}, v^{*}, q})

, where

a_{j, k, q}^{*} = (u^{*}, v^{*})

. From this point of view, Formula (40) is equivalent to summation over artificial outgoing connections—one connection with weight 1 (associated with the maximum) and all others with weights 0 (for non-maximum entries). Special cases where several non-strict (equal) maxima exist are ignored in practice, having a negligible influence on the learning procedure. We represent this in the following way

y_{j, k, q} : = 1 \cdot ϕ (x_{u^{*}, v^{*}, q}) + \underset{(u, v) \neq (u^{*}, v^{*})}{\sum \sum} 0 \cdot ϕ (x_{u, v, q}) .

(42)

Therefore, all intermediate terms

\partial y_{j, k, q} / \partial ϕ (x_{u, v, q})

appearing along the chain(s) of derivatives can be replaced with

\partial y_{j, k, q} / \partial ϕ (x_{u^{*}, v^{*}, q})

. At the implementation level, this implies that values from tensor

δ_{Y}

should be propagated only to the corresponding

(u^{*}, v^{*}, q)

locations in

δ_{X}

, whereas the remaining entries within

δ_{X}

should be populated with zeros:

\begin{matrix} δ_{X; u, v, q} & : = ϕ^{'} (x_{u, v, q}) \cdot \{\begin{matrix} δ_{Y; ⌊ u / S ⌋, ⌊ v / S ⌋}, & if a_{⌊ u / S ⌋, ⌊ v / S ⌋}^{*} = (u, v); \\ 0, & otherwise . \end{matrix} \end{matrix}

(43)

4.3. Average Pooling Layers (2D)

Sometimes, a sum- or mean-based information reduction turns out to work better than a max-based reduction. Average pooling serves that purpose. Forward and backward computations are similar to the ones associated with max pooling, and, again, it is easy to understand them through the prism of Rule 1, thinking about artificial outgoing connections. This time, all such connections have equal weights of

1 / S^{2}

. The key formulas, given below, are counterparts of (40) and (43).

\begin{matrix} y_{j, k, q} & : = \frac{1}{S^{2}} \underset{\begin{matrix} j S ⩽ u < (j + 1) S \\ k S ⩽ v < (k + 1) S \end{matrix}}{\sum \sum} ϕ (x_{u, v, q}) . \end{matrix}

(44)

\begin{matrix} δ_{X; u, v, q} & : = ϕ^{'} (x_{u, v, q}) \cdot \frac{1}{S^{2}} \cdot δ_{Y; ⌊ u / S ⌋, ⌊ v / S ⌋} . \end{matrix}

(45)

Figure 5 shows an illustrative comparison between max pooling and average pooling forward computations.

4.4. Flattening Layers

A flattening layer is typically applied directly before a block of dense layers in a network. That is, when a 3D tensor from a previous layer needs to be flattened to become a 1D input to a dense layer. Although there is not much mathematics involved in flattening layers, and there are no tunable parameters, suitable forward and backward formulas can still be written in accordance with our general scheme:

\begin{matrix} Y & : = \underset{H \cdot V \cdot C \times 1}{R} ϕ (X) . \end{matrix}

(46)

\begin{matrix} δ_{X} & : = ϕ^{'} (X) \circ \underset{H \times V \times C}{R} (I_{H \cdot V \cdot C} \cdot δ_{Y}) = ϕ^{'} (X) \circ \underset{H \times V \times C}{R} δ_{Y} . \end{matrix}

(47)

The identity matrix

I_{H \cdot V \cdot C}

represents (temporarily) weights of outgoing connections. The left illustration in Figure 6 depicts backward computations for a flattening layer.

Just as for pooling layers, we remind the reader that Y can potentially be mapped non-linearly to

ϕ (Y)

if an activation function is declared for a flattening layer (seldom in practice).

4.5. Dropout Layers

Dropout layers can be regarded as probabilistic regularizers. For an imposed constant rate,

r^{*} \in [0, 1)

, a dropout layer allows a random subset of its input signals, approximately a fraction of

1 - r^{*}

, to pass through as outputs, whereas the remaining part, approximately

r^{*}

, is dropped, i.e., becomes zeroed.

Consider a tensor,

R = (r_{j, k, q})

, with the same shape as the input—

H \times V \times C

, whose entries are drawn randomly from a standard uniform distribution,

r_{j, k, q} \sim U (0, 1)

. Treating R as a tensor caters to the general case. Dropout layers can, in particular, also be applied after flat or dense layers with tensor R reduced to a 1D vector. During the fit stage, the forward computations are performed as follows:

\begin{matrix} Y & : = \frac{1}{1 - r^{*}} \cdot [R > r^{*}] \circ ϕ (X) . \end{matrix}

(48)

Note that the distribution of entries in tensor

[R > r^{*}]

complies with the Bernoulli distribution (having the success parameter set to

1 - r^{*}

).

The

1 / (1 - r^{*})

factor in (48) is introduced to increase the strength of signals passing through. E.g., for

r^{*} = 0.2

, the sum of surviving signals constitutes approximately

8 / 10

of the original total (in the sense of the absolute value). Therefore, the multiplication by

10 / 8

preserves a roughly similar flow compared to the one occurring when the dropout layer is switched off—namely, at the prediction stage. Then, all the signals are allowed to pass through unchanged. The complete justification behind the manner in which the dropout operates is provided in the regularization-related Section 7.3. Perhaps surprisingly, the mechanism described above is connected to model averaging techniques known from ensemble methods.

As regards the backward pass, the

δ

s from the subsequent layer are propagated in the following way

\begin{matrix} δ_{X} & : = ϕ^{'} (X) \circ \frac{1}{1 - r^{*}} \cdot [R > r^{*}] \circ δ_{Y} . \end{matrix}

(49)

Note that, at the implementation level, the whole tensor

1 / (1 - r^{*}) \cdot [R > r^{*}]

used during the forward pass (including the batch dimension), can be memorized and reused during backward computations. Figure 6 (right-hand side) illustrates backward computations for a dropout layer.

4.6. Dense Layers

Consider a dense layer mapping from a vector,

ϕ (X)

, with N inputs to a vector, Y, with M outputs. We choose the shape of matrix

W_{Y}

to be

M \times N

—the variant in which weights incoming to an output neuron are written as a row, i.e.,

W_{Y} = {(w_{Y; j, k})}_{0 ⩽ j < M, 0 ⩽ k < N},

(50)

with biases

W_{0, Y} = {(w_{0, Y; j})}_{0 ⩽ j < M}

being a column vector. Formulas for forward and backward computations are given below.

\begin{matrix} Y & : = W_{Y} \cdot ϕ (X) + W_{0, Y} . \end{matrix}

(51)

\begin{matrix} \nabla_{Y} & : = δ_{Y} \cdot ϕ {(X)}^{T} . \end{matrix}

(52)

\begin{matrix} \nabla_{0, Y} & : = δ_{Y} \cdot \underset{N \times 1}{1} . \end{matrix}

(53)

\begin{matrix} δ_{X} & : = ϕ^{'} (X) \circ {W_{Y}}^{T} \cdot δ_{Y} . \end{matrix}

(54)

We must remark that, in (54), the activation function of the previous layer is assumed to depend only on one argument, i.e., only the weighted sum of a neuron it activates. Possible examples are sigmoid, hyperbolic tangent, and ReLU, as opposed to, e.g., the softmax function, discussed in the next subsection.

4.7. Softmax Activation

Softmax activations play an important role since their outcomes can be interpreted as probability distributions. They are commonly imposed on final output layers in classification tasks. The crucial property of the softmax function is that it depends not only on one weighted sum incoming to the neuron it activates but on all weighted sums computed via the layer. In consequence, a certain adjustment has to be made in the scheme of backward computations—element-wise multiplication by a vector of activation derivatives is no longer possible. But, as we show below, this adjustment can also be seen through the prism of main Rule 1 if softmax computations are treated as a two-staged process.

Let us continue the notation from the preceding subsection, suppose that layer X is now equipped with the softmax activation with individual outputs computed as

\begin{matrix} ϕ (x_{k}) & : = \exp (x_{k}) / \sum_{0 ⩽ z < N} \exp (x_{z}), & 0 ⩽ k < N . \end{matrix}

(55)

One should realize that a change in a single

x_{z}

signal causes changes in all

ϕ (x_{k})

values,

0 ⩽ k < N

, because of the denominator, and thereby further changes in all neurons of layer Y. Therefore, to suitably propagate back the errors

δ_{Y}

onto

δ_{X}

, it is convenient to think of an intermediate stage of errors that we denote as

δ_{ϕ} (X)

; see Figure 7.

Partial derivatives

\partial ϕ (x_{z}) / \partial x_{k}

express quantitatively how strongly a

ϕ (x_{z})

is affected by a change in

x_{k}

. Hence, one can write the following formulas for backward propagation by applying Rule 1 twice. In the second case,

\partial ϕ (x_{z}) / \partial x_{k}

are treated as artificial weights of outgoing connections.

\begin{matrix} δ_{ϕ (X); k} & : = \sum_{0 ⩽ j < M} w_{Y; j, k} \cdot δ_{Y; j}, & 0 ⩽ k < N, \end{matrix}

(56)

\begin{matrix} δ_{X; k} & : = \sum_{0 ⩽ z < N} \frac{\partial ϕ (x_{z})}{\partial x_{k}} \cdot δ_{ϕ (X); z}, & 0 ⩽ k < N . \end{matrix}

(57)

All partial derivatives of the softmax function (or, in fact, any other) can be gathered into a matrix of the form

D ϕ (X) : = (\begin{matrix} \frac{\partial ϕ (x_{0})}{\partial x_{0}} & \frac{\partial ϕ (x_{1})}{\partial x_{0}} & \dots & \frac{\partial ϕ (x_{M - 1})}{\partial x_{0}} \\ \frac{\partial ϕ (x_{0})}{\partial x_{1}} & \frac{\partial ϕ (x_{1})}{\partial x_{1}} & \dots & \frac{\partial ϕ (x_{M - 1})}{\partial x_{1}} \\ \dots & \dots & ⋱ & \dots \\ \frac{\partial ϕ (x_{0})}{\partial x_{M}} & \frac{\partial ϕ (x_{1})}{\partial x_{M}} & \dots & \frac{\partial ϕ (x_{M - 1})}{\partial x_{M - 1}} \end{matrix})

(58)

and applied in the following general formula for backward propagation through a dense layer:

δ_{X} : = D ϕ (X) \cdot {W_{Y}}^{T} \cdot δ_{Y} .

(59)

Please note that Formula (54) is a special case of (59) for activation functions depending on a single argument.

As regards actual derivatives for the softmax function, it is easy to check that, after some algebra, one obtains the following:

\begin{matrix} \frac{\partial ϕ (x_{z})}{\partial x_{k}} & = \{\begin{matrix} ϕ (x_{z}) (1 - ϕ (x_{z})), & for z = k; \\ - ϕ (x_{z}) ϕ (x_{k}), & for z \neq k . \end{matrix} \end{matrix}

(60)

This means that explicit analytic expressions are not needed at the implementation level, and one can reuse neurons’ outcomes from the forward pass to compute the derivatives fast during the backward pass. A similar property is true for sigmoid activation functions, where

ϕ^{'} (x) = ϕ (x) (1 - ϕ (x))

, which agrees with softmax partial derivatives for

z = k

.

In short, (60) is equal to

\partial ϕ (x_{z}) / \partial x_{k} = ϕ (x_{z}) ([z = k] - ϕ (x_{k}))

, and in the matrix form, it becomes:

D ϕ (X) = I_{N} \cdot ϕ (X) - ϕ (X) \cdot ϕ^{T} (X) .

(61)

5. Weight Initialization

A cautious initialization of weights in deep networks is of great importance. Its goal is to prevent signals that travel through many layers from exploding or vanishing. Without proper initialization, such problems are likely to occur, and their probability increases with the number of layers. They are commonly referred to as exploding gradients or vanishing gradients to indicate that the learning process no longer converges, but despite the names, the problem pertains not only to the backward flows—it is equally valid for the forward ones. To provide a simple and convincing numerical argument, below, we repeat one after [50].

Think of a network consisting of 100 dense layers, each with 512 neurons (without activations). Let

x

be the input vector also of length 512, and suppose its entries are drawn from a standard normal distribution,

N (0, 1)

. We are going to check what happens if initial weights for all layers are also drawn from

N (0, 1)

. The forward pass, represented by the following Python code (Figure 8), consists of 100 consecutive multiplications:

Its execution produces the printout “nan, nan” which informs us about the numerically unwanted behavior. In fact, it is easy to check that, after 28 iterations, all elements inside x are at least of order $10^{35}$ and some of order $10^{38}$ —close to the maximum non-infinity value allowed for floating-point numbers of single precision. One iteration later, there are already 503 nan values and 9 inf values inside x.

On the other hand, it is easy to check that weights must not be too small either. The following carefree correction (Figure 9) scales weights by a factor of

10^{- 2}

:

And the program execution produces the printout “0.0, 0.0”—the signals have vanished.

The mathematical rationale behind the above observations is that, for all layers, any element of the vector resulting from W.dot(x) arises as a sum of 512 statistically independent summands. Therefore, the variance of such an element is equal to the sum of component variances. Note that the fact that each summand in itself is a product (a weight times an input) can be neglected because weights and inputs are mutually independent. Strictly speaking, this is true for the considered simplified example due to zero-mean distributions (for both weights and inputs) and due to the lack of activation functions. More general analysis is provided in the subsequent sections.

Since each element of the resulting vector arises as a sum of 512 summands, a sensible first guess could be to scale weights by

1 / \sqrt{512}

—the inverse of the standard deviation for the mentioned sum. More generally, when input sizes differ, a plausible scaling factor seems to be the inverse of the square root of the number of inputs. In the following code (Figure 10) we apply the guess:

And the obtained printout is “-0.016219143 0.81295455”—the signals have neither exploded nor vanished.

The accurate analysis in subsequent sections shows that the above first guess is not far off from the truth. The analysis follows the works of Glorot [17] and He [29] that resulted in two state-of-the-art methods for weight initialization.

5.1. Randomness and Variance Analysis Setup

Consider a network with L layers. In the analyses below, we use layer indexes,

i = 1, 2, \dots, L

, instead of names. Let

{\vec{N}}_{i}

denote the number of inputs incoming to each neuron in the i-th layer during the forward pass—hence, the symbolic left–right arrow. We take into account only layers with tunable weights (dense, convolutional) and assume that the following unified linear expression (one should think here of a standard matrix multiplication for dense layers, and a GEMM-based approach for convolutional layers) (62) describes the way a layer operates. In other words, multiplication by a transformation matrix

{\vec{W}}_{i}

with weights suitably arranged takes place:

\begin{matrix} Y_{i} & = {\vec{W}}_{i} \cdot X_{i} + W_{0, i}, \end{matrix}

(62)

\begin{matrix} X_{i + 1} & = ϕ (Y_{i}) . \end{matrix}

(63)

Rows of

{\vec{W}}_{i}

and column(s) of

X_{i}

are both of length

{\vec{N}}_{i}

, in particular, for convolutional layers

{\vec{N}}_{i} = F_{i}^{2} \cdot C_{i}

, where

F_{i} \times F_{i}

represents the filter shape, and

C_{i}

the number of input channels. The following randomness-related conditions are satisfied:

Random elements $w_{i}$ in each ${\vec{W}}_{i}$ are mutually independent and share the same distribution (i.i.d.—independent, identically distributed);
Likewise, elements $x_{i}$ in each $X_{i}$ are i.i.d.;
$X_{i}$ and ${\vec{W}}_{i}$ are independent of each other;
Biases are initialized with zeros.

Knowing that any element

y_{i}

of

Y_{i}

in (62) becomes computed as a sum (of products) with

{\vec{N}}_{i}

mutually independent summands having the same distribution, we have

\begin{matrix} Var (y_{i}) & = {\vec{N}}_{i} Var (w_{i} x_{i}) = {\vec{N}}_{i} (E (w_{i}^{2} x_{i}^{2}) - E^{2} (w_{i} x_{i})) \\ = {\vec{N}}_{i} (E (w_{i}^{2}) E (x_{i}^{2}) - \underset{0}{\underset{︸}{E^{2} (w_{i})}} E^{2} (x_{i})) \\ = {\vec{N}}_{i} Var (w_{i}) E (x_{i}^{2}) \\ = {\vec{N}}_{i} Var (w_{i}) E (ϕ^{2} (y_{i - 1})), \end{matrix}

(64)

where

w_{i}

,

x_{i}

stand for single elements in

{\vec{W}}_{i}

and

X_{i}

. Note that

E (x_{i}^{2}) \neq Var (x_{i})

, unless

x_{i}

has a zero mean. For example, for ReLU

x_{i} = ϕ (y_{i - 1}) = [y_{i - 1} > 0] y_{i - 1}

, thus clearly

E (x_{i}) \neq 0

.

To make use of Equation (64), one needs to transform it into a recurrence, binding

Var (y_{i})

with

Var (y_{i - 1})

. Both Glorot [17] and He [29] achieve this, however, via different means. Glorot assumes symmetric and approximately linear activation functions

ϕ

around 0 (true, e.g., for a sigmoid or hyperbolic tangent but not necessarily in general), whereas He focuses on convolutional layers activated by ReLU, takes advantage of its properties, and achieves a more elegant result, though restricted specifically to ReLU (or a parametric family of ReLUs).

5.2. Glorot Initialization

The assumption that

ϕ

is approximately linear around 0 implies that

x_{i} = ϕ (y_{i - 1}) \approx y_{i - 1}

has a mean approximately equal to 0. Therefore

E (ϕ^{2} (y_{i - 1})) \approx Var (y_{i - 1})

, and by inserting this approximation into (64), the following recurrence is obtained:

Var (y_{i}) = {\vec{N}}_{i} Var (w_{i}) Var (y_{i - 1}) .

(65)

Expanding it over L layers leads to the following product:

Var (y_{L}) = Var (y_{1}) \prod_{i = 2}^{L} ({\vec{N}}_{i} Var (w_{i}))

(66)

which is crucial to the design of weight initialization. Clearly, an incorrect design may cause the signals to magnify or vanish exponentially. Therefore, as Glorot [17] rightly remarks, to keep the information flowing normally during the forward computations, an obvious sufficient condition is to keep all factors of product (66) equal to 1:

{\vec{N}}_{i} Var (w_{i}) = 1, \forall i .

(67)

This suggests that one could draw initial weights, for example, from a zero-mean Gaussian distribution with the standard devation

\sqrt{1 / \vec{N_{i}}}

. However, this conclusion is premature because the backward flows of signals must also be taken into account.

Analogously to (62) and (63), during the backward pass, we can think of the following operation:

δ_{X_{i}} = ϕ^{'} (X_{i}) \circ {\overset{\leftarrow}{W}}_{i} \cdot δ_{Y_{i}},

(68)

where

{\overset{\leftarrow}{W}}_{i}

denotes a suitable backward transformation matrix (e.g., a transposed matrix of weights for dense layers, or a flipped and rearranged matrix of filter weights for convolutional; see, e.g., (54), (39)). This time,

{\overset{\leftarrow}{N}}_{i}

stands for the number of “backward inputs” incoming to each neuron from the right. For convolutional layers,

{\overset{\leftarrow}{N}}_{i} = F_{i}^{2} \cdot K_{i}

, where

K_{i}

is the number of output channels (kernels).

Similarly to before, we look at the variance of any single element,

δ_{x_{i}}

, in

δ_{X_{i}}

. That variance is equal to the sum of variances of

{\overset{\leftarrow}{N}}_{i}

independent summands, yielding

\begin{matrix} Var (δ_{x_{i}}) = {\overset{\leftarrow}{N}}_{i} Var (ϕ^{'} (x_{i}) w_{i} δ_{y_{i}}) = {\overset{\leftarrow}{N}}_{i} Var (w_{i}) E ({(ϕ^{'} (x_{i}) δ_{y_{i}})}^{2}) \end{matrix}

(69)

—a counterpart of (64). This time, the assumption that

ϕ

is approximately linear around 0 implies that

ϕ^{'} (x_{i}) \approx 1

can be inserted into (69), and the following backward recurrence for the variance of

δ

s is achieved:

Var (δ_{x_{i}}) = {\overset{\leftarrow}{N}}_{i} Var (w_{i}) Var (δ_{x_{i + 1}}),

(70)

where

δ_{x_{i + 1}} = ϕ^{'} (x_{i}) δ_{y_{i}}

defines the recursive quantity. Expanding the recurrence yields the product

Var (δ_{x_{1}}) = Var (δ_{x_{L}}) \prod_{i = 2}^{L} ({\overset{\leftarrow}{N}}_{L - i + 1} Var (w_{L - i + 1})),

(71)

which, from the backward computations point of view, suggests the following sufficient condition:

{\overset{\leftarrow}{N}}_{i} Var (w_{i}) = 1, \forall i .

(72)

Please note that, in general, one is unable to ensure that conditions (67) and (72) are satisfied simultaneously. Therefore, as a compromise, Glorot [17] proposed the application of an average, namely

Var (w_{i}) = 2 / ({\vec{N}}_{i} + {\overset{\leftarrow}{N}}_{i}), \forall i .

(73)

This proposition implies the following two recipes for a random initialization of weights, based on normal and uniform distribution, respectively.

Glorot normal initialization:

$\begin{matrix} w_{i} \sim N (0, \sqrt{2 / ({\vec{N}}_{i} + {\overset{\leftarrow}{N}}_{i})}) . \end{matrix}$

(74)
Glorot uniform initialization:

$\begin{matrix} w_{i} \sim U (- \sqrt{6 / ({\vec{N}}_{i} + {\overset{\leftarrow}{N}}_{i})}, \sqrt{6 / ({\vec{N}}_{i} + {\overset{\leftarrow}{N}}_{i})}) . \end{matrix}$

(75)

The strange-looking

\sqrt{6}

in (75) arises as a product of

\sqrt{2}

and

\sqrt{3}

, since the standard deviation of

U (- 1, 1)

is

\sqrt{1 / 3}

.

5.3. He Initialization

With the ReLU function in mind, [29] also studied the variance in signals described by (64) and noticed that the following interesting property holds:

\begin{matrix} E (x_{i}^{2}) & = E ({([y_{i - 1} > 0] y_{i - 1})}^{2}) \\ = \int_{0}^{\infty} y_{i - 1}^{2} p_{i - 1} (y_{i - 1}) d y_{i - 1} = \frac{1}{2} Var (y_{i - 1}) . \end{matrix}

(76)

The last equality is true because the distribution of

y_{i - 1}

is symmetric at around 0 due to the symmetric distribution of

{\vec{W}}_{i - 1}

. Thus, (64) can be transformed into the following elegant recurrence:

Var (y_{i}) = \frac{1}{2} {\vec{N}}_{i} Var (w_{i}) Var (y_{i - 1}),

(77)

which, expanded over L layers, yields the product

Var (y_{L}) = Var (y_{1}) \prod_{i = 2}^{L} (\frac{1}{2} {\vec{N}}_{i} Var (w_{i})) .

(78)

Therefore, a sufficient condition to avoid the product from growing or vanishing exponentially is

\frac{1}{2} {\vec{N}}_{i} Var (w_{i}) = 1, \forall i .

(79)

Note the presence of factor

1 / 2

in contrast to (67). In fact, for the first layer (

i = 1

), one could have

N_{1} Var (w_{1}) = 1

because there is no ReLU applied to the input signals. But a redundant factor,

1 / 2

, existing only on one layer does not matter; thus, He [29] keeps “for all i” in (79) for simplicity. The derived condition suggests that one could draw initial weights from a zero-mean Gaussian distribution with standard deviation

\sqrt{2 / {\vec{N}}_{i}}

, or a uniform distribution,

U (- \sqrt{6 / {\vec{N}}_{i}}, \sqrt{6 / {\vec{N}}_{i}})

. However, similarly as before, to make the analysis complete, one has to take backward computations into account as well.

The backward recurrence of interest includes the

1 / 2

factor:

Var (δ_{x_{i}}) = \frac{1}{2} {\overset{\leftarrow}{N}}_{i} Var (w_{i}) Var (δ_{x_{i + 1}}),

(80)

and, when expanded, implies the product

Var (δ_{x_{1}}) = Var (δ_{x_{L}}) \prod_{i = 2}^{L} (\frac{1}{2} {\overset{\leftarrow}{N}}_{L - i + 1} Var (w_{L - i + 1}))

(81)

and the following sufficient condition for the backward flow:

\frac{1}{2} {\overset{\leftarrow}{N}}_{i} Var (w_{i}) = 1, \forall i .

(82)

Instead of looking for a compromise between (79) and (82), [29] noticed something more interesting—namely, for convolutional layers, it is sufficient to apply (79) or (82) alone. Recall that

{\vec{N}}_{i} = F_{i}^{2} \cdot C_{i}

and

{\overset{\leftarrow}{N}}_{i} = F_{i}^{2} \cdot K_{i}

. Therefore, if, for example, (82) is applied in (78), then the product becomes

\begin{matrix} Var (y_{1}) \prod_{i = 2}^{L} (\frac{1}{2} {\vec{N}}_{i} Var (w_{i})) = Var (y_{1}) \prod_{i = 2}^{L} \frac{{\vec{N}}_{i}}{{\overset{\leftarrow}{N}}_{i}} = Var (y_{1}) \prod_{i = 2}^{L} \frac{F_{i}^{2} C_{i}}{F_{i}^{2} K_{i}} = Var (y_{1}) C_{2} / K_{L} . \end{matrix}

(83)

The last pass is true, as successive ratios cancel out, owing to

C_{i + 1} = K_{i}

for all

i ⩾ 2

. The final constant means that, if weight initialization properly scales the forward signals in the forward flow, then it simultaneously does so in the backward flow, or vice versa. Most libraries pick (79) and offer the following initialization for convolutional layers activated via ReLUs.

He normal initialization:

$\begin{matrix} w_{i} \sim N (0, \sqrt{2 / {\vec{N}}_{i}}) = N (0, \sqrt{2 / (F_{i}^{2} C_{i})}), \end{matrix}$

(84)
He uniform initialization:

$\begin{matrix} w_{i} \sim U (- \sqrt{6 / {\vec{N}}_{i}}, \sqrt{6 / {\vec{N}}_{i}}) = U (- \sqrt{6 / (F_{i}^{2} C_{i})}, \sqrt{6 / (F_{i}^{2} C_{i})}) . \end{matrix}$

(85)

6. Learning Algorithms (Optimizers)

Error functions under optimization (also known as objective or cost functions) are typically stochastic—hence the name stochastic gradient descent (SGD). There are several sources of stochasticity: evaluation at random data subsamples called batches (or mini-batches), inherent noise in acquired data, dropout regularization, etc. In this section, we discuss some learning algorithms for neural networks—variants of SGD. We devote the majority of space to Adam [25], arguably the most popular state-of-the-art optimizer.

Many SGD variants for large data sets are first-order methods (they do not use second or higher-order derivatives) that try to speed up convergence via adjustments made to a certain momentum term or (terms) or learning rate(s), or both. The historical roots of these approaches can be seen in the classical momentum method from 1964 [51] and the RPROP algorithm from 1992 [52]. RPROP introduced individual learning rates for different dimensions (components) of a gradient. The rates were multiplied by

1.2

for dimensions preserving the minimization direction over successive steps and halved for dimensions where ”turns” occurred. The contemporary techniques have similar mechanisms, but they are typically backed with a smoother adaptation using the idea of an exponential moving average.

For notational simplicity, throughout this section, we write just W and just ∇ to denote the vector (or tensor) of all network’s weights and the gradient of all partial derivatives, respectively; we shall not distinguish layer names. Instead, we write indexes of time moments in the subscripts (letter t), assuming

t = 0

constitutes the initial moment. When the learning epochs are divided into batches, then successive optimization steps are, in fact, associated with updates after successive batches.

6.1. Exponential Moving Average (EMA)

Given a sequence,

x_{0}, x_{1}, \dots

, consider a new sequence generated from it as follows:

\begin{matrix} y_{t} & : = β y_{t - 1} + (1 - β) x_{t}, t = 0, 1, \dots; \end{matrix}

(86)

where

y_{- 1} = 0

for initialization, and

β \in [0, 1)

is the rate of exponential decay, frequently set to at least

0.9

. The expansion of (86) produces the following pattern:

\begin{matrix} y_{t} & : = β^{2} y_{t - 2} + (1 - β) (x_{t} + β x_{t - 1}) \\ = β^{3} y_{t - 3} + (1 - β) (x_{t} + β x_{t - 1} + β^{2} x_{t - 2}) \\ = \dots = β^{t + 1} \underset{0}{\underset{︸}{y_{- 1}}} + (1 - β) \sum_{i = 0}^{t} β^{t - i} x_{i}, \end{matrix}

(87)

where the weights under summation change according to a geometric progression. The

β

values close to 1 have the effect of “long memory” because many past terms,

x_{i}

, contribute to

y_{t}

. This can be described as exponential smoothing over a long period. Instead,

β

values closer to 0 translate onto a “short memory”—weights further in the past fade out quickly. Note that (87) is a biased (not normalized) average because weights do not add up to 1, unless

β = 0

, which would imply

y_{t} = x_{t}

for all t. More strictly, for a fixed t, the sum of weights is

(1 - β^{t + 1}) / (1 - β)

—a potential correcting normalizer.

6.2. AdaGrad and RMSProp

AdaGrad (Adaptive Gradient) algorithm [19] adapts the learning rate via a division by the square root of cumulated past squared gradients (including the current one):

\begin{matrix} v_{t} & : = v_{t - 1} + \nabla_{t}^{2}, \end{matrix}

(88)

\begin{matrix} W_{t + 1} & : = W_{t} - \frac{η}{\sqrt{v_{t} + ϵ}} \nabla_{t}, \end{matrix}

(89)

where

v_{- 1} = 0

,

ϵ = 10^{- 7}

,

\nabla_{t}^{2} = \nabla_{t} \circ \nabla_{t}

. The

ϵ

constant is a safety valve to prevent a division by zero. Since division and

\sqrt{\cdot}

operations are performed element-wise, the effective learning rates for particular gradient dimensions become individual.

As regards the RMSProp algorithm [20], it was designed as a version of RPROP suitable for mini-batches, and it can be viewed as AdaGrad-equipped EMA, namely

\begin{matrix} v_{t} & : = β v_{t - 1} + (1 - β) \nabla_{t}^{2}, \end{matrix}

(90)

\begin{matrix} W_{t + 1} & : = W_{t} - \frac{η}{\sqrt{v_{t} + ϵ}} \nabla_{t}, \end{matrix}

(91)

where

v_{- 1} = 0

, and again,

β = 0.9

is a common choice.

6.3. Adam

The Adam algorithm (the name loosely derived from “adaptive movement”), proposed by Kingma and Ba [25], is a combination of momentum and RMSProp approaches. It uses unbiased estimates of gradient moments to adapt individual non-explicit learning rates: the first moment (the mean) and the second raw moment (uncentered variance). The algorithm tracks the EMAs of the gradient (

m_{t}

) and the squared gradient (

v_{t}

) using two hyper-parameters,

β_{1}, β_{2} \in [0, 1)

, that control the decay. They have really good default values of

0.9

and

0.999

, respectively.

The averages are initialized as tensors of zeros, leading to biased estimates, especially during the initial time steps, and especially when the decay rates are small (i.e.,

β

s close to 1). Fortunately, the initialization bias can be counteracted easily, resulting in bias-corrected estimates named

{\hat{m}}_{t}

and

{\hat{v}}_{t}

[25].

When starting from

m_{- 1} = v_{- 1} = 0

, successive estimates of the first moment and the second raw moment are computed as follows:

\begin{matrix} m_{t} & : = β_{1} m_{t - 1} + (1 - β_{1}) \nabla_{t}, \end{matrix}

(92)

\begin{matrix} v_{t} & : = β_{2} v_{t - 1} + (1 - β_{2}) \nabla_{t}^{2}, \end{matrix}

(93)

And their corresponding recursive expansions are as follows:

\begin{matrix} m_{t} = (1 - β_{1}) \sum_{i = 0}^{t} β_{1}^{t - i} \nabla_{i}, & v_{t} = (1 - β_{2}) \sum_{i = 0}^{t} β_{2}^{t - i} \nabla_{i}^{2} . \end{matrix}

(94)

To technically check whether moving averages constitute unbiased estimators of true moments, one needs to take their expected values and see whether the following properties hold:

E (m_{t}) = E (\nabla_{t})

and

E (v_{t}) = E (\nabla_{t}^{2})

. We check this for the first moment estimator (skipping analogous derivation for the second):

\begin{matrix} E (m_{t}) & = E ((1 - β_{1}) \sum_{i = 0}^{t} β_{1}^{t - i} \nabla_{i}) = (1 - β_{1}) \sum_{i = 0}^{t} β_{1}^{t - i} E (\nabla_{i}) \\ = (1 - β_{1}) \sum_{i = 0}^{t} β_{1}^{t - i} E (\nabla_{t}) + ζ = (1 - β_{1}^{t + 1}) E (\nabla_{t}) + ζ . \end{matrix}

(95)

The term

ζ

is introduced as an error so that

E (\nabla_{i})

can be approximated by

E (\nabla_{t})

. As explained in [25], one either has

ζ = 0

when the true moment

E (\nabla_{i})

is stationary, or otherwise,

ζ

is kept sufficiently small since the exponential decay rate,

β_{1}

, can (and should) be chosen, such that the EMA assigns very small weights to gradients far in the past. Apparently, the proposed estimators are biased, since

E (m_{t}) \approx (1 - β_{1}^{t + 1}) E (\nabla_{t}) \neq E (\nabla_{t})

and

E (v_{t}) \approx (1 - β_{2}^{t + 1}) E (\nabla_{t}^{2}) \neq E (\nabla_{t}^{2})

, but fortunately, they can easily be corrected via scalar divisions:

\begin{matrix} {\hat{m}}_{t} & : = m_{t} / (1 - β_{1}^{t + 1}) . \end{matrix}

(96)

\begin{matrix} {\hat{v}}_{t} & : = v_{t} / (1 - β_{2}^{t + 1}) . \end{matrix}

(97)

Finally, the updates of weights are performed as

\begin{matrix} W_{t + 1} & : = W_{t} - η {\hat{m}}_{t} / (\sqrt{{\hat{v}}_{t}} + ϵ), \end{matrix}

(98)

where

ϵ = 10^{- 7}

.

The following two properties of Adam should be emphasized. The first is that its effective step sizes are approximately bounded by

η

, since

{\hat{m}}_{t} / (\sqrt{{\hat{v}}_{t}} + ϵ) \approx \pm 1

in statistically common scenarios [25]. The second is that Adam is invariant to the scale of gradients; e.g., scaling all ∇s by a factor, c, scales

{\hat{m}}_{t}

also by a factor, c, and

{\hat{v}}_{t}

by a factor of

c^{2}

. Those factors cancel out under the expression

{\hat{m}}_{t} / \sqrt{{\hat{v}}_{t}} = (c {\hat{m}}_{t}) / \sqrt{c^{2} {\hat{v}}_{t}}

, neglecting the small

ϵ

for a moment.

Figure 11, repeated after the original article by Kingma and Ba (2014) [25], presents plots comparing the training costs (losses) of Adam and other SGD algorithms while fitting on MNIST and CIFA-10 data sets.

6.4. Other Ideas: Nadam, Adamax, AMSGrad

Owing to Nesterov’s idea [53], some optimization techniques can be equipped with “prescience” by looking one step ahead and computing the gradient not with respect to the current position, i.e., current weights,

W_{t}

, but the approximate future position. For example, introducing this idea to the momentum technique, known as NAG (the Nesterov accelerated gradient), can be written down as follows:

\begin{matrix} m_{t} & : = β m_{t - 1} + (1 - β) \nabla_{| W_{t} - η m_{t - 1}}, \end{matrix}

(99)

\begin{matrix} W_{t + 1} & : = W_{t} - η m_{t} . \end{matrix}

(100)

The subscript

W_{t} - η m_{t - 1}

at ∇ in (99) indicates that we use the previous averaged velocity,

m_{t - 1}

, to look at the hypothetical position

W_{t} - η m_{t - 1}

and treat it as the argument at which to calculate the gradient, rather than at

W_{t}

. In general, the anticipatory approach helps improve the convergence [20,53,54].

Using this idea, in 2016, Dozat proposed the Nadam algorithm (Adam + Nesterov) [31]. We will present it simply. Observe that Adam’s update on weights can be equivalently written as

W_{t + 1} : = W_{t} - \frac{η}{\sqrt{{\hat{v}}_{t}} + ϵ} (β_{1} {\hat{m}}_{t - 1} + \frac{1 - β_{1}}{1 - β_{1}^{t + 1}} \nabla_{t})

(101)

—obtained by inserting (92) into (96) and (98). With all other formulas preserved, Nadam arises by replacing in (101) the previous moment,

{\hat{m}}_{t - 1}

, with the current one,

{\hat{m}}_{t}

.

Adamax is a variant of Adam proposed by its very authors [25]. Instead of using the inverse of

ℓ_{2}

-norm (of past gradients) to scale the current one, the authors looked at

ℓ_{p}

-norm for the sake of generalization. They observed that, for a special case of

p \to \infty

, a surprisingly simple and stable algorithm emerges (not necessarily the case for a large but finite p). To derive Adamax, Kingma and Ba first defined a generalized moving average for

v_{t}

as follows:

v_{t} = β_{2}^{p} v_{t - 1} + (1 - β_{2}^{p}) | \nabla_{t} |^{p} = (1 - β_{2}^{p}) \sum_{i = 0}^{t} β_{2}^{p (t - i)} {| \nabla_{i} |}^{p},

(102)

where its hyperparameter is written on purpose as

β_{2}^{p}

, i.e., with explicit p-th power, (which can be regarded as equivalent to some other fractional parameter,

β_{2}^{'} \in [0, 1)

, where

β_{2}^{'} = β_{2}^{p}

and for the moment p is fixed). Then, they defined

u_{t} = \lim_{p \to \infty} v_{t}^{1 / p}

, and they observed that

\begin{matrix} u_{t} & = \lim_{p \to \infty} {(1 - β_{2}^{p})}^{1 / p} {(\sum_{i = 0}^{t} β_{2}^{p (t - i)} {| \nabla_{i} |}^{p})}^{1 / p} \\ = \max \{β_{2}^{t} | \nabla_{0} |, β_{2}^{t - 1} | \nabla_{1} |, \dots, β_{2} | \nabla_{t - 1} |, | \nabla_{t} |\}, \end{matrix}

(103)

which corresponds to the simple and elegant recursive rule

u_{t} : = \max {β_{2} u_{t - 1}, | \nabla_{t} |}

. We remind the reader that the

\max {\dots}

operation here is element-wise, as in previous contexts. Finally, the whole Adamax algorithm can be stated as follows:

\begin{matrix} m_{t} & : = β_{1} m_{t - 1} + (1 - β_{1}) \nabla_{t}, \end{matrix}

(104)

\begin{matrix} u_{t} & : = \max {β_{2} u_{t - 1}, | \nabla_{t} |}, \end{matrix}

(105)

\begin{matrix} W_{t + 1} & : = W_{t} - η m_{t} / ((1 - β_{1}^{t + 1}) u_{t}), \end{matrix}

(106)

where

m_{- 1} = u_{- 1} = 0

. Note the lack of bias correction in

u_{t}

.

Last but not least, we mention AMSGrad—a fairly new variant of Adam by [33]. AMSGrad sticks to the

ℓ_{2}

norm, but it ensures that the current

{\hat{v}}_{t}

is not smaller than the previous one. With other formulas preserved, the key difference boils down to the definition

{\hat{v}}_{t} : = \max {{\hat{v}}_{t - 1}, v_{t}}

.

7. Regularization

Regularization tempers down overfitting, usually by imposing constraints on weights (but not only on weights). This section discusses three popular regularization techniques:

ℓ_{2}

,

ℓ_{1}

, and dropout. In fact, the first two have been known in machine learning and numerical methods since long before the “era” of DL. The last one was born in 2012 [22] specifically for DL.

Let us remind that in the

ℓ_{p}

metric space, the p-norm for a vector of weights is defined as

{∥ W ∥}_{p} = {(\sum_{k} {| w_{k} |}^{p})}^{1 / p}

. For notational simplicity, we do not specify the number of weights, but we assume that k iterates over all of them. Additionally, it shall be useful to explicitly define the error function as a sum of losses over data points, plus a suitable regularization term. The sum of losses can reflect different learning modes: on-line (if reduced to a single summand), off-line (a sum over the whole data set), and mini-batches (in between). We denote it as

err (W) = \sum_{i} l_{i} (W) + \dots

, emphasizing the functional dependence on weights.

7.1. $ℓ_{2}$ Regularization (Weight Decay)

The optimization goal with

ℓ_{2}

regularization is to minimize the sum of losses,

\sum_{i} l_{i} (W)

, subject to the following constraint on the 2-norm (i.e., the Euclidean norm):

{∥ W ∥}_{2} ⩽ r

. Geometrically, this means that we want weights to be kept within a ball (the dimensionality of the ball being equal to the number of weights) of radius r. Commonly, one incorporates the constraint into the cost function and applies the following equivalent form [55,56]:

err (W) = \sum_{i} l_{i} (W) + \frac{1}{2} λ {∥ W ∥}_{2}^{2},

(107)

where

λ ⩾ 0

is a penalty hyperparameter controlling the strength of regularization. There exists a 1:1 correspondence between r and

λ

. Large

λ

s translate into small radiuses.

Taking a derivative of (107) with respect to any

w_{k}

yields

\begin{matrix} \frac{\partial err (W)}{\partial w_{k}} & = \sum_{i} \frac{\partial l_{i} (W)}{\partial w_{k}} + λ w_{k} . \end{matrix}

(108)

Therefore, in the simple SGD scenario, the update of a single weight can be carried out in the following manner:

w_{k} : = w_{k} - η \sum_{i} \frac{\partial l_{i} (W)}{\partial w_{k}} - η λ w_{k} .

(109)

Such an update rule is very simple to implement. Other advanced SGD variants (Adam, Adamax, AMSGrad, etc.) pose no difficulties either because ∇, presented in various formulas in Section 6, should now be understood as the gradient of the whole cost function, including the regularization term, i.e.,

\nabla = {(\sum_{i} \partial l_{i} (W) / \partial w_{k} + λ w_{k})}_{k}

, generated for all k. It is also easy to see how (109) tries to shrink weights at each update step [55], i.e., pull them towards zero, owing to the last summand,

- η λ w_{k}

.

Finally, it is worth pointing out a potential nomenclature ambiguity. In the literature on neural networks,

ℓ_{2}

regularization is often known by the name of weight decay [57,58,59]. Please note that this should not be confused with learning rate decay—a scheduling technique [60,61], present, e.g., in Keras and PyTorch, that is, in fact, not a regularizer.

7.2. $ℓ_{1}$ Regularization (Lasso)

The cost function with

ℓ_{1}

regularization is

err (W) = \sum_{i} l_{i} (W) + κ {∥ W ∥}_{1},

(110)

where we chose

κ ⩾ 0

to denote the regularization hyperparameter. As a side remark, please note that mixtures of

ℓ_{2}

and

ℓ_{1}

regularizations are also possible, implying the following cost:

\sum_{i} l_{i} (W) + 1 / {2 λ ∥ W ∥}_{2}^{2} + κ {∥ W ∥}_{1}

[62,63]. The meaning of

ℓ_{1}

is that, again, weights are constrained to stay within a certain ball, but this time a diamond-shaped ball implied by the 1-norm, i.e., the sum of absolute values of weights. This “encourages” many weights not only to be small but to be exactly zero (or at least very close to zero in the numerical sense, e.g., less than

10^{- 5}

). That is why

ℓ_{1}

is often regarded as a feature selection mechanism [12] and known by the name lasso—least absolute shrinkage and selection operator [64].

Commonly, in practice, derivatives of (110) take the form

\begin{matrix} \frac{\partial err (W)}{\partial w_{k}} & = \sum_{i} \frac{\partial l_{i} (W)}{\partial w_{k}} + κ sgn (w_{k}), \end{matrix}

(111)

in spite of

| w_{k} |

not being differentiable at zero [56,63,65].

7.3. Dropout Regularization

The dropout technique was proposed by Hinton, Srivastava, et al. and presented in two articles [22,66]. If it is known how dropout layers operate (Section 4.5), it may not be evident at first glance that they, in fact, perform model averaging—the approach well known from ensemble methods [2,67]. The averaging within dropout is performed in a “camouflaged” way since the direct way would be prohibitively time-consuming, facing three problems: 1. training multiple large networks (preferably with different structures) requires a lot of computations, 2. large networks require large amounts of data (preferably not the same per network), and 3. evaluating multiple responses at the test stage to combine predictions is unacceptable in many applications [66]. The dropout technique addresses all three problems in a probabilistic manner.

Consider a network with N units (neurons), where each unit can be dropped (along with its connections) with the probability

r^{*}

at the training stage. Since random neurons are dropped, there is almost certainly a different network structure, simpler than a complete one, trained for each data presentation. However, all of these networks share same weights for units that remain present. This is a contrast with respect to traditional ensemble methods, where models are trained from scratch with independent parameters. From that perspective, a network with N units can be seen as a family of

2^{N}

“thinner” networks, and random dropout makes it possible to train a huge number of them in a reasonable time with an extensive sharing of weights [22].

Originally [66], it was postulated to leave output signals of preserved units unchanged at the training stage and to multiply all signals by

1 - r^{*}

at test time when the dropout is switched off. Note that this constitutes a simple approximate averaging mechanism. Signals or, equivalently, weights of the final network are scaled-down versions of the trained ones. Multiplication by

1 - r^{*}

at test time ensures that the expected output (under the training-stage distribution) is the same as the actual output at the test stage. Hence, owing to scaling, the hypothetical networks become combined into a single “mean” network.

In Section 4.5, we described dropout layers operating differently, namely with signals scaled by

1 / (1 - r^{*})

at the training stage and unchanged at testing—a default version in many frameworks (Keras, TensorFlow, and Pytorch). One should realize that the two approaches are equivalent.

Based on multiple experiments, Hinton et al. found a superior dropout in terms of generalization than other regularization methods [66]. How can dropout be motivated? According to its authors, dropout reduces complex co-adaptations. This means that individual units are driven to be more robust in doing something useful without relying on many other units. They are forced to work in the presence of a random subset of other units and, therefore, with little collaboration. Without dropout, a frequent tendency is that units become co-adapted, i.e., learn a certain complex collaboration, often driven by noise in training data, that relies on the presence of many of them or even all of them. A nice, non-technical explanation of this phenomenon comes from considerations of successful conspiracies—quoting [66]: “(…) Ten conspiracies each involving five people is probably a better way to create havoc than one big conspiracy that requires fifty people to all play their parts correctly (…)”.

8. Implementation and Experiments

As declared in the introduction, the Python project named hmdl (“home-made deep learning”), associated with this tutorial, constitutes the referenced from-scratch implementation of crucial deep learning ideas. We devote this section to that project and to experiments carried out using it. We hope the contents that follow help the reader understand how DL’s mathematics and algorithms presented so far become translated into the realm of programming and computations.

We begin with a high-level view of hmdl (Section 8.1), presenting its object-oriented design, the main classes, and their responsibilities. In Section 8.3, we show how forward and backward computations are organized in compliance with the main inductive rules (Rule 1 and Rule 2), and we present excerpts of the source code responsible for those computations for common layer types. More attention in this respect is given to convolutional layers whose various implementations have been described separately in Section 8.4. Then, in Section 8.5, we present compact the functions responsible for the random initialization of weights in compliance with Glorot’s or He’s approaches. The next section, Section 8.6, presents the implementation of the Adam algorithm in hmdl, coupled with regularizations. Finally, in the last Section 8.7, we demonstrate thorough experiments carried out with hmdl and Keras, involving multiple network architectures and three well-known data sets. We instruct the reader as to how the experiments can be reproduced.

Note on source codes: When presenting excerpts of hmdl source code in this manuscript, we have removed, for clarity and brevity, the auxiliary code parts, such as verbosity printouts, time measurements, etc. (for full codes, we direct the reader to the repository).

8.1. “Home-Made Deep Learning” Project—A High-Level View

The repository for the hmdl project is placed at https://github.com/pklesk/hmdl. At this opportunity, we would like to give great credit to the Numba compiler (https://numba.pydata.org). Using numba.jit and numba.cuda amenities, important parts of hmdl were compiled into fast machine code (CPU and GPU, respectively). The main module is implemented in the hmdl.py file. The UML diagram in Figure 12 provides a high-level view of hmdl and the classes it involves.

The abstract class Layer is the backbone of the hierarchy, and there are five specialized classes that inherit from it. They represent frequently applied layer kinds discussed in former sections: Conv2D, MaxPool2D, Dense, Flatten, and Dropout. Other new extensions can be introduced. There are some pieces of information common to all layers, defined within the Layer class, in particular input_, output_, delta_backward_input_, delta_, delta_backward_output_, weights_, and gradient_. All of those should be understood as tensors stored as numpy arrays of numpy.float32 numbers. To be precise, weights_ and gradient_ are short lists of numpy arrays. This allows for independent access to proper weights and biases. The dimensionality of arrays is implied by the layer kind. For example, for the Conv2D or MaxPool2D layers, the input_ tensor is a 5-dimensional array, where the successive dimensions represent indexes of the data example (within a batch), image row, image column, input channel, and output channel. As regards Dense layers, their input_ and output_ are 2-dimensional arrays (indexing only data examples and neurons).

Additionally, one should notice the properties named activation_ and activation_d_ present in the Layer class. They are function pointers (or None values), representing the activation function and its derivative, respectively. We apply Python’s getattr(…) mechanism to obtain suitable pointers from string names specified by the user, e.g., "relu", "sigmoid", "softmax", etc.

8.2. Executions of Forward and Backward Computations

During the forward computations, every input_ is mapped to an output_, possibly with the use of weights_ and an invocation of activation_. As regards the backward flow, firstly, the delta_ tensor becomes computed based on activation_d_ and delta_backward_input_ (the latter stores weighted sums of subsequent

δ

s) in compliance with the inductive Rule 1. Then, the gradient_ becomes computed based on delta_ and input_ in compliance with Rule 2. Finally, the responsibility of each layer is to prepare a delta_backward_output_ tensor, using delta_ and weights_, so that it constitutes a delta_backward_input_ from the perspective of the preceding layer.

To suitably handle the described forward and backward flows, each layer is equipped with function pointers named do_forward_function, do_backward_function, do_backward, and _output_function. The design of hmdl allows for the provision of more than one implementation for either of those. An actual implementation to be executed can be chosen either manually by the user or automatically via the framework itself in an attempt to improve efficiency (taking into account the sizes of tensors, filters, etc.). The source code shown below (Figure 13) presents two main functions that execute forward and backward computations from within the abstract class Layer. Inside the functions, one can note suitable invocations performed via pointers do_forward_function, do_backward_function, and do_backward_output_function, assumed to be defined by subclasses.

Moreover, it should be remarked that layer instances belonging to the same class but present in different places of a network structure are allowed to execute different implementation variants of their forward/backward computations. For example, in the Conv2D class, there exist 7 implementation variants that can be inserted into the do_forward_function pointer. They differ with respect to computational resources (CPU: numpy or numba.jit; GPU: numba.cuda) or the algorithmic approach (direct, tiles, or via FFT). We discuss them in Section 8.4.

Networks built with hmdl are instances of the SequentialClassifier class. Its add(…) method allows the user to append successive layers in a manner similar to Keras’s API [68]. The class is additionally equipped with the fit, predict, and predict_proba methods, following the convention known from the sklearn module. In fact, the SequentialClassifier inherits from two classes, BaseEstimator and ClassifierMixin, that come from sklearn.base. It is also worth remarking that a classifier instance initializes every backward flow induction by calling the self.loss_d_(…) function, a pointer to the categorical_crossentropy_d or square_loss_d. Figure 14 presents the source code of the fit function for SequentialClassifier, whereas Figure 15 presents the core parts of its forward and backward functions involved in the process of fitting.

8.3. Automatic Differentiation (Backward Computations) for Common Layer Types

Flattening layers (class Flatten): We purposely start this subsection with the simplest layer type as far as differentiation is concerned—namely the flattening layer. We remind the reader that the chain rule (Rule 1) that computes the

δ_{X}

tensor within the backpropagation induction says to apply the derivative of the activation function multiplied by the weighted sum of subsequent

δ_{Y}

values taken over outgoing connections. In the case of flattening layers, each neuron—a signal in the input tensor X (of shape

H \times V \times C

)—has exactly one outgoing connection to its counterpart in the flattened vector Y. Therefore, the appropriate inductive formula (we repeat Formula (47) from Section 4) is as follows:

\begin{matrix} δ_{X} & : = ϕ^{'} (X) \circ \underset{H \times V \times C}{R} (I_{H \cdot V \cdot C} \cdot δ_{Y}) = ϕ^{'} (X) \circ \underset{H \times V \times C}{R} δ_{Y} . \end{matrix}

Backward computations associated with this formula are reduced to a suitable call of the reshape function, preceded by a call of self.activation_d_() to compute the derivative of the activation function (if any activation was imposed by the user). The source code in Figure 16 presents those simple computations.

Dropout layers (class Dropout): In the case of dropout layers, automatic differentiation is also quite simple. Below, we repeat Formulas (48) and (49) for both the forward and backward computations of those layers. We remind the reader that

r^{*}

represents the imposed rate of signals to be dropped (zeroed) in the current pass of computations, whereas the tensor

R = (r_{j, k, q})

, of shape

H \times V \times C

, has entries drawn randomly from a standard uniform distribution,

r_{j, k, q} \sim U (0, 1)

.

\begin{matrix} Y & : = \frac{1}{1 - r^{*}} \cdot [R > r^{*}] \circ ϕ (X) . \\ δ_{X} & : = ϕ^{'} (X) \circ \frac{1}{1 - r^{*}} \cdot [R > r^{*}] \circ δ_{Y} . \end{matrix}

At the implementation level, it is useful to memorize, during the forward pass, the tensor of random entries,

[R > r^{*}]

(the ones to be preserved), together with their multipliers,

1 / (1 - r^{*})

, and then to make use of it during the backward pass. In the code from Figure 17, that combined information is memorized as the attribute self.coeffs_. As was the case for the flattening layers, also here, in the dropout layers, each “neuron” can be regarded as having exactly one outgoing connection.

Dense layers (class Dense): In contrast to flattening and dropout layers, dense layers do include tunable parameters—weights. Therefore, the backward pass for this layer type involves both the inductive computations of

δ

values (Rule 1) and the computations of the gradient (Rule 2). We remind the reader of the necessary Formulas (52)–(54) from Section 4 below.

\begin{matrix} \nabla_{Y} & : = δ_{Y} \cdot ϕ {(X)}^{T} . \\ \nabla_{0, Y} & : = δ_{Y} \cdot \underset{N \times 1}{1} . \\ δ_{X} & : = ϕ^{'} (X) \circ {W_{Y}}^{T} \cdot δ_{Y} . \end{matrix}

The code excerpts shown in Figure 18 are responsible for the above computations. It is also worth recalling (as noted in Section 8.1) that, inside hmdl tensors, the first dimension is always related to the data example index within the current batch. Therefore, in the code below, the attributes delta_backward_input_, delta_, and delta_backward_output_ are two-dimensional arrays, where the first dimension pertains to data examples and the second to neurons within the given layer. Owing to that fact, suitable summations over the current batch needed to compute the self.gradient_ are carried out by means of the .dot product, i.e., the standard algebraic matrix multiplication (see the function do_backward_numpy).

Max pooling layers (class MaxPool2D): In the hmdl project, there are three implementation variants for computations carried out within the max pooling layers. The first one is a “pure” numpy variant. The second is an accelerated version of the first, obtained by means of numba module and the @jit(nopython=True, …) decorator, allowing for low-level just-in-time compilation. The third variant also applies numba, but it performs GPU computations—it requires the @cuda.jit decorator over a suitably programmed kernel function in compliance with the CUDA computational model. For clarity of presentation, below, we show the source codes only for the simplest numpy-based variants of computations. Source codes for the two other variants are moved to Appendix A.

Recall the notation and suitable formulas from Section 4; the pools are assumed to be of sizes

S \times S

, and the single outputs, for all

0 ⩽ j < ⌊ H / S ⌋

,

0 ⩽ k < ⌊ V / S ⌋

, and

0 ⩽ q < C

, are computed as follows:

\begin{matrix} y_{j, k, q} & : = \max_{\begin{matrix} j S ⩽ u < (j + 1) S \\ k S ⩽ v < (k + 1) S \end{matrix}} ϕ (x_{u, v, q}) . \end{matrix}

Again, it is useful to memorize, during the forward pass, the coordinates for which the maxima are attained, denoted by

\begin{matrix} a_{j, k, q}^{*} & : = \underset{\begin{matrix} j S ⩽ u < (j + 1) S \\ k S ⩽ v < (k + 1) S \end{matrix}}{\arg \max} ϕ (x_{u, v, q}), \end{matrix}

(112)

as they are needed at the stage of backward computations. Figure 19 presents the source codes responsible for forward computations in the max pooling layers in the simplest numpy-based variant.

As regards the backward pass and automatic differentiation for this layer type, Formula (43), required for that purpose, is as follows:

\begin{matrix} δ_{X; u, v, q} & : = ϕ^{'} (x_{u, v, q}) \cdot \{\begin{matrix} δ_{Y; ⌊ u / S ⌋, ⌊ v / S ⌋}, & if a_{⌊ u / S ⌋, ⌊ v / S ⌋}^{*} = (u, v); \\ 0, & otherwise . \end{matrix} \end{matrix}

Backward computations compliant with this formula are presented in Figure 20 (numpy-based variant). Please note how the attribute self.argmax_, memorized during the forward pass, is reused at this stage. We remind the reader that other computational variants are available in Appendix A.

Convolutional layers (class Conv2D): In the hmdl project, there are seven implementation variants of forward computations for convolutional layers and four variants of backward computations. Since this subsection is on automatic differentiation, we discuss here only the backward variants. Forward variants and their efficiency are discussed in a separate section, Section 8.4.

As regards backward computations in convolutional layers, the variants available in the hmdl project are numpy_gemm, numba_jit_gemm, numba_cuda_direct, and numba_cuda_tiles. For clarity of presentation, here, we present source codes only for the numpy_gemm variant; the remaining variants are moved to Appendix B.

We start by reminding the reader of Formula (38) (from Section 4) pertaining to the computation of gradient for convolutions in compliance with the GEMM approach:

\begin{matrix} \nabla_{Y} & : = \underset{0 ⩽ q < K}{S} \underset{F \times F \times C}{R} ((\underset{1 \times H \cdot V}{R} δ_{Y; \cdot, \cdot, q}) (\underset{\begin{matrix} - ⌊ F / 2 ⌋ ⩽ a, b ⩽ ⌊ F / 2 ⌋ \\ 0 ⩽ p < C \end{matrix}}{S} \underset{H \cdot W \times 1}{R} ϕ (X_{\begin{array}{l} a : a + H \\ b : b + V \\ p \end{array}}^{P}))), \\ \nabla_{0, Y} & : = \underset{0 ⩽ q < K}{S} ((\underset{1 \times H \cdot V}{R} δ_{Y; \cdot, \cdot, q}) \cdot \underset{H \cdot V \times 1}{1}), \end{matrix}

Figure 21, shown below, presents the source code responsible for those computations.

As regards error expressions for convolutional layers, their computation has been defined by Formula (39), which we remind the reader of below.

\begin{matrix} δ_{X} & : = ϕ^{'} (X) \circ \underset{0 ⩽ q < C}{S} \underset{H \times V}{R} ((\underset{1 \times F^{2} \cdot K}{R} {\bar{W}}_{Y; \cdot, \cdot, q, \cdot}) (\underset{\begin{matrix} 0 ⩽ j < H \\ 0 ⩽ k < V \end{matrix}}{S} \underset{F^{2} \cdot K \times 1}{R} δ_{Y; \begin{matrix} j - ⌊ F / 2 ⌋ : j + ⌊ F / 2 ⌋ + 1 \\ k - ⌊ F / 2 ⌋ : k + ⌊ F / 2 ⌋ + 1 \end{matrix}}^{P})) . \end{matrix}

The source code corresponding to that computation is presented in Figure 22.

8.4. Efficiency of Forward Computations in Convolutional Layers for Different Implementation Variants

As mentioned earlier, there are seven implementation variants provided for forward computations in convolutional layers. Below, we list their code names with brief descriptions. The code names constitute suffixes of the actual full function names, e.g., do_forward_numba_cuda_tiles(.). Additionally, CUDA-related functions execute from within the so-called GPU kernels with computations designed for each single GPU thread (affix: _job). For more details, we direct the reader to the code and bibliographic references.

numpy_gemm—GEMM-based convolution; see (9), involving image-to-matrix and weights-to-matrix transformations. and a numpy.dot multiplication of matrices, executed from regular Python code.
numba_jit_direct—definition-based convolution; see (8), implemented directly by multiple loops and compiled and executed by means of numba.jit.
numba_jit_gemm—GEMM-based convolution (as in variant 1); compiled and executed by means of numba.jit.
cv2—convolution based on an OpenCV function cv2.filter2D (faster than scipy.signal.convolve2d); involves multiple invocations of the function to suitably handle input–output channels.
numba_cuda_direct—definition-based convolution (as in 2); compiled and executed on GPU via numba.cuda.
numba_cuda_tiles—definition-based convolution that partitions inputs into tiles associated with CUDA thread blocks and involves collaboration of threads in order to copy inputs and weights into efficient shared memory; compiled and executed on the GPU by means of numba.cuda.
numba_cuda_viafft—takes advantage of the fact that convolution (for q-th output channel) can be computed as $IFFT 2 (\sum_{p} FFT 2 (W_{\cdot, \cdot, p, q}) \cdot FFT 2 (X_{\cdot, \cdot, p})) + w_{0; q}$ , where p iterates over input channels; involves invocations of multiple GPU kernels and is compiled and executed on the GPU by means of numba.cuda.

Winograd-based implementations are currently not provided in the hmdl project.

Figure 23 compares execution times (averaged over 10 repetitions) of convolutional layers with 64 input and 64 output channels and a batch of size 32 for different implementations, as well as filter and image sizes. Note the logarithmic scale on the vertical axes. In the first two graphs, we additionally plotted the times achieved using Keras (backed with TensorFlow) for reference. Details of the hardware and software environment on which the executions were performed are provided below.

“NVIDIA Quadro M4000M” environment—hardware: Lenovo ThinkPad P70, Intel Xeon CPU E3-1505M v5 2.8 GHz (3.7 GHz boost), 64 GB RAM, and NVIDIA Quadro M4000M GPU (4 GB/32 GB of dedicated/shared memory), manufactured by: Lenovo Group Limited, Beijing, China; software: nvcc 11.6 (V11.6.112), Python 3.9.7, numpy 1.20.0, numba 0.54.1, cv2 4.5.5-dev keras 2.8.0, and tensorflow 2.8.0; OS: Windows 10 Pro (22H2).

The following observations can be made based on the time measurements obtained and Figure 23. The numba_jit_direct and cv2 implementations are clearly the worst, with the latter being more efficient for larger filter and image sizes. The variants numpy_gemm and numba_jit_gemm perform similarly well with a slight advantage of the latter. Interestingly, for a small

3 \times 3

filter, their results are comparable with numba_cuda_tiles, or actually better for images not bigger than

8 \times 8

. For larger images and filters, the advantage of GPU parallelism manifests, and numba_cuda_direct/tiles/viafft variants become clearly better than _gemm CPU variants. For a

7 \times 7

filter, numba_cuda_tiles starts to be competitive with numba_cuda_direct, and it beats the latter regularly for even larger filters and images not smaller than

16 \times 16

. As regards numba_cuda_viafft, one can benefit from its efficiency also for larger images and significantly large filters. In fact, one can note, in the last graph, that the execution times of numba_cuda_viafft become approximately equal (see the overlapping red curves) for all large filter sizes from

11 \times 11

onwards and suitably large images—a fact compliant with the algorithmic properties of FFT-based convolutions. In those cases, numba_cuda_viafft was superior to both numba_cuda_direct and numba_cuda_tiles. Last, but certainly not least, one should notice the very high efficiency of Keras—the pink curves in the first two graphs. In particular, for the

7 \times 7

filter and

64 \times 64

images, the execution time of keras was better by roughly one order of magnitude than that of numba_cuda_direct and numba_cuda_tiles. This demonstrates how highly optimized the underlying TensorFlow is in terms of exploiting GPU power and resources.

All tests on convolutional layers can be re-executed (script: tester.py), and their current log files are available in the repository (folder: tests/).

8.5. Implementation of Weight Initialization

In compliance with mathematical analysis presented in Section 5, weight initialization performed according to Glorot’s method [17] is conducted in the hmdl project via the functions shown in Figure 24. We remind the reader that this kind of initialization is appropriate for non-convolutional layer types.

For convolutional layers, hmdl performs weight initialization in compliance with the recipe due to He [29], using the functions presented in Figure 25.

8.6. Implementation of Adam Coupled with Regularizations

The source code presented below in Figure 26 constitutes the continuation of the one from Figure 15—a backward pass for a SequentialClassifier. It demonstrates hmdl’s implementation of the Adam algorithm, explained in Section 6.3. If some form of regularization (

ℓ_{1}

,

ℓ_{2}

, and gradient clipping) is imposed by the user, the code below applies first the suitable updates to the gradient tensor for the current layer, and then it conducts the actual operations defined by Adam. Recall that those operations are the computation of first and second gradient moments, removing the bias introduced via EMAs, and applying the final correction to weights.

8.7. Deep Learning Experiments

All experiments reported in this section can be reproduced by the reader using the script experimenter.py. It allows training (optional) and testing to be run for one of three well-known data sets, Olivetti (https://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html), MNIST (https://yann.lecun.com/exdb/mnist/), or CIFAR-10 (https://www.cs.toronto.edu/~kriz/cifar.html) (accessed on 22 October 2024) [69,70,71] (see Figure 27), using two DL frameworks, hmdl and Keras. For every conducted experiment, we have generated a distinct 10-digit identifier, hashed based on given network structure and an additional description (data name, initial randomization seed, and number of repetitions). The log files and classifiers obtained can be found in the repository (folders experiments/, hmdl_clfs/, keras_clfs/) with their names prefixed using the appropriate identifier. Detailed instructions on how to run a new experiment or reproduce an existing one (when knowing its identifier) were moved to Section 8.8.

Two hardware environments were applied to conduct the experiments—the first one was a personal computer equipped with NVIDIA Quadro M4000M as the GPU device (details on page 42), and the second was a high-performance server equipped with the NVIDIA A100 GPU for virtualized computations (details provided below). We explain that the complete set of all experiments was carried out independently in each of the two environments.

“GRID A100-7-40C MIG 7g.40gb” environment—hardware: Ubuntu Server 20.4.03, 4 CPUs: AMD EPYC 7H12 64-Core (2.6 GHz), 128 GB RAM, and NVIDIA GRID A100-7-40C vGPU, manufactured by: Supermicro (Super Micro Computer, Inc.), San Jose, CA 95131, USA; software: nvcc 11.4 (V11.4.48), Python 3.8.10, numpy 1.23.5, numba 0.56.4, cv2 4.7.0, keras 2.12.0, and tensorflow 2.12.0; OS: Ubuntu 20.04.3 LTS.

Table 1, Table 2 and Table 3 gather all the results. In strings representing the structures of tested neural networks, the following notation was applied. The abbreviations ”C”, ”M”, ”F”, ”D”, and ”DR” stand for different layer types: convolutional, max pooling, flattening, dense, and dropout, respectively. The parameters of a layer are written in parentheses. For example, C(3, 32, r) stands for a convolutional layer using a

3 \times 3

filter and producing 32 output channels (kernels) activated via the ReLU function, denoted as ”r”. DR(0.125) stands for a dropout layer using the rate of

0.125

. M(2) represents a max pooling layer that applies a

2 \times 2

regions to produce a down-sampled output tensor. D(10, sm) represents a dense layer with 10 neurons activated via the softmax function. The symbol ”

\times 2

” denotes a stack of the same two layers.

Due to the small size of the Olivetti data (400 examples in total), its experimental executions were repeated 10 times (with successive randomization seeds starting from zero, and the data were split randomly as 70:30 into training and test parts), and the results were averaged. MNIST and CIFAR-10 are moderately large sets (60 k and 50 k training examples, respectively, and 10 k testing examples for both); therefore, their experiments were conducted just once. For all data sets, we imposed 100 epochs of training but different counts of batches per epoch (n_batches parameter in hmdl library): 10 batches for Olivetti, 20 for MNIST, and 50 for CIFAR-10. These counts were suitably translated into the batch_size parameter in Keras.

We experimented with networks of varying depth (16 layers at most) and various structural ideas: dense layers alone, single convolutional layers in front of pooling layers, blocks of stacked convolutional layers (VGG-like structures), varying filter sizes and counts of output channels, dropout rates, etc.

We must emphasize that the main goal of these experiments was not to obtain excellent classifiers for the mentioned data sets. Instead, the goal was to verify the consistency of results from Keras and the “home-made” implementation, when the same architectures were applied. We remark that both implementations used a uniform He initialization of weights for layers with ReLUs, a uniform Glorot initialization otherwise, and the Adam optimizer (

η = 10^{- 3}

,

β_{1} = 0.9

,

β_{2} = 0.999

, and

ϵ = 10^{- 7}

).

Finally, we need to remark that, owing to the imposed randomization seeds the reported accuracies and loss values can be reproduced exactly for a fixed GPU device. We imposed seed=0 as the default value across several modules—in particular, we executed random.seed(seed), tf.random.set_seed(seed), and np.random.seed(seed). Also, please see the set_tf_global_determinism(seed) function, where we forced the deterministic behavior of TensorFlow and CUDA; additional data set- or repetition-related seeds can be changed by the user via the data_description dictionary and its entry named "seed", placed in the main program function. One should realize that results do differ between devices due to the order and properties of computations carried out on the GPU (e.g., the number of streaming multi-processors and CUDA cores, scheduling scheme, etc.).

The overall results confirm several well-known or intuitive facts. CIFAR-10, the hardest among the data sets, allowed only a few structures to surpass

80 %

accuracy, whereas it was fairly easy to beat

99.2 %

for MNIST. For these two sets, it is also easy to see that dense layers alone are not sufficient for good accuracy—convolutional layers are needed. Furthermore, blocks of stacked convolutional layers (VGG-like) led to the best results. Due to the small size of the Olivetti data, the above observations were not evident on that set. The results oscillated around

94 % \pm 2 %

with little dependence on the structure.

In every row of Table 1, Table 2 and Table 3, the blue color is used to mark the higher accuracy and the green to mark the smaller loss. The last two columns report relatative differences in the results between implementations, i.e., higher/lower accuracy

- 1

, and

1 -

smaller/larger loss (as a percentage). As regards accuracy, the results are highly consistent—for a large majority of cases, the relative difference is smaller than

1 %

, with few outliers (the worst:

3.54 %

). Naturally, categorical cross-entropy loss is a more sensitive measure than accuracy. See, e.g., MNIST classifiers, where high accuracies of

99.4 %

or

99.5 %

are accompanied by losses ranging from about

0.016

to

0.032

. Thus, relative loss differences are naturally of a larger magnitude.

For the same architectures, the results of hmdl and Keras are highly consistent in terms of accuracy. But one might be also interested in simply counting the times that hmdl or Keras actually returned a better model, neglecting the margin and relative difference in accuracy (typically smaller than

1 %

). For the Olivetti data, hmdl did so in 7 out of the total of 22 executions (hence, the Keras models were better in 15 cases). For MNIST, hmdl produced better models in 6/10 cases, whereas for CIFAR-10, the hardest data set, it did so in 15/24 cases.

As regards some particular results, the following observations can be made. For the MNIST data set, the structures of neural networks from experiments 7–10 contained the aforementioned VGG blocks, i.e., stacks of convolutional layers (denoted by C(⋯)×2 strings), followed by the maximum pooling layers. Those structures translated into higher accuracies (typically surpassing

99.40 %

) than the structures from experiments 1–6, without the stacked convolutions. A similar observation is true for experiments on the CIFAR-10 data set, where the gap between such groups of structures is even more noticeable. In experiments 8–12, all networks involving the stacked convolutions surpassed an accuracy of

76 %

(reaching

83 % \pm 1 %

in a few cases), whereas all networks from experiments 1–7, without stacked convolutions, had trouble reaching a

70 %

accuracy level. Some of the experiments were planned to verify whether dropout layers distributed along the structure translate into better results when working with constant rates (e.g., equal to

0.125

) or rates increasing towards the “right-end” of a network (e.g.,

0.125

,

0.25

, and

0.5

). In the case of the MNIST data, no clear difference in this respect was observed. Interestingly, for the CIFAR-10 data, models with increasing dropout rates (experiment no. 12) performed significantly better than their counterparts with constant rates (experiment no. 11). Differences of approximately

2 %

in accuracy were noticed. This observation was true for models that came from both frameworks (Keras and hmdl) and from both GPU devices (Quadro M4000M and A100). Finally, as regards comparisons of filter sizes for convolutions (

7 \times 7

,

5 \times 5

, and

3 \times 3

) spread across various experiments, we did not observe any clear influence of filter size on accuracy. On the other hand, in the large majority of cases, larger filter sizes translated naturally into a longer time of model fitting.

In this context, it is worth mentioning some of the recent results, from 2021 to 2024, achieved for the MNIST and CIFAR-10 benchmarks using the newest non-standard deep learning ideas. As regards MNIST, the results based on the so-called capsule networks stand out. For example, Mazzia [42] reports an accuracy of

99.84 %

using a capsule network with self-attention routing (Efficient-CapsNet) involving only 161 k trainable parameters. On the other hand, Byerly et al. [43] apply a similar approach, but they claim that no routing between capsules is needed. Their model achieves a

99.87 %

accuracy but employs significantly more trainable parameters—1.5 M. As regards CIFAR-10, recent results differ quite significantly, depending on the new ideas researchers experiment with. For example, Dhakad et al. [44] propose an interesting idea of “scalable hierarchical aware CNN”(SHA-CNN). They carry out tests on MNIST, CIFAR-10, and CIFAR-100, but the accuracy achieved for CIFAR-10—

83.35 %

—is not very high. In fact, some of our CNNs obtained using Keras or hmdl surpass that result. On the other hand, there exist recent articles that report outstanding accuracies for CIFAR-10. For example, Touvron [45] applies the so-called vision transformers and obtains an accuracy of

99.3 %

. Interestingly, some authors stick to traditional CNN architectures but focus on variations of the learning algorithm itself, using ensemble or federated learning. For example, Bruno, Moroni, and Martinelli [46] apply two networks with the so-called EfficientNet-b0 architecture. The two networks are trained on disjoint data subsets (bagging) and then as an adaptive ensemble by performing the fine-tuning of the trainable combination layer. The authors achieve a

99.612 %

accuracy for CIFAR-10.

Despite the consistency of hmdl and Keras observed in our experiments in terms of accuracy and convergence, we must fairly note the evident superiority of Keras (TensorFlow) over hmdl in terms of time efficiency. In many cases, the hmdl implementation was slower by at least an order of magnitude. For structures with multiple convolutional layers, the Keras fit times were roughly 40 times shorter (despite the application of numba.cuda in hmdl) in the case of the NVIDIA Quadro M4000M device and roughly 90 times shorter in the case of the NVIDIA Grid A100 device. The main reason working in favor of TensorFlow’s efficiency is the highly optimized cuDNN low-level library that directly supports DNN-related primitives present within TensorFlow and accelerates GPU computations, convolutions, and pooling in particular. On the other hand, some possible reasons working against hmdl time efficiency (in spite of numba.cuda) may include Python’s objected-oriented overhead, wrapping costs for Numba, and CPU–GPU memory management.

8.8. Reproducing Experiments (or Running New Ones)

Figure 28 presents the list of main settings that can be imposed on an experiment using the experimenter.py script. The very first variable EXPERIMENT_ID decides whether an existing experiment is going to be reproduced (in which case its 10-digit identifier is expected) or a new one executed (in which case the None value is expected). The example setting shown in the figure, EXPERIMENT_ID = "1224383397", corresponds to the 8th experiment carried out on MNIST data set (refer back to Table 2). We remind the reader that log files from all reported experiments are available in the repository (https://github.com/pklesk/hmdl) in the folder experiments/ (accessed on 17 October 2024). Identifiers of those experiments are visible in the file names and also inside the logs. Once the EXPERIMENT_ID value is set, the user can carry out an experiment using the standard Python command, i.e., python experimenter.py or python3 experimenter.py.

The two subsequent variables, UNPICKLE_DATA and PICKLE_DATA, are provided for the user’s convenience and they allow, respectively, for reading and writing data arrays from/to a binary file (based on Python’s pickle module). Data sets such as MNIST and CIFAR-10 are fairly large in size, and the setting UNPICKLE_DATA = False causes those original data sets to be imported from within the keras.datasets module and read via mnist.load_data() or cifar10.load_data(), which may be time-consuming. By setting UNPICKLE_DATA = True, one gets the data arrays faster as they are read from pre-prepared pickled binary files.

It is also worth explaining the settings prefixed by COPY_INITIAL_WEIGHTS_ and ending with either KERAS_TO_HMDL or HMDL_TO_KERAS. They let the user initialize (or read) weights for a certain network architecture in one of the frameworks and then copy those exact weights to the same architecture in the second framework. This allows for two kinds of comparisons. One is to check whether forward passes of the two networks (from Keras and from hmdl) for the same batch of data are consistent, e.g., how many significant digits of the floating-point response from final output layers agree. The other kind of comparison is to see for how long the learning processes initiated from the same weights remain consistent. We believe the names of other main settings in Figure 28 are self-explanatory.

Figure 29 exemplifies how a structure for a neural network becomes described in the hmdl project (within the __main__ function of the script experimenter.py). The description is formed as a list of tuples. The first tuple defines the general settings for the SequentialClassifier. The other tuples define the subsequent layers, each described by its class name (e.g., Conv2D, MaxPool2D, Dropout, Flatten, and Dense) and a dictionary of its properties.

Finally, Figure 30 shows the code excerpt (in experimenter.py) reponsible for the choice of which data set to experiment with and, in particular, the choice of the randomization seed. The same value for the randomization seed implies the same experimental results for all executions on a fixed GPU device (and independently of the CPU).

9. Conclusions

In this tutorial and survey paper, we have shown that a suitable combination of discoveries from several areas was necessary to pave the way for the current high status of deep learning. Automatic differentiation was, obviously, the first necessity. But many critical building blocks came after 2009: GPU computations, proper weight initialization schemes, robust learning algorithms (Adam and similar), or, e.g., dropout regularization that “magically” performs model averaging without actually training multiple models explicitly.

The accessibility of software frameworks for DL is, obviously, an asset, especially for practitioners. On the other hand, we believe implementing a deep neural network from scratch is a valuable exercise that strengthens the low-level understanding of the algorithms involved. In fact, such an understanding is a must for new discoveries to come. We hope this paper, aside from serving as a survey, can play the role of a useful tutorial for that purpose, especially for new researchers in the field.

There are two main directions in which our future work related to this tutorial may develop. The first one is purely technical and related to the hmdl project. Its efficiency and functionality could be enhanced through several new elements, such as reducing the amount of host–device memory transfers, the implementation of Winograd fast convolutions, introducing new layer types, etc. The second possible direction is educational. In this tutorial (concise but already quite long), we have covered the well-established elements of contemporary deep neural networks. We are aware that the freshest ideas, such as recurrent independent mechanisms, transformers, federated learning, gradient sharing, and several others, may require separate treatment, especially regarding automatic differentiation within those techniques.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

The three data sets (known as Olivetti, MNIST, and CIFAR-10) applied in this research are open and publicly available benchmarks commonly applied to academic research on image recognition. Suitable references to the sources and authors of these data sets have been placed in the manuscript.

Acknowledgments

We would like to express our gratitude and give credit to the authors of Numba—an open-source JIT accelerating compiler for Python (https://numba.pydata.org).

Conflicts of Interest

The author declares no conflicts of interest.

Appendix A. Implementation Variants Supported by Numba for Max Pooling Layers (Hmdl Project)

Appendix A.1. Variant Accelerated Using Low-Level Compilation by `numba.jit`

This variant applies when the setting do_forward_impl="numba_jit" is submitted to the constructor of MaxPool2D objects. Then, the function pointer self.do_forward_function_ points to the do_forward_numba_jit method presented in Figure A1. Please note the full specification of types for the arguments that come after the @jit decorator.

Figure A1. Forward computations for class MaxPool2D: variant based on numba’s just-in-time compilation.

Figure A2 presents analogical backward computations when the function pointer self.do_backward_function_ points to the do_backward_numba_jit function.

Figure A2. Backward computations for class MaxPool2D: variant based on numba’s just-in-time compilation.

Appendix A.2. Variant Accelerated Using GPU Computations and `numba.cuda`

Another variant for max pooling computations in the hmdl project is implemented as a kernel function allowing for GPU computations. Instead of directly using the CUDA API from NVIDIA in C language, in hmdl, we take advantage of the numba.cuda module that offers analogical API and CUDA mechanisms for Python. This variant applies when the setting do_forward_impl="numba_cuda" is submitted to the constructor of the MaxPool2D object.

Figure A3 and Figure A4 contain the source codes meant for that variant and forward computations. The function do_forward_numba_cuda_direct, presented in Figure A3, iterates over subbatches within the current batch stored in self.input_, and using the CUDA streaming mechanisms, it copies the subbatches from the host to the GPU device memory. Then, an invocation of the actual CUDA kernel function, named do_forward_numba_cuda_direct_job, presented in Figure A4, is performed using 512 threads per block (see tpb value) and the number of blocks per grid (bpg), calculated as follows:

bpg = ⌈ m \cdot (H / S) \cdot (W / S) \cdot C / tpb ⌉,

where m denotes the subbatch size.

Figure A3. Forward computations for class MaxPool2D: variant implemented for GPU computations using numba.cuda module.

Figure A4. CUDA kernel function do_forward_numba_cuda_direct_job for forward max pooling computations invoked via function do_forward_numba_cuda_direct.

Backward GPU computations are organized analogically using the streaming mechanism (over data subbatches) and invocations of a CUDA kernel function. Figure A5 and Figure A6 present the source codes involved in those computations. One can note how the self.argmax_ information, memorized during the forward pass, is now reused at the stage of backward computations.

Figure A5. Backward computations for class MaxPool2D: variant implemented for GPU computations using numba.cuda module.

Figure A6. CUDA kernel function do_backward_numba_cuda_direct_job for backward max pooling computations invoked via function do_backward_numba_cuda_direct.

Appendix B. Implementation Variants Supported by Numba for Convolutional Layers (Hmdl Project)—Backward Computations

Appendix B.1. Variant Accelerated Using Low-Level Compilation via numba.jit

This variant applies when the setting do_backward_impl="numba_jit_gemm" is submitted to the constructor of the Conv2D object. Then, the function pointer self.do_backward_function_ points to the do_backward_numba_jit_gemm method presented in Figure A7. Please note the full specification of types for arguments that come after the @jit decorator.

Figure A7. Backward computations of gradient for class Conv2D: variant based on numba’s just-in-time compilation.

Appendix B.2. Variant Based Directly on Definition Accelerated Using GPU Computations and `numba.cuda`

Similarly to the CUDA-based max pooling implementation from Appendix A.2, the implementation for convolutional layers presented here is split into two functions. The outer function do_backward_numba_cuda_direct, presented in Figure A8, iterates over subbatches within the current batch stored in self.input_, and using the CUDA streaming mechanisms, it copies the subbatches from the host to the GPU device memory. Then, an invocation of the actual CUDA kernel function, named do_backward_numba_cuda_direct_job, presented in Figure A9, is performed using 512 threads per block (see tpb value) and the number of blocks per grid (bpg), calculated as:

bpg = ⌈ m \cdot F^{2} \cdot C \cdot K / tpb ⌉,

where m denotes the subbatch size.

Figure A8. Backward computations of gradient for class Conv2D: variant based directly on definition, using GPU computations and numba.cuda.

Figure A9. CUDA kernel function do_backward_numba_cuda_direct_job for backward convolutional computations invoked via function do_backward_numba_cuda_direct.

Appendix B.3. Variant Based on Tiles Accelerated Using GPU Computations Using and `numba.cuda`

The CUDA-based variant presented in this section organizes computations into tiles of size

16 \times 16

. It applies 256 threads per block defined by the three-dimensional tuple

tpb = (1, 16, 16)

and, accordingly, the three-dimensional grid of CUDA blocks defined as follows:

bpg = (m \cdot C \cdot K, bpg_h, bpg_w)

where m is the subbatch size, and bpg_h and bpg_w denote the counts of tiles along the last two dimensions, both calculated as

⌈ F / 16 ⌉

.

Figure A10 presents the source code of the outer function (iterating over subbatches and applying the streaming CUDA mechanism). For each subbatch, the outer function invokes the actual kernel function presented in Figure A11.

Figure A10. Backward computations of gradient for class Conv2D: variant based on tiles, using GPU and numba.cuda.

Figure A11. CUDA kernel function do_backward_numba_cuda_tiles_job for backward convolutional computations invoked via function do_backward_numba_cuda_tiles.

References

Rabiner, L. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 1989, 77, 257–286. [Google Scholar] [CrossRef]
Friedman, J.; Hastie, T.; Tibshirani, R. Additive logistic regression: A statistical view of boosting. Ann. Stat. 2000, 28, 337–407. [Google Scholar] [CrossRef]
McCulloch, W.; Pitts, W. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 1943, 5, 115–133. [Google Scholar] [CrossRef]
Rosenblatt, F. The Perceptron: A Perceiving and Recognizing Automaton; Technical Report 85–460–1; Cornell Aeronautical Laboratory, Inc.: Buffalo, NY, USA, 1957. [Google Scholar]
Linnainmaa, S. Algoritmin Kumulatiivinen Pyoristysvirhe Yksittaisten Pyoristysvirheiden Taylor-Kehitelmana. Master’s Thesis, University of Helsinki, Tapiola, Finland, 1970. [Google Scholar]
Fukushima, K. Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position. Biol. Cybern. 1980, 36, 193–202. [Google Scholar] [CrossRef]
Rumelhart, D.; Hinton, G.; Williams, R. Learning Representations by Back-propagating Errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Aizenberg, I.; Aizenberg, N.N.; Vandewalle, J. Multi-Valued and Universal Binary Neurons: Theory, Learning and Applications; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2000. [Google Scholar]
Heck, L.; Konig, Y.; Sönmez, M.; Weintraub, M. Robustness to Telephone Handset Distortion in Speaker Recognition by Discriminative Feature Design. Speech Commun. 2000, 31, 181–192. [Google Scholar] [CrossRef]
Ng, A. Feature Selection, L₁ vs. L₂ Regularization, and Rotational Invariance. In Proceedings of the 21st International Conference on Machine Learning, Banff, AB, Canada, 4–8 July 2004; ICML’04. pp. 78–85. [Google Scholar] [CrossRef]
Hinton, G.; Osindero, S.; Teh, Y. A Fast Learning Algorithm for Deep Belief Nets. Neural Comput. 2006, 18, 1527–1554. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the CVPR 2009, Miami, FL, USA, 22–24 June 2009; pp. 248–255. [Google Scholar]
Huang, J. The Intelligent Industrial Revolution. Online. NVIDIA Article on GTC (GPU Technology Conference). 2016. Available online: https://blogs.nvidia.com/blog/intelligent-industrial-revolution/ (accessed on 10 August 2022).
Raina, R.; Madhavan, A.; Ng, A. Large-scale Deep Unsupervised Learning Using Graphics Processors. In Proceedings of the ICML ’09: Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 873–880. [Google Scholar]
Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. J. Mach. Learn. Res. Proc. Track 2010, 9, 249–256. [Google Scholar]
Nair, V.; Hinton, G. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the ICML 2010, Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
Duchi, J.; Hazan, E.; Singer, Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
Hinton, G.; Srivastava, N.; Swersky, K. Overview of Mini-Batch Gradient Descent. Online. Lecture 6, Unpublished as a Paper. 2012. Available online: http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf (accessed on 10 July 2022).
Ciresan, D.; Meier, U.; Schmidhuber, J. Multi-column deep neural networks for image classification. In Proceedings of the CVPR 2012, Providence, RI, USA, 16–21 June 2012; pp. 3642–3649. [Google Scholar]
Hinton, G.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.R. Improving neural networks by preventing co-adaptation of feature detectors. CoRR 2012. Available online: http://arxiv.org/abs/1207.0580 (accessed on 22 October 2024).
Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Volume 1, pp. 1097–1105. [Google Scholar]
Mathieu, M.; Henaff, M.; LeCun, Y. Fast Training of Convolutional Networks through FFTs. arXiv 2013. Available online: https://arxiv.org/abs/1312.5851 (accessed on 22 October 2024).
Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the ICLR (Poster), Banff, AB, Canada, 14–16 April 2014; Available online: https://arxiv.org/pdf/1412.6980.pdf (accessed on 22 October 2024).
Simoyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014. Available online: https://arxiv.org/abs/1409.1556v6 (accessed on 22 October 2024).
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. arXiv 2014. Available online: https://arxiv.org/abs/1409.4842 (accessed on 22 October 2024).
Lavin, A.; Gray, S. Fast Algorithms for Convolutional Neural Networks. arXiv 2015. Available online: https://arxiv.org/abs/1509.09308 (accessed on 22 October 2024).
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015. Available online: https://arxiv.org/abs/1512.03385 (accessed on 22 October 2024).
Dozat, T. Incorporating Nesterov Momentum into Adam. In Proceedings of the 4th International Conferenc on Learning Representations (ICLR), San Juan, PR, USA, 2–4 May 2016; pp. 1–4. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017. Available online: https://arxiv.org/abs/1706.03762 (accessed on 22 October 2024).
Reddi, S.; Kale, S.; Kumar, S. On the Convergence of Adam and Beyond. In Proceedings of the 6th International Conferenc on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Goyal, A.; Lamb, A.; Hoffmann, J.; Sodhani, S.; Levine, S.; Bengio, Y.; Schölkopf, B. Recurrent Independent Mechanisms. arXiv 2020. Available online: https://arxiv.org/abs/1909.10893 (accessed on 22 October 2024).
Xue, Y.; Tong, Y.; Neri, F. An ensemble of differential evolution and Adam for training feed-forward neural networks. Inf. Sci. 2022, 608, 453–471. [Google Scholar] [CrossRef]
Bengio, Y.; Lecun, Y.; Hinton, G. Deep Learning for AI. Commun. ACM 2021, 64, 58–65. [Google Scholar] [CrossRef]
Rumelhart, D.; McClelland, J. Parallel Distributed Processing: Explorations in the Microstructure of Cognition; MIT Press: Cambridge, MA, USA, 1986. [Google Scholar]
Fukushima, K. Visual feature extraction by a multilayered network of analog threshold elements. IEEE Trans. Syst. Sci. Cybern. 1969, 5, 322–333. [Google Scholar] [CrossRef]
Tong, G.; Huang, L. Fast Convolution based on Winograd Minimum Filtering: Introduction and Development. CS & IT-CSCP 2021; pp. 177–191 abs/2111.00977. Available online: https://arxiv.org/abs/2111.00977 (accessed on 22 October 2024).
Delvin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the ACL’2019, Austin, TX, USA, 4–13 October 2019; pp. 4171–4186. [Google Scholar]
Wolfram, S. What Is ChatGPT Doing… and Why Does It Work? Online. 2023. Available online: https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work (accessed on 22 October 2024).
Mazzia, V.; Salvetti, F.; Chiaberge, M. Efficient-CapsNet: Capsule network with self-attention routing. Sci. Rep. 2021, 11, 14634. [Google Scholar] [CrossRef] [PubMed]
Byerly, A.; Kalganova, T.; Dear, I. No Routing Needed Between Capsules. arXiv 2021. Available online: https://arxiv.org/abs/2001.09136 (accessed on 22 October 2024). [CrossRef]
Dhakad, N.; Malhotra, Y.; Vishvakarma, S.K.; Roy, K. SHA-CNN: Scalable Hierarchical Aware Convolutional Neural Network for Edge AI. arXiv 2024. Available online: https://arxiv.org/abs/2407.21370 (accessed on 22 October 2024).
Touvron, H.; Cord, M.; El-Nouby, A.; Verbeek, J.; Jégou, H. Three things everyone should know about Vision Transformers. arXiv 2022. Available online: https://arxiv.org/abs/2203.09795 (accessed on 22 October 2024).
Antonio, B.; Moroni, D.; Martinelli, M. Efficient adaptive ensembling for image classification. Expert Syst. 2023. [Google Scholar] [CrossRef]
Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control. Signals Syst. 1989, 2, 303–314. [Google Scholar] [CrossRef]
Huang, K.; Wang, Y.; Tao, M.; Zhao, T. Why do deep residual networks generalize better than deep feedforward networks?—A neural tangent kernel perspective. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates Inc.: Red Hook, NY, USA, 2020. NIPS ’20. [Google Scholar]
Ma, L.; Li, N.; Yu, G.; Geng, X.; Huang, M.; Wang, X. Pareto-Wise Ranking Classifier for Multiobjective Evolutionary Neural Architecture Search. IEEE Trans. Evol. Comput. 2024, 28, 570–581. [Google Scholar] [CrossRef]
Dellinger, J. Weight Initialization in Neural Networks: A Journey from the Basics to Kaiming. Online. 2019. Available online: https://towardsdatascience.com/weight-initialization-in-neural-networks-a-journey-from-the-basics-to-kaiming-954fb9b47c79 (accessed on 23 June 2022).
Polyak, B. Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 1964, 4, 1–17. [Google Scholar] [CrossRef]
Riedmiller, M.; Braun, H. RPROP—A Fast Adaptive Learning Algorithm. In Proceedings of the 1992 International Symposium on Computer and Information Sciences, Antalya, Turkey, 2–4 November 1992; pp. 279–285. [Google Scholar]
Nesterov, Y. A method for unconstrained convex minimization problem with the rate of convergence O(1/k²). In Proceedings of the Doklady ANSSSR (Translated as Soviet. Math. Docl.); 1983; Volume 269, pp. 543–547. Available online: https://api.semanticscholar.org/CorpusID:202149403 (accessed on 22 October 2024).
Ruder, S. An Overview of Gradient Descent Optimization Algorithms. 2016. Available online: https://ruder.io/optimizing-gradient-descent/index.html (accessed on 10 July 2022).
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Korzeń, M.; Jaroszewicz, S.; Klęsk, P. Logistic regression with weight grouping priors. Comput. Stat. Data Anal. 2013, 64, 281–298. [Google Scholar] [CrossRef]
Tessier, H.; Gripon, V.; Léonardon, M.; Arzel, M.; Hannagan, T.; Bertrand, D. Rethinking Weight Decay for Efficient Neural Network Pruning. J. Imaging 2022, 8, 64. [Google Scholar] [CrossRef]
Nowlan, S.; Hinton, G. Simplifying Neural Networks by Soft Weight-Sharing. Neural Comput. 1992, 4, 473–493. [Google Scholar] [CrossRef]
Plaut, D.C.; Nowlan, S.J.; Hinton, G.E. Experiments on Learning by Back-Propagation; Technical Report CMU–CS–86–126; Carnegie–Mellon University: Pittsburgh, PA, USA, 1986. [Google Scholar]
Kim, C.; Kim, S.; Kim, J.; Lee, D.; Kim, S. Automated Learning Rate Scheduler for Large-batch Training. arXiv 2021. Available online: https://arxiv.org/abs/2107.05855 (accessed on 22 October 2024).
d’Ascoli, S.; Refinetti, M.; Biroli, G. Optimal learning rate schedules in high-dimensional non-convex optimization problems. arXiv 2022. Available online: https://arxiv.org/abs/2202.04509 (accessed on 22 October 2024).
Zou, H.; Hastie, T. Regularization and Variable Selection via the Elastic Net. J. R. Stat. Soc. Ser. B 2005, 67, 301–320. [Google Scholar] [CrossRef]
Centofani, F.; Fontana, M.; Lepore, A.; Vantini, S. Smooth LASSO estimator for the Function-on-Function linear regression model. Comput. Stat. Data Anal. 2022, 176, 107556. [Google Scholar] [CrossRef]
Tibshirani, R. Regression Shrinkage and Selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
Hahn, G.; Lutz, S.M.; Laha, N.; Cho, M.H.; Silverman, E.K.; Lange, C. A fast and efficient smoothing approach to Lasso regression and an application in statistical genetics: Polygenic risk scores for chronic obstructive pulmonary disease (COPD). Stat. Comput. 2021, 31, 35. [Google Scholar] [CrossRef]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Breiman, L. Combining Predictors; Technical Report; Department of Statistics, University of California: Berkeley, CA, USA, 1998. [Google Scholar]
Keras. GitHub. 2015. Available online: https://github.com/fchollet/keras (accessed on 22 October 2024).
Roweis, S. Olivetti Faces Dataset; AT&T Laboratories: Cambridge, UK, 1994. [Google Scholar]
MNIST Handwritten Digit Database; Data Set. 1998. Available online: https://yann.lecun.com/exdb/mnist/ (accessed on 22 October 2024).
Krizhevsky, A.; Nair, V.; Hinton, G. CIFAR-10; Data Set; Canadian Institute for Advanced Research: Toronto, ON, Canada, 2010. [Google Scholar]

Figure 1. Milestones in the history of neural networks and deep learning [3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35].

Figure 2. Example medium-sized deep network built with popular layer types: convolutional, max pooling, dropout, flattening, and dense (prepared for the CIFAR-10 data set; see experiment with ID 3392187021 in Section 8).

Figure 3. Example of a simple neural network for 10-class classification.

Figure 4. Illustration of forward and backward computations at the junction of two convolutional layers

α

and

β

for the network structure from Figure 3.

Figure 4. Illustration of forward and backward computations at the junction of two convolutional layers

α

and

β

for the network structure from Figure 3.

Figure 5. Forward computations of max and average pooling for

S = 4

.

Figure 5. Forward computations of max and average pooling for

S = 4

.

Figure 6. Illustration of backward computations for flattening and dropout layers. High dropout rate

r^{*} = 0.875

chosen for readability of surviving connections.

Figure 6. Illustration of backward computations for flattening and dropout layers. High dropout rate

r^{*} = 0.875

chosen for readability of surviving connections.

Figure 7. Illustration of two stages of backward computations for a dense layer using a softmax activation function.

Figure 8. Forward pass leading to exploding signals: a fake network consisting of 100 dense layers with 512 neurons each (no activations) and intial weights drawn from a standard normal distribution.

Figure 9. Fake forward pass leading to vanishing signals due to intial weights drawn from a normal distribution with standard deviation scaled by

10^{- 2}

, regardless of N.

Figure 9. Fake forward pass leading to vanishing signals due to intial weights drawn from a normal distribution with standard deviation scaled by

10^{- 2}

, regardless of N.

Figure 10. Fake forward pass leading to a stable numerical behavior due to initial weights drawn from a properly scaled normal distribution with standard deviation

1 / \sqrt{N}

.

Figure 10. Fake forward pass leading to a stable numerical behavior due to initial weights drawn from a properly scaled normal distribution with standard deviation

1 / \sqrt{N}

.

Figure 11. Training costs (losses) of Adam and other SGD algorithms on MNIST (a) and CIFAR-10 (b) data sets. Source: (Kingma and Ba, 2014) [25].

Figure 12. UML class diagram of hdml project (https://github.com/pklesk/hmdl/blob/main/uml/classes.pdf) (accessed on 17 October 2024).

Figure 13. Functions executing forward and backward computations from the abstract class Layer.

Figure 14. The core of the fit function of SequentialClassifier. The main training loop (over epochs and batches) executes forward and backward computations through layers for each batch.

Figure 15. Forward and backward passes for SequentialClassifier.

Figure 16. Automatic differentiation for class Flatten.

Figure 17. Automatic differentiation for class Dropout.

Figure 18. Automatic differentiation for class Dense.

Figure 19. Forward computations for class MaxPool2D: the simplest numpy-based variant.

Figure 20. Backward computations for class MaxPool2D: the simplest numpy-based variant.

Figure 21. Backward computations of gradient for class Conv2D: the simplest numpy-based variant (GEMM).

Figure 22. Backward computations of error propagation for class Conv2D: the simplest numpy-based variant (GEMM).

Figure 23. Execution times of convolutional layers (64 input and 64 output channels; batch size: 32) for different implementation variants, filter sizes, and image sizes. For details of the hardware and software environment see page 42.

Figure 24. Functions for Glorot initalization of weights in the hmdl project.

Figure 25. Functions for He initalization of weights in the hmdl project.

Figure 26. Implementation of Adam (coupled with regularization) in the hmdl project.

Figure 27. Sample images from data sets applied in experiments.

Figure 28. Main settings for an experiment in script experimenter.py.

Figure 29. Description of a network structure in the hmdl project, declared in experimenter.py.

Figure 30. Choice of data set, randomization seed, and other settings in the script experimenter.py.

Table 1. Results of deep learning experiments with hmdl and Keras on Olivetti data.

Data Set: Olivetti
				hmdl			Keras			Diffs [%]
No.	Structure String	Expt. ID (hash)	Params	Fit Time [s]	Test Acc.	Test Loss	Fit Time [s]	Test Acc.	Test Loss	Acc.	Loss
GPU device: NVIDIA Quadro M4000M
1	F();D(8,r);DR(0.25);D(40,sm);	0979993065	33 136	2.2	0.6817	1.258	4.4	0.7058	1.186	3.54	5.7
2	F();D(64,r);DR(0.25);D(40,sm);	1781687523	264 808	8.7	0.9542	0.1738	4.6	0.9433	0.2555	1.16	32.0
3	F();D(64,r);DR(0.25);D(64,r);DR(0.25);D(40,sm);	3416029629	268 968	8.9	0.9467	0.2246	5.2	0.9475	0.2330	0.08	3.6
4	C(5,8,r);M(4);DR(0.25);F();D(64,r);DR(0.25);D(40,sm);	2855585004	133 944	38.1	0.9633	0.1456	7.2	0.9667	0.1380	0.35	5.2
5	C(9,16,r);M(4);DR(0.25);F();D(64,r);DR(0.25);D(40,sm);	4283815439	266 120	62.8	0.9567	0.1764	9.1	0.9575	0.1699	0.08	3.7
6	C(5,8,r);M(4);DR(0.25);C(3,16,r);M(2);DR(0.25); F();D(64,r);DR(0.25);D(40,sm);	1035804442	69 576	62.3	0.9492	0.1677	8.2	0.9558	0.1503	0.70	10.4
7	C(5,8,r);M(2);DR(0.25);C(3,16,r);M(2);DR(0.25); F();D(64,r);DR(0.25);D(40,sm);	3391566488	266 184	82.4	0.9217	0.2963	9.1	0.9183	0.3415	0.37	13.2
8	C(5,8,r)×2;M(4);DR(0.25);F();D(64,r);DR(0.25);D(40,sm);	3923102291	135 552	97.1	0.9583	0.1611	14.3	0.9600	0.1652	0.18	2.5
9	C(3,8,r)×2;M(4);DR(0.25);F();D(64,r);DR(0.25);D(40,sm);	1455896911	134 400	85.4	0.9600	0.1528	12.3	0.9592	0.1939	0.08	21.2
10	C(5,8,r)×2;M(2);DR(0.25);C(3,16,r)×2;M(2);DR(0.25); F();D(64,r);DR(0.25);D(40,sm);	2185109709	270 112	175.9	0.9358	0.3113	17.9	0.9233	0.3771	1.35	17.4
11	C(5,8,r)×2;M(4);DR(0.25);C(3,32,r)×2;M(2);DR(0.25); F();D(64,r);DR(0.25);D(40,sm);	1466991051	147 136	148.0	0.9500	0.1941	16.4	0.9483	0.2060	0.18	5.8
GPU device: GRID A100-7-40C MIG 7g.40gb
1	F();D(8,r);DR(0.25);D(40,sm);	0979993065	33 136	1.2	0.6817	1.258	2.5	0.6908	1.229	1.33	2.3
2	F();D(64,r);DR(0.25);D(40,sm);	1781687523	264 808	2.3	0.9542	0.1740	2.5	0.9550	0.2310	0.08	24.7
3	F();D(64,r);DR(0.25);D(64,r);DR(0.25);D(40,sm);	3416029629	268 968	2.5	0.9508	0.2212	5.2	0.9475	0.2266	0.35	2.4
4	C(5,8,r);M(4);DR(0.25);F();D(64,r);DR(0.25);D(40,sm);	2855585004	133 944	27.1	0.9625	0.1464	3.2	0.9608	0.1683	0.18	13.0
5	C(9,16,r);M(4);DR(0.25);F();D(64,r);DR(0.25);D(40,sm);	4283815439	266 120	35.9	0.9542	0.1754	3.3	0.9575	0.1720	0.35	1.9
6	C(5,8,r);M(4);DR(0.25);C(3,16,r);M(2);DR(0.25); F();D(64,r);DR(0.25);D(40,sm);	1035804442	69 576	43.7	0.9483	0.1799	3.9	0.9617	0.1295	1.41	28.4
7	C(5,8,r);M(2);DR(0.25);C(3,16,r);M(2);DR(0.25); F();D(64,r);DR(0.25);D(40,sm);	3391566488	266 184	55.1	0.9242	0.2890	3.9	0.9442	0.2134	2.16	26.2
8	C(5,8,r)×2;M(4);DR(0.25);F();D(64,r);DR(0.25);D(40,sm);	3923102291	135 552	52.5	0.9608	0.1532	3.8	0.9683	0.1362	0.78	11.1
9	C(3,8,r)×2;M(4);DR(0.25);F();D(64,r);DR(0.25);D(40,sm);	1455896911	134 400	51.3	0.9533	0.1760	4.0	0.9708	0.1334	1.84	24.2
10	C(5,8,r)×2;M(2);DR(0.25);C(3,16,r)×2;M(2);DR(0.25); F();D(64,r);DR(0.25);D(40,sm);	2185109709	270 112	97.6	0.9358	0.3081	4.9	0.9433	0.2224	0.80	27.8
11	C(5,8,r)×2;M(4);DR(0.25);C(3,32,r)×2;M(2);DR(0.25); F();D(64,r);DR(0.25);D(40,sm);	1466991051	147 136	83.7	0.9517	0.1879	5.1	0.9558	0.1753	0.43	6.7

Table 2. Results of deep learning experiments with hmdl and Keras on MNIST data.

Data Set: MNIST
				hmdl			Keras			Diffs [%]
No.	Structure String	Expt. ID (hash)	Params	Fit Time $[s]$	Test Acc.	Test Loss	Fit Time $[s]$	Test Acc.	Test Loss	Acc.	Loss
GPU device: NVIDIA Quadro M4000M
1	F();D(128,r);DR(0.125);D(128,r);DR(0.125);D(10,sm);	1321552096	118 282	133	0.9759	0.0769	18	0.9805	0.0765	0.47	0.5
2	F();D(512,r);DR(0.125);D(512,r);DR(0.125);D(10,sm);	0643554048	669 706	275	0.9849	0.0584	31	0.9855	0.0752	0.06	22.3
3	F();D(512,r);DR(0.125);D(512,r);DR(0.125); D(512,r);DR(0.125);D(10,sm);	0491196249	932 362	352	0.9808	0.0800	39	0.9846	0.0902	0.39	11.3
4	C(7,64,r);M(2);DR(0.125);F();D(128,r);DR(0.125);D(10,sm);	2044490280	1 610 250	6 861	0.9875	0.0407	393	0.9876	0.0493	0.01	17.4
5	C(7,64,r);M(2);DR(0.125);C(5,128,r);M(2);DR(0.125); F();D(128,r);DR(0.125);D(10,sm);	2037082629	1 012 362	67 425	0.9927	0.0250	1 395	0.9936	0.0274	0.09	8.8
6	C(5,64,r);M(2);DR(0.125);C(3,128,r);M(2);DR(0.125); F();D(128,r);DR(0.125);D(10,sm);	0549589385	879 754	27 840	0.9920	0.0254	814	0.9932	0.0333	0.12	23.7
7	C(5,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.125); F();D(128,r);DR(0.125);D(10,sm);	1685817126	484 714	46 803	0.9944	0.0205	1 101	0.9937	0.0301	0.07	31.9
8	C(3,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.125); C(3,128,r)×2;M(2);DR(0.125);F();D(128,r);DR(0.125);D(10,sm);	1224383397	435 306	49 391	0.9959	0.0219	1 093	0.9951	0.0343	0.08	36.2
9	C(5,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.125); C(3,128,r)×2;M(2);DR(0.125);F();D(128,r);DR(0.125);D(10,sm);	3604054245	452 202	64 678	0.9953	0.0161	1 343	0.9943	0.0247	0.10	34.8
10	C(5,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.25); C(3,128,r)×2;M(2);DR(0.5);F();D(128,r);DR(0.5);D(10,sm);	1584841448	452 202	65 518	0.9929	0.0241	1 776	0.9957	0.0198	0.28	17.8
GPU device: GRID A100-7-40C MIG 7g.40gb
1	F();D(128,r);DR(0.125);D(128,r);DR(0.125);D(10,sm);	1321552096	118 282	100	0.9750	0.0778	11	0.9816	0.0760	0.68	2.3
2	F();D(512,r);DR(0.125);D(512,r);DR(0.125);D(10,sm);	0643554048	669 706	190	0.9838	0.0582	11	0.9840	0.0931	0.02	37.5
3	F();D(512,r);DR(0.125);D(512,r);DR(0.125); D(512,r);DR(0.125);D(10,sm);	0491196249	932 362	292	0.9801	0.0816	16	0.9841	0.0930	0.41	12.3
4	C(7,64,r);M(2);DR(0.125);F();D(128,r);DR(0.125);D(10,sm);	2044490280	1 610 250	2 681	0.9877	0.0409	83	0.9869	0.0602	0.08	32.1
5	C(7,64,r);M(2);DR(0.125);C(5,128,r);M(2);DR(0.125); F();D(128,r);DR(0.125);D(10,sm);	2037082629	1 012 362	8 254	0.9927	0.0220	82	0.9946	0.0250	0.19	12.0
6	C(5,64,r);M(2);DR(0.125);C(3,128,r);M(2);DR(0.125); F();D(128,r);DR(0.125);D(10,sm);	0549589385	879 754	5 432	0.9926	0.0248	68	0.9935	0.0316	0.09	21.5
7	C(5,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.125); F();D(128,r);DR(0.125);D(10,sm);	1685817126	484 714	6 592	0.9943	0.0189	82	0.9942	0.0312	0.01	39.4
8	C(3,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.125); C(3,128,r)×2;M(2);DR(0.125);F();D(128,r);DR(0.125);D(10,sm);	1224383397	435 306	8 308	0.9947	0.0243	82	0.9955	0.0239	0.08	1.6
9	C(5,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.125); C(3,128,r)×2;M(2);DR(0.125);F();D(128,r);DR(0.125);D(10,sm);	3604054245	452 202	9 666	0.9954	0.0174	93	0.9952	0.0259	0.02	32.8
10	C(5,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.25); C(3,128,r)×2;M(2);DR(0.5);F();D(128,r);DR(0.5);D(10,sm);	1584841448	452 202	9 440	0.9928	0.0240	143	0.9954	0.0187	0.26	22.1

Table 3. Results of deep learning experiments with hmdl and Keras on CIFAR-10 data.

Data Set: CIFAR-10
				hmdl			Keras			Diffs [%]
No.	Structure String	Expt. ID (hash)	Params	Fit Time $[s]$	Test Acc.	Test Loss	Fit Time $[s]$	Test Acc.	Test Loss	Acc.	Loss
GPU device: NVIDIA Quadro M4000M
1	F();D(512,r);DR(0.25);D(512,r);DR(0.25); D(512,r);DR(0.25);D(10,sm);	0112184563	2 103 818	619	0.5589	1.573	75	0.5700	1.642	1.99	4.2
2	F();D(512,r);DR(0.125);D(512,r);DR(0.125); D(512,r);DR(0.125);D(10,sm);	3617925328	2 103 818	610	0.5671	2.259	80	0.5624	2.436	0.84	7.3
3	C(3,32,r);M(2);DR(0.125);F();D(128,r);DR(0.125);D(10,sm);	1176787770	1 050 890	3 981	0.6557	1.116	234	0.6410	1.068	2.29	4.3
4	C(3,32,r);M(2);DR(0.125);F();D(128,r);DR(0.25);D(10,sm)	4064952781	1 050 890	4 071	0.6582	1.009	234	0.6500	1.061	1.26	4.9
5	C(7,32,r);M(2);DR(0.125);F();D(128,r);DR(0.125);D(10,sm);	2176511806	1 054 730	5 699	0.6581	1.139	309	0.6542	1.062	0.60	6.8
6	C(7,32,r);M(2);DR(0.125);C(5,64,r);M(2);DR(0.125); F();D(128,r);DR(0.125);D(10,sm);	3250393598	581 706	20 149	0.7148	1.363	628	0.7248	1.134	1.40	16.8
7	C(3,32,r);C(5,16,r);M(2);DR(0.125); F();D(128,r);DR(0.25);D(10,sm);	4064194389	539 418	17 091	0.6665	1.607	777	0.6631	1.625	0.51	1.1
8	C(7,32,r);C(5,32,r);M(2);DR(0.125); C(3,64,r)×2;M(2);DR(0.25);F();D(128,r);DR(0.5);D(10,sm);	3392187021	611 498	49 819	0.7741	0.8160	1 282	0.7739	0.7324	0.03	10.2
9	C(3,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.125); F();D(128,r);DR(0.125);D(10,sm);	1676196221	591 274	33 930	0.7654	1.137	961	0.7607	1.178	0.62	3.5
10	C(3,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.25); F();D(128,r);DR(0.5);D(10,sm);	4191965583	591 274	34 161	0.7968	0.6892	959	0.7855	0.6926	1.44	0.5
11	C(3,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.125); C(3,128,r)×2;M(2);DR(0.125);F();D(128,r);DR(0.125);D(10,sm);	0843263662	550 570	52 878	0.8093	1.005	1 229	0.8086	1.024	0.09	1.9
12	C(3,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.25); C(3,128,r)×2;M(2);DR(0.5);F();D(128,r);DR(0.5);D(10,sm);	0715119487	550 570	52 614	0.8217	0.5949	1 228	0.8359	0.5497	1.73	7.6
GPU device: GRID A100-7-40C MIG 7g.40gb
1	F();D(512,r);DR(0.25);D(512,r);DR(0.25); D(512,r);DR(0.25);D(10,sm);	0112184563	2 103 818	383	0.5646	1.567	29	0.5681	1.654	0.62	5.3
2	F();D(512,r);DR(0.125);D(512,r);DR(0.125); D(512,r);DR(0.125);D(10,sm);	3617925328	2 103 818	386	0.5543	2.331	28	0.5595	2.575	0.94	9.5
3	C(3,32,r);M(2);DR(0.125);F();D(128,r);DR(0.125);D(10,sm);	1176787770	1 050 890	1 491	0.6565	1.124	41	0.6450	1.069	1.78	4.9
4	C(3,32,r);M(2);DR(0.125);F();D(128,r);DR(0.25);D(10,sm)	4064952781	1 050 890	1 496	0.6580	1.004	42	0.6406	1.052	2.72	4.6
5	C(7,32,r);M(2);DR(0.125);F();D(128,r);DR(0.125);D(10,sm);	2176511806	1 054 730	1 646	0.6579	1.167	49	0.6578	1.045	0.02	10.5
6	C(7,32,r);M(2);DR(0.125);C(5,64,r);M(2);DR(0.125); F();D(128,r);DR(0.125);D(10,sm);	3250393598	581 706	3 487	0.7145	1.335	83	0.7197	1.241	0.73	7.0
7	C(3,32,r);C(5,16,r);M(2);DR(0.125); F();D(128,r);DR(0.25);D(10,sm);	4064194389	539 418	2 804	0.6664	1.626	83	0.6624	1.606	0.45	1.2
8	C(7,32,r);C(5,32,r);M(2);DR(0.125); C(3,64,r)×2;M(2);DR(0.25);F();D(128,r);DR(0.5);D(10,sm);	3392187021	611 498	7 043	0.7741	0.7779	107	0.7774	0.7439	0.43	4.4
9	C(3,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.125); F();D(128,r);DR(0.125);D(10,sm);	1676196221	591 274	5 795	0.7738	1.106	143	0.7656	1.093	1.07	1.2
10	C(3,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.25); F();D(128,r);DR(0.5);D(10,sm);	4191965583	591 274	5 784	0.7948	0.7195	89	0.7938	0.6612	0.13	8.1
11	C(3,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.125); C(3,128,r)×2;M(2);DR(0.125);F();D(128,r);DR(0.125);D(10,sm);	0843263662	550 570	8 625	0.8032	1.014	103	0.8172	0.9271	1.74	8.6
12	C(3,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.25); C(3,128,r)×2;M(2);DR(0.5);F();D(128,r);DR(0.5);D(10,sm);	0715119487	550 570	8 759	0.8190	0.6119	103	0.8441	0.5184	3.06	15.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Klęsk, P. Understanding the Flows of Signals and Gradients: A Tutorial on Algorithms Needed to Implement a Deep Neural Network from Scratch. Appl. Sci. 2024, 14, 9972. https://doi.org/10.3390/app14219972

AMA Style

Klęsk P. Understanding the Flows of Signals and Gradients: A Tutorial on Algorithms Needed to Implement a Deep Neural Network from Scratch. Applied Sciences. 2024; 14(21):9972. https://doi.org/10.3390/app14219972

Chicago/Turabian Style

Klęsk, Przemysław. 2024. "Understanding the Flows of Signals and Gradients: A Tutorial on Algorithms Needed to Implement a Deep Neural Network from Scratch" Applied Sciences 14, no. 21: 9972. https://doi.org/10.3390/app14219972

APA Style

Klęsk, P. (2024). Understanding the Flows of Signals and Gradients: A Tutorial on Algorithms Needed to Implement a Deep Neural Network from Scratch. Applied Sciences, 14(21), 9972. https://doi.org/10.3390/app14219972

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Understanding the Flows of Signals and Gradients: A Tutorial on Algorithms Needed to Implement a Deep Neural Network from Scratch

Abstract

1. Introduction

1.1. Motivation and Contributions

1.2. Organization of the Remaining Content

2. Background

2.1. Milestones in the Not-So-Long History of Deep Learning

2.2. Backpropagation

2.3. Notation

3. Flows of Signals and Gradients—Direct Derivation for a Small Network Example

3.1. Forward Pass

3.2. Backward Pass

4. General Formulas for Common Layer Types

4.1. Convolutional Layers (2D)

4.2. Max Pooling Layers (2D)

4.3. Average Pooling Layers (2D)

4.4. Flattening Layers

4.5. Dropout Layers

4.6. Dense Layers

4.7. Softmax Activation

5. Weight Initialization

5.1. Randomness and Variance Analysis Setup

5.2. Glorot Initialization

5.3. He Initialization

6. Learning Algorithms (Optimizers)

6.1. Exponential Moving Average (EMA)

6.2. AdaGrad and RMSProp

6.3. Adam

6.4. Other Ideas: Nadam, Adamax, AMSGrad

7. Regularization

7.1. ℓ 2 Regularization (Weight Decay)

7.2. ℓ 1 Regularization (Lasso)

7.3. Dropout Regularization

8. Implementation and Experiments

8.1. “Home-Made Deep Learning” Project—A High-Level View

8.2. Executions of Forward and Backward Computations

8.3. Automatic Differentiation (Backward Computations) for Common Layer Types

8.4. Efficiency of Forward Computations in Convolutional Layers for Different Implementation Variants

8.5. Implementation of Weight Initialization

8.6. Implementation of Adam Coupled with Regularizations

8.7. Deep Learning Experiments

8.8. Reproducing Experiments (or Running New Ones)

9. Conclusions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Implementation Variants Supported by Numba for Max Pooling Layers (Hmdl Project)

Appendix A.1. Variant Accelerated Using Low-Level Compilation by numba.jit

Appendix A.2. Variant Accelerated Using GPU Computations and numba.cuda

Appendix B. Implementation Variants Supported by Numba for Convolutional Layers (Hmdl Project)—Backward Computations

Appendix B.1. Variant Accelerated Using Low-Level Compilation via numba.jit

Appendix B.2. Variant Based Directly on Definition Accelerated Using GPU Computations and numba.cuda

Appendix B.3. Variant Based on Tiles Accelerated Using GPU Computations Using and numba.cuda

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

7.1. $ℓ_{2}$ Regularization (Weight Decay)

7.2. $ℓ_{1}$ Regularization (Lasso)

Appendix A.1. Variant Accelerated Using Low-Level Compilation by `numba.jit`

Appendix A.2. Variant Accelerated Using GPU Computations and `numba.cuda`

Appendix B.2. Variant Based Directly on Definition Accelerated Using GPU Computations and `numba.cuda`

Appendix B.3. Variant Based on Tiles Accelerated Using GPU Computations Using and `numba.cuda`