Next Article in Journal
High-Fidelity OC-Seislet Stacking Method for Low-SNR Seismic Data
Next Article in Special Issue
Fast Type-II Hartley Transform Algorithms for Short-Length Input Sequences
Previous Article in Journal
Non-Linear Saturated Multi-Objective Pseudo-Screening Using Support Vector Machine Learning, Pareto Front, and Belief Functions: Improving Wastewater Recycling Quality
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Tutorial

Understanding the Flows of Signals and Gradients: A Tutorial on Algorithms Needed to Implement a Deep Neural Network from Scratch

by
Przemysław Klęsk
Faculty of Computer Science and Information Technology, West Pomeranian University of Technology in Szczecin, Żołnierska 49, 71-210 Szczecin, Poland
Appl. Sci. 2024, 14(21), 9972; https://doi.org/10.3390/app14219972
Submission received: 23 September 2024 / Revised: 18 October 2024 / Accepted: 23 October 2024 / Published: 31 October 2024
(This article belongs to the Special Issue Advanced Digital Signal Processing and Its Applications)

Abstract

:
Theano, TensorFlow, Keras, Torch, PyTorch, and other software frameworks have remarkably stimulated the popularity of deep learning (DL). Apart from all the good they achieve, the danger of such frameworks is that they unintentionally spur a black-box attitude. Some practitioners play around with building blocks offered by frameworks and rely on them, having a superficial understanding of the internal mechanics. This paper constitutes a concise tutorial that elucidates the flows of signals and gradients in deep neural networks, enabling readers to successfully implement a deep network from scratch. By “from scratch”, we mean with access to a programming language and numerical libraries but without any components that hide DL computations underneath. To achieve this goal, the following five topics need to be well understood: (1) automatic differentiation, (2) the initialization of weights, (3) learning algorithms, (4) regularization, and (5) the organization of computations. We cover all of these topics in the paper. From a tutorial perspective, the key contributions include the following: (a) proposition of R and S operators for tensors—rashape and stack, respectively—that facilitate algebraic notation of computations involved in convolutional, pooling, and flattening layers; (b) a Python project named hmdl (“home-made deep learning”); and (c) consistent notation across all mathematical contexts involved. The hmdl project serves as a practical example of implementation and a reference. It was built using NumPy and Numba modules with JIT and CUDA amenities applied. In the experimental section, we compare hmdl implementation to Keras (backed with TensorFlow). Finally, we point out the consistency of the two in terms of convergence and accuracy, and we observe the superiority of the latter in terms of efficiency.

1. Introduction

Software frameworks and libraries, such as Theano, TensorFlow, Keras, Torch, PyTorch, and others, have undoubtedly and remarkably stimulated the popularity of research on deep learning (DL) and its applications among practitioners from different fields. Imagine someone trying to select a few significant research papers on DL, e.g., from a particular month of the current year. Those are so numerous that the task seems hopeless and pointless. In fact, the “hype” around DL is so strong that it is even hard to publish a new paper on machine learning if the proposed ideas do not address DL in some way.
It is easy to notice that the large majority of research on DL is, in fact, application work conducted by practitioners capable of using frameworks. That is understandable and not a fault in itself. Yet, apart from all the good they achieve, the danger of frameworks, like that of any other powerful tool, is that they unintentionally spur a black-box attitude towards the method being applied. It is fair to say that some researchers play around with building blocks offered by frameworks and rely on them, having a superficial understanding of the internal mechanics.
This paper is intended to constitute a concise yet thorough tutorial on DL’s mathematics and algorithms, allowing one to successfully implement a deep neural network from scratch. By “from scratch”, we mean with access to a programming language, and with numerical libraries at one’s disposal, but without frameworks or components that hide DL computations underneath. To achieve this goal, the following five topics need to be well understood:
  • automatic differentiation—i.e., a backpropagation algorithm with suitable inductive steps for different layer types and the activation functions;
  • initialization of weights—to avoid exploding or vanishing signals/gradients and to guarantee convergence;
  • learning algorithms (optimizers)—to guarantee suitably fast convergence;
  • regularization—to avoid overfitting;
  • organization of computations—to plan when and how to use CPU or GPU computations.
We discuss those topics in the paper and concentrate on providing the reader with selected state-of-the-art methods to suitably guide an implementation.

1.1. Motivation and Contributions

First of all, let us clearly emphasize that this manuscript is a tutorial (invited and submitted as the tutorial article type to the special issue “Advanced Digital Signal Processing and Its Applications” of the Applied Sciences). It is not a regular research paper. Therefore, the primary goal of this tutorial is educational, and its value should be assessed in terms of educational contributions to build a better understanding of the flows of signals and gradients in deep neural networks.
Let us refer here to two remarkable examples of tutorials well known in the machine learning community. The first one is the tutorial on hidden Markov models by Rabiner [1] from 1989. The second is the tutorial on boosting (the ensemble learning technique) by Friedman, Hastie, and Tibshirani [2] from the year 2000. Both of these works have been, and continue to be up to this date, great sources of knowledge and tutoring material for researchers in those fields. Our motivation and ambition is that this tutorial on deep neural networks can also become well recognized in the machine learning community.
We regard the following elements as the key contributions of this tutorial:
  • The proposition of R and S operators for tensors—rashape and stack, respectively—that facilitate algebraic notation for computations (forward and backward) involved in convolutional, pooling, and flattening layers;
  • A Python project named hmdl (“home-made deep learning”)—a practical implementation of the ideas presented in this tutorial;
  • A careful analysis of a small example—tracking flows of signals and gradients through a four-layer convolutional network (Section 3);
  • A concise explanation of DL computations (the aforementioned topics)—automatic differentiation, initialization of weights, learning algorithms, and regularization;
  • Clear and consistent notation across all mathematical contexts involved.
Let us briefly elaborate here on the second element—the hmdl project. It constitutes the postulated “from scratch” implementation that can be treated as an example and a reference. It provides the common layer types and functionalities, and it does that in approximately 2 k lines of code. Therefore, the task of developing a “home-made” DL framework is not as difficult as it might seem. In sum, we think the hmdl project helps the reader understand how DL’s mathematics and algorithms become translated into the realm of programming and computations.
Apart from tutoring contributions, this manuscript is additionally intended to constitute a short survey of crucial scientific contributions in the development of DL. We focus especially on the ones that took place during the last thirteen years (since the “DL big bang”) and have paved the way for DL’s popularity and frameworks. We present those contributions on the canvas of the five aforementioned topics and with the main goal in mind—guiding the reader towards implementation.

1.2. Organization of the Remaining Content

The rest of this paper is organized as follows. Section 2 presents the general background, that is, the history of neural networks and deep learning, and the core computational principles involved in the backpropagation method. Additionally, in that section, we introduce the notational conventions applied throughout the paper. With automatic differentiation in mind, in Section 3, we analyze in detail a small case example of a network consisting of four layers (two of them convolutional). We study the flows of signals (forward pass) and the flow of gradients (backward pass) in this network, and we explicitly derive all necessary formulas. Building upon the small case example, Section 4 provides general formulas for forward and backward computations in the most common layer types: convolutional, max pooling, average pooling, flattening, dropout, and dense. Particular attention is also paid to the softmax activation function and the differentiation involved. Section 5 is devoted to the topic of the random initialization of weights. Recursive equations for variances of signals/gradients explained in the section help explain the problems of vanishing or exploding gradients. Section 6 discusses optimizers, i.e., the state-of-the-art learning algorithms applied in modern neural networks. In particular, the section explains exponential moving averages (EMAs) and the role they play in contemporary learning algorithms. In Section 7, we turn our attention to regularization techniques applied to DL in order to temper down overfitting. Section 8 pertains to the aforementioned Python project, entitled hmdl. We present its design, structure, and key computational elements (with source code examples). The section ends with multiple deep learning experiments comparing hmdl with Keras on three well-known data sets, using various network architectures, two hardware environments, and several approaches to compute convolutions. We instruct the reader as to how the experiments and results can be reproduced. Section 9 concludes the paper with final thoughts and remarks.

2. Background

2.1. Milestones in the Not-So-Long History of Deep Learning

The history of neural networks and (then) deep learning is peculiar. It includes excitements, stagnation periods, and breakthroughs. Figure 1 presents the selected milestones in that history along the timeline. We categorized most of them according to the five key topics listed in the previous section, hence the following labels: “backpropagation”, “initialization”, “learning”, “regularization”, and “computations”. Obviously, these threads were interwoven along the way, but certain periods of specific interests can be observed. An additional category, “structure”, is introduced to label various structure-related ideas—early models of a neuron, novel layer types or activations, contemporary deep architectures, etc.
As is commonly known, it all began in the 1940s and 1950s from the first mathematical models of biological neurons proposed in the works of [3] and then [4]. Despite some initial interest, a bizarre, 20-year-long stagnation period followed, roughly from 1960 to 1980. This was, to some extent, caused by a misunderstanding involving the so-called XOR problem and the inability of the simple perceptron to cope with it, which made some people draw false general conclusions. An interesting “lonely” event in that period (apart from minor exceptions) is a master’s thesis by [5] on automatic differentiation. The work contained a suitable inductive algorithm but did not pertain to neural networks explicitly. And it was not until 1986 that Rumelhart, Hinton, and Williams [7] (re)discovered the algorithm for the realm of neural nets and coined the name backpropagation for it. Two other works from 1980s deserve attention: firstly, the one on Neocognitron by [6], who presented the early ideas of convolutional and pooling layers, and secondly, the one on zip code recognition by LeCun [8]. The latter turned into the first real-world application of a neural network, trained end to end with backpropagation and without pre-training. LeCun’s net consisted of over 4 k units with 2 k weights (some shared), and it was organized in four hidden layers, including convolution-like and average pooling. Thus, it may be regarded as the first deep learning application, even though the term did not exist at the time.
The second stagnation period, lasting for about 10 years, was mainly due to difficulties experienced in training the networks, even medium-sized networks. Convergence problems and large computational powers that had to be invested in were the reasons for this frustration. Researchers started to notice the problem of vanishing gradients. As a result, in the 1990s and early 2000s, the focus of the machine learning community moved to other approaches (e.g., SVMs, boosting, HMMs, etc.) with hand-engineered features or moment-based descriptors. A notable exception was Larry Heck’s research group SRI International, funded by the US government, which kept working on speech processing using neural networks. Their successful application of a speaker recognition system [11], developed in 1998–2000, proved that automatically extracted raw features may outperform hand-crafted ones, in particular the Mel–Cepstral features popular in speech processing. Also around that time, the term “deep learning”, introduced by [10], started to appear in common use.
In 2006, some revival of interest in DL occurred due to the work on deep belief nets [13]. It indicated new ways of overcoming vanishing gradients by pre-training one layer at a time and then fine-training with backpropagation. Some ideas on pre-training are still valid today (transfer learning and stacked autoencoders).
The major breakthrough came around 2009, and it is frequently referred to as the “DL big bang”. It arose due to a combination of three reasons: GPU technology due to NVIDIA [15], large data sets [14,23], and convenient software that appeared shortly after. As Bengio, LeCun, and Hinton (the 2019 Turing award winners) describe it now: “(…) the emergence of GPUs and the availability of large datasets were key enablers of deep learning and they were greatly enhanced by the development of open source, flexible software platforms with automatic differentiation such as Theano, Torch, Caffe, TensorFlow, and PyTorch (…)” [36].
Looking at the timeline (Figure 1), one can see that after 2009, key events start to become dense. In 2010, [17] published their groundbreaking work on vanishing or exploding signals/gradients in deep structures. They derived a suitable recurrence equation to analyze variance in signals and provided a method for random weight initialization that keeps variance approximately constant—a state-of-the-art method up to this day. The reader should be strongly reminded that the carefree initialization of weights (making the signals vanish or explode) was one of the main reasons for DL stagnation in the 1990s.
Also in 2010, [18] demonstrated the usefulness of activation functions called rectified linear units (ReLUs). As a form of regularization, ReLUs contribute to higher accuracy and convergence. Other important structure- and regularization-related contributions followed in 2012: max pooling [20] and dropout due to Ciresan, Hinton, et al. In parallel to those, the research on faster convergence produced new algorithms with adaptive learning rates: AdaGrad [19] and RMSProp [20] in 2011 and 2012, respectively.
A combination of novel ideas from that time allowed Alex Krizhevsky [23] to construct a structure with 8 layers, named AlexNet, that drew the attention of the whole computer vision community. AlexNet won the ImageNet 2012 competition by a large margin [14,36], being able to accurately recognize objects from one thousand classes in natural images. The key elements of that victory were as follows: ReLUs, convolutions with max pooling, dropout regularization, and the use of multiple GPUs. Great credit must also be given to the efforts of Fei-Fei’s group for collecting a training set of over 1 million labeled images [14]. It is interesting to note that many elements applied in AlexNet—inherent components of DL still nowadays, such as: “weight sharing” [37], convolutions and pooling [6], and ReLU-like activation [38]—had already appeared before 1990 and thus significantly predated the contemporary era. Within barely three years since AlexNet, new impressive deeper structures came to life: VGGNet [26], GoogLeNet [27], and ResNet [30].
Around the same time, studies on efficient computations brought about novelties. Initial works on fast convolutions based on FFT [24] or Winograd transformations [28], and their GPU/CUDA implementations opened an important research topic, which is still lively today [39]. Undoubtedly, convolutional layers play a crucial role in contemporary DL (hence the commonness of the names CNNs or DCNNs, the latter standing for deep convolutional neural networks). Simultaneously, they are the most time-consuming layers, with complexity scaling roughly as a product of the number of input pixels, channels, kernels (or feature maps), and the size of a filter and batch. As a side remark, it is worth pointing out that the name “convolution”, strongly rooted in DL’s nomenclature, is not strictly correct. In forward computations, the operation commonly called a “2D convolution” is, in fact, a 2D correlation in the mathematical sense. It is during backward computations that one deals with convolution due to weights being flipped up–down and left–right.
Regarding studies on optimizers, in 2014, Kinga and Ba published their great proposition of Adam [25]—a learning algorithm that applies first- and second-order gradient moments estimated via exponential moving averages. Adam’s effectiveness and simplicity quickly made it a highly popular state-of-the-art optimizer, recommended as the default one by many sources. It is fair to remark that two later propositions, marked on our timeline in 2016 and 2018, called Nadam [31] and AMSGrad [33], respectively, represent only minor modifications of the main strong idea behind Adam. On the other hand, also worth attention is a recent, yet already widely recognized, proposition of EDEAdam from 2022 [35]. It integrates the main idea of Adam with an ensemble of differential evolution algorithms and helps further prevent getting stuck in local optima.
The topic of weight initialization resurfaced in 2015. [29] demonstrated the proper way to generate random weights when a layer is equipped with the ReLU activation. It turned out that Glorot’s method from five years before was not complete regarding that particular case.
Last, but certainly not least, we end the timeline from Figure 1 with some entries that represent recent advances in natural language processing. One should emphasize at this point novelties like the following: self-attention [32], recurrent independent mechanisms [34], and transformer architectures [32,40]. Owing to the above advances and a proper combination of deep and reinforcement learning, the researchers from the OpenAI company were able to develop a family of large language models (called GPT-3) and release, in November 2022, a truly astonishing chatbot named ChatGPT [41].
The research on DL is clearly very lively nowadays. In the most recent few years (2021–2024), several fresh and interesting research topics have emerged, such as capsule networks [42], self-attention with/without routing [43], hierarchical aware CNNs [44], vision transformers [45], and federated or adaptive ensemble learning [46] for deep neural networks. Those topics are outside the scope of this manuscript, as we have decided to focus on the contemporary but well-established ideas of DL and on state-of-the-art algorithms.
To conclude this section, we would like to point out a certain contrast. On the one hand, important mathematical results are known, e.g., due to [47], proving that one hidden layer with sufficiently many non-linear neurons is sufficient for universal approximation. On the other hand, the practice shows that deep nets generalize better than shallow ones (with the same number of parameters). The question “why depth?” was recently considered by DL authorities—Bengio, LeCun, and Hinton [36]. They believe depth helps build internal representations of concepts. It is known that human intelligence emerges from highly parallel networks of fairly simple neurons that learn by adjusting the strengths of their connections. Apparently, this simple mechanism allows humans to perform complicated tasks (recognize objects, understand language, etc.). Deep networks excel because they exploit a particular form of compositionality in which features in one layer are combined in many different ways to create more abstract features in the next layer [48,49]. That compositionality and successive levels of abstraction work very well for perception-related tasks, and there is strong evidence that this mechanism is used in biological perceptual systems [36].

2.2. Backpropagation

In this section, we turn our attention to automatic differentiation and the backpropagation algorithm. Obviously, various structural novelties that appeared throughout the history—weight sharing, convolutions, pooling, ReLUs, softmax activation, dropout, etc.—and possibly others still to come in the future, need to be handled properly during the computations of derivatives, which allow for tuning network parameters. Therefore, apart from being correct and efficient, the algorithm for that purpose needs to be (and is) universal. With that universality in mind, here, we discuss the general principles that underlie backpropagation.
The usual goal is to train (using data examples) a neural network that can be seen as a large graph of connections with tunable parameters called weights. Commonly, the connections are organized as a layered structure; see, e.g., Figure 2. The training is carried out through some variant of the gradient descent in order to minimize a certain error function for the whole training data set with respect to weights. That error is, in turn, defined by the so-called loss function at the level of a single training pair, ( x , y * ) —categorical cross-entropy and squared loss being the common choices for loss functions in classification and regression tasks, respectively.
For the large majority of networks (graphs of connections), automatic differentiation can be constructed from two rules well known from calculus: (i) the chain rule—stating that, for a composition of two differentiable functions, the derivative of the composition is equal to the derivative of the outer function (calculated with respect to the inner one) multiplied by the derivative of the inner function; (ii) the sum rule—stating that a derivative of a sum of functions is the sum of their derivatives. In practice, the rules imply that intermediate derivatives leading towards a particular weight along a single path of connections must be multiplied, whereas the ones along alternate paths must be summed.
Although a derivative for any individual weight could be computed separately via the above approach, doing so would be inefficient because of unnecessary recomputations of many intermediate values. Therefore, one can do better by applying an induction. For a fixed training example (or a batch of examples), the induction proceeds from the back of the network towards the front, layer by layer, and computes derivatives specific to the given layer—components of the final gradient—based on values already computed. To achieve this, one introduces an auxiliary quantity, traditionally named δ (we skip its subscripts for a moment), associated with every neuron and computed inductively. That quantity is interpreted as an “error”, but more accurately, it should be seen as a partial contribution to the final loss derivative that takes place from the point of the given neuron onwards in the network. We must clarify that δ values themselves are not equal to loss derivatives. To turn them into derivatives, an additional multiplication by inputs of the given layer is necessary.
Let l denote the loss function, and suppose that we are about to perform backward computations for the i-th layer, having already performed the computations for layers i + 1 , i + 2 , . Suppose also that the layer is equipped with a collection (a tensor) of weights, denoted here as W i . The algorithmic goal is to be able to compute l / w for any w W i , based on error quantities δ i . The key argument to realize is that w can affect the loss of the whole network only through its direct effect on the very next layer, i + 1 . Therefore, to compute δ i —the collection of error quantities for the current layer—one needs to know only δ i + 1 and W i + 1 , i.e., direct subsequent errors and weights of outgoing connections through which these errors are reached, respectively. Error quantities for i + 2 , i + 3 , are not needed. In fact, their influence is already present inside δ i + 1 .
Below, we present two algorithmic rules that govern backpropagation. We purposely wrote them down in a mnemonic verbal form.
Rule 1 (inductive step):
“current δ : = activation derivative times weighted sum of subsequent δs taken over outgoing connections.”
Rule 2 (gradient):
: = δs times inputs.”  
The first rule tells us how a δ quantity should be computed based on subsequent, already computed, δ s (or, equivalently, how the subsequent δ s should be propagated back). This rule has to be executed always, regardless of the layer type. The second rule tells us how to compute the gradient, denoted as ∇ (again, subscripts are skipped for now), for weights present in the current layer (if any). The gradient can be applied later in formulas correcting the weights, in accordance with a particular variant of gradient descent (e.g., simple SGD, Adam, AMSGrad, etc.). The type of multiplication hidden under the word ‘times’ in the rules (e.g., algebraic or element-wise multiplication) is implied by the layer type and, in some cases, also by the activation. In particular, please note that, for convolutional layers, the weights of each filter are shared by many inputs—a fact that must be taken into account when applying both Rule 1 and Rule 2 for such layers.
In many further places in the text, we will refer to the above two rules: in Section 3, where we provide a “manual” derivation of backpropagation for a specific small network, and in Section 4, where we present general formulas.

2.3. Notation

We choose to index elements in vectors, matrices, and tensors starting from 0 (not from 1), as in C or Python languages. This choice is slightly more convenient than natural indexing for some operations on tensors like reshaping or taking subranges.
Additionally, for convolutional filters, we apply a convention that allows for negative indexes of weights. This convention treats the point (pixel) at which the filter is currently anchored as a temporary center, i.e., the ( 0 , 0 ) point. More specifically, assuming an odd-sized F × F filter ( 1 × 1 , 3 × 3 , 5 × 5 , etc.), one can write its indexed weights as w a , b , where both a , b take values in the set { F / 2 , , 1 , 0 , 1 , , F / 2 } . Let us set an example; suppose that X denotes an input image, and Y denotes the resulting convolution image (for simplicity, both with 1 channel). Additionally, let F = 7 . Then, elements of Y are computed as follows:
y j , k = 3 a , b 3 w a , b · x j + a , k + b .
Obviously, x j + a , k + b must be well defined for regions outside the proper borders of X (padding with zeros is a standard practice that we address later on). We remark that, in general, not pairs but quadruples shall be used as indexes of filter weights to suitably handle input and output channels.
In many places where indicator functions are needed, we apply the so-called Iverson convention. It represents an indicator with a square bracket, [ · ] , yielding 1 if its argument (any logical statement) is true, and 0 otherwise.
As regards multiplications, we use the symbol “·” for standard algebraic products of numbers or matrices, whereas we use “∘” for element-wise products of tensors with the same size (also known as the Hadamard product). In some places, even if the multiplication symbol is mathematically redundant, we still write it explicitly in order to emphasize the particular terms involved in the backpropagation chain. Identity matrices are denoted as I with a suitable subscript to define the size, whereas bold symbols, 1, stand for tensors consisting of ones, e.g., 1 5 × 1 = ( 1 , 1 , 1 , 1 , 1 ) T .
We apply the “ : = ” signs to indicate assignments, algorithmic or computational, and we use them distinctly from purely mathematical equality signs, “=”.
In places related to tensor-form formulas, it shall be convenient for us to introduce two operators: R—performing a reshape of a tensor, and S—generating a stack of tensors as a function of certain iterators. For example, given a vector, x = ( x 0 , , x 23 ) , the expression R 3 × 2 × 4 x yields a three-dimensional tensor of shape 3 × 2 × 4 with the following successive slices:
x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 , x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 , x 16 x 17 x 18 x 19 x 20 x 21 x 22 x 23 ,
where the last dimension changes the fastest. Obviously, the superposition R 24 R 3 × 2 × 4 x yields back the original x. Shaping x explicitly to become a row or a column vector is also possible by writing R 1 × 24 x or R 24 × 1 x , respectively. As regards the second operator—S, it can be understood as a parameterized generator of larger tensors produced from smaller ones. It works by iterating over a certain variable (or variables) and stacks the successive operands together, thereby generating a tensor extended to an extra dimension with respect to the operands. When the argument under S is a vector, then the resulting tensor will be a matrix consisting of stacked vectors. When the argument is a matrix, the result will be a three-dimensional tensor consisting of stacked matrices, and so forth. The expression below provides a simple example:
S 10 a < 13 a · 1 1 × 5 = 10 10 10 10 10 11 11 11 11 11 12 12 12 12 12 .
As regards convolutions, there exist three modes according to which they can be computed, commonly named “same”, “full”, and “valid” (see, e.g., the documentation of scipy.signal.convolve2d). For simplicity, we choose to apply the mode “same” throughout the paper. This means the resulting convolution image is of the same size as the input, and it is achieved by aligning the top-left input pixel in every channel with the central weight of a filter. It also means that suitable image padding with zeros must be introduced. We use superscript P to denote a zero-padded tensor; e.g., for a tensor, X, of shape H × V × C , the elements of X P ought to be understood (we use V for width since W is occupied by weights) as
x j , k , q P = x j , k , q , if 0 j < H and 0 k < V ; 0 , otherwise ,
regardless of the channel index, q.
Frequently, formulas related to convolutions and padding go together with extracting subtensors, i.e., taking subranges from original tensors. For that purpose, we shall use the “:” symbol, working as the range operator in a way known from programming languages, where the index written before the colon is treated inclusively and the one afterwards exclusively. E.g.,  X j 1 : j 2 , k 1 : k 2 represents a subtensor consisting of the following elements: x j , k , q for all j 1 j < j 2 , k 1 k < k 2 and all q (if not restricted; otherwise, subtensors are taken along all the channels).
Furthermore, ϕ shall stand for activation functions, and we are going to use that same overloaded symbol for all layers. The reader will be able to understand it correctly from the context, based on the name of ϕ ( · ) ’s argument. When no activation function for a layer is declared, one should simply treat ϕ as an identity function and its derivative as a 1.
Finally, we explain that, in actual DL implementations (also in ours based on NumPy and Numba), all tensors are typically stored with an extra dimension pertaining to a batch of images (data examples) currently processed, rather than a single image. For example, a layer producing 16 convolutional images of size 32 × 32 each, for a batch of 100 input images, can be stored as a 4D tensor of the following shape: 100 × 32 × 32 × 16 . In spite of this, in the mathematical and algorithmic sections, we choose to simplify the considerations (and notation) by not taking that extra dimension into account. In other words, we look at computations as if they were carried for a single data example.

3. Flows of Signals and Gradients—Direct Derivation for a Small Network Example

A good general understanding can be developed by studying small examples first. In this section, we provide a direct “manual” derivation of relevant formulas, working with an exemplative small network consisting of four layers, two of them convolutional. We believe this exercise helps abstract out the general formulas later on.
The structure to consider is shown in Figure 3. Suppose it is meant to perform a classification task with 10 classes, based on 64 × 64 input images with 3 channels. The letters X , α , β , γ , Y shall be used as names (to refer to layers) and also to represent the tensors storing actual values. In this section, we purposely use explicit numbers to define the sizes of tensors, filters, etc., rather than symbolic constants. Those explicit sizes shall thus be visible in the formulas, allowing the reader to clearly see their impact. Also purposely, we choose the sigmoid activation function for the output layer. We are aware that, e.g., the softmax activation would be a more natural choice for the classification task, but we prefer to postpone the discussion on softmax until Section 4.7.
Below, we specify the names for collections of weights in tunable layers of the network. Convolutional weights are kept as 4D tensors, weights of the dense layer as a matrix, and weights related to free terms (also known as biases) as 1D vectors.
W α = w α ; a , b , p , q 7 × 7 × 3 × 16 , W 0 , α = w 0 , α ; q 16 ,
W β = w β ; c , d , q , r 5 × 5 × 16 × 32 , W 0 , β = w 0 , β ; r 32 ,
W Y = w Y ; e , f 10 × 131 072 , W 0 , Y = w 0 , Y ; e 10 .
For clarity, we picked distinct letters for indexes pertaining to different pieces of information. For example, q is the index of an output channel (kernel) produced via layer α and simultaneously the index of an input channel from the perspective of layer β . Those letters are written in subscripts after a layer name and a semicolon. Additionally, we use the prefix 0 in front of a layer name to denote the collection of its biases.
As regards activation functions in our example, we have ϕ ( β j , k , r ) = [ β j , k , r > 0 ] · β j , k , r since, for layer β , the ReLU activation is chosen; then, ϕ ( α j , k , q ) = α j , k , q , and ϕ ( α j , k , q ) / α j , k , q = 1 , since no activation is declared for layer α , and finally, ϕ ( y e ) = 1 / ( 1 + exp ( y e ) ) —the sigmoid function for layer Y.
In the following subsections, we discuss forward and backward computations for the described network. We do so layer by layer, writing first the more natural scalar-form formulas and then providing their tensor-form counterparts. Please note that at the implementation level, tensor formulas are typically well suited for CPU computations based on efficient libraries for algebra (e.g., BLAS and LAPACK underlying Python’s numpy), whereas scalar formulas are typically better suited to (or more easily translatable to) GPU computations.

3.1. Forward Pass

The first layer, named α , is a 2D convolutional layer equipped with 7 × 7 filters using 3 input channels and mapping to 16 output channels (also known as kernels or feature maps). These outputs should be computed as follows:
α j , k , q : = 3 a , b 3 0 p < 3 w α ; a , b , p , q · x j + a , k + b , p · [ 0 j + a < 64 ] · [ 0 k + b < 64 ] + w 0 , α ; q ,
for all 0 j , k < 64 and 0 q < 16 . The tensor-form counterpart of the above scalar formula is given below. It applies the stack and reshape operations, together with image padding.
α : = S 0 q < 16   R 64 × 64 R 1 × 7 2 · 3 W α ; · , · , · , q · S 0 j , k < 64 R 7 2 · 3 × 1 X j 3 : j + 4 k 3 : k + 4 P 7 2 · 3 × 64 2 + S 0 q < 16 w 0 , α ; q · 1 64 × 64 .
At the inner-most level of Formula (9), each image subregion of shape 7 × 7 × 3 (implied by the filter size and input channels) becomes reshaped to a column vector and stacked into a temporary matrix of the final shape 7 2 · 3 × 64 . This matrix is then multiplied, from the left, by a subset of filter weights (for the q-th output channel) reshaped to a row vector 1 × 7 2 · 3 . Then, matrices resulting from such multiplications are stacked over all q indexes, 0 q < 16 , to produce the final output tensor. The addition of biases takes place in the second line of (9). Such an approach is referred to as GEMM-based convolution [28,39] (general matrix-to-matrix multiplication, a part of BLAS). No activation is declared for layer α ; hence, ϕ ( α ) = α .
Now, we move on to layer β . It is equipped with a 5 × 5 filter, and it computes 32 output channels as follows:
β j , k , r : = 2 c , d 2 0 q < 16 w β ; c , d , q , r · ϕ ( α j + c , k + d , q ) · [ 0 j + c < 64 ] · [ 0 k + d < 64 ] + w 0 , β ; r ,
for 0 j , k < 64 and 0 r < 32 ; or via a tensor counterpart:
β : = S 0 r < 32   R 64 × 64 R 1 × 5 2 · 16 W β ; · , · , · , r · S 0 j , k < 64 R 5 2 · 16 × 1 ϕ α j 2 : j + 3 k 2 : k + 3 P 5 2 · 16 × 64 2 + S 0 r < 32 w 0 , β ; r · 1 64 × 64 .
ReLU activation for this layer can be computed as follows:
ϕ ( β ) : = [ β > 0 ] β ,
where [ β > 0 ] represents a 64 × 64 × 32 tensor consisting of ones and zeros, depending on whether the condition β > 0 is satisfied or not element-wise. We remark that formula (12) represents the efficient way to implement ReLU activation in Python. Even when the [ β > 0 ] tensor is a numpy array of bools, it is easy to check that, despite multiplications, this version is faster than versions based on an ifelse statement or conditional array indexing.
The subsequent layer, γ , flattens the tensor ϕ ( β ) , yielding
γ : = R 131 072 × 1 ϕ ( β ) ,
and further ϕ ( γ ) : = γ , since we have not imposed any activation to the flattening layer (we remark that this is theoretically possible, though seldom achieved in practice).
The flattened result is a column vector; hence, the subsequent dense layer, Y, can conveniently multiply such an input by its matrix of weights (standard multiplication):
Y : = W Y · ϕ ( γ ) + W 0 , Y .
And lastly, through the application of the sigmoid activation function, the final network output becomes
ϕ ( Y ) : = 1 / 1 + exp ( Y ) .

3.2. Backward Pass

Suppose that categorical cross-entropy is selected as the loss function to fit our network and that Y * stands for a one-hot encoded vector representing the output wanted for the given training example. The loss between Y and Y * is
l ( Y , Y * ) = 0 e < 10 y e * log ϕ ( y e ) .
In accordance with the chain rule, loss derivatives with respect to w Y ; e , f weights can be calculated as the following product:
l w Y ; e , f = l ϕ ( y e ) · ϕ ( y e ) y e · y e w Y ; e , f = y e * ϕ ( y e ) · ϕ ( y e ) ( 1 ϕ ( y e ) ) · ϕ ( γ f ) ,
where the second line applies the actual derivatives.
Now, we start deriving the backpropagation algorithm for our network by writing down partial derivatives of the loss as the following column vector:
δ l : = l ϕ ( y 0 ) l ϕ ( y 1 ) l ϕ ( y 9 ) T .
Even though the loss function is not an actual layer in the network, we purposely denote the above vector as δ l and treat it as our initial error quantity from which the induction ought to be started. Having δ l , one can define errors δ Y for the proper last layer using Rule 1 (see page 8):
δ Y : = ϕ ( Y ) I 10 · δ l = ϕ ( y 0 ) y 0 · l ϕ ( y 0 ) ϕ ( y 1 ) y 1 · l ϕ ( y 1 ) ϕ ( y 9 ) y 9 · l ϕ ( y 9 ) T ,
where each entry, for our example, is δ Y ; e = ϕ ( y e ) ( 1 ϕ ( y e ) ) · ( y e * / ϕ ( y e ) ) ,   0 e < 10 . We explain that the identity matrix I 10 , written purposely in the first line of (19), can be seen as a matrix of weights for outgoing connections. Every neuron in the output layer, Y, can be thought of as having exactly one outgoing connection, with weight 1, to a corresponding virtual loss neuron. Obviously, the presence of an identity matrix does not change anything mathematically, but we want to keep the writing comprehensive and consistent with the inductive Rule 1.
Having computed δ Y , we can compute the gradient for weights in the last layer with the use of Rule 2. It yields
Y : = δ Y · ϕ ( γ ) T ,
0 , Y : = δ Y · 1 10 × 1 ,
respectively for proper connections and for biases.
We remark that, with (20) and (21) computed, corrections of weights could be potentially performed. For example, using the simple SGD, one would have W Y : = W Y η Y , and W 0 , Y : = W 0 , Y η 0 , Y , where η > 0 stands for the learning rate. However, correction steps are typically withheld until the complete backward pass through all the layers is finished.
We now move to the flattening layer, γ . Again, in accordance with Rule 1, a particular δ γ ; f must take into account its own activation derivative, and subsequent δ s reached via outgoing connections. In the scalar form, this yields
δ γ ; f : = ϕ ( γ f ) γ f 1 · 0 e < 10 w Y ; e , f · δ Y ; e ,
for 0 f < 131 072 , which, in the tensor form, is equivalent to
δ γ : = ϕ ( γ ) ( W Y T · δ Y ) = 1 131 072 ( W Y T · δ Y ) .
Again, the initial multiplication by a column vector of ones is presented for the sake of consistency with Rule 1. This vector represents derivatives of the activation function (none declared).
There are no tunable weights in the flattening layer; hence, the gradient γ needs not be computed. Yet, we do need to suitably propagate δ quantities from layer γ back to layer β . This must be performed via a suitable reshape operation, additionally taking into account the fact that β is equipped with the ReLU activation. Suppose, for a moment, that an index, f, of a certain neuron in layer γ corresponds to an index, ( g , h , r ) , of a neuron in layer β . The error quantity for that neuron should be computed as follows:
δ β ; g , h , r : = ϕ ( β g , h , r ) β g , h , r · 1 · δ γ ; f ,
which, again, can be seen through the prism of Rule 1 as a “sum” over one outgoing connection with weight 1 that reaches δ γ ; f . This implies the following tensor formula:
δ β : = ϕ ( β ) R 64 × 64 × 32 ( I 131 072 · δ γ ) = [ β > 0 ] R 64 × 64 × 32 δ γ .
Now, we move to loss derivatives with respect to convolutional weights: w β ; c , d , q , r . For mathematical purposes, we introduce the following auxiliary notations related to the mapping between indexes of β and γ layers. Given a fixed index, r, of an output channel in layer β , think of the set of indexes corresponding to it within the flattened layer:
{ f } ( r ) = { f : 0 f < 131 072 and f mod 32 = r } .
E.g., for r = 3 in our example, the above set becomes { f } ( 3 ) = { 3 , 35 , 67 , , 131 043 } . Furthermore, given a particular index, f, selected from that set, let ϕ ( β g ( f ) , h ( f ) , r ) denote the activated neuron connected subsequently to γ f . Our example implies that g ( f ) = f / 64 ,   h ( f ) = f mod 64 .
Using the above notation, the derivatives we look for are as follows:
l w β ; c , d , q , r = 0 e < 10 l ϕ ( y e ) · ϕ ( y e ) y e δ Y ; e · { f } ( r ) y e ϕ ( γ f ) · ϕ ( γ f ) γ f 1 · γ f ϕ ( β g ( f ) , h ( f ) , r ) 1 · ϕ ( β g ( f ) , h ( f ) , r ) β g ( f ) , h ( f ) , r · β g ( f ) , h ( f ) , r w β ; c , d , q , r = 0 e < 10 δ Y ; e · { f } ( r ) w Y ; e , f · [ β g ( f ) , h ( f ) , r > 0 ] · ϕ ( α g ( f ) + c , h ( f ) + d , q ) = { f } ( r ) [ β g ( f ) , h ( f ) , r > 0 ] · 0 e < 10 w Y ; e , f · δ Y ; e δ γ ; f · ϕ ( α g ( f ) + c , h ( f ) + d , q ) = { f } ( r ) [ β g ( f ) , h ( f ) , r > 0 ] · δ γ ; f δ β ; g ( f ) , h ( f ) , r · ϕ ( α g ( f ) + c , h ( f ) + d , q ) = 0 g , h < 64 δ β ; g , h , r · ϕ ( α g + c , h + d , q ) .
Below, we write down tensor formulas corresponding to (27). Note that, due to the filter range 2 c , d 2 in layer β , the indexes of pixels present in α P can take values within 2 j , k < 64 + 2 . Therefore, in compliance with Rule 2, the gradient for weights in layer β can be written as follows:
β : = S 0 r < 32   R 5 × 5 × 16 R 1 × 64 2 δ β ; · , · , r · S 2 c , d 2 0 q < 16 R 64 2 × 1 ϕ α c : c + 64 d : d : 64 q P 64 2 × 5 2 · 16 ,
0 , β : = S 0 r < 32 R 1 × 64 2 δ β ; · , · , r · 1 64 2 × 1 .
We remind the reader that ϕ ( α P ) is, in fact, equal to α P since no activation function for layer α is declared.
We have now reached the left-most layer, α , and we need to compute δ α . To do so, we need to remember that filter weights are shared among many outgoing connections. In particular, given a fixed index of a weight, say ( c , d , q , r ) , one must realize that, for all j , k , the quantities δ α ; j , k , q reach the quantities δ β ; j c , k d , r (note the minuses) via connections associated with that weight, i.e.,  w β ; c , d , q , r . Therefore, the proper way to apply now the Rule 1 is as follows:
δ α ; j , k , q = ϕ ( α j , k , q ) α j , k , q 1 · 2 c , d 2   0 r < 32 w β ; c , d , q , r · δ β ; j c , k d , r · [ 0 j c < 64 ] · [ 0 k d < 64 ] .
Please note that, with the substitutions c : = c and d : = d , one can equivalently write δ α ; j , k , q down as follows:
δ α ; j , k , q = 2 c , d 2   0 r < 32 w β ; c , d , q , r · δ β ; j + c , k + d , r · [ 0 j + c < 64 ] · [ 0 k + d < 64 ] .
This lets us see the δ s resulting from the backward pass strictly as a convolution, rather than a 2D correlation (a nomenclature remark made in the introduction). This means that one needs to use flipped weights, bottom–up and left–right, that we shall denote as W ¯ β = w ¯ β ; c , d , q , r = w β ; c , d , q , r for all valid c , d , q , r .
The tensor-form formula for δ α can be written down as
δ α : = S 0 q < 16   R 64 × 64   R 1 × 5 2 · 32 W ¯ β ; · , · , q , · · S 0 j , k < 64 R 5 2 · 32 × 1    δ β ; j 2 : j + 3 k 2 : k + 3 P 5 2 · 32 × 64 2 .
In Figure 4, we graphically illustrate computations, both forward and backward, taking place at the junction of layers α and β .
Now, we can derive the loss derivatives with respect to w α ; a , b , p , q weights. The long chain of involved multiplications and summations proceeds in the following manner:
l w α ; a , b , p , q = 0 e < 10 l ϕ ( y e ) ϕ ( y e ) y e δ Y ; e · 0 f < 131 072 y e ϕ ( γ f ) · ϕ ( γ f ) γ f 1 · γ f ϕ ( β g ( f ) , h ( f ) , r ( f ) ) 1 · ϕ ( β g ( f ) , h ( f ) , r ( f ) ) β g ( f ) , h ( f ) , r ( f ) · 2 c , d 2 β g ( f ) , h ( f ) , r ( f ) ϕ ( α g ( f ) + c , h ( f ) + d , q ) · ϕ ( α g ( f ) + c , h ( f ) + d , q ) α g ( f ) + c , h ( f ) + d , q 1 · [ 0 g ( f ) + c < 64 ] · [ 0 h ( f ) + d < 64 ] · α g ( f ) + c , h ( f ) + d , q w α ; a , b , p , q
= 0 f < 131 072 [ β g ( f ) , h ( f ) , r ( f ) > 0 ] · 0 e < 10 w Y ; e , f · δ Y ; e δ γ ; f δ β ; g ( f ) , h ( f ) , r ( f ) · 2 c , d 2 w β ; c , d , q , r ( f ) · [ 0 g ( f ) + c < 64 ] · [ 0 h ( f ) + d < 64 ] · x g ( f ) + c + a , h ( f ) + d + b , p = 0 g , h < 64 0 r < 32 δ β ; g , h , r · 2 c , d 2 w β ; c , d , q , r · [ 0 g + c < 64 ] · [ 0 h + d < 64 ] · x g + c + a , h + d + b , p
By looking at the subscript of x in the final line, one can observe that the scope of particular image inputs (pixels) involved in the derivative l / w α ; a , b , p , q is, in a sense, broadened, as we have passed backwards through two consecutive convolutional layers. This can be noticed in composed shifts c + a and d + b with respect to a pixel index, ( g , h ) . In other words, by letting the convolutional indexes range as 3 a , b 3 (a 7 × 7 filter) and 2 c , d 2 (a 5 × 5 filter), the involved inputs range within a 12 × 12 neighborhood around each ( g , h ) index of a pixel, disregarding, for a moment, channel and kernel indexes p , q , r . In order to simplify (33) and inductively make use of quantities δ α , we introduce the following 1 : 1 substitutions: g = j c and h = k d . This leads to
l w α ; a , b , p , q = 0 j c , k d < 64 0 r < 32 δ β ; j c , k d , r · 2 c , d 2 w β ; c , d , q , r · [ 0 j < 64 ] · [ 0 k < 64 ] · x j + a , k + b , p = 0 j , k < 64 2 c , d 2 0 r < 32 w β ; c , d , q , r · δ β ; j c , k d , r · [ 0 j c < 64 ] · [ 0 k d < 64 ] · x j + a , k + b , p = 0 j , k < 64 δ α ; j , k , q · x j + a , k + b , p .
In the second pass, we changed places between ranges of external summations ( 0 j c , k d < 64 ) and indicator fuctions [ 0 j < 64 ] · [ 0 k < 64 ] being part of the expression under the sum (equivalent replacement). Also, we changed the order of inner sums ( c , d before r) and substituted for δ α in accordance with (31).
The corresponding tensor formulas are as follows:
α : = S 0 q < 16   R 7 × 7 × 3 R 1 × 64 2 δ α ; · , · , q · S 3 a , b 3 0 p < 3 R 64 2 × 1 X a : a + 64 b : b + 64 p P 64 2 × 7 2 · 3 ,
0 , α : = S 0 q < 16 R 1 × 64 2 δ α ; · , · , q · 1 64 2 × 1 .
Our exercise—deriving backpropagation formulas for the network from Figure 3—is finally done. Below, we put it in a summarized algorithm-like form. We concentrate on the backward pass with references to Rule 1 and Rule 2. The rules take place alternately, and we signal their presence even if, in some cases, their execution is void. Weight corrections are written according to the simple SGD method. More advanced optimization approaches are discussed in Section 6.
  • Forward pass: Formulas (8)–(15).
  • Backward pass:
    Compute initial error quantity δ l —see (18).
    Rule 1: compute δ Y as a function of ϕ ( Y ) , I 10 , δ l —see (19).
    Rule 2: compute Y , 0 , Y as a function of δ Y , ϕ ( γ ) —see (20), (21).
    Rule 1: compute δ γ as a function of ϕ ( γ ) , W Y , δ Y —see (23).
    Rule 2: do not compute γ (no tunable parameters in layer γ ).
    Rule 1: compute δ β as a function of ϕ ( β ) , I 131 072 , δ γ —see (25).
    Rule 2: compute β , 0 , β as a function of δ β , ϕ ( α ) —see (28), (29).
    Rule 1: compute δ α as a function of ϕ ( α ) , W β , δ β —see (32).
    Rule 2: compute α , 0 , α as a function of δ α and X—see (35), (36).
    10
    Rule 1: do not compute δ X (not needed since X is the initial layer).
  • Corrections (simple SGD):
    1
    Compute corrections for layer Y:
    W Y : = W Y η Y ,
    W 0 , Y : = W 0 , Y η 0 , Y .
    2
    Do not compute corrections for layer γ (no tunable parameters).
    3
    Compute corrections for layer β :
    W β : = W β η β ,
    W 0 , β : = W 0 , β η 0 , β .
    4
    Compute corrections for layer α :
    W α : = W α η α ,
    W 0 , α : = W 0 , α η 0 , α .

4. General Formulas for Common Layer Types

Having analyzed a small structure with particular sizes of tensors and filters, in this section, we provide general formulas. The presentation is concise, with more comments given only for new layer types not discussed previously.
As regards notation, in all cases, we consider a junction of two layers (somewhere within a network) named X and Y. More precisely, in the forward pass, an activated tensor or vector, ϕ ( X ) , becomes mapped to Y and then possibly to ϕ ( Y ) . In the backward pass, we want to compute gradients with respect to tunable weights (if present), and to propagate a tensor of quantities δ Y to the preceding tensor, δ X . For layer types discussed previously, we skip forward computations and concentrate only on the backward pass. The shape of input tensors is denoted as H × V × C , understood as height × width × the number of input channels. We pick the letter V to represent widths because W was already used for weights. We reduce considerations to square-sized convolutional filters of the shape F × F , where F is odd, and square pools of shape S × S , all working with a stride of 1. In the case of dense layers, the mapping is performed from a flat vector of length N to another flat vector of size M.
The last subsection is devoted to the softmax activation function, or more generally, to such activations that depend on more than one argument.

4.1. Convolutional Layers (2D)

Let K stand for the number of output channels to be produced (also known as kernels or feature maps). Tensor W Y , storing convolutional weights, is of the following shape: F × F × C × K ; whereas vector W 0 , Y , storing the biases, is of length K.
In the backward pass, the gradients ought to be computed as follows:
Y : = S 0 q < K   R F × F × C R 1 × H · V δ Y ; · , · , q S F / 2 a , b F / 2 0 p < C R H · W × 1 ϕ X a : a + H b : b + V p P ,
0 , Y : = S 0 q < K R 1 × H · V δ Y ; · , · , q · 1 H · V × 1 .
And the propagation of errors is carried out by computing the following:
δ X : = ϕ ( X ) S 0 q < C   R H × V R 1 × F 2 · K W ¯ Y ; · , · , q , · S 0 j < H 0 k < V R F 2 · K × 1    δ Y ; j F / 2 : j + F / 2 + 1 k F / 2 : k + F / 2 + 1 P .

4.2. Max Pooling Layers (2D)

Using a grid of non-overlapping two-dimensional pools of shape S × S each (or, more precisely, S × S × 1 if their one-channel depth is written explicitly), an input tensor ϕ ( X ) of shape H × V × C becomes mapped to an output tensor, Y, with a reduced shape, H / S × V / S × C .
Single outputs, for all 0 j < H / S , 0 k < V / S , and 0 q < C , are computed as follows:
y j , k , q : = max j S u < ( j + 1 ) S k S v < ( k + 1 ) S ϕ ( x u , v , q ) .
Additionally, in the forward pass, one needs to memorize the coordinates for which the maxima are attained. They shall be needed later for backward propagation. We denote them as
a j , k , q * : = arg   max j S u < ( j + 1 ) S k S v < ( k + 1 ) S ϕ ( x u , v , q ) .
At the implementation level, (112) can be established first (or in parallel with (40)), and using its entries, one can populate Y. For example, when a j , k , q * = ( u * , v * ) , then y j , k , q : = ϕ ( x u * , v * , q ) .
We remark that max pooling layers may potentially be equipped with an activation function, mapping Y non-linearly to ϕ ( Y ) , although this is seldom met in practice.
As regards the backward pass, the calculus-based rationale for max pooling layers is as follows. For suitably small changes in all ϕ ( x u , v , q ) values, the maxima are attained at the same locations. Then, each entry, y j , k , q , can be, in fact, treated as a function of only one input signal— ϕ ( x u * , v * , q ) , where a j , k , q * = ( u * , v * ) . From this point of view, Formula (40) is equivalent to summation over artificial outgoing connections—one connection with weight 1 (associated with the maximum) and all others with weights 0 (for non-maximum entries). Special cases where several non-strict (equal) maxima exist are ignored in practice, having a negligible influence on the learning procedure. We represent this in the following way
y j , k , q : = 1 · ϕ ( x u * , v * , q ) + ( u , v ) ( u * , v * ) 0 · ϕ ( x u , v , q ) .
Therefore, all intermediate terms y j , k , q / ϕ ( x u , v , q ) appearing along the chain(s) of derivatives can be replaced with y j , k , q / ϕ ( x u * , v * , q ) . At the implementation level, this implies that values from tensor δ Y should be propagated only to the corresponding ( u * , v * , q ) locations in δ X , whereas the remaining entries within δ X should be populated with zeros:
δ X ; u , v , q : = ϕ ( x u , v , q ) · δ Y ; u / S , v / S , if a u / S , v / S * = ( u , v ) ; 0 , otherwise .

4.3. Average Pooling Layers (2D)

Sometimes, a sum- or mean-based information reduction turns out to work better than a max-based reduction. Average pooling serves that purpose. Forward and backward computations are similar to the ones associated with max pooling, and, again, it is easy to understand them through the prism of Rule 1, thinking about artificial outgoing connections. This time, all such connections have equal weights of 1 / S 2 . The key formulas, given below, are counterparts of (40) and (43).
y j , k , q : = 1 S 2 j S u < ( j + 1 ) S k S v < ( k + 1 ) S ϕ ( x u , v , q ) .
δ X ; u , v , q : = ϕ ( x u , v , q ) · 1 S 2 · δ Y ; u / S , v / S .
Figure 5 shows an illustrative comparison between max pooling and average pooling forward computations.

4.4. Flattening Layers

A flattening layer is typically applied directly before a block of dense layers in a network. That is, when a 3D tensor from a previous layer needs to be flattened to become a 1D input to a dense layer. Although there is not much mathematics involved in flattening layers, and there are no tunable parameters, suitable forward and backward formulas can still be written in accordance with our general scheme:
Y : = R H · V · C × 1 ϕ ( X ) .
δ X : = ϕ ( X ) R H × V × C ( I H · V · C · δ Y ) = ϕ ( X ) R H × V × C δ Y .
The identity matrix I H · V · C represents (temporarily) weights of outgoing connections. The left illustration in Figure 6 depicts backward computations for a flattening layer.
Just as for pooling layers, we remind the reader that Y can potentially be mapped non-linearly to ϕ ( Y ) if an activation function is declared for a flattening layer (seldom in practice).

4.5. Dropout Layers

Dropout layers can be regarded as probabilistic regularizers. For an imposed constant rate, r * [ 0 , 1 ) , a dropout layer allows a random subset of its input signals, approximately a fraction of 1 r * , to pass through as outputs, whereas the remaining part, approximately r * , is dropped, i.e., becomes zeroed.
Consider a tensor, R = ( r j , k , q ) , with the same shape as the input— H × V × C , whose entries are drawn randomly from a standard uniform distribution, r j , k , q U ( 0 , 1 ) . Treating R as a tensor caters to the general case. Dropout layers can, in particular, also be applied after flat or dense layers with tensor R reduced to a 1D vector. During the fit stage, the forward computations are performed as follows:
Y : = 1 1 r * · [ R > r * ] ϕ ( X ) .
Note that the distribution of entries in tensor [ R > r * ] complies with the Bernoulli distribution (having the success parameter set to 1 r * ).
The 1 / ( 1 r * ) factor in (48) is introduced to increase the strength of signals passing through. E.g., for r * = 0.2 , the sum of surviving signals constitutes approximately 8 / 10 of the original total (in the sense of the absolute value). Therefore, the multiplication by 10 / 8 preserves a roughly similar flow compared to the one occurring when the dropout layer is switched off—namely, at the prediction stage. Then, all the signals are allowed to pass through unchanged. The complete justification behind the manner in which the dropout operates is provided in the regularization-related Section 7.3. Perhaps surprisingly, the mechanism described above is connected to model averaging techniques known from ensemble methods.
As regards the backward pass, the δ s from the subsequent layer are propagated in the following way
δ X : = ϕ ( X ) 1 1 r * · [ R > r * ] δ Y .
Note that, at the implementation level, the whole tensor 1 / ( 1 r * ) · [ R > r * ] used during the forward pass (including the batch dimension), can be memorized and reused during backward computations. Figure 6 (right-hand side) illustrates backward computations for a dropout layer.

4.6. Dense Layers

Consider a dense layer mapping from a vector, ϕ ( X ) , with N inputs to a vector, Y, with M outputs. We choose the shape of matrix W Y to be M × N —the variant in which weights incoming to an output neuron are written as a row, i.e.,
W Y = w Y ; j , k 0 j < M , 0 k < N ,
with biases W 0 , Y = ( w 0 , Y ; j ) 0 j < M being a column vector. Formulas for forward and backward computations are given below.
Y : = W Y · ϕ ( X ) + W 0 , Y .
Y : = δ Y · ϕ ( X ) T .
0 , Y : = δ Y · 1 N × 1 .
δ X : = ϕ ( X ) W Y T · δ Y .
We must remark that, in (54), the activation function of the previous layer is assumed to depend only on one argument, i.e., only the weighted sum of a neuron it activates. Possible examples are sigmoid, hyperbolic tangent, and ReLU, as opposed to, e.g., the softmax function, discussed in the next subsection.

4.7. Softmax Activation

Softmax activations play an important role since their outcomes can be interpreted as probability distributions. They are commonly imposed on final output layers in classification tasks. The crucial property of the softmax function is that it depends not only on one weighted sum incoming to the neuron it activates but on all weighted sums computed via the layer. In consequence, a certain adjustment has to be made in the scheme of backward computations—element-wise multiplication by a vector of activation derivatives is no longer possible. But, as we show below, this adjustment can also be seen through the prism of main Rule 1 if softmax computations are treated as a two-staged process.
Let us continue the notation from the preceding subsection, suppose that layer X is now equipped with the softmax activation with individual outputs computed as
ϕ ( x k ) : = exp ( x k ) / 0 z < N exp ( x z ) , 0 k < N .
One should realize that a change in a single x z signal causes changes in all ϕ ( x k ) values, 0 k < N , because of the denominator, and thereby further changes in all neurons of layer Y. Therefore, to suitably propagate back the errors δ Y onto δ X , it is convenient to think of an intermediate stage of errors that we denote as δ ϕ ( X ) ; see Figure 7.
Partial derivatives ϕ ( x z ) / x k express quantitatively how strongly a ϕ ( x z ) is affected by a change in x k . Hence, one can write the following formulas for backward propagation by applying Rule 1 twice. In the second case, ϕ ( x z ) / x k are treated as artificial weights of outgoing connections.
δ ϕ ( X ) ; k : = 0 j < M w Y ; j , k · δ Y ; j , 0 k < N ,
δ X ; k : = 0 z < N ϕ ( x z ) x k · δ ϕ ( X ) ; z , 0 k < N .
All partial derivatives of the softmax function (or, in fact, any other) can be gathered into a matrix of the form
D ϕ ( X ) : = ϕ ( x 0 ) x 0 ϕ ( x 1 ) x 0 ϕ ( x M 1 ) x 0 ϕ ( x 0 ) x 1 ϕ ( x 1 ) x 1 ϕ ( x M 1 ) x 1 ϕ ( x 0 ) x M ϕ ( x 1 ) x M ϕ ( x M 1 ) x M 1
and applied in the following general formula for backward propagation through a dense layer:
δ X : = D ϕ ( X ) · W Y T · δ Y .
Please note that Formula (54) is a special case of (59) for activation functions depending on a single argument.
As regards actual derivatives for the softmax function, it is easy to check that, after some algebra, one obtains the following:
ϕ ( x z ) x k = ϕ ( x z ) ( 1 ϕ ( x z ) ) , for z = k ; ϕ ( x z ) ϕ ( x k ) , for z k .
This means that explicit analytic expressions are not needed at the implementation level, and one can reuse neurons’ outcomes from the forward pass to compute the derivatives fast during the backward pass. A similar property is true for sigmoid activation functions, where ϕ ( x ) = ϕ ( x ) ( 1 ϕ ( x ) ) , which agrees with softmax partial derivatives for z = k .
In short, (60) is equal to ϕ ( x z ) / x k = ϕ ( x z ) ( [ z = k ] ϕ ( x k ) ) , and in the matrix form, it becomes:
D ϕ ( X ) = I N · ϕ ( X ) ϕ ( X ) · ϕ T ( X ) .

5. Weight Initialization

A cautious initialization of weights in deep networks is of great importance. Its goal is to prevent signals that travel through many layers from exploding or vanishing. Without proper initialization, such problems are likely to occur, and their probability increases with the number of layers. They are commonly referred to as exploding gradients or vanishing gradients to indicate that the learning process no longer converges, but despite the names, the problem pertains not only to the backward flows—it is equally valid for the forward ones. To provide a simple and convincing numerical argument, below, we repeat one after [50].
Think of a network consisting of 100 dense layers, each with 512 neurons (without activations). Let x be the input vector also of length 512, and suppose its entries are drawn from a standard normal distribution, N ( 0 , 1 ) . We are going to check what happens if initial weights for all layers are also drawn from N ( 0 , 1 ) . The forward pass, represented by the following Python code (Figure 8), consists of 100 consecutive multiplications:
  • Its execution produces the printout “nan, nan” which informs us about the numerically unwanted behavior. In fact, it is easy to check that, after 28 iterations, all elements inside x are at least of order 10 35 and some of order 10 38 —close to the maximum non-infinity value allowed for floating-point numbers of single precision. One iteration later, there are already 503 nan values and 9 inf values inside x.
On the other hand, it is easy to check that weights must not be too small either. The following carefree correction (Figure 9) scales weights by a factor of 10 2 :
  • And the program execution produces the printout “0.0, 0.0”—the signals have vanished.
The mathematical rationale behind the above observations is that, for all layers, any element of the vector resulting from W.dot(x) arises as a sum of 512 statistically independent summands. Therefore, the variance of such an element is equal to the sum of component variances. Note that the fact that each summand in itself is a product (a weight times an input) can be neglected because weights and inputs are mutually independent. Strictly speaking, this is true for the considered simplified example due to zero-mean distributions (for both weights and inputs) and due to the lack of activation functions. More general analysis is provided in the subsequent sections.
Since each element of the resulting vector arises as a sum of 512 summands, a sensible first guess could be to scale weights by 1 / 512 —the inverse of the standard deviation for the mentioned sum. More generally, when input sizes differ, a plausible scaling factor seems to be the inverse of the square root of the number of inputs. In the following code (Figure 10) we apply the guess:
  • And the obtained printout is “-0.016219143 0.81295455”—the signals have neither exploded nor vanished.
The accurate analysis in subsequent sections shows that the above first guess is not far off from the truth. The analysis follows the works of Glorot [17] and He [29] that resulted in two state-of-the-art methods for weight initialization.

5.1. Randomness and Variance Analysis Setup

Consider a network with L layers. In the analyses below, we use layer indexes, i = 1 , 2 , , L , instead of names. Let N i denote the number of inputs incoming to each neuron in the i-th layer during the forward pass—hence, the symbolic left–right arrow. We take into account only layers with tunable weights (dense, convolutional) and assume that the following unified linear expression (one should think here of a standard matrix multiplication for dense layers, and a GEMM-based approach for convolutional layers) (62) describes the way a layer operates. In other words, multiplication by a transformation matrix W i with weights suitably arranged takes place:
Y i = W i · X i + W 0 , i ,
X i + 1 = ϕ ( Y i ) .
Rows of W i and column(s) of X i are both of length N i , in particular, for convolutional layers N i = F i 2 · C i , where F i × F i represents the filter shape, and C i the number of input channels. The following randomness-related conditions are satisfied:
  • Random elements w i in each W i are mutually independent and share the same distribution (i.i.d.—independent, identically distributed);
  • Likewise, elements x i in each X i are i.i.d.;
  • X i and W i are independent of each other;
  • Biases are initialized with zeros.
Knowing that any element y i of Y i in (62) becomes computed as a sum (of products) with N i mutually independent summands having the same distribution, we have
Var y i = N i Var w i x i = N i E w i 2 x i 2 E 2 w i x i = N i E w i 2 E x i 2 E 2 w i 0 E 2 x i = N i Var w i E x i 2 = N i Var w i E ϕ 2 ( y i 1 ) ,
where w i , x i stand for single elements in W i and X i . Note that E x i 2 Var x i , unless x i has a zero mean. For example, for ReLU x i = ϕ ( y i 1 ) = [ y i 1 > 0 ] y i 1 , thus clearly E x i 0 .
To make use of Equation (64), one needs to transform it into a recurrence, binding Var y i with Var y i 1 . Both Glorot [17] and He [29] achieve this, however, via different means. Glorot assumes symmetric and approximately linear activation functions ϕ around 0 (true, e.g., for a sigmoid or hyperbolic tangent but not necessarily in general), whereas He focuses on convolutional layers activated by ReLU, takes advantage of its properties, and achieves a more elegant result, though restricted specifically to ReLU (or a parametric family of ReLUs).

5.2. Glorot Initialization

The assumption that ϕ is approximately linear around 0 implies that x i = ϕ ( y i 1 ) y i 1 has a mean approximately equal to 0. Therefore E ϕ 2 ( y i 1 ) Var y i 1 , and by inserting this approximation into (64), the following recurrence is obtained:
Var y i = N i Var w i Var y i 1 .
Expanding it over L layers leads to the following product:
Var y L = Var y 1 i = 2 L N i Var w i
which is crucial to the design of weight initialization. Clearly, an incorrect design may cause the signals to magnify or vanish exponentially. Therefore, as Glorot [17] rightly remarks, to keep the information flowing normally during the forward computations, an obvious sufficient condition is to keep all factors of product (66) equal to 1:
N i Var w i = 1 , i .
This suggests that one could draw initial weights, for example, from a zero-mean Gaussian distribution with the standard devation 1 / N i . However, this conclusion is premature because the backward flows of signals must also be taken into account.
Analogously to (62) and (63), during the backward pass, we can think of the following operation:
δ X i = ϕ ( X i ) W i · δ Y i ,
where W i denotes a suitable backward transformation matrix (e.g., a transposed matrix of weights for dense layers, or a flipped and rearranged matrix of filter weights for convolutional; see, e.g., (54), (39)). This time, N i stands for the number of “backward inputs” incoming to each neuron from the right. For convolutional layers, N i = F i 2 · K i , where K i is the number of output channels (kernels).
Similarly to before, we look at the variance of any single element, δ x i , in δ X i . That variance is equal to the sum of variances of N i independent summands, yielding
Var δ x i = N i Var ϕ ( x i ) w i δ y i = N i Var w i E ( ϕ ( x i ) δ y i ) 2
—a counterpart of (64). This time, the assumption that ϕ is approximately linear around 0 implies that ϕ ( x i ) 1 can be inserted into (69), and the following backward recurrence for the variance of δ s is achieved:
Var δ x i = N i Var w i Var δ x i + 1 ,
where δ x i + 1 = ϕ ( x i ) δ y i defines the recursive quantity. Expanding the recurrence yields the product
Var δ x 1 = Var δ x L i = 2 L N L i + 1 Var w L i + 1 ,
which, from the backward computations point of view, suggests the following sufficient condition:
N i Var w i = 1 , i .
Please note that, in general, one is unable to ensure that conditions (67) and (72) are satisfied simultaneously. Therefore, as a compromise, Glorot [17] proposed the application of an average, namely
Var w i = 2 / ( N i + N i ) , i .
This proposition implies the following two recipes for a random initialization of weights, based on normal and uniform distribution, respectively.
  • Glorot normal initialization:
    w i N 0 , 2 / ( N i + N i ) .
  • Glorot uniform initialization:
    w i U 6 / ( N i + N i ) , 6 / ( N i + N i ) .
The strange-looking 6 in (75) arises as a product of 2 and 3 , since the standard deviation of U ( 1 , 1 ) is 1 / 3 .

5.3. He Initialization

With the ReLU function in mind, [29] also studied the variance in signals described by (64) and noticed that the following interesting property holds:
E x i 2 = E ( [ y i 1 > 0 ] y i 1 ) 2 = 0 y i 1 2 p i 1 ( y i 1 ) d y i 1 = 1 2 Var y i 1 .
The last equality is true because the distribution of y i 1 is symmetric at around 0 due to the symmetric distribution of W i 1 . Thus, (64) can be transformed into the following elegant recurrence:
Var y i = 1 2 N i Var w i Var y i 1 ,
which, expanded over L layers, yields the product
Var y L = Var y 1 i = 2 L 1 2 N i Var w i .
Therefore, a sufficient condition to avoid the product from growing or vanishing exponentially is
1 2 N i Var w i = 1 , i .
Note the presence of factor 1 / 2 in contrast to (67). In fact, for the first layer ( i = 1 ), one could have N 1 Var w 1 = 1 because there is no ReLU applied to the input signals. But a redundant factor, 1 / 2 , existing only on one layer does not matter; thus, He [29] keeps “for all i” in (79) for simplicity. The derived condition suggests that one could draw initial weights from a zero-mean Gaussian distribution with standard deviation 2 / N i , or a uniform distribution, U ( 6 / N i , 6 / N i ) . However, similarly as before, to make the analysis complete, one has to take backward computations into account as well.
The backward recurrence of interest includes the 1 / 2 factor:
Var δ x i = 1 2 N i Var w i Var δ x i + 1 ,
and, when expanded, implies the product
Var δ x 1 = Var δ x L i = 2 L 1 2 N L i + 1 Var w L i + 1
and the following sufficient condition for the backward flow:
1 2 N i Var w i = 1 , i .
Instead of looking for a compromise between (79) and (82), [29] noticed something more interesting—namely, for convolutional layers, it is sufficient to apply (79) or (82) alone. Recall that N i = F i 2 · C i and N i = F i 2 · K i . Therefore, if, for example, (82) is applied in (78), then the product becomes
Var y 1 i = 2 L 1 2 N i Var w i = Var y 1 i = 2 L N i N i = Var y 1 i = 2 L F i 2 C i F i 2 K i = Var y 1 C 2 / K L .
The last pass is true, as successive ratios cancel out, owing to C i + 1 = K i for all i 2 . The final constant means that, if weight initialization properly scales the forward signals in the forward flow, then it simultaneously does so in the backward flow, or vice versa. Most libraries pick (79) and offer the following initialization for convolutional layers activated via ReLUs.
  • He normal initialization:
    w i N 0 , 2 / N i = N 0 , 2 / ( F i 2 C i ) ,
  • He uniform initialization:
    w i U 6 / N i , 6 / N i = U 6 / ( F i 2 C i ) , 6 / ( F i 2 C i ) .

6. Learning Algorithms (Optimizers)

Error functions under optimization (also known as objective or cost functions) are typically stochastic—hence the name stochastic gradient descent (SGD). There are several sources of stochasticity: evaluation at random data subsamples called batches (or mini-batches), inherent noise in acquired data, dropout regularization, etc. In this section, we discuss some learning algorithms for neural networks—variants of SGD. We devote the majority of space to Adam [25], arguably the most popular state-of-the-art optimizer.
Many SGD variants for large data sets are first-order methods (they do not use second or higher-order derivatives) that try to speed up convergence via adjustments made to a certain momentum term or (terms) or learning rate(s), or both. The historical roots of these approaches can be seen in the classical momentum method from 1964 [51] and the RPROP algorithm from 1992 [52]. RPROP introduced individual learning rates for different dimensions (components) of a gradient. The rates were multiplied by 1.2 for dimensions preserving the minimization direction over successive steps and halved for dimensions where ”turns” occurred. The contemporary techniques have similar mechanisms, but they are typically backed with a smoother adaptation using the idea of an exponential moving average.
For notational simplicity, throughout this section, we write just W and just ∇ to denote the vector (or tensor) of all network’s weights and the gradient of all partial derivatives, respectively; we shall not distinguish layer names. Instead, we write indexes of time moments in the subscripts (letter t), assuming t = 0 constitutes the initial moment. When the learning epochs are divided into batches, then successive optimization steps are, in fact, associated with updates after successive batches.

6.1. Exponential Moving Average (EMA)

Given a sequence, x 0 , x 1 , , consider a new sequence generated from it as follows:
y t : = β y t 1 + ( 1 β ) x t , t = 0 , 1 , ;
where y 1 = 0 for initialization, and β [ 0 , 1 ) is the rate of exponential decay, frequently set to at least 0.9 . The expansion of (86) produces the following pattern:
y t : = β 2 y t 2 + ( 1 β ) ( x t + β x t 1 ) = β 3 y t 3 + ( 1 β ) ( x t + β x t 1 + β 2 x t 2 ) = = β t + 1 y 1 0 + ( 1 β ) i = 0 t β t i x i ,
where the weights under summation change according to a geometric progression. The β values close to 1 have the effect of “long memory” because many past terms, x i , contribute to y t . This can be described as exponential smoothing over a long period. Instead, β values closer to 0 translate onto a “short memory”—weights further in the past fade out quickly. Note that (87) is a biased (not normalized) average because weights do not add up to 1, unless β = 0 , which would imply y t = x t for all t. More strictly, for a fixed t, the sum of weights is ( 1 β t + 1 ) / ( 1 β ) —a potential correcting normalizer.

6.2. AdaGrad and RMSProp

AdaGrad (Adaptive Gradient) algorithm [19] adapts the learning rate via a division by the square root of cumulated past squared gradients (including the current one):
v t : = v t 1 + t 2 ,
W t + 1 : = W t η v t + ϵ t ,
where v 1 = 0 , ϵ = 10 7 , t 2 = t t . The ϵ constant is a safety valve to prevent a division by zero. Since division and · operations are performed element-wise, the effective learning rates for particular gradient dimensions become individual.
As regards the RMSProp algorithm [20], it was designed as a version of RPROP suitable for mini-batches, and it can be viewed as AdaGrad-equipped EMA, namely
v t : = β v t 1 + ( 1 β ) t 2 ,
W t + 1 : = W t η v t + ϵ t ,
where v 1 = 0 , and again, β = 0.9 is a common choice.

6.3. Adam

The Adam algorithm (the name loosely derived from “adaptive movement”), proposed by Kingma and Ba [25], is a combination of momentum and RMSProp approaches. It uses unbiased estimates of gradient moments to adapt individual non-explicit learning rates: the first moment (the mean) and the second raw moment (uncentered variance). The algorithm tracks the EMAs of the gradient ( m t ) and the squared gradient ( v t ) using two hyper-parameters, β 1 , β 2 [ 0 , 1 ) , that control the decay. They have really good default values of 0.9 and 0.999 , respectively.
The averages are initialized as tensors of zeros, leading to biased estimates, especially during the initial time steps, and especially when the decay rates are small (i.e.,  β s close to 1). Fortunately, the initialization bias can be counteracted easily, resulting in bias-corrected estimates named m ^ t and v ^ t [25].
When starting from m 1 = v 1 = 0 , successive estimates of the first moment and the second raw moment are computed as follows:
m t : = β 1 m t 1 + ( 1 β 1 ) t ,
v t : = β 2 v t 1 + ( 1 β 2 ) t 2 ,
And their corresponding recursive expansions are as follows:
m t = ( 1 β 1 ) i = 0 t β 1 t i i , v t = ( 1 β 2 ) i = 0 t β 2 t i i 2 .
To technically check whether moving averages constitute unbiased estimators of true moments, one needs to take their expected values and see whether the following properties hold: E m t = E t and E v t = E t 2 . We check this for the first moment estimator (skipping analogous derivation for the second):
E m t = E ( 1 β 1 ) i = 0 t β 1 t i i = ( 1 β 1 ) i = 0 t β 1 t i E i = ( 1 β 1 ) i = 0 t β 1 t i E t + ζ = ( 1 β 1 t + 1 ) E t + ζ .
The term ζ is introduced as an error so that E i can be approximated by E t . As explained in [25], one either has ζ = 0 when the true moment E i is stationary, or otherwise, ζ is kept sufficiently small since the exponential decay rate, β 1 , can (and should) be chosen, such that the EMA assigns very small weights to gradients far in the past. Apparently, the proposed estimators are biased, since E m t ( 1 β 1 t + 1 ) E t E t and E v t ( 1 β 2 t + 1 ) E t 2 E t 2 , but fortunately, they can easily be corrected via scalar divisions:
m ^ t : = m t / ( 1 β 1 t + 1 ) .
v ^ t : = v t / ( 1 β 2 t + 1 ) .
Finally, the updates of weights are performed as
W t + 1 : = W t η m ^ t / ( v ^ t + ϵ ) ,
where ϵ = 10 7 .
The following two properties of Adam should be emphasized. The first is that its effective step sizes are approximately bounded by η , since m ^ t / ( v ^ t + ϵ ) ± 1 in statistically common scenarios [25]. The second is that Adam is invariant to the scale of gradients; e.g., scaling all ∇s by a factor, c, scales m ^ t also by a factor, c, and v ^ t by a factor of c 2 . Those factors cancel out under the expression m ^ t / v ^ t = ( c m ^ t ) / c 2 v ^ t , neglecting the small ϵ for a moment.
Figure 11, repeated after the original article by Kingma and Ba (2014) [25], presents plots comparing the training costs (losses) of Adam and other SGD algorithms while fitting on MNIST and CIFA-10 data sets.

6.4. Other Ideas: Nadam, Adamax, AMSGrad

Owing to Nesterov’s idea [53], some optimization techniques can be equipped with “prescience” by looking one step ahead and computing the gradient not with respect to the current position, i.e., current weights, W t , but the approximate future position. For example, introducing this idea to the momentum technique, known as NAG (the Nesterov accelerated gradient), can be written down as follows:
m t : = β m t 1 + ( 1 β ) | W t η m t 1 ,
W t + 1 : = W t η m t .
The subscript W t η m t 1 at ∇ in (99) indicates that we use the previous averaged velocity, m t 1 , to look at the hypothetical position W t η m t 1 and treat it as the argument at which to calculate the gradient, rather than at W t . In general, the anticipatory approach helps improve the convergence [20,53,54].
Using this idea, in 2016, Dozat proposed the Nadam algorithm (Adam + Nesterov) [31]. We will present it simply. Observe that Adam’s update on weights can be equivalently written as
W t + 1 : = W t η v ^ t + ϵ β 1 m ^ t 1 + 1 β 1 1 β 1 t + 1 t
—obtained by inserting (92) into (96) and (98). With all other formulas preserved, Nadam arises by replacing in (101) the previous moment, m ^ t 1 , with the current one, m ^ t .
Adamax is a variant of Adam proposed by its very authors [25]. Instead of using the inverse of 2 -norm (of past gradients) to scale the current one, the authors looked at p -norm for the sake of generalization. They observed that, for a special case of p , a surprisingly simple and stable algorithm emerges (not necessarily the case for a large but finite p). To derive Adamax, Kingma and Ba first defined a generalized moving average for v t as follows:
v t = β 2 p v t 1 + ( 1 β 2 p ) | t | p = ( 1 β 2 p ) i = 0 t β 2 p ( t i ) | i | p ,
where its hyperparameter is written on purpose as β 2 p , i.e., with explicit p-th power, (which can be regarded as equivalent to some other fractional parameter, β 2 [ 0 , 1 ) , where β 2 = β 2 p and for the moment p is fixed). Then, they defined u t = lim p v t 1 / p , and they observed that
u t = lim p ( 1 β 2 p ) 1 / p i = 0 t β 2 p ( t i ) | i | p 1 / p = max β 2 t | 0 | , β 2 t 1 | 1 | , , β 2 | t 1 | , | t | ,
which corresponds to the simple and elegant recursive rule u t : = max { β 2 u t 1 , | t | } . We remind the reader that the max { } operation here is element-wise, as in previous contexts. Finally, the whole Adamax algorithm can be stated as follows:
m t : = β 1 m t 1 + ( 1 β 1 ) t ,
u t : = max { β 2 u t 1 , | t | } ,
W t + 1 : = W t η m t / ( 1 β 1 t + 1 ) u t ,
where m 1 = u 1 = 0 . Note the lack of bias correction in u t .
Last but not least, we mention AMSGrad—a fairly new variant of Adam by [33]. AMSGrad sticks to the 2 norm, but it ensures that the current v ^ t is not smaller than the previous one. With other formulas preserved, the key difference boils down to the definition v ^ t : = max { v ^ t 1 , v t } .

7. Regularization

Regularization tempers down overfitting, usually by imposing constraints on weights (but not only on weights). This section discusses three popular regularization techniques: 2 , 1 , and dropout. In fact, the first two have been known in machine learning and numerical methods since long before the “era” of DL. The last one was born in 2012 [22] specifically for DL.
Let us remind that in the p metric space, the p-norm for a vector of weights is defined as W p = k | w k | p 1 / p . For notational simplicity, we do not specify the number of weights, but we assume that k iterates over all of them. Additionally, it shall be useful to explicitly define the error function as a sum of losses over data points, plus a suitable regularization term. The sum of losses can reflect different learning modes: on-line (if reduced to a single summand), off-line (a sum over the whole data set), and mini-batches (in between). We denote it as err ( W ) = i l i ( W ) + , emphasizing the functional dependence on weights.

7.1. 2 Regularization (Weight Decay)

The optimization goal with 2 regularization is to minimize the sum of losses, i l i ( W ) , subject to the following constraint on the 2-norm (i.e., the Euclidean norm): W 2 r . Geometrically, this means that we want weights to be kept within a ball (the dimensionality of the ball being equal to the number of weights) of radius r. Commonly, one incorporates the constraint into the cost function and applies the following equivalent form [55,56]:
err ( W ) = i l i ( W ) + 1 2 λ W 2 2 ,
where λ 0 is a penalty hyperparameter controlling the strength of regularization. There exists a 1:1 correspondence between r and λ . Large λ s translate into small radiuses.
Taking a derivative of (107) with respect to any w k yields
err ( W ) w k = i l i ( W ) w k + λ w k .
Therefore, in the simple SGD scenario, the update of a single weight can be carried out in the following manner:
w k : = w k η i l i ( W ) w k η λ w k .
Such an update rule is very simple to implement. Other advanced SGD variants (Adam, Adamax, AMSGrad, etc.) pose no difficulties either because ∇, presented in various formulas in Section 6, should now be understood as the gradient of the whole cost function, including the regularization term, i.e.,  = i l i ( W ) / w k + λ w k k , generated for all k. It is also easy to see how (109) tries to shrink weights at each update step [55], i.e., pull them towards zero, owing to the last summand, η λ w k .
Finally, it is worth pointing out a potential nomenclature ambiguity. In the literature on neural networks, 2 regularization is often known by the name of weight decay [57,58,59]. Please note that this should not be confused with learning rate decay—a scheduling technique [60,61], present, e.g., in Keras and PyTorch, that is, in fact, not a regularizer.

7.2. 1 Regularization (Lasso)

The cost function with 1 regularization is
err ( W ) = i l i ( W ) + κ W 1 ,
where we chose κ 0 to denote the regularization hyperparameter. As a side remark, please note that mixtures of 2 and 1 regularizations are also possible, implying the following cost: i l i ( W ) + 1 / 2 λ W 2 2 + κ W 1 [62,63]. The meaning of 1 is that, again, weights are constrained to stay within a certain ball, but this time a diamond-shaped ball implied by the 1-norm, i.e., the sum of absolute values of weights. This “encourages” many weights not only to be small but to be exactly zero (or at least very close to zero in the numerical sense, e.g., less than 10 5 ). That is why 1 is often regarded as a feature selection mechanism [12] and known by the name lasso—least absolute shrinkage and selection operator [64].
Commonly, in practice, derivatives of (110) take the form
err ( W ) w k = i l i ( W ) w k + κ sgn ( w k ) ,
in spite of | w k | not being differentiable at zero [56,63,65].

7.3. Dropout Regularization

The dropout technique was proposed by Hinton, Srivastava, et al. and presented in two articles [22,66]. If it is known how dropout layers operate (Section 4.5), it may not be evident at first glance that they, in fact, perform model averaging—the approach well known from ensemble methods [2,67]. The averaging within dropout is performed in a “camouflaged” way since the direct way would be prohibitively time-consuming, facing three problems: 1. training multiple large networks (preferably with different structures) requires a lot of computations, 2. large networks require large amounts of data (preferably not the same per network), and 3. evaluating multiple responses at the test stage to combine predictions is unacceptable in many applications [66]. The dropout technique addresses all three problems in a probabilistic manner.
Consider a network with N units (neurons), where each unit can be dropped (along with its connections) with the probability r * at the training stage. Since random neurons are dropped, there is almost certainly a different network structure, simpler than a complete one, trained for each data presentation. However, all of these networks share same weights for units that remain present. This is a contrast with respect to traditional ensemble methods, where models are trained from scratch with independent parameters. From that perspective, a network with N units can be seen as a family of 2 N “thinner” networks, and random dropout makes it possible to train a huge number of them in a reasonable time with an extensive sharing of weights [22].
Originally [66], it was postulated to leave output signals of preserved units unchanged at the training stage and to multiply all signals by 1 r * at test time when the dropout is switched off. Note that this constitutes a simple approximate averaging mechanism. Signals or, equivalently, weights of the final network are scaled-down versions of the trained ones. Multiplication by 1 r * at test time ensures that the expected output (under the training-stage distribution) is the same as the actual output at the test stage. Hence, owing to scaling, the hypothetical networks become combined into a single “mean” network.
In Section 4.5, we described dropout layers operating differently, namely with signals scaled by 1 / ( 1 r * ) at the training stage and unchanged at testing—a default version in many frameworks (Keras, TensorFlow, and Pytorch). One should realize that the two approaches are equivalent.
Based on multiple experiments, Hinton et al. found a superior dropout in terms of generalization than other regularization methods [66]. How can dropout be motivated? According to its authors, dropout reduces complex co-adaptations. This means that individual units are driven to be more robust in doing something useful without relying on many other units. They are forced to work in the presence of a random subset of other units and, therefore, with little collaboration. Without dropout, a frequent tendency is that units become co-adapted, i.e., learn a certain complex collaboration, often driven by noise in training data, that relies on the presence of many of them or even all of them. A nice, non-technical explanation of this phenomenon comes from considerations of successful conspiracies—quoting [66]: “(…) Ten conspiracies each involving five people is probably a better way to create havoc than one big conspiracy that requires fifty people to all play their parts correctly (…)”.

8. Implementation and Experiments

As declared in the introduction, the Python project named hmdl (“home-made deep learning”), associated with this tutorial, constitutes the referenced from-scratch implementation of crucial deep learning ideas. We devote this section to that project and to experiments carried out using it. We hope the contents that follow help the reader understand how DL’s mathematics and algorithms presented so far become translated into the realm of programming and computations.
We begin with a high-level view of hmdl (Section 8.1), presenting its object-oriented design, the main classes, and their responsibilities. In Section 8.3, we show how forward and backward computations are organized in compliance with the main inductive rules (Rule 1 and Rule 2), and we present excerpts of the source code responsible for those computations for common layer types. More attention in this respect is given to convolutional layers whose various implementations have been described separately in Section 8.4. Then, in Section 8.5, we present compact the functions responsible for the random initialization of weights in compliance with Glorot’s or He’s approaches. The next section, Section 8.6, presents the implementation of the Adam algorithm in hmdl, coupled with regularizations. Finally, in the last Section 8.7, we demonstrate thorough experiments carried out with hmdl and Keras, involving multiple network architectures and three well-known data sets. We instruct the reader as to how the experiments can be reproduced.
  • Note on source codes: When presenting excerpts of hmdl source code in this manuscript, we have removed, for clarity and brevity, the auxiliary code parts, such as verbosity printouts, time measurements, etc. (for full codes, we direct the reader to the repository).

8.1. “Home-Made Deep Learning” Project—A High-Level View

The repository for the hmdl project is placed at https://github.com/pklesk/hmdl. At this opportunity, we would like to give great credit to the Numba compiler (https://numba.pydata.org). Using numba.jit and numba.cuda amenities, important parts of hmdl were compiled into fast machine code (CPU and GPU, respectively). The main module is implemented in the hmdl.py file. The UML diagram in Figure 12 provides a high-level view of hmdl and the classes it involves.
The abstract class Layer is the backbone of the hierarchy, and there are five specialized classes that inherit from it. They represent frequently applied layer kinds discussed in former sections: Conv2D, MaxPool2D, Dense, Flatten, and Dropout. Other new extensions can be introduced. There are some pieces of information common to all layers, defined within the Layer class, in particular input_, output_, delta_backward_input_, delta_, delta_backward_output_, weights_, and gradient_. All of those should be understood as tensors stored as numpy arrays of numpy.float32 numbers. To be precise, weights_ and gradient_ are short lists of numpy arrays. This allows for independent access to proper weights and biases. The dimensionality of arrays is implied by the layer kind. For example, for the Conv2D or MaxPool2D layers, the input_ tensor is a 5-dimensional array, where the successive dimensions represent indexes of the data example (within a batch), image row, image column, input channel, and output channel. As regards Dense layers, their input_ and output_ are 2-dimensional arrays (indexing only data examples and neurons).
Additionally, one should notice the properties named activation_ and activation_d_ present in the Layer class. They are function pointers (or None values), representing the activation function and its derivative, respectively. We apply Python’s getattr(…) mechanism to obtain suitable pointers from string names specified by the user, e.g., "relu", "sigmoid", "softmax", etc.

8.2. Executions of Forward and Backward Computations

During the forward computations, every input_ is mapped to an output_, possibly with the use of weights_ and an invocation of activation_. As regards the backward flow, firstly, the delta_ tensor becomes computed based on activation_d_ and delta_backward_input_ (the latter stores weighted sums of subsequent δ s) in compliance with the inductive Rule 1. Then, the gradient_ becomes computed based on delta_ and input_ in compliance with Rule 2. Finally, the responsibility of each layer is to prepare a delta_backward_output_ tensor, using delta_ and weights_, so that it constitutes a delta_backward_input_ from the perspective of the preceding layer.
To suitably handle the described forward and backward flows, each layer is equipped with function pointers named do_forward_function, do_backward_function, do_backward, and _output_function. The design of hmdl allows for the provision of more than one implementation for either of those. An actual implementation to be executed can be chosen either manually by the user or automatically via the framework itself in an attempt to improve efficiency (taking into account the sizes of tensors, filters, etc.). The source code shown below (Figure 13) presents two main functions that execute forward and backward computations from within the abstract class Layer. Inside the functions, one can note suitable invocations performed via pointers do_forward_function, do_backward_function, and do_backward_output_function, assumed to be defined by subclasses.
Moreover, it should be remarked that layer instances belonging to the same class but present in different places of a network structure are allowed to execute different implementation variants of their forward/backward computations. For example, in the Conv2D class, there exist 7 implementation variants that can be inserted into the do_forward_function pointer. They differ with respect to computational resources (CPU: numpy or numba.jit; GPU: numba.cuda) or the algorithmic approach (direct, tiles, or via FFT). We discuss them in Section 8.4.
Networks built with hmdl are instances of the SequentialClassifier class. Its add(…) method allows the user to append successive layers in a manner similar to Keras’s API [68]. The class is additionally equipped with the fit, predict, and predict_proba methods, following the convention known from the sklearn module. In fact, the SequentialClassifier inherits from two classes, BaseEstimator and ClassifierMixin, that come from sklearn.base. It is also worth remarking that a classifier instance initializes every backward flow induction by calling the self.loss_d_(…) function, a pointer to the categorical_crossentropy_d or square_loss_d. Figure 14 presents the source code of the fit function for SequentialClassifier, whereas Figure 15 presents the core parts of its forward and backward functions involved in the process of fitting.

8.3. Automatic Differentiation (Backward Computations) for Common Layer Types

Flattening layers (class Flatten): We purposely start this subsection with the simplest layer type as far as differentiation is concerned—namely the flattening layer. We remind the reader that the chain rule (Rule 1) that computes the δ X tensor within the backpropagation induction says to apply the derivative of the activation function multiplied by the weighted sum of subsequent δ Y values taken over outgoing connections. In the case of flattening layers, each neuron—a signal in the input tensor X (of shape H × V × C )—has exactly one outgoing connection to its counterpart in the flattened vector Y. Therefore, the appropriate inductive formula (we repeat Formula (47) from Section 4) is as follows:
δ X : = ϕ ( X ) R H × V × C ( I H · V · C · δ Y ) = ϕ ( X ) R H × V × C δ Y .
Backward computations associated with this formula are reduced to a suitable call of the reshape function, preceded by a call of self.activation_d_() to compute the derivative of the activation function (if any activation was imposed by the user). The source code in Figure 16 presents those simple computations.
Dropout layers (class Dropout): In the case of dropout layers, automatic differentiation is also quite simple. Below, we repeat Formulas (48) and (49) for both the forward and backward computations of those layers. We remind the reader that r * represents the imposed rate of signals to be dropped (zeroed) in the current pass of computations, whereas the tensor R = ( r j , k , q ) , of shape H × V × C , has entries drawn randomly from a standard uniform distribution, r j , k , q U ( 0 , 1 ) .
Y : = 1 1 r * · [ R > r * ] ϕ ( X ) . δ X : = ϕ ( X ) 1 1 r * · [ R > r * ] δ Y .
At the implementation level, it is useful to memorize, during the forward pass, the tensor of random entries, [ R > r * ] (the ones to be preserved), together with their multipliers, 1 / ( 1 r * ) , and then to make use of it during the backward pass. In the code from Figure 17, that combined information is memorized as the attribute self.coeffs_. As was the case for the flattening layers, also here, in the dropout layers, each “neuron” can be regarded as having exactly one outgoing connection.
Dense layers (class Dense): In contrast to flattening and dropout layers, dense layers do include tunable parameters—weights. Therefore, the backward pass for this layer type involves both the inductive computations of δ values (Rule 1) and the computations of the gradient (Rule 2). We remind the reader of the necessary Formulas (52)–(54) from Section 4 below.
Y : = δ Y · ϕ ( X ) T . 0 , Y : = δ Y · 1 N × 1 . δ X : = ϕ ( X ) W Y T · δ Y .
The code excerpts shown in Figure 18 are responsible for the above computations. It is also worth recalling (as noted in Section 8.1) that, inside hmdl tensors, the first dimension is always related to the data example index within the current batch. Therefore, in the code below, the attributes delta_backward_input_, delta_, and delta_backward_output_ are two-dimensional arrays, where the first dimension pertains to data examples and the second to neurons within the given layer. Owing to that fact, suitable summations over the current batch needed to compute the self.gradient_ are carried out by means of the .dot product, i.e., the standard algebraic matrix multiplication (see the function do_backward_numpy).
Max pooling layers (class MaxPool2D): In the hmdl project, there are three implementation variants for computations carried out within the max pooling layers. The first one is a “pure” numpy variant. The second is an accelerated version of the first, obtained by means of numba module and the @jit(nopython=True, …) decorator, allowing for low-level just-in-time compilation. The third variant also applies numba, but it performs GPU computations—it requires the @cuda.jit decorator over a suitably programmed kernel function in compliance with the CUDA computational model. For clarity of presentation, below, we show the source codes only for the simplest numpy-based variants of computations. Source codes for the two other variants are moved to Appendix A.
Recall the notation and suitable formulas from Section 4; the pools are assumed to be of sizes S × S , and the single outputs, for all 0 j < H / S , 0 k < V / S , and 0 q < C , are computed as follows:
y j , k , q : = max j S u < ( j + 1 ) S k S v < ( k + 1 ) S ϕ ( x u , v , q ) .
Again, it is useful to memorize, during the forward pass, the coordinates for which the maxima are attained, denoted by
a j , k , q * : = arg   max j S u < ( j + 1 ) S k S v < ( k + 1 ) S ϕ ( x u , v , q ) ,
as they are needed at the stage of backward computations. Figure 19 presents the source codes responsible for forward computations in the max pooling layers in the simplest numpy-based variant.
As regards the backward pass and automatic differentiation for this layer type, Formula (43), required for that purpose, is as follows:
δ X ; u , v , q : = ϕ ( x u , v , q ) · δ Y ; u / S , v / S , if a u / S , v / S * = ( u , v ) ; 0 , otherwise .
Backward computations compliant with this formula are presented in Figure 20 (numpy-based variant). Please note how the attribute self.argmax_, memorized during the forward pass, is reused at this stage. We remind the reader that other computational variants are available in Appendix A.
Convolutional layers (class Conv2D): In the hmdl project, there are seven implementation variants of forward computations for convolutional layers and four variants of backward computations. Since this subsection is on automatic differentiation, we discuss here only the backward variants. Forward variants and their efficiency are discussed in a separate section, Section 8.4.
As regards backward computations in convolutional layers, the variants available in the hmdl project are numpy_gemm, numba_jit_gemm, numba_cuda_direct, and numba_cuda_tiles. For clarity of presentation, here, we present source codes only for the numpy_gemm variant; the remaining variants are moved to Appendix B.
We start by reminding the reader of Formula (38) (from Section 4) pertaining to the computation of gradient for convolutions in compliance with the GEMM approach:
Y : = S 0 q < K   R F × F × C R 1 × H · V δ Y ; · , · , q S F / 2 a , b F / 2 0 p < C R H · W × 1 ϕ X a : a + H b : b + V p P , 0 , Y : = S 0 q < K R 1 × H · V δ Y ; · , · , q · 1 H · V × 1 ,
Figure 21, shown below, presents the source code responsible for those computations.
As regards error expressions for convolutional layers, their computation has been defined by Formula (39), which we remind the reader of below.
δ X : = ϕ ( X ) S 0 q < C   R H × V R 1 × F 2 · K W ¯ Y ; · , · , q , · S 0 j < H 0 k < V R F 2 · K × 1    δ Y ; j F / 2 : j + F / 2 + 1 k F / 2 : k + F / 2 + 1 P .
The source code corresponding to that computation is presented in Figure 22.

8.4. Efficiency of Forward Computations in Convolutional Layers for Different Implementation Variants

As mentioned earlier, there are seven implementation variants provided for forward computations in convolutional layers. Below, we list their code names with brief descriptions. The code names constitute suffixes of the actual full function names, e.g., do_forward_numba_cuda_tiles(.). Additionally, CUDA-related functions execute from within the so-called GPU kernels with computations designed for each single GPU thread (affix: _job). For more details, we direct the reader to the code and bibliographic references.
  • numpy_gemm—GEMM-based convolution; see (9), involving image-to-matrix and weights-to-matrix transformations. and a numpy.dot multiplication of matrices, executed from regular Python code.
  • numba_jit_direct—definition-based convolution; see (8), implemented directly by multiple loops and compiled and executed by means of numba.jit.
  • numba_jit_gemm—GEMM-based convolution (as in variant 1); compiled and executed by means of numba.jit.
  • cv2—convolution based on an OpenCV function cv2.filter2D (faster than scipy.signal.convolve2d); involves multiple invocations of the function to suitably handle input–output channels.
  • numba_cuda_direct—definition-based convolution (as in 2); compiled and executed on GPU via numba.cuda.
  • numba_cuda_tiles—definition-based convolution that partitions inputs into tiles associated with CUDA thread blocks and involves collaboration of threads in order to copy inputs and weights into efficient shared memory; compiled and executed on the GPU by means of numba.cuda.
  • numba_cuda_viafft—takes advantage of the fact that convolution (for q-th output channel) can be computed as IFFT 2 p FFT 2 ( W · , · , p , q ) · FFT 2 ( X · , · , p ) + w 0 ; q , where p iterates over input channels; involves invocations of multiple GPU kernels and is compiled and executed on the GPU by means of numba.cuda.
  • Winograd-based implementations are currently not provided in the hmdl project.
Figure 23 compares execution times (averaged over 10 repetitions) of convolutional layers with 64 input and 64 output channels and a batch of size 32 for different implementations, as well as filter and image sizes. Note the logarithmic scale on the vertical axes. In the first two graphs, we additionally plotted the times achieved using Keras (backed with TensorFlow) for reference. Details of the hardware and software environment on which the executions were performed are provided below.
  • “NVIDIA Quadro M4000M” environment—hardware: Lenovo ThinkPad P70, Intel Xeon CPU E3-1505M v5 2.8 GHz (3.7 GHz boost), 64 GB RAM, and NVIDIA Quadro M4000M GPU (4 GB/32 GB of dedicated/shared memory), manufactured by: Lenovo Group Limited, Beijing, China; software: nvcc 11.6 (V11.6.112), Python 3.9.7, numpy 1.20.0, numba 0.54.1, cv2 4.5.5-dev keras 2.8.0, and tensorflow 2.8.0; OS: Windows 10 Pro (22H2).
The following observations can be made based on the time measurements obtained and Figure 23. The numba_jit_direct and cv2 implementations are clearly the worst, with the latter being more efficient for larger filter and image sizes. The variants numpy_gemm and numba_jit_gemm perform similarly well with a slight advantage of the latter. Interestingly, for a small 3 × 3 filter, their results are comparable with numba_cuda_tiles, or actually better for images not bigger than 8 × 8 . For larger images and filters, the advantage of GPU parallelism manifests, and numba_cuda_direct/tiles/viafft variants become clearly better than _gemm CPU variants. For a 7 × 7 filter, numba_cuda_tiles starts to be competitive with numba_cuda_direct, and it beats the latter regularly for even larger filters and images not smaller than 16 × 16 . As regards numba_cuda_viafft, one can benefit from its efficiency also for larger images and significantly large filters. In fact, one can note, in the last graph, that the execution times of numba_cuda_viafft become approximately equal (see the overlapping red curves) for all large filter sizes from 11 × 11 onwards and suitably large images—a fact compliant with the algorithmic properties of FFT-based convolutions. In those cases, numba_cuda_viafft was superior to both numba_cuda_direct and numba_cuda_tiles. Last, but certainly not least, one should notice the very high efficiency of Keras—the pink curves in the first two graphs. In particular, for the 7 × 7 filter and 64 × 64 images, the execution time of keras was better by roughly one order of magnitude than that of numba_cuda_direct and numba_cuda_tiles. This demonstrates how highly optimized the underlying TensorFlow is in terms of exploiting GPU power and resources.
All tests on convolutional layers can be re-executed (script: tester.py), and their current log files are available in the repository (folder: tests/).

8.5. Implementation of Weight Initialization

In compliance with mathematical analysis presented in Section 5, weight initialization performed according to Glorot’s method [17] is conducted in the hmdl project via the functions shown in Figure 24. We remind the reader that this kind of initialization is appropriate for non-convolutional layer types.
For convolutional layers, hmdl performs weight initialization in compliance with the recipe due to He [29], using the functions presented in Figure 25.

8.6. Implementation of Adam Coupled with Regularizations

The source code presented below in Figure 26 constitutes the continuation of the one from Figure 15—a backward pass for a SequentialClassifier. It demonstrates hmdl’s implementation of the Adam algorithm, explained in Section 6.3. If some form of regularization ( 1 , 2 , and gradient clipping) is imposed by the user, the code below applies first the suitable updates to the gradient tensor for the current layer, and then it conducts the actual operations defined by Adam. Recall that those operations are the computation of first and second gradient moments, removing the bias introduced via EMAs, and applying the final correction to weights.

8.7. Deep Learning Experiments

All experiments reported in this section can be reproduced by the reader using the script experimenter.py. It allows training (optional) and testing to be run for one of three well-known data sets, Olivetti (https://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html), MNIST (https://yann.lecun.com/exdb/mnist/), or CIFAR-10 (https://www.cs.toronto.edu/~kriz/cifar.html) (accessed on 22 October 2024) [69,70,71] (see Figure 27), using two DL frameworks, hmdl and Keras. For every conducted experiment, we have generated a distinct 10-digit identifier, hashed based on given network structure and an additional description (data name, initial randomization seed, and number of repetitions). The log files and classifiers obtained can be found in the repository (folders experiments/, hmdl_clfs/, keras_clfs/) with their names prefixed using the appropriate identifier. Detailed instructions on how to run a new experiment or reproduce an existing one (when knowing its identifier) were moved to Section 8.8.
Two hardware environments were applied to conduct the experiments—the first one was a personal computer equipped with NVIDIA Quadro M4000M as the GPU device (details on page 42), and the second was a high-performance server equipped with the NVIDIA A100 GPU for virtualized computations (details provided below). We explain that the complete set of all experiments was carried out independently in each of the two environments.
  • “GRID A100-7-40C MIG 7g.40gb” environment—hardware: Ubuntu Server 20.4.03, 4 CPUs: AMD EPYC 7H12 64-Core (2.6 GHz), 128 GB RAM, and NVIDIA GRID A100-7-40C vGPU, manufactured by: Supermicro (Super Micro Computer, Inc.), San Jose, CA 95131, USA; software: nvcc 11.4 (V11.4.48), Python 3.8.10, numpy 1.23.5, numba 0.56.4, cv2 4.7.0, keras 2.12.0, and tensorflow 2.12.0; OS: Ubuntu 20.04.3 LTS.
Table 1, Table 2 and Table 3 gather all the results. In strings representing the structures of tested neural networks, the following notation was applied. The abbreviations ”C”, ”M”, ”F”, ”D”, and ”DR” stand for different layer types: convolutional, max pooling, flattening, dense, and dropout, respectively. The parameters of a layer are written in parentheses. For example, C(3, 32, r) stands for a convolutional layer using a 3 × 3 filter and producing 32 output channels (kernels) activated via the ReLU function, denoted as ”r”. DR(0.125) stands for a dropout layer using the rate of 0.125 . M(2) represents a max pooling layer that applies a 2 × 2 regions to produce a down-sampled output tensor. D(10, sm) represents a dense layer with 10 neurons activated via the softmax function. The symbol ” × 2 ” denotes a stack of the same two layers.
Due to the small size of the Olivetti data (400 examples in total), its experimental executions were repeated 10 times (with successive randomization seeds starting from zero, and the data were split randomly as 70:30 into training and test parts), and the results were averaged. MNIST and CIFAR-10 are moderately large sets (60 k and 50 k training examples, respectively, and 10 k testing examples for both); therefore, their experiments were conducted just once. For all data sets, we imposed 100 epochs of training but different counts of batches per epoch (n_batches parameter in hmdl library): 10 batches for Olivetti, 20 for MNIST, and 50 for CIFAR-10. These counts were suitably translated into the batch_size parameter in Keras.
We experimented with networks of varying depth (16 layers at most) and various structural ideas: dense layers alone, single convolutional layers in front of pooling layers, blocks of stacked convolutional layers (VGG-like structures), varying filter sizes and counts of output channels, dropout rates, etc.
We must emphasize that the main goal of these experiments was not to obtain excellent classifiers for the mentioned data sets. Instead, the goal was to verify the consistency of results from Keras and the “home-made” implementation, when the same architectures were applied. We remark that both implementations used a uniform He initialization of weights for layers with ReLUs, a uniform Glorot initialization otherwise, and the Adam optimizer ( η = 10 3 , β 1 = 0.9 , β 2 = 0.999 , and ϵ = 10 7 ).
Finally, we need to remark that, owing to the imposed randomization seeds the reported accuracies and loss values can be reproduced exactly for a fixed GPU device. We imposed seed=0 as the default value across several modules—in particular, we executed random.seed(seed), tf.random.set_seed(seed), and np.random.seed(seed). Also, please see the set_tf_global_determinism(seed) function, where we forced the deterministic behavior of TensorFlow and CUDA; additional data set- or repetition-related seeds can be changed by the user via the data_description dictionary and its entry named "seed", placed in the main program function. One should realize that results do differ between devices due to the order and properties of computations carried out on the GPU (e.g., the number of streaming multi-processors and CUDA cores, scheduling scheme, etc.).
The overall results confirm several well-known or intuitive facts. CIFAR-10, the hardest among the data sets, allowed only a few structures to surpass 80 % accuracy, whereas it was fairly easy to beat 99.2 % for MNIST. For these two sets, it is also easy to see that dense layers alone are not sufficient for good accuracy—convolutional layers are needed. Furthermore, blocks of stacked convolutional layers (VGG-like) led to the best results. Due to the small size of the Olivetti data, the above observations were not evident on that set. The results oscillated around 94 % ± 2 % with little dependence on the structure.
In every row of Table 1, Table 2 and Table 3, the blue color is used to mark the higher accuracy and the green to mark the smaller loss. The last two columns report relatative differences in the results between implementations, i.e., higher/lower accuracy  1 , and 1  smaller/larger loss (as a percentage). As regards accuracy, the results are highly consistent—for a large majority of cases, the relative difference is smaller than 1 % , with few outliers (the worst: 3.54 % ). Naturally, categorical cross-entropy loss is a more sensitive measure than accuracy. See, e.g., MNIST classifiers, where high accuracies of 99.4 % or 99.5 % are accompanied by losses ranging from about 0.016 to 0.032 . Thus, relative loss differences are naturally of a larger magnitude.
For the same architectures, the results of hmdl and Keras are highly consistent in terms of accuracy. But one might be also interested in simply counting the times that hmdl or Keras actually returned a better model, neglecting the margin and relative difference in accuracy (typically smaller than 1 % ). For the Olivetti data, hmdl did so in 7 out of the total of 22 executions (hence, the Keras models were better in 15 cases). For MNIST, hmdl produced better models in 6/10 cases, whereas for CIFAR-10, the hardest data set, it did so in 15/24 cases.
As regards some particular results, the following observations can be made. For the MNIST data set, the structures of neural networks from experiments 7–10 contained the aforementioned VGG blocks, i.e., stacks of convolutional layers (denoted by C(⋯)×2 strings), followed by the maximum pooling layers. Those structures translated into higher accuracies (typically surpassing 99.40 % ) than the structures from experiments 1–6, without the stacked convolutions. A similar observation is true for experiments on the CIFAR-10 data set, where the gap between such groups of structures is even more noticeable. In experiments 8–12, all networks involving the stacked convolutions surpassed an accuracy of 76 % (reaching 83 % ± 1 % in a few cases), whereas all networks from experiments 1–7, without stacked convolutions, had trouble reaching a 70 % accuracy level. Some of the experiments were planned to verify whether dropout layers distributed along the structure translate into better results when working with constant rates (e.g., equal to 0.125 ) or rates increasing towards the “right-end” of a network (e.g., 0.125 , 0.25 , and 0.5 ). In the case of the MNIST data, no clear difference in this respect was observed. Interestingly, for the CIFAR-10 data, models with increasing dropout rates (experiment no. 12) performed significantly better than their counterparts with constant rates (experiment no. 11). Differences of approximately 2 % in accuracy were noticed. This observation was true for models that came from both frameworks (Keras and hmdl) and from both GPU devices (Quadro M4000M and A100). Finally, as regards comparisons of filter sizes for convolutions ( 7 × 7 , 5 × 5 , and 3 × 3 ) spread across various experiments, we did not observe any clear influence of filter size on accuracy. On the other hand, in the large majority of cases, larger filter sizes translated naturally into a longer time of model fitting.
In this context, it is worth mentioning some of the recent results, from 2021 to 2024, achieved for the MNIST and CIFAR-10 benchmarks using the newest non-standard deep learning ideas. As regards MNIST, the results based on the so-called capsule networks stand out. For example, Mazzia [42] reports an accuracy of 99.84 % using a capsule network with self-attention routing (Efficient-CapsNet) involving only 161 k trainable parameters. On the other hand, Byerly et al. [43] apply a similar approach, but they claim that no routing between capsules is needed. Their model achieves a 99.87 % accuracy but employs significantly more trainable parameters—1.5 M. As regards CIFAR-10, recent results differ quite significantly, depending on the new ideas researchers experiment with. For example, Dhakad et al. [44] propose an interesting idea of “scalable hierarchical aware CNN”(SHA-CNN). They carry out tests on MNIST, CIFAR-10, and CIFAR-100, but the accuracy achieved for CIFAR-10— 83.35 % —is not very high. In fact, some of our CNNs obtained using Keras or hmdl surpass that result. On the other hand, there exist recent articles that report outstanding accuracies for CIFAR-10. For example, Touvron [45] applies the so-called vision transformers and obtains an accuracy of 99.3 % . Interestingly, some authors stick to traditional CNN architectures but focus on variations of the learning algorithm itself, using ensemble or federated learning. For example, Bruno, Moroni, and Martinelli [46] apply two networks with the so-called EfficientNet-b0 architecture. The two networks are trained on disjoint data subsets (bagging) and then as an adaptive ensemble by performing the fine-tuning of the trainable combination layer. The authors achieve a 99.612 % accuracy for CIFAR-10.
Despite the consistency of hmdl and Keras observed in our experiments in terms of accuracy and convergence, we must fairly note the evident superiority of Keras (TensorFlow) over hmdl in terms of time efficiency. In many cases, the hmdl implementation was slower by at least an order of magnitude. For structures with multiple convolutional layers, the Keras fit times were roughly 40 times shorter (despite the application of numba.cuda in hmdl) in the case of the NVIDIA Quadro M4000M device and roughly 90 times shorter in the case of the NVIDIA Grid A100 device. The main reason working in favor of TensorFlow’s efficiency is the highly optimized cuDNN low-level library that directly supports DNN-related primitives present within TensorFlow and accelerates GPU computations, convolutions, and pooling in particular. On the other hand, some possible reasons working against hmdl time efficiency (in spite of numba.cuda) may include Python’s objected-oriented overhead, wrapping costs for Numba, and CPU–GPU memory management.

8.8. Reproducing Experiments (or Running New Ones)

Figure 28 presents the list of main settings that can be imposed on an experiment using the experimenter.py script. The very first variable EXPERIMENT_ID decides whether an existing experiment is going to be reproduced (in which case its 10-digit identifier is expected) or a new one executed (in which case the None value is expected). The example setting shown in the figure, EXPERIMENT_ID = "1224383397", corresponds to the 8th experiment carried out on MNIST data set (refer back to Table 2). We remind the reader that log files from all reported experiments are available in the repository (https://github.com/pklesk/hmdl) in the folder experiments/ (accessed on 17 October 2024). Identifiers of those experiments are visible in the file names and also inside the logs. Once the EXPERIMENT_ID value is set, the user can carry out an experiment using the standard Python command, i.e., python experimenter.py or python3 experimenter.py.
The two subsequent variables, UNPICKLE_DATA and PICKLE_DATA, are provided for the user’s convenience and they allow, respectively, for reading and writing data arrays from/to a binary file (based on Python’s pickle module). Data sets such as MNIST and CIFAR-10 are fairly large in size, and the setting UNPICKLE_DATA = False causes those original data sets to be imported from within the keras.datasets module and read via mnist.load_data() or cifar10.load_data(), which may be time-consuming. By setting UNPICKLE_DATA = True, one gets the data arrays faster as they are read from pre-prepared pickled binary files.
It is also worth explaining the settings prefixed by COPY_INITIAL_WEIGHTS_ and ending with either KERAS_TO_HMDL or HMDL_TO_KERAS. They let the user initialize (or read) weights for a certain network architecture in one of the frameworks and then copy those exact weights to the same architecture in the second framework. This allows for two kinds of comparisons. One is to check whether forward passes of the two networks (from Keras and from hmdl) for the same batch of data are consistent, e.g., how many significant digits of the floating-point response from final output layers agree. The other kind of comparison is to see for how long the learning processes initiated from the same weights remain consistent. We believe the names of other main settings in Figure 28 are self-explanatory.
Figure 29 exemplifies how a structure for a neural network becomes described in the hmdl project (within the __main__ function of the script experimenter.py). The description is formed as a list of tuples. The first tuple defines the general settings for the SequentialClassifier. The other tuples define the subsequent layers, each described by its class name (e.g., Conv2D, MaxPool2D, Dropout, Flatten, and Dense) and a dictionary of its properties.
Finally, Figure 30 shows the code excerpt (in experimenter.py) reponsible for the choice of which data set to experiment with and, in particular, the choice of the randomization seed. The same value for the randomization seed implies the same experimental results for all executions on a fixed GPU device (and independently of the CPU).

9. Conclusions

In this tutorial and survey paper, we have shown that a suitable combination of discoveries from several areas was necessary to pave the way for the current high status of deep learning. Automatic differentiation was, obviously, the first necessity. But many critical building blocks came after 2009: GPU computations, proper weight initialization schemes, robust learning algorithms (Adam and similar), or, e.g., dropout regularization that “magically” performs model averaging without actually training multiple models explicitly.
The accessibility of software frameworks for DL is, obviously, an asset, especially for practitioners. On the other hand, we believe implementing a deep neural network from scratch is a valuable exercise that strengthens the low-level understanding of the algorithms involved. In fact, such an understanding is a must for new discoveries to come. We hope this paper, aside from serving as a survey, can play the role of a useful tutorial for that purpose, especially for new researchers in the field.
There are two main directions in which our future work related to this tutorial may develop. The first one is purely technical and related to the hmdl project. Its efficiency and functionality could be enhanced through several new elements, such as reducing the amount of host–device memory transfers, the implementation of Winograd fast convolutions, introducing new layer types, etc. The second possible direction is educational. In this tutorial (concise but already quite long), we have covered the well-established elements of contemporary deep neural networks. We are aware that the freshest ideas, such as recurrent independent mechanisms, transformers, federated learning, gradient sharing, and several others, may require separate treatment, especially regarding automatic differentiation within those techniques.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

The three data sets (known as Olivetti, MNIST, and CIFAR-10) applied in this research are open and publicly available benchmarks commonly applied to academic research on image recognition. Suitable references to the sources and authors of these data sets have been placed in the manuscript.

Acknowledgments

We would like to express our gratitude and give credit to the authors of Numba—an open-source JIT accelerating compiler for Python (https://numba.pydata.org).

Conflicts of Interest

The author declares no conflicts of interest.

Appendix A. Implementation Variants Supported by Numba for Max Pooling Layers (Hmdl Project)

Appendix A.1. Variant Accelerated Using Low-Level Compilation by numba.jit

This variant applies when the setting do_forward_impl="numba_jit" is submitted to the constructor of MaxPool2D objects. Then, the function pointer self.do_forward_function_ points to the do_forward_numba_jit method presented in Figure A1. Please note the full specification of types for the arguments that come after the @jit decorator.
Figure A1. Forward computations for class MaxPool2D: variant based on numba’s just-in-time compilation.
Figure A1. Forward computations for class MaxPool2D: variant based on numba’s just-in-time compilation.
Applsci 14 09972 g0a1
Figure A2 presents analogical backward computations when the function pointer self.do_backward_function_ points to the do_backward_numba_jit function.
Figure A2. Backward computations for class MaxPool2D: variant based on numba’s just-in-time compilation.
Figure A2. Backward computations for class MaxPool2D: variant based on numba’s just-in-time compilation.
Applsci 14 09972 g0a2

Appendix A.2. Variant Accelerated Using GPU Computations and numba.cuda

Another variant for max pooling computations in the hmdl project is implemented as a kernel function allowing for GPU computations. Instead of directly using the CUDA API from NVIDIA in C language, in hmdl, we take advantage of the numba.cuda module that offers analogical API and CUDA mechanisms for Python. This variant applies when the setting do_forward_impl="numba_cuda" is submitted to the constructor of the MaxPool2D object.
Figure A3 and Figure A4 contain the source codes meant for that variant and forward computations. The function do_forward_numba_cuda_direct, presented in Figure A3, iterates over subbatches within the current batch stored in self.input_, and using the CUDA streaming mechanisms, it copies the subbatches from the host to the GPU device memory. Then, an invocation of the actual CUDA kernel function, named do_forward_numba_cuda_direct_job, presented in Figure A4, is performed using 512 threads per block (see tpb value) and the number of blocks per grid (bpg), calculated as follows:
bpg = m · ( H / S ) · ( W / S ) · C / tpb ,
where m denotes the subbatch size.
Figure A3. Forward computations for class MaxPool2D: variant implemented for GPU computations using numba.cuda module.
Figure A3. Forward computations for class MaxPool2D: variant implemented for GPU computations using numba.cuda module.
Applsci 14 09972 g0a3
Figure A4. CUDA kernel function do_forward_numba_cuda_direct_job for forward max pooling computations invoked via function do_forward_numba_cuda_direct.
Figure A4. CUDA kernel function do_forward_numba_cuda_direct_job for forward max pooling computations invoked via function do_forward_numba_cuda_direct.
Applsci 14 09972 g0a4
Backward GPU computations are organized analogically using the streaming mechanism (over data subbatches) and invocations of a CUDA kernel function. Figure A5 and Figure A6 present the source codes involved in those computations. One can note how the self.argmax_ information, memorized during the forward pass, is now reused at the stage of backward computations.
Figure A5. Backward computations for class MaxPool2D: variant implemented for GPU computations using numba.cuda module.
Figure A5. Backward computations for class MaxPool2D: variant implemented for GPU computations using numba.cuda module.
Applsci 14 09972 g0a5
Figure A6. CUDA kernel function do_backward_numba_cuda_direct_job for backward max pooling computations invoked via function do_backward_numba_cuda_direct.
Figure A6. CUDA kernel function do_backward_numba_cuda_direct_job for backward max pooling computations invoked via function do_backward_numba_cuda_direct.
Applsci 14 09972 g0a6

Appendix B. Implementation Variants Supported by Numba for Convolutional Layers (Hmdl Project)—Backward Computations

Appendix B.1. Variant Accelerated Using Low-Level Compilation via numba.jit

This variant applies when the setting do_backward_impl="numba_jit_gemm" is submitted to the constructor of the Conv2D object. Then, the function pointer self.do_backward_function_ points to the do_backward_numba_jit_gemm method presented in Figure A7. Please note the full specification of types for arguments that come after the @jit decorator.
Figure A7. Backward computations of gradient for class Conv2D: variant based on numba’s just-in-time compilation.
Figure A7. Backward computations of gradient for class Conv2D: variant based on numba’s just-in-time compilation.
Applsci 14 09972 g0a7

Appendix B.2. Variant Based Directly on Definition Accelerated Using GPU Computations and numba.cuda

Similarly to the CUDA-based max pooling implementation from Appendix A.2, the implementation for convolutional layers presented here is split into two functions. The outer function do_backward_numba_cuda_direct, presented in Figure A8, iterates over subbatches within the current batch stored in self.input_, and using the CUDA streaming mechanisms, it copies the subbatches from the host to the GPU device memory. Then, an invocation of the actual CUDA kernel function, named do_backward_numba_cuda_direct_job, presented in Figure A9, is performed using 512 threads per block (see tpb value) and the number of blocks per grid (bpg), calculated as:
bpg = m · F 2 · C · K / tpb ,
where m denotes the subbatch size.
Figure A8. Backward computations of gradient for class Conv2D: variant based directly on definition, using GPU computations and numba.cuda.
Figure A8. Backward computations of gradient for class Conv2D: variant based directly on definition, using GPU computations and numba.cuda.
Applsci 14 09972 g0a8
Figure A9. CUDA kernel function do_backward_numba_cuda_direct_job for backward convolutional computations invoked via function do_backward_numba_cuda_direct.
Figure A9. CUDA kernel function do_backward_numba_cuda_direct_job for backward convolutional computations invoked via function do_backward_numba_cuda_direct.
Applsci 14 09972 g0a9

Appendix B.3. Variant Based on Tiles Accelerated Using GPU Computations Using and numba.cuda

The CUDA-based variant presented in this section organizes computations into tiles of size 16 × 16 . It applies 256 threads per block defined by the three-dimensional tuple tpb = ( 1 , 16 , 16 ) and, accordingly, the three-dimensional grid of CUDA blocks defined as follows:
bpg = ( m · C · K , bpg _ h , bpg _ w )
where m is the subbatch size, and bpg_h and bpg_w denote the counts of tiles along the last two dimensions, both calculated as F / 16 .
Figure A10 presents the source code of the outer function (iterating over subbatches and applying the streaming CUDA mechanism). For each subbatch, the outer function invokes the actual kernel function presented in Figure A11.
Figure A10. Backward computations of gradient for class Conv2D: variant based on tiles, using GPU and numba.cuda.
Figure A10. Backward computations of gradient for class Conv2D: variant based on tiles, using GPU and numba.cuda.
Applsci 14 09972 g0a10
Figure A11. CUDA kernel function do_backward_numba_cuda_tiles_job for backward convolutional computations invoked via function do_backward_numba_cuda_tiles.
Figure A11. CUDA kernel function do_backward_numba_cuda_tiles_job for backward convolutional computations invoked via function do_backward_numba_cuda_tiles.
Applsci 14 09972 g0a11

References

  1. Rabiner, L. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 1989, 77, 257–286. [Google Scholar] [CrossRef]
  2. Friedman, J.; Hastie, T.; Tibshirani, R. Additive logistic regression: A statistical view of boosting. Ann. Stat. 2000, 28, 337–407. [Google Scholar] [CrossRef]
  3. McCulloch, W.; Pitts, W. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 1943, 5, 115–133. [Google Scholar] [CrossRef]
  4. Rosenblatt, F. The Perceptron: A Perceiving and Recognizing Automaton; Technical Report 85–460–1; Cornell Aeronautical Laboratory, Inc.: Buffalo, NY, USA, 1957. [Google Scholar]
  5. Linnainmaa, S. Algoritmin Kumulatiivinen Pyoristysvirhe Yksittaisten Pyoristysvirheiden Taylor-Kehitelmana. Master’s Thesis, University of Helsinki, Tapiola, Finland, 1970. [Google Scholar]
  6. Fukushima, K. Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position. Biol. Cybern. 1980, 36, 193–202. [Google Scholar] [CrossRef]
  7. Rumelhart, D.; Hinton, G.; Williams, R. Learning Representations by Back-propagating Errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
  8. LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
  9. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  10. Aizenberg, I.; Aizenberg, N.N.; Vandewalle, J. Multi-Valued and Universal Binary Neurons: Theory, Learning and Applications; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2000. [Google Scholar]
  11. Heck, L.; Konig, Y.; Sönmez, M.; Weintraub, M. Robustness to Telephone Handset Distortion in Speaker Recognition by Discriminative Feature Design. Speech Commun. 2000, 31, 181–192. [Google Scholar] [CrossRef]
  12. Ng, A. Feature Selection, L1 vs. L2 Regularization, and Rotational Invariance. In Proceedings of the 21st International Conference on Machine Learning, Banff, AB, Canada, 4–8 July 2004; ICML’04. pp. 78–85. [Google Scholar] [CrossRef]
  13. Hinton, G.; Osindero, S.; Teh, Y. A Fast Learning Algorithm for Deep Belief Nets. Neural Comput. 2006, 18, 1527–1554. [Google Scholar] [CrossRef]
  14. Deng, J.; Dong, W.; Socher, R.; Li, L.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the CVPR 2009, Miami, FL, USA, 22–24 June 2009; pp. 248–255. [Google Scholar]
  15. Huang, J. The Intelligent Industrial Revolution. Online. NVIDIA Article on GTC (GPU Technology Conference). 2016. Available online: https://blogs.nvidia.com/blog/intelligent-industrial-revolution/ (accessed on 10 August 2022).
  16. Raina, R.; Madhavan, A.; Ng, A. Large-scale Deep Unsupervised Learning Using Graphics Processors. In Proceedings of the ICML ’09: Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 873–880. [Google Scholar]
  17. Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. J. Mach. Learn. Res. Proc. Track 2010, 9, 249–256. [Google Scholar]
  18. Nair, V.; Hinton, G. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the ICML 2010, Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
  19. Duchi, J.; Hazan, E.; Singer, Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
  20. Hinton, G.; Srivastava, N.; Swersky, K. Overview of Mini-Batch Gradient Descent. Online. Lecture 6, Unpublished as a Paper. 2012. Available online: http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf (accessed on 10 July 2022).
  21. Ciresan, D.; Meier, U.; Schmidhuber, J. Multi-column deep neural networks for image classification. In Proceedings of the CVPR 2012, Providence, RI, USA, 16–21 June 2012; pp. 3642–3649. [Google Scholar]
  22. Hinton, G.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.R. Improving neural networks by preventing co-adaptation of feature detectors. CoRR 2012. Available online: http://arxiv.org/abs/1207.0580 (accessed on 22 October 2024).
  23. Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Volume 1, pp. 1097–1105. [Google Scholar]
  24. Mathieu, M.; Henaff, M.; LeCun, Y. Fast Training of Convolutional Networks through FFTs. arXiv 2013. Available online: https://arxiv.org/abs/1312.5851 (accessed on 22 October 2024).
  25. Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the ICLR (Poster), Banff, AB, Canada, 14–16 April 2014; Available online: https://arxiv.org/pdf/1412.6980.pdf (accessed on 22 October 2024).
  26. Simoyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014. Available online: https://arxiv.org/abs/1409.1556v6 (accessed on 22 October 2024).
  27. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. arXiv 2014. Available online: https://arxiv.org/abs/1409.4842 (accessed on 22 October 2024).
  28. Lavin, A.; Gray, S. Fast Algorithms for Convolutional Neural Networks. arXiv 2015. Available online: https://arxiv.org/abs/1509.09308 (accessed on 22 October 2024).
  29. He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar] [CrossRef]
  30. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015. Available online: https://arxiv.org/abs/1512.03385 (accessed on 22 October 2024).
  31. Dozat, T. Incorporating Nesterov Momentum into Adam. In Proceedings of the 4th International Conferenc on Learning Representations (ICLR), San Juan, PR, USA, 2–4 May 2016; pp. 1–4. [Google Scholar]
  32. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017. Available online: https://arxiv.org/abs/1706.03762 (accessed on 22 October 2024).
  33. Reddi, S.; Kale, S.; Kumar, S. On the Convergence of Adam and Beyond. In Proceedings of the 6th International Conferenc on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  34. Goyal, A.; Lamb, A.; Hoffmann, J.; Sodhani, S.; Levine, S.; Bengio, Y.; Schölkopf, B. Recurrent Independent Mechanisms. arXiv 2020. Available online: https://arxiv.org/abs/1909.10893 (accessed on 22 October 2024).
  35. Xue, Y.; Tong, Y.; Neri, F. An ensemble of differential evolution and Adam for training feed-forward neural networks. Inf. Sci. 2022, 608, 453–471. [Google Scholar] [CrossRef]
  36. Bengio, Y.; Lecun, Y.; Hinton, G. Deep Learning for AI. Commun. ACM 2021, 64, 58–65. [Google Scholar] [CrossRef]
  37. Rumelhart, D.; McClelland, J. Parallel Distributed Processing: Explorations in the Microstructure of Cognition; MIT Press: Cambridge, MA, USA, 1986. [Google Scholar]
  38. Fukushima, K. Visual feature extraction by a multilayered network of analog threshold elements. IEEE Trans. Syst. Sci. Cybern. 1969, 5, 322–333. [Google Scholar] [CrossRef]
  39. Tong, G.; Huang, L. Fast Convolution based on Winograd Minimum Filtering: Introduction and Development. CS & IT-CSCP 2021; pp. 177–191 abs/2111.00977. Available online: https://arxiv.org/abs/2111.00977 (accessed on 22 October 2024).
  40. Delvin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the ACL’2019, Austin, TX, USA, 4–13 October 2019; pp. 4171–4186. [Google Scholar]
  41. Wolfram, S. What Is ChatGPT Doing… and Why Does It Work? Online. 2023. Available online: https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work (accessed on 22 October 2024).
  42. Mazzia, V.; Salvetti, F.; Chiaberge, M. Efficient-CapsNet: Capsule network with self-attention routing. Sci. Rep. 2021, 11, 14634. [Google Scholar] [CrossRef] [PubMed]
  43. Byerly, A.; Kalganova, T.; Dear, I. No Routing Needed Between Capsules. arXiv 2021. Available online: https://arxiv.org/abs/2001.09136 (accessed on 22 October 2024). [CrossRef]
  44. Dhakad, N.; Malhotra, Y.; Vishvakarma, S.K.; Roy, K. SHA-CNN: Scalable Hierarchical Aware Convolutional Neural Network for Edge AI. arXiv 2024. Available online: https://arxiv.org/abs/2407.21370 (accessed on 22 October 2024).
  45. Touvron, H.; Cord, M.; El-Nouby, A.; Verbeek, J.; Jégou, H. Three things everyone should know about Vision Transformers. arXiv 2022. Available online: https://arxiv.org/abs/2203.09795 (accessed on 22 October 2024).
  46. Antonio, B.; Moroni, D.; Martinelli, M. Efficient adaptive ensembling for image classification. Expert Syst. 2023. [Google Scholar] [CrossRef]
  47. Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control. Signals Syst. 1989, 2, 303–314. [Google Scholar] [CrossRef]
  48. Huang, K.; Wang, Y.; Tao, M.; Zhao, T. Why do deep residual networks generalize better than deep feedforward networks?—A neural tangent kernel perspective. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates Inc.: Red Hook, NY, USA, 2020. NIPS ’20. [Google Scholar]
  49. Ma, L.; Li, N.; Yu, G.; Geng, X.; Huang, M.; Wang, X. Pareto-Wise Ranking Classifier for Multiobjective Evolutionary Neural Architecture Search. IEEE Trans. Evol. Comput. 2024, 28, 570–581. [Google Scholar] [CrossRef]
  50. Dellinger, J. Weight Initialization in Neural Networks: A Journey from the Basics to Kaiming. Online. 2019. Available online: https://towardsdatascience.com/weight-initialization-in-neural-networks-a-journey-from-the-basics-to-kaiming-954fb9b47c79 (accessed on 23 June 2022).
  51. Polyak, B. Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 1964, 4, 1–17. [Google Scholar] [CrossRef]
  52. Riedmiller, M.; Braun, H. RPROP—A Fast Adaptive Learning Algorithm. In Proceedings of the 1992 International Symposium on Computer and Information Sciences, Antalya, Turkey, 2–4 November 1992; pp. 279–285. [Google Scholar]
  53. Nesterov, Y. A method for unconstrained convex minimization problem with the rate of convergence O(1/k2). In Proceedings of the Doklady ANSSSR (Translated as Soviet. Math. Docl.); 1983; Volume 269, pp. 543–547. Available online: https://api.semanticscholar.org/CorpusID:202149403 (accessed on 22 October 2024).
  54. Ruder, S. An Overview of Gradient Descent Optimization Algorithms. 2016. Available online: https://ruder.io/optimizing-gradient-descent/index.html (accessed on 10 July 2022).
  55. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
  56. Korzeń, M.; Jaroszewicz, S.; Klęsk, P. Logistic regression with weight grouping priors. Comput. Stat. Data Anal. 2013, 64, 281–298. [Google Scholar] [CrossRef]
  57. Tessier, H.; Gripon, V.; Léonardon, M.; Arzel, M.; Hannagan, T.; Bertrand, D. Rethinking Weight Decay for Efficient Neural Network Pruning. J. Imaging 2022, 8, 64. [Google Scholar] [CrossRef]
  58. Nowlan, S.; Hinton, G. Simplifying Neural Networks by Soft Weight-Sharing. Neural Comput. 1992, 4, 473–493. [Google Scholar] [CrossRef]
  59. Plaut, D.C.; Nowlan, S.J.; Hinton, G.E. Experiments on Learning by Back-Propagation; Technical Report CMU–CS–86–126; Carnegie–Mellon University: Pittsburgh, PA, USA, 1986. [Google Scholar]
  60. Kim, C.; Kim, S.; Kim, J.; Lee, D.; Kim, S. Automated Learning Rate Scheduler for Large-batch Training. arXiv 2021. Available online: https://arxiv.org/abs/2107.05855 (accessed on 22 October 2024).
  61. d’Ascoli, S.; Refinetti, M.; Biroli, G. Optimal learning rate schedules in high-dimensional non-convex optimization problems. arXiv 2022. Available online: https://arxiv.org/abs/2202.04509 (accessed on 22 October 2024).
  62. Zou, H.; Hastie, T. Regularization and Variable Selection via the Elastic Net. J. R. Stat. Soc. Ser. B 2005, 67, 301–320. [Google Scholar] [CrossRef]
  63. Centofani, F.; Fontana, M.; Lepore, A.; Vantini, S. Smooth LASSO estimator for the Function-on-Function linear regression model. Comput. Stat. Data Anal. 2022, 176, 107556. [Google Scholar] [CrossRef]
  64. Tibshirani, R. Regression Shrinkage and Selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
  65. Hahn, G.; Lutz, S.M.; Laha, N.; Cho, M.H.; Silverman, E.K.; Lange, C. A fast and efficient smoothing approach to Lasso regression and an application in statistical genetics: Polygenic risk scores for chronic obstructive pulmonary disease (COPD). Stat. Comput. 2021, 31, 35. [Google Scholar] [CrossRef]
  66. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
  67. Breiman, L. Combining Predictors; Technical Report; Department of Statistics, University of California: Berkeley, CA, USA, 1998. [Google Scholar]
  68. Keras. GitHub. 2015. Available online: https://github.com/fchollet/keras (accessed on 22 October 2024).
  69. Roweis, S. Olivetti Faces Dataset; AT&T Laboratories: Cambridge, UK, 1994. [Google Scholar]
  70. MNIST Handwritten Digit Database; Data Set. 1998. Available online: https://yann.lecun.com/exdb/mnist/ (accessed on 22 October 2024).
  71. Krizhevsky, A.; Nair, V.; Hinton, G. CIFAR-10; Data Set; Canadian Institute for Advanced Research: Toronto, ON, Canada, 2010. [Google Scholar]
Figure 1. Milestones in the history of neural networks and deep learning [3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35].
Figure 1. Milestones in the history of neural networks and deep learning [3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35].
Applsci 14 09972 g001
Figure 2. Example medium-sized deep network built with popular layer types: convolutional, max pooling, dropout, flattening, and dense (prepared for the CIFAR-10 data set; see experiment with ID 3392187021 in Section 8).
Figure 2. Example medium-sized deep network built with popular layer types: convolutional, max pooling, dropout, flattening, and dense (prepared for the CIFAR-10 data set; see experiment with ID 3392187021 in Section 8).
Applsci 14 09972 g002
Figure 3. Example of a simple neural network for 10-class classification.
Figure 3. Example of a simple neural network for 10-class classification.
Applsci 14 09972 g003
Figure 4. Illustration of forward and backward computations at the junction of two convolutional layers α and β for the network structure from Figure 3.
Figure 4. Illustration of forward and backward computations at the junction of two convolutional layers α and β for the network structure from Figure 3.
Applsci 14 09972 g004
Figure 5. Forward computations of max and average pooling for S = 4 .
Figure 5. Forward computations of max and average pooling for S = 4 .
Applsci 14 09972 g005
Figure 6. Illustration of backward computations for flattening and dropout layers. High dropout rate r * = 0.875 chosen for readability of surviving connections.
Figure 6. Illustration of backward computations for flattening and dropout layers. High dropout rate r * = 0.875 chosen for readability of surviving connections.
Applsci 14 09972 g006
Figure 7. Illustration of two stages of backward computations for a dense layer using a softmax activation function.
Figure 7. Illustration of two stages of backward computations for a dense layer using a softmax activation function.
Applsci 14 09972 g007
Figure 8. Forward pass leading to exploding signals: a fake network consisting of 100 dense layers with 512 neurons each (no activations) and intial weights drawn from a standard normal distribution.
Figure 8. Forward pass leading to exploding signals: a fake network consisting of 100 dense layers with 512 neurons each (no activations) and intial weights drawn from a standard normal distribution.
Applsci 14 09972 g008
Figure 9. Fake forward pass leading to vanishing signals due to intial weights drawn from a normal distribution with standard deviation scaled by 10 2 , regardless of N.
Figure 9. Fake forward pass leading to vanishing signals due to intial weights drawn from a normal distribution with standard deviation scaled by 10 2 , regardless of N.
Applsci 14 09972 g009
Figure 10. Fake forward pass leading to a stable numerical behavior due to initial weights drawn from a properly scaled normal distribution with standard deviation 1 / N .
Figure 10. Fake forward pass leading to a stable numerical behavior due to initial weights drawn from a properly scaled normal distribution with standard deviation 1 / N .
Applsci 14 09972 g010
Figure 11. Training costs (losses) of Adam and other SGD algorithms on MNIST (a) and CIFAR-10 (b) data sets. Source: (Kingma and Ba, 2014) [25].
Figure 11. Training costs (losses) of Adam and other SGD algorithms on MNIST (a) and CIFAR-10 (b) data sets. Source: (Kingma and Ba, 2014) [25].
Applsci 14 09972 g011
Figure 12. UML class diagram of hdml project (https://github.com/pklesk/hmdl/blob/main/uml/classes.pdf) (accessed on 17 October 2024).
Figure 12. UML class diagram of hdml project (https://github.com/pklesk/hmdl/blob/main/uml/classes.pdf) (accessed on 17 October 2024).
Applsci 14 09972 g012
Figure 13. Functions executing forward and backward computations from the abstract class Layer.
Figure 13. Functions executing forward and backward computations from the abstract class Layer.
Applsci 14 09972 g013
Figure 14. The core of the fit function of SequentialClassifier. The main training loop (over epochs and batches) executes forward and backward computations through layers for each batch.
Figure 14. The core of the fit function of SequentialClassifier. The main training loop (over epochs and batches) executes forward and backward computations through layers for each batch.
Applsci 14 09972 g014
Figure 15. Forward and backward passes for SequentialClassifier.
Figure 15. Forward and backward passes for SequentialClassifier.
Applsci 14 09972 g015
Figure 16. Automatic differentiation for class Flatten.
Figure 16. Automatic differentiation for class Flatten.
Applsci 14 09972 g016
Figure 17. Automatic differentiation for class Dropout.
Figure 17. Automatic differentiation for class Dropout.
Applsci 14 09972 g017
Figure 18. Automatic differentiation for class Dense.
Figure 18. Automatic differentiation for class Dense.
Applsci 14 09972 g018
Figure 19. Forward computations for class MaxPool2D: the simplest numpy-based variant.
Figure 19. Forward computations for class MaxPool2D: the simplest numpy-based variant.
Applsci 14 09972 g019
Figure 20. Backward computations for class MaxPool2D: the simplest numpy-based variant.
Figure 20. Backward computations for class MaxPool2D: the simplest numpy-based variant.
Applsci 14 09972 g020
Figure 21. Backward computations of gradient for class Conv2D: the simplest numpy-based variant (GEMM).
Figure 21. Backward computations of gradient for class Conv2D: the simplest numpy-based variant (GEMM).
Applsci 14 09972 g021
Figure 22. Backward computations of error propagation for class Conv2D: the simplest numpy-based variant (GEMM).
Figure 22. Backward computations of error propagation for class Conv2D: the simplest numpy-based variant (GEMM).
Applsci 14 09972 g022
Figure 23. Execution times of convolutional layers (64 input and 64 output channels; batch size: 32) for different implementation variants, filter sizes, and image sizes. For details of the hardware and software environment see page 42.
Figure 23. Execution times of convolutional layers (64 input and 64 output channels; batch size: 32) for different implementation variants, filter sizes, and image sizes. For details of the hardware and software environment see page 42.
Applsci 14 09972 g023
Figure 24. Functions for Glorot initalization of weights in the hmdl project.
Figure 24. Functions for Glorot initalization of weights in the hmdl project.
Applsci 14 09972 g024
Figure 25. Functions for He initalization of weights in the hmdl project.
Figure 25. Functions for He initalization of weights in the hmdl project.
Applsci 14 09972 g025
Figure 26. Implementation of Adam (coupled with regularization) in the hmdl project.
Figure 26. Implementation of Adam (coupled with regularization) in the hmdl project.
Applsci 14 09972 g026
Figure 27. Sample images from data sets applied in experiments.
Figure 27. Sample images from data sets applied in experiments.
Applsci 14 09972 g027
Figure 28. Main settings for an experiment in script experimenter.py.
Figure 28. Main settings for an experiment in script experimenter.py.
Applsci 14 09972 g028
Figure 29. Description of a network structure in the hmdl project, declared in experimenter.py.
Figure 29. Description of a network structure in the hmdl project, declared in experimenter.py.
Applsci 14 09972 g029
Figure 30. Choice of data set, randomization seed, and other settings in the script experimenter.py.
Figure 30. Choice of data set, randomization seed, and other settings in the script experimenter.py.
Applsci 14 09972 g030
Table 1. Results of deep learning experiments with hmdl and Keras on Olivetti data.
Table 1. Results of deep learning experiments with hmdl and Keras on Olivetti data.
Data Set: Olivetti
hmdlKerasDiffs [%]
No.Structure StringExpt. ID
(hash)
ParamsFit
Time
[s]
Test
Acc.
Test
Loss
Fit
Time
[s]
Test
Acc.
Test
Loss
Acc.Loss
GPU device: NVIDIA Quadro M4000M
1F();D(8,r);DR(0.25);D(40,sm);097999306533 1362.20.68171.2584.40.70581.1863.545.7
2F();D(64,r);DR(0.25);D(40,sm);1781687523264 8088.70.95420.17384.60.94330.25551.1632.0
3F();D(64,r);DR(0.25);D(64,r);DR(0.25);D(40,sm);3416029629268 9688.90.94670.22465.20.94750.23300.083.6
4C(5,8,r);M(4);DR(0.25);F();D(64,r);DR(0.25);D(40,sm);2855585004133 94438.10.96330.14567.20.96670.13800.355.2
5C(9,16,r);M(4);DR(0.25);F();D(64,r);DR(0.25);D(40,sm);4283815439266 12062.80.95670.17649.10.95750.16990.083.7
6C(5,8,r);M(4);DR(0.25);C(3,16,r);M(2);DR(0.25);
F();D(64,r);DR(0.25);D(40,sm);
103580444269 57662.30.94920.16778.20.95580.15030.7010.4
7C(5,8,r);M(2);DR(0.25);C(3,16,r);M(2);DR(0.25);
F();D(64,r);DR(0.25);D(40,sm);
3391566488266 18482.40.92170.29639.10.91830.34150.3713.2
8C(5,8,r)×2;M(4);DR(0.25);F();D(64,r);DR(0.25);D(40,sm);3923102291135 55297.10.95830.161114.30.96000.16520.182.5
9C(3,8,r)×2;M(4);DR(0.25);F();D(64,r);DR(0.25);D(40,sm);1455896911134 40085.40.96000.152812.30.95920.19390.0821.2
10C(5,8,r)×2;M(2);DR(0.25);C(3,16,r)×2;M(2);DR(0.25);
F();D(64,r);DR(0.25);D(40,sm);
2185109709270 112175.90.93580.311317.90.92330.37711.3517.4
11C(5,8,r)×2;M(4);DR(0.25);C(3,32,r)×2;M(2);DR(0.25);
F();D(64,r);DR(0.25);D(40,sm);
1466991051147 136148.00.95000.194116.40.94830.20600.185.8
GPU device: GRID A100-7-40C MIG 7g.40gb
1F();D(8,r);DR(0.25);D(40,sm);097999306533 1361.20.68171.2582.50.69081.2291.332.3
2F();D(64,r);DR(0.25);D(40,sm);1781687523264 8082.30.95420.17402.50.95500.23100.0824.7
3F();D(64,r);DR(0.25);D(64,r);DR(0.25);D(40,sm);3416029629268 9682.50.95080.22125.20.94750.22660.352.4
4C(5,8,r);M(4);DR(0.25);F();D(64,r);DR(0.25);D(40,sm);2855585004133 94427.10.96250.14643.20.96080.16830.1813.0
5C(9,16,r);M(4);DR(0.25);F();D(64,r);DR(0.25);D(40,sm);4283815439266 12035.90.95420.17543.30.95750.17200.351.9
6C(5,8,r);M(4);DR(0.25);C(3,16,r);M(2);DR(0.25);
F();D(64,r);DR(0.25);D(40,sm);
103580444269 57643.70.94830.17993.90.96170.12951.4128.4
7C(5,8,r);M(2);DR(0.25);C(3,16,r);M(2);DR(0.25);
F();D(64,r);DR(0.25);D(40,sm);
3391566488266 18455.10.92420.28903.90.94420.21342.1626.2
8C(5,8,r)×2;M(4);DR(0.25);F();D(64,r);DR(0.25);D(40,sm);3923102291135 55252.50.96080.15323.80.96830.13620.7811.1
9C(3,8,r)×2;M(4);DR(0.25);F();D(64,r);DR(0.25);D(40,sm);1455896911134 40051.30.95330.17604.00.97080.13341.8424.2
10C(5,8,r)×2;M(2);DR(0.25);C(3,16,r)×2;M(2);DR(0.25);
F();D(64,r);DR(0.25);D(40,sm);
2185109709270 11297.60.93580.30814.90.94330.22240.8027.8
11C(5,8,r)×2;M(4);DR(0.25);C(3,32,r)×2;M(2);DR(0.25);
F();D(64,r);DR(0.25);D(40,sm);
1466991051147 13683.70.95170.18795.10.95580.17530.436.7
Table 2. Results of deep learning experiments with hmdl and Keras on MNIST data.
Table 2. Results of deep learning experiments with hmdl and Keras on MNIST data.
Data Set: MNIST
hmdlKerasDiffs [%]
No.Structure StringExpt. ID
(hash)
ParamsFit
Time
[ s ]
Test
Acc.
Test
Loss
Fit
Time
[ s ]
Test
Acc.
Test
Loss
Acc.Loss
GPU device: NVIDIA Quadro M4000M
1F();D(128,r);DR(0.125);D(128,r);DR(0.125);D(10,sm);1321552096118 2821330.97590.0769180.98050.07650.470.5
2F();D(512,r);DR(0.125);D(512,r);DR(0.125);D(10,sm);0643554048669 7062750.98490.0584310.98550.07520.0622.3
3F();D(512,r);DR(0.125);D(512,r);DR(0.125);
D(512,r);DR(0.125);D(10,sm);
0491196249932 3623520.98080.0800390.98460.09020.3911.3
4C(7,64,r);M(2);DR(0.125);F();D(128,r);DR(0.125);D(10,sm);20444902801 610 2506 8610.98750.04073930.98760.04930.0117.4
5C(7,64,r);M(2);DR(0.125);C(5,128,r);M(2);DR(0.125);
F();D(128,r);DR(0.125);D(10,sm);
20370826291 012 36267 4250.99270.02501 3950.99360.02740.098.8
6C(5,64,r);M(2);DR(0.125);C(3,128,r);M(2);DR(0.125);
F();D(128,r);DR(0.125);D(10,sm);
0549589385879 75427 8400.99200.02548140.99320.03330.1223.7
7C(5,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.125);
F();D(128,r);DR(0.125);D(10,sm);
1685817126484 71446 8030.99440.02051 1010.99370.03010.0731.9
8C(3,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.125);
C(3,128,r)×2;M(2);DR(0.125);F();D(128,r);DR(0.125);D(10,sm);
1224383397435 30649 3910.99590.02191 0930.99510.03430.0836.2
9C(5,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.125);
C(3,128,r)×2;M(2);DR(0.125);F();D(128,r);DR(0.125);D(10,sm);
3604054245452 20264 6780.99530.01611 3430.99430.02470.1034.8
10C(5,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.25);
C(3,128,r)×2;M(2);DR(0.5);F();D(128,r);DR(0.5);D(10,sm);
1584841448452 20265 5180.99290.02411 7760.99570.01980.2817.8
GPU device: GRID A100-7-40C MIG 7g.40gb
1F();D(128,r);DR(0.125);D(128,r);DR(0.125);D(10,sm);1321552096118 2821000.97500.0778110.98160.07600.682.3
2F();D(512,r);DR(0.125);D(512,r);DR(0.125);D(10,sm);0643554048669 7061900.98380.0582110.98400.09310.0237.5
3F();D(512,r);DR(0.125);D(512,r);DR(0.125);
D(512,r);DR(0.125);D(10,sm);
0491196249932 3622920.98010.0816160.98410.09300.4112.3
4C(7,64,r);M(2);DR(0.125);F();D(128,r);DR(0.125);D(10,sm);20444902801 610 2502 6810.98770.0409830.98690.06020.0832.1
5C(7,64,r);M(2);DR(0.125);C(5,128,r);M(2);DR(0.125);
F();D(128,r);DR(0.125);D(10,sm);
20370826291 012 3628 2540.99270.0220820.99460.02500.1912.0
6C(5,64,r);M(2);DR(0.125);C(3,128,r);M(2);DR(0.125);
F();D(128,r);DR(0.125);D(10,sm);
0549589385879 7545 4320.99260.0248680.99350.03160.0921.5
7C(5,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.125);
F();D(128,r);DR(0.125);D(10,sm);
1685817126484 7146 5920.99430.0189820.99420.03120.0139.4
8C(3,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.125);
C(3,128,r)×2;M(2);DR(0.125);F();D(128,r);DR(0.125);D(10,sm);
1224383397435 3068 3080.99470.0243820.99550.02390.081.6
9C(5,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.125);
C(3,128,r)×2;M(2);DR(0.125);F();D(128,r);DR(0.125);D(10,sm);
3604054245452 2029 6660.99540.0174930.99520.02590.0232.8
10C(5,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.25);
C(3,128,r)×2;M(2);DR(0.5);F();D(128,r);DR(0.5);D(10,sm);
1584841448452 2029 4400.99280.02401430.99540.01870.2622.1
Table 3. Results of deep learning experiments with hmdl and Keras on CIFAR-10 data.
Table 3. Results of deep learning experiments with hmdl and Keras on CIFAR-10 data.
Data Set: CIFAR-10
hmdlKerasDiffs [%]
No.Structure StringExpt. ID
(hash)
ParamsFit
Time
[ s ]
Test
Acc.
Test
Loss
Fit
Time
[ s ]
Test
Acc.
Test
Loss
Acc.Loss
GPU device: NVIDIA Quadro M4000M
1F();D(512,r);DR(0.25);D(512,r);DR(0.25);
D(512,r);DR(0.25);D(10,sm);
01121845632 103 8186190.55891.573750.57001.6421.994.2
2F();D(512,r);DR(0.125);D(512,r);DR(0.125);
D(512,r);DR(0.125);D(10,sm);
36179253282 103 8186100.56712.259800.56242.4360.847.3
3C(3,32,r);M(2);DR(0.125);F();D(128,r);DR(0.125);D(10,sm);11767877701 050 8903 9810.65571.1162340.64101.0682.294.3
4C(3,32,r);M(2);DR(0.125);F();D(128,r);DR(0.25);D(10,sm)40649527811 050 8904 0710.65821.0092340.65001.0611.264.9
5C(7,32,r);M(2);DR(0.125);F();D(128,r);DR(0.125);D(10,sm);21765118061 054 7305 6990.65811.1393090.65421.0620.606.8
6C(7,32,r);M(2);DR(0.125);C(5,64,r);M(2);DR(0.125);
F();D(128,r);DR(0.125);D(10,sm);
3250393598581 70620 1490.71481.3636280.72481.1341.4016.8
7C(3,32,r);C(5,16,r);M(2);DR(0.125);
F();D(128,r);DR(0.25);D(10,sm);
4064194389539 41817 0910.66651.6077770.66311.6250.511.1
8C(7,32,r);C(5,32,r);M(2);DR(0.125);
C(3,64,r)×2;M(2);DR(0.25);F();D(128,r);DR(0.5);D(10,sm);
3392187021611 49849 8190.77410.81601 2820.77390.73240.0310.2
9C(3,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.125);
F();D(128,r);DR(0.125);D(10,sm);
1676196221591 27433 9300.76541.1379610.76071.1780.623.5
10C(3,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.25);
F();D(128,r);DR(0.5);D(10,sm);
4191965583591 27434 1610.79680.68929590.78550.69261.440.5
11C(3,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.125);
C(3,128,r)×2;M(2);DR(0.125);F();D(128,r);DR(0.125);D(10,sm);
0843263662550 57052 8780.80931.0051 2290.80861.0240.091.9
12C(3,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.25);
C(3,128,r)×2;M(2);DR(0.5);F();D(128,r);DR(0.5);D(10,sm);
0715119487550 57052 6140.82170.59491 2280.83590.54971.737.6
GPU device: GRID A100-7-40C MIG 7g.40gb
1F();D(512,r);DR(0.25);D(512,r);DR(0.25);
D(512,r);DR(0.25);D(10,sm);
01121845632 103 8183830.56461.567290.56811.6540.625.3
2F();D(512,r);DR(0.125);D(512,r);DR(0.125);
D(512,r);DR(0.125);D(10,sm);
36179253282 103 8183860.55432.331280.55952.5750.949.5
3C(3,32,r);M(2);DR(0.125);F();D(128,r);DR(0.125);D(10,sm);11767877701 050 8901 4910.65651.124410.64501.0691.784.9
4C(3,32,r);M(2);DR(0.125);F();D(128,r);DR(0.25);D(10,sm)40649527811 050 8901 4960.65801.004420.64061.0522.724.6
5C(7,32,r);M(2);DR(0.125);F();D(128,r);DR(0.125);D(10,sm);21765118061 054 7301 6460.65791.167490.65781.0450.0210.5
6C(7,32,r);M(2);DR(0.125);C(5,64,r);M(2);DR(0.125);
F();D(128,r);DR(0.125);D(10,sm);
3250393598581 7063 4870.71451.335830.71971.2410.737.0
7C(3,32,r);C(5,16,r);M(2);DR(0.125);
F();D(128,r);DR(0.25);D(10,sm);
4064194389539 4182 8040.66641.626830.66241.6060.451.2
8C(7,32,r);C(5,32,r);M(2);DR(0.125);
C(3,64,r)×2;M(2);DR(0.25);F();D(128,r);DR(0.5);D(10,sm);
3392187021611 4987 0430.77410.77791070.77740.74390.434.4
9C(3,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.125);
F();D(128,r);DR(0.125);D(10,sm);
1676196221591 2745 7950.77381.1061430.76561.0931.071.2
10C(3,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.25);
F();D(128,r);DR(0.5);D(10,sm);
4191965583591 2745 7840.79480.7195890.79380.66120.138.1
11C(3,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.125);
C(3,128,r)×2;M(2);DR(0.125);F();D(128,r);DR(0.125);D(10,sm);
0843263662550 5708 6250.80321.0141030.81720.92711.748.6
12C(3,32,r)×2;M(2);DR(0.125);C(3,64,r)×2;M(2);DR(0.25);
C(3,128,r)×2;M(2);DR(0.5);F();D(128,r);DR(0.5);D(10,sm);
0715119487550 5708 7590.81900.61191030.84410.51843.0615.3
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Klęsk, P. Understanding the Flows of Signals and Gradients: A Tutorial on Algorithms Needed to Implement a Deep Neural Network from Scratch. Appl. Sci. 2024, 14, 9972. https://doi.org/10.3390/app14219972

AMA Style

Klęsk P. Understanding the Flows of Signals and Gradients: A Tutorial on Algorithms Needed to Implement a Deep Neural Network from Scratch. Applied Sciences. 2024; 14(21):9972. https://doi.org/10.3390/app14219972

Chicago/Turabian Style

Klęsk, Przemysław. 2024. "Understanding the Flows of Signals and Gradients: A Tutorial on Algorithms Needed to Implement a Deep Neural Network from Scratch" Applied Sciences 14, no. 21: 9972. https://doi.org/10.3390/app14219972

APA Style

Klęsk, P. (2024). Understanding the Flows of Signals and Gradients: A Tutorial on Algorithms Needed to Implement a Deep Neural Network from Scratch. Applied Sciences, 14(21), 9972. https://doi.org/10.3390/app14219972

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop