1. Introduction
Despite the remarkable success and advancements of deep learning (DL) models in computer vision tasks, there are serious obstacles to the deployment of AI in different domains related to the challenge of developing deep neural networks that are both robust and generalise well beyond the training data [
1]. Accurate and stable numerical algorithms play a significant role in creating robust and reliable computational models [
2]. The source of numerical instability in DL models is partially due to the use of a large number of parameters/hyperparameters and data that suffer from floating-point errors and inaccurate results. In the case of convolutional neural networks (CNNs), an obvious contributor to the instability of their large volume of weights is the repeated action of backpropagation algorithms for controlling the growth of the gradient descent to fit the model’s performance to the different patches of training samples. This paper is concerned with empirical estimation of CNN training-caused fluctuations in the condition numbers of various weight matrices as a potential source of instability at convolutional layers and their negative effects on overall model performance. We shall propose a spectral-based approach to reduce and control the undesirable fluctuation.
The
condition number of a square
matrix
A, which is considered as a linear transformation
, measures the sensitivity of computing its action to perturbations to input data and round-off errors, which are defined as
over the set of nonzero
x. The condition number depends on how much the calculation of its inverse suffers from underflow (i.e., how much
is significantly different from 0). Stable action of A means that small changes in the input data are expected to lead to small changes in the output data, and these changes are bound by the reciprocal of the condition number. Hence, the higher the condition number of A is, the more unstable A’s action is in response to small data perturbations, and such matrices are said to be ill-conditioned. Indeed, the distribution of the condition numbers of a random matrix simply describes the loss in precision, in terms of the number of digits, as well as the speed of convergence due to ill-conditioning when solving linear systems of equations iteratively [
3]. Originally, the condition number of a matrix was introduced by A. Turing in [
4]. Afterwards, the condition numbers of matrices and numerical problems were comprehensively investigated in [
5,
6,
7]. The most common efficient and stable way of computing
is by computing the SVD of
A and calculating the ratio of
A’s largest singular value to its smallest non-zero one [
8].
J. W. Demmel, in [
6], investigated the upper and lower bounds of the probability distribution of condition numbers of random matrices and showed that the sets of ill-posed problems including matrix inversions, eigenproblems, and polynomial zero finding all have a common algebraic and geometric structure. In particular, Demmel showed that in the case of matrix inversion, the further away a matrix is from the set of noninvertible matrices, the smaller is its condition number. Accordingly, the spatial distributions of random matrices in their domains are indicators of the distributions of their condition numbers. These results provide clear evidence of the viability of our approach to exploit the tools of topological data analysis (TDA) to investigate the condition number stability of point clouds of random matrices. In general, TDA can be used to capture information about complex topological and geometric structures of point clouds in metric spaces with or without prior knowledge about the data (see [
9] for more detail). Since the early 2000s, applied topology has entered a new era exploiting the persistent homology (PH) tool to investigate the global and local shape of high-dimensional datasets. Various vectorisations of persistence diagrams (PDs) generated by the PH tool encode information about both the local geometry and global topology of the clouds of convolution filters of CNN models [
10]. Here, we shall attempt to determine the impact of the SVD surgery procedure on the PDs of point clouds of CNNs’ well- and ill-conditioned convolution filters.
Contribution: We introduce a singular-value-decomposition-based matrix surgery (SVD surgery) technique to modify the matrix condition numbers that is suitable for stabilising the actions of ill-conditioned convolution filters in point clouds of image datasets. The various versions of our SVD surgery preserve the norm of the input matrix while reducing the norm of its inverse away from non-invertible matrices. PH analyses of point clouds of matrices (and those of their inverses) post SVD surgery bring the PDs of point clouds of filters of convolution filters and those of their inverses closer to each other.
2. Background to the Motivating Challenge
The ultimate motivation for this paper is related to specific requirements that arose in our challenging investigations of how to “train an efficient slim convolutional neural network model capable of learning discriminating features of Ultrasound Images (US) or any radiological images for supporting clinical diagnostic decisions”. In particular, the developed model’s predictions are required to be robust against tolerable data perturbation and less prone to overfitting effects when tested on unseen data.
In machine learning and deep learning, vanishing or exploding gradients and poor convergence are generally due to an ill-conditioning problem. The most common approaches to overcome ill-conditioning are regularisation, data normalisation, re-parameterisation, standardisation, and random dropouts. When training a deep CNN with extremely large datasets of “natural” images, the convolution filter weights/entries are randomly initialised, and the entries are changed through an extensive training procedure using many image batches over a number of epochs, at the end of each of which, the back-propagation procedure updates the filter entries for improved performance. The frequent updates of filter entries result in non-negligible to significant fluctuation and instability of their condition numbers, causing sensitivity of the trained CNN models [
11,
12]. CNN model sensitivity is manifested by overfitting, reduced robustness against noise, and vulnerability to adversarial attacks [
13].
Transfer learning is a common approach when developing CNN models for the analysis of US (or other radiological) image datasets, wherein the pretrained filters and other model weights of an existing CNN model (trained on natural images) are used as initialising parameters for retraining. However, condition number instabilities increase in the transfer learning mode when used for small datasets of non-natural images, resulting in suboptimal performance and the model suffering from overfitting.
3. Related Work
Deep learning CNN models involve a variety of parameters, the complexity of which are dominated by the entries of sets of convolution filters at various convolution layers as well as those of the fully connected neural network layers. The norms and/or variances of these parameters are the main factors considered when designing initialisation strategies to speed up training optimisation and improve model performance in machine and deep learning tasks. Currently, most popular CNN architectures initialise these weights using zero-mean Gaussian distributions with controlled layer dependent/independent variances. Krizhevsky et al. [
14] use a constant standard deviation of
to initialise the weights in each layer. Due to the exponentially vanishing/growing gradient and for compatibility with activation functions,
Glorot [
15], or
He [
16], weights are initialised with controllable variances per layer. For Glorot, the initialised variances depend on the number of in/out neurons, while He initialisation of the variances is closely linked to their proposed parameterised rectified activation unit (PReLU), which is designed to improve model fitting with little overfitting risk. In all these initialisation strategies, no explicit consideration is given to the filters’ condition numbers or their stability during training. In these cases, our investigations found that, post training, almost all convolution filters are highly ill-conditioned, and hence, this adversely affects their use in transfer learning for non-natural images. More recent attempts to control the norm of the network layer were proposed in GradInit [
17] and MetaInit [
18]. These methods can accelerate the convergence while improving model performance and stability. However, both approaches require extra trainable parameters, and controlling the condition number during training is not guaranteed.
Recently, many research works have investigated issues closely related to our objectives by imposing orthogonality conditions on trainable DL model weights. These include orthonormal and orthogonal weight initialisation techniques [
19,
20,
21], orthogonal convolution [
22], orthogonal regularisers [
23], orthogonal deep neural networks [
24], and orthogonal weight normalisation [
25]. Recalling that orthogonal/orthonormal matrices are optimally well conditioned, these publications indirectly support our hypothesis on the link between DL overfitting and condition numbers of learnt convolution filters. Although the instability of weight matrices’ condition numbers are not discussed explicitly, these related works fit into the emerging paradigm of spectral regularisation of NN layer weight matrices. For example, J. Wang et al. [
22] assert that imposing orthogonality on convolutional filters is ideal for overcoming training instability of DCNN models and improves performance. Furthermore, A. Sinha [
23] point out that an ill-conditioned learnt weight matrix contributes to a neural network’s susceptibility to adversarial attacks. In fact, their orthogonal regularisation aims to keep the learnt weight matrix’s condition number sufficiently low, and they demonstrate its increased adversarial accuracy when tested on the natural image datasets of MNIST and F-MNIST. S. Li et al. in [
24] note that existing spectral regularisation schemes are mostly motivated to improve training for empirical applications and conduct a theoretical analysis of such methods using bounds on the concept of generalisation error (GE) measures that are defined in terms of the training algorithms and the isometry of the application feature space. They conclude that the optimal bound of the GE is attained when each weight matrix of a DNN has a spectrum of equal singular values, and they call such models OrthDNNs. To overcome the high computation requirements of strict OrthDNNs, they define approximate OrthDNNs by periodically applying their singular value bounding (SVB) scheme of hard regularisation. In general, controlling weights’ behaviours during training has proven to accelerate the training process and reduce the likelihood of overfitting the model to the training set, e.g., weight standardisation [
26], weight normalisation/reparameterization [
27], centred weight normalisation [
28], and using Newton’s iteration controllable orthogonalization [
29]. Most of the these proposed techniques have been developed specifically to deal with trainable DL models for the analysis of natural images, and one may assume that these techniques are used frequently during training after each epoch/batch. However, none of the known state-of-the-arts DL models seem to implicitly incorporate these techniques. In fact, our investigations of these commonly used DL models revealed that the final convolution filters are highly ill-conditioned [
11].
Our literature review revealed that reconditioning and regularisation have long been used in analytical applications to reduce/control the ill-conditioning computations noted. In the late 1980s, E. Rothwell and B. Drachman [
30] proposed an iterative method to reduce the condition number in ill-conditioned matrix problem that is based on regularising the non-zero singular values of the matrix. At each iteration, each diagonal entry in the SVD of the matrix is appended with a ratio of a regularising parameter to the singular value. This algorithm is not efficient enough to be used for our motivating challenge. In addition, the change of the norm is dependent on the regularising parameter.
In recent years, there has been a growing interest in using TDA to analyse point clouds of various types and complexities of datasets. For example, significant advances and insights have been made in capturing local and global topological and geometric features in high-dimensional datasets using PH tools, including conventional methods [
31]. TDA has also been deployed to interpret deep learning and CNN learning parameters at various layers [
11,
32,
33] and to integrate topology-based methods in deep learning [
34,
35,
36,
37,
38,
39]. We shall use TDA to assess the spatial distributions of point clouds of matrices/filters (and their inverses) before and after SVD surgery for well- and ill-conditioned random matrices.
4. Topological Data Analysis
In this section, we briefly introduce persistent homology preliminaries and describe the point cloud settings of randomly generated matrices to investigate their topological behaviours.
Persistent homology of point clouds: Persistent homology is a computational tool of TDA that encapsulates the spatial distribution of point clouds of data records sampled from metric spaces by recording the topological features of a gradually triangulated shape by connecting pairs of data points according to an increasing distance/similarity sequence of thresholds. For a point cloud
X and a list
of increasing thresholds, the shape
generated by this TDA process is a sequence
of simplicial complexes ordered by inclusion. The Vietoris–Rips simplicial complex (VR) is the most commonly used approach to construct
due to its simplicity, and Ripser [
40] is used to construct VR. The sequence of distance thresholds is referred to as a
filtration of
. The topological features of
consist of the number of holes or voids of different dimensions, which is known as the
Bettie number, in each constituents of
. For
, the
j-th Bettie number
is obtained, respectively, by counting
= #(connected components),
= #(empty loops with more than three edges),
= #(3D cavities bounded by more than four faces), etc. Note that
is the set of generators of the
j-th singular homology of the simplicial complex
. The TDA analysis of
X with respect to a filtration
is based on the persistency of each element of
as
. Here, the persistency of each element is defined as the difference between its birth (first appearance) and its death (disappearance). It is customary to visibly represent
as a vertically stacked set of barcodes, with each element having a horizontal straight line joining its birth to its death. For more-detailed and rigorous descriptions, see [
41,
42,
43]). For simplicity, the barcode set and the PD of the
are referred to by
.
Analysis of the resulting PH barcodes of point clouds in any dimension is provided by the
persistence diagram (PD) formed by a multi-set of points in the first quadrants of the plane
,
above or on the line
. Each marked point in the PD corresponds to a generator of the persistent homology group of the given dimension and is represented by a pair of coordinates
. To illustrate these visual representations of PH information, we created a point cloud of 1500 points sampled randomly on the surface of a torus:
Figure 1 and
Figure 2 below display this point cloud together with the barcodes and PD representation of its PH in both dimensions. The two long
persisting barcodes represent the two empty discs whose Cartesian product generates the torus. The persistency lengths of these two holes depend on the radii
of the generating circles. In this case,
. The persistency lengths of the set of shorter barcodes are inversely related to the point cloud size. Noisy sampling will only have an effect on the shorter barcodes.
Demmel’s general assertion that the further away a matrix is from the set of non-invertible matrices, the smaller is its condition number [
6] implies that the distribution of condition numbers of a point cloud of filters is linked to its topological profile as well as that of the point cloud of their inverses. In relation to our motivating application, the more ill-conditioned the convolutional filter is, the closer it is to being non-invertible, resulting in unstable feature learning. Accordingly, the success of condition-number-reducing matrix surgery can be indirectly inferred by its ability to reduce the differences between the topological profiles (expressed by PDs) of point clouds of filters and those of their inverses. We shall first compare the PDs of point clouds of well-conditioned matrices and ill-conditioned ones, and we do the same for the PDs of their respective inverse point clouds.
Determining the topological profiles of point clouds using visual assessments of the corresponding point clouds’ persistent barcodes/diagrams is subjective and cumbersome. A more quantitatively informative way of interpreting the visual display of PBs and PDs can be obtained by constructing histograms of barcode persistency records in terms of uniform binning of birth data. Bottleneck and Wasserstein distances provide an easy quantitative comparison approach but may not fully explain the differences between the structures of PDs of different point clouds. In recent years, several feature vectorisations of PDs have been proposed that can be used to formulate numerical measures to distinguish topological profiles of different point clouds. The easiest scheme to interpret is the statistical vectorisation of persistent barcode modules [
44]. Whenever reasonable, we shall complement the visual display of PDs with an appropriate barcode binning histogram of barcodes’ persistency, alongside computing the bottleneck and Wasserstein distances using the GUDHI library [
45] to compare the topological profiles of point clouds of matrices.
To illustrate the above process, we generated a set of
random Gaussian filters of size 3 × 3 matrices sorted in ascending order of their condition number, and we created two point clouds: (1)
of the 64 matrices with the lowest condition numbers and (2)
with the 64 matrices of the highest condition numbers.
is well-conditioned, with condition numbers in the range [1.19376, 1.67], while
is highly ill-conditioned, with condition numbers in the range [621.3677, 10,256.2265]. Below, we display the PDs in both dimensions of
,
and their inverse point clouds in
Figure 3.
In dimension zero, there are marginal differences between the connected component persistency of and that of . In contrast, considerable differences can be found between the persistence of the connected components of and that of . In dimension one, there are slightly more marginal differences between the hole persistency of and that of . However, these differences are considerably more visible between the hole persistency of and that of . One easy observation in both inverse point clouds, as opposed to the original ones, is the early appearance of a hole that dies almost immediately, being very near to the line death = birth.
A more informative comparison between the various PDs can be discerned by examining
Table 1 below, which displays the persistency-death-based binning of the various PDs. Note that in all cases, there are 64 connected components born at time 0. The pattern and timing of death (i.e., merging) of connected components in the well-conditioned point clouds
and
are nearly similar; however, in the case of ill-conditioned point clouds, most connected components of
merge much earlier than those of
.
The above results are analogous to Demmel’s result in that the well-conditioned point cloud exhibits similar topological profiles to that of its inverse point cloud, while the topological profile of the ill-conditioned point cloud differs significantly from that of its inverse. In order to estimate the proximity of the PDs of the well- and ill-conditioned point clouds to those of their inverses, we computed both the bottleneck and the Wasserstein. The results are included in
Table 2 below, which also includes these distances between other pairs of PDs. Again, both distance functions confirm the close proximity of the PD of
with that of
in comparison to the significantly bigger distances between the PDs of
and
.
Next, we introduce our matrix surgery strategy and the effects of various implementations on point clouds of matrices, with emphasis on the relations between the PDs of the output matrices and those of their inverse point clouds.
5. Matrix Surgery
In this section, we describe the research framework to perform matrix surgery that aims to reduce and control the condition numbers of matrices. Suppose matrix
is non-singular and is based on a random Gaussian or uniform distribution. The condition number of A is defined as:
where
is the norm of the matrix. In this investigation, we focus on the Euclidean norm (
-norm), where
can be expressed as:
where
and
are the largest and smallest singular values of
A, respectively. A matrix is said to be ill-conditioned if any small change in the input results in big changes in the output, and it is said to be well-conditioned if any small change in the input results in a relatively small change in the output. Alternatively, a matrix with a low condition number (close to one) is said to be well-conditioned, while a matrix with a high condition number is said to be ill-conditioned, and the ideal condition number of an orthogonal matrix is one. Next, we describe our simple approach of modifying singular-value-matrix-based SVD since the condition number is defined by the largest and smallest singular values. We recall that the singular value decomposition of a square matrix
is defined by:
where
and
are left and right orthogonal singular vectors (unitary matrices); diagonal matrix
are singular values, where
. SVD surgery, described below, is equally applicable to rectangular matrices.
5.1. SVD-Based Surgery
In the wide context, SVD surgery refers to the process of transforming matrices to improve their conditioning numbers. In particular, it targets matrices that are far from having orthogonality/orthonormality characteristics to replace them with improved well-conditioned matrices by deploying their left and right orthogonal singular vectors along with the new singular value diagonal matrix. SVD surgery can be realised in a variety of ways according to the expected properties of the output matrices to fit the use case. Given any matrix
A, SVD surgery on
A outputs a new matrix of the same size as follows:
Changes to the singular values amount to rescaling the effect of the matrix action along the left and right orthogonal vectors of
U and
V, and the monotonicity requirement ensures reasonable control of the various rescalings. The orthogonal regularisation scheme of [
22] and the SVB scheme of [
24] do reduce the condition numbers when applied for improved control of overfitting of DL models trained on natural images, but both make changes to all the singular values and cannot guarantee success for the application of DL training of US image datasets. Furthermore, the SVB scheme is a rather strict form of SVD-based matrix surgery for controlling the condition numbers, but no analysis is conducted on the norms of these matrices or their inverses.
Our strategy for using SVD surgery is specifically designed for the motivating application and aims to reduce extremely high condition number values, preserve the norm of the input filters, and reduce the norm of their inverses away from non-invertible ones. Replacing all diagonal singular value entries with the largest singular value will produce an orthogonal matrix with a condition number equal to one, but this approach ignores or reduces the effect of significant variations in the training data along some of the singular vectors, leading to less effective learning. Instead, we propose a less drastic, application-dependent strategy for altering singular values. In general, our approach involves scaling all singular values to be less than
in order to minimise
while ensuring the maintenance of their monotonicity property. To reduce the condition numbers of an ill-conditioned matrix, it may only be necessary to adjust the relatively low singular values to bring them closer to
. There are numerous methods for implementing such strategies, including the following linear combination scheme. Here, we follow a less drastic strategy to change singular values:
The value of j can be chosen to be any singular value that is very close to , and the linear combination parameters can be customised based on the application and can possibly be determined empirically. In extreme cases, this strategy allows for the possibility of setting for all . This is rather timid in comparison to the orthogonal regularisation strategies, which preserve the monotonicity of the singular values. Regarding our motivating application, parameter choices would vary depending on the layer, but the linear combination parameters should not significantly rescale the training dataset features along the singular vectors. While SVD surgery can be applied to inverse matrices, employing the same replacement strategy and reconstruction may not necessarily result in a significant reduction in the condition number.
Example: Suppose
B is a square matrix with
that is drawn from a normal distribution with mean
and standard deviation
as follows:
Singular values of
B are
, and it is possible to modify and reconstruct
,
, and
by replacing one and/or two singular values such that
,
, and
, respectively. New singular values in
are convex linear combinations such that
and
. After reconstruction, the condition numbers of
,
, and
are significantly lower compared to those of the original matrix, as shown in
Table 3, by using the Euclidean norm.
5.2. Effects of SVD Surgery on Large Datasets of Convolution Filters and Their Inverses
In training CNN models, it is customary to initialise the convolution filters of each layer using random Gaussian matrices of sizes that are layer- and CNN-architecture-dependent. Here, we shall focus on the effect of surgery on 3 × 3 Gaussian matrices. To illustrate the effect of SVD surgery on point clouds of convolutions, we generate a set of
3 × 3 matrices drawn from the Gaussian distribution
. We use the norm of the original matrix, the norm of the inverse, and the condition number to illustrate the effects of SVD surgery and observe the distribution of these parameters per set.
Figure 4 below shows a clear reduction in the condition numbers of modified matrices compared to the original ones. The reduction in the condition numbers is a result of reducing the norms of the inverses of the matrices (see
Figure 5). The minimum and maximum condition numbers for the original set are approximately 1.2 and 10,256, respectively. After only replacing the smallest singular value
with
, after reconstruction, the new minimum and maximum values are 1.006 and 17.14, respectively.
Figure 4 shows a significant change in the distribution of the norms of the inverses of 3 × 3 matrices post-surgery, which is consequently reflected in their condition number distribution. The use of a linear combination formula helps keep the range of condition numbers below a certain threshold depending on the range of singular values. For instance, 3D illustrations in
Figure 5 show a significant reduction in the condition number by keeping the ranges below 3 in (b) and 2 in (c), where
and
are replaced with
and
, respectively. The new minimum and maximum condition number values for both sets after matrix surgery are
and
, respectively.
5.3. Effects of SVD Surgery on PDs of Point Clouds of Matrices
For the motivating application, we need to study the impact of SVD surgery on point clouds of matrices (e.g., layered sets of convolution filters) rather than single matrices. Controlling the condition numbers of the layered point clouds of CNN filters (in addition to the fully connected layer weight matrices) during training affects the model’s learning and performance. The implementation of SVD surgery can be integrated into customised CNN models as a filter regulariser for the analysis of natural and US image datasets. It can be applied at filter initialisation when training from scratch, on pretrained filters during transfer learning, and on filters modified during training by backpropagation after every batch/epoch.
In this section, we investigate the topological behaviour of a set of matrices represented as a point cloud using persistent homology tools, as discussed in
Section 4. For any size
filters, we first generate a set of random Gaussian matrices. By normalising their entries and flattening them, we obtain a point cloud in
residing on its
. Subsequently, we construct a second point cloud in
by computing the inverse matrices, normalising their entries, and flattening. Here, we only illustrate this process for a specific point cloud of
matrices for two different linear combinations of the two lower singular values. The general case of larger-size filters is discussed in the first author’s PhD thesis [
46].
Figure 6 below shows the
and
persistence diagrams for point clouds (originals and inverses) plus those for post-matrix-surgery with respect to the linear combinations: (1) replacing both
and
with
(i.e.,
) and (2) replacing
with
. The first row corresponds to the effect of SVD on the PD of the original point cloud, while the second row corresponds to the inverse point cloud.
The original point cloud includes extremely wide-ranging matrices in relation to their conditioning, which means their proximity to the non-invertible set of matrices is also wide-ranging. That accounts for the observable visual differences between the PDs of and those of in both dimensions. The PDs of and are not significantly dissimilar in dimension 0, but in dimension 1, we can notice that many holes in have longer lifespans, while many others are born later than the time that all holes in vanish. In fact, in dimension 0, the dissimilarities appear as a result of many connected components in living longer than those in . The PDs of and are visually equivalent in both dimensions as a reflection of the fact that this surgery produces optimally well-conditioned orthonormal matrices (i.e., the inverse matrices are simply the transpose of the original ones). This means that the strict surgery that produces the point cloud is useful for applications that require orthogonality, whereas the less-relaxed surgery is beneficial for applications where condition numbers are in a reasonable range of values as long as they are not ill-conditioned.
For a more informative description of these observations, we computed the death-based binning table, which is shown below as
Table 4. The results confirm that the topological profiles (represented by their PDs) of
and
are indeed different in both dimensions. There is less quantitative similarity in dimension 0 between the PDs of
and
than reported by visual examination. In dimension 1, the visual observations are to some extent supported by the number of holes in the various bins. The table also confirms the exact similarity in both dimensions of the PDs of
and
, as reported using visual examination.
Again, we estimated the proximities of the PDs of the various related pairs of point cloud matrices and their inverses in term of the bottleneck and the Wasserstein distance functions. The results are shown in
Table 5 below. The significantly large distances in dimension 0 explain the noted differences between the PD of
and that of
. In dimension 1, the surprisingly small bottleneck distance between the PD of
and that of
indicates that bottleneck distances may not reflect the dissimilarities in visual representations. The distances between the PDs of
and
in both dimensions are reasonably small, except that in dimension 1, the distance increased slightly post the
surgery. This may be explained by the observation made earlier that “many holes have longer lifespans, while many others are born later than the time that all holes in
vanish” when visually examining the 1-dimensional PDs. Finally, these distance computations confirm the strict similarity reported above between the PDs of
and
.
SVD Surgery for the Motivating Application
The need for matrix surgery to improve the condition number arose during our previous investigation [
46], which aimed to develop a CNN model for ultrasound breast tumour images that has reduced overfitting and is robust to reasonable noise. During model training, we observed that the condition numbers of a large number of the initialised convolution filters were fluctuating significantly over the different iterations [
12]. Having experimented with various linear-combination-based SVD surgery techniques, the work eventually led to a modestly performing customised CNN model with reasonable robustness to tolerable data perturbations and generalisability to unseen data. This was achieved with a carefully selected constant linear combination SVD surgery applied to all convolutional layer filters at (1) initialisation from scratch, (2) pretrained filters, and (3) during training batches and/or epochs.
Our ongoing attempt to improve the previous work for improved CNN model performance is based on using more convolution layers and investigating the conditioning of the large non-square matrix of the fully connected layers (FCLs) of neurons. A major obstacle to the training aspects of this work is the selection of appropriate linear-combination-based SVD surgery for different point clouds for a larger range of filter sizes. In our motivating application as well as in many other tasks, it is specifically desirable to control the condition numbers of filters/matrices within a specific range and with reasonable upper bounds. Such requirements significantly increase the toughness of the challenge of finding different linear-combination-based surgery schemes (suitable for various convolutional layers and FCLs) that guarantee maintaining condition numbers within specified ranges.
There may exist many alternatives to using linear-combination-based reconditioning SVD surgery. The PH investigations of the last section indicate the need to avoid adopting crude/strong reconditioning algorithms to avoid slowing down learning and/or underfitting effects. Below is pseudocode, Algorithm 1, for a simple but efficient SVD surgery strategy that we developed more recently for “reconditioning” each of the convolution filters (as well as the components of the FCL weight matrices) after each training epoch that maintains the condition numbers within a desired range.
Algorithm 1 SVD surgery and condition number threshold. |
- 1:
Compute the SVD of F: , and let be the singular values of in descending order. - 2:
, ▹ Initial threshold - 3:
while do - 4:
▹ Threshold small singular values - 5:
- 6:
▹ Update threshold for next iteration - 7:
for to do ▹ Smoothen the remaining singular values - 8:
- 9:
Reconstruct F using the modified singular values:
|
Note that the above algorithm does not change any input matrix that has a condition number in the specified range, while it makes minimal essential adjustments to the singular values. We are incorporating this efficient SVD-based reconditioning procedure during the training of specially designed SLIM CNN models for tumour diagnosis from ultrasound images. The results are encouraging, and future publications will cover the implications of such “reconditioning” matrix surgery on the performance of Slim-CNN models and the topological profiles of the filters’ point clouds during training.
Future works include (1) assessment of topological profiles of point clouds of matrices (and those of their inverses) in terms of their condition number distribution and (2) quantifying Demmel’s assertion that links condition numbers of matrices to their proximity to non-invertible matrices. For such investigation, the SVD surgery scheme is instrumental in generating sufficiently large point clouds of matrices for any range of condition numbers.