1. Introduction
To address ubiquitous computing, edge computing, and distributed sensor networks, as well as a significant increase in device density and sensor deployment towards smart and self-aware sensors, sophisticated and dependable data processing architectures are required. The field of tiny machine learning (ML) is an emerging field posing challenges that are only partially addressed [
1]. Floating-point arithmetic, with a high dynamic range and sufficient precision, is frequently used to compute machine learning models. Only integer arithmetic (8–32 bit) is supported by very low-resource tiny embedded systems, e.g., ARM Cortex M0-based systems; hence, training with integer arithmetic must be completed directly on the target device [
2] or by model modification and freezing [
3]. The computation of complex deep learning (DL) models is further limited by memory and computing power constraints of ultra-low-power devices [
4]. To overcome software limitations and limited computability, hardware designs are becoming more popular [
5]. We focus on the software processing of ML models on low-resource and low-power devices by model transformation fitting of low-resource devices.
The present work addresses virtualization on the programming level in IoT and sensor networks using very low-resource computers, typically with less than 64 kB of available RAM, with a particular focus on machine learning (ML) provided as a virtualized service. Compared with [
6], we provide new algorithms and ML applications, including ANN/CNN-based regression and classification tasks, with a rigorous evaluation of discretization errors. The set of discretized non-linear transfer functions is extended, optimized, and evaluated rigorously. A new unified data set derived from physical simulation is used to demonstrate the capability and accuracy of the ML VM service, which can be deployed on any very low-resource micro-controller providing integer arithmetic only, as well as on desktop computers. We evaluate the VM MLISA and the model transformation process with respect to constraints and its application in Structural Health Monitoring.
A functional prediction or regression model is composed of linear and non-linear functions. The most critical part in the transformation and scaling process of ML models towards integer arithmetic is the non-linear function, e.g., the tanh transfer function. A chained composition can introduce high non-linearity, which must be handled carefully to avoid exploding model errors. For this purpose, we will train ANN models with data from highly non-linear analytical functions for the implementation of surrogate models. There is an extended evaluation of computational complexity and requirements for most relevant applications. The advantage of using a textual programming language instead of binary code is demonstrated by extended examples.
This work focuses on a universal VM suitable for implementation on very low-resource embedded systems and providing ML as a programmable service. Although “as a service” is a common cloud-based paradigm, we use this term in terms of virtualization on node and single computer level.
There is ongoing work to implement the computation of ML models with 8 or less bit integer arithmetic (and storage data size) on micro-controllers [
7,
8], commonly called Tiny ML [
9] or ML on commodity devices [
1]. We will relax this hard constraint by assuming a 32-bit microprocessor, e.g., the widely used Arm Cortex M series with chip die areas below 0.1 mm
2 and power consumption about 10 mW (active mode). Finally, we implement ML with 16 bit data-size storage (input, intermediate, and output data as well as model parameters). Overflow issues are relaxed by using 32-bit arithmetic internally. In [
9], the authors outlined the benefits of Tiny ML in the context of sensor networks. Tiny ML enables the local processing of sensor data without extensive periodic communication to external serves, especially in the context of real-time capable structural health monitoring (SHM) systems. In [
9], the computation of complex and deep convolutional neural networks (CNNs) was implemented on the sensor node level, often equipped with digital signal processors (DSPs) with optimized vector operations and sometimes floating-point arithmetic units (FPUs). The authors chose an Arm Cortex M4-based micro-controller, which provides dedicated DSP and FPU operations and 320 kB of RAM. In [
3], the authors presented a software framework to run lightweight neural networks on micro-controllers based on both the ARM Cortex-M series and the RISC-V-based parallel ultra-low-power (PULP) platform, especially addressing energy-efficient computing, which is a high constraint on self-powered autonomous sensor nodes. The efficient implementation of ML models using a dedicated model and training library Fast ANN on the Arm Cortex M architecture is also demonstrated in [
10]. They claim that the model computation is still possible on FPU-less micro-controllers but do not give evaluations for real use cases. In addition, all these frameworks create static model code, which cannot easily be updated at run-time (no service).
Instead of performing software model transformations to map a model using arithmetic A defined by an accuracy, value, and dynamic range, on a model using arithmetic B with lower accuracy and reduced value and dynamic ranges, the model can be mapped on a dedicated hardware architecture, providing the elementary core operations of ML models, as described by [
1]. Although this is the most efficient method, this approach prevents the use of widely used and cheap electronic components and universal and flexible model computations, as well as update services (of the model and the ML services). This compromises the deployment of the proposed ML/VM architecture in embedded systems with limited or no reprogramming capabilities (like material-integrated systems, discussed in
Section 2).
This paper is organized as follows. A short introduction to material-integrated sensor nodes for structural health monitoring (SHM) is given to outline the motivation as well as the communication architecture, defining constraints for the VM and its ML service introduced and used in this work. An extended section describes the REXA VM with a focus on ML. The transformation process of continuous ML models to discretized integer-scaled arithmetic models is described in detail, and approximation methodologies and discretization errors are discussed for non-linear transfer functions. Two use cases demonstrate the capability of the transformation process, the REXA VM ML operations, and typical accuracy losses that can be expected in real applications for regression and classification tasks.
We introduce two different model scaling algorithms (static and dynamic) and evaluate the effect of discretization on the model accuracy with the two use-case examples. One use case uses synthetic sensor data created by wave propagation simulation, and the second uses input data from a mathematical model. The first use case combines classification and regression tasks into one model, and the second is a pure regression model. The main focus is on the non-linear activation functions used in neuronal network models and their discretization errors. The two use cases will demonstrate the quality of the proposed scaling approach and the simplicity of the VM programming for classification and regression models, including CNN architectures.
2. Sensor Node Architecture for SHM and Communication Architecture
The virtualization and deployment of VMs onto tiny micro-controllers was inspired by the wireless material-integrated sensor node [
11,
12], which is embedded between two layers of a fiber–metal laminate plate and developed in the DFG research group 3022 for automated diagnostics of hidden damages in fiber–metal laminates using guided ultrasonic waves (GUW). The sensor node is supplied with power via RFID/NFC communication technology only [
13]. The energy harvester is able to deliver up to 15 mW of continuous power, significantly constraining the selection and operation of the electronic parts, as shown in
Figure 1. After the sensor node is integrated into the plate, no software updates or maintenance can be applied. The communication with the micro-controller takes place only via the NFC tag. The communication is bidirectional, originally via the NFC tag’s EEPROM (code and data). Alternatively, wireless communication can be realized directly with a write-through mode (because the lifetime of the EEPROM is limited to about 1000–10,000 write cycles), writing message data directly to the micro-controller. The sensor node uses an ARM Cortex STM32 L031 device, an NFC tag with a power supply, and a pre-amplifier for piezoelectric sensors. The features are summarized in
Table 1. More details and descriptions of the sensor node can be found in [
12].
Although this work focuses on the design and implementation of a universal VM suitable for implementation on very low-resource embedded systems and providing ML as a programmable service, communication aspects are relevant and must be considered as constraints, both for the design and operation of the VM and the communication design. Communication with material-integrated sensor nodes is commonly performed by using wireless technologies, but in the presence of metals and dielectric materials, wireless communication is a challenge. Low- and mid-frequency RFID technologies are widely used. In this work, it is assumed that the REXA VM is accessible by RFID/NFC communication, as shown in
Figure 2. Sensor nodes communicate wirelessly via RFID/NFC tag circuits with a reader, which is connected short-range to a “remote” VM instance. The reader nodes are connected long-range (wired or wireless) to establish a distributed network. The reader nodes are communication end-points and message routers.
The communication capabilities have an impact on the ML models that can be implemented with respect to the code and data size. It can be expected that the typical message size will not exceed 1k Bytes. The transfer of 1k Byte of payload data requires about one second of transfer time, including packet fragmentation. ML models, including the forward prediction functions, are submitted to the VM via text, including the fixed model parameters. In the use-case section, the text and code + data sizes are measured, showing that small and moderate complex models, including CNNs, can be transferred by such a resource-constraint communication channel.
3. REXA VM Architecture and Programming Language
The REXA VM is the core component for implementing and virtualizing ML on low-resource computers using a programmable approach. The next sub-sections describe the VM architecture briefly to aid in understanding the implementation details of the following use cases.
3.1. Architecture
The real-time capable and extensible architecture (REXA) VM is a full-featured script engine based on a stack processor architecture. A detailed description can be found in [
11,
14]. Although any programming language can be used, we implemented a modified subset of the stack-based Forth programming language [
15]. Any ISA can be implemented by addressing stack-based computation, but Forth is a well-known and long-standing programming language that is firstly a high- and low-level language, secondly can be implemented efficiently on low-resource systems, thirdly is extensible (e.g., by the MLISA introduced in this work), and fourthly requires only simple compilers. In contrast, C/C++ is a compiled language not intended for script execution as intended in this work, and C/C++ compilers are complex, whereas Forth compilers are not.
The full version embeds a text-to-byte-code compiler that translates Forth programs into byte-code. A dialect of the Forth programming language provides high-level constructs like loops and functions (words in the Forth terminology), as well as the compactness of the VM implementation, including a hand-written language parser and an incremental direct compiler producing VM code. One main feature is the binding of data and code in frames without the necessity of a free-used memory block list-driven dynamic memory management. All dynamic run-time data are stored on multiple data stacks. Such binary byte-code frames with embedded data can be exchanged among different processors and sensor nodes. The compilation, as well as the code execution, can be performed under soft real-time constraints. The run-time can be estimated in advance, e.g., offline, by a VM twin with check-pointing, tracing, and monitoring capabilities. The REXA VM is written in plain C and is highly portable. Alternatively, the REXA VM can be implemented in other programming languages like JavaScript to support embedding in UI applications.
The REXA VM was designed especially for deployment on low-resource micro-controllers with less than 64 kB RAM and low clock frequencies below 50 MHz. It utilizes a freely programmable ISA, but the ISA of the VM used in this work is closely related to the Forth programming language [
15]. The VM is a pure stack processor, i.e., most operations process data via multiple stack memories with a zero-operand instruction format. There is support for arrays and access to external buffers via the DIOS (see below). The VM instruction loop processes byte-code programs stored in a code segment (CS).
Figure 3 shows the architecture design of the REXA VM and its interoperability with the closely coupled just-in-time (JIT) compiler. The JIT compiler depends on the VM ISA, which can be freely defined, although this work is strongly related to the Forth programing language. The architecture details depend on the configuration (single- or multi-tasking, number of stacks, and customized extensions and accelerators). The principle architecture is equal for software and hardware implementations. Profiling is an optional feature used for predictive real-time scheduling, as well as the energy-aware real-time scheduler.
The code segment (CS) is the central storage for source code, byte-code, and embedded data. The CS is partitioned into dynamically sized code frames, commonly assigned to a task (depending on the scheduling model), as shown in
Figure 4. Assuming a 16-bit VM, the CS is limited to 32k Bytes in size. The scheduler controls and monitors the byte-code loop (
vmloop). Code operations can suspend task execution by waiting for events handled by an event table. The input–output system (IOS, similar to the widely used foreign function interface (FFI), extends the code and data space of the VM) is the central bridge between the core VM and the host application. The VM architecture is optimized for resource sharing, e.g., using an ADC sample buffer for computations from the VM programming level.
Temporary (short lifetime) data are stored and manipulated directly on fixed-size stack memories:
The data stack (DS) holds most of the processing data and instruction operands;
The return stack (RS) used for function calls (not accessible from the programming level for security reasons);
Optional loop stack (FS) used for loop counters and secondary user data (can be merged with RS for memory efficiency).
All non-temporary data are either embedded in the code frames or provided by the host application via the data input–output system layer (DIOS) API or by providing Io functions using the function input–output system layer (FIOS) API. All MLISA operations are attached to the VM using the FIOS; ADC buffers are attached by the DIOS.
The data width of the stack cell is always 16 bit (single word width). The REXA VM also supports double-word operations (as a configurable option). Double words are composed of two single data words (word order depends on the native byte order of the underlying processor). The VM can read and write double words directly from and to stacks (single memory access). The access time of multiplexed single and double word access to the stacks by memory pointer casting is commonly identical (assuming 32-bit microprocessors). The push and pop operations involved in most of the VM instruction code words modify stack pointers (dstop, rstop, fstop). For security reasons, the return stack (which holds code pointers on function calls) should not be accessed directly by program code.
Besides hardcore stacks implemented inside the VM, soft-core stacks can be implemented on the programming level in data arrays (embedded in code frames). Push and pop operations are provided by the core instruction word set:
array mystack 100
1 mystack push
mystack pop . cr
3 mystack get ( Gets copy of n-th value from top )
The compiler translates the source code text into byte-code instructions. It is a just-in-time (JIT) compiler that can compile code incrementally and on demand. Since the ISA of stack processors consists mostly of zero-operand instructions, it supports fine-grained compilation at the token level, including ML models. The source text can be directly stored in the code segment referenced by a code frame (or any other data buffer, alternatively). Most instruction words can be directly mapped to a consecutively numbered operation code. Therefore, the compiler translates the source code into byte-code in place, i.e., by replacing the text with binary byte-code, saving additional target memory buffers. An instruction word consists of at least one character and thus can always be replaced by the op-code (one byte). Although a literal value can consist of only one digit and the data of a single word value occupies two bytes, there is always a space or newline character after a literal value, providing the required data space. Extension of the current code frame at the end is always possible (as long as there is free space in the CS). One exception is a double-word literal value requiring at least two characters and the suffix “l”, followed by an obligatory separator character and the space, providing four bytes of data space in total.
Data are either stored on the stacks during run-time or embedded in the code frame during translation. Scalar variables and initialized arrays can always be embedded in place. Non-initialized arrays are appended to the end of the compiled code frame (placing id delayed until the code frame is compiled).
3.2. Programming Language
The programming language consists of the Forth core set, which is mainly zero-operand words. A word is either a numerical or string value storing this value on the data stack or an instruction word like arithmetic or control flow operations. Zero-operand instructions get their operands from the stacks and store results on the stack again. User functions (words) can be defined by using the operator, as shown in Example 1. Because user words, as well as core words, get their operands via the stack and store results on the stack, a comment in the form LHS -- RHS is commonly used to specify the input and output function interface. The left-hand side specifies the input arguments (right is the top of the stack), and the right-hand side specifies the output (if any).
A typical REXA VM program for ML consists of a head section defining initialized and non-initialized arrays, and words computing data, as shown in Example 1 and discussed in detail in the MLISA
Section 4.
array X 3
array P { 1 2 3 }
array Y 3
( n -- sum )
: productsum
0 ( sum )
swap
( n ) 0 do
X i cell+ ! ( X[i] )
P i cellü ! ( P[i] )
* ( X[i]*P[i] )
+ ( +sum )
loop
( sum )
;
3 productsum
Y 1 cell+ @ ( Y[1]=sum )
Example 1. REXA Forth program sketch.
4. ML Instruction Set Architecture
The extensible REXA VM is a stack-based process that provides ML as a programmable service via its input–output system bridge. To enable and support efficient processing of ML models, a set of basic ML operations are added to the ISA of the VM. This ISA can be extended at any time. The ML instruction set architecture extension MLISA provides universal ML micro-service operations (ML and MLISA as a service, MLaaS). A code frame describes the ML model structure and defines an inference function that evaluates a specific model by applying the following ML core and vector operations to the input data, e.g., a measured time-resolved sensor signal. There are primarily three classes of ML models supported by the MLISA:
An ML task consists of the prior training phase using example data and the post-application inference phase using new unknown data. Actually, we only support the online and on-site inference of already offline-trained models. The following MLISA provides only operations for the applications of mathematical models based on discrete integer arithmetic. The original models were trained with standard numerical methods using floating point arithmetic transformed to integer-scaled models. Training using classical error back-propagation methods is currently not supported due to the requirement of storing a suitable training and test data set on the device, which is not available on very low-resource microcontrollers.
4.1. ML Core Operations
ANN and CNN computations require efficient and generic vector operations crucial to implementing ML on microcontrollers, at least for model inference. The REXA VM provides a unified core set of vector operations that can be used for the iterative computation of ANN and CNN models. It is assumed that the integer data width of the models is N-bit and that there is 2N-bit arithmetic. In our case, we have 16-bit model data and 32-bit native integer arithmetic. The set of basic operations needed to implement ANN and CNN models and perform forward activation computations consists of the following:
Element-wise vector operations, i.e., addition and multiplication, vecmul: op1vec op2vec dstvec scalevec;
Dot-product operations performing a sum of product data fusion (vecprod: veca vecb scale → number);
A folding operation for node layer computations (vecfold: invec wgtvec outvec scalevec);
A convolution operation for CNN computations (vecconv: invec wgtvec outvec scale inwidth kwidth stride pad);
A pooling operation for CNN computations reusing the vecconv operation. The second argument combines the kernel height and a pooling function index (max, min, and so on) if the kernel width argument is negative (vecconv: invec kheight + poolfun outvec scale inwidth -kwidth stride pad);
A mapping operation applying a function elementwise (vecmap: srcvec dstvec func scalvec);
A reduction operation applying a function to all elements returning an aggregate value (vecred: vec vecoff veclen op) with the supported functions min, max, sum, and average;
A vector reshape operation shrinking or expanding a vector (vecshape: srcvec dstvec scale);
A generic scaling operation (vecscale: srcvec dstvec scalevec).
Vector operations commonly operate on arrays embedded in code frames, as shown in Definition 1. Scaling is typically applied after an aggregation operation, e.g., after computing a vector dot product sum of products (using 2N arithmetic), to avoid overflow. Some operations use one scaling factor for all elements, as discussed in the following section.
Definition 1. Initialized arrays embedded in place in code frames and non-initialized arrays stored at the end of the compiled code frame.
4.2. Vector Operations
The dynamic ranges of different integer (fixed point) and floating-point coding are shown in
Table 2.
The core set of vector operations provided by the REXA VM supporting (16-bit) integer arithmetic ANN and CNN computations are summarized in
Table 3 and
Table 4. These operations are the primary part of the MLISA.
Vector operations always operate on single data words (16 bit), but internally, 32-bit arithmetic is used to avoid over- and underflows. To scale to a signed 16 bit integer, some of the operations use a scale factor or scale factor vector (negative scale values reduce, positive expand the values by the scale factor) to avoid overflows or underflows in following computations, similar to scaled tensors in [
7,
8]. Vector operations can access arrays stored in code frames or provided externally by the host application (e.g., a signal buffer).
The computation of these operations is defined by the following formulas:
with
cn as the kernel size (width multiplied by height) and
s as the striding value (default is one).
The vecconv operation can be used for convolutional and pooling layers (pooling is used if wgtwidth is negative and the wgtvec value contains the weight matrix height combined with the pooling function selector). An activation function must be applied separately using the vecmap operation, e.g., by applying a sigmoid function to all elements of a vector.
The vecsumn function is required for sliced convolution. In this case (see architectural description), a partial convolution is performed for one input array and stored in an accumulator array. After all partial convolutions are calculated, the sum of all vectors must be calculated (with final scaling). This can be done in principle with the vecadd operation, but due to the accumulator (single word size), overflows can occur before final scaling.
4.3. Activation Functions
Besides the vector operations that can be simply implemented in closed form with integer arithmetic, transfer functions with non-linear behavior are the most critical part of the integer computation of ML models. There are different transfers (activation) that are used in ANN and CNN models; the most prominent examples are:
Linear function (linear) without x- and y-limits;
Logistic or sigmoid function (sigmoid) with y-limit = [−1, 1];
Hyperbolic tangent function (tanh) with y-limit = [−1, 1];
Rectifying linear unit (relu) with one-side open y-limit = [0, ∞).
The linear and relu functions can be directly implemented with integer arithmetic without loss of accuracy (except due to the integer discretization). The highly non-linear sigmoid and tanh functions require an approximation by using a hybrid approach combining a (compacted) look-up table (LUT) and an interpolation function. The tanh function can be neglected since it can be replaced, in most cases, by the sigmoid function without loss of generalization (of course, prior to training).
Trigonometric functions and functions composed of trigonometric functions are implemented with piecewise linear and non-linear look-up tables. The approximated discretized
sigmoid function algorithm is shown in Algorithm 1. For example, the error of the discretized sigmoid function is always less than 1% or below 10 digital values while only requiring 30 bytes of LUT space and less than 10 unit operations (in addition to the LUT size of the
fplog10 function). These software functions can be immediately implemented in hardware, too. The LUTs are computed with Algorithm 2. The approximation of the
tanh function is much more complex and computationally intensive as it involves the computation of two exponential terms, posing exploding behavior for larger negative and positive x values, as shown in
Figure 5. The exploding functional behavior is relaxed for the sigmoid function by computing the
sigmoid function only for positive x-values and using a logarithmic base function, finally mirroring and flipping the result for negative x-values, which does not prevent exploding gradients in the case of the
tanh function.
tanh can be rewritten as shown in the following Eq., computing the discretized
tanh using the same approach as used for the
sigmoid function:
Algorithm 1. Range-segmented and LUT-based implementation of the sigmoid and hyperbolic tangent functions with less than 1% approximation error for a wide range of x-values (using approximated LUT-based log10 function). Shown is the C program code. The data types are in the format: s = signed or u = unsigned, b = byte, and the number gives the number of bytes. |
static ub1 log10lut[] = { <90/DX values> } // x-scale is 1:10 and log10-scale is 1:100 sb2 fplog10(sb2 x) { sb2 shift=0; while (x>=100) { shift++; x/=10; }; return shift*100+(sb2)log10lut[x-10]; } static ub1 sglutA[] = { <24 values> }; // alt. ub2 static ub1 sglutB[] = { <6 elements> }; // y scale 1:1000 [0,1], x scale 1:1000 sb2 fpsigmoid(sb2 x) { sb2 y; ub1 mirror=x<0?1:0; if (mirror) x=-x; if (x>=RC1*1000) return mirror?0:1000; if (x<=RA1*1000) { y = 500+((x*231)/1000); return mirror?1000-y:y; } else if (x<RB1*1000) { ub2 i10 = fplog10((x/5)/2)-OA1; y = ((sb2)sglutA[i10])+YA1; return mirror?1000-y:y; } else { ub2 i10 = fplog10((x/10)/10)-OB1; y = ((sb2)sglutB[i10])+YB1; return mirror?1000-y:y; } return 0; } static ub1 thlutA[] = { <24 values> }; // alt. ub2 static ub1 thlutB[] = { <6 elements> }; // y scale 1:1000 [0,1], x scale 1:1000 sb2 fptanh(sb2 x) { sb2 y; ub1 mirror=x<0?1:0; if (mirror) x=-x; if (x>=RC2*1000) return mirror?-1000:1000; if (x<=RA2*1000) { y = (x*920)/1000; return mirror?-y:y; } else if (x<RB2*1000) { ub2 i10 = fplog10((x/2)/2)-OA2 y = thlutA[i10]+YA2; return mirror?-y:y; } else { ub2 i10 = fplog10((x/10)/10)-OB2, y = thlutB[i10]+YB2; return mirror?-y:y; } return 0; } |
The LUT table can be computed with a stretched x distribution as follows, assuming Δ
x = 1, 2, 3, ...:
with % as the modulo operation that creates an equidistant series of values. The
log10lut table size is 90/Δ
x with an unsigned byte data type. The accuracy (relative error) of the sigmoid approximation is plotted in
Figure 6 with an input and output scaling factor of 10 for different LUT sizes. The LUT sizes were 90, 45, and 23, respectively. Using Δ
x larger than 1 results in a significantly increased approximation error for small x-values (20%), but the average relative error rises only from 1% to 3%.
The
fpsigmoid function LUTs are computed iteratively using the
fplog10 function for LUT index stretching, described by the following pseudo-code in Algorithm 2. The symmetry of the sigmoid function is exploited by just computing the positive normalized x-range and applying mirroring and flipping for negative values. The positive function is approximated by four segments. The normalized x-range [0, 1] is handled by a linear function directly computable in the range [0,
RA], followed by the first highly non-linear segment in the range (
RA,
RB), and the converging segment in the range (
RB,
RC), and finally a constant segment (x ≥
RC).
Algorithm 2. Computation of the segmented LUTs A/B for the integer-scaled sigmoid and hyperbolic tangent functions. F is the real-valued generator function for sigmoid or tanh LUT computation. The log10 function is used to stretch the x distribution in the LUTs. |
RA := 1 RB := 3 RC := 10 δ=0.01 OA := int(fplog10(int(x*1000/5))/2) OB := int(fplog10(int(x*1000/10))/10) YA := int(F(RA)*1000) YB := int(F(RB)*1000) lutA := [] for x=RA to RB step δ do i10 := int(fplog10(int(x*1000/5))/2)-OA if lutA[i10] = undefined then lutA[i10] := int(F(x)*1000)-YA endif done lutB := [] for x=RB to RC step δ do i10 := int(fplog10(int(x*1000/10))/10)-OB if lutB[i10] = undefined then lutB[i10] := int(F(x)*1000)-YB endif done |
The accuracy (relative error) of the sigmoid approximation is plotted in
Figure 7 with an input and output scaling factor of 10,000 (i.e., 1:10,000). For
x > −3, the error is below 5% and decreases to 1% on average. For
x < −3, the relative error increases significantly due to the integer discretization error. The error increases in some x-ranges for lower
fplog10 resolution (smaller LUT sizes, Δ
x = 4) but can be improved if the sigmoid interval ranges
R are shifted towards larger values (increasing the sigmoid LUT, too). The red and green areas show lowered or increased accuracy. The accuracy of the transfer function itself is not a measure of the accuracy of ML models using this function, especially if post-trained (adapted) using the discretized function instead of the continuous function. For the segment ranges
R = [
A = 1,
B = 3,
C = 10], the sizes of the LUTs are |
sglutA| = 24 and |
sglutB| = 6, for
R = [1, 7, 15], the sizes are |
sglutA| = 43 and |
sglutB| = 8, approximately |
sglutA| ≈ 6(
B −
A + 1) and |
sglutB| ≈
C −
B, in addition to the LUT size of the
fplog10 function (90/Δ
x), requiring in total 6(
B −
A + 1) +
C −
B + 90/Δ
x Bytes of static storage.
The implementation of the integer version of the
tanh function requires extended LUTs due to the higher gradient in the x-range [0, 1], i.e., choosing
RA < 1. Typical results compared with the real-valued function are shown in
Figure 8. Selected error statistics of the
fpsigmoid and
fptanh functions are shown in
Table 5. The median error is mostly below 1%. Higher errors commonly occur with small y values as a result of integer discretization and not the approximation itself. The LUT sizes vary between 20 and 50 elements and are small enough to be stored in very low-resource micro-controllers, even if a 16-bit data word size is required.
To summarize, the accurate approximation of the highly non-linear and widely used sigmoid and tanh functions is possible with a segmented LUT approach. The computational complexity is low (less than 20 unit operations are required for one function evaluation), and the storage requirement is low (about 200 Bytes for each function). The average relative error is below 1%, except for small integer values limited by the discretization error.
5. Transformation Process
In the following, we introduce the transformation process of a floating-point arithmetic model to a scaled fixed-point (FP) integer arithmetic model. For historical reasons, we call the floating-point model Foo and the fixed-point integer-scaled model FooFP, and the same is true for the activation function, Act and ActFP, respectively. The transformation process can be summarized as follows:
Training of model using floating-point arithmetic using a generic ML software frameworks like Neataptic [
16] or ConvNetJS [
17];
Validation of model using test data to estimate model accuracy;
Transformation of original model in an annotated unified surrogate Foo model for later analysis and FooFP transformation and model comparison;
Restructuring and refactoring of Foo model (layer expansion, i.e., mapping three-dimensional tensors on vectors suitable for vector operations);
Statistical analysis (value boundary scans) of data flow in the Foo model using the entire data set;
Calculation of scaling factors (static or dynamic scaling);
Transformation of the unified Foo model to a surrogate FooFP model using integer-scaled arithmetic;
Test of FooFP using discretized and scaled test data and comparison with results from Foo model with respect to overflow (should not occur) and model accuracy deviation;
Transforming layers into a sequence of MLISA vector operations;
Test MLISA integer model using the (simulated) VM and with the integer data set and validate with Foo/FooFP models.
For the sake of simplicity and modifiability, we use two JavaScript-based ML frameworks:
Neataptic primarily for pure ANN models [
16];
ConvNetJS for CNN models (including ANN) [
17].
Both frameworks provide direct access to the network layers, the forward and backward functions, and enable easy modification to support the model transformation process and perform statistical analysis. Our approach can be used with other frameworks, e.g., PyTorch and Tensorflow, although it is more difficult to implement our algorithms.
There are three phases in the model transformation:
Mapping the internal framework model (e.g., ConvNetJS) to an annotated functional unified standard model (USM) aligned to the operational vector semantics of the VM MLISA and refactoring if necessary (see
Figure 9);
Annotation of the USM with statistics derived from analysis of model parameters, input, intermediate, and output data;
Calculation of scale factors based on the statistical analysis and replacement of floating-point vector operations with integer-scaled vector operations (for simulation), arranged in a sequential list of MLISA vector and scale operations.
The sliced and sequential accumulated convolution, pooling, and product-sum calculation of fully connected neuronal layers is required to match the MLISA vector operations provided by the REXA VM, as shown principally in Algorithm 3. If a previous layer has a depth (z) ordering, i.e., a result of a multi-filter convolution operation, the following layer must process the output for each z-layer independently using slicing and accumulators and, finally, fusion by the
vecsumn operation. Each slice has its own weight or filter coefficient set.
Algorithm 3. Principle factorization of sliced accumulated operations |
( Layer i-1=3 ) ( Output of layer i-1 ) array yL3N0 62 array yL3N1 62 array yL3N2 62 array yL3N3 62
( Layer i=4 ) ( Output of layer i ) array yL4N0 60 array yL4N1 60 array yL4N2 60 array yL4N3 60
( Accumulators of layer i shared by all nodes ) array aL4S0 60 array aL4S1 60 array aL4S2 60 array aL4S3 60
( Filter weights for each slice and node ) array wL4N0S0 { .. } array wL4N0S1 { .. } .. : forward ( First Conv filter operation N=0 ) yL3N0 wL4N0S0 aL4S0 -5000 62 3 1 0 vecconv yL3N1 wL4N0S1 aL4S1 -5000 62 3 1 0 vecconv yL3N2 wL4N0S2 aL4S2 -5000 62 3 1 0 vecconv yL3N3 wL4N0S3 aL4S3 -5000 62 3 1 0 vecconv aL4S0 aL4S1 aL4S2 aL4S3 yL4N0 0 4 vecsumn
yL4N0 yL4N0 $ relu 0 vecmap |
The statistical analysis is split into a static and a dynamic analysis. The dynamic analysis applies the phase-1 transformed Foo ML model (still floating-point arithmetic) to all training and test data samples. The fivenum statistics (minimum, maximum, median, and first and third quantiles) are recorded for the input vectors x of a layer, the output vectors y, and intermediate values like the sum output of a neuron. Additionally, statistics of static parameters like weights and biases are recorded.
5.1. Scaling
The floating-point numbers must be scaled for the target data type range, e.g., N = 16 bit signed integer. The set of values consists of input, output, and intermediate data, convolutional filter coefficients, and weight parameters. The following three significant issues arise:
To avoid overflows, the scaling factor should be lowered such that the maximum value does not exceed about 0.7 max(N), i.e., introducing a safety margin, e.g., limiting the secure value range of a signed 16-bit integer value to [−10,000, 10,000]. If an ML model is applied to new unknown input data, this secure range can be left without exceeding the real value range of any input, intermediate, and output variables. If the model poses (high) non-linearity, the behavior is unpredictable for unknown data (layer accumulative over- or underflow errors). To avoid increased discretization and underflow errors, the scaling should be lifted, especially between layers, but with some constraints discussed later on.
The idealistic scaling function is:
with
xi as the target scaled (fixed point) integer value and
xr as the original real (floating point) value. This scaling would exploit the full integer value range (including the safety margin), but accumulative (sum) or sign-dependent operations like
relu would fail due to the origin shift. An improved origin and sign-preserving scaling is then:
The scaling is not part of the values. Instead, the scaling factors are static parameters of the model.
Different scaling architectures for functional nodes (neurons) and convolution and pooling nodes are shown in
Figure 10. There is symmetric scaling with the same input and output scaling and asymmetric scaling with different scaling factors on the input and output. The activation function expects a specific input and output scaling, therefore requiring intermediate re-scaling to meet this constraint. For instance, the
fpsigmoid and
fptanh functions discussed in
Section 4.3 expect a static x- and y-scaling of 1000. The weights of neuronal nodes and the kernel coefficients of convolutional nodes are scaled based on the model analysis (minimum and maximum). The bias scaling is the same as for the weights. A convolutional layer applies
n different filters to the input data, which can be one linear vector or multiple vectors from a previous convolutional layer. The filter dimension can be considered an additional data depth dimension. A neuronal network layer always flattens the depth dimension. The processing of one convolutional operation involves an accumulator that sums the results of the filter application to all input (depth) vectors, as shown in
Figure 10 (Bottom).
The dynamic and adaptive re-scaling of intermediate variables and parameters has no effect if a following layer is a discretized LUT-based function (e.g., tanh) but can have an effect if there is a non-LUT-based function or if another accumulative (FC/CONV) function is applied.
Things become more difficult if we assume different (optimized) scaling of intermediate values. The finest granularity of dynamic scaling is one vector, e.g., the entire output of a neural node layer or one (depth) output vector of a convolutional or pooling layer gets the same scaling. The following operation processes multiple input vectors sequentially by using an accumulator. If different input vectors have different scaling (fractional to normalized scaling), the scaling must be corrected before summing up the results in the accumulator. Convolutional and neuronal node operations are always product–sum operations, as shown in
Figure 11. A scaled product–sum is then given by the following re-scaling:
To evaluate the dynamic fine-grained scaling, we compare these models with a globally statically scaled model, i.e., applying a fixed scaling factor to all values, e.g., s0 = 10,000 for 16-bit signed integer values if the statistical analysis returned a value range within the limits [−1, 1] for all model parameters, input, intermediate, and output values. Therefore, a best guess static scaling is given by range/max(M), where max(M) is the maximum absolute value of any parameter and any value of the input, intermediate, and output data.
Any product–sum calculation with scaled weights (scaling factor
sw) requires a downscaling of 1/
sw afterward, performed directly with the MLISA vector operations, as shown in Algorithm 4.
Algorithm 4. Downscaling of MLISA vector functions assuming a weight scale factor of 5000. The X/Y scaling is not relevant here and must not be adjusted. |
array X 4 array W { 1000 2000 3000 4000 } array Y 4 X W Y -5000 vecfold |
5.2. Workflow
The transformation of the continuous floating-point arithmetic model into scaled discrete integer arithmetic is an iterative process and depends on the specific use case and the training data.
The entire workflow and model processing pipeline is shown in
Figure 12. The USM is basically a layer table providing relevant information for each layer
L, the layer parameters (weights
w and bias
b), layer-specific statistical data analysis
S, including each layer latent variable output
z, a layer surrogate function
F using floating-point arithmetic, scaling factors
S for each layer and node of a layer (for weight and bias parameters
w and
b, input and output scaling for each layer,
zin and
zout, respectively), and finally the integer-scaled and transformed surrogate function
FP. The functions
F and
FP are used for model simulation, under- and overflow analysis, and model error analysis. The MLISA vector operations are derived from the
L,
P, and
S information. Tensor flattening and layer node restructuring are conducted in the first USM transformation phase.
The statistical analysis, as shown in Algorithm 5, must provide value distributions of all model parameters and average statistics of input, intermediate, and output nodes based on the available training (including validation) data. Under- and overflow of integer arithmetic operations must be prevented by choosing the scaling factors with a safety margin. Discretization and rounding errors using integer-scaled arithmetic are accumulative across model layers, requiring a simulation of the scaled model to detect range violations.
The scaled transformation can be static using one fixed model scaling factor s0 based on the absolute maximum value calculated from all (x,w,b,z,y) values, as shown in Algorithm 6, or dynamically adapting each layer scaling independently to fill the available integer value range optimally (reduced by the safety margin), as shown in Algorithm 7. This is completed by using layer-specific re-scaling factors applied to the global preset factor s0.
Note that layer parameters are vectors of vectors (i.e., a matrix). One vector is associated with one node of a layer, e.g., a neuron function (vector of weights) or one convolution operation of a convolutional layer (vector of kernel coefficients).
Algorithm 5. Static and dynamic analysis of the pre-transformed continuous USM/Foo model. The layer-specific F function is a surrogate and simulation function using floating-point arithmetic that also performs statistical collection on calling. The compute table implements all layer-specific computations, e.g., convolution or application of activation functions. Finally, the global model statistics stats = (min,max) of all values and parameters are computed |
updateStats = (l,x) => ( l.stats[$x] := l.stats[$x] + fivenum(x) ) for ∀ l ∈ L do: if l.parameters.w then updateStats(l,w=l.parameters.w) if l.parameters.b then updateStats(l,b=l.parameters.b) l.F = (l,x) => ( updateStats(l,x) y=compute[l.type](x,l.parameters) updateStats(l,y) y ) predictFoo = (L,D) => for ∀ d ∈ D do x := d for ∀ l ∈ L do: y := l.F(l,x) x := y L.stats = □ { ∀ l.stats ∈ L } // global model statistics y |
Algorithm 6. Static scaling algorithm transforming a continuous into a discrete model. The layer-specific FP function is a surrogate and simulation function using bit-accurate integer-scaled arithmetic. The default scale is s0, applied to all model parameters and input values. The default range includes a safety margin, e.g., for 16-bit integer, it could be range = 10,000 (but maximal about 30,000) |
s0 = range / max(|L.stats.max|,|L.stats.min|) for ∀ l ∈ L do if l.parameters.w then l.parametersFP.w = scale(l.parameters.w,s0) if l.parameters.b then l.parametersFP.b = scale(l.parameters.b,s0) l.FP = (l,x) => ( y=computeFP[l.type](x,l.parametersFP,s0) y ) |
Algorithm 7. Simplified dynamic scaling algorithm transforming a continuous into a discrete model. The layer-specific FP function is a surrogate and simulation function using bit-accurate integer-scaled arithmetic. The default scale is s0, applied to all model parameters and input values, but re-scaling factors can modify the default scale, including input and output scaling of layer functions. It is important to keep track of the current layer input and output re-scaling (reScaleCurrent). |
range = 10000 reScaleCurrent := 1 range = 10000 // default ± integer value range imul = (x,k) => ( if k>0 then int(x*k) else int(x/k) ) idiv = (x,k) => ( if k>0 then int(x/k) else int(x*k) ) adaptScale = (max,scale,range) => ( rescale=range/(max*scale) rescale<0?-int(-1/rescale): int(rescale)) for ∀ l ∈ L do l.scale=s0*reScaleCurrent yrescale = adaptScale(max(|l.stats.y.min|,|l.stats.y.max|),s0*reScaleCurrent,range) if not layerHasFixedScale(l) and not layerHasFixedScale(next) and yrescale>reScaleCurrent then l.rescaleY = int(yrescale/reScaleCurrent) else if yrescale < 0 then if -yrescale > reScaleCurrent or (reScaleCurrent%-yrescale) ≠ 0) yrescale = -reScaleCurrent l.rescaleY = yrescale else l.parametersFP.rescaleY = 1 if l.parameters.w then l.scaleW=s0 l.rescaleW = adaptScale(max(|l.stats.w.min|,|l.stats.w.max|),s0,range) if l.parameters.w then l.parametersFP.w = scale(l.parameters.w, imul(s0,l.rescaleP)) if l.parameters.b then l.parametersFP.b = scale(l.parameters.b, imul(s0,reScaleCurrent)) if layerHasFixedScale(l) then if reScaleCurrent ≠ 1 then l.scaleX = s0*reScaleCurrent l.scaleY = yscale l.rescaleX =-reScaleCurrent l.rescaleY =1 reScaleCurrent := 1 else l.scaleX=xscale l.scaleY=yscale l.rescaleX=1 l.rescaleY=1 } l.FP = (l,x) => ( y=computeFP[l.type](x,l.parametersFP, l.scaleW, l.rescaleW, l.rescaleY, l.scaleX, l.scaleY) y ) predictFooFP = (L,D) => for ∀ d ∈ D do x := d for ∀ l ∈ L do: y := l.FP(l,x) x := y y |
The final MLISA REXA VM code synthesis creates the necessary data storage (input, intermediate, and output arrays, as well as the parameter arrays). The sharing of arrays is supported for a chain of 1:1 mapping operations, e.g., application of a transfer function. Sharing of dynamic data storage (array unions) is difficult to implement if the union would contain arrays of different lengths. REXA VM arrays always contain a length header at the beginning, preventing the sharing of different length arrays.
5.3. Unified Model Graph
The previous workflow, consisting of model pre-transformation, analysis, and post-transformation, is merged in only the meta-graph model, as shown in Definition 2.
Definition 2. Unified model graph merging the original ML model, the USM with its Foo and FooFP surrogate models, and all transformation parameters.
6. Simulation and Data Set
Sampling of experimental measuring data originating from damaged structures is a difficult task with respect to parameter variance, i.e., damage position, size, sensor positions, and so on, and ground truth labeling. Therefore, for this study, we used simulated GUW time-resolved signal data. The signals were simulated using an extended version of the SimNDT simulator [
16] based on an elasto-dynamic finite integration technique [
18]. A transmission GUW experiment commonly utilizes two transducers, one generator (pitch signal), and one sensor (catch signal). The generator signal was a sine wave of base frequency 40 kHz and a Gaussian mask window (5 cycles). The simulation was carried out with a time step of 0.06 μs, a total of 5000 steps (300 μs), with each tenth step recorded. In total, 7 × 6 damage positions were simulated. Circular damage (air, 30 mm diameter) placed at a specific center location (
x,
y) modifies the GUW signals, as shown in
Figure 13. The host material was a plate of 500 × 500 mm with high absorbing damping material at each plate side (to minimize wave reflections at edges and plate sides).
7. Use Case 1: CNN for Damage Location Regression and Classification
The first use case uses the data delivered by the GUW simulation introduced in the previous section. In total, there were 43 different data sub-sets, each related to a specific position of the circular damage in the plate structure, including the baseline measurement without damage. A classical CNN model was chosen to predict the damage positions (
x,
y) and provide binary damage classification. The CNN input was a down-sampled and low-pass filtered GUW signal (128 data points). The outputs are two continuous variables,
px and
py, normalized to the full range of the damage location in the x- and y-direction with a 10% margin, i.e., the minimum location coordinate corresponds to 0.1, and the maximum value corresponds to 0.9. If
px < 0.1 and
py < 0.1, then no damage was detected (i.e., classification output). The model architecture and its parameters are shown in Definition 3. The CNN was implemented with the ConvNetJS framework [
17] and trained with the typical 500 epochs at a default learning rate of α = 0.01 using the adagrad trainer (batch size was chosen as 1 due to the low sample count).
Definition 3. Architecture and parameters of the CNN model (ConvNetJS) using 1 dim convolution and pooling operations.
Convolutional Neural Network
============================
Classes: undefined
Input: [128,1,1]
Output: [2]
Layers: [L8 P13]
[1] input : out=[128,1,1]
[2] conv : in=[128,1,1] out=[124,1,4] k=[5,1] filters=4 stride=1
[3] relu : out=[124,1,4]
[4] pool : in=[124,1,4] out=[62,1,4] k=[2,1] filters=4 stride=2
[5] conv : in=[62,1,4] out=[60,1,4] k=[3,1] filters=4 stride=1
[6] relu : out=[60,1,4]
[7] pool : in=[60,1,4] out=[30,1,4] k=[2,1] filters=4 stride=2
[8] fc : in=[120] out=[16]
[9] tanh : out=[16]
[10] fc : in=[16] out=[8]
[11] tanh : out=[8]
[12] fc : in=[8] out=[2]
[13] regression : in=[2] out=[2]
Predictors: 128
A typical GUW signal and its low-pass-filtered and down-sampled version are shown in
Figure 14. The low-pass filter was a simple exponential filter with a filter coefficient of
k = 0.2.
To capture training variations, the original model was trained 100 times, each training starting with a randomly initialized model but always with the same training and test data set.
Figure 15 shows the comparison of the prediction results of the continuous Foo and discretized and scaled FooFP model for the regression tasks. The prediction delivers the damage position coordinates, and a non-damage detection is given by a (0,0) value pair (or close to). The RMSE and maximal position prediction errors are computed. Results for static and dynamic scaling were compared. The summary of results is as follows:
The total value range of all input, output, intermediate, and parameter values depends on the particular training of the original model but is mostly in the range [−4, 4] (see
Figure 15,
V column). Therefore, a static or preset scaling of
s0 is in the range [3000, 10,000].
The discretization error of integer arithmetic is neglectable for all linear operations but depends slightly on the discretization parameters of the non-linear functions (FP1/FP2/FP5 in
Figure 15 represents LUT resolution with 1/ΔX = 1/2/5).
The dynamic scaling, compared with static scaling, shows no significant improvement in the model accuracy (maximal 5%) but is unexpected with a larger variance (see
Figure 15).
The overall discretization error depends on a particular model parameter set, i.e., with nearly the same floating-point accuracy, the integer model accuracy can differ significantly (RMSE and Emax). Multiple trained models should be analyzed with respect to the final discretization error, selecting the best model.
The prediction error of the discretized model (RMSE and Emax) differs only slightly.
There is no increase in the classification error compared with the continuous model, showing an overall stable prediction behavior.
To conclude, the discretization, even with a moderate static scaling, does not degrade the prediction accuracy.
The MLISA REXA VM Forth program of the discretized model is shown in
Appendix A in Algorithm A1. The model code occupies 1746 dynamic and 2166 static words in the CS (i.e., occupying about 8k Bytes of RAM). The entire textural code size is 18,452 Bytes. The forward computation requires the execution of 1280 words, which is equivalent to 85 ms/MHz (assuming 15k/words/s/MHz [
11]).
8. Use Case 2: ANN Polynomial Models
Based on the previous use case evaluation indicating non-linear functions as the primary source for approximation errors, we want to force high non-linearity in an ML surrogate model for a polynomial of degree
n:
The ANN model architecture for the implementation of such a surrogate model is shown in Definition 4. It consists of only 17 neurons. A polynomial of degree 4 was chosen with the following parameters:
The model was trained using the polynomial model (500 epochs); 100 independent models were trained using the same training data with 1000 randomly selected function samples. After model training, the Foo transformation process and a model analysis were performed.
Definition 4. Architecture and parameters of the ANN model (ConvNetJS) as a surrogate regression model for highly non-linear analytical model functions.
Artificial Neural Network
============================
Classes: undefined
Input: [1,1,1]
Output: [1]
Layers: [L5 P9]
[1] input : out=[1]
[2] fc : in=[1] out=[4]
[3] tanh : out=[4]
[4] fc : in=[4] out=[8]
[5] tanh : out=[8]
[6] fc : in=[8] out=[4]
[7] tanh : out=[4]
[8] fc : in=[4] out=[1]
[9] regression : in=[1] out=[1]
Predictors: 1
The following discretization parameters for the activation function were chosen, as shown in
Table 6.
The statistical analysis of the prediction results of the continuous Foo and the discretized FooFP model is shown in
Figure 16. The summary of the results is as follows:
The prediction errors of the discretized model (RMSE and Emax) are significantly higher compared to the continuous model.
The discretization error results from the non-linear tanh function, which is clearly highlighted if a scaled float-point alternative is used in the discretized model, with errors similar to the continuous model.
Choosing the tanh and underlying log10 discretization parameters leads to the prediction error. Modification of the LUT partitioning and interval coefficients can improve the RMSE as well as the maximum error Emax.
Dynamic scaling shows no significant improvement in the model accuracy (maximal 5%).
As shown at the bottom of
Figure 16, the discretization error is not constant; instead, it introduces discontinuity.
The MLISA REXA VM Forth program of this model is shown in
Appendix A in Algorithm A2. The model code occupies 18 dynamic and 89 static words in the CS. The entire textural code size is 921 Bytes. The forward computation requires the execution of 70 words, which is equivalent to 5 ms/MHz.
9. Discussion
The two use cases clearly showed the benefit of the proposed scaling approach and the simplicity of the VM programming for classification and regression models, including CNN architectures. The results can be summarized as follows:
The prediction error of the discretized model compared with the continuous model is comparable if there is no high non-linearity (use case 1) but significantly higher if there is a higher degree of non-linearity (use case 2).
Dynamic scaling compared to static scaling shows no significant improvement, only for very low default s0 scaling factors.
Model optimization with respect to the average classification or RMSE and peak regression errors is possible via the optimization of the non-linear piecewise-segmented and LUT-based activity functions (sigmoid, tanh). The optimization requires the modification of function approximation parameters, but these functions are statically built into the VM (as a service).
Even if the micro-controller provides an FPU, the VM should continue using 16-bit integer arithmetic to satisfy the still remaining hard memory limits. Moreover, the JIT run-time compiler translates text-to-byte-code in place. For instance, a constant value “0” can always be replaced by a binary 16 Bit container since tokens are separated by space or newline characters. A 32-bit engine using the FPU would make this approach impossible, and the memory footprint would increase significantly (doubles at maximum).
The REXA VM implementation of the discretized models requires typical code sizes (including data) of about 1–20k Bytes, which can be transferred using (RFID) wireless communication. Even more complex models can be processed by a 16-bit VM with a code segment size limit of 32k Bytes.
The average computation times of models with the REXA VM range from 1 to 100 ms, which is fully sufficient even for ad-hoc remotely powered sensor nodes via RFID fields.
10. Conclusions
In previous work [
6], we investigated the effect of ML model discretization in only classification tasks. This work investigated the effect of integer-scaled discretization for regression tasks with two use cases as well and presented the model transformation and scaling algorithms in detail. The first use case considers time-dependent ultrasonic signals as an input for a damage location regression model. The second use case uses synthetic data from a highly non-linear polynomial function to investigate the impact of discretization of the non-linear activation functions. Static (one scaling factor for the entire model) was surprisingly fully sufficient, and dynamic fine-grained scaling of different stages of the model does not improve the overall prediction accuracy of the model. The highest impact on the model accuracy is the discretization and step-wise approximation of non-linear (activation) functions. As an outlook, the model activation functions could be generated for a specific model at run-time by the VM based on parameters provided by the transformed model. However, the generation of the approximated functions requires floating-point versions of at least the
log and
e functions, optimally by the floating-point version of the activation functions. If floating-point functions are available, the non-linear activation functions could be directly calculated without approximation and discretization errors (at least significantly lower errors). One solution could be generic activation function templates that can use different LUTs transferred separately to the VM. Finally, the impact of integer discretization on the accuracy of recurrent state-based neural networks should be investigated.