The Raspberry Pi Pico, ESP32, and Arduino Nano 33 BLE Sense are the embedded systems selected for the TinyML implementation. The technical criteria considered when choosing these devices are the optimal hardware features for TinyML applications and the EmSPS for programming the CPU SoC of each device. Also, it is important to mention that Pi Pico, ESP32, and Nano 33 BLE have a 32-bit CPU architecture, CPU clock frequency of over 60 MHz, and internal RAM size of over 200 KB, which are key hardware features. Regarding the EmSPS for programming the devices, we used the Arduino IDE, which offers numerous tools and functionally stable libraries, among other advantages, for developing optimal TinyML algorithms.
3.1. Implementation Methodology
The implementation methodology consists of the following steps:
TinyML algorithm selection on a computer.
Training the selected model on a computer.
Testing and compressing the model on a computer.
Programming, deploying, and testing the trained model on the embedded system.
We use Python version 3.9.9 for programming, and Tensorflow and Keras are the libraries that support the computer’s implementation of the training algorithm. The Tensorflow lite library in Arduino IDE supports deploying the DNN model on the selected embedded systems. The following paragraphs explain each step in detail.
Remark 4. The computer used to train the DNN and program the embedded systems has the following hardware requirements: CPU Core i5 2.4 GHz, 12 GB RAM, 1 TB solid state hard disk, no graphic board.
TinyML algorithm selection on a computer. First, the dataset is selected. We utilize the Iris dataset [
48]. This dataset comprises 150 samples; each is a set of entry data integrated by four measurements describing the flower’s dimensions.
Figure 6 represents the information in the Iris dataset. Each entry dataset corresponds to a set of four values that indicate the length and width of the sepal and the length and width of the petal in centimeters.
Thus, the sepal and petal measurements determine whether the flower is setosa, virginica, or versicolor; there are three classes.
Based on the results obtained in [
88], where the authors used four hidden layers with the following distribution: 300, 200, 100, and 16 neurons, respectively, we decided to carry out a similar implementation and use the DNN architecture illustrated in
Figure 7.
Other works that support this implementation can be seen in [
89,
90].
The Python code, which illustrates the statements that contain the libraries and plugins required for loading and data preparation as well as DNN selection and configuration, appears in
Figure 8. The Iris dataset is loaded through a CSV file.
The selected DNN architecture is the following:
Number of inputs: 4.
Total number of Layers: 4.
Number of Hidden Layers: 2.
Number of Neurons for Layer (input, output, and hidden): 16.
Activation function: “Relu” (first three layers), “Sigmoid” (out layer).
We modified the configuration implemented in [
88], and our architecture of the DNN consists of one input layer, two hidden layers, and one output layer. The first layer is configured with sixteen neurons, the two hidden layers consist of sixteen neurons, and the output layer consists of three neurons (see
Figure 7). The statements
and
allow the DNN model configuration; see code in
Figure 8.
The basis of neural networks theory establishes many other considerations about DNN architectures, such as the correct activation functions and their dependence on the task performed (classification or regression), the number of hidden layers, and the number of neurons in each layer. The criteria and considerations for selecting the DNN architecture are based on the system type, application, and the desired degree of accuracy in predictions. This implementation does not address a detailed and specific analysis of the parameters and hyperparameters of the selected DNN and the performance of hardware-constrained embedded systems under the selected architecture. However, the discussion on choosing an optimal DNN architecture plays a crucial role, and this analysis opens up exciting possibilities for future works that address optimization issues of TinyML algorithms.
We selected two activation functions: “Relu” for the input and two hidden layers and “sigmoid” for the output layer.
Training the selected model on a computer. Using DL techniques, a training algorithm is programmed for a DNN. The training steps are executed on the computer. The obtained result is a file that contains the trained model. Training (x_train, y_train) and testing data (x_test, y_test) are obtained using the code shown in
Figure 8. As a good practice adopted from various reports in the state of the art [
48,
88], the “one-hot” labeling technique is used.
Figure 9 illustrates an example of the corresponding features (input data) and label matrix (output) for the DNN model obtained as a result of the code given in
Figure 8. It should be noted that the input matrix in
Figure 6 is used for demonstration purposes in the explanation. Likewise, this matrix is presented in
Figure 9; however, for better understanding, it is presented again with the output tags in one-hot format.
Figure 10 depicts the code fragment for executing the training step considering the architecture described in the code snippet shown in
Figure 8. The hyperparameters are given with the statement
. The numbers of epochs are set within the statement
. Finally, the training parameterization is as follows:
Before the launch of the training processes, the Iris dataset is divided into two groups.
Figure 8 shows how the dataset is divided into 80% for the training (x_train, y_train) and 20% for the validation (x_test, y_test).
The graphical behavior of the DNN model during the training stage is shown in
Figure 11, taking into account 100 training epochs.
Figure 12 shows the training process on the computer using Python and TensorFlow.
An important issue when training ANN with TensorFlow and Keras in Python is the numerical format of the dataset. In some cases, this format is (64 bits), which turns out to be a data type that would generate trained models with a large size. In the case of this implementation, for the Iris dataset, the data type was changed to with the statements x_train = x_train.astype(‘float32’) and y_train = y_train.astype(‘float32’), which is a 32-bit float type; thus, the trained model results in a smaller size without disregarding precision in the data and hoping that the accuracy of the model when making predictions does not decrease. However, converting the Iris dataset data from to allows the compression process to obtain better results by reducing the size of the model file. However, performing the data type conversion directly in the Iris dataset before the training process does not mean that a quantization process is still involved. However, the possibility of doing so remains available.
Testing and compressing the model on a computer. As a first performance test, the classification processes through the DNN are tested on a computer using Python and Keras. However, the most important step is when the DNN is tested by performing the flower-type classification process within the embedded system.
Figure 13 depicts the portion of code on which the
statement is executed to obtain the trained model’s accuracy and loss function value at the last epoch executed in the training process. The result in the Python IDLE output terminal is
:0.6000,
: 0.966666638.
The final result of the training process is a file, with
H5 extension, that contains the trained model. Subsequently, the
H5 file is compressed to a file with a
tflite extension and then converted to a
cc extension file. See
Table 13, where the file compression steps and the resulting size in each step are summarized. The
cc file is used in the methodology’s next step.
Obtaining the tflite file is an intermediate and necessary step since it is not possible to directly convert the H5 file to the cc file.
However, although the computers and some embedded systems with more powerful hardware, such as Raspberry Pi 4 or FPGAs, are capable of deploying a DNN model trained with a
tflite file, the
tflite file is not yet suitable to run by our three embedded systems under test (ESP32, PI Pico, Arduino Nano BLE). For that reason, the
H5 to
tflite to
cc files process conversion is necessary. The code fragment to convert the file with
H5 extension to
tflite is shown in
Figure 14.
It should be noted that once the process of converting and compressing the trained model file has been carried out to produce a suitable file format and size to be supported by the embedded system, a computer evaluation and testing step must be carried out again to verify that the accuracy of the model has not changed or that if there was a change, it was not significant.
In addition, the model’s performance is evaluated with the
tflite file by introducing input features from the Iris database and verifying the flower type prediction. The portion code used to test, evaluate, and perform the prediction is shown in
Figure 15.
As a result of the prediction, the following fragment is obtained in the Python IDLE output terminal: . Observe that the highest value of the output vector () corresponds to the Virginica class type; therefore, the prediction class is Virginica.
The following step is converting the
tflite file to
cc using the instruction “!xxd -i exercise model.tflite - exercise model.cc” in Python, specifically in the Colab environment. Afterward, “!cat exercise model.cc” allows us to obtain an array of data in hexadecimal memory address representation. The array is saved in a file with a “cc” extension, which is much more compact than the
tflite and can be recorded within the embedded system that we consider (ESP32, Pi Pico, Nano 33 BLE) and supported in the RAM. All this procedure is shown in
Figure 16.
Table 13 summarizes the process of obtaining the files that contain the trained DNN models, as well as the conversion and compression steps to which the model file is subjected for its adaptation in size and compatible format for execution on ESP32, Raspberry Pi Pico, and Arduino Nano 33 BLE Sense systems.
Figure 14 shows the use of the following statements through which the conversion of the file
to
is carried out:
__ |
|
_ |
The following statement allows one to optimize the tflite file, as well as the quantization of the model already trained and converted to a tflite file:
|
The previous instructions allow the quantization process in this compression step. Quantization is one of the main methods to reduce model size. It consists of converting the weights and activations of the floating precision (32-bit) model to lower precision values, typically 8-bit integers (). This process considerably reduces the space needed to store the model weights and makes it ready for the third step of the compression process: converting from a tflite file to a cc file.
An important aspect when creating a TinyML application is, as mentioned, the memory size that the trained model file will require, specifically in the RAM of the embedded system. As seen in
Table 13, the tflite file has a size of less than 5000 KB; with this size, the model can be executed on a conventional computer or even on high-performance embedded systems such as the Raspberry Pi 4 or 5, and other equivalent embedded systems with RAM sizes greater than 10 MB. ESP32, Pi Pico, and Arduino Nano 33 BLE have a RAM bigger than 200 KB, while the CC extension model has a size of 5.2 KB.
Running it out without quantization is possible in the third step of the compression process, as in the conversion step from H5 to tflite. Thus, it enables the use of the resulting model and its files in the implementation of the DNN in the selected embedded systems without having the problem of loss of accuracy in data prediction with the DNN due to a quantization process; the statements for this compression step are shown in
Figure 16. However, under another scenario, which considers a larger dataset that generates data that, due to their type, use more memory space, and with a denser DNN architecture and other training hyperparameters, there is a possibility that the file size of the trained model is not supported into RAM of an embedded system that executes the DNN. Therefore, data quantization will have to be used after training in the file compression stage, and an analysis of the possible loss of accuracy with the model obtained will have to be performed.
Programming, deploying, and testing the trained model on the embedded system.
The hardware feature that is key for a Tiny embedded system to deploy ML models is mainly the RAM size, which supports and allocates the compressed training model, i.e., the cc file. It is a fact that the RAM size for the three embedded systems is smaller than 520 KB; this becomes a limitation to implementing TinyML with the file of the tflite model resulting from the training since, in some cases, the file size is up to 4000 KB. In contrast, the compressed file size in our experiment is 5.2 KB, as shown in
Figure 16. Thus, it is important to show the relevant hardware features of the three embedded systems tested in our demonstrative implementation.
It should be noted that carrying out a training process with a larger dataset, with a greater amount of data and characteristics, and with more demanding training parameters and hyperparameters will generate the final result of the training as a file to which it must be assigned a larger space in RAM than other small ones, to achieve its execution.
However, it is correct to assume that training with a more complete dataset, DNN architecture with a greater number of layers and neurons, as well as more demanding training parameters and hyperparameters, will generate trained models with a higher degree of accuracy in the prediction. One of the most outstanding challenges in the field of TinyML is to achieve more accurate predictions with a certain embedded system using more optimized, compact files that allow each embedded system to execute a TinyML application in real time without having to meet high requirements and or expectations for the size of the SRAM.
The analysis and discussion on file compression techniques for trained models used in TinyML applications is a great topic, which is addressed within the line of research for optimization of embedded systems in the field of artificial intelligence, with works such as [
21,
32,
33,
48,
56,
57,
58,
63,
75,
81,
91], addressing the topic and bridging the gap for other related research in this field.
However, the main objective of the implementation in this work is not to show the results of the execution of the DNN with models optimized using different compression techniques for execution in hardware-constrained embedded systems.
Likewise, when presenting
Table 14,
Table 15 and
Table 16, emphasis will be placed on compliance with the SRAM size requirement in each system, to support models trained under a context that involves generating increasingly accurate models that do not compromise compliance, to not exceed sizes and to allow each system to operate in real time.
It is a fact that training the DNN with a greater amount of data will generate a model with a larger file size; thus, performing a compression process on the model file without compromising the degree of accuracy in the prediction was a challenge achieved and presented in this implementation.
Table 14 shows important hardware features that must be considered for implementing TinyML algorithms. However, other considerations taken into account to select the embedded systems shown in
Table 14 are mentioned below. Although the Pi Pico and the ESP32 are dual-core systems, it is not necessary to carry out the implementation using the capabilities of both cores to distribute the processing; in this way, the comparison between the three embedded systems is fair when performing a deployment of TinyML in a single core.
On the other hand, the operating frequencies of the CPU SoCs could generate differences between the execution times in each case. This aspect is explored in
Table 15. As can be seen in
Table 14, the RAM size of the Pi Pico and Arduino Nano 33 BLE does not exceed 260 KB and 520 KB for ESP32, which is a key aspect when considering that the implementation of TinyML requires that the embedded system has sufficient memory to allocate memory address and host a file with the trained model of the DNN that can have a size of up to 20 KB.
Figure 17 and
Figure 18 describe important hardware details of the ESP32, Raspberry Pi Pico, and Arduino Nano 33 BLE Sense.
Figure 19 summarizes the TinyML implementation process on the Arduino IDE platform, considering the Iris dataset and the compressed hexadecimal model content in the
data.h instance.
The
data.h instance is a fragment of code that exclusively contains the hexadecimal data array from the
cc file obtained shown in
Figure 16 and the size in bytes that it uses from RAM within the embedded system.
Figure 20 indicates the libraries to be called in Arduino IDE and the invocation function for Tensor Arena.
The
tensor-arena statement (see
Figure 21) assigns space in the RAM of the embedded systems selected in this review. This instruction is not required in systems that do not have hardware restriction problems, such as FPGAs and computers.
The main function programmed within ESP32, Raspberry Pi Pico, and Arduino Nano 33 BLE Sense to DNN deploy appears in
Figure 21. As mentioned above, we used the Arduino IDE, which supports the development of embedded software in C++.
3.2. Results Deploying the DNN in Embedded Systems
Our example implementation is based on the work published in [
88]. To carry out the implementation test on the theESP32, Pi Pico, and Arduino Nano 33 BLE systems, it is necessary to place within an Arduino IDE code snippet with the same data array content in the
exercice model.cc file shown in
Figure 16.
Figure 22 presents a portion of the
exercice model.cc, particularly the first three lines, with the statements and syntax required by the Arduino IDE environment. The file is a more extensive data array, about 380 lines of hexadecimal data, and is copied directly from the
cc file obtained in
Figure 16 using Colab and pasted into the
data.h instance. An important aspect of achieving the execution of the DNN within the ESP32, Pi Pico, and Nano 33 BLE is the size in bytes of the
exercice model.cc file, which is approximately 5.2 KB. The RAM sizes of our devices are 512 KB for the ESP32, 264 KB for Pi Pico, and 256 KB for Nano 33 BLE. Thus, the Iris-dataset-based DNN implementation using
cc file for flower classification on the three embedded systems is supported.
An example of the input dataset to the DNN executing within our embedded system (ESP32, Pi Pico, and Nano 33 BLE) is shown in
Figure 23. Through the output terminal of the Arduino IDE, it is possible to interact and provide the three embedded systems with the input data for the DNN and verify the result when executing the flower classification task using the DNN algorithm, whose training model was obtained and compressed in the computer (see the following steps: training the selected model on a computer and programming, deploying, and testing the trained model in our methodology).
Figure 24 shows the results using the ESP32, Pi Pico, and Nano 33 BLE to predict types of flowers using the Iris dataset. The visualization of the results was performed through the output terminal of the Arduino IDE. The information in
Figure 24 is interpreted as follows: when the model is deployed on the embedded system, it obtains a set of three values that form an output vector, and each value of the output vector corresponds to a weight assigned to each type of flower; the output vectors are depicted in the “Prediction-ESP32” column in
Figure 24. The first component of each vector corresponds to Setosa, the second one to Virginica, and the third to the Versicolor type of flower. The highest weight in each output vector indicates the prediction of the flower type given by the DNN model. In addition, the “Real Label” column (see
Figure 24) shows the correct classification.
For example, the prediction vector in the “Prediction-ESP32” column in
Figure 24 is [−21.06, −5.13, −2.25]. Focusing on the third weight (−2.25), it is the highest. This result indicates that the type of flower identified by the DNN based on the entry dataset is Versicolor. It should also be noted that in the “Real Label” column, a “1” is activated in the third column, which indicates that the actual type of flower is also Versicolor. Therefore, the prediction made by the DNN with the entry dataset in the first output vector provided a correct flower classification. The same interpretation is used for the rest of the table shown in
Figure 24.
It is necessary to mention that the processing speeds of each system are different because they have different hardware characteristics; see
Table 14.
Table 15 shows the time, in milliseconds, that each embedded system took to deploy the DNN using the same trained model and the same input data.
The clock frequency processor and the memory size are very important features to be considered. In the case of 32- and 64-bit architectures, the features available on the device do not represent a problem since the implementation results have shown that they have sufficient resources for TinyML applications. For hardware-constrained embedded systems with an 8-bit architecture, the clock frequency processor does not represent a limitation to deploying TinyML algorithms. However, the RAM size still represents a big problem when deploying TinyML algorithms. To face the problem of RAM size limitation, the community dedicated to TinyML turns its attention to the optimization and compression of files used in TinyML, as is the case of trained models, to achieve optimal implementations on 8-bit architectures.
Table 15 shows, more specifically, the latency time that the algorithm has since it invokes the trained model, and a prediction is made with a vector of input characteristics. Likewise, the measurement of energy consumed by each embedded system is presented when it is running the algorithm continuously. The RAM allocated for invoking the trained model and the results when predicting the flower type are presented.
As described in
Table 16, the Raspberry Pi Pico has lower latency than the ESP32 and Arduino Nano BLE, as well as lower power consumption, when the embedded systems deploy the DNN in real time. However, it is possible that for DNNs with more complicated and larger models, the performance of the Raspberry Pi Pico begins to reduce compared to the ESP32 and Arduino Nano BLE due to the hardware resources that each system has. The analysis of the performance of each embedded system when executing DNNs with more complex models for classification and regression can be addressed in depth in future work.
It should be mentioned that the implementation only shows the execution of the DNN in embedded systems without analyzing its accuracy when performing classifications with inputs different from those contained in the Iris dataset. The optimization of the model is not analyzed from training, as improvements can be achieved throughout the process by choosing another dataset for training, changing the architecture of the DNN, or modifying training parameters and hyperparameters. The performance of the real-time DNN executed on hardware-constrained embedded systems by modifying the training and the model’s characteristics can be analyzed in future works.