Benchmarking In-Sensor Machine Learning Computing: An Extension to the MLCommons-Tiny Suite

Aymone, Fabrizio Maria; Pau, Danilo Pietro

doi:10.3390/info15110674

Open AccessArticle

Benchmarking In-Sensor Machine Learning Computing: An Extension to the MLCommons-Tiny Suite

by

Fabrizio Maria Aymone

and

Danilo Pietro Pau

^*

System Research and Applications, STMicroelectronics, Business Center Colleoni, Building Andromeda 3, at the 7th Floor, Via Cardano 20, 20864 Agrate Brianza, Italy

^*

Author to whom correspondence should be addressed.

Information 2024, 15(11), 674; https://doi.org/10.3390/info15110674

Submission received: 28 August 2024 / Revised: 3 October 2024 / Accepted: 22 October 2024 / Published: 28 October 2024

(This article belongs to the Special Issue Developments in Cyber-Physical Systems and Cyber-Physical-Human Systems)

Download

Browse Figures

Versions Notes

Abstract

:

This paper proposes a new benchmark specifically designed for in-sensor digital machine learning computing to meet an ultra-low embedded memory requirement. With the exponential growth of edge devices, efficient local processing is essential to mitigate economic costs, latency, and privacy concerns associated with the centralized cloud processing. Emerging intelligent sensors equipped with computing assets to run neural network inferences and embedded in the same package, which hosts the sensing elements, present new challenges due to their limited memory resources and computational skills. This benchmark evaluates models trained with Quantization Aware Training (QAT) and compares their performance with Post-Training Quantization (PTQ) across three use cases: Human Activity Recognition (HAR) by means of the SHL dataset, Physical Activity Monitoring (PAM) by means of the PAMAP2 dataset, and superficial electromyography (sEMG) regression with the NINAPRO DB8 dataset. The results demonstrate the effectiveness of QAT over PTQ in most scenarios, highlighting the potential for deploying advanced AI models on highly resource-constrained sensors. The INT8 versions of the models always outperformed their FP32, regarding memory and latency reductions, except for the activations for CNN. The CNN model exhibited reduced memory usage and latency with respect to its Dense counterpart, allowing it to meet the stringent 8KiB data RAM and 32 KiB program RAM limits of the ISPU. The TCN model proved to be too large to fit within the memory constraints of the ISPU, primarily due to its greater capacity in terms of number of parameters, designed for processing more complex signals like EMG. This benchmark aims to guide the development of efficient AI solutions for In-Sensor Machine Learning Computing, fostering innovation in the field of Edge AI benchmarking, such as the one conducted by the MLCommons-Tiny working group.

Keywords:

edge artificial intelligence; in-sensor machine learning computing; digital signal processing; intelligent signal processing unit; tiny sensors; MLCommons-Tiny working group

Graphical Abstract

1. Introduction

The number of edge devices is increasing at an exponential rate, with the global Internet-of-Things (IoT) market expected to reach over a trillion U.S. dollars by the end of 2024 [1]. In such a context, centralized cloud processing—where incoming data streamed from edge devices are sent and processed on remote servers—poses significant challenges due to the economic costs and latency associated with wireless communication, particularly exacerbated with the advent of generative artificial intelligence (AI). On the other hand, the edge processing of information safeguards privacy, while enhancing responsiveness, scalability, and reducing energy consumption. Over the recent decade, advancements in AI and machine learning (ML) [2] have surpassed human capabilities in various domains, including computer vision, automatic speech recognition (ASR), and Natural Language Processing (NLP). Harnessing the power of AI at the edge has become increasingly compelling [3], giving rise to the recent trend of “Edge AI” and fostering research communities such as the one represented by the TinyML Foundation. Deploying neural networks (NNs) on Micro Controller Units (MCUs) opens up a plethora of always-on, battery-powered edge applications ranging from industry (e.g., predictive maintenance) to smart homes, consumers, wellness and wearables (e.g., keyword spotting), robotics, and automation. These MCUs are typically resource-constrained, with read-only memory (ROM) and random-access memory (RAM) usually under 1 MB memory, requiring power-efficient inference under one milliwatt (mW). To meet such tight memory requirements, researchers have handcrafted NNs and compressed their model size using techniques such as quantization [4,5], pruning [6], and knowledge distillation [7], while preserving accuracy. In this configuration, data are collected by the sensors and are transmitted to the MCU where AI inferences run. In always-on edge applications, this demands the continuous communication between the sensor and the operating MCU, potentially leading to increased energy consumption. To mitigate this, hardware manufacturers have recently begun integrating small digital signal processing (DSP) units directly into the same packages, which host the sensing elements, bypassing the communication bottleneck and allowing the MCU to enter into sleep mode, waking up only when further processing is required. Unfortunately, little attention has been given to the performances characterization process of such sensors running NNs workloads, which this manuscript tries to overcome.

2. Key Contributions of This Work

ML computing super integration into the sensor package (aka In-Sensor ML Computing) is still at its infancy; however, there are already some sensors in the market which have proposed an implementation, such as STMicroelectronics’s Intelligent Sensor Processing Unit (ISPU (https://www.st.com/content/st_com/en/campaigns/ispu-ai-in-sensors.html, accessed on 21 October 2024)) and Machine Learning Core (MLC (https://github.com/STMicroelectronics/STMems_Machine_Learning_Core, accessed on 21 October 2024)). Such sensors further shrink the embedded memory size with respect to the one integrated into the MCUs by more than 10 times, while posing new challenges to the machine learning developers. This emerging and challenging context would benefit from a common benchmark to evaluate associated hardware and software solutions. Therefore, the contributions of this work are summarized as follows:

To propose a benchmark to target in-sensor ML computing constrained by an ultra-low memory footprint.
To introduce ML models trained with Quantization Aware Training (QAT) and compare their performances with Post-Training Quantization (PTQ).
To include three different use cases such as Human Activity Recognition (HAR), Physical Activity Monitoring (PAM), and Superficial ElectroMioGraphy (S-EMG).

3. Related Works

3.1. Tools and Techniques in TinyML

The tiny memory budget of MCU-class devices, typically ARM Cortex-M, has prompted engineers to develop tools and techniques to compress NNs. Pruning [6,8,9] consists of selectively eliminating specific weights/activations, leveraging the acceleration of the inference execution achieved through model’s sparsity. Pruning can be divided into three classes according to its application:

Before training (PBT) [10,11,12];
During training (PDT) [6,13,14];
After training (PAT) [15,16,17].

Knowledge distillation (KD) [7,18] is another often-used compression technique. This approach exploits the internal representations generated by a larger complex neural network, named the teacher, to train a tinier and compact network, named the student. By transferring knowledge from the teacher to the student, KD allows for substantial reductions in a model’s footprint and computational complexity, while retaining most of the original model’s accuracy and performances. The most widely adopted technique for memory optimizations in NNs is quantization. Quantization [19,20,21] involves reducing the bit precision of the weights and activations, typically from 32 to bit floating-point (fp32) number representations to lower bit integer representations, such as 16 bits, 8 bits, or even down to a binary level [22]. Quantization can be divided into two categories according to its application:

Post-Training Quantization (PTQ);
Quantization Aware Training (QAT).

PTQ consists of applying the quantization process as a post-processing step after the training of the model has been completed. For example, TensorFlow Lite (TFLite) [23], a popular tool part of the Tensorflow Deep Learning framework [24], supports PTQ for various precision methods, including INT8. With QAT, on the other hand, models are trained with reduced precision, allowing the network to learn in a quantized environment from the start. This approach often leads to more robust (also from adversarial attacks [25]) quantized models, as the network is optimized for lower precision throughout training. QKeras [26,27,28], an API drop-in replacement extending the Keras framework, was designed for quantization, supporting only QAT, offering a wide range of layers and activation functions optimized for quantization. QKeras allows users to create heterogeneously quantized integer or fractional deep learning models by simply replacing the Keras layers of the original model with the corresponding QKeras versions.

3.2. Industry Benchmarks

There are several industry benchmarks in the ML community that target tiny devices belonging to the MCU-class. Nevertheless, to the best of the authors’ knowledge, there is not any standard industrial benchmark targeting in-sensor ML computing with RAMs and ROMs in the order of very few tens of KiloBytes (KiB). Here, a few benchmarks for Tiny ML are reported. MLCommons [29] is an AI engineering community-driven association whose objective is to create open benchmarks to measure the performance of ML systems. The MLCommons Benchmark suite allows to evaluate DL models across a broad range of hardware platforms, ranging from data centers to edge device. In particular, MLCommons Tiny Benchmark [30] is designed to target MCUs with memory sizes in the order of hundreds of KiBs. It encompasses four use cases:

Keyword Spotting (KWS), which evaluates the ability to recognize specific keywords from audio input, such as voice commands in smart devices;
Visual Wake Words (VWW), which assesses the ability to detect the presence of a person in an image, often used in smart cameras or other visual recognition tasks;
Image Classification (IC), which consists of recognizing the object depicted by the image;
Anomaly Detection (AD) of machine faults from audio recordings, crucial for applications like predictive maintenance.

The last use case (AD) adopted an Auto-Encoder (AE) Fully Connected NN, while the others used convolutional neural networks (CNNs).

The Embedded Microprocessor Benchmark Consortium (EEMBC) is an organization that creates industry-standard benchmarks for evaluating the performance in terms of the latency and energy of embedded systems, with applications in IoT, autonomous driving, mobile devices, and others. EEMBC Coremark [31] is a widely accepted standard for measuring MCU devices’ performance on basic computational tasks. However, it does not include ML inference workloads. MLMark [32] is another EEMBC Benchmark focusing on evaluating ML inference workloads. The models adopted, however, are too memory-demanding to fit into a common MCU-class device. This makes MLMark non-representative for typical TinyML workloads.

4. Proposed Benchmark

The benchmark this paper proposes targets three different use cases for in-sensor DSP. As the latter is severely memory-constrained, the requirements for a solution to be suitable are set here later. Then, the use cases are described together with the choice of the models. Lastly, the models’ performance, size, and latency are reported. The model architectures are precisely reported in Table 1, where shape denotes the output dimension of the layer. After each layer, ReLU was used as an activation function; before the activation function, Batch Normalization [33] was always used. Dense and CNN models’ last layers’ shapes depends on the dataset considered (8 for SHL, 12 for PAMAP2). The code implemented in this study is open-source and publicly available at [34].

4.1. Requirements

The ISPU [35] is probably one of the most industry-mature products available for in-sensor ML computing. As such, its technical specifications may serve for the reference implementation for this proposed benchmark. The ML DSP processor, super integrated into the package together with the accelerometer and gyroscope sensing elements, features 8 KiB of data SRAM and 32 KiB of program SRAM, which sets the target hardware for the solutions evaluated by this benchmark. When it comes to latency and power consumption, it is difficult to set hard constraints. The required latency often depends on the specific needs of the application, while accurate power consumption estimates are challenging due to various tradeoffs, e.g., reducing latency can lead to higher power consumption and vice versa. These factors must be carefully balanced to meet the unique requirements of different applications. Lastly, the benchmark shall take into account that the precision of the data processed by the sensor, which is typically INT16, reflecting the bit-width outputs from the accelerometer and gyroscope sensors.

4.2. The Use Cases

The selection of use cases was driven by the principles of representativeness and innovation. The SHL and PAMAP2 datasets were chosen because they are well-established benchmarks in the ML community, commonly used for Human Activity Recognition and Physical Activity Monitoring, respectively. These datasets consist of accelerometer and gyrometer data, which are the typical data outputs of MEMS IMUs, the type of sensors prevalent in commercially available ISPU are MEMS IMUS. As such, these data types are ideal for exploring the capabilities of in-sensor ML computing benchmarks.

For all three use cases, a five-fold cross-validation approach was adopted to ensure robust evaluation. The data across all datasets were initially in FP32 format. Where applicable, the Full-Scale Range (FSR) and sensor precision were used to guide the quantization process, converting the data to INT16. Specifically, the precision of the sensor was derived (i.e., the minimum value measurable) by dividing the FSR of the sensor by the maximum number achievable by the bit precision of the sensor data (e.g.,

2^{16} - 1

for INT16). Then, the data in FP32 were divided by the sensor precision previously derived and rounded to the nearest integer to obtain INT16. In cases where the FSR was not provided, such as with the SHL dataset, quantization was performed based on the maximum absolute value recorded. Although all three datasets featured data from multiple users, we opted to focus on data from a single user to streamline the training process and reduce computational overhead. The main objective of this study was to explore and benchmark the feasibility and performance of in-sensor ML computing on representative sensor data, rather than to address inter-subject variability. By limiting the scope to a single user, the training time was significantly reduced without compromising the validity of comparisons across different algorithms and datasets. Nevertheless, relying on a single user’s data may introduce the risk of overfitting, potentially limiting the generalizability of the results obtained. This aspect should be considered a limitation of this study. Nevertheless, to push the boundaries of current applications and spark interest in new domains, a novel data type was also included. Biopotential electrode sensors, which are integral to a wide range of time-sensitive applications, stand to benefit significantly from localized ML processing. These sensors are used in diverse fields, from biomedical applications to augmented and virtual reality. To represent this category, the NINAPRO DB8 dataset, which features electromyography (EMG) data collected from the forearm to estimate finger positions, was incorporated.

4.2.1. SHL Dataset

The first use case of the benchmark is Transportation Mode and Human Activity Recognition (HAR). The dataset considered is the Sussex-Huawei Locomotion (SHL) dataset [36,37], SHL preview version, which was collected over a period of 7 months in 2017 by three participants engaging in eight different modes of transportation in real-life settings. The dataset includes multi-modal data from smartphones carried at typical body locations [36]. Nevertheless, in the benchmark proposed by this paper, only the data from the three-axis gyroscope and three-axis accelerometer were used, and only one user was considered. The dataset was shuffled and split with 20% test data. The classes to be predicted still consisted of eight factors (Still, Walking, Run, Bike, Car, Bus, Train, Subway), and two different NN architectures were adopted: Dense and CNN.

4.2.2. PAMAP2 Dataset

The Physical Activity Monitoring Dataset (PAMAP2) [38,39] features information on 18 distinct physical activities, ranging from walking and cycling to playing soccer. The dataset was collected from nine individuals, each equipped with three inertial measurement units (IMUs) and a heart rate monitor. Also, in this case, only data from the three-axis gyroscope and three-axis accelerometer were used, and only one user was considered. The dataset was shuffled and split with 20% test data. It comprised 12 classes (lying, sitting, standing, walking, running, cycling, Nordic walking, ascending stairs, descending stairs, vacuum cleaning, ironing, rope jumping), and the NN architectures that were adopted were the same as for the SHL dataset. The number of neurons in the output layer of the NNs were different from the SHL dataset, as the number of classes in the two datasets was different.

4.2.3. NINAPRO DB8 Dataset

The NINAPRO DB8 dataset [40] is part of the Ninapro project, which was a comprehensive series of datasets aimed at providing benchmark data for the study and development of algorithms related to hand movement recognition and prosthetic control. DB8, specifically, was designed for estimating finger movement using superficial electromyography (sEMG) and inertial measurement unit (IMU) data. The dataset comprised data from 12 participants, 10 of whom were able-bodied, while 2 were were right-hand trans-radial amputees. EMG and IMU data were collected from the right forearm, while the corresponding kinematic data were extracted from a data glove worn on the contra-lateral hand. The data collection involved 16 active double-differential wireless sensors placed on the forearm, capturing EMG signals and nine-axis IMU data (from accelerometers, gyroscopes, and magnetometers). The dataset also included data from a Cyberglove 2 data glove, worn on the left hand, providing kinematic data on finger movements with 18 degrees of freedom. These data were reduced to five degrees of freedom by multiplication with the projection matrix indicated in [40]. These five degrees of freedom were the ground truth and were to be predicted on the base of the EMG signals. Only the EMG signals were utilized, as the focus was on EMG data, with IMU data already being addressed in the other two use cases. Each subject’s dataset consisted of three separate acquisitions, with the authors recommending the last acquisition for testing. In this benchmark, Acquisitions 1 and 2 were combined and used for training, while Acquisition 3 was reserved for testing. As with previous cases, only data from User 1 were included. The model employed for this benchmark was the TEMPONet (Temporal Embedded Muscular Processing Online Network) [41], a Temporal Convolutional Network (TCN) [42] designed specifically for sEMG classification. This model was chosen based on the description in [43]. Since the original code for the neural network in [43] was not publicly available, the network was re-implemented by this work and from scratch using the architecture details outlined in this paper. Unlike the original implementation, this benchmark does not use the Exponential Moving Average (EMA) in post-processing, as this step was considered outside the scope of the benchmark and not part of the NN architecture.

4.3. Model Performance, Size and Latency

All the models considered by this benchmark are provided in three versions. The first version is full-precision fp32, which is obtained by five-fold validation training on the training set using Keras-tensorflow [24]. The second version is INT8, and it is obtained through PTQ using TFLite [23]. The last version is also in INT8, and it is obtained through QAT using the QKeras [26,27,28] framework. The ML framework previously described is illustrated in Figure 1. As the five-fold validation was adopted, five different models were saved for each NN topology. In SHL and PAMAP2, where there are different carry positions for the IMU sensors, a different model was trained for each position. Table 2 reports the average accuracy (on the five folds) of the Keras, TFLite, and QKeras versions of the Dense and CNN models on the SHL and PAMAP2 datasets. A trend in the accuracies reported is noticeable. Keras tends to always perform the best, followed by QKeras, while TFLite tends to perform the worst. In only one case, i.e., CNN on PAMAP2 with the hand as the carry position, QKeras performed worse than TFLite, by 1.6%. The reasons behind such result will be the subject of further investigations. Nevertheless, the higher range of gyroscope and accelerometer data coming from hand movements regarding the ankle and chest could play an important role in influencing this result, considering the quantizations of activations and weights, which are susceptible to high ranges. Despite this corner case, the superiority in terms of the accuracy of QAT in comparison with PTQ was confirmed by the reported results. Input data standardization was omitted to simplify the pipeline, as the model was intentionally designed to process raw sensor signals directly without any preprocessing steps. Future work could explore incorporating preprocessing techniques to assess their impact on model performance. Concerning NINAPRO DB8, the regression performance of the TCN was tested with the same metrics adopted in the original paper [43], namely a mean absolute error (MAE) and an accuracy at 10° and 15°, respectively. The latter is calculated by defining each correct prediction within 10° and 15°, respectively, regarding the ground truth. The metrics and their standard deviations are reported in Figure 2. In particular, the Keras, TFLite, and QKeras versions of the TCN show, respectively, a MAE of 8.44°, 8.88°, and 9.4° with, respectively, 0.42°, 0.41°, 0.07° standard deviations, an accuracy at 10° of 78.6%, 76%, and 77.9% with standard deviations of 0.003%, 0.021%, and 0.001%, respectively, and an accuracy at 15° of 87.5%, 86.6%, and 85.8% with standard deviations of 0.009%, 0.007%, and 0.002%, respectively. To assess the statistical significance of these results, a one-way ANOVA was performed for all three metrics. The analysis yielded F(2, 12) values of 8.6 for the MAE, 5.03 for accuracy at 10°, and 6.8 for accuracy at 15°, corresponding to p-values of 0.005, 0.026, and 0.01, respectively, thereby ensuring statistical significance. In contrast with the macro-trend observed in SHL and PAMAP2, QKeras performed worse than TFLite regarding its MAE by approx. 0.5° and by less than 1% at an accuracy of 15°. Nevertheless, the under-performance of QKeras with reference to TFLite is marginal, and it still performed better than TFLite on one metric, at an accuracy of 10°. The results could be further improved by applying EMA post-processing, as in [43]; however, as it is not part of the TCN topology, it has not been considered by this benchmark.

The NN architectures adopted in this benchmark comprises three models: Dense, CNN, and TCN. To benchmark their size and inference latency, they can be distinguished into two versions: FP32 (Keras) and INT8 (TFLite and QKeras).

The memory footprint of the NNs was divided into two contributions, activations and weights, as they reside in two different types of memories: data RAM (8 KiB) and program RAM (32 KiB), respectively. Moreover, there are also contributions due to the software library footprints. However, these were neglected as they are software- and hardware-dependent, undermining the generalizability of this benchmark. To calculate activation and weight memories, the Keras and TFLite versions of the NNs were profiled with the latest version of the ST Edge AI Unified Core Technology tool freely available online (https://stm32ai-cs.st.com/home, accessed on 21 October 2024). Lastly, the inference latency was measured using the latter tool on the STM32U5 MCU. The results are reported in Figure 3, where the ISPU’s data RAM (8 KiB) and program RAM (32 KiB) are highlighted and available. Beware that the PAMAP2 version of the Dense and CNN models was used as it has a greater number of neurons at the output of the SHL dataset, i.e., 12 vs. 8, representing a higher size and latency. The activations’ memory for Dense occupies 18 KiB (FP32) and 9 KiB (INT8); 1.31 KiB (FP32) and 4.31 KiB (INT8) for the CNN model; and 17.94 KiB (FP32) and 17.13 KiB (INT8) for the TCN model. The weights’ memory footprint for Dense encompasses 465 KiB (FP32) and 114 KiB (INT8); 26.3 KiB (FP32) and 6.84 KiB (INT8) for the CNN model; and 492 KiB (FP32) and 197 KiB (INT8) for the TCN model. Lastly, the inference latency is reported to be 23.17 ms (FP32) and 10.78 ms (INT8) for the Dense model; 2.231 ms (FP32) and 1.54 ms (INT8) for the CNN model; and 143.3 ms (INT8) and 111.8 ms (INT8) for the TCN model. The INT8 versions of the models always outperformed their FP32 counterparts in terms of memory and latency reductions, except for the activations for CNNs. This result is probably attributed to the various mechanisms and optimizations that happen at a code-generation level in the employed tool. For the Dense model, INT8 reduces activations’ memory by half, whereas in the case of TCN, the memory is left almost untouched. The reasons for this could be attributed to the factors that were mentioned previously. Regarding weights, the CNN model approximately has a 20× reduction with reference to its Dense version and the TCN architecture, which has a similar weight footprint to Dense. This is due to the higher number of parameters involved in a Dense layer compared to a Conventional one. As a matter of fact, the Dense model has 116 thousand parameters, the TCN 127 thousand, and CNN has only 6.5 thousand of them. Also, the TCN uses Convolutional layers; however, it is far more deeper than the CNN architecture. This aspect emerges prominently in the latency analysis. The TCN requires approx. 10× more time than Dense and 100× more time than the CNN model. While the Dense and CNN models have the same depth, CNN requires 10× less time. This can be attributed to several factors, which depend on SW and HW optimizations. Observing the limits imposed by the ISPU [35] sensor, only the CNN architecture fulfilled them. Additionally, considering the overhead of software libraries, only the INT8 version of the CNN met the requirements. Its latency was measured directly on the ISPU hardware, resulting in 128.5 ms. The higher latency regarding the values measured on the STM32U5 could be attributed to the aspects related to hardware performance and clock frequency. This indicates that CNN versions should be preferred concerning the Dense model as the number of parameters is drastically lower, with them having the ability to fit within the tight memory requirements of the sensor. The TCN is still not able to fit within the ISPU requirements as it is intrinsically deep, due to the complex nature of its tasks.

5. Conclusions and Future Works

This study introduced a novel benchmark tailored for in-sensor ML computing within challenging KiB-range memory constraints. The benchmark evaluated NNs across three distinct use cases: Human Activity Recognition, superficial electromyography, and Transportation Mode Recognition. The performances of these models were assessed in terms of accuracy, memory footprint, and inference latency. Quantization techniques, including Post-Training Quantization and Quantization-Aware Training, were utilized to adapt the models for resource-constrained environments. The results highlighted the efficacy of the latter technique over the former in maintaining model accuracy while reducing memory and computational demands. However, certain edge cases revealed that the latter might not always outperform the former, suggesting a need for a further investigation into the specific conditions affecting quantization performances. The benchmark revealed significant differences in model size and latency, with CNN models demonstrating substantially greater memory and latency efficiencies compared to the Dense and TCN models. The findings underscore the need for careful model selection and optimization to meet the stringent requirements of in-sensor ML computing paradigms. However, a limitation of this study is the adoption of data from a single user, which may introduce the risk of overfitting and limit the generalizability of the results. While this approach allowed for a reduced computational overhead, future studies should incorporate data from multiple subjects to better assess the models’ robustness across inter-subject variability.

Future works should focus on several aspects, including the usage of advanced hybrid quantization techniques, e.g., INT4 and Binary, as well as on the adoption of broader model architectures, such as recurrent neural networks (e.g., Legendre Memory Unit [44]). Furthermore, energy consumption analysis should be added as a metric to the benchmark. In conclusion, the proposed benchmark provided a critical step toward optimizing and quantitatively measuring neural network performance for in-sensor ML computing. Future research and development in this area will contribute to more efficient and capable AI systems in edge computing, fostering the growth of intelligent IoT applications to be as close to the sensing element where data are generated.

Author Contributions

Conceptualization, F.M.A. and D.P.P.; methodology, F.M.A. and D.P.P.; investigation, F.M.A. and D.P.P.; resources, F.M.A. and D.P.P.; writing—original draft preparation and writing—review and editing, F.M.A. and D.P.P.; supervision, D.P.P.; project administration, D.P.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research did not receive any type of funding.

Institutional Review Board Statement

This study did not require ethical approval. It did not involve humans or animals.

Informed Consent Statement

Not applicable since this study did not involve humans.

Data Availability Statement

No new data were created in this work. Only public datasets (not from the authors of this manuscript) were used.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Al-Sarawi, S.; Anbar, M.; Abdullah, R.; Al Hawari, A.B. Internet of Things Market Analysis Forecasts, 2020–2030. In Proceedings of the 2020 Fourth World Conference on Smart Trends in Systems, Security and Sustainability (WorldS4), London, UK, 27–28 July 2020; pp. 449–453. [Google Scholar] [CrossRef]
Zhang, C.; Lu, Y. Study on artificial intelligence: The state of the art and future prospects. J. Ind. Inf. Integr. 2021, 23, 100224. [Google Scholar] [CrossRef]
Zhang, J.; Tao, D. Empowering Things with Intelligence: A Survey of the Progress, Challenges, and Opportunities in Artificial Intelligence of Things. IEEE Internet Things J. 2021, 8, 7789–7817. [Google Scholar] [CrossRef]
Han, S.; Mao, H.; Dally, W.J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. In Proceedings of the 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Wang, K.; Liu, Z.; Lin, Y.; Lin, J.; Han, S. HAQ: Hardware-Aware Automated Quantization with Mixed Precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; Zhang, C. Learning Efficient Convolutional Networks through Network Slimming. In Proceedings of the ICCV, Venice, Italy, 22–29 October 2017. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Cheng, H.; Zhang, M.; Shi, J.Q. A survey on deep neural network pruning-taxonomy, comparison, analysis, and recommendations. arXiv 2023, arXiv:2308.06767. [Google Scholar] [CrossRef]
Fu, Y.; Yang, H.; Yuan, J.; Li, M.; Wan, C.; Krishnamoorthi, R.; Chandra, V.; Lin, Y. DepthShrinker: A New Compression Paradigm Towards Boosting Real-Hardware Efficiency of Compact Neural Networks. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S., Eds.; Volume 162, pp. 6849–6862. [Google Scholar]
Lee, N.; Ajanthan, T.; Torr, P.H. Snip: Single-shot network pruning based on connection sensitivity. arXiv 2018, arXiv:1810.02340. [Google Scholar]
Su, J.; Chen, Y.; Cai, T.; Wu, T.; Gao, R.; Wang, L.; Lee, J.D. Sanity-checking pruning methods: Random tickets can win the jackpot. Adv. Neural Inf. Process. Syst. 2020, 33, 20390–20401. [Google Scholar]
Wang, C.; Zhang, G.; Grosse, R. Picking winning tickets before training by preserving gradient flow. arXiv 2020, arXiv:2002.07376. [Google Scholar]
Wen, W.; Wu, C.; Wang, Y.; Chen, Y.; Li, H. Learning structured sparsity in deep neural networks. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar]
Huang, Z.; Wang, N. Data-driven sparse structure selection for deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 304–320. [Google Scholar]
Bai, Y.; Wang, H.; Tao, Z.; Li, K.; Fu, Y. Dual lottery ticket hypothesis. arXiv 2022, arXiv:2203.04248. [Google Scholar]
Chen, T.; Zhang, Z.; Liu, S.; Chang, S.; Wang, Z. Long live the lottery: The existence of winning tickets in lifelong learning. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
Chen, T.; Sui, Y.; Chen, X.; Zhang, A.; Wang, Z. A unified lottery ticket hypothesis for graph neural networks. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual, 18–24 July 2021; pp. 1695–1706. [Google Scholar]
Huang, Y.; Aloufi, R.; Cadet, X.; Zhao, Y.; Barnaghi, P.; Haddadi, H. MicroT: Low-Energy and Adaptive Models for MCUs. arXiv 2024, arXiv:2403.08040. [Google Scholar]
Choukroun, Y.; Kravchik, E.; Yang, F.; Kisilev, P. Low-bit quantization of neural networks for efficient inference. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 3009–3018. [Google Scholar]
Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Quantized neural networks: Training neural networks with low precision weights and activations. J. Mach. Learn. Res. 2018, 18, 1–30. [Google Scholar]
Lin, D.; Talathi, S.; Annapureddy, S. Fixed point quantization of deep convolutional networks. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 2849–2858. [Google Scholar]
Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Binarized neural networks. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar]
David, R.; Duke, J.; Jain, A.; Reddi, V.J.; Jeffries, N.; Li, J.; Kreeger, N.; Nappier, I.; Natraj, M.; Regev, S.; et al. TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systems. arXiv 2021, arXiv:2010.08678. [Google Scholar]
Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, Savannah, GA, USA, 2–4 November 2016. [Google Scholar]
Ayaz, F.; Zakariyya, I.; Cano, J.; Keoh, S.L.; Singer, J.; Pau, D.; Kharbouche-Harrari, M. Improving Robustness Against Adversarial Attacks with Deeply Quantized Neural Networks. In Proceedings of the 2023 International Joint Conference on Neural Networks (IJCNN), Gold Coast, Australia, 18–23 June 2023; pp. 1–8. [Google Scholar] [CrossRef]
Coelho, C.N.; Kuusela, A.; Li, S.; Zhuang, H.; Ngadiuba, J.; Aarrestad, T.K.; Loncar, V.; Pierini, M.; Pol, A.A.; Summers, S. Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors. Nat. Mach. Intell. 2021, 3, 675–686. [Google Scholar] [CrossRef]
Coelho, C.N.; Kuusela, A.; Zhuang, H.; Aarrestad, T.; Loncar, V.; Ngadiuba, J.; Pierini, M.; Summers, S. Ultra low-latency, low-area inference accelerators using heterogeneous deep quantization with QKeras and hls4ml. arXiv 2020, arXiv:2006.10159. [Google Scholar]
Wang, E.; Davis, J.J.; Moro, D.; Zielinski, P.; Lim, J.J.; Coelho, C.; Chatterjee, S.; Cheung, P.Y.; Constantinides, G.A. Enabling binary neural network training on the edge. In Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning, Virtual, 25 June 2021; pp. 37–38. [Google Scholar]
Mattson, P.; Reddi, V.J.; Cheng, C.; Coleman, C.; Diamos, G.; Kanter, D.; Micikevicius, P.; Patterson, D.; Schmuelling, G.; Tang, H.; et al. MLPerf: An Industry Standard Benchmark Suite for Machine Learning Performance. IEEE Micro 2020, 40, 8–16. [Google Scholar] [CrossRef]
Banbury, C.; Reddi, V.J.; Torelli, P.; Holleman, J.; Jeffries, N.; Kiraly, C.; Montino, P.; Kanter, D.; Ahmed, S.; Pau, D.; et al. MLPerf Tiny Benchmark. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, Virtual, 6–14 December 2021. [Google Scholar]
Gal-On, S.; Levy, M. Exploring Coremark a Benchmark Maximizing Simplicity and Efficacy; The Embedded Microprocessor Benchmark Consortium: Gainesville, VA, USA, 2012. [Google Scholar]
Torelli, P.; Bangale, M. Measuring Inference Performance of Machine-Learning Frameworks on Edge-Class Devices with the Mlmark Benchmark. Techincal Report. Available online: https://api.semanticscholar.org/CorpusID:232220731 (accessed on 5 April 2021).
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
Benchmarking in Sensor Machine Learning: An Extension to MLCommons-Tiny Github Repository. Available online: https://github.com/fabrizioaymone/sensor (accessed on 28 June 2024).
Update: ISM330ISN and ISM330IS, Sensors with Intelligent Sensor Processing Unit for Greater AI at the Edge. Available online: https://www.st.com/content/st_com/en/campaigns/ispu-ai-in-sensors.html (accessed on 28 May 2024).
Gjoreski, H.; Ciliberto, M.; Wang, L.; Ordonez Morales, F.J.; Mekki, S.; Valentin, S.; Roggen, D. The University of Sussex-Huawei Locomotion and Transportation Dataset for Multimodal Analytics With Mobile Devices. IEEE Access 2018, 6, 42592–42604. [Google Scholar] [CrossRef]
Wang, L.; Gjoreski, H.; Ciliberto, M.; Mekki, S.; Valentin, S.; Roggen, D. Enabling Reproducible Research in Sensor-Based Transportation Mode Recognition With the Sussex-Huawei Dataset. IEEE Access 2019, 7, 10870–10891. [Google Scholar] [CrossRef]
Reiss, A.; Stricker, D. Introducing a New Benchmarked Dataset for Activity Monitoring. In Proceedings of the 2012 16th International Symposium on Wearable Computers, Newcastle, UK, 18–22 June 2012; pp. 108–109. [Google Scholar] [CrossRef]
Reiss, A.; Stricker, D. Creating and benchmarking a new dataset for physical activity monitoring. In Proceedings of the 5th International Conference on Pervasive Technologies Related to Assistive Environments, Heraklion Crete, Greece, 6–8 June 2012; pp. 1–8. [Google Scholar]
Krasoulis, A.; Vijayakumar, S.; Nazarpour, K. Effect of user practice on prosthetic finger control with an intuitive myoelectric decoder. Front. Neurosci. 2019, 13, 461612. [Google Scholar] [CrossRef]
Zanghieri, M.; Benatti, S.; Burrello, A.; Kartsch, V.; Conti, F.; Benini, L. Robust Real-Time Embedded EMG Recognition Framework Using Temporal Convolutional Networks on a Multicore IoT Processor. IEEE Trans. Biomed. Circuits Syst. 2020, 14, 244–256. [Google Scholar] [CrossRef] [PubMed]
Lea, C.; Vidal, R.; Reiter, A.; Hager, G.D. Temporal Convolutional Networks: A Unified Approach to Action Segmentation. arXiv 2016, arXiv:cs.CV/1608.08242. [Google Scholar]
Zanghieri, M.; Benatti, S.; Burrello, A.; Kartsch Morinigo, V.J.; Meattini, R.; Palli, G.; Melchiorri, C.; Benini, L. sEMG-based Regression of Hand Kinematics with Temporal Convolutional Networks on a Low-Power Edge Microcontroller. In Proceedings of the 2021 IEEE International Conference on Omni-Layer Intelligent Systems (COINS), Barcelona, Spain, 23–25 August 2021; pp. 1–6. [Google Scholar] [CrossRef]
Voelker, A.; Kajić, I.; Eliasmith, C. Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]

Figure 1. ML framework adopted in this work.

Figure 2. MAE, Acc. 10°, and Acc. 15° results of TCN on NINAPRO DB8.

Figure 3. Activations, weights, and latencies of the FP32 and INT8 versions of the Dense, CNN, and TCN models.

Table 1. Architecture of Dense, CNN, and TCN neural networks.

Dense		CNN		TCN
Layers	Shape	Layers	Shape	Layers	Shape
Input	24, 6	Input	24, 6	Input	256, 16
Dense	24, 128	Conv2D	6, 4, 8	3XConv1D	256, 16
Dense	24, 64	Conv2D	6, 4, 8	AvgPool1D	128, 16
Dense	24, 128	Flatten	48	2XConv1D	128, 32
Flatten	3072	Dense	64	Conv1D	64, 32
Dense	32	Dense	8 or 12	AvgPool1D	32, 32
Dense	8 or 12	Softmax	8 or 12	2XConv1D	32, 64
Softmax	8 or 12			Conv1D	8, 64
				AvgPool1D	4, 64
				Flatten	256
				Dense	256
				Dense	32
				Dense	5

The “Layers” column specifies the type of neural network layer used, presented in sequential order from the input to the output of the network. The “Shape” column represents the dimensions of each corresponding layer. For the Dense and CNN Softmax output layers, the number of units may vary between 8 and 12, depending on the dataset being used.

Table 2. Accuracy of Dense and CNN models on SHL and PAMAP2 datasets.

Dataset		SHL				PAMAP2
Carry Position		Hand	Hips	Torso	Bag	Chest	Ankle	Hand
Dense	Keras	81.1%	92.9%	84.2%	95.3%	90.7%	86.9%	87.6%
	TFLite	77.8%	88.9%	72.2%	93.9%	85.2%	76.8%	87.1%
	Qkeras	79.9%	92.4%	81.0%	94.9%	90.4%	87.4%	87.5%
CNN	Keras	77.1%	90.4%	83.0%	93.7%	88.8%	83.6%	86.5%
	TFLite	75.9%	89.4%	74.6%	93.5%	87.7%	80.6%	86.3%
	QKeras	77.4%	90.5%	81.9%	93.6%	88.3%	82.1%	84.7%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Aymone, F.M.; Pau, D.P. Benchmarking In-Sensor Machine Learning Computing: An Extension to the MLCommons-Tiny Suite. Information 2024, 15, 674. https://doi.org/10.3390/info15110674

AMA Style

Aymone FM, Pau DP. Benchmarking In-Sensor Machine Learning Computing: An Extension to the MLCommons-Tiny Suite. Information. 2024; 15(11):674. https://doi.org/10.3390/info15110674

Chicago/Turabian Style

Aymone, Fabrizio Maria, and Danilo Pietro Pau. 2024. "Benchmarking In-Sensor Machine Learning Computing: An Extension to the MLCommons-Tiny Suite" Information 15, no. 11: 674. https://doi.org/10.3390/info15110674

APA Style

Aymone, F. M., & Pau, D. P. (2024). Benchmarking In-Sensor Machine Learning Computing: An Extension to the MLCommons-Tiny Suite. Information, 15(11), 674. https://doi.org/10.3390/info15110674

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Benchmarking In-Sensor Machine Learning Computing: An Extension to the MLCommons-Tiny Suite

Abstract

1. Introduction

2. Key Contributions of This Work

3. Related Works

3.1. Tools and Techniques in TinyML

3.2. Industry Benchmarks

4. Proposed Benchmark

4.1. Requirements

4.2. The Use Cases

4.2.1. SHL Dataset

4.2.2. PAMAP2 Dataset

4.2.3. NINAPRO DB8 Dataset

4.3. Model Performance, Size and Latency

5. Conclusions and Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI