1. Introduction
Perception is crucial for autonomous vehicles (AVs). While AVs have the potential to be highly safe, they are not infallible. All automotive manufacturers prioritize safety and efficiency in their designs, yet unforeseen challenges can still arise. The foremost responsibility of an AV is to ensure the safety of its passengers, making it essential for the vehicle to make critical driving decisions based on its surrounding environment. Additionally, the AV must process data efficiently—from sensing to driving—to respond in a timely manner to both its own actions and those of nearby entities. The inherent nature of AVs presents a gap within traditional edge computing research, and we seek to address some of these challenges below.
In this framework, the primary processing pipeline of an AV involves data flowing from sensors to perception, then to path planning, and finally to actuation. Among these components, the object detection module is arguably the most vital, as underscored by numerous news reports [
1,
2,
3]. Object detection demands both speed and accuracy, making Deep Neural Networks (DNNs) the most fitting choice for this module. DNNs have a robust history and are actively researched, finding applications in both academic and industrial settings. Various DNN implementations, such as YOLO, SSD, Faster R-CNN, DETR, Deformable DETR and PointPillars for object detection [
4,
5,
6,
7,
8,
9], DeepLab, DeepLabV3+, and ViT for image segmentation [
10,
11,
12], and LaneNet for lane detection [
13], demonstrate their effectiveness across tasks.
These DNN workloads are typically executed as edge workloads; however, for a DNN to be effective, it must be continuously trained and optimized to address new challenges that arise over time. For example, the emergence of new climates or unmarked older roads necessitates this adaptability. Consequently, many companies employ human operators and semi-supervised methods to label and train the vast amounts of data generated by vehicles on the road, with each vehicle producing nearly 11 terabytes of data daily [
14].
Moreover, while edge computing offers significantly lower latency compared to cloud solutions, it also comes with more limited resources. This scarcity of computing power makes understanding perception workloads on edge devices a critical aspect of vehicle and edge computing [
15].
As a mode of transport, AVs must adhere to stringent safety standards, where latency and detection accuracy are paramount [
16]. However, meeting real-time requirements is notoriously challenging, as reflected by the extensive research in this domain driven by continuous technological advancements [
17]. Given the widespread use of DNNs in perception tasks, the pressing issue is how AVs and edge nodes can effectively manage these workloads without necessitating constant upgrades of sensors and models.
Since most AVs rely on their own data for real-time object detection, they face inherent limitations, particularly regarding sensor blind zones and obstructions [
18].
In this paper, we investigate memory contention not only in convolutional neural networks (CNNs) but also across individual layers, focusing on both characterization and prediction. To the best of our knowledge, this work is the first to comprehensively analyze the impact of memory contention on the performance of perception workloads in edge environments. Existing studies often highlight the CPU and GPU as limiting factors, for example, [
19] systematically discusses the configuration of CPU and GPU in autonomous driving systems, neglecting the crucial role of memory.
Memory usage is a significant consideration that can profoundly affect the deployment of machine learning models for inference on edge devices. These models typically require substantial memory for training and operation, and their size can escalate rapidly with increased complexity. Consequently, deploying a memory-intensive machine learning model on a resource-constrained edge device can lead to performance issues, such as slower execution times or system crashes. In scenarios where multiple tasks are simultaneously scheduled, only four tasks may be completed despite each having sufficient resources. Therefore, optimizing memory usage is essential for enhancing the performance and reliability of machine learning systems, reducing hardware costs, and improving overall workflow efficiency. Understanding how memory utilization affects workloads at various levels is critical for diverse deployment scenarios.
Our analysis indicates that memory can be effectively generalized through characterization functions to model and predict workload behavior across different machine learning methods and platforms.
By conducting a thorough analysis, we can extract the performance characteristics for each layer within a perception network. We monitor the behavior of individual layers during the inference process under varying constraints and conditions. This information is vital for understanding model performance and for optimizing its architecture and parameters. However, monitoring the behavior of individual layers can be complex, particularly with intricate models that contain many layers.
Our research reveals that convolutional layers, along with routing, shortcut, and ReLU activation layers, are particularly sensitive to factors such as memory availability. For instance, under constrained memory conditions, some layers may experience significant increases in computation time, with convolutional layers seeing over a 2849% rise, routing layers over 1053%, ReLU layers over 1173.34%, and shortcut layers over 271%. These findings are instrumental in identifying memory bottlenecks in perception networks and developing dynamic memory allocation strategies to enhance performance.
Our main contributions include the following:
Characterizing and modeling the impact of memory on machine learning inference workloads for edge devices.
Conducting deep layer analysis to identify specific layers affected by memory contention and over-subsidization.
Generalizing the edge components that influence machine learning inference workloads and outlining challenges for future research.
The remainder of this paper is structured as follows:
Section 2 discusses the background and motivation.
Section 3 details the profiling setup and variables.
Section 4 presents and analyzes the results.
Section 5 elaborates on the findings, and
Section 6 concludes the paper.
2. Background and Motivation
As the world increasingly embraces the next generation of sensor-equipped vehicles which has great potential to enhance urban planning and transportation management [
20], the supporting infrastructure must evolve accordingly. Modern vehicles, whether fully autonomous or semi-autonomous, demand sophisticated and extensive computational resources to process the data necessary for ensuring passenger safety. To guarantee this safety, computations must be conducted in real-time and be relevant to the vehicle’s spatial context. Studies such as [
14] explore the future landscape of connected vehicles and clarify the associated challenges, notably the issue of offloading computational tasks away from the vehicle itself.
In recent years, the definition of computing has become less distinct, as many edge devices, like proprietary solutions from Nvidia or even mobile phones, can perform similar tasks. Edge computing finds applications across various domains, including instant diagnosis in healthcare [
21], predictive maintenance in manufacturing [
22], smart farming in agriculture [
23], unmanned aerial vehicles (UAVs) [
24], and enhancing Android-based device capabilities [
25]. Gradually, the concept of edge Artificial Intelligence [
26] emerges. Likewise, edge computing is vital for connected autonomous vehicles (CAVs). Research such as [
27] highlights the inefficiencies of onboard computation in CAVs and demonstrates how edge computing can alleviate issues like network congestion and limited processing resources.
Recent studies, including [
18,
28,
29], have investigate and validate the potential of using edge computing for sensor fusion in CAVs. Further works, such as [
27,
30], examine real-world applications of offloading computations to the edge, revealing a range of possibilities. The discussion around Deep Neural Networks (DNNs) has emerged as a significant focus area in this context as indicated by [
14].
Most AVs rely on DNN inferencing for their self-driving capabilities, utilizing state-of-the-art models for object detection, such as YOLO [
4], Faster R-CNN [
6], Mask R-CNN [
31], SSD [
5], DETR [
7], Deformable DETR [
8], and PointPillars [
9]. YOLO [
4] pioneers single-shot real-time object detection, balancing speed and accuracy, but struggles with detecting small objects and overlapping objects, as the grid division could lead to imprecise localization in those cases. Faster R-CNN [
6] introduces the Region Proposal Network (RPN), improving object detection by generating region proposals. Moreover, Mask R-CNN [
31] expands it by adding instance segmentation, allowing both object localization and segmentation. SSD [
5] improves single-stage detection by using multi-scale feature maps and default boxes for handling objects at different scales. Both DETR [
7] and Deformable DETR [
8] try to use and enhance the transformer-based approach to object detection, eliminating region proposals and anchor boxes for a simpler end-to-end pipeline, but both of them are slower than traditional methods like YOLO [
4], Faster R-CNN [
6], and SSD [
5] due to the complexity of the transformer model. PointPillars [
9] revolutionizes 3D object detection by applying a CNN to LiDAR point clouds, enabling real-time detection in autonomous driving. While these models perform well, there are trade-offs and optimization opportunities to consider. For instance, to facilitate deployment on lower-end computational hardware, techniques such as quantization, pruning, and architectural modifications are often employed [
32,
33,
34].
Deploying machine learning models on edge devices, especially those with limited resources, presents several challenges [
35]. One significant hurdle is the constrained computing and memory resources available on these devices, which restrict the complexity and size of deployable models. Additionally, power limitations on edge devices further constrain computational energy use [
36].
This challenge is particularly pronounced in vehicular edge computing, where most machine learning models require large-weight configurations to function optimally. Thus, models deployed on edge devices must be meticulously optimized for both power and speed, a balance that can be difficult to achieve without compromising accuracy or overall performance.
Another challenge is the lack of standardization in hardware and software platforms, which complicates the development and deployment of machine learning models that maintain consistency across AVs and their associated edge devices. Furthermore, deploying machine learning models on edge devices raises concerns about data privacy and security, as sensitive vehicle data may be processed and stored locally. Addressing these challenges will necessitate advancements in hardware and software optimization, as well as improvements in standardization and data privacy measures [
14,
37,
38,
39,
40].
Traditional improvements in machine learning and neural networks often focus on overall latency per frame and optimizing accuracy through loss function adjustments, typically centering on CPU and GPU impacts while neglecting memory considerations. Although targeted methods like the Pyramid Attention Network [
41] and the Region Proposal Network [
6] have led to many real-time detection algorithms, these are generally developed without the constraints of edge resources in mind.
Common strategies in the literature for addressing edge memory limitations often involve trade-offs or sacrifices. For example, the authors in [
42,
43] reduce the full YOLO architecture to decrease computational demands. Similarly, ref. [
44] proposes distributing workloads through a pipeline to mitigate hardware constraints like memory contention. Traditional methods, such as [
45], consider and attempt to eliminate memory contention for individual tasks.
The contemporary literature addressing memory contention issues for DNNs on edge devices also encounters limitations. Among the various memory-aware middleware approaches, works like MASA [
46], along with others such as Deepeye [
47], NestDNN [
48], and DART [
49], tackle CPU and memory issues for low-end edge devices from a middleware perspective, considering the impacts of individual layers on CPU and memory usage. However, even in the case of MASA, the comprehensive effects of memory are not fully explored. Additionally, due to the methodologies and hardware platforms employed in MASA, the analysis is limited to two-layer types, neglecting single-stage and two-stage foundations. Furthermore, these works do not model memory characteristics as a variable impacting performance; instead, they facilitate DNN processes. Lastly, most modern DNNs leverage PyTorch [
50] as their backbone, which is absent from these motivating studies.
Challenges of Characterizing Machine Learning on Edge
Characterizing machine learning workloads on edge devices involves several significant challenges. A primary issue is the wide variety of edge devices and their distinct hardware configurations, making it difficult to generalize performance metrics and operational timings across different systems. Additionally, energy consumption is closely tied to workload intensity and processing duration, further complicating the characterization process. Additionally, the inherent complexity and variability of machine learning models, coupled with the diverse input and network conditions, complicate predictions about model behavior on various edge devices, even when those devices share identical specifications.
Moreover, conducting the detailed profiling of machine learning workloads on edge devices necessitates specialized tools and expertise, which may not be readily accessible to developers or end users. Compounding these challenges is the need to protect client data privacy, as profiling often requires monitoring the processing of sensitive information directly on the device. To address these issues, advancements in standardizing benchmarking tools and developing privacy-preserving methods for edge devices are essential as highlighted in surveys like [
51].
Studies such as [
52,
53] explore the relationship between memory and layer-wise connections to privacy. These papers demonstrate that through layer-wise encoding, it is feasible to encrypt and process data at the layer level with minimal impact on performance for connected autonomous vehicles (CAVs). Similarly, traditional privacy concerns related to memory usage also warrant further investigation.
3. Measuring and Characterizing Perception Workloads on the Edge
Deep learning-based perception networks can be categorized into two main types: single-stage and two-stage object detection. In this study, we focus on both categories, selecting widely recognized representative networks. Specifically, we examine Single-Stage YOLO (both Darknet and PyTorch variants) and Two-Stage Faster R-CNN (PyTorch based). We denote YOLO as and Faster R-CNN as . By including both single-stage and two-stage methods, we aim to generalize our findings to apply to other methodologies employing similar principles.
Our objective is to explore the distinctive resource demands of each category, particularly regarding memory impacts and the performance bottlenecks that arise from these resource requirements.
Beyond analyzing individual resource impacts, we aim to characterize the workload in a way that allows for predictive modeling. As illustrated in
Figure 1, a generalized prediction model for incoming workloads can enhance scheduling on edge nodes. The edge network architecture thus serves as both a communication and computational hub for connected autonomous vehicles (CAVs) utilizing the edge network.
3.1. Profiling Testbed and Setup
To rigorously evaluate the memory contention effects and layer-wise performance, we designed a test environment that closely resembles real vehicular edge scenarios. Our experiments were conducted on two types of edge hardware configurations: (1) a high-end laptop simulating a powerful edge node (Intel Core i7-10750H, Nvidia GeForce RTX 2070, 16 GB DDR4 RAM, 1 TB NVMe SSD), and (2) a Nvidia Jetson Xavier NX as a representative low-end or mid-range edge device. We selected these two platforms to capture a broader spectrum of resource capabilities commonly found in vehicular edge deployments.
Dataset and Workload Parameters: For the object detection tasks, we used inputs derived from standard AV-oriented image sets (e.g., subsets of KITTI or COCO) [
54,
55]. The input resolutions varied (e.g., 320 × 320, 640 × 480, and 1280 × 720) to emulate camera feeds from different AV sensors. Each resolution set was run multiple times (10 runs per setting) to capture the performance variability and to mitigate any transient system effects (such as GPU thermal throttling or background OS processes). The size of each workload was selected to reflect realistic AV perception tasks, where real-time performance is vital. For each test run, we tracked not only the inference latency but also the peak and average memory consumption, using nvidia-smi, PyTorch’s built-in profiler, and Linux-based resource monitoring tools. However, it should be noted that the scope is on system-level profiling rather than perception accuracy metrics. Our approach is agnostic to the dataset, and thus we focus on memory usage and layer execution times behavior, which remain consistent across different datasets.
Simulation Topology: To approximate a distributed edge environment, we modeled each device as if it were an individual edge node receiving inference tasks from one or more autonomous vehicles. Each node could be allocated a certain fraction of tasks, simulating real traffic scenarios where multiple vehicles offload to the same edge. The exact number of concurrent requests varied to provoke different degrees of resource contention. In more constrained tests, we artificially reduced the available RAM or GPU memory to simulate heavy oversubscription.
Reliability and Validity Measures. We repeated each configuration (device × resolution × concurrency level) multiple times to ensure stable averages. We computed standard deviations and confidence intervals (95% CI) to check for statistical consistency. In addition, we systematically logged CPU frequency, GPU utilization, and GPU temperature to verify that the results were not skewed by thermal throttling or one-off system states. Where possible, we cross-checked the measured layer-by-layer timings with independent profiling scripts to validate the internal PyTorch logs. Although our dataset was substantial enough to capture meaningful trends in resource usage, we acknowledge that a larger or more diverse dataset (e.g., a full AV sensor suite) could yield additional insights. This limitation is further discussed in
Section 5 and
Section 6.
3.2. Features and Formulation
To assess how incoming AV workloads will affect available edge nodes, we consider two sets of variables: the inputs from AVs and the resources available from edge nodes. We define the set of AV inputs as and the set of edge nodes as . Each element and is monitored and managed by the Prediction Module.
As depicted in
Figure 1, upon receiving input requests from vehicles, the edge nodes within the service range can process these tasks. Each edge node evaluates the incoming workload requests and determines its capacity to handle them. The nodes then communicate their task requirements and resource availability to the edge manager, which efficiently allocates the workload to an edge node capable of completing the task within an acceptable timeframe.
To formulate potential edge configurations, we define the following key variables: processor resource, RAM as the memory resource, workload size (derived from the requested service and input size), and time (as the quantifying metric).
4. Analysis and Key Insights
The variability in edge deployment hardware and task types presents a significant challenge. Diverse AV workload requests and differences in hardware performance on edge nodes complicate planning, ultimately affecting the reliability of DNN inferencing and presenting a fundamental challenge for robust workload characterization.
4.1. Memory and Computation Load
As discussed in
Section 2, memory contention is a critical factor when latency in time or performance is essential.
Figure 2 illustrates two distinct states, emphasizing the necessity of managing memory contention effectively in these scenarios.
In our profiling, we identify three scenarios regarding memory resources:
Scenario 1: No memory resource contention.
Scenario 2: Memory resource contention typically encountered under heavy workloads.
Scenario 3: Extreme lack of memory resources.
In Scenario 1, memory resources are abundant for all tasks, with no bottlenecks. In Scenario 2, however, memory resources are strained due to access contention or limited availability. Notably, temperature and system bus bottlenecks are not considered in Scenario 2, as these conditions may lead to system instability. Scenario 3 accounts for severe memory allocation constraints, from low to minimal operational limits.
In Scenario 1 (
Figure 2(left)), the model remains consistent with a CPU-dominated linear model, which we can express as
where
, representing the computational load required by the workload.
However, when applying Equation (
1) to the same set of workloads under memory limitations, the model no longer holds true as depicted in
Figure 2(right). Traditional optimization and scheduling methods may alleviate some issues, but they cannot fully resolve the challenges posed by memory contention.
Thus, to accurately characterize the impact of memory contention across various platforms, further characterization is necessary for workloads in Scenarios 2 and 3, which encounter moderate to heavy memory contention.
By modeling the behavior of the algorithms under varying resource constraints, we can profile and characterize the effects of memory contention, with Faster R-CNN denoted as and YOLO as . Initially, we profile the maximum amount of RAM needed for each machine learning method, represented as for Faster R-CNN and for YOLO. Each ML method is tested with progressively less available RAM until task failure occurs.
As shown in
Figure 3, we profile the workload characteristics under Scenario 3 and observe a clear correlation with the amount of memory available. From the data collected, we can model the behaviors of the two ML workloads using Equations (
2) and (
3), respectively. Here, the CPU frequency is represented as
, and memory is denoted as
. It is important to note that while an
value of at least
is deemed acceptable, extreme cases, such as operating
with 100 MB or less, exhibit higher variance when predicted using this model.
To ensure consistency in edge workload parameters for Scenario 3 across different platforms, we analyze the same scenario on the Jetson device, as shown in
Figure 4. While the Jetson device exhibits less variance in the data, its behavior aligns closely with our previous findings. The second set of models derived for Faster R-CNN (
), and YOLO (
) is expressed in Equations (
4) and (
5).
With the Jetson Memory model, we achieve an for and for .
This characterization is beneficial for industrial cost estimation, providing insights into how many incoming tasks an edge node can manage. With the impact of memory clearly defined, we now turn our attention to how it affects each individual layer of the models.
4.2. Layer-Wise Characterization
Building on the insights gained from memory profiling, we can delve into the layer-specific analysis of our machine learning methods. Both YOLO and Faster R-CNN utilize several common layers, including convolutional layers, activation layers, and operations such as max-pooling.
Figure 5 illustrates the differences in layer performance under normal operation versus resource-constrained conditions, highlighting how memory contention disproportionately affects specific layers within the architecture.
Given the shared building blocks, we conduct a layer-by-layer workload profiling for both scenarios. The layers that experience significant impacts are depicted in
Figure 6 for YOLO and
Figure 7 for Faster R-CNN. A detailed analysis of the gathered data reveals that the layers that are most affected are primarily located at the beginning of the architecture for both models,
and
.
Table 1 summarizes the layers that are most impacted in YOLO, while
Table 2 details the affected layers in Faster R-CNN.
Starting with
Figure 6(left), we observe that two layers—Layer 1 and Layer 6—are significantly impacted by memory contention. Both layers are convolutional layers with a theoretical computational load of 3.407 GFLOPs. Despite this seemingly substantial load, similar loads are distributed across various other layers in the architecture. Notably, Layer 1 exhibits a remarkable 27.65 times increase in execution time, while Layer 6 demonstrates a 29.49 times increase compared to the non-contention scenario. These findings provide valuable insights into the areas most affected by resource constraints within the YOLO architecture, highlighting the critical layers that may benefit from targeted optimization efforts.
However, in
Figure 6(right), after removing the two heavily impacted layers from
Figure 6(left), we identify other layers that also show notable increases in latency. In our profiling results, Layers 9 and 20 reveal an unusual degree of increase when compared to both their theoretical GFLOP values and the non-contention latency. Specifically, Layer 9, which functions as a routing layer with a near-zero GFLOP, experiences a staggering latency increase of over
. Similarly, Layer 20, a shortcut layer with an almost negligible GFLOP of 0.001, demonstrates a latency increase of
.
While it is expected that the initial convolutional layers would be heavily affected by memory constraints—primarily due to the necessity of loading input data from storage—the significant impacts observed in Layers 9 and 20 suggest intriguing avenues for future research.
Next, we analyze the layer data for Faster R-CNN as shown in
Figure 7. Here, five layers are identified as most affected, with Layers 3 and 7 standing out prominently. Layer 3, a ReLU activation layer, exhibits a dramatic increase of
, while Layer 7, a convolutional layer, shows an even more severe increase of
.
4.3. Compute Power and Processor Characterization
On both of our platforms, we began with a consistent input load for the two machine learning methods.
Figure 8 illustrates the performance of Faster R-CNN across four distinct CPU power Thermal Design Power (TDP) and core configurations, using inference time as the key metric to characterize the workload. Based on the data points collected in
Figure 8, right, we can model the impact of Scenario 1 for Faster R-CNN as follows, where
and frequency is in gigahertz.
Similarly,
Figure 8(left) presents the profiling results for YOLO under the same CPU power and core configurations. We can also formulate the model for the impacts of Scenario 1 on the YOLO workload as follows:
Incorporating the memory variable from Scenario 2, we arrive at Equation (
8) for Faster R-CNN, with memory expressed in gigabytes and frequency in gigahertz:
Likewise, when characterizing the complete workload, Scenario 2 for YOLO is represented by Equation (
9), which can be expressed as follows:
4.4. Workload Size and Resource Requirement
The CPU profiling reveals a direct linear relationship between the CPU computational power and inference performance for both models as illustrated in
Figure 8.
But the input size of the task also influences the amount of computational power required. For instance, processing higher-resolution sensor data demands significantly more computational power compared to low-resolution data. To explore this further, we perform profiling with various input sizes as depicted in
Figure 9.
However, when we introduce an additional variable, memory, into the equation, the profiling results change significantly as illustrated in
Figure 2. Applying the direct linear formula with adjusted constants to
Figure 2(right) shows that the
fit for the formulation based solely on the CPU variable becomes increasingly inaccurate.
4.5. Insight Analysis and Generalization
Using the data from our profiling to identify the variables that influence ML workloads, we can now work on generalizing our prediction model.
Drawing from the key insights in Formulas (
2)–(
5), we can summarize the findings for Scenario 3 as follows:
where
and
represent the variable coefficient constants associated with each edge node, while
denotes the respective constant value, and
indicates the currently available memory. To accurately represent Scenario 3, the variable
is removed from Equation (
10). We can account for the variance in thermal limitations using the rolling window average in Equation (
11), where
n marks a specific point in time, enhancing accuracy. The rolling window for
and
can be defined such that
, with
representing the current time constant for Equation (
10):
By integrating the two characterization models, we can formulate a decision tree for the task queue
c running on the edge. Here,
for task
i signifies the memory threshold for Scenario 3, where the influence of the CPU variable is overshadowed by the effects of memory. This threshold varies depending on the ML algorithm; in our results, it is approximately 100 MB for YOLO and 800 MB for Faster R-CNN. The decision tree for predicting workload is outlined as follows:
Based on the profiling results, the first key insight highlights the effects of limited memory resources on a given workload. Traditional CPU-based scheduling approaches fail to consider physical memory constraints, which can lead to issues like the over-subsidization of an edge node. Furthermore, the amount of data needed for a straightforward predictive model derived from our insights is quite small, enabling quick and efficient customization.
We identify the characteristics of various edge node components relevant to CAV-based inference workloads, categorizing the variables for CPU and input size as linear while implementing specific measures to address over-subsidization scenarios.
Further profiling of layer performance reveals that only a few specific layers are impacted by memory contention. As shown in
Figure 6 and
Figure 7, we identify the affected layers for
and
, respectively. These findings facilitate a more targeted approach to researching and optimizing ML methods for edge devices, particularly in scenarios involving memory contention or lower hardware specifications.
5. Implications and Discussion
In our generalization, we initially explored a non-linear approach to account for all the variables discussed in
Section 4 simultaneously. To achieve this, we tested various modeling methods. The results of including all our targeted variables are presented in
Table 3. While Linear Regression provides a reasonable level of model accuracy, it is outperformed by methods such as MLP [
56] and Correlated Nystrom Views [
57].
Through additional testing, we attempted to train and fine-tune the aforementioned models using actual data, comparing their predictive capabilities across different platforms. However, due to the limited sample size, the models encountered overfitting issues. To address this, we applied SMOTE [
58] to upsample our data where necessary, aiming for better generalization. Unfortunately, this process proved to be cumbersome and requires further optimization through research.
While we are confident that the ML modeling approach based on the identified variables will outperform the generalized formulations presented, it will necessitate a larger dataset for effectiveness and will require retraining for each individual edge node in a self-updating loop, which will demand additional resources.
5.1. Unusual Findings
Several unusual findings emerged during our research that warrant further attention. Firstly, our profiling revealed that oversubsidized workloads trigger the Out-Of-Memory (OOM) killer in Nvidia Jetson but not on other platforms using the same Linux kernel and OOM policy. Since this issue exceeds the scope of this paper, we did not explore it further, but it presents a potential avenue for optimizing CAV workload scheduling.
Another anomaly noted during profiling was the discrepancy between the theoretical computational load and the actual latency observed in the workload. Specifically, as discussed in the layer-wise profiling section, we found that shortcut layers and routing layers exhibited significantly higher latency than standard convolutional layers. While this could be attributed to insufficient memory, it does not fully account for the substantial increase in latency we measured. Additionally, both shortcut and routing layers have a minimal computational footprint, making this finding particularly relevant for optimization research. Once again, due to the technical complexities involved, we did not investigate this phenomenon further. We hypothesize that this latency increase may stem from a linear or sequential architecture workflow. Future research could explore base ML algorithms and those that can be parallelized, such as transformers, to determine if a method can be developed to improve the consistency between the theoretical performance and real-world outcomes.
Although these findings did not impact the workload profiling directly, they highlight additional opportunities for future investigation.
5.2. Limitations
There are several limitations that are noteworthy and go beyond the scope of a single paper. However, we believe that addressing these will be great future work directions to navigate.
First is the partial consideration of AV dynamics. This paper zeroes in on memory contention and latency in deep learning workloads but does not explicitly account for vehicular mobility models or dynamic network conditions (e.g., handoff between edge nodes as a vehicle moves). While we believe the memory-contention insights are broadly applicable, real-time vehicular scenarios might involve intermittent connectivity or variable network latency that further affects resource allocation.
Second is the limitations of open source datasets. It is a well-known fact that these publicly available datasets are just a very small representation of the conditions faced in real life. Any models created or studied within the scope of these publicly available datasets will run into the same issues of being narrow in scope and inapplicable to real-life needs. We believe that more diverse datasets are needed to refine the observations made in this paper.
Third, we also see that certain layers (e.g., routing and shortcut layers in YOLO) exhibited disproportionate latency spikes under memory contention. While we hypothesize that these anomalies stem from architecture design and memory-access bottlenecks, our analysis does not fully isolate all possible causes (e.g., overhead from framework-level memory allocation, PCIe contention, or caching effects). A deeper hardware or machine-level study could offer more conclusive explanations.
Finally, the current work does not quantitatively evaluate how encryption or obfuscation techniques impact memory contention. In safety-critical vehicular settings, ensuring data confidentiality might introduce extra overhead, which we have not yet modeled.
5.3. Further Explorations
A promising avenue for exploration based on our findings is the development of accurate and efficient profiling tools and methodologies that can manage the complexity and variability of machine learning models and edge device hardware. A comprehensive understanding of workload behavior can lead to the establishment of standardized benchmarks and metrics targeting variables for more generalized use cases.
Furthermore, our findings indicate a need to address privacy considerations. While our profiling does not incorporate privacy algorithms and methods applicable to the selected workloads, the integration of privacy-preserving techniques and data obfuscation methods may reveal different workload characteristics and provide additional insights. Considering privacy in workload profiling could offer a more complete understanding of the challenges and opportunities at hand.
6. Conclusions and Future Works
Machine learning and AI workloads have become stable components of our current society, driven by the widespread adoption of edge technology by major companies such as Microsoft Azure, Amazon AWS, and IBM [
59,
60,
61]. However, utilizing on-road edge to offload workloads from connected and autonomous vehicles (CAVs) introduces many uncertainties. Existing research on optimizing machine learning for edge devices lacks the in-depth insights necessary for edge users and device providers to accurately predict the behavior of machine learning workloads across various configurations. In this paper, we tackle these challenges and identify areas for potential improvement. We propose a novel approach to optimizing workloads by considering edge device parameters, workload inputs, and algorithm architectures. The profiling and insights provided in our approach can be generalized beyond the specific machine learning algorithms discussed here.
While our models are fundamental, they provide a clearer characterization of workload behavior compared to traditional machine learning (ML) models. Although some models, like MLP and Correlated Nystrom Views, can predict workload behavior, they require extensive training and struggle to accommodate diverse ML inference algorithms.
By characterizing and generalizing our findings, we offer valuable insights into the performance and capabilities of edge devices for machine learning workloads. Our analysis indicates that even under memory contention, algorithms such as YOLO and Faster R-CNN can successfully complete their tasks, suggesting that resource oversubscription on edge devices is a feasible option for non-time-sensitive applications.
Moreover, utilizing memory and layer-wise information allows for the optimization of both the architecture and deployment of machine learning models, enhancing the efficiency of edge devices and reducing energy consumption. Our analysis identifies critical bottlenecks in machine learning workloads as detailed in
Section 4 and
Section 5.
The ability to predict the execution time of specific layers presents several advantages for future research. First, it enables the optimization of model architecture and hyperparameters by pinpointing the most time-consuming layers and enhancing their performance. Second, it improves model efficiency during training or inference by scheduling layers to maximize available resources and minimize overall execution time. Finally, it supports the development of new algorithms and techniques that consider the performance characteristics of individual layers, leading to the creation of more efficient and accurate machine learning models.
The contributions of our paper are significant. Our proposed approach offers a targeted method for predicting the performance and maximum workload capabilities of edge devices for machine learning inference tasks. These insights empower other researchers and service providers to pursue future research on more efficient algorithms and resource management strategies, including memory and processing power, ultimately reducing contention and enhancing overall performance. Additionally, exploring distributed systems and edge-to-cloud architectures could help mitigate resource contention and improve scalability.
In conclusion, there are numerous potential areas for future research in profiling machine learning workloads on edge devices. Addressing these challenges will be vital for facilitating the widespread deployment of machine learning models in edge environments. By investigating these areas, researchers can advance efficient algorithms, resource management strategies, profiling tools, and privacy-preserving techniques, ultimately enhancing the seamless integration of machine learning within edge computing frameworks.