1. Introduction
Computer vision has reached a mature stage and finds applications across a wide range of fields. Computer vision applications heavily rely on image sensors, which detect electromagnetic radiation, typically in the form of visible or infrared light. Indeed, image acquisition plays a pivotal role in machine vision as long as high-quality images allow for consequent higher-quality processing and analysis. This stage involves both hardware and software components, and selecting the right components is crucial for the following processing and analysis.
In general, images can be generated with active or passive methods. In the active method, a light source illuminates the object, while in the passive method, natural sunlight serves as the illumination. Usually, in production lines and industrial applications, the active method is more suitable, since it leads to more reproducible conditions; however, this method means that the choice of light source and illumination conditions becomes critical. Depending on the application, typical light sources can be lamps, LEDs, and/or laser sources. The wavelength of the electromagnetic wave also matters for image recording, with visible, infrared, or X-ray regions of the spectrum used for scene illumination [
1].
In both cases, computer vision has relied on frame-based cameras for a long time; they produce data corresponding to the captured light intensity at each selected pixel in a synchronous manner (i.e., the whole sensor records the scene at defined frame rates). While this technology has been effective and superior to other camera types for many years, frame-based cameras still exhibit limitations that impact performance and accuracy. Some of the challenges associated with frame-based cameras include
High latency: the latency between the presentation of physical stimuli, the transduction into analog values, and the encoding time into a digital representation are high if the data are sampled at a fixed rate, independent from the stimuli.
Low-dynamic range: frame-based cameras have difficulties in handling scenes with very high variations in brightness.
Power consumption: high-quality and feature-rich frame-based cameras present high consumption, making them unpractical in many resource-constrained environments.
Motion blur: when capturing high-speed motion, frame-based cameras introduce motion blur, affecting subsequent image processing accuracy.
Limited frame rates: traditional sensors allow for slow frame rates (typically in the order of 20–240 fps), whereas for high-speed recordings, specialized highly complex and expensive equipment is needed.
Typical frame-based cameras acquire incident light at a frequency indicative of the temporal resolution and latency of the sensor. Two major reading schemes are global and rolling shutter. In global shutter schemes, the entire array is read simultaneously at fixed timestamps; in this way, the camera exposure is synchronized to be the same for all the pixels of the array. Consequently, the sensor latency depends on the number of pixels on the sensor to transfer. To decrease latency, rolling shutter cameras read the array row by row; the time to read out a single row is known as
line time and can be of the order of 10–20 microseconds. This reading scheme reduces latency to the product of the line time and the number of rows but, as a consequence, the occurrence of motion blur artifacts increases. In general, lower exposition times mitigate blur at the cost of causing overly dark photos and loss of details. Higher frame rate cameras can mitigate motion blur, but this leads to increased power consumption and overheating issues, and the associated image processor or digital signal processor must handle exponentially larger data volumes [
2].
The aforementioned limitations impact the possibility of applying traditional computer vision in certain domains, depending on the very specific requirements of each application field. To address these challenges, researchers have started to explore other solutions; among them, in the last 60 years, neuromorphic cameras (also known as event cameras) have emerged, taking inspiration from biological vision systems. If attempts to electronically model the mammal visual system date to 1970 [
3] the seminal work on such sensor was introduced in 1988 [
4] and constructed an analog model of the early stages of retinal processing on a single silicon chip. Differently from traditional frame-based cameras, neuromorphic image sensors operate asynchronously, mimicking the spatio-temporal information extraction of biological vision systems to enhance data acquisition, sensitivity, dynamic range, and spatio-temporal resolution. Unlike traditional frame-based cameras, these sensors adhere to a different paradigm, resulting in sparse event data output. In this way, their application can benefit from reduced temporal latency, high dynamic range, robustness to lighting conditions, data transmission speed, and reduced power consumption. Moreover, neuromorphic cameras have the advantage of working under privacy-reservation conditions since they capture pixel brightness variations instead of static imagery [
5]. As evidence of the potential of such sensors, both academia and industry have shown interest in neuromorphic cameras and, since their original introduction, several breakthroughs have improved performance and provided different working alternatives, leading to commercial solutions and sensors ready for the market. The state of the art company case studies show how event cameras have been successfully applied in several fields, from robotics to intelligent transportation systems, from surveillance and security to healthcare, from facial analysis to space imaging.
Figure 1 reports two commercial examples of neuromorphic cameras, the Prophesee EVK4 [
6] (a) and the Inivation DAVIS346 [
7] (b).
Nevertheless, due to the distinct characteristics of such sensors, the methods and algorithms developed for frame-based computer vision are no longer directly applicable to event-based vision systems. Furthermore, the extensive datasets used for training and characterizing frame-based systems cannot be seamlessly transferred to their event-based counterparts due to the fundamental differences in data output. As a consequence, event cameras have garnered significant interest from researchers, and academic investigation in this domain has flourished as experts seek to address fundamental questions related to event camera technology. Given the transformative potential of event cameras, it is not surprising that a wealth of work has been proposed to unlock their full capabilities and overcome associated challenges.
Additionally, surveys and comprehensive analyses that consolidate existing knowledge, highlight trends, and provide valuable insights for both practitioners and academics have been proposed. Nevertheless, it is possible to observe that this, albeit extensive, literature is missing a critical analysis based on different application domains (see
Section 4), frequently preferring to list certain computer vision tasks instead. Often, a solution for a low-level computer vision task can serve as a building element for tasks hierarchically classified at a higher-level. However, while it is acknowledged that the traditional hierarchical classification of computer vision tasks may evolve with the latest progress in deep learning, and in particular with the advent of end-to-end systems, it is still crucial to separate computer vision tasks from systems that have been implemented in a specific application domain, which has unique features and challenges. This is particularly true for the case of the relatively recent neuromorphic sensor that has less well-established research compared to classic cameras and, thus, less clear links between results in specific tasks and applications.
Finally, the reduction in the cost of producing event cameras, the availability of event camera simulators and datasets, as well and the very rapid advances in deep learning, are leading to a rapid diffusion of such technology; and this widens the number of works in this field. Similarly, the number of workshops, academic conferences, and special issues in journals has increased tremendously over the years, with the necessity to re-organize works presented in previous surveys to include the most recent research outcomes.
This work aims to fill the aforementioned gaps with the following contributions:
a collection of past surveys about neuromorphic cameras and computer vision with the introduction of a specific taxonomy to easily classify and refer to them;
a critical analysis driven by the different application domains, instead of low-, medium-, or high-level computer vision tasks;
an updated review that includes recent works and research outcomes.
The first two contributions, to the best of our knowledge, are introduced for the first time in the context of reviews about neuromorphic cameras.
This work is organized as summed up by the scheme shown in
Figure 2.
Section 2 briefly sums up the working principle of a neuromorphic camera and introduces the event-based paradigm. The methodology used to select the papers forming the subject of this survey and the reason behind the selected Scopus search queries are given in
Section 3. In
Section 4, existing surveys in the literature are discussed and cataloged using a proposed taxonomy that considers the main focus of each survey.
Section 5 briefly puts in context the proposed analysis by application domain with respect to low-, middle-, and high-level computer vision tasks. In
Section 6, a review analyzes works that are presented by their application domain, divided into groups, and each subsection analyzes one of these groups. The amount of information and works is very high, so a discussion that takes into account all the aspects presented in this manuscript highlighting common outcomes and challenges, as well as peculiarities of each application field and future directions, can be found in
Section 7.
Section 8 has the conclusions.
2. Neuromorphic Cameras
The biological vision system has been optimized over hundreds of millions of years of evolution and has excellent image information perception and highly parallel information processing capabilities. The retina transmits information in the form of images, shadows, and colors to the brainstem through a crossover, where the final visual information is extracted and useless visual information is discarded, recognizing visual information in the brain when the final processing is complete. With the continuous development of vision sensor arrays, the performance of traditional imaging systems that capture brightness at a fixed rate is constantly being improved. However, the subsequent amount of raw data collected is also increasing, making data transmission and processing more and more complex and demanding. In contrast, living organisms efficiently process sensory information in complex environments due to a well-established hierarchical structure, co-location of computation and storage, and very complex neural networks. Several types of image-sensing systems that simulate the biological visual structures of humans and animals have been proposed [
8]. We use
neuromorphic sensors to refer to those bio-inspired devices that try to mimic the sensing and early visual-processing characteristics of living organisms [
9].
Neuromorphic vision is generally divided into three levels: the
structural level, which imitates the retina; the
device functional level, which approaches the retina; and the
intelligence level, which surpasses the retina. In neuromorphic vision sensors, an optoelectronic nanodevice simulates the biological vision sampling model and information processing function, and a perception system with or beyond the biological vision ability is constructed under limited physical space and power consumption constraints using simulation engineering techniques, such as device function-level approximation [
10]. A schematic diagram that shows the analogies between the human visual system (top) and neuromorphic vision (bottom) is reported in
Figure 3.
Similarly to conventional cameras, a neuromorphic vision sensor, also known as an
event camera,
address-event representation (AER), or
silicon retina, is composed of pixels, but, instead of capturing intensity, each pixel asynchronously captures intensity changes in the scene that exceed user-defined thresholds. The camera outputs streams of events, where the
event
is defined by
with
denoting the pixel position with spatial coordinates
where the event is triggered, with
the timestamp, and with
the polarity of the event that can be defined as ON (positive)/OFF (negative), or also,
, to distinguish an increase or decrease in intensity from darker to brighter values or vice-versa, exceeding given thresholds. The event data in a time sequence are recorded as
, with
being the number of total events in the interval of time
.
In other words, a single event occurs if there is a change in the brightness magnitude (from the last event or the initial brightness condition) that reaches a threshold
for positive or ON changes and
for negative (OFF) events. Events are triggered in an asynchronous way and timestamped with microsecond resolution and can be transmitted with very low latency (in general, in the order of sub-milliseconds).
Figure 4 shows an example of different sensor outputs when recording a white rotating PC fan in the
xyt-plan: the frame-based camera (right) grabs a frame depending on its internal clock, and thus on the established exposition time and as a function of the frame rate. This means that blur will be present when the rotation speed is high with respect to the camera frame rate, and that the same scene will be captured when the image is still. In the case of event cameras, a continuous stream will be captured, where the only pixels that activate will capture the movement, either observing a change from dark to bright or vice-versa. In the case of high rotation speeds, the motion will still be completely captured. On the contrary, in case of no motion and no noise, no events will be streamed.
The fast response of a sensor capable of capturing events at a high rate makes it suitable for considering accumulating events over time, to better grasp the scene. When creating an output representation comparable to images of frame-based cameras, it is possible to perform the following: all the events that occurred in a time interval are moved to image coordinates depending on the pixel position, the most recent event can be kept in case of multiple events at the same pixel, and thus the image can be formed at a frame rate that is a function of the accumulation time. The case of “no event” for a given pixel is usually modeled by giving the value 0 to that pixel. In this way, a video stream can be created. An example of an obtained image by accumulating events over time from an event camera obtained in an indoor navigation scenario is reported in
Figure 5a. ON and OFF events are rendered respectively as blue and black pixels on a white background. At this point, it is fundamental to highlight how the functioning of event cameras poses a new paradigm shift, where the output is sparse and asynchronous instead of dense and synchronized. Moreover, the output is no longer a set of grayscale intensities, but a stream of events
, as defined in Equation (
1). This poses new challenges in terms of camera setup and computer vision algorithm design: first of all, classic computer vision algorithms are not easily or directly applicable, since they are designed to work with a fundamentally different information source. This has opened a new research area that investigates alternative representations that are more suitable for algorithms dedicated to event processing and/or that can facilitate the feature extraction phase of computer vision pipelines. Refer to [
11] for more details on event representation. Moreover, the captured stream of events strongly depends on the values of the different configurable thresholds (
biases) associated with the event camera acquisition scheme, making an experimental setup stage necessary, where the values depend on the specific application context and the environmental conditions.
Early event-based sensors presented high noise levels, unsuitable for real and commercial applications [
12]. The development of optoelectronic materials, together with advancements in robotics and biomedical fields, has led to a new generation of event-based cameras that have been adopted by research and industry. The major differentiating parameters of event cameras with respect to frame-based cameras are their latency, dynamic range, power consumption, and bandwidth.
The dynamic vision sensor (DVS) [
13] emulates a simplified three-layer retina to realize an abstraction of the information flow through the photoreceptor, bipolar, and ganglion cells. Internally, a differentiation circuit amplifies the changes with high precision and outputs a voltage logarithmically encoding the photo-current. This output voltage is compared against global thresholds that are offset from the reset voltage to detect increasing (ON) and decreasing (OFF) changes.
In practice, DVS has become a synonym for event camera, although it only represents a possible technology and thus a type of neuromorphic camera. In fact, other families of event cameras are also available. For example, an asynchronous time-based image sensor (ATIS) [
14] encodes both relative changes and absolute exposure information, thanks to an exposure measurement unit. Absolute instantaneous pixel illuminance is acquired by converting the integrated photo charge into the timing of asynchronous pulse edges. The disadvantage is that the pixels are at least double the area of DVS pixels. Another way to combine relative changes with static information is in a dynamic and active pixel vision sensor (DAVIS) [
15]. The difference is that ATIS are still an asynchronous event-driven technology, while in DAVIS an active pixel sensor (APS) retrieves static scene information like in a frame-based camera, and events can be placed on top of this representation. Two frames with both color and event information from the dataset released in [
16] are reported in
Figure 6. Being based on frame-based principles, DAVIS sensors have limited dynamic range compared to DVS, and they display redundancies in the case of static scenes. Refer to [
17] for more information on these families of sensors, and to [
12] for a discussion that also includes other emerging technologies.
7. Discussion
From the analysis of the state of the art in the different application fields, it is possible to derive findings that are application-domain-dependent, as well as observations that are directly related to the technology under examination. Before stating the conclusions related to the different application fields, it is important to analyze common challenges and opportunities.
First of all, initial consideration must be given to low- and high-level tasks. We already highlighted how a low-level component addresses a simple task that can be leveraged for downstream applications. In this survey, we did not focus on tasks, because of the wide availability in the literature, moving the center of attention to the application domain instead. It is useful anyway to link computer vision tasks with the final application. A diagram showing this association, limited to the works introduced in
Section 6, is reported in
Figure 9. It is possible to observe how, apart from very specific tasks like pose estimation in the context of human pose analysis, single tasks serve as building components that transcend multiple application domains, illustrating the principle that a core computational function can be both pivotal and ubiquitous across analyzed application domains. As for traditional imaging, this interconnectivity highlights the integral role of computer vision tasks in advancing the capabilities of technological systems.
Generally speaking, a critical analysis of the progress made by the research community in neuromorphic computer vision in single lower-level tasks makes it possible to state how the application of event cameras in both research and industry can only grow with time, considering that the field of application of such sensors will, in practice, depend only on the creativity in combining approaches or adapting a research outcome from a specific context to a different application field. Thus, integrating the application-driven analysis and the specific application outcomes from this work with other surveys that showed tasks and/or the internal tasks associated with a single application domain is crucial to obtaining a deep understanding of challenges and opportunities.
The computer vision tasks required for the applications summed up in
Figure 9 also pose the interesting question of how the analyzed state of the art compares with traditional imaging systems. A quantitative comparison with traditional sensors has often been regarded as
unfair in the literature [
11,
26], due to the different time scales over which these tasks have been investigated by the neuromorphic engineering community and the traditional computer vision community, and the consequent different maturity level of the respective solutions. In addition, the advantages of event cameras make neuromorphic sensors more or less suitable depending on the working conditions, so it is essential to consider the final application context and environment. Apart from these considerations, some qualitative observations can still be made. For instance, for detection and classification, traditional imaging algorithms applied in conventional full scene acquisition tend to perform better, due to a richer information content. In contrast, in the case of tracking, event cameras can offer advantages due to their high temporal resolution and low latency, and this becomes particularly true in high-speed or high-dynamic-range scenarios. As for segmentation, the performance can be highly dependent on the specific context. Finally, for SLAM (but very often this is also valid for the other aforementioned tasks), the best outcomes can be achieved when event cameras and traditional cameras are integrated together or as part of multi-sensor solutions (e.g., LiDAR, GPS, etc.), leveraging the strengths of both technologies. In all the cases, however, the application requirements, e.g., the acquisition conditions, such as low-dynamic-range scenarios or the necessity to acquire data at very high frame rates, are fundamental to deciding on the most suitable sensor. To conclude, generally speaking, frame-based solutions currently lead in terms of algorithm maturity and performance in many computer vision tasks. Hence, there is the need to develop new algorithms and analysis methods to fully exploit the potential of event-based acquisitions.
On the other hand, application-domain-based analysis alone lets us observe how each application tries to exploit the different advantages of switching to the event-based paradigm. It is not easy to make a systematic comparison with other sensors, since they have been investigated for decades, while the off-the-shelf availability of event cameras is very recent in comparison. Moreover, it is possible to observe the absence of an outstanding neuromorphic sensor technology: the best option to employ from the different families of neuromorphic sensors strongly depends on the application specifications and constraints, without any privileged choice. The introduced paradigm shift has led to several ways to represent events and to model them in a machine-efficient way. As such, it is quite evident that there is no best representation, and very often the best choice depends on the specific application, as extensively analyzed in [
11,
186,
187]. More research is necessary, to investigate new event stacking models and encoding schemes that can use existing solutions and architecture to process event data, achieving a better performance. The biggest bottleneck in this field is possibly the lack of a comprehensive theoretical framework to formally describe and analyze event-based sensing and algorithms [
49]. As for event representation, this gap makes the translation of traditional machine learning algorithms difficult and, as a consequence, deep learning architectures must be adapted and modified to efficiently process events. Nevertheless, these sensors introduce a set of trade-offs, such as the optimal balance between latency and power consumption, and present several parameters that can suddenly change the acquired data. We think that these issues cannot be mitigated without an eye to the application domain, working conditions, and specific priorities, e.g., processing time vs. accuracy.
All the analyzed application domains certainly share the necessity of obtaining ad hoc data to train and test neural network architectures. It can be observed how researchers building event-based approaches often have to start from scratch and have to acquire their own data due to a lack of available datasets. Very few task-oriented datasets are available with full-frame sequences. Having common datasets is fundamental to comparing and evaluating methods, as well as for implementing machine learning-based solutions, as happens with traditional imaging. In many cases, we observed how the dataset was particularly associated with the specific goal of the paper, making it impossible to summarize the datasets introduced in the analyzed literature. It is possible that the availability of such sensors will favor the introduction of shared data that researchers will use to evaluate and test their solutions. At the same time, many event camera simulators have been proposed in the literature [
188]. Simulators of event cameras based on data coming from frame-based cameras work by imitating changes in intensity with respect to time from standard imaging system data. However, upsampling strategies are usually employed to simulate the high temporal resolution of event cameras when using traditional sensors working at lower frame rates. Since continuous pixel visual signal data are not available, an interpolation step is necessary to reconstruct a linear approximation of the continuous underlying visual signal. Examples of simulators with available implementations can be found in [
188,
189,
190]. To this end, creating synthetic datasets with event camera simulators represents a unique opportunity that has a two-fold advantage. First of all, it leads to converting frame-camera datasets, which are much more numerous than event datasets; in this way, it would be possible to obtain the relevant quantity of information necessary to use deep neural networks or to use backbone networks pre-trained with (many) simulated data and fine-tuned with (fewer) real data. On the other hand, simulators allow performing fruitful research, even in the absence of such sensor availability. This does not detract from the fact that it is crucial for the scientific community to propose and make available specific datasets based on event data, to establish a common framework for algorithm evaluation, as well as in the light of the limitations of simulators in terms of realism, event camera biases/noise modeling, and how well neural networks trained on synthetic events generalize to real data.
Specific observations can instead be stated by separately analyzing the application domains.
For animal and environmental monitoring, it is possible to conclude that, although cameras and inertial measurement units (IMUs) are fundamental for autonomous navigation in GPS-denied conditions, it is still necessary to explore new sensors to deal with problems like direct sunlight or darkness [
191]. It is true that weight and cost are two obstacles to the adoption of event cameras in scenarios like UAVs; however, considering how recent advancements are moving such chips to sensor sizes comparable to smartphones, it appears that exponential expansion in this application field is only a matter of time.
The principal challenges related to the adoption of event cameras, as well as with other vision sensors for surveillance and security applications, are certainly related to privacy issues. In fact, interconnected and ubiquitous data acquisition systems not only become more powerful but also more vulnerable. If it is true that the increased adoption of AI presents an opportunity to address various challenges, it is also true that AI models are also vulnerable to advanced and sophisticated attacks, like evasion [
192] or poisoning [
193]. These vulnerabilities represent a challenge in the adoption of event cameras in security systems. The usage of sensors that by design preserve the person’s identity could certainly help to address these privacy-related constraints. One notable example shows how event cameras can be adopted to achieve anonymized re-identification [
194], because they can better guarantee the anonymity of subjects, without having the privacy concerns of frame-based cameras in public spaces. With traditional images, it is common to apply face blurring or masking, as well as encryption techniques, but this far from ensures end-to-end privacy. Moreover, research has shown that it is possible to reconstruct high-quality video frames from the event streams produced by these cameras [
195,
196], and this might constrain event-based privacy-preserving person re-identification. Apart from the aforementioned weaknesses, if it is crucial to address these issues (e.g., protection from reconstruction attacks [
197]) before achieving robust privacy-preserving systems, neuromorphic computer vision certainly represents a step towards privacy-oriented surveillance.
For industrial applications and machine vision, event cameras are a unique opportunity to obtain unprecedented quality control, even in complex lighting environments. It is important to highlight how our analysis showed that the nature of the sensor alone, although promising for such tasks, is not sufficient, and more complex algorithms are required before having performance comparable to multi-sensor systems or industrial ad hoc setups. Moreover, further research should address the monitoring of areas where workers and machines cooperate, to achieve next-generation safety levels; in this case, similar considerations as those previously stated for privacy issues are valid: the advantage of event cameras in this application field is that they capture only a fraction of the visual information compared to frame-based vision, naturally hiding sensitive visual details.
For space exploration and SSA, preliminary works have shown very promising results. Undoubtedly, the key features for the adoption of neuromorphic computer vision for such applications include lower bandwidth and power requirements, thus being suitable for remote processing, e.g., space-based platforms. This often implies new challenges, such as the necessity to modify electro-optic instruments to support the event-based sensors and the pre-existing equipment simultaneously, and/or a specific calibration setup for the different reference systems. The results are often very exploratory, showing that a lot of research must still be performed; considering the unique characteristics of such cameras, we can expect a growth in such applications in the next few years.
Eye tracking, gaze estimation, and driver monitoring systems are receiving growing interest in the neuromorphic computer vision research community. In fact, the nature of these problems related to the detection and tracking of specific features can certainly benefit from event cameras. From the analysis of the state of the art, it emerged how, as this new potential application is only in its initial stages, there is plenty of room for improvement; moreover, it was observed how works are progressively switching from multi-sensor systems to being purely event-driven. The future of such systems will also depend on the final application scenario; for example, for eye segmentation or an unconstrained gaze tracker, the brightness information that DAVIS sensors deliver can lead to single-sensor systems, while, when the objects of tracking are saccadic movements or the requirements demand keeping the frame rate high, multi-sensor systems seem to be privileged. However, both event and traditional camera solutions still have several gaps to fill. State-of-the-art analysis shows that frame-based and event cameras can bring together the best of both worlds and provide multiple modalities to deal with problems that appear to be better addressed in this way, rather than in the single data-source domain alone.
Similar considerations are also valid for gesture, action recognition, and human pose estimation. Moreover, it is interesting to observe how the core problem here is how to learn spatio-temporal contextual information from events. In fact, using a predefined accumulation interval may result in adverse effects like motion blur and a weak texture [
143]. While research outcomes are showing promising results, the problem of converting event data into a proper representation remains an open issue, as widely observed.
An analysis of state-of-the-art works for medicine and healthcare showed the feasibility of using neuromorphic sensors for elderly activity monitoring, and many works on fall detection have been proposed. Furthermore, an active and growing research field is investigating the usage of event cameras coupled with other data sources to realize assistive devices, in particular for visual impairment. As already observed, when sensitive data are transmitted, it is clear that realizing privacy-preserving systems is crucial for the adoption of such sensors in real scenarios, as investigated in [
198,
199]. As a final note, the authors in [
157] stated how there are no established studies that have delved into the utilization of event cameras for specific biomedical purposes. However, their work suggested how event cameras possess attributes conducive to innovative applications in healthcare and medicine. Together with our proposed analysis, we can conclude that, as for other application fields with few works, an impressive growth in works in these domains can be expected.
The most investigated application field for neuromorphic cameras is certainly robotics and, as a consequence, intelligent transportation systems, where the primary research focus is on tasks associated with scene interpretation. Endowing robots with neuromorphic technologies represents a very promising solution for the creation of robots that can be seamlessly integrated into many automated tasks. Neuromorphic cameras could revolutionize the robot sensing landscape. In particular, these sensors are privileged because of their fast reaction time, which leads to low-latency algorithms for perception and decision making. Nevertheless, the development of end-to-end systems that fully integrate event-based processing from the perception step to the actuation step still needs much research.
Future Directions
Event cameras hold significant potential in many scenarios, and their algorithms are rapidly evolving. Nevertheless, the conventional frame-based vision has made unprecedented achievements, been investigated for a longer time, and is continuously improving. The application-oriented analysis in previous paragraphs provided evidence of how the field of event-based vision represents a vibrant area of ongoing research. Where this was evident from the state-of-the-art analysis, future developments related to a specific application field were reported. It is desirable to state some generic observations about the future directions of neuromorphic cameras. The capacity to revolutionize both analyzed and new application fields will necessarily depend on several factors. First of all, the adoption of neuromorphic sensors will grow as the manufacturing costs of these cameras continue to decrease. Secondly, as researchers delve deeper into their potential, more innovative and successful uses will be identified. The future of event cameras also lies in addressing the challenges of deep-learning-based computer vision techniques, a consequence of the paradigm shift of dealing with streams of events. From these premises, the future of event cameras looks encouraging, with steady development and potential mass production on the horizon.
Moreover, to date, neuromorphic cameras have only implemented a small subset of the principles of the biological vision system. For example, a complete understanding of all ganglion cells is still lacking [
200]. New computational models and SNNs that can efficiently process the spiking nature of output data have been proposed and represent a very active research field. Implementing SNNs in conventional Von Neumann machines limits their computing efficiency, due to an asynchronous network activity that leads to quasi-random access of synaptic weights in time and space, as each spiking neuron should ideally be its own processor, without a central clock [
41], and because of their highly parallel nature with amalgamated memory and computational units. Ongoing research in neuromorphic hardware is targeted towards the optimization of the execution of SNNs, to close the gap between the simulations of SNNs on Von Neumann machines and biological SNNs [
201]. Eventually, neuromorphic cameras realized on silicon retina will face limitations due to circuit complexity, large pixel area, low fill factors, and poor pixel response uniformity [
12]. This shows how breakthroughs in neuroscience and neuromorphic research, from both algorithm and hardware perspectives, remain crucial.
A report from Gartner© from 2020 estimated that
“by 2025, traditional computing technologies will hit a digital wall, forcing the shift to new computing paradigms such as neuromorphic computing” [
202]. In light of this, it can be cautiously stated that neuromorphic cameras hold revolutionary potential for widespread application in fields requiring computer vision. However, this prediction underscores the need for continued research from both industry and academia.