1. Introduction
Event-based imaging is considered a new paradigm in computer vision because of its distinct representation of visual data [
1,
2,
3,
4]. Event cameras use bioinspired vision sensors that record relative pixel-level intensity changes in a scene [
5], as opposed to capturing frame-based intensity images. These illumination changes are referred to as events. An event is a four-dimensional tuple
that records a pixel-intensity change at the position
and the time
t. The polarity
is 0 for decreasing and 1 for increasing pixel-intensity changes. Event cameras do not rely on a global clock—that is, each pixel operates independently and is only recorded when a change in illuminance is detected. This allows for a very high temporal resolution of an event stream, which is usually measured in microseconds [
1]. In general, event cameras offer significant advantages over conventional cameras, such as ultra-low latency, no-motion blur, low power consumption, and high dynamic range [
4].
These properties make event cameras suitable for robotics and computer vision applications. Indeed, after the release of the first commercial event camera in 2008 [
1], these systems found utilization in different settings. For example, an Internet of Things (IoT) crime detection application used an event camera to transmit data to a processing node only when an event was detected [
6]. Reduced energy consumption and bandwidth thanks to the event camera made the IoT application more efficient, compared to traditional surveillance systems, which remain operational, regardless of whether activity is observed in the field of view. The absence of motion blur and the presence of high dynamic range enable event cameras to acquire reliable visual data during high-speed motion. In [
7], these features of event cameras were integrated with frame-based imaging and depth images to address the challenges of low-light scenes and rapid motion aboard an autonomous quadrotor. In addition, in [
8], event cameras were used in micro air vehicles to obtain optical flow information for navigation. Event cameras were also tested for daytime and nighttime traffic data collection, monitoring vehicle speeds in the range of 20 to 300 km/h, resulting in a near-zero mean error and a standard deviation of 2.3 km/h in speed estimation [
9]. A recent work [
10] has also shown that combined RGB and event based modalities achieve high accuracy in lip reading.
Despite numerous benefits and use cases, adoption of event cameras remains limited. This is likely due to the scarcity of datasets collected with event cameras and the deficiency of tools and algorithms for event-based processing, especially high-performance deep learning pipelines.
Facial images contain a wealth of features that are highly relevant to message transmission during intersocial communications [
11]. Facial expressions convey information not only about a person’s emotional state, but also about cognitive human states, such as interest, boredom, confusion, and stress [
12]. From the perspective of cognitive psychology, the face is a key distinctive feature in determining an individual’s identity. In this regard, the human cognitive system has evolved to extract structural codes from faces, thus acquiring the components of facial structures that enable us to distinguish between faces. In fact, there is a special region in the brain for processing visual stimuli on the face, the fusiform face area (FFA) [
13].
As with the human cognitive system, faces play an important role in artificial intelligence applications. Face recognition systems are widely in service, primarily as a biometric tool for human identity validation [
14], in access control systems to enable secure environments [
15], and to determine the location of a missing person [
16]. While face detection is used for video conferencing, crowd surveillance, intelligent human–computer interfaces, content-based image retrieval, and video coding [
17], facial landmarks are used to create virtual avatars via head pose estimation [
18] or even for diagnosing health disorders, such as fetal alcohol spectrum disorder [
19].
This paper focuses on face detection and facial landmark detection directly from event streams in controlled and uncontrolled environments.The main contributions of this work are as follows:
We release an open-source, large-scale, event-based dataset (689 min of recorded event streams) captured under different lighting conditions, from different viewpoints and distances, with multiple people in the scene, and a greater number (73) and diversity of participants (see
Figure 1).
The dataset is fully and accurately labeled with bounding boxes of faces and five-point facial landmarks (eye centers, nose tip, and mouth corners) of all subjects in a scene, releasing over 1.6 million annotated faces in different environments labeled at a rate of 30 Hz.
We present a dedicated deep learning model for face detection and facial landmark detection. The model, adapted from [
20], uses a hybrid architecture combining convolutional and recurrent layers. Our model is the first deep learning model to use recurrent layers for face detection from event streams. It outperforms its well-known frame-based counterpart [
21] in accuracy.
With the open-source data and models, our work can serve as a benchmark for face and facial landmark detection for event-based cameras. We conducted extensive experiments showing real-time face detection from the direct output of event streams.
The rest of the paper is organized as follows: In
Section 2, we provide an overview of existing event-based face datasets and state-of-the-art models for face and facial landmark detection from event streams.
Section 3 presents our dataset, including the data collection and annotation steps. In
Section 4, we explain the underlying model used in the face detection task, as well as enhancements made on the base model to improve the performance and to predict facial landmarks. Information on model training and experimental results are provided in
Section 5, and a related discussion can be found in
Section 6. We conclude our work in
Section 7.
2. Related Work
In most applications, face detection and facial landmark detection (see
Figure 2c) serve as the first mandatory steps. A substantial amount of research has been conducted in visual [
22,
23,
24], thermal [
25,
26], depth [
27,
28], and other domains [
29] for face and facial landmark detection. State-of-the-art face detection algorithms rely on deep learning networks [
30], particularly convolutional neural networks (CNNs) [
31,
32], reinforcement learning [
33], generative adversarial networks [
34], and hybrid architectures [
35].
Minaee et al. [
36] presented a comprehensive review of face detection methods for conventional cameras, starting with a discussion of early methods, such as Haar cascades classifiers [
37] and histogram of oriented gradients (HOG) [
38], and moving to more sophisticated approaches presented during the deep learning wave. In addition to a comparison between model architectures, their performance on well-known benchmarks [
39,
40,
41,
42,
43] was also shown. Deep-learning-based face detection models were classified into the following categories: Cascade CNN-based models, R-CNN-based models, single-shot detector models, feature pyramid network-based models, Transformers-based models, and other architectures. One of the most impressive results was obtained by a single-shot detector called Retina Face [
44], which was reported to achieve an average precision (AP) score of 91.4% on the WIDER FACE hard test set.
Data-hungry deep learning models require large datasets for training. Therefore, extensive research has been dedicated to the collection and annotation of face and facial landmark datasets in different domains. WIDER FACE [
39] and MegaFace Challenge [
45] for the visual domain contain 393,703 and 1 million faces, respectively. Several remarkable RGB face datasets are included in
Table 1, along with metrics highlighting the specific features of each dataset. With regard to the thermal domain, the recently presented TFW dataset [
46] contains 9982 images of 147 subjects. For depth images, Microsoft Kinect was utilized to create the KaspAROV database featuring 108 subjects recorded on 432 videos [
47]. Another depth dataset with extreme head poses called Pandora was presented in [
48]. Specifically created for driver pose estimation, it contains 110 annotated sequences with 22 subjects. There are even datasets for artificial faces. For instance, Zheng et al. introduced iCartoonFace, containing 60,000 images with 109,810 cartoon faces [
49].
Despite significant advances in face detection in the visual, thermal, and depth domains, to the best of our knowledge, there are only a few face datasets recorded with event-based cameras (see
Table 1). The Face Pose Alignment dataset [
50] consists of 108 clips with the total duration of 10.2 min acquired during moderate and intense head motion conditions. The eyes and mouth in this dataset are labeled with bounding boxes. The dataset has some drawbacks, such as the low resolution of the sensor (304 × 204), the small number of participants (18), the short duration of the recorded data, and the limited variability of the face poses and camera angles. Another dataset was introduced by Lenz et al. [
51] in which the authors recorded event streams of faces for eye blink detection. This dataset contains 50 videos, 25 of which were annotated for face and eye position. However, due to the short duration of the collected videos (13.5 min) and the small number of participants, this dataset does not solve the problem of having a large amount of data to train deep neural networks.
One way to circumvent the dataset problem for event cameras is to synthesize data from traditional frame-based data (RGB or grayscale images) into event-based data, as in [
54], but this method may not capture the full advantage of the asynchronous nature of event-based cameras and requires significant computing power for processing. Another popular method for avoiding the problem of adapting computer-vision algorithms is to reconstruct grayscale images from event streams and use deep learning models trained for visual images on the data [
54,
55,
56,
57]. The temptation to use this method arises from the availability of standard and well-established computer vision algorithms that become achievable through the grayscale transformation. While the event data offer all the information needed for full-value frame-based image reconstruction [
54] (see
Figure 2), there is a high computational cost involved in mid-transformations, as well as increased latency issues that are inherent to conventional frame-based images [
58].
These issues motivated researchers to use event streams directly for face detection and facial landmark detection. In [
51], researchers used event streams directly for real-time face detection using a person’s natural movement (i.e., eye blinking). Since the event camera stops generating events when there is no motion, this method relies on the repetitive nature of eye blinks. Therefore, when the bounding box of the blink of both eyes is detected, the boundaries of the face are determined with an accumulation time of 250 ms. However, this method is sensitive to the camera resolution. Specifically, the model can detect a face at a distance of up to 150 cm from the camera. In a similar study [
59], the authors developed the GR-YOLO neural network to obtain the area of the eye blink to determine the area of the face based on facial landmarks.
One of the pioneering methods for direct detection of faces from event streams is described in [
60]. Barua et al. used the HOG [
38] as input features and utilized a random forest method consisting of 50 trees as a learning algorithm. The results showed that face detection with this method was comparable to the performance of the Viola–Jones detector on original and reconstructed intensity images at a rate of 2000 frames per second. Ramesh et al. [
61] also presented a method for face detection using event streams directly. In this work, kernelized correlation filters (KCFs) were used along with an Adaboost classification framework for event-based face detection. The KCFs were reformulated to discriminate between facial and non-facial image areas instead of building upon handcrafted feature descriptors.
There are a number of advantages to performing face-related tasks specifically on event cameras. Apart from the inherent properties of event cameras that make them suitable for working under complex lighting conditions and for dynamic scenes, it was shown in [
52] that event cameras are also much better suited for capturing microexpressions of people. Becattini et al. [
52] produced an event-reaction dataset and corresponding reaction classifier. Preliminarily, the authors relied on a face alignment tool [
62] for RGB images and an open event camera simulator (ESIM) [
63] to extract cropped facial data by training a YOLOv3 [
64] object detector. Another recent work on facial expression recognition was presented in [
53]. In [
53], the authors present the NEFER dataset, which consists of paired RGB and event videos containing human faces annotated with face bounding boxes and facial landmarks as well as labeled with the corresponding emotions. The face detector was obtained using the same approach as in [
52], except that it was trained on a YOLOv2 [
65] object detection model. The authors reported twice the precision accuracy in recognizing seven emotions using an event-based approach compared to detection on RGB videos. In [
52], authors highlighted the absence of open-source direct face extractors from event data, and therefore both works [
52,
53] used alternative methods combining synthetic event data with face alignment to extract facial data before performing their primary tasks.
Overall, our dataset can be applied to a number of event-vision based research tasks, apart from face and facial landmarks detection. The presented dataset can be used standalone or integrated with other event datasets [
20] for intensity reconstruction. Another task for which the dataset can serve helpfully is face recognition.
4. Deep Faces in Event Streams (DFES) Model
In this section, we describe the architecture of our Deep Faces in Event Streams (DFES) model for face bounding box and five-point facial landmark detection. Our model is based on the model in [
20] used for object detection in event streams. In this model, the researchers introduced a mechanism in which learning occurs through incoming events accumulated over a period of time and through the use of information from past events.
In our model, an event stream is represented as a sequence of events , where and are the pixel coordinates of an event with the polarity . is the timestamp of the event . The faces are denoted using axis-aligned bounding boxes , where x and y are the coordinates of the lower left pixel and w and h are the width and height of a bounding box, and t is the time at which a face is present in a scene. Similarly, five-point facial landmarks are denoted as a sequence of ordered points corresponding to the eyes, nose, and the left and right corners of the mouth, .
The face detection problem can be formally defined as a mapping from E to and written as , where the detector should predict face bounding boxes using past events. Likewise, the combined detection of faces and facial landmarks can be defined as .
In this work, the detectors
and
are implemented using deep learning models (see
Figure 6). Since applying the detector
D to all past events is computationally intractable, incoming events during the period
are collected in an array for the period
. This array is then transformed into a tensor map
using the histogram preprocessing function
to group each event during
in the corresponding cells, depending on the
x and
y coordinates of the pixels and their polarity. The resulting tensor map
has the dimensions
, where
K is the width and
L is the height of the tensor map.
Afterward, the features
are extracted from the tensor map
. In [
20], squeeze-excite layers [
66] were utilized to obtain spatial information from the features. In our work, we also used the residual neural network (ResNet) architecture [
67] for feature extraction. The advantages of the ResNet feature extractor [
67] (primary components, residual blocks, and connection skipping) served as the motivation for our work to test it. The key advantage of this network is the use of the residual block, which binds two convolutional layers using a skip connection. Specifically, we incorporated the ResNet-18, ResNet-34, and ResNet-50 variants in our network. The architecture and parameters of the feature extractors used for the DFES model are summarized in
Table 5.
To exploit prior events, the detectors
and
depend not only on the events accumulated in the present
, but also on the encoded information from the past stored as an internal state
in
. A recurrent neural network architecture was utilized to generate the internal state vector
. Therefore, the output of the feature extractor was connected to a five-layer convolutional long short-term memory (ConvLSTM) network [
68] that obtains the internal state information
in addition to the extracted features
. Finally, the output of each ConvLSTM layer is fed into a two-head regression predictor consisting of convolutional layers to predict face detection bounding boxes and five-point facial landmarks, as implemented in [
20]. Thus, this two-regression-head predictor detects objects at multiple scales. To achieve this, our model first supplies feature maps of differing scales to the feature extractor and ConvLSTM layers. Each scaled feature map from the ConvLSTM layers is then fed into both regression branches and used as input to the predictor.
To train our neural network, we used the cost function
, where
was a smooth
loss [
69] for regression and
was a softmax focal loss [
21] for classification (between the background and face classes). Our implementation of the softmax focal loss is identical to that in [
20]. To describe the smooth
loss in our model, let
and
denote the ground truth values for the face bounding boxes and five-point facial landmarks, respectively. Similarly,
and
denote the predicted versions. If we combine the bounding box and facial landmark values for the ground truth and predicted values as
and
, we can write the smooth
loss function as follows:
where
is a tunable parameter to control the contributions of the losses L1 (mean absolute error loss) and L2 (mean squared error loss). In this work, this value is the same as in [
20], i.e.,
= 0.11.
7. Conclusions
Event cameras, a new type of retinomorphic sensors, have advantages over conventional frame-based cameras and find various applications in computer vision. However, due to the relative novelty of such cameras, there are only few datasets to apply deep learning to event streams, which limits the application of these bioinspired sensors.
To address this problem, we prepared the FES dataset. To the best of our knowledge, this is the first large and diverse event-based camera dataset for face detection. The FES dataset contains 689 min of raw event streams recorded in wild and controlled environments with multiple face poses and distances. In addition, the dataset contains accurately annotated bounding box and facial landmark coordinates. Thus, the dataset is already prepared for the application of machine learning and other algorithms.
To validate the efficacy of our dataset, we trained six models for bounding box detection only and another six models for simultaneous detection of face landmarks and the coordinates of bounding boxes. The models were able to detect the faces and facial landmarks accurately, showing the usefulness of the FES dataset. The models can also be run in real time. To stimulate further research in this area, we share our dataset, codes, and trained models at
https://github.com/IS2AI/faces-in-event-streams, accessed on 2 December 2023 under MIT license.
Direct event-based face extraction might serve as a solid base for further face-related tasks such as emotion classification or face recognition. As previously conducted works [
52,
53] have demonstrated the superiority of the neuromorphic event-based approach over RGB-based models for emotion and microexpression recognition, this implies the richness of features obtained from event-based facial representations. The current study showed better performance of the DFES models in comparison with the Event-RetinaNet model, which relies on frame-based representations of event streams. Nonetheless, to fully prove the efficiency of direct event-based face detection over RGB-based face detection models, a separate comparison study on event streams paired with RGB frames should be conducted to investigate the matter in more broad terms. Additionally, in terms of future work, we will focus on increasing the size of the dataset and improving the performance of the models by employing data augmentation and new deep network architectures. We also intend to expand upon this research topic by leveraging the presented dataset for face recognition tasks involving event streams.