1. Introduction
Since its creation in IROS 2016, the Autonomous Drone Racing (ADR) competition has posed the challenge of developing an autonomous drone capable of beating a human in a drone race. The first editions of this competition gathered research groups whose first solutions broke down the problem into three main problems to be addressed: (1) gate detection; (2) drone localisation on the race track for control and navigation; (3) suitable hardware for onboard processing. As reported in [
1,
2], gate detection was first attempted with colour-based segmentation algorithms and some initial efforts using convolutional neural networks for robust gate detection [
3].
For drone localisation, visual simultaneous localisation and mapping (SLAM) and visual odometry techniques were employed, seeking to provide global localisation on the race track, which was exploited by teams to implement a waypoint-based navigation system [
2]. Further improvements proposed adding local pose estimation with respect to the gate in a top-down navigation scheme where the controller drives the drone to follow a global trajectory, which is refined once the drone flies toward the next gate [
4,
5]. The Game of Drones competition at NeurIPS [
6] called upon researchers to ignore hardware and efficient performance to focus on high-level navigation strategies while seeking to push for the development of machine learning methods to be incorporated into the solution. These included gate detection and direct flight command prediction from the drone’s camera images using convolutional neural networks (CNNs) [
7,
8].
The Alpha Pilot competition pushed again for the development of a solution that considers real flight conditions and limitations imposed by having to process data on board the drone [
9,
10,
11]. Similar to the efforts in previous editions of the ADR in IROS, teams had to rely on CNN for gate detection, visual SLAM/odometry for global and local pose estimation, and control techniques for waypoint-based navigation. Nevertheless, when considering most of the implementations presented in these competitions, from IROS to Alpha Pilot, custom drones stand out as the ones with the best performance, which allow developers to accommodate specialised hardware that enables better performance. For instance, depth cameras, high-speed inertial measurement units (IMU), and, in particular, the use of graphics processing units (GPUs) used to run CNNs.
Recently, Foehn et all [
12] presented a solution that, for the first time, enabled a drone to beat humans on a race track, achieving flight speeds of up to 18 m/s. However, the backbone of this novel solution is the use of a motion capture system that provides global pose estimation of the drone. Although the achieved results are impressive, and showed that an autonomous drone can beat a human on a race track, the reality is that humans do not use motion capture systems to succeed in the competition, and it would not be feasible to use a motion capture system on large and real-life race tracks. Thus, the question remains open as to whether onboard processing for state estimation and localisation (global and/or local) will be enough to achieve a solution such as that achieved with a motion capture system.
However, if onboard processing has to be performed for computationally heavy tasks such as gate detection and drone localisation (state estimation), solutions are expected to heavily rely on specialised hardware such as GPUs. Another alternative is that of smart sensors such as event cameras [
13], where only relevant information is acquired or the SCAMP cameras, where each pixel has a processing unit [
14]. A preprocessing of sensor data could reduce the intensity of computations, enabling faster processing speed, which ultimately could contribute to bridging the gap between the demand for high-speed state estimation and onboard processing. Motivated by the above, in this paper, we present the preliminary results of adapting a novel sensor known as OpenCV AI Kit or OAK-D for the autonomous drone racing challenge. The OAK-D,(Luxonis Holding Corporation, Littleton, CO, USA), is a novel smart camera capable of performing neural inference on the chip and of acquiring images from three cameras simultaneously: a 4K colour camera and two monochromatic cameras working as a stereo rig for depth estimation. OAK-D does not use a GPU for neural inference but instead a Movidius Myriad X processor specialised for neural network operations. Inspired by the first works in the IROS ADR, we trained a CNN for gate detection and developed a flight controller that uses gate detection to pilot the drone to navigate toward gate and cross it. This model runs on OAK-D, using the colour camera as the input to the neural model. Thus, gate detection is performed on a frame-to-frame basis, and such output is directly acquired from the OAK-D and sent to the host computer on board the drone, an Intel Stick Compute, where the controller is run.
According to our experiments, a processing speed of 40 Hz was achieved for both detection and flight command output. This allowed the drone to fly at a speed of 2 m/s. Given space restrictions in our laboratory, our experiments were limited but enough to show that OAK-D can be coupled effectively on board the drone. Furthermore, we designed a new OAK-D back case to reduce weight, which makes the sensor to be carried on board ligther, which also benefits battery consumption. The CAD models for this new back-case design will be available to be used as open-source material without restriction. From our experiments, we think that OAK-D is an alternative for efficient onboard processing for the ADR challenge and could be exploited to perform other computationally heavy tasks such as localisation and navigation.
Therefore, to the best of our knowledge, this is the first study on the use of the OAK-D sensor for the ADR problem. Our experimental results provide insights into the potential of the use of this sensor and its spatial AI capabilities to address the first, but essential, part of the ADR challenge: the gate detection problem. The latter can be compared against other solutions implemented in previous ADR competitions using specialised hardware such as GPUs. As more research has been carried out on the ADR challenge, it is clear that new methods and hardware used to tackle it will pave the way to developing new applications requiring agile flight with capabilities for autonomous navigation and object detection. For instance, the primary task of gate detection in ADR could be exploited by applications such as parcel delivery or site inspections where a drone would be required to enter a house or a facility through a window that is detected automatically by the drone’s perception system [
15,
16]; additionally, autonomous navigation and agile flight could be used in search and rescue applications where a drone has to fly over a disaster zone as quickly as possible, not only for the sake of finding survivors but also because battery life is limited [
17,
18].
To present our approach, this paper is organised as follows:
Section 2 presents a review of the state of the art;
Section 3 presents our framework, from the new back-case design for OAK-D to the trained neural network and our flight controller based on gate detection;
Section 4 describes our experiments; and, finally, our final remarks are discussed in the Conclusions.
2. Related Work
In autonomous navigation for drone racing, data processing is one of the critical tasks, as the data received from the sensors are used to interpret the environment. One of the main problems is the perception of the environment and the latency between one frame and another, which directly affects the vehicle’s reaction and its speed. To mitigate this problem, researchers have resorted to using complementary information obtained from sensors such as IMU, LiDAR, ultrasonic, and optical flow. However, computer vision techniques have been replaced by deep learning. This is because computer vision techniques are susceptible to lighting conditions, as they are less efficient in detecting gates, for example, in the presence of overlapping or changing environments. In [
3], the authors report that, using methods based on colour detection, they obtained a
success rate, while with deep learning, they obtained up to a
success in detecting gates in the same environment.
The use of deep learning has revolutionised the development of autonomous vehicles due to their robustness and versatility in providing solutions for autonomous drone racing, for example, gating detection, relative positioning [
19,
20,
21,
22,
23], flight commands [
8,
24,
25], actions [
2], speed, and even the direction of the drone [
4,
19]. In addition, deep learning makes it possible to transfer the knowledge acquired in simulation environments to the real world [
21,
26].
While deep learning increases the drone’s capabilities to interpret its environment, DL places high demands on computational resources, creating bottlenecks in the processes. Therefore, computers with specialised image-processing processors or GPUs are employed to split the operations and not saturate the CPU. For example, various authors [
3,
27] used embedded computers with GPUs such as the NVIDIA Jetson TX1 and TX2, (NVIDIA Corporation, Santa Clara, CA, USA), to improve browsing performance as the GPU speeds up the inference of deep learning networks. Jung et al. [
27] used an NVIDIA Jetson TX1, which has a quad-core ARM processor and a 192-core Kepler GPU. They employed a single shot multibox detector (SSD) network for gating detection and reported that by using input images with VGA resolution at 60 Hz, they obtained detection at a speed of 10 Hz. In [
3], the authors used an NVIDIA Jetson TX2 board, which has a quad-core ARM processor, an NVIDIA Denver2 Dual Core,(NVIDIA Corporation, Santa Clara, CA, USA), and a 256-core Pascal GPU, (NVIDIA Corporation, Santa Clara, CA, USA). In addition, they designed a new network (ADR-Net) that allowed them to speed up the detection of the gates, and reported that using input images with a VGA resolution of 30 Hz, the gate was detected at a speed of 28.95 Hz with an accuracy rate of 75%. In other cases, such as [
10], the authors used an NVIDIA Xavier board, (NVIDIA Corporation, Santa Clara, CA, USA). This board has an eight-core CPU, an NVIDIA Volta GPU, a vision accelerator, and a deep learning accelerator, which were used to process the camera images captured at a frequency of 60 Hz to obtain a position estimation at 35 Hz. However, the inference of networks on computers without a GPU reduces the detection speed. For example, Cabrera et al. [
28], reported a rate of 2.36 Hz using an Intel Compute stick, (Intel Corporation, Santa Clara, CA, USA), with 2 GB of RAM. Later, Cabrera et al. [
29] reported a detection speed of up to 13.01 Hz using an Intel Compute stick with 4 GB of RAM. Kaufmann et al. [
19], used an Intel UpBoard and reported a detection speed of 10 Hz. An alternative to improving detection speed is using cameras that have processors that facilitate visual information processing, such as 3D cameras manufactured by Intel, event cameras, SCAMP (pixel-processor array) cameras, and recently smart cameras created by Luxonis [
30].
The development of drones as part of Industry 4.0 is of international interest to boost the research, innovation, and technological development of unmanned aerial systems (UASs), which includes unmanned aerial vehicles (UAVs), popularly known as drones. Drones have been used to improve areas such as agriculture, construction work, logistics, search and rescue operations, and inspection tasks. For example, in agriculture [
31,
32], drones can provide an analysis of crop diseases, assessments of yields, identification of where fertiliser is needed, and when water is wasted in farming. This information improves farmer productivity for growing more food in a time-efficient manner. In construction work [
33,
34], drones provide geographic measurements, cartography, overall protection of the area, inspection of critical infrastructures, transportation inspection, and measuring work progress in real-time. Additionally, drones have been used in logistics to perform warehouse inventory, to transport and delivery packages [
15,
31,
35], reduce risks, and ensure timely execution.
Drones used in search and rescue operations [
32], such as after hurricanes, forest fires, and accidents along rivers and coastlines, can quickly search a given search area and detect a human. They can also can be used to protect endangered species from extinction [
32]. Scientists can analyse historical data on poaching incidents using recorded videos and deep learning to analyse and predict when and where animals shall be hunted. In this manner, the rangers and drones can patrol these target areas to prevent poaching. Finally, the inspection of power lines, infrastructure, wind turbines, pipelines, chimneys and gas leaks [
35,
36,
37] using drones facilitates the work of companies in the industrial sector, since drones provide thermography, photogrammetry, and scanning information to minimise expenses and risks.
Autonomous flight in confined spaces presents scientific and technical challenges due to the energetic cost of staying airborne and the spatial AI required to navigate complex environments. However, these areas demand drones capable of providing large amounts of data and analysing data in real-time. In addition, drones must detect obstacles using colour segmentation [
38], deep learning and RGB cameras [
39,
40], depth information and deep learning [
41], obstacle-free navigation [
42], monocular depth prediction [
43], and agile and precise flight [
44].
The latter tasks require continuous information with low latency in the communications among processes to avoid malfunctioning such as collisions. Hence, one of the strategies involves dividing the tasks into several modules and executing them in parallel so that the drones can provide information through the cameras and, at the same time, detect risky situations for the drones. Additionally, deep learning has been used to detect multiple objects of interest, improve navigation [
45] and obtain 6DOF positioning [
46] or relative positioning [
5,
23]. Even reinforcement learning has been used to enhance flight precision [
39,
47]. However, for these processes to run in real-time, specialised cards such as GPUs or smart cameras are required to speed up the execution and distribution of the processes.
4. Results and Discussion
To develop our system, we used the robotic operating system (ROS) [
53]. This allowed us to implement independent task-oriented node packages that could communicate through the ROS network protocol via publishers and subscribers. In this manner, we were able to run nodes on the drone’s onboard computer, an external onboard computer that is carried by the drone, whose purpose is serving as host to the OAK-D smart camera, the flight controller, and to publish low-resolution colour images from OAK-D. These published data are received by the GCS. For network communication, we used the local network opened by the Bebop. The only necessary condition was that the host onboard computer should run the ROS master, and the GCS should connect as a client to the ROS master.
A general block diagram of the ROS nodes implemented in our system is shown in
Figure 9. This diagram indicates which nodes are run on the onboard computer, the Intel Stick computer (green boxes), and which are run on the GCS (orange box), an Alienware Laptop, both computers, as stated before, connected via the Bebop’s local network (Documentation on how to use the Bebop’s network with the ROS can be found at
https://bebop-autonomy.readthedocs.io/en/latest/(accessed on 1 November 2021)). In the same figure, we can see the ROS node that communicates with the OAK-D camera to receive the colour, right and depth images (we do not use the left image), and the gate detections. This node encodes these data into ROS messages that are published to ROS topics. This is the main data stream to which the rest of the ROS nodes subscribe (green arrows). After each node performs its corresponding task, their results are published in ROS topics as well (red arrows in the figure). To carry out our experiments, we used an Alienware Laptop,(Alienware Corporation, Miami, FL, USA), as the GCS. The ROS master ran on an Intel Stick computer on board the drone, as shown in
Figure 10. It is important to remark that the GCS was used only for data visualisation, that is, the camera images published by the ROS nodes running on the Intel Stick and telemetry. The Bebop local network was enough to establish communication and receive these data; however, the video streaming had a lag of 2 or 3 s and, sometimes, due to external signal interference in the facility, this lag increased to a few seconds more.
We considered the Bebop and the OAK-D as external processing units since these devices have their own processors. The Bebop has its own internal computer responsible for creating and managing its local network, running its inner controller, and receiving high-level flight control commands (from manual control or autonomous flight). The OAK-D smart camera also has its internal neural processor, which manages the data stream, but it also runs the convolutional neural network for object detection. This is a great advantage since it is similar to having a GPU processor but without having to occupy the significant space and weight that an embedded GPU would require.
Object detection becomes inexpensive in computational terms to the host computer, our onboard Intel Compute Stick, which only consumes the data through the ROS nodes. From our point of view, these are the reasons that make the OAK-D smart camera a valuable asset for small drones with limited payload and modest computer power.
We performed flight tests on an indoor environment, as shown in
Figure 11, an area of approximately 4.8 × 19.2 m. Due to the limited space, we set only two gates 2 and 2.5 m high. Each gate had an area of 1 × 1 m. These are very narrow gates, but we wanted to assess the performance of our approach under these conditions, seeking to evaluate the effectiveness of the controller to command the drone to cross each gate. For our experiments, we varied the position of the gates to obtain different combinations. At the outset, the drone was placed 9 m before the first gate. The second gate was placed 8.5 m after the first gate. This is illustrated in the schematic view in
Figure 12. For the control system, we used a similar approach to that implemented in [
3]. This is a state-machine controller that seeks to centre the drone with respect to the gate in terms of the
X and
Y image axes. For that, the controller uses as input the gate detection, represented in terms of pixel coordinates. The latter represents the centre of the bounding box where the gate is being detected by the neural network. Recall that the neural network for the gate detection runs on the OKA-D camera and its output is published in an ROS topic, which is consumed by the controller running on an Intel Stick Compute processor.
Regarding the state machine, first, we requested that the confidence value of the detection of class 1 (gates) be greater than 0.3 Then, we obtained the bounding box dimensions, its centroid, and the maximum and minimum coordinate positions. Finally, we determined the closer detection to the drone with the bounding box area since more than one gate could be detected simultaneously in one image.
In state 1, two proportional controllers are performed: one to control roll movements and one to control elevation. These controllers seek to reduce the error between the gate’s centroid and the image centre. This error in pixel is calculated as follows:
where
and
are the image centre in pixels, and
and
correspond to the centroid of the detected gate, both coordinates in the colour image. Note that these errors are normalised with respect to the image width
. These normalised errors are used to implement the proportional controllers for both axes as mentioned above.
Then we can define our proportional controllers as follows:
Note that the rotation controller () maintains the orientation of the drone according to a reference angle . For the signal in the pitch angle (forward/backward motion), we used a constant value that enables the drone to keep an angle with respect to its body frame. This was set experimentally to 0.4, which provided a trade-off between speed and enough time for the other controllers to align the drone with respect to the gate centre, thus enabling the crossing and, on average, to fly with a flight speed of 2.0 m/s.
Once centred and if the area of the bounding box greater than a threshold is maintained, the drone transits to state 2. Otherwise, the drone must approach the gate, and, if in this course, the drone remains in the centre, it proceeds to state 3.
In-state 2, the drone reduces its speed to reach an almost hovering state and then it transitions to state 3. In state 3, the drone is considered to be centred with respect to the gate and moves forward until it crosses the gate, moving forward until a new gate is observed with certain area in pixels of its bounding box; if this is the case, then it transitions to state 1.
We mention that we did try to use the depth inference of the OAK-D camera. However, the depth for the gate was inconsistent. This was expected since the reported depth by the sensor is an average depth of the depth data found within the detected bounding box. In addition, physically, the gate thickness was less than 4 cm. In our future work, we will explore whether the chromatic cameras and the depth can be exploited to improve the gate detection. For now, using the neural model on the colour image was enough for the experiments in this work.
Figure 13 shows images of the GCS while testing the drone in indoor environments. Telemetry, detections, and control signals are displayed in the image.
Figure 14 shows images of the drone’s performance recorded with an external camera. In this scenario, ten flights were performed where the drone successfully crossed the two gates, achieving a processing frequency of 40 Hz and flying at an average speed of 2.0 m/s. A video illustrating the flight performance can be found at
https://youtu.be/P1187astpe0 (accessed on 1 November 2021).
Figure 15 presents an example of the error in pixels of the gate detection with respect to the image centre in both the
X and
Y image axes.
According to the evaluation of the trained model described in
Section 3.3 carried out using an evaluation image dataset, the expected accuracy for gate detection is 82.52%. However, to measure the network’s performance during the real flight experiments, we randomly selected three out of the ten flights described before. For these flights, we stored the camera image with its corresponding bounding box returned by the CNN running on the OAK-D camera. This bounding box indicates where the network detected the gate on the image, as shown in
Figure 13. To carry out the analysis in terms of the metrics used to evaluate our trained model, we considered the following: if the gate was found within this bounding box or with a 90% of the intersection of union, then it was counted as a true positive detection; otherwise, it was counted as a false positive. If the gate was observed by the camera, but no detection was returned by the OAK-D, then it was counted as a false negative. In these experiments, the gate was always visible and close enough to the drone to be detected by the network; hence, there were no true negatives. In total, we analysed 3389 images; the results in terms of accuracy, precision, recall and F1 score are shown in
Table 1. Note that the precision increased to 96% with an F1 score of 0.89, meaning that the gate detector performed with a slight improvement with respect to to the expected performance. This analysis helps us to appreciate the success of the drone in detecting the gate correctly and flying toward each gate since the number of false positives or false negatives did not affect the overall performance.
In
Table 2, we present a comparison of relevant works using artificial neural networks for gate detection in the context of ADR. First, we compare the device responsible for running the network, mentioning whether it has a CPU, GPU, or a specialised processing unit (SPU); the type of network, input image (resolution and frame rate), the rate at which the network performs inference, and the precision are compared. Note that other authors [
19,
28] used a device without a GPU or SPU, and they obtained an inference rate of 10 and 2.36 Hz, respectively; the authors in [
28] reported an accuracy of 96% although no precision was indicated. Jung et al. [
3] obtained a high inference rate of 28.9 Hz, which is almost the same camera’s frame rate, using the NVIDIA Jetson Tx2 GPU and a network called ADR-Net for gate detection, reporting an accuracy of 85.2% and a precision of 75.5%. The next work, reported in [
10], used the NVIDIA Jetson AGX Xavier, a GPU card with superior capabilities, which was specialised to run AI methods. This card has a CPU, GPU, and implements medium-precision matrix multiplication and accumulation; it also includes integer matrix multiplication and accumulation floating-point instructions to accelerate dense linear algebra computations, signal processing, and deep learning inference. Furthermore, the card has two NVIDIA Deep Learning Accelerator (DLA) engines and two Vision Accelerator engines. These engines improve power efficiency and free up the GPU to run more complex networks. From these features, it is not surprising that gate detection is performed at 60 Hz, with an accuracy of 96.4% and a precision of 99.7%. However, according to the authors, their visual odometry thread runs at 35 Hz, forcing them to use a filtering technique to synchronise the output of both modules, meaning that the fast gate detection is not completely exploited.
Finally, although the NVIDIA Jetson AGX Xavier is the best high-performance device for small embedded hardware to be carried onboard a drone, the OAK-D camera also achieves a high inference speed at 40 Hz, using only the Intel Movidius Myriad X. This SPU combines vision, image processing, and deep learning inference in real time. This allowed us to implement the Tiny Yolov3 network to carry out gate detection with an accuracy of up to 81% and a precision of 96%. Although the precision is less than that obtained with the Xavier card, we argue that a potential advantage could be that of the data being processed on the chip. That is, the OAK-D camera receives the camera image and directly performs the neural inference on its neural processor, sending the detection to the computer host. In contrast, the camera used in [
10] has to transmit image data through the communication system on board the drone toward the GPU card, causing a latency even when running at 60 Hz. This could be an issue for agile flight, as mentioned in [
54]. Currently, the OAK-D’s colour camera can acquire images at a frame rate of 60 fps, but the stereo monochromatic cameras can achieve this at 120 fps; hence, in our future work, we will investigate possible neural architectures that can run at these frame rates, seeking to find out whether this could help address the latency issue.
5. Conclusions
We presented the initial results of adapting a novel smart camera to become part of a solution for Autonomous Drone Racing. This new camera is the OpenCV AI Kit, also known as the OAK-D camera, a sensor capable of performing neural inference without using a GPU, while capable of acquiring images from a colour camera and two monochromatic cameras arranged as a stereo rig from which a depth image can be obtained. All of this imagery and neural inference runs on the camera’s chip, freeing the host computer from having to compute these intensive operations. To the best of our knowledge, this is the first study on the use of this sensor as part of a solution to the ADR challenge.
To carry out this adaptation, first, we modified the back case of the OAK-D camera since its original weight is 187 g, which is a considerable weight for our drone whose maximum payload is 200 g. Thus, we designed a new back case and produced it with a 3D printer; we released this model to be used by the community. Once the drone maintained a stable flight with the OAK-D on board, we designed and implemented a navigation controller based on gate detection using a neural model. Our gate detector runs at 40 Hz, with the controller producing flight commands at the same processing speed. This controller runs on an Intel Stick Compute onboard the drone. We carried out several experiments where the drone crossed gates in an indoor environment, achieving an average flight speed of 2 m/s.
We argue that these initial results show that the gap between intensive processing and limited computer power can be bridged by using this type of novel sensor that does not require a GPU, whose processed data cost nothing to the host computer, and whose frame rate has the potential of being increased further. In this context, this is a promising result toward developing an autonomous drone that does not require external sensors such as a motion capture system, and with the capability of processing complex tasks on board, which aims to achieve truly autonomous performance.
Finally, autonomous behaviour, i.e., without human intervention, is relevant to the ADR challenge but also to other applications that may benefit from the software and hardware solutions designed and implemented to address this problem. Today, Industry 4.0 calls for the development of technology that can support operations with high demand, such as logistic and delivery tasks where drones have begun to be used as a viable option to overcome critical situations such as traffic jams and accessibility to remote places [
15,
31]. However, the deployment of multiple drones raises issues regarding interconnectivity, reliability, and safety [
36,
55]. The operation of autonomous drones may seem distant yet, especially when their solutions relying on AI face the problem of explainability for the sake of liability [
56]. However, sensors such as OAK-D have the potential to showcase what can be achieved with drones and AI, whose applications will extend beyond typical aerial video/photography and precision agriculture to other domains where AI-based operation and decision-making could aid or even exhibit superior performance to that of a human, for instance, in search and rescue applications where agile and smart visual identification could rapidly find victims and safe parcel delivery through intelligent sense-and-avoid systems [
34,
35].
In the future, we will investigate more sophisticated methods for drone localisation, state estimation, and even flight command inference running onboard the OAK-D smart camera in tandem with the host onboard computer.