1. Introduction
ME is a critical component in many video processing applications, such as video compression, object tracking, and video stabilization. ME works by predicting the movement of objects within video frames, allowing for more efficient data encoding and transmission. Common ME algorithms include Full Search (FS), Diamond Search (DS), Hexagonal Search (HS) and Test-Zone Search (TZS), each offering a different trade-off between computational complexity and accuracy. FS, while the most accurate, is computationally intensive, making it less suitable for real-time applications. DS, HS, and TZS, on the other hand, provide faster computations with slightly lower accuracy, making them more practical for applications requiring low latency.
In recent years, there has been growing interest in effective video compression for RCVs, such as drones, and UAVs [
1]. This is due to the increasing popularity of UAVs for applications such as aerial photography, videography, military, and entertainment [
2]. The entertainment sector has witnessed a significant boost with the rise of drone racing. This engaging sport involves pilots skillfully maneuvering UAVs through obstacle courses, aiming to complete the course in the quickest time possible. As drones can fly very fast when it comes to drone racing, it is crucial that the latency in video transmission from the drone to the pilot’s first-person view (FPV) goggles is as minimal as possible. This is not only important for the pilot’s performance, but also for the safety of the drone, as real-time video adaptation can help avoid collisions [
3]. Statistics indicate a growing need for low-latency video processing in drones and other technologies. For instance, drone racing requires video transmission latencies below 150 milliseconds to ensure that pilots can react swiftly to their environment. Similarly, applications in military surveillance and emergency response demand near-instantaneous video feedback to make timely and informed decisions.
To address this, our research focuses on reducing latency through video compression before network transmission, a challenging task under limited computing resources. Since ME is typically the most time-consuming part of video coding, we propose an innovative approach to accelerate it. This paper introduces a novel ME algorithm, Motion Dynamics Input Search (MDIS), which builds upon the foundation of the well-known DS algorithm while incorporating motion dynamics estimation that responds to real-time user interaction with the controller. Instead of searching all the points of the diamond for each block, MDIS only searches the selected locations based on the user input from the controller. Additionally, the search range is modified based on the location of the pixel in the image and the current drone movement type. While our research is centered on UAVs and FPV systems, the MDIS algorithm can be applied to any RCV system.
We evaluate MDIS on a drone-flying simulator which records user inputs for each frame. We also evaluate it on real 4K drone footage. By leveraging motion dynamics estimation in the ME process, our algorithm aims to enhance the efficiency, accuracy, and adaptability of ME for RCVs, with a primary focus on UAVs.
In summary, the main contributions of this paper are as follows:
The development of the MDIS algorithm, which integrates motion dynamics estimation and user interaction to enhance ME efficiency.
The comprehensive evaluation of MDIS using both simulated and real drone footage, demonstrating significant improvements in encoding speed and compression ratio without compromising video quality.
The implementation and testing of MDIS within an open-source HEVC video encoder, showcasing its practical applicability and performance advantages.
2. Related Work
The problem of ME in general and for UAVs has been a topic of significant interest in recent years. A key focus has been on improving the efficiency and accuracy of block-matching ME algorithms.
The researchers of [
4] proposed an All-Direction Search (ADS) pattern, which searches for the best block in all possible directions. For further improvement in search speed, a halfway stop technique was applied in the search process. The results showed that the proposed ADS algorithm outperformed other state-of-the-art and eminent ME algorithms. This work is particularly relevant as it presents a novel approach to block-matching that can potentially enhance the performance of the DS algorithm. Wang and Yang [
5] proposed an improved fast ME algorithm for H.266/VVC, where they replaced the DS model with an HS model. They also increased the number of initial matching search points of the rotation HS model, which led to a significant reduction in the overall coding time. Hassan and Butt [
6] studied the effects of a meta-heuristic algorithm on ME. They proposed the Firefly algorithm for ME, which saved a considerable amount of time with a comparable encoding efficiency. Multiple other ME optimizations have also been successfully implemented in [
7,
8,
9,
10].
In the realm of user interaction, Yu et al. [
11] presented EyeRobot, a concept that enables dynamic viewpoints for telepresence using the intuitive control of the user’s head motion. This approach could potentially be adapted for the MDIS ME algorithm. Yoo et al. [
12] proposed a hybrid hand-gesture system that combines an inertial measurement unit (IMU)-based motion capture system and a vision-based gesture system to increase real-time performance. Their approach divided IMU-based commands and vision-based commands according to whether drone operation commands are continuously input.
The work by Varga et al. [
13] on the validation of a Limit Ellipsis Controller (LEC) for rescue drones presents an innovative control algorithm designed for UAVs in emergency situations. In their study, the LEC was adapted for indoor environments, showcasing its potential to improve drone stability and maneuverability in semi-structured settings. The integration of our algorithm could further improve their system by optimizing the ME in both the visual and thermal camera equipped on the UAV. Similarly, [
14] illustrates the significant utility of drones in outdoor safety and rescue operations, particularly in expediting the search for missing persons. The real-time video feed provided by UAVs is essential for drone pilots to make timely and informed decisions. In these scenarios, it is critical that the video feed is both promptly available and of high quality. Our algorithm aligns well with these requirements by enhancing the overall encoding speed, thereby ensuring quicker end-to-end video transmission. Generally, our algorithm could be implemented to improve ME in various domains, like construction, agriculture, or mining [
15,
16,
17,
18].
Classical forms of user interaction, such as joysticks, have been fundamental in RCV control, providing direct manipulation of the vehicle’s movement. Modern systems continue to use these traditional controls while integrating advanced features like vibration feedback and ergonomic design to enhance user experience and control accuracy [
19]. These systems are often complemented by templates or predefined paths that can guide the drone’s motion, as seen in traditional remote-controlled aircraft systems. Semi-autonomous systems, where a human operator sets high-level goals or waypoints for the drone to follow, also play a crucial role. For example, recent advancements have shown the integration of intuitive control systems like handheld controllers with high-resolution displays and robust wireless communication, significantly enhancing operational efficiency and user interaction. Systems like those described by Yogi et al. [
20] and Chen et al. [
21], where operators set goal points and the UAV autonomously navigates to these points, demonstrate the balance between manual control and autonomy. Our algorithm could be applied in such systems, which require a video feed, and where drone movements are roughly known in advance. By knowing the estimated drone movements, a drone direction output file (discussed in further sections) could be generated and applied in ME.
Our research aims to address the challenge of reducing search locations in the ME process by leveraging user inputs from the controller. In previous work, we introduced a basic version of the MDIS algorithm, called User Input Search [
22]. To our knowledge, no other research has optimized ME algorithms using controller input, making our approach unique and innovative. The main contributions are highlighted in the previous section.
3. Motion Dynamics Input Search Algorithm
In order to implement the MDIS algorithm, an open-source High-Efficiency Video Coding (HEVC) video encoder, Kvazaar [
23], was modified. Specifically, the DS algorithm, which is a part of Kvazaar’s source code, was upgraded to incorporate MDIS. After conducting several preliminary tests, in which we encoded both simulator and real drone footage, we selected the DS algorithm that proved to be most suitable for our specific use case. Between DS, TZS, and HS, there was no clear winner in terms of encoding quality; the superior method varied depending on the specific content of the video being encoded, with the quality difference, measured in terms of PSNR, never exceeding 1.7%. In terms of compression ratio, TZS consistently demonstrated superior performance, with files encoded using TZS being on average 1.1% smaller than the smallest file produced by either DS or HS. Despite this, DS emerged as the fastest among the three, outperforming the TZS algorithm by an average of 56% and the HS algorithm by an average of 2.3%. DS’s straightforward implementation in Kvazaar further solidified our choice. In summary, the DS and HS algorithms displayed comparable performance levels, whereas TZS appeared to prioritize a higher compression ratio while maintaining encoding quality, trading off increased encoding time.
Kvazaar’s ME algorithm flow with the use of DS is depicted in
Figure 1.
The first step is to determine the starting motion vector (MV), which is done with Advanced Motion-Vector Prediction (AMVP). As described in [
24], AMVP works by predicting the MV of a block in the current frame based on the MVs of spatially and temporally neighboring blocks. In
Figure 2, the candidates A0, A1, B0, B1, and B2 represent spatial candidates and TB and TC represent temporal candidates. This prediction is then used as a starting point for the ME process, which can significantly reduce the search time and computational complexity. AMVP is particularly effective in scenarios where the motion in the video is relatively consistent or predictable.
Figure 2.
HEVC AMVP scheme showing red rectangles as spatial candidates from the same frame and green rectangles as temporal candidates from different frames.
Figure 2.
HEVC AMVP scheme showing red rectangles as spatial candidates from the same frame and green rectangles as temporal candidates from different frames.
The next step is finding the best MV, by modifying the starting MV. This process is carried out by conducting an iterative search of the five points forming a diamond shape. The center is then shifted to the point with the best match. This iteration continues until the best match is found at the center point, or the number of steps has exceeded the step limit. Searching for a single point refers to calculating the Sum of Absolute Differences (SAD) and comparing it with the current best SAD. The SAD is calculated as
where
A and
B are matrices representing pixel values, or simply blocks being compared, and
n and
m represent the dimensions of the blocks. If the calculated SAD is lower than the best (lowest) SAD, then the current point becomes the best match. This second step is done on integer precision, or in other words, on a larger scale.
The last step is fine-tuning the MV, by performing a fractional search. A fractional search, also known as a sub-pixel search, is a refinement step in ME that allows for more precise MV prediction. After the best match has been found at an integer pixel position, a fractional search is conducted around this position at a sub-pixel level. This is typically achieved by using interpolation methods to create a high-resolution grid between the integer pixels. The search is then conducted on this grid to find the best match with even greater precision. This process can significantly improve the accuracy of the MV and the overall quality of the video compression.
In MDIS, steps 1 and 3 are identical to those in the DS. The difference is in step 2, or the integer block search. The main idea behind the algorithm is to consider the motion dynamics estimation that is influenced by user inputs received from the controller, and also some other case-specific parameters. The information received from the controller includes which joysticks were pushed at which video frame, as well as the push intensities.
For instance, if the drone is directed to move upwards, then it makes sense to focus the search only on the upside direction. The same goes with every other direction. The rationale behind this focused approach is the predominantly global nature of the movement in drone-captured footage, especially evident in drone racing scenes. In such scenarios, the drone’s motion effectively dictates the movement trajectory of nearly every pixel within the frame, rendering a comprehensive search across all potential diamond locations unnecessary. By knowing the drone’s directional movement within the current frame, MDIS optimizes the ME process by selectively searching only relevant diamond locations, thereby reducing both ME and overall encoding times.
Figure 3 illustrates the difference between searched locations in DS and MDIS algorithms—locations marked with a red ‘X’ are searched, while the locations marked with a faded red ‘O’ are not searched in MDIS. In this particular case, the drone appears to be moving to the right, as evidenced by the exclusion of the left and middle search locations. This is merely one example; search locations can be selectively filtered based on the type of movement being analyzed.
The Motion Dynamics Estimator Block in
Figure 4 can be imagined as a ‘black box’ which estimates motion dynamics depending on several parameters, such as the mounting point of the camera (given as an offset from the origin of the vehicle’s frame of reference), the default pose of its viewport and user inputs from the controller. Whether the camera is firmly fixed or mobile (attached to a gimbal) is also considered in the MDIS algorithm. The Motion Dynamics Estimator Block is a block that interpolates inside the standard video encoder and instructs the Motion Estimation block which locations will be searched. It is located in the processing unit of the Unmanned System (UxS).
Given that tests in this study were performed on simulator footage as well as on real 4K drone sequences, the Motion Dynamics Estimator Block was customized to suit these particular scenarios.
To recapitulate, the semantic description of the MDIS algorithm is as follows:
Determine the starting MV—Use the AMVP scheme.
Collect data—Obtain data from the Motion Dynamics Estimator block (globally defined parameters mentioned earlier and user input from the controller).
Process data—Analyze the collected data to determine the appropriate search model for the current frame (detailed in
Section 3.2 and
Section 3.3).
Perform integer-block search—Use the determined search model to conduct the integer-block search.
Refine the MV—Apply fractional search to refine the MV.
3.1. Understanding Image Changes Based on Drone Movements
The movement of a vehicle in motion can be described by a combination of linear and angular motion in different dimensions. When it comes to drones, linear motion can be decomposed into velocity vectors, each parallel to one of the three axes (x, y, and z), while the angular motion can be decomposed into three rotational components, each around one of the three axes (
Figure 5).
Each rotational component of an aerial vehicle has a specific name: the rotation around the Longitudinal Axis of the vehicle (axis X) is called “roll”, the rotation around the Lateral Axis of the vehicle (axis Y) is called “pitch”, while the rotation around the axis vertical to the vehicle (axis Z) is called “yaw”.
The most common movement in drone scenes is the forward movement, which requires the drone to slightly pitch. When the drone moves forward, pixels tend to spread away from the center of the image (
Figure 6). The further away the pixel is from the center, the more it will expand (move toward the corner of the image) in the next frame. An example of a drone forward motion in a simulator can be seen in
Figure 7. Similarly, when the drone moves backward, pixels compress toward the center of the image. The next movement type is the yaw rotation movement. When the drone rotates to the right, pixels tend to follow the movement scheme illustrated in
Figure 8. Pixels located closer to the top or bottom of the image exhibit a more pronounced arc in their movement, while pixels located in the center of the image move in a straight line. Of course, when the drone is rotating to the left, the pixels move to the right, and vice versa for the right rotation. Additionally, the arc angle depends on the camera lens characteristics.
Pixel movements are much simpler when it comes to left–right and up–down drone motion. When the drone moves to the left it must perform a slight roll rotation first, and afterward, the pixels move straight to the right. If the drone is in an upward motion, the pixels move straight downwards. Hence, no complex search models were created for the roll and throttle movements. Simply, only the diamond locations that match the current direction of the drone motion are searched.
3.2. Search Model Creation
The MDIS model was designed with the described pixel movements in mind. The model creation starts by segmenting the image into nine equal regions. This is achieved by partitioning both the vertical and horizontal axes into three equal segments, resulting in a 3x3 grid. The next step is determining which diamond locations should be searched in each of the nine segments. We developed a few search model versions which varied in several parameters. For example, in one version of MDIS, the diamond shape is extended so that it also includes corner locations (
Figure 9). Different versions also varied in parameters such as the weight in the weighted probability function, or the probability threshold, described later on.
The general model design process will be described through the forward motion model. The search model for left–right motion is created analogously.
Figure 10 shows the forward motion search model in MDIS. The whole box represents a single image, which is segmented into nine segments, as explained earlier. In each segment, a modified diamond shape is drawn, where locations marked with ‘X’ represent locations that should be searched, and locations marked with ‘0’ represent locations that should be skipped. In other words, depending on the block location in the image, some search locations will be included and some will be excluded.
For example, in the forward motion search model shown in
Figure 10, blocks that belong in the center grid segment are searched for the most similar block in all possible directions. Similarly, blocks in the upper-left grid segment are only searched in the right and bottom directions. In this particular case, the diagonal search locations are also shown, which can be excluded in some search models.
We call this model the Discrete search model (DSM). While MDIS could be implemented following DSM, we decided to improve it by creating the so-called Continuous Search Model (CSM). Instead of simply deciding whether a certain location should be searched depending on the block location, the CSM calculates the search probability for each search location.
The search probability is calculated based on the block coordinates, and each search location uses a different function, inspired by the DSM. All the functions for each search location in a forward motion are displayed in
Figure 11. Every function was crafted to emulate the DSM location-filtering approach, with an additional goal of functions having minimal computational complexity. The variable
x denotes the x-coordinate of the upper left corner of the current block, while
y denotes the y-coordinate of the same corner.
w stands for the picture width, and
h represents the picture height. Each grid segment refers to one search location. The small grayscale images inside each grid segment represent the 2D probability functions—the brighter the area, the higher the probability. These grayscale images correspond to images themselves, in terms of resolution. Let us look at the upper left search location for an example—the closer the block is to the bottom right corner of the image, the higher the probability that the upper left search location will be searched. This can be seen from the small grayscale image in the upper left grid segment—the closer the block is to the bottom right corner, the brighter it is. Search probabilities are calculated for each block independently. At the start of the MDIS algorithm, the probability threshold is defined (0.5 for instance), and each search location that has a search probability higher than the probability threshold will be searched.
Figure 12 is a snippet of MVs generated by the DS algorithm, in a case where the drone moved forward. This phenomenon consistently manifests whenever the drone advances, illustrating that MVs accurately adhere to the forward motion model outlined by MDIS. All the vectors point toward the center of the image. One very important observation, which was already discussed in the previous subsection, is that the further away the pixel is from the center, the more it will move towards the edge of the image. This is also considered in the search model. When searching for the most similar block, the search range is multiplied by the weighted probability function—the same probability function that was calculated earlier. This is a great benefit of the CSM in comparison with the DSM. Not only do the probabilities determine whether the search location should be considered at all, but they also dictate the search range. This can also be seen in
Figure 12. Notice how blocks that are further away from the center have bigger MVs.
Figure 13 shows the DSM for the left drone motion. The CSM was designed as well and is shown in
Figure 14. Again, all the functions were designed with the intention of minimizing computational complexity, hence some functions have sharp transitions in probability values, even though there might be a better probability function available. The model for the right drone motion is analogous. The function
sets all the values within a specified region to 0. The function
assigns a value of 0 to all elements that fall below a certain threshold. The parameter
A from function (2) is a two-dimensional area representing a part of an image, and the parameter
T is a scalar value.
3.3. Determining Search Directions Based on the Received User Input
Even though a system has been developed that allows users to try the MDIS algorithm in real time, the tests for this research were made using pre-recorded footage and accompanying log files which include user input. The input received from the user has the following format:
In this example, Right Stick Vertical, which corresponds to the forward motion, was pushed with the intensity of
at the second
. Generally, the received input follows the following format:
The timestamp can easily be converted into the frame number, and the intensities range from −1 to 1. The idea is to use the received input to generate a textual file that contains numbers from ‘0’ to ‘8’, which correspond to all the possible movements, starting with ‘0’, which means undefined (perform a normal DS), ‘1’ for forward motion, ‘2’ for backward motion, ‘3’ for moving to the left, and so on (
Table 1). The MDIS algorithm then reads those numbers and uses a specific search model depending on the read number.
The accuracy of the utilized motion dynamics estimation is ensured by relying on the same joystick input data already used for controlling the drone, which is handled and secured by the drone’s manufacturers. Since this information is integral to the drone’s operation, it is inherently accurate and reliable. Even if an error occurs, it would not cause a significant issue because such errors are rare and because ME uses the AMVP scheme described earlier in this section, which is robust to rare disturbances. For example, we tested this by intentionally inserting a random value in the generated textual file at every 20th character location, and the impact on ME quality was negligible.
Transforming the received user input data into the desired formatted output presents a complex challenge. It is not sufficient to merely identify the joystick with the greatest push intensity at a given frame and assume that it determines the final search model. This approach would not yield optimal results. Consider the following scenario: a drone is stationary and begins to accelerate forward. Naturally, the highest intensity would be for the Right Stick Vertical. It might seem intuitive to immediately start using the forward motion search model. However, it is important to note that due to physical forces, the drone initially leans downward (pitch) as it starts to accelerate. This downward tilt can be leveraged by employing a combination of the downward and forward motion search models. Relying solely on the standard forward motion model during the several initial frames of drone acceleration would not yield optimal results. Similarly, when the drone starts moving left or right, it first slightly tilts to the side (roll).
While integrating additional sensor data such as radar or laser data could potentially enhance the precision of motion dynamics estimation, our primary focus is on leveraging joystick input data. This decision is based on the following factors:
Data availability and reliability: joystick input data are readily available and reliably processed by the drone’s control system. These data are directly tied to the user’s commands and provide immediate feedback on the intended movements.
Integration complexity: incorporating radar or laser data would introduce additional complexity into the system. These sensors require calibration, synchronization, and integration with the existing control and processing framework. This can be challenging and may necessitate significant changes to the hardware and software architecture.
In scenarios where user inputs are assisted by systems that modify them, such as stabilization systems, the input data may not directly map to the proper movement code. For instance, if a stabilization system smoothens abrupt joystick movements to prevent sudden jerks, the input data might be mapped incorrectly, because incorrect data would be used in movement code calculation. The MDIS algorithm should match the stabilization mechanism’s adjustments and use these modified input data when calculating movement codes, as it more accurately represents the actual control inputs being applied to the drone.
Another important factor to consider is that our primary interest lies in the direction of acceleration, not just the drone’s movement. This is due to the fact that MVs are refined based on a starting MV. For instance, if the drone is moving forward but begins to decelerate, the backward motion model should be employed, despite the drone’s continued forward motion. The same goes with all other movement types, not just the forward/backward motion. Very small accelerations and decelerations are not problematic, because the fractional pixel search, which is unchanged in MDIS, can handle small changes in MVs. Specifically, this last case of small changes in acceleration refers to cases in which the user’s fingers aren’t completely steady, so the joystick push intensities can slightly vary from frame to frame.
The MDIS version tested in this article determined the final movement codes by following the following described principles:
Consecutive frames condition—A joystick had to maintain the highest push intensity for several consecutive frames (eight in the tested MDIS version). Only after this consistency was maintained over several frames did the corresponding movement code start being written to the final direction output file.
Acceleration threshold condition—To ensure we are capturing acceleration, the stick push intensity in the current frame had to differ from the previous frame by a specific threshold (e.g., 0.05). If any of these conditions were not met, a ‘0’ was written to the final directions output file.
In the current MDIS implementation, only the nine search modes are implemented. The combinations of multiple models, such as forward and downward motion, explained earlier, are not yet implemented. Whenever it is unclear which search model should be used, a ‘0’ is written in the textual movement file, which means a normal DS is used for that frame. This is an improvement that can be implemented in future research.
4. Experimental Results
Over 70 test cases, encompassing all types of movements, were carried out using footage from a simulator and real drone footage. Pre-recorded data were used for verifying the algorithm to ensure consistent and repeatable testing conditions. Moreover, pre-recorded data are in the highest-quality raw format so that there are no coding artifacts in the input video. This allows us to thoroughly evaluate the algorithm’s performance under controlled scenarios, using both simulator footage with precise user input data and real drone footage. By doing so, we can accurately measure and compare the performance of the MDIS algorithm against the DS algorithm without the variability that might arise from live data collection.
The simulator sequence under test spans 46 s and is presented in a 4K resolution at 60 frames per second. Real drone footage consists of 4 short sequences (4–20 s per sequence), also in 4K resolution, at 60 frames per second. The tests were performed on a 64-bit Windows personal computer with 16 GB of RAM, 11th Gen Intel(R) Core(TM) i5-11400F @ 2.60 GHz processor (Intel, Santa Clara, CA, USA), and an NVIDIA GeForce GTX 1650 graphics card (NVIDIA, Santa Clara, CA, USA). The simulator footage was given particular attention because it provided precise user input data from the controller that corresponded to the footage. For the real drone footage, we manually created the textual user input files. This was done by individually inspecting each sequence and tracking the drone movements. Nevertheless, the approach of manually creating user input files for the real drone footage is also valid, because the data from the controller undergo the same processing as the approximated manual input, ensuring a consistent methodology across both types of footage.
In order to precisely measure the encoding times and times occupied by each function separately, Intel’s VTune Profiler software version 2023.0.0 was used [
25]. The Flame Graph analysis was used the most, since it shows exactly how much CPU time each function occupied. In order to optimize each search model independently, the simulator footage was segmented into several shorter sequences, each approximately 300 frames long. Each of these sequences was characterized by a single specific type of movement, namely left–right rotation, forwards–backwards motion, left–right motion, or upwards–downwards motion.
Several versions of MDIS were tested, and the following results refer to the best version, which will be explained in detail.
Figure 15 presents a comparison chart illustrating CPU time consumption, derived from Flame graph analysis, for encoding the entire simulator sequence using the MDIS and DS algorithms.
The bars on the left side of the chart represent the CPU time used exclusively for the search algorithm functions. In contrast, the bars on the right side illustrate the CPU time consumed by the broader search_ref function, which includes not only the search functions but also other operations like early termination and selecting starting points, which are key elements of the whole ME process. This comparison shows that MDIS speeds up the overall ME process and does not affect the other ME process operations negatively. It is noteworthy that all other operations were executed within a similar time frame for both the DS and MDIS algorithms, confirming that MDIS is the primary factor in achieving faster search times. The DS algorithm accounted for 529 CPU seconds in total, which makes up 3.5% of the total encoding time. Conversely, the MDIS function accounted for 360 CPU seconds in total, representing 2.3% of the total encoding time. In a percentage, the MDIS algorithm is approximately 32% faster than the DS algorithm.
In addition to accelerating the encoding speed, the MDIS algorithm also maintains the same video quality as the DS algorithm, as evidenced by PSNR values which were consistently within a ±0.6% range. Furthermore, MDIS offers a superior compression ratio. While the video encoded with DS amounts to 264 MB, the MDIS-encoded video is smaller, with a size of 257 MB.
Table 2 represents encoding data for real drone footage. Encoded video file size refers to the final video file size of the corresponding real drone sequence, expressed in MB. Search time refers to the CPU time consumed solely by the search function—by MDIS or DS. PSNR is calculated using FFmpeg [
26] by comparing the final encoded video and the original, raw video. Even better results were achieved while testing on real drone footage, but it is worth noticing that real drone footage was filmed using only basic movements. No complex movements, such as combined forward motion and left–right rotation, were used. In real drone footage, MDIS was up to 4.25 times faster than the DS while preserving video quality and compression rate.
This version of MDIS excluded the extra corner search locations that were added in one of the MDIS versions (
Figure 9). Including the corners resulted in a slightly smaller video file size, but at the cost of a longer encoding time. The reduction in file size was minimal (around 1 MB), while the MDIS execution time increased by approximately 15%. Therefore, the disadvantages of using the corner search locations outweigh the benefits, at least in this case of encoding the simulator footage. The search range weight factor used in this version was 2. This refers to the search range that is modified based on the calculated search probabilities. This would mean that a block that had a search probability of 1.0 had the search range doubled. It is worth noting that the process of calculating the search probabilities has a negligible impact on the encoding time.
We found it very interesting how MDIS resulted in smaller video file sizes. Hence, a detailed inspection was performed using VQAnalyzer software version 7.2.0.73697 [
27]. Oftentimes, MDIS guides the ME process in the right direction. In DS, there can be cases where blocks ‘accidentally’ find the most similar blocks on search locations that do not align with the drone’s movement. For instance, during the first iteration of DS, the center might shift to the left search location even though the drone is moving right. While the left search location may indeed be the best match in the first iteration, DS might not find better blocks in subsequent iterations. In contrast, MDIS, which might not find the best match in the first iteration, often identifies much better blocks in later iterations.
This discrepancy can also impact neighboring blocks due to the process of selecting the initial MV. At times, DS might conclude that the best match is in the left search location, causing neighboring blocks to start their search from that point, which can lead to ‘dead-ends’. Although this can also occur with MDIS, it happens less frequently due to the enforced search direction, which is grounded in theoretical principles.
Figure 16 and
Figure 17 show this effect—they are snippets of the same video frame, but one is from the video encoded with DS, and one is from the video encoded with MDIS. There are much fewer MVs in the case of DS, which makes the DS frame much bigger. In this case, the frame encoded with DS takes 2.27 MB, while the frame encoded with MDIS takes 1.95 MB.
A similar situation where MDIS identifies more MVs than DS arises due to the extended search range facilitated by search probabilities. With DS, blocks located near the edges of frames often end up without assigned MVs. However, MDIS more frequently succeeds in finding MVs for such blocks due to its extended search range capability.
The final search directions file was generated using a Pythonversion 3.9.13 script. The script is designed to process the input files and generate an output file containing the final search directions for the MDIS algorithm. The four input files follow the scheme explained earlier, and each file corresponds to one joystick movement. The script checks which joystick had the highest push intensity at that moment, and writes only that information in a separate file. Later, that newly created file is processed in the following way. If the push intensity is below 0.1, write a ‘0’ to the final directions file. Otherwise, compare the intensity and movement category with the previous value, and follow the logic described earlier in this research. For example, if the movement category is forward motion, and high acceleration is detected, then still write a ‘0’ in the output file because the drone is currently leaning forward, and the classical forward motion model can not yet be applied. Only once the high acceleration stops, start applying the forward motion model, or in other words, start writing the corresponding number to the output file.
5. Conclusions and Future Work
This research has presented a novel ME algorithm, MDIS, designed to enhance video compression for RCVs, with a particular focus on UAVs. The MDIS algorithm builds upon the well-established DS algorithm, incorporating information from vehicle motion dynamics estimation and real-time user interaction data from the controller to guide the search process. This approach has been shown to significantly improve the efficiency and accuracy of ME, reducing the computational complexity and latency of video transmission, which is of paramount importance in applications such as drone racing, FPV systems, construction, agriculture, or mining. Overall, the MDIS algorithm’s improvements in video encoding speed and quality are crucial across these diverse applications, enhancing operational efficiency, safety, and decision-making processes.
The MDIS algorithm has been thoroughly tested using both simulator and real drone footage. The results have demonstrated that MDIS can reduce the encoding time by approximately 32% compared to the DS algorithm in the best-case scenario while maintaining the same video quality. Furthermore, MDIS has been shown to provide a superior compression ratio, resulting in smaller video file sizes. This is a significant achievement, as it allows for more a efficient use of network bandwidth and storage resources. This also means MDIS reduces overall end-to-end video transmission latency and can increase the final frame rate of the displayed video, enhancing the pilot’s user experience.
This research has also highlighted the importance of considering the direction of acceleration, rather than just the direction of movement, in the ME process. This insight has been incorporated into the MDIS algorithm, allowing it to more accurately predict the MVs and thus improve the efficiency of the search process.
MDIS presents a significant enhancement over DS through its improved search range modification, achieved by incorporating search probabilities. This enhancement enables the MDIS algorithm to identify a greater number of MVs compared to DS. In DS, blocks situated near the frame edges frequently lack assigned MVs. In contrast, MDIS demonstrates a higher success rate in obtaining MVs for such blocks, thanks to its expanded search range capability.
Despite these promising results, there are several potential avenues for further research and improvement. One such area is the implementation of combined search models in MDIS. Currently, MDIS uses a single search model based on the dominant movement type. However, in certain scenarios, such as when the drone is accelerating or moving forward and rotating at the same time, a combination of different search models may yield better results. Implementing this feature in MDIS could further improve its performance.
Another area for future work is the refinement of the process for determining final search directions based on user input. The current approach, while effective, could potentially be improved by incorporating more sophisticated machine learning techniques to better predict the final search directions.
Moreover, the process of early termination, which was not mentioned in this research before, could be modified to follow the logic of MDIS. The purpose of early termination in ME is to optimize computational efficiency. Instead of exhaustively searching through all possible MVs, the encoder may terminate the search early if it determines that further exploration is unlikely to yield a significantly better result. This is based on heuristics and thresholds calculated during the ME process. Early termination takes up a lot of CPU time because it performs an HS in a small area, similar to the DS, but at sub-pixel-level precision. The HS search locations could also be modified to follow the same logic depending on the user input.
In conclusion, the MDIS algorithm represents a significant step forward in the field of video compression for RCVs. By intelligently leveraging user input data, MDIS has the potential to greatly enhance the performance and user experience of applications such as drone racing and FPV systems. As such, it provides a promising foundation for future research and development in this exciting area.