1. Introduction
Video inpainting aims to fill in missing or corrupted regions of a video with plausible content and has been widely used in various practical applications, including video editing, damaged video restoration, and watermark removal. Compared to image inpainting, video inpainting presents greater challenges due to the additional temporal dimension. In addition to generating visually plausible content for each frame, video inpainting has to maintain temporal coherence in the missing regions. While significant progress has been made in image inpainting, directly processing video frame by frame with an image inpainting method will lead to temporal inconsistencies and severe artifacts due to the complex motion of the objects and camera.
Although the temporal dimension brings challenges to video inpainting, it inherently provides more information for restoration of the missing regions. Therefore, the effective utilization of complementary information across frames to synthesize high-quality content is critical for video inpainting. The advances in deep learning and computer vision have enabled the development of deep video inpainting. A number of deep learning-based video inpainting methods have been proposed [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11]. These methods can be roughly classified into two classes: pixel-based methods and flow-based methods. The first class generally utilizes 3D convolution [
1,
2,
3], attention mechanisms [
4,
5,
6], or transformers [
10,
11] to capture the spatio-temporal correlations among video frames. These methods take corrupted frames as input and employ learned spatio-temporal correlations to directly infer the missing regions without going through complicated transformations. Flow-based methods [
7,
9] argue that completing optical flow is much easier than completing the pixels in missing regions and formulate the video inpainting task as a pixel propagation problem. These methods complete optical flow first and use the synthesized optical flow to guide the pixel propagation from the valid region to the missing region. Compared with pixel-based methods, flow-based methods can produce inpainting results with high-frequency details. However, the flow-based video inpainting methods are highly dependent on the accuracy of the completed flow. Incorrect optical flow will greatly degrade the final inpainting quality. Furthermore, errors in early stages will inevitably propagate to subsequent stages, yielding inconsistent results.
Recently, transformers have drawn great attention. The powerful long-range spatio-temporal modeling capability of video-based transformers has made great progress in some video-related tasks, such as video super-resolution [
12,
13], video action recognition [
14,
15], etc. Not surprisingly, more and more researchers have begun to employ video transformers for deep video inpainting [
10,
11]. They take multi-frames as input and utilize various transformer blocks to establish correspondences between missing region tokens and valid region tokens. These correspondences are then used to hallucinate missing regions and generate final inpainting results. However, the existence of hole region tokens will easily lead to inaccurate results when estimating correspondences. To achieve better video inpainting results, some researchers make efforts to integrate optical flow and video transformer techniques, e.g., E2FGVI [
16] and FGT [
17]. In these methods, the corrupted optical flows are first completed, and the content is propagated across frames using the completed flows. The propagated information offers more effective cues for the subsequent transformer-based integration. While some promising inpainting results have been shown, these methods fail to fully explore the guidance of optical flow, which only propagates information in the early stage of the network without considering the effect of optical flow on subsequent transformer blocks. Moreover, the video scenes are variable due to the complex motion of cameras and objects. For example, when the video is almost static, it is difficult for the network to capture valid information from adjacent frames. The transformer blocks proposed by [
16,
17], which are composed of a temporal transformer block or a separated spatial temporal block, cannot integrate effective spatio-temporal information across frames well.
To address the above problems, in this paper, we propose a novel Flow-Guided Spatial Temporal Transformer (FSTT) architecture for deep video inpainting, which aims to effectively utilize the remarkable spatio-temporal modeling capabilities of transformers under the guidance of the optical flow. More specifically, to further mitigate the degradation caused by hole region pixels when establishing correspondences between missing regions and valid regions, we introduce the completed optical flow into each transformer block and design a Flow-Guided Fusion Feed-Forward (Flow-Guided F3N) module to replace the two-layer MLPs in the conventional transformer architecture. The Flow-Guided F3N module propagates the information across video frames along the optical flow trajectory, providing more effective information for the subsequent attention operations. For video inpainting, when the neighboring frames cannot provide sufficient information for missing regions, the spatial information within the current frame can also be utilized. Based on this observation, a decomposed spatial temporal MHSA (multi-head self-attention) mechanism is proposed, in which a temporal MHSA module is utilized to capture the temporal information in videos, and the spatial MHSA module is further utilized to integrate the spatial information in each frame. To improve the efficiency of the network, we further design a global–local attention mechanism for the temporal MHSA module, called Global–Local Temporal MHSA. We employ a local temporal attention mechanism within a small window and integrate global temporal information in a coarse-fine-grained way. Through the above designs, our network effectively and efficiently leverages the complementary content across video frames, producing a high level of visual quality.
We conduct extensive quantitative and qualitative experiments on two popular video inpainting datasets, DAVIS [
18] and YouTube-VOS [
19], to validate the effectiveness of the proposed network. The experiment results compared to the state-of-the-art methods show the superiority of our method.
The main contributions of the proposed method are summarized as follows:
We propose a novel Flow-Guided Spatial Temporal Transformer (FSTT) architecture for high-quality video inpainting.
We propose a Flow-Guided F3N module to alleviate the inaccuracy caused by hole pixels when performing MHSA.
We propose a decomposed spatial temporal MHSA to effectively integrate the spatio-temporal information across frames. A global–local temporal attention mechanism is further designed to improve the efficiency of the FSTT.
3. Proposed Methods
Given a set of corrupted video frames of height h, width w, and length t in RGB space , with their corresponding missing region , where M is a binary mask and the value ‘0’ represents known pixels, our network aims to learn a function that generates the inpainting results . The should be consistent spatially and temporally, and as consistent as possible with the ground-truth video .
To achieve this goal, we propose a novel Flow-Guided Spatial Temporal Transformer (FSTT) for video inpainting. In the following, an overview of FSTT is first introduced. Then, we present the detailed design of the FSTT. Finally, the loss functions that are utilized to train the network are given.
3.1. Network Overview
The overall framework of the proposed FSTT is illustrated in
Figure 1, which mainly consists of two stages: optical flow completion and corrupted frame inpainting. These two stages consist of four components: (1) the optical flow completion module, (2) the frame feature encoder module, (3) flow-guided spatial temporal transformer blocks, and (4) the decoder module.
Specifically, taking a corrupted video sequence and its corresponding binary masks as inputs, FSTT firstly completes the forward optical flows and backward optical flows across frames of inputs at 1/4 resolution through optical flow completion module . These completed flows are utilized to guide the restoration of the corrupted frames at the feature level.
In the frame inpainting stage, the convolutional encoder that is built on stacked 2D convolution layers extracts contextual features from input frames and obtains c channel feature maps . Then, the valid information is propagated between the feature maps with the help of the completed bidirectional flow and , which provides more effective information for the missing regions. Thirdly, the propagated features are split into smaller patches and flattened to one-dimensional tokens , where t is the frame length, n represents the number of the tokens in one feature map, and d represents the token channel. Next, Z is fed into the core component, the flow-guided spatial temporal transformer block, to integrate the spatial temporal information in the video under the guidance of the completed optical flow and , producing refined tokens . The refined tokens are then linearly transformed and reshaped to obtain feature maps . Finally, similar to the encoder module, the decoder module with a series of deconvolution layers is leveraged to decode the features back to the completed RGB results .
3.2. Optical Flow Completion Module
The motion information between video frames provides significant assistance in solving various video-related tasks, such as video segmentation and video object detection. Similarly, the motion information is also crucial for the video inpainting task. In this paper, we introduce optical flow into video inpainting to better integrate the spatial temporal information of videos.
To leverage the motion information in the video, we first need to obtain the completed optical flow map. Considering the efficiency of the proposed network, we exploit a lightweight optical flow estimation network
to complete the flow. The
adopts a similar architecture to SpyNet [
46] which is widely used for optical flow-related tasks.
Specifically, we down-sample the input corrupted frames
X to 1/4 resolution (denoted as
), which matches the spatial resolution of the encoder feature maps. The completed forward optical flow
between frames
and
is computed by the flow estimation module
as follows:
and the backward optical flow
is computed in a similar manner.
3.3. Flow-Guided Feature Propagation
With the completed optical flow, we can propagate the valid information from neighboring frames to the corrupted region in the current frame by warping operation
:
However, directly obtaining accurate optical flow is challenging due to the existence of missing regions. Inaccurate optical flow will lead to irrelevant information propagation, significantly degrading the quality of the inpainting results. To alleviate this problem, inspired by [
47], we combine the deformable convolution with optical flow propagation to improve the efficiency of information propagation performance.
Figure 2 shows the improved pipeline for feature propagation from
to
. The neighbor frame feature
is first aligned with the current frame feature
through Equation (
2), obtaining pre-aligned feature maps
. Then, we concatenate
and
, and then estimate the offsets
and modulation masks
between them with an offset prediction network:
where
and
denote a stack of convolutional layers, and
is a sigmoid function.
The learned offset
contains the motion information between frames, which further compensates for inaccurate optical flow. Thus, we add the learned offsets
and the completed optical flow map
to generate the refined offsets
. The deformable convolution operation is applied to
to generate the final propagation feature
:
Finally, we merge the with the current frame feature , which has several convolution layers. In the implementation, we perform bidirectional propagation across frame features, and a convolution layer with kernel size is utilized to fuse the forward and backward propagation features.
3.4. Flow-Guided Spatial Temporal Transformer Block
To effectively leverage the spatial temporal information across video frames, we introduce optical flow into the transformer block to guide the information propagation. Furthermore, for video, in addition to the temporal information across frames, the spatial information of the current video frame can also be leveraged to fill in the missing regions. Thus, both spatial and temporal attention are utilized in transformer block to integrate the complementary content across frames. To this end, a novel Flow-Guided Spatial Temporal Transformer block (FSTT block) is proposed.
The illustration of the FSTT block is shown in
Figure 1. The FSTT block mainly consists of three parts: Spatial Multi-Head Self-Attention (Spatial MHSA), Global–Local Temporal Multi-Head Self-Attention (Global–Local Temporal MHSA), and Flow-Guided Fusion Feed-Forward (Flow-Guided F3N). In detail, the input of FSTT block
is first projected to the query, key, and value features
q,
k, and
v, respectively, as follows:
where
,
, and
denote a
linear projection layer. Then, we conduct MHSA on the temporal dimension and the spatial dimension separately. The query, key, and value features are split into different heads along the channel dimension. For temporal MHSA, the attention retrieval should be performed on the tokens across input frames simultaneously, formulated as:
where the
, and
v are rearranged into the shape of
(
n denotes the number of heads, and
). The Spatial MHSA adopts a similar computational approach as the Temporal MHSA, but it only conducts attention on the frame itself. The shape of
is
.
However, calculating temporal attention directly on all input video frames will result in a significant computational cost. To address this issue, inspired by the window partition strategy proposed in the recent transformer method [
48], we further design a Global–Local Temporal Multi-Head Self-Attention (Global–Local Temporal MHSA) with a coarse-fine-grained temporal attention calculation, which improves network efficiency while maintaining performance.
After performing temporal and spatial attention, we feed the features into a Flow-Guided Fusion Feed-Forward (Flow-Guided F3N) module. The Flow-Guided F3N utilizes the completed flow to propagate information across frame features to provide more effective information for the subsequent blocks, compensating the inaccuracy introduced by the missing area when conducting MHSA. The whole process in the FSTT block is formulated as:
3.4.1. Global–Local Temporal MHSA
We introduce a window partition strategy into the temporal MHSA to improve network efficiency. As shown in
Figure 3, given the token map
, we divide it into several non-overlapping windows of size
, where the local tokens in a window are denoted as
. For each token in the hole region, the local temporal window provides the fine-grained information integration. To obtain global information for temporal MHSA, a convolution layer
with a kernel size
k and stride
s is applied to perform window pooling spatially and generate the global tokens
.
In order to simultaneously obtain local and global information, we calculate attentions with local–global interactions. Specifically, for a query in a local window, we find correspondences not only in the local window but also from the global window. We concatenate the local tokens
and global tokens
and then project them into the key and value. The query
q, key
k, and value
v are generated as follows:
where
,
, and
are linear projection layers. The generated
are then used for the temporal MHSA computation.
3.4.2. Flow-Guided Fusion Feed-Forward Yellow Module
To enhance the sub-token fusion capability and learn fine-grained features when applying transformers for video inpainting tasks, Soft Split (SS) and Soft Composition (SC) operations are proposed by F3N [
11] to replace the two-layer MLPs in the conventional transformer. The SS operation softly splits video frames into patches with overlapping regions, and the SC operation softly composites these overlapped patches back to images, improving the quality of the inpainted results. Our Flow-Guided Fusion Feed-Forward module (Flow-Guided F3N, FGF3N) is built on the F3N. It coordinates flow-guided feature propagation between the SS and SC operations. The processing of FGF3N is illustrated in
Figure 4.
Let
represent the token vectors that are input to FGF3N, where
. FGF3N first utilizes an MLP layer to process the
, generating a token map
. Then, the Soft Composition operation, flow-guided feature propagation, and Soft Split operation are applied to
step by step. The formula of FGF3N is as follows:
where MLP refers to a multi-layer perceptron, and FGFP denotes the flow-guided feature propagation. The Soft Composition operation composes the one-dimensional vector
to a 2D feature map, enabling FGF3N to perform flow-guided feature propagation.
3.5. Training Losses
Three loss functions are utilized to train the FSTT: flow estimation loss , reconstruction loss , and adversarial loss .
We use the flow estimation loss to train the optical flow completion network. The flow estimation loss measures the distance between the completed bidirectional flows and ground-truth flows:
where
and
represent the ground-truth forward and backward optical flow, respectively. These flows are extracted from the original uncorrupted video frames with a pre-trained flow extraction network.
In addition, we apply the pixel-wise reconstruction loss to both the hole region and the valid region to constrain that the inpainted results approximate to the ground-truth frame. This loss is defined as follows:
Inspired by the recent video inpainting works [
10,
16], T-Patch GAN loss [
10] is also leveraged to supervise the training process. T-Patch GAN improves the perceptual quality and spatio-temporal coherence of video inpainting results through adversarial training. The generator loss for FSTT is:
The detailed optimization function of the T-Patch GAN discriminator
D is formulated as:
The overall optimization functions are concluded as follows:
Following the previous work [
16], the weights of
,
, and
are set to 1, 0.01, and 1, respectively. All modules in the proposed FSTT are jointly optimized in an end-to-end manner.
4. Experiments
4.1. Datasets
Two widely-used datasets in video inpainting tasks, DAVIS [
18] and YouTube-VOS [
19], are adopted to train and evaluate the proposed network.
The YouTube-VOS dataset is a large-scale benchmark dataset for the task of video object segmentation. The videos are sourced from YouTube and cover a wide range of categories, such as animals, sports, music, and news. The dataset contains 4453 high-resolution video sequences, covering a wide range of object types and motion patterns. The training set, validation set, and test set of YouTube-VOS contain 3471, 474, and 508 video sequences, respectively.
The DAVIS dataset consists of a set of videos with varying degrees of complexity, containing multiple objects with different shapes, sizes, and motions. It contains 150 high-quality video sequences, including 90 for testing and 60 for training, and has been used in a wide range of research projects and competitions. A subset of 90 video sequences provide all the frames densely annotated with pixel-level segmentation masks for the object of interest.
Following [
16], two types of free-form masks, moving masks, and stationary masks are used to train the network. The moving masks can be used to simulate real-world applications, such as object removal. The stationary masks can be seen in the task of watermark removal. We train the proposed model on the training set from the YouTube-VOS dataset and evaluate it on both the DAVIS and YouTube-VOS datasets. For ablation studies, we conduct experiments on the DAVIS dataset.
4.2. Implementation Details
The channel for the encoder and decoder in our model is set to 128. Eight stacks of flow-guided spatial temporal transformer blocks are utilized in FSTT, and the dimension of the tokens is set as 512. We conduct experiments on videos with a resolution of . We also initialize the optical flow estimation network using the pre-trained weights of SpyNet to provide knowledge about optical flow.
The Adam optimizer [
49] with
and
is adopted to train the model. The initial learning rate is set to 0.0001 and decreased by a factor of 10 at 400 K iterations. When the interval of the video frame is too large, the object motion will be larger, which will degrade the accuracy of the flow estimation. Thus, during the training, we sample five temporally adjacent frames as local frames and randomly sample an additional three frames as non-local frames that do not perform flow-guided feature propagation operations in the model, to enable the model to capture the distance information. The model is trained on an NVIDIA 2080 Ti GPUs with batch size 8. For ablation studies, we conduct experiments on the DAVIS dataset and train the model for 250 K iterations.
4.3. Quantitative Evaluation
For quantitative comparison, we utilize PSNR (Peak Signal-to-Noise Ratio), SSIM (Structural Similarity Index), and VFID (Video-based Fréchet Inception Distance) [
50] as evaluation metrics. PSNR and SSIM are two metrics commonly used in image and video processing for evaluating the quality of a compressed or distorted image or video compared to the original. We calculate the scores frame by frame and report the mean value. To further evaluate the quality and temporal consistency of videos, we also introduce the VFID metrics. VFID is an extension of the Fréchet Inception Distance (FID), which calculates the perceptual similarity between videos. In practice, we utilize the I3D pre-trained video recognition model to calculate the VFID.
To evaluate the effectiveness of the proposed model, we compare FSTT with several recent competitive video inpainting methods, including: VINet [
8], DFGVI [
7], CAP [
4], FGVC [
9], STTN [
10], FuseFormer [
11], and E2FGVI [
16]. Among them, DFGVI and FGVC are flow-based methods, while STTN and FuseFormer are transformer-based methods. E2FGVI combines flow and transformer to solve the video inpainting problem. We utilize the stationary masks generated by [
16] to perform quantitative evaluation.
The quantitative evaluation results on the DAVIS and YouTube-VOS datasets are shown in
Table 1. It can be seen that the transformer-based methods ([
10,
11]) greatly improve the performance of video inpainting. E2FGVI further introduces optical flow into the transformer, achieving better results. Our method takes full advantage of optical flow guidance in the transformer to alleviate the inaccuracies caused by missing pixels and proposes a decomposed spatial temporal MHSA to effectively integrate spatio-temporal information in videos, outperforming all the state-of-the-art video inpainting methods in terms of PSNR, SSIM, and VFID. This demonstrates the superiority of our proposed method.
4.4. Qualitative Evaluation
To present visual quality of the inpainting results, we select the three most representative methods—DFGVI [
7], FuseFormer [
11], and E2FGVI [
16]—for conducting qualitative evaluation. The qualitative results compared with baselines are shown in
Figure 5 and
Figure 6, where
Figure 5 shows the completed result for the static mask and
Figure 6 shows the object removal results. From the visual results, we can find that our proposed method is able to generate more perceptually pleasing and temporally coherent results compared with baselines.
In the compared methods, the flow-based method (DFGVI) utilizes completed optical flow to propagate information between frames, making it sensitive to optical flow. Inaccurate flow estimation will mislead the temporal information propagation and degrade the result quality. As shown in
Figure 5 and
Figure 6, the inpainted results produce obvious artifacts. The transformer-based method (FuseFormer) has difficulty in finding high-quality correspondences when dealing with complex motion, which leads to producing blurry results. Compared to E2FGVI, our model generates more visually pleasant results on both stationary scenes and motion scenes.
Furthermore, we conduct a user study to assess the visual quality of different methods for a more comprehensive comparison. We choose two state-of-the-art video inpainting methods, FuseFormer [
11] and E2FGVI [
16], to conduct the user study. In practice, 15 participants are invited, and 20 video sequences under two types of masks are sampled for evaluation. In each trial, participants are shown the inpainting results from different methods and asked to rank the inpainting results. The results of the user study are presented in
Figure 7. As we can see, our method achieves better results than other methods in most cases.
4.5. Efficiency Analysis
The FLOPs and inference time are utilized to evaluate the efficiency of each method. We compare our method with different transformer-based methods that have achieved promising results among existing video inpainting methods on the DAVIS dataset. The comparison results are reported in
Table 2. Compared to other methods, although the FLOPs and running time of FSTT are slightly higher in some cases, our method achieves competitive results, demonstrating the effectiveness of the proposed method.
4.6. Ablation Study
In this section, we validate the effectiveness of the designed modules in FSTT. We mainly perform effectiveness studies on the Flow-Guided F3N module and the decomposed spatial temporal MHSA module. Furthermore, we also analyze the temporal consistency of inpainting results and the effectiveness of the optical flow completion network.
4.6.1. Effectiveness of Flow-Guided F3N Module
The Flow-Guided F3N module propagates information between frames based on the completed optical flow. This module provides more effective information for hole regions and mitigates the degradation caused by pixels within the hole region when performing MHSA in the subsequent stage, improving the inpainting quality. To investigate the impact of the Flow-Guided F3N module, we replace the Flow-Guided F3N module with F3N and take this model as the baseline. We analyze the effectiveness of the Flow-Guided F3N module in detail with four settings: (a) without any feature propagation; (b) only involving deformable convolution to perform feature propagation; (c) only utilizing optical flow; (d) combining deformable convolution with flow guidance. The results of the quantitative comparison are reported in
Table 3, and the visual comparison results are shown in
Figure 8, respectively.
As shown in
Table 3, the quantitative performance is improved compared to the model without any feature propagation. This validates the importance of performing feature propagation between frames. Furthermore, the combination of deformable convolution and flow guidance helps our model propagate more accurate information between frames, achieving the best inpainting quality. In
Figure 8, the inpainting results produced by the model with F3N tend to generate discontinuous content. With the assistance of deformable convolution and optical flow guidance, the effectiveness information is propagated from adjacent frames to the hole regions, enabling more accurate correspondences to be found. We can see that the structure of inpainting results becomes smoother and more accurate gradually.
4.6.2. Effectiveness of Decomposed Spatial Temporal MHSA Module
To evaluate the effectiveness of decomposed spatial temporal MHSA, we compare the performance of models with different attention mechanisms, including: the model without spatial MHSA, the model with general temporal MHSA, and the proposed global–local temporal MHSA. As shown in
Table 4, the introduction of spatial MHSA effectively integrates the spatial information in video frames, which improves the performance of the model. Although the performance has not been improved much, it is because, in the DAVIS dataset, most of the scenes are dynamic. Therefore, the temporal MHSA has a greater impact, and the spatial MHSA has less influence, which accurately reflects the real-world situation. The general temporal MHSA achieves the best quantitative performance. However, it suffers from heavy computation. Our proposed global–local temporal MHSA achieves comparable performance while the computational cost is decreased.
4.6.3. Optical Flow Completion
We study the effectiveness of the optical flow completion network. We compared the inpainting optical flow produced by our proposed model with that produced by DFGVI. The comparison results are shown in
Figure 9. DFGVI employs a multi-stage network for optical flow completion. However, it can be seen from
Figure 9 that DFGVI fails to inpaint the optical flow well, resulting in severe artifacts in the completed results. Our method adopts an end-to-end way to training the network, allowing the model to learn the optical flow adaptively. While our model does not exactly recover the optical flow, it produces similar results that can provide valuable information for the subsequent transformer blocks, leading to promising results.
4.6.4. Temporal Consistency
Furthermore, to show the temporal consistency of the proposed method, we also visualize the temporal profile of the corresponding videos. The results are shown in
Figure 10. In the temporal profile, we observe that our method can produce sharp and smooth edges. This indicates that the completed videos include many fewer flickering artifacts and maintain temporal consistency.
5. Conclusions and Future Work
In this paper, a novel Flow-Guided Spatial Temporal Transformer (FSTT) architecture is proposed for deep video inpainting. The FSTT aims to explore how to effectively utilize the transformer to establish correspondences between missing regions and valid regions in both spatial and temporal dimensions with the guidance of completed optical flow, which captures spatio-temporal information to perform video inpainting. Two elaborately designed modules, the Flow-Guided Fusion Feed-Forward (Flow-Guided F3N) module and the decomposed spatial temporal MHSA module, are utilized to address the problems of previous methods. The Flow-Guided F3N module provides more effective information for the subsequent stages with flow-guided propagation and alleviates the inaccuracy caused by hole pixels when performing transformers. The decomposed spatial temporal MHSA module effectively integrates the spatio-temporal information in videos. Furthermore, a Global–Local Temporal Attention Mechanism based on the window partition strategy is designed to improve the efficiency of the proposed model. The quantitative and qualitative experimental results on DAVIS and YouTube-VOS datasets demonstrate the superiority of the proposed FSTT.
An improved version [
51] of [
17] also explores how to make full use of the guidance of optical flow in transformers to solve video inpainting and achieves promising results. Different from our method, the solution in [
51] further introduces the completed optical flow into the temporal MHSA and spatial MHSA to enhance feature integration. Furthermore, reference [
51] elaborately designs a individual flow completion network and introduces edge loss to train the flow completion network, which improves the quality of the completed flow. The method in reference [
51] dealing with combining optical flow and transformers provides a direction for us. However, as mentioned in reference [
51], their method also has some significant limitations. First, their method depends highly on the quality of the completed flows. Incorrect optical flow will greatly degrade the final inpainting quality. Additionally, the computational speed is slow because of operations such as Poisson blending and pixel-level propagation. Our method performs information propagation at the feature level and adaptively learns the optical flow in an end-to-end manner, which improves the efficiency and effectiveness of the model.
We also note that FSTT still has some limitations. When the video is occluded by a large mask, FSTT tends to generate blurry inpainting results, as shown in
Figure 11a. We infer that when the missing region is too large and that the amount of information in the receptive field is limited. As a result, it becomes difficult to capture enough image patches from the valid regions, leading to blurry results. It is possible to perform a scheme that progressively fills the hole region starting from the hole boundary. This approach ensures that the receptive field captures more relevant contextual information, leading to improved quality of inpainting results. Additionally, as shown in
Figure 11b, FSTT fails to generate plausible content when the moving objects have a large number of missing details. Due to the motion inconsistency between foreground and background, completing motion foreground objects with a large number of missing details is very challenging for current video inpainting methods. One promising way is to separate foreground and background and inpaint them separately.