Joint Video Super-Resolution and Frame Interpolation via Permutation Invariance
Abstract
:1. Introduction
- We propose the permutation invariant residual block (PIRB) which can process the input frames in a permutation invariant manner while effectively extracting the complementary features. Thereby, we demonstrate the visually pleasing quality. In turn, the learned features effectively shepherd both tasks.
- We propose the feature attention mechanism for the proposed task to effectively focus on important regions and handle unwanted artifacts.
2. Related Work
2.1. Super Resolution
2.2. Frame Interpolation
2.3. Spatio-Temporal Super Resolution
2.4. Permutation Invariance
3. Proposed Method
3.1. Network Architecture
3.1.1. Flow Estimation Module
3.1.2. Permutation Invariant Residual Network (PIRN)
3.1.3. Upsampling CNN Decoder
3.1.4. Network Architecture Details
3.2. Training Details
4. Experimental Results
4.1. Quantitative Results
4.2. Qualitative Comparisons
4.3. Ablation Study
4.4. Application: Video Compression Effect
5. Discussion
6. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Haris, M.; Shakhnarovich, G.; Ukita, N. Deep back-projection networks for super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1664–1673. [Google Scholar]
- Haris, M.; Shakhnarovich, G.; Ukita, N. Recurrent Back-Projection Network for Video Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3897–3906. [Google Scholar]
- Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
- Wang, X.; Chan, K.C.; Yu, K.; Dong, C.; Change Loy, C. Edvr: Video restoration with enhanced deformable convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
- Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual dense network for image super-resolution. In Proceedings of the IEEE Conference on Comput. Vision and Pattern Recogn, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2472–2481. [Google Scholar]
- Bao, W.; Lai, W.S.; Ma, C.; Zhang, X.; Gao, Z.; Yang, M.H. Depth-aware video frame interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3703–3712. [Google Scholar]
- Jiang, H.; Sun, D.; Jampani, V.; Yang, M.H.; Learned-Miller, E.; Kautz, J. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Utah, USA, 18–22 June 2018; pp. 9000–9008. [Google Scholar]
- Niklaus, S.; Mai, L.; Liu, F. Video frame interpolation via adaptive separable convolution. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
- Aittala, M.; Durand, F. Burst image deblurring using permutation invariant convolutional neural networks. In Proceedings of the European Conference on Computer Vision; Springer: Berlin, Germany, 2018; pp. 731–747. [Google Scholar]
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
- Zaheer, M.; Kottur, S.; Ravanbakhsh, S.; Poczos, B.; Salakhutdinov, R.R.; Smola, A.J. Deep sets. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 3391–3401. [Google Scholar]
- Xue, T.; Chen, B.; Wu, J.; Wei, D.; Freeman, W.T. Video enhancement with task-oriented flow. Int. J. Comput. Vision 2019, 127, 1106–1125. [Google Scholar] [CrossRef] [Green Version]
- Liu, C.; Sun, D. A Bayesian approach to adaptive video super resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 20–25 June 2011. [Google Scholar]
- Tao, X.; Gao, H.; Liao, R.; Wang, J.; Jia, J. Detail-revealing deep video super-resolution. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4472–4480. [Google Scholar]
- Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a deep convolutional network for image super-resolution. In Proceedings of the European Conference on Computer Vision; Springer: Berlin, Germany, 2014; pp. 184–199. [Google Scholar]
- Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
- Kim, J.; Kwon Lee, J.; Mu Lee, K. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
- Kim, J.; Kwon Lee, J.; Mu Lee, K. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1637–1645. [Google Scholar]
- Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision; Springer: Berlin, Germany, 2018; pp. 286–301. [Google Scholar]
- Liao, R.; Tao, X.; Li, R.; Ma, Z.; Jia, J. Video super-resolution via deep draft-ensemble learning. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 531–539. [Google Scholar]
- Kappeler, A.; Yoo, S.; Dai, Q.; Katsaggelos, A.K. Video super-resolution with convolutional neural networks. IEEE Trans. Comput. Imaging 2016, 2, 109–122. [Google Scholar] [CrossRef]
- Tian, Y.; Zhang, Y.; Fu, Y.; Xu, C. TDAN: Temporally Deformable Alignment Network for Video Super-Resolution. arXiv 2018, arXiv:1812.02898. [Google Scholar]
- Caballero, J.; Ledig, C.; Aitken, A.; Acosta, A.; Totz, J.; Wang, Z.; Shi, W. Real-time video super-resolution with spatio-temporal networks and motion compensation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4778–4787. [Google Scholar]
- Liu, D.; Wang, Z.; Fan, Y.; Liu, X.; Wang, Z.; Chang, S.; Huang, T. Robust video super-resolution with learned temporal dynamics. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2507–2515. [Google Scholar]
- Jo, Y.; Wug Oh, S.; Kang, J.; Joo Kim, S. Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3224–3232. [Google Scholar]
- Kim, S.Y.; Lim, J.; Na, T.; Kim, M. 3DSRnet: Video Super-resolution using 3D Convolutional Neural Networks. arXiv 2018, arXiv:1812.09079. [Google Scholar]
- Wronski, B.; Garcia-Dorado, I.; Ernst, M.; Kelly, D.; Krainin, M.; Liang, C.K.; Levoy, M.; Milanfar, P. Handheld Multi-Frame Super-Resolution. ACM Trans. Graph. 2019, 38, 1–18. [Google Scholar] [CrossRef] [Green Version]
- Niklaus, S.; Mai, L.; Liu, F. Video frame interpolation via adaptive convolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Niklaus, S.; Liu, F. Context-aware Synthesis for Video Frame Interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Liu, Z.; Yeh, R.A.; Tang, X.; Liu, Y.; Agarwala, A. Video frame synthesis using deep voxel flow. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4463–4471. [Google Scholar]
- Oh, T.H.; Jaroensri, R.; Kim, C.; Elgharib, M.; Durand, F.; Freeman, W.T.; Matusik, W. Learning-based video motion magnification. In Proceedings of the European Conference on Computer Vision; Springer: Berlin, Germany, 2018. [Google Scholar]
- Kim, S.Y.; Oh, J.; Kim, M. FISR: Deep Joint Frame Interpolation and Super-Resolution with A Multi-scale Temporal Loss. arXiv 2019, arXiv:1912.07213. [Google Scholar] [CrossRef]
- Li, T.; He, X.; Teng, Q.; Wang, Z.; Ren, C. Space-time super-resolution with patch group cuts prior. Signal Process. Image Commun. 2015, 30, 147–165. [Google Scholar] [CrossRef]
- Shahar, O.; Faktor, A.; Irani, M. Space-time super-resolution from a single video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 20–25 June 2011. [Google Scholar]
- Sharma, M.; Chaudhury, S.; Lall, B. Space-Time Super-Resolution Using Deep Learning Based Framework. In Proceedings of the International Conference on Pattern Recognition and Machine Intelligence, Kolkata, India, 5–8 December 2017; Springer: Berlin, Germany, 2017; pp. 582–590. [Google Scholar]
- Shechtman, E.; Caspi, Y.; Irani, M. Space-time super-resolution. IEEE Trans. Patt. Anal. Mach. Intell. 2005, 27, 531–545. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Zhang, T.; Gao, K.; Ni, G.; Fan, G.; Lu, Y. Spatio-temporal super-resolution for multi-videos based on belief propagation. Signal Process. Image Commun. 2018, 68, 1–12. [Google Scholar] [CrossRef]
- Ilse, M.; Tomczak, J.M.; Welling, M. Attention-based deep multiple instance learning. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
- Tsai, R.; Huang, T. Multiframe image restoration and registration. In Advances in Computer Vision and Image Processing; JAI Press, Inc.: Greenwich, CT, USA, 1984; pp. 317–339. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Sun, D.; Yang, X.; Liu, M.Y.; Kautz, J. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8934–8943. [Google Scholar]
- Haris, M.; Shakhnarovich, G.; Ukita, N. Space-Time-Aware Multi-Resolution Video Enhancement. arXiv 2020, arXiv:2003.13170. [Google Scholar]
- Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar]
- Pickup, L.C.; Pan, Z.; Wei, D.; Shih, Y.; Zhang, C.; Zisserman, A.; Scholkopf, B.; Freeman, W.T. Seeing the arrow of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2035–2042. [Google Scholar]
- Cisco. Visual Networking Index: Forecast and Trends, 2017–2022 White Paper. Online. Available online: https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/white-paper-c11-741490.html (accessed on 16 November 2022).
- Teed, Z.; Deng, J. Raft: Recurrent all-pairs field transforms for optical flow. In Proceedings of the European Conference on Computer Vision; Springer: Berlin, Germany, 2020. [Google Scholar]
- Byung-Ki, K.; Hyeon-Woo, N.; Kim, J.Y.; Oh, T.H. DFlow: Learning to Synthesize Better Optical Flow Datasets via a Differentiable Pipeline. In Proceedings of the International Conference on Learning Representations, Sydney, Australia, 24–25 August 2023. [Google Scholar]
Layer Name | Filter Size | Channels | Stride | Upscale | Activation |
---|---|---|---|---|---|
Conv0 | 3 | 1 | - | ReLU | |
Conv1 | 64 | 1 | - | ReLU | |
Sym. pooling | 64 | 1 | - | Max/Avg | |
Concat(w/Conv1) | - | 64 + 64 | - | - | - |
PConv_R0 | 64 | 1 | - | ReLU | |
PConv_R1 | 64 | 1 | - | ReLU | |
PConv_S0 | 64 | 1 | - | ReLU | |
PConv_S1 | 64 | 1 | - | Sigmoid | |
Sym. pooling | 64 | 1 | - | Max/Avg | |
Conv3 | 64 | 1 | - | ReLU | |
UpScale | - | 64 | - | - | |
Conv4 | 64 | 1 | - | ReLU | |
Conv5 | 3 | 1 | - | ReLU |
Dataset | #param. | Vimeo90k | Vid4 | SPMCS | |||
---|---|---|---|---|---|---|---|
Metric | (Million) | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM |
SepConv- [8]→ Bicubic | 21.6 | 33.1487 | 0.9589 | 30.0614 | 0.8760 | 31.0992 | 0.9174 |
SepConv- [8]→ RBPN [2] | 34.4 | 32.4599 | 0.9283 | 29.5295 | 0.8224 | 31.2464 | 0.9034 |
SepConv- [8]→ DBPN [1] | 32.0 | 32.6833 | 0.9349 | 29.7292 | 0.8337 | 31.2743 | 0.9043 |
SuperSlomo [7] → Bicubic | 19.8 | 32.6034 | 0.9556 | 29.7232 | 0.8627 | 30.9245 | 0.9115 |
SuperSlomo [7] → RBPN [2] | 32.6 | 32.9948 | 0.9612 | 29.8192 | 0.8711 | 31.0364 | 0.9152 |
SuperSlomo [7] → DBPN [1] | 30.2 | 32.9835 | 0.9612 | 29.8260 | 0.8710 | 31.0405 | 0.9152 |
DAIN [6] → Bicubic | 24.0 | 33.0474 | 0.9628 | 30.0717 | 0.8931 | 31.0960 | 0.9167 |
DAIN [6] → RBPN [2] | 36.8 | 33.8300 | 0.9730 | 30.4270 | 0.9201 | 31.2514 | 0.9024 |
DAIN [6] → DBPN [1] | 34.4 | 33.7916 | 0.9737 | 30.4284 | 0.9196 | 31.2758 | 0.9029 |
Ours (Max-pooling) | 12.0 | 34.3556 | 0.9730 | 30.6366 | 0.9117 | 31.2392 | 0.9192 |
Ours (Avg.-pooling) | 12.0 | 34.4841 | 0.9739 | 30.7144 | 0.9169 | 31.2145 | 0.9172 |
Dataset | Vimeo90k |
---|---|
Metric | SSIM |
TOFlow [12] → DBPN [1] | 0.897 |
DBPN [1] → DAIN [6] | 0.918 |
STAR- [42] | 0.926 |
STAR-ST- [42] | 0.927 |
STAR-ST- [42] | 0.927 |
Ours | 0.974 |
Dataset | Vimeo90k | Vid4 | SPMCS | |||
---|---|---|---|---|---|---|
Metric | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM |
Order independent (Avg) | 34.4841 | 0.9739 | 30.7144 | 0.9169 | 31.2145 | 0.9172 |
Order dependent (,) | 34.2758 | 0.9721 | 30.5972 | 0.9100 | 31.2385 | 0.9192 |
Order dependent (,) | 34.2761 | 0.9721 | 30.5945 | 0.9099 | 31.2386 | 0.9192 |
w/o Attention module | 34.3954 | 0.9732 | 30.6194 | 0.9125 | 31.1806 | 0.9165 |
Order independent (Max) | 34.3556 | 0.9730 | 30.6366 | 0.9117 | 31.2392 | 0.9192 |
Order independent (4-frame) | 34.7363 | 0.9746 | 30.5990 | 0.9154 | 31.2914 | 0.9198 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Choi, J.; Oh, T.-H. Joint Video Super-Resolution and Frame Interpolation via Permutation Invariance. Sensors 2023, 23, 2529. https://doi.org/10.3390/s23052529
Choi J, Oh T-H. Joint Video Super-Resolution and Frame Interpolation via Permutation Invariance. Sensors. 2023; 23(5):2529. https://doi.org/10.3390/s23052529
Chicago/Turabian StyleChoi, Jinsoo, and Tae-Hyun Oh. 2023. "Joint Video Super-Resolution and Frame Interpolation via Permutation Invariance" Sensors 23, no. 5: 2529. https://doi.org/10.3390/s23052529
APA StyleChoi, J., & Oh, T. -H. (2023). Joint Video Super-Resolution and Frame Interpolation via Permutation Invariance. Sensors, 23(5), 2529. https://doi.org/10.3390/s23052529