ATGT3D: Animatable Texture Generation and Tracking for 3D Avatars
Abstract
:1. Introduction
- We propose the EDM, which enhances 3D human texture parameters through the use of texture seeds and diffusion models, generating nearly complete 3D human texture maps.
- We introduce a human motion PTDM for mesh-level animatable human model datasets. It is a straightforward and effective texture motion tracking framework that can generate temporally coherent texture motions from a single image.
- Using the BEAT2 and AMASS datasets, we develop an outstanding human and pose synchronization model using only three seed poses, capable of generating body and facial gestures. This significantly enhances the fidelity and diversity of the results.
2. Related Work
2.1. Model Transformation
2.2. Texture Repair
2.3. Motion Tracking
2.4. Literature Review
3. Datasets and Preprocessing
3.1. HUMBI Dataset
3.2. AMASS Dataset
3.3. BEAT2 Dataset
4. Generation Architecture
4.1. Eye Diffusion Module (EDM)
4.2. Pose Tracking Diffusion Module (PTDM)
4.3. ATGT3D Network Architecture
5. Experiments and Results
5.1. Eye Reconstruction
5.2. Motion Texture Reconstruction
FGD ↓ | FID ↓ | Diversity ↑ | MSE ↓ | LVD ↓ | |
---|---|---|---|---|---|
Baseline | 13.080 | 6.941 | 8.3145 | 1.442 | 9.317 |
+VQVAE | 9.787 | 6.673 | 10.624 | 1.619 | 9.473 |
+4 VQVAE | 7.397 | 6.698 | 12.544 | 1.243 | 8.938 |
FACT | 6.673 | 6.371 | 12.954 | 1.203 | 8.998 |
+Masked Hints | 5.423 | 6.794 | 13.057 | 1.180 | 9.015 |
PTDM (ours) | 5.214 | 6.641 | 13.213 | 1.091 | 8.265 |
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Fu, J.; Li, S.; Jiang, Y.; Lin, K.Y.; Qian, C.; Loy, C.C.; Wu, W.; Liu, Z. Stylegan-human: A data-centric odyssey of human generation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 1–19. [Google Scholar]
- Grigorev, A.; Iskakov, K.; Ianina, A.; Bashirov, R.; Zakharkin, I.; Vakhitov, A.; Lempitsky, V. Stylepeople: A generative model of fullbody human avatars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 5151–5160. [Google Scholar]
- Lewis, K.M.; Varadharajan, S.; Kemelmacher-Shlizerman, I. Tryongan: Body-aware try-on via layered interpolation. ACM Trans. Graph. 2021, 40, 1–10. [Google Scholar] [CrossRef]
- Men, Y.; Mao, Y.; Jiang, Y.; Ma, W.Y.; Lian, Z. Controllable person image synthesis with attribute-decomposed gan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5084–5093. [Google Scholar]
- Pumarola, A.; Agudo, A.; Sanfeliu, A.; Moreno-Noguer, F. Unsupervised person image synthesis in arbitrary poses. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8620–8628. [Google Scholar]
- Sarkar, K.; Golyanik, V.; Liu, L.; Theobalt, C. Style and pose control for image synthesis of humans from a single monocular view. arXiv 2021, arXiv:2102.11263. [Google Scholar]
- Sarkar, K.; Liu, L.; Golyanik, V.; Theobalt, C. Humangan: A generative model of human images. In Proceedings of the 2021 International Conference on 3D Vision (3DV), London, UK, 1–3 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 258–267. [Google Scholar]
- Vahdat, A.; Kreis, K. Improving Diffusion Models as an Alternative to GANs, Part 1. NVIDIA Technical Blog; NVIDIA Developer: Santa Clara, CA, USA, 2022. [Google Scholar]
- Guo, C.; Zuo, X.; Wang, S.; Cheng, L. Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 580–597. [Google Scholar]
- Ebert, D. Texturing & Modeling: A Procedural Approach; Morgan Kaufman: San Fransisco, CA, USA, 2002. [Google Scholar]
- Jiang, W.; Yi, K.M.; Samei, G.; Tuzel, O.; Ranjan, A. Neuman: Neural human radiance field from a single video. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 402–418. [Google Scholar]
- Noguchi, A.; Sun, X.; Lin, S.; Harada, T. Neural articulated radiance field. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 5762–5772. [Google Scholar]
- Peng, S.; Dong, J.; Wang, Q.; Zhang, S.; Shuai, Q.; Zhou, X.; Bao, H. Animatable neural radiance fields for modeling dynamic human bodies. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 14314–14323. [Google Scholar]
- Prokudin, S.; Black, M.J.; Romero, J. Smplpix: Neural avatars from 3d human models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 5–9 January 2021; pp. 1810–1819. [Google Scholar]
- Weng, C.Y.; Curless, B.; Srinivasan, P.P.; Barron, J.T.; Kemelmacher-Shlizerman, I. Humannerf: Free-viewpoint rendering of moving people from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 16210–16220. [Google Scholar]
- Wang, J.; Zhong, Y.; Li, Y.; Zhang, C.; Wei, Y. Re-identification supervised texture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 11846–11856. [Google Scholar]
- Jiang, Y.; Yang, S.; Qiu, H.; Wu, W.; Loy, C.C.; Liu, Z. Text2human: Text-driven controllable human image generation. ACM Trans. Graph. (TOG) 2022, 41, 1–11. [Google Scholar] [CrossRef]
- Neverova, N.; Guler, R.A.; Kokkinos, I. Dense pose transfer. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 123–138. [Google Scholar]
- Xu, X.; Loy, C.C. 3D human texture estimation from a single image with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 13849–13858. [Google Scholar]
- Zhao, F.; Liao, S.; Zhang, K.; Shao, L. Human parsing based texture transfer from single image to 3D human via cross-view consistency. Adv. Neural Inf. Process. Syst. 2020, 33, 14326–14337. [Google Scholar]
- Lazova, V.; Insafutdinov, E.; Pons-Moll, G. 360-degree textures of people in clothing from a single image. In Proceedings of the 2019 International Conference on 3D Vision (3DV), Quebec City, QC, Canada, 16–19 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 643–653. [Google Scholar]
- Alldieck, T.; Zanfir, M.; Sminchisescu, C. Photorealistic monocular 3d reconstruction of humans wearing clothing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 1506–1515. [Google Scholar]
- He, T.; Xu, Y.; Saito, S.; Soatto, S.; Tung, T. Arch++: Animation-ready clothed human reconstruction revisited. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 11046–11056. [Google Scholar]
- Li, Z.; Zheng, Z.; Zhang, H.; Ji, C.; Liu, Y. Avatarcap: Animatable avatar conditioned monocular human volumetric capture. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 322–341. [Google Scholar]
- Natsume, R.; Saito, S.; Huang, Z.; Chen, W.; Ma, C.; Li, H.; Morishima, S. Siclope: Silhouette-based clothed people. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 4480–4490. [Google Scholar]
- Zheng, Z.; Yu, T.; Liu, Y.; Dai, Q. Pamir: Parametric model-conditioned implicit representation for image-based human reconstruction. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3170–3184. [Google Scholar] [CrossRef] [PubMed]
- Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; Black, M.J. SMPL: A Skinned Multi-Person Linear Model. ACM Trans. Graph. (Proc. SIGGRAPH Asia) 2015, 34, 248:1–248:16. [Google Scholar] [CrossRef]
- Kurita, T. Principal component analysis (PCA). In Computer Vision: A Reference Guide; Springer: Cham, Switzerland, 2019; pp. 1–4. [Google Scholar]
- Pavlakos, G.; Choutas, V.; Ghorbani, N.; Bolkart, T.; Osman, A.A.A.; Tzionas, D.; Black, M.J. Expressive Body Capture: 3D Hands, Face, and Body from a Single Image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv 2021, arXiv:2112.10752. Available online: http://arxiv.org/abs/2112.10752 (accessed on 20 December 2021).
- Grigorev, A.; Sevastopolsky, A.; Vakhitov, A.; Lempitsky, V. Coordinate-based texture inpainting for pose-guided human image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 12135–12144. [Google Scholar]
- Liu, H.; Zhu, Z.; Becherini, G.; Peng, Y.; Su, M.; Zhou, Y.; Iwamoto, N.; Zheng, B.; Black, M.J. Emage: Towards unified holistic co-speech gesture generation via masked audio gesture modeling. arXiv 2023, arXiv:2401.00374. [Google Scholar]
- Cheong, S.Y.; Mustafa, A.; Gilbert, A. Kpe: Keypoint pose encoding for transformer-based image generation. arXiv 2022, arXiv:2203.04907. [Google Scholar]
- Hong, F.; Zhang, M.; Pan, L.; Cai, Z.; Yang, L.; Liu, Z. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. arXiv 2022, arXiv:2205.08535. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML), PMLR, Virtual Event, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Loper, M.; Mahmood, N.; Black, M.J. MoSh: Motion and shape capture from sparse markers. ACM Trans. Graph. 2014, 33, 220. [Google Scholar] [CrossRef]
- Mahmood, N.; Ghorbani, N.; Troje, N.F.; Pons-Moll, G.; Black, M.J. AMASS: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5442–5451. [Google Scholar]
- Yu, Z.; Yoon, J.S.; Lee, I.K.; Venkatesh, P.; Park, J.; Yu, J.; Park, H.S. Humbi: A large multiview dataset of human body expressions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 2990–3000. [Google Scholar]
- Krebs, F.; Meixner, A.; Patzer, I.; Asfour, T. The KIT Bimanual Manipulation Dataset. In Proceedings of the IEEE/RAS International Conference on Humanoid Robots (Humanoids), Munich, Germany, 18–20 July 2021; pp. 499–506. [Google Scholar]
- Firmani, F.; Park, E.J. A framework for the analysis and synthesis of 3D dynamic human gait. Robotica 2012, 30, 145–157. [Google Scholar] [CrossRef]
- Cai, Y.; Wang, Y.; Zhu, Y.; Cham, T.J.; Cai, J.; Yuan, J.; Liu, J.; Zheng, C.; Yan, S.; Ding, H.; et al. A unified 3d human motion synthesis model via conditional variational auto-encoder. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 11645–11655. [Google Scholar]
- Ghorbani, S.; Mahdaviani, K.; Thaler, A.; Kording, K.; Cook, D.J.; Blohm, G.; Troje, N.F. MoVi: A large multi-purpose human motion and video dataset. PLoS ONE 2021, 16, e0253157. [Google Scholar] [CrossRef] [PubMed]
- Mandery, C.; Terlemez, O.; Do, M.; Vahrenkamp, N.; Asfour, T. The KIT Whole-Body Human Motion Database. In Proceedings of the International Conference on Advanced Robotics (ICAR), Istanbul, Turkey, 27–31 July 2015; pp. 329–336. [Google Scholar]
- Mandery, C.; Terlemez, O.; Do, M.; Vahrenkamp, N.; Asfour, T. Unifying Representations and Large-Scale Whole-Body Motion Databases for Studying Human Motion. IEEE Trans. Robot. 2016, 32, 796–809. [Google Scholar] [CrossRef]
- Guler, R.A.; Natalia Neverova, I.K. DensePose: Dense Human Pose Estimation in the Wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Wu, Y.; Kirillov, A.; Massa, F.; Lo, W.Y.; Girshick, R. Detectron2. Available online: https://github.com/facebookresearch/detectron2 (accessed on 1 November 2019).
- Popescu, M.C.; Balas, V.E.; Perescu-Popescu, L.; Mastorakis, N. Multilayer perceptron and neural networks. WSEAS Trans. Circuits Syst. 2009, 8, 579–588. [Google Scholar]
- Kim, J.; Cho, H.; Kim, J.; Tiruneh, Y.Y.; Baek, S. Sddgr: Stable diffusion-based deep generative replay for class incremental object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 28772–28781. [Google Scholar]
- Yoon, Y.; Cha, B.; Lee, J.H.; Jang, M.; Lee, J.; Kim, J.; Lee, G. Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Trans. Graph. (TOG) 2020, 39, 1–16. [Google Scholar] [CrossRef]
- Soloveitchik, M.; Diskin, T.; Morin, E.; Wiesel, A. Conditional frechet inception distance. arXiv 2021, arXiv:2103.11521. [Google Scholar]
- Li, J.; Kang, D.; Pei, W.; Zhe, X.; Zhang, Y.; He, Z.; Bao, L. Audio2gestures: Generating diverse gestures from speech audio with conditional variational autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 11293–11302. [Google Scholar]
- Xing, J.; Xia, M.; Zhang, Y.; Cun, X.; Wang, J.; Wong, T.T. Codetalker: Speech-driven 3d facial animation with discrete motion prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 12780–12790. [Google Scholar]
- Yi, H.; Liang, H.; Liu, Y.; Cao, Q.; Wen, Y.; Bolkart, T.; Tao, D.; Black, M.J. Generating holistic 3d human motion from speech. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 469–480. [Google Scholar]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
- Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 586–595. [Google Scholar]
- Kanazawa, A.; Tulsiani, S.; Efros, A.A.; Malik, J. Learning category-specific mesh reconstruction from image collections. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 371–386. [Google Scholar]
- Casas, D.; Comino-Trinidad, M. SMPLitex: A Generative Model and Dataset for 3D Human Texture Estimation from Single Image. In Proceedings of the British Machine Vision Conference (BMVC), Aberdeen, UK, 20–24 November 2023. [Google Scholar]
Dataset | Subjects | Motions | Minutes |
---|---|---|---|
KIT [39] | 55 | 4232 | 661.84 |
BMLrub [40] | 111 | 3061 | 522.69 |
WEIZMANN [39] | 5 | 2222 | 505.35 |
CMU [41] | 96 | 1983 | 543.49 |
BMLmovi [42] | 89 | 1864 | 174.39 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, F.; Choi, J. ATGT3D: Animatable Texture Generation and Tracking for 3D Avatars. Electronics 2024, 13, 4562. https://doi.org/10.3390/electronics13224562
Chen F, Choi J. ATGT3D: Animatable Texture Generation and Tracking for 3D Avatars. Electronics. 2024; 13(22):4562. https://doi.org/10.3390/electronics13224562
Chicago/Turabian StyleChen, Fei, and Jaeho Choi. 2024. "ATGT3D: Animatable Texture Generation and Tracking for 3D Avatars" Electronics 13, no. 22: 4562. https://doi.org/10.3390/electronics13224562
APA StyleChen, F., & Choi, J. (2024). ATGT3D: Animatable Texture Generation and Tracking for 3D Avatars. Electronics, 13(22), 4562. https://doi.org/10.3390/electronics13224562