Improved Pose-Controlled Animation: A Quantitative and Qualitative Analysis

Qinghui Xu; YanLin Wu; Yajun Yuan; Zongqi Ge; Khang Wen Goh

doi:10.12410/sia0201013

Scientific Innovation in Asia

Volume 2, 2024 - Issue 1

Submit Manuscript Journal Homepage

Latest Issue

Article Menu

Improved Pose-Controlled Animation: A Quantitative and Qualitative Analysis

by Qinghui Xu ¹ , YanLin Wu ² , Yajun Yuan ³ , Zongqi Ge ⁴ and Khang Wen Goh ^1,*

Faculty of Data Science and Information Technology, INTI International University, Nilai 71800, Negeri Sembilan, Malaysia

School of Mathematics Education Management, INTI International University, Nilai 71800, Negeri Sembilan, Malaysia

University of East London Singapore Campus, 069542, Singapore

Author to whom correspondence should be addressed.

SIA 2024 2(1):29; https://doi.org/10.12410/sia0201013

Received: 1 December 2024 / Accepted: 18 December 2024 / Published Online: 26 December 2024

View Full-Text

Download PDF

Abstract

Character animation, which aims to generate dynamic character videos from static images, has gained significant attention in recent years. Despite the advances in diffusion models, which have established themselves as the leading approach in visual generation tasks due to their strong generative capabilities, challenges remain in the domain of image-to-video synthesis, particularly in character animation. The preservation of temporal consistency and the retention of fine-grained character details across frames continue to pose significant obstacles. In this work, we propose a novel framework specifically designed for character animation, leveraging the potential of diffusion models. To address the challenge of maintaining intricate appearance details from the reference image, we introduce ReferenceNet, a network that integrates detailed features using spatial attention mechanisms. To enhance controllability and ensure smooth motion transitions, we present an efficient pose guide that directs the character's movements and incorporate an effective temporal modeling strategy to facilitate seamless inter-frame consistency. Our framework is capable of animating arbitrary characters by expanding the training data, outperforming existing image-to-video methods in character animation tasks. Experimental evaluations on benchmark image animation datasets demonstrate that our approach achieves state-of-the-art performance, setting a new standard for this domain.

Keywords: AI; Image recognition; video generation; diffusion model; dynamic video;

Copyright: © 2024 by Xu, Wu, Yuan, Ge and Goh. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) (Creative Commons Attribution 4.0 International License). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

ACS Style

Xu, Q.; Wu, Y.; Yuan, Y.; Ge, Z.; Goh, K. W. Improved Pose-Controlled Animation: A Quantitative and Qualitative Analysis. Scientific Innovation in Asia, 2024, 2, 29. doi:10.12410/sia0201013

AMA Style

Xu Q., Wu Y., Yuan Y. et al.. Improved Pose-Controlled Animation: A Quantitative and Qualitative Analysis. Scientific Innovation in Asia; 2024, 2(1):29. doi:10.12410/sia0201013

Chicago/Turabian Style

Xu, Qinghui; Wu, YanLin; Yuan, Yajun; Ge, Zongqi; Goh, Khang W. 2024. "Improved Pose-Controlled Animation: A Quantitative and Qualitative Analysis" Scientific Innovation in Asia 2, no.1:29. doi:10.12410/sia0201013

Article Metrics

Article Access Statistics

References

Martin Arjovsky, Soumith Chintala, and Leon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223.PMLR, 2017. 1
Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat,Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila,Samuli Laine, Bryan Catanzaro, et al. ediffi: Text-to-imagediffusion models with an ensemble of expert denoisers. arXivpreprint arXiv:2211.01324, 2022. 2
Ankan Kumar Bhunia, Salman Khan, Hisham Cholakkal,Rao Muhammad Anwer, Jorma Laaksonen, Mubarak Shah,and Fahad Shahbaz Khan. Person image synthesis via denoising diffusion model. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition,pages 5968–5976, 2023. 3
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis.Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023. 2, 3
Zinelabidine Boulkenafet, Jukka Komulainen, and AbdenourHadid. Face anti-spoofing based on color texture analysis.In 2015 IEEE international conference on image processing(ICIP), pages 2636–2640. IEEE, 2015. 8
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh.Realtime multi-person 2d pose estimation using part affinityfields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017. 5
Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei AEfros. Everybody dance now. In Proceedings of theIEEE/CVF international conference on computer vision,pages 5933–5942, 2019. 1, 3
Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang,Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu,Qifeng Chen, Xintao Wang, et al. Videocrafter1: Opendiffusion models for high-quality video generation. arXivpreprint arXiv:2310.19512, 2023. 2, 3, 4
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An imageis worth 16x16 words: Transformers for image recognitionat scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7,OpenReview.net, 2021. 4
Patrick Esser, Johnathan Chiu, Parmida Atighehchian,Jonathan Granskog, and Anastasis Germanidis. Structureand content-guided video synthesis with diffusion models.In Proceedings of the IEEE/CVF International Conferenceon Computer Vision, pages 7346–7356, 2023. 2, 7
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio. Generative adversarial nets. Advances inneural information processing systems, 27, 2014. 1
Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, YuQiao, Dahua Lin, and Bo Dai. Animatediff: Animate yourpersonalized text-to-image diffusion models without specifictuning. arXiv preprint arXiv:2307.04725, 2023. 2, 3, 5, 7
Hsuan-I Ho, Lixin Xue, Jie Song, and Otmar Hilliges. Learning locally editable virtual humans. In Proceedings ofthe IEEE/CVF Conference on Computer Vision and PatternRecognition, pages 21024–21035, 2023. 3
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural informationprocessing systems, 33:6840–6851, 2020. 2, 4
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang,Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, BenPoole, Mohammad Norouzi, David J Fleet, et al. Imagenvideo: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022. 2
Jonathan Ho, Tim Salimans, Alexey A. Gritsenko, WilliamChan, Mohammad Norouzi, and David J. Fleet. Video diffusion models. In NeurIPS, 2022. 2
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, andJie Tang. Cogvideo: Large-scale pretraining for text-to-videogeneration via transformers. In The Eleventh InternationalConference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. 2
Alain Hore and Djemel Ziou. Image quality metrics: Psnrvs. ssim. In 2010 20th international conference on patternrecognition, pages 2366–2369. IEEE, 2010. 5
Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao,and Jingren Zhou. Composer: Creative and controllable image synthesis with composable conditions. In InternationalConference on Machine Learning, 2023. 2
Yasamin Jafarian and Hyun Soo Park. Learning high fidelity depths of dressed humans by watching social mediadance videos. In Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition, pages 12753–12762, 2021. 2, 5
Johanna Karras, Aleksander Holynski, Ting-Chun Wang,and Ira Kemelmacher-Shlizerman. Dreampose: Fashionvideo synthesis with stable diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision,pages 22680–22690, 2023. 2, 3, 4, 6
Tero Karras, Samuli Laine, and Timo Aila. A style-basedgenerator architecture for generative adversarial networks.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019. 1
Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, ShantNavasardyan, and Humphrey Shi. Text2video-zero: Textto-image diffusion models are zero-shot video generators. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15954–15964, 2023. 2
Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In 2nd International Conference on LearningRepresentations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. 2, 4
Wen Liu, Zhixin Piao, Jie Min, Wenhan Luo, Lin Ma, andShenghua Gao. Liquid warping gan: A unified frameworkfor human motion imitation, appearance transfer and novel view synthesis. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5904–5913,3
Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Ying Shan, Xiu Li, and Qifeng Chen. Follow your pose:Pose-guided text-to-video generation using pose-free videos.arXiv preprint arXiv:2304.01186, 2023. 2
Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learningadapters to dig out more controllable ability for text-to-imagediffusion models. arXiv preprint arXiv:2302.08453, 2023. 2
Haomiao Ni, Changhao Shi, Kai Li, Sharon X Huang, andMartin Renqiang Min. Conditional image-to-video generation with latent flow diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18444–18455, 2023. 3
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In Interna tional Conference on Machine Learning, 2021. 2
Chenyang QI, Xiaodong Cun, Yong Zhang, Chenyang Lei,Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15932–15942, 2023. 2
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 2
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022. 2
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022. 2

Article Overview

Article Versions

More by Authors Links

Improved Pose-Controlled Animation: A Quantitative and Qualitative Analysis

Abstract

Article Metrics

References

Article Overview

Article Versions

Related Links

More by Authors Links

Improved Pose-Controlled Animation: A Quantitative and Qualitative Analysis

Abstract

Share and Cite

Article Metrics

References