SteadyDancer : Harmonized and Coherent Human Image Animation
with First-Frame Preservation

Jiaming Zhang¹, Shengming Cao², Rui Li², Xiaotong Zhao², Yutao Cui²,
Xinglin Hou, Gangshan Wu¹, Haolan Chen² Yu Xu² Limin Wang^1,3† Kai Ma^2†

¹State Key Laboratory for Novel Software Technology, Nanjing University ²Platform and Content Group (PCG), Tencent ³OpenGVLab, Shanghai AI Laboratory
^†Corresponding author.

Paper SteadyDancer X-Dance

MY ALT TEXT

Abstract

Preserving first-frame identity while ensuring precise motion control is a fundamental challenge in human image animation. The Image-to-Motion Binding process of the dominant Reference-to-Video (R2V) paradigm overlooks critical spatio-temporal misalignments common in real-world applications, leading to failures such as identity drift and visual artifacts. We introduce SteadyDancer, an Image-to-Video (I2V) paradigm-based framework that achieves harmonized and coherent animation and is the first to ensure first-frame preservation robustly. Firstly, we propose a Condition-Reconciliation Mechanism to harmonize the two conflicting conditions, enabling precise control without sacrificing fidelity. Secondly, we design Synergistic Pose Modulation Modules to generate an adaptive and coherent pose representation that is highly compatible with the reference image. Finally, we employ a Staged Decoupled-Objective Training Pipeline that hierarchically optimizes the model for motion fidelity, visual quality, and temporal coherence. Experiments demonstrate that SteadyDancer achieves state-of-the-art performance in both appearance fidelity and motion control, while requiring significantly fewer training resources than comparable methods.

Motivation

MY ALT TEXT

Spatio-temporal Misalignments: We identify and tackle the prevalent issues of spatial-structural inconsistencies and temporal start-gaps between source images and driving videos common in real-world scenarios, which often lead to identity drift in generated animations.

Image-to-Video (I2V) v.s. Reference-to-Video (R2V) paradigm: The R2V paradigm treats animation as binding a reference image to a driven pose. However, this relaxation of alignment constraints fails under spatio-temporal misalignments, causing artifacts and abrupt transitions in spatial inconsistencies or temporal start-gap scenarios. Conversely, the I2V paradigm is superior as it inherently guarantees first-frame preservation, and its Motion-to-Image Alignment ensures high-fidelity and coherent video generation starting directly from the reference state.

Method

MY ALT TEXT

An overview of SteadyDancer, a Human Image Animation framework based on the Image-to-Video (I2V) paradigm. First, it employs a Condition-Reconciliation Mechanism to reconcile appearance and motion conditions, achieving precise control without sacrificing first-frame preservation. Second, it utilizes Synergistic Pose Modulation Modules to resolve critical spatio-temporal misalignments. Finally, we employ a Staged Decoupled-Objective Training Pipeline that hierarchically optimizes the model for motion fidelity, visual quality, and temporal coherence.

Cases in X-Dance Benchmark

X-Dance Benchmark, which focus on 1) the spatio-temporal misalignments by different-source image-video pairs; and 2) visual identity preservation, temporal coherence, and motion accuracy by complex motion and appearance variations.

Cases in RealisDance-Val Benchmark

RealisDance-Val Benchmark, which focus on 1) real-world dance videos with same-source image-video pairs; and 2) synthesize realistic object dynamics that are physically consistent with the driving actions.

X-Dance Benchmark

MY ALT TEXT

To fill the void left by existing same-source benchmarks (such as TikTok), which fail to evaluate spatio-temporal misalignments, we propose X-Dance, a new benchmark that focuses on these challenges. The X-Dance benchmark is constructed from diverse image categories (male/female/cartoon, and upper-/full-body shots) and challenging driving videos (complex motions with blur and occlusion). Its curated set of pairings intentionally introduces spatial-structural inconsistencies and temporal start-gaps, allowing for a more robust evaluation of model generalization in the real world. You can download the X-Dance benchmark from Hugging Face.

BibTeX

      
        @misc{zhang2025steadydancer,
              title={SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation}, 
              author={Jiaming Zhang and Shengming Cao and Rui Li and Xiaotong Zhao and Yutao Cui and Xinglin Hou and Gangshan Wu and Haolan Chen and Yu Xu and Limin Wang and Kai Ma},
              year={2025},
              eprint={2511.19320},
              archivePrefix={arXiv},
              primaryClass={cs.CV},
              url={https://arxiv.org/abs/2511.19320}, 
        }