UniAVGen - Unified Audio and Video Generation

📝 Overview

UniAVGen is a unified framework for high-fidelity joint audio-video generation, addressing key limitations of existing methods such as poor lip synchronization, insufficient semantic consistency, and limited task generalization.

At its core, UniAVGen adopts a symmetric dual-branch architecture (parallel Diffusion Transformers for audio and video) and introduces three critical innovations: (1) Asymmetric Cross-Modal Interaction for bidirectional temporal alignment, (2) Face-Aware Modulation to prioritize salient facial regions during interaction, (3) Modality-Aware Classifier-Free Guidance to amplify cross-modal correlations during inference.

🌊 Multi-Task Capabilities

1. Joint Audio-Video Generation

Input: Reference image, video caption, speech content.

Output: Temporally aligned audio and video.

Generated results:

Emotion consistency:

Happy: Calm: Angry:

2. Joint Generation with Reference Audio

Input: Reference image, video caption, speech content, reference audio for timbre control.

Output: Aligned audio-video with timbre matching the reference audio.

Reference audio:

Generated results:

3. Joint Audio-Video Continuation

Input: Reference image, video caption, speech content, conditional audio/video (in the examples below, the first 2.5s is the condition).

Output: Seamless continuation of audio and video, with conditional content preserving temporal consistency via cross-modal interaction.

Generated results:

Origin Videos:

4. Video-to-Audio Dubbing

Input: Conditional silent video, video caption, speech content, optional reference audio.

Output: Audio aligned with video emotions/expressions.

Generated results:

Origin videos:

5. Audio-Driven Video Synthesis

Input: Reference image, video caption, conditional audio.

Output: Video with expressions/motions aligned to the driving audio.

Generated results:

Cite Our Work

@misc{zhang2025uniavgenunifiedaudiovideo, title={UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions}, author={Guozhen Zhang and Zixiang Zhou and Teng Hu and Ziqiao Peng and Youliang Zhang and Yi Chen and Yuan Zhou and Qinglin Lu and Limin Wang}, year={2025}, eprint={2511.03334}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2511.03334}, }

💫UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions