💫UniAVGen: Unified Audio and Video Generation with
Asymmetric Cross-Modal Interactions

Guozhen Zhang1,†, Zixiang Zhou2,†, Teng Hu3, Ziqiao Peng4, Youliang Zhang5
Yi Chen2, Yuan Zhou2, Qinglin Lu2, Limin Wang1,6,‡
1State Key Laboratory for Novel Software Technology, Nanjing University 2Tencent Hunyuan
3Shanghai Jiao Tong University 4Renmin University of China
5Tsinghua University 6Shanghai AI Lab
Equal Contribution | Corresponding Author
zgzaacm@gmail.com, lmwang@nju.edu.cn

📝 Overview

UniAVGen is a unified framework for high-fidelity joint audio-video generation, addressing key limitations of existing methods such as poor lip synchronization, insufficient semantic consistency, and limited task generalization.

At its core, UniAVGen adopts a symmetric dual-branch architecture (parallel Diffusion Transformers for audio and video) and introduces three critical innovations: (1) Asymmetric Cross-Modal Interaction for bidirectional temporal alignment, (2) Face-Aware Modulation to prioritize salient facial regions during interaction, (3) Modality-Aware Classifier-Free Guidance to amplify cross-modal correlations during inference.

UniAVGen Framework Overview

🌊 Multi-Task Capabilities

1. Joint Audio-Video Generation

Input: Reference image, video caption, speech content.

Output: Temporally aligned audio and video.

Generated results:
Emotion consistency:
Happy: Calm: Angry:

2. Joint Generation with Reference Audio

Input: Reference image, video caption, speech content, reference audio for timbre control.

Output: Aligned audio-video with timbre matching the reference audio.

Reference audio:
Generated results:

3. Joint Audio-Video Continuation

Input: Reference image, video caption, speech content, conditional audio/video (in the examples below, the first 2.5s is the condition).

Output: Seamless continuation of audio and video, with conditional content preserving temporal consistency via cross-modal interaction.

Generated results:

Origin Videos:

4. Video-to-Audio Dubbing

Input: Conditional silent video, video caption, speech content, optional reference audio.

Output: Audio aligned with video emotions/expressions.

Generated results:

Origin videos:

5. Audio-Driven Video Synthesis

Input: Reference image, video caption, conditional audio.

Output: Video with expressions/motions aligned to the driving audio.

Generated results:

Cite Our Work

@misc{zhang2025uniavgenunifiedaudiovideo, title={UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions}, author={Guozhen Zhang and Zixiang Zhou and Teng Hu and Ziqiao Peng and Youliang Zhang and Yi Chen and Yuan Zhou and Qinglin Lu and Limin Wang}, year={2025}, eprint={2511.03334}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2511.03334}, }