UniAVGen is a unified framework for high-fidelity joint audio-video generation, addressing key limitations of existing methods such as poor lip synchronization, insufficient semantic consistency, and limited task generalization.
At its core, UniAVGen adopts a symmetric dual-branch architecture (parallel Diffusion Transformers for audio and video) and introduces three critical innovations: (1) Asymmetric Cross-Modal Interaction for bidirectional temporal alignment, (2) Face-Aware Modulation to prioritize salient facial regions during interaction, (3) Modality-Aware Classifier-Free Guidance to amplify cross-modal correlations during inference.