SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation

Abstract

Audio-driven video generation aims to synthesize realistic videos that align with input audio recordings, akin to the human ability to visualize scenes from auditory input. However, existing approaches predominantly focus on exploring semantic information, such as the classes of sounding sources present in the audio, limiting their ability to generate videos with accurate content and spatial composition. In contrast, we humans can not only naturally identify the semantic categories of sounding sources but also determine their deeply encoded spatial attributes, including locations and movement directions. This useful information can be elucidated by considering specific spatial indicators derived from the inherent physical properties of sound, such as loudness or frequency. As prior methods largely ignore this factor, we present SpA2V, the first framework explicitly exploits these spatial auditory cues from audios to generate videos with high semantic and spatial correspondence. SpA2V decomposes the generation process into two stages: (1) Audio-guided Video Planning: We meticulously adapt a state-of-the-art MLLM for a novel task of harnessing spatial and semantic cues from input audio to construct Video Scene Layouts (VSLs). This serves as an intermediate representation to bridge the gap between the audio and video modalities; (2) Layout-grounded Video Generation: We develop an efficient and effective approach to seamlessly integrate VSLs as conditional guidance into pre-trained diffusion models, enabling VSL-grounded video generation in a training-free manner. Extensive experiments demonstrate that SpA2V excels in generating realistic videos with semantic and spatial alignment to the input audios.

SpA2V Framework

SpA2V decomposes the generation process into two stages: Audio-guided Video Planning and Layout-grounded Video Generation. In the first stage, given an input audio, example conversations are retrieved from candidate database via Retrieval Module. They together with a System Instruction and the audio are fed into the MLLM Video Planner to perform reasoning and generate a desired Video Scene Layout (VSL) sequence with respective global video-wise and local frame-wise captions. In the second stage, they are incorporated to guide a video diffusion model to generate the final video that is semantically and spatially coherent with the input audio.

AVLBench Construction

Due to the novelty of our Audio → Video Scene Layout → Video approach, we construct a new benchmark dubbed AVLBench for conducting experiments and evaluations.

The building process involves 4 steps: Sourcing to crawl potential data, Filtering to select data with quality control, Augmenting to enrich data diversity, and Annotating to produce annotations for final data. Eventually, we obtain 7.2K real-world stereo audio-video pairs with 14.5K annotated sounding objects, covering several distinct scenarios.

Qualitative Comparison

System-wise qualitative comparison of videos generated by SpA2V, TempoTokens, Seeing-and-Hearing, AC+LTX, and AC+LVD. Our method can synthesize high-quality videos with compelling semantic and spatial correspondence to input audios across different scenarios.

Vehicles with translational movements in outdoor environments.

Instruments being played with stationary motions in indoor environments.

Quantitative Evaluation

Stage-by-stage quantitative comparison of SpA2V and aforementioned baselines together with ablation studies in Audio-driven Video Planning and Layout-grounded Video Generation respectively.

In Stage 1, SpA2V outperforms the baseline AC+LVD and generates VSLs with high similarity to the ground-truth VSLs, indicating strong spatial alignments to the input audios. Additionally, we conduct ablation for this stage on the Prompting Mechanism Component, In-context Learning Setup, Example Selection, Choices of MLLM, and Response Randomness Factor for choosing the best setting in practice.

In Stage 2, SpA2V consistently surpasses AC+LVD in synthesizing videos grounded by input layouts, and outperforms other baselines including TempoTokens, Seeing-and-Hearing, and AC+LTX in system-level audio-to-video comparison. Besides, we also assess the effectiveness of different Caption Selection strategies and the impact of VSL Quality toward the final outcomes.

User Study

Ranking scores in user study highlight the subjective preference of users for the videos generated by SpA2V over the other baselines in Visual Quality and Audio-Video Alignment.

BibTeX

@article{pham2025spa2v,
  title={SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation},
  author={Pham, Kien T and He, Yingqing and Xing, Yazhou and Chen, Qifeng and Chen, Long},
  journal={arXiv preprint arXiv:2508.00782},
  year={2025}
}