ShoulderShot : Generating Over-the-Shoulder Dialogue Videos

Anonymous submission
ShoulderShot generates long dialogues maintaining spatial continuity and character consistency.

Abstract

Over-the-shoulder dialogue videos are essential in films, short dramas, and advertisements, providing visual variety and enhancing viewers' emotional connection. Despite their importance, such dialogue scenes remain largely underexplored in video generation research. Key challenges involve maintaining character consistency, spatial continuity, and generating long, multi-turn dialogues efficiently. We introduce ShoulderShot, a framework leveraging dual-shot generation and looping video to enable extended, consistent dialogues. Our results surpass existing methods in shot-reverse-shot layout, spatial continuity, and dialogue length flexibility, thereby opening up new possibilities for practical dialogue video generation.

Comparing Dialogue Video Generation with MoCha

Videos: Comparision of dialogue videos generated by MoCha and our method. Our method generates longer, multi-turn dialogue videos compared to MoCha. We use lip-sync from Tencent Cloud and CosyVoise for Text-to-Speech (TTS).

Comparing storyboards generation with state-of-the-art methods

Figure: Comparison of dual-shot over-the-shoulder images generated by existing methods and our results. (a) Mocha, (b) In-Context LoRA, (c) GPT-4o, (d) Ours. Our result shows better character consistency and shot layout.

Looping Video Generation

Videos: Looping video generation results. No audio; for practical use, we can add dialogue audio using TTS and lip-sync like LatentSync or MuseTalk.

Quantitative Results

Method FVD↓ FID-VID↓
I2V with reverse playback 368 3.12
I2V with FLF2V playback 378 3.42
Loop denoising (ours) 284 2.35
Table: Quantitative comparison of loop video generation strategies: reversed vs. first-Last frame (FLF2V) playback and loop denoising (ours). Metrics include Fréchet Distance metrics FVD and FID-VID. Lower is better.