ShoulderShot : Generating Over-the-Shoulder Dialogue Videos

Anonymous submission

ShoulderShot generates long dialogues maintaining spatial continuity and character consistency.

Abstract

Over-the-shoulder dialogue videos are essential in films, short dramas, and advertisements, providing visual variety and enhancing viewers' emotional connection. Despite their importance, such dialogue scenes remain largely underexplored in video generation research. Key challenges involve maintaining character consistency, spatial continuity, and generating long, multi-turn dialogues efficiently. We introduce ShoulderShot, a framework leveraging dual-shot generation and looping video to enable extended, consistent dialogues. Our results surpass existing methods in shot-reverse-shot layout, spatial continuity, and dialogue length flexibility, thereby opening up new possibilities for practical dialogue video generation.

Comparing Dialogue Video Generation with MoCha

Scene description	MoCha	ShoulderShot (ours)
A woman (Person1) with dark, shoulder-length wavy hair, fair skin tone, wearing a white blouse with red cherry patterns. She has a soft, natural expression. A man (Person2) with short dark hair, a well-groomed beard, medium to fair skin tone, wearing black sunglasses and a light grey or white t-shirt. He has a strong build and a calm demeanor. First clip: Person1 stands in a cozy kitchen, facing slightly right, speaking warmly while preparing food on the counter. Pots and pans hang in the background. Second clip: Person2 stands on the other side of the counter, speaking to Person1.

Scene description	MoCha	ShoulderShot (ours)
A man (Person1) with short, slicked-back black hair, fair skin tone, wearing a dark pinstripe suit with a white shirt and black tie. A woman (Person2) with brown hair styled in an Edwardian-era updo, fair skin tone, wearing a light blue lace dress with sheer sleeves and ornate trim. First clip: Person1 stands on the deck of a ship, facing slightly right and speaking to Person2 with a smug expression. The ocean and white columns are visible in the background. Person2 is facing away from the camera. Second clip: Person2 is seated on a wooden deck chair in the same scene, facing forward and speaking with surprise or concern. The ocean stretches behind her.

Scene description	MoCha	ShoulderShot (ours)
A woman(Person1) with shoulder-length black hair, light skin tone, wearing a green trench coat. A man(Person2) with short brown hair, medium skin tone, wearing a black hoodie. First clip: Person1 facing slightly right and speaking. The background is blurred but appears to be a street with cars passing by. Second clip: Person2 stands in the same scene, speaking slightly to the left.

Scene description	MoCha	ShoulderShot (ours)
Woman(Person1) with short brown hair floating slightly, pale skin tone, wearing a white and blue space suit with a name tag. A man(Person2) with thick, curly black hair, sideburns, a prominent forehead, and pale skin tone, wearing a similar space suit. First clip: Person1 near a circular window inside a space station, facing slightly right, speaking with mouth open to the right. Earth is visible in the background through the window, slowly rotating. Second clip: Person2 is in the same spaceship, facing slightly left and speaking to Person1. Person1 is out of focus in the background. Behind Person2, equipment panels and blinking lights can be seen.

Videos: Comparision of dialogue videos generated by MoCha and our method. Our method generates longer, multi-turn dialogue videos compared to MoCha. We use lip-sync from Tencent Cloud and CosyVoise for Text-to-Speech (TTS).

Comparing storyboards generation with state-of-the-art methods

Figure: Comparison of dual-shot over-the-shoulder images generated by existing methods and our results. (a) Mocha, (b) In-Context LoRA, (c) GPT-4o, (d) Ours. Our result shows better character consistency and shot layout.

Looping Video Generation

Reference image	I2V + FLF2V playback	Loop denoising (ours)

Reference image	I2V + FLF2V playback	Loop denoising (ours)

Reference image	I2V + FLF2V playback	Loop denoising (ours)

Reference image	I2V + FLF2V playback	Loop denoising (ours)

Reference image	I2V + FLF2V playback	Loop denoising (ours)

Reference image	I2V + FLF2V playback	Loop denoising (ours)

Reference image	I2V + FLF2V playback	Loop denoising (ours)

Reference image	I2V + FLF2V playback	Loop denoising (ours)

Reference image	I2V + FLF2V playback	Loop denoising (ours)

Reference image	I2V + FLF2V playback	Loop denoising (ours)

Reference image	I2V + FLF2V playback	Loop denoising (ours)

Reference image	I2V + FLF2V playback	Loop denoising (ours)

Reference image	I2V + FLF2V playback	Loop denoising (ours)

Reference image	I2V + FLF2V playback	Loop denoising (ours)

Reference image	I2V + FLF2V playback	Loop denoising (ours)

Reference image	I2V + FLF2V playback	Loop denoising (ours)

Reference image	I2V + FLF2V playback	Loop denoising (ours)

Reference image	I2V + FLF2V playback	Loop denoising (ours)

Reference image	I2V + FLF2V playback	Loop denoising (ours)

Reference image	I2V + FLF2V playback	Loop denoising (ours)

Videos: Looping video generation results. No audio; for practical use, we can add dialogue audio using TTS and lip-sync like LatentSync or MuseTalk.

Quantitative Results

Method	FVD↓	FID-VID↓
I2V with reverse playback	368	3.12
I2V with FLF2V playback	378	3.42
Loop denoising (ours)	284	2.35

Table: Quantitative comparison of loop video generation strategies: reversed vs. first-Last frame (FLF2V) playback and loop denoising (ours). Metrics include Fréchet Distance metrics FVD and FID-VID. Lower is better.