Over-the-shoulder dialogue videos are essential in films, short dramas, and advertisements, providing visual variety and enhancing viewers' emotional connection. Despite their importance, such dialogue scenes remain largely underexplored in video generation research. Key challenges involve maintaining character consistency, spatial continuity, and generating long, multi-turn dialogues efficiently. We introduce ShoulderShot, a framework leveraging dual-shot generation and looping video to enable extended, consistent dialogues. Our results surpass existing methods in shot-reverse-shot layout, spatial continuity, and dialogue length flexibility, thereby opening up new possibilities for practical dialogue video generation.
Scene description | MoCha | ShoulderShot (ours) |
---|---|---|
A woman (Person1) with dark, shoulder-length wavy hair, fair skin tone, wearing a white blouse with red cherry patterns. She has a soft, natural expression. A man (Person2) with short dark hair, a well-groomed beard, medium to fair skin tone, wearing black sunglasses and a light grey or white t-shirt. He has a strong build and a calm demeanor. First clip: Person1 stands in a cozy kitchen, facing slightly right, speaking warmly while preparing food on the counter. Pots and pans hang in the background. Second clip: Person2 stands on the other side of the counter, speaking to Person1. |
Scene description | MoCha | ShoulderShot (ours) |
---|---|---|
A man (Person1) with short, slicked-back black hair, fair skin tone, wearing a dark pinstripe suit with a white shirt and black tie. A woman (Person2) with brown hair styled in an Edwardian-era updo, fair skin tone, wearing a light blue lace dress with sheer sleeves and ornate trim. First clip: Person1 stands on the deck of a ship, facing slightly right and speaking to Person2 with a smug expression. The ocean and white columns are visible in the background. Person2 is facing away from the camera. Second clip: Person2 is seated on a wooden deck chair in the same scene, facing forward and speaking with surprise or concern. The ocean stretches behind her. |
Scene description | MoCha | ShoulderShot (ours) |
---|---|---|
A woman(Person1) with shoulder-length black hair, light skin tone, wearing a green trench coat. A man(Person2) with short brown hair, medium skin tone, wearing a black hoodie. First clip: Person1 facing slightly right and speaking. The background is blurred but appears to be a street with cars passing by. Second clip: Person2 stands in the same scene, speaking slightly to the left. |
Scene description | MoCha | ShoulderShot (ours) |
---|---|---|
Woman(Person1) with short brown hair floating slightly, pale skin tone, wearing a white and blue space suit with a name tag. A man(Person2) with thick, curly black hair, sideburns, a prominent forehead, and pale skin tone, wearing a similar space suit. First clip: Person1 near a circular window inside a space station, facing slightly right, speaking with mouth open to the right. Earth is visible in the background through the window, slowly rotating. Second clip: Person2 is in the same spaceship, facing slightly left and speaking to Person1. Person1 is out of focus in the background. Behind Person2, equipment panels and blinking lights can be seen. |
Reference image | I2V + FLF2V playback | Loop denoising (ours) |
---|---|---|
![]() |
Reference image | I2V + FLF2V playback | Loop denoising (ours) |
---|---|---|
![]() |
Reference image | I2V + FLF2V playback | Loop denoising (ours) |
---|---|---|
![]() |
Reference image | I2V + FLF2V playback | Loop denoising (ours) |
---|---|---|
![]() |
Reference image | I2V + FLF2V playback | Loop denoising (ours) |
---|---|---|
![]() |
Reference image | I2V + FLF2V playback | Loop denoising (ours) |
---|---|---|
![]() |
Reference image | I2V + FLF2V playback | Loop denoising (ours) |
---|---|---|
![]() |
Reference image | I2V + FLF2V playback | Loop denoising (ours) |
---|---|---|
![]() |
Reference image | I2V + FLF2V playback | Loop denoising (ours) |
---|---|---|
![]() |
Reference image | I2V + FLF2V playback | Loop denoising (ours) |
---|---|---|
![]() |
Reference image | I2V + FLF2V playback | Loop denoising (ours) |
---|---|---|
![]() |
Reference image | I2V + FLF2V playback | Loop denoising (ours) |
---|---|---|
![]() |
Reference image | I2V + FLF2V playback | Loop denoising (ours) |
---|---|---|
![]() |
Reference image | I2V + FLF2V playback | Loop denoising (ours) |
---|---|---|
![]() |
Reference image | I2V + FLF2V playback | Loop denoising (ours) |
---|---|---|
![]() |
Reference image | I2V + FLF2V playback | Loop denoising (ours) |
---|---|---|
![]() |
Reference image | I2V + FLF2V playback | Loop denoising (ours) |
---|---|---|
![]() |
Reference image | I2V + FLF2V playback | Loop denoising (ours) |
---|---|---|
![]() |
Reference image | I2V + FLF2V playback | Loop denoising (ours) |
---|---|---|
![]() |
Reference image | I2V + FLF2V playback | Loop denoising (ours) |
---|---|---|
![]() |
Method | FVD↓ | FID-VID↓ |
---|---|---|
I2V with reverse playback | 368 | 3.12 |
I2V with FLF2V playback | 378 | 3.42 |
Loop denoising (ours) | 284 | 2.35 |