Multimodal input, @reference system, camera replication, creative templates, video extension, and more.
Ever since the days when we could only "tell stories" with text and first/last frames, we've dreamed of building a video model that truly understands what you want to express. Today, it's finally here!
JiMeng Seedance 2.0 now supports four input modalities — image, video, audio, and text — giving you richer ways to express yourself with more controllable generation.
You can use an image to set the visual style, a video to define character motion and camera movement, a few seconds of audio to establish rhythm and atmosphere... combine these with text prompts to make your creative process more natural, more efficient, and more like being a real "director."
In this upgrade, "reference capability" is the biggest highlight:
| 核心维度 | Seedance 2.0 |
|---|---|
| Image Input | ≤ 9 images |
| Video Input | ≤ 3 videos, total duration no more than 15s (reference videos cost a bit more) |
| Audio Input | Supports MP3 upload, ≤ 3 files, total duration no more than 15s |
| Text Input | Natural language |
| Generation Duration | ≤ 15s, freely selectable from 4-15s |
| Sound Output | Built-in sound effects / background music |
Interaction limit: The current maximum for mixed input is 12 files total. We recommend prioritizing uploads that have the greatest impact on visuals or rhythm, and allocating file counts wisely across different modalities.



Method 1: Type "@" to invoke reference





After uploading materials, images, videos, and audio all support hover preview.



Below are some use cases and creative approaches for different scenarios, to help you better understand how Seedance 2.0 has improved in generation quality, control capabilities, and creative expression. If you're not sure where to start, check out these examples for inspiration!
Beyond multimodality, Seedance 2.0 has made significant improvements at the foundational level — physics are more realistic, motions are more natural and fluid, instruction comprehension is more precise, and style consistency is more stable. It can now reliably handle complex actions, continuous motion, and other challenging generation tasks, making overall video output more realistic and smoother. This is a comprehensive evolution of core capabilities!
A girl elegantly hanging laundry, after hanging one piece she reaches into the basket for another, giving it a firm shake.
The character in the painting has a guilty expression, eyes darting left and right as they peek out of the frame, quickly reaching out to grab a cola and take a sip, then showing a look of pure satisfaction. Then footsteps are heard, and the character hurriedly puts the cola back. A cowboy picks up the cola and walks away. Finally the camera pushes in as the screen fades to black with only a spotlight illuminating a canned cola, with stylish subtitles appearing at the bottom: "Yi Kou Cola — A Taste Not to Be Missed!"
Camera slowly pulls back (revealing the full street view) and follows the heroine as she walks along a 19th-century London street, the wind ruffling her skirt. A steam-powered car comes speeding from the right side of the street, rushing past her — the gust lifts her skirt and she gasps in shock, quickly pressing it down with both hands. Background sounds include footsteps, crowd noise, and car sounds.
Camera follows a man in black sprinting away, a crowd chasing behind him. Camera switches to a side tracking shot as he panics and crashes into a fruit stand, scrambles to his feet and keeps running. Sounds of a frantic crowd.
Seedance 2.0 = Multimodal Reference (reference anything) + Strong Creative Generation + Precise Instruction Response (excellent comprehension)
Supports uploading text, images, videos, and audio — all of which can be used as subjects or references. You can reference anything's motion, effects, style, camera movement, characters, scenes, and sound. As long as your prompt is clear, the model can understand it.
Just describe the visuals and actions you want in natural language — be clear about whether it's a reference or an edit. When you have multiple materials, we recommend double-checking that each @reference is properly labeled so images, videos, and characters don't get mixed up.
Have a first/last frame image? Also want to reference video actions?
→ Write it clearly in the prompt, e.g.: "@Image1 as first frame, reference @Video1's fighting actions"
Want to extend an existing video?
→ Specify the extension duration, e.g. "Extend @Video1 by 5s". Note: the selected generation duration should be for the "new portion" only.
Want to merge multiple videos?
→ Describe the composition logic in the prompt, e.g.: "I want to add a scene between @Video1 and @Video2, with content about xxx"
No audio files? You can directly reference the sound from a video.
Want to generate continuous action?
→ Add continuity descriptions to the prompt, e.g.: "The character transitions from a jump directly into a roll, keeping the motion smooth and fluid" @Image1 @Image2 @Image3...
Making videos always comes with headaches: faces changing mid-shot, motions not matching, unnatural video extensions, editing that throws off the entire rhythm... This multimodal upgrade tackles all these "long-standing pain points" at once. Below are specific use cases.
You've probably experienced these frustrations: characters looking different between shots, product details getting lost, small text becoming blurry, scene jumps, inconsistent camera styles... These common consistency issues in creative work can now all be resolved in 2.0. From faces to clothing to font details, overall consistency is more stable and accurate.
The man @Image1 walks tiredly down a hallway after work, his steps slowing, finally stopping at his front door. Close-up on his face — he takes a deep breath, adjusts his emotions, lets go of the negativity and relaxes. Then a close-up of him fishing out his keys, inserting them into the lock. After entering, his little daughter and a pet dog run over joyfully to greet him with hugs. The interior is very warm and cozy. Natural dialogue throughout.
Replace the woman in @Video1 with a Chinese opera huadan performer on an elaborate stage. Reference @Video1's camera work and transitions, matching the camera to the character's movements for ultimate stage beauty and enhanced visual impact.
Reference all transitions and camera movements from @Video1, one continuous take, starting with a chess game.
0-2 seconds: Rapid four-panel flash cuts — red, pink, purple, and leopard-print bows each frozen in frame.

Create a commercial-style showcase of the handbag in @Image2. The side of the bag references @Image1, the surface texture references @Image3. All bag details should be displayed. Grand and majestic background music.

Use @Image1 as the first frame. First-person perspective, reference @Video1's camera movements. Upper scene references @Image2, left scene references @Image3, right scene references @Image4.
Previously, getting the model to mimic cinematic blocking, camera work, or complex actions required either writing extremely detailed prompts or was simply impossible. Now, just upload a reference video and you're set.
Reference the man's appearance from @Image1, he is in the elevator from @Image2, fully replicate all camera movements and the protagonist's facial expressions from @Video1.
Reference the man's appearance from @Image1, he is in the corridor from @Image2, fully replicate all camera movements from @Video1.




The tablet from @Image1 as the main subject, camera movements reference @Video1.

The female star from @Image1 as the main subject, reference @Video1's camera style for rhythmic push-pull-pan movements.
Reference @Image1 @Image2 for the spear-wielding character, @Image3 @Image4 for the dual-blade character. Mimic @Video1's actions, fighting in the maple leaf forest from @Image5.

Reference Video1's character actions, reference Video2's orbiting camera language, generate a fighting scene between Character 1 and Character 2.


Reference Video1's camera movements and scene transition rhythm, replicate using the red supercar from Image1.
Beyond generating images and writing stories, Seedance 2.0 also supports "follow-the-reference" — creative transitions, finished ads, film clips, complex edits. As long as you have reference images or videos, the model can identify action rhythms, camera language, visual structure, and precisely replicate them.
Replace the person in @Video1 with @Image1. @Image1 as first frame, the person wearing virtual sci-fi glasses. Reference @Video1's camera work.

Reference the model's facial features from the first image. The model wears the outfits from reference images 2-6 while approaching the camera.



Reference the video's ad concept, use the provided down jacket images with ad copy to generate a new down jacket commercial.
Black and white ink wash style. The character from @Image1 references @Video1's effects and movements, performing a segment of ink wash tai chi kung fu.
Replace the first frame character in @Video1 with @Image1, fully reference @Video1's effects and movements.

Starting from the ceiling in @Image1, reference @Video1's jigsaw-shattering effect for the transition.


Open with a black screen, reference Video1's particle effects and material, golden gilded sand particles.

The character from @Image1 references @Video1's actions and expression changes, showcasing an exaggerated instant noodle eating performance.
Animate @Image1 as a comic strip, reading left to right, top to bottom.

Reference the storyboard from @Image1, create a 15s healing-style opening sequence about "The Four Seasons of Childhood."

Reference Video1's audio, use Images 1-5 as inspiration to create an emotion-driven video.





Extend 15s of video. Reference the donkey-riding-motorcycle character from @Image1 and @Image2, add a whimsical ad segment.

Extend the video by 6s. An intense electric guitar riff kicks in, with "JUST DO IT" ad text appearing in the center of the video.

Extend @Video1 by 15 seconds. 1-5 seconds: Light and shadow slowly glide through venetian blinds across the wooden table and cup.
Extend backward by 10s. In the warm afternoon light, the camera begins at the row of awnings on the street corner, gently fluttering in the breeze.
Fixed camera, center fisheye lens peering downward through a circular opening.
Based on the provided office building promotional photos, generate a 15-second cinematic-realistic style real estate documentary.



A roast-style dialogue in the "Cat & Dog Roast Room," with rich emotions matching a stand-up comedy performance.

The opening instrumental of the classic Yu Opera segment "The Case of Chen Shimei" begins to play.

Generate a 15-second music video. Keywords: steady composition / gentle push-pull / low-angle heroic feel / documentary but premium.

The girl with a hat in the center of frame gently sings "I'm so proud of my family!"

Fixed camera. The standing muscular man (captain) clenches his fist, waves his arm and says in Spanish: "Assault in three minutes!"

0-3 seconds: Opening with an alarm clock ringing, the blurry image fades in to reveal Image 1.


The monkey from @Image1 walks toward the bubble tea shop counter, camera following behind him.



In a science-explainer style and tone, bring the content of Image 1 to life.
@Image1-5, a continuous one-take tracking shot, following a runner from the street up stairs, through a corridor, onto the rooftop, finally overlooking the city.





Starting with @Image1 as the first frame, the view zooms out to outside an airplane window.



Spy thriller style. @Image1 as the first frame, camera tracking the female spy in a red trench coat from the front.




From the exterior shot of @Image1, first-person POV with a fast push into the wooden cabin interior.




@Image1-5, a thrilling roller coaster ride from a first-person POV in one continuous take.





Sometimes you already have a video and don't want to start over finding images or rebuilding from scratch — you just want to tweak a motion segment, extend a few seconds, or make a character's performance better match your vision. Now you can directly use existing video as input and make targeted modifications to specific segments, actions, or rhythms without changing anything else.
Subvert the storyline in @Video1 — the man's gaze shifts from tender to ice-cold and ruthless.
Subvert the entire storyline of @Video1. 0-3 seconds: A man in a suit sits at a bar.
Replace the female lead singer in Video1 with the male lead singer from Image1, movements fully mimicking the original video.

Change the woman's hairstyle in Video1 to long red hair. The great white shark from Image1 slowly emerges.

Video1 camera pans right, the fried chicken shop owner busily hands fried chicken to customers in line.

The girl in the poster keeps changing outfits, clothing styles reference @Image1 and @Image2.




Images from @Image1-7 sync to keyframes in @Video's visuals for beat-matching.






Scenic landscape images from @Image1-6, synced to @Video's visual rhythm for beat-matching.
8-second strategic-battle anime clip, matching a revenge theme.
The woman from @Image1 walks to a mirror, looks at her reflection, pauses in thought, then suddenly breaks down screaming.


This is a range hood commercial. @Image1 as the first frame, a woman elegantly cooking.




@Image1 as the first frame, camera rotates and pushes in, the character suddenly looks up and begins roaring.



