Decision guide

Which AI Video Generation Model to Use When

A practical decision guide for choosing between Google Veo 3.1, Kuaishou Kling, ByteDance Seedance, and HeyGen — the AI video models actually shipping in production.

Salk TeamMay 12, 20269 min read

TL;DR

There is no "best" AI video model. There are five questions about your shot — subject, length, lipsync, audio, budget — and each one rules a different model in or out. This guide is the rubric we use internally at Salk.

Every week someone asks "which AI video model should I use?" and the honest answer is: it depends entirely on what you are trying to shoot. A four-second product loop, a sixty-second talking-head explainer, and a cinematic crowd scene live in completely different model worlds. Picking by reputation costs you twice — once when the output disappoints, again when you re-run the job somewhere it would have just worked.

This is an opinionated, in-product guide. The four families below are the ones wired into Salk AI Studio because they each clearly win a category. Tens of thousands of clips have moved through them and the boundaries between their strengths have stayed remarkably stable across the last six months.

The four families currently shipping

As of May 2026, four model families do real production work for ad-grade video:

Google Veo 3.1 — the cinematic photoreal specialist. Three tiers (Standard / Fast / Lite), 4K on the top two, and dialogue audio baked in for free.
Kuaishou Kling — the motion-and-physics specialist. Eleven model variants in production; v2.1 Master, v2.6 Pro, and v3 do most of the work. Best in class for fast camera moves, fluid simulation, and natural human motion.
HeyGen Video Agent — the only model on this list that can lipsync a person to a script. If a subject needs to actually say words to the camera, this is the path.
ByteDance Seedance 2.0 — joint audio-video generation tuned for fast, short social cuts. Smaller catalog and tighter quotas than the others, but punches above its price on motion stability.

Every other production-grade model — Runway Gen-4, Pika 2.1, Sora, Luma Ray — sits somewhere on the curve these four already define. There is a short sidebar on them at the end.

The five questions that pick your model

Before picking a model, answer these five questions about the actual shot you need. Most of the time the first one that gets a hard "yes" picks the model for you.

1. Does the subject need to talk to the camera?

If yes — meaning lips must move in sync with words a viewer can understand — you need HeyGen Video Agent. None of the generative text-to-video models do reliable lipsync. They will generate a person whose mouth moves vaguely, but the words will not land. Stop reading and jump to the HeyGen section.

2. How long does the clip need to be?

Veo caps at 8 seconds. Legacy Kling (v1.6 through v2.6) caps at 10. Kling v3 and Seedance both extend to 15. HeyGen scales to multi-minute scripted output. If you need more than 8 seconds and you are not doing a talking-head, the candidates narrow to Kling v3+ or Seedance.

3. Do you need audio baked in?

Veo 3.x ships with native dialogue plus ambient audio for free. Kling v2.6 Pro and v3 support audio as an opt-in (the parameter is generate_audio on v2.6, sound: "on" on v3). Seedance does native joint audio-video. Everything else (Kling v1.x–v2.5, Veo 2) is silent — you would compose audio separately, typically through ElevenLabs and a lipsync runner. If you need dialogue and fast motion in the same shot, the only real options are Kling v3 with sound on, or Veo 3.1 Standard.

4. What kind of motion are you generating?

Kling consistently wins on physics: fabric, fluid, hair, particle effects, contact between objects. Veo wins on cinematic camera moves and people moving naturally through space. Seedance wins on tight motion-heavy shorts where iteration speed beats fidelity. HeyGen does not really compete on this axis — its subjects are mostly stationary presenters.

5. What is the per-clip budget?

This is where it gets stark. A 5-second Kling v2.1 Standard clip costs about $0.13. An 8-second Veo 3.1 Standard clip at 4K costs $4.80. That is a ~36x spread for outputs that often look identical at thumbnail size. The right strategy is almost always: draft on the cheap tier, finalize on the premium one.

Veo 3.1 — cinematic photoreal with dialogue baked in

Veo 3.1 is the model to reach for when the scene needs to look like it could have been shot on a real camera, by a real cinematographer, with real people speaking. The 3.1 family is the first generative video model where you can genuinely confuse a clip for live-action footage at 1080p — particularly for human faces, skin texture, and natural lighting.

The Standard tier ships at $0.40 per second at 720p and 1080p. That is expensive enough that you do not run it casually, but cheap enough that a finished 8-second hero shot costs $3.20 — easily justified for a paid ad. The Fast and Lite tiers exist so you do not spend hero money on every draft.

Where Veo 3.1 falls short: complex physical interactions, action sequences with multiple moving subjects, and anything stylized. Veo is also the only model on this list that does not support 1:1 aspect ratio — pick 16:9 or 9:16 only.

Kling — when motion and physics are the point

Kling is the better model the moment your shot is about motion. A dancer mid-spin, water pouring, a car drifting, a person sprinting through a crowd — Kling renders these with a physical believability that Veo struggles to match. The v3 generation also introduces multi-shot storyboarding: a single submission can produce one to six consecutive shots stitched into a 15-second clip, with per-shot prompts.

The Kling catalog is genuinely confusing because Kuaishou ships aggressively — eleven model variants are in production right now. Three of them do nearly all the work: v2.1 Master for max-quality short clips, v2.6 Pro with audio for motion plus ambient SFX, and v3 for anything 11+ seconds or multi-shot.

Where Kling falls short: it is silent by default, so you build an audio pipeline around it unless you are on v2.6 Pro or v3. It also requires hosted HTTPS image URLs for image-to-video — Salk handles this transparently, but it adds a hop. And the v2.1 Master tier is the single most expensive 5-second clip on this entire list.

HeyGen Video Agent — when a person needs to say words

HeyGen Video Agent is in a different category from the rest. Veo, Kling, and Seedance are generative — they synthesize pixels from prompts. HeyGen is composed — it takes a script and an avatar (a real person’s likeness, or a stock photo-avatar) and produces a video of that person delivering that script with frame-accurate lipsync.

This is the only path that actually works for explainers, ad presenters, internal training videos, multilingual product walk-throughs, or anything where the message is the words. At $0.0333 per second, a 60-second explainer costs about $2 — a price point that is genuinely hard to beat for the production value.

Where HeyGen falls short: it cannot generate the cinematic B-roll you would intercut around the presenter. That is still Veo or Kling. A good ad workflow uses HeyGen for the spoken segments and one of the generative models for everything else, then edits them together — which is exactly the case Salk’s canvas was built for.

Seedance 2.0 — fast iteration for short social cuts

Seedance is the newest entrant. ByteDance launched 2.0 in March 2026 with a unified multimodal architecture that handles text, image, audio, and video inputs in one model. Where it wins: very short (4–8 second) social-format clips where you need to iterate quickly. The Fast tier comes in at $0.081 per second, which is competitive with Veo 3.1 Lite — the difference is that Seedance produces natively at 720p with joint audio-video, while Veo Lite is silent unless you upgrade.

Where Seedance falls short: the catalog is small, durations cap at 15 seconds, and 4K is not supported. API access is also tighter than the others — at the time of writing it is available primarily through partner platforms rather than direct ByteDance keys. Salk routes Seedance through a stable fallback path so you do not have to think about that.

What about Runway, Pika, Sora, and Luma?

A fair question. All four are excellent models. Here is the short version of why they are not in Salk today:

Runway Gen-4 is competitive with Kling on motion and has the best in-browser editing tools, but its API quotas have been unreliable for production batch workloads.
Pika 2.1 is the best model for cartoony/stylized output and the only one with a mature Pikaffects inpainting flow. We currently lean on Kling and Veo for stylized prompts because their consistency at scale is better.
Sora is OpenAI’s flagship and produces stunning hero clips, but per-second pricing is materially higher than Veo 3.1 Standard for outputs that are not consistently better. The price-to-output ratio is the gate.
Luma Ray 2 does excellent slow, cinematic camera work and is a real choice for high-end commercial. We are tracking it for a future addition.

None of these are bad calls. They are simply second-place picks for the specific jobs the four shipping models already win.

Common mistakes

Using Veo 3.1 Standard for drafts. You are spending 8x more than Lite for output you are going to throw away. Draft on Lite, finalize on Standard.
Asking Kling for lipsync. Kling’s lips will move; the words will not be there. Use HeyGen for any spoken-word shot.
Using HeyGen for landscape scenery or product shots. The Video Agent is tuned for human presenters. It will produce something, but it will not be your hero clip.
Sending a 4K request to a model that does not support it. Veo 3.1 Standard and Fast can do 4K, but only at 8 seconds. Veo 3.1 Lite, Veo 3, and Kling cap at 1080p.
Picking a model by "what is newest." Kling v3 came out three months after v2.1 Master, and v2.1 Master is still the right pick for many shots. New does not equal better.

Cheat sheet: pick by goal

If you only remember one section of this article:

Hero ad shot (cinematic, photoreal, with dialogue) → Veo 3.1 Standard at 1080p.
Same shot, drafting phase → Veo 3.1 Fast or Veo 3.1 Lite for ideation, switch to Standard on the keeper.
Product loop with fast motion, no dialogue → Kling v2.1 Master at 5s.
Storyboarded multi-shot scene (3–6 cuts, up to 15s total) → Kling v3 with multi_shot.
Action shot with ambient SFX → Kling v2.6 Pro with audio on.
Talking-head explainer or presenter ad → HeyGen Video Agent at landscape or portrait.
Short social cut, need it now, budget tight → Seedance 2.0 Fast.
Anything 4K → Veo 3.1 Standard or Veo 3.1 Fast at 8 seconds.

Try it in Salk

Salk’s canvas wires all four families behind one set of skill nodes. You do not pick a model from a flat dropdown — describe the shot to the chat agent, it queues the right one, and you can swap models on the same prompt without re-uploading reference assets. Image-to-video, text-to-video, multi-shot storyboards, and HeyGen avatars all live on the same canvas, which means you can intercut a Veo hero shot with a HeyGen presenter without leaving the page.

Start free. Pay only for the seconds you actually render. No subscription.

Try every model from one canvas.

Veo, Kling, HeyGen, Seedance — described by the shot, not the dropdown. Start free.

Start creating — it's free See pricing