brain/
conceptai-video-generation

Pose-driven vs audio-driven avatars

Notes

Pose-driven vs audio-driven avatars

One-line summary: AI avatar generation splits into two driving-signal paradigms — motion-from-reference-video (pose-driven) and motion-from-audio (audio-driven) — that are converging fast but currently produce models with very different shapes.

The insight

Avatar generation models can be cleanly classified by what signal drives the motion:

  • Pose-driven: a reference video supplies motion through extracted skeleton poses (DWPose, VitPose) or denser representations (DensePose, SMPL-X). The model animates the reference identity through that motion sequence. Best for "puppeting" — using a webcam or driving performance to animate a character. Examples: mimicmotion, animate-anyone-2, stableanimator, unianimate, wan-animate, runway-act-two.
  • Audio-driven: a voice or music track supplies the motion, and the model infers gestures, head movement, and lip sync from the audio. Best for "talking-head" content where the user has audio but no driving performance video. Examples: echomimic-v2, omnihuman-1-5, heygen-avatar-v.
  • Hybrid: increasingly common — accept both. EchoMimicV2 takes audio primarily but optionally accepts pose. Wan 2.2's S2V-14B has a --pose_video flag that lets pose drive generation synchronized with audio.

Evidence

  • 2026-05-07-ai-avatar-motion-mimicking-models-survey documents this split across all 11+ models surveyed.
  • All pose-driven open-source models in the survey use the same backbone pattern: pose extraction → pose encoding → reference-image conditioning → diffusion denoising. Differentiators are in which pose representation, how you handle pose-detector noise, and what backbone you build on.
  • Audio-driven models in the survey (echomimic-v2, omnihuman-1-5, heygen-avatar-v) all converge on a similar shape: image + audio → talking-head or semi-body video, with identity preserved across the generation.

Design implications

  • Choose the paradigm by what input you actually have. If you have a driving video (or a webcam), pose-driven is more controllable. If you have audio (voiceover, recorded podcast, AI-generated speech), audio-driven is the only option.
  • The "puppet your virtual influencer in real time from your webcam" workflow lives squarely in the pose-driven branch. runway-act-two is the leading closed-source product; wan-animate is the leading open option.
  • Closed-source state-of-the-art audio-driven systems (heygen-avatar-v, omnihuman-1-5) still outperform their open counterparts. The pose-driven open lineage has caught up much faster than the audio-driven one.

Contradictions / tensions

  • Hybrid models (EchoMimicV2, Wan 2.2 S2V) blur the line. A single model that takes "image + (audio OR pose video OR both)" may eventually dominate, making the dichotomy historical.
  • "Audio-driven" is misleading when applied to talking-head models with reference video like heygen-avatar-v — that one also needs a reference video for identity, and its motion mostly comes from learned static+dynamic identity priors, with audio providing rhythm.

Open questions

Sources

Related

Referenced by