HeyGen Avatar V
HeyGen Avatar V
One-line summary: HeyGen's closed/commercial talking-head avatar model that conditions on the full token sequence of a reference video (rather than a low-dimensional embedding) at every transformer layer.
What it is
A reference-video + audio to lip-synced talking-head video model. Distinguished from most open-source counterparts by (a) its identity representation strategy — full-token reference attention rather than face-encoder + ID-adapter — and (b) its talking-head-only focus rather than full-body.
Why it matters to ai-video-generation
The most architecturally distinctive identity-preservation approach surfaced in the survey. It also represents the "closed-source talking-head" frontier — paired with omnihuman-1-5, it shows where commercial closed AI avatar work is converging. From 2026-05-07-ai-avatar-motion-mimicking-models-survey.
Key facts
- Vendor: HeyGen.
- Inputs: a single reference video (not just an image) + driving audio.
- Output: talking-head lip-synced video. Not full-body.
- Vendor-reported pairwise preference 68.9–85.7% over competitors in human evaluation.
Technical contributions
- Sparse Reference Attention: rather than compressing identity into a low-dimensional embedding, the model conditions on the full token sequence of the user's reference video at every transformer layer. Structured sparsity makes the cost scale near-linearly with reference length.
- Static vs dynamic identity disentanglement:
- Static features = time-invariant traits (dental structure, skin texture, facial geometry, hair, accessories).
- Dynamic features = behavioral patterns (talking rhythm, habitual micro-expressions, gestural tendencies).
- Five-stage training curriculum: text-to-video pretraining → audio-to-video pretraining → supervised personality fine-tuning → distillation → RLHF alignment.
Strengths
- Fine-grained identity preservation — dental, micro-expression, gestural style.
- Cross-scene generation: appearance and behavior transfer to new environments.
- Industrial-strength pipeline (RLHF aligned).
Weaknesses
- Closed source / commercial.
- Talking-head only — explicitly not full-body animation.
- Self-reported preference numbers; no independent benchmarks surfaced.
Sources
Related
- omnihuman-1-5
- stableanimator — open-source identity-preservation counterpart with a different mechanism (Face Encoder + ID-Adapter + HJB).
- runway-act-two
- identity-preservation-video-diffusion
- pose-driven-vs-audio-driven-avatars