brain/
← all entities
entitygenericai-video-generation

HeyGen Avatar V

Notes

HeyGen Avatar V

One-line summary: HeyGen's closed/commercial talking-head avatar model that conditions on the full token sequence of a reference video (rather than a low-dimensional embedding) at every transformer layer.

What it is

A reference-video + audio to lip-synced talking-head video model. Distinguished from most open-source counterparts by (a) its identity representation strategy — full-token reference attention rather than face-encoder + ID-adapter — and (b) its talking-head-only focus rather than full-body.

Why it matters to ai-video-generation

The most architecturally distinctive identity-preservation approach surfaced in the survey. It also represents the "closed-source talking-head" frontier — paired with omnihuman-1-5, it shows where commercial closed AI avatar work is converging. From 2026-05-07-ai-avatar-motion-mimicking-models-survey.

Key facts

  • Vendor: HeyGen.
  • Inputs: a single reference video (not just an image) + driving audio.
  • Output: talking-head lip-synced video. Not full-body.
  • Vendor-reported pairwise preference 68.9–85.7% over competitors in human evaluation.

Technical contributions

  • Sparse Reference Attention: rather than compressing identity into a low-dimensional embedding, the model conditions on the full token sequence of the user's reference video at every transformer layer. Structured sparsity makes the cost scale near-linearly with reference length.
  • Static vs dynamic identity disentanglement:
    • Static features = time-invariant traits (dental structure, skin texture, facial geometry, hair, accessories).
    • Dynamic features = behavioral patterns (talking rhythm, habitual micro-expressions, gestural tendencies).
  • Five-stage training curriculum: text-to-video pretraining → audio-to-video pretraining → supervised personality fine-tuning → distillation → RLHF alignment.

Strengths

  • Fine-grained identity preservation — dental, micro-expression, gestural style.
  • Cross-scene generation: appearance and behavior transfer to new environments.
  • Industrial-strength pipeline (RLHF aligned).

Weaknesses

  • Closed source / commercial.
  • Talking-head only — explicitly not full-body animation.
  • Self-reported preference numbers; no independent benchmarks surfaced.

Sources

Related

Referenced by