HeyGen Avatar V

One-line summary: HeyGen's closed/commercial talking-head avatar model that conditions on the full token sequence of a reference video (rather than a low-dimensional embedding) at every transformer layer.

What it is

A reference-video + audio to lip-synced talking-head video model. Distinguished from most open-source counterparts by (a) its identity representation strategy — full-token reference attention rather than face-encoder + ID-adapter — and (b) its talking-head-only focus rather than full-body.

Why it matters to ai-video-generation

The most architecturally distinctive identity-preservation approach surfaced in the survey. It also represents the "closed-source talking-head" frontier — paired with omnihuman-1-5, it shows where commercial closed AI avatar work is converging. From 2026-05-07-ai-avatar-motion-mimicking-models-survey.

Key facts

Vendor: HeyGen.
Inputs: a single reference video (not just an image) + driving audio.
Output: talking-head lip-synced video. Not full-body.
Vendor-reported pairwise preference 68.9–85.7% over competitors in human evaluation.

Technical contributions

Sparse Reference Attention: rather than compressing identity into a low-dimensional embedding, the model conditions on the full token sequence of the user's reference video at every transformer layer. Structured sparsity makes the cost scale near-linearly with reference length.
Static vs dynamic identity disentanglement:
- Static features = time-invariant traits (dental structure, skin texture, facial geometry, hair, accessories).
- Dynamic features = behavioral patterns (talking rhythm, habitual micro-expressions, gestural tendencies).
Five-stage training curriculum: text-to-video pretraining → audio-to-video pretraining → supervised personality fine-tuning → distillation → RLHF alignment.

Strengths

Fine-grained identity preservation — dental, micro-expression, gestural style.
Cross-scene generation: appearance and behavior transfer to new environments.
Industrial-strength pipeline (RLHF aligned).

Weaknesses

Closed source / commercial.
Talking-head only — explicitly not full-body animation.
Self-reported preference numbers; no independent benchmarks surfaced.

Sources

2026-05-07-ai-avatar-motion-mimicking-models-survey

omnihuman-1-5
stableanimator — open-source identity-preservation counterpart with a different mechanism (Face Encoder + ID-Adapter + HJB).
runway-act-two
identity-preservation-video-diffusion
pose-driven-vs-audio-driven-avatars

HeyGen Avatar V

HeyGen Avatar V

What it is

Why it matters to ai-video-generation

Key facts

Technical contributions

Strengths

Weaknesses

Sources

Related