Identity preservation in video diffusion

One-line summary: The recurring hard problem in avatar generation — keeping a subject's identity consistent across long video durations — is being attacked through several distinct architectural strategies in 2025.

The insight

Current diffusion models for video generation drift on identity over time. The face slowly turns into someone else; the cloth pattern changes; the proportions shift. Every credible avatar model has a deliberate strategy for this. The strategies cluster into a small number of families.

Strategies surveyed

Face-encoder + ID-adapter inside diffusion (stableanimator):

A global content-aware Face Encoder refines face embeddings via interaction with image embeddings.
A distribution-aware ID-Adapter aligns embeddings to prevent temporal-layer interference.
An HJB (Hamilton-Jacobi-Bellman) inference-time face optimization runs in parallel with denoising.
Claimed as the first end-to-end ID-preserving framework — no post-hoc face swap.

Full-token sparse reference attention (heygen-avatar-v):

Skips the embedding bottleneck entirely. Conditions on the full token sequence of the reference video at every transformer layer.
Structured sparsity makes cost scale near-linearly with reference length.
Disentangles static identity (dental structure, skin texture) from dynamic identity (talking rhythm, gestural style).

3D-anchored hybrid (persona-avatar-3d):

Uses 3D parametric body model (SMPL-X) + 3D Gaussian Splatting as the identity anchor.
Diffusion (mimicmotion) used upstream as a synthetic-data generator only, then baked into the 3D representation.
Balanced sampling oversamples the input image during training to prevent diffusion drift.

Implicit facial features (wan-animate):

Encodes raw face images directly (not landmark-based) and injects via cross-attention into dedicated Face Blocks.
Data augmentation (scaling, color jittering, noise) during training disentangles identity from expression.
Combined with spatially-aligned skeleton signals from VitPose for body identity.

Evidence

All of the above strategies are documented in 2026-05-07-ai-avatar-motion-mimicking-models-survey.
The fact that the field has converged on "we need a dedicated identity-preservation mechanism" is itself signal — naive diffusion video doesn't preserve identity well enough.

Design implications

For long-form output, identity preservation is a first-class concern, not a polish step.
The full-token attention approach (HeyGen) is the most expensive but reportedly the highest-fidelity, particularly for fine micro-expressions and idiosyncratic talking style.
The 3D-anchored approach trades up-front compute (~1 hour bake per subject in PERSONA) for cheap per-frame inference — appealing for real-time use cases.

Contradictions / tensions

HeyGen's vendor numbers (68.9–85.7% pairwise preference) and Wan-Animate's author-reported wins on human eval over Act-Two and DreamActor-M1 are both self-reported. The relative quality of these strategies is not independently established. See independent-avatar-benchmarks.

Open questions

See independent-avatar-benchmarks and hand-fidelity-comparison-across-avatar-models.

Sources

2026-05-07-ai-avatar-motion-mimicking-models-survey

Identity preservation in video diffusion

Identity preservation in video diffusion

The insight

Strategies surveyed

Evidence

Design implications

Contradictions / tensions

Open questions

Sources

Related