brain/
conceptai-video-generation

Identity preservation in video diffusion

Notes

Identity preservation in video diffusion

One-line summary: The recurring hard problem in avatar generation — keeping a subject's identity consistent across long video durations — is being attacked through several distinct architectural strategies in 2025.

The insight

Current diffusion models for video generation drift on identity over time. The face slowly turns into someone else; the cloth pattern changes; the proportions shift. Every credible avatar model has a deliberate strategy for this. The strategies cluster into a small number of families.

Strategies surveyed

Face-encoder + ID-adapter inside diffusion (stableanimator):

  • A global content-aware Face Encoder refines face embeddings via interaction with image embeddings.
  • A distribution-aware ID-Adapter aligns embeddings to prevent temporal-layer interference.
  • An HJB (Hamilton-Jacobi-Bellman) inference-time face optimization runs in parallel with denoising.
  • Claimed as the first end-to-end ID-preserving framework — no post-hoc face swap.

Full-token sparse reference attention (heygen-avatar-v):

  • Skips the embedding bottleneck entirely. Conditions on the full token sequence of the reference video at every transformer layer.
  • Structured sparsity makes cost scale near-linearly with reference length.
  • Disentangles static identity (dental structure, skin texture) from dynamic identity (talking rhythm, gestural style).

3D-anchored hybrid (persona-avatar-3d):

  • Uses 3D parametric body model (SMPL-X) + 3D Gaussian Splatting as the identity anchor.
  • Diffusion (mimicmotion) used upstream as a synthetic-data generator only, then baked into the 3D representation.
  • Balanced sampling oversamples the input image during training to prevent diffusion drift.

Implicit facial features (wan-animate):

  • Encodes raw face images directly (not landmark-based) and injects via cross-attention into dedicated Face Blocks.
  • Data augmentation (scaling, color jittering, noise) during training disentangles identity from expression.
  • Combined with spatially-aligned skeleton signals from VitPose for body identity.

Evidence

  • All of the above strategies are documented in 2026-05-07-ai-avatar-motion-mimicking-models-survey.
  • The fact that the field has converged on "we need a dedicated identity-preservation mechanism" is itself signal — naive diffusion video doesn't preserve identity well enough.

Design implications

  • For long-form output, identity preservation is a first-class concern, not a polish step.
  • The full-token attention approach (HeyGen) is the most expensive but reportedly the highest-fidelity, particularly for fine micro-expressions and idiosyncratic talking style.
  • The 3D-anchored approach trades up-front compute (~1 hour bake per subject in PERSONA) for cheap per-frame inference — appealing for real-time use cases.

Contradictions / tensions

  • HeyGen's vendor numbers (68.9–85.7% pairwise preference) and Wan-Animate's author-reported wins on human eval over Act-Two and DreamActor-M1 are both self-reported. The relative quality of these strategies is not independently established. See independent-avatar-benchmarks.

Open questions

Sources

Related

Referenced by