conceptai-video-generation
Identity preservation in video diffusion
Notes
Identity preservation in video diffusion
One-line summary: The recurring hard problem in avatar generation — keeping a subject's identity consistent across long video durations — is being attacked through several distinct architectural strategies in 2025.
The insight
Current diffusion models for video generation drift on identity over time. The face slowly turns into someone else; the cloth pattern changes; the proportions shift. Every credible avatar model has a deliberate strategy for this. The strategies cluster into a small number of families.
Strategies surveyed
Face-encoder + ID-adapter inside diffusion (stableanimator):
- A global content-aware Face Encoder refines face embeddings via interaction with image embeddings.
- A distribution-aware ID-Adapter aligns embeddings to prevent temporal-layer interference.
- An HJB (Hamilton-Jacobi-Bellman) inference-time face optimization runs in parallel with denoising.
- Claimed as the first end-to-end ID-preserving framework — no post-hoc face swap.
Full-token sparse reference attention (heygen-avatar-v):
- Skips the embedding bottleneck entirely. Conditions on the full token sequence of the reference video at every transformer layer.
- Structured sparsity makes cost scale near-linearly with reference length.
- Disentangles static identity (dental structure, skin texture) from dynamic identity (talking rhythm, gestural style).
3D-anchored hybrid (persona-avatar-3d):
- Uses 3D parametric body model (SMPL-X) + 3D Gaussian Splatting as the identity anchor.
- Diffusion (mimicmotion) used upstream as a synthetic-data generator only, then baked into the 3D representation.
- Balanced sampling oversamples the input image during training to prevent diffusion drift.
Implicit facial features (wan-animate):
- Encodes raw face images directly (not landmark-based) and injects via cross-attention into dedicated Face Blocks.
- Data augmentation (scaling, color jittering, noise) during training disentangles identity from expression.
- Combined with spatially-aligned skeleton signals from VitPose for body identity.
Evidence
- All of the above strategies are documented in 2026-05-07-ai-avatar-motion-mimicking-models-survey.
- The fact that the field has converged on "we need a dedicated identity-preservation mechanism" is itself signal — naive diffusion video doesn't preserve identity well enough.
Design implications
- For long-form output, identity preservation is a first-class concern, not a polish step.
- The full-token attention approach (HeyGen) is the most expensive but reportedly the highest-fidelity, particularly for fine micro-expressions and idiosyncratic talking style.
- The 3D-anchored approach trades up-front compute (~1 hour bake per subject in PERSONA) for cheap per-frame inference — appealing for real-time use cases.
Contradictions / tensions
- HeyGen's vendor numbers (68.9–85.7% pairwise preference) and Wan-Animate's author-reported wins on human eval over Act-Two and DreamActor-M1 are both self-reported. The relative quality of these strategies is not independently established. See independent-avatar-benchmarks.
Open questions
Sources
Related
Referenced by