Wan-Animate
Wan-Animate
One-line summary: Apache-2.0 DiT-based unified character animation and replacement model from Alibaba's HumanAIGC team (Tongyi Lab), September 2025 — currently the strongest open-source motion-mimicking system per author benchmarks.
What it is
A unified diffusion-transformer model that runs in two modes:
- Animation mode — apply a reference video's motion + expression to a static character image, preserving the original background.
- Replacement mode — substitute the character in a reference video with the source identity while inheriting scene lighting and color tone.
Built on the Wan-I2V foundation model. arXiv:2509.14055.
Why it matters to ai-video-generation
Wan-Animate is the open-source SOTA in the pose-driven motion-mimicking lineage as of late 2025 — DiT-backed (replacing the SVD/UNet approach of mimicmotion) and shipped under Apache 2.0. It's also what closed-source competitors like Runway Act-Two and DreamActor-M1 are now being benchmarked against (per the survey, by the Wan-Animate authors). From 2026-05-07-ai-avatar-motion-mimicking-models-survey.
Key facts
- Authors: HumanAIGC Team, Tongyi Lab, Alibaba.
- License: Apache 2.0.
- Backbone: Wan-I2V (Diffusion Transformer), part of the Wan 2.2 family.
- Inputs: character image + reference video (skeleton poses + facial images extracted).
- Released model: Wan2.2-Animate (weights and inference code open-sourced).
Technical contributions
- Spatially-aligned skeleton signals: 2D pose from VitPose, VAE-compressed to match latent dimensions, merged into noise latents via spatial alignment.
- Implicit facial features: raw face images encoded directly (rather than via manually-defined landmarks) and injected through cross-attention into dedicated Face Blocks. Data augmentations during training disentangle identity from expression.
- Relighting LoRA: auxiliary module for replacement mode that adjusts character lighting/color tone to harmonize with the new environment, trained on IC-Light-synthesized data.
- Modified I2V input paradigm: character image is treated as appearance reference (not as the first frame, the way standard I2V does it). Driving signals dictate content.
Benchmark claims (author-reported)
- Portrait data: SSIM 0.834 / FVD 94.65, outperforming AnimateAnyone, Champ, and StableAnimator on automated metrics.
- Human eval: pairwise preference vs. Runway Act-Two and DreamActor-M1 across quality, identity consistency, motion accuracy, and expression fidelity.
⚠ These are self-reported. See independent-avatar-benchmarks.
Strengths
- Apache 2.0 with weights — truly open.
- DiT backbone — better temporal coherence than UNet-based predecessors.
- Unifies animation + replacement in one model.
Weaknesses
- Self-reported benchmarks; independent verification gap.
- Compute-heavy (DiT on Wan-I2V foundation).
Sources
Related
- mimicmotion — UNet/SVD-era predecessor in spirit.
- animate-anyone-2 — earlier same-team work, UNet-based.
- stableanimator — competitor; identity-preservation focus.
- unianimate — sibling DiT model (UniAnimate-DiT) on the Wan 2.1 backbone.
- runway-act-two — closed-source competitor named in human eval.
- heygen-avatar-v
- video-diffusion-unet-to-dit
- identity-preservation-video-diffusion
- pose-driven-vs-audio-driven-avatars