Wan-Animate

One-line summary: Apache-2.0 DiT-based unified character animation and replacement model from Alibaba's HumanAIGC team (Tongyi Lab), September 2025 — currently the strongest open-source motion-mimicking system per author benchmarks.

What it is

A unified diffusion-transformer model that runs in two modes:

Animation mode — apply a reference video's motion + expression to a static character image, preserving the original background.
Replacement mode — substitute the character in a reference video with the source identity while inheriting scene lighting and color tone.

Built on the Wan-I2V foundation model. arXiv:2509.14055.

Why it matters to ai-video-generation

Wan-Animate is the open-source SOTA in the pose-driven motion-mimicking lineage as of late 2025 — DiT-backed (replacing the SVD/UNet approach of mimicmotion) and shipped under Apache 2.0. It's also what closed-source competitors like Runway Act-Two and DreamActor-M1 are now being benchmarked against (per the survey, by the Wan-Animate authors). From 2026-05-07-ai-avatar-motion-mimicking-models-survey.

Key facts

Authors: HumanAIGC Team, Tongyi Lab, Alibaba.
License: Apache 2.0.
Backbone: Wan-I2V (Diffusion Transformer), part of the Wan 2.2 family.
Inputs: character image + reference video (skeleton poses + facial images extracted).
Released model: Wan2.2-Animate (weights and inference code open-sourced).

Technical contributions

Spatially-aligned skeleton signals: 2D pose from VitPose, VAE-compressed to match latent dimensions, merged into noise latents via spatial alignment.
Implicit facial features: raw face images encoded directly (rather than via manually-defined landmarks) and injected through cross-attention into dedicated Face Blocks. Data augmentations during training disentangle identity from expression.
Relighting LoRA: auxiliary module for replacement mode that adjusts character lighting/color tone to harmonize with the new environment, trained on IC-Light-synthesized data.
Modified I2V input paradigm: character image is treated as appearance reference (not as the first frame, the way standard I2V does it). Driving signals dictate content.

Benchmark claims (author-reported)

Portrait data: SSIM 0.834 / FVD 94.65, outperforming AnimateAnyone, Champ, and StableAnimator on automated metrics.
Human eval: pairwise preference vs. Runway Act-Two and DreamActor-M1 across quality, identity consistency, motion accuracy, and expression fidelity.

⚠ These are self-reported. See independent-avatar-benchmarks.

Strengths

Apache 2.0 with weights — truly open.
DiT backbone — better temporal coherence than UNet-based predecessors.
Unifies animation + replacement in one model.

Weaknesses

Self-reported benchmarks; independent verification gap.
Compute-heavy (DiT on Wan-I2V foundation).

Sources

2026-05-07-ai-avatar-motion-mimicking-models-survey

mimicmotion — UNet/SVD-era predecessor in spirit.
animate-anyone-2 — earlier same-team work, UNet-based.
stableanimator — competitor; identity-preservation focus.
unianimate — sibling DiT model (UniAnimate-DiT) on the Wan 2.1 backbone.
runway-act-two — closed-source competitor named in human eval.
heygen-avatar-v
video-diffusion-unet-to-dit
identity-preservation-video-diffusion
pose-driven-vs-audio-driven-avatars

Wan-Animate

Wan-Animate

What it is

Why it matters to ai-video-generation

Key facts

Technical contributions

Benchmark claims (author-reported)

Strengths

Weaknesses

Sources

Related