brain/
← all entities
entitygenericai-video-generation

Wan-Animate

Notes

Wan-Animate

One-line summary: Apache-2.0 DiT-based unified character animation and replacement model from Alibaba's HumanAIGC team (Tongyi Lab), September 2025 — currently the strongest open-source motion-mimicking system per author benchmarks.

What it is

A unified diffusion-transformer model that runs in two modes:

  • Animation mode — apply a reference video's motion + expression to a static character image, preserving the original background.
  • Replacement mode — substitute the character in a reference video with the source identity while inheriting scene lighting and color tone.

Built on the Wan-I2V foundation model. arXiv:2509.14055.

Why it matters to ai-video-generation

Wan-Animate is the open-source SOTA in the pose-driven motion-mimicking lineage as of late 2025 — DiT-backed (replacing the SVD/UNet approach of mimicmotion) and shipped under Apache 2.0. It's also what closed-source competitors like Runway Act-Two and DreamActor-M1 are now being benchmarked against (per the survey, by the Wan-Animate authors). From 2026-05-07-ai-avatar-motion-mimicking-models-survey.

Key facts

  • Authors: HumanAIGC Team, Tongyi Lab, Alibaba.
  • License: Apache 2.0.
  • Backbone: Wan-I2V (Diffusion Transformer), part of the Wan 2.2 family.
  • Inputs: character image + reference video (skeleton poses + facial images extracted).
  • Released model: Wan2.2-Animate (weights and inference code open-sourced).

Technical contributions

  • Spatially-aligned skeleton signals: 2D pose from VitPose, VAE-compressed to match latent dimensions, merged into noise latents via spatial alignment.
  • Implicit facial features: raw face images encoded directly (rather than via manually-defined landmarks) and injected through cross-attention into dedicated Face Blocks. Data augmentations during training disentangle identity from expression.
  • Relighting LoRA: auxiliary module for replacement mode that adjusts character lighting/color tone to harmonize with the new environment, trained on IC-Light-synthesized data.
  • Modified I2V input paradigm: character image is treated as appearance reference (not as the first frame, the way standard I2V does it). Driving signals dictate content.

Benchmark claims (author-reported)

  • Portrait data: SSIM 0.834 / FVD 94.65, outperforming AnimateAnyone, Champ, and StableAnimator on automated metrics.
  • Human eval: pairwise preference vs. Runway Act-Two and DreamActor-M1 across quality, identity consistency, motion accuracy, and expression fidelity.

⚠ These are self-reported. See independent-avatar-benchmarks.

Strengths

  • Apache 2.0 with weights — truly open.
  • DiT backbone — better temporal coherence than UNet-based predecessors.
  • Unifies animation + replacement in one model.

Weaknesses

  • Self-reported benchmarks; independent verification gap.
  • Compute-heavy (DiT on Wan-I2V foundation).

Sources

Related

Referenced by