brain/
← all entities
entitygenericai-video-generation

UniAnimate

Notes

UniAnimate

One-line summary: Alibaba ali-vilab unified video diffusion model for consistent human image animation, with a 2025 DiT-based variant (UniAnimate-DiT) on the Wan 2.1 backbone.

What it is

Two-generation lineage:

  • UniAnimate (SCIS 2025) — a unified-noise-input video diffusion architecture for pose-driven human animation. Replaces compute-heavy temporal Transformers with state-space-model temporal architecture.
  • UniAnimate-DiT (April 2025) — successor built on the Wan 2.1-14B I2V Diffusion Transformer foundation, fine-tuned via LoRA.

Why it matters to ai-video-generation

UniAnimate's first-frame-conditioned iterative strategy lets it generate one-minute consistent videos — long generation has historically been a weak spot for this class of model. UniAnimate-DiT is one of the cleanest examples of the broader UNet→DiT migration in this space (see video-diffusion-unet-to-dit). From 2026-05-07-ai-avatar-motion-mimicking-models-survey.

Key facts

  • Authors: Alibaba ali-vilab.
  • Repos: github.com/ali-vilab/UniAnimate and github.com/ali-vilab/UniAnimate-DiT.
  • UniAnimate-DiT uses LoRA for parameter-efficient fine-tuning and a lightweight 3D-conv pose encoder.
  • Venue (V1): SCIS 2025.

Technical contributions

  • Unified noise input supporting both random-noised input and first-frame-conditioned input — enables long-video generation by iteratively conditioning each new chunk on the prior chunk's last frame.
  • State-space-model temporal architecture as a Transformer alternative for the temporal dimension.
  • Lightweight 3D-conv pose encoder (DiT variant) that encodes motion information cheaply.

Strengths

  • One-minute consistent generation through first-frame conditioning.
  • The DiT variant aligns with where the field is heading (Wan family backbone).

Weaknesses

  • LoRA-based fine-tuning means quality is bounded by the underlying Wan 2.1-14B I2V foundation.

Sources

Related

Referenced by