entitygenericai-video-generation
UniAnimate
Notes
UniAnimate
One-line summary: Alibaba ali-vilab unified video diffusion model for consistent human image animation, with a 2025 DiT-based variant (UniAnimate-DiT) on the Wan 2.1 backbone.
What it is
Two-generation lineage:
- UniAnimate (SCIS 2025) — a unified-noise-input video diffusion architecture for pose-driven human animation. Replaces compute-heavy temporal Transformers with state-space-model temporal architecture.
- UniAnimate-DiT (April 2025) — successor built on the Wan 2.1-14B I2V Diffusion Transformer foundation, fine-tuned via LoRA.
Why it matters to ai-video-generation
UniAnimate's first-frame-conditioned iterative strategy lets it generate one-minute consistent videos — long generation has historically been a weak spot for this class of model. UniAnimate-DiT is one of the cleanest examples of the broader UNet→DiT migration in this space (see video-diffusion-unet-to-dit). From 2026-05-07-ai-avatar-motion-mimicking-models-survey.
Key facts
- Authors: Alibaba ali-vilab.
- Repos: github.com/ali-vilab/UniAnimate and github.com/ali-vilab/UniAnimate-DiT.
- UniAnimate-DiT uses LoRA for parameter-efficient fine-tuning and a lightweight 3D-conv pose encoder.
- Venue (V1): SCIS 2025.
Technical contributions
- Unified noise input supporting both random-noised input and first-frame-conditioned input — enables long-video generation by iteratively conditioning each new chunk on the prior chunk's last frame.
- State-space-model temporal architecture as a Transformer alternative for the temporal dimension.
- Lightweight 3D-conv pose encoder (DiT variant) that encodes motion information cheaply.
Strengths
- One-minute consistent generation through first-frame conditioning.
- The DiT variant aligns with where the field is heading (Wan family backbone).
Weaknesses
- LoRA-based fine-tuning means quality is bounded by the underlying Wan 2.1-14B I2V foundation.
Sources
Related
- wan-animate — sibling on Wan 2.2 backbone, dedicated avatar focus.
- mimicmotion
- stableanimator
- video-diffusion-unet-to-dit
Referenced by