DisPose
DisPose
One-line summary: 2024–2025 plug-and-play module that disentangles sparse skeleton pose into motion-field guidance + keypoint correspondence, improving existing pose-driven animation models without fine-tuning their weights.
What it is
A hybrid ControlNet that bolts onto frozen base animation models (MusePose, MimicMotion) and improves their generation quality and consistency by replacing monolithic pose conditioning with two complementary signals. arXiv:2412.09349.
Why it matters to ai-video-generation
DisPose is the canonical reference for the dense vs sparse pose conditioning trade-off. Dense conditions (DensePose, depth maps, SMPL) impose strict geometric constraints that fail when body shapes differ between driving video and reference image. Sparse skeletons are too loose. DisPose's disentanglement gives a third option. From 2026-05-07-ai-avatar-motion-mimicking-models-survey.
Key facts
- Inputs: reference image (static) + driving video (for pose extraction).
- Pose extractor: DWpose.
- Plug-and-play: freezes base-model weights, adds control residuals.
Technical contributions
- Motion Field Guidance: generates both sparse and dense motion fields from skeleton keypoints. Dense field comes from conditional motion propagation on the reference image — region-level guidance that avoids strict shape constraints.
- Keypoint Correspondence: extracts DIFT diffusion features at skeleton keypoints from the reference image and transfers these point embeddings to target poses via multi-scale correspondence.
- Hybrid ControlNet: motion-field guidance added to noise-latent input; sparse-point features injected into convolutional layers; control residuals fed into the U-Net middle and up-sampling blocks.
Performance claims
Improves VBench metrics and reduces FID-FVD/FVD on TikTok and unseen datasets, demonstrated on MusePose and MimicMotion baselines.
Strengths
- Doesn't require retraining the underlying animation model.
- Shape-agnostic — works across different body shapes.
Weaknesses
- Adds inference overhead (extra ControlNet branch).
- Tied to the assumption that DIFT features generalize from reference to target poses — may break for extreme pose deltas.
Sources
Related
- mimicmotion — base model improved by DisPose.
- pose-driven-vs-audio-driven-avatars