brain/
← all entities
entitygenericai-video-generation

DisPose

Notes

DisPose

One-line summary: 2024–2025 plug-and-play module that disentangles sparse skeleton pose into motion-field guidance + keypoint correspondence, improving existing pose-driven animation models without fine-tuning their weights.

What it is

A hybrid ControlNet that bolts onto frozen base animation models (MusePose, MimicMotion) and improves their generation quality and consistency by replacing monolithic pose conditioning with two complementary signals. arXiv:2412.09349.

Why it matters to ai-video-generation

DisPose is the canonical reference for the dense vs sparse pose conditioning trade-off. Dense conditions (DensePose, depth maps, SMPL) impose strict geometric constraints that fail when body shapes differ between driving video and reference image. Sparse skeletons are too loose. DisPose's disentanglement gives a third option. From 2026-05-07-ai-avatar-motion-mimicking-models-survey.

Key facts

  • Inputs: reference image (static) + driving video (for pose extraction).
  • Pose extractor: DWpose.
  • Plug-and-play: freezes base-model weights, adds control residuals.

Technical contributions

  • Motion Field Guidance: generates both sparse and dense motion fields from skeleton keypoints. Dense field comes from conditional motion propagation on the reference image — region-level guidance that avoids strict shape constraints.
  • Keypoint Correspondence: extracts DIFT diffusion features at skeleton keypoints from the reference image and transfers these point embeddings to target poses via multi-scale correspondence.
  • Hybrid ControlNet: motion-field guidance added to noise-latent input; sparse-point features injected into convolutional layers; control residuals fed into the U-Net middle and up-sampling blocks.

Performance claims

Improves VBench metrics and reduces FID-FVD/FVD on TikTok and unseen datasets, demonstrated on MusePose and MimicMotion baselines.

Strengths

  • Doesn't require retraining the underlying animation model.
  • Shape-agnostic — works across different body shapes.

Weaknesses

  • Adds inference overhead (extra ControlNet branch).
  • Tied to the assumption that DIFT features generalize from reference to target poses — may break for extreme pose deltas.

Sources

Related

Referenced by