DisPose

One-line summary: 2024–2025 plug-and-play module that disentangles sparse skeleton pose into motion-field guidance + keypoint correspondence, improving existing pose-driven animation models without fine-tuning their weights.

What it is

A hybrid ControlNet that bolts onto frozen base animation models (MusePose, MimicMotion) and improves their generation quality and consistency by replacing monolithic pose conditioning with two complementary signals. arXiv:2412.09349.

Why it matters to ai-video-generation

DisPose is the canonical reference for the dense vs sparse pose conditioning trade-off. Dense conditions (DensePose, depth maps, SMPL) impose strict geometric constraints that fail when body shapes differ between driving video and reference image. Sparse skeletons are too loose. DisPose's disentanglement gives a third option. From 2026-05-07-ai-avatar-motion-mimicking-models-survey.

Key facts

Inputs: reference image (static) + driving video (for pose extraction).
Pose extractor: DWpose.
Plug-and-play: freezes base-model weights, adds control residuals.

Technical contributions

Motion Field Guidance: generates both sparse and dense motion fields from skeleton keypoints. Dense field comes from conditional motion propagation on the reference image — region-level guidance that avoids strict shape constraints.
Keypoint Correspondence: extracts DIFT diffusion features at skeleton keypoints from the reference image and transfers these point embeddings to target poses via multi-scale correspondence.
Hybrid ControlNet: motion-field guidance added to noise-latent input; sparse-point features injected into convolutional layers; control residuals fed into the U-Net middle and up-sampling blocks.

Performance claims

Improves VBench metrics and reduces FID-FVD/FVD on TikTok and unseen datasets, demonstrated on MusePose and MimicMotion baselines.

Strengths

Doesn't require retraining the underlying animation model.
Shape-agnostic — works across different body shapes.

Weaknesses

Adds inference overhead (extra ControlNet branch).
Tied to the assumption that DIFT features generalize from reference to target poses — may break for extreme pose deltas.

Sources

2026-05-07-ai-avatar-motion-mimicking-models-survey

mimicmotion — base model improved by DisPose.
pose-driven-vs-audio-driven-avatars

DisPose

DisPose

What it is

Why it matters to ai-video-generation

Key facts

Technical contributions

Performance claims

Strengths

Weaknesses

Sources

Related