MimicMotion

One-line summary: Pose-driven, image-to-video human animation model from Tencent built on Stable Video Diffusion, accepted at ICML 2025.

What it is

A reference-image + driving-pose-sequence to video model. Extracts 2D skeleton keypoints (DWPose) from a driving video, then animates the reference identity through that motion sequence.

Why it matters to ai-video-generation

One of the foundational open-source models for "motion mimicking" — driving an avatar's motion from a reference video. Trained on 4,436 dancing videos averaging ~20 seconds. From 2026-05-07-ai-avatar-motion-mimicking-models-survey.

Key facts

Backbone: Stable Video Diffusion (SVD), an image-to-video diffusion model.
Driving signal: 2D pose keypoints from DWPose, frame-by-frame.
License: open source via Tencent/MimicMotion on GitHub.
Venue: ICML 2025.

Technical contributions

Confidence-aware pose guidance: encodes pose-detector confidence as brightness in guidance frames so inaccurate keypoints contribute proportionally less, rather than being hard-thresholded.
Regional loss amplification on hands: high-confidence hand keypoints receive amplified loss weights during training to address hand distortion (a notorious diffusion-video failure mode).
Progressive latent fusion: overlapped video segments blend with position-aware weighting rather than simple averaging, enabling arbitrary-length output without boundary flicker.

Strengths

Confidence-aware approach makes it robust to noisy pose detection.
Open source under Tencent's repo, replicable.
Standard DWPose extraction means easy integration with existing pipelines.

Weaknesses

Hand fidelity is still called out as a weak spot industry-wide — the regional loss amplification mitigates but doesn't fully solve it.
Trained primarily on dancing video — may generalize less well to other motion types.

Open questions

See hand-fidelity-comparison-across-avatar-models.

Sources

2026-05-07-ai-avatar-motion-mimicking-models-survey

stableanimator — competitor; identity preservation focus.
animate-anyone-2 — competitor; environment affordance focus.
wan-animate — successor in spirit; DiT backbone instead of SVD.
dispose-pose-conditioning — plug-and-play module that improves MimicMotion's pose handling.
persona-avatar-3d — uses MimicMotion as a synthetic-data generator for a 3D avatar pipeline.
pose-driven-vs-audio-driven-avatars
video-diffusion-unet-to-dit

MimicMotion

MimicMotion

What it is

Why it matters to ai-video-generation

Key facts

Technical contributions

Strengths

Weaknesses

Open questions

Sources

Related