entitygenericai-video-generation
MimicMotion
Notes
MimicMotion
One-line summary: Pose-driven, image-to-video human animation model from Tencent built on Stable Video Diffusion, accepted at ICML 2025.
What it is
A reference-image + driving-pose-sequence to video model. Extracts 2D skeleton keypoints (DWPose) from a driving video, then animates the reference identity through that motion sequence.
Why it matters to ai-video-generation
One of the foundational open-source models for "motion mimicking" — driving an avatar's motion from a reference video. Trained on 4,436 dancing videos averaging ~20 seconds. From 2026-05-07-ai-avatar-motion-mimicking-models-survey.
Key facts
- Backbone: Stable Video Diffusion (SVD), an image-to-video diffusion model.
- Driving signal: 2D pose keypoints from DWPose, frame-by-frame.
- License: open source via Tencent/MimicMotion on GitHub.
- Venue: ICML 2025.
Technical contributions
- Confidence-aware pose guidance: encodes pose-detector confidence as brightness in guidance frames so inaccurate keypoints contribute proportionally less, rather than being hard-thresholded.
- Regional loss amplification on hands: high-confidence hand keypoints receive amplified loss weights during training to address hand distortion (a notorious diffusion-video failure mode).
- Progressive latent fusion: overlapped video segments blend with position-aware weighting rather than simple averaging, enabling arbitrary-length output without boundary flicker.
Strengths
- Confidence-aware approach makes it robust to noisy pose detection.
- Open source under Tencent's repo, replicable.
- Standard DWPose extraction means easy integration with existing pipelines.
Weaknesses
- Hand fidelity is still called out as a weak spot industry-wide — the regional loss amplification mitigates but doesn't fully solve it.
- Trained primarily on dancing video — may generalize less well to other motion types.
Open questions
Sources
Related
- stableanimator — competitor; identity preservation focus.
- animate-anyone-2 — competitor; environment affordance focus.
- wan-animate — successor in spirit; DiT backbone instead of SVD.
- dispose-pose-conditioning — plug-and-play module that improves MimicMotion's pose handling.
- persona-avatar-3d — uses MimicMotion as a synthetic-data generator for a 3D avatar pipeline.
- pose-driven-vs-audio-driven-avatars
- video-diffusion-unet-to-dit
Referenced by