brain/
← all entities
entitygenericai-video-generation

MimicMotion

Notes

MimicMotion

One-line summary: Pose-driven, image-to-video human animation model from Tencent built on Stable Video Diffusion, accepted at ICML 2025.

What it is

A reference-image + driving-pose-sequence to video model. Extracts 2D skeleton keypoints (DWPose) from a driving video, then animates the reference identity through that motion sequence.

Why it matters to ai-video-generation

One of the foundational open-source models for "motion mimicking" — driving an avatar's motion from a reference video. Trained on 4,436 dancing videos averaging ~20 seconds. From 2026-05-07-ai-avatar-motion-mimicking-models-survey.

Key facts

  • Backbone: Stable Video Diffusion (SVD), an image-to-video diffusion model.
  • Driving signal: 2D pose keypoints from DWPose, frame-by-frame.
  • License: open source via Tencent/MimicMotion on GitHub.
  • Venue: ICML 2025.

Technical contributions

  • Confidence-aware pose guidance: encodes pose-detector confidence as brightness in guidance frames so inaccurate keypoints contribute proportionally less, rather than being hard-thresholded.
  • Regional loss amplification on hands: high-confidence hand keypoints receive amplified loss weights during training to address hand distortion (a notorious diffusion-video failure mode).
  • Progressive latent fusion: overlapped video segments blend with position-aware weighting rather than simple averaging, enabling arbitrary-length output without boundary flicker.

Strengths

  • Confidence-aware approach makes it robust to noisy pose detection.
  • Open source under Tencent's repo, replicable.
  • Standard DWPose extraction means easy integration with existing pipelines.

Weaknesses

  • Hand fidelity is still called out as a weak spot industry-wide — the regional loss amplification mitigates but doesn't fully solve it.
  • Trained primarily on dancing video — may generalize less well to other motion types.

Open questions

Sources

Related

Referenced by