How do current avatar models compare on hand fidelity?

The question

Hand quality is a notorious failure mode for diffusion video generation, called out repeatedly across the survey. mimicmotion's regional loss amplification on high-confidence hand keypoints was introduced specifically to address it. Where does hand fidelity actually stand across the models surveyed?

Why it matters

For full-body avatars, hands are visible, expressive, and high-frequency in a way that makes failure modes glaringly obvious. A model that can't render hands cleanly is unusable for many real applications (sign language, gesture-heavy presentations, demonstrations).

What we currently believe

MimicMotion explicitly addresses hand distortion via training-loss reweighting on high-confidence hand keypoints.
Runway Act-Two markets full hand tracking as a feature.
Wan-Animate and AnimateAnyone V2 don't explicitly call out hand fidelity strategies, but report strong human-eval scores generally.
No direct head-to-head hand-fidelity comparison exists in the source.

Evidence we have

2026-05-07-ai-avatar-motion-mimicking-models-survey notes this is an unresolved comparison.
mimicmotion's regional loss amplification is documented.

Evidence we need

Side-by-side hand renders of the same driving sequence across MimicMotion, Wan-Animate, StableAnimator, and Act-Two.
Whether hand-specific evaluation metrics (e.g., MPJPE on hand keypoints, hand-region FID) are used in the field.

How to resolve

Look for hand-specific evaluation papers in 2025 video generation survey work.
Practical: run the same driving video through 2–3 open-source models and compare frames.

How do current avatar models compare on hand fidelity?

How do current avatar models compare on hand fidelity?

The question

Why it matters

What we currently believe

Evidence we have

Evidence we need

How to resolve

Related