Are there independent benchmarks for avatar generation models, or only vendor-/author-reported numbers?

The question

Most performance claims in this space come from the model authors or the vendor selling the model:

wan-animate reports beating stableanimator, AnimateAnyone, and Champ on automated metrics, and beating runway-act-two and DreamActor-M1 on human eval.
heygen-avatar-v reports 68.9–85.7% pairwise preference over competitors.
Closed commercial roundups (Kling 3.0 motion fidelity score 8.4, Aurora "best in class") are vendor-marketing.

Is there an independent benchmarking effort — VBench-class — that compares avatar models on a level playing field?

Why it matters

Without independent benchmarks, picking a model means trusting the vendor or paper authors. For a thread specifically about evaluating model capabilities, this is the load-bearing question.

What we currently believe

VBench has been mentioned as a relevant benchmark for video generation broadly, and DisPose's results report VBench scores (dispose-pose-conditioning). Whether VBench has avatar-specific tasks, or whether there's a dedicated avatar leaderboard, is unknown.

Evidence we have

2026-05-07-ai-avatar-motion-mimicking-models-survey flags this gap in its "Contradictions and open questions" section.
DisPose reports VBench improvements over MimicMotion, suggesting VBench is at least applicable.

Evidence we need

Whether VBench has dedicated avatar-relevant categories.
Whether there's a standalone avatar leaderboard (e.g., on HuggingFace or paperswithcode).
Independent third-party comparisons (academic surveys, benchmarking papers).

How to resolve

Search VBench documentation for human-animation task definitions.
Look for 2025 survey papers on "human image animation" or "talking-head generation" benchmarking.
A targeted /academic-research pass on the topic.

Are there independent benchmarks for avatar generation models, or only vendor-/author-reported numbers?

Are there independent benchmarks for avatar generation models, or only vendor-/author-reported numbers?

The question

Why it matters

What we currently believe

Evidence we have

Evidence we need

How to resolve

Related