OmniHuman 1.5
OmniHuman 1.5
One-line summary: ByteDance closed/API audio-driven full-body avatar model (August 2025) that bridges a multimodal LLM with a Diffusion Transformer in a "dual-system" cognitive design.
What it is
A single-image + voice-track to video model. Optional text prompts for refinement. Notably does not accept a motion video as a driving signal — it is image+audio only, deriving expression and gesture from the audio. From 2026-05-07-ai-avatar-motion-mimicking-models-survey.
Why it matters to ai-video-generation
OmniHuman is one of the few closed-source models that produces expressive full-body (not just talking-head) avatar video from audio alone. The "dual-system" architecture (slow planning via MLLM, fast reaction via diffusion) is one of the more architecturally distinctive contributions in the audio-driven branch.
Key facts
- Vendor: ByteDance.
- Status: closed; available via BytePlus API and partner platforms (fal.ai, eachlabs, MindStudio).
- Released: August 2025 (v1.5; succeeds v1).
- Output: 1024×1024 at 30fps, up to ~30 seconds; project page also claims "videos over one minute" with dynamic motion / continuous camera / multi-character interactions.
Technical contributions
- Dual-system architecture: bridges a Multimodal Large Language Model (slow, deliberate planning) with a Diffusion Transformer (fast, intuitive reaction). Frames this as cognitive simulation.
- Handles multi-character interactions, anime, stylized art, anthropomorphic figures — not limited to realistic photographs.
Capabilities
- Lip-sync coherent with rhythm, prosody, semantic content.
- Musical performances with rich expression beyond mere lip-sync.
- Context-aware gesturing tied to speech content.
Strengths
- Full-body audio-driven in a single closed model — rare combination.
- Stylistic range (real, anime, anthropomorphic).
Weaknesses
- Closed source and API-gated.
- No motion-video driving signal — can't be used for "puppeting" workflows.
- Vendor benchmarks; no independent comparison surfaced.
Sources
Related
- heygen-avatar-v — closed-source counterpart focused on talking-head with reference video.
- echomimic-v2 — open-source audio-driven counterpart.
- pose-driven-vs-audio-driven-avatars