OmniHuman 1.5

One-line summary: ByteDance closed/API audio-driven full-body avatar model (August 2025) that bridges a multimodal LLM with a Diffusion Transformer in a "dual-system" cognitive design.

What it is

A single-image + voice-track to video model. Optional text prompts for refinement. Notably does not accept a motion video as a driving signal — it is image+audio only, deriving expression and gesture from the audio. From 2026-05-07-ai-avatar-motion-mimicking-models-survey.

Why it matters to ai-video-generation

OmniHuman is one of the few closed-source models that produces expressive full-body (not just talking-head) avatar video from audio alone. The "dual-system" architecture (slow planning via MLLM, fast reaction via diffusion) is one of the more architecturally distinctive contributions in the audio-driven branch.

Key facts

Vendor: ByteDance.
Status: closed; available via BytePlus API and partner platforms (fal.ai, eachlabs, MindStudio).
Released: August 2025 (v1.5; succeeds v1).
Output: 1024×1024 at 30fps, up to ~30 seconds; project page also claims "videos over one minute" with dynamic motion / continuous camera / multi-character interactions.

Technical contributions

Dual-system architecture: bridges a Multimodal Large Language Model (slow, deliberate planning) with a Diffusion Transformer (fast, intuitive reaction). Frames this as cognitive simulation.
Handles multi-character interactions, anime, stylized art, anthropomorphic figures — not limited to realistic photographs.

Capabilities

Lip-sync coherent with rhythm, prosody, semantic content.
Musical performances with rich expression beyond mere lip-sync.
Context-aware gesturing tied to speech content.

Strengths

Full-body audio-driven in a single closed model — rare combination.
Stylistic range (real, anime, anthropomorphic).

Weaknesses

Closed source and API-gated.
No motion-video driving signal — can't be used for "puppeting" workflows.
Vendor benchmarks; no independent comparison surfaced.

Sources

2026-05-07-ai-avatar-motion-mimicking-models-survey

heygen-avatar-v — closed-source counterpart focused on talking-head with reference video.
echomimic-v2 — open-source audio-driven counterpart.
pose-driven-vs-audio-driven-avatars

OmniHuman 1.5

OmniHuman 1.5

What it is

Why it matters to ai-video-generation

Key facts

Technical contributions

Capabilities

Strengths

Weaknesses

Sources

Related