EchoMimicV2

One-line summary: CVPR 2025 audio-driven semi-body human animation model from Ant Group, image-to-video, Apache-2.0.

What it is

A reference-image + driving-audio to video model that animates upper body and face. "Semi-body" specifically means it covers face plus upper body, not just portraits — distinguishing it from EchoMimic V1 which was portrait-only.

Why it matters to ai-video-generation

The strongest open-source audio-driven option in the survey — the audio-driven branch of avatar generation has historically been more closed (HeyGen, OmniHuman). EchoMimicV2 also accepts pose information from a driving video (RefImg-Pose Alignment demo), making it a hybrid in practice. From 2026-05-07-ai-avatar-motion-mimicking-models-survey.

Key facts

Authors: AntGroup.
Repo: github.com/antgroup/echomimic_v2.
License: Apache 2.0.
Venue: CVPR 2025.
Hardware: tested on A100 80G, RTX 4090D 24G, V100 16G — runs on consumer GPUs.

Technical components (per repo)

Pretrained checkpoints include four main weights:

denoising_unet.pth
reference_unet.pth
motion_module.pth
pose_encoder.pth

(Indicates a UNet-based architecture, not DiT — predates the video-diffusion-unet-to-dit migration.)

Capabilities

Primary input: audio (speech or music).
Optional: pose extracted from a driving video.
Output: image-to-video; semi-body (face + upper body).

Strengths

Open source under Apache 2.0.
Runs on a single 24GB consumer GPU.
Hybrid audio + pose driving signals available.

Weaknesses

UNet-based backbone — not yet ported to DiT (where the field is heading).
Semi-body, not full-body.

Sources

2026-05-07-ai-avatar-motion-mimicking-models-survey

omnihuman-1-5 — closed-source audio-driven counterpart.
heygen-avatar-v
mimicmotion
pose-driven-vs-audio-driven-avatars

EchoMimicV2

EchoMimicV2

What it is

Why it matters to ai-video-generation

Key facts

Technical components (per repo)

Capabilities

Strengths

Weaknesses

Sources

Related