EchoMimicV2
EchoMimicV2
One-line summary: CVPR 2025 audio-driven semi-body human animation model from Ant Group, image-to-video, Apache-2.0.
What it is
A reference-image + driving-audio to video model that animates upper body and face. "Semi-body" specifically means it covers face plus upper body, not just portraits — distinguishing it from EchoMimic V1 which was portrait-only.
Why it matters to ai-video-generation
The strongest open-source audio-driven option in the survey — the audio-driven branch of avatar generation has historically been more closed (HeyGen, OmniHuman). EchoMimicV2 also accepts pose information from a driving video (RefImg-Pose Alignment demo), making it a hybrid in practice. From 2026-05-07-ai-avatar-motion-mimicking-models-survey.
Key facts
- Authors: AntGroup.
- Repo: github.com/antgroup/echomimic_v2.
- License: Apache 2.0.
- Venue: CVPR 2025.
- Hardware: tested on A100 80G, RTX 4090D 24G, V100 16G — runs on consumer GPUs.
Technical components (per repo)
Pretrained checkpoints include four main weights:
- denoising_unet.pth
- reference_unet.pth
- motion_module.pth
- pose_encoder.pth
(Indicates a UNet-based architecture, not DiT — predates the video-diffusion-unet-to-dit migration.)
Capabilities
- Primary input: audio (speech or music).
- Optional: pose extracted from a driving video.
- Output: image-to-video; semi-body (face + upper body).
Strengths
- Open source under Apache 2.0.
- Runs on a single 24GB consumer GPU.
- Hybrid audio + pose driving signals available.
Weaknesses
- UNet-based backbone — not yet ported to DiT (where the field is heading).
- Semi-body, not full-body.
Sources
Related
- omnihuman-1-5 — closed-source audio-driven counterpart.
- heygen-avatar-v
- mimicmotion
- pose-driven-vs-audio-driven-avatars