brain/
← all entities
entitygenericai-video-generation

EchoMimicV2

Notes

EchoMimicV2

One-line summary: CVPR 2025 audio-driven semi-body human animation model from Ant Group, image-to-video, Apache-2.0.

What it is

A reference-image + driving-audio to video model that animates upper body and face. "Semi-body" specifically means it covers face plus upper body, not just portraits — distinguishing it from EchoMimic V1 which was portrait-only.

Why it matters to ai-video-generation

The strongest open-source audio-driven option in the survey — the audio-driven branch of avatar generation has historically been more closed (HeyGen, OmniHuman). EchoMimicV2 also accepts pose information from a driving video (RefImg-Pose Alignment demo), making it a hybrid in practice. From 2026-05-07-ai-avatar-motion-mimicking-models-survey.

Key facts

  • Authors: AntGroup.
  • Repo: github.com/antgroup/echomimic_v2.
  • License: Apache 2.0.
  • Venue: CVPR 2025.
  • Hardware: tested on A100 80G, RTX 4090D 24G, V100 16G — runs on consumer GPUs.

Technical components (per repo)

Pretrained checkpoints include four main weights:

  • denoising_unet.pth
  • reference_unet.pth
  • motion_module.pth
  • pose_encoder.pth

(Indicates a UNet-based architecture, not DiT — predates the video-diffusion-unet-to-dit migration.)

Capabilities

  • Primary input: audio (speech or music).
  • Optional: pose extracted from a driving video.
  • Output: image-to-video; semi-body (face + upper body).

Strengths

  • Open source under Apache 2.0.
  • Runs on a single 24GB consumer GPU.
  • Hybrid audio + pose driving signals available.

Weaknesses

  • UNet-based backbone — not yet ported to DiT (where the field is heading).
  • Semi-body, not full-body.

Sources

Related

Referenced by