brain/
← all entities
entitygenericpatia

ElevenLabs

Notes

ElevenLabs

One-line summary: AI voice platform offering ultra-low-latency TTS, voice cloning, and a hosted conversational agent framework — a candidate voice layer for patia's Phase 2 roadmap item and the TTS dependency used by clicky.

What it is

A voice AI platform (elevenlabs.io) spanning three product lines:

  • ElevenCreative — text-to-speech in 70+ languages, music generation, sound effects, voice cloning
  • ElevenAgents — conversational AI deployable to phone, chat, email, WhatsApp; includes analytics, guardrails, workflow integration
  • ElevenAPI — developer APIs for the underlying capabilities

Why it matters to patia

Two reasons:

  1. Phase 2 voice candidate. patia's roadmap defers voice to Phase 2 (after the SMS + web chat retention signal is validated). ElevenLabs is one option to evaluate alongside Retell, Vapi, and Twilio voice. Flash v2.5 at ~75 ms latency is competitive for real-time conversational use. The v3 model is marketed as "most expressive," which aligns with patia's product principle of a warm, patient, unhurried tone — arguably more important than raw latency for this audience.
  2. Used by clicky. If patia ever adopts a screen-aware mode (screen-aware-ai-for-seniors), ElevenLabs is already the working TTS reference in that pattern.

Key facts

  • Flagship models: Flash v2.5 (ultra-low latency, ~75 ms), Multilingual v2 (consistent/lifelike), v3 (most expressive)
  • Channels for ElevenAgents: phone, chat, email, WhatsApp
  • Pricing: free tier available; paid tiers on elevenlabs.io/pricing — not captured in this source
  • API posture: developer-first with REST + streaming options

Strengths (from patia's perspective)

  • Latency is competitive for real-time voice conversation
  • Expressiveness matters for this audience — a flat, robotic voice cuts against patia's tone principle; v3's expressiveness is a differentiator relative to older TTS engines
  • Multi-channel agent framework could short-circuit some Phase 2 buildout if the hosted agent layer is good enough — though likely we keep agent logic in-house for control
  • Already integrated in reference implementations of the screen-aware pattern
  • One third-party accuracy endorsement for ElevenLabs Scribe 2 (STT). A reply on the insanely-fast-whisper-stt launch thread (2026-05-02-insanely-fast-whisper, @AlxAndrws) names Scribe 2 as the most accurate STT model in their experience and worth the extra cost over open-source. Single anecdotal data point — the only third-party accuracy comparison for any STT in the wiki so far. (The existing product-line section above covers Creative / Agents / API; ElevenLabs also offers an STT line called Scribe that isn't otherwise documented in this wiki yet.)

Weaknesses (from patia's perspective)

  • Vendor lock-in risk if the expressive voices become a product signature — voice cloning means switching providers later is not trivial
  • Price sensitivity unknown at pilot scale — need to pull the pricing sheet when Phase 2 planning starts
  • Pricing pressure from new entrants. xAI launched grok-voice in April 2026 with TTS at $4.20 per million characters and STT at $0.10–$0.20/hr, claiming "10x cheaper" than ElevenLabs (2026-04-20-grok-voice-thread). Open-source and Chinese voice models are shipping in parallel — insanely-fast-whisper-stt is one concrete example, packaging Whisper variants for free local STT with diarization (2026-05-02-insanely-fast-whisper). The claim isn't independently verified, but the directional signal — voice AI commoditizing fast — weakens ElevenLabs' pricing power for Phase 2 evaluation.
  • Agent framework vs. our own agent — ElevenAgents is a full platform; patia likely wants to keep the agent core in-house (see src/lib/agent/ in the project repo) and only use ElevenLabs for TTS (and possibly STT) I/O

Open questions

  • Does ElevenLabs v3 expressiveness translate to older adults perceiving the voice as warmer, or does expressiveness skew younger/performative in ways that feel off?
  • How does Flash v2.5 compare to Retell's end-to-end latency for a comparable conversational loop?
  • Are there compliance considerations (HIPAA, call-recording law) for using ElevenAgents over phone channels with senior users?

Sources

Related

Referenced by