entitygenericpatia
ElevenLabs
Notes
ElevenLabs
One-line summary: AI voice platform offering ultra-low-latency TTS, voice cloning, and a hosted conversational agent framework — a candidate voice layer for patia's Phase 2 roadmap item and the TTS dependency used by clicky.
What it is
A voice AI platform (elevenlabs.io) spanning three product lines:
- ElevenCreative — text-to-speech in 70+ languages, music generation, sound effects, voice cloning
- ElevenAgents — conversational AI deployable to phone, chat, email, WhatsApp; includes analytics, guardrails, workflow integration
- ElevenAPI — developer APIs for the underlying capabilities
Why it matters to patia
Two reasons:
- Phase 2 voice candidate. patia's roadmap defers voice to Phase 2 (after the SMS + web chat retention signal is validated). ElevenLabs is one option to evaluate alongside Retell, Vapi, and Twilio voice. Flash v2.5 at ~75 ms latency is competitive for real-time conversational use. The v3 model is marketed as "most expressive," which aligns with patia's product principle of a warm, patient, unhurried tone — arguably more important than raw latency for this audience.
- Used by clicky. If patia ever adopts a screen-aware mode (screen-aware-ai-for-seniors), ElevenLabs is already the working TTS reference in that pattern.
Key facts
- Flagship models: Flash v2.5 (ultra-low latency, ~75 ms), Multilingual v2 (consistent/lifelike), v3 (most expressive)
- Channels for ElevenAgents: phone, chat, email, WhatsApp
- Pricing: free tier available; paid tiers on elevenlabs.io/pricing — not captured in this source
- API posture: developer-first with REST + streaming options
Strengths (from patia's perspective)
- Latency is competitive for real-time voice conversation
- Expressiveness matters for this audience — a flat, robotic voice cuts against patia's tone principle; v3's expressiveness is a differentiator relative to older TTS engines
- Multi-channel agent framework could short-circuit some Phase 2 buildout if the hosted agent layer is good enough — though likely we keep agent logic in-house for control
- Already integrated in reference implementations of the screen-aware pattern
- One third-party accuracy endorsement for ElevenLabs Scribe 2 (STT). A reply on the insanely-fast-whisper-stt launch thread (2026-05-02-insanely-fast-whisper, @AlxAndrws) names Scribe 2 as the most accurate STT model in their experience and worth the extra cost over open-source. Single anecdotal data point — the only third-party accuracy comparison for any STT in the wiki so far. (The existing product-line section above covers Creative / Agents / API; ElevenLabs also offers an STT line called Scribe that isn't otherwise documented in this wiki yet.)
Weaknesses (from patia's perspective)
- Vendor lock-in risk if the expressive voices become a product signature — voice cloning means switching providers later is not trivial
- Price sensitivity unknown at pilot scale — need to pull the pricing sheet when Phase 2 planning starts
- Pricing pressure from new entrants. xAI launched grok-voice in April 2026 with TTS at $4.20 per million characters and STT at $0.10–$0.20/hr, claiming "10x cheaper" than ElevenLabs (2026-04-20-grok-voice-thread). Open-source and Chinese voice models are shipping in parallel — insanely-fast-whisper-stt is one concrete example, packaging Whisper variants for free local STT with diarization (2026-05-02-insanely-fast-whisper). The claim isn't independently verified, but the directional signal — voice AI commoditizing fast — weakens ElevenLabs' pricing power for Phase 2 evaluation.
- Agent framework vs. our own agent — ElevenAgents is a full platform; patia likely wants to keep the agent core in-house (see
src/lib/agent/in the project repo) and only use ElevenLabs for TTS (and possibly STT) I/O
Open questions
- Does ElevenLabs v3 expressiveness translate to older adults perceiving the voice as warmer, or does expressiveness skew younger/performative in ways that feel off?
- How does Flash v2.5 compare to Retell's end-to-end latency for a comparable conversational loop?
- Are there compliance considerations (HIPAA, call-recording law) for using ElevenAgents over phone channels with senior users?
Sources
- 2026-04-17-clicky-cursor-aware-ai-assistants
- 2026-04-20-grok-voice-thread — pricing-pressure context from a new entrant
- 2026-05-02-insanely-fast-whisper — concrete open-source STT pressure point; in-thread anecdote endorsing ElevenLabs Scribe 2 accuracy
Related
- clicky — uses ElevenLabs as its TTS layer
- grok-voice — alternative Phase 2 voice candidate; launched at a materially lower price point
- insanely-fast-whisper-stt — open-source local STT pressure point on commercial voice pricing
- screen-aware-ai-for-seniors — the screen-aware pattern implicates voice out by default
Referenced by