ElevenLabs

One-line summary: AI voice platform offering ultra-low-latency TTS, voice cloning, and a hosted conversational agent framework — a candidate voice layer for patia's Phase 2 roadmap item and the TTS dependency used by clicky.

What it is

A voice AI platform (elevenlabs.io) spanning three product lines:

ElevenCreative — text-to-speech in 70+ languages, music generation, sound effects, voice cloning
ElevenAgents — conversational AI deployable to phone, chat, email, WhatsApp; includes analytics, guardrails, workflow integration
ElevenAPI — developer APIs for the underlying capabilities

Why it matters to patia

Two reasons:

Phase 2 voice candidate. patia's roadmap defers voice to Phase 2 (after the SMS + web chat retention signal is validated). ElevenLabs is one option to evaluate alongside Retell, Vapi, and Twilio voice. Flash v2.5 at ~75 ms latency is competitive for real-time conversational use. The v3 model is marketed as "most expressive," which aligns with patia's product principle of a warm, patient, unhurried tone — arguably more important than raw latency for this audience.
Used by clicky. If patia ever adopts a screen-aware mode (screen-aware-ai-for-seniors), ElevenLabs is already the working TTS reference in that pattern.

Key facts

Flagship models: Flash v2.5 (ultra-low latency, ~75 ms), Multilingual v2 (consistent/lifelike), v3 (most expressive)
Channels for ElevenAgents: phone, chat, email, WhatsApp
Pricing: free tier available; paid tiers on elevenlabs.io/pricing — not captured in this source
API posture: developer-first with REST + streaming options

Strengths (from patia's perspective)

Latency is competitive for real-time voice conversation
Expressiveness matters for this audience — a flat, robotic voice cuts against patia's tone principle; v3's expressiveness is a differentiator relative to older TTS engines
Multi-channel agent framework could short-circuit some Phase 2 buildout if the hosted agent layer is good enough — though likely we keep agent logic in-house for control
Already integrated in reference implementations of the screen-aware pattern
One third-party accuracy endorsement for ElevenLabs Scribe 2 (STT). A reply on the insanely-fast-whisper-stt launch thread (2026-05-02-insanely-fast-whisper, @AlxAndrws) names Scribe 2 as the most accurate STT model in their experience and worth the extra cost over open-source. Single anecdotal data point — the only third-party accuracy comparison for any STT in the wiki so far. (The existing product-line section above covers Creative / Agents / API; ElevenLabs also offers an STT line called Scribe that isn't otherwise documented in this wiki yet.)

Weaknesses (from patia's perspective)

Vendor lock-in risk if the expressive voices become a product signature — voice cloning means switching providers later is not trivial
Price sensitivity unknown at pilot scale — need to pull the pricing sheet when Phase 2 planning starts
Pricing pressure from new entrants. xAI launched grok-voice in April 2026 with TTS at $4.20 per million characters and STT at $0.10–$0.20/hr, claiming "10x cheaper" than ElevenLabs (2026-04-20-grok-voice-thread). Open-source and Chinese voice models are shipping in parallel — insanely-fast-whisper-stt is one concrete example, packaging Whisper variants for free local STT with diarization (2026-05-02-insanely-fast-whisper). The claim isn't independently verified, but the directional signal — voice AI commoditizing fast — weakens ElevenLabs' pricing power for Phase 2 evaluation.
Agent framework vs. our own agent — ElevenAgents is a full platform; patia likely wants to keep the agent core in-house (see src/lib/agent/ in the project repo) and only use ElevenLabs for TTS (and possibly STT) I/O

Open questions

Does ElevenLabs v3 expressiveness translate to older adults perceiving the voice as warmer, or does expressiveness skew younger/performative in ways that feel off?
How does Flash v2.5 compare to Retell's end-to-end latency for a comparable conversational loop?
Are there compliance considerations (HIPAA, call-recording law) for using ElevenAgents over phone channels with senior users?

Sources

2026-04-17-clicky-cursor-aware-ai-assistants
2026-04-20-grok-voice-thread — pricing-pressure context from a new entrant
2026-05-02-insanely-fast-whisper — concrete open-source STT pressure point; in-thread anecdote endorsing ElevenLabs Scribe 2 accuracy

clicky — uses ElevenLabs as its TTS layer
grok-voice — alternative Phase 2 voice candidate; launched at a materially lower price point
insanely-fast-whisper-stt — open-source local STT pressure point on commercial voice pricing
screen-aware-ai-for-seniors — the screen-aware pattern implicates voice out by default

ElevenLabs

ElevenLabs

What it is

Why it matters to patia

Key facts

Strengths (from patia's perspective)

Weaknesses (from patia's perspective)

Open questions

Sources

Related