Voice: STT, TTS, latency, and quality tuning
How Call2Me's voice stack works — Deepgram Nova-3 STT, ElevenLabs/Cartesia/OpenAI TTS, sub-500ms target latency. Picking the right voice per language.
Updated May 1, 2026
The voice stack is what makes a phone call feel real. This page covers the engines, the trade-offs, and how to tune for your use case.
The pipeline
Three primitives running in parallel:
- STT — streaming speech-to-text, transcribes as the caller talks.
- LLM — reads the transcript, generates a response.
- TTS — synthesizes the response audio and streams it back.
End-to-end target: under 500ms from "caller stops talking" to "agent starts speaking."
STT: Deepgram Nova-3 (default)
Deepgram's latest Nova family. Streaming, multilingual auto-detect, ~150ms transcript latency.
Alternative: Whisper — slower, slightly more accurate on noisy lines, but the latency penalty rarely justifies it for live conversation.
For multilingual agents, leave the language as auto-detect. Deepgram identifies the language from the first 1–2 seconds and transcribes accordingly. You don't pre-declare.
TTS engines
| Engine | Strengths | Cost | Best for |
|---|---|---|---|
| ElevenLabs Multilingual v2 | Most natural across all 9 langs | Premium | Receptionist, sales — high-touch |
| ElevenLabs Turbo Multilingual | Fast, lower latency | Mid | Outbound campaigns, support |
| Cartesia Sonic Multilingual | Cheap, surprisingly good in TR/AR | Low | High-volume bulk |
| OpenAI TTS Multilingual | Solid baseline, included in voice base | Included | Cost-sensitive, default |
Pick per agent on AgentDetail → Voice section. Preview every option before committing.
Per-language quality notes
| Language | Best engine | Notes |
|---|---|---|
| Turkish | ElevenLabs Multilingual v2 | Cartesia also strong |
| American English | All four high quality | OpenAI is the cheapest pick |
| British English | ElevenLabs (specific BR voices) | OpenAI multilingual is close |
| German | ElevenLabs Multilingual v2 | Cartesia second |
| French | ElevenLabs Multilingual v2 | All engines acceptable |
| Spanish | ElevenLabs / Cartesia | Tied |
| Italian | ElevenLabs Multilingual v2 | Best quality |
| Brazilian Portuguese | ElevenLabs Multilingual v2 | OpenAI close behind |
| Arabic | Cartesia Sonic | Improving fastest, worth re-checking quarterly |
Latency budget
Where the 500ms goes:
- STT: ~150ms (per chunk)
- LLM first-token: ~150–300ms (depends on model)
- TTS first-audio: ~80–150ms
- Network: ~30–50ms
To keep under 500ms median:
- Use GPT-4o-mini or Claude 3 Haiku for the LLM (faster than the larger siblings).
- Use ElevenLabs Turbo or OpenAI TTS for TTS (lower latency than v2).
- Keep system prompt under 800 words (longer prompts → slower first-token).
- Don't over-retrieve from KB (top-3 is usually enough).
For deeper detail see the sub-500ms latency guide.
Background sound
You can layer ambient audio under the agent's voice — typing, café murmur, office hum. Lives on AgentDetail → Speech Settings → Background sound. Useful for:
- Making AI calls feel less sterile.
- Matching the expected acoustic environment (office, restaurant, retail).
- Masking subtle audio glitches.
Pick from the gallery or upload your own short loop.
Voice consistency across languages
If your agent runs in multiple languages on the same call (rare but happens), pick a multilingual voice so the character stays consistent. Switching from en-US/nova to tr-TR/aysel mid-call is jarring; ElevenLabs Multilingual v2 stays the same character across all 9 languages.
DTMF (keypad) input
Voice agents accept keypad presses. Use cases:
- Language menu: "Press 1 for English, 2 for Turkish."
- Account number entry.
- IVR-style routing on top of the voice agent.
Configure the menu in the system prompt; the runtime parses DTMF tones and feeds them to the LLM as transcript text.
Voicemail detection
Agents detect voicemail (the "leave a message after the beep" pattern) and respond per your config:
- Hang up — don't leave anything.
- Leave a scripted message — define what to say.
- Treat as human — risky; only with very short scripts.
Detection is on by default. Tunable per agent.
Recording quality
Call recordings are MP3 at 16kHz mono — phone-quality. Storage in our buckets, retention configurable per workspace.
If you need higher-quality audio (e.g. for legal review), turn on raw WAV recording — it costs more storage but preserves full fidelity.
Frequently asked
Q.What speech-to-text engine does Call2Me use?
Deepgram Nova-3 is the default for STT, with multilingual auto-detection across 9 languages. Whisper is available as an alternative.
Q.What text-to-speech engines are available?
ElevenLabs (Multilingual v2 + Turbo), Cartesia Sonic, and OpenAI TTS. Pick one per agent in the voice gallery.
Q.What's the typical end-to-end voice latency on Call2Me?
Sub-500ms target end-to-end (caller stops speaking → agent starts replying). Real-world median is around 400ms with Deepgram + GPT-4o-mini + ElevenLabs Turbo.
Q.Which languages does Call2Me support for voice?
Nine: Turkish, American English, British English, German, French, Spanish, Italian, Brazilian Portuguese, and Arabic. Multilingual auto-detect mode handles all nine on a single agent.