Core

Voice: STT, TTS, latency, and quality tuning

How Call2Me's voice stack works — Deepgram Nova-3 STT, ElevenLabs/Cartesia/OpenAI TTS, sub-500ms target latency. Picking the right voice per language.

Updated May 1, 2026

The voice stack is what makes a phone call feel real. This page covers the engines, the trade-offs, and how to tune for your use case.

The pipeline

Three primitives running in parallel:

STT — streaming speech-to-text, transcribes as the caller talks.
LLM — reads the transcript, generates a response.
TTS — synthesizes the response audio and streams it back.

End-to-end target: under 500ms from "caller stops talking" to "agent starts speaking."

STT: Deepgram Nova-3 (default)

Deepgram's latest Nova family. Streaming, multilingual auto-detect, ~150ms transcript latency.

Alternative: Whisper — slower, slightly more accurate on noisy lines, but the latency penalty rarely justifies it for live conversation.

For multilingual agents, leave the language as auto-detect. Deepgram identifies the language from the first 1–2 seconds and transcribes accordingly. You don't pre-declare.

TTS engines

Engine	Strengths	Cost	Best for
ElevenLabs Multilingual v2	Most natural across all 9 langs	Premium	Receptionist, sales — high-touch
ElevenLabs Turbo Multilingual	Fast, lower latency	Mid	Outbound campaigns, support
Cartesia Sonic Multilingual	Cheap, surprisingly good in TR/AR	Low	High-volume bulk
OpenAI TTS Multilingual	Solid baseline, included in voice base	Included	Cost-sensitive, default

Pick per agent on AgentDetail → Voice section. Preview every option before committing.

Per-language quality notes

Language	Best engine	Notes
Turkish	ElevenLabs Multilingual v2	Cartesia also strong
American English	All four high quality	OpenAI is the cheapest pick
British English	ElevenLabs (specific BR voices)	OpenAI multilingual is close
German	ElevenLabs Multilingual v2	Cartesia second
French	ElevenLabs Multilingual v2	All engines acceptable
Spanish	ElevenLabs / Cartesia	Tied
Italian	ElevenLabs Multilingual v2	Best quality
Brazilian Portuguese	ElevenLabs Multilingual v2	OpenAI close behind
Arabic	Cartesia Sonic	Improving fastest, worth re-checking quarterly

Latency budget

Where the 500ms goes:

STT: ~150ms (per chunk)
LLM first-token: ~150–300ms (depends on model)
TTS first-audio: ~80–150ms
Network: ~30–50ms

To keep under 500ms median:

Use GPT-4o-mini or Claude 3 Haiku for the LLM (faster than the larger siblings).
Use ElevenLabs Turbo or OpenAI TTS for TTS (lower latency than v2).
Keep system prompt under 800 words (longer prompts → slower first-token).
Don't over-retrieve from KB (top-3 is usually enough).

For deeper detail see the sub-500ms latency guide.

Background sound

You can layer ambient audio under the agent's voice — typing, café murmur, office hum. Lives on AgentDetail → Speech Settings → Background sound. Useful for:

Making AI calls feel less sterile.
Matching the expected acoustic environment (office, restaurant, retail).
Masking subtle audio glitches.

Pick from the gallery or upload your own short loop.

Voice consistency across languages

If your agent runs in multiple languages on the same call (rare but happens), pick a multilingual voice so the character stays consistent. Switching from en-US/nova to tr-TR/aysel mid-call is jarring; ElevenLabs Multilingual v2 stays the same character across all 9 languages.

DTMF (keypad) input

Voice agents accept keypad presses. Use cases:

Language menu: "Press 1 for English, 2 for Turkish."
Account number entry.
IVR-style routing on top of the voice agent.

Configure the menu in the system prompt; the runtime parses DTMF tones and feeds them to the LLM as transcript text.

Voicemail detection

Agents detect voicemail (the "leave a message after the beep" pattern) and respond per your config:

Hang up — don't leave anything.
Leave a scripted message — define what to say.
Treat as human — risky; only with very short scripts.

Detection is on by default. Tunable per agent.

Recording quality

Call recordings are MP3 at 16kHz mono — phone-quality. Storage in our buckets, retention configurable per workspace.

If you need higher-quality audio (e.g. for legal review), turn on raw WAV recording — it costs more storage but preserves full fidelity.

Frequently asked

Q.What speech-to-text engine does Call2Me use?

Deepgram Nova-3 is the default for STT, with multilingual auto-detection across 9 languages. Whisper is available as an alternative.

Q.What text-to-speech engines are available?

ElevenLabs (Multilingual v2 + Turbo), Cartesia Sonic, and OpenAI TTS. Pick one per agent in the voice gallery.

Q.What's the typical end-to-end voice latency on Call2Me?

Sub-500ms target end-to-end (caller stops speaking → agent starts replying). Real-world median is around 400ms with Deepgram + GPT-4o-mini + ElevenLabs Turbo.

Q.Which languages does Call2Me support for voice?

Nine: Turkish, American English, British English, German, French, Spanish, Italian, Brazilian Portuguese, and Arabic. Multilingual auto-detect mode handles all nine on a single agent.

ShareX / Twitter LinkedIn

Ready to ship?

Spin up your first agent in 5 minutes — $10 free credit.

Start free