Call2Me
Core

Voice: STT, TTS, latency, and quality tuning

How Call2Me's voice stack works — Deepgram Nova-3 STT, ElevenLabs/Cartesia/OpenAI TTS, sub-500ms target latency. Picking the right voice per language.

Updated May 1, 2026

The voice stack is what makes a phone call feel real. This page covers the engines, the trade-offs, and how to tune for your use case.

The pipeline

Three primitives running in parallel:

  1. STT — streaming speech-to-text, transcribes as the caller talks.
  2. LLM — reads the transcript, generates a response.
  3. TTS — synthesizes the response audio and streams it back.

End-to-end target: under 500ms from "caller stops talking" to "agent starts speaking."

STT: Deepgram Nova-3 (default)

Deepgram's latest Nova family. Streaming, multilingual auto-detect, ~150ms transcript latency.

Alternative: Whisper — slower, slightly more accurate on noisy lines, but the latency penalty rarely justifies it for live conversation.

For multilingual agents, leave the language as auto-detect. Deepgram identifies the language from the first 1–2 seconds and transcribes accordingly. You don't pre-declare.

TTS engines

EngineStrengthsCostBest for
ElevenLabs Multilingual v2Most natural across all 9 langsPremiumReceptionist, sales — high-touch
ElevenLabs Turbo MultilingualFast, lower latencyMidOutbound campaigns, support
Cartesia Sonic MultilingualCheap, surprisingly good in TR/ARLowHigh-volume bulk
OpenAI TTS MultilingualSolid baseline, included in voice baseIncludedCost-sensitive, default

Pick per agent on AgentDetail → Voice section. Preview every option before committing.

Per-language quality notes

LanguageBest engineNotes
TurkishElevenLabs Multilingual v2Cartesia also strong
American EnglishAll four high qualityOpenAI is the cheapest pick
British EnglishElevenLabs (specific BR voices)OpenAI multilingual is close
GermanElevenLabs Multilingual v2Cartesia second
FrenchElevenLabs Multilingual v2All engines acceptable
SpanishElevenLabs / CartesiaTied
ItalianElevenLabs Multilingual v2Best quality
Brazilian PortugueseElevenLabs Multilingual v2OpenAI close behind
ArabicCartesia SonicImproving fastest, worth re-checking quarterly

Latency budget

Where the 500ms goes:

  • STT: ~150ms (per chunk)
  • LLM first-token: ~150–300ms (depends on model)
  • TTS first-audio: ~80–150ms
  • Network: ~30–50ms

To keep under 500ms median:

  • Use GPT-4o-mini or Claude 3 Haiku for the LLM (faster than the larger siblings).
  • Use ElevenLabs Turbo or OpenAI TTS for TTS (lower latency than v2).
  • Keep system prompt under 800 words (longer prompts → slower first-token).
  • Don't over-retrieve from KB (top-3 is usually enough).

For deeper detail see the sub-500ms latency guide.

Background sound

You can layer ambient audio under the agent's voice — typing, café murmur, office hum. Lives on AgentDetail → Speech Settings → Background sound. Useful for:

  • Making AI calls feel less sterile.
  • Matching the expected acoustic environment (office, restaurant, retail).
  • Masking subtle audio glitches.

Pick from the gallery or upload your own short loop.

Voice consistency across languages

If your agent runs in multiple languages on the same call (rare but happens), pick a multilingual voice so the character stays consistent. Switching from en-US/nova to tr-TR/aysel mid-call is jarring; ElevenLabs Multilingual v2 stays the same character across all 9 languages.

DTMF (keypad) input

Voice agents accept keypad presses. Use cases:

  • Language menu: "Press 1 for English, 2 for Turkish."
  • Account number entry.
  • IVR-style routing on top of the voice agent.

Configure the menu in the system prompt; the runtime parses DTMF tones and feeds them to the LLM as transcript text.

Voicemail detection

Agents detect voicemail (the "leave a message after the beep" pattern) and respond per your config:

  • Hang up — don't leave anything.
  • Leave a scripted message — define what to say.
  • Treat as human — risky; only with very short scripts.

Detection is on by default. Tunable per agent.

Recording quality

Call recordings are MP3 at 16kHz mono — phone-quality. Storage in our buckets, retention configurable per workspace.

If you need higher-quality audio (e.g. for legal review), turn on raw WAV recording — it costs more storage but preserves full fidelity.

Frequently asked

Q.What speech-to-text engine does Call2Me use?

Deepgram Nova-3 is the default for STT, with multilingual auto-detection across 9 languages. Whisper is available as an alternative.

Q.What text-to-speech engines are available?

ElevenLabs (Multilingual v2 + Turbo), Cartesia Sonic, and OpenAI TTS. Pick one per agent in the voice gallery.

Q.What's the typical end-to-end voice latency on Call2Me?

Sub-500ms target end-to-end (caller stops speaking → agent starts replying). Real-world median is around 400ms with Deepgram + GPT-4o-mini + ElevenLabs Turbo.

Q.Which languages does Call2Me support for voice?

Nine: Turkish, American English, British English, German, French, Spanish, Italian, Brazilian Portuguese, and Arabic. Multilingual auto-detect mode handles all nine on a single agent.

ShareX / TwitterLinkedIn

Ready to ship?

Spin up your first agent in 5 minutes — $10 free credit.

Start free