The silence problem: why AI voice demos fail
Most AI voice demos fail on the same thing — and it isn't the words. It's the silence. People answer in about 300ms; past 800ms the caller talks over the agent. Latency is the product.

Most AI voice demos fail on the same thing. It isn't the words, the accent, or the knowledge base.
It's the silence.
People answer in about 300ms. Past ~800ms, the caller talks over the agent. A demo hides that gap; a real call doesn't. Latency is the product.
A conversation is a clock
In human conversation, the gap between turns is tiny — around 300ms. We don't think about it, but we feel it instantly when it's wrong. A pause that runs long reads as hesitation, confusion, or "the line dropped." So the caller does what people always do with silence: they fill it. They repeat the question, they talk over the reply, they hang up.
That's why a voice agent's timing isn't a polish item. It's the thing the caller reacts to first, before they've processed a single word.
Why demos lie
A demo is scripted and calm. The presenter knows when to speak and waits politely for the agent. Real callers don't. They interrupt, they trail off, they answer a question with another question. Every one of those moments is a timing test the demo never ran — which is why agents that dazzle on stage stall on the first real call.
Fixing the silence
Closing the gap is a latency-and-turn-taking problem: detect end-of-speech quickly, respond inside the human window, and handle the caller starting again without collapsing. Get that right and the words finally get a chance to matter.
For the behavior at each delay band, see the 700ms wall; for handling interruptions specifically, barge-in: the feature that separates a demo from a product.
Build an agent that answers in the human window. Free to start — $5 in credits, no card.
Frequently asked
Q.Why do AI voice demos fail in real conversations?
Because they fail on silence, not words. In a real exchange a person replies in roughly 300ms; if the agent's reply lands after about 800ms, the caller assumes it's their turn and talks over it. A scripted demo hides this gap; a live, unpredictable call exposes it. The model can be excellent and the demo still falls apart on timing.
Q.What is turn-taking in voice AI?
Turn-taking is the timing of who speaks when — detecting that the caller has finished, responding in the human window, and yielding gracefully if they start again. It's the difference between a conversation and two parties talking past each other. Good turn-taking is mostly a latency and interruption-handling problem.
Q.How fast does a voice agent need to respond?
Aim for the human window — under about 500ms from end-of-speech to first audio, and certainly under 800ms. Past that, callers start talking over the agent and the conversation collapses.
Q.Is latency really more important than the language model?
For the feel of a live call, yes. A perfect answer delivered into a silence the caller has already filled is worse than a good answer delivered on time. Latency is the product.
Keep reading
All posts
Voice AIThe 700ms wall: what callers actually do at each latency band
Everyone benchmarks voice-AI latency. Almost nobody talks about what callers DO at each delay band. Under 500ms feels human; ~700ms reads as robotic; 900ms+ and they talk over the agent or hang up.
Jun 19, 20262 min
Voice AILatency, corrected: P95, P99, and the jitter you can't beat
A popular post on the 700ms latency wall drew pushback from the voice-infra community. Three corrections worth keeping: it's not 'fast' but 'well-timed'; track P95/P99 not the average; you don't beat the telco leg, you plan around it.
Jun 23, 20262 min
Voice AIBarge-in: the feature that separates a demo from a product
You can tell an AI voice agent is fake the moment you try to interrupt it. Real conversation isn't strictly turn-based — people cut in. Barge-in means detecting that in under 300ms, stopping mid-word, and re-planning.
Jun 15, 20262 min