Notes on Voice Agent Latency (and Deepgram’s VAQI)
deepgram recently released their voice-agent quality index (VAQI). it got me thinking about everything i’ve noticed in building voice agents over the past year. this post is a dump of those thoughts.
1. is VAQI the first attempt at measuring what actually matters?
i’ve read latency benchmarks before: STT in 200 ms, LLM first token in 600 ms, TTS done in 300 ms. looks clean on paper.
but in a real call? users feel awkwardness when:
the bot talks over them (interruptions).
there’s a weird gap after they stop (missed response window).
the bot just feels slow (latency).
VAQI tries to measure these things, not just raw API speeds. a single number that says: did the whole convo feel smooth or shitty?
finally someone’s trying to quantify the vibe of the conversation, not just its parts.
and deepgram’s VAQI score shows what i always felt: being fast in STT doesn’t matter if your LLM or TTS adds seconds of dead air.
2. latency is a death by thousand cuts
our stack is STT (deepgram) → LLM (gpt 4.1 mini) → tts (elevenlabs, smallest, cartesia etc.,). all good on paper. but real test runs?
stt: 300–700 ms
llm: 800–1500 ms
tts: 600–2000 ms
and then there's the unseen stuff:
audio chunking
websocket setup time
backend orchestration
VAD not detecting end of speech fast enough (voice activity detection)
note to mention, within call function-calling, RAG etc.,
in total? 2 to 6 seconds of latency felt on real calls.
that’s why experience-first benchmarks like VAQI matter. they punish these gaps. doesn't care why they happen. only that they ruin the conversational flow.
3. streaming everything is non-negotiable
if your STT isn't streaming? you're screwed.
if your LLM waits for full input before generating? you're screwed again.
if your TTS waits for the full sentence? triple screwed.
VAQI exposes this: the best scores (deepgram’s own agent duh!) stream everything.
partial STT goes into LLM instantly. LLM sends tokens to TTS as soon as possible. TTS streams audio while it’s still rendering the tail end.
that’s the only way to stay under 2 seconds real perceived latency.
4. prompt length still hurts
prompt size kills latency more than expected.
fat and lengthy system prompts?
context windows full of junk history?
all this adds 300 to 500 ms to LLM start time.
i lowkey still think prompt-side latency is negligible, but, VAQI says otherwise.
but, fact remains, more the context, the slower the first token arrives. so gotta keep shit tight.
5. TTS is the silent killer
tts is supposed to be fast now (elevenlabs flash claims 75 ms inference). but if you want:
high quality
cloned voice
emotion
…it’s 600–2000 ms, easy.
smallest.ai and cartesia are doing better here, but VAQI shows this: if TTS delays even a little, the user feels it. gaps after the LLM? brutal.
pre-generate standard phrases are a must or you’re dead.
TTS cacheing for semantically similar intents is a must as well.
6. india latency is unfixable (for now)
sad reality: we build for india.
and the cloud infra is US/EU-centric. even when STT/LLM/TTS are fast, india adds 300–400 ms just from distance.
VAQI won’t forgive this. users in benglauru will feel this added 400 ms lag. every time.
unless we put everything on-prem (hard) or vendors give india-region endpoints (rare), this problem will stay.
speech-to-speech models are tempting... but not ready
openai’s realtime. google gemini’s voice model. speech-in, speech-out. no STT/ LLM/TTS separation.
VAQI-friendly by design.
but right now:
no custom voices.
no banking-grade prompt control.
no compliance readiness.
for regulation-heavy use cases like BFSI? still a no-go.
but this space moves fast. 6–8 months from now, i won’t be surprised if these models make VAQI scores look silly.
backend glue is the hidden problem
this is the part VAQI exposes that nobody talks about:
api hops
data serialization
service-to-service calls
VAD detection lag
even if the AI models are fast, the backend orchestration can screw it all up.
deepgram’s own stack wins partly because it’s a single integrated agent. no cross-vendor calls. no glue code.
probably that’s why a lot of the players in this space are trying to own the whole stack, helps with pricing as well.
VAQI actually matters
most benchmarks don’t reflect real customer pain.
but vaqi does:
captures interruptions (the worst sin in voice ai).
captures response gaps (the awkward “did it crash?” moments).
captures total perceived latency.
deepgram’s own 71.5 score shows how hard this is, even the top solution isn’t near-perfect.
but now we have a number to chase.
closing thought
i used to care about STT latency. then LLM speed. then TTS inference time.
now?
i care about conversation quality as a whole. (duh!)
it’s not about shaving 100 ms off STT. it’s about making the whole pipeline feel fast, smooth, human. (again, duh!)
and fixing backend glue. and streaming literally everything. and trimming prompts to the bone. and maybe someday... switching to speech-to-speech.
but until then: VAQI shows the scoreboard we all have to play on.
deepgram didn’t pay me to write such good things about their benchmark (unfortunately) :(