Voice AI in Live Conversation — STT + LLM + TTS Under a 600ms Budget

Tue Jan 21 2025

A demo of a voice AI is a fundamentally different artifact from a production voice AI.

In a demo, a developer says one well-formed sentence. The system thinks for two or three seconds. The voice agent replies with a synthesized response. Everyone claps.

In production, a citizen calls a non-emergency line at 11pm. They've been in a fender bender. They're stressed. They start a sentence, pause, restart, talk over the agent's first reply. They expect a response that feels like talking to a person — fluid, fast, willing to be interrupted.

The gap between those two artifacts is most of the engineering. This post is about how I closed that gap at Aurelian.

The budget

Voice latency isn't one number, it's a stack.

Time-to-first-audible-response (TTFAR) — the moment the citizen finishes speaking to the moment they hear the agent start speaking — is the number that actually matters to a user's perception of "is this thing alive?"

It decomposes:

TTFAR = endpoint detection
      + STT finalization
      + LLM time-to-first-token
      + TTS time-to-first-audio-frame
      + audio transport latency

I set the budget for a fluid conversation at around 600ms p95. Above 800ms it starts to feel like the agent is thinking. Above 1.2s the citizen will start to repeat themselves. Above 2s they'll assume the line dropped.

The naive serial pipeline blows that budget on day one. STT finalization alone can be 300-500ms if you wait for "I think the user is done" with a typical endpointing config. LLM TTFT is another 200-600ms depending on the model. TTS first-frame is often 150-400ms. Add it up: 1.5s before you've even sent audio back.

Every milestone in the project came from finding a place where I could overlap something instead of waiting for it.

The streaming architecture

The system that worked was streaming end-to-end. Nothing waits for "complete" before the next thing starts.

caller audio
   ↓ (frames in)
streaming STT  ──── interim transcripts ────┐
   ↓ (final)                                 │
endpoint signal                              │
   ↓                                         ↓
            LLM (streaming output)
   ↓ (tokens)                                │
   ────────── tokens piped into TTS ─────────┘
   ↓ (audio frames)
caller hears agent

Three things make this fast:

1. STT runs continuously. The pipeline doesn't wait for the user to "finish" before transcribing. It keeps a rolling transcript with confidence scores. Interim results feed downstream context. By the time the endpointer decides the user is done, the final transcript is already ready.

2. LLM starts before STT finalizes. The moment the endpointer says "probably done" with high confidence, the LLM kicks off with the current rolling transcript. If the user keeps talking after, the in-flight request cancels and restarts. Most of the time they don't, and 200-400ms is saved.

3. TTS streams from LLM tokens, not from a complete response. As soon as the LLM emits a token, it goes into the TTS request as part of a streaming text input. TTS frames stream back. The first audio frame goes out the door before the LLM has finished generating.

The cost of all this overlap is cancellation. Every layer has to support cancellation cleanly, because cancellations happen all the time. STT might decide on a re-segmentation. The LLM might get interrupted by the user starting a new turn. TTS might need to abort mid-utterance and switch to a new response. Cancellation paths are first-class, not exception cases.

The non-obvious problems

The streaming pipeline is the easy part. The hard parts are the ones nobody mentions.

Endpointing

When did the user actually finish talking?

Default endpointing in any streaming STT provider relies on silence detection — a configurable hold time after the last audio. Set it short and you cut people off mid-sentence. Set it long and your TTFAR explodes.

Real callers don't speak in clean sentences. They pause to think. They use filler words. They start a sentence, abandon it, restart. I needed an endpointer that understood semantics, not just silence.

I layered three signals:

  • Silence detection (the cheap one)
  • Acoustic prosody (rising vs falling pitch — questions tend to rise, statements fall)
  • A small model that scored "is this transcript a complete enough thought to act on?"

The third was the unlock. A 200ms pause after a complete-sounding thought triggers a response. A 700ms pause after a fragment ("I was driving down...") doesn't. Citizens stopped getting cut off, and TTFAR for clean utterances dropped to roughly 400ms.

Barge-in

The citizen interrupts. Mid-sentence. While the agent is speaking.

What you want: agent stops speaking, listens to the new utterance, responds.

What's tricky: detecting that the new audio is the citizen interrupting and not the agent's own audio leaking through into the input channel. Echo cancellation isn't perfect. I had to combine:

  • Voice activity detection on the input channel
  • A confidence threshold that the input was speech, not echo
  • A "did the agent's response already convey the key information" check to decide whether to abandon the rest of the response or queue it

Barge-in is one of those features that sounds simple. It is not.

Hallucinations on partial transcripts

If you fire the LLM on a rolling interim transcript before the user finishes, the LLM has to handle inputs that get rewritten retroactively.

User says: "I want to report a [pause] cat in a tree." The interim transcript at 400ms is "I want to report a." If the LLM fires there, it'll produce a confidently wrong response — "Okay, what would you like to report?" — that the user will then have to correct.

Two defenses:

  • Never fire on transcripts below a confidence threshold
  • Have the LLM explicitly check whether the input is a complete intent, and if not, return a low-cost continuation that buys time ("Mm-hmm")

I ended up with a class of agent responses I called acknowledgements: cheap, generic, polite phrases the agent could emit while waiting for more signal. They bought latency budget without making the conversation feel robotic.

Prosody collapse on small TTS chunks

When you stream TTS in tiny chunks (5-10 tokens at a time), the synthesizer doesn't have enough context to do natural prosody. Inflections get flat. Sentence-final pitch falls don't happen. The voice sounds robotic.

Stream in larger chunks (40-50 tokens) and the prosody comes back, but you've added 200-300ms to time-to-first-audio.

I tuned chunk size per response type. Greetings and acknowledgements stream small (low-context, fixed prosody anyway). Substantive responses buffer up a sentence's worth of tokens before the first TTS frame goes out — sacrificing some latency to keep the voice human.

Cancellation cascades

Every cancellation has to propagate. STT cancels → LLM in-flight request must abort → TTS request must abort → audio buffer must flush.

If the cancellation cascade has a bug, the system "double-talks": the old response continues while the new response starts. This is one of the worst-feeling failure modes in voice AI. Citizens reach for the disconnect button.

I wrote a test harness that fuzzed cancellation timing — random aborts at every layer — and ran it as part of CI. It caught a class of bugs that no functional test would have surfaced.

Trade-offs that mattered

Smaller chunks vs better prosody. I compromised: cheap responses stream small, substantive responses buffer a sentence. Different voices for different jobs.

Lower-latency model vs higher-quality model. A faster model handles trivial turns (acknowledgements, intent gathering); a stronger model handles high-stakes turns (incident description, escalation reasoning). A small router decides which model to fire per turn, in code, based on the conversation state. The router itself rides on the confidence-aware policy layer — same primitive, different decision.

Cached responses vs always-live generation. Greetings, holds, and "let me get a dispatcher" responses are pre-rendered audio, played from cache. The cache is per-agency (each agency has its own preferred phrasing). Everything else is generated live. Caching where you can buys back latency budget for the cases where you can't.

Transport vs synthesis. I initially deployed in a single region. Latency to West Coast callers was fine; latency to East Coast callers was not. Multi-region voice deployment turned out to be more about audio transport latency than inference latency — I'd been optimizing the wrong number.

What real production voice AI is

Most voice AI demos optimize for "the model is good." Production voice AI optimizes for "the experience feels human."

The former is about which model you pick. The latter is about:

  • A streaming pipeline with first-class cancellation everywhere
  • Endpointing that understands semantics, not just silence
  • Barge-in that handles echo, partial speech, and queue management
  • Prosody that survives small chunks
  • A response router that picks the right model for the moment
  • Caching for everything that doesn't need to be live
  • Multi-region transport for callers who don't live next to your inference region
  • Telemetry per stage of the pipeline, because "p95 latency" is meaningless unless you know which stage moved

None of those things are model improvements. All of them are platform engineering.

The number

The system shipped at sub-600ms p95 TTFAR for the typical conversational turn, and 250-400ms for cached or acknowledgement responses. The faster end of that range is fast enough that callers stop noticing the AI is an AI, which is — for the domain — exactly the bar.

The improvement that mattered most wasn't a single optimization. It was eliminating every place where one stage waited on another to finish. The architecture was the win.

If you're building voice AI for production, that's the principle: the model is one box in the diagram. The diagram is the product.