AIEngine
Back to blog
EngineeringSTTTTSLLM

Building a Streaming Voice Pipeline: STT → LLM → TTS

A look under the hood of our intelligent front desk — how speech flows through transcription, reasoning, and synthesis in real time.

Michael TorresMay 21, 20261 min read

Our front desk answers a call in four stages. Each one streams into the next so the caller never waits for a full stage to finish before the next begins.

1. Access

The call arrives over phone or SMS. We bridge the telephony layer to our local media server, which exposes a raw audio stream.

2. Speech-to-Text (STT)

A streaming ASR model transcribes audio in chunks as it arrives — not after the caller finishes speaking. Partial hypotheses let us detect intent early.

partial:  "i'd like to book a"
final:    "i'd like to book a meeting room for tomorrow"

3. LLM reasoning

The transcript is fed to the local LLM with conversation context and a system prompt. The model returns an intent, extracted entities, and a natural-language response.

intent:   book_meeting_room
entities: { date: "tomorrow" }
reply:    "Sure! Which room would you prefer?"

4. Text-to-Speech (TTS)

The reply is synthesized as streaming audio and played back the moment the first chunk is ready. Because TTS starts before the LLM finishes emitting, the perceived latency is the time to first audio, not time to full response.

The whole pipeline runs on one box. That's what makes sub-500ms first-audio realistic.

Why streaming matters

Streaming each stage — rather than buffering — is the single biggest lever on perceived latency. Callers judge responsiveness by the first word, not the last.