Building a Streaming Voice Pipeline: STT → LLM → TTS
A look under the hood of our intelligent front desk — how speech flows through transcription, reasoning, and synthesis in real time.

Our front desk answers a call in four stages. Each one streams into the next so the caller never waits for a full stage to finish before the next begins.
1. Access
The call arrives over phone or SMS. We bridge the telephony layer to our local media server, which exposes a raw audio stream.
2. Speech-to-Text (STT)
A streaming ASR model transcribes audio in chunks as it arrives — not after the caller finishes speaking. Partial hypotheses let us detect intent early.
partial: "i'd like to book a"
final: "i'd like to book a meeting room for tomorrow"
3. LLM reasoning
The transcript is fed to the local LLM with conversation context and a system prompt. The model returns an intent, extracted entities, and a natural-language response.
intent: book_meeting_room
entities: { date: "tomorrow" }
reply: "Sure! Which room would you prefer?"
4. Text-to-Speech (TTS)
The reply is synthesized as streaming audio and played back the moment the first chunk is ready. Because TTS starts before the LLM finishes emitting, the perceived latency is the time to first audio, not time to full response.
The whole pipeline runs on one box. That's what makes sub-500ms first-audio realistic.
Why streaming matters
Streaming each stage — rather than buffering — is the single biggest lever on perceived latency. Callers judge responsiveness by the first word, not the last.