# Building a Streaming Voice Pipeline: STT → LLM → TTS

A look under the hood of our intelligent front desk — how speech flows through transcription, reasoning, and synthesis in real time.

---

Our front desk answers a call in four stages. Each one streams into the next so the
caller never waits for a full stage to finish before the next begins.

## 1. Access

The call arrives over phone or SMS. We bridge the telephony layer to our local media
server, which exposes a raw audio stream.

## 2. Speech-to-Text (STT)

A streaming ASR model transcribes audio in chunks as it arrives — not after the caller
finishes speaking. Partial hypotheses let us detect intent early.

```text
partial:  "i'd like to book a"
final:    "i'd like to book a meeting room for tomorrow"
```

## 3. LLM reasoning

The transcript is fed to the local LLM with conversation context and a system prompt.
The model returns an intent, extracted entities, and a natural-language response.

```text
intent:   book_meeting_room
entities:
reply:    "Sure! Which room would you prefer?"
```

## 4. Text-to-Speech (TTS)

The reply is synthesized as streaming audio and played back the moment the first chunk is
ready. Because TTS starts before the LLM finishes emitting, the perceived latency is the
time to *first audio*, not time to *full response*.

The whole pipeline runs on one box. That's what makes sub-500ms first-audio realistic.

## Why streaming matters

Streaming each stage — rather than buffering — is the single biggest lever on perceived
latency. Callers judge responsiveness by the first word, not the last.