AI Voice Assistant — Smart Q&A with Voice I/O | ROXL's Blog

Challenge

The client needed a conversational assistant embedded in their product — one that could answer questions about their domain, understand spoken input, and respond in natural voice. Off-the-shelf chatbot widgets covered text only. Voice required stitching together three separate AI services: speech-to-text, a language model, and text-to-speech — with latency low enough to feel like a real conversation.

Options Considered

Browser Web Speech API only — free, zero latency for STT, but recognition quality was unreliable on non-English accents and mobile browsers. TTS was robotic. Rejected for quality reasons.
Fully managed voice assistant platform (Voiceflow, VAPI) — fast to prototype, but locked the client into a vendor with limited customisation of the reasoning layer. Rejected to retain control over the LLM prompt and context.
OpenAI Whisper + GPT-4 + ElevenLabs, custom orchestration — chosen. Best-in-class accuracy at each stage; streaming reduces perceived latency; the reasoning layer is fully customisable.

Decision

Audio captured in the browser is streamed to a Python backend. Whisper transcribes the audio in real time. The transcript is sent to GPT-4 with a system prompt scoped to the client's domain knowledge base. The response streams back token by token; as sentences complete, they are forwarded to ElevenLabs for synthesis. The resulting audio chunks play sequentially in the browser — the user hears the first sentence before the model has finished generating the full response.

Implementation

FastAPI backend exposes a WebSocket endpoint for audio streaming. A VAD (voice activity detection) layer on the frontend decides when the user has finished speaking before flushing the buffer to Whisper. The knowledge base is stored as vector embeddings (OpenAI Embeddings + Pinecone); relevant chunks are retrieved and injected into the GPT-4 context window at each turn. Conversation history is maintained per session to support follow-up questions. The frontend widget is a self-contained React component embeddable via a single script tag.

Outcome

End-to-end latency from end of speech to first audio syllable: under 1.8 seconds on average. Recognition accuracy on domain-specific terminology exceeded 95%. The widget deployed without changes to the host application — one script tag, zero backend coupling.