← Back to solutions
AppPythonPaid

AI Voice Assistant — Smart Q&A with Voice I/O

Before: users had to type queries and read text responses — slow, inaccessible, and unnatural for mobile. After: speak a question, get an instant spoken answer. Full voice loop: speech recognition → LLM reasoning → voice synthesis, all in under two seconds.

Challenge

The client needed a conversational assistant embedded in their product — one that could answer questions about their domain, understand spoken input, and respond in natural voice. Off-the-shelf chatbot widgets covered text only. Voice required stitching together three separate AI services: speech-to-text, a language model, and text-to-speech — with latency low enough to feel like a real conversation.

Options Considered

  1. Browser Web Speech API only — free, zero latency for STT, but recognition quality was unreliable on non-English accents and mobile browsers. TTS was robotic. Rejected for quality reasons.
  2. Fully managed voice assistant platform (Voiceflow, VAPI) — fast to prototype, but locked the client into a vendor with limited customisation of the reasoning layer. Rejected to retain control over the LLM prompt and context.
  3. OpenAI Whisper + GPT-4 + ElevenLabs, custom orchestration — chosen. Best-in-class accuracy at each stage; streaming reduces perceived latency; the reasoning layer is fully customisable.

Decision

Audio captured in the browser is streamed to a Python backend. Whisper transcribes the audio in real time. The transcript is sent to GPT-4 with a system prompt scoped to the client's domain knowledge base. The response streams back token by token; as sentences complete, they are forwarded to ElevenLabs for synthesis. The resulting audio chunks play sequentially in the browser — the user hears the first sentence before the model has finished generating the full response.

Implementation

FastAPI backend exposes a WebSocket endpoint for audio streaming. A VAD (voice activity detection) layer on the frontend decides when the user has finished speaking before flushing the buffer to Whisper. The knowledge base is stored as vector embeddings (OpenAI Embeddings + Pinecone); relevant chunks are retrieved and injected into the GPT-4 context window at each turn. Conversation history is maintained per session to support follow-up questions. The frontend widget is a self-contained React component embeddable via a single script tag.

Outcome

End-to-end latency from end of speech to first audio syllable: under 1.8 seconds on average. Recognition accuracy on domain-specific terminology exceeded 95%. The widget deployed without changes to the host application — one script tag, zero backend coupling.

Open for contract collaboration

I am available for contract-based collaboration. If you have an interesting project idea, schedule a call via Calendly.

Schedule a 30-min call