Building a voice consultation engine that doesn't make pharmacists babysit it
Three problems — hear it, know when the speaker stopped, know what they meant. Six small models in a pipeline. One human safety net. The pharmacist never tunes anything.
The brief was unusual: replace a typed pharmacy consultation form with voice, in a Nigerian pharmacy where two people sit across a desk in a noisy room. The pharmacist asks the questions the AI generates; the patient answers. The system must hear the answer, decide which on-screen question it belongs to, and quietly cross it off the list.
Three problems, in order: hear it, know when the speaker stopped, know what they meant.
Hearing it
Browsers ship Web Speech API; we use it as the baseline. When a device has a Deepgram key we bump to nova-2 over WebSocket; on devices with WebGPU and ≥ 4 GB RAM a Whisper-WebGPU scaffold takes over. An adaptive pipeline picks the highest tier the device supports.
In a noisy pharmacy the hard part isn't the ASR — it's the audio that reaches the ASR. So we wired RNNoise (~112 KB WASM) upstream of the VAD to strip AC hum and chair scrapes before any model sees the buffer. For phone-call mode we deliberately turn off browser noise suppression so the patient's voice isn't smothered. The whole constraint matrix lives in one factory so every mic acquisition routes through the same source of truth.
Knowing when the speaker stopped
Every previous version used a 1.5 s debounce on the final transcript. That wrongly fires when a patient is mid-sentence ("I have… long pause… a headache"). We replaced the debounce with Silero v5 — a 2 MB neural VAD that emits onSpeechStart and onSpeechEnd directly. Then we layered smart-turn-v3 on top: when Silero says "speech ended", smart-turn classifies the captured audio against an end-of-turn model. If P(EOU) < 0.5 we wait for the next silence. Above 0.5 we run intent.
mic ─► RNNoise ─► Silero VAD ─► smart-turn ─► runIntent
onSpeechEnd (EOU gate)
This single change cut false matcher fires by an order of magnitude on our calibration set.
Knowing what they meant
The matcher is a deberta-v3-small NLI model running in the browser. It scores the candidate transcript against every visible question. Above 0.65 we auto-answer; in [0.55, 0.65) we record an "abstain" — the question stays on screen and the pharmacist confirms or corrects.
Before any matching runs, a separate model decides whether the speaker was reading the question or answering it. Pharmacists read questions aloud; we don't want to cross one off because the pharmacist just said its words. The role classifier asks an on-device LLM (Chrome AI when present, SmolLM2 otherwise) and falls back to NLI similarity.
This is where the production bug landed.
The bug
A pharmacist reported the matcher crossing off questions when the patient said "hello". The NLI was scoring "hello" against an 8-word question at 0.66 — enough to flip into READING, which blocked the next real answer. The model wasn't broken; it was over-trusted. A single greeting word can't semantically paraphrase an 8-word question, no matter what cosine says.
The fix was a length-ratio guard. A candidate cannot be a reading if it has fewer than 3 words, or its segment-to-question word ratio is below 0.4 or above 2.0. We pinned the rule with 19 regression tests sourced verbatim from the production logs that day.
The model wasn't broken; it was over-trusted. A single greeting word can't semantically paraphrase an 8-word question, no matter what cosine says.
The safety net
Even an "infallible" matcher will eventually misfire. We added a 5-second undo toast that fires every time the matcher auto-commits. Click it within five seconds and the question reappears at the head of the queue. We don't retract the server reply — the next answer naturally supersedes — so undo costs the pharmacist one tap, not a round-trip.
The active card also gets a small breathing green dot the moment Silero detects speech, so the pharmacist sees the system noticed them. Reduced-motion users see a static dot. A diagnostic chip behind a flag exposes the last decision's role, score, and VAD method for QA.
The discipline
Every new model ships in shadow first — parallel, telemetry-only, no influence on user-visible behaviour. A bi-encoder role classifier, a WebGPU Whisper shadow, an incremental NLU on partial transcripts: all running today, all emitting Mixpanel events, all silent in the UI until we have ≥ 500 sessions of disagreement data. The matcher we use is the one we earned the right to use.
Three problems. Six small models in a pipeline. One human safety net. The pharmacist never tunes anything.