Syncorix is an AI-powered healthcare infrastructure company building specialised tools for the African healthcare ecosystem — pharmacists, health insurers, and hospitals.

PharmaDoyen is an AI dispensing intelligence platform that flags drug interactions in real time, improving medication safety by 60% and reducing pharmacist consultation time.

Toda AI is an automated claims adjudication platform for African health maintenance organisations, processing claims 80% faster with built-in fraud detection.

What is Clinical Intelligence?

Clinical Intelligence is a zero-ingestion platform that makes hospital EMR data research-ready in 60 days, without exposing raw patient records, enabling data monetisation.

Where does Syncorix operate?

Syncorix is headquartered in Lagos, Nigeria, with operations and partners across West Africa, East Africa, and plans to expand to the Gulf Cooperation Council.

How can I contact Syncorix?

Contact Syncorix at info@syncorixglobal.com or visit syncorixglobal.com to book a demo or request early access.

Back to Blog

Engineering April 19, 2026

Building a voice consultation engine that doesn't make pharmacists babysit it

Three problems — hear it, know when the speaker stopped, know what they meant. Six small models in a pipeline. One human safety net. The pharmacist never tunes anything.

Syncorix Editorial

Healthcare AI Perspectives

6 min read

The pipeline: mic → RNNoise denoise → Silero VAD → smart-turn end-of-utterance gate → NLI intent matcher.

The brief was unusual: replace a typed pharmacy consultation form with voice, in a Nigerian pharmacy where two people sit across a desk in a noisy room. The pharmacist asks the questions the AI generates; the patient answers. The system must hear the answer, decide which on-screen question it belongs to, and quietly cross it off the list.

Three problems, in order: hear it, know when the speaker stopped, know what they meant.

Hearing it

Browsers ship Web Speech API; we use it as the baseline. When a device has a Deepgram key we bump to nova-2 over WebSocket; on devices with WebGPU and ≥ 4 GB RAM a Whisper-WebGPU scaffold takes over. An adaptive pipeline picks the highest tier the device supports.

In a noisy pharmacy the hard part isn't the ASR — it's the audio that reaches the ASR. So we wired RNNoise (~112 KB WASM) upstream of the VAD to strip AC hum and chair scrapes before any model sees the buffer. For phone-call mode we deliberately turn off browser noise suppression so the patient's voice isn't smothered. The whole constraint matrix lives in one factory so every mic acquisition routes through the same source of truth.

Knowing when the speaker stopped

Every previous version used a 1.5 s debounce on the final transcript. That wrongly fires when a patient is mid-sentence ("I have… long pause… a headache"). We replaced the debounce with Silero v5 — a 2 MB neural VAD that emits onSpeechStart and onSpeechEnd directly. Then we layered smart-turn-v3 on top: when Silero says "speech ended", smart-turn classifies the captured audio against an end-of-turn model. If P(EOU) < 0.5 we wait for the next silence. Above 0.5 we run intent.

mic ─► RNNoise ─► Silero VAD ─► smart-turn ─► runIntent
                    onSpeechEnd    (EOU gate)

This single change cut false matcher fires by an order of magnitude on our calibration set.

Knowing what they meant

The matcher is a deberta-v3-small NLI model running in the browser. It scores the candidate transcript against every visible question. Above 0.65 we auto-answer; in [0.55, 0.65) we record an "abstain" — the question stays on screen and the pharmacist confirms or corrects.

Before any matching runs, a separate model decides whether the speaker was reading the question or answering it. Pharmacists read questions aloud; we don't want to cross one off because the pharmacist just said its words. The role classifier asks an on-device LLM (Chrome AI when present, SmolLM2 otherwise) and falls back to NLI similarity.

This is where the production bug landed.

The bug

A pharmacist reported the matcher crossing off questions when the patient said "hello". The NLI was scoring "hello" against an 8-word question at 0.66 — enough to flip into READING, which blocked the next real answer. The model wasn't broken; it was over-trusted. A single greeting word can't semantically paraphrase an 8-word question, no matter what cosine says.

The fix was a length-ratio guard. A candidate cannot be a reading if it has fewer than 3 words, or its segment-to-question word ratio is below 0.4 or above 2.0. We pinned the rule with 19 regression tests sourced verbatim from the production logs that day.

The model wasn't broken; it was over-trusted. A single greeting word can't semantically paraphrase an 8-word question, no matter what cosine says.

The safety net

Even an "infallible" matcher will eventually misfire. We added a 5-second undo toast that fires every time the matcher auto-commits. Click it within five seconds and the question reappears at the head of the queue. We don't retract the server reply — the next answer naturally supersedes — so undo costs the pharmacist one tap, not a round-trip.

The active card also gets a small breathing green dot the moment Silero detects speech, so the pharmacist sees the system noticed them. Reduced-motion users see a static dot. A diagnostic chip behind a flag exposes the last decision's role, score, and VAD method for QA.

The discipline

Every new model ships in shadow first — parallel, telemetry-only, no influence on user-visible behaviour. A bi-encoder role classifier, a WebGPU Whisper shadow, an incremental NLU on partial transcripts: all running today, all emitting Mixpanel events, all silent in the UI until we have ≥ 500 sessions of disagreement data. The matcher we use is the one we earned the right to use.

Small models in the live pipeline

10×

Reduction in false matcher fires after Silero + smart-turn

Undo window on every auto-commit — the human safety net

Three problems. Six small models in a pipeline. One human safety net. The pharmacist never tunes anything.