Whisper — OpenAI's Open Speech-to-Text for Local Healthcare AI

Languages

99

Whisper supports speech-to-text and translation across 99 languages, trained on 680,000 hours of multilingual audio. English accuracy is strong on clean audio; non-English and accented English accuracy drops materially.

Hallucination rate

~1% – >8/10

~1% of segments in a formal ACM FAccT 2024 study; far higher in independent informal tests (one University of Michigan researcher found hallucinations in 8 of 10 sample audio files). Healthcare-specific failure modes documented in 2024–25 press.

Sizes

5 + turbo

tiny (39M), base (74M), small (244M), medium (769M), large-v3 (1550M), and the newer large-v3-turbo. Larger models trade compute for accuracy; "turbo" trades a small accuracy hit for 8× speedup on the large model.

OpenAI's own warning

High-risk

OpenAI's documentation explicitly recommends against using Whisper in "high-risk domains" and "decision-making contexts." Healthcare is a high-risk domain. The buyer's safety plan must assume the warning is binding.

What Whisper actually is

Whisper is an open-source automatic speech recognition (ASR) model released by OpenAI in 2022 under the MIT license. It is a transformer-based encoder-decoder trained on 680,000 hours of multilingual web audio with timestamps and translation pairs — uncommon scale for an open speech model and the reason it became the de facto open ASR baseline. Models come in five base sizes (tiny / base / small / medium / large-v3) plus a faster large-v3-turbo variant; faster-whisper and whisper.cpp are popular reimplementations that materially improve speed on CPU and Apple Silicon respectively.

The healthcare buyer relevance is double-edged. On the positive side, Whisper runs locally — no audio leaves your machine if you do not send it anywhere — and the model quality is high enough that it powers a significant share of ambient AI scribes in production (Nabla is the most-cited example). On the negative side, OpenAI's training data was scraped audio, the model has a documented tendency to "hallucinate" plausible-sounding sentences when audio is silent, noisy, or in an accent it handles poorly, and these failures are exactly the most dangerous in clinical settings. Reporting in 2024–25 (Healthcare-Brew, Fortune, PBS, Tom's Hardware) documented racial commentary, violent rhetoric, and imagined medical treatments appearing in Whisper-generated medical transcripts. OpenAI's own documentation says not to use it in high-risk domains. Healthcare is one.

What this means for a buyer: Whisper can be used in a hospital stack, but only with safeguards. The original audio must be retained for fact-checking. The transcription must be reviewed before any downstream draft is produced. The clinical workflow must assume the transcript is preliminary, not authoritative. Treat Whisper as an audio assistant, not as a source of truth.

Deployment posture

Whisper is distributed as model weights (PyTorch + Hugging Face) and runs locally on CPU, CUDA, ROCm, Apple Silicon (via whisper.cpp), and a range of edge accelerators. The standard production patterns are faster-whisper for CPU-and-CUDA serving with batching, whisper.cpp for Apple Silicon and embedded devices, and the original OpenAI repo for reference / accuracy comparisons. Real-time transcription typically uses a streaming wrapper that chunks audio into 30-second windows with overlap and runs incremental decoding.

SURFACE

Library + CLI + streaming

Original OpenAI Python library + CLI. faster-whisper for high-throughput CUDA serving. whisper.cpp for Apple Silicon and CPU. Streaming wrappers (whisper-live, WhisperX) for real-time use.

HARDWARE

Modest by LLM standards

Whisper-large-v3 runs comfortably on a single workstation GPU or a 16 GB RAM Apple Silicon Mac. Quantized variants run on CPU. Hardware cost is far lower than an LLM serving stack.

LATENCY

Real-time achievable

With faster-whisper + INT8 on a single GPU, large-v3 can comfortably keep up with real-time speech with a latency budget that suits clinical workflow. Turbo cuts this further.

RETENTION

Keep the audio

The single most important deployment decision: retain the original audio long enough that suspicious transcripts can be verified against the source. Vendors that delete audio immediately remove the safety net.

Healthcare fit

Whisper is the right speech engine when the workflow tolerates a transcript-plus-clinician-review pattern and the alternative would be sending audio to a third-party cloud. It is the wrong engine when "transcription is authoritative" — i.e., when nobody reads the transcript before it becomes part of a clinical record. The published 2024–25 evidence on Whisper hallucinations in medical settings is unambiguous: the failure modes happen in exactly the kind of audio (silent gaps, code-switching, dialect, noisy clinical environments) common to real patient encounters.

checkGood fit: ambient scribe pilots where the transcript is reviewed before a note is drafted. Internal dictation tools with mandatory human review. Research transcription where the original audio is retained.
checkGood fit: multilingual hospital settings — Whisper's 99-language coverage is genuinely unusual and often the deciding factor over commercial speech APIs that charge per language.
checkGood fit: de-identification batch jobs over historical clinical audio — accuracy plus review-required workflow is the right shape.
closeBad fit: any workflow that treats the transcript as authoritative without human review. OpenAI's own documentation tells you not to do this.
closeBad fit: dropping the original audio after transcription. The recoverability path is the safety mechanism.

Privacy and governance

Whisper's privacy story is straightforward: the model is open-weight and downloadable, all inference can happen on hospital-controlled hardware, no outbound calls of any kind. That puts Whisper in a strong position for HIPAA / PIPEDA / PHIPA / Quebec Law 25 environments at the privacy layer. The harder governance question is safety: hallucinations in medical audio are documented and patterned, OpenAI tells you not to deploy in high-risk domains, and a workflow that ignores both warnings is the workflow that ends up in the press for the wrong reasons.

The Moneli Automation pattern for Whisper is: retain original audio for the same retention period as the chart, build the workflow with mandatory clinician review before any transcript-derived note is signed, monitor sample edit distance and hallucination rate against a held-out audio set, and have explicit stop conditions if either drifts. The engine is fine; the workflow around it is the work.

Strengths and limitations

STRENGTHS

Why hospital stacks pick it

Best open-source speech-to-text accuracy on clean English audio. 99-language coverage including many languages commercial APIs charge a premium for. Runs locally with modest hardware. Free under MIT license. Active reimplementation ecosystem (faster-whisper, whisper.cpp, WhisperX) for production-grade serving. Powers most ambient AI scribes in production — meaning the engine itself has been battle-tested at scale even when the surrounding workflow varies.

LIMITATIONS

Where it does not fit

Documented hallucinations in medical audio — invented sentences, racial commentary, imagined treatments — covered in major 2024–25 reporting. OpenAI's own documentation warns against high-risk-domain use. Accuracy drops on accented English, non-English, and noisy clinical audio. No speaker diarization out of the box (use WhisperX or pyannote). No language model conditioning on medical vocabulary by default — fine-tune or post-process for terminology accuracy.

Where Whisper fits in a hospital stack

Layer	What Whisper contributes	What still has to be solved
Speech-to-text	Best open-weight ASR available, runs locally, multilingual.	Hallucination safeguards, original-audio retention, review-required workflow.
Diarization	None natively — pair with WhisperX or pyannote.audio.	Speaker attribution accuracy in clinical settings, multi-party audio handling.
Domain accuracy	Strong baseline; medical-terminology accuracy improves materially with fine-tuning or post-processing.	Vocabulary lists, prompt biasing, fine-tuning data curation, evaluation against held-out clinical audio.
Note drafting	None — Whisper produces a transcript, not a note. Pair with MedGemma or another LLM via vLLM.	Note-generation prompt, clinical review workflow, audit trail.
Safety / governance	None — operator must own retention, review, and stop conditions.	Audit retention, sample-edit-distance monitoring, hallucination evaluation, OpenAI's "do not deploy in high-risk" caveat acknowledged in writing.

Whisper is the local-ASR baseline. It is the right engine for many private healthcare workflows — paired with the safeguards OpenAI's own documentation demands. Treat the engine as preliminary; let the workflow around it be authoritative.

Quick facts

Project	Whisper (OpenAI, open-source, MIT license). GitHub: openai/whisper.
Architecture	Transformer encoder-decoder ASR trained on 680,000 hours of multilingual audio.
Sizes	tiny (39M), base (74M), small (244M), medium (769M), large-v3 (1550M), large-v3-turbo. Production default for medical: large-v3 or large-v3-turbo.
Languages	99 languages. Strongest accuracy on English; degrades on accents, code-switching, and underrepresented languages.
Notable reimplementations	faster-whisper (CTranslate2-based, CPU + CUDA), whisper.cpp (C/C++, Apple Silicon and embedded), WhisperX (diarization + alignment).
Production users	Nabla (~85,000 clinicians via downstream product). ~30,000 clinicians and ~40 U.S. health systems use Whisper-powered tools according to 2024 reporting.
Known failure modes	Invented sentences in silent gaps; racial / violent text generation in noisy audio; degraded accuracy on accents and non-English; medical terminology errors without fine-tuning.
Website	github.com/openai/whisper

Use Whisper with the safeguards OpenAI's own docs demand

Whisper is a powerful local-ASR engine and a documented hallucination risk in medical audio at the same time. Moneli Automation's typical pattern is to deploy Whisper behind a workflow that retains the original audio, enforces clinician review of the transcript, monitors hallucination rate against held-out audio, and has explicit stop conditions — the safeguards the engine's own publisher tells you to use.

send Request a WalledCare pilot arrow_back All open-source profiles