OPEN SOURCE · SPEECH-TO-TEXT · USE WITH SAFEGUARDS
Whisper
OpenAI's open-source speech-to-text model — the engine behind most ambient AI scribes including Nabla and behind ~30,000 clinicians across ~40 U.S. health systems via downstream products. Strong general accuracy, 99 languages, runs locally on hospital hardware. Also the most-documented speech model for medical-context hallucinations: invented sentences in roughly 1% of segments in formal studies, much higher in informal testing. OpenAI explicitly warns against use in high-risk domains. Read this page before pointing it at a patient encounter.
Whisper supports speech-to-text and translation across 99 languages, trained on 680,000 hours of multilingual audio. English accuracy is strong on clean audio; non-English and accented English accuracy drops materially.
~1% of segments in a formal ACM FAccT 2024 study; far higher in independent informal tests (one University of Michigan researcher found hallucinations in 8 of 10 sample audio files). Healthcare-specific failure modes documented in 2024–25 press.
tiny (39M), base (74M), small (244M), medium (769M), large-v3 (1550M), and the newer large-v3-turbo. Larger models trade compute for accuracy; "turbo" trades a small accuracy hit for 8× speedup on the large model.
OpenAI's documentation explicitly recommends against using Whisper in "high-risk domains" and "decision-making contexts." Healthcare is a high-risk domain. The buyer's safety plan must assume the warning is binding.
What Whisper actually is
Whisper is an open-source automatic speech recognition (ASR) model released by OpenAI in 2022 under the MIT license. It is a transformer-based encoder-decoder trained on 680,000 hours of multilingual web audio with timestamps and translation pairs — uncommon scale for an open speech model and the reason it became the de facto open ASR baseline. Models come in five base sizes (tiny / base / small / medium / large-v3) plus a faster large-v3-turbo variant; faster-whisper and whisper.cpp are popular reimplementations that materially improve speed on CPU and Apple Silicon respectively.
The healthcare buyer relevance is double-edged. On the positive side, Whisper runs locally — no audio leaves your machine if you do not send it anywhere — and the model quality is high enough that it powers a significant share of ambient AI scribes in production (Nabla is the most-cited example). On the negative side, OpenAI's training data was scraped audio, the model has a documented tendency to "hallucinate" plausible-sounding sentences when audio is silent, noisy, or in an accent it handles poorly, and these failures are exactly the most dangerous in clinical settings. Reporting in 2024–25 (Healthcare-Brew, Fortune, PBS, Tom's Hardware) documented racial commentary, violent rhetoric, and imagined medical treatments appearing in Whisper-generated medical transcripts. OpenAI's own documentation says not to use it in high-risk domains. Healthcare is one.
What this means for a buyer: Whisper can be used in a hospital stack, but only with safeguards. The original audio must be retained for fact-checking. The transcription must be reviewed before any downstream draft is produced. The clinical workflow must assume the transcript is preliminary, not authoritative. Treat Whisper as an audio assistant, not as a source of truth.
Deployment posture
Whisper is distributed as model weights (PyTorch + Hugging Face) and runs locally on CPU, CUDA, ROCm, Apple Silicon (via whisper.cpp), and a range of edge accelerators. The standard production patterns are faster-whisper for CPU-and-CUDA serving with batching, whisper.cpp for Apple Silicon and embedded devices, and the original OpenAI repo for reference / accuracy comparisons. Real-time transcription typically uses a streaming wrapper that chunks audio into 30-second windows with overlap and runs incremental decoding.
Original OpenAI Python library + CLI. faster-whisper for high-throughput CUDA serving. whisper.cpp for Apple Silicon and CPU. Streaming wrappers (whisper-live, WhisperX) for real-time use.
Whisper-large-v3 runs comfortably on a single workstation GPU or a 16 GB RAM Apple Silicon Mac. Quantized variants run on CPU. Hardware cost is far lower than an LLM serving stack.
With faster-whisper + INT8 on a single GPU, large-v3 can comfortably keep up with real-time speech with a latency budget that suits clinical workflow. Turbo cuts this further.
The single most important deployment decision: retain the original audio long enough that suspicious transcripts can be verified against the source. Vendors that delete audio immediately remove the safety net.
Healthcare fit
Whisper is the right speech engine when the workflow tolerates a transcript-plus-clinician-review pattern and the alternative would be sending audio to a third-party cloud. It is the wrong engine when "transcription is authoritative" — i.e., when nobody reads the transcript before it becomes part of a clinical record. The published 2024–25 evidence on Whisper hallucinations in medical settings is unambiguous: the failure modes happen in exactly the kind of audio (silent gaps, code-switching, dialect, noisy clinical environments) common to real patient encounters.
- checkGood fit: ambient scribe pilots where the transcript is reviewed before a note is drafted. Internal dictation tools with mandatory human review. Research transcription where the original audio is retained.
- checkGood fit: multilingual hospital settings — Whisper's 99-language coverage is genuinely unusual and often the deciding factor over commercial speech APIs that charge per language.
- checkGood fit: de-identification batch jobs over historical clinical audio — accuracy plus review-required workflow is the right shape.
- closeBad fit: any workflow that treats the transcript as authoritative without human review. OpenAI's own documentation tells you not to do this.
- closeBad fit: dropping the original audio after transcription. The recoverability path is the safety mechanism.
Privacy and governance
Whisper's privacy story is straightforward: the model is open-weight and downloadable, all inference can happen on hospital-controlled hardware, no outbound calls of any kind. That puts Whisper in a strong position for HIPAA / PIPEDA / PHIPA / Quebec Law 25 environments at the privacy layer. The harder governance question is safety: hallucinations in medical audio are documented and patterned, OpenAI tells you not to deploy in high-risk domains, and a workflow that ignores both warnings is the workflow that ends up in the press for the wrong reasons.
The Moneli Automation pattern for Whisper is: retain original audio for the same retention period as the chart, build the workflow with mandatory clinician review before any transcript-derived note is signed, monitor sample edit distance and hallucination rate against a held-out audio set, and have explicit stop conditions if either drifts. The engine is fine; the workflow around it is the work.
Strengths and limitations
Best open-source speech-to-text accuracy on clean English audio. 99-language coverage including many languages commercial APIs charge a premium for. Runs locally with modest hardware. Free under MIT license. Active reimplementation ecosystem (faster-whisper, whisper.cpp, WhisperX) for production-grade serving. Powers most ambient AI scribes in production — meaning the engine itself has been battle-tested at scale even when the surrounding workflow varies.
Documented hallucinations in medical audio — invented sentences, racial commentary, imagined treatments — covered in major 2024–25 reporting. OpenAI's own documentation warns against high-risk-domain use. Accuracy drops on accented English, non-English, and noisy clinical audio. No speaker diarization out of the box (use WhisperX or pyannote). No language model conditioning on medical vocabulary by default — fine-tune or post-process for terminology accuracy.
Where Whisper fits in a hospital stack
| Layer | What Whisper contributes | What still has to be solved |
|---|---|---|
| Speech-to-text | Best open-weight ASR available, runs locally, multilingual. | Hallucination safeguards, original-audio retention, review-required workflow. |
| Diarization | None natively — pair with WhisperX or pyannote.audio. | Speaker attribution accuracy in clinical settings, multi-party audio handling. |
| Domain accuracy | Strong baseline; medical-terminology accuracy improves materially with fine-tuning or post-processing. | Vocabulary lists, prompt biasing, fine-tuning data curation, evaluation against held-out clinical audio. |
| Note drafting | None — Whisper produces a transcript, not a note. Pair with MedGemma or another LLM via vLLM. | Note-generation prompt, clinical review workflow, audit trail. |
| Safety / governance | None — operator must own retention, review, and stop conditions. | Audit retention, sample-edit-distance monitoring, hallucination evaluation, OpenAI's "do not deploy in high-risk" caveat acknowledged in writing. |
Whisper is the local-ASR baseline. It is the right engine for many private healthcare workflows — paired with the safeguards OpenAI's own documentation demands. Treat the engine as preliminary; let the workflow around it be authoritative.
Quick facts
| Project | Whisper (OpenAI, open-source, MIT license). GitHub: openai/whisper. |
| Architecture | Transformer encoder-decoder ASR trained on 680,000 hours of multilingual audio. |
| Sizes | tiny (39M), base (74M), small (244M), medium (769M), large-v3 (1550M), large-v3-turbo. Production default for medical: large-v3 or large-v3-turbo. |
| Languages | 99 languages. Strongest accuracy on English; degrades on accents, code-switching, and underrepresented languages. |
| Notable reimplementations | faster-whisper (CTranslate2-based, CPU + CUDA), whisper.cpp (C/C++, Apple Silicon and embedded), WhisperX (diarization + alignment). |
| Production users | Nabla (~85,000 clinicians via downstream product). ~30,000 clinicians and ~40 U.S. health systems use Whisper-powered tools according to 2024 reporting. |
| Known failure modes | Invented sentences in silent gaps; racial / violent text generation in noisy audio; degraded accuracy on accents and non-English; medical terminology errors without fine-tuning. |
| Website | github.com/openai/whisper |
Use Whisper with the safeguards OpenAI's own docs demand
Whisper is a powerful local-ASR engine and a documented hallucination risk in medical audio at the same time. Moneli Automation's typical pattern is to deploy Whisper behind a workflow that retains the original audio, enforces clinician review of the transcript, monitors hallucination rate against held-out audio, and has explicit stop conditions — the safeguards the engine's own publisher tells you to use.
send Request a WalledCare pilot arrow_back All open-source profiles
Further reading
- Whisper on GitHub (official OpenAI repo)
- faster-whisper — production-grade reimplementation
- whisper.cpp — C/C++ implementation for Apple Silicon and CPU
- Healthcare Brew: Whisper makes up words patients have never said
- Fortune: hallucinations and hospital use
- PBS NewsHour on Whisper in medical settings
- Healthcare IT News on Whisper limitations
- MedGemma profile — the LLM layer that turns the transcript into a draft note
- AI Scribes category — buyer guide for the full ambient-documentation category, including hallucination evaluation