AI Scribe Hallucinations and Omissions — The Published Numbers and Where They Cluster

Hallucination rate

1.47%

Across 12,999 clinician-annotated sentences from 18 model configurations — the npj Digital Medicine framework analysis (2025). Lower than headline coverage often implies; still material at clinical volume.

Omission rate

3.45%

From the same framework. Omissions are more than twice as common as hallucinations and dominate the practical failure profile of every ambient scribe a hospital evaluates.

Major hallucinations

44%

Share of hallucinations classified as major — capable of changing diagnosis or management if uncorrected. The headline rate is a small number; the per-error severity is the real exposure.

Omissions in current issues

55%

Share of major omissions that cluster in the "current issues" section of the note — exactly where a missed item becomes a clinical-safety event. The where-it-fails matters as much as the rate.

What "hallucination" and "omission" actually mean here

The two terms get used interchangeably in healthcare AI coverage; the published evidence distinguishes them carefully. A hallucination is content the AI generated that does not appear in the source audio: a fabricated symptom, an invented diagnosis, a documented physical exam that never happened. An omission is content the AI failed to document that was present in the audio: a missed medication change, a missed history item, a missed assessment finding. Both produce a clinically unsafe note. The mitigation patterns are different.

The 2025 npj Digital Medicine framework analysis ("A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation") is the cleanest reference in the literature. It annotated 12,999 sentences across 18 model configurations and produced the rates above — 1.47% hallucination, 3.45% omission. The same group's editorial, "Beyond human ears: navigating the uncharted risks of AI scribes in clinical practice" (npj Digital Medicine, 2025), is the companion paper a hospital safety committee should read alongside it.

Where the failures cluster — the geography matters

The headline rates make AI scribes sound safer than they are in practice, because the errors do not distribute randomly across the note. They concentrate in the sections where a missing or invented item is most likely to change clinical action. The framework analysis broke major omissions down by note section:

Note section	Share of major omissions	Clinical exposure
Current issues / HPI	55%	A missed symptom or current concern at the top of the note shapes the rest of the workup. Highest-risk omission location.
Past medical, family, social history	35%	Missing comorbidity, family-history pattern, or social context that changes risk stratification or differential.
Assessment and plan	10%	Lower share but highest immediate consequence — a missed plan item is a missed clinical action.

The takeaway for evaluation: specialty-matched random sampling is the wrong audit strategy. Section-stratified sampling — concentrating audit on HPI and current-issues sections — finds an order of magnitude more major errors per hour of reviewer time than random sampling does. Bake the section weights into the pilot's audit framework.

The most-documented failure patterns

Beyond the rate and the location, the shape of the errors matters. The most-reported failure patterns in 2024–26:

closeDocumented physical exams that never happened. The npj Digital Medicine analysis and several independent audits report the model producing complete physical-exam narratives when the audio contained no exam discussion. The hallucination is plausible-sounding, fits the visit type, and is hard to catch on review without checking the audio.
closeDropped negations. "Denies chest pain" rendered as "chest pain." A documented and patterned error class — the model omits or flips negation under noisy audio or fast speech. Clinically this is the most dangerous error class because the note then asserts a positive symptom that contradicts the patient.
closePronoun and speaker attribution errors. The clinician's statement gets attributed to the patient or vice versa. Documented in the UCLA NEJM AI RCT (which reported one grade 1 adverse event during the trial window) and in the multi-system Mass General Brigham JAMA cohort.
closeSmoothed uncertainty. Patient status documented as "stable with good response to therapy" when the audio contained equivocation. The model defaults to confident phrasing; the clinical reality is often genuinely unclear. Particularly dangerous in handoff and inpatient settings.
closeInvented medications and dosing. Documented less frequently in the formal studies but visible in independent audits — the model generates a plausible-looking dose for a medication that was mentioned but not dosed in the audio.
closeRacial commentary and inappropriate content. Documented in 2024 Healthcare-Brew and Fortune reporting on Whisper-powered scribes — invented sentences containing racial commentary, violent rhetoric, and imagined treatments. Rare per-encounter but profoundly serious when they occur.

The transcription layer is a second hallucination source

Most ambient AI scribes use OpenAI's Whisper as the speech-to-text engine. Whisper has its own hallucination problem — independent of the LLM that drafts the note. The 2024 ACM FAccT analysis reported invented sentences in roughly 1% of segments under controlled conditions. A widely-cited University of Michigan analysis found hallucinations in 8 of 10 informal samples. OpenAI's own documentation explicitly warns against using Whisper in "high-risk domains" and "decision-making contexts" — and explicitly names healthcare as a high-risk domain.

The compounding effect matters: if the transcription layer hallucinates a phrase and the note-generation layer then summarizes the transcript, the final note can contain a confident assertion that has no basis in the audio at all. Auditing the LLM layer in isolation does not catch this. The audit has to compare the signed note against the original audio, not against the transcript. The original audio retention policy is therefore a safety control, not a privacy concern.

Why "clinician reviews the draft" is not by itself a safety control

The most-published failure of AI-scribe workflows is the implicit assumption that mandatory clinician review will catch the errors. The published evidence is unambiguous: it does not. The 2025 npj Digital Medicine editorial is direct about it: when clinicians are asked to review a pre-filled, plausible-looking note, they miss errors more often than they realize. Three forces compound:

FORCE 01

Automation bias

A clinician reading a draft that looks correct will tend to accept it. The published literature on automation bias is large and consistent — a plausibly worded pre-filled note is harder to scrutinize than a blank one.

FORCE 02

Time pressure

Ambient documentation pilots succeed when they save time. Saved time gets spent on patients. Review time drops as familiarity grows — which is the moment the audit data degrades.

FORCE 03

Plausibility ≠ accuracy

The most dangerous hallucinations are the ones that read correctly. A physical exam that never happened narrates fluently; a dropped negation flows naturally with the rest of the sentence. The errors are designed by the model to fit.

"Signature is review" is not a safe workflow. A safe workflow needs explicit, structural safeguards beyond clinician review — sample-and-audit against the audio, section-stratified spot checks, edit-distance monitoring, and stop conditions tied to measurable thresholds.

Equity is unsolved

Accuracy degrades on non-English audio, on accented English, and on patient populations underrepresented in training data. The UCLA RCT and the multi-system Mass General Brigham cohort both reported degradation; Whisper accuracy drops on languages with smaller training-data shares. Vendors that ship strong multilingual coverage (Nabla, Suki) are not exempt from this — they are merely better instrumented than monolingual competitors. For a hospital serving a multilingual patient population, the evaluation rubric should include accuracy disaggregated by language and by demographic group. Aggregate accuracy hides systematic harm.

The safeguards every hospital should require

Across the published evidence, six safeguards consistently separate pilots that catch errors safely from pilots that drift into normalized unsafe output:

check1. Retain the original audio for the chart-retention period. The audio is the safety net — the only way to verify a suspicious transcript or note. Vendors that delete audio after note generation remove the audit trail; that is a documented warning sign, not a privacy feature.
check2. Sample-and-audit against the audio, not against the transcript. A weekly stratified sample (10–20 encounters per specialty, weighted toward HPI / current-issues content) compared back to the original audio is the minimum continuous-audit baseline.
check3. Edit-distance monitoring with thresholds. Track the diff between the model's draft and the signed note. Falling edit distance over time often signals automation bias, not improving model quality. A documented "edit distance drops below X" threshold should trigger a pause-and-review.
check4. Stop conditions written before go-live. The steering committee should agree in advance on the patterns that pause the rollout: recurring section-stratified omissions, audited hallucination rate above a stated threshold, a clinician-reported safety event. The decision rule has to exist before the pressure to dismiss the signal exists.
check5. Negation and physical-exam test sets. The two highest-consequence patterned errors. A held-out test set of audio clips containing negations and physical-exam discussions should be run against the model regularly and tracked over time.
check6. Equity-disaggregated accuracy reporting. Accuracy reported by language, by demographic group, by specialty. Aggregate numbers are the wrong audit metric.

What this means for the RFP

Translate the safeguards above into vendor questions. A short set of additions to a standard RFP that vendors who take safety seriously can answer cleanly, and that distinguish them from vendors who do not:

checkWhat is your measured hallucination and omission rate, against what evaluation framework, with what test set? (Acceptable: a specific framework reference. Red flag: "very rare.")
checkProvide your section-stratified error breakdown — share of errors in HPI, history, A&P. (Acceptable: a table. Red flag: "we don't break it down that way.")
checkIs your transcription engine Whisper-based, and what specific guardrails address the documented Whisper hallucination patterns? (Acceptable: named mitigations. Red flag: "proprietary speech recognition" without specifics.)
checkWhat is your default audio-retention policy, and is it customer-configurable? (Acceptable: retention configurable by the customer up to chart retention. Red flag: audio deleted by default.)
checkProvide a sample of your equity-disaggregated accuracy reporting from a current customer deployment. (Acceptable: a specific artifact. Red flag: "accuracy is high across all populations.")
checkWhat stop-condition language does your contract support if our audit finds the system unsafe? (Acceptable: termination-for-cause language and a documented pause path. Red flag: no contractual stop.)

The full procurement checklist lives in the AI Scribe RFP Questions guide — these six belong in the safety section alongside the others.

Where this fits in the WalledCare directory

This reference is the safety lens on the broader AI Scribes category page (workflows, evidence base, evaluation rubric), pairs with the How to Test an AI Scribe Safely guide (pilot design), and feeds the safety section of the RFP questions checklist. For the on-prem alternative perspective, the on-prem reference architecture covers the same evaluation rubric from the hospital-owned-stack side.

send Request a WalledCare pilot menu_book Back to guides grid_view Back to directory