CATEGORY · AI SCRIBES
AI Scribes
Ambient AI scribes are the documentation category healthcare buyers approach first — because clinical documentation is the workflow eating the most clinician time and driving the most burnout, and because the evidence base for measurable improvement is now real. This guide is the buyer's view: what the category does, what the published numbers actually say, what an evaluation rubric should contain, and where the cloud-only commercial vendors fit versus an on-prem stack.
For every hour of direct patient face-time, physicians spend nearly two additional hours on EHR and desk work — Sinsky et al., Annals of Internal Medicine, 2016. The category's reason for existing.
Family physicians log 86 minutes of after-hours EHR work per night on average. The metric ambient scribes are most directly trying to bend.
From 51.9% to 38.8% within 30 days of ambient-scribe rollout in a multi-system quality-improvement study (263 clinicians, six health systems, 2025). 74% reduction in odds of burnout.
Disclosed funding into ambient AI scribes in 2025 alone. Five cloud vendors raised the bulk of it; none of them ships an on-prem option.
What an AI scribe actually is
An AI scribe is a clinician-facing application that captures the patient encounter as audio, transcribes and structures it, and writes a draft note back into the EHR. Modern ambient scribes go beyond transcription: they synthesize SOAP-format notes, suggest ICD-10 / E/M / HCC codes, draft after-visit summaries and referral letters, and — increasingly — pull labs, vitals, and prior visit context forward to ground the draft. The clinician edits, signs, and the note is filed.
The category's name is misleading in two ways. First, "scribe" undersells the scope — most products are now closer to clinical assistants that touch documentation, coding, and patient communication on the same audio input. Second, "ambient" is doing a lot of work: the clinician still has to start the recording, edit the draft, and sign before anything reaches the chart. None of this is autonomous, and the published evidence is unambiguous that "signature is review" is not safe.
The four workflows that produce most of the value
The headline workflow. The UCLA NEJM AI randomized trial found Nabla cut time-on-notes by ~9.5% versus usual care (statistically significant). JAMA-published work across five academic medical centers reported ~16-minute reductions in documentation time per provider per day.
ICD-10, E/M, HCC, CPT — emerging as the surface that justifies the contract on revenue-cycle ROI grounds rather than clinician hours alone. Ambience and Suki lean hardest on this surface.
Plain-language after-visit summary auto-generated from the same audio. Reduces post-visit nurse calls and improves comprehension. Most vendors ship this; Nabla and Ambience have the most polished surface.
Referral letters and structured order staging from the encounter. Suki's voice-command surface and Ambience's AutoRefer are the public reference points. Reduces the "chart, write the referral, send" cycle to one approval step.
What the published evidence actually shows
The evidence base in 2026 is thicker than it was eighteen months ago, and uneven across vendors. The headline numbers a buyer should know — and the caveats that go with each:
- checkUCLA NEJM AI RCT (2025) — Three-arm pragmatic randomized trial of Nabla, Microsoft DAX Copilot, and usual care. 238 outpatient physicians across 14 specialties. Nabla physicians cut time-on-notes by 41 seconds per note (4:30 → 3:49) — a statistically significant 9.5% drop. DAX showed a smaller, non-significant drop. Both arms reported ~7% improvement in burnout. NCT06792890.
- checkMass General Brigham JAMA study (2026) — Across five academic medical centers, ambient scribes reduced total EHR time by 13.4 minutes/day and documentation time by 16.0 minutes/day. Scribe usage was associated with 0.49 additional visits per week per clinician. Modest, durable, defensible.
- checkMulti-system burnout QI study (PMC, 2025) — 263 clinicians across six health systems. Burnout dropped from 51.9% to 38.8% at 30 days. ~74% reduction in odds of burnout. Significant improvements in cognitive task load, after-hours documentation, and focused attention on patients.
- checkSingle-system data points — Emory Healthcare reported a 30.7% increase in documentation-related well-being prevalence. Mass General Brigham reported a 21.2% reduction in burnout prevalence at 84 days. Cooper University Healthcare clinicians saved ~4.15 minutes per patient — about an hour daily.
- checkThe Permanente Medical Group — ~15,791 physician hours saved across the system after ambient-scribe rollout, the most-cited single-system number in the category.
- closeCaveat: STAT News (April 2026) — A large multi-site analysis found that ambient-scribe time savings are real but modest, and adoption is inconsistent — clinicians who use the tool for fewer than ten encounters often do not see durable gains. Pilots that report large numbers usually exclude low-utilization clinicians.
Where ambient scribes go wrong
The error rates are small in percentage terms and patterned in a way buyers can plan around — but the failure modes are real and the published evidence is clear that "signature is not review."
- closeHallucinations: ~1.47% of sentences. The npj Digital Medicine framework analysis (12,999 clinician-annotated sentences across 18 model configurations) reported a 1.47% hallucination rate and a 3.45% omission rate in AI-generated clinical notes. 44% of hallucinations were classified as major — capable of changing diagnosis or management if uncorrected.
- closeOmissions cluster in the parts that matter most. 55% of major omissions occurred in the "current issues" section, 35% in past medical / family / social history, 10% in the assessment and plan. Exactly where a missed item becomes a clinical safety event.
- closePronoun and attribution errors. The UCLA RCT reported one mild patient-safety event during the trial window. Most ambient-scribe failure modes outside hallucination are pronoun swaps, mis-attributed quotes, and dropped negations ("denies chest pain" rendered as "chest pain").
- closeAutomation bias is a documented risk. The 2025 npj Digital Medicine editorial "Beyond human ears" is direct about it: when clinicians are asked to review a draft, they miss errors more often than they realize. A pre-signed pre-filled note is harder to scrutinize than a blank one. Designing the workflow with this assumption is non-optional.
- closeEquity is unsolved. Accuracy degrades for non-English consultations and for patient populations underrepresented in training data. Vendors that ship strong multilingual coverage (Nabla, Suki) are not exempt — they are merely better instrumented than monolingual competitors.
The evaluation rubric that survives the demo
Vendors will offer to run a pilot with their own measurement tooling. Don't take that offer at face value. The rubric your team should fix in writing before the pilot starts:
Not per note. Totals are what the CFO sees and what the burnout study uses. Pull EHR audit logs, not vendor self-reports.
Bigger edits mean the model is off pattern. Bake a sample-and-diff workflow into the pilot from day one.
Use a clinician-reviewed held-out sample audited weekly. Match the methodology in the npj Digital Medicine framework: classify majors versus minors and where in the note they cluster.
Percentage of factual claims with a verifiable source — either a transcript passage or a chart artifact. Suki's "evidence-linked documentation" is a vendor-side example of this idea.
Use a validated single-item or multi-item burnout instrument (e.g., the abbreviated Maslach scale). Pre / 30 day / 90 day. Don't use NPS.
Every metric above, split by patient primary language and demographic. The aggregated number is comforting and the disaggregated number is the truth.
Can you produce the prompt, retrieved context, output, and clinician edit for any signed note? If not, the deployment is not yet in production.
Track utilization per clinician per week. STAT's 2026 analysis was clear that tools used at fewer than ten encounters per clinician do not produce durable time savings. Plan the rollout to drive adoption, not just provisioning.
For a published reference: the SCRIBE evaluation framework (npj Digital Medicine, 2025) and PDQI-9 are the most cited validated rubrics for AI-generated note quality. Mayo Clinic Proceedings: Digital Health published a 2025 study using simulated ambulatory encounters as the safety-evaluation methodology — useful for stress-testing under conditions that vendor demos cannot reproduce.
Cloud commercial vendors — the shortlist
The five commercial scribes that consistently make U.S. healthcare shortlists. All five are cloud-only — none currently offers a customer-tenanted on-prem deployment. Profiles, comparison, and decision short-circuits live in the vendors hub.
Largest deployment scale. Deepest Epic integration. Strongest peer-reviewed evidence after Nabla's. Best for large U.S. systems on Epic.
Best integrated revenue-cycle play. AutoCDI, AutoCoding, AutoAVS. Built on OpenAI. Cleveland Clinic, UCSF.
Specialty-tuned (oncology lead). Highest KLAS spotlight (98.8). Transparent ~$350–$500/user/month. Ochsner, Texas Oncology.
NEJM AI RCT evidence. No audio stored by default. 14-day retention. 35+ languages. CVS Health, CHLA.
All four major EHRs incl. deep MEDITECH Expanse. Voice commands beyond docs. Published $299–$399/user/month. MedStar Health.
EHR coverage, evidence base, default privacy, pricing, funding, and which dimension each vendor wins on. Open vendors hub →
When the on-prem stack is the right answer instead
Cloud commercial scribes are the right answer for most U.S. health systems whose security program already operates under signed BAAs. The architectural choice every cloud vendor on this page makes — vendor-controlled inference, customer audio leaving the network — does not work for every buyer. The patterns where the on-prem stack is the cleaner answer:
| Constraint | Cloud commercial scribe | On-prem (WalledCare) |
|---|---|---|
| PHI must remain inside the building | Audio leaves the network — non-starter. | Inside hospital data center. No outbound API. |
| Canadian provincial residency (PHIPA, HIA, Law 25) | U.S. cloud is the deal-breaker even where vendor is HIPAA-compliant. | Province-resident inference; auditable by the privacy officer. |
| 2026 HIPAA hardening (operating above floor) | Compliant with floor; not the cleanest answer for organizations that want to operate above it. | Encryption at rest mandatory regardless; no third-party audio path to harden. |
| Multi-app on-prem stack (scribe + Q&A + discharge + handoff) | Different vendors per app, each with its own cloud relationship. | One stack, shared infrastructure, shared audit log. |
| Self-hosted model dependency | Vendor chooses the model. You don't. | Llama 3.3, Mistral, MedGemma — swap as the open-weight ecosystem evolves. |
| EHR is non-Epic (Oscar, Meditech, custom) | Variable. Suki is best on MEDITECH; otherwise Epic depth is the moat. | HL7 / FHIR + sidecar. Built per customer. |
If one or more of the constraints above is binding, the right next step is not a different commercial vendor — it is a different architecture. The WalledCare on-prem reference stack is the case for that path: hardware footprint, model choices, FHIR-grounded RAG pattern, and the 2026 GPU budget cheatsheet.
Adjacent categories
Internal policy, SOP, and pathway lookup grounded in customer-approved sources. The compounding category.
Permission-aware retrieval across local clinical knowledge with traceability and access scopes.
Drafting and QA workflows for discharge instructions and patient-readable summaries.
Shift-change copilots that organize SBAR-style context without leaking PHI outside approved systems.
Pick the right pilot, not the right demo
The shortest path to a defensible AI-scribe decision is to lock the evaluation rubric before the demos start, fix the constraints in writing, and run a real pilot — same workflow, same metrics, same patient population — across the cloud and on-prem options that survive the constraint screen. WalledCare's pilot process is built around this comparison.
send Request a WalledCare pilot arrow_back Back to directory
Further reading
- NEJM AI: Ambient AI Scribes in Clinical Practice — Randomized Trial (UCLA, 2025)
- UCLA Health press release on the randomized trial
- Mass General Brigham JAMA-published five-AMC study (2026)
- Multi-system QI study on ambient AI scribes and burnout (PMC, 2025)
- JAMA Network Open: Ambient AI scribes to reduce administrative burden and burnout
- npj Digital Medicine: Framework to assess hallucination + omission rates in clinical text summarization
- npj Digital Medicine: "Beyond human ears" — risks of AI scribes in clinical practice
- npj Digital Medicine: SCRIBE evaluation framework for ambient digital scribing tools
- Mayo Clinic Proceedings: Digital Health — simulated-encounter safety evaluation of ambient scribes
- NEJM Catalyst: Ambient AI scribes to alleviate clinical documentation burden
- Sinsky et al., Annals of Internal Medicine (2016): Time-motion study of physician time
- AMA: Family doctors and the 86-minute "pajama time" finding
- AMA: AI scribes save 15,000 hours at The Permanente Medical Group
- STAT News (April 2026): Large AI scribe study finds modest time savings, inconsistent use
- AHA: Six health systems enhancing care delivery with ambient AI scribes