AI Scribes — Buyer Guide for Ambient Clinical Documentation

EHR work per hour of patient time

~2 hr

For every hour of direct patient face-time, physicians spend nearly two additional hours on EHR and desk work — Sinsky et al., Annals of Internal Medicine, 2016. The category's reason for existing.

"Pajama time"

86 min

Family physicians log 86 minutes of after-hours EHR work per night on average. The metric ambient scribes are most directly trying to bend.

Burnout drop

−13.1 pts

From 51.9% to 38.8% within 30 days of ambient-scribe rollout in a multi-system quality-improvement study (263 clinicians, six health systems, 2025). 74% reduction in odds of burnout.

Capital deployed in 2025

> $1B

Disclosed funding into ambient AI scribes in 2025 alone. Five cloud vendors raised the bulk of it; none of them ships an on-prem option.

What an AI scribe actually is

An AI scribe is a clinician-facing application that captures the patient encounter as audio, transcribes and structures it, and writes a draft note back into the EHR. Modern ambient scribes go beyond transcription: they synthesize SOAP-format notes, suggest ICD-10 / E/M / HCC codes, draft after-visit summaries and referral letters, and — increasingly — pull labs, vitals, and prior visit context forward to ground the draft. The clinician edits, signs, and the note is filed.

The category's name is misleading in two ways. First, "scribe" undersells the scope — most products are now closer to clinical assistants that touch documentation, coding, and patient communication on the same audio input. Second, "ambient" is doing a lot of work: the clinician still has to start the recording, edit the draft, and sign before anything reaches the chart. None of this is autonomous, and the published evidence is unambiguous that "signature is review" is not safe.

The four workflows that produce most of the value

WORKFLOW 01

Encounter → SOAP note

The headline workflow. The UCLA NEJM AI randomized trial found Nabla cut time-on-notes by ~9.5% versus usual care (statistically significant). JAMA-published work across five academic medical centers reported ~16-minute reductions in documentation time per provider per day.

WORKFLOW 02

Encounter → coding suggestions

ICD-10, E/M, HCC, CPT — emerging as the surface that justifies the contract on revenue-cycle ROI grounds rather than clinician hours alone. Ambience and Suki lean hardest on this surface.

WORKFLOW 03

Encounter → patient-facing summary

Plain-language after-visit summary auto-generated from the same audio. Reduces post-visit nurse calls and improves comprehension. Most vendors ship this; Nabla and Ambience have the most polished surface.

WORKFLOW 04

Encounter → referral / orders

Referral letters and structured order staging from the encounter. Suki's voice-command surface and Ambience's AutoRefer are the public reference points. Reduces the "chart, write the referral, send" cycle to one approval step.

What the published evidence actually shows

The evidence base in 2026 is thicker than it was eighteen months ago, and uneven across vendors. The headline numbers a buyer should know — and the caveats that go with each:

checkUCLA NEJM AI RCT (2025) — Three-arm pragmatic randomized trial of Nabla, Microsoft DAX Copilot, and usual care. 238 outpatient physicians across 14 specialties. Nabla physicians cut time-on-notes by 41 seconds per note (4:30 → 3:49) — a statistically significant 9.5% drop. DAX showed a smaller, non-significant drop. Both arms reported ~7% improvement in burnout. NCT06792890.
checkMass General Brigham JAMA study (2026) — Across five academic medical centers, ambient scribes reduced total EHR time by 13.4 minutes/day and documentation time by 16.0 minutes/day. Scribe usage was associated with 0.49 additional visits per week per clinician. Modest, durable, defensible.
checkMulti-system burnout QI study (PMC, 2025) — 263 clinicians across six health systems. Burnout dropped from 51.9% to 38.8% at 30 days. ~74% reduction in odds of burnout. Significant improvements in cognitive task load, after-hours documentation, and focused attention on patients.
checkSingle-system data points — Emory Healthcare reported a 30.7% increase in documentation-related well-being prevalence. Mass General Brigham reported a 21.2% reduction in burnout prevalence at 84 days. Cooper University Healthcare clinicians saved ~4.15 minutes per patient — about an hour daily.
checkThe Permanente Medical Group — ~15,791 physician hours saved across the system after ambient-scribe rollout, the most-cited single-system number in the category.
closeCaveat: STAT News (April 2026) — A large multi-site analysis found that ambient-scribe time savings are real but modest, and adoption is inconsistent — clinicians who use the tool for fewer than ten encounters often do not see durable gains. Pilots that report large numbers usually exclude low-utilization clinicians.

Where ambient scribes go wrong

The error rates are small in percentage terms and patterned in a way buyers can plan around — but the failure modes are real and the published evidence is clear that "signature is not review."

closeHallucinations: ~1.47% of sentences. The npj Digital Medicine framework analysis (12,999 clinician-annotated sentences across 18 model configurations) reported a 1.47% hallucination rate and a 3.45% omission rate in AI-generated clinical notes. 44% of hallucinations were classified as major — capable of changing diagnosis or management if uncorrected.
closeOmissions cluster in the parts that matter most. 55% of major omissions occurred in the "current issues" section, 35% in past medical / family / social history, 10% in the assessment and plan. Exactly where a missed item becomes a clinical safety event.
closePronoun and attribution errors. The UCLA RCT reported one mild patient-safety event during the trial window. Most ambient-scribe failure modes outside hallucination are pronoun swaps, mis-attributed quotes, and dropped negations ("denies chest pain" rendered as "chest pain").
closeAutomation bias is a documented risk. The 2025 npj Digital Medicine editorial "Beyond human ears" is direct about it: when clinicians are asked to review a draft, they miss errors more often than they realize. A pre-signed pre-filled note is harder to scrutinize than a blank one. Designing the workflow with this assumption is non-optional.
closeEquity is unsolved. Accuracy degrades for non-English consultations and for patient populations underrepresented in training data. Vendors that ship strong multilingual coverage (Nabla, Suki) are not exempt — they are merely better instrumented than monolingual competitors.

The evaluation rubric that survives the demo

Vendors will offer to run a pilot with their own measurement tooling. Don't take that offer at face value. The rubric your team should fix in writing before the pilot starts:

METRIC 01

Time saved per clinician per shift

Not per note. Totals are what the CFO sees and what the burnout study uses. Pull EHR audit logs, not vendor self-reports.

METRIC 02

Edit distance from draft to signed note

Bigger edits mean the model is off pattern. Bake a sample-and-diff workflow into the pilot from day one.

METRIC 03

Hallucination + omission rate

Use a clinician-reviewed held-out sample audited weekly. Match the methodology in the npj Digital Medicine framework: classify majors versus minors and where in the note they cluster.

METRIC 04

Citation coverage

Percentage of factual claims with a verifiable source — either a transcript passage or a chart artifact. Suki's "evidence-linked documentation" is a vendor-side example of this idea.

METRIC 05

Burnout (validated instrument)

Use a validated single-item or multi-item burnout instrument (e.g., the abbreviated Maslach scale). Pre / 30 day / 90 day. Don't use NPS.

METRIC 06

Equity breakdown

Every metric above, split by patient primary language and demographic. The aggregated number is comforting and the disaggregated number is the truth.

METRIC 07

Audit completeness

Can you produce the prompt, retrieved context, output, and clinician edit for any signed note? If not, the deployment is not yet in production.

METRIC 08

Adoption, not just availability

Track utilization per clinician per week. STAT's 2026 analysis was clear that tools used at fewer than ten encounters per clinician do not produce durable time savings. Plan the rollout to drive adoption, not just provisioning.

For a published reference: the SCRIBE evaluation framework (npj Digital Medicine, 2025) and PDQI-9 are the most cited validated rubrics for AI-generated note quality. Mayo Clinic Proceedings: Digital Health published a 2025 study using simulated ambulatory encounters as the safety-evaluation methodology — useful for stress-testing under conditions that vendor demos cannot reproduce.

Cloud commercial vendors — the shortlist

The five commercial scribes that consistently make U.S. healthcare shortlists. All five are cloud-only — none currently offers a customer-tenanted on-prem deployment. Profiles, comparison, and decision short-circuits live in the vendors hub.

VENDOR

Abridge

Largest deployment scale. Deepest Epic integration. Strongest peer-reviewed evidence after Nabla's. Best for large U.S. systems on Epic.

VENDOR

Ambience Healthcare

Best integrated revenue-cycle play. AutoCDI, AutoCoding, AutoAVS. Built on OpenAI. Cleveland Clinic, UCSF.

VENDOR

DeepScribe

Specialty-tuned (oncology lead). Highest KLAS spotlight (98.8). Transparent ~$350–$500/user/month. Ochsner, Texas Oncology.

VENDOR

Nabla

NEJM AI RCT evidence. No audio stored by default. 14-day retention. 35+ languages. CVS Health, CHLA.

VENDOR

Suki

All four major EHRs incl. deep MEDITECH Expanse. Voice commands beyond docs. Published $299–$399/user/month. MedStar Health.

COMPARE

Side-by-side comparison

EHR coverage, evidence base, default privacy, pricing, funding, and which dimension each vendor wins on. Open vendors hub →

When the on-prem stack is the right answer instead

Cloud commercial scribes are the right answer for most U.S. health systems whose security program already operates under signed BAAs. The architectural choice every cloud vendor on this page makes — vendor-controlled inference, customer audio leaving the network — does not work for every buyer. The patterns where the on-prem stack is the cleaner answer:

Constraint	Cloud commercial scribe	On-prem (WalledCare)
PHI must remain inside the building	Audio leaves the network — non-starter.	Inside hospital data center. No outbound API.
Canadian provincial residency (PHIPA, HIA, Law 25)	U.S. cloud is the deal-breaker even where vendor is HIPAA-compliant.	Province-resident inference; auditable by the privacy officer.
2026 HIPAA hardening (operating above floor)	Compliant with floor; not the cleanest answer for organizations that want to operate above it.	Encryption at rest mandatory regardless; no third-party audio path to harden.
Multi-app on-prem stack (scribe + Q&A + discharge + handoff)	Different vendors per app, each with its own cloud relationship.	One stack, shared infrastructure, shared audit log.
Self-hosted model dependency	Vendor chooses the model. You don't.	Llama 3.3, Mistral, MedGemma — swap as the open-weight ecosystem evolves.
EHR is non-Epic (Oscar, Meditech, custom)	Variable. Suki is best on MEDITECH; otherwise Epic depth is the moat.	HL7 / FHIR + sidecar. Built per customer.

If one or more of the constraints above is binding, the right next step is not a different commercial vendor — it is a different architecture. The WalledCare on-prem reference stack is the case for that path: hardware footprint, model choices, FHIR-grounded RAG pattern, and the 2026 GPU budget cheatsheet.

Adjacent categories

CATEGORY

Document Q&A

Internal policy, SOP, and pathway lookup grounded in customer-approved sources. The compounding category.

CATEGORY

Private Medical Search

Permission-aware retrieval across local clinical knowledge with traceability and access scopes.

CATEGORY

Discharge Summaries

Drafting and QA workflows for discharge instructions and patient-readable summaries.

CATEGORY

Handoff Tools

Shift-change copilots that organize SBAR-style context without leaking PHI outside approved systems.

Pick the right pilot, not the right demo

The shortest path to a defensible AI-scribe decision is to lock the evaluation rubric before the demos start, fix the constraints in writing, and run a real pilot — same workflow, same metrics, same patient population — across the cloud and on-prem options that survive the constraint screen. WalledCare's pilot process is built around this comparison.

send Request a WalledCare pilot arrow_back Back to directory

What an AI scribe actually is

The four workflows that produce most of the value

What the published evidence actually shows

Where ambient scribes go wrong

The evaluation rubric that survives the demo

Cloud commercial vendors — the shortlist

When the on-prem stack is the right answer instead

Adjacent categories

Pick the right pilot, not the right demo

Further reading