EVIDENCE · UCLA NEJM AI RCT · 7 min

The UCLA NEJM AI RCT, Explained for Hospital Buyers

The 2025 UCLA pragmatic randomized trial is the cleanest peer-reviewed ambient-scribe comparison in the literature — Nabla vs Microsoft DAX Copilot vs usual care, 238 physicians, 14 specialties, ~48,000 visits. The headline numbers are widely cited and often misread. This is a buyer-side reading of what the study actually measured, what the result implies, and how to use it in procurement without overweighting one paper.

Physicians randomized
238

Three-arm pragmatic RCT (1:1:1 to Nabla, DAX, or usual care) across 14 specialties. Nov 2024 – Jan 2025. Published in NEJM AI 2025. NCT06792890.

Nabla effect
−9.5%

Time-in-note reduction versus control, statistically significant (p=0.02). 4:30 → 3:49 — 41 seconds per note.

DAX effect
−1.7% n.s.

Time-in-note change versus control, not statistically significant (p=0.66). Headline procurement-relevant finding.

What the study actually measured

Two design choices make this trial the most directly applicable to hospital procurement decisions of any ambient-scribe study to date. First, it is pragmatic — physicians used the tools in real outpatient practice across 14 specialties, not in a structured laboratory setting. The result is what a hospital actually gets when it deploys, not what the technology is theoretically capable of. Second, it ran head-to-head — Nabla and DAX in the same study, against the same control arm, with the same measurement framework. Most prior ambient-scribe evidence was vendor-curated cohort data; this was the first peer-reviewed apples-to-apples comparison.

The primary outcome was time-in-note — minutes from opening a clinical note to closing it. The secondary outcomes were physician burnout (Mini-Z), task load (PTL), professional fulfillment (PFI), and safety events. The visit volume was substantial: 24,696 visits for DAX, 23,653 for Nabla — enough power to detect modest effects.

The headline result

Nabla physicians cut time-in-note by 9.5% versus control (4:30 → 3:49, 41 seconds per note, p=0.02). DAX physicians showed a 1.7% reduction that was not statistically significant (p=0.66). Both arms reported burnout, task-load, and fulfillment improvements over control — the well-being benefits cluster together even when the time-savings story diverges.

The 41 seconds matters less than the comparison. Forty-one seconds per note across a 20-visit day is ~14 minutes saved — modest, real, defensible to a CFO, and very different from the "two hours a day" the vendor marketing has been selling. The number that should drive procurement is the peer-reviewed minute-count, not the vendor case study.

What the result does not mean: that DAX does not work. Many clinicians prefer it, the burnout signal is real, and Microsoft has shipped subsequent product generations since the trial window. What it does mean: a hospital signing DAX should know that the only peer-reviewed time-in-note measurement available is unfavorable, and the procurement RFP should ask Microsoft for the rebuttal — preferably a follow-up RCT, secondarily an explanation of what has changed since.

The safety result that often gets ignored

Both arms reported "clinically significant inaccuracies" occasionally on a five-point Likert scale — DAX averaged 2.7, Nabla 2.8 (lower is better, where 1 is "no inaccuracies" and 5 is "frequent inaccuracies"). One grade 1 (mild) adverse event was reported across the trial window. The safety profile is meaningfully similar across the two products and consistent with the broader npj Digital Medicine literature on AI-scribe hallucinations and omissions — see the safety reference.

For a procurement committee weighting the trial, the implication is that the two arms differ on time-savings, not on safety. The safety story is identical: both require clinician review, both produce occasional clinically significant inaccuracies, and "signature is review" is not a workflow that catches the errors.

What the trial does not tell you

Three honest caveats a procurement committee should hold in mind:

  • closeOutpatient-only. The trial enrolled outpatient physicians in 14 specialties. Inpatient, surgical, ED, and procedural workflows have different baselines and different failure modes. A buyer evaluating either product for inpatient use should ask for specialty-matched and acuity-matched data — neither vendor has it at this scale yet.
  • closeSingle health system. UCLA is one large academic medical center. Multi-system generalization is plausible but not proven. The Mass General Brigham JAMA cohort (five academic medical centers, 13.4 min/day total EHR-time reduction) is the closest multi-system reference.
  • closeSnapshot in time. Both products have shipped substantial updates since the November 2024 – January 2025 trial window. The trial measures specific versions of specific products in a specific setting; durable performance characteristics are still emerging.

How to use the result in procurement

Three concrete moves the committee can make:

  • checkAdd Question 19 from the RFP checklist. "List the peer-reviewed studies of your product, with author institutions and dates." Vendors that cite the UCLA RCT credit themselves with the time-savings result that did or didn't earn statistical significance. Vendors who cite vendor case studies as "studies" should be questioned. See the full checklist.
  • checkAnchor the ROI calculator on the moderate evidence anchor. The UCLA result and the Mass General Brigham cohort both produce time-savings in the ~7-13 minutes/day range. Use that as the floor for your ROI math, not the vendor "two hours/day" envelope. The ROI calculator includes both anchors.
  • checkDemand a head-to-head pilot if both products are in the shortlist. The pragmatic RCT design is replicable at the pilot scale. Three arms — Vendor A, Vendor B, usual care — for three weeks each, the same physicians rotating, the same measurement framework. Most vendors will resist this; the resistance itself is informative.

What the result implies for the broader market

The UCLA trial is a signal, not a verdict. It established that pragmatic-RCT-grade evidence is achievable in this category and useful for buyers. It also established that the peer-reviewed time-savings numbers are modest compared with the vendor marketing. Other vendors will face the same scrutiny as their products mature; Abridge's 2025 JAMIA cohort study and the multi-system burnout QI study (PMC, 2025) point in the same direction — real but modest, durable when adoption is consistent, harder to find when adoption is shallow.

For the procurement committee, the right disposition is "weight peer-reviewed numbers above vendor case studies, ask for the rebuttal when the result is unfavorable, and run the head-to-head pilot when the shortlist contains two credible options." See the AI Scribes category guide and the side-by-side vendor comparison for the full evidence map.

send Request a WalledCare pilot menu_book Back to blog

Further reading