EVIDENCE · UCLA NEJM AI RCT · 7 min

The UCLA NEJM AI RCT, Explained for Hospital Buyers

The 2025 UCLA pragmatic randomized trial is the cleanest peer-reviewed ambient-scribe comparison in the literature — Nabla vs Microsoft DAX Copilot vs usual care, 238 physicians, 14 specialties, ~48,000 visits. The headline numbers are widely cited and often misread. This is a buyer-side reading of what the study actually measured, what the result implies, and how to use it in procurement without overweighting one paper.

Physicians randomized

238

Three-arm pragmatic RCT (1:1:1 to Nabla, DAX, or usual care) across 14 specialties. Nov 2024 – Jan 2025. Published in NEJM AI 2025. NCT06792890.

Nabla effect

−9.5%

Time-in-note reduction versus control, statistically significant (p=0.02). 4:30 → 3:49 — 41 seconds per note.

DAX effect

−1.7% n.s.

Time-in-note change versus control, not statistically significant (p=0.66). Headline procurement-relevant finding.

What the study actually measured

Two design choices make this trial the most directly applicable to hospital procurement decisions of any ambient-scribe study to date. First, it is pragmatic — physicians used the tools in real outpatient practice across 14 specialties, not in a structured laboratory setting. The result is what a hospital actually gets when it deploys, not what the technology is theoretically capable of. Second, it ran head-to-head — Nabla and DAX in the same study, against the same control arm, with the same measurement framework. Most prior ambient-scribe evidence was vendor-curated cohort data; this was the first peer-reviewed apples-to-apples comparison.

The primary outcome was time-in-note — minutes from opening a clinical note to closing it. The secondary outcomes were physician burnout (Mini-Z), task load (PTL), professional fulfillment (PFI), and safety events. The visit volume was substantial: 24,696 visits for DAX, 23,653 for Nabla — enough power to detect modest effects.

The headline result

Nabla physicians cut time-in-note by 9.5% versus control (4:30 → 3:49, 41 seconds per note, p=0.02). DAX physicians showed a 1.7% reduction that was not statistically significant (p=0.66). Both arms reported burnout, task-load, and fulfillment improvements over control — the well-being benefits cluster together even when the time-savings story diverges.

The 41 seconds matters less than the comparison. Forty-one seconds per note across a 20-visit day is ~14 minutes saved — modest, real, defensible to a CFO, and very different from the "two hours a day" the vendor marketing has been selling. The number that should drive procurement is the peer-reviewed minute-count, not the vendor case study.

What the result does not mean: that DAX does not work. Many clinicians prefer it, the burnout signal is real, and Microsoft has shipped subsequent product generations since the trial window. What it does mean: a hospital signing DAX should know that the only peer-reviewed time-in-note measurement available is unfavorable, and the procurement RFP should ask Microsoft for the rebuttal — preferably a follow-up RCT, secondarily an explanation of what has changed since.

The safety result that often gets ignored

Both arms reported "clinically significant inaccuracies" occasionally on a five-point Likert scale — DAX averaged 2.7, Nabla 2.8 (lower is better, where 1 is "no inaccuracies" and 5 is "frequent inaccuracies"). One grade 1 (mild) adverse event was reported across the trial window. The safety profile is meaningfully similar across the two products and consistent with the broader npj Digital Medicine literature on AI-scribe hallucinations and omissions — see the safety reference.

For a procurement committee weighting the trial, the implication is that the two arms differ on time-savings, not on safety. The safety story is identical: both require clinician review, both produce occasional clinically significant inaccuracies, and "signature is review" is not a workflow that catches the errors.

What the trial does not tell you

Three honest caveats a procurement committee should hold in mind:

closeOutpatient-only. The trial enrolled outpatient physicians in 14 specialties. Inpatient, surgical, ED, and procedural workflows have different baselines and different failure modes. A buyer evaluating either product for inpatient use should ask for specialty-matched and acuity-matched data — neither vendor has it at this scale yet.
closeSingle health system. UCLA is one large academic medical center. Multi-system generalization is plausible but not proven. The Mass General Brigham JAMA cohort (five academic medical centers, 13.4 min/day total EHR-time reduction) is the closest multi-system reference.
closeSnapshot in time. Both products have shipped substantial updates since the November 2024 – January 2025 trial window. The trial measures specific versions of specific products in a specific setting; durable performance characteristics are still emerging.

How to use the result in procurement

Three concrete moves the committee can make:

checkAdd Question 19 from the RFP checklist. "List the peer-reviewed studies of your product, with author institutions and dates." Vendors that cite the UCLA RCT credit themselves with the time-savings result that did or didn't earn statistical significance. Vendors who cite vendor case studies as "studies" should be questioned. See the full checklist.
checkAnchor the ROI calculator on the moderate evidence anchor. The UCLA result and the Mass General Brigham cohort both produce time-savings in the ~7-13 minutes/day range. Use that as the floor for your ROI math, not the vendor "two hours/day" envelope. The ROI calculator includes both anchors.
checkDemand a head-to-head pilot if both products are in the shortlist. The pragmatic RCT design is replicable at the pilot scale. Three arms — Vendor A, Vendor B, usual care — for three weeks each, the same physicians rotating, the same measurement framework. Most vendors will resist this; the resistance itself is informative.

What the result implies for the broader market

The UCLA trial is a signal, not a verdict. It established that pragmatic-RCT-grade evidence is achievable in this category and useful for buyers. It also established that the peer-reviewed time-savings numbers are modest compared with the vendor marketing. Other vendors will face the same scrutiny as their products mature; Abridge's 2025 JAMIA cohort study and the multi-system burnout QI study (PMC, 2025) point in the same direction — real but modest, durable when adoption is consistent, harder to find when adoption is shallow.

For the procurement committee, the right disposition is "weight peer-reviewed numbers above vendor case studies, ask for the rebuttal when the result is unfavorable, and run the head-to-head pilot when the shortlist contains two credible options." See the AI Scribes category guide and the side-by-side vendor comparison for the full evidence map.

send Request a WalledCare pilot menu_book Back to blog