PILOT PLAYBOOK · 30/60/90 DAYS · 9 min

AI Scribe Pilot Playbook

The canonical 30/60/90-day pilot framework Moneli Automation uses with hospital steering committees evaluating an AI scribe — three discrete phases, written stage gates, weekly cadence, the specific metrics that matter at each phase, the stop conditions that pause without drama, and the decision rubric that ends the pilot with a yes/no the committee can defend.

Phases

30 days infrastructure + shadow mode, 30 days clinician-supervised review, 30 days expanded use and decision gate.

Stage gates

Each phase ends with a documented go / pause / stop decision against pre-written criteria — not at the steering committee's discretion.

Decision artifact

Yes / no

The pilot ends with a written go / no-go recommendation the committee owns — not a vendor pitch deck.

Failure mode

Drift

Most pilots fail not on technology but on drift away from the evaluation rubric as adoption pressure builds. Stage gates exist to interrupt drift.

Pre-pilot: artifacts that must exist before day one

The pilot's pre-week-zero artifacts are the entire reason it works. Three documents the steering committee signs off on before any infrastructure stands up:

checkPilot scope memo (1 page). Specialty, clinic, clinician count, encounter types in scope, encounter types out of scope, expected weekly volume, data classes allowed, PIA status. Signed by clinical informatics, privacy, IT, and clinical safety leads.
checkEvaluation framework (2 pages). The metrics, the thresholds, the stop conditions, the restart criteria. Lifted from the safety reference's six-safeguard pattern plus the time-savings rubric from the ROI calculator's evidence anchors.
checkGovernance memo (1 page). Who can pause the pilot, who escalates safety events, what notification timelines apply, what the audit retention window is. Names individuals, not roles.

Phase 1 — Days 1-30: infrastructure + shadow mode

The first month is operational, not clinical. The infrastructure stands up, the data path is audited against the architecture diagram, and the AI runs in shadow mode — generating drafts that the clinical informatics team reviews but that never enter the chart. This is the lowest-risk validation phase and the one most pilots skip.

WEEK 01

Stand up the stack

Vendor onboarding, integration setup, EHR sandbox access. No clinicians, no real patient data. End-of-week artifact: a whiteboard diagram of the data path matching reality, signed by IT lead.

WEEK 02

Baseline week

Capture a week of documentation time and edit patterns under the existing workflow, with no AI in play. This is the comparison data the entire pilot will be measured against.

WEEK 03

Shadow mode

AI generates drafts in parallel with normal workflow. Clinical informatics team audits 10-20 encounters per specialty against the source audio. Section-stratified error tagging begins.

WEEK 04

Stage gate 1

Phase-1 review: did the data path match the diagram? Did the baseline week complete? Did the shadow-mode audit reveal hallucination / omission rates inside the evaluation thresholds? Go / pause / stop decision documented.

Stage gate 1 criteria: Data path verified. Baseline captured. Shadow-mode audit shows hallucination rate ≤ 2% and omission rate ≤ 5% (section-stratified). No critical-omission flags. If yes on all four, proceed to Phase 2. If no on any, pause and escalate.

Phase 2 — Days 31-60: clinician-supervised review

Clinicians begin reviewing drafts and signing notes. Audio retention stays on (the original audio is a safety net, not a privacy issue). Edit distance, time-to-sign, and clinician-rated quality get logged per encounter. The audit cadence continues at 10-20 encounters per week, with the section-stratified sampling concentrated on HPI / current-issues per the published evidence.

WEEK 05

First live week

Limited cohort (5-10 clinicians) review and sign drafts. All audio retained. Edit-distance logging on. Daily stand-up with clinical informatics; weekly review with steering committee.

WEEK 06

Audit deep-dive

Mid-phase audit: 30-40 encounters reviewed against audio. Edit-distance trend analyzed. Clinician feedback survey distributed.

WEEK 07

Calibration

Adjust prompts, vendor configuration, or workflow based on audit findings. Document each change with a before / after measurement.

WEEK 08

Stage gate 2

Phase-2 review: edit distance trending down? Section-stratified omission rate inside threshold? Critical-omission flags reviewed? Clinician confidence rising? Time-to-sign measurably better than baseline?

Stage gate 2 criteria: Edit distance flat or trending down. No recurring critical-omission patterns. Clinician-survey net positive. Measured time-to-sign improvement within the published evidence range (peer-reviewed 7-13 min/day per the ROI calculator anchors). If yes, proceed to Phase 3. If no, pause and escalate.

Phase 3 — Days 61-90: expanded use and decision gate

The cohort expands to the full pilot scope. The metrics from Phase 2 continue with the larger denominator. The steering committee compiles the evidence packet for the final decision. The last week is consolidation, not new experimentation.

WEEK 09

Expanded cohort

Scope widens to the full pilot population. Existing metrics continue. New cohort gets a 2-3 day onboarding window with the clinical informatics team.

WEEK 10

Steady state

Operate at expanded scale for one week. Audit continues at the higher denominator. Capacity utilization and concurrency patterns observed.

WEEK 11

Evidence consolidation

Compile the evidence packet: time savings, hallucination/omission rates by section, critical-flag count, clinician-confidence survey, edit-distance trend, capacity utilization, cost-comparison against the TCO worksheet.

WEEK 12

Decision gate

Steering committee reviews evidence against the four decision-rubric questions. Written go / no-go recommendation produced. Next-quarter plan (if go) or wind-down plan (if no-go) attached.

The four decision-gate questions

Phase 3 ends with the steering committee answering four questions in writing. These four are the entire purpose of the pilot:

QUESTION 01

Did the workflow save time?

Time-to-sign minus baseline, per clinician per shift, averaged across the pilot. Compare against UCLA NEJM AI 9.5% and Mass General Brigham 13.4 min/day. Results outside the published range deserve scrutiny in either direction.

QUESTION 02

Was the safety profile inside threshold?

Hallucination and omission rate, section-stratified, audited against the source audio. Stop-condition triggers documented. Critical-omission flags reviewed with named follow-up.

QUESTION 03

Did the workflow govern itself?

Were clinicians using the review process as designed, or improvising? Was the escalation path used? Did the audit log capture what it was supposed to capture? Governance posture matters as much as the technology result.

QUESTION 04

Is the next-quarter scale plan honest?

If the pilot succeeded, what does the next-quarter expansion look like in detail — clinicians, infrastructure, training, governance? If the answer is "we'll figure it out," the pilot is not actually done.

Stop conditions and the restart path

Any one of the following patterns pauses the pilot immediately, regardless of phase. The pause does not require steering-committee debate at the moment of trigger — the trigger and the pause are pre-decided:

closeAudited hallucination rate above 3% for two consecutive audit cycles. The published baseline is 1.47%; sustained 3%+ is a signal of vendor regression or workflow drift.
closeRecurring critical-omission patterns. Three or more critical-omission flags in the same note section, same week — concentrate in HPI / current-issues per the published evidence.
closeFalling edit distance with rising omission rate. The signature of automation bias — clinicians editing less while the model is producing more missed content.
closeAudit log gap. Any disruption to the audit-log capture during a clinical session is a stop condition, not a tolerable variation.
closeClinician-reported safety event. Single occurrence. Routine pause-and-review, restart only after the incident has been resolved.

The restart path: document the fix, retest in a narrow cohort (5-10 clinicians, one week), confirm the original evaluation framework still applies, get steering-committee sign-off on resumption. Pilots that resume without these steps tend to drift into normalized unsafe operation.

The governance memo template

One-page artifact. Reads like a contract between the pilot and the institution. Should be in writing before week one:

checkPilot owner: [Named clinical informatics lead]. Authority to pause the pilot at any time without prior steering-committee approval.
checkSafety lead: [Named clinical safety officer]. Reviews all critical-omission flags within 24 hours. Authority to require workflow changes.
checkPrivacy lead: [Named privacy officer]. Reviews all audit-log gaps and breach-related triggers. Authority to require contract escalation.
checkEscalation path: Clinical events → Safety lead → CMO. Privacy events → Privacy lead → General Counsel. Operational events → Pilot owner → CIO.
checkAudit retention: Audio + transcripts + signed-note diffs retained for [X years, matching the strictest applicable regime — Quebec Law 25 / PHIPA / HIA / HIPAA + state requirements].
checkCommunication cadence: Daily clinical-informatics stand-up. Weekly steering-committee review. Stage-gate decision memos signed by all three named leads.

Where this fits in the WalledCare directory

This playbook is the canonical extension of the shorter How to Test an AI Scribe Safely guide and the 30-Day Private AI Pilot blog post. Pair with the safety reference for the evaluation framework, the RFP questions for the procurement-side artifacts, and the ROI calculator + TCO worksheet for the economics.

sendRequest a WalledCare pilot menu_bookBack to guides