PILOT PLAYBOOK · 30 DAYS · 6 min
The 30-Day Private AI Pilot
Most healthcare AI pilots fail not because the technology underperforms but because the pilot was never structured to fail safely. A 30-day private pilot — small enough to scope cleanly, long enough to surface real patterns — is the right shape for a hospital deciding whether private AI is worth a bigger program. This is the week-by-week playbook.
Long enough to surface real workflow patterns, short enough that stop conditions can pause it cheaply. Most decisions earn their evidence in this window.
The right first-pilot scope is one specialty or one clinic where reviewers, escalation owners, and documentation baselines are easy to name before the first live session.
The most-common pilot failure mode is not technology underperformance — it is drift away from the documented evaluation framework as adoption pressure grows. Stop conditions exist to interrupt drift.
Pre-week zero: define the pilot before standing it up
The pilot's design before kickoff matters more than the technology choice. Three artifacts the steering committee should sign off on in writing before any infrastructure goes live:
- checkScope memo. Specialty, clinic, clinician count, encounter types in scope, encounter types out of scope, expected weekly volume, data classes allowed. One page.
- checkEvaluation framework. The metrics the pilot will measure — time saved per clinician per shift, edit distance, omission rate (with section weighting per the safety reference), critical-omission flags, clinician confidence, escalation volume. Each with a threshold that triggers review.
- checkStop conditions and restart criteria. The conditions that pause the pilot (recurring section-stratified omissions, audited hallucination rate above threshold, clinician-reported safety event) and the criteria that allow restart (documented fix, narrow-cohort retest, equivalent measurement against the original framework). These exist before pressure to dismiss them exists.
Week 1: Infrastructure and baseline
Stand up the local model runtime (Ollama for a workstation-scale pilot; vLLM if the pilot will share GPU capacity across two or three users). Confirm air-gapped operation. Document the data path on a whiteboard — every place audio, transcript, or generated note moves through the system. Capture a baseline week of documentation time and edit patterns under the existing workflow, without the AI in play. This is the comparison data the rest of the pilot will be measured against.
Avoid the temptation to involve clinicians yet. Week 1 is operational — the IT and clinical-informatics leads validate that the infrastructure does what the architecture diagram says it does. Patient-facing use comes after the infrastructure has been audited against the data-path artifact.
Week 2: Shadow mode
Run the AI in parallel with the existing workflow. Clinicians document normally; the model generates drafts that are reviewed by the clinical informatics team but do not enter the chart. This is the lowest-risk validation phase and the one most pilots skip. The point is to detect hallucinations and omissions before they have a clinical consequence — exactly the npj Digital Medicine framework analysis recommends as the highest-leverage pre-deployment check.
Sample 10–20 encounters per specialty. Audit against the original audio, not just the transcript — the Whisper transcription layer is a documented hallucination source, and the audit has to compare the signed note (or in this case, the shadow draft) against the source audio. Tag every error by section (HPI, history, A&P) and severity. Build the first calibration of what failure-frequency baseline looks like in your specialty.
Week 3: Clinician-supervised review
Clinicians begin reviewing drafts and signing notes. Audio retention stays on (the original audio is a safety control, not just a privacy concern). Edit distance, time-to-sign, and clinician-rated quality get logged per encounter. The steering committee meets at the midpoint to review the first three weeks of data against the stop conditions.
If the data is clean — edit distance trending down with use, section-stratified omission rate inside the threshold, no critical-omission flags — proceed. If the data shows drift (clinicians editing less than the audit suggests they should be editing, omission rate concentrated in HPI / current issues), pause for a documented fix before continuing. The pilot should be reversible at this checkpoint.
Week 4: Decision gate
The last week of the pilot is the evidence-consolidation phase, not new experimentation. The steering committee assembles the pilot's evidence against the decision rubric defined in pre-week zero, and answers four questions explicitly:
Time-to-sign minus baseline, per clinician per shift, averaged across the pilot population. Compare against the peer-reviewed range — UCLA NEJM AI 9.5%, Mass General Brigham 13.4 min/day. Pilot results outside the published range deserve scrutiny in either direction.
Hallucination and omission rate, section-stratified, audited against the source audio. Stop-condition triggers documented. Any critical-omission flags reviewed by the clinical-safety lead with named follow-up.
Were clinicians using the review process as designed, or improvising? Was the escalation path used? Did the audit log capture what it was supposed to capture? Governance posture matters as much as the technology result.
If the pilot succeeded, what does the next-quarter expansion look like in detail — clinicians, infrastructure, training, governance? If the answer is "we'll figure it out," the pilot is not actually done.
Why private (not vendor-cloud) for the first pilot?
A private pilot — running on hospital-owned infrastructure or in a customer-tenanted private deployment — produces evidence the hospital owns and can reuse against any vendor afterwards. A cloud-vendor pilot produces evidence the vendor partly owns, structured by the vendor's measurement tooling, with parameters the vendor at least partially controls. For a hospital that hasn't yet decided whether private AI is the right architecture, the private pilot is the cheaper experiment — and the evidence is more useful in subsequent vendor conversations.
The reference stack for a private pilot is documented in the on-premise clinical assistants architecture article — pilot-scale GPU options, clinical-surface choices, FHIR-grounded RAG patterns, and the 2026 hardware budget. The shape of the runtime layer is documented in the open-source profiles in the directory. The compliance posture under Canadian provincial regimes is documented in the compliance hub.
What changes after the 30-day pilot
The honest framing for the post-pilot conversation: a successful private pilot does not commit the hospital to building a hospital-wide private stack. It commits the hospital to having defensible evidence that a private workflow can produce a real, measurable result on its own infrastructure. That evidence becomes the comparison basis for a subsequent head-to-head pilot against a cloud vendor — the only honest way to make the cloud-vs-local decision is to have measured both, in the same workflow, with the same evaluation rubric.
WalledCare and Moneli Automation's pilot process is exactly this shape — short, scoped, evidence-producing, decision-enabling. The artifact at the end is a written go / no-go recommendation the hospital owns, not a vendor sales pitch.