CLINICAL AI · ARCHITECTURE · 5 min read
On-Premise Clinical Assistants: a reference stack for hospital-owned AI
A clinical scribe, a discharge drafter, a shift handoff, and a document Q&A agent — all running on hardware you own, never calling an outside API. Here is what the stack looks like in 2026, what it costs, and where it breaks.
Physician hours saved in one health system (The Permanente Medical Group) after rolling out an ambient AI scribe.
Of AI scribe notes contain at least one error in a 2025 primary-care analysis; 44% of hallucinations were classified as major.
Residual hallucination rate reported for a Self-RAG loop grounded in cited passages on structured clinical tasks.
Why on-prem, in 2026
Hospitals are under two pressures at once. Clinical documentation still eats roughly 16 minutes per ambulatory patient visit — the number Sinsky and colleagues reported in the Annals time-motion work — and burnout driven by the “pajama time” in the EHR is a headline problem. At the same time, the regulatory floor is moving up: the 2026 HIPAA Security Rule update reclassifies encryption as mandatory rather than addressable, adds vulnerability scanning requirements for AI infrastructure, and compresses incident notification to 72 hours.
In Canada, PIPEDA does not govern public hospitals’ core activities — that is stated plainly by the Office of the Privacy Commissioner — but every province with meaningful hospital volume has its own health information act. PHIPA in Ontario, HIA in Alberta, PIPA in British Columbia, PHIA in Manitoba and Nova Scotia. Each carries mandatory audit-log and data-residency expectations. Keeping compute close to the data is the simple path through all of this.
The commercial case has caught up. The Permanente Medical Group reported about 15,791 physician hours saved after one ambient-scribe deployment. Intermountain Health recorded a 27% reduction in time in notes per appointment for clinicians who used their scribe on at least ten encounters. A JAMA-adjacent quality-improvement study across six systems watched clinician burnout drop from 51.9% to 38.8% thirty days into the rollout.
The stack, layer by layer
A production on-prem clinical assistant has three layers, and each is a budget line.
Web portal, tablet at bedside, a sidecar in Epic or Oscar. Where the assistant is visible — and where the identity provider plugs in for SSO, role-based access, and session audit.
Llama 3.3 70B, Mistral Large, or a medically-tuned variant (Meditron 70B, MedGemma 27B) served through vLLM or NVIDIA Triton. Tensor parallelism across H100s with FP8 quantization is the 2026 sweet spot.
EHR over HL7 / FHIR, a vector index of approved policies and SOPs, and an append-only audit log. No outbound network rule. Encryption at rest is now mandatory, not optional.
Four workflows that actually save time
Every ambient-scribe study measures time-on-notes, but four workflows produce most of the reclaimed clinician hours in practice. Pick one for the pilot.
UCLA’s randomized trial across 238 physicians and 72,000 encounters found Nabla users cut documentation time by about 10% versus usual care. Permanente reports about 16 minutes saved per 8 hours of patient care.
Structured draft from the encounter plus the last visit’s plan. Fast to produce, still needs clinician sign-off. The gain compounds when discharge is the bottleneck in the unit.
SBAR generated from the last twelve hours of notes and vitals. Saves the verbal hour and catches the omissions that reliance on memory usually creates.
Semantic search over policies, formularies, and care pathways, answered with inline source links. This is the one that compounds — every team member gets faster at finding the same answer.
What goes wrong
The error rates are not small, and the failure modes are patterned — which means they are addressable, but only if you design for them from day one.
- closeHallucinations are common. A 2025 competitive analysis of primary-care scribes found ~70% of generated notes contained at least one error, with 44% of hallucinations classified as major — the kind that could alter diagnosis or management. Omissions and pronoun swaps are the two most frequent types.
- closeAutomation bias is real. Nature npj Digital Medicine’s 2025 editorial on AI scribes is blunt: even when clinicians are expected to review the draft, they miss errors more often than they realize. A signature is not a review.
- closeEquity is not solved. Accuracy drops for non-English consultations and for patient populations underrepresented in training data. Monolingual evaluation misses real-world failure modes that show up in multilingual catchments.
- closeWithout citations, nothing is verifiable. If a claim cannot be traced to an approved source in your grounding set, it cannot be audited, and it cannot be safely acted on.
A pattern that works: FHIR-grounded RAG with structured artifacts
The arXiv literature from the last twelve months has converged on a pattern worth copying. The Model Context Protocol (MCP) for FHIR lets a local LLM request exactly the resources it needs — a patient’s medications, last visit, active problem list — through a declarative interface rather than a blob-of-notes dump.
A Self-RAG loop (generate an initial answer, list claims without citations, then refine using only cited passages) has been shown to push residual hallucinations below 6% on structured clinical tasks. Structured patient artifacts — a pre-rendered problem list, med list, and recent vitals passed into the context — consistently reduce hallucination versus handing the model a raw note.
The shape to build: MCP-FHIR for context retrieval, a vector index of policies and local protocols for document grounding, a Self-RAG loop for generation, and citation tracing so the clinician sees exactly what the draft is based on. Latency budget: under two seconds from user action to first token, or the workflow breaks.
Hardware budget (2026 baseline)
Conservative numbers for a single hospital serving concurrent clinical assistants. Plan for a doubling every eighteen months as model sizes and context windows grow.
| Scale | Concurrent users | GPU | Total VRAM | Notes |
|---|---|---|---|---|
| Pilot | 10–20 | 1× A100 80GB | 80 GB | 8B model at FP16, or 70B at AWQ 4-bit with tight batch size. |
| Department | 50–100 | 2× A100 80GB or 2× H100 | 160 GB | 70B in AWQ 4-bit comfortably; room for longer contexts. |
| Hospital | 200–500 | 4–8× H100, NVLink | 320–640 GB | 70B at FP8 with tensor parallelism; multiple models; long context. |
Reference: the vLLM Llama 3.3 70B recipe is the standard starting point for Blackwell and Hopper hardware; FP8 on Hopper and NVFP4 on Blackwell give the best quality-to-throughput ratio.
Evaluation: measure what matters
Do not evaluate the model in isolation. Evaluate the workflow. The metrics that predict whether a deployment survives year two:
- checkTime saved per clinician per shift — not per note. Totals are what the CFO sees.
- checkEdit distance from the draft to the signed note. Bigger edits mean the model is off pattern.
- checkCitation coverage — the percentage of factual claims with a verifiable source in the grounding set.
- checkHallucination rate on a clinician-reviewed held-out sample, audited weekly.
- checkEquity breakdown. Every metric above, split by primary language and patient demographic.
- checkAudit coverage. Can you produce the full prompt, retrieved context, output, and clinician edit for any event? If not, you are not yet in production.
Where WalledCare fits
We build the reference stack above — configured for your workflows, your EHR, and your compliance officer’s sign-off sheet. The nine apps on the home page (Chart Whisperer, Document Q&A, Policy Navigator, Data Forge, Equity Lens, Discharge Summary, Shift Handoff, Referral Letter, Vendor Scorecard) are what that looks like running on a hospital’s own hardware. Policy Navigator is often the first to go live — it carries the least clinical risk and earns trust fast.
The right first pilot is small, measurable, non-diagnostic, and documented. Pick one workflow. Give us thirty days.
Further reading
- AMA: AI scribes save 15,000 hours at The Permanente Medical Group
- UCLA randomized study of Nabla across 238 physicians
- Ambient AI scribes and burnout — quality-improvement study across six systems
- Usability, technical performance, and accuracy of AI scribes in primary care (JMIR, 2025)
- Beyond human ears: risks of AI scribes in clinical practice (npj Digital Medicine, 2025)
- Evaluating RAG variants for clinical decision support with secure on-prem deployment
- MCP-FHIR: an open-source framework for LLM-to-FHIR clinical assistants
- MedGemma — Google’s open medical model family
- vLLM quick-start recipe for Llama 3.3 70B on NVIDIA hardware
- Office of the Privacy Commissioner of Canada: PIPEDA and hospitals