IT REFERENCE · HARDWARE SIZING · 8 min
Hardware Sizing for Private Healthcare AI
Three reference architectures sized for the three workflows most hospital private-AI pilots actually deploy — single-clinician pilot, departmental multi-workflow, and hospital-system scale. With concrete GPU choices, throughput expectations, power and cooling budgets, storage requirements, and the network assumptions Moneli Automation uses when scoping new deployments in 2026.
Single workstation with a professional GPU (RTX 6000 Ada or A6000) — enough to run quantized 70B models locally with Ollama or llama.cpp. The cost of admission to a real pilot.
2× A100 80GB on a single inference node — Llama 3.x 70B at fp16 with full vLLM throughput. Fits departmental concurrent load with headroom.
4× H100 SXM5 on a single inference node — ~3,300 TPS aggregate for 70B-class models per H100, ~24,000 TPS streamable per H100 SXM5 under ideal batching. The reference point cited in the 4-month-ROI literature.
H100 SXM5 TDP. Plan for ~3.5 kW of conditioned, redundant power and ~12,000 BTU/hr of cooling per 4× H100 node. Often the binding constraint for retrofitting an existing hospital server room.
The three reference architectures
Most hospital private-AI deployments fall into one of three shapes. The architecture choice is driven by concurrent-user expectations and the number of distinct workflows the stack will run, not by which workflow comes first. A pilot stack and a department stack are different machines; trying to size one for both produces the worst-of-both-worlds compromise.
| Architecture | Concurrent users | Workflows supported | Reference build | 2026 capex |
|---|---|---|---|---|
| Single-clinician pilot | 1 | One workflow (typically ambient scribe pilot or document Q&A pilot) | 1× RTX 6000 Ada (48 GB) or A6000 (48 GB), Threadripper / Xeon workstation, 128 GB RAM, 4 TB NVMe | ~$5–8K |
| Department | 5–25 | Multi-workflow shared inference — scribe + document Q&A + discharge drafter | 2× A100 80GB (or 1× H100 80GB), dual-socket server, 256–512 GB RAM, 8 TB NVMe, redundant PSU | ~$30–50K |
| Hospital system | 50–500+ | Full reference stack — scribe, RAG search, discharge, handoff, internal knowledge, batch jobs | 4× H100 SXM5, dual-socket server with NVLink, 1 TB RAM, 30 TB NVMe, dual 10 / 25 Gbps networking | ~$120–180K |
GPU choice, plainly
Five GPUs cover essentially every 2026 hospital private-AI workload. The trade-off is memory (for model size), TDP (for power and cooling), and tensor-engine support (for FP8 throughput).
48 GB GDDR6, 300 W TDP, PCIe form factor. Fits a workstation. Runs quantized 70B (Q4_K_M, ~40 GB) at single-stream rates of ~40 TPS. The right entry GPU for a pilot that does not need concurrent multi-user serving.
48 GB GDDR6, 350 W TDP. Better tensor-core throughput than RTX 6000 Ada at similar memory. Strong fit for a pilot that needs higher single-stream speed without jumping to A100 / H100.
80 GB HBM2e, 400 W TDP, SXM4 or PCIe form factor. Runs Llama 3.x 70B at fp16 (~140 GB → needs tensor-parallel 2). ~200–300 TPS for 70B at moderate concurrency. The workhorse departmental GPU.
80 GB HBM3, 700 W TDP. Transformer Engine and FP8 deliver ~3× the throughput of A100 on LLM inference. ~3,300 TPS for 70B-class models per H100; ~24,000 TPS per H100 streamable under optimal batching.
NVIDIA Blackwell. Roughly 2–4× H100 inference throughput on the same envelope of power. Shipping in 2024–25, expected to be the H100-replacement reference architecture by 2027.
For audit-restrictive or air-gapped environments where a GPU is not procurable, llama.cpp + Q4_K_M on a modern EPYC or Xeon with AVX-512 / AMX delivers ~5–15 TPS for 70B. Slow but viable for batch / off-hours workflows.
Workflow-by-workflow sizing
Each of the WalledCare workflow categories has a distinct throughput profile. The size to plan for depends on which workflows the stack actually serves:
| Workflow | Token budget per encounter | Latency tolerance | Concurrency pattern | Minimum sizing |
|---|---|---|---|---|
| AI Scribe | ~3,000–6,000 tokens (transcript + draft) | Soft — clinician edits drafts | Peaks at clinic-day start / end | 1× A100 (department) or 1× H100 (hospital) |
| Document Q&A | ~2,000–4,000 tokens (retrieved context + answer) | Tight — interactive UI | Bursty — multiple concurrent queries | 1× A100 + retrieval-layer GPU memory |
| Private Medical Search | ~1,500–3,000 tokens (synthesized search results) | Tight — interactive search | Distributed across day | Shared with Document Q&A on same GPU |
| Discharge Summaries | ~5,000–10,000 tokens (full chart context + summary) | Soft — batch generation | Spikes at unit discharge times | 1× H100 (or share with scribe on A100) |
| Handoff Tools | ~4,000–8,000 tokens (12-hour chart synthesis) | Soft — pre-shift batch | Predictable — shift-change spikes | Shared with discharge drafter |
| Batch (de-identification, indexing) | Variable — high-throughput | None — overnight | Off-hours bulk | Reuse pilot GPU off-hours |
The departmental architecture is sized to serve ambient-scribe, document Q&A, and private medical search concurrently on shared A100s. The hospital-scale architecture adds enough headroom to serve discharge drafting and shift handoff during their peak windows without degrading scribe latency. Most pilots that fail on hardware ground up-sized one workflow without accounting for the others.
Power, cooling, and the server-room conversation
The GPU capex is often the smaller line item by the time facilities is involved. A 4× H100 node draws ~3.5 kW at peak — more than most hospital server rooms have allocated per rack in the existing power and cooling envelope. Three facility-level decisions need explicit planning:
- checkConditioned, redundant power. Plan for ~1 kW per A100, ~700 W per H100, plus host overhead (CPU, RAM, disks, fans, networking). A 4× H100 node should be on a 30 A circuit minimum, with UPS-backed redundant PSU on the server. Many hospital server rooms have 15 A circuits per rack and need an electrician revisit.
- checkCooling capacity. Roughly 12,000 BTU/hr per 4× H100 node at peak load. Existing 5-ton CRAC capacity handles two such nodes per rack; pre-existing density may force a separate aisle or row.
- checkNoise. H100 SXM5 servers run loud — ~80 dB sustained under load. Not a problem in a data center; very much a problem if the deployment is in a clinical-adjacent server closet. Plan placement accordingly.
Storage and network
Model weights are large but predictable. A full library of Llama 3.x (8B + 70B + 405B), Mistral, Gemma 2/3, MedGemma 4B + 27B, Whisper large-v3, plus the typical embedding models occupies ~600 GB at fp16. Q4_K_M quantized variants quarter that. Plan 4 TB of NVMe for the pilot, 8 TB for the department, 30 TB for the hospital-scale build (model weights + audit logs + RAG index storage). RAG indexes (Qdrant / Milvus / OpenSearch) typically need 1.5–3 GB of disk per million vectors plus matching RAM for hot indexes.
Network expectations are modest in steady state and bursty during model loads. Inference is GPU-bound, not network-bound; the 10 Gbps internal fabric most hospital server racks already have is sufficient. The bursty load is model pulls — a 70B fp16 model is ~140 GB to first load, ~35 GB quantized. Pull from a local Hugging Face mirror, not the public internet, for both speed and compliance.
Software stack the hardware has to host
For sizing purposes, the software stack consumes a predictable footprint above the model weights:
- checkInference engine: vLLM (Linux + CUDA, multi-GPU tensor parallel) for production. Ollama on workstations and pilots. llama.cpp on Apple Silicon / CPU. LocalAI as an OpenAI-compatible drop-in gateway when existing client code expects an OpenAI endpoint.
- checkRetrieval layer: Qdrant (single binary, simplest ops) or Milvus (K8s-native, scale to billions) or OpenSearch (hybrid keyword + vector). Plan 32–128 GB RAM for hot indexes plus disk.
- checkSpeech-to-text: Whisper large-v3 or large-v3-turbo (faster-whisper for CUDA, whisper.cpp for Apple). Small footprint compared with LLMs; can co-host on the same GPU as the LLM for low-throughput scribe pilots.
- checkOrchestration: Haystack as the Python framework. Runs in a container alongside the inference engine; minimal additional hardware footprint.
- checkGateway: A thin API gateway (Kong, Traefik, or a custom thin layer) in front of vLLM for SSO / mTLS / per-team quotas / prompt logging. Hosted on the same node or a small adjacent VM.
2026 reference build sheet
The exact bills-of-materials Moneli Automation has scoped recently. Treat as a starting point; specific quotes vary with VAR and date.
| Architecture | Component | Spec | Notes |
|---|---|---|---|
| Pilot | GPU | 1× RTX 6000 Ada 48 GB | Workstation-class, 300 W |
| CPU | AMD Threadripper Pro 7945WX (12c) or Xeon W7-2475X | 16 PCIe Gen5 lanes for the GPU | |
| RAM | 128 GB DDR5 ECC | Buffer for inference plus retrieval layer | |
| Storage | 4 TB NVMe Gen4 | Model weights + scratch | |
| Total | ~$5,500–$8,000 | Excludes facility power upgrade | |
| Department | GPU | 2× A100 80GB SXM4 (NVLink) | Tensor-parallel 70B fp16 |
| Host | Dual EPYC 9354 or Xeon Platinum 8462Y+ | Server-class, dual-socket | |
| RAM | 512 GB DDR5 ECC | Headroom for retrieval and orchestration on same node | |
| Storage | 8 TB NVMe RAID 1 + 32 TB SAS for audit log retention | Audit log volume sized for 7-year retention | |
| Total | ~$30,000–$50,000 | Plus rack power upgrade if existing rack capacity insufficient | |
| Hospital | GPU | 4× H100 80GB SXM5 (NVLink) | 4-way NVLink interconnect critical for tensor-parallel |
| Host | Dual EPYC 9654 or Xeon Platinum 8568Y+ | Top-tier dual-socket; PCIe Gen5 | |
| RAM | 1 TB DDR5 ECC | Includes RAG cache and Whisper inference | |
| Storage | 30 TB NVMe RAID 10 + 100 TB SAS | Larger model library, RAG indexes, longer audit retention | |
| Total | ~$120,000–$180,000 | Includes UPS + cooling upgrade if rack density requires |
What this does not capture
Three categories of cost that hide in the architecture diagram and should be in the procurement budget:
- closeFacility upgrades. Power, cooling, structured cabling, room access controls. Often 10–25% of the capex line item for existing-server-room retrofits; can exceed the GPU cost for closet-style deployments.
- closeOperational labor. A working private-AI stack needs ~0.25–1.0 FTE of ops support depending on scale — model approvals, capacity planning, audit-log review, vendor coordination. Most pilots underestimate this.
- closeRefresh cycle. GPU depreciation runs faster than typical hospital hardware; plan a 4–5 year refresh cycle for high-end GPUs, not the 7–10 year cycle most healthcare IT budgets use for general-purpose servers.
Where this fits in the WalledCare directory
This sizing guide is the IT-facing companion to the on-premise clinical assistants reference architecture and the ROI calculator, which both reference the $120K hospital-scale capex number anchored here. The hardware sizing also pairs with the open-source profiles for the software stack each architecture hosts, and the Canadian compliance hub for the residency angle that makes on-prem the right answer for many buyers.
send Request a WalledCare pilot menu_book Back to guides grid_view Back to directory
Further reading
- On-Premise Clinical Assistants — reference architecture
- 30-Day Private AI Pilot playbook
- AI Scribe ROI Calculator
- vLLM profile — production inference engine
- Ollama profile — workstation pilot runtime
- Qdrant profile — retrieval layer for the department / hospital builds
- vLLM distributed serving documentation
- NVIDIA H100 product page
- NVIDIA A100 product page