IT REFERENCE · HARDWARE SIZING · 8 min

Hardware Sizing for Private Healthcare AI

Three reference architectures sized for the three workflows most hospital private-AI pilots actually deploy — single-clinician pilot, departmental multi-workflow, and hospital-system scale. With concrete GPU choices, throughput expectations, power and cooling budgets, storage requirements, and the network assumptions Moneli Automation uses when scoping new deployments in 2026.

Pilot capex
~$5K

Single workstation with a professional GPU (RTX 6000 Ada or A6000) — enough to run quantized 70B models locally with Ollama or llama.cpp. The cost of admission to a real pilot.

Department capex
~$30K

2× A100 80GB on a single inference node — Llama 3.x 70B at fp16 with full vLLM throughput. Fits departmental concurrent load with headroom.

Hospital capex
~$120K

4× H100 SXM5 on a single inference node — ~3,300 TPS aggregate for 70B-class models per H100, ~24,000 TPS streamable per H100 SXM5 under ideal batching. The reference point cited in the 4-month-ROI literature.

Power per H100
700W

H100 SXM5 TDP. Plan for ~3.5 kW of conditioned, redundant power and ~12,000 BTU/hr of cooling per 4× H100 node. Often the binding constraint for retrofitting an existing hospital server room.

The three reference architectures

Most hospital private-AI deployments fall into one of three shapes. The architecture choice is driven by concurrent-user expectations and the number of distinct workflows the stack will run, not by which workflow comes first. A pilot stack and a department stack are different machines; trying to size one for both produces the worst-of-both-worlds compromise.

ArchitectureConcurrent usersWorkflows supportedReference build2026 capex
Single-clinician pilot 1 One workflow (typically ambient scribe pilot or document Q&A pilot) 1× RTX 6000 Ada (48 GB) or A6000 (48 GB), Threadripper / Xeon workstation, 128 GB RAM, 4 TB NVMe ~$5–8K
Department 5–25 Multi-workflow shared inference — scribe + document Q&A + discharge drafter 2× A100 80GB (or 1× H100 80GB), dual-socket server, 256–512 GB RAM, 8 TB NVMe, redundant PSU ~$30–50K
Hospital system 50–500+ Full reference stack — scribe, RAG search, discharge, handoff, internal knowledge, batch jobs 4× H100 SXM5, dual-socket server with NVLink, 1 TB RAM, 30 TB NVMe, dual 10 / 25 Gbps networking ~$120–180K

GPU choice, plainly

Five GPUs cover essentially every 2026 hospital private-AI workload. The trade-off is memory (for model size), TDP (for power and cooling), and tensor-engine support (for FP8 throughput).

PILOT
RTX 6000 Ada / A6000

48 GB GDDR6, 300 W TDP, PCIe form factor. Fits a workstation. Runs quantized 70B (Q4_K_M, ~40 GB) at single-stream rates of ~40 TPS. The right entry GPU for a pilot that does not need concurrent multi-user serving.

PILOT-PLUS
L40S

48 GB GDDR6, 350 W TDP. Better tensor-core throughput than RTX 6000 Ada at similar memory. Strong fit for a pilot that needs higher single-stream speed without jumping to A100 / H100.

DEPARTMENT
A100 80GB

80 GB HBM2e, 400 W TDP, SXM4 or PCIe form factor. Runs Llama 3.x 70B at fp16 (~140 GB → needs tensor-parallel 2). ~200–300 TPS for 70B at moderate concurrency. The workhorse departmental GPU.

HOSPITAL
H100 80GB SXM5

80 GB HBM3, 700 W TDP. Transformer Engine and FP8 deliver ~3× the throughput of A100 on LLM inference. ~3,300 TPS for 70B-class models per H100; ~24,000 TPS per H100 streamable under optimal batching.

FUTURE
B200 / GB200

NVIDIA Blackwell. Roughly 2–4× H100 inference throughput on the same envelope of power. Shipping in 2024–25, expected to be the H100-replacement reference architecture by 2027.

CPU FALLBACK
EPYC / Xeon

For audit-restrictive or air-gapped environments where a GPU is not procurable, llama.cpp + Q4_K_M on a modern EPYC or Xeon with AVX-512 / AMX delivers ~5–15 TPS for 70B. Slow but viable for batch / off-hours workflows.

Workflow-by-workflow sizing

Each of the WalledCare workflow categories has a distinct throughput profile. The size to plan for depends on which workflows the stack actually serves:

WorkflowToken budget per encounterLatency toleranceConcurrency patternMinimum sizing
AI Scribe ~3,000–6,000 tokens (transcript + draft) Soft — clinician edits drafts Peaks at clinic-day start / end 1× A100 (department) or 1× H100 (hospital)
Document Q&A ~2,000–4,000 tokens (retrieved context + answer) Tight — interactive UI Bursty — multiple concurrent queries 1× A100 + retrieval-layer GPU memory
Private Medical Search ~1,500–3,000 tokens (synthesized search results) Tight — interactive search Distributed across day Shared with Document Q&A on same GPU
Discharge Summaries ~5,000–10,000 tokens (full chart context + summary) Soft — batch generation Spikes at unit discharge times 1× H100 (or share with scribe on A100)
Handoff Tools ~4,000–8,000 tokens (12-hour chart synthesis) Soft — pre-shift batch Predictable — shift-change spikes Shared with discharge drafter
Batch (de-identification, indexing) Variable — high-throughput None — overnight Off-hours bulk Reuse pilot GPU off-hours

The departmental architecture is sized to serve ambient-scribe, document Q&A, and private medical search concurrently on shared A100s. The hospital-scale architecture adds enough headroom to serve discharge drafting and shift handoff during their peak windows without degrading scribe latency. Most pilots that fail on hardware ground up-sized one workflow without accounting for the others.

Power, cooling, and the server-room conversation

The GPU capex is often the smaller line item by the time facilities is involved. A 4× H100 node draws ~3.5 kW at peak — more than most hospital server rooms have allocated per rack in the existing power and cooling envelope. Three facility-level decisions need explicit planning:

  • checkConditioned, redundant power. Plan for ~1 kW per A100, ~700 W per H100, plus host overhead (CPU, RAM, disks, fans, networking). A 4× H100 node should be on a 30 A circuit minimum, with UPS-backed redundant PSU on the server. Many hospital server rooms have 15 A circuits per rack and need an electrician revisit.
  • checkCooling capacity. Roughly 12,000 BTU/hr per 4× H100 node at peak load. Existing 5-ton CRAC capacity handles two such nodes per rack; pre-existing density may force a separate aisle or row.
  • checkNoise. H100 SXM5 servers run loud — ~80 dB sustained under load. Not a problem in a data center; very much a problem if the deployment is in a clinical-adjacent server closet. Plan placement accordingly.

Storage and network

Model weights are large but predictable. A full library of Llama 3.x (8B + 70B + 405B), Mistral, Gemma 2/3, MedGemma 4B + 27B, Whisper large-v3, plus the typical embedding models occupies ~600 GB at fp16. Q4_K_M quantized variants quarter that. Plan 4 TB of NVMe for the pilot, 8 TB for the department, 30 TB for the hospital-scale build (model weights + audit logs + RAG index storage). RAG indexes (Qdrant / Milvus / OpenSearch) typically need 1.5–3 GB of disk per million vectors plus matching RAM for hot indexes.

Network expectations are modest in steady state and bursty during model loads. Inference is GPU-bound, not network-bound; the 10 Gbps internal fabric most hospital server racks already have is sufficient. The bursty load is model pulls — a 70B fp16 model is ~140 GB to first load, ~35 GB quantized. Pull from a local Hugging Face mirror, not the public internet, for both speed and compliance.

Software stack the hardware has to host

For sizing purposes, the software stack consumes a predictable footprint above the model weights:

  • checkInference engine: vLLM (Linux + CUDA, multi-GPU tensor parallel) for production. Ollama on workstations and pilots. llama.cpp on Apple Silicon / CPU. LocalAI as an OpenAI-compatible drop-in gateway when existing client code expects an OpenAI endpoint.
  • checkRetrieval layer: Qdrant (single binary, simplest ops) or Milvus (K8s-native, scale to billions) or OpenSearch (hybrid keyword + vector). Plan 32–128 GB RAM for hot indexes plus disk.
  • checkSpeech-to-text: Whisper large-v3 or large-v3-turbo (faster-whisper for CUDA, whisper.cpp for Apple). Small footprint compared with LLMs; can co-host on the same GPU as the LLM for low-throughput scribe pilots.
  • checkOrchestration: Haystack as the Python framework. Runs in a container alongside the inference engine; minimal additional hardware footprint.
  • checkGateway: A thin API gateway (Kong, Traefik, or a custom thin layer) in front of vLLM for SSO / mTLS / per-team quotas / prompt logging. Hosted on the same node or a small adjacent VM.

2026 reference build sheet

The exact bills-of-materials Moneli Automation has scoped recently. Treat as a starting point; specific quotes vary with VAR and date.

ArchitectureComponentSpecNotes
PilotGPU1× RTX 6000 Ada 48 GBWorkstation-class, 300 W
CPUAMD Threadripper Pro 7945WX (12c) or Xeon W7-2475X16 PCIe Gen5 lanes for the GPU
RAM128 GB DDR5 ECCBuffer for inference plus retrieval layer
Storage4 TB NVMe Gen4Model weights + scratch
Total~$5,500–$8,000Excludes facility power upgrade
DepartmentGPU2× A100 80GB SXM4 (NVLink)Tensor-parallel 70B fp16
HostDual EPYC 9354 or Xeon Platinum 8462Y+Server-class, dual-socket
RAM512 GB DDR5 ECCHeadroom for retrieval and orchestration on same node
Storage8 TB NVMe RAID 1 + 32 TB SAS for audit log retentionAudit log volume sized for 7-year retention
Total~$30,000–$50,000Plus rack power upgrade if existing rack capacity insufficient
HospitalGPU4× H100 80GB SXM5 (NVLink)4-way NVLink interconnect critical for tensor-parallel
HostDual EPYC 9654 or Xeon Platinum 8568Y+Top-tier dual-socket; PCIe Gen5
RAM1 TB DDR5 ECCIncludes RAG cache and Whisper inference
Storage30 TB NVMe RAID 10 + 100 TB SASLarger model library, RAG indexes, longer audit retention
Total~$120,000–$180,000Includes UPS + cooling upgrade if rack density requires

What this does not capture

Three categories of cost that hide in the architecture diagram and should be in the procurement budget:

  • closeFacility upgrades. Power, cooling, structured cabling, room access controls. Often 10–25% of the capex line item for existing-server-room retrofits; can exceed the GPU cost for closet-style deployments.
  • closeOperational labor. A working private-AI stack needs ~0.25–1.0 FTE of ops support depending on scale — model approvals, capacity planning, audit-log review, vendor coordination. Most pilots underestimate this.
  • closeRefresh cycle. GPU depreciation runs faster than typical hospital hardware; plan a 4–5 year refresh cycle for high-end GPUs, not the 7–10 year cycle most healthcare IT budgets use for general-purpose servers.

Where this fits in the WalledCare directory

This sizing guide is the IT-facing companion to the on-premise clinical assistants reference architecture and the ROI calculator, which both reference the $120K hospital-scale capex number anchored here. The hardware sizing also pairs with the open-source profiles for the software stack each architecture hosts, and the Canadian compliance hub for the residency angle that makes on-prem the right answer for many buyers.

send Request a WalledCare pilot menu_book Back to guides grid_view Back to directory

Further reading