Hardware Sizing for Private Healthcare AI

Pilot capex

~$5K

Single workstation with a professional GPU (RTX 6000 Ada or A6000) — enough to run quantized 70B models locally with Ollama or llama.cpp. The cost of admission to a real pilot.

Department capex

~$30K

2× A100 80GB on a single inference node — Llama 3.x 70B at fp16 with full vLLM throughput. Fits departmental concurrent load with headroom.

Hospital capex

~$120K

4× H100 SXM5 on a single inference node — ~3,300 TPS aggregate for 70B-class models per H100, ~24,000 TPS streamable per H100 SXM5 under ideal batching. The reference point cited in the 4-month-ROI literature.

Power per H100

700W

H100 SXM5 TDP. Plan for ~3.5 kW of conditioned, redundant power and ~12,000 BTU/hr of cooling per 4× H100 node. Often the binding constraint for retrofitting an existing hospital server room.

The three reference architectures

Most hospital private-AI deployments fall into one of three shapes. The architecture choice is driven by concurrent-user expectations and the number of distinct workflows the stack will run, not by which workflow comes first. A pilot stack and a department stack are different machines; trying to size one for both produces the worst-of-both-worlds compromise.

Architecture	Concurrent users	Workflows supported	Reference build	2026 capex
Single-clinician pilot	1	One workflow (typically ambient scribe pilot or document Q&A pilot)	1× RTX 6000 Ada (48 GB) or A6000 (48 GB), Threadripper / Xeon workstation, 128 GB RAM, 4 TB NVMe	~$5–8K
Department	5–25	Multi-workflow shared inference — scribe + document Q&A + discharge drafter	2× A100 80GB (or 1× H100 80GB), dual-socket server, 256–512 GB RAM, 8 TB NVMe, redundant PSU	~$30–50K
Hospital system	50–500+	Full reference stack — scribe, RAG search, discharge, handoff, internal knowledge, batch jobs	4× H100 SXM5, dual-socket server with NVLink, 1 TB RAM, 30 TB NVMe, dual 10 / 25 Gbps networking	~$120–180K

GPU choice, plainly

Five GPUs cover essentially every 2026 hospital private-AI workload. The trade-off is memory (for model size), TDP (for power and cooling), and tensor-engine support (for FP8 throughput).

PILOT

RTX 6000 Ada / A6000

48 GB GDDR6, 300 W TDP, PCIe form factor. Fits a workstation. Runs quantized 70B (Q4_K_M, ~40 GB) at single-stream rates of ~40 TPS. The right entry GPU for a pilot that does not need concurrent multi-user serving.

PILOT-PLUS

L40S

48 GB GDDR6, 350 W TDP. Better tensor-core throughput than RTX 6000 Ada at similar memory. Strong fit for a pilot that needs higher single-stream speed without jumping to A100 / H100.

DEPARTMENT

A100 80GB

80 GB HBM2e, 400 W TDP, SXM4 or PCIe form factor. Runs Llama 3.x 70B at fp16 (~140 GB → needs tensor-parallel 2). ~200–300 TPS for 70B at moderate concurrency. The workhorse departmental GPU.

HOSPITAL

H100 80GB SXM5

80 GB HBM3, 700 W TDP. Transformer Engine and FP8 deliver ~3× the throughput of A100 on LLM inference. ~3,300 TPS for 70B-class models per H100; ~24,000 TPS per H100 streamable under optimal batching.

FUTURE

B200 / GB200

NVIDIA Blackwell. Roughly 2–4× H100 inference throughput on the same envelope of power. Shipping in 2024–25, expected to be the H100-replacement reference architecture by 2027.

CPU FALLBACK

EPYC / Xeon

For audit-restrictive or air-gapped environments where a GPU is not procurable, llama.cpp + Q4_K_M on a modern EPYC or Xeon with AVX-512 / AMX delivers ~5–15 TPS for 70B. Slow but viable for batch / off-hours workflows.

Workflow-by-workflow sizing

Each of the WalledCare workflow categories has a distinct throughput profile. The size to plan for depends on which workflows the stack actually serves:

Workflow	Token budget per encounter	Latency tolerance	Concurrency pattern	Minimum sizing
AI Scribe	~3,000–6,000 tokens (transcript + draft)	Soft — clinician edits drafts	Peaks at clinic-day start / end	1× A100 (department) or 1× H100 (hospital)
Document Q&A	~2,000–4,000 tokens (retrieved context + answer)	Tight — interactive UI	Bursty — multiple concurrent queries	1× A100 + retrieval-layer GPU memory
Private Medical Search	~1,500–3,000 tokens (synthesized search results)	Tight — interactive search	Distributed across day	Shared with Document Q&A on same GPU
Discharge Summaries	~5,000–10,000 tokens (full chart context + summary)	Soft — batch generation	Spikes at unit discharge times	1× H100 (or share with scribe on A100)
Handoff Tools	~4,000–8,000 tokens (12-hour chart synthesis)	Soft — pre-shift batch	Predictable — shift-change spikes	Shared with discharge drafter
Batch (de-identification, indexing)	Variable — high-throughput	None — overnight	Off-hours bulk	Reuse pilot GPU off-hours

The departmental architecture is sized to serve ambient-scribe, document Q&A, and private medical search concurrently on shared A100s. The hospital-scale architecture adds enough headroom to serve discharge drafting and shift handoff during their peak windows without degrading scribe latency. Most pilots that fail on hardware ground up-sized one workflow without accounting for the others.

Power, cooling, and the server-room conversation

The GPU capex is often the smaller line item by the time facilities is involved. A 4× H100 node draws ~3.5 kW at peak — more than most hospital server rooms have allocated per rack in the existing power and cooling envelope. Three facility-level decisions need explicit planning:

checkConditioned, redundant power. Plan for ~1 kW per A100, ~700 W per H100, plus host overhead (CPU, RAM, disks, fans, networking). A 4× H100 node should be on a 30 A circuit minimum, with UPS-backed redundant PSU on the server. Many hospital server rooms have 15 A circuits per rack and need an electrician revisit.
checkCooling capacity. Roughly 12,000 BTU/hr per 4× H100 node at peak load. Existing 5-ton CRAC capacity handles two such nodes per rack; pre-existing density may force a separate aisle or row.
checkNoise. H100 SXM5 servers run loud — ~80 dB sustained under load. Not a problem in a data center; very much a problem if the deployment is in a clinical-adjacent server closet. Plan placement accordingly.

Storage and network

Model weights are large but predictable. A full library of Llama 3.x (8B + 70B + 405B), Mistral, Gemma 2/3, MedGemma 4B + 27B, Whisper large-v3, plus the typical embedding models occupies ~600 GB at fp16. Q4_K_M quantized variants quarter that. Plan 4 TB of NVMe for the pilot, 8 TB for the department, 30 TB for the hospital-scale build (model weights + audit logs + RAG index storage). RAG indexes (Qdrant / Milvus / OpenSearch) typically need 1.5–3 GB of disk per million vectors plus matching RAM for hot indexes.

Network expectations are modest in steady state and bursty during model loads. Inference is GPU-bound, not network-bound; the 10 Gbps internal fabric most hospital server racks already have is sufficient. The bursty load is model pulls — a 70B fp16 model is ~140 GB to first load, ~35 GB quantized. Pull from a local Hugging Face mirror, not the public internet, for both speed and compliance.

Software stack the hardware has to host

For sizing purposes, the software stack consumes a predictable footprint above the model weights:

checkInference engine: vLLM (Linux + CUDA, multi-GPU tensor parallel) for production. Ollama on workstations and pilots. llama.cpp on Apple Silicon / CPU. LocalAI as an OpenAI-compatible drop-in gateway when existing client code expects an OpenAI endpoint.
checkRetrieval layer: Qdrant (single binary, simplest ops) or Milvus (K8s-native, scale to billions) or OpenSearch (hybrid keyword + vector). Plan 32–128 GB RAM for hot indexes plus disk.
checkSpeech-to-text: Whisper large-v3 or large-v3-turbo (faster-whisper for CUDA, whisper.cpp for Apple). Small footprint compared with LLMs; can co-host on the same GPU as the LLM for low-throughput scribe pilots.
checkOrchestration: Haystack as the Python framework. Runs in a container alongside the inference engine; minimal additional hardware footprint.
checkGateway: A thin API gateway (Kong, Traefik, or a custom thin layer) in front of vLLM for SSO / mTLS / per-team quotas / prompt logging. Hosted on the same node or a small adjacent VM.

2026 reference build sheet

The exact bills-of-materials Moneli Automation has scoped recently. Treat as a starting point; specific quotes vary with VAR and date.

Architecture	Component	Spec	Notes
Pilot	GPU	1× RTX 6000 Ada 48 GB	Workstation-class, 300 W
	CPU	AMD Threadripper Pro 7945WX (12c) or Xeon W7-2475X	16 PCIe Gen5 lanes for the GPU
	RAM	128 GB DDR5 ECC	Buffer for inference plus retrieval layer
	Storage	4 TB NVMe Gen4	Model weights + scratch
	Total	~$5,500–$8,000	Excludes facility power upgrade
Department	GPU	2× A100 80GB SXM4 (NVLink)	Tensor-parallel 70B fp16
	Host	Dual EPYC 9354 or Xeon Platinum 8462Y+	Server-class, dual-socket
	RAM	512 GB DDR5 ECC	Headroom for retrieval and orchestration on same node
	Storage	8 TB NVMe RAID 1 + 32 TB SAS for audit log retention	Audit log volume sized for 7-year retention
	Total	~$30,000–$50,000	Plus rack power upgrade if existing rack capacity insufficient
Hospital	GPU	4× H100 80GB SXM5 (NVLink)	4-way NVLink interconnect critical for tensor-parallel
	Host	Dual EPYC 9654 or Xeon Platinum 8568Y+	Top-tier dual-socket; PCIe Gen5
	RAM	1 TB DDR5 ECC	Includes RAG cache and Whisper inference
	Storage	30 TB NVMe RAID 10 + 100 TB SAS	Larger model library, RAG indexes, longer audit retention
	Total	~$120,000–$180,000	Includes UPS + cooling upgrade if rack density requires

What this does not capture

Three categories of cost that hide in the architecture diagram and should be in the procurement budget:

closeFacility upgrades. Power, cooling, structured cabling, room access controls. Often 10–25% of the capex line item for existing-server-room retrofits; can exceed the GPU cost for closet-style deployments.
closeOperational labor. A working private-AI stack needs ~0.25–1.0 FTE of ops support depending on scale — model approvals, capacity planning, audit-log review, vendor coordination. Most pilots underestimate this.
closeRefresh cycle. GPU depreciation runs faster than typical hospital hardware; plan a 4–5 year refresh cycle for high-end GPUs, not the 7–10 year cycle most healthcare IT budgets use for general-purpose servers.

Where this fits in the WalledCare directory

This sizing guide is the IT-facing companion to the on-premise clinical assistants reference architecture and the ROI calculator, which both reference the $120K hospital-scale capex number anchored here. The hardware sizing also pairs with the open-source profiles for the software stack each architecture hosts, and the Canadian compliance hub for the residency angle that makes on-prem the right answer for many buyers.

send Request a WalledCare pilot menu_book Back to guides grid_view Back to directory