OPEN SOURCE · PRODUCTION INFERENCE · HOSPITAL STACK

vLLM

The open-source LLM serving engine that introduced PagedAttention — a virtual-memory KV cache that drops memory waste from ~60–80% to under 4% and delivers ~22× throughput over vanilla Hugging Face Transformers. Started at UC Berkeley, now the production inference layer behind much of the open-weight ecosystem. The right answer when "private hospital LLM" graduates from pilot to actual concurrent clinical or administrative workload.

Llama 2 70B throughput
~2,200 TPS

vLLM on 4× A100 GPUs at 256 concurrent users — ~2.3× Hugging Face TGI, ~3.1× vanilla PyTorch serving. Production-grade output for a single hospital deployment.

Memory waste cut
< 4%

PagedAttention models the KV cache as virtual-memory pages: fixed-size blocks allocated on demand. Cuts waste from ~60–80% in naive serving to under 4%, which is what unlocks the throughput gain.

GPU utilization
85–92%

Continuous batching plus PagedAttention keeps GPU utilization in the 85–92% range under concurrent load — the difference between "expensive idle hardware" and "the capex actually paid back."

vs Ollama
~5–20× at scale

At 8–10 concurrent users vLLM delivers ~800 TPS on a single A100 where Ollama drops to ~150 TPS. Single-user latency favors Ollama; concurrent load favors vLLM by a wide margin.

What vLLM actually is

vLLM is an open-source Python library and HTTP server for high-throughput LLM inference, originally published by the UC Berkeley Sky Computing Lab in 2023 alongside the PagedAttention paper. It loads any Hugging Face-compatible LLM (Llama, Mistral, Gemma, Qwen, DeepSeek, and many others), serves it over an OpenAI-compatible HTTP API, and uses two techniques that are now standard practice in production LLM serving: PagedAttention for KV-cache memory management, and continuous batching for filling GPU cycles with incoming requests in flight rather than waiting for a batch to fill.

For a hospital that has graduated past Ollama-class pilots, vLLM is usually the right next layer. It is the inference engine behind countless production deployments — from clinical NLP startups to large-language-model API providers — and the architecture choices are conservative enough to underwrite serious operational use: it supports tensor parallel across multiple GPUs (typical configurations: 2× A100 for a 70B model, 4× A100 for higher concurrency, 4× H100 for the budget-constrained equivalent at much higher TPS).

What vLLM is not: an end-user product. It is the engine. The full hospital stack still needs an OpenAI-compatible gateway (vLLM itself, or LocalAI in front of it), a retrieval layer (Qdrant / OpenSearch / Milvus), orchestration (Haystack or equivalent), and the governance plus evaluation surface around it. vLLM solves the hard performance problem; the operating model is still the operator's job.

Deployment posture

vLLM runs on Linux with CUDA, supports tensor parallelism across multiple GPUs in a single node, and pipeline parallelism across nodes. The HTTP server exposes an OpenAI-compatible API on a configurable port; the typical hospital pattern is to put it behind an internal API gateway, terminate TLS at the gateway, and apply SSO / mTLS / per-team quotas there rather than in vLLM itself. The runtime makes no outbound calls — model weights live on local disk, all inference happens on the served GPU.

SURFACE
OpenAI-compatible HTTP

/v1/chat/completions, /v1/completions, /v1/embeddings. Most existing client code works without modification — the same drop-in compatibility most teams already build around.

HARDWARE
Multi-GPU and multi-node

Tensor parallel across GPUs in a node, pipeline parallel across nodes. Typical sizing: 70B fp16 on 2× A100 80GB (tensor parallel 2); 70B with 4-bit quantization on a single A100 80GB; 405B on 8× H100.

PERFORMANCE
PagedAttention + continuous batching

KV cache in fixed-size virtual-memory blocks; new requests fill GPU cycles continuously rather than waiting for a batch boundary. Net: high throughput at high concurrency with predictable tail latency.

SECURITY
Operator-controlled boundary

No telemetry, no model-pull dependency once weights are local. Authentication / authorization / rate-limiting belong at the gateway in front of vLLM; the engine itself is intentionally minimal on identity.

Healthcare fit

vLLM is the right serving layer when a hospital workflow has moved past pilot status: multiple teams using a private model concurrently, latency SLOs to defend, a clinical or administrative workflow that has to keep working when ten people hit the API at the same time. Public examples include ambient-style draft generation on internal hardware, clinical knowledge retrieval over policy SOPs at department scale, claims-data analysis pipelines, and large-batch de-identification jobs over historical records.

  • checkGood fit: production hospital serving of Llama 3.1 / 3.3 (8B / 70B / 405B), Mistral, Qwen, Gemma 2/3, MedGemma, or fine-tuned variants — paired with a retrieval layer and an orchestration layer.
  • checkGood fit: multi-team internal inference platforms with shared GPU capacity and per-team quotas enforced at a gateway in front.
  • checkGood fit: high-volume batch jobs (de-identification, document classification, large RAG indexing) where throughput per GPU-hour is the deciding metric.
  • closeBad fit: single-clinician laptop pilots where the boundary is "one workstation, one user." Use Ollama or llama.cpp instead.
  • closeBad fit: Apple Silicon, Windows-native, or CPU-only environments. vLLM targets Linux + CUDA.

Privacy and governance

vLLM's privacy posture follows the standard production-engine pattern: the runtime itself does not exfiltrate data, but it also does not provide governance. No telemetry, no outbound calls, no model-pull dependency once weights are on disk — which is exactly what makes the engine suitable inside HIPAA / PIPEDA / PHIPA / Quebec Law 25 boundaries. Authentication, authorization, prompt logging, response logging, content filtering, and audit trails belong at the gateway in front of vLLM, not in vLLM itself.

Moneli Automation's typical pattern for a production hospital workflow is vLLM as the inference engine, a thin API gateway in front (SSO, mTLS, per-team quotas, prompt/response logging), a retrieval layer (Qdrant or OpenSearch) for grounded answers, and a clinician-facing surface that enforces review and signature. The engine is performant by design; the governance and evaluation wrapper is the work that earns clinical trust.

Strengths and limitations

STRENGTHS
Why hospital stacks pick it

Industry-standard high-throughput serving — the engine behind most production private-LLM deployments since 2024. PagedAttention + continuous batching deliver the GPU utilization (~85–92%) that justifies on-prem capex. OpenAI-compatible API; existing client code works. Strong tensor- and pipeline-parallel support across multi-GPU and multi-node configurations. Active upstream development; new models supported within days of upstream release.

LIMITATIONS
Where it isn't the whole answer

Linux + CUDA only — no Apple Silicon, no Windows-native, no AMD ROCm in production. Higher operational complexity than Ollama: requires capacity planning, observability, and gateway-side identity. No native multi-tenancy or per-user quotas in the engine itself. No built-in evaluation, prompt-management, or content-filtering surface. Pure inference; bring your own retrieval, orchestration, and governance.

Where vLLM fits in a hospital stack

LayerWhat vLLM contributesWhat still has to be solved
Production inferenceHigh-throughput, high-utilization serving of open-weight models on hospital-owned GPUs.Capacity planning, observability, alerting, on-call rotation.
API gatewayOpenAI-compatible endpoint, drop-in for existing clients.SSO, mTLS, per-team quotas, prompt/response logging, content filtering.
Retrieval / RAGNone — pure inference; bring Qdrant / OpenSearch / Milvus alongside.Embedding model choice, chunking, hybrid retrieval design, citation coverage.
OrchestrationNone — pair with Haystack or equivalent framework for multi-step pipelines.Workflow design, error handling, evaluation harness, regression testing.
Governance / auditNone at the engine layer — sits behind gateway-side logging.Model approval policy, data classes, prompt audit, review workflow, audit retention.

vLLM is the production inference engine. It is the right answer when concurrent throughput, GPU utilization, and scale matter more than the single-binary simplicity of a pilot runtime. Pair it with Ollama as the dev-and-pilot lane and LocalAI / Haystack / Qdrant for the surrounding stack.

Quick facts

ProjectvLLM (open-source, Apache 2.0). Originally published by UC Berkeley Sky Computing Lab (2023). GitHub: vllm-project/vllm.
TypeProduction LLM inference engine with PagedAttention KV-cache management and continuous batching.
PlatformsLinux + CUDA. Multi-GPU tensor parallel, multi-node pipeline parallel. Not supported: Apple Silicon, Windows-native, AMD ROCm (production).
APIOpenAI-compatible HTTP server: /v1/chat/completions, /v1/completions, /v1/embeddings.
ModelsLlama 3.1 / 3.3 / 4, Mistral / Mixtral, Gemma 2 / 3, MedGemma, Qwen, DeepSeek, Phi, fine-tuned variants. Any Hugging Face-compatible LLM.
Throughput class~2,200 TPS on Llama 2 70B / 4× A100 / 256 concurrent users. ~700 TPS on 4-bit-quantized 70B on a single A100 (LMDeploy comparison). 85–92% GPU utilization under load.
Typical hospital sizingDepartment scale: 2× A100 80GB for 70B fp16, or 1× A100 with 4-bit quantization. Hospital scale: 4× H100. System scale: 8× H100.
Websitedocs.vllm.ai · GitHub: github.com/vllm-project/vllm

Use vLLM as the production inference engine, behind a gateway

vLLM is the right serving layer once a private hospital AI workflow has earned production status. Moneli Automation typically deploys it behind a thin API gateway that owns identity, quotas, and logging, with vLLM doing the work the engine is best at — high-throughput, high-utilization inference on hospital-owned GPUs.

send Request a WalledCare pilot arrow_back All open-source profiles

Further reading