Ollama — Local Model Runtime for Private Healthcare AI Pilots

Model library

100+

Curated, signed open-weight catalog including Llama 3.x, Mistral, Gemma 2/3, MedGemma, Qwen, gpt-oss, DeepSeek, Phi, Code Llama, embedding models, vision models, and most major releases within days of upstream.

Default REST port

11434

HTTP API on localhost:11434 with an OpenAI-compatible endpoint at /v1/chat/completions. Drop-in for any client that already talks to an LLM API — no SDK switch needed.

Data egress

Zero

No telemetry, no model-pull dependency once the model is on disk, and air-gapped operation supported. The data-handling default that makes the runtime usable in a HIPAA / PIPEDA / PHIPA environment.

Throughput class

~40–150 TPS

Single-stream token-per-second range for ~7B–13B models on a single consumer or workstation GPU. Adequate for pilot demos and small-team usage; vLLM is the order-of-magnitude-faster path for production-serving.

What Ollama actually is

Ollama is a local LLM runtime — written in Go on top of llama.cpp — that wraps model download, format conversion, GPU/CPU dispatch, and an HTTP API into a single binary the operator can run on a workstation, a server, or a Mac. The healthcare-relevant property is that the runtime makes no outbound calls after the model is on disk: the data path is "browser or app → localhost:11434 → GPU → response," with nothing routed to a vendor cloud. That is the configuration that lets a hospital safely test a private model on PHI-adjacent data without first standing up a production serving stack.

The model catalog is the second buyer-relevant property. Ollama publishes a signed library with Llama 3.1 / 3.3 (8B / 70B / 405B), Mistral (7B / Mixtral 8x7B / 8x22B), Gemma 2 / 3 (2B / 9B / 27B), MedGemma (4B and 27B, Google's medical-tuned Gemma), Qwen 2.5, DeepSeek, gpt-oss, Code Llama, embedding models (nomic-embed-text, bge-large), and vision models. Pulls are reproducible and signed. For Canadian / Australian healthcare buyers, this matters: the model catalog is exactly the set most regulators consider auditable.

What Ollama is not: a high-throughput production server. Under load — 8+ concurrent requests, multi-team usage, latency SLOs — Ollama is outclassed by vLLM on throughput by roughly an order of magnitude. The right pattern is to use Ollama as the pilot runtime and graduate to vLLM (or vLLM + LocalAI / Haystack) when the workflow is real.

Deployment posture

Ollama installs as a single binary, runs on Linux / macOS / Windows / Docker, supports CUDA, ROCm, Metal (Apple Silicon), and CPU-only fallback. Models are stored under ~/.ollama/models by default and can be air-gapped after first pull. The HTTP API listens on 127.0.0.1:11434 out of the box; the operator opts in to wider exposure via environment variable. The healthcare deployment defaults are sound; the operator's job is to lock the boundary at the network layer and configure SSO / mTLS if the runtime is shared.

SURFACE

OpenAI-compatible HTTP API

POST /api/generate, /api/chat, /api/embed, plus an OpenAI-compatible /v1/chat/completions endpoint. Most existing LLM client code works without modification.

HARDWARE

Laptop to workstation

Best fit: a single consumer / workstation GPU (RTX 4090, A6000, M-series Mac) or single A100. Multi-GPU is supported but vLLM is the better path at that scale.

SECURITY

Loopback by default

Binds to 127.0.0.1 by default; widen via OLLAMA_HOST. Air-gapped operation after first model pull. No telemetry, no auto-updates that exfiltrate metadata.

SCALE LIMIT

Outclassed under load

At 10+ concurrent users Ollama throughput stays around ~150 TPS while vLLM hits ~800. The transition signal is "we have a real internal user base," not "we have a pilot."

Healthcare fit

Ollama is the right runtime when a healthcare team needs to test a private model on PHI-adjacent data this week, without first negotiating a multi-month inference-stack procurement. Public examples in 2025–26 include workstation-resident PHI de-identification pipelines built on Llama 3.x in Ollama, internal knowledge assistants over policy SOPs in a single department, and clinical-AI literacy programs for IT / informatics teams that want a local sandbox before approving a vendor pilot.

checkGood fit: workstation-scale pilots, secure demos for leadership, departmental document Q&A, internal knowledge assistants over policy SOPs, PHI de-identification experiments on de-identified test data.
checkGood fit: running MedGemma 4B / 27B locally for medical text and image reasoning, paired with Qdrant or OpenSearch for retrieval and Haystack for orchestration.
checkConditional fit: a clinician-adjacent drafting pilot where the team wants a single-tenant local proof point before moving to a shared serving layer.
closeBad fit: production hospital-wide inference, multi-team concurrent usage with latency SLOs, ambient-scribe workloads where peak throughput matters. Use vLLM (or vLLM + LocalAI) instead.

Privacy and governance

Ollama's default privacy posture is unusually clean for an open-source LLM runtime: no telemetry, no remote calls after the model is pulled, binds to loopback, fully air-gappable. That makes the runtime usable inside HIPAA / PIPEDA / PHIPA / Quebec Law 25 boundaries with the standard local-system controls (disk encryption, OS-level audit, network segmentation). The runtime itself does not, however, deliver governance — model approval, allowed data classes, prompt logging, output review, and audit trails are the operator's responsibility.

Moneli Automation's typical use of Ollama is exactly this: a tightly scoped, inspectable pilot inside a hospital network, with explicit model approval, allowed data classes, prompt logging, and a clear stop / graduate decision tied to evaluation metrics. The runtime is the easy part; the governance wrapper is the value.

Strengths and limitations

STRENGTHS

Why teams shortlist it

Single-binary install. 100+ signed model library including MedGemma. Zero telemetry. OpenAI-compatible HTTP API on a stable port. Air-gappable. Apple Silicon Metal acceleration is first-class — a real advantage for clinician laptops and small clinics. The fastest path from "approved model" to "running on hospital hardware."

LIMITATIONS

Why it isn't the production stack

Single-stream throughput; outclassed by vLLM under concurrent load (~150 TPS vs ~800 TPS at 10 users). Static GPU memory allocation (no PagedAttention). No native multi-tenancy or per-user quotas. No built-in evaluation, prompt-management, or audit-log surface. Treat it as a pilot runtime, not a production platform.

Where Ollama fits in a hospital stack

Layer	What Ollama contributes	What still has to be solved
Pilot runtime	Fast, signed, air-gapped serving of approved open-weight models on controlled hardware.	Model approval policy, allowed data classes, workflow design, evaluation rubric.
Data boundary	Loopback default, no telemetry, no outbound calls — the cleanest local boundary among comparable runtimes.	Network segmentation, access controls, prompt logging, retention rules.
Model catalog	Direct path to running MedGemma, Llama 3.x, Mistral, Gemma, embedding models — without per-tool integration work.	Choosing which model is right for which workflow, evaluation against held-out data.
Scale path	Workstation pilot today; graduate to vLLM or LocalAI when the user base or SLO grows.	Capacity planning, observability, per-user quotas, multi-tenant security model.
Procurement value	Demonstrates whether private AI is worth turning into a bigger program — with real numbers, on real hardware.	Business case for the next stage of investment.

Ollama is the pilot lane. It is the right answer when the binding constraint is "we need to validate a private AI workflow on our own hardware before negotiating any vendor contract." It is the wrong answer for production hospital-wide serving — graduate to vLLM when the workflow earns it.

Quick facts

Project	Ollama (open-source, MIT licensed). GitHub: ollama/ollama.
Type	Local LLM runtime + model packaging layer. Built on llama.cpp; written in Go.
Platforms	Linux, macOS, Windows, Docker. CUDA, ROCm, Metal, CPU.
Default API	`http://localhost:11434`. Native API and OpenAI-compatible `/v1/chat/completions` endpoint.
Notable models	Llama 3.1 / 3.3 (8B, 70B, 405B), Mistral, Mixtral, Gemma 2 / 3, MedGemma 4B / 27B, Qwen 2.5, DeepSeek, gpt-oss, Phi, Code Llama, nomic-embed-text, bge-large.
Throughput class	~40–150 TPS single-stream on a single workstation GPU. Order-of-magnitude lower than vLLM under concurrent load.
Healthcare suitability	Workstation-scale and small-team pilots, internal knowledge assistants, MedGemma deployment, PHI-adjacent testing inside hospital networks.
Website	ollama.com · Docs: docs.ollama.com · GitHub: github.com/ollama/ollama

Use Ollama as the pilot runtime, not the production platform

Ollama is the runtime to reach for when the question is "can we run a private model on our own hardware safely this month?" — not when the question is "can we run hospital-wide ambient documentation on this stack?" Moneli Automation typically uses Ollama for the first inspectable pilot, then graduates the workflow onto a production serving layer once the value is real.

send Request a WalledCare pilot arrow_back All open-source profiles