OPEN SOURCE · LOCAL MODEL RUNTIME · PRIVATE PILOT
Ollama
A local LLM runtime that serves 100+ open-weight models (Llama 3.1, Llama 3.3, Mistral, Gemma 2/3, MedGemma, Qwen, gpt-oss, DeepSeek, more) behind a simple REST API on localhost:11434 — fully air-gapped, no telemetry, OpenAI-compatible endpoint. The fastest way to put a private model on hospital-controlled hardware and prove the workflow before standing up a production-grade inference stack.
Curated, signed open-weight catalog including Llama 3.x, Mistral, Gemma 2/3, MedGemma, Qwen, gpt-oss, DeepSeek, Phi, Code Llama, embedding models, vision models, and most major releases within days of upstream.
HTTP API on localhost:11434 with an OpenAI-compatible endpoint at /v1/chat/completions. Drop-in for any client that already talks to an LLM API — no SDK switch needed.
No telemetry, no model-pull dependency once the model is on disk, and air-gapped operation supported. The data-handling default that makes the runtime usable in a HIPAA / PIPEDA / PHIPA environment.
Single-stream token-per-second range for ~7B–13B models on a single consumer or workstation GPU. Adequate for pilot demos and small-team usage; vLLM is the order-of-magnitude-faster path for production-serving.
What Ollama actually is
Ollama is a local LLM runtime — written in Go on top of llama.cpp — that wraps model download, format conversion, GPU/CPU dispatch, and an HTTP API into a single binary the operator can run on a workstation, a server, or a Mac. The healthcare-relevant property is that the runtime makes no outbound calls after the model is on disk: the data path is "browser or app → localhost:11434 → GPU → response," with nothing routed to a vendor cloud. That is the configuration that lets a hospital safely test a private model on PHI-adjacent data without first standing up a production serving stack.
The model catalog is the second buyer-relevant property. Ollama publishes a signed library with Llama 3.1 / 3.3 (8B / 70B / 405B), Mistral (7B / Mixtral 8x7B / 8x22B), Gemma 2 / 3 (2B / 9B / 27B), MedGemma (4B and 27B, Google's medical-tuned Gemma), Qwen 2.5, DeepSeek, gpt-oss, Code Llama, embedding models (nomic-embed-text, bge-large), and vision models. Pulls are reproducible and signed. For Canadian / Australian healthcare buyers, this matters: the model catalog is exactly the set most regulators consider auditable.
What Ollama is not: a high-throughput production server. Under load — 8+ concurrent requests, multi-team usage, latency SLOs — Ollama is outclassed by vLLM on throughput by roughly an order of magnitude. The right pattern is to use Ollama as the pilot runtime and graduate to vLLM (or vLLM + LocalAI / Haystack) when the workflow is real.
Deployment posture
Ollama installs as a single binary, runs on Linux / macOS / Windows / Docker, supports CUDA, ROCm, Metal (Apple Silicon), and CPU-only fallback. Models are stored under ~/.ollama/models by default and can be air-gapped after first pull. The HTTP API listens on 127.0.0.1:11434 out of the box; the operator opts in to wider exposure via environment variable. The healthcare deployment defaults are sound; the operator's job is to lock the boundary at the network layer and configure SSO / mTLS if the runtime is shared.
POST /api/generate, /api/chat, /api/embed, plus an OpenAI-compatible /v1/chat/completions endpoint. Most existing LLM client code works without modification.
Best fit: a single consumer / workstation GPU (RTX 4090, A6000, M-series Mac) or single A100. Multi-GPU is supported but vLLM is the better path at that scale.
Binds to 127.0.0.1 by default; widen via OLLAMA_HOST. Air-gapped operation after first model pull. No telemetry, no auto-updates that exfiltrate metadata.
At 10+ concurrent users Ollama throughput stays around ~150 TPS while vLLM hits ~800. The transition signal is "we have a real internal user base," not "we have a pilot."
Healthcare fit
Ollama is the right runtime when a healthcare team needs to test a private model on PHI-adjacent data this week, without first negotiating a multi-month inference-stack procurement. Public examples in 2025–26 include workstation-resident PHI de-identification pipelines built on Llama 3.x in Ollama, internal knowledge assistants over policy SOPs in a single department, and clinical-AI literacy programs for IT / informatics teams that want a local sandbox before approving a vendor pilot.
- checkGood fit: workstation-scale pilots, secure demos for leadership, departmental document Q&A, internal knowledge assistants over policy SOPs, PHI de-identification experiments on de-identified test data.
- checkGood fit: running MedGemma 4B / 27B locally for medical text and image reasoning, paired with Qdrant or OpenSearch for retrieval and Haystack for orchestration.
- checkConditional fit: a clinician-adjacent drafting pilot where the team wants a single-tenant local proof point before moving to a shared serving layer.
- closeBad fit: production hospital-wide inference, multi-team concurrent usage with latency SLOs, ambient-scribe workloads where peak throughput matters. Use vLLM (or vLLM + LocalAI) instead.
Privacy and governance
Ollama's default privacy posture is unusually clean for an open-source LLM runtime: no telemetry, no remote calls after the model is pulled, binds to loopback, fully air-gappable. That makes the runtime usable inside HIPAA / PIPEDA / PHIPA / Quebec Law 25 boundaries with the standard local-system controls (disk encryption, OS-level audit, network segmentation). The runtime itself does not, however, deliver governance — model approval, allowed data classes, prompt logging, output review, and audit trails are the operator's responsibility.
Moneli Automation's typical use of Ollama is exactly this: a tightly scoped, inspectable pilot inside a hospital network, with explicit model approval, allowed data classes, prompt logging, and a clear stop / graduate decision tied to evaluation metrics. The runtime is the easy part; the governance wrapper is the value.
Strengths and limitations
Single-binary install. 100+ signed model library including MedGemma. Zero telemetry. OpenAI-compatible HTTP API on a stable port. Air-gappable. Apple Silicon Metal acceleration is first-class — a real advantage for clinician laptops and small clinics. The fastest path from "approved model" to "running on hospital hardware."
Single-stream throughput; outclassed by vLLM under concurrent load (~150 TPS vs ~800 TPS at 10 users). Static GPU memory allocation (no PagedAttention). No native multi-tenancy or per-user quotas. No built-in evaluation, prompt-management, or audit-log surface. Treat it as a pilot runtime, not a production platform.
Where Ollama fits in a hospital stack
| Layer | What Ollama contributes | What still has to be solved |
|---|---|---|
| Pilot runtime | Fast, signed, air-gapped serving of approved open-weight models on controlled hardware. | Model approval policy, allowed data classes, workflow design, evaluation rubric. |
| Data boundary | Loopback default, no telemetry, no outbound calls — the cleanest local boundary among comparable runtimes. | Network segmentation, access controls, prompt logging, retention rules. |
| Model catalog | Direct path to running MedGemma, Llama 3.x, Mistral, Gemma, embedding models — without per-tool integration work. | Choosing which model is right for which workflow, evaluation against held-out data. |
| Scale path | Workstation pilot today; graduate to vLLM or LocalAI when the user base or SLO grows. | Capacity planning, observability, per-user quotas, multi-tenant security model. |
| Procurement value | Demonstrates whether private AI is worth turning into a bigger program — with real numbers, on real hardware. | Business case for the next stage of investment. |
Ollama is the pilot lane. It is the right answer when the binding constraint is "we need to validate a private AI workflow on our own hardware before negotiating any vendor contract." It is the wrong answer for production hospital-wide serving — graduate to vLLM when the workflow earns it.
Quick facts
| Project | Ollama (open-source, MIT licensed). GitHub: ollama/ollama. |
| Type | Local LLM runtime + model packaging layer. Built on llama.cpp; written in Go. |
| Platforms | Linux, macOS, Windows, Docker. CUDA, ROCm, Metal, CPU. |
| Default API | http://localhost:11434. Native API and OpenAI-compatible /v1/chat/completions endpoint. |
| Notable models | Llama 3.1 / 3.3 (8B, 70B, 405B), Mistral, Mixtral, Gemma 2 / 3, MedGemma 4B / 27B, Qwen 2.5, DeepSeek, gpt-oss, Phi, Code Llama, nomic-embed-text, bge-large. |
| Throughput class | ~40–150 TPS single-stream on a single workstation GPU. Order-of-magnitude lower than vLLM under concurrent load. |
| Healthcare suitability | Workstation-scale and small-team pilots, internal knowledge assistants, MedGemma deployment, PHI-adjacent testing inside hospital networks. |
| Website | ollama.com · Docs: docs.ollama.com · GitHub: github.com/ollama/ollama |
Use Ollama as the pilot runtime, not the production platform
Ollama is the runtime to reach for when the question is "can we run a private model on our own hardware safely this month?" — not when the question is "can we run hospital-wide ambient documentation on this stack?" Moneli Automation typically uses Ollama for the first inspectable pilot, then graduates the workflow onto a production serving layer once the value is real.
send Request a WalledCare pilot arrow_back All open-source profiles
Further reading
- Ollama official site
- Ollama on GitHub
- Ollama API reference
- Ollama model library
- De-identifying HIPAA PHI using local Ollama (DEV community)
- Healthcare team case study: replacing cloud AI with local Ollama
- vLLM profile — the production-serving counterpart when concurrent throughput matters
- llama.cpp profile — Ollama's underlying inference engine
- MedGemma profile — Google's medical Gemma, runnable directly in Ollama