Milvus — Distributed Vector Database for Billion-Scale Hospital RAG

Scale ceiling

Tens of billions

Milvus Distributed handles datasets from 100M vectors up to tens of billions. Sharded storage architecture; storage, query, and coordination scale independently. Designed for the corpora that don't fit on a single node.

Architecture

K8s-native

Fully Kubernetes-native, distributed from day one. Separates compute and storage; uses object storage (S3 / MinIO) for vectors and metadata. Requires a real ops team — that is both the cost and the value.

GPU acceleration

Optional

Milvus supports GPU-accelerated index building and similarity search via the cuVS / RAFT integration — the highest-throughput open-source vector DB option when GPU capacity is available.

Deployment modes

3

Milvus Lite (embedded, for prototyping), Milvus Standalone (single-node for pilots), Milvus Distributed (production, billion-vector scale). The mode you choose is a real architectural decision.

What Milvus actually is

Milvus is an open-source vector database originally built by the Zilliz team in 2019 and now under LF AI & Data Foundation governance. The core differentiator versus Qdrant is architectural: Milvus is distributed by design, with separated storage and compute layers, sharded ANN indexes, and a fully Kubernetes-native operating model. That makes it heavier to run but also the most credible open-source option once a corpus passes the billion-vector mark or once HA / multi-region replication is a binding requirement.

For healthcare specifically, Milvus has become the default reference for FHIR-grounded medical RAG: public examples in 2025–26 include time-aware patient-history search built on FHIR + Milvus + BGE embeddings, visual medication-identification systems combining CLIP and Milvus, and large-scale knowledge bases that fuse internal clinical content with licensed external evidence. The supporting commercial entity Zilliz publishes deep documentation and active integration with the broader open-source healthcare stack (Docling, Haystack, LangChain, LlamaIndex).

What Milvus is not: the right choice when the workload comfortably fits on a single node and operational simplicity matters more than horizontal-scale headroom. For most departmental hospital RAG deployments, Qdrant is the better starting point. Reach for Milvus when the corpus is system-wide and growing, when GPU-accelerated search is on the table, or when the org already operates a Kubernetes platform team.

Deployment posture

Milvus is delivered in three deployment modes: Milvus Lite (embedded Python library, for prototypes and Jupyter notebooks), Milvus Standalone (single-node Docker, for pilots up to ~100M vectors), and Milvus Distributed (Kubernetes-native, for production billion-vector workloads). Production Milvus uses S3-compatible object storage for persistence (MinIO is the typical on-prem choice), etcd for metadata, and a separation of query / index / coordinator pods that lets each scale independently.

SURFACE

gRPC + clients

Native gRPC API with official clients in Python, Java, Go, Node.js, C#, and Ruby. PyMilvus is the most widely-used client. Compatible with Haystack, LangChain, LlamaIndex, Docling.

INDEXING

Wide index family

HNSW, IVF_FLAT, IVF_PQ, DiskANN, SCANN, and GPU-accelerated cuVS-backed indexes. The widest index palette in the open-source category.

STORAGE

Object-storage backed

Vectors and metadata persist in S3-compatible object storage (MinIO on-prem). etcd for coordination. The architecture that makes billion-vector horizontal scaling possible.

OPS

Kubernetes-grade

Production Milvus expects a Kubernetes platform team. That is both the cost (vs Qdrant's single-binary simplicity) and the value (vs Qdrant's single-node ceiling).

Healthcare fit

Milvus is the right retrieval layer when a healthcare workload genuinely needs billion-vector scale, multi-team concurrent usage at high throughput, GPU-accelerated similarity search, or operational architectures already aligned with Kubernetes. Real-world deployments are growing: FHIR-grounded patient history retrieval, multi-modal medical RAG (CLIP + Milvus for pill / pathology image search), large-scale literature search, and hospital-system knowledge bases that combine internal SOPs with licensed external evidence.

checkGood fit: system-wide RAG over EHR, claims, imaging metadata, and literature where corpus size genuinely exceeds single-node capacity.
checkGood fit: teams that already run Kubernetes and prefer to standardize one data-platform pattern across services.
checkGood fit: GPU-rich environments where cuVS-accelerated indexing and search materially improve cost-per-query.
checkGood fit: FHIR-grounded medical RAG patterns where the published reference architecture explicitly uses Milvus.
closeBad fit: single-department pilots where Qdrant's operational simplicity wins. The decision rule of thumb: under 100M vectors and no HA requirement, start with Qdrant.
closeBad fit: teams without a Kubernetes ops capability. Milvus Distributed is not a single-binary deployment.

Privacy and governance

Milvus runs entirely on customer-controlled infrastructure in the self-hosted deployment — no telemetry by default, no remote dependency, all data on local Kubernetes / object storage. The data-handling posture is consistent with HIPAA / PIPEDA / PHIPA / Quebec Law 25 environments, with the caveat that a real Milvus deployment has more moving parts than a single Qdrant binary, and each component (etcd, MinIO, query nodes, coordinator) needs its own network-boundary, access-control, and audit story.

Governance for Milvus is the same shape as for any distributed data system: schema ownership, ingestion-side population of permission metadata, query-side authorization at the gateway, audit log retention, backup and restore policy, and a documented capacity model. The engine does retrieval; the operator owns the operating model.

Strengths and limitations

STRENGTHS

Why hospital stacks pick it

Highest scale ceiling among open-source vector databases — credibly handles billions of vectors. Kubernetes-native architecture that aligns with existing hospital-system platform teams. GPU-accelerated indexing and search via cuVS / RAFT. Widest index family. Strong reference architectures for FHIR-grounded medical RAG. Backed by Zilliz with active commercial support if needed.

LIMITATIONS

Where it does not fit

Higher operational complexity than Qdrant; not a single-binary deployment. Kubernetes prerequisite is real. Less polished metadata-filter integration than Qdrant for permission-aware retrieval. Heavier learning curve for teams new to vector databases. Overkill for departmental pilots under ~100M vectors.

Where Milvus fits in a hospital stack

Layer	What Milvus contributes	What still has to be solved
Vector retrieval at scale	Billion-vector horizontal scaling on Kubernetes; storage / compute separation; GPU-accelerated search.	Capacity planning, ops team, backup and restore strategy, multi-region replication policy.
Index palette	HNSW, IVF, DiskANN, SCANN, GPU-accelerated cuVS — pick by recall / latency / cost trade-off.	Choosing the right index per collection; periodic re-indexing as the model or corpus evolves.
Healthcare reference patterns	FHIR-grounded patient history search, CLIP + Milvus visual medical RAG, large multi-modal corpora.	FHIR ingestion pipeline, embedding strategy per modality, evaluation per use case.
Embeddings	None — bring your own embedding model served through Ollama, vLLM, or LocalAI.	Embedding choice (BGE, E5, MedCPT, nomic-embed-text), evaluation, re-indexing cadence.
Orchestration	None — pair with Haystack or LangChain for the full RAG pipeline.	Workflow design, error handling, evaluation rubric, audit trail.

Milvus is the heavyweight retrieval layer. It is the right answer when a hospital corpus genuinely outgrows single-node engines or when Kubernetes-native operational architecture is already a given. For most starting-point hospital workloads, Qdrant is the lighter choice; graduate to Milvus when the workload earns it.

Quick facts

Project	Milvus (open-source, Apache 2.0). Under LF AI & Data Foundation governance. Commercial support: Zilliz. GitHub: milvus-io/milvus.
Type	Distributed, Kubernetes-native vector database with separated storage and compute and a wide index palette.
Deployment modes	Milvus Lite (embedded). Milvus Standalone (single-node Docker). Milvus Distributed (production Kubernetes).
Indexes	HNSW, IVF_FLAT, IVF_PQ, DiskANN, SCANN, GPU-accelerated (cuVS / RAFT).
Storage	S3-compatible object store (MinIO on-prem). etcd for coordination.
Scale class	Tens of billions of vectors. Multi-shard, multi-replica. Designed for HA and horizontal scaling.
Healthcare reference patterns	FHIR-grounded medical RAG (BGE + Milvus), CLIP + Milvus visual medication ID, Docling + Milvus document RAG.
Website	milvus.io · GitHub: github.com/milvus-io/milvus

Use Milvus when scale and Kubernetes are both already in the picture

Milvus is the retrieval layer to reach for when the hospital workload genuinely exceeds single-node capacity, or when the org already operates Kubernetes and wants one distributed pattern across services. Moneli Automation's typical Milvus deployment is paired with FHIR-grounded ingestion, BGE or MedCPT embeddings, and Haystack orchestration — the same shape as a Qdrant deployment, scaled up.

send Request a WalledCare pilot arrow_back All open-source profiles