OPEN SOURCE · VECTOR DATABASE · RETRIEVAL LAYER

Qdrant

A Rust-built open-source vector database designed for single-binary self-hosting, advanced metadata filtering, and best-in-class price-performance — >10,000 queries per second on a single server, sub-100ms latency at 100 million vectors, and operationally simpler than Milvus by a wide margin. The right retrieval layer when a hospital wants RAG-grade similarity search without a Kubernetes-grade ops investment.

Throughput
10,000+ QPS

Single-server query throughput in 2026 benchmarks — sufficient for hospital-scale RAG workloads with headroom. Production deployments handling 100M+ vectors at sub-100ms latency at 95% recall.

Footprint
Single binary

One Rust binary, no Kubernetes prerequisite for the single-node deployment. The smallest credible operational footprint among production-grade vector databases.

Self-host cost
$30–50 / mo

Independent 2026 benchmarks put millions-of-vectors self-hosted deployments on a small VPS at $30–50 per month — best price-performance ratio in the category.

Filter accuracy
Best in class

Qdrant's HNSW + payload-filter integration handles complex metadata filters (specialty, department, document type, date ranges, permission scopes) faster and more accurately than the comparable engines — critical for permission-aware hospital RAG.

What Qdrant actually is

Qdrant is an open-source vector similarity search engine and database written in Rust, with a strong emphasis on operational simplicity and high-performance metadata filtering. The data model is straightforward: collections of points, each with a vector and an arbitrary JSON payload, indexed with HNSW (Hierarchical Navigable Small World) plus integrated filter handling that turns "find nearest vectors where department = 'cardiology' and last_updated > '2025-01-01' and permission_scope contains 'physician'" into a single fast query.

For a hospital building a RAG stack — clinical document Q&A over policies and SOPs, private medical search over guidelines, or discharge-summary retrieval grounded in patient context — Qdrant's combination of speed, filter quality, and operational simplicity is usually the right starting point. It scales well past the point where most pilots earn graduation, runs as a single binary on commodity hardware, and ships an HTTP / gRPC API plus official clients for Python, TypeScript, Go, Rust, Java, and .NET. Milvus is the right answer when scale grows into the billions-of-vectors range and a Kubernetes ops team is already in place; for the typical hospital workload, Qdrant is operationally lighter.

What Qdrant is not: a full RAG framework. It does retrieval, not embedding generation, not chunking, not prompt construction, not orchestration. Pair it with Haystack for the framework layer and an embedding model (BGE, E5, nomic-embed-text) served through Ollama or vLLM.

Deployment posture

Qdrant ships as a single statically-linked Rust binary, an official Docker image, a Helm chart for Kubernetes clusters, and a managed Qdrant Cloud option (which is irrelevant for the residency-bound hospital use case). The single-binary deployment is what most pilots reach for; the Docker / Helm options scale into multi-replica clusters when needed. Storage is on local disk by default; cluster mode adds replication and sharding without changing the data model.

SURFACE
HTTP + gRPC

REST API on port 6333, gRPC API on port 6334. Native clients in Python, TypeScript, Go, Rust, Java, .NET. Compatible with most RAG frameworks (Haystack, LangChain, LlamaIndex) out of the box.

SCALE
Single-node to cluster

Single-node handles 100M+ vectors comfortably. Cluster mode (replication + sharding) for higher availability and higher scale. The graduation point between modes is usually 200–500M vectors or a strict HA requirement.

INDEXING
HNSW + payload filters

HNSW with quantization options (scalar, product, binary) and integrated payload filtering. The filter-quality story is the main differentiator over Faiss-based engines like Milvus.

SECURITY
API-key + TLS

API-key authentication, JWT support, TLS, role-based access control via configurable tokens. Designed to sit behind a proper network boundary; the engine itself is intentionally minimal on identity.

Healthcare fit

Qdrant is the right retrieval layer when a hospital RAG workflow needs strong filter accuracy (permission scopes, departments, specialty, document type) and predictable operational simplicity. The typical use cases are policy and SOP Q&A, clinical guideline retrieval, formulary search, discharge-summary retrieval grounded in patient context, and internal knowledge assistants that draw on hospital-curated corpora.

  • checkGood fit: permission-aware RAG over policies, SOPs, care pathways, formulary tables — workloads where every query needs department and role filters as well as similarity ranking.
  • checkGood fit: hospital-scale corpus sizes (1M–500M vectors) where operational simplicity beats horizontal-scale headroom.
  • checkGood fit: teams using Haystack or LangChain that want a vector store they can stand up in an afternoon without operating a separate cluster.
  • closeBad fit: hospital-system corpora past the billion-vector mark or with strict multi-region HA requirements — Milvus is the better fit there.
  • closeBad fit: teams that already operate Elasticsearch / OpenSearch heavily and would rather extend than add a service — use OpenSearch k-NN instead.

Privacy and governance

Qdrant runs entirely on hospital-controlled hardware in the typical self-hosted deployment — no telemetry, no remote dependency, all data on local disk. That is the standard local-engine privacy posture and it fits HIPAA / PIPEDA / PHIPA / Quebec Law 25 environments cleanly. Authentication and authorization are present but intentionally minimal; the operator's job is to put Qdrant behind a proper API gateway with SSO, mTLS, per-team quotas, and audit logging.

The governance question Qdrant cannot answer alone is "who is allowed to see which documents." Payload filters (department, role, sensitivity level) are the mechanism that enforces permission-aware retrieval, but the operator owns the schema, the ingestion pipeline that populates it, and the query-side enforcement. Moneli Automation's typical Qdrant deployment is exactly this: the engine for similarity search, the operator's schema for permission control, and the gateway for identity.

Strengths and limitations

STRENGTHS
Why hospital stacks pick it

Best operational simplicity among production-grade vector databases — single binary, no Kubernetes prerequisite. Best filter-quality story; HNSW + payload filters integrated rather than bolted on. Best price-performance ratio for self-hosted deployments. Strong client library coverage. Rust runtime — fast and low-overhead.

LIMITATIONS
Where it does not fit

Single-node ceiling is high but not unlimited — past billions of vectors, Milvus is the better fit. No native multi-tenant isolation at the collection level (the operator builds tenancy with naming conventions and gateway-side authorization). Less mature than OpenSearch / Elasticsearch for combined keyword + vector queries (though hybrid search is supported and improving fast).

Where Qdrant fits in a hospital stack

LayerWhat Qdrant contributesWhat still has to be solved
Vector retrievalFast similarity search with high-quality metadata filtering on commodity hardware.Embedding model choice, chunking strategy, hybrid retrieval design, evaluation harness.
Operational simplicitySingle binary, no Kubernetes prerequisite, sub-day deployment.Backup / restore policy, replication strategy, capacity planning past single-node.
Permission-aware retrievalPayload-filter integration makes department / role / sensitivity scoping fast and accurate.Schema design, ingestion-side population of permission fields, gateway-side enforcement.
EmbeddingsNone — bring your own embedding model served through Ollama, vLLM, or LocalAI.Domain-tuned embedding choice (BGE, E5, MedCPT, nomic-embed-text), evaluation, re-indexing policy.
OrchestrationNone — pair with Haystack for full RAG pipelines.Workflow design, error handling, citation coverage, evaluation rubric.

Qdrant is the retrieval layer. It is the right answer for most hospital-scale RAG workloads when operational simplicity matters as much as throughput. Pair with Haystack for orchestration, vLLM or Ollama for embedding and generation, and a gateway for identity.

Quick facts

ProjectQdrant (open-source, Apache 2.0). Built in Rust. GitHub: qdrant/qdrant.
TypeVector similarity search engine + database with payload-filter integration.
IndexingHNSW with optional scalar, product, and binary quantization. Integrated filter execution rather than post-filter.
APIREST (port 6333) and gRPC (port 6334). Official clients: Python, TypeScript / JavaScript, Go, Rust, Java, .NET.
Hybrid searchSupported — dense + sparse + filter combinations.
Scale classSingle-node 100M+ vectors at sub-100ms latency at 95% recall. Cluster mode for higher scale and HA.
Self-host cost classSmall VPS for millions-of-vectors deployments. Best price-performance ratio in the category per 2026 benchmarks.
Websiteqdrant.tech · GitHub: github.com/qdrant/qdrant

Use Qdrant as the retrieval layer when simplicity matters

Qdrant is the right vector database when the workload is hospital-scale rather than hyperscale and the operating model values operational simplicity over horizontal-scale headroom. Moneli Automation's typical pattern is Qdrant as the retrieval layer, BGE or nomic-embed embeddings served through Ollama or vLLM, and Haystack as the orchestration layer above it.

send Request a WalledCare pilot arrow_back All open-source profiles

Further reading