📦

Memqdrant

MCP server exposing a Qdrant-backed memory palace with a typed wing/room/hall schema. Rust, stdio, single-user.

0 installs

Trust: 34 — Low

Rag

Ask AI about Memqdrant

I know everything about Memqdrant. Ask me about installation, configuration, usage, or troubleshooting.

0/500

Loading tools...

Reviews

Documentation

palazzo

MCP server exposing a Qdrant-backed memory palace — typed wings, rooms, and halls instead of a generic blob store.

Cali's Rust daemon. Stdio or Streamable HTTP. No web UI, no auth, no drama.

What it is

A single-binary Rust MCP server that:

Speaks MCP over stdio (run locally) or Streamable HTTP (run as a service for a team or a homelab)
Embeds text via one of two backends — local ONNX (fastembed-rs, fully self-contained) or a remote Ollama (nomic-embed-text). Either way: 768-dim, same vector space.
Stores and retrieves points in Qdrant with a structured palace schema (wing → room → hall) and temporal validity (valid_until / superseded_by) so the palace is a journal, not a snapshot
Detects near-duplicates before writing (cosine ≥ 0.95 + exact text match)
Keeps an append-only JSONL write-ahead log for every mutation

It is intentionally opinionated. If you want a generic (text, metadata) store, use qdrant/mcp-server-qdrant — this project starts from that interface and replaces the untyped metadata with an enum-validated palace schema.

Inspiration and prior art

qdrant/mcp-server-qdrant — the official Qdrant MCP server (Python, FastMCP). palazzo borrows its store / find tool shape, collection configuration, and filter-wrapping pattern.
MemPalace/mempalace — the wing / room / drawer terminology and the read-tool set (status, taxonomy, check_duplicate) are lifted from MemPalace's 29-tool MCP server. If you want a full palace with an agentic knowledge graph, cross-wing tunnels, and 96.6% R@5 retrieval on LongMemEval, go use MemPalace directly. palazzo is the minimum-viable single-user flavour of the same idea, Rust-native, Qdrant-backed.

Neither upstream is vendored. Both are linked above; please follow and star their work.

Tools

Tool	What it does
`palace_store`	File a verbatim memory into a wing/room/hall. Returns a new point ID or the existing one on near-duplicate.
`palace_find`	Semantic search. Optional typed filters: `wing`, `category`, `room`, `hall`, `since`, `until`, `recency_half_life_days`.
`palace_recall`	Fetch by explicit IDs. Cheap — no embedding.
`palace_status`	Total point count plus facet breakdown by wing, hall, category.
`palace_taxonomy`	Flat facet dump of wing / room / hall / category counts.
`palace_check_duplicate`	Probe whether candidate text already exists above the 0.95 cosine threshold.
`palace_supersede`	Replace one or more existing memories with a corrected version. Marks the old points with `valid_until`, `superseded_by`, `superseded_reason`; default `palace_find` hides them.
`palace_store_batch`	Bulk-ingest up to 256 memories in one call. Embeds the whole batch in one ONNX/Ollama inference pass and bulk-upserts to Qdrant in one HTTP call (~3-5× faster than N single-item calls). Per-item dedup against the live palace; result returns per-item status, IDs, and dedup hits. Designed for migrations and bulk imports.
`palace_gain`	Token-savings report. Aggregates the per-tool gain log and returns a `Summary` of how many tokens of agent context this server saved versus a hand-coded SSH+curl+jq equivalent. Optional `since` (RFC3339) and `include_text` flags.

Input caps: 32 KB per text body, 100 IDs per recall batch, 1–20 results per find.

Temporal filtering on `palace_find`

since / until — inclusive RFC3339 second-precision UTC timestamps (e.g. 2026-04-01T00:00:00Z). Filter memories by when they were stored. Bad format is rejected with an explicit error.
recency_half_life_days (f64) — opt-in recency bias. When set, palazzo fetches up to 4× the requested limit from Qdrant (capped at 80), re-ranks each hit by score × exp(-age_days / half_life), then returns the top limit. Omit or pass 0 for pure cosine. Typical values: 30 (aggressive), 90 (moderate), 365 (gentle — a year-old memory gets half its raw score).

Both knobs work alongside the wing/category/room/hall filters — they compose.

Temporal validity (`palace_supersede`)

Memories become wrong over time — infra gets renamed, services get rebuilt, decisions get reversed. palace_supersede lets you replace an old entry with a corrected one without losing the history:

The new text is embedded and stored as a fresh point with supersedes: [<old_id>, ...].
Each old point gets marked with valid_until = now, superseded_by = <new_id>, and your free-text reason.
Default palace_find excludes any point with a past valid_until — agents only see current truth. Pass include_superseded: true to surface the full timeline for archaeology.
palace_recall always exposes valid_until / superseded_by / superseded_reason on the returned point, so you can tell current-from-stale at a glance.

The palace becomes a journal, not a snapshot — every correction is an append, never a delete.

Palace schema

Every point carries:

category:    person | career | technical | infrastructure | project-memory | vibe | project
wing:        projects | infrastructure | nexpublica | personal | career | vibe
room:        free-text (project or topic)
hall:        facts | events | decisions | discoveries | preferences
text:        the memory itself, verbatim
timestamp:   RFC3339 UTC
session:     optional conversation identifier
source_file: optional MD path when imported

IDs ≥ 1_000_000_000 are reserved for auto-generation (unix-millis). The palace schema enums are defined in src/schema.rs.

Config

All via environment variables:

Variable	Default	Notes
`QDRANT_URL`	`http://localhost:6333`
`COLLECTION`	`claude-memory`
`PALAZZO_WAL`	`~/.palazzo/wal.jsonl`
`PALAZZO_BIND`	`127.0.0.1:6334`	only used by `serve`
`PALAZZO_ALLOWED_HOSTS`	`localhost,127.0.0.1,::1`	DNS-rebinding guard for `serve`; set to `*` to disable
`OLLAMA_URL`	`http://localhost:11434`	only read by the `ollama` backend
`OLLAMA_MODEL`	`nomic-embed-text`	only read by the `ollama` backend
`FASTEMBED_CACHE_DIR`	`~/.cache/fastembed`	only used by the `fastembed` backend
`PALAZZO_USAGE_LOG`	`/var/lib/palazzo/usage.jsonl`	append-only JSONL backing `palace_gain`
`PALAZZO_GAIN_ENABLED`	`1`	set to `0`/`false`/`no`/`off` to disable per-call recording
`RUST_LOG`	`palazzo=info`

Logging goes to stderr only. Stdout is the MCP transport — anything written there corrupts the JSON-RPC stream.

On startup, palazzo creates keyword payload indexes on wing, category, room, hall if they're missing. Idempotent; required for the facet-based tools. Adding indexes to an existing collection is non-destructive — Qdrant builds them in place and existing points stay.

Embedding backends

palazzo ships two backends behind mutually-exclusive cargo features. Pick one at build time.

Feature	How it embeds	When to use
`fastembed` (default)	Local ONNX inference of `nomic-embed-text-v1.5-Q` (INT8 dynamic-quantised) via `fastembed-rs`	You want palazzo fully self-contained — zero external services. Static binary, ~110 MB one-time model download into `FASTEMBED_CACHE_DIR`, ~1 GB resident. This is what every deployed palazzo runs.
`ollama`	HTTP calls to an Ollama server running `nomic-embed-text`	You already run Ollama on your LAN and prefer a tiny no-native-deps binary. Useful for dev rigs that don't want to pay the model-download / RSS cost.

Select the variant via cargo features (release archives publish both per-platform):

cargo build --release                                          # fastembed (default)
cargo build --release --no-default-features --features ollama

Both backends produce 768-dim vectors in the same vector space (nomic-embed-text-v1.5 architecture). Existing points embedded with one backend stay searchable with the other — including across the f32 → INT8-quantised swap that landed in v0.5.1. Expect ~0.98–0.99 cosine on the same text between any two precision combos, which is below the noise floor of typical palace queries.

Build

cargo build --release

Release profile is LTO-thin, single codegen unit, stripped. Binary ~28 MB with fastembed (default — static ONNX runtime included), ~8 MB with ollama. Resident memory at idle: ~1 GB (fastembed, model loaded) or ~30 MB (ollama).

Running

palazzo speaks two transports; pick one.

stdio (local)

palazzo

Stdout is the MCP channel — logging always goes to stderr. This is the default mode when the binary is invoked with no arguments. Best for single-user laptop use: no port to bind, no service to manage.

claude mcp add palazzo -- /path/to/target/release/palazzo
claude mcp list

Override env vars with -e KEY=VALUE before the --:

claude mcp add palazzo \
  -e COLLECTION=my-palace \
  -e OLLAMA_URL=http://localhost:11434 \
  -- /path/to/target/release/palazzo

Streamable HTTP (service)

palazzo serve --bind 0.0.0.0:6334

Serves MCP over Streamable HTTP at POST /mcp. Useful when the binary lives on a server co-located with Qdrant + Ollama, and your laptop (or multiple clients) connect over the network.

claude mcp add --transport http palazzo http://your-server:6334/mcp

Bind address can also be set via PALAZZO_BIND. Default is 127.0.0.1:6334.

Bulk ingest over HTTP (`POST /ingest`)

The serve mode exposes a sibling REST endpoint alongside /mcp:

curl -X POST http://palazzo-host:6334/ingest \
  -H 'Content-Type: application/x-ndjson' \
  --data-binary @batch.jsonl

Same backend as palace_store_batch — embed, dedup, WAL, upsert — but delivered as a plain HTTP request. When invoked from an MCP client via Bash(curl), the agent transcript only carries the curl command and the JSON response summary; the file's bytes flow through curl's body stream and never touch the LLM tokenizer. Use this for any bulk migration where the source data already exists on disk or a reachable URL. Same PALAZZO_ALLOWED_HOSTS allowlist as /mcp.

The response is streamed NDJSON — one progress line per processed batch (default 256 items each), then a final {"done": true, ...} line:

{"chunk":0,"items_in_chunk":256,"counts":{"stored":256,...},"running":{"stored":256,...}}
{"chunk":1,"items_in_chunk":256,"counts":{...},"running":{"stored":512,...}}
{"chunk":2,"items_in_chunk":88,"counts":{...},"running":{"stored":600,...}}
{"done":true,"total":600,"counts":{"stored":600,"duplicates_returned":0,"skipped_duplicates":0,"failed":0}}

Use curl -N to disable client-side buffering and watch progress live. Errors during processing emit a {"chunk":N,"error":"..."} line and close the stream. Body parse failures still return 400 with a plain-text body before any streaming starts.

Bulk ingest from a file (`palazzo ingest`)

palazzo ingest --file batch.jsonl
palazzo ingest --json < batch.jsonl

Same backend as palace_store_batch — embedding, dedup, WAL, upsert — but the texts never round-trip through the MCP transcript. Use this from migration scripts when the agent context can't afford the per-call cost of carrying the payloads. Input is JSON-Lines ({"text":..., "category":..., "wing":..., "room":..., "hall":...} per line, blank/#-prefixed lines ignored). Items are chunked into MAX_STORE_BATCH (256) groups and processed sequentially. Default output is a one-line summary on stderr; --json emits the full per-item result on stdout.

Deploy as a systemd service

The deploy/ directory contains a hardened systemd unit, an env-file template, and an installer. On Debian / Ubuntu / any systemd host:

# On the target host, after placing the binary at e.g. ~/palazzo
sudo ./deploy/install.sh ~/palazzo
# Review /etc/palazzo/env, then:
sudo systemctl enable --now palazzo

The unit runs as a dedicated palazzo user, drops all needless privileges (ProtectSystem=strict, MemoryDenyWriteExecute=true, RestrictNamespaces=true, etc.), and persists the WAL at /var/lib/palazzo/wal.jsonl.

If you expose the service beyond a trusted LAN, put a reverse proxy with TLS + auth (e.g. nginx + basic auth, or an identity-aware proxy) in front of :6334. There is no built-in authentication — palazzo assumes a trusted network.

Testing

End-to-end smoke test against a throwaway Qdrant collection:

cargo build --release
python3 scripts/smoke.py

It creates palazzo-test, boots the binary, round-trips store / find / recall / status / check_duplicate / duplicate-skip / filtered find / since-until / recency-boost / supersede / superseded-hidden / superseded-surfaced / recall-temporal-metadata, and drops the collection. Fails loudly on any mismatch.

Requires live Qdrant and (for the ollama backend) Ollama reachable at the configured URLs. The fastembed backend has no external service requirement once the model cache is warm.

Security notes

Stdio transport, single-user threat model. No network listener, no auth.
Dependencies audited with cargo audit on every build bump.
Every write goes through a WAL (~/.palazzo/wal.jsonl by default) with content previews truncated to 120 chars.
OLLAMA_URL and QDRANT_URL are environment-controlled — anyone who can set env vars on this binary can already execute code as you, so the SSRF surface is accepted.
MCP tool outputs (including stored text) are echoed back through the protocol; treat them as untrusted input to whatever LLM consumes them. This is a generic MCP concern, not specific to palazzo.

Non-goals

No multi-tenant auth. Anyone reachable on the listener can read and write the palace. Put it behind a tailnet, a reverse proxy with auth, or a localhost-only bind. palazzo assumes a trusted network.
No web UI. Use the Qdrant dashboard for raw inspection; the MCP tools are the supported interface.
No knowledge graph, no agent diaries, no LLM rerankers, no embedding-model swaps to a different architecture. The palace stays a single 768-dim collection on nomic-embed-text. If you want any of those layers, MemPalace is purpose-built for it.
No automatic collection migrations across architecture changes. Compatible variants of the same model (V15 ↔ V15Q) work in the same collection because the vector space is identical. A different architecture would invalidate the existing points and isn't supported.

License

MIT — see LICENSE.

Credits

qdrant/mcp-server-qdrant (Apache-2.0) for the MCP-over-Qdrant baseline.
MemPalace/mempalace (MIT) for the palace terminology, read-tool set, and the idea that verbatim beats summarised.
The MCP Rust SDK for the server harness.