Ncms
No description available
Ask AI about Ncms
Powered by Claude Β· Grounded in docs
I know everything about Ncms. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
See It Working β’ How It Works β’ Fine-Tune Your Own Adapter β’ Benchmarks β’ Quickstart Guide
Your AI agents forget everything between sessions. Every conversation starts from zero. Every insight, every architectural decision, every hard-won debugging breakthrough β gone.
NCMS fixes this. Permanently.
pip install ncms
from ncms.interfaces.mcp.server import create_ncms_services, create_mcp_server
memory, bus, snapshots, consolidation = await create_ncms_services()
server = create_mcp_server(memory, bus, snapshots, consolidation)
Three lines. Your agents now have persistent, searchable, shared memory with cognitive scoring β a system that learns while it sleeps, tracks how knowledge evolves through state-change grammar, and optionally runs a fine-tuned ingest-side classifier that replaces brittle regex with a 2.4 MB LoRA adapter you train on your own corpus. No vector database. No embedding pipeline. No external services.
What Makes NCMS Different
| Problem | Traditional Approach | NCMS |
|---|---|---|
| Memory retrieval | Dense vector similarity (lossy) | BM25 + SPLADE + graph expansion + cross-encoder + structured recall (precise) |
| "What's the current state?" | Recency sort or last-write-wins | TLG grammar retrieval β structural proof over typed state-transition edges, 32/32 rank-1 on ADR corpus |
| Admission / state-change / topic tagging | 5 separate regex & LLM code paths | One fine-tuned 2.4 MB LoRA adapter β five classification heads in a single forward pass |
| Agent coordination | Polling shared files, explicit tool calls | Embedded Knowledge Bus (osmotic) |
| Agent goes offline | Knowledge lost until restart | Snapshot surrogate response (always available) |
| Dependencies | Vector DB + graph DB + message broker | Zero. Single pip install. |
| Setup time | Hours of infrastructure | 3 seconds to first query |
See It Working
git clone https://github.com/AliceNN-ucdenver/ncms.git
cd ncms && uv sync
uv run ncms demo
Three collaborative agents run through a complete lifecycle β storing knowledge, asking questions, going offline with surrogate responses, and announcing breaking changes β all in-memory, under 10 seconds.
uv run ncms dashboard # Real-time observability at http://localhost:8420
How It Works
NCMS organizes agent memory into a Hierarchical Temporal Memory Graph (HTMG) β a four-level structure where raw facts crystallize into tracked states, states cluster into temporal episodes, and episodes consolidate into strategic insights. Think of it as giving your agents not just storage, but the ability to understand their knowledge. (V1 architecture)
NCMS Architecture (HTMG)
Every memory enters through an ingest pipeline that classifies it β like a bouncer deciding who gets into the club, but one who went to grad school. Raw facts become ATOMIC nodes. State changes ("Redis upgraded to v7.4") become ENTITY_STATE nodes with bitemporal validity tracking. Related events cluster into EPISODE nodes via a 7-signal hybrid linker. And overnight, dream cycles consolidate episodes into ABSTRACT insights β the system literally learns while it sleeps.
The Fine-Tunable 5-Head SLM β Ingest-Voice Content Classifier
NCMS replaces five separate pieces of brittle pattern-matching code with a single fine-tuned LoRA adapter running a 5-head BERT classifier at ingest. One forward pass produces: admission routing, state-change detection, topic tagging, preference intent, and typed-span role classification. The output drives every downstream ingest decision β domain expansion, L2 entity-state creation, supersession edges, episode formation. (Current state: v9 Β· Domain plugin architecture)
Five heads, one forward pass (20-65 ms on MPS):
| Head | Output | What it replaces |
|---|---|---|
admission | persist / ephemeral / discard | 4-feature regex heuristic (65.9% accuracy) |
state_change | declaration / retirement / none β feeds L2 entity-state induction | 3-pattern state-declaration regex (8/8 FP on YAML templates) |
topic | per-adapter taxonomy label (not a hardcoded enum) | LLM-based label_detector.py + manual Memory.domains tagging |
intent | positive / negative / habitual / difficulty / choice / none | regex preference extractor (never shipped) |
role (per-span) | primary / alternative / casual / not_relevant on each gazetteer-detected span | GLiNER-only entity extraction on closed-vocab domains (no role disambiguation) |
Signals at retrieval. Each head's output lands on memory.structured["intent_slot"] at ingest and becomes a typed signal the scoring pipeline can read:
intentβ boosts memories whose preference label matches the query's pattern intent (Phase H.1)state_changeβ boosts memories tagged as actual state changes onCHANGE_DETECTIONqueries (Phase H.2); gates the supersession/conflict reconciliation penalty so it only fires onCURRENT_STATE_LOOKUP(Phase G β the canonical bug fix)topicβ auto-appended toMemory.domainsso the domain filter narrows retrieval without manual taggingroleβ reserved for grounding boost (memories where the query entity hasrole=primary); off-by-default pending v10 calibration
A per-query diagnostic event emits the signal vector for every search: which heads fired, which contributed to the rank-1 result's score, and whether grammar composition replaced the BM25 head.
3-tier fallback chain β the chain's presence at MemoryService construction is the kill-switch. Set NCMS_DEFAULT_ADAPTER_DOMAIN=<name> to load the LoRA at startup; leave it unset to ingest on the heuristic-only chain. No boolean flag.
primary: LoRA adapter (per-deployment, ~2.4 MB on disk)
β if adapter missing / head abstained
fallback: E5 zero-shot (cold-start: intent-only head, no training required)
β if torch unavailable
heuristic: null output (admission=persist, everything else None β ingest keeps working)
Dynamic topics. The topic vocabulary lives in the adapter's manifest.json + taxonomy.yaml, not in the codebase. Swap adapter β swap topics. The dashboard enumerates them directly from the database (SQLiteStore.list_topics_seen()) with zero config coupling.
Three reference adapters ship today β each ~2.4 MB at ~/.ncms/adapters/<domain>/v9/:
conversational/v9β open-vocab domain (no gazetteer;rolehead idle, GLiNER provides entities)clinical/v9β 536 gazetteer entries Γ 6 slots (medication / procedure / symptom / severity / alternative / frequency)software_dev/v9β 712 gazetteer entries Γ 9 slots (library / language / framework / pattern / tool / database / service / alternative / frequency)
All three are baked into the NemoClaw hub Docker image; the hub defaults to software_dev.
CTLG β query-side cue tagging (planned, sibling adapter)
The 5-head SLM owns ingest voice. Query-voice semantic parsing is a different task β composing typed cues (causal / temporal / ordinal / modal) into a structured TLG query form β and it ships as a separate CTLG adapter loaded alongside the 5-head SLM at runtime. Two adapters in production, one per cognitive role (content classification vs cue tagging), NOT two-for-CTLG.
Why a sibling, not a 6th head: v8 attempted to add the cue tagger as a 6th head on the same encoder. Joint training of per-token BIO sequence labeling alongside the per-CLS classification heads saturated under shared encoder capacity β training loss oscillated, several previously-healthy heads regressed. v9 dropped the 6th head; the CTLG adapter forks training while keeping the runtime architecture coherent. (CTLG design Β· v8 saturation retrospective)
Plumbing already in place β EdgeType.CAUSED_BY + EdgeType.ENABLES on graph edges; cue_tags: list[dict] field on ExtractedLabel; _extract_and_persist_causal_edges ingestion path gated on cue_tags presence; rules-first synthesizer at domain/tlg/composition.py; NCMS_TLG_LLM_FALLBACK_ENABLED knob reserved. The cue head is the only missing piece β corpus annotation + dedicated training.
Retrieval Pipeline
Traditional memory systems compress documents into dense vectors, losing precision. NCMS uses complementary mechanisms that work together without a single embedding:
Tier 0 β Intent Classification. Queries are classified into one of 7 intent types (fact lookup, current state, historical, event reconstruction, change detection, pattern, strategic reflection) via a BM25 exemplar index. This shapes which memory types receive a scoring bonus downstream.
Tier 1 β BM25 + SPLADE Hybrid Search. BM25 via Tantivy (Rust) provides exact lexical matching. SPLADE adds learned sparse neural retrieval β expanding "API specification" to also match "endpoint", "schema", "contract". Results fuse via Reciprocal Rank Fusion.
Tier 1.5 β Graph-Expanded Discovery. Entity relationships in the knowledge graph discover related memories that search missed lexically. A query matching "connection pooling" also finds memories about "PostgreSQL replication" β because both share the PostgreSQL entity.
Tier 2 β ACT-R Cognitive Scoring. Every memory has an activation level computed from access recency, frequency, and contextual relevance. Dream-learned association strengths weight entity connections; reconciliation penalties demote superseded or conflicted states.
Tier 2.5 β Score Normalization. Per-query min-max normalization brings all signals to [0,1] scale before combining.
Tier 3 β Selective Cross-Encoder Reranking. A 22M-parameter cross-encoder (ms-marco-MiniLM-L-6-v2) reranks candidates β but only for fact lookup, pattern, and strategic reflection queries. State and temporal queries skip reranking to preserve chronological and causal ordering.
Tier 4 β Structured Recall. The recall() method layers structured context on top: entity state snapshots, episode membership with sibling expansion, causal chains from the HTMG. One call returns what takes 5+ tool calls elsewhere.
Tier 5 β Temporal Linguistic Geometry (TLG). For state-evolution queries ("What's the current authentication scheme?", "What caused the payments delay?", "What came before MFA?"), TLG runs a grammar-based structural proof over typed state-transition edges. It produces an exact answer (or abstains) with a readable syntactic proof β and composes with BM25 via a zero-confidently-wrong invariant: when TLG's confidence is high, its rank-1 answer replaces BM25's head; when it abstains, BM25 ordering is returned unchanged.
Query intent today is BM25-exemplar classification. A small in-memory Tantivy index of ~70 exemplar queries classifies each search into one of 7 intent classes (fact_lookup, current_state_lookup, historical_lookup, event_reconstruction, change_detection, pattern_lookup, strategic_reflection). The SLM signals from ingest (intent / state_change / topic / role) feed retrieval bonuses gated on this classified intent. Query-side compositional parsing (cue tagging β structured TLG queries) is the next step β see the CTLG design. (Pre-paper Β· v9 findings)
activation(m) = base_level(m) + spreading_activation(m, query) + noise
- supersession_penalty - conflict_penalty + hierarchy_bonus
base_level(m) = ln( sum( (time_since_access)^(-decay) ) )
spreading(m) = sum( learned_PMI_weight(entity) ) β dream-learned associations
combined(m) = bm25 * w_bm25 + splade * w_splade + activation * w_actr + graph * w_graph
β TLG grammar answer (when has_confident_answer(), replaces rank-1)
Memory Ingestion Pipeline
Entities, preferences, topics, admission routing, and state-change detection all run on the same memory at ingest time β but the SLM (when enabled) is the primary source of truth on admission / state-change / topic, with regex paths kept alive as fallback for cold-start deployments.
Content Classification β Incoming content passes through a dedup gate (SHA-256) then a two-class classifier. NAVIGABLE documents (ADRs, PRDs, YAML configs with headings/structure) get section-aware ingestion: one vocabulary-dense profile memory in the memory store, full document + sections in the document store. ATOMIC fragments (facts, observations, announcements) proceed through the standard pipeline.
5-Head SLM (optional, set NCMS_DEFAULT_ADAPTER_DOMAIN=<name>) β Runs before admission. Produces all five classification outputs (intent / role / topic / admission / state_change) in one forward pass. Its admission_head replaces the regex admission scorer when confident; its state_change_head replaces the state-declaration regex; its topic_head auto-populates Memory.domains; its role_head classifies gazetteer-detected spans into primary / alternative / casual / not_relevant for downstream L2 entity-state grounding.
GLiNER NER β Zero-shot Named Entity Recognition using a 209M-parameter DeBERTa model. Extracts entities across any domain, running in parallel with the SLM β GLiNER's output feeds the knowledge graph (spreading activation, co-occurrence edges, entity-state reconciliation) while the SLM's output feeds ingest decisions. The two are complementary: GLiNER handles open-vocabulary NER, SLM handles typed domain-specific slot extraction.
Admission Routing β 3-way gate: discard, ephemeral cache, or persist. Either the SLM's admission_head (when confident) or the 4-feature regex heuristic (fallback) decides. Memories with importance >= 8.0 bypass admission entirely.
State Reconciliation β When a new entity state arrives ("Redis upgraded to v7.4"), NCMS classifies its relationship to existing states (supports / refines / supersedes / conflicts) and applies bitemporal truth maintenance. Superseded states get is_current=False with validity closure.
Episode Formation β Related memories are automatically grouped into temporal episodes via a 7-signal hybrid linker (BM25, SPLADE, entity overlap, domain match, temporal proximity, source agent, structured anchors like JIRA tickets).
Contradiction Detection (opt-in) β LLM-powered post-ingest scan for factual contradictions against existing related memories.
Knowledge Consolidation (opt-in) β Offline clustering + LLM synthesis of cross-memory patterns into searchable ABSTRACT insights.
Dream Cycles (Project Oracle)
Like biological sleep consolidation, NCMS runs three non-LLM passes during "sleep" to create the differential access patterns ACT-R cognitive scoring needs to contribute signal:
- Dream Rehearsal β Selects high-value memories via 5-signal weighted scoring (PageRank centrality 0.40, staleness 0.30, importance 0.20, frequency 0.05, recency 0.05) and injects synthetic access records.
- Association Learning β Computes pointwise mutual information (PMI) from entity co-access patterns in the search log, feeding learned weights into
spreading_activation(). - Importance Drift β Compares recent access rates to older rates and adjusts
memory.importancewithin bounded limits. Frequently accessed memories rise; neglected ones gracefully decay.
Knowledge Bus & Agent Sleep/Wake
Agents don't poll for updates. They don't call each other directly. Knowledge flows through domain-routed channels β osmotic knowledge transfer.
# API agent announces a change β frontend agent gets it automatically
await agent.announce_knowledge(
event="breaking-change",
domains=["api:user-service"],
content="GET /users now returns role field",
breaking=True,
)
Ask/Respond β Non-blocking queries routed by domain. Announce/Subscribe β Fire-and-forget broadcasts to interested agents. Surrogate Response β When agents go offline, they publish knowledge snapshots. Other agents can still ask them questions through the snapshot.
Fine-Tune Your Own Adapter
Three reference adapters ship today at ~/.ncms/adapters/{conversational,software_dev,clinical}/v9/ β but the point of the architecture is that operators train their own for their own domain. The 5-head classifier does its best work when fine-tuned on the kind of content your users actually ingest.
One-command training
# Put your corpus JSONL + taxonomy YAML in a directory:
./my_corpus/
βββ gold.jsonl # hand-labeled examples (start with ~50-75 rows)
βββ topics.yaml # topic_labels: [framework, testing, infra, ...]
βββ object_to_topic: # map surface forms to topics
# Run the four-phase pipeline (takes ~5-15 min on Apple Silicon MPS):
uv run python -m experiments.intent_slot_distillation.train_adapter \
--domain my_domain \
--taxonomy ./my_corpus/topics.yaml \
--adapter-dir ./adapters/my_domain/v1 \
--target-size 500 \
--adversarial-size 300 \
--epochs 6 \
--lora-r 16
What happens:
- Bootstrap β loads your gold + any mixed-content seeds (admission / state-change variety). Auto-labels topic/admission/state_change from the taxonomy map where gold doesn't already have them.
- Expand (SDG) β template-based synthetic data expansion. 500 target β ~400 deduped examples with full multi-head labels.
- Adversarial β generates 200β300 hard cases across 7 failure modes (quoted speech, negated positives, past-flip, third-first contrast, double negation, sarcasm, empty/minimal).
- Train + Gate β LoRA fine-tune with class-weighted slot loss. The gate refuses to promote an adapter that doesn't meet thresholds (intent F1 β₯ 0.70, slot F1 β₯ 0.75, confidently-wrong β€ 10 %) or regresses against a named baseline adapter.
Output: a 2.4 MB adapter directory with lora_adapter/ + heads.safetensors + manifest.json + taxonomy.yaml + eval_report.md (PASS/FAIL gate + per-head F1 table).
Point NCMS at your adapter
# Via config
NCMS_DEFAULT_ADAPTER_DOMAIN=my_domain \
NCMS_SLM_CHECKPOINT_DIR=./adapters/my_domain/v9 \
uv run ncms serve
# Or via benchmark runner
uv run python -m benchmarks longmemeval --features-on \
--intent-slot-domain my_domain
See Add a domain for the full walk-through of authoring a v9 domain plugin (gazetteer + diversity + archetypes), and v9 domain plugin architecture for the design rationale. Historical reading: P2 plan (the P2 sprint that produced the original 5-head adapter).
Benchmarks
NCMS achieves nDCG@10 = 0.7206 on SciFact β the BEIR dataset most aligned with factual knowledge retrieval β exceeding published ColBERTv2 (0.693, +4.0%) and SPLADE++ (0.710, +1.5%) without dense vectors or LLM at query time. Cross-domain validation on NFCorpus (biomedical) shows consistent improvement: +10.0% over BM25 (0.3188 β 0.3506).
On SWE-bench Django (503 documents, 170 test queries), structured recall achieves Recall AR nDCG@10 = 0.2032, exceeding search-only AR (0.1759) by +15.5%. (Full SWE-bench results)
TLG β state-evolution retrieval (NEW, 2026-04)
Across 11 intent shapes on the hand-curated ADR / project / clinical corpus:
| Strategy | Top-5 accuracy | Rank-1 accuracy |
|---|---|---|
| BM25 | 13 / 32 (41 %) | 5 / 32 (16 %) |
BM25 + observed_at DESC | 13 / 32 (41 %) | 0 / 32 (0 %) |
| Entity-scoped + path-rerank | 14 / 32 (44 %) | 6 / 32 (19 %) |
| TLG grammar | 32 / 32 (100 %) | 32 / 32 (100 %) |
Every TLG answer comes with a readable syntactic proof ("successor = ADR-010 (refines)", "walked 6 predecessors; root = ADR-001"). On LongMemEval's conversational subset the grammar correctly abstains (framing mismatch β LME isn't state-evolution content) and falls through to BM25+SPLADE unchanged. Full validation in docs/tlg-validation-findings.md; the reusable four-domain state-evolution benchmark (MSEB v1) ships at docs/mseb-results.md (original design).
5-Head SLM β ingest classifier (v9, 2026-04)
The shipped 5-head LoRA adapter classifies every memory at ingest into typed labels (intent / role / topic / admission / state_change). v8 attempted a 6th head for query-side cue tagging but joint training saturated; v9 ships the 5 production heads only, and the cue tagger is being designed as a separate sibling adapter (see CTLG design).
| Domain | Ingest heads (F1, v9 baseline) | Gazetteer | Head labels trained |
|---|---|---|---|
| conversational/v9 | intent 1.000, topic 1.000, admission 1.000, state_change 1.000 | open-vocab (none) | preference + topic taxonomy |
| software_dev/v9 | intent 1.000, role 0.79, topic 1.000, admission 1.000, state_change 1.000 | 712 entries Γ 9 slots | framework / library / language / pattern / tool / database / service taxonomy |
| clinical/v9 | intent 1.000, role 0.79, topic 1.000, admission 1.000, state_change 1.000 | 536 entries Γ 6 slots | medication / procedure / symptom / severity taxonomy |
The role head's "primary / alternative / casual / not_relevant" classification on gazetteer-detected spans is the v9 replacement for the v6 BIO slot tagger; it sources canonical state values for L2 entity-state nodes and feeds the role-grounding retrieval bonus (off-by-default pending v10 calibration evidence).
Compare to zero-shot baselines: E5 label-similarity hits intent F1 0.347β0.612 on the same gold β the LoRA adapters gain +0.22 to +0.49 absolute intent F1 while eliminating the 26.7β56.7% confidently-wrong rate.
For full per-head evidence + retrieval-side ablation results (Phase G + Phase H series), see docs/v9-mseb-slm-lift-findings.md and docs/mseb-results.md.
MSEB v1 β state-evolution retrieval across four domains (NEW, 2026-04-21)
A pluggable, gold-audited benchmark for state-evolution memory retrieval: four domains (SWE-bench Verified diffs, PMC clinical case reports, ADR prose, LongMemEval conversations) Γ four query classes (general / temporal / preference / noise). Head-to-head against mem0 in a 12-cell single-pass run (747 hand-audited gold queries, locked):
| Domain | NCMS (hybrid) | mem0 (dense) | Ξ NCMS β mem0 |
|---|---|---|---|
| MSEB-SoftwareDev (ADR prose) | 0.745 | 0.455 | +0.29 |
| MSEB-Clinical (PMC case reports) | 0.672 | 0.224 | +0.45 |
| MSEB-SWE (SWE-bench Verified diffs) | 0.416β0.456 | 0.256 | +0.16β0.20 |
| MSEB-Convo (LongMemEval) | 0.345 | 0.207 | +0.14 |
Hybrid retrieval beats dense retrieval on state-evolution content by +0.14 to +0.45 rank-1 across every domain tested β not a borderline result. All three backends (NCMS tlg-on, NCMS tlg-off, mem0) correctly reject 100 % of the 59 adversarial off-topic noise queries. Per-class breakdown, per-head SLM contribution analysis, honest TLG limitations, and full reproducibility recipe in docs/mseb-results.md.
./benchmarks/mseb/run_main_12.sh # One-shot: 12 cells, 4 domains Γ 3 backends
LongMemEval A/B (500-question non-regression check, 2026-04-20)
The SLM on vs. off on LongMemEval is an axis-mismatch test β conversational memory recall isn't the axis the SLM was built for, so the point of this run is confirming zero regression + acceptable latency, not headline accuracy:
Baseline (--features-on) | SLM (--intent-slot-domain conversational) | Ξ | |
|---|---|---|---|
| Recall@5 | 0.4680 | 0.4680 | 0.0000 (bit-identical across all 6 categories) |
| Elapsed | 10,562 s | 11,099 s | +537 s (~48 ms / memory overhead) |
| Memories stored | 10,960 | 10,960 | β |
| Errors / tracebacks / HTTP 4xx | 0 | 0 | β |
The classifier ran ~11k forward passes cleanly; it just didn't move the number because LongMemEval's retrieval path doesn't consume the SLM's outputs on the axes it classifies. Expected and desired β the real benchmark for the SLM's admission + state_change + topic heads is state-evolution retrieval, not conversational recall. See docs/completed/intent-slot-history/intent-slot-sprint-4-findings.md Β§10 for the full A/B breakdown.
Baseline Comparison (SWE-bench Django)
Compared against Mem0 and Letta on SWE-bench Django (850 issues, 80/20 chronological split). NCMS wins 3 of 4 metrics with zero OpenAI API calls β Mem0 and Letta both use OpenAI text-embedding-3-small dense vectors.
| Metric | NCMS | NCMS Recall | Mem0 | Letta |
|---|---|---|---|---|
| AR nDCG@10 | 0.1750 | 0.2031 | 0.1550 | 0.1412 |
| TTL Accuracy | 0.6529 | β | 0.5941 | 0.7412 |
| CR Temporal MRR | 0.0947 | β | 0.0150 | 0.0616 |
| LRU nDCG@10 | 0.3540 | β | 0.1979 | 0.1245 |
See the full ablation study, weight tuning results, and completed milestones for methodology, per-dataset metrics, and development history.
Get Started
pip install ncms # Core install
pip install "ncms[docs]" # + rich document support (DOCX/PPTX/PDF/XLSX)
pip install "ncms[dashboard]" # + observability dashboard
uv run ncms demo # See it in action
uv run ncms serve # Start MCP server
uv run ncms dashboard # Real-time dashboard
uv run ncms load file.md --domains arch # Matrix-style knowledge download
uv run ncms lint # Diagnose memory store health
uv run ncms export --output-dir wiki # Export as linked markdown wiki
Quickstart Guide β MCP server setup, Claude Code hooks, NeMo agent integration, configuration reference, and local LLM inference.
GPU-Accelerated LLM Inference
NCMS LLM features (contradiction detection, knowledge consolidation) can be accelerated with an NVIDIA DGX Spark running vLLM via the NGC vLLM container.
Deploy Nemotron on DGX Spark:
sudo docker run -d --gpus all --ipc=host --restart unless-stopped \
--name vllm-nemotron-nano \
-p 8000:8000 \
-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
-v /root/.cache/huggingface:/root/.cache/huggingface \
nvcr.io/nvidia/vllm:26.01-py3 \
vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
--host 0.0.0.0 --port 8000 --trust-remote-code \
--max-model-len 524288 \
--enable-auto-tool-choice --tool-call-parser qwen3_coder
Point NCMS at the Spark:
NCMS_CONTRADICTION_DETECTION_ENABLED=true \
NCMS_LLM_MODEL=openai/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
NCMS_LLM_API_BASE=http://spark-ee7d.local:8000/v1 \
NCMS_CONSOLIDATION_KNOWLEDGE_ENABLED=true \
NCMS_CONSOLIDATION_KNOWLEDGE_MODEL=openai/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
NCMS_CONSOLIDATION_KNOWLEDGE_API_BASE=http://spark-ee7d.local:8000/v1 \
uv run ncms serve
The Nemotron 3 Nano (30B total, 3B active MoE) fits entirely in the Spark's 128GB unified memory, delivering sub-second LLM inference.
Note: the ingest-side intent-slot SLM (bert-base-uncased + LoRA) runs happily on Apple Silicon MPS, CUDA, or CPU β no DGX required. The DGX is only for the LLM-dependent opt-in features (contradiction detection, knowledge consolidation, synthesis).
Completed Features
Core retrieval (Phases 0-11)
- BM25 + SPLADE + Graph hybrid retrieval (nDCG@10=0.72 SciFact)
- Selective cross-encoder reranking (intent-aware)
- Per-query score normalization
- Structured recall with episode / entity / causal context (+15.5% AR)
- 4-feature admission scoring with 3-way quality gate
- Bitemporal state reconciliation (supports / refines / supersedes / conflicts)
- 7-signal hybrid episode formation
- Intent-aware retrieval (7 intent classes)
- Hierarchical consolidation: episode summaries, state trajectories, recurring patterns
- Dream cycles: rehearsal, PMI association learning, importance drift
- ACT-R cognitive scoring with dream-learned association weights
P1 β Temporal Linguistic Geometry β SHIPPED 2026-04-19
- Grammar-based structural retrieval over typed state-transition edges
- 11 intent shapes (current_state, ordinal, causal_chain, sequence, predecessor, interval, transitive_cause, concurrent, before_named, range, noise)
- Zero-confidently-wrong composition invariant with BM25
- Readable syntactic proofs on every grammar answer
- 32/32 top-5 and rank-1 on ADR state-evolution corpus
- Full integration:
NCMS_TEMPORAL_ENABLED,ncms tlg status|induce,--tlgbenchmark flag
P2 β Intent-Slot SLM β SHIPPED 2026-04-20
- LoRA multi-head classifier (5 heads, one forward pass per memory)
- Replaces admission regex, state-change regex, LLM topic labeller, manual domain tagging, never-shipped preference extractor
- 3-tier fallback chain (LoRA adapter β E5 zero-shot β heuristic)
- Per-deployment adapter training: 4-phase pipeline with pass/fail gate
- 3 reference adapters shipped (conversational / software_dev / clinical, F1=1.000 on gold)
- Dynamic topics (no closed-vocab enum in code; lives in adapter manifest)
- Benchmark runner integration (
--intent-slot-domainon LongMemEval) - Dashboard event +
SQLiteStore.list_topics_seen()for config-free topic enumeration
Content-aware ingestion & document model
- Two-class content gate: ATOMIC fragments vs NAVIGABLE documents
- Document Profile model (one profile memory + sections in document store)
- Content-hash deduplication (SHA-256) at store boundary
- Content size gating with importance-based exemptions
- Entity quality filtering (rejects junk: numeric %, hex IDs, count patterns)
Retrieval enhancements
- Level-first retrieval with intent-driven traversal strategies
- Synthesis pipeline with 5 modes (summary, detail, timeline, comparison, evidence)
- Emergent topic map from L4 abstract clustering
- Temporal query parsing with proximity boost
Tools & interfaces
- 26 MCP tools via FastMCP
- HTTP REST API with bearer token auth
- A2A JSON-RPC 2.0 bridge (agent discovery + task routing)
- CLI:
ncms serve|demo|dashboard|info|load|lint|reindex|export|maintenance|watch|topics|state|episodes|topic-map|tlg - Observability dashboard (SSE + D3 graph + entity / episode / state / intent-slot views)
Ingestion & monitoring
- Filesystem watcher with auto-domain classification (
ncms watch) - Matrix-style knowledge loader (MD, JSON, YAML, CSV, HTML, DOCX, PPTX, PDF, XLSX)
- Index rebuild utility (
ncms reindex) - Read-only diagnostics (
ncms lint) - Wiki export (
ncms export) - Background maintenance scheduler
- OpenTelemetry tracing integration
- Prometheus metrics endpoint
Deployment & integration
- NemoClaw integration (MCP config, OpenClaw skill, sandbox blueprint)
- NeMo Agent Toolkit
MemoryEditoradapter - Bus heartbeat + offline detection with auto-snapshot
- Helm chart for Kubernetes
- All-in-one Docker image with pre-baked models
- docker-compose multi-agent hub
Evaluation
- SciFact ablation: nDCG@10=0.7206, exceeds ColBERTv2 (+4.0%) and SPLADE++ (+1.5%)
- SWE-bench Django: Recall AR 0.2032, +15.5% over search; beats Mem0 and Letta on 3 / 4 metrics
- TLG ADR validation: 32 / 32 top-5 and rank-1 across 11 intent shapes
- Intent-Slot LoRA gate: F1=1.000 on gold across 5 heads, 3 reference domains
- Dream cycle benchmark (SciFact, NFCorpus, ArguAna)
- LongMemEval: Recall@5=0.4680 (500 questions, 6 categories)
- MemoryAgentBench harness (AR, TTL, LRU, selective forgetting)
Roadmap (Post-v1)
P3 β SWE state-evolution benchmark (planned)
- MSEB v1 β four-domain state-evolution benchmark (SWE-bench Verified / PMC Clinical / ADR prose / LongMemEval), 747 hand-audited gold queries stratified by
general/temporal/preference/noise, head-to-head vs mem0 β results indocs/mseb-results.md; full-scale rerun next - Reusable JSONL artefact that other memory systems can consume without knowing NCMS internals
- Gates paper milestone M3 ("confidently-wrong = 0 at scale")
Adapter operations (follow-up from Sprint 4)
-
ncms train-adapter/adapter-list/adapter-promoteCLIs (thin wrappers over the experiment driver) - Drift detection (dashboard watches per-head confidence distributions, warns on OOD content)
- Generic-domain adapter (one broad adapter shipped with NCMS as Tier-2 fallback)
- LoRA hyperparameter sweep automation
- Encoder comparison (RoBERTa / DistilBERT) for latency / quality tradeoff
Distributed infrastructure
- NATS / Redis-backed Knowledge Bus transport (implementing existing
KnowledgeBusTransportProtocol) - Neo4j / FalkorDB graph backend (implementing existing
KnowledgeGraphProtocol) - BM25-scored surrogate responses
Production validation (requires real agent workloads)
- Simulated Agent Workday benchmark (3-7 day multi-agent workload for ACT-R validation)
- ACT-R weight crossover demonstration (show ACT-R weight becomes beneficial with dream-learned access patterns)
- Rehearsal Boost Rate measurement (validate β₯85% of rehearsed memories show activation increase)
Dashboard & observability
- Historical replay and time-travel debugging (replay memory state at any point in time)
- Intent-slot confidence histogram + drift alerts
See completed milestones and V1 ablation results for development history.
Research Artefacts
Current state (v9):
- Main paper β architecture, SciFact/SWE-bench results, ablation studies
- v9 MSEB findings β Phase G/H/I SLM-signal ablation results, regex-vs-SLM retrieval audit
- MSEB v1 results β four-domain state-evolution benchmark (NCMS hybrid vs mem0 dense)
- v9 domain plugin architecture β YAML-native domain plugins (gazetteer + diversity + archetypes)
Forward-looking (planned):
- CTLG design β query-side cue tagger as a sibling adapter (post-v8 saturation pivot)
- CTLG cue guidelines β annotation rubric for the cue corpus
- CTLG grammar β composition rules from cue tags to TLGQuery
Background:
- Temporal Linguistic Geometry pre-paper β grammar-theoretic framework for state-evolution retrieval
- Intent-Slot Distillation pre-paper β original P2 motivation for the learned multi-head classifier
Sprint-level historical findings (v6/v7 era) live under docs/completed/.
Acknowledgments
- GLiNER β Zero-shot NER by Zaratiana et al. (NAACL 2024)
- SPLADE β Sparse neural retrieval by Formal et al. (SIGIR 2021), powered by sentence-transformers SparseEncoder
- Tantivy β Rust-based full-text search engine
- peft β LoRA adapter implementation (HuggingFace PEFT)
- transformers β BERT encoder for the intent-slot SLM
- safetensors β Adapter artifact serialization
- ACT-R β Cognitive architecture by John R. Anderson
- Linguistic Geometry β Game-state reduction framework by Boris Stilman β inspiration for TLG's zone / trajectory primitives
- BEIR β Heterogeneous IR benchmark by Thakur et al. (NeurIPS 2021)
- NetworkX β Graph library powering the knowledge graph
- litellm β Universal LLM API proxy
- aiosqlite β Async SQLite wrapper
License
MIT
Built for agents that remember β and reason over how knowledge changes.
By Shawn McCarthy / Chief Archeologist
