io.github.joesaby/doctree-mcp
BM25 search + tree navigation over markdown docs for AI agents. No embeddings, no LLM calls.
Ask AI about io.github.joesaby/doctree-mcp
Powered by Claude Β· Grounded in docs
I know everything about io.github.joesaby/doctree-mcp. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
doctree-mcp
Agentic document retrieval over markdown, CSV, and JSONL. BM25 + tree navigation via MCP β no vector DB, no embeddings, no LLM calls at index time.
The pitch: MCP provides the structural primitives (a navigable tree, BM25, glossary, row lookup). The bundled skills provide the procedural knowledge (how to walk that tree). Together the agent behaves like a trained research librarian β not a one-shot searcher. See The Skill + MCP Pattern.
Quick Start
Have docs already? Point a client at them:
# In your AI tool's MCP config β see docs/CLIENTS.md for per-tool snippets
{ "mcpServers": { "doctree": {
"command": "bunx", "args": ["doctree-mcp"],
"env": { "DOCS_ROOT": "./docs", "WIKI_WRITE": "1" }
} } }
Restart the tool β ask "search the docs for X" or invoke the doc-read prompt.
Starting fresh? Scaffold a Karpathy-style LLM wiki:
bunx doctree-mcp init # configure current tool
bunx doctree-mcp init --all # configure every supported client
bunx doctree-mcp init --dry-run
Creates docs/wiki/ (LLM-maintained) + docs/raw-sources/ (your inputs), writes the MCP config, installs a post-write lint hook, appends wiki conventions to CLAUDE.md / AGENTS.md / .cursor/rules/.
Operation Modes
| Mode | Use when | Guide |
|---|---|---|
| stdio (default) | Local dev, agent on your machine | Client setup |
| HTTP (Streamable HTTP) | Teams, CI, hosted agents | Deployment β Railway Β· Fly Β· Render Β· Cloudflare Containers Β· Docker |
| CLI | init, lint, debug-index | Operation modes |
Full decision tree: Operation Modes.
How It Works β Retrieve Β· Curate Β· Add
Agent: "How does token refresh work?"
β search_documents("token refresh")
#1 auth/middleware.md Β§ Token Refresh Flow score: 12.4
#2 auth/oauth.md Β§ Refresh Token Lifecycle score: 8.7
β get_tree("docs:auth:middleware")
[n1] # Auth Middleware
[n4] ## Token Refresh Flow
[n5] ### Automatic Refresh
β navigate_tree("docs:auth:middleware", "n4") β n4 + descendants
Core read tools (always on):
| Tool | Purpose |
|---|---|
search_documents | BM25 keyword search + facet filters + glossary expansion (markdown Β· CSV Β· JSONL) |
get_tree | Table of contents β headings, word counts, summaries |
get_node_content | Full text of a specific section by node ID |
navigate_tree | A section plus all descendants in one call |
lookup_row | O(1) exact-key lookup for structured data rows (e.g. PROJ-44) |
Wiki write tools (opt-in with WIKI_WRITE=1):
| Tool | Purpose |
|---|---|
find_similar | Duplicate detection with overlap ratios |
draft_wiki_entry | Scaffold: suggested path, inferred frontmatter, glossary hits |
write_wiki_entry | Validated write: path containment, schema, duplicate guards, dry-run |
Safety: path containment Β· frontmatter validation Β· duplicate detection Β· dry-run Β· overwrite protection.
Deprecated aliases (list_documents, find_files, find_symbol) are superseded by search_documents β still functional, no longer recommended.
The Skill + MCP Pattern
Most retrieval tools hand the agent a search box and hope for the best. doctree-mcp hands it a tree, and the bundled skills teach it how to walk one.
- MCP = structural primitives.
search_documents,get_tree,navigate_tree,get_node_content,lookup_rowreturn tree positions the agent reasons over β not finished answers. - Skills = procedural knowledge.
/doc-read,/doc-write,/doc-lintencode breadcrumb drill-down: search β outline β navigate β retrieve. The agent learns the policy, not just the API.
That pairing doesn't exist cleanly elsewhere:
| Approach | Primitive | Skill teaches | Gap |
|---|---|---|---|
| Managed hybrid RAG (Cloudflare AI Search, Nia) | Flat chunks + similarity | β | Black-box score, no audit trail |
| Tool-returns-answer (Context7) | 2 tools returning answers | Query shape | Agent can't reason about skipped content |
| Skill-over-CLI (QMD) | CLI over flat search | Query expansion | No tree to navigate |
doctree-mcp + /doc-read | Navigable tree | Breadcrumbs, multi-instance routing, wiki compilation | β |
Why iterative retrieval wins:
- Context rot. Stuffing a 1M-token window with chunks degrades output. Breadcrumb navigation keeps working memory small.
- Auditability.
search_documents β get_tree β navigate_tree β get_node_contentis a replayable trail. A cosine score is not. Regulated domains can ship the former. - Progressive disclosure. Fewer navigable primitives beat tool sprawl (cf. Cloudflare Code Mode).
Multi-instance = client-side federation. Register several doctree servers under different names; the /doc-read skill encodes the routing policy. Add or remove instances without touching the skill. See Client setup β Multi-instance routing.
The LLM Wiki Pattern
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Raw Sources β β The Wiki β β The Schema β
β (immutable) β βββ β (LLM-maintained)β βββ β (you define) β
β notes Β· logs β β runbooks Β· refs β β CLAUDE.md rules β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
Inspired by Karpathy's LLM Wiki. Full walkthrough: docs/LLM-WIKI-GUIDE.md.
Configuration (summary)
---
title: "Descriptive Title"
description: "One-line summary β boosts ranking"
tags: [relevant, terms]
type: runbook # runbook | guide | reference | tutorial | architecture | adr
category: auth
---
All non-reserved frontmatter fields become filter facets:
search_documents("auth", filters: { type: "runbook", tags: ["production"] })
Common env vars:
| Variable | Default | Description |
|---|---|---|
DOCS_ROOT | ./docs | Docs folder |
DOCS_GLOB | **/*.md | Comma-separated globs (**/*.md,**/*.csv,**/*.jsonl) |
DOCS_ROOTS | β | Weighted multi-collection (./wiki:1.0,./rfcs:0.5) |
PORT | 3100 | HTTP mode port |
WIKI_WRITE | (unset) | 1 enables write tools |
GLOSSARY_PATH | $DOCS_ROOT/glossary.json | Query-expansion glossary |
Full reference: docs/CONFIGURATION.md.
Glossary β place glossary.json in docs root for bidirectional query expansion:
{ "CLI": ["command line interface"], "K8s": ["kubernetes"] }
Acronym definitions like "TLS (Transport Layer Security)" are also auto-extracted.
Structured data β CSV/JSONL files become documents where each row is a tree node. Column roles (id, title, description, facets, URL) are auto-detected from headers. See docs/STRUCTURED-DATA.md.
Running from Source
git clone https://github.com/joesaby/doctree-mcp.git
cd doctree-mcp && bun install
DOCS_ROOT=./docs bun run serve # stdio
DOCS_ROOT=./docs bun run serve:http # HTTP (port 3100)
DOCS_ROOT=./docs bun run index # CLI: inspect indexed output
bun test
Performance
| Operation | Time | Token cost |
|---|---|---|
| Full index (900 docs) | 2β5s | 0 |
| Incremental re-index | ~50ms | 0 |
| Search | 5β30ms | ~300β1K tokens |
| Tree outline | <1ms | ~200β800 tokens |
Docs
Setup & operation
- Operation Modes β stdio Β· HTTP Β· CLI
- Client Setup β Claude Code Β· Cursor Β· Windsurf Β· Codex Β· OpenCode Β· Claude Desktop
- Deployment β Railway Β· Fly.io Β· Render Β· Cloudflare Containers Β· Docker
- Configuration β env vars, frontmatter, ranking tuning
Patterns & concepts
- LLM Wiki Guide β agent-maintained knowledge base walkthrough
- Structured Data β CSV / JSONL indexing
- Architecture & Design β BM25 internals, tree navigation
- Competitive Analysis β PageIndex, QMD, GitMCP, Context7, managed RAG
Source
- Prompts β MCP prompt templates
- Skills:
/doc-readΒ·/doc-writeΒ·/doc-lint
Standing on Shoulders
- PageIndex β hierarchical tree navigation
- Pagefind by CloudCannon β BM25 scoring, positional index, facets
- Bun.markdown by Oven β native CommonMark parser
- Karpathy's LLM Wiki β the LLM-maintained wiki pattern
License
MIT
