Tome MCP
MCP server for managing a research paper library β PDFs, bibliography, semantic search, citations, and LaTeX document analysis.
Installation
npx tome-mcpAsk AI about Tome MCP
Powered by Claude Β· Grounded in docs
I know everything about Tome MCP. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
β οΈ Tome β DEPRECATED
This package is deprecated. Use precis-mcp instead.
precis-mcp unifies paper reading (formerly tome-mcp / acatome-mcp) and manuscript editing into a single MCP server with 4 tools: search(), get(), put(), move().
pip install precis-mcp
Tome (archived)
A Python MCP server that manages a research paper library: PDFs, bibliography, semantic search, figure tracking, and Semantic Scholar integration.
No LLM inside β pure deterministic code. The AI client provides the intelligence; Tome provides the tools.
Developed and tested with Windsurf + Claude Opus 4.6 (thinking). Should work with any MCP-capable client and sufficiently capable model, but this combination is where the magic happens.
Installation
pip install tome-mcp
For development (tests, linting):
git clone https://github.com/retospect/tome-mcp.git
cd tome-mcp
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
Dependencies
chromadbβ vector database for semantic search (includes built-inall-MiniLM-L6-v2embeddings, no external server needed)PyMuPDF(fitz) β PDF text extractionbibtexparserβ₯ 2.0 β BibTeX parsing and serializationhttpxβ HTTP client for CrossRef, Semantic Scholar, Unpaywall APIsmcpβ Model Context Protocol SDKPyYAMLβ config file parsing
MCP configuration
Quickest setup β uses uvx to run without a manual venv:
{
"mcpServers": {
"tome": {
"command": "uvx",
"args": ["tome-mcp"],
"env": {
"TOME_ROOT": "/path/to/your/project",
"SEMANTIC_SCHOLAR_API_KEY": "optional"
}
}
}
}
Or point your MCP client at a local install:
{
"mcpServers": {
"tome": {
"command": "/path/to/tome/.venv/bin/python",
"args": ["-m", "tome.server"],
"env": {
"TOME_ROOT": "/path/to/your/project",
"SEMANTIC_SCHOLAR_API_KEY": "optional"
}
}
}
}
Alternatively, use set_root(path='...') at the start of each session.
Quick start: your first session
Once Tome is installed and your MCP client is configured, open your project in the IDE and type these prompts in order:
1. Orient
This is a LaTeX project using the Tome MCP server for paper management. Call
guide('getting-started')to see the tool index, thenset_root('/path/to/my/project')to connect.
2. Describe your project (so the LLM builds context)
The book/paper is about [your topic]. The main file is
main.tex. Runtoc()to see the document structure andpaper()to see the library.
3. Ingest your first paper
I dropped a PDF in
tome/inbox/. Ingest it and verify the DOI.
4. Search and cite
Find papers in our library about [topic] and show me relevant quotes.
5. Compile
Compile the document and check for warnings.
That's it. The LLM discovers Tome's tools via guide() and learns your
project structure from the filesystem. From here, explore the built-in
guides β call guide() with no arguments to see all topics.
Environment variables (all optional)
| Variable | Default | Purpose |
|---|---|---|
TOME_ROOT | (none) | Project root directory (alternative to set_root() or cwd) |
SEMANTIC_SCHOLAR_API_KEY | (none) | Higher S2 rate limits |
UNPAYWALL_EMAIL | (none) | Email for Unpaywall open-access PDF lookup |
Directory layout
User-facing (git-tracked)
project-root/
βββ tome/
β βββ references.bib # AUTHORITATIVE bibliography
β βββ inbox/ # Drop PDFs here for processing
β βββ figures/ # Source figure screenshots
β βββ notes/ # LLM-curated paper notes (authorYYYY.yaml)
Cache (gitignored, fully regenerable via tome:rebuild)
project-root/
βββ .tome/
β βββ tome.json # Derived metadata cache
β βββ staging/ # Ingest prep area (transient)
β βββ raw/ # Extracted text: raw/xu2022/xu2022.p1.txt
β βββ chroma/ # ChromaDB persistent storage (embeddings + index)
β βββ corpus_checksums.json # Checksum manifest for .tex/.py files
β βββ tome.json.bak # Safety backup before each write
Data model
Durability tiers
| Tier | Data | Location | Recovery |
|---|---|---|---|
| Source of truth | PDFs, figure screenshots | Vault (~/.tome-mcp/pdf/), tome/figures/ | Unrecoverable |
| Self-contained archives | .tome HDF5 files | Vault (~/.tome-mcp/tome/) | Unrecoverable (contain text + embeddings) |
| Authoritative metadata | Bibliography | tome/references.bib | Git rollback |
| Derived cache | Everything else | .tome-mcp/ | Rebuildable from .tome archives |
.tome archives β HDF5, not zip
Each ingested paper produces a .tome file in the vault. These are HDF5 archives
(opened with h5py, not zipfile). Each archive is fully self-contained:
import h5py, json
f = h5py.File('~/.tome-mcp/tome/x/xu2022.tome', 'r')
meta = json.loads(f['meta'][()]) # key, title, authors, year, doi, ...
pages = f['pages'][:] # extracted page text (one string per page)
chunks = f['chunks/texts'][:] # chunked text for search
embeds = f['chunks/embeddings'][:] # (N, 384) float32 vectors
f.attrs['content_hash'] # SHA256 of the source PDF
f.attrs['embedding_model'] # "all-MiniLM-L6-v2"
f.close()
All databases (catalog.db, ChromaDB) can be rebuilt from .tome files alone.
references.bib β authoritative
The bib file is the single source of truth for paper metadata. Tome parses it
with bibtexparser and writes back using full parse-modify-serialize (not regex
surgery). A roundtrip test (parse β serialize β parse β compare) runs before
every write; if anything changed unexpectedly, the write aborts.
A .bak copy is made before every write.
x-fields (curated, survive .tome/ rebuild)
| Field | Values | Meaning |
|---|---|---|
x-pdf | true/false | PDF has been ingested (stored in vault) |
x-doi-status | valid/unchecked/rejected/missing | DOI verification state |
x-tags | comma-separated | Freeform tags for search filtering |
Key format
authorYYYY[a-c]? β first author surname + publication year. Collisions get
letter suffixes. Datasheets use manufacturer_partid. Patents use the patent
number.
tome.json β derived cache
Rebuilt from references.bib + filesystem on rebuild. Contains expensive-to-
derive operational state:
{
"version": 1,
"papers": {
"xu2022": {
"title": "...",
"authors": ["Xu, Y.", "..."],
"year": 2022,
"doi": "10.1038/s41586-022-04435-4",
"s2_id": "CorpusId:12345678",
"s2_fetched": "2026-02-13",
"citation_count": 47,
"cited_by_in_library": ["chen2023"],
"references_in_library": ["lambert2015"],
"abstract": "...",
"file_sha256": "a1b2c3...",
"pages_extracted": 12,
"embedded": true,
"doi_history": [],
"crossref_fetched": "2026-02-13T19:29:00Z",
"figures": {
"fig3": {
"status": "captured",
"file": "figures/xu2022_fig3.png",
"page": 3,
"reason": "QI transfer diagram",
"requested": "2026-02-13",
"captured": "2026-02-13",
"_caption": "Conductance measurements...",
"_context": [{"page": 1, "text": "As shown in Fig. 3..."}],
"_attribution": "Reproduced from Xu et al. (2022), Figure 3"
}
}
}
},
"requests": {
"ouyang2025": {
"doi": "10.1063/5.0xxx",
"tentative_title": "Fano interference...",
"reason": "PDF behind paywall",
"added": "2026-02-13",
"resolved": null
}
}
}
Fields prefixed with _ are derived (regenerable from raw text extraction).
DOI lifecycle
| Status | Meaning | doi field |
|---|---|---|
valid | CrossRef resolves, title/authors match | Present, verified |
unchecked | DOI present, not yet verified | Present, unverified |
rejected | Was wrong or hallucinated, DOI removed | Absent |
missing | Never had a DOI | Absent |
Transitions:
- Added with DOI β
unchecked - Added without DOI β
missing unchecked+check_doisucceeds βvalidunchecked+check_doifails βrejected(DOI removed, history intome.json)rejected+set_paperwith new DOI βuncheckedmissing+set_paperwith DOI βunchecked
Invariant: if x-doi-status = valid, the DOI is trustworthy.
Ingest pipeline
Two-phase commit
Phase 1: Prepare (writes only to .tome/staging/, reversible)
- Copy PDF from inbox to
.tome/staging/{key}/ - Extract PDF metadata (title, authors from
doc.metadata) - Extract first-page text (DOI regex, title heuristic)
- If DOI found β query CrossRef β structured metadata
- If no DOI but title found β query Semantic Scholar β metadata
- Extract text page-by-page
- Chunk (500 chars, 100 overlap, sentence boundaries)
- Return proposal to LLM (suggested key, extracted vs API metadata)
The LLM reviews the proposal and confirms or corrects.
Phase 2: Commit (fast, ordered for crash safety)
- Write bib entry to
tome/references.bib(via bibtexparser) - Copy PDF to vault (
~/.tome-mcp/pdf/), write.tomearchive - Move staging artifacts β
.tome-mcp/raw/,.tome-mcp/cache/ - Upsert into ChromaDB
- Update
.tome/tome.json - Clean up staging dir
If commit fails partway: staging dir still exists, inbox file may already be
gone but bib entry exists. rebuild reconciles.
Verification
The LLM performs title/author verification (not Tome). Tome extracts metadata from the PDF and from APIs, returns both to the LLM. The LLM handles fuzzy matching (encoding variants like Γ§/c, abbreviations, reordering).
Corpus indexing (.tex / .py files)
Separate from papers. Living documents that change frequently.
Sync model
sync_corpus or lazy sync on search_corpus:
- Scan glob patterns (e.g.
sections/*.tex) - Checksum each file (SHA256)
- Compare against
.tome/corpus_checksums.json - Changed files: delete old ChromaDB entries, re-chunk, re-embed, insert
- Deleted files: remove from ChromaDB
- New files: add to ChromaDB
- Unchanged files: skip
ChromaDB collections: paper_pages, paper_chunks, corpus_chunks (separate).
MCP tools
Many formerly separate tools have been unified into multi-action tools.
Call guide() for the full topic index, or guide('getting-started') for orientation.
Paper management
| Tool | Description |
|---|---|
paper | Unified: get/set/list/remove/request/stats. No args = library stats. key = metadata + notes. action='list' = browse. |
ingest | Process inbox PDFs. Without confirm: proposes key + metadata. With confirm=True: commits to library + vault. |
notes | Read/write/clear paper notes or file meta. Paper notes in tome/notes/, file meta in % === FILE META blocks. |
link_paper | Link/unlink a vault paper to the current project. No args = list linked papers. |
Search & navigation
| Tool | Description |
|---|---|
search | Unified search: scope (all/papers/corpus/notes) Γ mode (semantic/exact). Filters: key, keys, tags, paths. |
toc | Document structure: locate (heading/cite/label/index/tree). Replaces old doc_tree, find_cites, list_labels. |
Document analysis
| Tool | Description |
|---|---|
doc_lint | Structural issues: undefined refs, orphan labels, shallow cites, tracked patterns. |
dep_graph | Labels, refs, cites for a single .tex file. |
review_status | Tracked marker counts from tome/config.yaml patterns. |
validate_deep_cites | Verify deep-cite quotes against source PDF text in ChromaDB. |
Discovery & exploration
| Tool | Description |
|---|---|
discover | Unified: federated search (S2 + OpenAlex), citation graph, shared citers, refresh, stats, lookup. |
cite_graph | S2 citation graph (who cites this paper, what it cites). Flags in-library papers. |
explore | LLM-guided citation beam search β fetch, triage, expand, dismiss. |
DOI & figures
| Tool | Description |
|---|---|
doi | Unified DOI management: verify, reject, list rejected, fetch open-access PDF (via Unpaywall β inbox). |
figure | Request, register, or list figures. No args = list all. |
Task tracking
| Tool | Description |
|---|---|
needful | List N most urgent tasks, or mark a task as done. Ranked by never-done > changed > overdue. |
file_diff | Git diff annotated with LaTeX section headings. |
Maintenance
| Tool | Description |
|---|---|
set_root | Switch project root. Scaffolds directories. Surfaces open issues. |
reindex | Re-index papers, corpus files, or both. Rebuilds from vault archives. |
guide | On-demand usage guides. Call without args for topic index. |
report_issue | Log a tool issue to tome/issues.md (git-tracked). |
Tool descriptions
Every tool has a carefully written MCP description (~100 words) using consistent
terminology. Tool responses include a next_steps field when follow-up action
is needed.
Terminology (used in all descriptions)
| Term | Meaning |
|---|---|
| library | The collection of papers in tome/references.bib |
| key | The bib key, e.g. miller1999. Same as \cite{miller1999} |
has_pdf | Whether a PDF has been ingested (exists in vault) |
| inbox | tome/inbox/ β drop PDFs here for processing |
Error handling
All errors are specific exception classes with messages that tell the LLM what went wrong and what to do about it.
TomeError (base)
βββ PaperNotFound β key not in library
βββ PageOutOfRange β page N requested, paper has M pages
βββ DuplicateKey β key already exists
βββ DOIResolutionFailed β CrossRef error (404, 429, 5xx)
βββ IngestFailed β could not identify paper from PDF
βββ BibParseError β bib file could not be parsed
βββ BibWriteError β roundtrip test failed, write aborted
βββ ChromaDBError β search index init/query failed
βββ ConfigError (base) β project configuration issue
β βββ ConfigMissing β no tome/config.yaml found
β βββ RootNotFound β named root not in config
β βββ RootFileNotFound β root .tex file doesn't exist on disk
β βββ NoBibFile β no references.bib yet
β βββ NoTexFiles β tex_globs matched no files
β βββ UnpaywallNotConfigured β no email for Unpaywall API
βββ APIError β external API error (CrossRef, S2, Unpaywall)
βββ TextNotExtracted β paper exists but no raw text yet
βββ FigureNotFound β no such figure for paper
βββ UnsafeInput β path traversal or unsafe characters
Every error message includes: what happened, why, and what to do next.
Testing
- Every module gets a corresponding
test_*.py - Tests use small fixtures (2-entry bib, 1-page PDF mock)
- Error paths tested explicitly (more important than happy paths for MCP)
- External services (CrossRef, S2) are mocked
- Integration tests requiring live services marked
@pytest.mark.integration pytestwith no marks runs all unit tests (no network required)
Package structure
~/repos/tome/
βββ pyproject.toml
βββ README.md
βββ LICENSE # AGPL-3.0
βββ .gitignore
βββ examples/
β βββ config.yaml # Full config example (all features)
βββ src/
β βββ tome/
β βββ __init__.py
β βββ __main__.py # python -m tome.server entry point
β βββ py.typed # PEP 561 type marker
β βββ server.py # MCP server + tool handlers
β βββ errors.py # Exception hierarchy
β βββ config.py # Project config (config.yaml parsing)
β βββ manifest.py # tome.json read/write (atomic, backup)
β βββ bib.py # BibTeX parser + writer (bibtexparser)
β βββ extract.py # PDF text extraction (PyMuPDF)
β βββ chunk.py # Sentence-boundary overlapping chunker
β βββ store.py # ChromaDB management (built-in embeddings)
β βββ checksum.py # SHA256 file checksumming
β βββ identify.py # PDF identification + key generation
β βββ crossref.py # CrossRef API client
β βββ semantic_scholar.py # Semantic Scholar API client
β βββ openalex.py # OpenAlex API client
β βββ unpaywall.py # Unpaywall open-access PDF lookup
β βββ http.py # Shared HTTP client utilities
β βββ figures.py # Figure request/registration + caption extraction
β βββ notes.py # Paper notes (YAML + ChromaDB indexing)
β βββ issues.py # Issue tracking (tome/issues.md)
β βββ analysis.py # LaTeX document analysis (labels, refs, cites)
β βββ latex.py # LaTeX parsing utilities
β βββ toc.py # Table of contents parsing
β βββ index.py # Back-of-book index (.idx parsing)
β βββ find_text.py # Normalized .tex source search
β βββ grep_raw.py # Normalized PDF raw text grep
β βββ validate.py # Path traversal + input validation
β βββ git_diff.py # Git diff with LaTeX section annotations
β βββ cite_tree.py # Citation tree (S2 graph caching)
β βββ s2ag.py # Local S2AG database (offline citations)
β βββ s2ag_cli.py # S2AG CLI utilities
β βββ needful.py # Recurring task tracking
β βββ summaries.py # File content summaries
β βββ guide.py # On-demand usage guide loader
β βββ filelock.py # Cross-process file locking
β βββ docs/ # Built-in guide markdown files (11)
βββ tests/
βββ conftest.py # Shared fixtures
βββ test_analysis.py
βββ test_bib.py
βββ test_checksum.py
βββ test_chunk.py
βββ test_cite_tree.py
βββ test_concurrent_bib.py
βββ test_config.py
βββ test_crossref.py
βββ test_discovery.py
βββ test_errors.py
βββ test_extract.py
βββ test_figures.py
βββ test_filelock.py
βββ test_git_diff.py
βββ test_grep_raw.py
βββ test_guide.py
βββ test_http.py
βββ test_identify.py
βββ test_index.py
βββ test_issues.py
βββ test_latex.py
βββ test_manifest.py
βββ test_needful.py
βββ test_notes.py
βββ test_openalex.py
βββ test_semantic_scholar.py
βββ test_store.py
βββ test_summaries.py
βββ test_toc.py
βββ test_unpaywall.py
βββ test_validate.py
