Pullmd
Self-hosted URL-to-Markdown service for humans and AI agents. PWA + REST + MCP + Claude Code skill, with Reddit support and refreshable share links.
Ask AI about Pullmd
Powered by Claude Β· Grounded in docs
I know everything about Pullmd. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
PullMD
Self-hosted URL-to-Markdown service for humans and AI agents.
PullMD takes any web URL and returns clean, readable Markdown β no navigation, no ads, no boilerplate. It auto-detects Reddit threads (with full comment trees), uses Cloudflare's native Markdown when available, runs Mozilla Readability + Trafilatura on static HTML, and as a last resort renders JavaScript-heavy pages via headless Chromium (Playwright sidecar) before extracting.
It ships as:
- a PWA frontend with dark/paper themes, history, archive, share links
- a REST API at
GET /api?url=β¦ - an MCP server at
POST /mcp(Streamable-HTTP transport, stateless) - a Claude Code skill as a downloadable zip
Every conversion gets an 8-hex share id that works as a stable
live-endpoint: GET /s/:id returns the cached markdown and
re-fetches from the source if older than one hour. Use the share id
as a fixed URL that always returns fresh content β useful for
subreddit feeds and similar.
Quick start
Pre-built multi-arch images (linux/amd64, linux/arm64) live on Docker
Hub. Drop the compose file somewhere and run:
mkdir pullmd && cd pullmd
curl -O https://raw.githubusercontent.com/AeternaLabsHQ/pullmd/main/docker-compose.yml
docker compose up -d
# β http://localhost:3000
That's it. No .env needed: every variable has a sensible default
and PullMD listens on port 3000. Add a .env next to the compose
file to override anything (see Configuration).
docker-compose.yml (zero-config)
services:
pullmd:
image: aeternalabshq/pullmd:latest
container_name: pullmd
restart: unless-stopped
ports:
- "${PORT:-3000}:3000"
environment:
- PUBLIC_URL=${PUBLIC_URL:-http://localhost:${PORT:-3000}}
- TRAFILATURA_URL=http://trafilatura:8001/extract
- PLAYWRIGHT_URL=http://playwright:8002/render
- REDDIT_CLIENT_ID=${REDDIT_CLIENT_ID:-}
- REDDIT_CLIENT_SECRET=${REDDIT_CLIENT_SECRET:-}
- REDDIT_USER_AGENT=${REDDIT_USER_AGENT:-}
volumes:
- ./data:/data
networks:
- pullmd-internal
depends_on:
- trafilatura
- playwright
trafilatura:
image: aeternalabshq/pullmd-trafilatura:latest
container_name: pullmd-trafilatura
restart: unless-stopped
networks:
- pullmd-internal
playwright:
image: aeternalabshq/pullmd-playwright:latest
container_name: pullmd-playwright
restart: unless-stopped
networks:
- pullmd-internal
networks:
pullmd-internal:
driver: bridge
Note: the Playwright sidecar adds ~3.7 GB to your image cache (Chromium + Firefox + WebKit binaries from the official Playwright base image). It's optional β leave
PLAYWRIGHT_URLunset and theplaywrightservice block off, and PullMD silently degrades to static extraction with a fallback note in the metadata.
Mirror on GHCR:
ghcr.io/aeternalabshq/{pullmd,pullmd-trafilatura,pullmd-playwright}. Replace theimage:lines if you prefer GitHub's registry.
Behind Traefik
For deployments behind Traefik with TLS, use docker-compose.traefik.yml
instead. Same images, but with Traefik labels and the proxy external
network. Set HOST_DOMAIN in .env:
curl -O https://raw.githubusercontent.com/AeternaLabsHQ/pullmd/main/docker-compose.traefik.yml
echo "HOST_DOMAIN=pullmd.example.com" > .env
docker compose -f docker-compose.traefik.yml up -d
Local development (no Docker)
git clone https://github.com/AeternaLabsHQ/pullmd.git
cd pullmd
npm install
npm start # http://localhost:3000
npm test # node --test
Configuration
All variables go in .env (copy from .env.example):
| Variable | Required | Purpose |
|---|---|---|
HOST_DOMAIN | yes | Public hostname without scheme. Used by Traefik routing and as fallback for PUBLIC_URL. |
PUBLIC_URL | no | Full public origin embedded in /help and the skill zip. Defaults to https://${HOST_DOMAIN}. |
TRAFILATURA_URL | no | URL of the Trafilatura sidecar's /extract endpoint. Unset β skip Trafilatura, Readability only. |
PLAYWRIGHT_URL | no | URL of the Playwright sidecar's /render endpoint. Unset β skip Playwright fallback for JS pages. |
REDDIT_CLIENT_ID | no | OAuth credentials for Reddit. Without them, PullMD uses the public JSON API (lower rate limit). |
REDDIT_CLIENT_SECRET | no | |
REDDIT_USER_AGENT | no | Reddit requires a unique UA. Default: PullMD/1.0 (URL-to-Markdown service). |
DISABLE_PUBLIC_HISTORY | no | When true, hides the global recent-conversions list and archive (/api/history + /api/archive return 403, frontend hides the section). /s/:id share links keep working. Default: false. |
PULLMD_USER_AGENT | no | Pin a single outbound User-Agent for every web fetch. Disables rotation. Useful for CI or when one specific UA is known to work. |
PULLMD_UA_FEED_URL | no | URL of a JSON feed of current real-world UAs. Default: WinFuture23/real-world-user-agents. Set to an empty string to disable live refresh and rely on the built-in seed pool. |
PULLMD_AUTH_MODE | no | disabled (default) / single-admin / multi-user. See "Authentication" below. |
PULLMD_ADMIN_EMAIL | required when AUTH_MODE != disabled, on first startup | Bootstrap email for the first admin user. |
PULLMD_ADMIN_PASSWORD | required when AUTH_MODE != disabled, on first startup | Bootstrap password (min 8 chars). |
PULLMD_AUTH_TOKEN | no | Legacy bearer token compat (single-admin mode only, deprecated). |
PUBLIC_URL matters for self-hosting: the help page and downloadable
skill embed it as the canonical endpoint. Set it correctly and your
users get a copy-paste setup that points at your instance.
PullMD rotates its outbound User-Agent for the web fetch path from a
pool of current desktop browsers, refreshed every 48 hours from a
live feed of real-world UAs
maintained by @WinFuture23. A built-in
seed pool ensures rotation works even when the feed is unreachable. Set
PULLMD_USER_AGENT to pin a single UA, or PULLMD_UA_FEED_URL to point
at your own feed. The Reddit path keeps its dedicated REDDIT_USER_AGENT
because Reddit's API expects a stable, identifying UA.
DISABLE_PUBLIC_HISTORY=true is the privacy switch for shared
instances (multi-tenant VPS, office deployments). Conversions still
get cached and assigned share IDs; users just can't see what other
users have fetched. Anyone with a known /s/:id link still gets
their markdown back. Use this as a stopgap until per-user scoping
lands.
Authentication (v2.0+)
Pulling v2.x: Use the explicit
:2tag (or:2.0,:2.0.0). The:latesttag remains on v1.x for backward compatibility until v2.x has stabilized in real-world deployments.services: pullmd: image: aeternalabs/pullmd:2
PullMD ships with three auth modes. Pick one with PULLMD_AUTH_MODE:
| Mode | Behavior |
|---|---|
disabled | Default. No auth, everything open. Existing v1.x behavior. |
single-admin | One user, credentials from env vars. No self-signup. For homelab. |
multi-user | Self-signup at /signup, login at /login, per-user data isolation. |
In single-admin and multi-user modes, PULLMD_ADMIN_EMAIL + PULLMD_ADMIN_PASSWORD bootstrap the first admin user on first startup. After that, changing these env vars does not change the password β use the admin CLI:
docker compose exec pullmd node scripts/admin.js reset-password you@example.com
Auth boundary
| Endpoint | Auth required (when mode != disabled) |
|---|---|
/, /help, static assets, /web-reader.zip | no |
/login, /signup, /api/me (auth surface) | no |
/s/:id (share links) | no |
/api, /api/stream | yes |
/mcp | yes |
/api/history, /api/archive | yes |
/api/cache/:id, DELETE /api/cache | yes |
/api/stats, /api/storage, /api/config (aggregate) | no |
Authentication paths
- Session cookies β
POST /loginsetspullmd_session(HttpOnly,SameSite=Lax,Secureover HTTPS, 7-day TTL with sliding expiry). The PWA uses this automatically. - API keys β generate at
/settings, send viaAuthorization: Bearer pmd_<32-char-base62>. Stored as SHA-256 hashes; only shown once at creation. - Legacy
PULLMD_AUTH_TOKENβ deprecated.single-adminmode only. Maps to admin user. Kept for migration compatibility, removed in v3.0.
Migration from v1.x
See MIGRATION.md for the full upgrade checklist. The TL;DR: leave PULLMD_AUTH_MODE unset and v2.0 behaves exactly like v1.x.
AI-agent integration
Three install paths. Once your instance is running, ${PULLMD_URL}/help
shows the same boxes with your URL pre-filled. Replace ${PULLMD_URL}
below with your hostname (e.g. https://pullmd.example.com).
1. Universal prompt
Drop into any chat agent (ChatGPT, Claude, Gemini, β¦):
When you need to read a web page, fetch via PullMD instead of raw HTML:
GET ${PULLMD_URL}/api?url=<URL>
Returns clean Markdown (text/markdown). Optional query params:
comments=false skip Reddit comments
comment_depth=N comment nesting depth (default 3)
frontmatter=true prepend YAML metadata block
format=text strip Markdown, return plain text
nocache=true bypass the 1h cache and refetch
render=force|skip override the auto Playwright fallback
lang=de|en language for the comments section header
Response headers worth checking:
X-Source reddit | cloudflare | readability | playwright
X-Quality 0.0-1.0 extraction confidence
X-Share-Id 8-hex permalink, openable as /s/<id>
Reddit URLs are auto-detected (incl. redd.it short links and /s/ shares).
Use this whenever you would otherwise fetch raw HTML β the markdown is
much cleaner and saves significant context window space.
2. Claude Code skill
web-reader.zip is auto-built with your URL embedded:
curl -O ${PULLMD_URL}/web-reader.zip
mkdir -p ~/.claude/skills
unzip web-reader.zip -d ~/.claude/skills/
# Restart Claude Code; the skill activates on web-reading requests.
3. MCP server
Remote MCP server at ${PULLMD_URL}/mcp (Streamable-HTTP transport, stateless).
Three tools: read_url, get_share, list_recent. Server-side updates reach
every client automatically β no local install needed.
Claude Code (CLI):
claude mcp add --transport http pullmd ${PULLMD_URL}/mcp
Claude Desktop / Cursor / other MCP hosts β JSON config:
{
"mcpServers": {
"pullmd": {
"type": "http",
"url": "${PULLMD_URL}/mcp"
}
}
}
Once registered, the three tools surface natively in the agent β no prompt instructions needed, the LLM picks them up via their schema descriptions.
MCP client compatibility (updated for v2.0)
| Client | Bearer (Authorization: Bearer pmd_...) | OAuth | Notes |
|---|---|---|---|
| Claude Code CLI | β | β | Recommended. Generate a key at /settings. |
| Cursor | β | β | Same as CLI. |
| Claude Desktop | β | (#6) | UI lacks header field. Phase 2 OAuth. |
| claude.ai (web) | β | (#6) | Web requires OAuth. Phase 2. |
For Phase 1, Claude Desktop / claude.ai users still need the OAuth/proxy workaround documented in #10. Phase 2 (#6) layers OAuth on top of this user system.
Claude Desktop limitation
The Claude Desktop "Add custom connector" UI accepts URL + OAuth
Client ID/Secret but no custom-header field. Additionally,
claude_desktop_config.json entries with "type": "http" are silently
rewritten to {} after Desktop launches (current Desktop only honors
stdio servers in that file).
Until OAuth support lands (see #6), the practical workaround for Claude Desktop users is a reverse proxy that accepts the auth token as either a bearer header (for CLI) or as a URL path prefix (for Desktop, which has no header field).
Caddy workaround for Claude Desktop
Contributed by @WinFuture23:
@bearer header Authorization "Bearer {$AUTH_TOKEN}"
handle @bearer { reverse_proxy pullmd:3000 }
@token_path path /{$AUTH_TOKEN}/* /{$AUTH_TOKEN}
handle @token_path {
uri strip_prefix /{$AUTH_TOKEN}
reverse_proxy pullmd:3000
}
Then in Claude Desktop's connector dialog, use the URL with the token
path prefix: https://your-instance.com/<TOKEN>/mcp. CLI clients keep
using the Authorization header as normal.
This is a stopgap pattern; native OAuth (Phase 2) will remove the need for it.
API
| Endpoint | Returns |
|---|---|
GET /api?url=β¦ | Markdown (or JSON / plain text via format=). |
GET /api/stream?url=β¦ | Server-Sent Events stream of extraction-stage status, ending in a result event. Used by the PWA. |
GET /s/:id | Cached Markdown by share id; refreshes from source if > 1 h old. |
GET /api/history | Recent conversions (JSON). |
GET /api/archive | Paginated full archive. |
GET /api/storage | Cache size / hit-rate stats. |
GET /api/stats | Extraction telemetry (sources, quality, latency). |
POST /mcp | Streamable-HTTP MCP endpoint (3 tools: read_url, get_share, list_recent). |
GET /web-reader.zip | Claude Code skill bundle, with this instance's URL baked in. |
GET /help | Bilingual user/agent setup guide. |
/api parameters
| Param | Default | Notes |
|---|---|---|
url | β | Required. |
comments | true | Include Reddit comments. Ignored for non-Reddit URLs. |
comment_depth | 3 | Max nesting depth (1β10). |
comment_limit | 15 | Max top-level comments. |
frontmatter | false | Prepend YAML metadata. |
format | md | text strips Markdown; json returns structured response. |
nocache | false | Bypass the 1-hour cache. |
render | auto | force β always render via Playwright. skip β never render. Bypasses cache. |
lang | de | Comments-section header language (de or en). |
Response headers
X-SourceβredditΒ·cloudflareΒ·readabilityΒ·readability-fallbackΒ·trafilaturaΒ·playwrightX-Qualityβ0.0β1.0extraction confidenceX-Share-Idβ the 8-hex permalink id
Cache & TTLs
/api?url=β¦re-fetches from source if the cache row is older than 1 hour./s/:iddoes the same on-demand refresh, so share links double as live endpoints.- Cache rows are pruned 90 days after the last write.
/s/:idhits keep the row alive (since they trigger refresh + write); read-only access does not extend the TTL. - If the source is unreachable on refresh, the last good snapshot is served β share links keep working even when the original URL dies.
Architecture
server.jsβ Express app factory (createApp) with dependency injection for tests. Exposes/apiand/api/stream(SSE).lib/reddit.jsβ Reddit URL normalization, redirect resolution, post + comment extraction.lib/web.jsβ Orchestrator: Cloudflare-Markdown short-circuit, then static Readability + Trafilatura withpickBest, then optional Playwright re-render + re-extract on body-soup / low-quality output.lib/render-decision.jsβ Predicate that decides when to fall back to Playwright (readability-fellback + thin, body-soup signature, or quality < 0.5; plusforce/skipoverrides).lib/playwright-client.jsβ HTTP client for the Playwright sidecar withAbortSignalpropagation for SSE-disconnect cancellation.lib/scoring.jsβ Quality scoring used to pick between extractors and as a render-trigger heuristic.lib/cache.jsβ SQLite cache (better-sqlite3) with 90-day TTL and 8-hex share ids.lib/mcp.jsβ Stateless MCP server registering the three tools.lib/distrib.jsβ Public-URL substitution in/helpand/web-reader.zip.trafilatura-sidecar/β Python sidecar (FastAPI) wrapping Trafilatura.playwright-sidecar/β Python sidecar (FastAPI + Playwright + Chromium) for JS-rendered pages.public/β PWA frontend (vanilla JS, dark/paper themes, service worker, EventSource client for/api/stream).skill/web-reader/β Claude Code skill source (templated with__PULLMD_URL__).
License
GNU AGPL v3 β Copyright Β© 2026 Aeterna Labs.
PullMD is free software: you can redistribute it and modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, version 3 or later. If you run a modified version as a network service, you must make your modifications available to its users.
