📦

Md Succ AI

URL to Markdown API — md.succ.ai

0 installs

Trust: 34 — Low

Rag

Ask AI about Md Succ AI

I know everything about Md Succ AI. Ask me about installation, configuration, usage, or troubleshooting.

0/500

Loading tools...

Reviews

Documentation

Clean Markdown from any URL. Fast, accurate, agent-friendly.

Quick Start • Features • API • How It Works • Self-Hosting • Monitoring • Security

Convert any webpage, document, feed, or video to clean, readable Markdown. Built for AI agents, MCP tools, and RAG pipelines. Powered by succ.

Quick Start

# Markdown output
curl https://md.succ.ai/https://example.com

# JSON output
curl -H "Accept: application/json" https://md.succ.ai/https://example.com

# Documents (PDF, DOCX, XLSX, CSV)
curl https://md.succ.ai/https://example.com/report.pdf

# YouTube transcript
curl https://md.succ.ai/https://youtube.com/watch?v=dQw4w9WgXcQ

# RSS/Atom feed
curl https://md.succ.ai/https://blog.example.com/feed.xml

# LLM-optimized (30-50% fewer tokens)
curl "https://md.succ.ai/https://example.com?mode=fit"

# Batch convert
curl -X POST https://md.succ.ai/batch \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com", "https://httpbin.org/html"]}'

That's it. No API key, no signup, no SDK. Just prepend https://md.succ.ai/ to any URL.

Features

Feature	Description
9-Pass Extraction	Readability, Defuddle, Article Extractor, CSS selectors, Schema.org, Open Graph, text density, cleaned body — quality-checked at each step
7 Formats	HTML, PDF, DOCX, XLSX, CSV, YouTube transcripts, RSS/Atom feeds
4-Tier Pipeline	HTTP fetch → headless browser → LLM extraction → BaaS anti-bot bypass
Batch Conversion	Convert up to 50 URLs in one request with concurrent processing
Async + Webhooks	Submit long conversions and get results via polling or webhook callback
Structured Extraction	`/extract` — JSON schema in, structured data out (LLM-powered)
Quality Scoring	Each conversion scored 0-1 with A-F grade
Fit Mode	LLM-optimized output — pruned boilerplate, 30-50% fewer tokens
Citation Links	Numbered references with footer instead of inline links
Redis Cache	Two-layer caching (Redis + in-memory fallback), SHA-256 hashed keys
Rate Limiting	Per-IP via Redis atomic pipeline, CF-Connecting-IP aware
Prometheus + Grafana	11 custom metrics, pre-provisioned dashboard, auto-scraped
Structured Logging	JSON logs via Pino, per-request correlation IDs
OpenAPI Docs	Interactive API reference at `/docs` (Scalar UI)

Supported formats

Format	Content-Type	Method
HTML	`text/html`	9-pass extraction + Turndown
PDF	`application/pdf`	Text extraction via unpdf
DOCX	`application/vnd...wordprocessingml`	mammoth → HTML → Turndown
XLSX/XLS	`application/vnd...spreadsheetml`	SheetJS → Markdown tables
CSV	`text/csv`	SheetJS → Markdown table
YouTube	`youtube.com`, `youtu.be`	Transcript extraction via innertube API
RSS/Atom	`application/rss+xml`, `application/atom+xml`	Feed parsing with item metadata

Documents are also detected by URL extension (.pdf, .docx, .xlsx, .csv) when Content-Type is application/octet-stream.

API

Base URL: https://md.succ.ai Docs: /docs (interactive Scalar UI) | /openapi.json (OpenAPI 3.1 spec)

Endpoints

Method	Path	Description
`GET`	`/{url}`	Convert URL to Markdown
`GET`	`/?url={url}`	Same, query param format
`POST`	`/extract`	Structured data extraction via LLM (JSON schema)
`POST`	`/batch`	Batch convert up to 50 URLs
`POST`	`/async`	Async conversion with optional webhook
`GET`	`/job/:id`	Poll async job status
`GET`	`/health`	Health check (includes Redis status)
`GET`	`/docs`	Interactive API reference
`GET`	`/openapi.json`	OpenAPI 3.1 spec

Query Parameters

Parameter	Values	Description
`url`	URL	Target URL (alternative to path format)
`links`	`citations`	Convert inline links to numbered references with footer
`mode`	`fit`	Prune boilerplate sections for smaller LLM context
`max_tokens`	number	Truncate output to N tokens (use with `mode=fit`)

Response Headers

Header	Description
`x-request-id`	Unique request correlation ID
`x-markdown-tokens`	Token count (cl100k_base)
`x-conversion-tier`	`fetch`, `browser`, `baas:scrapfly`, `llm`, `youtube`, `feed`, `document:pdf`, etc.
`x-conversion-time`	Total conversion time in ms
`x-extraction-method`	Extraction pass used (`readability`, `defuddle`, `browser-raw`, etc.)
`x-quality-score`	Quality score 0-1
`x-quality-grade`	Quality grade A-F
`x-readability`	`true` if Readability extracted clean content
`x-cache`	`hit` or `miss` (Redis-backed)
`x-ratelimit-limit`	Max requests per window
`x-ratelimit-remaining`	Requests remaining in current window
`x-ratelimit-reset`	Window reset timestamp (Unix seconds)

Rate Limits

Endpoint	Limit
`GET /*`	60 req/min per IP
`POST /extract`	10 req/min per IP
`POST /batch`	5 req/min per IP
`POST /async`	10 req/min per IP

JSON response format

{
  "title": "Example Domain",
  "url": "https://example.com",
  "content": "# Example Domain\n\nThis domain is for use in...",
  "fit_markdown": "# Example Domain\n\nThis domain is...",
  "fit_tokens": 20,
  "excerpt": "This domain is for use in documentation examples...",
  "tokens": 33,
  "tier": "fetch",
  "readability": true,
  "method": "readability",
  "quality": { "score": 0.85, "grade": "A" },
  "time_ms": 245
}

Batch conversion

curl -X POST https://md.succ.ai/batch \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://example.com",
      "https://httpbin.org/html",
      "https://github.com"
    ],
    "options": {
      "mode": "fit",
      "links": "citations"
    }
  }'

Returns an array of results. Up to 50 URLs, processed with 10-way concurrency. Per-URL 60s timeout.

Async conversion with webhook

# Submit async job
curl -X POST https://md.succ.ai/async \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "callback_url": "https://your-server.com/webhook"
  }'
# → {"job_id": "abc12345", "status": "processing", "poll_url": "/job/abc12345"}

# Poll for result
curl https://md.succ.ai/job/abc12345

Webhook delivers JSON POST to callback_url on completion/failure. HTTPS required, 3 retries with exponential backoff. Private/internal addresses blocked (SSRF-safe).

Structured data extraction

curl -X POST https://md.succ.ai/extract \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://github.com/trending",
    "schema": {
      "type": "object",
      "properties": {
        "repositories": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name": { "type": "string" },
              "author": { "type": "string" },
              "description": { "type": "string" },
              "stars_today": { "type": "number" }
            }
          }
        }
      }
    }
  }'

Returns structured JSON matching the provided schema, extracted by LLM. Automatically retries with headless browser for SPA/JS-heavy sites when initial extraction returns empty data.

More examples

# Citation-style links (numbered references)
curl "https://md.succ.ai/?url=https://en.wikipedia.org/wiki/Markdown&links=citations"

# LLM-optimized output (pruned boilerplate)
curl "https://md.succ.ai/?url=https://htmx.org/docs/&mode=fit"

# Token limit
curl "https://md.succ.ai/?url=https://example.com&mode=fit&max_tokens=4000"

# RSS feed as markdown
curl https://md.succ.ai/https://hnrss.org/frontpage

How It Works

4-tier conversion pipeline — each tier only activates if the previous one produced insufficient quality:

URL ──→ Cache hit? ──→ Return cached result (Redis, dynamic TTL)
         │
         ├─ YouTube? ──→ Transcript extraction (innertube API)
         │
         ├─ RSS/Atom feed? ──→ Feed parsing with item metadata
         │
         ├─ Document? (PDF, DOCX, XLSX, CSV)
         │   └─→ Document converter → Markdown
         │
         ├─ Tier 1: HTTP fetch + 9-pass extraction
         │   └─→ Readability → Defuddle → Article Extractor → CSS selectors
         │       → Schema.org → Open Graph → Text density → Body fallback
         │
         ├─ Tier 2: Camoufox headless browser (SPA/JS-heavy)
         │   └─→ Same 9-pass pipeline on rendered DOM
         │
         ├─ Tier 2.5: LLM extraction (quality < B)
         │   └─→ nano-gpt API → content extraction
         │
         └─ Tier 3: BaaS anti-bot bypass (CF Turnstile / quality < D)
             └─→ ScrapFly → ZenRows → ScrapingBee (rotation)
             └─→ Same 9-pass pipeline on returned HTML

Cloudflare challenge pages are detected automatically. When fetch gets a CF challenge, browser is skipped (saves IP), and BaaS providers handle the bypass.

When both LLM and BaaS are needed, they race in parallel — saves 30-45s vs sequential.

Caching

Two-layer cache system backed by Redis 7:

Content	TTL	Key
HTML pages	5 min	`cache:{sha256(url+options)}`
Browser renders	10 min	Same
YouTube transcripts	1 hr	Same
Documents	2 hr	Same
/extract results	1 hr	`extract:{sha256(url)}:{sha256(schema)}`

Cache keys use SHA-256 hashes to prevent poisoning via long/malicious URLs. Tracking parameters (UTM, fbclid, gclid, etc.) are stripped before hashing. Falls back to in-memory Map when Redis is unavailable.

Stack

Component	Role
Hono	HTTP framework
Pino	Structured JSON logging
Mozilla Readability	Primary content extraction
Defuddle	Obsidian team's content extraction
@extractus/article-extractor	Alternative extraction heuristics
Turndown	HTML → Markdown conversion
linkedom	Lightweight DOM parser
Camoufox	Firefox fork with C++ anti-detection
Redis + ioredis	Cache, rate limiting, job storage
prom-client	Prometheus metrics
unpdf	PDF text extraction
mammoth	DOCX → HTML conversion
SheetJS	XLSX/XLS/CSV parsing
NanoGPT	LLM API for Tier 2.5 and /extract
Ajv	JSON Schema validation for /extract
gpt-tokenizer	cl100k_base token counting
nanoid	Request/job IDs

Self-Hosting

Docker (recommended)

git clone https://github.com/vinaes/md-succ-ai.git
cd md-succ-ai
cp .env.example .env  # edit with your API keys and passwords
docker compose up -d

This starts four containers:

Container	Purpose	Port
md-succ-ai	API server with Camoufox browser fallback	127.0.0.1:3100
md-succ-redis	Redis 7 (cache, rate limiting, jobs)	internal
md-succ-prometheus	Prometheus metrics collector	internal
md-succ-grafana	Grafana dashboards	127.0.0.1:3200

The API is available at http://localhost:3100.

Local (without Docker)

npm install
npx camoufox-js fetch
npm start

Redis is optional for local development. Without Redis, caching and rate limiting fall back to in-memory Map, and async jobs are unavailable.

Environment variables

Variable	Default	Description
`PORT`	`3000`	Server port
`ENABLE_BROWSER`	`true`	Enable Camoufox browser fallback
`NODE_ENV`	`production`	Node environment
`REDIS_URL`	`redis://redis:6379`	Redis connection URL (with password in Docker)
`REDIS_PASSWORD`	—	Redis authentication password (required in Docker)
`GRAFANA_PASSWORD`	—	Grafana admin password (required in Docker)
`NANOGPT_API_KEY`	—	nano-gpt API key for LLM tier and /extract
`NANOGPT_MODEL`	`meta-llama/llama-3.3-70b-instruct`	LLM model for content extraction (Tier 2.5)
`NANOGPT_EXTRACT_MODEL`	same as `NANOGPT_MODEL`	LLM model for `/extract` endpoint
`SCRAPFLY_API_KEY`	—	ScrapFly anti-bot bypass (1000 credits/mo free)
`ZENROWS_API_KEY`	—	ZenRows anti-bot bypass (1000 credits trial)
`SCRAPINGBEE_API_KEY`	—	ScrapingBee anti-bot bypass (1000 credits one-time)

BaaS providers are optional. When configured, they activate as Tier 3 for Cloudflare-protected sites. Providers are tried in order; if one hits rate limits, the next is used automatically.

Nginx reverse proxy

An example nginx config is in nginx/md.succ.ai.conf:

Rate limiting: 10 req/s per IP, burst 20
Connection limit: 10 concurrent per IP
Proxy timeouts: 60s read (for browser renders)
POST endpoints with appropriate body limits
HSTS, security headers (nosniff, X-Frame-Options, Referrer-Policy)
/metrics blocked (403)
/grafana/ proxied to Grafana container with WebSocket support

Monitoring

The project ships with a full Prometheus + Grafana stack:

Prometheus scrapes the /metrics endpoint every 10s (internal Docker network only).

Grafana is pre-provisioned with a 15-panel dashboard:

Request rate, response time percentiles (p50/p95/p99)
Conversion tier distribution, cache hit rate
Quality score distribution, tokens per conversion
Rate limit rejections, async job status
Browser pool utilization, webhook deliveries
Node.js process metrics (CPU, memory, event loop lag)

Access Grafana at https://your-domain/grafana/ (proxied via nginx).

Custom Metrics

Metric	Type	Labels
`http_requests_total`	Counter	method, route, status
`http_request_duration_seconds`	Histogram	method, route, status
`conversion_tier_total`	Counter	tier
`conversion_tokens`	Histogram	tier
`conversion_quality`	Histogram	tier
`cache_hits_total`	Counter	source
`cache_misses_total`	Counter	—
`rate_limit_rejections_total`	Counter	route
`browser_pool_active`	Gauge	—
`async_jobs_total`	Counter	status
`webhook_deliveries_total`	Counter	status

Plus Node.js default metrics (CPU, memory, event loop, GC) via prom-client.

Security

SSRF protection — URL validation, DNS resolution checks (IPv4 + IPv6), redirect validation per hop, Camoufox route blocking, webhook callback DNS validation
Private IP blocking — 127/8, 10/8, 172.16/12, 192.168/16, 169.254/16, CGNAT, cloud metadata hostnames, hex/octal IP formats, IPv6 mapped addresses
Input limits — 5MB response size, 5 max redirects, content-type validation, body size limits per endpoint
Output sanitization — Error messages stripped of internal paths/stack traces, URLs sanitized in responses
Cache security — SHA-256 hashed keys (no URL poisoning), tracking params stripped, Redis LRU eviction (128MB cap)
Redis authentication — --requirepass with password from .env, authenticated connection URL
API key safety — BaaS API keys only used in outbound requests, never logged or exposed in responses
LLM hardening — Prompt injection protection (HTML sanitization, document delimiters, output validation), schema field whitelist, blocked schema keywords ($ref, $defs, etc.)
Rate limiting — Per-IP via Redis INCR+EXPIRE (atomic pipeline), CF-Connecting-IP support, in-memory fallback
Security headers — HSTS, X-Content-Type-Options, X-Frame-Options, Referrer-Policy, Permissions-Policy
CDN integrity — Subresource Integrity (SRI) on third-party scripts
Container security — Non-root user (mduser), no-new-privileges, pinned image versions
CF challenge detection — Cloudflare challenge pages detected and handled without wasting browser/BaaS credits

Architecture

                    ┌──────────────────┐
                    │   Cloudflare     │
                    │   (TLS + CDN)    │
                    └────────┬─────────┘
                             │
                    ┌────────▼─────────┐
                    │   nginx          │
                    │   (rate limit,   │
                    │    HSTS, proxy)  │
                    └────────┬─────────┘
                             │
         ┌───────────────────┼───────────────────┐
         │                   │                   │
┌────────▼─────────┐ ┌──────▼───────┐ ┌─────────▼────────┐
│  md-succ-ai      │ │  Prometheus  │ │  Grafana         │
│  (Node 22, Hono) │ │  (scrape     │ │  (dashboards,    │
│  Camoufox       │ │   /metrics)  │ │   alerting)      │
│  BaaS clients    │ └──────────────┘ └──────────────────┘
│  Pino logging    │
└────────┬─────────┘
         │
┌────────▼─────────┐
│  Redis 7         │
│  (cache, rate    │
│   limit, jobs)   │
└──────────────────┘

License

FSL-1.1-Apache-2.0 — Free for non-competitive use. Apache 2.0 after 2 years.

Disclaimer: Not affiliated with NanoGPT. LLM features use the NanoGPT API for pay-per-prompt model access.

Part of the succ ecosystem.