Data.world FastMCP Platform
This project focuses on the development of a Data.World MCP server designed to support customer's ability to deploy the framework and make their instance available for agentic workloads.
Ask AI about Data.world FastMCP Platform
Powered by Claude Β· Grounded in docs
I know everything about Data.world FastMCP Platform. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
data.world FastMCP Platform
β οΈ Pre-release β In Active Development
This project is functional and has been tested end-to-end in a local environment, but it has not undergone formal QA, security audit, or enterprise support validation for public deployment. The Admin UI in particular is in a primitive early state. Use in production environments at your own discretion and with appropriate review.
A production-grade MCP (Model Context Protocol) gateway that exposes data.world's catalog, governance, and knowledge graph capabilities as AI-native tools β enabling AI agents to discover, understand, and reason about enterprise data before querying it.
What This Is
Most MCP servers for data platforms are thin API wrappers. This gateway is different: every tool is designed around what an agent wants to accomplish, not what API endpoint it maps to. describe_dataset doesn't expose the data.world schema endpoint β it returns everything an agent needs to understand a dataset in a single call: schema, governance, certification status, responsible parties, and compliance tags.
When paired with a SQL or file-system MCP (the "data access layer"), this gateway acts as the knowledge and intelligence layer in a multi-MCP agent architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AI Agent (Claude, GPT, etc.) β
ββββββββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββ
β β
ββββββββββββββΌβββββββββββ ββββββββΌβββββββββββββββββββ
β data.world MCP β β Data MCP β
β Knowledge + Wisdom β β Access Layer β
β Layer β β β
β β’ What data exists? β β β’ SQL (Postgres, β
β β’ Who governs it? β β Snowflake, BigQuery) β
β β’ Is it certified? β β β’ Files (S3, ADLS) β
β β’ What does it mean? β β β’ REST APIs β
β β’ Who can access it? β β β’ Streaming sources β
βββββββββββββββββββββββββ βββββββββββββββββββββββββββ
The agent asks the gateway which dataset to query, what columns exist and what they mean, whether the data is certified, and who to contact for governance questions β then the data MCP executes the actual retrieval. The user receives verified, cited, auditable answers.
Sub-project Status
| Sub-project | Description | Status |
|---|---|---|
| SP1 | Core MCP Gateway (7 tools) | β Complete |
| SP2 | Enterprise Auth (Okta JWT) | β Complete |
| SP3 | Admin API (control plane) | β Complete |
| SP4 | AI-Powered Instance Discovery | β Complete |
| SP5 | Admin UI (React frontend) | β Complete β see Admin UI notes |
| SP6 | Okta SSO for Admin UI | π² Planned |
| β | Production hardening pass | π² Planned |
The 7 MCP Tools
Catalog Layer
| Tool | What it does |
|---|---|
search_catalog | Full-text search across the data.world catalog. Supports filtering by owner, tags, domain, and 8 responsible-party roles. Returns source_url for inline citation. |
describe_dataset | Full schema (all tables, columns, types), governance metadata, certification status, quality score, compliance tags (GDPR, SOX, CCPA, HIPAA), and responsible-party contacts for a specific dataset. |
list_collections | Enumerates curated domain collections with member counts and IRIs for further navigation. Enterprise tier. |
Governance Layer
| Tool | What it does |
|---|---|
get_access_policy | Access level (open/restricted/private), policy description, approved groups, and compliance classification for a dataset. Enterprise tier. |
Knowledge Graph Layer
| Tool | What it does |
|---|---|
get_glossary_terms | Business vocabulary definitions with synonyms, owning team, and linked datasets/columns. Resolves terms before data interpretation. Enterprise tier. |
get_lineage | Upstream/downstream data dependency graph with configurable depth and direction. Enterprise tier. |
get_related_resources | Graph traversal from any catalog IRI β finds linked datasets, glossary terms, and related resources. IRIs are provided by other tool responses. Enterprise tier. |
Source Link Citations
Every tool response includes a source_url field pointing directly to the originating page in data.world, and a next_step hint guiding agents to the logical follow-on tool call. Agents can use source_url to produce inline markdown citations:
The Hospital Outcome of Care Surgical Measures dataset covers quality metrics for surgical procedures across US hospitals...
See docs/citation-system-prompt.md for a system prompt snippet that instructs agents to use inline citations.
Architecture
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Control Plane β
β β
β Admin API (FastAPI, port 8000) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Tool config management Β· Discovery scan orchestration β β
β β Telemetry Β· Audit log Β· Recommendation review β β
β ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ β
β β pg_notify('config_changed') β
β βΌ β
β PostgreSQL (shared state) β
β β asyncpg LISTEN β
β βΌ β
β MCP Gateway (FastMCP, port 8001) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 7 MCP tools Β· Okta JWT auth Β· Telemetry middleware β β
β β Live tool toggle (no restart) Β· Source link citations β β
β ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ β
β β β
βββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββ
β HTTPS
βββββββββββΌββββββββββ
β data.world API β
β (public or β
β enterprise) β
βββββββββββββββββββββ
The Admin API and MCP Gateway communicate only through PostgreSQL LISTEN/NOTIFY β no direct service-to-service calls. A crash or restart of either service has zero impact on the other.
Prerequisites
- Python 3.10+
- Node.js 18+ (for the Admin UI)
- Docker Desktop (for PostgreSQL)
- A data.world account with an API token (get one here)
Local Development Setup
1. Clone and install
git clone https://github.com/illegal-request/data.world-FastMCP-Platform.git
cd data.world-FastMCP-Platform
# Install the MCP gateway (editable)
pip install -e .
# Install the Admin API
pip install -e "admin_api/[dev]"
2. Configure environment
cp .env.example .env
# Edit .env and set DATAWORLD_API_TOKEN to your data.world token
For the Admin API, create a separate .env:
cp admin_api/.env.example admin_api/.env # if it doesn't exist, copy from root .env
# Set ADMIN_BOOTSTRAP_KEY to a random string (β₯32 chars)
# Example: python -c "import secrets; print(secrets.token_urlsafe(48))"
3. Start PostgreSQL
docker compose up postgres
4. Run database migrations
cd admin_api
python -m alembic upgrade head
cd ..
5. Start the services
Terminal 1 β Admin API:
cd admin_api
python -m uvicorn dataworld_admin.main:app --host 0.0.0.0 --port 8000
Terminal 2 β Admin UI:
cd admin_ui
npm install
npm run dev
# Serves at http://localhost:5173
Terminal 3 β MCP Gateway (for Claude Code / Claude Desktop):
python -m dataworld_mcp
# Default transport: stdio β launched by the MCP client, not manually
The gateway runs in stdio mode by default, meaning your MCP client (Claude Desktop, Claude Code, Cursor) starts it as a subprocess. You only start it manually if you want to run it as an HTTP server.
6. Running tests
# MCP gateway tests
pytest
# Admin API tests
cd admin_api && pytest
Connecting to AI Clients
Claude Desktop
Add to your claude_desktop_config.json:
{
"mcpServers": {
"dataworld": {
"command": "python",
"args": ["-m", "dataworld_mcp"],
"cwd": "/path/to/data.world-FastMCP-Platform",
"env": {
"DATAWORLD_API_TOKEN": "your_token_here"
}
}
}
}
Claude Code / Cursor (HTTP mode)
Start the gateway as an HTTP server:
MCP_TRANSPORT=streamable-http MCP_PORT=8001 python -m dataworld_mcp
The project includes a .mcp.json pre-configured for localhost:8001. If using Claude Code, this is picked up automatically when you open the project directory.
Alternatively, use SSE transport:
MCP_TRANSPORT=sse MCP_PORT=8001 python -m dataworld_mcp
And update .mcp.json:
{
"mcpServers": {
"dataworld": {
"type": "sse",
"url": "http://localhost:8001/sse"
}
}
}
Agent system prompt
For inline citations, add to your agent's system prompt (or paste from docs/citation-system-prompt.md):
When tool results include
source_urlfields, cite them as inline markdown hyperlinks β[descriptive text](url)β woven naturally into your response prose. Do not collect them into a reference list at the end.
Configuration Reference
All configuration is via environment variables. Copy .env.example to .env to get started.
MCP Gateway (.env)
| Variable | Required | Default | Description |
|---|---|---|---|
DATAWORLD_API_TOKEN | β Yes | β | Your data.world API token |
DATAWORLD_BASE_URL | No | https://api.data.world/v0 | API base URL. Override for enterprise single-tenant: https://api.{company}.app.data.world/v0 |
DATAWORLD_UI_BASE_URL | No | https://data.world | UI base URL for source_url construction. Override for enterprise: https://{company}.app.data.world |
MCP_TRANSPORT | No | stdio | Transport protocol: stdio, streamable-http, or sse |
MCP_PORT | No | 8001 | Port for HTTP/SSE transports |
AUTH_MODE | No | env_token | Authentication mode: env_token (single token from env) or okta (per-user JWT) |
OKTA_ISSUER | If AUTH_MODE=okta | β | Okta authorization server URL |
OKTA_AUDIENCE | If AUTH_MODE=okta | β | Expected JWT audience |
OKTA_CLIENT_ID | If AUTH_MODE=okta | β | Okta application client ID |
OKTA_CLIENT_SECRET | If AUTH_MODE=okta | β | Okta application client secret |
DATABASE_URL | No | β | PostgreSQL connection string. Required for live tool configuration and telemetry. |
Admin API (admin_api/.env)
| Variable | Required | Default | Description |
|---|---|---|---|
ADMIN_BOOTSTRAP_KEY | β Yes | β | Static token for Admin UI login (β₯32 chars). Generate with python -c "import secrets; print(secrets.token_urlsafe(48))" |
DATABASE_URL | β Yes | β | PostgreSQL connection string |
DATAWORLD_API_TOKEN | β Yes | β | data.world API token (used by discovery scanner) |
DATAWORLD_BASE_URL | No | https://api.data.world/v0 | data.world API base URL |
DISCOVERY_LLM_MODEL | No | β | LiteLLM model string for AI-powered scan analysis (e.g. claude-sonnet-4-5). Omit to use the template analyser fallback. |
ANTHROPIC_API_KEY | If using Anthropic | β | Anthropic API key (required if DISCOVERY_LLM_MODEL uses an Anthropic model) |
Enterprise Deployment
What changes from local setup
| Concern | Local | Enterprise |
|---|---|---|
| API host | https://api.data.world/v0 | https://api.{company}.app.data.world/v0 |
| UI base URL | https://data.world | https://{company}.app.data.world |
| Authentication | Single API token (env_token) | Per-user Okta JWT (okta mode) |
| Transport | stdio (client-managed) | streamable-http or sse (hosted server) |
| Credentials | .env file | Secrets manager (AWS, Azure, Vault, K8s) β planned, see Known Issues |
Okta authentication mode
Set AUTH_MODE=okta and provide all four OKTA_* variables. In this mode:
- Agents authenticate with their Okta JWT
- The gateway validates the JWT with your Okta authorization server
- The validated user's data.world token is fetched via RFC 8693 token exchange
- Tool calls execute with the individual user's data.world permissions
Note: The RFC 8693 token exchange endpoint for enterprise data.world instances has not been confirmed with data.world enterprise support. OktaTokenProvider._exchange() is intentionally isolated β only this method changes when the endpoint is confirmed. See Known Issues.
Docker deployment
The project includes Dockerfiles for both the gateway and the Admin API:
# Build and run all services
docker compose up
Note: the Docker Compose file uses development defaults (plain dataworld/dataworld Postgres credentials). Update for any non-local deployment.
AI-Powered Discovery
The discovery engine scans your data.world instance and uses an LLM to generate instance-specific tool descriptions β so agents arrive pre-tuned to your actual domain taxonomy, collection structure, and responsible-party roles.
Configure DISCOVERY_LLM_MODEL with any LiteLLM-compatible model string:
| Provider | Model string example |
|---|---|
| Anthropic | claude-sonnet-4-5 |
| OpenAI | gpt-4o-mini |
| Azure OpenAI | azure/gpt-4o-mini |
| Google Gemini | gemini/gemini-1.5-flash |
| Amazon Bedrock | bedrock/anthropic.claude-sonnet-4-5 |
| None | Omit β template analyser is used |
Run a discovery scan from the Admin UI at http://localhost:5173.
Admin UI
β οΈ The Admin UI is in a primitive early development state. The core functionality works β login, tool configuration, discovery scan management, telemetry dashboard β but the interface has not undergone design review, usability testing, or full feature development. Expect rough edges, missing polish, and incomplete workflows.
Access the Admin UI at http://localhost:5173 after starting the dev server (npm run dev in admin_ui/).
Login with your ADMIN_BOOTSTRAP_KEY value from your .env file.
What works:
- Viewing and toggling MCP tools on/off (changes propagate to the live gateway without restart)
- Viewing and editing tool descriptions
- Running basic and advanced discovery scans
- Reviewing and approving discovery recommendations
- Telemetry dashboard (tool call counts, latency)
SP6 β Okta SSO for Admin UI: The "Login with Okta SSO" button is a stub. Full Okta OIDC login for admin sessions is planned as SP6 and has not been implemented.
Developer Guide
Project structure
data.world-FastMCP-Platform/
βββ src/dataworld_mcp/ # MCP Gateway (the core product)
β βββ tools/ # The 7 MCP tools
β β βββ catalog.py # search_catalog, describe_dataset, list_collections
β β βββ governance.py # get_access_policy
β β βββ knowledge.py # get_glossary_terms, get_lineage, get_related_resources
β β βββ url_builder.py # Source URL construction utility
β βββ auth/ # Okta JWT validation
β βββ client/ # data.world API client (httpx + tenacity)
β βββ telemetry/ # Tool call event buffering and middleware
β βββ config.py # Environment configuration (single source of truth)
β βββ config_listener.py # PostgreSQL LISTEN for live config updates
β βββ server.py # FastMCP server instance
βββ admin_api/ # Control plane API (FastAPI)
β βββ src/dataworld_admin/
β β βββ discovery/ # LLM-powered catalog scan engine
β β βββ tools/ # Tool config management
β β βββ telemetry/ # Telemetry persistence
β β βββ auth/ # Admin API authentication
β βββ alembic/ # Database migrations
βββ admin_ui/ # Admin frontend (React + TypeScript)
β βββ src/
βββ tests/ # MCP gateway test suite
βββ docs/ # Architecture docs, briefs, system prompt guidance
βββ docker-compose.yml
βββ Dockerfile.gateway
βββ pyproject.toml
Adding a new tool
- Add the tool function to the appropriate file in
src/dataworld_mcp/tools/using the@mcp.tool()decorator - Follow the XML docstring format (
<usecase>/<instructions>) for consistent tool selection behaviour - Include
source_urlin the response (extract from API response or construct withdataset_url()) - Append the citation hint to
next_step:"source_url fields in results can be used as inline markdown citations [title](url) in agent responses." - Add tests in
tests/ - Import the module in
src/dataworld_mcp/__main__.pyto register the tool
Tool description format
@mcp.tool()
async def my_tool(param: str) -> dict:
"""
<usecase>
Use when the agent wants to [accomplish X]. Call after [Y] to [Z].
</usecase>
<instructions>
Provide [param description]. Returns [response structure].
</instructions>
"""
The <usecase> block drives tool selection. The <instructions> block drives tool use. Keep them separate β mixing them degrades tool selection accuracy.
Known Issues & Limitations
See KNOWN_ISSUES.md for the full list. Key items:
- Okta token exchange endpoint unconfirmed β
AUTH_MODE=oktarequires validation with data.world enterprise support before production use - Secrets manager not implemented β credentials are read from
.envfiles; AWS/Azure/Vault/K8s secrets integration is planned - Admin UI is primitive β SP6 (Okta SSO), full UI polish, and complete workflow coverage are planned
- Enterprise source URLs (Group 2 tools) β constructed URLs may not resolve on some enterprise instance configurations
- Lineage node-level citations β
get_lineagereturns a top-levelsource_urlonly; per-node source links are deferred to a future release
Roadmap
| Item | Description |
|---|---|
| SP6: Okta SSO for Admin UI | Replace bootstrap key login with full Okta OIDC for admin sessions |
| Production hardening | Secrets manager integration, confirmed token exchange endpoint, scanner service account provisioning |
| Dedicated citation agent | Opinionated agent for deployment to enterprise agent marketplaces (Gemini Enterprise, A2A), built on this gateway |
MCP resource_link content type | Migrate to the MCP 2025-06-18 spec's typed resource_link content blocks when FastMCP adds first-class support |
Lineage node-level source_url | Per-node source links in get_lineage responses (V2) |
Contributing
This project is in active development. Issues and pull requests are welcome.
Before contributing:
- Run the test suite:
pytest(gateway) andcd admin_api && pytest(Admin API) - Follow the existing tool description format (XML docstrings with
<usecase>/<instructions>) - New tools must include
source_urlin responses and append the citation hint tonext_step - Keep
url_builder.pyas the single source of truth for URL construction β do not readDATAWORLD_UI_BASE_URLdirectly in tool files
License
[License TBD β not yet specified for this pre-release]
