Dataworkers Claw Community
Weβre build a swarm of agents for all data tasks. That anyone can use for free, open-source community version.
Ask AI about Dataworkers Claw Community
Powered by Claude Β· Grounded in docs
I know everything about Dataworkers Claw Community. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
Data Workers - Open-Source Community Edition
Open-source autonomous AI agents for data engineering
Stop writing boilerplate pipelines. Stop debugging data incidents manually.
Describe what you need in natural language. The agents handle execution.
11 AI agents Β· 160+ MCP tools Β· 15 connectors Β· 2,900+ tests Β· Zero config to start
What is Data Workers?
Data Workers is a coordinated swarm of AI agents that automate the full spectrum of data engineering workflows. Each agent is a standalone MCP (Model Context Protocol) server that exposes domain-specific tools to Claude Code, OpenCode, Cursor, VS Code, and any MCP-compatible client.
The problem: Data engineers spend 60%+ of their time on undifferentiated work -- writing pipeline boilerplate, debugging data incidents at 2am, chasing schema changes across teams, manually cataloging assets, and fighting governance paperwork.
The solution: 11 autonomous agents that understand your data stack end-to-end. They build pipelines, detect anomalies, manage catalogs, enforce governance, track ML experiments, and more -- all through natural language via the MCP protocol your AI tools already speak.
Everything runs locally with in-memory stubs by default. No external services required. No data leaves your machine. BYO model -- use any LLM provider.
Read more: Why We Open-Sourced Data Workers
Get Started
git clone https://github.com/DataWorkersProject/dataworkers-claw-community.git
cd dataworkers-claw-community
npm install
Then add agents to Claude Code (run from inside the cloned repo):
claude mcp add dw-pipelines -- "$(pwd)/start-agent.sh" dw-pipelines && \
claude mcp add dw-incidents -- "$(pwd)/start-agent.sh" dw-incidents && \
claude mcp add dw-catalog -- "$(pwd)/start-agent.sh" dw-context-catalog && \
claude mcp add dw-schema -- "$(pwd)/start-agent.sh" dw-schema && \
claude mcp add dw-quality -- "$(pwd)/start-agent.sh" dw-quality && \
claude mcp add dw-governance -- "$(pwd)/start-agent.sh" dw-governance && \
claude mcp add dw-usage -- "$(pwd)/start-agent.sh" dw-usage-intelligence && \
claude mcp add dw-observability -- "$(pwd)/start-agent.sh" dw-observability && \
claude mcp add dw-connectors -- "$(pwd)/start-agent.sh" dw-connectors && \
claude mcp add dw-ml -- "$(pwd)/start-agent.sh" dw-ml
Start Claude Code and ask:
- "Search the catalog for customer-related tables"
- "Show me the full lineage for the orders table"
- "Why did the orders table row count drop 40% yesterday?"
- "Scan the customer schema for PII and suggest masking policies"
- "Compare the last two ML experiments and explain the accuracy difference"
Everything works instantly with in-memory seed data β no infrastructure required.
Client configuration
Each agent can be started via the start-agent.sh script, which handles working directory and dependency resolution. Replace /path/to/dataworkers-claw-community with your clone location.
Claude Code (.mcp.json in your project root):
{
"mcpServers": {
"dw-pipelines": {
"command": "/path/to/dataworkers-claw-community/start-agent.sh",
"args": ["dw-pipelines"]
},
"dw-catalog": {
"command": "/path/to/dataworkers-claw-community/start-agent.sh",
"args": ["dw-context-catalog"]
},
"dw-quality": {
"command": "/path/to/dataworkers-claw-community/start-agent.sh",
"args": ["dw-quality"]
}
}
}
Cursor (.cursor/mcp.json) β same format:
{
"mcpServers": {
"dw-pipelines": {
"command": "/path/to/dataworkers-claw-community/start-agent.sh",
"args": ["dw-pipelines"]
},
"dw-incidents": {
"command": "/path/to/dataworkers-claw-community/start-agent.sh",
"args": ["dw-incidents"]
}
}
}
OpenCode (opencode.json in your project root):
{
"mcp": {
"dw-pipelines": {
"type": "local",
"command": ["/path/to/dataworkers-claw-community/start-agent.sh", "dw-pipelines"],
"enabled": true
},
"dw-catalog": {
"type": "local",
"command": ["/path/to/dataworkers-claw-community/start-agent.sh", "dw-context-catalog"],
"enabled": true
}
}
}
Agents
| Agent | Package | Description | Tools |
|---|---|---|---|
| Pipelines | dw-pipelines | NL-to-pipeline generation, template engine, Iceberg MERGE INTO, Kafka events, Airflow deployment. Write tools (generate_pipeline, deploy_pipeline) require Pro. | 4 |
| Incidents | dw-incidents | Statistical anomaly detection, graph-based root cause analysis, playbook execution | 5 |
| Catalog | dw-context-catalog | Hybrid search (vector + BM25 + graph), lineage traversal, Iceberg crawler | 35 |
| Schema | dw-schema | INFORMATION_SCHEMA diffs, rename detection, Iceberg snapshot evolution | 9 |
| Quality | dw-quality | Weighted 5-dimension scoring, z-score anomaly detection, 14-day baselines | 6 |
| Governance | dw-governance | Priority-based policy engine, 3-pass PII scanner (regex + values + LLM) | 6 |
| Usage Intelligence | dw-usage-intelligence | Practitioner analytics, workflow patterns, adoption dashboards, heatmaps (zero LLM) | 26 |
| Observability | dw-observability | SHA-256 audit trail, drift detection, agent metrics (p50/p95/p99), health monitoring | 6 |
| Connectors | dw-connectors | Unified MCP gateway to 15 catalog connectors | 56 |
| Orchestration | dw-orchestration | Priority scheduler, heartbeat monitor, agent registry, event choreography | internal (not MCP) |
| MLOps & Models | dw-ml | Experiment tracking, model registry, feature pipelines, SHAP explainability, drift detection, A/B testing. Write tools (train_model, deploy_model, create_experiment, log_metrics, register_model, create_feature_pipeline, ab_test_models) require Pro. | 16 |
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MCP Clients β
β Claude Code Β· OpenCode Β· Cursor Β· VS Code Β· Any MCP Client β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β MCP Protocol (JSON-RPC 2.0 / stdio)
β
ββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββ
β 11 AI Agents (160+ tools) β
β β
β pipelines Β· incidents Β· catalog Β· schema Β· quality Β· governanceβ
β usage-intelligence Β· observability Β· connectors Β· orchestrationβ
β ml β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β Factory-injected dependencies
β
ββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββ
β Core Platform (9 packages) β
β MCP Framework Β· Context Layer Β· Agent Lifecycle Β· Validation β
β Conflict Resolution Β· Enterprise Β· Orchestrator Β· Platform β
β Medallion (Bronze β Silver β Gold lakehouse management) β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββ
β Infrastructure Adapters (auto-detect) β
β Redis Β· Kafka Β· PostgreSQL Β· Neo4j Β· pgvector Β· PG FTS β
β LLM Bridge Β· Warehouse Bridge Β· Airflow β
β (falls back to InMemory stubs when services unavailable) β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββ
β 15 Catalog Connectors β
β Snowflake Β· BigQuery Β· Databricks Β· dbt Β· Iceberg Β· Glue β
β Hive Β· DataHub Β· OpenMetadata Β· Purview Β· Dataplex Β· Nessie β
β Polaris Β· OpenLineage Β· Lake Formation β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Connectors
Data Workers includes 15 catalog connectors out of the box. Additional enterprise connectors are available in Pro/Enterprise editions.
Catalog Connectors (15)
| Connector | Description |
|---|---|
| Snowflake | Databases, tables, DDL, usage stats |
| BigQuery | Datasets, tables, schema, cost estimation |
| Databricks | Unity Catalog, tables, query history |
| AWS Glue | Databases, tables, partitions |
| Lake Formation | Permissions, grants, resource listing |
| Hive Metastore | Thrift-based database/table/partition access |
| dbt | Models, lineage, test results, run history |
| DataHub | Entity search, metadata, lineage, usage stats |
| OpenMetadata | Entity search, lineage, tags, glossary |
| Purview | Catalog search, entity metadata, classifications |
| Dataplex | Lakes, zones, assets, data quality, discovery |
| Nessie | Git-like branching, commits, merges, content versioning |
| Apache Iceberg | REST Catalog, time travel, schema evolution, statistics |
| Apache Polaris | Multi-catalog federation, OAuth2, permission policies |
| OpenLineage | Lineage graphs, job runs, column lineage, event emission |
Enterprise Connectors (35) -- available in Pro/Enterprise editions
| Category | Connectors |
|---|---|
| Orchestration (11) | Airflow, Dagster, Prefect, AWS Step Functions, Azure Data Factory, dbt Cloud, Cloud Composer, Temporal, Mage, Kestra, Argo |
| Alerting (5) | PagerDuty, Slack, Microsoft Teams, OpsGenie, New Relic |
| Quality (6) | Great Expectations, Soda, Monte Carlo, Anomalo, Bigeye, Elementary |
| BI (5) | Looker, Tableau, Metabase, Sigma, Superset |
| Observability (2) | OpenTelemetry, Datadog |
| Identity (2) | Okta, Azure AD |
| ITSM (2) | ServiceNow, Jira Service Management |
| Cost (1) | AWS Cost Explorer |
| Streaming (1) | Kafka Schema Registry |
Community Edition includes up to 3 enterprise connectors. See pricing for details.
Project Structure
dataworkers-claw-community/
βββ agents/ # 11 agent MCP servers
β βββ dw-pipelines/ # Write tools (generate, deploy) require Pro
β βββ dw-incidents/
β βββ dw-context-catalog/
β βββ dw-schema/
β βββ dw-quality/
β βββ dw-governance/
β βββ dw-usage-intelligence/
β βββ dw-observability/
β βββ dw-connectors/
β βββ dw-orchestration/
β βββ dw-ml/ # Write tools require Pro
βββ core/ # 9 shared platform packages
β βββ mcp-framework/ # Base MCP server class
β βββ infrastructure-stubs/ # 9 interfaces + InMemory stubs + real adapters
β βββ llm-provider/ # Multi-provider LLM abstraction
β βββ medallion/ # Bronze/Silver/Gold lakehouse management
β βββ enterprise/ # Enterprise middleware shim (no-op in Community Edition)
β βββ orchestrator/ # Multi-agent coordination
β βββ context-layer/ # Shared context for cross-agent communication
β βββ ...
βββ connectors/ # 15 catalog connectors
βββ packages/ # CLI (dw-claw) and VS Code extension
βββ tests/ # Contract, integration, e2e, and eval tests
βββ docker/ # Dockerfiles and compose
βββ docs/ # Architecture specs and guides
Development
npm test # Run all tests (2,900+, no external services required)
npm run build # Build all packages
npm run lint # Lint
npm run typecheck # Type-check
cd agents/dw-pipelines && npm run dev # Run a single agent in dev mode
Troubleshooting
Agent fails to start: Ensure you're using start-agent.sh (not node directly). The script sets the working directory correctly for tsx module resolution. See docs/MCP-STARTUP-BUG-REPORT.md for details.
Module not found errors: Run npm install from the repo root. The monorepo uses npm workspaces β all dependencies are hoisted.
Tests fail on fresh clone: Make sure Node.js >= 20 is installed. Run npm install before npm test.
Known Limitations
- npm packages require the cloned repo.
npx dw-clawandnpx data-context-mcpdepend on workspace packages that aren't published individually. Use thestart-agent.shapproach for now. - dw-orchestration is an internal service, not an MCP agent. It provides task scheduling and agent coordination APIs used by other agents.
- Write operations require Pro. Tools like
generate_pipeline,deploy_model, andtrain_modelreturn upgrade prompts in the Community Edition.
Contributing
We welcome contributions. See CONTRIBUTING.md for guidelines on reporting bugs, setting up your dev environment, submitting PRs, and code style.
Join the Data Workers Community on Discord to ask questions and connect with other contributors.
Further Reading
| Topic | Link |
|---|---|
| Infrastructure details | docs/ARCHITECTURE.md |
| Configuration (env vars) | .env.example |
| Tiers & Pricing | dataworkers.io/pricing |
| Security | SECURITY.md |
| License | LICENSE (Apache 2.0) |
| LLM Data Disclosure | docs/LLM-DATA-DISCLOSURE.md |
| API Reference | docs/API.md |
Built by Data Workers Β· Discord Β· Twitter Β· Website
