Sample Agentic Data Lineage
MCP server: Sample Agentic Data Lineage
Installation
npx sample-agentic-data-lineageAsk AI about Sample Agentic Data Lineage
Powered by Claude Β· Grounded in docs
I know everything about Sample Agentic Data Lineage. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
Sample Agentic Data Lineage
A comprehensive data lineage and governance solution combining multiple technologies and approaches for modern data platforms.
ποΈ Project Overview
This repository contains a collection of integrated projects that demonstrate end-to-end data lineage tracking, from data ingestion through transformation to consumption, with intelligent agent-based analysis and governance.
π Project Structure
sample-agentic-data-lineage/
βββ dbt-redshift-openlineage/ # dbt + Redshift + OpenLineage integration
βββ glue-openlineage/ # AWS Glue + OpenLineage integration
βββ lakehouse-core/ # Core lakehouse infrastructure
βββ marquez-agents/ # Intelligent lineage analysis agents
βββ marquez-mcp/ # Marquez MCP server implementation
βββ redshift-enternal-tables-auto-creation/ # Automated external table creation
π Quick Start
1. Environment Setup
# Install Python dependencies
pip install -e .
# Or use uv (recommended)
uv venv --python 3.13.0
source .venv/bin/activate
uv sync --active
2. Installation Order & Setup
Follow this step-by-step installation order to set up the complete data lineage solution:
Step 1: Marquez MCP Server Preparation
# Prepare Marquez MCP server and build Docker image
cd marquez-mcp/
# Follow marquez-mcp/README.md for MCP server setup and image building
This prepares the MCP (Model Context Protocol) interface for Marquez integration and builds the required Docker image.
Step 2: Marquez Agents Preparation
# Prepare intelligent lineage analysis agents and build Docker image
cd marquez-agents/
# Follow marquez-agents/README.md for agent setup and image building
This prepares AI-powered agents for automated lineage analysis and governance and builds the required Docker image.
Step 3: Core Infrastructure Setup
# Deploy lakehouse-core infrastructure
cd lakehouse-core/
# Follow lakehouse-core/README.md for detailed setup
This sets up the foundational AWS infrastructure including Marquez, Karpenter, and Kubernetes environment.
Step 4: Glue OpenLineage Integration
# Deploy Glue OpenLineage infrastructure
cd glue-openlineage/
# Follow glue-openlineage/README.md for detailed setup
This configures AWS Glue with OpenLineage integration for metadata extraction.
Step 5: Manual Glue Database & Job Setup
After infrastructure deployment, manually perform these steps in AWS Console:
- Create test database in AWS Glue Data Catalog
- Run Glue jobs to populate metadata and generate lineage events
- Verify data catalog entries and lineage data flow
Step 6: External Table Mapping
# Set up automated external table creation
cd redshift-enternal-tables-auto-creation/
# Follow setup instructions to map Glue tables to Redshift
This creates external tables in Redshift based on Glue Data Catalog entries.
Step 7: dbt Project Execution
# Run dbt transformations with lineage tracking
cd dbt-redshift-openlineage/
# Execute dbt models to generate column-level lineage
dbt run
This generates detailed column-level lineage through dbt transformations.
Individual Project Documentation
Each subproject has its own README with detailed setup instructions:
- lakehouse-core: Core infrastructure and utilities
- glue-openlineage: AWS Glue metadata extraction
- redshift-enternal-tables-auto-creation: Automated Redshift external table management
- dbt-redshift-openlineage: dbt data transformation with advanced column-level lineage and multi-column dependency detection
- marquez-mcp: MCP server for Marquez integration
- marquez-agents: AI-powered lineage analysis and governance
π― Key Features
Data Lineage Tracking
- Advanced column-level lineage with multi-column dependency detection through dbt transformations
- Complex SQL pattern recognition for CASE WHEN, aggregations, and conditional logic
- Cross-system lineage between AWS Glue, Redshift, and S3
- OpenLineage compliance for standardized metadata exchange
- Real-time lineage updates with automated synchronization
Column-level lineage visualization in Marquez showing data flow and transformations
Dataset overview and metadata in Marquez interface
Intelligent Analysis
- AI-powered agents for lineage health analysis
- Automated governance recommendations and reporting
- Multi-agent collaboration for complex analysis tasks
- Natural language queries for lineage exploration
Multi-agent collaboration for complex data lineage analysis tasks
Infrastructure Automation
- Automated external table creation from Glue to Redshift
- Infrastructure as Code for reproducible deployments
- Environment management with standardized tooling
- Scalable architecture supporting enterprise workloads
π§ Technology Stack
Core Data Platform
- Data Transformation: dbt-core, dbt-redshift, SQL
- Data Warehousing: Amazon Redshift
- Data Lake: AWS S3, AWS Glue Data Catalog
- Lineage Management: Marquez, OpenLineage
- Column-level Lineage: dbt-colibri plugin
AI & Agent Framework
- AI Services: Amazon Bedrock (Claude Sonnet)
- Agent Framework: Strands Agents
- Multi-Agent System: Custom orchestrator with specialized agents
- Model Context Protocol: FastMCP, MCP
Infrastructure & DevOps
- Infrastructure as Code: Terraform
- Container Orchestration: Kubernetes, Karpenter
- Testing Environment: Kubernetes Pods
- CI/CD: GitLab CI/CD with comprehensive security scanning
- Security: Probe scanning, SAST, dependency scanning, secret detection
Development Stack
- Languages: Python 3.13+
- Package Management: uv (primary), pip
- Configuration Management: dynaconf, YAML
- HTTP Clients: httpx, aiohttp, requests
- Database Connectivity: psycopg2-binary (PostgreSQL/Redshift)
- AWS Integration: boto3
Web & UI
- Web Framework: Streamlit (agent interfaces)
- API Protocol: MCP (Model Context Protocol)
- Configuration: TOML, YAML, environment variables
π Use Cases
Data Engineering
- Track data transformations across the entire pipeline
- Understand impact analysis before making changes
- Automate data quality and freshness monitoring
- Generate comprehensive lineage documentation
Data Governance
- Identify data assets not under governance
- Monitor data freshness and quality metrics
- Analyze compliance with data policies
- Generate governance reports and recommendations
Analytics & BI
- Understand data provenance for reports and dashboards
- Validate data quality for critical business metrics
- Track data dependencies across analytical workflows
- Ensure data consistency across different systems
π Security
This project implements comprehensive security scanning in the CI/CD pipeline:
- π― Probe Security Scanning: Automated vulnerability detection on every commit
- π Static Analysis: SAST scanning for Python, SQL, and infrastructure code
- π¦ Dependency Scanning: Vulnerability detection in all project dependencies
- π Secret Detection: Prevents accidental commit of sensitive information
- ποΈ Infrastructure Security: Terraform and Docker security best practices
- π License Compliance: Automated license compatibility checking
Security Dashboard
In GitLab: Refer to the Probe scan results which are automatically running for every code commit. Access the Security Dashboard at:
- Project β Security & Compliance β Security Dashboard
- View detailed reports in merge request security widgets
- Download security artifacts from pipeline runs
For detailed security information, see SECURITY.md.
π€ Contributing
Each subproject follows its own development practices. Please refer to individual project READMEs for specific contribution guidelines.
Security Requirements for Contributors
- All commits must pass security scanning (Probe, SAST, dependency checks)
- No high-severity vulnerabilities allowed in new code
- Follow secure coding practices outlined in SECURITY.md
- Run pre-commit hooks before submitting changes
π License
This project is licensed under the MIT-0 License - see the LICENSE file for details.
All subprojects are covered under the same MIT-0 License unless otherwise specified.
Third-Party Licenses
This project uses OpenLineage, which is licensed under the Apache License 2.0.
- OpenLineage Copyright Β© The OpenLineage Authors
- Licensed under the Apache License, Version 2.0
- Full license text: https://github.com/OpenLineage/OpenLineage/blob/main/LICENSE
