π
io.github.AnnasMazhar/pyspark-mcp
SQL to PySpark conversion, AWS Glue job generation, and Spark code optimization.
0 installs
Trust: 37 β Low
Data
Ask AI about io.github.AnnasMazhar/pyspark-mcp
Powered by Claude Β· Grounded in docs
I know everything about io.github.AnnasMazhar/pyspark-mcp. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Loading tools...
Reviews
Documentation
PySpark MCP Server
SQL migration assistance, AWS Glue job generation, and Spark code optimization β as an MCP server.
What It Does
- SQL Dialect Transpilation β Convert between PostgreSQL, Oracle, Redshift, MySQL, Snowflake, and Spark SQL using SQLGlot
- PySpark DataFrame API Generation β Generate DataFrame API code from SQL with optimization hints
- AWS Glue Integration β Job templates, DynamicFrame conversions, Data Catalog definitions, S3 optimization strategies
- Batch Processing β Process hundreds of SQL files concurrently
- Code Review & Optimization β Analyze existing PySpark code for performance improvements
- Pattern Detection β Find code duplication and suggest refactoring
What It Doesn't Do
- Recursive CTEs β provides Spark SQL equivalent + guidance (PySpark has no native recursive CTE support)
- MERGE/PIVOT/CONNECT BY β transpiles to Spark SQL, provides DataFrame API guidance
- Perfect 1:1 DataFrame API transpilation for all SQL β complex queries get Spark SQL + optimization recommendations
Quick Start
pip install -e .
pyspark-mcp # starts the MCP server
MCP Configuration
Claude Desktop
Add to ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"pyspark": {
"command": "pyspark-mcp",
"args": []
}
}
}
Hermes Agent
Add to ~/.hermes/config.yaml:
mcp:
servers:
pyspark:
command: pyspark-mcp
enabled_tools: all
Docker
docker compose up -d
Tools
SQL Conversion
convert_sql_to_pysparkβ Convert SQL to PySpark with dialect detectionanalyze_sql_contextβ Analyze SQL complexity and suggest approach
AWS Glue
generate_aws_glue_job_templateβ Generate complete Glue job scriptsconvert_dataframe_to_dynamic_frameβ DataFrame β DynamicFrame conversiongenerate_data_catalog_table_definitionβ Data Catalog table definitionsgenerate_incremental_processing_jobβ Incremental/CDC job generationanalyze_s3_optimization_opportunitiesβ S3 layout and partitioning analysis
Optimization
review_pyspark_codeβ Code review with performance recommendationsoptimize_pyspark_codeβ Suggest optimizations for existing coderecommend_join_strategyβ Broadcast vs shuffle join recommendationssuggest_partitioning_strategyβ Partitioning recommendations
Batch Processing
batch_process_filesβ Process multiple SQL files concurrentlybatch_process_directoryβ Convert entire directories
Development
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
# Test
pytest tests/ -v --cov=pyspark_tools
# Format
black pyspark_tools tests
isort pyspark_tools tests
# Lint
flake8 pyspark_tools tests
Architecture
pyspark_tools/
βββ server.py # FastMCP server + tool definitions
βββ sql_converter.py # SQLGlot-based transpilation + DataFrame API generation
βββ aws_glue_integration.py # Glue job templates, DynamicFrame, Data Catalog
βββ advanced_optimizer.py # Performance analysis + optimization suggestions
βββ batch_processor.py # Concurrent file processing
βββ code_reviewer.py # PySpark code review patterns
βββ duplicate_detector.py # Code deduplication
βββ data_source_analyzer.py # Data source analysis
βββ file_utils.py # File I/O utilities
CI/CD
- β 256 tests passing
- β 71% code coverage
- β Code quality checks (black, isort, flake8)
- β Python 3.11 tested
License
MIT β see LICENSE.
mcp-name: io.github.AnnasMazhar/pyspark-mcp
