Pydough Ce
PyDough text to analytics: Community Edition
Installation
npx pydough-ceAsk AI about Pydough Ce
Powered by Claude Β· Grounded in docs
I know everything about Pydough Ce. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
PyDough-CE
Welcome to the pydough-ce project! This repository contains the pydough-analytics toolkit, a powerful system that turns natural language questions into safe, executable analytics.
It combines a custom Domain-Specific Language (DSL) called PyDough with LLM-powered system to create a seamless text-to-analytics workflow.
This repository also includes a DSPy-based pipeline in dspy_pydough_pipeline_parallel that evaluates and generates PyDough queries from natural language using prompt-driven large language models, validating predictions against ground truth through multiple evaluation metrics while supporting ensemble selection methods, caching, and parallel processing with multiple API keys.
What It Does
At its core, this project lets you ask questions of your relational database in plain English. The pipeline handles the heavy lifting:
- Generate Metadata β Reflect your database schema into a PyDough knowledge graph.
- Ask a Question β Phrase your analytics request in natural language (e.g., βWhich cities have the highest sales?β).
- Translate to PyDough β The LLM converts the question into the PyDough DSL, a declarative language purpose-built for analytics.
- Execute Safely β PyDough compiles to SQL, runs against your database, and returns a tidy DataFrame.
Key Features
- Natural language interface β Query data without writing SQL.
- Automatic schema analysis β Works with SQLite, Snowflake, MySQL and PostgSQL.
- Safety by design β PyDough limits execution to declarative analytics, reducing blast radius.
- Developer friendly β Includes a CLI, Python API.
- Extensible β Plug in custom prompts, LLM providers.
Getting Started
Provider Setup β Env (Vertex vs APIβKey)
Below are concise .env examples reflecting the two modes we support for both Claude and Gemini and a variant with explicit region.
Do not commit real credentials or API keys to Git. Use placeholders in docs and local
.envfiles.
1) Minimal Vertex (recommended, default)
Use ADC + Vertex. No API key required. The SDK will use Vertex if you pass project/location in code or set GOOGLE_GENAI_USE_VERTEXAI=true.
# .env β minimal Vertex
GOOGLE_PROJECT_ID="your-gcp-project-id"
GOOGLE_APPLICATION_CREDENTIALS=/abs/path/to/service-account.json
GOOGLE_GENAI_USE_VERTEXAI=true
# Optional: explicit region selection (see #3), defaults noted below
# GOOGLE_REGION="us-east5" # e.g., Claude default region
# GOOGLE_REGION="us-central1" # e.g., Gemini default region
Defaults / notes
- Gemini on Vertex: default region on code if not provided is
us-central1. - Claude on Vertex: default region on code if not provided is
us-east5. - You can also use the SDK alt env names:
GOOGLE_CLOUD_PROJECT/GOOGLE_CLOUD_LOCATION. - Vertex can also use credentials via gcloud auth application-default login
- Ensure IAM role like
roles/aiplatform.userand Vertex AI API enabled.
2) APIβKey mode (no Vertex) β Gemini only
If you set GOOGLE_GENAI_USE_VERTEXAI=false, the code will use the Google AI Studio (APIβkey) SDK for Gemini.
In this mode, GOOGLE_API_KEY is required, and ADC / project / region are not used by the Gemini client.
# .env β APIβkey mode (Gemini via Google AI Studio API)
GOOGLE_API_KEY="your-google-api-key"
GOOGLE_GENAI_USE_VERTEXAI=false
# These may exist in your shell and are harmless here, but are not required by APIβkey mode:
# GOOGLE_PROJECT_ID="your-gcp-project-id"
# GOOGLE_APPLICATION_CREDENTIALS=/abs/path/to/service-account.json
# GOOGLE_REGION="us-central1"
Notes
- No IAM or Vertex regional control; intended for quick tests or limited environments.
3) Vertex with explicit region (Gemini & Claude)
Set an explicit region that supports the models you plan to use. if you do not set either one of the they have the next default values:
- Gemini β
us-central1 - Claude β
us-east5
# .env β Vertex with explicit region
GOOGLE_PROJECT_ID="your-gcp-project-id"
GOOGLE_REGION="us-east5" # or us-central1, europe-west4, etc., if supported
GOOGLE_APPLICATION_CREDENTIALS=/abs/path/to/service-account.json
GOOGLE_GENAI_USE_VERTEXAI=true
# GOOGLE_API_KEY can be unset in Vertex mode
Recap
- Switch between modes using
GOOGLE_GENAI_USE_VERTEXAI:trueβ Vertex (ADC). RequiresGOOGLE_PROJECT_ID(+GOOGLE_REGIONoptional) and credentials.falseβ APIβkey mode for Gemini. RequiresGOOGLE_API_KEY.
- Claude in this repo runs only via Vertex (ADC), so it needs
projectand a supportedregion(e.g.,us-east5).
Using Local Models (Ollama)
PyDough-CE can also be used with locally hosted open-source models via Ollama.
However, itβs important to note that results may vary and are not fully standardized when using local models.
The current PyDough prompting strategy and DSL generation were primarily designed and optimized for high-performing hosted models, specifically Gemini 2.5 Pro and Anthropic Claude (Opus / Sonnet). As a result, local models may occasionally produce PyDough code that is syntactically close but semantically incorrect (for example, small naming mismatches or invalid expressions).
That said, local inference can still be useful for experimentation, development, or environments without external API access.
Recommended local models:
gemma3:12bqwen3:8bllama3.1:8b
Running PyDough-CE with Ollama
Make sure Ollama is installed and running locally:
ollama serve
Pull a model (example):
ollama pull gemma3:12b
Set the required environment variables via CLI(or .env file):
export OLLAMA_BASE_URL=http://localhost:11434
export OLLAMA_CONTEXT_LENGTH=32768
A large context window is strongly recommended, as PyDough prompts are relatively large to perform well.
Recommendation
For the most reliable and consistent results, we recommend using Gemini 2.5 Pro or Anthropic Claude Opus 4.
Local models via Ollama are supported, but should be considered experimental at this time.
TPCH sample database (download helper)
To make local testing easy, this repo includes a small helper script to download the TPCH demo database.
- Script location:
setup_tpch.sh - What it does: If the target file already exists, it prints
FOUNDand exits. Otherwise it downloads the SQLite DB. - Where the DB should live:
./data/databases/TPCH.db(from the repo root). The rest of the docs/CLI examples assume this path.
One-liner (macOS/Linux)
Run from the repo root:
mkdir -p ./pydough-analytics/data/databases
bash pydough-analytics/setup_tpch.sh ./pydough-analytics/data/databases/TPCH.db
If you don't have wget, you can use curl instead:
mkdir -p ./pydough-analytics/data/databases
curl -L https://github.com/lovasoa/TPCH-sqlite/releases/download/v1.0/TPC-H.db -o ./pydough-analytics/data/databases/TPCH.db
Verify the file is present:
ls -lh ./pydough-analytics/data/databases/TPCH.db
Windows (PowerShell)
New-Item -ItemType Directory -Force -Path .\pydough-analytics\data\databases | Out-Null
Invoke-WebRequest -Uri https://github.com/lovasoa/TPCH-sqlite/releases/download/v1.0/TPC-H.db -OutFile .\pydough-analytics\data\databases\TPCH.db
Requirements
- Python 3.10 or newer (3.11 recommended).
- SQLite database file to introspect.
- PyDough 1.0.10 or newer.
Configure the environment
Make sure to use the following environment setup when running the app.
Here is the full shell sequence. Replace /path/to/pydough-ce with your clone path.
cd /path/to/pydough-ce
rm -rf .venv
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e pydough-analytics
export PATH="$(pwd)/.venv/bin:$PATH"
hash -r
pydough-analytics --version
Development Mode
pip install -e .[dev]
Adds:
- pytest
- pytest-mock
Snowflake
pip install -e .[snowflake]
Adds:
- snowflake-connector-python[pandas]
- snowflake-sqlalchemy
It is recommended to generate metadata or use Ask for Snowflake.
Snowflake Lite (Platform-Safe)
pip install -e .[snowflake-lite]
Adds:
- snowflake-connector-python
- snowflake-sqlalchemy
Recommended for platform deployments to avoid dependency conflicts.
Notebook Mode
pip install -e .[notebooks]
Adds:
- JupyterLab
- IPython kernel
- Matplotlib
MCP Mode
pip install -e .[mcp]
Adds:
- fastmcp
Allows running PyDough as an MCP server.
Terminal location:
Run all of the next commands from the pydough-analytics folder (the folder that contains data/, docs/, samples/, src/, etc.).
Quick check:
ls data
# β databases metadata metadata_markdowns prompts
Default paths (data at pydough-analytics folder)
We keep project files in ./data/ for consistency:
- Database (SQLite):
./data/databases/TPCH.db - Metadata JSON:
./data/metadata/Tpch_graph.json - Metadata Markdown:
./data/metadata_markdowns/Tpch.md
Generate metadata from SQLite
pydough-analytics generate-json --url sqlite:///data/databases/TPCH.db --graph-name tpch --json-path ./data/metadata/Tpch_graph.json
This inspects the SQLite file and creates a metadata graph definition under data/metadata/Tpch_graph.json.
Export Markdown docs
pydough-analytics generate-md --graph-name tpch --json-path ./data/metadata/Tpch_graph.json --md-path ./data/metadata_markdowns/Tpch.md
The Markdown file provides a human-friendly overview of the metadata: collections, properties, and relationships.
Ask the LLM (after generating JSON + Markdown)
Run natural-language questions on your dataset. The PyDough code is always printed; you can optionally include SQL, a DataFrame preview, and an explanation. The CE default is Google / Gemini 2.5 Pro.
pydough-analytics ask --question "Give me the name of all the suppliers from the United States" --url sqlite:///data/databases/TPCH.db --db-name tpch --md-path ./data/metadata_markdowns/Tpch.md --kg-path ./data/metadata/Tpch_graph.json --show-sql --show-df --show-explanation
Notes:
--db-nameshould match the--graph-nameused during metadata generation (here:TPCH).- To switch providers (e.g., Anthropic), pass a valid provider/model for your integration:
--provider anthropic --model claude-sonnet-4-5@20250929 - To run with Ollama you'll also need to add the provider and model as corresponds:
--provider ollama --model qwen3:8b
- Use
--rowsto control how many DataFrame rows are displayed (default: 20).
Run the test suite (optional)
Install the [dev] version and run the tests with:
python -m pip install -e .[dev]
pytest -q tests
Or, if you prefer to install the dependencies directly:
python -m pip install pytest
python -m pip install pytest-mock
pytest -q tests
With these steps you now have the full CE pipeline: SQLite DB β JSON metadata graph β Markdown documentation β LLM Ask.
Repository Structure
/
βββ dspy_pydough_pipeline_parallel
β βββ data # Sanmple and database files.
β βββ download_bird_database.sh # Script to download and setup BIRD-SQL databases
β βββ environment.yml # Conda environment
β βββ evaluation # folder with the validation scripts
β βββ main.py # Main entry point
β βββ predictors # Folder with the generation scripts
β βββ readme.md # In-depth package documentation.
β βββ utils # Utils functions
β
βββ pydough-analytics/ # Core Python package.
β βββ data/ # Sanmple metadata files.
β βββ docs/ # Additional guides.
β βββ samples/ # Sample code with notebooks.
β βββ src/ # Library source code.
β βββ tests/ # Unit and integration tests.
β βββ README.md # In-depth package documentation.
βββ README.md # You are here!
The PyDough DSL
PyDough is a Pythonic DSL designed for the LLM to emitβand for you to readβconcise analytics logic. Typical patterns include filtering, aggregation, and ranking.
# Top 3 sales by amount
result = sales.CALCULATE(city, amount).TOP_K(3, by=amount.DESC())
You can check the full PyDough repo and documentation here: https://github.com/bodo-ai/PyDough/tree/main
Whatβs Next?
We welcome ideas and contributions. Current focus areas include:
- Support for more databases.
