Env Doctor
Debug your GPU, CUDA, and AI stacks across local, Docker, and CI/CD (CLI and MCP server)
Installation
npx env-doctorAsk AI about Env Doctor
Powered by Claude ยท Grounded in docs
I know everything about Env Doctor. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
Env-Doctor
The missing link between your GPU and Python AI libraries
"Why does my PyTorch crash with CUDA errors when I just installed it?"
Because your driver supports CUDA 11.8, but
pip install torchgave you CUDA 12.4 wheels.
Env-Doctor diagnoses and fixes the #1 frustration in GPU computing: mismatched CUDA versions between your NVIDIA driver, system toolkit, cuDNN, and Python libraries.
It takes 5 seconds to find out if your environment is broken - and exactly how to fix it.
Doctor "Check" (Diagnosis)

Features
| Feature | What It Does |
|---|---|
| One-Command Diagnosis | Check compatibility: GPU Driver โ CUDA Toolkit โ cuDNN โ PyTorch/TensorFlow/JAX |
| Compute Capability Check | Detect GPU architecture mismatches โ catches why torch.cuda.is_available() returns False on new GPUs (e.g. Blackwell) even when driver and CUDA are healthy |
| Python Version Compatibility | Detect Python version conflicts with AI libraries and dependency cascade impacts |
| CUDA Auto-Installer | Execute CUDA Toolkit installation directly with --run; CI-friendly with --yes; preview with --dry-run |
| Safe Install Commands | Get the exact pip install command that works with YOUR driver |
| Extension Library Support | Install compilation packages (flash-attn, SageAttention, auto-gptq, apex, xformers) with CUDA version matching |
| AI Model Compatibility | Check if LLMs, Diffusion, or Audio models fit on your GPU before downloading |
| WSL2 GPU Support | Validate GPU forwarding, detect driver conflicts within WSL2 env for Windows users |
| Deep CUDA Analysis | Find multiple installations, PATH issues, environment misconfigurations |
| Container Validation | Catch GPU config errors in Dockerfiles before you build |
| MCP Server | Expose diagnostics to AI assistants (Claude Desktop, Zed) via Model Context Protocol |
| CI/CD Ready | JSON output, proper exit codes, and CI-aware env-var persistence (GitHub Actions, GitLab CI, CircleCI, Azure Pipelines, Jenkins) |
| Fleet Dashboard (optional) | Web UI for monitoring multiple GPU machines โ aggregate status, drill-down diagnostics, history timeline. Install with pip install "env-doctor[dashboard]" |
Installation
Core CLI
The core CLI has no heavy dependencies โ installs in seconds.
pip install env-doctor
# Or with uv (faster, isolated)
uv tool install env-doctor
uvx env-doctor check
With Fleet Dashboard
If you want to manage a distributed system of multiple GPU nodes, this dashboard can help you from a observability POV. It adds a web UI for monitoring multiple GPU machines and has no effect on the core CLI. You will be able to soon take action directly from the dashboard via distributed env-doctor cli instances on each VM!
pip install "env-doctor[dashboard]"
This adds: fastapi, uvicorn, sqlalchemy, aiosqlite
pip install env-doctor | pip install "env-doctor[dashboard]" | |
|---|---|---|
env-doctor check | โ | โ |
| All CLI commands | โ | โ |
| MCP server | โ | โ |
env-doctor check --report-to | โ | โ |
env-doctor report install/status | โ | โ |
env-doctor dashboard (web UI) | โ | โ |
MCP Server (AI Assistant Integration)
Env-Doctor includes a built-in Model Context Protocol (MCP) server that exposes diagnostic tools to AI assistants like Claude Code and Claude Desktop.
Quick Setup for Claude Desktop
-
Install env-doctor:
pip install env-doctor -
Add to Claude Desktop config (
~/Library/Application Support/Claude/claude_desktop_config.json):{ "mcpServers": { "env-doctor": { "command": "env-doctor-mcp" } } } -
Restart Claude Desktop - the tools will be available automatically.
Available Tools (11 Total)
env_check- Full GPU/CUDA environment diagnosticsenv_check_component- Check specific component (driver, CUDA, cuDNN, etc.)python_compat_check- Check Python version compatibility with installed AI librariescuda_info- Detailed CUDA toolkit informationcudnn_info- Detailed cuDNN library informationcuda_install- Step-by-step CUDA installation instructionsinstall_command- Get safe pip install commands for AI librariesmodel_check- Analyze if AI models fit on your GPUmodel_list- List all available models in databasedockerfile_validate- Validate Dockerfiles for GPU issuesdocker_compose_validate- Validate docker-compose.yml for GPU configuration
Demo โ Claude Code using env-doctor MCP tools
Example Usage
Ask your AI assistant:
- "Check my GPU environment"
- "Is my Python version compatible with my installed AI libraries?"
- "How do I install CUDA Toolkit on Ubuntu?"
- "Get me the pip install command for PyTorch"
- "Can I run Llama 3 70B on my GPU?"
- "Validate this Dockerfile for GPU issues"
- "What CUDA version does my PyTorch require?"
- "Show me detailed CUDA toolkit information"
Learn more: MCP Integration Guide
Fleet Dashboard (optional)
The core CLI works standalone. The dashboard is an observability layer for teams running multiple GPU machines.
pip install "env-doctor[dashboard]" unlocks a web UI that aggregates diagnostic results from every machine in your fleet into a single view โ no SSH required.
How It Works
There are two roles โ the dashboard host (receives and displays reports) and the GPU machines (run checks and send results). They communicate over a simple REST API.
Dashboard Host GPU Machine 1
โโโโโโโโโโโโโโโโโโโโโโ-โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ env-doctor dashboard โ โโโโโ POST โโโโ โ env-doctor check --report-to โ
โ (React UI + SQLite) โ /api/report โ (runs locally, POSTs result) โ
โโโโโโโโโโโโโโโโโโโโโโ-โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โฒ GPU Machine 2
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโ POST โโโโโโโโโโโโโโ env-doctor check --report-to โ
/api/report โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Step 1 โ Start the dashboard on any machine with a reachable IP (a cheap CPU instance is enough, no GPU needed):
pip install "env-doctor[dashboard]"
env-doctor dashboard
# โ Serving at http://localhost:8765
# โ ๐ Generated new API token at ~/.env-doctor/api-token
# Token: <copy this โ paste into the browser login + share with hosts>
On first launch the dashboard generates a shared API token (saved at ~/.env-doctor/api-token, mode 0600). The browser login screen and every host CLI need that token. Override the location with ENV_DOCTOR_API_TOKEN=<token> in the dashboard's environment.
Step 2 โ Report from each GPU machine (only needs the core CLI, not the [dashboard] extra):
pip install env-doctor
# One-time report
env-doctor check --report-to http://<dashboard-host>:8765 --token <token>
# Or: set up automatic reporting every 2 minutes
env-doctor report install \
--url http://<dashboard-host>:8765 \
--token <token> \
--interval 2m
The token is saved in ~/.env-doctor/report-config.json so the scheduled cron / Task Scheduler entry stays clean (no secrets in crontab -l).
report install creates a scheduled task on the GPU machine โ a cron job on Linux/macOS or a Windows Task Scheduler entry on Windows. That task runs env-doctor check --report-to <url> on the configured interval.
Smart Change Detection
The scheduled task fires every 2 minutes, but it does not POST every 2 minutes. Each run:
- Runs
env-doctor checklocally on the GPU machine - Hashes the result (status, checks โ excluding timestamps)
- Compares to the last sent hash stored in
~/.env-doctor/report-state.json
| Condition | Action |
|---|---|
| Hash changed (driver updated, library installed, new issue) | POST full report immediately |
| Hash unchanged, 30 min since last POST | POST lightweight heartbeat (confirms machine is alive) |
| Hash unchanged, heartbeat not due | Skip โ no network call, sub-second no-op |
On a stable machine, this means ~1 POST every 30 minutes instead of 720.
Reporting Commands (run on GPU machines)
These commands run on each GPU machine, not on the dashboard host:
| Command | Where | What it does |
|---|---|---|
env-doctor dashboard | Dashboard host | Starts the web UI and API server |
env-doctor check --report-to URL | GPU machine | Runs check locally, POSTs result to dashboard |
env-doctor report install --url URL | GPU machine | Creates a cron job / scheduled task on this machine |
env-doctor report status | GPU machine | Reads this machine's local config and last report time |
env-doctor report uninstall | GPU machine | Removes the scheduled task from this machine |
report status is purely local โ it reads ~/.env-doctor/report-state.json and prints when this machine last reported and whether the scheduler is active. No network call.
Setting Up on Cloud Instances
# AWS / GCP / Azure โ on your dashboard VM
pip install "env-doctor[dashboard]"
env-doctor dashboard --host 0.0.0.0 --port 8765
# Open port 8765 in your security group / firewall rules
# On each GPU instance (same VPC) โ one command
pip install env-doctor && env-doctor report install --url http://<vm-private-ip>:8765
For machines behind NAT (different networks), use Tailscale for zero-config networking:
# Install Tailscale on each machine, then use the Tailscale IP
env-doctor report install --url http://100.x.x.x:8765 --token <token>
Deploying to a Shared Host
If you're hosting one dashboard for a small team rather than running it on a single laptop:
- Use the API token. Generated automatically on first launch (see Step 1 above), or pin a known value with
ENV_DOCTOR_API_TOKEN. All/api/*routes requireAuthorization: Bearer <token>. - Put TLS in front. The dashboard speaks plain HTTP. Terminate TLS in nginx, Caddy, or a cloud load balancer and forward to
127.0.0.1:8765. Tokens travel in theAuthorizationheader โ they need TLS to stay private outside trusted networks. - Restrict CORS. Set
ENV_DOCTOR_CORS_ORIGINS=https://dashboard.example.comon the dashboard process so the browser only honours requests from your origin (defaults to*for backward compatibility with local use). - Tune staleness. A machine is flagged "stale" once its
last_seenis older thanENV_DOCTOR_STALE_SECONDSseconds (default 3600 = 1 hour, โ 2ร the default heartbeat). - Rotating the token. Edit
~/.env-doctor/api-token(or updateENV_DOCTOR_API_TOKENand restart). Each host CLI needsenv-doctor report install --token <new>re-run, or its~/.env-doctor/report-config.jsonupdated, before the next check-in.
What the Dashboard Shows
The web UI at http://<dashboard-host>:8765 displays:
- Topology view โ force-directed graph showing the dashboard hub and all GPU machines as nodes, colour-coded by health status, with click-to-zoom detail panels
- Fleet overview โ aggregate health pie chart, issues table with expandable per-machine diagnostics, health gauge, and remote command execution
- Machine detail โ full diagnostic breakdown identical to what
env-doctor checkprints locally - History timeline โ past snapshots per machine to track when issues appeared or were resolved
All data stored in ~/.env-doctor/dashboard.db (SQLite) on the dashboard host. No external database or cloud dependencies.
Remote Remediation
The dashboard can queue env-doctor CLI commands to run on remote machines โ no SSH required.
Dashboard GPU Machine
โโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ
โ Operator โ โ Scheduled check โ
โ clicks โ โ (cron / Task โ
โ "โถ Run" on โ โ Scheduler) โ
โ Fleet page โ โ โ
โโโโโโโโฌโโโโโโ โโโโโโโโโโฌโโโโโโโโโโ
โ โ
โผ โผ
Queue command env-doctor check
in database --report-to <url>
(status: pending) โ
โ โผ
โ Server returns pending
โ commands in response
โ โ
โ โผ
โ CLI executes commands,
โ posts results back
โ โ
โ โผ
โ CLI re-runs check to
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ verify the fix
How it works: GPU machines check in periodically via env-doctor check --report-to. On each check-in, the server returns any pending commands in the HTTP response. The CLI executes them, posts the output back, and re-runs diagnostics to verify the fix. This pull-based model means:
- No inbound ports needed on GPU machines (works behind NATs, firewalls, VPNs)
- No SSH credentials stored on the dashboard
- Only
env-doctorcommands are accepted (security scoped) - Commands typically execute within one check-in interval (default: 5 minutes)
Set up scheduled check-ins on each GPU machine:
# Linux / macOS โ creates a cron job
env-doctor report install --url http://<dashboard-host>:8765 --interval 5m
# Windows โ creates a Task Scheduler entry
env-doctor report install --url http://<dashboard-host>:8765 --interval 5m
Learn more: Fleet Monitoring Guide
Usage
Diagnose Your Environment
env-doctor check
Example output:
๐ฉบ ENV-DOCTOR DIAGNOSIS
============================================================
๐ฅ๏ธ Environment: Native Linux
๐ฎ GPU Driver
โ
NVIDIA Driver: 535.146.02
โโ Max CUDA: 12.2
๐ง CUDA Toolkit
โ
System CUDA: 12.1.1
๐ฆ Python Libraries
โ
torch 2.1.0+cu121
โ
All checks passed!
On new-generation GPUs (e.g. RTX 5070 / Blackwell), env-doctor catches architecture mismatches and distinguishes between two failure modes:
Hard failure โ torch.cuda.is_available() returns False:
๐ฏ COMPUTE CAPABILITY CHECK
GPU: NVIDIA GeForce RTX 5070 (Compute 12.0, Blackwell, sm_120)
PyTorch compiled for: sm_50, sm_60, sm_70, sm_80, sm_90, compute_90
โ ARCHITECTURE MISMATCH: Your GPU needs sm_120 but PyTorch 2.5.1 doesn't include it.
This is likely why torch.cuda.is_available() returns False even though
your driver and CUDA toolkit are working correctly.
FIX: Install PyTorch nightly with sm_120 support:
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126
Soft failure โ torch.cuda.is_available() returns True via NVIDIA's PTX JIT, but complex ops may silently degrade:
๐ฏ COMPUTE CAPABILITY CHECK
GPU: NVIDIA GeForce RTX 5070 (Compute 12.0, Blackwell, sm_120)
PyTorch compiled for: sm_50, sm_60, sm_70, sm_80, sm_90, compute_90
โ ๏ธ ARCHITECTURE MISMATCH (Soft): Your GPU needs sm_120 but PyTorch 2.5.1 doesn't include it.
torch.cuda.is_available() returned True via NVIDIA's driver-level PTX JIT,
but you may experience degraded performance or failures with complex CUDA ops.
FIX: Install a newer PyTorch with native sm_120 support for full compatibility:
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126
Check Python Version Compatibility
env-doctor python-compat
๐ PYTHON VERSION COMPATIBILITY CHECK
============================================================
Python Version: 3.13 (3.13.0)
Libraries Checked: 2
โ 2 compatibility issue(s) found:
tensorflow:
tensorflow supports Python <=3.12, but you have Python 3.13
Note: TensorFlow 2.15+ requires Python 3.9-3.12. Python 3.13 not yet supported.
torch:
torch supports Python <=3.12, but you have Python 3.13
Note: PyTorch 2.x supports Python 3.9-3.12. Python 3.13 support experimental.
โ ๏ธ Dependency Cascades:
tensorflow [high]: TensorFlow's Python ceiling propagates to keras and tensorboard
Affected: keras, tensorboard, tensorflow-estimator
torch [high]: PyTorch's Python version constraint affects all torch ecosystem packages
Affected: torchvision, torchaudio, triton
๐ก Consider using Python 3.12 or lower for full compatibility
๐ก Cascade: tensorflow constraint also affects: keras, tensorboard, tensorflow-estimator
๐ก Cascade: torch constraint also affects: torchvision, torchaudio, triton
============================================================
Get Safe Install Command
env-doctor install torch
โฌ๏ธ Run this command to install the SAFE version:
---------------------------------------------------
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
---------------------------------------------------
Install CUDA Toolkit
Display instructions or execute the installation directly:
# Show platform-specific steps (default)
env-doctor cuda-install
# Preview what would run โ no changes made
env-doctor cuda-install --dry-run
# Execute interactively (asks [y/N] before running)
env-doctor cuda-install --run
# Execute headlessly โ great for CI/scripts
env-doctor cuda-install --run --yes
# Install a specific version, headless
env-doctor cuda-install 12.6 --run --yes
Example dry-run output (Windows):
[DRY RUN] [1/1] winget install Nvidia.CUDA --version 12.2
[DRY RUN] [1/1] nvcc --version
CUDA 12.2 installation completed successfully.
Verification: PASSED
Full log: C:\Users\you\.env-doctor\install.log
Every run writes a timestamped log to ~/.env-doctor/install.log for debugging.
Supported Platforms:
- Ubuntu 20.04, 22.04, 24.04
- Debian 11, 12
- RHEL 8, 9 / Rocky Linux / AlmaLinux
- Fedora 39+
- WSL2 (Ubuntu)
- Windows 10/11 (via
winget) - Conda (all platforms)
Exit codes for CI pipelines:
| Code | Meaning |
|---|---|
0 | Installation succeeded and verified |
1 | An installation step failed |
2 | Installed but nvcc --version failed |
Install Compilation Packages (Extension Libraries)
For extension libraries like flash-attn, SageAttention, auto-gptq, apex, and xformers that require compilation from source, env-doctor provides special guidance to handle CUDA version mismatches:
env-doctor install flash-attn
Example output (with CUDA mismatch):
๐ฉบ PRESCRIPTION FOR: flash-attn
โ ๏ธ CUDA VERSION MISMATCH DETECTED
System nvcc: 12.1.1
PyTorch CUDA: 12.4.1
๐ง flash-attn requires EXACT CUDA version match for compilation.
You have TWO options to fix this:
============================================================
๐ฆ OPTION 1: Install PyTorch matching your nvcc (12.1)
============================================================
Trade-offs:
โ
No system changes needed
โ
Faster to implement
โ Older PyTorch version (may lack new features)
Commands:
# Uninstall current PyTorch
pip uninstall torch torchvision torchaudio -y
# Install PyTorch for CUDA 12.1
pip install torch --index-url https://download.pytorch.org/whl/cu121
# Install flash-attn
pip install flash-attn --no-build-isolation
============================================================
โ๏ธ OPTION 2: Upgrade nvcc to match PyTorch (12.4)
============================================================
Trade-offs:
โ
Keep latest PyTorch
โ
Better long-term solution
โ Requires system-level changes
โ Verify driver supports CUDA 12.4
Steps:
1. Check driver compatibility:
env-doctor check
2. Download CUDA Toolkit 12.4:
https://developer.nvidia.com/cuda-12-4-0-download-archive
3. Install CUDA Toolkit (follow NVIDIA's platform-specific guide)
4. Verify installation:
nvcc --version
5. Install flash-attn:
pip install flash-attn --no-build-isolation
============================================================
Check Model Compatibility
env-doctor model llama-3-8b
๐ค Checking: LLAMA-3-8B (8.0B params)
๐ฅ๏ธ Your Hardware: RTX 3090 (24GB)
๐พ VRAM Requirements:
โ
FP16: 19.2GB - fits with 4.8GB free
โ
INT4: 4.8GB - fits with 19.2GB free
โ
This model WILL FIT on your GPU!
List all models: env-doctor model --list
Cloud GPU Recommendations:
# Get cloud GPU recommendations for a model that doesn't fit
env-doctor model llama-3-70b --recommend
# Direct VRAM lookup (no model name needed)
env-doctor model --vram 80000 --recommend
โ๏ธ Cloud GPU Recommendations
FP16 (~140.0 GB):
$27.20 /hr azure ND96asr_v4 8x A100 (40GB each) 180.0GB free
$29.39 /hr gcp a2-highgpu-8g 8x A100 (40GB each) 180.0GB free
...
Automatic HuggingFace Support (New โจ) If a model isn't found locally, env-doctor automatically checks the HuggingFace Hub, fetches its parameter metadata, and caches it locally for future runs โ no manual setup required.
# Fetches from HuggingFace on first run, cached afterward
env-doctor model bert-base-uncased
env-doctor model sentence-transformers/all-MiniLM-L6-v2
Output:
๐ค Checking: BERT-BASE-UNCASED
(Fetched from HuggingFace API - cached for future use)
Parameters: 0.11B
HuggingFace: bert-base-uncased
๐ฅ๏ธ Your Hardware:
RTX 3090 (24GB VRAM)
๐พ VRAM Requirements & Compatibility
โ
FP16: 264 MB - Fits easily!
๐ก Recommendations:
1. Use fp16 for best quality on your GPU
Validate Dockerfiles
env-doctor dockerfile
๐ณ DOCKERFILE VALIDATION
โ Line 1: CPU-only base image: python:3.10
Fix: FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
โ Line 8: PyTorch missing --index-url
Fix: pip install torch --index-url https://download.pytorch.org/whl/cu121
More Commands
| Command | Purpose |
|---|---|
env-doctor check | Full environment diagnosis |
env-doctor python-compat | Check Python version compatibility with AI libraries |
env-doctor cuda-install | Step-by-step CUDA Toolkit installation guide |
env-doctor install <lib> | Safe install command for PyTorch/TensorFlow/JAX, extension libraries (flash-attn, auto-gptq, apex, xformers, SageAttention, etc.) |
env-doctor model <name> | Check model VRAM requirements |
env-doctor cuda-info | Detailed CUDA toolkit analysis |
env-doctor cudnn-info | cuDNN library analysis |
env-doctor dockerfile | Validate Dockerfile |
env-doctor docker-compose | Validate docker-compose.yml |
env-doctor init --github-actions | Generate GitHub Actions workflow |
env-doctor scan | Scan for deprecated imports |
env-doctor debug | Verbose detector output |
env-doctor dashboard | Start fleet monitoring web UI (requires [dashboard] extra) |
env-doctor report install | Set up periodic reporting via cron (Linux) or Task Scheduler (Windows) |
CI/CD Integration
Generate a GitHub Actions workflow with one command:
env-doctor init --github-actions
This creates .github/workflows/env-doctor.yml โ review, commit, and push. Your CI will validate the ML environment on every push and PR.
Or add manually:
# JSON output for scripting
env-doctor check --json
# CI mode with exit codes (0=pass, 1=warn, 2=error)
env-doctor check --ci
Exit code semantics for check --json / check --ci:
| Code | Meaning |
|---|---|
0 | All detected components are compatible (uninstalled libraries do not count as failures) |
1 | Installed components have warnings or version conflicts |
2 | One or more components are in an error state |
GitHub Actions example:
- run: pip install env-doctor
- run: env-doctor check --ci
Documentation
Full documentation: https://mitulgarg.github.io/env-doctor/
- Getting Started
- Command Reference
- MCP Integration Guide
- WSL2 GPU Guide
- CI/CD Integration
- Architecture
Video Tutorial: Watch Demo on YouTube
Platform Support
Env-Doctor is built for Linux and Windows โ the platforms where NVIDIA GPUs and CUDA are available. All GPU diagnostics (driver, CUDA, cuDNN, library compatibility) target these platforms.
macOS is supported for non-GPU features: Fleet Dashboard hosting, model memory checks (env-doctor model), Python compatibility, project import scanning, and the MCP server. This makes a Mac a great centralised dashboard host while your Linux/Windows VMs handle the GPU workloads.
macOS uses zsh, which treats
[]as a glob pattern. Quote extras when installing:pip install "env-doctor[dashboard]"
Contributing
Contributions welcome! See CONTRIBUTING.md for details.
License
MIT License - see LICENSE
