Judges Panel
18 specialized judges that evaluate AI-generated code for security, cost, and quality.
Ask AI about Judges Panel
Powered by Claude Β· Grounded in docs
I know everything about Judges Panel. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
Judges Panel
An MCP (Model Context Protocol) server that provides a panel of 39 specialized judges to evaluate AI-generated code β acting as an independent quality gate regardless of which project is being reviewed. Combines deterministic pattern matching & AST analysis (instant, offline, zero LLM calls) with LLM-powered deep-review prompts that let your AI assistant perform expert-persona analysis across all 39 domains.
Highlights:
- Includes an App Builder Workflow (3-step) demo for release decisions, plain-language risk summaries, and prioritized fixes β see Try the Demo.
- Includes V2 context-aware evaluation with policy profiles, evidence calibration, specialty feedback, confidence scoring, and uncertainty reporting.
- Includes public repository URL reporting to clone a repo, run the full tribunal, and output a consolidated markdown report.
Why Judges?
AI code generators (Copilot, Cursor, Claude, ChatGPT, etc.) write code fast β but they routinely produce insecure defaults, missing auth, hardcoded secrets, and poor error handling. Human reviewers catch some of this, but nobody reviews 39 dimensions consistently.
| ESLint / Biome | SonarQube | Semgrep / CodeQL | Judges | |
|---|---|---|---|---|
| Scope | Style + some bugs | Bugs + code smells | Security patterns | 39 domains: security, cost, compliance, a11y, API design, cloud, UX, β¦ |
| AI-generated code focus | No | No | Partial | Purpose-built for AI output failure modes |
| Setup | Config per project | Server + scanner | Cloud or local | One command: npx @kevinrabun/judges eval file.ts |
| Auto-fix patches | Some | No | No | 114 deterministic patches β instant, offline |
| Non-technical output | No | Dashboard | No | Plain-language findings with What/Why/Next |
| MCP native | No | No | No | Yes β works inside Copilot, Claude, Cursor |
| SARIF output | No | Yes | Yes | Yes β upload to GitHub Code Scanning |
| Cost | Free | $$$$ | Free/paid | Free / MIT |
Judges doesn't replace linters β it covers the dimensions linters don't: authentication strategy, data sovereignty, cost patterns, accessibility, framework-specific anti-patterns, and architectural issues across multiple files.
Quick Start
Try it now (no clone needed)
# Install globally
npm install -g @kevinrabun/judges
# Evaluate any file
judges eval src/app.ts
# Pipe from stdin
cat api.py | judges eval --language python
# Single judge
judges eval --judge cybersecurity server.ts
# SARIF output for CI
judges eval --file app.ts --format sarif > results.sarif
# HTML report with severity filters and dark/light theme
judges eval --file app.ts --format html > report.html
# Fail CI on findings (exit code 1)
judges eval --fail-on-findings src/api.ts
# Suppress known findings via baseline
judges eval --baseline baseline.json src/api.ts
# Use a named preset
judges eval --preset security-only src/api.ts
# Use a config file
judges eval --config .judgesrc.json src/api.ts
# Set a minimum score threshold (exit 1 if below)
judges eval --min-score 80 src/api.ts
# One-line summary for scripts
judges eval --summary src/api.ts
# List all 39 judges
judges list
Additional CLI Commands
# Interactive project setup wizard
judges init
# Preview auto-fix patches (dry run)
judges fix src/app.ts
# Apply patches directly
judges fix src/app.ts --apply
# Watch mode β re-evaluate on file save
judges watch src/
# Project-level report (local directory)
judges report . --format html --output report.html
# Evaluate a unified diff (pipe from git diff)
git diff HEAD~1 | judges diff
# Analyze dependencies for supply-chain risks
judges deps --path . --format json
# Create a baseline file to suppress known findings
judges baseline create --file src/api.ts -o baseline.json
# Generate CI template files
judges ci-templates --provider github
judges ci-templates --provider gitlab
judges ci-templates --provider azure
judges ci-templates --provider bitbucket
# Generate per-judge rule documentation
judges docs
judges docs --judge cybersecurity
judges docs --output docs/
# Install shell completions
judges completions bash # eval "$(judges completions bash)"
judges completions zsh
judges completions fish
judges completions powershell
# Install pre-commit hook
judges hook install
# Uninstall pre-commit hook
judges hook uninstall
Use in GitHub Actions
Add Judges to your CI pipeline with zero configuration:
# .github/workflows/judges.yml
name: Judges Code Review
on: [pull_request]
jobs:
judges:
runs-on: ubuntu-latest
permissions:
contents: read
security-events: write # only if using upload-sarif
steps:
- uses: actions/checkout@v4
- uses: KevinRabun/judges@main
with:
path: src/api.ts # file or directory
format: text # text | json | sarif | markdown
upload-sarif: true # upload to GitHub Code Scanning
fail-on-findings: true # fail CI on critical/high findings
Outputs available for downstream steps: verdict, score, findings, critical, high, sarif-file.
Use with Docker (no Node.js required)
# Build the image
docker build -t judges .
# Evaluate a local file
docker run --rm -v $(pwd):/code judges eval --file /code/app.ts
# Pipe from stdin
cat api.py | docker run --rm -i judges eval --language python
# List judges
docker run --rm judges list
Or use as an MCP server
1. Install and Build
git clone https://github.com/KevinRabun/judges.git
cd judges
npm install
npm run build
2. Try the Demo
Run the included demo to see all 39 judges evaluate a purposely flawed API server:
npm run demo
This evaluates examples/sample-vulnerable-api.ts β a file intentionally packed with security holes, performance anti-patterns, and code quality issues β and prints a full verdict with per-judge scores and findings.
The demo now also includes an App Builder Workflow (3-step) section. In a single run, you get both tribunal output and workflow output:
- Release decision (
Ship now/Ship with caution/Do not ship) - Plain-language summaries of top risks
- Prioritized remediation tasks and AI-fixable
P0/P1items
Sample workflow output (truncated):
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β App Builder Workflow Demo (3-Step) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Decision : Do not ship
Verdict : FAIL (47/100)
Risk Counts : Critical 24 | High 27 | Medium 55
Step 2 β Plain-Language Findings:
- [CRITICAL] DATA-001: Hardcoded password detected
What: ...
Why : ...
Next: ...
Step 3 β Prioritized Tasks:
- P0 | DEVELOPER | Effort L | DATA-001
Task: ...
Done: ...
AI-Fixable Now (P0/P1):
- P0 DATA-001: ...
Sample tribunal output (truncated):
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Judges Panel β Full Tribunal Demo β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Overall Verdict : FAIL
Overall Score : 43/100
Critical Issues : 15
High Issues : 17
Total Findings : 83
Judges Run : 33
Per-Judge Breakdown:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Judge Data Security 0/100 7 finding(s)
β Judge Cybersecurity 0/100 7 finding(s)
β Judge Cost Effectiveness 52/100 5 finding(s)
β οΈ Judge Scalability 65/100 4 finding(s)
β Judge Cloud Readiness 61/100 4 finding(s)
β Judge Software Practices 45/100 6 finding(s)
β Judge Accessibility 0/100 8 finding(s)
β Judge API Design 0/100 9 finding(s)
β Judge Reliability 54/100 3 finding(s)
β Judge Observability 45/100 5 finding(s)
β Judge Performance 27/100 5 finding(s)
β Judge Compliance 0/100 4 finding(s)
β οΈ Judge Testing 90/100 1 finding(s)
β οΈ Judge Documentation 70/100 4 finding(s)
β οΈ Judge Internationalization 65/100 4 finding(s)
β οΈ Judge Dependency Health 90/100 1 finding(s)
β Judge Concurrency 44/100 4 finding(s)
β Judge Ethics & Bias 65/100 2 finding(s)
β Judge Maintainability 52/100 4 finding(s)
β Judge Error Handling 27/100 3 finding(s)
β Judge Authentication 0/100 4 finding(s)
β Judge Database 0/100 5 finding(s)
β Judge Caching 62/100 3 finding(s)
β Judge Configuration Mgmt 0/100 3 finding(s)
β οΈ Judge Backwards Compat 80/100 2 finding(s)
β οΈ Judge Portability 72/100 2 finding(s)
β Judge UX 52/100 4 finding(s)
β Judge Logging Privacy 0/100 4 finding(s)
β Judge Rate Limiting 27/100 4 finding(s)
β οΈ Judge CI/CD 80/100 2 finding(s)
3. Run the Tests
npm test
Runs automated tests covering all judges, AST parsers, markdown formatters, and edge cases.
4. Connect to Your Editor
VS Code (recommended β zero config)
Install the Judges Panel extension from the Marketplace. It provides:
- Inline diagnostics & quick-fixes on every file save
@judgeschat participant β type@judgesin Copilot Chat, or just ask for a "judges panel review" and Copilot routes automatically- Auto-configured MCP server β all 39 expert-persona prompts available to Copilot with zero setup
code --install-extension kevinrabun.judges-panel
VS Code β manual MCP config
If you prefer explicit workspace config (or want teammates without the extension to benefit), create .vscode/mcp.json:
{
"servers": {
"judges": {
"command": "npx",
"args": ["-y", "@kevinrabun/judges"]
}
}
}
Claude Desktop
Add to claude_desktop_config.json:
{
"mcpServers": {
"judges": {
"command": "npx",
"args": ["-y", "@kevinrabun/judges"]
}
}
}
Cursor / other MCP clients
Use the same npx command for any MCP-compatible client:
{
"command": "npx",
"args": ["-y", "@kevinrabun/judges"]
}
5. Use Judges in GitHub Copilot PR Reviews
Yes β users can include Judges as part of GitHub-based review workflows, with one important caveat:
- The hosted
copilot-pull-request-revieweron GitHub does not currently let you directly attach arbitrary local MCP servers the same way VS Code does. - The practical pattern is to run Judges in CI on each PR, publish a report/check, and have Copilot + human reviewers use that output during review.
Option A (recommended): PR workflow check + report artifact
Create .github/workflows/judges-pr-review.yml:
name: Judges PR Review
on:
pull_request:
types: [opened, synchronize, reopened]
jobs:
judges:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Setup Node
uses: actions/setup-node@v4
with:
node-version: 20
cache: npm
- name: Install
run: npm ci
- name: Generate Judges report
run: |
npx tsx -e "import { generateRepoReportFromLocalPath } from './src/reports/public-repo-report.ts';
const result = generateRepoReportFromLocalPath({
repoPath: process.cwd(),
outputPath: 'judges-pr-report.md',
maxFiles: 600,
maxFindingsInReport: 150,
});
console.log('Overall:', result.overallVerdict, result.averageScore);"
- name: Upload report artifact
uses: actions/upload-artifact@v4
with:
name: judges-pr-report
path: judges-pr-report.md
This gives every PR a reproducible Judges output your team (and Copilot) can reference.
Option B: Add Copilot custom instructions in-repo
Add .github/instructions/judges.instructions.md with guidance such as:
When reviewing pull requests:
1. Read the latest Judges report artifact/check output first.
2. Prioritize CRITICAL and HIGH findings in remediation guidance.
3. If findings conflict, defer to security/compliance-related Judges.
4. Include rule IDs (e.g., DATA-001, CYBER-004) in suggested fixes.
This helps keep Copilot feedback aligned with Judges findings.
CLI Reference
All commands support --help for usage details.
judges eval
Evaluate a file with all 39 judges or a single judge.
| Flag | Description |
|---|---|
--file <path> / positional | File to evaluate |
--judge <id> / -j <id> | Single judge mode |
--language <lang> / -l <lang> | Language hint (auto-detected from extension) |
--format <fmt> / -f <fmt> | Output format: text, json, sarif, markdown, html, junit, codeclimate |
--output <path> / -o <path> | Write output to file |
--fail-on-findings | Exit with code 1 if verdict is FAIL |
--baseline <path> / -b <path> | JSON baseline file β suppress known findings |
--summary | Print a single summary line (ideal for scripts) |
--config <path> | Load a .judgesrc / .judgesrc.json config file |
--preset <name> | Use a named preset: strict, lenient, security-only, startup, compliance, performance |
--min-score <n> | Exit with code 1 if overall score is below this threshold |
--verbose | Print timing and debug information |
--quiet | Suppress non-essential output |
--no-color | Disable ANSI colors |
judges init
Interactive wizard that generates project configuration:
.judgesrc.jsonβ rule customization, disabled judges, severity thresholds.github/workflows/judges.ymlβ GitHub Actions CI workflow.gitlab-ci.judges.ymlβ GitLab CI pipeline (optional)azure-pipelines.judges.ymlβ Azure Pipelines (optional)
judges fix
Preview or apply auto-fix patches from deterministic findings.
| Flag | Description |
|---|---|
| positional | File to fix |
--apply | Write patches to disk (default: dry run) |
--judge <id> | Limit to a single judge's findings |
judges watch
Continuously re-evaluate files on save.
| Flag | Description |
|---|---|
| positional | File or directory to watch (default: .) |
--judge <id> | Single judge mode |
--fail-on-findings | Exit non-zero if any evaluation fails |
judges report
Run a full project-level tribunal on a local directory.
| Flag | Description |
|---|---|
| positional | Directory path (default: .) |
--format <fmt> | Output format: text, json, html, markdown |
--output <path> | Write report to file |
--max-files <n> | Maximum files to analyze (default: 600) |
--max-file-bytes <n> | Skip files larger than this (default: 300000) |
judges hook
Manage a Git pre-commit hook that runs Judges on staged files.
judges hook install # add pre-commit hook
judges hook uninstall # remove pre-commit hook
Detects Husky (.husky/pre-commit) and falls back to .git/hooks/pre-commit. Uses marker-based injection so it won't clobber existing hooks.
judges diff
Evaluate only the changed lines from a unified diff (e.g., git diff output).
| Flag | Description |
|---|---|
--file <path> | Read diff from file instead of stdin |
--format <fmt> | Output format: text, json, sarif, junit, codeclimate |
--output <path> | Write output to file |
git diff HEAD~1 | judges diff
judges diff --file changes.patch --format sarif
judges deps
Analyze project dependencies for supply-chain risks.
| Flag | Description |
|---|---|
--path <dir> | Project root to scan (default: .) |
--format <fmt> | Output format: text, json |
judges deps --path .
judges deps --path ./backend --format json
judges baseline
Create a baseline file to suppress known findings in future evaluations.
judges baseline create --file src/api.ts
judges baseline create --file src/api.ts -o .judges-baseline.json
judges ci-templates
Generate CI/CD configuration templates for popular providers.
judges ci-templates --provider github # .github/workflows/judges.yml
judges ci-templates --provider gitlab # .gitlab-ci.judges.yml
judges ci-templates --provider azure # azure-pipelines.judges.yml
judges ci-templates --provider bitbucket # bitbucket-pipelines.yml (snippet)
judges docs
Generate per-judge rule documentation in Markdown.
| Flag | Description |
|---|---|
--judge <id> | Generate docs for a single judge |
--output <dir> | Write individual .md files per judge |
judges docs # all judges to stdout
judges docs --judge cybersecurity # single judge
judges docs --output docs/judges/ # write files to directory
judges completions
Generate shell completion scripts.
eval "$(judges completions bash)" # Bash
eval "$(judges completions zsh)" # Zsh
judges completions fish | source # Fish
judges completions powershell # PowerShell (Register-ArgumentCompleter)
Named Presets
Use --preset to apply pre-configured evaluation settings:
| Preset | Description |
|---|---|
strict | All severities, all judges β maximum thoroughness |
lenient | Only high and critical findings β fast and focused |
security-only | Security judges only β cybersecurity, data-security, authentication, logging-privacy |
startup | Skip compliance, sovereignty, i18n judges β move fast |
compliance | Only compliance, data-sovereignty, authentication β regulatory focus |
performance | Only performance, scalability, caching, cost-effectiveness |
judges eval --preset security-only src/api.ts
judges eval --preset strict --format sarif src/app.ts > results.sarif
CI Output Formats
JUnit XML
Generate JUnit XML for Jenkins, Azure DevOps, GitHub Actions, or GitLab test result viewers:
judges eval --format junit src/api.ts > results.xml
Each judge maps to a <testsuite>, each finding becomes a <testcase> with <failure> for critical/high severity.
CodeClimate / GitLab Code Quality
Generate CodeClimate JSON for GitLab Code Quality or similar tools:
judges eval --format codeclimate src/api.ts > codequality.json
Score Badges
Generate SVG or text badges for your README:
import { generateBadgeSvg, generateBadgeText } from "@kevinrabun/judges/badge";
const svg = generateBadgeSvg(85); // shields.io-style SVG
const text = generateBadgeText(85); // "β judges 85/100"
const svg2 = generateBadgeSvg(75, "quality"); // custom label
The Judge Panel
| Judge | Domain | Rule Prefix | What It Evaluates |
|---|---|---|---|
| Data Security | Data Security & Privacy | DATA- | Encryption, PII handling, secrets management, access controls |
| Cybersecurity | Cybersecurity & Threat Defense | CYBER- | Injection attacks, XSS, CSRF, auth flaws, OWASP Top 10 |
| Cost Effectiveness | Cost Optimization | COST- | Algorithm efficiency, N+1 queries, memory waste, caching strategy |
| Scalability | Scalability & Performance | SCALE- | Statelessness, horizontal scaling, concurrency, bottlenecks |
| Cloud Readiness | Cloud-Native & DevOps | CLOUD- | 12-Factor compliance, containerization, graceful shutdown, IaC |
| Software Practices | Engineering Best Practices | SWDEV- | SOLID principles, type safety, error handling, input validation |
| Accessibility | Accessibility (a11y) | A11Y- | WCAG compliance, screen reader support, keyboard navigation, ARIA |
| API Design | API Design & Contracts | API- | REST conventions, versioning, pagination, error responses |
| Reliability | Reliability & Resilience | REL- | Error handling, timeouts, retries, circuit breakers |
| Observability | Observability & Monitoring | OBS- | Structured logging, health checks, metrics, tracing |
| Performance | Performance & Efficiency | PERF- | N+1 queries, sync I/O, caching, memory leaks |
| Compliance | Regulatory Compliance | COMP- | GDPR/CCPA, PII protection, consent, data retention, audit trails |
| Data Sovereignty | Data, Technological & Operational Sovereignty | SOV- | Data residency, cross-border transfers, vendor key management, AI model portability, identity federation, circuit breakers, audit trails, data export |
| Testing | Testing & Quality Assurance | TEST- | Test coverage, assertions, test isolation, naming |
| Documentation | Documentation & Readability | DOC- | JSDoc/docstrings, magic numbers, TODOs, code comments |
| Internationalization | Internationalization (i18n) | I18N- | Hardcoded strings, locale handling, currency formatting |
| Dependency Health | Dependency Management | DEPS- | Version pinning, deprecated packages, supply chain |
| Concurrency | Concurrency & Async Safety | CONC- | Race conditions, unbounded parallelism, missing await |
| Ethics & Bias | Ethics & Bias | ETHICS- | Demographic logic, dark patterns, inclusive language |
| Maintainability | Code Maintainability & Technical Debt | MAINT- | Any types, magic numbers, deep nesting, dead code, file length |
| Error Handling | Error Handling & Fault Tolerance | ERR- | Empty catch blocks, missing error handlers, swallowed errors |
| Authentication | Authentication & Authorization | AUTH- | Hardcoded creds, missing auth middleware, token in query params |
| Database | Database Design & Query Efficiency | DB- | SQL injection, N+1 queries, connection pooling, transactions |
| Caching | Caching Strategy & Data Freshness | CACHE- | Unbounded caches, missing TTL, no HTTP cache headers |
| Configuration Mgmt | Configuration & Secrets Management | CFG- | Hardcoded secrets, missing env vars, config validation |
| Backwards Compat | Backwards Compatibility & Versioning | COMPAT- | API versioning, breaking changes, response consistency |
| Portability | Platform Portability & Vendor Independence | PORTA- | OS-specific paths, vendor lock-in, hardcoded hosts |
| UX | User Experience & Interface Quality | UX- | Loading states, error messages, pagination, destructive actions |
| Logging Privacy | Logging Privacy & Data Redaction | LOGPRIV- | PII in logs, token logging, structured logging, redaction |
| Rate Limiting | Rate Limiting & Throttling | RATE- | Missing rate limits, unbounded queries, backoff strategy |
| CI/CD | CI/CD Pipeline & Deployment Safety | CICD- | Test infrastructure, lint config, Docker tags, build scripts |
| Code Structure | Structural Analysis (AST) | STRUCT- | Cyclomatic complexity, nesting depth, function length, dead code, type safety |
| Agent Instructions | Agent Instruction Markdown Quality & Safety | AGENT- | Instruction hierarchy, conflict detection, unsafe overrides, scope, validation, policy guidance |
| AI Code Safety | AI-Generated Code Safety | AICS- | Prompt injection, insecure LLM output handling, debug defaults, missing validation, unsafe deserialization of AI responses |
| Framework Safety | Framework-Specific Safety | FW- | React hooks ordering, Express middleware chains, Next.js SSR/SSG pitfalls, Angular/Vue lifecycle patterns, framework-specific anti-patterns |
| IaC Security | Infrastructure as Code | IAC- | Terraform, Bicep, ARM template misconfigurations, hardcoded secrets, missing encryption, overly permissive network/IAM rules |
| False-Positive Review | False Positive Detection & Finding Accuracy | FPR- | Meta-judge reviewing pattern-based findings for false positives: string literal context, comment/docstring matches, test scaffolding, IaC template gating |
How It Works
The tribunal operates in three layers:
-
Pattern-Based Analysis β All tools (
evaluate_code,evaluate_code_single_judge,evaluate_project,evaluate_diff) perform heuristic analysis using regex pattern matching to catch common anti-patterns. This layer is instant, deterministic, and runs entirely offline with zero external API calls. -
AST-Based Structural Analysis β The Code Structure judge (
STRUCT-*rules) uses real Abstract Syntax Tree parsing to measure cyclomatic complexity, nesting depth, function length, parameter count, dead code, and type safety with precision that regex cannot achieve. All supported languages β TypeScript, JavaScript, Python, Rust, Go, Java, C#, and C++ β are parsed via tree-sitter WASM grammars (real syntax trees compiled to WebAssembly, in-process, zero native dependencies). A scope-tracking structural parser is kept as a fallback when WASM grammars are unavailable. No external AST server required. -
LLM-Powered Deep Analysis (Prompts) β The server exposes MCP prompts (e.g.,
judge-data-security,full-tribunal) that provide each judge's expert persona as a system prompt. When used by an LLM-based client (Copilot, Claude, Cursor, etc.), the host LLM performs deeper, context-aware probabilistic analysis beyond what static patterns can detect. This is where thesystemPrompton each judge comes alive β Judges itself makes no LLM calls, but it provides the expert criteria so your AI assistant can act as 39 specialized reviewers.
Composable by Design
Judges Panel is a dual-layer review system: instant deterministic tools (offline, no API keys) for pattern and AST analysis, plus 39 expert-persona MCP prompts that unlock LLM-powered deep analysis when connected to an AI client. It does not try to be a CVE scanner or a linter. Those capabilities belong in dedicated MCP servers that an AI agent can orchestrate alongside Judges.
Built-in AST Analysis (v2.0.0+)
Unlike earlier versions that recommended a separate AST MCP server, Judges Panel now includes real AST-based structural analysis out of the box:
- TypeScript, JavaScript, Python, Rust, Go, Java, C#, C++ β All parsed with a unified tree-sitter WASM engine for full syntax-tree analysis (functions, complexity, nesting, dead code, type safety). Falls back to a scope-tracking structural parser when WASM grammars are unavailable
The Code Structure judge (STRUCT-*) uses these parsers to accurately measure:
| Rule | Metric | Threshold |
|---|---|---|
STRUCT-001 | Cyclomatic complexity | > 10 per function (high) |
STRUCT-002 | Nesting depth | > 4 levels (medium) |
STRUCT-003 | Function length | > 50 lines (medium) |
STRUCT-004 | Parameter count | > 5 parameters (medium) |
STRUCT-005 | Dead code | Unreachable statements (low) |
STRUCT-006 | Weak types | any, dynamic, Object, interface{}, unsafe (medium) |
STRUCT-007 | File complexity | > 40 total cyclomatic complexity (high) |
STRUCT-008 | Extreme complexity | > 20 per function (critical) |
STRUCT-009 | Extreme parameters | > 8 parameters (high) |
STRUCT-010 | Extreme function length | > 150 lines (high) |
Recommended MCP Stack
When your AI coding assistant connects to multiple MCP servers, each one contributes its specialty:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AI Coding Assistant β
β (Claude, Copilot, Cursor, etc.) β
ββββββββ¬βββββββββββββββββββ¬βββββββββββ¬ββββββββββββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββββββ ββββββββββ ββββββββββ
β Judges β β CVE / β β Linter β
β Panel β β SBOM β β Server β
β ββββββββββββββ ββββββββββ ββββββββββ
β 36 Heuristic β Vuln DB Style &
β judges β scanning correctness
β + AST judge β
ββββββββββββββββ
Patterns +
structural
analysis
| Layer | What It Does | Example Servers |
|---|---|---|
| Judges Panel | 39-judge quality gate β security patterns, AST analysis, cost, scalability, a11y, compliance, sovereignty, ethics, dependency health, agent instruction governance, AI code safety, framework safety | This server |
| CVE / SBOM | Vulnerability scanning against live databases β known CVEs, license risks, supply chain | OSV, Snyk, Trivy, Grype MCP servers |
| Linting | Language-specific style and correctness rules | ESLint, Ruff, Clippy MCP servers |
| Runtime Profiling | Memory, CPU, latency measurement on running code | Custom profiling MCP servers |
What This Means in Practice
When you ask your AI assistant "Is this code production-ready?", the agent can:
- Judges Panel β Scan for hardcoded secrets, missing error handling, N+1 queries, accessibility gaps, compliance issues, plus analyze cyclomatic complexity, detect dead code, and flag deeply nested functions via AST
- CVE Server β Check every dependency in
package.jsonagainst known vulnerabilities - Linter Server β Enforce team style rules, catch language-specific gotchas
Each server returns structured findings. The AI synthesizes everything into a single, actionable review β no single server needs to do it all.
MCP Tools
evaluate_v2
Run a V2 context-aware tribunal evaluation designed to raise feedback quality toward lead engineer/architect-level review:
- Policy profile calibration (
default,startup,regulated,healthcare,fintech,public-sector) - Context ingestion (architecture notes, constraints, standards, known risks, data-boundary model)
- Runtime evidence hooks (tests, coverage, latency, error rate, vulnerability counts)
- Specialty feedback aggregation by judge/domain
- Confidence scoring and explicit uncertainty reporting
Supports:
- Code mode:
code+language - Project mode:
files[]
| Parameter | Type | Required | Description |
|---|---|---|---|
code | string | conditional | Source code for single-file mode |
language | string | conditional | Programming language for single-file mode |
files | array | conditional | { path, content, language }[] for project mode |
context | string | no | High-level review context |
includeAstFindings | boolean | no | Include AST/code-structure findings (default: true) |
minConfidence | number | no | Minimum finding confidence to include (0-1, default: 0) |
policyProfile | enum | no | default, startup, regulated, healthcare, fintech, public-sector |
evaluationContext | object | no | Structured architecture/constraint context |
evidence | object | no | Runtime/operational evidence for confidence calibration |
evaluate_app_builder_flow
Run a 3-step app-builder workflow for technical and non-technical stakeholders:
- Tribunal review (code/project/diff)
- Plain-language translation of top risks
- Prioritized remediation tasks with AI-fixable P0/P1 extraction
Supports:
- Code mode:
code+language - Project mode:
files[] - Diff mode:
code+language+changedLines[]
| Parameter | Type | Required | Description |
|---|---|---|---|
code | string | conditional | Full source content (code/diff mode) |
language | string | conditional | Programming language (code/diff mode) |
files | array | conditional | { path, content, language }[] for project mode |
changedLines | number[] | no | 1-based changed lines for diff mode |
context | string | no | Optional business/technical context |
maxFindings | number | no | Max translated top findings (default: 10) |
maxTasks | number | no | Max generated tasks (default: 20) |
includeAstFindings | boolean | no | Include AST/code-structure findings (default: true) |
minConfidence | number | no | Minimum finding confidence to include (0-1, default: 0) |
evaluate_public_repo_report
Clone a public repository URL, run the full judges panel across eligible source files, and generate a consolidated markdown report.
| Parameter | Type | Required | Description |
|---|---|---|---|
repoUrl | string | yes | Public repository URL (https://...) |
branch | string | no | Optional branch name |
outputPath | string | no | Optional path to write report markdown |
maxFiles | number | no | Max files analyzed (default: 600) |
maxFileBytes | number | no | Max file size in bytes (default: 300000) |
maxFindingsInReport | number | no | Max detailed findings in output (default: 150) |
credentialMode | string | no | Credential detection mode: standard (default) or strict |
includeAstFindings | boolean | no | Include AST/code-structure findings (default: true) |
minConfidence | number | no | Minimum finding confidence to include (0-1, default: 0) |
enableMustFixGate | boolean | no | Enable must-fix gate summary for high-confidence dangerous findings (default: false) |
mustFixMinConfidence | number | no | Confidence threshold for must-fix gate triggers (0-1, default: 0.85) |
mustFixDangerousRulePrefixes | string[] | no | Optional dangerous rule prefixes for gate matching (e.g., AUTH, CYBER, DATA) |
keepClone | boolean | no | Keep cloned repo on disk for inspection |
Quick examples
Generate a report from CLI:
npm run report:public-repo -- --repoUrl https://github.com/microsoft/vscode --output reports/vscode-judges-report.md
# stricter credential-signal mode (optional)
npm run report:public-repo -- --repoUrl https://github.com/openclaw/openclaw --credentialMode strict --output reports/openclaw-judges-report-strict.md
# judge findings only (exclude AST/code-structure findings)
npm run report:public-repo -- --repoUrl https://github.com/openclaw/openclaw --includeAstFindings false --output reports/openclaw-judges-report-no-ast.md
# show only findings at 80%+ confidence
npm run report:public-repo -- --repoUrl https://github.com/openclaw/openclaw --minConfidence 0.8 --output reports/openclaw-judges-report-high-confidence.md
# include must-fix gate summary in the generated report
npm run report:public-repo -- --repoUrl https://github.com/openclaw/openclaw --enableMustFixGate true --mustFixMinConfidence 0.9 --mustFixDangerousPrefix AUTH --mustFixDangerousPrefix CYBER --output reports/openclaw-judges-report-mustfix.md
# opinionated quick-start mode (recommended first run)
npm run report:quickstart -- --repoUrl https://github.com/openclaw/openclaw --output reports/openclaw-quickstart.md
Call from MCP client:
{
"tool": "evaluate_public_repo_report",
"arguments": {
"repoUrl": "https://github.com/microsoft/vscode",
"branch": "main",
"maxFiles": 400,
"maxFindingsInReport": 120,
"credentialMode": "strict",
"includeAstFindings": false,
"minConfidence": 0.8,
"enableMustFixGate": true,
"mustFixMinConfidence": 0.9,
"mustFixDangerousRulePrefixes": ["AUTH", "CYBER", "DATA"],
"outputPath": "reports/vscode-judges-report.md"
}
}
Typical response summary includes:
- overall verdict and average score
- analyzed file count and total findings
- per-judge score table
- highest-risk findings and lowest-scoring files
Sample report snippet:
# Public Repository Full Judges Report
Generated from https://github.com/microsoft/vscode on 2026-02-21T12:00:00.000Z.
## Executive Summary
- Overall verdict: WARNING
- Average file score: 78/100
- Total findings: 412 (critical 3, high 29, medium 114, low 185, info 81)
get_judges
List all available judges with their domains and descriptions.
evaluate_code
Submit code to the full judges panel. all 39 judges evaluate independently and return a combined verdict.
| Parameter | Type | Required | Description |
|---|---|---|---|
code | string | yes | The source code to evaluate |
language | string | yes | Programming language (e.g., typescript, python) |
context | string | no | Additional context about the code |
includeAstFindings | boolean | no | Include AST/code-structure findings (default: true) |
minConfidence | number | no | Minimum finding confidence to include (0-1, default: 0) |
config | object | no | Inline configuration (see Configuration) |
evaluate_code_single_judge
Submit code to a specific judge for targeted review.
| Parameter | Type | Required | Description |
|---|---|---|---|
code | string | yes | The source code to evaluate |
language | string | yes | Programming language |
judgeId | string | yes | See judge IDs below |
context | string | no | Additional context |
minConfidence | number | no | Minimum finding confidence to include (0-1, default: 0) |
config | object | no | Inline configuration (see Configuration) |
evaluate_project
Submit multiple files for project-level analysis. all 39 judges evaluate each file, plus cross-file architectural analysis detects code duplication, inconsistent error handling, and dependency cycles.
| Parameter | Type | Required | Description |
|---|---|---|---|
files | array | yes | Array of { path, content, language } objects |
context | string | no | Optional project context |
includeAstFindings | boolean | no | Include AST/code-structure findings (default: true) |
minConfidence | number | no | Minimum finding confidence to include (0-1, default: 0) |
config | object | no | Inline configuration (see Configuration) |
evaluate_diff
Evaluate only the changed lines in a code diff. Runs all 39 judges on the full file but filters findings to lines you specify. Ideal for PR reviews and incremental analysis.
| Parameter | Type | Required | Description |
|---|---|---|---|
code | string | yes | The full file content (post-change) |
language | string | yes | Programming language |
changedLines | number[] | yes | 1-based line numbers that were changed |
context | string | no | Optional context about the change |
includeAstFindings | boolean | no | Include AST/code-structure findings (default: true) |
minConfidence | number | no | Minimum finding confidence to include (0-1, default: 0) |
config | object | no | Inline configuration (see Configuration) |
analyze_dependencies
Analyze a dependency manifest file for supply-chain risks, version pinning issues, typosquatting indicators, and dependency hygiene. Supports package.json, requirements.txt, Cargo.toml, go.mod, pom.xml, and .csproj files.
| Parameter | Type | Required | Description |
|---|---|---|---|
manifest | string | yes | Contents of the dependency manifest file |
manifestType | string | yes | File type: package.json, requirements.txt, etc. |
context | string | no | Optional context |
Judge IDs
data-security Β· cybersecurity Β· cost-effectiveness Β· scalability Β· cloud-readiness Β· software-practices Β· accessibility Β· api-design Β· reliability Β· observability Β· performance Β· compliance Β· data-sovereignty Β· testing Β· documentation Β· internationalization Β· dependency-health Β· concurrency Β· ethics-bias Β· maintainability Β· error-handling Β· authentication Β· database Β· caching Β· configuration-management Β· backwards-compatibility Β· portability Β· ux Β· logging-privacy Β· rate-limiting Β· ci-cd Β· code-structure Β· agent-instructions Β· ai-code-safety Β· framework-safety Β· iac-security Β· false-positive-review
MCP Prompts
Each judge has a corresponding prompt for LLM-powered deep analysis:
| Prompt | Description |
|---|---|
judge-data-security | Deep data security review |
judge-cybersecurity | Deep cybersecurity review |
judge-cost-effectiveness | Deep cost optimization review |
judge-scalability | Deep scalability review |
judge-cloud-readiness | Deep cloud readiness review |
judge-software-practices | Deep software practices review |
judge-accessibility | Deep accessibility/WCAG review |
judge-api-design | Deep API design review |
judge-reliability | Deep reliability & resilience review |
judge-observability | Deep observability & monitoring review |
judge-performance | Deep performance optimization review |
judge-compliance | Deep regulatory compliance review |
judge-data-sovereignty | Deep data, technological & operational sovereignty review |
judge-testing | Deep testing quality review |
judge-documentation | Deep documentation quality review |
judge-internationalization | Deep i18n review |
judge-dependency-health | Deep dependency health review |
judge-concurrency | Deep concurrency & async safety review |
judge-ethics-bias | Deep ethics & bias review |
judge-maintainability | Deep maintainability & tech debt review |
judge-error-handling | Deep error handling review |
judge-authentication | Deep authentication & authorization review |
judge-database | Deep database design & query review |
judge-caching | Deep caching strategy review |
judge-configuration-management | Deep configuration & secrets review |
judge-backwards-compatibility | Deep backwards compatibility review |
judge-portability | Deep platform portability review |
judge-ux | Deep user experience review |
judge-logging-privacy | Deep logging privacy review |
judge-rate-limiting | Deep rate limiting review |
judge-ci-cd | Deep CI/CD pipeline review |
judge-code-structure | Deep AST-based structural analysis review |
judge-agent-instructions | Deep review of agent instruction markdown quality and safety |
judge-ai-code-safety | Deep review of AI-generated code risks: prompt injection, insecure LLM output handling, debug defaults, missing validation |
judge-framework-safety | Deep review of framework-specific safety: React hooks, Express middleware, Next.js SSR/SSG, Angular/Vue patterns |
judge-iac-security | Deep review of infrastructure-as-code security: Terraform, Bicep, ARM template misconfigurations |
judge-false-positive-review | Meta-judge review of pattern-based findings for false positive detection and accuracy |
full-tribunal | all 39 judges in a single prompt |
Configuration
Create a .judgesrc.json (or .judgesrc) file in your project root to customize evaluation behavior. See .judgesrc.example.json for a copy-paste-ready template, or reference the JSON Schema for full IDE autocompletion.
{
"$schema": "https://github.com/KevinRabun/judges/blob/main/judgesrc.schema.json",
"preset": "strict",
"minSeverity": "medium",
"disabledRules": ["COST-*", "I18N-001"],
"disabledJudges": ["accessibility", "ethics-bias"],
"ruleOverrides": {
"SEC-003": { "severity": "critical" },
"DOC-*": { "disabled": true }
},
"languages": ["typescript", "python"],
"format": "text",
"failOnFindings": false,
"baseline": ""
}
| Field | Type | Default | Description |
|---|---|---|---|
$schema | string | β | JSON Schema URL for IDE validation |
preset | string | β | Named preset: strict, lenient, security-only, startup, compliance, performance |
minSeverity | string | "info" | Minimum severity to report: critical Β· high Β· medium Β· low Β· info |
disabledRules | string[] | [] | Rule IDs or prefix wildcards to suppress (e.g. "COST-*", "SEC-003") |
disabledJudges | string[] | [] | Judge IDs to skip entirely (e.g. "cost-effectiveness") |
ruleOverrides | object | {} | Per-rule overrides keyed by rule ID or wildcard β { disabled?: boolean, severity?: string } |
languages | string[] | [] | Restrict analysis to specific languages (empty = all) |
format | string | "text" | Default output format: text Β· json Β· sarif Β· markdown Β· html Β· junit Β· codeclimate |
failOnFindings | boolean | false | Exit code 1 when verdict is fail β useful for CI gates |
baseline | string | "" | Path to a baseline JSON file β matching findings are suppressed |
All evaluation tools (CLI and MCP) accept the same configuration fields via --config <path> or inline config parameter.
Advanced Features
Inline Suppressions
Suppress specific findings directly in source code using comment directives:
const x = eval(input); // judges-ignore SEC-001
// judges-ignore-next-line CYBER-002
const y = dangerousOperation();
// judges-file-ignore DOC-* β suppress globally for this file
Supported comment styles: //, #, /* */. Supports comma-separated rule IDs and wildcards (*, SEC-*).
Auto-Fix Patches
Certain findings include machine-applicable patches in the patch field:
| Pattern | Auto-Fix |
|---|---|
new Buffer(x) | β Buffer.from(x) |
http:// URLs (non-localhost) | β https:// |
Math.random() | β crypto.randomUUID() |
Patches include oldText, newText, startLine, and endLine for automated application.
Cross-Evaluator Deduplication
When multiple judges flag the same issue (e.g., both Data Security and Cybersecurity detect SQL injection on line 15), findings are automatically deduplicated. The highest-severity finding wins, and the description is annotated with cross-references (e.g., "Also identified by: CYBER-003").
Taint Flow Analysis
The engine performs inter-procedural taint tracking to trace data from user-controlled sources (e.g., req.body, process.env) through transformations to security-sensitive sinks (e.g., eval(), exec(), SQL queries). Taint flows are used to boost confidence on true-positive findings and suppress false positives where sanitization is detected.
Positive Signal Detection
Code that demonstrates good practices receives score bonuses (capped at +15):
| Signal | Bonus |
|---|---|
| Parameterized queries | +3 |
| Security headers (helmet) | +3 |
| Auth middleware (passport, etc.) | +3 |
| Proper error handling | +2 |
| Input validation libs (zod, joi, etc.) | +2 |
| Rate limiting | +2 |
| Structured logging (pino, winston) | +2 |
| CORS configuration | +1 |
| Strict mode / strictNullChecks | +1 |
| Test patterns (describe/it/expect) | +1 |
Framework-Aware Rules
Judges include framework-specific detection for Express, Django, Flask, FastAPI, Spring, ASP.NET, Rails, and more. Framework middleware (e.g., helmet(), express-rate-limit, passport.authenticate()) is recognized as mitigation, reducing false positives.
Cross-File Import Resolution
In project-level analysis, imports are resolved across files. If one file imports a security middleware module from another file in the project, findings about missing security controls are automatically adjusted with reduced confidence.
Scoring
Each judge scores the code from 0 to 100:
| Severity | Score Deduction |
|---|---|
| Critical | β30 points |
| High | β18 points |
| Medium | β10 points |
| Low | β5 points |
| Info | β2 points |
Verdict logic:
- FAIL β Any critical finding, or score < 60
- WARNING β Any high finding, any medium finding, or score < 80
- PASS β Score β₯ 80 with no critical, high, or medium findings
The overall tribunal score is the average of all 39 judges. The overall verdict fails if any judge fails.
Project Structure
judges/
βββ src/
β βββ index.ts # MCP server entry point β tools, prompts, transport
β βββ api.ts # Programmatic API entry point
β βββ cli.ts # CLI argument parser and command router
β βββ types.ts # TypeScript interfaces (Finding, JudgeEvaluation, etc.)
β βββ config.ts # .judgesrc configuration parser and validation
β βββ errors.ts # Custom error types (ConfigError, EvaluationError, ParseError)
β βββ language-patterns.ts # Multi-language regex pattern constants and helpers
β βββ plugins.ts # Plugin system for custom rules
β βββ scoring.ts # Confidence scoring and calibration
β βββ dedup.ts # Finding deduplication engine
β βββ fingerprint.ts # Finding fingerprint generation
β βββ comparison.ts # Tool comparison benchmark data
β βββ cache.ts # Evaluation result caching
β βββ calibration.ts # Confidence calibration from feedback data
β βββ fix-history.ts # Auto-fix application history tracking
β βββ ast/ # AST analysis engine (built-in, no external deps)
β β βββ index.ts # analyzeStructure() β routes to correct parser
β β βββ types.ts # FunctionInfo, CodeStructure interfaces
β β βββ tree-sitter-ast.ts # Tree-sitter WASM parser (all 8 languages)
β β βββ structural-parser.ts # Fallback scope-tracking parser
β β βββ cross-file-taint.ts # Cross-file taint propagation analysis
β β βββ taint-tracker.ts # Single-file taint flow tracking
β βββ evaluators/ # Analysis engine for each judge
β β βββ index.ts # evaluateWithJudge(), evaluateWithTribunal(), evaluateProject(), etc.
β β βββ shared.ts # Scoring, verdict logic, markdown formatters
β β βββ *.ts # One analyzer per judge (39 files)
β βββ formatters/ # Output formatters
β β βββ sarif.ts # SARIF 2.1.0 output
β β βββ html.ts # Self-contained HTML report (dark/light theme, filters)
β β βββ junit.ts # JUnit XML output (Jenkins, Azure DevOps, GitHub Actions)
β β βββ codeclimate.ts # CodeClimate/GitLab Code Quality JSON
β β βββ diagnostics.ts # Diagnostics formatter
β β βββ badge.ts # SVG and text badge generator
β βββ commands/ # CLI subcommands
β β βββ init.ts # Interactive project setup wizard
β β βββ fix.ts # Auto-fix patch preview and application
β β βββ watch.ts # Watch mode β re-evaluate on save
β β βββ report.ts # Project-level local report
β β βββ hook.ts # Pre-commit hook install/uninstall
β β βββ ci-templates.ts # GitLab, Azure, Bitbucket CI templates
β β βββ diff.ts # Evaluate unified diff (git diff)
β β βββ deps.ts # Dependency supply-chain analysis
β β βββ baseline.ts # Create baseline for finding suppression
β β βββ completions.ts # Shell completions (bash/zsh/fish/PowerShell)
β β βββ docs.ts # Per-judge rule documentation generator
β β βββ feedback.ts # False-positive tracking & finding feedback
β β βββ benchmark.ts # Detection accuracy benchmark suite
β β βββ rule.ts # Custom rule authoring wizard
β β βββ language-packs.ts # Language-specific rule pack presets
β β βββ config-share.ts # Shareable team/org configuration
β βββ presets.ts # Named evaluation presets (strict, lenient, security-only, β¦)
β βββ patches/
β β βββ index.ts # 53 deterministic auto-fix patch rules
β βββ tools/ # MCP tool registrations
β β βββ register.ts # Tool registration orchestrator
β β βββ register-evaluation.ts # Evaluation tools (evaluate_code, etc.)
β β βββ register-workflow.ts # Workflow tools (app builder, reports, etc.)
β β βββ prompts.ts # MCP prompt registrations (per-judge + full-tribunal)
β β βββ schemas.ts # Zod schemas for tool parameters
β βββ reports/
β β βββ public-repo-report.ts # Public repo clone + full tribunal report generation
β βββ judges/ # Judge definitions (id, name, domain, system prompt)
β βββ index.ts # JUDGES array, getJudge(), getJudgeSummaries()
β βββ *.ts # One definition per judge (39 files)
βββ scripts/
β βββ generate-public-repo-report.ts # Run: npm run report:public-repo -- --repoUrl <url>
β βββ daily-popular-repo-autofix.ts # Run: npm run automation:daily-popular
β βββ debug-fp.ts # Debug false-positive findings
βββ examples/
β βββ sample-vulnerable-api.ts # Intentionally flawed code (triggers all judges)
β βββ demo.ts # Run: npm run demo
β βββ quickstart.ts # Quick-start evaluation example
βββ tests/
β βββ judges.test.ts # Core judge evaluation tests
β βββ negative.test.ts # Negative / FP-avoidance tests
β βββ subsystems.test.ts # Subsystem integration tests
β βββ extension-logic.test.ts # VS Code extension logic tests
β βββ tool-routing.test.ts # MCP tool routing tests
βββ grammars/ # Tree-sitter WASM grammar files
β βββ tree-sitter-typescript.wasm
β βββ tree-sitter-cpp.wasm
β βββ tree-sitter-python.wasm
β βββ tree-sitter-go.wasm
β βββ tree-sitter-rust.wasm
β βββ tree-sitter-java.wasm
β βββ tree-sitter-c_sharp.wasm
βββ judgesrc.schema.json # JSON Schema for .judgesrc config files
βββ server.json # MCP Registry manifest
βββ package.json
βββ tsconfig.json
βββ README.md
Scripts
| Command | Description |
|---|---|
npm run build | Compile TypeScript to dist/ |
npm run dev | Watch mode β recompile on save |
npm test | Run the full test suite |
npm run demo | Run the sample tribunal demo |
npm run report:public-repo -- --repoUrl <url> | Generate a full tribunal report for a public repository URL |
npm run report:quickstart -- --repoUrl <url> | Run opinionated high-signal report defaults for fast adoption |
npm run automation:daily-popular | Analyze up to 10 rotating popular repos/day and open up to 5 remediation PRs per repo |
npm start | Start the MCP server |
npm run clean | Remove dist/ |
judges init | Interactive project setup wizard |
judges fix <file> | Preview auto-fix patches (add --apply to write) |
judges watch <dir> | Watch mode β re-evaluate on file save |
judges report <dir> | Full tribunal report on a local directory |
judges hook install | Install a Git pre-commit hook |
judges diff | Evaluate changed lines from unified diff |
judges deps | Analyze dependencies for supply-chain risks |
judges baseline create | Create baseline for finding suppression |
judges ci-templates | Generate CI pipeline templates |
judges docs | Generate per-judge rule documentation |
judges completions <shell> | Shell completion scripts |
judges feedback submit | Mark findings as true positive, false positive, or won't fix |
judges feedback stats | Show false-positive rate statistics |
judges benchmark run | Run detection accuracy benchmark suite |
judges rule create | Interactive custom rule creation wizard |
judges rule list | List custom evaluation rules |
judges pack list | List available language packs |
judges config export | Export config as shareable package |
judges config import <src> | Import a shared configuration |
judges compare | Compare judges against other code review tools |
judges list | List all 39 judges with domains and descriptions |
Daily Popular Repo Automation
This repo includes a scheduled workflow at .github/workflows/daily-popular-repo-autofix.yml that:
- selects up to 10 repositories per day from a default pool of 100+ popular repos (or a manually supplied target),
- runs the full Judges evaluation across supported source languages,
- applies only conservative, single-line remediations that reduce matching finding counts,
- opens up to 5 PRs per repository with attribution to both Judges and the target repository,
- skips repositories unless they are public and PR creation is possible with existing GitHub auth (no additional auth flow).
- enforces hard runtime caps of 10 repositories/day and 5 PRs/repository.
Each run writes daily-autofix-summary.json (or SUMMARY_PATH) with per-repository telemetry, including:
runAggregateβ compact run-level totals and cross-repo top prioritized rules,runAggregate.totalCandidatesDiscoveredandrunAggregate.totalCandidatesAfterLocationDedupeβ signal how much overlap was removed before attempting fixes,runAggregate.totalCandidatesAfterPriorityThresholdβ candidates that remain after applying minimum priority score,runAggregate.dedupeReductionPercentβ percent reduction from location dedupe for quick runtime-efficiency tracking,runAggregate.priorityThresholdReductionPercentβ percent reduction from minimum-priority filtering after dedupe,priorityRulePrefixesUsedβ dangerous rule prefixes used during prioritization,minPriorityScoreUsedβ minimumcandidatePriorityScoreapplied for candidate inclusion,candidatesDiscovered,candidatesAfterLocationDedupe, andcandidatesAfterPriorityThresholdβ per-repo candidate counts after each filter stage,topPrioritizedRuleCountsβ most common rule IDs among ranked candidates,topPrioritizedCandidatesβ top ranked candidate samples (rule, severity, confidence, file, line, priority score).
Optional runtime control:
AUTOFIX_MIN_PRIORITY_SCOREβ minimum candidate priority score required after dedupe (default:0, disabled).
Required secret:
JUDGES_AUTOFIX_GH_TOKENβ GitHub token with permission to fork/push/create PRs for target repositories.
Manual run:
gh workflow run "Judges Daily Full-Run Autofix PRs" -f targetRepoUrl=https://github.com/owner/repo
Programmatic API
Judges can be consumed as a library (not just via MCP). Import from @kevinrabun/judges/api:
import {
evaluateCode,
evaluateProject,
evaluateCodeSingleJudge,
getJudge,
JUDGES,
findingsToSarif,
} from "@kevinrabun/judges/api";
// Full tribunal evaluation
const verdict = evaluateCode("const x = eval(input);", "typescript");
console.log(verdict.overallScore, verdict.overallVerdict);
// Single judge
const result = evaluateCodeSingleJudge("cybersecurity", code, "typescript");
// SARIF output for CI integration
const sarif = findingsToSarif(verdict.evaluations.flatMap(e => e.findings));
Package Exports
| Entry Point | Description |
|---|---|
@kevinrabun/judges/api | Programmatic API (default) |
@kevinrabun/judges/server | MCP server entry point |
@kevinrabun/judges/sarif | SARIF 2.1.0 formatter |
@kevinrabun/judges/junit | JUnit XML formatter |
@kevinrabun/judges/codeclimate | CodeClimate/GitLab Code Quality JSON |
@kevinrabun/judges/badge | SVG and text badge generator |
@kevinrabun/judges/diagnostics | Diagnostics formatter |
@kevinrabun/judges/plugins | Plugin system API |
@kevinrabun/judges/fingerprint | Finding fingerprint utilities |
@kevinrabun/judges/comparison | Tool comparison benchmarks |
SARIF Output
Convert findings to SARIF 2.1.0 for GitHub Code Scanning, Azure DevOps, and other CI/CD tools:
import { findingsToSarif, evaluationToSarif, verdictToSarif } from "@kevinrabun/judges/sarif";
const sarif = verdictToSarif(verdict, "src/app.ts");
fs.writeFileSync("results.sarif", JSON.stringify(sarif, null, 2));
Custom Error Types
All thrown errors extend JudgesError with a machine-readable code property:
| Error Class | Code | When |
|---|---|---|
ConfigError | JUDGES_CONFIG_INVALID | Malformed .judgesrc or invalid inline config |
EvaluationError | JUDGES_EVALUATION_FAILED | Unknown judge, analyzer crash |
ParseError | JUDGES_PARSE_FAILED | Unparseable source code or input data |
import { ConfigError, EvaluationError } from "@kevinrabun/judges/api";
try {
evaluateCode(code, "typescript");
} catch (e) {
if (e instanceof ConfigError) console.error("Config issue:", e.code);
}
License
MIT
