RAGScore
Generate QA datasets & evaluate RAG systems. Privacy-first, any LLM, local or cloud.
Ask AI about RAGScore
Powered by Claude Β· Grounded in docs
I know everything about RAGScore. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
Generate QA datasets & evaluate RAG systems in 2 commands
π Privacy-First β’ β‘ Lightning Fast β’ π€ Any LLM β’ π Local or Cloud β’ π Multilingual
β‘ 2-Line RAG Evaluation
# Step 1: Generate QA pairs from your docs
ragscore generate docs/
# Step 2: Evaluate your RAG system
ragscore evaluate http://localhost:8000/query
That's it. Get accuracy scores and incorrect QA pairs instantly.
============================================================
β
EXCELLENT: 85/100 correct (85.0%)
Average Score: 4.20/5.0
============================================================
β 15 Incorrect Pairs:
1. Q: "What is RAG?"
Score: 2/5 - Factually incorrect
2. Q: "How does retrieval work?"
Score: 3/5 - Incomplete answer
π Quick Start
Install
pip install ragscore # Core (works with Ollama)
pip install "ragscore[openai]" # + OpenAI support
pip install "ragscore[notebook]" # + Jupyter/Colab support
pip install "ragscore[all]" # + All providers
Option 1: Python API (Notebook-Friendly)
Perfect for Jupyter, Colab, and rapid iteration. Get instant visualizations.
from ragscore import quick_test
# 1. Audit your RAG in one line
result = quick_test(
endpoint="http://localhost:8000/query", # Your RAG API
docs="docs/", # Your documents
n=10, # Number of test questions
)
# 1b. Tailored QA β target specific audiences
result = quick_test(
endpoint="http://localhost:8000/query",
docs="docs/",
audience="developers", # Who asks the questions?
purpose="api-integration", # What's the document for?
)
# 2. See the report
result.plot()
# 3. Inspect failures
bad_rows = result.df[result.df['score'] < 3]
display(bad_rows[['question', 'rag_answer', 'reason']])
Rich Object API:
result.accuracy- Accuracy scoreresult.df- Pandas DataFrame of all resultsresult.plot()- 3-panel visualization (4-panel withdetailed=True)result.corrections- List of items to fix
Option 2: CLI (Production)
Generate QA Pairs
# Set API key (or use local Ollama - no key needed!)
export OPENAI_API_KEY="sk-..."
# Generate from any document
ragscore generate paper.pdf
ragscore generate docs/*.pdf --concurrency 10
# Tailored QA generation β target specific audiences
ragscore generate docs/ --audience developers --purpose faq
ragscore generate docs/ --audience customers --purpose "pre-sales"
ragscore generate docs/ --audience "compliance auditors" --purpose "security audit"
Evaluate Your RAG
# Point to your RAG endpoint
ragscore evaluate http://localhost:8000/query
# Custom options
ragscore evaluate http://api/ask --model gpt-4o --output results.json
π¬ Detailed Multi-Metric Evaluation
Go beyond a single score. Add detailed=True to get 5 diagnostic dimensions per answer β in the same single LLM call.
result = quick_test(
endpoint=my_rag,
docs="docs/",
n=10,
detailed=True, # β Enable multi-metric evaluation
)
# Inspect per-question metrics
display(result.df[[
"question", "score", "correctness", "completeness",
"relevance", "conciseness", "faithfulness"
]])
# Radar chart + 4-panel visualization
result.plot()
==================================================
β
PASSED: 9/10 correct (90%)
Average Score: 4.3/5.0
Threshold: 70%
ββββββββββββββββββββββββββββββββββββββββββββββββββ
Correctness: 4.5/5.0
Completeness: 4.2/5.0
Relevance: 4.8/5.0
Conciseness: 4.1/5.0
Faithfulness: 4.6/5.0
==================================================
| Metric | What it measures | Scale |
|---|---|---|
| Correctness | Semantic match to golden answer | 5 = fully correct |
| Completeness | Covers all key points | 5 = fully covered |
| Relevance | Addresses the question asked | 5 = perfectly on-topic |
| Conciseness | Focused, no filler | 5 = concise and precise |
| Faithfulness | No fabricated claims | 5 = fully faithful |
CLI:
ragscore evaluate http://localhost:8000/query --detailed
π Full demo notebook β build a mini RAG and test it with detailed metrics.
π― Audience & Purpose demo β generate tailored QA for developers, customers, auditors, and more.
π Ollama local demo β 100% private RAG evaluation with no API keys.
π 100% Private with Local LLMs
# Use Ollama - no API keys, no cloud, 100% private
ollama pull llama3.1
ragscore generate confidential_docs/*.pdf
ragscore evaluate http://localhost:8000/query
Perfect for: Healthcare π₯ β’ Legal βοΈ β’ Finance π¦ β’ Research π¬
Ollama Model Recommendations
RAGScore generates complex structured QA pairs (question + answer + rationale + support span) in JSON format. This requires models with strong instruction-following and JSON output capabilities.
| Model | Size | Min RAM | QA Quality | Recommended |
|---|---|---|---|---|
llama3.1:70b | 40GB | 48GB VRAM | Excellent | GPU server (A100, L40) |
qwen2.5:32b | 18GB | 24GB VRAM | Excellent | GPU server (A10, L20) |
llama3.1:8b | 4.7GB | 8GB VRAM | Good | Best local choice |
qwen2.5:7b | 4.4GB | 8GB VRAM | Good | Good local alternative |
mistral:7b | 4.1GB | 8GB VRAM | Good | Good local alternative |
llama3.2:3b | 2.0GB | 4GB RAM | Fair | CPU-only / testing |
qwen2.5:1.5b | 1.0GB | 2GB RAM | Poor | Not recommended |
Minimum recommended: 8B+ models. Smaller models (1.5Bβ3B) produce lower quality support spans and may timeout on longer chunks.
Ollama Performance Guide
# Recommended: 8B model with concurrency 2 for local machines
ollama pull llama3.1:8b
ragscore generate docs/ --provider ollama --model llama3.1:8b
# GPU server (A10/L20): larger model with higher concurrency
ollama pull qwen2.5:32b
ragscore generate docs/ --provider ollama --model qwen2.5:32b --concurrency 5
Expected performance (28 chunks, 5 QA pairs per chunk):
| Hardware | Model | Time | Concurrency |
|---|---|---|---|
| MacBook (CPU) | llama3.2:3b | ~45 min | 2 |
| MacBook (CPU) | llama3.1:8b | ~25 min | 2 |
| A10 (24GB) | llama3.1:8b | ~3β5 min | 5 |
| L20/L40 (48GB) | qwen2.5:32b | ~3β5 min | 5 |
| OpenAI API | gpt-4o-mini | ~2 min | 10 |
RAGScore auto-reduces concurrency to 2 for local Ollama to avoid GPU/CPU contention.
π Supported LLMs
| Provider | Setup | Notes |
|---|---|---|
| Ollama | ollama serve | Local, free, private |
| OpenAI | export OPENAI_API_KEY="sk-..." | Best quality |
| Anthropic | export ANTHROPIC_API_KEY="..." | Long context |
| DashScope | export DASHSCOPE_API_KEY="..." | Qwen models |
| vLLM | export LLM_BASE_URL="..." | Production-grade |
| Any OpenAI-compatible | export LLM_BASE_URL="..." | Groq, Together, etc. |
π Output Formats
Generated QA Pairs (output/generated_qas.jsonl)
{
"id": "abc123",
"question": "What is RAG?",
"answer": "RAG (Retrieval-Augmented Generation) combines...",
"rationale": "This is explicitly stated in the introduction...",
"support_span": "RAG systems retrieve relevant documents...",
"difficulty": "medium",
"source_path": "docs/rag_intro.pdf"
}
Evaluation Results (--output results.json)
{
"summary": {
"total": 100,
"correct": 85,
"incorrect": 15,
"accuracy": 0.85,
"avg_score": 4.2
},
"incorrect_pairs": [
{
"question": "What is RAG?",
"golden_answer": "RAG combines retrieval with generation...",
"rag_answer": "RAG is a database system.",
"score": 2,
"reason": "Factually incorrect - RAG is not a database"
}
]
}
π§ͺ Python API
from ragscore import run_pipeline, run_evaluation
# Generate QA pairs
run_pipeline(paths=["docs/"], concurrency=10)
# Generate tailored QA pairs for specific audiences
run_pipeline(
paths=["docs/"],
audience="support engineers",
purpose="fine-tuning a support chatbot",
)
# Evaluate RAG
results = run_evaluation(
endpoint="http://localhost:8000/query",
model="gpt-4o", # LLM for judging
)
print(f"Accuracy: {results.accuracy:.1%}")
π€ AI Agent Integration
RAGScore is designed for AI agents and automation:
# Structured CLI with predictable output
ragscore generate docs/ --concurrency 5
ragscore evaluate http://api/query --output results.json
# Exit codes: 0 = success, 1 = error
# JSON output for programmatic parsing
CLI Reference:
| Command | Description |
|---|---|
ragscore generate <paths> | Generate QA pairs from documents |
ragscore generate <paths> --audience <who> | Tailored QA for specific audience |
ragscore generate <paths> --purpose <why> | Focus QA on document purpose |
ragscore evaluate <endpoint> | Evaluate RAG against golden QAs |
ragscore evaluate <endpoint> --detailed | Multi-metric evaluation |
ragscore --help | Show all commands and options |
ragscore generate --help | Show generate options |
ragscore evaluate --help | Show evaluate options |
βοΈ Configuration
Zero config required. Optional environment variables:
export RAGSCORE_CHUNK_SIZE=512 # Chunk size for documents
export RAGSCORE_QUESTIONS_PER_CHUNK=5 # QAs per chunk
export RAGSCORE_WORK_DIR=/path/to/dir # Working directory
π Privacy & Security
| Data | Cloud LLM | Local LLM |
|---|---|---|
| Documents | β Local | β Local |
| Text chunks | β οΈ Sent to LLM | β Local |
| Generated QAs | β Local | β Local |
| Evaluation results | β Local | β Local |
Compliance: GDPR β β’ HIPAA β (with local LLMs) β’ SOC 2 β
π§ͺ Development
git clone https://github.com/HZYAI/RagScore.git
cd RagScore
pip install -e ".[dev,all]"
pytest
π Links
- GitHub β’ PyPI β’ Issues β’ Discussions
β Star us on GitHub if RAGScore helps you!
Made with β€οΈ for the RAG community
