Local Rag System
Local RAG System for DevOps/SRE - Complete Documentation
Ask AI about Local Rag System
Powered by Claude Β· Grounded in docs
I know everything about Local Rag System. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
Local RAG System for DevOps/SRE
Table of Contents
- Introduction and Project Goals
- System Architecture
- Infrastructure Components
- LLM Model Selection
- Documentation Indexing
- MCP Integrations and Extensions
- Deployment and Operations
- MCP Setup Guide
- Best Practices and Troubleshooting
- Complete Code Reference
1. Introduction and Project Goals
Business Objective
Create a fully local, private RAG (Retrieval-Augmented Generation) system that enables:
- Fast technical answers about DevOps/SRE/Cloud without browsing documentation.
- Data privacy - everything runs locally, zero data sent to external APIs.
- No API costs - unlimited queries without token fees.
- Technical knowledge - indexing O'Reilly books, AWS/Kubernetes/Terraform documentation.
- Tool integration - internet access, Kubernetes, Docker, filesystem.
Core Principles
- 100% local - no data leaves your computer.
- Production-ready - Docker Compose, health checks, monitoring.
- Scalable - easy to add new documents.
- Flexible - swappable components (models, vector databases).
- Apple Silicon optimized - MLX utilization for maximum performance.
2. System Architecture
Component Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β USER / LM STUDIO β
β (Interface + LLM Model) β
β + MCP Tool Integration β
ββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββ
β
βββ> MCP Tool Calls (RAG, Web Search, K8s)
β
ββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββ
β LANGCHAIN RAG SERVER (Docker) β
β β’ FastAPI endpoints (/query, /search, /health) β
β β’ LangChain orchestration β
β β’ HuggingFace Embeddings (sentence-transformers) β
β β’ Connection pooling β
ββββββ¬ββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββ
β β
β Vector Search β LLM Generation
β β
ββββββΌββββββββββββββββββββββ ββββββββββΌββββββββββββββββββββ
β QDRANT (Docker) β β LM STUDIO LOCAL SERVER β
β β’ 23,389 chunks β β β’ Qwen3 Coder 30B MLX β
β β’ Similarity search β β β’ Magistral Small 2509 β
β β’ Web Dashboard β β β’ Tool calling support β
β β’ Port 6333/6334 β β β’ Port 1234 β
ββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββ
Detailed Query Flow
- User asks a question in the LM Studio Chat.
- LM Studio determines if external tools are needed (RAG, web search, etc.).
- If RAG is needed, LM Studio calls the RAG MCP Server.
- The RAG MCP Server forwards the query to the RAG FastAPI Server.
- The RAG Server converts the question into an embedding using sentence-transformers.
- Qdrant searches for the 3-5 most similar documentation fragments.
- The RAG Server builds a prompt combining:
- System instruction
- Context from Qdrant (retrieved fragments)
- User's question
- The RAG Server calls LM Studio Local Server for LLM generation.
- LM Studio (LLM model) generates an answer using:
- Context from Qdrant (priority: specific examples, code, facts).
- Pretrained knowledge (general understanding, syntax, best practices).
- The answer is returned to the user with source citations.
Two Knowledge Sources
| Source | Description | Strengths | Weaknesses |
|---|---|---|---|
| Qdrant Retrieved Context | Specific fragments from indexed documents. Exact quotes, examples, code snippets. | Current, precise, verifiable. | Limited to indexed content. |
| Pretrained Model Knowledge | General knowledge about programming, DevOps, clouds, best practices, common patterns. | Broad, structural understanding. | May be outdated (training cutoff date). |
Technology Stack
- Backend: Python 3.12+, FastAPI, LangChain, Uvicorn
- Databases: Qdrant, Sentence Transformers all-MiniLM-L6-v2 (embeddings)
- Containerization: Docker & Docker Compose
- LLM: LM Studio, MLX quantized models
- MCP: Model Context Protocol for tool integration
3. Infrastructure Components
3.1. Qdrant Vector Database
- Role: Store and search vector representations of documentation.
- Specifications:
- Version:
qdrant/qdrant:latest - Ports: 6333 (HTTP API), 6334 (gRPC)
- Storage: Docker volume
qdrant_storage - Collection:
devops_docs(23,389 chunks) - Vector dimensions: 384 (all-MiniLM-L6-v2)
- Health Check:
http://localhost:6333(every 10s) - Resource Usage:
- CPU: ~0.5-1 core
- RAM: ~200-500 MB
- Disk: ~2-5 GB
- Version:
3.2. LangChain RAG Server
- Role: RAG pipeline orchestration, API endpoint for queries.
- Specifications:
- Framework: FastAPI + LangChain
- Port: 8000
- Base Image:
python:3.12-slim - API Endpoints:
- GET
/: Service information - GET
/health: Detailed health check - GET
/config: Current configuration - POST
/query: Main RAG query endpoint - POST
/search: Direct search without an LLM
- GET
3.3. LM Studio Local Server
- Role: Host LLM models, inference, tool calling.
- Specifications:
- Port: 1234 (OpenAI-compatible API)
- Platform: Apple Silicon M4 Pro, 48 GB RAM
- Installed Models:
- Qwen3 Coder 30B MLX 6BIT (~25 GB)
- Magistral Small 2509 MLX 5BIT (~17 GB)
4. LLM Model Selection
4.1. Selection Criteria for DevOps/SRE
- Technical accuracy: Precision in Terraform, K8s, AWS.
- Code generation: HCL, YAML, Docker Compose.
- Tool calling support: Integration with MCP servers.
- M4 performance: MLX optimization.
4.2. Model Comparison
| Model | Specifications | Strengths | Weaknesses | Best for... |
|---|---|---|---|---|
| Qwen3 Coder 30B MLX (6-bit) βββββ | 30B, 25GB, 30-40 tok/s | Best for code/infra, great Terraform/K8s knowledge, fast on MLX | Requires ~35 GB RAM, slower than smaller models | Generating Terraform modules, debugging K8s manifests, code review. |
| Magistral Small 2509 MLX (5-bit) βββββ | 22B, 17GB, 40-50 tok/s | Excellent reasoning, lighter and faster than Qwen3, good at technical writing | Slightly weaker in pure code generation | Architectural decisions, complex problem solving, best practice recommendations. |
4.3. Deployment Recommendations
- For 48 GB RAM:
- Primary: Qwen3 Coder 30B (for code/infrastructure)
- Secondary: Magistral Small 2509 (for reasoning/decisions)
- For 32 GB RAM:
- Primary: Magistral Small 2509 (universal)
- For 64+ GB RAM:
- Premium: Qwen3 Coder 30B 8BIT (max quality)
5. Documentation Indexing
5.1. Document Preparation
- Supported formats: PDF, TXT, Markdown, HTML.
- Folder structure:
documents/
βββ devops/
β βββ terraformcookbook.pdf
β βββ kubernetes-best-practices.pdf
β βββ aws_resources.pdf
βββ sre/
β βββ site-reliability-engineering.pdf
βββ cloud/
βββ aws-well-architected.pdf
- Recommended sources: O'Reilly books, official documentation, internal company documentation.
5.2. Indexing Process
- Loading: Read documents from folders (PyPDFLoader).
- Chunking:
- Chunk size: 1000 characters
- Overlap: 200 characters (to preserve context)
- Result: 23,389 chunks from 8,677 pages.
- Embedding Generation:
- Model:
sentence-transformers/all-MiniLM-L6-v2 - Dimensions: 384
- Model:
- Storing in Qdrant:
- Collection:
devops_docs - Distance metric: Cosine Similarity
- Indexing time: ~30-60 minutes.
- Collection:
5.3. Running the Indexing Script
# Ensure Docker stack is running
docker compose up -d
# Run indexing (local Python)
python index_documents.py
# Or via Docker
docker compose exec langchain-server python /app/index_documents.py
# Verify indexing
curl http://localhost:8000/health | jq '.collection_vectors_count'
# Should show: 23389
6. MCP Integrations and Extensions
6.1. Model Context Protocol (MCP) Overview
- What is MCP: A protocol created by Anthropic that enables LLMs to access external tools.
- Architecture: LLM β MCP Server β External Service/API
- Supported in: LM Studio, Claude Desktop, VS Code, Cursor
6.2. Available MCP Integrations
| MCP Server | Description | Use Cases |
|---|---|---|
| RAG DevOps Docs | Search through your indexed DevOps/SRE documentation | Terraform questions, K8s best practices, AWS configurations |
| Web Search | Multi-engine web search (Bing, Brave, DuckDuckGo) | Latest package versions, breaking changes, new features |
| Kubernetes | Native K8s cluster management | List pods, get logs, check deployments, helm operations |
| Filesystem | Local file system access | Read configs, search code, analyze project structure |
| Docker | Container management | List containers, check logs, manage images |
6.3. Web Search Integration
- Implementation:
mrkrsl/web-search-mcp - Available Tools:
full-web-search: Comprehensive search with full content extraction.get-web-search-summaries: Quick search with snippets.get-single-web-page-content: Extract content from a specific URL.
6.4. Kubernetes MCP Server
- Implementation:
containers/kubernetes-mcp-server - Key Features:
- Pod management (list, logs, exec).
- CRUD for any K8s resource (Deployments, Services, etc.).
- Helm operations (install, list, uninstall).
- Security Modes:
- Read-only: View only.
- Disable destructive: View and create, but no updates/deletes.
- Full access: Full permissions (for dev environments).
7. Deployment and Operations
7.1. Quick Start
# Clone the repository
git clone https://github.com/pshq-ripe/local-rag-system
cd local-mcp
# Start Docker services
docker compose up -d
# Check health
curl http://localhost:8000/health
curl http://localhost:6333
# Index documents (first time only)
python index_documents.py
# Verify indexing completed
curl http://localhost:8000/health | jq
7.2. Docker Compose Setup
- Services:
qdrant: The vector database.langchain-server: The RAG server.- LM Studio: Runs natively on the host (outside of Docker).
7.3. Project Structure
local-mcp/
βββ docker-compose.yaml # Main orchestration file
βββ Dockerfile # RAG server container image
βββ requirements.txt # Python dependencies
βββ rag_server.py # FastAPI RAG server
βββ index_documents.py # Document indexing script
βββ Makefile # Convenience commands
β
βββ documents/ # Source documents for indexing
β βββ devops/
β βββ sre/
β βββ cloud/
β
βββ logs/ # Application logs
β
βββ mcp-servers/ # MCP server implementations
βββ rag-mcp-server/ # RAG MCP integration
βββ README.md # MCP setup instructions
7.4. Networking
- Docker Network:
rag-network(bridge type). - Port Mapping:
- 6333 β Qdrant HTTP API
- 6334 β Qdrant gRPC
- 8000 β RAG Server API
- 1234 β LM Studio (host)
- Host Access:
host.docker.internalallows the RAG server to communicate with LM Studio.
7.5. Makefile Commands
# Build images
make build
# Start stack
make up
# Stop stack
make down
# View logs
make logs
# Restart RAG server
make restart
# Index documents
make index
# Health check
make health
# Test query
make test
# Clean everything
make clean
7.6. Health Checks & Monitoring
- Qdrant Health:
GET http://localhost:6333 - RAG Server Health:
GET http://localhost:8000/health
{
"status": "healthy",
"qdrant_connected": true,
"lm_studio_connected": true,
"collection_exists": true,
"collection_vectors_count": 23389,
"qa_chain_initialized": true
}
8. MCP Setup Guide
8.1. Prerequisites
# Ensure Node.js is installed (v18+)
node --version
# Ensure npm is available
npm --version
# Ensure Docker is running
docker compose version
8.2. MCP Servers Installation
Option 1: Install All MCP Servers (Recommended)
# Create MCP servers directory
mkdir -p ~/lm-studio-mcp
cd ~/lm-studio-mcp
# Run the complete setup script
curl -o setup-mcp.sh https://raw.githubusercontent.com/pshq-ripe/scripts/setup-mcp.sh
chmod +x setup-mcp.sh
./setup-mcp.sh
Option 2: Manual Installation
8.2.1. Web Search MCP Server
cd ~/lm-studio-mcp
git clone https://github.com/mrkrsl/web-search-mcp.git
cd web-search-mcp
npm install
npm run build
# Test
node dist/index.js --help
8.2.2. RAG MCP Server (Custom)
cd ~/lm-studio-mcp
mkdir rag-mcp-server
cd rag-mcp-server
# Create package.json
cat > package.json << 'EOF'
{
"name": "rag-mcp-server",
"version": "1.0.0",
"type": "module",
"description": "MCP server for DevOps RAG documentation",
"main": "rag-mcp-server.js",
"dependencies": {
"@modelcontextprotocol/sdk": "^0.5.0",
"node-fetch": "^3.3.2"
}
}
EOF
# Install dependencies
npm install
# Copy rag-mcp-server.js from repository
curl -o rag-mcp-server.js https://raw.githubusercontent.com/pshq-ripe/mcp-servers/rag-mcp-server.js
chmod +x rag-mcp-server.js
# Test
node rag-mcp-server.js
8.2.3. Kubernetes MCP Server
# No installation needed - uses npx
# Will be installed on first use
8.2.4. Filesystem MCP Server
# No installation needed - uses npx
# Will be installed on first use
8.2.5. Docker MCP Server (Optional)
# No installation needed - uses npx
# Will be installed on first use
8.3. LM Studio MCP Configuration
8.3.1. Locate mcp.json
# On macOS, mcp.json is located at one of:
# 1. ~/Library/Application Support/LMStudio/mcp.json
# 2. ~/.config/lmstudio/mcp.json
# 3. ~/.lmstudio/mcp.json
# Find it:
find ~ -name "mcp.json" 2>/dev/null | grep -i lmstudio
8.3.2. Create/Update mcp.json
{
"mcpServers": {
"rag-devops-docs": {
"command": "node",
"args": [
"/Users/YOUR_USERNAME/lm-studio-mcp/rag-mcp-server/rag-mcp-server.js"
],
"env": {
"RAG_SERVER_URL": "http://localhost:8000"
}
},
"web-search": {
"command": "node",
"args": [
"/Users/YOUR_USERNAME/lm-studio-mcp/web-search-mcp/dist/index.js"
],
"env": {
"MAX_BROWSERS": "3",
"BROWSER_HEADLESS": "true",
"DEFAULT_TIMEOUT": "6000",
"MAX_CONTENT_LENGTH": "100000",
"ENABLE_RELEVANCE_CHECKING": "true",
"RELEVANCE_THRESHOLD": "0.3"
}
},
"kubernetes": {
"command": "npx",
"args": [
"-y",
"kubernetes-mcp-server@latest",
"--disable-destructive"
],
"env": {
"KUBECONFIG": "/Users/YOUR_USERNAME/.kube/config"
}
},
"filesystem": {
"command": "npx",
"args": [
"-y",
"@modelcontextprotocol/server-filesystem",
"/Users/YOUR_USERNAME/projects"
]
},
"docker": {
"command": "npx",
"args": [
"-y",
"mcp-server-docker"
]
}
}
}
Important: Replace YOUR_USERNAME with your actual username!
# Get your username
whoami
# Or use full path
echo $HOME
8.3.3. Restart LM Studio
# Close LM Studio completely
killall "LM Studio"
# Restart
open -a "LM Studio"
8.4. Verify MCP Setup
8.4.1. Check LM Studio Logs
In LM Studio:
- Go to Developer tab
- Check Developer Logs
- Look for:
[Plugin(mcp/rag-devops-docs)] stdout: [Tools Prvdr.] Register with LM Studio
[Plugin(mcp/web-search)] stdout: [Tools Prvdr.] Register with LM Studio
[Plugin(mcp/kubernetes)] stdout: [Tools Prvdr.] Register with LM Studio
8.4.2. Test RAG MCP Server
# Test directly
cd ~/lm-studio-mcp/rag-mcp-server
node rag-mcp-server.js
# Should print: "RAG MCP server running on stdio"
# Ctrl+C to exit
# Test RAG Server is accessible
curl http://localhost:8000/health | jq
8.4.3. Test in LM Studio Chat
Load a model with tool calling support (Qwen3 Coder or Magistral), then ask:
Search my DevOps documentation for information about creating EC2 instances in Terraform.
Expected behavior:
- Model recognizes it needs documentation
- Calls
search_devops_docstool - Returns answer with sources (e.g., "terraformcookbook.pdf, page 83")
8.5. Troubleshooting MCP
Problem: MCP Server not found
# Verify file exists
ls -la ~/lm-studio-mcp/rag-mcp-server/rag-mcp-server.js
# Check permissions
chmod +x ~/lm-studio-mcp/rag-mcp-server/rag-mcp-server.js
# Test execution
node ~/lm-studio-mcp/rag-mcp-server/rag-mcp-server.js
Problem: Module not found errors
cd ~/lm-studio-mcp/rag-mcp-server
# Clean and reinstall
rm -rf node_modules package-lock.json
npm install
# Verify dependencies
npm list
Problem: RAG Server connection refused
# Check Docker stack
docker compose ps
# Check RAG Server
curl http://localhost:8000/health
# Restart if needed
docker compose restart langchain-server
Problem: Tool not appearing in LM Studio
- Verify
mcp.jsonsyntax (use JSON validator) - Check file paths are absolute (not relative)
- Restart LM Studio completely
- Check Developer Logs for errors
- Ensure model supports tool calling (Qwen3 Coder, Magistral)
8.6. Advanced MCP Configuration
Custom System Prompt for Better Tool Usage
In LM Studio β Chat Settings β System Prompt:
You are a DevOps/SRE expert assistant with access to comprehensive tools:
- search_devops_docs: Search indexed documentation (Terraform, K8s, AWS, Docker)
- web_search: Search the internet for latest information
- kubernetes operations: Manage K8s clusters
- filesystem: Access local files and code
When answering questions about DevOps/Infrastructure:
1. Use search_devops_docs FIRST to check documentation
2. Use web_search for latest versions or breaking changes
3. Cite specific sources (book names, page numbers, URLs)
4. Combine documentation facts with your general knowledge
For general questions, answer directly without tools.
Testing Individual MCP Servers
# Test Web Search
cd ~/lm-studio-mcp/web-search-mcp
npm test # if available
# Test Kubernetes
kubectl get pods # Ensure kubectl works
npx kubernetes-mcp-server@latest --help
# Test RAG
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question": "Test query"}' | jq
9. Best Practices and Troubleshooting
9.1. Model Selection Strategy
- Code generation: Qwen3 Coder 30B
- Architectural decisions: Magistral Small 2509
- Quick queries: Magistral Small 2509
- Debugging: Qwen3 Coder 30B
9.2. RAG Query Optimization
max_results (k): Default is 3. Increasing it improves context but slows down the response.temperature: Default is 0.7. For DevOps tasks, 0.5-0.7 is recommended for more predictable answers.score_threshold: Default is 0.5. Higher values (0.7) for more precise matches, lower (0.3) for broader results.
9.3. Common Issues and Solutions
Problem: RAG Server won't start
Symptoms: Container crashes immediately, "Connection refused" in logs.
Solution:
# Check Qdrant
docker compose logs qdrant
# Verify network
docker network inspect local-mcp_rag-network
# Check dependencies
docker compose ps
# Restart
docker compose restart
Problem: Collection not found (404)
Symptoms: /query returns 503, /health shows collection_exists: false.
Solution:
# Run indexing
make index
# Or
python index_documents.py
# Verify
curl http://localhost:6333/collections
# Restart server
make restart
Problem: LM Studio connection error
Symptoms: "Connection error" in /query response, timeout errors.
Solution:
# Check LM Studio Local Server is running
curl http://localhost:1234/v1/models
# Verify model is loaded in LM Studio
# Check firewall settings
# Test host.docker.internal resolves
Problem: MCP tools not working
Symptoms: Model doesn't call tools, tools not visible in UI.
Solution:
- Verify model supports tool calling (Qwen3 Coder, Magistral)
- Check
mcp.jsonsyntax and paths - Restart LM Studio completely
- Check Developer Logs for errors
- Test MCP servers individually (see section 8.6)
Problem: Slow inference
Symptoms: Query takes >30 seconds, timeouts.
Solution:
# Switch to smaller/faster model (Magistral 5BIT)
# Reduce max_results (3 β 2)
# Check system resources (Activity Monitor)
# Close other applications
# Consider MLX quantized models
Problem: Out of memory
Symptoms: LM Studio crashes, "Model loading stopped" error, system freeze.
Solution:
# Switch to smaller model
# Close other apps
# Use higher quantization (6BIT β 4BIT)
# Check available RAM: vm_stat
# Consider upgrading RAM
9.4. Backup and Maintenance
Backup Qdrant Data
# Backup
docker run --rm \
-v qdrant_storage:/data \
-v $(pwd):/backup \
ubuntu tar czf /backup/qdrant-$(date +%Y%m%d).tar.gz /data
# Restore
docker run --rm \
-v qdrant_storage:/data \
-v $(pwd):/backup \
ubuntu tar xzf /backup/qdrant-20251117.tar.gz -C /
Update Dependencies
# Update Docker images
docker compose pull
# Rebuild
docker compose build --no-cache
# Update Python dependencies
pip install --upgrade -r requirements.txt
# Update MCP servers
cd ~/lm-studio-mcp/web-search-mcp
git pull
npm install
npm run build
10. Complete Code Reference
10.1. Key Files
docker-compose.yaml: Docker orchestrationDockerfile: RAG server containerrequirements.txt: Python dependenciesrag_server.py: FastAPI RAG serverindex_documents.py: Document indexing scriptMakefile: Convenience commandsmcp.json: MCP configuration for LM Studiorag-mcp-server.js: RAG MCP integration
10.2. Environment Variables
| Variable | Default | Description |
|---|---|---|
QDRANT_URL | http://qdrant:6333 | Qdrant server URL |
LM_STUDIO_URL | http://host.docker.internal:1234/v1 | LM Studio API URL |
COLLECTION_NAME | devops_docs | Qdrant collection name |
EMBEDDING_MODEL | sentence-transformers/all-MiniLM-L6-v2 | Embedding model |
CHUNK_SIZE | 1000 | Document chunk size |
CHUNK_OVERLAP | 200 | Chunk overlap size |
TEMPERATURE | 0.7 | LLM temperature |
MAX_RETRIEVAL_RESULTS | 3 | Max chunks to retrieve |
10.3. API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/ | GET | Service information |
/health | GET | Detailed health check |
/config | GET | Current configuration |
/query | POST | Main RAG query |
/search | POST | Direct vector search |
10.4. MCP Tools Available
| Tool | MCP Server | Description |
|---|---|---|
search_devops_docs | rag-devops-docs | Search indexed documentation |
check_rag_health | rag-devops-docs | Check RAG system status |
full-web-search | web-search | Comprehensive web search |
get-web-search-summaries | web-search | Quick web search |
pods_list | kubernetes | List Kubernetes pods |
pods_log | kubernetes | Get pod logs |
helm_install | kubernetes | Install Helm chart |
| (many more) | kubernetes | K8s operations |
read_file | filesystem | Read local file |
search_files | filesystem | Search in files |
list_containers | docker | List Docker containers |
container_logs | docker | Get container logs |
Summary
This system is designed to be:
- β Production-ready: Docker, health checks, graceful degradation
- β Private: 100% local, zero external API calls
- β Scalable: Easy to add documents and MCP servers
- β Performant: MLX optimization, native Go MCP, proper indexing
- β Secure: Read-only modes, isolated networks, local-only access
Achieved Goals:
- 23,389 chunks of DevOps/SRE documentation indexed
- 5+ MCP servers integrated (RAG, Web, K8s, Docker, FS)
- 2 LLM models (Qwen3 Coder, Magistral) ready for work
- Sub-second query latency for most queries
- Comprehensive tooling for DevOps workflows
Next Steps:
- Add more documents to expand knowledge base
- Test different models for specific use cases
- Expand MCP integrations (Git, Grafana, Slack)
- Fine-tune retrieval parameters based on usage patterns
- Set up automated backups and monitoring
For issues, questions, or contributions, please see the repository's issue tracker.
