Sre Sentinel
devops sre agent, winner of wemakedevs future stack gen ai hackathon 2025
Installation
npx sre-sentinelAsk AI about Sre Sentinel
Powered by Claude Β· Grounded in docs
I know everything about Sre Sentinel. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
SRE Sentinel - AI DevOps Copilot
SRE Sentinel is an AI-powered monitoring and self-healing system for containerized applications. It continuously monitors Docker containers, detects anomalies using advanced AI analysis, performs root cause analysis, and executes automated fixes through the Model Context Protocol (MCP).
π Features
- Real-time Monitoring: Continuously monitors Docker containers for logs, metrics, and events
- AI-Powered Anomaly Detection: Uses Cerebras AI for fast anomaly detection in container logs
- Deep Root Cause Analysis: Leverages Llama 4 Scout for comprehensive incident analysis
- Automated Remediation: Executes fixes through secure MCP Gateway with Docker control tools
- Dynamic Tool Discovery: Automatically discovers available tools from MCP servers
- Real-time Telemetry: Provides WebSocket-based real-time event streaming
- Human-Friendly Explanations: Generates stakeholder-friendly incident explanations
ποΈ Architecture
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Containers βββββΆβ SRE Sentinel βββββΆβ Event Bus β
β β β β β β
β - Logs β β - Monitor β β - Publish β
β - Metrics β β - Detect β β - Persist β
β - Events β β - Analyze β β - Distribute β
βββββββββββββββββββ β - Remediate β βββββββββββββββββββ
βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β AI Services β
β β
β - Cerebras β
β - Llama β
βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β MCP Gateway β
β β
β - Docker Controlβ
β - Config Patcherβ
βββββββββββββββββββ
π οΈ MCP Integration
SRE Sentinel uses the Model Context Protocol (MCP) to securely interact with container infrastructure through the Docker MCP Gateway. This architecture provides a secure, audited, and extensible way to execute container operations.
MCP Architecture Overview
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β SRE Sentinel βββββΆβ MCP Gateway βββββΆβ MCP Servers β
β (Python) β β (Docker) β β (Node.js) β
β β β β β β
β - AI Analysis β β - Session Mgmt β β - Docker API β
β - Fix Actions β β - Tool Routing β β - Config Mgmt β
β - SSE Client β β - Security β β - Validation β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
MCP Gateway Connection
The Python orchestrator connects to the MCP Gateway using Server-Sent Events (SSE) protocol:
- Session Initialization: Establishes a session with the MCP Gateway
- Tool Discovery: Automatically discovers available tools from all registered MCP servers
- Dynamic Execution: Executes tools with proper parameter validation and error handling
- Security Isolation: All container operations go through the Gateway's security layer
MCP Servers
1. Docker Control Server (mcp-servers/docker-control/)
Provides secure Docker container management tools:
-
restart_container: Restart a Docker container- Parameters:
container_name(required),reason(optional) - Example:
{"container_name": "demo-api", "reason": "Memory leak detected"}
- Parameters:
-
health_check: Check container health status- Parameters:
container_name(required) - Returns: Status, health state, restart count, start time
- Parameters:
-
update_resources: Update CPU and memory limits- Parameters:
container_name(required),resources(required) - Example:
{"container_name": "demo-api", "resources": {"memory": "1g", "cpu": "1.0"}}
- Parameters:
-
get_logs: Retrieve recent container logs- Parameters:
container_name(required),tail(optional, default: 100) - Returns: Last N lines of container logs
- Parameters:
-
exec_command: Execute commands inside containers- Parameters:
container_name(required),command(required, array),timeout(optional) - Example:
{"container_name": "demo-api", "command": ["sh", "-c", "ps aux"], "timeout": 30}
- Parameters:
2. Config Patcher Server (mcp-servers/config-patcher/)
Handles configuration updates for containers:
update_env_vars: Update environment variables- Parameters:
container_name(required),env_updates(required) - Process: Commits container as image, recreates with new environment
- Example:
{"container_name": "demo-api", "env_updates": {"DEBUG": "true", "LOG_LEVEL": "info"}}
- Parameters:
Dynamic Tool Discovery
The MCP orchestrator (src/core/orchestrator.py) automatically discovers available tools:
# Tools are discovered at runtime from the MCP Gateway
async def _discover_tools(self) -> None:
# Connects to MCP Gateway via SSE
# Retrieves tool schemas and descriptions
# Builds dynamic tool registry
This approach provides:
- Automatic Discovery: No hardcoded tool definitions
- Schema Validation: Dynamic parameter validation based on tool schemas
- Flexible Execution: Proper error handling and response parsing
- Easy Extension: Add new MCP servers without code changes
MCP Tool Execution Flow
- AI Analysis: Llama AI analyzes incidents and recommends
FixActionobjects - Tool Mapping:
FixAction.actionmaps to MCP tool names - Parameter Preparation:
FixAction.detailscontains JSON parameters for the tool - Gateway Execution: Tool call is sent to MCP Gateway via SSE
- Result Processing: Response is parsed and returned as
FixExecutionResult
Security Features
- Isolation: AI never directly accesses Docker socket
- Audit Trail: All tool executions are logged in the Gateway
- Parameter Validation: Strict schema validation prevents injection
- Session Management: Secure session-based communication
- Limited Scope: Each tool has specific, limited capabilities
Adding New MCP Servers
- Create server in
mcp-servers/your-server/ - Implement tools following MCP specification
- Add to
mcp-servers/catalog.yaml:your-server: description: "Your server description" image: "mcp-servers/your-server:latest" tools: - name: "your_tool" description: "What your tool does" volumes: - "/var/run/docker.sock:/var/run/docker.sock" - Build and restart:
./mcp-servers/build-servers.sh docker-compose restart mcp-gateway
π Prerequisites
- Docker and Docker Compose
- Python 3.9+
- Node.js (for MCP servers)
- Redis (for event bus)
- API keys for Cerebras and Llama AI services
π Quick Start
-
Clone the repository:
git clone https://github.com/your-org/sre-sentinel.git cd sre-sentinel -
Set up environment variables:
cp .env-example .env # Edit .env with your API keys and configuration -
Run the project setup (installs deps and builds MCP images):
./scripts/setup.sh -
Start the system:
docker-compose up -d -
View the dashboard: Open http://localhost:3000 in your browser
π§ Configuration
Environment Variables
CEREBRAS_API_KEY: API key for Cerebras AI serviceLLAMA_API_KEY: API key for Llama AI serviceLLAMA_API_BASE: Base URL for Llama API (default: https://openrouter.ai/api/v1)MCP_GATEWAY_URL: URL of the MCP Gateway (default: http://mcp-gateway:8811)API_PORT: Port for the API server (default: 8000)AUTO_HEAL_ENABLED: Enable automatic healing (default: true)REDIS_HOST: Redis server host (default: redis)REDIS_PORT: Redis server port (default: 6379)
Docker Compose Labels
Add these labels to containers you want to monitor:
labels:
- "sre-sentinel.monitor=true"
- "sre-sentinel.service=your-service-name"
π Monitoring Dashboard
The web dashboard provides real-time visibility into:
- Container status and resource usage
- Log streaming with anomaly highlighting
- Incident history and analysis results
- AI insights and recommendations
π How It Works
- Monitoring: SRE Sentinel continuously monitors labeled containers
- Anomaly Detection: Cerebras AI analyzes log patterns for anomalies
- Incident Creation: Critical anomalies trigger incident creation
- Root Cause Analysis: Llama 4 Scout performs deep analysis
- Fix Execution: Recommended fixes are executed via MCP Gateway
- Health Verification: Container health is verified after fixes
- Resolution: Incidents are marked resolved with explanations
π§ͺ Testing
Break a Service
Use the provided script to simulate a service failure:
./scripts/break-service.sh
This will:
- Stop the demo API service
- Trigger anomaly detection
- Create an incident
- Execute automated fixes
- Restore service health
Manual MCP Testing
Test MCP tools directly through the Gateway:
# First, initialize a session
curl -X POST http://localhost:8811/mcp \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"id": 1,
"method": "initialize",
"params": {
"protocolVersion": "2024-11-05",
"capabilities": {},
"clientInfo": {"name": "test-client", "version": "1.0.0"}
}
}'
# Then list available tools (using session ID from response)
curl -X POST http://localhost:8811/mcp \
-H "Content-Type: application/json" \
-H "Mcp-Session-Id: YOUR_SESSION_ID" \
-d '{
"jsonrpc": "2.0",
"id": 2,
"method": "tools/list",
"params": {}
}'
# Test container restart
curl -X POST http://localhost:8811/mcp \
-H "Content-Type: application/json" \
-H "Mcp-Session-Id: YOUR_SESSION_ID" \
-d '{
"jsonrpc": "2.0",
"id": 3,
"method": "tools/call",
"params": {
"name": "restart_container",
"arguments": {"container_name": "demo-api", "reason": "Manual test"}
}
}'
MCP Configuration
The MCP Gateway is configured in docker-compose.yml:
mcp-gateway:
image: docker/mcp-gateway:latest
command:
- --transport=streaming # Enable SSE for Python client
- --port=8811
- --catalog=/mcp-servers/catalog.yaml
- --enable-all-servers
- --verbose
- --log-calls
volumes:
- /var/run/docker.sock:/var/run/docker.sock # Docker access
- ./mcp-servers:/mcp-servers:ro # Server definitions
Python MCP Client Implementation
The SRE Sentinel Python client implements the MCP protocol:
-
Session Management (
src/core/orchestrator.py):async def _initialize_session(self, url: str) -> None: # Initialize MCP session with Gateway # Extract session ID from response headers # Use session ID for subsequent requests -
Tool Discovery:
async def _discover_tools(self) -> None: # List all available tools from Gateway # Parse tool schemas and descriptions # Build dynamic tool registry -
Tool Execution:
async def _call_tool(self, tool_name: str, args: dict) -> FixExecutionResult: # Execute tool via MCP Gateway # Handle SSE response format # Parse and return results
MCP Message Flow
1. Python Client (SRE Sentinel)
β JSON-RPC 2.0 via SSE
2. MCP Gateway (Docker)
β Routes to appropriate server
3. MCP Server (Node.js)
β Executes Docker operation
4. Returns result through Gateway
β SSE response
5. Python Client receives result
π‘οΈ Security
- MCP Gateway provides secure isolation between AI and container infrastructure
- All tool executions are logged and auditable
- Container access is limited to specific operations
- Sensitive environment variables are redacted in AI prompts
π Extending SRE Sentinel
MCP Server Development
Creating a New MCP Server
-
Create Server Directory:
mkdir mcp-servers/your-server cd mcp-servers/your-server -
Initialize Node.js Project:
npm init -y npm install @modelcontextprotocol/sdk dockerode -
Implement Server (
index.js):import { Server } from "@modelcontextprotocol/sdk/server/index.js"; import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js"; const server = new Server( { name: "your-server", version: "1.0.0", }, { capabilities: { tools: {} }, } ); // Define tools server.setRequestHandler(ListToolsRequestSchema, async () => ({ tools: [ { name: "your_tool", description: "What your tool does", inputSchema: { type: "object", properties: { param1: { type: "string", description: "Parameter description" }, }, required: ["param1"], }, }, ], })); // Handle tool calls server.setRequestHandler(CallToolRequestSchema, async (request) => { const { name, arguments: args } = request.params; // Implement your tool logic here return { content: [{ type: "text", text: JSON.stringify(result) }], }; }); // Start server const transport = new StdioServerTransport(); await server.connect(transport); -
Add to Catalog (
mcp-servers/catalog.yaml):your-server: description: "Your custom MCP server" title: "Your Server" type: "server" dateAdded: "2025-10-05T00:00:00Z" image: "mcp-servers/your-server:latest" tools: - name: "your_tool" description: "What your tool does" env: - name: "NODE_ENV" value: "production" volumes: - "/var/run/docker.sock:/var/run/docker.sock" metadata: category: "custom" tags: ["your", "tags"] license: "MIT License" owner: "Your Name" -
Build and Deploy:
# Build all MCP servers ./mcp-servers/build-servers.sh # Restart Gateway to load new server docker-compose restart mcp-gateway
Best Practices for MCP Servers
- Validation: Always validate input parameters
- Error Handling: Return structured error responses
- Security: Never expose sensitive data in tool responses
- Logging: Use stderr for logging (MCP standard)
- Idempotency: Design tools to be idempotent where possible
Custom AI Analysis
Modify the AI analysis in:
src/cerebras_client.py: Anomaly detection logicsrc/llama_analyzer.py: Root cause analysis logic
Custom Monitoring
Extend monitoring in src/monitor.py:
- Add new metrics collection
- Implement custom anomaly detection
- Add specialized fix actions
π€ Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π Acknowledgments
- Model Context Protocol for secure AI-tool integration
- Cerebras for fast AI inference
- Llama for advanced reasoning capabilities
- Docker for containerization platform
