Persistent Agent Runtime
Persistent runtime for long-running AI agent tasks with durable checkpoints, lease-based workers, retries, dead letters, and operational visibility
Ask AI about Persistent Agent Runtime
Powered by Claude Β· Grounded in docs
I know everything about Persistent Agent Runtime. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
Persistent Agent Runtime
Durable execution infrastructure for AI agents.
Why This Exists
Running AI agents in production β not as chatbots, but as programmatic, multi-step tasks on remote compute β surfaces infrastructure problems that most agent frameworks don't address:
-
Crash recovery: Agent tasks can run for minutes or hours across many LLM calls. On ephemeral compute (containers, spot instances), the process can be killed at any time. Without durable checkpointing, all completed steps are lost and must be re-executed β wasting time and LLM spend.
-
Execution visibility: When agents run on remote workers instead of your terminal, you need per-step cost tracking, failure diagnostics, and execution history to understand what happened and why.
-
Failure management: At scale, agent task failures need to be inspectable and actionable β structured dead-letter queues with redrive, not just a stack trace in logs.
This project is developer infrastructure for deploying agents at scale β the layer between "I have an agent" and "I can run 500 of them reliably in production." Target use cases are coding agents, batch document processing, and research tasks.
Key capabilities:
- Checkpoint-resume execution β Postgres-backed LangGraph checkpointer; tasks survive worker crashes at node boundaries (not deterministic replay)
- Lease-based crash recovery β any worker can resume any task; a reaper recovers expired leases
- Multi-provider LLM β Anthropic, OpenAI, Google, and AWS Bedrock, auto-discovered from configured API keys
- Per-agent cost tracking and budgets β incremental per-step cost, budget-based task pausing
- Human-in-the-loop β
waiting_for_approvalandwaiting_for_inputstates with stateless pause/resume - Customer-owned observability β per-task Langfuse endpoints, checkpoint-based cost/token aggregation
- Dead-letter with redrive β structured failure inspection, not log scraping
- Bring-your-own-tools (BYOT) β customer-provided MCP tool servers registered by URL
- Sandboxed code execution β E2B-backed shell, file I/O, and artifact export; file input via multipart upload
Built as a working system with a Java/Spring Boot API, Python worker, React/Vite console, and PostgreSQL state store, deployable on ECS Fargate + Aurora Serverless v2.
Current Architecture
The system uses a stateless worker pool with a database-as-queue model. Workers are interchangeable β any worker can claim any task, and any worker can resume a checkpointed task from another worker. State lives in PostgreSQL (shared), not on local disk, so tasks survive crashes, deployments, and scaling events.
- Clients submit a task through the API service, targeting a configured agent.
- The task is stored in PostgreSQL in
queuedstate. - A worker claims the task with
FOR UPDATE SKIP LOCKED. - The worker loads the agent's configuration (model, tools, system prompt) and executes the LangGraph workflow, writing checkpoints to PostgreSQL.
- Heartbeats extend the lease while work is in progress.
- Tasks can pause for human approval or input (
waiting_for_approval,waiting_for_input), releasing the worker. Any worker can resume when the human responds. - A reaper recovers expired leases and handles timeout/dead-letter transitions.
Built-in Agent Tools
Agents have access to these tools at runtime, wired directly into the LangGraph executor. Actual availability per task depends on the agent's allowed_tools config and whether a sandbox is provisioned:
web_searchβ web search via DuckDuckGoread_urlβ fetch and extract text from a public URLrequest_human_inputβ pause the task and wait for a human operator responsecreate_text_artifactβ save text output as a downloadable task artifact (S3). Available only to non-sandbox agents; sandbox-enabled agents useexport_sandbox_filefor output files insteadsandbox_exec,sandbox_read_file,sandbox_write_file,export_sandbox_fileβ shell and file I/O inside an E2B sandbox (available when the agent is configured withsandbox.enabled: true)
Agents configured with a custom MCP tool server URL (BYOT) get the server's tools merged into the same tool list at task start.
Repository Layout
docs/product-specs/: vision, user stories, core conceptsdocs/design-docs/: architecture and design documentsdocs/exec-plans/: implementation plans (active and completed)services/api-service/: Spring Boot API serviceservices/console/: React SPA for monitoring and controlling the runtimeservices/worker-service/: Python worker, checkpointer, executor, and toolstests/backend-integration/: cross-service integration tests (API + Worker + PostgreSQL, mocked LLMs)tests/e2e-langfuse/: Langfuse integration E2E tests (connectivity, trace publishing, cost tracking)tests/fixtures/: shared test infrastructure (Langfuse Docker Compose)experiments/langgraph/: proof-of-concept and validation workinfrastructure/database/: schema migrations and verificationinfrastructure/cdk/: AWS CDK infrastructure (Task 8)
Getting Started
Prerequisites
- Java 21+, Python 3.11+, Node.js 18+, Docker
Quick Start
# 1. Bootstrap the database (first time only)
make init
# 2. Configure environment
cp .env.localdev.example .env.localdev
# At least one LLM key: ANTHROPIC_API_KEY, OPENAI_API_KEY
# Optional: TAVILY_API_KEY (web_search tool)
# 3. Install and run
make install
make start # single worker (default)
make start N=3 # or start with multiple workers
For detailed setup options, environment variables, timing configuration, and manual database setup, see docs/LOCAL_DEVELOPMENT.md.
Deploy to AWS
For a full AWS deployment (Aurora Serverless v2, ECS Fargate, internal ALB, SSM access), see infrastructure/README.md.
Useful Entry Points
- API service:
services/api-service/README.md - Console:
services/console/README.md - Worker service:
services/worker-service/README.md - Backend integration tests:
tests/backend-integration/README.md - Database schema:
infrastructure/database/README.md
Common Commands
make install # install all dependencies
make start # start all services (1 worker)
make start N=3 # start all services with 3 workers
make scale-worker N=5 # scale workers up or down to N
make stop # stop all services
make status # show service statuses
make test-langfuse-up # start local Langfuse for testing
make test-langfuse-down # stop local Langfuse stack
make test-langfuse-status # inspect local Langfuse containers
make test-e2e-langfuse # run Langfuse E2E tests (requires Langfuse + full stack)
make check # verify prerequisites without starting services
make logs # tail background service logs
make api-test # API service tests
make worker-test # worker service tests
make e2e-test # backend integration tests
make db-reset-verify # reset and verify database schema (destructive)
make clean # remove build artifacts
Tip: use make -n <target> to preview the shell commands for a target without executing them. For example, make -n start N=3 shows the full startup flow for three workers.
Development Status
The repo is in active development.
Phase 1 β Durable Execution (complete)
- Database schema with lease-based task claiming (
FOR UPDATE SKIP LOCKED) - REST API for task submission, listing, status, checkpoints, cancellation, dead-letter listing, and redrive
- Worker poller, heartbeat manager, and reaper
- Worker registry with self-registration, heartbeat, and stale worker cleanup
- PostgreSQL-backed LangGraph checkpointer
- Built-in tools:
web_search,read_url,request_human_input, plus a standalone FastMCP server exposingweb_search,read_url,calculatorfor external use - Dev-only task controls for forced lease expiry and dead-letter transitions
- Dynamic model provider management: database-backed provider/model registry, auto-discovery from API keys, per-model cost tracking
- Console frontend: dashboard, task list, task dispatcher, execution telemetry, dead letter queue
- End-to-end test coverage for crash recovery and lifecycle behavior
Phase 2 β Multi-Agent (in progress)
Completed tracks:
- Track 1 β Agent Control Plane: Agent as first-class entity with CRUD, configuration management, and per-agent task routing
- Track 2 β Runtime State Model: Human-in-the-loop workflows (
waiting_for_approval,waiting_for_input), stateless pause/resume, task event history - Track 3 β Scheduler and Budgets: Agent-aware round-robin scheduling, per-agent budgets, budget-based task pausing, incremental cost tracking
- Track 4 β Custom Tool Runtime (BYOT): Customer-provided MCP tool servers registered by URL, worker-side MCP client integration, bearer token auth
Cross-cutting (completed):
- Customer-owned Langfuse integration: per-task Langfuse endpoint configuration, CRUD API, connectivity testing, checkpoint-based cost/token aggregation
- Agent Capabilities: E2B sandbox integration (provision / pause / resume / destroy lifecycle, crash recovery via
sandbox_idreconnect), sandbox tools (sandbox_exec,sandbox_read_file,sandbox_write_file,export_sandbox_file), output artifact storage on S3 (LocalStack locally),create_text_artifacttool, and multipart file-input task submission β seedocs/design-docs/agent-capabilities/design.md
Upcoming:
- Track 5 β Memory: long-term memory extraction, append-only storage, compaction
- Track 6 β GitHub Integration: code agent input/output via pull requests
AWS Deployment (validated end-to-end)
- CDK infrastructure: VPC, Aurora Serverless v2, ECS Fargate, internal ALB, SSM access host
- Automated schema bootstrap and model discovery on deploy
- Full deployment walkthrough: see
infrastructure/README.md
Design Documents
Start here if you want the actual system contract rather than the repo overview:
docs/design-docs/core-beliefs.mdβ architectural invariants governing all phasesdocs/design-docs/phase-1/design.mdβ durable execution foundationdocs/design-docs/phase-2/design.mdβ multi-agent, memory, scheduling, custom toolsdocs/design-docs/agent-capabilities/design.mdβ sandbox, artifacts, file inputdocs/design-docs/langfuse/design.mdβ customer-owned Langfuse integrationdocs/design-docs/phase-3-plus/design-notes.mdβ future reference material
Testing
There are four practical test layers in the repo:
- API service tests in
services/api-service - worker service tests in
services/worker-service/tests - backend integration tests in
tests/backend-integration/ - Langfuse E2E tests in
tests/e2e-langfuse/(requires local Langfuse viamake test-langfuse-up)
The integration suite is the best place to validate the intended runtime lifecycle:
- queued -> running -> completed
- retries and exponential backoff
- dead-letter behavior
- cancellation and redrive
- crash recovery and checkpoint resume
- multi-worker coordination
Local Validation Workflow
For local end-to-end validation of the Makefile workflow:
- Run
make checkto confirm prerequisites and local dependencies. - Use
make db-migratefor safe schema setup, ormake db-reset-verifyif you explicitly want a destructive reset plus schema verification. - Start the stack with
make startormake start N=3. - Confirm the stack is healthy with
make status,curl http://localhost:8080/actuator/health, and optionallycurl -I http://localhost:5173. - Tail logs with
make logsif startup looks suspicious. - Stop the stack with
make stopwhen finished.
make start waits for the console, API, and requested worker count to come up before reporting success.
For code changes during local development, make start is source-based rather than build-artifact-based:
- Console runs the Vite dev server, so frontend edits hot-reload without a rebuild.
- API runs
./gradlew bootRun, so a separate./gradlew buildis not required before startup, but you do need to restart the API process to pick up Java code changes. - Worker runs
python main.pyfrom source, so a separate build is not required, but you do need to restart worker processes to pick up Python code changes.
If services are already running, re-running make start does not restart them. Use make restart or the service-specific stop/start targets when you want local code changes to take effect for the API or worker.
When validating background service management, prefer running make start, make status, and make stop from a real interactive terminal. Some non-interactive runners can reap child processes when the parent command exits, which makes background-service checks look misleading.
Notes
AGENTS.mdis the primary agent-facing navigation file.CLAUDE.mdredirects to it.- Local
.venv, build output, caches, and logs are intentionally not part of the committed project structure. .tmp/is used for transient local runtime output such as E2E service logs.
