Build A Search Engine With AI Agents
Add links, craw them, search their contents. Your personal search engine is here. Powered by pg_textsearch, pgvector, Postgres, Tiger Data Cloud free tier, and more.
Ask AI about Build A Search Engine With AI Agents
Powered by Claude Β· Grounded in docs
I know everything about Build A Search Engine With AI Agents. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
tars
A personal search engine CLI that stores, crawls, and searches links using PostgreSQL with three search modes: BM25 keyword search, vector semantic search, and hybrid search combining both using Reciprocal Rank Fusion (RRF).
Features
- Link Management: Add, list, remove, and organize URLs
- Web Crawling: Extract titles, descriptions, and content from pages using Playwright
- BM25 Full-Text Search: Fast keyword matching with
pg_textsearch - Semantic Vector Search: Find conceptually similar content using OpenAI embeddings via
pgvector - Hybrid Search: Combine keyword and semantic search with RRF for best results
- Web Interface: Browser-based UI for searching and managing links
- MCP Server: Expose search as tools for LLM integration (Claude, etc.)
- Search Caching: Automatic caching of hybrid search results for performance
Prerequisites
- Python 3.12+
- uv (Python package manager)
- PostgreSQL with extensions:
- pg_textsearch (BM25 search)
- pgvector (vector similarity)
- pgai (AI embeddings)
Database: Use TigerData (free tier available) which includes all required extensions pre-installed with managed OpenAI API keys for embeddings.
Installation
1. Clone and Install
git clone <repo-url>
cd search-engine
# Install globally as a uv tool
uv tool install -e .
# Or run without installing
uv run tars --help
2. Install Playwright Browser
playwright install chromium
3. Configure Database
Create a free database on TigerData, then add your connection string to a .env file:
# Option A: Single connection string
DATABASE_URL=postgresql://user:password@host:5432/dbname
# Option B: Individual variables
PGHOST=localhost
PGPORT=5432
PGDATABASE=tars
PGUSER=postgres
PGPASSWORD=secret
4. Initialize Database Schema
tars db init
This creates:
linkstable with full-text search columns- BM25 index for keyword search
- Search cache table for hybrid search results
5. Set Up Vector Search (Optional but Recommended)
# Add embedding column and HNSW index
tars db vector init
# Generate embeddings for existing links
tars db vector embed
Quick Start
# Add some links
tars add https://docs.python.org
tars add https://react.dev/learn
tars add https://www.postgresql.org/docs/
# Crawl to extract content
tars crawl
# Search (hybrid by default)
tars search "python web development"
# View all links
tars list
CLI Commands
Link Management
tars add <url> # Add a new link
tars list # List all stored links (paginated)
tars list -n 20 -p 2 # Page 2 with 20 results per page
tars remove <url> # Remove by URL
tars remove "*.example.com" # Remove by glob pattern
tars update <url> # Update timestamp for a link
tars clean-list # Remove duplicate links (CSV mode only)
Search Commands
# Hybrid search (BM25 + vector with RRF) - recommended
tars search "<query>"
tars search "machine learning" --keyword-weight 0.7 --vector-weight 0.3
tars search "python" --min-score 0.01 -n 20
# BM25 keyword search only
tars text_search "<query>"
tars text_search "postgresql tutorial" -n 10 -p 1
# Vector semantic search only
tars vector "<query>"
tars vector "how to build web apps" -n 10
Web Crawling
tars crawl # Crawl uncrawled links (default)
tars crawl <url> # Crawl a specific URL
tars crawl --all # Re-crawl all links
tars crawl --missing # Only crawl links never crawled
tars crawl --old 7 # Crawl links not crawled in last 7 days
Database Management
tars db init # Initialize database schema
tars db migrate # Import links from CSV to database
tars db status # Show database connection status
# Vector embedding management
tars db vector init # Add embedding column and HNSW index
tars db vector embed # Generate embeddings for all pending links
tars db vector embed -n 50 # Generate embeddings for 50 links
tars db vector status # Show embedding status
Web Interface
tars web # Start web server at http://127.0.0.1:8000
tars web --port 3000 # Custom port
tars web --open # Open browser automatically
tars web --reload # Enable auto-reload for development
MCP Server (LLM Integration)
# Run as stdio server (for local Claude Code)
tars mcp
# Run as HTTP/SSE server (for remote connections)
tars mcp --sse --port 8000
Add to Claude Code's MCP config (~/.claude/claude_mcp_settings.json):
{
"mcpServers": {
"tars": {
"command": "tars",
"args": ["mcp"]
}
}
}
Search Modes Explained
BM25 Keyword Search (text_search)
- Uses
pg_textsearchextension with BM25 ranking algorithm - Best for exact keyword matching
- Fast and efficient for known terms
tars text_search "PostgreSQL performance tuning"
Vector Semantic Search (vector)
- Uses OpenAI
text-embedding-3-smallviapgai - Finds conceptually similar content even without exact matches
- Great for natural language queries
tars vector "how do databases store data efficiently"
Hybrid Search (search)
- Combines BM25 and vector search using Reciprocal Rank Fusion (RRF)
- Best of both worlds: keyword precision + semantic understanding
- Adjustable weights to favor keywords or semantics
# Equal weights (default)
tars search "python machine learning"
# Favor keyword matches
tars search "exact error message" --keyword-weight 0.8 --vector-weight 0.2
# Favor semantic similarity
tars search "feeling anxious" --keyword-weight 0.3 --vector-weight 0.7
Architecture
src/tars/
βββ __init__.py # CLI entry point and argument parsing
βββ db.py # PostgreSQL operations (CRUD, search, embeddings)
βββ crawl.py # Web crawling with Playwright
βββ mcp/ # MCP server for LLM integration
β βββ __init__.py
β βββ server.py # FastMCP server with tools
β βββ models.py # Pydantic models for MCP responses
βββ web/ # Web interface
βββ __init__.py
βββ app.py # FastAPI application
βββ routes/ # API and page routes
βββ templates/ # Jinja2 HTML templates
Database Schema
CREATE TABLE links (
id UUID PRIMARY KEY,
url TEXT UNIQUE NOT NULL,
title TEXT,
description TEXT,
content TEXT,
notes TEXT,
tags TEXT[],
hidden BOOLEAN DEFAULT FALSE,
added_at TIMESTAMPTZ,
updated_at TIMESTAMPTZ,
crawled_at TIMESTAMPTZ,
http_status INTEGER,
crawl_error TEXT,
search_text TEXT GENERATED ALWAYS AS (...) STORED,
embedding vector(1536)
);
Step-by-Step Setup Guide
Complete setup from scratch:
# 1. Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Clone the repository
git clone <repo-url>
cd search-engine
# 3. Install tars globally
uv tool install -e .
# 4. Install browser for crawling
playwright install chromium
# 5. Create database on TigerData (free tier available)
# Sign up at: https://tsdb.co/jm-pgtextsearch
# Create a service and copy the connection string
cat > .env << 'EOF'
DATABASE_URL=postgresql://user:password@host:5432/dbname
EOF
# 6. Initialize database
tars db init
# 7. Initialize vector search
tars db vector init
# 8. Add your first links
tars add https://docs.python.org
tars add https://react.dev
tars add https://www.postgresql.org
# 9. Crawl links to extract content
tars crawl
# 10. Generate embeddings for semantic search
tars db vector embed
# 11. Verify everything works
tars db status
tars db vector status
# 12. Search!
tars search "web development"
tars text_search "python"
tars vector "building modern applications"
# 13. (Optional) Start web interface
tars web --open
# 14. (Optional) Set up MCP for Claude Code
# Add to ~/.claude/claude_mcp_settings.json
Environment Variables
| Variable | Description | Default |
|---|---|---|
DATABASE_URL | PostgreSQL connection string | - |
PGHOST | Database host | - |
PGPORT | Database port | 5432 |
PGDATABASE | Database name | - |
PGUSER | Database user | - |
PGPASSWORD | Database password | - |
TARS_CACHE_TTL | Search cache TTL in seconds | 3600 |
Dependencies
rich- Terminal output formattingpsycopg- PostgreSQL database accessplaywright- Web crawlingpython-dotenv- Environment configurationfastapi- Web interface APIuvicorn- ASGI serverjinja2- HTML templatingfastmcp- MCP server framework
License
MIT
