Multimodal Rag
No description available
Ask AI about Multimodal Rag
Powered by Claude Β· Grounded in docs
I know everything about Multimodal Rag. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
Multimodal RAG with Gemini Embedding
A Retrieval-Augmented Generation application that embeds multiple content types β text, images, video, audio, and PDFs β using Google's Gemini Embedding 2 model, stores vectors in Supabase (pgvector), and uses Gemini 3.1 Flash Lite for reasoning. Built as a single Streamlit app with an optional MCP server for programmatic access.
Setup
-
Install dependencies:
pip install -r requirements.txt -
Create a
.envfile with your API keys (see.env.example):GEMINI_API_KEY=your-key SUPABASE_URL=https://your-project.supabase.co SUPABASE_SERVICE_KEY=your-key -
Set up the Supabase database β run
schema.sqlin the Supabase SQL Editor (or viapsql). This creates thedocumentstable, enables pgvector, and adds the required RPC functions. -
Run the app:
streamlit run app.py
Features
Upload & Embed
- Upload one or multiple files at once (text, Markdown, images, PDFs, audio, video)
- Assign each batch to a named collection for thematic grouping
- Files that exceed size limits are automatically
chunked:
- Text / Markdown: ~6000 token chunks with 500 token overlap
- PDF: 5-page chunks via PyMuPDF
- Audio: 75-second segments via pydub
- Video: 120-second segments via moviepy
- Duplicate detection: chunks already stored for a given filename are skipped, allowing interrupted uploads to resume without re-embedding
- Per-file progress bar and status text during embedding
- All embeddings are L2-normalized before storage
Search
- Natural language queries embedded with
RETRIEVAL_QUERYtask type - Vector similarity search via Supabase RPC (cosine distance)
- Configurable top-k (1β50) and similarity threshold (0.0β1.0, default 0.3)
- Filter results by content type and by collection
- Images and videos displayed inline in search results
- Optional reasoning via Gemini 3.1 Flash Lite with numbered source citations
Browse
- View all stored documents in a table (ID, title, type, filename, chunk, collection, created)
- Delete all chunks of a file by filename
- Delete a single chunk by document ID
Sidebar
- Live database statistics (total chunks, counts per content type)
- Collection filter applied to searches
- Stats refresh on demand
Architecture
app.py Streamlit GUI
(upload, search, browse + sidebar)
mcp_server.py MCP server exposing RAG search
lib/
βββ gemini_client.py Shared google-genai client
βββ embedder.py Gemini Embedding 2 (3072d)
βββ chunker.py Content-aware chunking
βββ db.py Supabase vector operations
βββ rag.py RAG pipeline orchestration
βββ reasoning.py Gemini Flash Lite reasoning
MCP Server
mcp_server.py exposes four tools over the MCP
protocol (stdio transport), allowing any
MCP-compatible client (e.g. Claude Desktop) to query
the RAG index directly.
Available tools
| Tool | Description |
|---|---|
search_documents | Semantic vector search; returns ranked excerpts |
search_and_reason | Vector search + Gemini reasoning; returns answer |
list_collections | Lists all available document collections |
document_stats | Returns chunk count and per-type breakdown |
Tool parameters (search tools)
| Parameter | Type | Default | Description |
|---|---|---|---|
query | string | required | Search query |
top_k | int | 10 | Max results (1β50) |
threshold | float | 0.3 | Min similarity |
content_type | string | "all" | Type filter |
collection | string | "all" | Collection filter |
Running the MCP server
python mcp_server.py
The server runs over stdio by default.
The .mcp.json in the project root pre-configures
it for use with Claude Desktop using the local
.venv interpreter.
Tech Stack
| Component | Technology |
|---|---|
| Embeddings | Gemini Embedding 2 Preview (3072d) |
| Reasoning | Gemini 3.1 Flash Lite |
| Vector DB | Supabase + pgvector |
| GUI | Streamlit |
| MCP server | FastMCP (stdio) |
| PyMuPDF | |
| Audio | pydub |
| Video | moviepy |
