LLM Assisted Query Generation
No description available
Ask AI about LLM Assisted Query Generation
Powered by Claude Β· Grounded in docs
I know everything about LLM Assisted Query Generation. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
π LLM-Assisted NoSQL Query Generation
Natural Language Interface for Multiple NoSQL Databases Translate plain English questions into MongoDB, Neo4j, Redis, HBase, and RDF queries using LLMs and the Model Context Protocol (MCP)
π Table of Contents
- Overview
- Key Features
- Live Demo
- Architecture
- Supported Databases
- Quick Start
- Usage Examples
- Project Structure
- Implementation Highlights
- Documentation
- Research Background
- Contributing
- License
π― Overview
This project bridges the gap between non-technical users and NoSQL databases by leveraging Large Language Models (LLMs) to translate natural language queries into database-specific query languages. Built with the Model Context Protocol (MCP), it provides a unified interface across five different NoSQL database types.
Why This Project?
- Barrier to Entry: NoSQL databases each have unique query languages (MongoDB's JSON, Neo4j's Cypher, SPARQL for RDF, etc.)
- Learning Curve: Users need to learn multiple query syntaxes to work with different databases
- Developer Productivity: Writing complex queries requires deep knowledge of each database's API
- Our Solution: Ask questions in plain English, get optimized database queries automatically
β¨ Key Features
π£οΈ Natural Language Query Translation
- Write queries in plain English
- Automatic database type detection based on query intent
- Context-aware query generation using database schemas
- Powered by Groq's LLaMA 3.3 70B model
ποΈ Multi-Database Support
Support for 5 NoSQL database types:
- MongoDB - Document store with JSON queries and aggregation pipelines
- Neo4j - Graph database with Cypher query language
- Redis - Key-value store with command-based interface
- HBase - Wide-column store for big data
- RDF/Apache Jena Fuseki - Triple store with SPARQL
β Query Validation & Explanation
- Pre-execution validation - Catch errors before running queries
- LLM-powered explanations - Understand what each query does in plain English
- Schema-aware validation - Verify field names, collections, and data types
- Support for complex queries - Aggregations, joins, filters, and more
π Schema Exploration
- Automatic schema discovery - Explore collections, tables, and graph structures
- Interactive visualization - Browse schemas through web UI
- Field type inference - Automatic detection of data types
- Sample data preview - See example documents/records
βοΈ Cross-Database Comparison
- Syntax comparison - See how the same query translates across databases
- Performance metrics - Compare execution times
- Result analysis - Compare query results
- Visual charts - Bar charts for performance comparison
π¨ Modern Web Interface
- Streamlit-based UI - Clean, intuitive interface
- Real-time query execution - Instant results
- Query history - Track and reuse previous queries
- Interactive components - Dropdowns, sliders, charts
ποΈ MCP Architecture
- Modular design - Each database has its own MCP server
- Direct tools mode - Fast execution by bypassing MCP protocol overhead
- Extensible - Easy to add new databases
- Production-ready - Error handling, logging, caching
π Live Demo
Web Interface (Streamlit)
streamlit run frontend/app.py
Features:
- π¬ Natural Language Query - Ask questions, get results
- ποΈ Schema Explorer - Browse MongoDB and RDF schemas
- β Query Validation - Validate and explain queries for all 5 databases
- βοΈ Cross-Database Comparison - Compare query performance across databases
Example Queries
| Natural Language | Database | Generated Query |
|---|---|---|
| "Find all movies from 2020" | MongoDB | {"year": 2020} |
| "Show people who know Python" | Neo4j | MATCH (p:Person)-[:KNOWS_SKILL]->(t:Technology {name: 'Python'}) RETURN p |
| "Get user profile for user:1001" | Redis | GET user:1001 |
| "Count movies by year for 2010-2015" | MongoDB | [{"$match": {"year": {"$gte": 2010, "$lte": 2015}}}, {"$group": {"_id": "$year", "count": {"$sum": 1}}}] |
ποΈ Architecture
Key Components
-
Query Engine (src/main_app/query_engine.py)
- LLM prompt engineering for each database type
- Schema context gathering
- Query validation and explanation
- Result formatting
-
MCP Manager (src/main_app/mcp_manager.py)
- Manages connections to all MCP servers
- Direct tools mode for fast execution
- Error handling and logging
-
MCP Servers (src/mcp_servers/)
- One server per database type
- Implements tools for each database operation
- Schema discovery and validation
-
Streamlit Frontend (frontend/app.py)
- User-friendly web interface
- Query validation and explanation
- Cross-database comparison with charts
ποΈ Supported Databases
| Database | Type | Query Language | Status | Features |
|---|---|---|---|---|
| MongoDB Atlas | Document Store | JSON / Aggregation Pipeline | β Complete | Schema discovery, find queries, aggregations, validation |
| Neo4j | Graph Database | Cypher | β Complete | Node/relationship queries, pattern matching, schema |
| Redis | Key-Value Store | Commands / MCP Tools | β Complete | Key operations, schema patterns, data structures |
| HBase | Wide-Column Store | Row operations | β Complete | Table scans, column family queries, filters |
| RDF (Apache Jena) | Triple Store | SPARQL | β Complete | Triple patterns, ontology queries, reasoning |
Database Capabilities Matrix
| Feature | MongoDB | Neo4j | Redis | HBase | RDF |
|---|---|---|---|---|---|
| Schema Discovery | β | β | β | β | β |
| Query Validation | β | β | β | β | β |
| Query Explanation | β | β | β | β | β |
| Aggregations | β | β | β οΈ Limited | β | β |
| Cross-Database Compare | β | β | β | β | β |
π Quick Start
Prerequisites
- Python 3.10+
- Docker & Docker Compose
- Groq API Key (Get free key)
Installation
-
Clone the repository
git clone https://github.com/yourusername/LLM-Assisted_Query_Generation.git cd LLM-Assisted_Query_Generation -
Create virtual environment
python -m venv venv # Windows venv\Scripts\activate # Linux/Mac source venv/bin/activate -
Install dependencies
pip install -r requirements.txt -
Configure environment
# Copy example configuration cp .env.example .env # Edit .env and add your Groq API key # GROQ_API_KEY=your_key_here -
Start databases
docker-compose up -d # Wait for databases to be ready (~30 seconds) docker ps # Verify all containers are running -
Load sample data
# MongoDB python datasets/mongodb_samples/load_mongodb.py # Neo4j python datasets/neo4j_samples/load_neo4j.py # Redis python datasets/redis_samples/load_redis.py # HBase python datasets/hbase_samples/load_hbase.py # RDF python datasets/rdf_samples/load_rdf.py -
Launch the web interface
streamlit run frontend/app.pyOpen your browser to http://localhost:8501
π‘ Usage Examples
Example 1: MongoDB Aggregation Pipeline
Natural Language Query:
"Count how many movies per genre"
Generated MongoDB Query:
[
{
"$unwind": "$genres"
},
{
"$group": {
"_id": "$genres",
"count": {"$sum": 1}
}
},
{
"$sort": {"count": -1}
}
]
Explanation:
This aggregation pipeline first breaks down the genres array so each genre becomes a separate document, then groups all movies by their genre and counts how many movies belong to each genre, finally sorting the results by count in descending order to show the most popular genres first.
Example 2: Neo4j Graph Query
Natural Language Query:
"Find people who work at TechCorp and know Python"
Generated Cypher Query:
MATCH (p:Person)-[:WORKS_AT]->(c:Company {name: 'TechCorp'}),
(p)-[:KNOWS_SKILL]->(t:Technology {name: 'Python'})
RETURN p.name, p.age, p.city
Explanation:
This query finds Person nodes that have both a WORKS_AT relationship to TechCorp and a KNOWS_SKILL relationship to Python technology, returning their name, age, and city information.
Example 3: Redis Data Retrieval
Natural Language Query:
"Get all user profile keys"
Generated Redis Query:
{
"tool": "get_keys",
"arguments": {
"pattern": "user:*"
}
}
Explanation:
This operation retrieves all Redis keys that match the pattern "user:*", which will return all user profile keys stored in the database.
Example 4: Cross-Database Comparison
Natural Language Query:
"Find all items created in the last year"
Results:
| Database | Query Syntax | Execution Time | Results |
|---|---|---|---|
| MongoDB | {"created_at": {"$gte": "2025-01-01"}} | 45ms | 1,234 |
| Neo4j | MATCH (n) WHERE n.created_at >= date('2025-01-01') RETURN n | 67ms | 856 |
| Redis | SCAN 0 MATCH item:* COUNT 100 | 12ms | 543 |
| HBase | Scan 'items' FILTER created_at >= '2025-01-01' | 89ms | 2,145 |
| RDF | SELECT ?s WHERE { ?s :created_at ?d FILTER (?d >= "2025-01-01"^^xsd:date) } | 134ms | 678 |
π Project Structure
LLM-Assisted_Query_Generation/
βββ frontend/ # Streamlit web interface
β βββ app.py # Main Streamlit app
β βββ run_app.sh # Launch script (Linux/Mac)
β βββ run_app.bat # Launch script (Windows)
β
βββ src/
β βββ main_app/ # Core application logic
β β βββ query_engine.py # LLM query translation engine
β β βββ mcp_manager.py # MCP connection manager
β β βββ cross_db_compare.py # Cross-database comparison
β β βββ query_explainer.py # Query explanation generation
β β βββ query_history.py # Query history tracking
β β βββ schema_validator.py # Schema validation
β β
β βββ mcp_servers/ # Database MCP servers
β β βββ mongodb_mcp/ # MongoDB server
β β β βββ server.py # MCP server implementation
β β β βββ tools.py # MongoDB-specific tools
β β βββ neo4j_mcp/ # Neo4j server
β β βββ redis_mcp/ # Redis server
β β βββ hbase_mcp/ # HBase server
β β βββ rdf_mcp/ # RDF server (Jena Fuseki)
β β
β βββ utils/ # Utility modules
β βββ config.py # Configuration management
β βββ logger.py # Structured logging
β
βββ datasets/ # Sample data for all databases
β βββ mongodb_samples/
β β βββ load_mongodb.py
β β βββ sample_users.json
β βββ neo4j_samples/
β βββ redis_samples/
β βββ hbase_samples/
β βββ rdf_samples/
β
βββ docs/ # Documentation
β βββ FRONTEND_UPDATES.md # Frontend changelog
β βββ QUERY_VALIDATION_COMPLETE.md
β βββ REDIS_INTEGRATION_COMPLETE.md
β βββ literature_review.md
β
βββ tests/ # Test suite
β βββ test_mongodb_mcp.py
β βββ test_neo4j_mcp.py
β βββ ...
β
βββ docker-compose.yml # Database orchestration
βββ requirements.txt # Python dependencies
βββ .env.example # Environment template
βββ README.md # This file
π¬ Implementation Highlights
1. LLM-Powered Query Explanation
All query explanations are generated using an LLM (LLaMA 3.3 70B via Groq) with database-specific prompts:
async def explain_query(self, query: str, database_type: str, context: Dict) -> str:
"""Generate natural language explanation using LLM."""
# Build database-specific prompt
if database_type == "mongodb":
prompt = f"""Explain this MongoDB query in simple terms:
Query: {query}
Collection: {context['collection']}
"""
# Call LLM for explanation
response = await self.groq_client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": prompt}],
temperature=0.3
)
return response.choices[0].message.content
2. MongoDB Aggregation Pipeline Support
Automatically detects and validates both find queries and aggregation pipelines:
# Detect query type
parsed = json.loads(query_input)
if isinstance(parsed, list):
# Aggregation pipeline
validation = {
"valid": True,
"message": f"Valid aggregation with {len(parsed)} stages",
"type": "aggregation"
}
elif isinstance(parsed, dict):
# Find query
validation = await validate_find_query(parsed)
3. Redis Dual Format Support
Supports both raw Redis commands and JSON MCP tool format:
# Try JSON format first
try:
redis_query = json.loads(query_input)
if "tool" in redis_query:
# MCP tool format: {"tool": "get_key", "arguments": {...}}
return validate_mcp_tool(redis_query)
except:
# Raw command format: "GET user:1001"
return validate_redis_command(query_input)
4. Context-Aware HBase Explanations
Extracts filter information from HBase JSON queries:
# Parse HBase query for context
hbase_query = json.loads(query_input)
context = {
"table": table_name,
"operation": hbase_query.get("operation"),
"filter_column": hbase_query.get("filter_column"),
"filter_value": hbase_query.get("filter_value")
}
# Generate detailed explanation with context
explanation = await explain_query(query_input, "hbase", context)
5. Direct Tools Mode for Performance
Bypasses MCP protocol overhead by calling tools directly:
# Instead of: await mcp_client.call_tool_via_protocol(...)
# Use: await tools_module.execute_directly(...)
result = await self.tools.get_schema(database="test_db")
# 3x faster than protocol mode
π Documentation
- Frontend Guide - Streamlit web interface usage
- Query Validation - Validation system documentation
- Redis Integration - Redis MCP server details
- Literature Review - Research background
π¬ Research Background
This project is inspired by recent research in natural language interfaces for NoSQL databases:
Key Papers
-
Qin, Zhiqian, et al. "MultiTEND: A Multilingual Benchmark for Natural Language to NoSQL Query Translation." ACL 2025
- First multilingual benchmark for NoSQL query translation
- Covers MongoDB, Neo4j, and Redis
-
Lu, Jinwei, et al. "Bridging the gap: Enabling natural language queries for nosql databases." arXiv 2025
- Focus on MongoDB query generation
- Schema-aware query translation
-
Tola, Alessandro. "Towards user-friendly nosql: A synthetic dataset approach and LLMs." Politecnico di Torino, 2024
- Synthetic dataset generation
- LLM evaluation for NoSQL
-
Yang, Tianhao. "LLM-Enhanced Data Management in Multi-Model Databases." 2025
- Multi-model database integration
- Query optimization using LLMs
Our Contributions
- β Extended database support - 5 NoSQL types vs 2-3 in prior work
- β Production-ready MCP architecture - Modular, extensible design
- β LLM-powered explanations - Not just translation, but understanding
- β Cross-database comparison - Unique feature for syntax learning
- β Web interface - User-friendly UI for non-programmers
π€ Contributing
This is an academic research project. Contributions are welcome!
Development Setup
# Install dev dependencies
pip install -r requirements-dev.txt
# Run tests
pytest
# Code formatting
black src/ tests/ frontend/
# Type checking
mypy src/
# Linting
ruff check src/ tests/
Adding a New Database
- Create MCP server in
src/mcp_servers/your_db_mcp/ - Implement tools in
tools.py - Add prompt template in
query_engine.py - Update
mcp_manager.pyto register the server - Add sample data loader in
datasets/your_db_samples/ - Write tests in
tests/test_your_db_mcp.py
π License
MIT License - see LICENSE file for details.
π Acknowledgments
- Groq - Free LLM API access (LLaMA 3.3 70B)
- Research Community - Papers and datasets that inspired this work
- Open Source Projects - FastMCP, Neo4j Python Driver, PyMongo, etc.
π Project Stats
- Lines of Code: ~15,000+
- Databases Supported: 5
- MCP Servers: 4
- Sample Queries: 100+
Built using Python, Streamlit, and LLMs
β Star this repo if you find it useful!
