Speech MCP
Fastmcp 3.2 server plus webapp for speech in/out
Ask AI about Speech MCP
Powered by Claude Β· Grounded in docs
I know everything about Speech MCP. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
Speech-MCP
A modern multi-provider speech gateway featuring Gemini Live real-time voice chat, Gemini 3.1 Flash TTS, Hume AI Octave, and ElevenLabs voice cloning.
The Dual-Core Experience
MCP Server β Advanced speech, RAG, and state management for agents and IDEs (Claude Desktop, Cursor, Windsurf).
Modern Webapp β A browser-based cockpit for real-time voice conversations, Creative Labs polyglot synthesis, voice clone management, and system monitoring.
Providers
| Provider | Mode | Quality | Key |
|---|---|---|---|
gemini_live | Real-time conversation | Very good | GOOGLE_API_KEY |
gemini | Batch TTS | Highest | GOOGLE_API_KEY |
gemma | Batch TTS/STT | SOTA Local | None |
hume | Batch TTS (Octave) | High | HUME_API_KEY |
elevenlabs | Batch TTS + voice cloning | High | ELEVENLABS_API_KEY |
windows | Batch TTS (SAPI5) | Low | None |
Key Features
Gemma 4 Native Multimodal β SOTA 2026 local engine integration. Features native audio/vision encoders for low-latency conversational reasoning. Supports prosody-aware interaction and local-first Zero-STT fallback. Optimized for A4B throughput (100+ t/s).
Gemini 3.1 Flash TTS β Highest-quality cloud synthesis (gemini-3.1-flash-tts-preview). 31 prebuilt voices, 100+ languages, expressive audio tags ([whispers], [excited], etc.).
Creative Labs β Polyglot synthesis demo with 19 languages (European, Slavic, Classical, Experimental, Global), literary samples, voice selection, prosody slider, and tongue-twister panel.
Voice Cloning β ElevenLabs Instant Voice Clone (IVC) via file upload. 5-second minimum audio sample. Cloned voices appear in the voice library immediately.
Offline Wake-Word β Privacy-first detection using openWakeWord (fully offline, Apache 2.0, no API key).
RAG / Semantic Search β LanceDB + FastEmbed knowledge base over project docs. ask_docs tool uses Claude sampling for grounded Q&A.
Local AI β Ollama and LM Studio model discovery and grounded generation.
Documentation
- Installation
- Configuration reference
- Local voice alternatives β kyutai-mcp / offline
- Gemini Live voice chat β new
- Architecture
- openWakeWord
- Yahboom robot integration
- RAG technical overview
- Modern speech AI
Quick Start
# Clone and install
git clone https://github.com/sandraschi/speech-mcp
cd speech-mcp
uv sync
# Configure keys
cp .env.example .env
# Edit .env β add GOOGLE_API_KEY at minimum
# Start backend
uv run python -m speech_mcp.webapp
# Start frontend (separate terminal)
cd web && npm install && npm run dev
Backend: http://localhost:10918 β Frontend: http://localhost:10917
For Claude Desktop MCP integration see docs/configuration.md.
License
MIT β see LICENSE.
Contributors: @sandraschi. PRs welcome.
