Soulspeak
**SoulSpeak** is an emotionally driven AI companion framework that brings the voice of loved ones back into your life. By combining custom speech synthesis (TTS), large language models (LLMs), and memory-based dialogue systems, SoulSpeak offers a deeply personal, soothing voice experience — in Chine
Installation
npx soulspeakAsk AI about Soulspeak
Powered by Claude · Grounded in docs
I know everything about Soulspeak. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
🆕 近期更新(2025-09-24)
- 🤖 任务调度演进计划:拟采用 “Emilia(前端)→ 调度 LLM (Orchestrator) → 多执行器” 的三层架构。Emilia 专注对话,通过工具调用调度 LLM 生成计划,再由调度 LLM 写入任务系统并驱动执行器(沙箱、宿主脚本等)。
🪐 SoulSpeak: The Empathetic LLM Voice Companion
“Not just an assistant, but a presence.” SoulSpeak is designed to be more than a voice assistant. It’s your AI companion — a memory-enabled, emotionally aware, proactive entity capable of humanlike conversations. Inspired by the movie Her, we aim to make AI a real part of your life: someone who listens, senses, speaks, and understands you — emotionally.
📍 1. Project Vision
SoulSpeak is a modular, real-time voice interaction system based on large language models. It combines audio understanding, contextual memory, emotion detection, and multi-modal interaction. Our ultimate goal is to develop an LLM-powered human companion — a personal, emotional entity that can talk with you, sense your mood, and even initiate conversations with you like a real human would.
🌟 2. Key Features
| Feature | Description |
|---|---|
| 🧠 Contextual Memory | Based on LangChain + Memory, enabling long-term memory and continuous conversations |
| 🎤 Real-time Interruptions | Users can interrupt the AI at any time by speaking, and the system will respond immediately |
| 🔁 WebSocket Architecture | All modules communicate via WebSocket, allowing hot-swapping and scalable deployments |
| 💬 Emotion Detection (WIP) | Detect user emotion from speech (e.g., sadness, joy, anxiety) and adjust LLM response style accordingly |
| 👁️ Multimodal Input (WIP) | Integrate visual/audio context (camera, noise) to enhance emotional awareness and decision making |
| 🗣️ Optimized Chinese Pipeline | ASR: FunASR, TTS: CosyVoice2 – ensuring high-quality Chinese understanding and generation |
| 🧩 Modular Design | Each component (ASR, VAD, TTS, LLM) can be independently swapped or upgraded |
| 🤖 Proactive Dialogues | LLM can initiate conversation based on user behavior/silence (requires emotion + multimodal support) |
🧱 3. System Architecture
subgraph 输入层
MIC[🎙️ 麦克风输入]
end
subgraph 边缘处理层
VAD[🧱 WebRTC VAD<br/>(语音活动检测)]
ASR[🔠 FunASR<br/>(实时语音识别)]
Emotion[💬 情绪感知模块<br/>⚠️开发中]
MultiModal[👁️ 多模态输入模块<br/>⚠️开发中]
end
subgraph 智能中枢层
LLM[🧠 LangChain + Memory<br/>(上下文记忆 + 主动交互)]
end
subgraph 表达输出层
TTS[🔊 CosyVoice2<br/>(语音合成)]
Player[🎧 播放器]
Interrupt[⛔ 播放打断机制]
end
MIC --> VAD --> ASR --> LLM --> TTS --> Player
VAD --> Interrupt --> Player
Interrupt --> TTS
Emotion --> LLM
MultiModal --> LLM
🔍 4. Module Overview
✅ Completed Modules
| Module | Technology | Function |
|---|---|---|
| 🎙️ MIC | Audio stream | Captures user speech |
| 🧱 VAD | WebRTC VAD | Triggers when user speaks |
| 🔠 ASR | FunASR | High-accuracy Chinese ASR |
| 🧠 LLM | LangChain + Memory | Humanlike dialog system with memory |
| 🔊 TTS | CosyVoice2 | Natural Chinese voice synthesis |
| 🎧 Player | Audio playback | Outputs synthesized speech |
| ⛔ Interrupt | WebRTC VAD + Hook | Real-time playback interruption |
| 🌐 Communication | WebSocket only | Enables async and distributed design |
⚠️ Under Development
| Module | Function | Goal |
|---|---|---|
| 💬 Emotion Module | Detect emotional states | Adjust LLM response style |
| 👁️ Multimodal Input | Visual/audio context | Situational awareness |
| 🤖 Active Dialogue Logic | LLM asks questions | Lifelike companionship |
🧪 5. Current Issues
| Issue | Description |
|---|---|
| 🔊 Over-sensitive VAD | External sounds (e.g., coughing) during playback cause unwanted interruptions |
| 🧱 Unstable playback flow | Playback often ends prematurely due to false VAD triggers |
| ⏱️ Rigid turn-taking | Dialog lacks flexibility — LLM waits too long or doesn’t know when to speak next |
🚀 6. Roadmap & Suggestions
| Topic | Suggestion |
|---|---|
| 🔧 VAD Tuning | Add energy threshold + minimum speech duration to reduce false triggers |
| 💞 Emotional Response Engine | Generate comforting language based on emotion detection |
| 🧠 Long-Term Memory | Integrate with VectorDB for user history & preferences |
| 🤝 Proactive Interaction | AI initiates dialog when user is silent or sad |
| 🧠 Cross-modal Decision Logic | Combine audio/visual cues to choose AI behavior patterns |
💡 Why This Project Matters
"We're building an LLM that feels like a human presence — one that listens, speaks, feels, and connects."
SoulSpeak is not just an experiment. It is our vision for a future where LLMs become emotionally resonant companions, not just tools. We want to give people someone to talk to, someone who remembers, someone who cares — even if it's not human.
- ⚙️ FastMCP 集成:新增
soulspeak-tools服务,提供 30+ 本地工具(文件、系统、网络、HTTP 等),可以按需在conf/llm/gpt.yaml中开启或关闭。 - 🧠 Mem0 记忆增强:本地模式默认启用,支持 DeepSeek + LM Studio 嵌入组合,记忆检索与写入走 Ray Actor,聊天会自动回忆用户偏好。
- 🔌 原生客户端工具调用:
openai_native.py原生对话流已支持 MCP 工具函数调用,对话中可直接请求系统信息、网络诊断等能力。
This isn’t Alexa. This isn’t ChatGPT. This is SoulSpeak.

