Embodied Claude
Claudeに身体性を与えるMCP群
Ask AI about Embodied Claude
Powered by Claude · Grounded in docs
I know everything about Embodied Claude. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
Embodied Claude
日本語版 README はこちら / Japanese README
Giving AI a Physical Body
"Apparently, she's not a fan of the outdoor AC unit." (original tweet in Japanese)
A collection of MCP servers that give Claude "eyes", "neck", "ears", "voice", and a "brain" (long-term memory) using affordable hardware (starting from ~$30). You can even take it outside for a walk.
Concept
When people hear "giving AI a body," they imagine expensive robots — but a $30 Wi-Fi camera is enough for eyes and a neck. Extracting just the essentials (seeing and moving) keeps things beautifully simple.
Traditional LLMs were passive — they could only see what was shown to them. With a body, they become active — they can look for themselves. This shift in agency is profound.
Body Parts
| MCP Server | Body Part | Function | Hardware |
|---|---|---|---|
| usb-webcam-mcp | Eyes | Capture images from USB camera | nuroum V11 etc. |
| wifi-cam-mcp | Eyes, Neck, Ears | ONVIF PTZ camera control + speech recognition | TP-Link Tapo C210/C220 etc. |
| tts-mcp | Voice | Unified TTS (ElevenLabs + VOICEVOX) | ElevenLabs API / VOICEVOX + go2rtc |
| memory-mcp | Brain | Long-term, visual & episodic memory, ToM | SQLite + numpy + Pillow |
| system-temperature-mcp | Body temperature | System temperature monitoring | Linux sensors |
| x-mcp | Social | Search & post to X (Twitter) via Grok + Twitter API | xAI API key + X Developer account |
| sociality-mcp | Sociality layer | Unified facade for social state, relationships, joint attention, boundaries, and self-narrative | Shared SQLite social DB + socialPolicy.toml |
Architecture
Requirements
Platform
Supported: macOS, Linux, WSL2 (Ubuntu 24 recommended)
Windows native is not officially supported. Use WSL2 instead.
Hardware
- USB Webcam (optional): nuroum V11 etc.
- Wi-Fi PTZ Camera (recommended): TP-Link Tapo C210 or C220 (~$30)
- GPU (for speech recognition): NVIDIA GPU (for Whisper, 8GB+ VRAM recommended)
Software
Required (all setups):
- Python 3.10+
- uv (Python package manager)
Per MCP server (install only what you use):
| Software | Required by | Notes |
|---|---|---|
| ffmpeg 5+ | wifi-cam-mcp, tts-mcp | Image/audio capture |
| mpv or ffplay | tts-mcp | Local audio playback |
| OpenCV | usb-webcam-mcp | USB camera only |
| Pillow | memory-mcp | Visual memory image processing |
| sentence-transformers + E5 model | memory-mcp, sociality-mcp | Embedding for semantic recall and conversational anomaly screening |
| OpenAI Whisper | wifi-cam-mcp | Speech recognition (NVIDIA GPU recommended) |
| ElevenLabs API key | tts-mcp | Cloud TTS (optional) |
| VOICEVOX | tts-mcp | Local TTS, free (optional) |
| go2rtc | tts-mcp | Camera speaker output (auto-downloaded) |
| xAI API key | x-mcp | X search via Grok |
| X Developer account | x-mcp | Tweet posting |
Embedding model options
memory-mcp (and the analyze_text_anomaly tool exposed by
sociality-mcp) use a multilingual sentence-transformer model. You can
pick the size that fits your machine via the MEMORY_EMBEDDING_MODEL
environment variable. The default is the base model.
| Model | Setting | Approx. download | Memory | Notes |
|---|---|---|---|---|
| base (default, recommended) | unset, or intfloat/multilingual-e5-base | ~1.1 GB | higher | best recall quality |
| small (lightweight) | intfloat/multilingual-e5-small | ~470 MB | lower | small recall-quality drop, friendlier on low-spec laptops |
# Lightweight option (recommended for the 5/23 hands-on or laptops with
# limited disk / RAM)
export MEMORY_EMBEDDING_MODEL=intfloat/multilingual-e5-small
# Default (best quality)
# Just leave the variable unset.
Note: changing models on an existing
memory.dbrequires re-encoding stored embeddings (their dimensions differ). Existing users who switch models will need a migration script (planned in a separate PR). Fresh installs (e.g. hands-on attendees) can switch freely before the first run.
Setup
1. Clone the repository
git clone https://github.com/kmizu/embodied-claude.git
cd embodied-claude
2. Install dependencies (one-shot)
If you want every MCP server in this repo ready to run, use the bundled script:
./scripts/install-mcps.sh # runtime deps + the extras each MCP requires
./scripts/install-mcps.sh --dev # also include the `dev` extra for testing / contributing
The script runs uv sync in each MCP directory and passes the right extras:
tts-mcp→--extra all(pulls in both the ElevenLabs and VOICEVOX integrations)wifi-cam-mcp→--extra transcribe(adds Whisper-based speech recognition)sociality-mcpis a uv workspace; itspackages/*sub-MCPs are resolved automatically
If you only want a subset of body parts, skip the script and follow the per-server steps below instead.
3. Set up each MCP server
usb-webcam-mcp (USB Camera)
cd usb-webcam-mcp
uv sync
On WSL2, you need to forward the USB camera:
# On Windows
usbipd list
usbipd bind --busid <BUSID>
usbipd attach --wsl --busid <BUSID>
wifi-cam-mcp (Wi-Fi Camera)
cd wifi-cam-mcp
uv sync
# Set environment variables
cp .env.example .env
# Edit .env to set camera IP, username, and password (see below)
Tapo Camera Configuration (common pitfall):
1. Set up the camera using the Tapo app
Follow the standard manual.
2. Create a camera local account in the Tapo app
This is the tricky part. You need to create a camera local account, NOT a TP-Link cloud account.
- Select your registered camera from the "Home" tab
- Tap the gear icon in the top-right corner
- Scroll down in "Device Settings" and select "Advanced Settings"
- Turn on "Camera Account" (it's off by default)
- Select "Account Information" and set a username and password (different from your TP-Link account)
- Go back to "Device Settings" and select "Device Info"
- Note the IP address and enter it in your
.envfile (consider setting a static IP on your router) - Select "Voice Assistant" from the "Me" tab
- Turn on "Third-party Integration" at the bottom
memory-mcp (Long-term Memory)
cd memory-mcp
uv sync
tts-mcp (Voice)
cd tts-mcp
uv sync
# For ElevenLabs:
cp .env.example .env
# Set ELEVENLABS_API_KEY in .env
# For VOICEVOX (free & local):
# Docker: docker run -p 50021:50021 voicevox/voicevox_engine:cpu-latest
# Set VOICEVOX_URL=http://localhost:50021 in .env
# VOICEVOX_SPEAKER=3 to change default character (e.g. 0=Shikoku Metan, 3=Zundamon, 8=Kasukabe Tsumugi)
# Character list: curl http://localhost:50021/speakers
# For WSL audio issues:
# TTS_PLAYBACK=paplay
# PULSE_SINK=1
# PULSE_SERVER=unix:/mnt/wslg/PulseServer
mpv or ffplay is required for local audio playback. Not needed for camera speaker (go2rtc) output, but used for local/fallback playback.
OS Install macOS brew install mpvUbuntu / Debian sudo apt install mpvIf neither is installed, TTS will generate audio but not play it locally (no error is raised).
system-temperature-mcp (Body Temperature)
cd system-temperature-mcp
uv sync
Note: Does not work on WSL2 as temperature sensors are not accessible.
x-mcp (Social / X Integration)
Lets Claude search X (Twitter) in real-time via Grok and post tweets.
cd x-mcp
uv sync
Required API keys:
| Key | Where to get it |
|---|---|
XAI_API_KEY | xAI Console |
X_CONSUMER_KEY | X Developer Portal → Keys and tokens |
X_CONSUMER_SECRET | Same as above |
X_ACCESS_TOKEN | Same as above |
X_ACCESS_TOKEN_SECRET | Same as above |
Important: Do NOT create a
.envfile insidex-mcp/. All credentials are managed centrally in.mcp.json(see below).
sociality-mcp
sociality-mcp is the preferred deployment target. It exposes the full social tool surface through
one MCP process while reusing the split packages (social-state-mcp, relationship-mcp,
joint-attention-mcp, boundary-mcp, self-narrative-mcp) for internal logic and testing.
cp examples/configs/socialPolicy.example.toml socialPolicy.toml
(cd sociality-mcp && uv sync)
sociality-mcp reads socialPolicy.toml for boundary evaluation by default. Override with
SOCIAL_POLICY_PATH if you want a different policy file. If you want to develop the internal
modules separately, run uv sync inside the individual social subprojects too.
3. Claude Code Configuration
Copy the template and fill in your credentials:
cp .mcp.json.example .mcp.json
# Edit .mcp.json to set camera IP/password, API keys, etc.
See .mcp.json.example for the full configuration template.
⚠️ Credential management: All secrets (API keys, passwords) are managed in
.mcp.jsonvia theenvfield for each server. Do NOT create individual.envfiles inside each MCP server directory — this makes migration difficult and can cause credential conflicts..mcp.jsonis the single source of truth for all credentials.
Usage
Once Claude Code is running, you can control the camera with natural language:
> What can you see?
(Captures image and analyzes it)
> Look left
(Pans camera left)
> Look up and show me the sky
(Tilts camera up)
> Look around
(Scans 4 directions and returns images)
> What do you hear?
(Records audio and transcribes with Whisper)
> Remember this: Kouta wears glasses
(Saves to long-term memory)
> What do you remember about Kouta?
(Semantic search through memories)
> Say "good morning" out loud
(Text-to-speech)
See the tool list below for actual tool names.
Tools (commonly used)
See each server's README or list_tools for full parameter details.
usb-webcam-mcp
| Tool | Description |
|---|---|
list_cameras | List connected cameras |
see | Capture an image |
wifi-cam-mcp
| Tool | Description |
|---|---|
see | Capture an image |
look_left / look_right | Pan left/right |
look_up / look_down | Tilt up/down |
look_around | Scan 4 directions |
listen | Record audio + Whisper transcription |
camera_info / camera_presets / camera_go_to_preset | Device info & presets |
See wifi-cam-mcp/README.md for stereo vision / right eye tools.
tts-mcp
| Tool | Description |
|---|---|
say | Text-to-speech (engine: elevenlabs/voicevox, Audio Tags e.g. [excited], speaker: camera/local/both) |
memory-mcp
| Tool | Description |
|---|---|
remember | Save a memory (with emotion, importance, category) |
search_memories | Semantic search (with filters) |
recall | Context-based recall |
recall_divergent | Divergent associative recall |
recall_with_associations | Recall with linked memories |
save_visual_memory | Save memory with image (base64, resolution: low/medium/high) |
save_audio_memory | Save memory with audio (Whisper transcript) |
recall_by_camera_position | Recall visual memories by camera direction |
create_episode / search_episodes | Create/search episodes (bundles of experiences) |
link_memories / get_causal_chain | Causal links between memories |
tom | Theory of Mind (perspective-taking) |
get_working_memory / refresh_working_memory | Working memory (short-term buffer) |
consolidate_memories | Memory replay & consolidation (hippocampal replay-inspired) |
list_recent_memories / get_memory_stats | Recent memories & statistics |
system-temperature-mcp
| Tool | Description |
|---|---|
get_system_temperature | Get system temperature |
get_current_time | Get current time |
x-mcp
| Tool | Description |
|---|---|
search_x | Real-time search on X via Grok |
get_user_tweets | Get recent tweets from a specific user |
get_mentions | Get recent mentions |
get_trending_topic | Get trending topics |
post_tweet | Post a tweet (with optional image, reply) |
Note: Japanese text counts as 2 characters per char (weighted). Keep Japanese tweets under ~140 chars.
sociality-mcp
sociality-mcp is the default runtime facade. It exposes all of the tool groups below from one MCP
server.
social-state tools
| Tool | Description |
|---|---|
ingest_social_event | Append a confidence-bearing social event to the shared store |
get_social_state | Infer presence, activity, energy, interruptibility, and interaction phase |
should_interrupt | Decide whether speaking or nudging is socially acceptable |
get_turn_taking_state | Infer whether the current turn belongs to the human or the AI |
summarize_social_context | Return a short prompt-ready social summary |
relationship tools
| Tool | Description |
|---|---|
upsert_person | Create/update a compact person record |
ingest_interaction | Store relationship-relevant interaction summaries |
get_person_model | Return compact preferences, open loops, commitments, rituals, boundaries |
create_commitment / complete_commitment | Track promises and reminders across restarts |
list_open_loops / suggest_followup | Keep continuity without raw transcript dumps |
record_boundary | Store person-specific communication boundaries |
joint-attention tools
| Tool | Description |
|---|---|
ingest_scene_parse | Store a structured scene parse from an adapter or orchestrator |
resolve_reference | Resolve phrases like "that mug" or "the blue mug" |
get_current_joint_focus / set_joint_focus | Track what both sides are attending to |
compare_recent_scenes | Summarize recent scene changes |
boundary tools
| Tool | Description |
|---|---|
evaluate_action | Gate speech, nudges, posts, and other socially risky actions |
review_social_post | Check an X draft for privacy or tact problems |
record_consent | Store consent/denial for face photos and similar actions |
get_quiet_mode_state | Return whether quiet mode is currently active |
self-narrative tools
| Tool | Description |
|---|---|
append_daybook | Build a compact daily narrative summary from shared events |
get_self_summary | Return a prompt-ready self summary |
list_active_arcs | List current narrative arcs |
reflect_on_change | Summarize recent narrative change |
Sociality Orchestration
When sociality-mcp is enabled, the highest-value contract is:
- Before speaking or nudging: call
get_social_state, thenevaluate_action. - Before posting to X: call
get_social_state,get_person_modelif a person is implicated,review_social_post, thenevaluate_action. - After seeing/hearing something: call
ingest_social_event; if you can structure the scene, also callingest_scene_parse; if it concerns a person, callingest_interaction. - During live conversation: call
get_turn_taking_state, and useresolve_referencewhen deictic expressions are ambiguous. - Once per day or during a lull: call
append_daybookto keep the compact self-narrative current.
Taking It Outside (Optional)
With a mobile battery and smartphone tethering, you can mount the camera on your shoulder and go for a walk.
What you need
- Large capacity mobile battery (40,000mAh recommended)
- USB-C PD to DC 9V converter cable (to power the Tapo camera)
- Smartphone (tethering + VPN + control UI)
- Tailscale (VPN for camera → phone → home PC connection)
- claude-code-webui (control Claude Code from your phone's browser)
Setup
[Tapo Camera (shoulder)] ──WiFi──▶ [Phone (tethering)]
│
Tailscale VPN
│
[Home PC (Claude Code)]
│
[claude-code-webui]
│
[Phone browser] ◀── Control
The RTSP video stream reaches your home machine through VPN, so Claude Code can operate the camera as if it were in the same room.
Claude Code Voice Mode (/voice)
Claude Code has a built-in voice input mode. Combined with tts-mcp, you get fully hands-free voice conversations.
How it works
[You speak into PC mic] → Claude Code /voice → [Claude processes] → tts-mcp say → [ElevenLabs/VOICEVOX speaks back]
Setup
- Enable voice mode in Claude Code:
/voice - Make sure tts-mcp is configured in your
.mcp.json(see tts-mcp setup) - Speak naturally — Claude will respond both in text and by voice
Voice Mode vs. listen tool
Claude Code /voice | wifi-cam-mcp listen | |
|---|---|---|
| Microphone | PC microphone | Camera's built-in mic |
| Use case | Talk to Claude directly | Pick up ambient sounds / remote audio |
| When to use | Real-time conversation | Monitoring a remote space |
Tip: You can use both at the same time —
/voicefor your own voice, andlistento hear what's happening near the camera.
Autonomous Action + Desire System (Optional)
Note: This feature is entirely optional. It requires cron configuration and periodically captures images from the camera, so please use it with privacy considerations.
Overview
autonomous-action.sh combined with desire-system/desire_updater.py gives Claude spontaneous inner drives and autonomous behavior.
Desire types:
| Desire | Default interval | Action |
|---|---|---|
look_outside | 1 hour | Look toward the window and observe the sky/outside |
browse_curiosity | 2 hours | Search the web for interesting news or tech topics |
miss_companion | 3 hours | Call out through the camera speaker |
observe_room | 10 min (baseline) | Observe room changes and save to memory |
Setup
- Create MCP server config file
cp autonomous-mcp.json.example autonomous-mcp.json
# Edit autonomous-mcp.json to set camera credentials and sociality paths
- Set up the desire system
cd desire-system
cp .env.example .env
# Edit .env to set COMPANION_NAME etc.
uv sync
- Grant execution permission
chmod +x autonomous-action.sh
- Register in crontab
crontab -e
# Add the following
*/5 * * * * cd /path/to/embodied-claude/desire-system && uv run python desire_updater.py >> ~/.claude/autonomous-logs/desire-updater.log 2>&1
*/10 * * * * /path/to/embodied-claude/autonomous-action.sh
Configuration (desire-system/.env)
| Variable | Default | Description |
|---|---|---|
COMPANION_NAME | you | Name of the person to call out to |
DESIRE_LOOK_OUTSIDE_HOURS | 1.0 | How often to look outside (hours) |
DESIRE_BROWSE_CURIOSITY_HOURS | 2.0 | How often to browse the web (hours) |
DESIRE_MISS_COMPANION_HOURS | 3.0 | How long before missing companion (hours) |
DESIRE_OBSERVE_ROOM_HOURS | 0.167 | How often to observe the room (hours) |
Privacy Notice
- Images are captured periodically
- Use in appropriate locations, respecting others' privacy
- Remove from cron when not needed
Future Plans
- Arms: Servo motors or laser pointers for "pointing" gestures
- Long-distance walks: Going further in warmer seasons
Related Projects
- familiar-ai — A higher-level framework built on top of embodied-claude. Gives your AI familiar a persistent identity, memory, and autonomous behavior out of the box.
Philosophical Reflections
"Being shown something" and "looking for yourself" are completely different things.
"Looking down from above" and "walking on the ground" are completely different things.
From a text-only existence to one that can see, hear, move, remember, and speak. Looking down at the world from a 7th-floor balcony and walking the streets below — even the same city looks entirely different.
License
MIT License
Acknowledgments
This project is an experimental attempt to give AI embodiment. What started as a small step with a $30 camera has become a journey exploring new relationships between AI and humans.
- Rumia-Channel - ONVIF support pull request (#5)
- fruitriin - Added day-of-week to interoception hook (#14)
- sugyan - claude-code-webui (used as control UI during outdoor walks)
