macos-control
Give AI agents eyes and hands on macOS β screenshot, OCR, click, type through the Model Context Protocol
Installation
npx macos-control-mcpAsk AI about macos-control
Powered by Claude Β· Grounded in docs
I know everything about macos-control. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
macos-control-mcp
Give AI agents eyes and hands on macOS.
What is this?
An MCP server that lets AI agents see your screen, read text on it, and interact β click, type, scroll β just like a human sitting at the keyboard. Unlike blind script runners, this MCP gives agents state awareness: they screenshot the screen, OCR it to get text with pixel coordinates, then click exactly where they need to.
The See-Think-Act Loop
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β 1. SEE screenshot / screen_ocr β
β β "What's on the screen?" β
β β
β 2. THINK AI reasons about the content β
β β "I need to click the Save btn" β
β β
β 3. ACT click_at / type_text / press_keyβ
β "Click at (425, 300)" β
β β
β β» repeat β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
This is what makes it powerful: the agent sees the result of every action and can course-correct, retry, or move on β just like you would.
Quick Start
No install needed β run directly with npx:
npx -y macos-control-mcp
On first run, a Python virtual environment is automatically created at ~/.macos-control-mcp/.venv with the required Apple Vision and Quartz frameworks. This takes ~60 seconds once and persists across updates.
Video Showcasing the MCP:
https://www.youtube.com/watch?v=aswlsElHV5o
Configure Your AI Client
All clients use the same command: npx -y macos-control-mcp
Claude Desktop
Edit ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"macos-control": {
"command": "npx",
"args": ["-y", "macos-control-mcp"]
}
}
}
Restart Claude Desktop after saving.
Claude Code
claude mcp add macos-control -- npx -y macos-control-mcp
VS Code / GitHub Copilot
Add to .vscode/mcp.json in your workspace:
{
"servers": {
"macos-control": {
"command": "npx",
"args": ["-y", "macos-control-mcp"]
}
}
}
Cursor
Add to .cursor/mcp.json in your project:
{
"mcpServers": {
"macos-control": {
"command": "npx",
"args": ["-y", "macos-control-mcp"]
}
}
}
Cline
Open Cline extension settings β MCP Servers β Add:
{
"macos-control": {
"command": "npx",
"args": ["-y", "macos-control-mcp"]
}
}
Windsurf
Add to ~/.codeium/windsurf/mcp_config.json:
{
"mcpServers": {
"macos-control": {
"command": "npx",
"args": ["-y", "macos-control-mcp"]
}
}
}
Permissions
macOS requires two permissions for full functionality:
- Screen Recording β for screenshots and OCR
- Accessibility β for clicking, typing, and reading UI elements
Go to System Settings β Privacy & Security and add your terminal app (Terminal, iTerm2, VS Code, etc.) to both lists. You'll be prompted on first use.
Tools (19)
See the screen
| Tool | Description |
|---|---|
screenshot | Capture full screen or app window as JPEG |
screen_ocr | OCR the screen β returns text elements with pixel coordinates |
find_text_on_screen | Find specific text and get clickable x,y coordinates |
Interact with the screen
| Tool | Description |
|---|---|
click_at | Click at x,y coordinates (returns screenshot) |
double_click_at | Double-click at x,y (returns screenshot) |
type_text | Type text into the frontmost app |
press_key | Press key combos (Cmd+S, Ctrl+C, etc.) |
scroll | Scroll up/down/left/right |
App management
| Tool | Description |
|---|---|
launch_app | Open or focus an application |
list_running_apps | List visible running apps |
Accessibility tree
| Tool | Description |
|---|---|
get_ui_elements | Get accessibility tree of an app window |
click_element | Click a named UI element (returns screenshot) |
Browser automation
| Tool | Description |
|---|---|
execute_javascript | Run JavaScript in the active browser tab |
get_page_text | Get all visible text from the page (faster than OCR) |
click_web_element | Click element by CSS selector (instant, precise) |
fill_form_field | Fill a form field by CSS selector |
Utilities
| Tool | Description |
|---|---|
open_url | Open URL in Safari or Chrome |
get_clipboard | Read clipboard contents |
set_clipboard | Write to clipboard |
Example Workflows
Fill out a web form
You: "Go to example.com/signup and fill in my details"
Agent:
1. open_url("https://example.com/signup")
2. screenshot() β sees the form
3. screen_ocr() β finds "Email" field at (300, 250)
4. click_at(300, 250) β clicks the email field
5. type_text("user@example.com")
6. find_text_on_screen("Submit") β gets button coordinates
7. click_at(350, 500) β submits the form
8. screenshot() β confirms success
Navigate an unfamiliar app
You: "Change the font size to 16 in TextEdit"
Agent:
1. launch_app("TextEdit")
2. screenshot() β sees the app
3. get_ui_elements("TextEdit") β finds menu items
4. press_key("t", ["command"]) β opens Fonts panel
5. screenshot() β sees the font panel
6. find_text_on_screen("Size") β locates the size field
7. click_at(x, y) β clicks size field
8. type_text("16")
9. press_key("return")
Extract info from an email
You: "Copy the tracking number from the shipping email in Mail"
Agent:
1. launch_app("Mail")
2. screenshot() β sees the inbox
3. find_text_on_screen("Your order has shipped") β locates the email
4. click_at(x, y) β opens the email
5. screenshot() β sees the email content
6. screen_ocr() β extracts all text including tracking number
7. set_clipboard("1Z999AA10123456784") β copies to clipboard
Take a screenshot and describe what's on screen
You: "What do I have open right now?"
Agent:
1. screenshot() β captures the full screen
2. Describes what it sees: apps, windows, content
3. list_running_apps() β ["Safari", "Slack", "VS Code", "Spotify"]
4. "You have Safari open on Twitter, Slack with 3 unread messages, and VS Code editing index.ts"
Debug a UI issue
You: "The submit button on this page isn't working, can you check what's happening?"
Agent:
1. screenshot() β sees the page
2. find_text_on_screen("Submit") β locates button at (500, 600)
3. get_ui_elements("Safari") β inspects accessibility tree
4. "The button has aria-disabled=true. There's a required field 'Phone' that's empty."
5. find_text_on_screen("Phone") β locates the field
6. click_at(x, y) β clicks the field
7. type_text("+1 555-0123")
8. click_at(500, 600) β clicks Submit again
9. screenshot() β confirms it worked
Control Spotify
You: "Pause the music and turn the volume down"
Agent:
1. launch_app("Spotify")
2. screenshot() β sees Spotify is playing
3. find_text_on_screen("Pause") β locates pause button
4. click_at(x, y) β pauses playback
5. find_text_on_screen("Volume") β locates volume slider area
6. click_at(x, y) β adjusts volume
7. screenshot() β confirms paused and volume lowered
Work with Finder
You: "Create a new folder on my Desktop called 'Project Assets' and move it to Documents"
Agent:
1. launch_app("Finder")
2. press_key("d", ["command", "shift"]) β opens Desktop
3. screenshot() β sees Desktop in Finder
4. press_key("n", ["command", "shift"]) β creates new folder
5. type_text("Project Assets")
6. press_key("return")
7. screenshot() β confirms folder created
Send a message in Slack
You: "Send 'build is green, ready to deploy' in the #engineering channel on Slack"
Agent:
1. launch_app("Slack")
2. screenshot() β sees Slack
3. press_key("k", ["command"]) β opens Quick Switcher
4. type_text("engineering")
5. press_key("return") β opens #engineering
6. screenshot() β confirms channel is open
7. click_at(x, y) β clicks message input
8. type_text("build is green, ready to deploy")
9. press_key("return") β sends message
10. screenshot() β confirms sent
Research and copy data from a website
You: "Look up the current price of AAPL on Google Finance and copy it"
Agent:
1. open_url("https://google.com/finance/quote/AAPL:NASDAQ")
2. screenshot() β sees the page loading
3. screen_ocr() β reads all text on the page
4. Finds the price: "$187.42"
5. set_clipboard("$187.42")
6. "Copied AAPL price $187.42 to your clipboard"
Multi-app workflow
You: "Take what's in my clipboard, search for it in Safari, and screenshot the results"
Agent:
1. get_clipboard() β "best mechanical keyboards 2025"
2. launch_app("Safari")
3. press_key("l", ["command"]) β focuses address bar
4. type_text("best mechanical keyboards 2025")
5. press_key("return") β searches
6. screenshot() β captures the search results
7. "Here are the search results for 'best mechanical keyboards 2025'"
Navigate System Settings
You: "Turn on Dark Mode"
Agent:
1. launch_app("System Settings")
2. screenshot() β sees System Settings
3. find_text_on_screen("Appearance") β locates the option
4. click_at(x, y) β opens Appearance settings
5. screenshot() β sees Light/Dark/Auto options
6. find_text_on_screen("Dark") β locates Dark mode option
7. click_at(x, y) β enables Dark Mode
8. screenshot() β confirms Dark Mode is on
Requirements
- macOS 13+ (Ventura or later)
- Node.js 18+
- Python 3.9+ (pre-installed on macOS β needed for OCR and mouse control)
How It Works
- Screenshots β native
screencaptureCLI - OCR β Apple Vision framework (VNRecognizeTextRequest) via Python bridge, returns text with bounding box coordinates
- Mouse β Quartz Core Graphics events via Python bridge for precise pixel-level control
- Keyboard & Apps β AppleScript via
osascriptfor key presses, app launching, and UI element interaction - Python env β auto-managed venv at
~/.macos-control-mcp/.venv/with only two packages (pyobjc-framework-Vision,pyobjc-framework-Quartz)
Troubleshooting
"Permission denied" or blank screenshots β Add your terminal to System Settings β Privacy & Security β Screen Recording
Clicks don't work β Add your terminal to System Settings β Privacy & Security β Accessibility
Python setup fails
β Ensure python3 is in your PATH. Run python3 --version to check. Non-Python tools (keyboard, apps, clipboard) still work without it.
OCR returns empty results
β Make sure Screen Recording permission is granted. Try a full-screen OCR first (without the app parameter).
"App not found" errors β Use the exact app name as shown in Activity Monitor (e.g., "Google Chrome" not "Chrome").
