Mlxstudio
MLX Studio - Home of JANG_Q - Image Gen/Edit + Chat/Code All in one - + OpenClaw (Anthropic API)
Ask AI about Mlxstudio
Powered by Claude ยท Grounded in docs
I know everything about Mlxstudio. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
The native macOS desktop app for local AI on Apple Silicon
vMLX v2 โ native Swift + Metal, 50โ95 t/s on M-series.
Zero PyTorch in the hot path. Pure SwiftUI. Drag and drop models.
The Python panel above remains available for legacy support.
Features โข Screenshots โข API Server โข Image Generation โข JANG Quantization โข Requirements โข Build โข ํ๊ตญ์ด
MLX Studio is a complete desktop app for running LLMs, VLMs, and image generation models locally on your Mac. No cloud, no API keys, no data leaving your machine. Supports every model on mlx-community -- Qwen, Llama, Mistral, Gemma, Phi, DeepSeek, and thousands more. Built on vMLX Engine and Apple's MLX framework.
JANG 2-bit destroys MLX 4-bit on MiniMax M2.5:
Quantization MMLU (200q) Size JANG_2L (2-bit) 74% 89 GB MLX 4-bit 26.5% 120 GB MLX 3-bit 24.5% 93 GB MLX 2-bit 25% 68 GB Adaptive mixed-precision quantization keeps critical layers at higher precision while compressing the rest. Check scores at jangq.ai. Models at JANGQ-AI.
Install
Option 1: Download the App (Recommended)
Download the latest DMG -- one file, ready to go.
- Download
vMLX-X.Y.Z-arm64.dmg - Open the DMG and drag to Applications
- Launch -- that's it
All releases are code-signed and notarized by Apple for macOS Gatekeeper. No Homebrew, no pip, no Xcode required.
Option 2: CLI via pip (Engine Only)
The vMLX inference engine is published on PyPI as vmlx -- same engine that powers the desktop app, available as a standalone CLI. This is real, published software with 1,894+ tests.
# Recommended: use uv (fast, no venv hassle)
brew install uv
uv tool install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit
# Or with pipx (isolates from system Python)
brew install pipx
pipx install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit
# Or with pip in a virtual environment
python3 -m venv ~/.vmlx-env && source ~/.vmlx-env/bin/activate
pip install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit
Note: On macOS 14+,
pip install vmlxwithout a venv will fail with "externally-managed-environment". Useuv,pipx, or create a venv first.
Once running, your local OpenAI-compatible API server is live at http://localhost:8000. Point any OpenAI or Anthropic SDK client at it.
Quick Start
- Launch MLX Studio from Applications
- Pick a model -- browse HuggingFace models in the Server tab, or enter a repo name (e.g.,
mlx-community/Qwen3-8B-4bit) - Start the session -- the model downloads automatically and the server starts
- Chat -- switch to the Chat tab and start talking
That's it. The app manages the entire Python engine, model downloads, and server lifecycle for you.
Screenshots
![]() Chat Interface Streaming conversations with thinking mode, code highlighting, and markdown | ![]() Agentic Coding Full tool calling with file I/O, shell execution, and web search |
![]() Image Generation & Editing Flux Schnell, Dev, Z-Image Turbo, Klein + Qwen Image Edit | ![]() Anthropic API Compatible Drop-in /v1/messages endpoint for Anthropic SDK clients |
![]() Developer Tools Convert, inspect, and diagnose models | ![]() Model Conversion GGUF to MLX, 16-bit to quantized, and JANG adaptive mixed-precision |
![]() HuggingFace Browser Search and download models directly in-app | ![]() Menu Bar Running models, GPU memory, and quick controls |
Features
Model Support (65+ Model Families)
Run any MLX model from HuggingFace -- thousands of models, zero configuration:
- Text LLMs -- Qwen 2/2.5/3/3.5/3.6, Llama 3/3.1/3.2/3.3/4, Mistral/Mixtral/Codestral, Mistral-Medium-3.5 (ministral3, dense GQA + 256K YaRN + PIXTRAL vision), Mistral-Small-4 (MLA), Gemma 2/3/4, Phi-3/4, DeepSeek V2/V3/V4 (MLA), GLM-4/4.7/5, Nemotron, Laguna (poolside, 33B/3B SWA MoE), MiniMax M2.5/M2.7, Kimi K2.5/K2.6, Step, XVERSE, Yi, InternLM, ChatGLM, CodeLlama, and any mlx-lm compatible model
- Vision LLMs (VL) -- Qwen-VL, Qwen2.5-VL, Qwen3.5-VL / Qwen3.6-VL, Pixtral, InternVL, LLaVA, Gemma 3n / 4-VL, Phi-3-Vision, Mistral-Medium-3.5 (PIXTRAL) -- send images and video directly in chat
- Multimodal Omni -- Nemotron-3-Nano-Omni (text + image + audio + video) with Parakeet audio encoder + RADIO ViT vision tower; routed via OmniMultimodalDispatcher across
/v1/chat/completions,/v1/messages,/v1/responses, and/api/chat - Mixture-of-Experts -- Qwen 3.5/3.6 MoE, Mixtral 8x7B/8x22B, DeepSeek V2/V3/V4, MiniMax M2.5/M2.7, Llama 4 Scout/Maverick, Laguna (256 routed experts top-8 + 1 shared)
- Hybrid SSM Models -- Nemotron-H, Nemotron-3-Nano-Omni, Jamba, GatedDeltaNet, Qwen3.5-A3B hybrid, Granite MoE Hybrid, LFM2 (Mamba + Attention with dedicated hybrid cache + SSM companion + capture-during-prefill)
- Image Generation -- Flux Schnell/Dev, Z-Image Turbo, FLUX.2 Klein 4B/9B (via mflux)
- Image Editing -- Qwen Image Edit (instruction-based editing, full precision)
- Audio -- Kokoro TTS, Whisper STT, Qwen3-Audio (via mlx-audio)
- JANG Models -- Adaptive mixed-precision quantized models from JANGQ-AI, stay quantized in GPU memory via native
QuantizedLinear - GGUF Import -- Convert GGUF models to MLX format directly in-app
OpenAI-Compatible API Server
Every session launches a full API server. Point any OpenAI SDK client at your local endpoint:
POST /v1/chat/completions-- Chat Completions API with streaming, tool calling, vision, structured outputPOST /v1/responses-- OpenAI Responses API (agentic format) with streamingPOST /v1/completions-- Text completionsPOST /v1/images/generations-- Image generation (Flux/Z-Image models, OpenAI format withusagefield)POST /v1/images/edits-- Image editing (Qwen Image Edit, instruction-based)POST /v1/embeddings-- Text embeddings with dimension control and batch processingPOST /v1/rerank-- Document rerankingPOST /v1/audio/speech-- Text-to-speech (Kokoro TTS)POST /v1/audio/transcriptions-- Speech-to-text (Whisper)GET /v1/models-- List loaded modelsGET /health-- Server health with VRAM usage, queue length, load times
Anthropic API Compatibility
Drop-in replacement for the Anthropic Claude API:
POST /v1/messages-- Anthropic Messages API format- Anthropic SDK tool calling format (auto-translated to internal format)
- Vision/multimodal support via Anthropic content blocks
- Use the Anthropic Python/TypeScript SDK -- just change the
base_urlto your local server - Copy-paste code snippets in the API tab for curl, Python, and JavaScript
Tool Calling & Agentic Workflows (14 Parsers)
Auto-detected tool call parsers for every major model family:
- Qwen (qwen3, qwen2.5) --
<tool_call>XML format - Llama 3 --
<function=name>format - Mistral --
[TOOL_CALLS]format - Hermes --
<tool_call>JSON format - DeepSeek -- function call blocks
- GLM-4.7 -- GLM tool format
- MiniMax -- MiniMax function calling
- Nemotron -- NVIDIA Nemotron tool format
- Granite -- IBM Granite format
- Functionary -- Functionary v3 format
- XLAM -- Salesforce xLAM format
- Kimi -- Moonshot Kimi format
- Step-3.5 -- StepFun format
- Auto-detection from
model_typein config.json with regex name fallback
26+ Built-in Tools:
- File I/O -- read, write, edit, patch, copy, move, delete, create directory, list directory, file info, insert text, replace lines, directory tree
- Search -- ripgrep file search with regex and glob, glob file finder, unified diff
- Execution -- shell commands (60s timeout), background processes (5m auto-kill), process output polling
- Web -- DuckDuckGo search, Brave Search API, URL fetch with HTML-to-text
- Developer -- token counter, regex find-replace across files, git operations, clipboard read/write, diagnostics (TypeScript/ESLint/Python linting)
- Interactive --
ask_usertool for human-in-the-loop interrupts - Per-category toggles: enable/disable file, search, shell, web tools independently
- Auto-continue agent loops (up to 10 tool iterations per request)
- MCP (Model Context Protocol) -- connect external tool servers, merge tool definitions, execute MCP tools via API
Reasoning Model Support (4 Parsers)
Collapsible thinking blocks with dedicated parsing for reasoning models:
- Qwen3 / Qwen3.5 --
<think>...</think>blocks - DeepSeek-R1 -- DeepSeek reasoning format
- OpenAI GPT-OSS / GLM-4.7 -- GPT-OSS thinking format
- Phi-4-reasoning -- reasoning content extraction
- Enable/disable thinking per request
- Reasoning effort control (low/medium/high)
- Streaming reasoning content with proper tokenization
Vision & Multimodal (VLM)
Full multimodal input support for vision-language models:
- Images -- PNG, JPEG, WebP via base64 or URL (up to 50 MB)
- Video -- MP4, MOV, WebM via base64 or URL (up to 200 MB), smart frame extraction (8-64 frames), configurable FPS
- Audio -- Base64 or URL audio input (Qwen3-Audio)
- Image detail levels: auto, low, high
- Dedicated MLLM cache for image/video embeddings (separate from KV cache)
- Send images directly in chat to any VL model
Continuous Batching & Concurrency
Production-grade multi-user serving:
- Continuous batching -- handle 32+ concurrent requests with dynamic slot allocation
- Prefill batching -- batch prompt processing with configurable batch size (prevents Metal GPU timeouts)
- Completion batching -- batch token generation across sequences
- Stream interval control -- configure streaming frequency
- Request pooling -- efficiently share GPU memory across concurrent sequences
- Rate limiting -- optional per-client request limits
- API key authentication -- optional
--api-keyflag for secured access
5-Layer Cache Stack
Multi-tier caching for maximum throughput and memory efficiency:
- L1: Memory-Aware Prefix Cache -- token-level semantic caching with LRU eviction, configurable memory allocation
- L1 alt: Paged KV Cache -- block-aware cache with reduced fragmentation for long contexts
- L2: Disk Cache -- persistent spillover to disk for large context windows
- L2 alt: Block Disk Store -- block-level disk persistence
- KV Quantization -- q4/q8 quantized KV cache at storage boundary (2-4x memory savings, no accuracy loss)
- Hybrid SSM Cache -- dedicated cache for Mamba + Attention architectures (Nemotron-H, Jamba, GatedDeltaNet)
- Automatic cache type selection based on model architecture
- Cache warming API (
POST /v1/cache/warm) for pre-loading common prompts - Cache stats API (
GET /v1/cache/stats) for monitoring hit rates and memory usage
Sampling & Generation Control
Full control over text generation:
- Temperature (0.0 - 2.0) -- creativity control
- Top-P (0.0 - 1.0) -- nucleus sampling
- Top-K (integer) -- top-K token filtering
- Min-P (0.0 - 1.0) -- minimum probability threshold
- Repetition Penalty -- penalize repeated tokens
- Stop Sequences -- custom stopping strings
- Max Tokens -- output length limit (up to 131072)
- Request Timeout -- per-request timeout override
- Structured Output --
response_formatwithjson_objectorjson_schemamodes for guaranteed valid JSON - Streaming with proper Unicode handling (emoji, CJK, Arabic multi-byte characters)
- Usage stats in streaming responses (
stream_options.include_usage)
Model Conversion & Quantization
Convert models directly in-app via the Tools tab:
- 16-bit to MLX -- convert HuggingFace safetensors to MLX format
- 16-bit to quantized -- quantize to 2-bit, 4-bit, or 8-bit MLX
- GGUF to MLX -- import GGUF models into MLX safetensors format
- MLX to JANG -- adaptive mixed-precision quantization (different bits per layer type)
- Model Inspector -- view config.json, architecture, layer structure
- Model Doctor -- diagnostic checks (load test, token count, memory estimation)
- Progress tracking with real-time status
Image Generation
Generate images locally with Flux and Z-Image models:
- Flux Schnell -- 4-step fast generation
- Flux Dev -- 20-step high-quality generation
- Z-Image Turbo -- fast turbo generation (4-bit and 8-bit)
- Flux Klein -- lightweight 4B parameter model
- Flux Kontext -- subject-consistent editing
- Flux Krea -- aesthetic fine-tuned generation
- Configurable steps, guidance scale, height, width, seed, sampler
- Multiple samplers: euler, euler_ancestral, heun, dpmpp_2m_sde, dpmpp_sde
- Quantized model support (2-bit to 8-bit)
- Image gallery with generation history, save, and settings persistence
- OpenAI-compatible
/v1/images/generationsendpoint withusagefield
Chat Interface
Full-featured conversation UI:
- Persistent history -- SQLite (WAL mode) with full message, metrics, and tool call history
- Markdown rendering -- GitHub-flavored markdown with syntax highlighting
- Reasoning display -- collapsible thinking sections for reasoning models
- Tool call display -- inline tool execution with status and results
- Streaming metrics -- live tokens/second, time-to-first-token (TTFT), prompt processing speed, prefix cache hit rate
- System prompts -- per-chat custom system message
- Chat settings -- per-chat overrides for temperature, top-p, top-k, min-p, repetition penalty, max tokens, stop sequences
- Chat folders -- hierarchical organization
- Message search -- full-text search across chat history
- Export/Import -- ShareGPT format
- Voice chat -- STT + TTS integration
Model Management
- HuggingFace browser -- search, filter by text/image, and download models directly in-app
- Download queue -- multiple concurrent downloads with real-time progress bars and cancel support
- Model size display -- file sizes from safetensors metadata before downloading
- Local model discovery -- auto-scan
~/.mlxstudio/models,~/.cache/huggingface/hub,~/.exo/models, and custom directories - Deduplication -- strict format detection prevents false positive model matches
- Zero-config detection -- reads model config.json to auto-set tool parsers, reasoning parsers, cache types, and chat templates
- 65+ model families in the auto-detection registry with two-tier detection (config.json
model_typeprimary, name regex fallback)
Desktop Experience
- 5 app modes -- Chat, Server, Image, Tools, API
- Menu bar tray -- live server status, GPU memory, running models, quick controls
- Multi-session -- run multiple models simultaneously on different ports
- Dock icon -- restore on click, close-to-tray support
- Dark and light themes -- system-respecting
- Keyboard shortcuts -- common actions
- Toast notifications -- user feedback
- Update banner -- new version detection
Advanced Quantization
MLX Studio supports standard MLX quantization (4-bit, 8-bit) as well as JANG adaptive mixed-precision -- an advanced format that assigns different bit widths to different layer types for better quality at the same model size.
- Convert in-app via the Tools tab, or via CLI:
vmlx convert model --jang-profile JANG_3M - Pre-quantized models available at JANGQ-AI on HuggingFace
- Stays quantized in GPU memory -- native MLX
QuantizedLinear+quantized_matmul - Compatible with all caching layers (prefix, paged, disk, KV quant)
See the vMLX source repo for profiles and conversion details.
Smelt Mode (Partial Expert Loading)
For MoE models that don't fit in RAM, Smelt loads only a subset of experts per layer from SSD and keeps the backbone resident. Response quality stays coherent while RAM usage drops; throughput scales inversely with expert % loaded because expert swaps hit SSD on the hot path.
Benchmarks on Nemotron-Cascade-2-30B-A3B-JANG_4M (23 MoE layers ร 128 experts, Apple M3 Ultra / 128 GB, dedicated machine, no parallel models):
--smelt-experts | Active RAM | Decode tok/s | RAM saving | Coherent |
|---|---|---|---|---|
| off (baseline) | 17,408 MB | 89.9 | โ | โ |
50 | 9,529 MB | 66.5 | โ45% | โ |
25 | 5,590 MB | * | โ68% | โ |
* Responses too short for reliable steady-state tok/s measurement at 25 %. Subjectively responsive.
All three configurations produced coherent, non-looping output. No quality degradation observed.
Credit: Smelt mode is inspired by Anemll's flash-moe โ a pure C / ObjectiveโC / Metal inference engine that showed huge MoE models (Qwen3.5-397B) can run on modest Apple Silicon hardware by streaming expert weights from SSD with
pread()on demand. vMLX Smelt takes a different implementation path: Python/MLX, tied to the JANG quantization format, and loading a fixed subset of experts per layer at startup (backbone resident, routing biased toward the loaded subset) rather than on-demand per-token. It plugs into the full vMLX server with continuous batching, paged cache, and OpenAI-compatible API. Different engine, same core insight โ thanks to the flash-moe team for validating the approach.
Smelt is mutually exclusive with VLM mode. MLX Studio / vMLX v1.3.33+ automatically disables --is-mllm when smelt is active (with a warning) because the vision tower is not wired through the partial-expert loader โ image input on a smelt-loaded VLM would produce garbage logits. Use a text-only model when running smelt, or disable smelt when running a VLM.
Requires an MoE model in JANG format. Not compatible with dense models (no experts to partial-load).
System Requirements
| Requirement | Minimum |
|---|---|
| macOS | 14.0 Sonoma or later |
| Chip | Apple Silicon (M1 / M2 / M3 / M4) |
| RAM | 8 GB (16 GB+ recommended for larger models) |
| Disk | ~500 MB for app; models range from 1-50 GB each |
Build from Source
git clone https://github.com/jjang-ai/vmlx.git
cd vmlx
# Python engine
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
# Electron app
cd panel && npm install && npm run build
npx electron-builder --mac --dir # .app bundle
npx electron-builder --mac dmg # DMG installer
Links
| Resource | Link |
|---|---|
| Source Code | github.com/jjang-ai/vmlx |
| PyPI | pypi.org/project/vmlx |
| MLX Models | huggingface.co/mlx-community |
| JANG Models | huggingface.co/JANGQ-AI |
| Website | vmlx.net |
License
Apache License 2.0
Built by Jinho Jang โข eric@jangq.ai โข JANGQ AI โข Support on Ko-fi
ํ๊ตญ์ด (Korean)
MLX Studio โ Apple Silicon์ ์ํ ๋ค์ดํฐ๋ธ macOS AI ์ฑ
Mac์์ LLM, VLM, ์ด๋ฏธ์ง ์์ฑ ๋ฐ ํธ์ง ๋ชจ๋ธ์ ์์ ํ ๋ก์ปฌ๋ก ์คํํ์ธ์.
JANG 2๋นํธ๊ฐ MLX 4/3/2๋นํธ๋ณด๋ค ๋์ ์ฑ๋ฅ โ ์ ์ํ ํผํฉ ์ ๋ฐ๋ ์์ํ(JANG_2S, JANG_2.6)๊ฐ MiniMax M2.5, Qwen3 ๋ฑ์์ ํ์ค MLX ์์ํ๋ฅผ ๋ฅ๊ฐํฉ๋๋ค. jangq.ai์์ ๋ฒค์น๋งํฌ ํ์ธ. JANGQ-AI์์ ์ฌ์ ์์ํ ๋ชจ๋ธ ๋ค์ด๋ก๋.
์ค์น: ์ต์ DMG ๋ค์ด๋ก๋ โ ๋๋๊ทธ ์ค ๋๋กญ์ผ๋ก ์ค์น.
์ฃผ์ ๊ธฐ๋ฅ
| ๊ธฐ๋ฅ | ์ค๋ช |
|---|---|
| ์ฑํ | ๋ํ ์ธํฐํ์ด์ค, ๋๊ตฌ ํธ์ถ, ์์ด์ ํธ ์ฝ๋ฉ |
| ์ด๋ฏธ์ง ์์ฑ | Flux Schnell/Dev, Z-Image Turbo, FLUX.2 Klein |
| ์ด๋ฏธ์ง ํธ์ง | Qwen Image Edit (ํ ์คํธ ์ง์ ๊ธฐ๋ฐ ํธ์ง) |
| 5๋จ๊ณ ์บ์ฑ | ํ๋ฆฌํฝ์ค, ํ์ด์ง๋, KV ์์ํ, ๋์คํฌ ์บ์ |
| API ์๋ฒ | OpenAI + Anthropic ํธํ API |
| 30๊ฐ ๋๊ตฌ | ํ์ผ, ์น ๊ฒ์, Git, ํฐ๋ฏธ๋ ๋ด์ฅ ๋๊ตฌ |
๊ฐ๋ฐ์: ์ฅ์งํธ (eric@jangq.ai)
JANGQ AI โข
Ko-fi๋ก ํ์ํ๊ธฐ








