☁️

Kokoro Tts Kotlin

Production ready text to speech service built with Kotlin, Kokoro (82M parameters), ONNX Runtime, and Clean Architecture. REST API + MCP server for AI assistants. Multi speaker dialogue, POS aware phonemization, AWS deployment (EC2/Lambda).

0 installs

Trust: 34 — Low

Cloud

Ask AI about Kokoro Tts Kotlin

I know everything about Kokoro Tts Kotlin. Ask me about installation, configuration, usage, or troubleshooting.

0/500

Loading tools...

Reviews

Documentation

Kokoro TTS Kotlin

A pure-JVM text-to-speech server powered by the Kokoro-82M neural TTS model. Runs ONNX inference natively on the JVM — no Python, no external services. Serves a REST API via Ktor and an MCP endpoint for AI assistant integration.

Supports single-voice synthesis, multi-voice dialogue with natural turn gaps, voice blending, inline phoneme annotations for foreign words, and WAV/MP3 output at 24 kHz.

The entire process of building this project — from model research and G2P engineering to clean architecture, deployment, and performance tuning — is described in detail in How to Build Self-Hosted TTS That Actually Sounds Good.

Installation

Prerequisites

JDK 25+ — download from Oracle or install via SDKMAN!:
```
sdk install java 25-open
```
curl — required by the data download script (pre-installed on macOS/Linux)
AWS credentials (optional) — only needed for S3 storage mode; local mode requires no AWS setup

Clone and Setup

git clone https://github.com/alexsobolev/kokoro-tts-kotlin.git
cd kokoro-tts-kotlin

Download Model Files

The TTS pipeline requires model weights, voice embeddings, and pronunciation dictionaries (~400 MB total). The script skips files that already exist.

./scripts/download-data.sh

This downloads into the data/ directory:

File	Size	Source	Description
`kokoro-v1.0.int8.onnx`	92.3 MB	kokoro-onnx	Quantized ONNX TTS model
`voices-v1.0.bin`	28.2 MB	kokoro-onnx	Voice style embeddings
`config.json`	2.3 KB	Kokoro-82M	Tokenizer vocabulary
`us_gold.json`	3.0 MB	misaki	US English gold pronunciation dict
`us_silver.json`	3.0 MB	misaki	US English silver pronunciation dict
`gb_gold.json`	2.8 MB	misaki	GB English gold pronunciation dict
`gb_silver.json`	3.6 MB	misaki	GB English silver pronunciation dict
`en-pos-perceptron.bin`	3.9 MB	Apache OpenNLP	OpenNLP POS tagger model
`lexicon_fixes.json`	0.1 KB	Local	Custom pronunciation overrides

Configure Environment

By default the server uses local storage — audio files are written to an output/ directory and served via HTTP. No AWS credentials needed.

For S3 storage, set:

export STORAGE_MODE=s3
export AWS_REGION=eu-central-1
export S3_BUCKET=my-tts-bucket

See Configuration for all available settings.

Build and Verify

./gradlew build    # Compile, lint, static analysis, and tests

Quick Start

Download model and lexicon files (required once; see Download Model Files):

./scripts/download-data.sh

Then run the server:

./gradlew :app:run

The server starts on port 8080 with local file storage (no AWS required). Audio files are saved to output/ and served at http://localhost:8080/audio/.... Swagger UI at /swagger, OpenAPI spec at /openapi.

API

Synthesize Speech

POST /v1/tts

Single voice:

{
  "turns": [
    { "voice": "af_heart", "text": "Hello, world!" }
  ],
  "speed": 1.0,
  "format": "wav"
}

Multi-voice dialogue (turns concatenated with randomized 250-500ms silence gaps):

{
  "turns": [
    { "voice": "af_heart", "text": "How are you today?" },
    { "voice": "am_adam", "text": "I am doing great, thanks for asking!" }
  ],
  "speed": 1.2,
  "format": "mp3"
}

Voice blending (weighted average of style embeddings, weights must sum to 1.0):

{
  "turns": [
    { "voice": "af_heart:0.6+af_bella:0.4", "text": "A blended voice." }
  ]
}

Inline phoneme annotations for foreign words and proper nouns:

{
  "turns": [
    { "voice": "af_heart", "text": "We visited (Machu Picchu)[mˈɑːtʃuː pˈiːtʃuː] in (Peru)[pəɹˈuː]." }
  ]
}

Response (local mode):

{
  "url": "http://localhost:8080/audio/af_heart/uuid.wav",
  "key": "af_heart/uuid.wav",
  "expiresInSeconds": 0,
  "sizeBytes": 48044,
  "format": "wav",
  "voice": "af_heart"
}

Defaults: speed = 1.0, format = "wav", voice = "af_heart".

Limits: speed in [0.5, 2.0], text per turn <= 5,000 characters, at least one turn.

Other Endpoints

Endpoint	Description
`GET /health`	Health check (returns `OK`)
`GET /v1/voices`	List available voices with IDs and languages
`/mcp`	MCP endpoint for AI assistant integration (SSE transport)
`/swagger`	Interactive API docs

Errors

Status	Condition
400	Text too long, speed out of range, empty dialogue, malformed JSON
404	Voice not found
500	Inference or storage failure

MCP Integration

Both the Ktor server and Lambda expose MCP (Model Context Protocol) endpoints, enabling AI assistants like Claude Desktop to synthesize speech as a tool call.

Tools:

list_voices — returns all voice IDs and languages
synthesize_speech — single-voice synthesis with text, voice, speed, format parameters
synthesize_dialogue — multi-turn dialogue with different voices per turn

Tool descriptions instruct LLMs to wrap foreign proper nouns in (word)[IPA] annotations for correct pronunciation.

Testing with MCP Inspector: connect to http://localhost:8080/mcp (SSE transport).

Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "kokoro-tts": {
      "command": "npx",
      "args": ["mcp-remote", "http://localhost:8080/mcp"]
    }
  }
}

Architecture

Clean architecture across five Gradle modules:

Module structure

domain — Pure value types (VoiceId, SpeechRate, AudioFormat, SynthesisException), zero dependencies
core — Port interfaces (PhonemeGenerator, InferenceEngine, AudioEncoder, VoiceRepository, AudioStorage), DTOs, use cases, TTS service orchestration
infra — Adapters: ONNX inference, POS-aware G2P (OpenNLP + misaki dictionaries), WAV/MP3 encoding, local/S3 storage, MCP server factory
app — Ktor HTTP layer with Koin DI composition
lambda — AWS Lambda handler with singleton cold-start initialization

TTS Pipeline

Text --> EnglishPhonemeGenerator --> KokoroTokenizer --> OnnxKokoroEngine --> SentencePostProcessor --> LocalAudioEncoder --> AudioStorage
         (POS-aware G2P)            (IPA -> tokens)     (ONNX @ 24kHz)      (volume envelopes)        (WAV/MP3)            (local disk or S3)

G2P (Grapheme-to-Phoneme)

EnglishPhonemeGenerator converts English text to IPA phonemes using a POS-aware hybrid approach matching misaki's logic. Four misaki JSON dictionaries are merged at startup (US gold > US silver > GB gold > GB silver) into a PosAwareLexicon that preserves per-POS pronunciation variants (e.g., "live" as adjective lˈIv vs verb lˈɪv, "record" as noun ɹˈɛkəɹd vs verb ɹəkˈɔɹd). A lexicon_fixes.json corrections file is deep-merged on top with the highest priority, fixing 3 upstream misaki VBP bugs (read, reread, wound had present-tense VBP mapped to past-tense pronunciations).

Two-pass pipeline:

First pass — POS-tag all tokens with Apache OpenNLP (perceptron model, Penn Treebank tags, original case preserved for tagger accuracy), resolve phonemes via POS-aware dictionary lookup with morphological stemming and letter-rule fallback
Second pass — Reverse-scan phonemes to compute futureVowel context, apply function word overrides ("the" → ði/ðə, "to" → tʊ/tə/tu), re-lookup sentence-final words with stressed None-key variants

Word resolution (in priority order):

Inline phoneme annotations — (word)[IPA] syntax bypasses the entire G2P pipeline
Contraction expansion — e.g., "I'm" is expanded to "I am" before phonemization
Context-sensitive function words — "the" uses ði before vowels / ðə before consonants; "to" uses tʊ before vowels / tə before consonants / tu at sentence end
Abbreviations — words with 2+ uppercase letters spelled out letter-by-letter
Number expansion — integers, decimals, leading-zero sequences expanded to words
POS-aware dictionary lookup — selects correct variant based on POS tag with parent-tag normalization (VBD→VERB, NN→NOUN, JJ→ADJ, RB→ADV) and sentence-final stressed forms via None key
Morphological stemming — plurals, past tense, progressive, adverbial, agent, privative suffixes with US English T-flapping (stem-final 't' → 'ɾ' before vowels in -ed/-ing)
Compound word splitting — tries all split positions (min 3 chars), demotes second-part stress
Letter-to-phoneme fallback — ~100 English grapheme patterns matched greedily with context-sensitive vowel rules

Intonation Post-Processing

The Kokoro model doesn't strongly differentiate intonation by punctuation. TtsService and SentencePostProcessor compensate:

Questions (?) — 0.92x speed; rising volume ramp (1.0 -> 1.15x, quadratic) on last 600ms. Multi-clause questions split at the last clause boundary
Exclamations (!) — gain boost (1.20x -> 1.0, linear fade) on first 400ms
Statements — unmodified

RMS-windowed speech boundary detection ensures volume effects target voiced content, not model-generated silence.

Voice Blending

Blended voices (e.g., af_heart:0.6+af_bella:0.4) are created by weighted averaging of 256-dimensional style embeddings. After blending, the result vector is L2-renormalized to the weighted average of input norms — without this, blended vectors have smaller magnitude and produce degraded audio.

ONNX Inference

Lazy session loading on first call, not at startup
Phoneme sequences truncated to 510 tokens (model's 512 context window minus BOS/EOS)
10ms fade-in/fade-out on every segment to eliminate click artifacts at boundaries
All available CPU threads via setIntraOpNumThreads()

Deployment

Docker (Ktor server)

docker build -t kokoro-tts .
docker run -p 8080:8080 kokoro-tts                          # local storage (default)
docker run -p 8080:8080 \
  -e STORAGE_MODE=s3 \
  -e AWS_REGION=eu-central-1 \
  -e S3_BUCKET=my-tts-bucket \
  kokoro-tts                                                # S3 storage

Non-root user, 3 GB heap, ExitOnOutOfMemoryError. Multi-stage build with Gradle dependency caching.

Lambda

docker build -f Dockerfile.lambda -t kokoro-tts-lambda .

Custom JDK 25 runtime (beyond AWS managed runtimes), 8 GB heap for ONNX model. Koin DI initializes once per container cold start. Auto-detects and decodes base64 request bodies from Lambda Function URLs.

Configuration

app/src/main/resources/application.yaml:

tts:
  tokenizer:
    configPath: "data/config.json"
  voices:
    path: "data/voices-v1.0.bin"
  phonemizer:
    goldDictPath: "data/us_gold.json"
    silverDictPath: "data/us_silver.json"
    gbGoldDictPath: "data/gb_gold.json"
    gbSilverDictPath: "data/gb_silver.json"
    fixesDictPath: "data/lexicon_fixes.json"
  pos:
    modelPath: "data/en-pos-perceptron.bin"
  model:
    onnxPath: "data/kokoro-v1.0.int8.onnx"
  aws:
    region: "$AWS_REGION:"
    s3Bucket: "$S3_BUCKET:"
  storage:
    mode: "$STORAGE_MODE:local"
    prefix: "$STORAGE_PREFIX:tts-audio"
    localOutputDir: "$LOCAL_OUTPUT_DIR:output"
    baseUrl: "$BASE_URL:http://localhost:8080"

All settings support environment variable overrides using Ktor's $ENV_VAR:default syntax. The Lambda handler reads the same settings from environment variables directly.

Variable	Default	Description
`STORAGE_MODE`	`local`	Storage backend: `local` (disk + HTTP) or `s3`
`LOCAL_OUTPUT_DIR`	`output`	Directory for local audio files
`BASE_URL`	`http://localhost:8080`	Public base URL for local audio download links
`AWS_REGION`	—	AWS region (required for `s3` mode)
`S3_BUCKET`	—	S3 bucket (required for `s3` mode)
`STORAGE_PREFIX`	`tts-audio`	S3 object key prefix

Building

./gradlew build              # Compile + ktlint + detekt + tests
./gradlew :app:run           # Dev server on port 8080
./gradlew buildFatJar        # Fat JAR at app/build/libs/app-all.jar
./gradlew ktlintFormat       # Auto-format code
./gradlew koverHtmlReport    # Merged code coverage report → build/reports/kover/html/
./gradlew koverXmlReport     # Merged XML coverage report (for CI)

Code Quality

The build runs ktlint (formatting), detekt (static analysis), and all tests as a single gate. Kover enforces a minimum 85% line coverage across all five modules with merged reporting at the root level. All tests follow the given-when-then pattern with // given, // when, // then section comments.

See GUIDELINES.md for detailed coding conventions, architecture rules, and design decisions.

Tech Stack

Kotlin 2.3.0, JDK 25, Ktor 3.4.0, Koin 4.1.1, ONNX Runtime 1.23.2, Apache OpenNLP 2.5.3, kotlinx.serialization 1.8.1, AWS SDK for Kotlin 1.6.12, MCP SDK 0.8.4, jump3r (LAME MP3 encoder).