AILANG Parse
Deterministic Office parser for DOCX/PPTX/XLSX/ODT/EPUB/HTML/PDF. Unstructured.io alternative.
Ask AI about AILANG Parse
Powered by Claude Β· Grounded in docs
I know everything about AILANG Parse. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
AILANG Parse
Universal document parsing in AILANG. Extracts structured content from DOCX, PPTX, XLSX, PDF, and image files into JSON and markdown.
Office formats (DOCX, PPTX, XLSX) use deterministic XML parsing β no AI, no cloud, instant results. PDFs and images delegate to whatever AI model you plug in (Gemini, Claude, local Ollama). AILANG Parse is AI-agnostic: swap --ai to change the backend, zero code changes.
Install
Requires AILANG CLI.
# Clone and symlink
git clone https://github.com/sunholo-data/ailang-parse.git
ln -s "$(pwd)/ailang-parse/bin/docparse" /usr/local/bin/docparse
SDKs
Use AILANG Parse from your language of choice:
pip install ailang-parse # Python
npm install @ailang/parse # JavaScript/TypeScript
go get github.com/sunholo-data/ailang-parse-go # Go
Quick Start
# Office documents (deterministic, no AI needed)
docparse report.docx
docparse slides.pptx
docparse spreadsheet.xlsx
# PDF and images (AI auto-enabled)
docparse document.pdf
docparse photo.png
# Options
docparse report.docx describe # AI image descriptions
docparse report.docx summarize # AI document summary
docparse scan.pdf --ai gemini-2.5-flash # Choose AI backend
# Format conversion
docparse report.docx --convert output.html
docparse data.csv --convert report.docx
docparse notes.md --convert slides.pptx
# AI document generation
ailang run --entry main --caps IO,FS,Env,AI --ai gemini-2.5-flash \
docparse/main.ail --generate report.docx --prompt "Q1 sales report with tables"
Output
Every run produces:
docparse/data/output.jsonβ Structured JSON with typed blocksdocparse/data/output.mdβ LLM-ready markdown
What AILANG Parse Extracts
| Feature | DOCX | PPTX | XLSX | Best Competitor |
|---|---|---|---|---|
| Tables with merged cells | Yes | Yes | Yes | Raw OOXML only |
| Track changes (redlining) | Yes | β | β | Pandoc (3/3) |
| Comments (interleaved) | Yes | β | β | Raw OOXML (2/2) |
| Headers/footers | Yes | β | β | Kreuzberg (2/3) |
| Text boxes / VML shapes | Yes | Yes | β | Raw OOXML (1/2) |
| Equations (Β§22.1) | Yes | β | β | None |
| Field codes (Β§17.16) | Yes | β | β | Kreuzberg, OOXML |
| Speaker notes | β | Yes | β | None |
| Multi-sheet extraction | β | β | Yes | Kreuzberg |
OfficeDocBench (69 files, 11 formats, 7 metrics): AILANG Parse 93.9% composite with 100% coverage vs nearest competitor 68.0% coverage-adjusted. 8 parsers compared including Raw OOXML, Pandoc, Kreuzberg, MarkItDown, Unstructured, Docling. Scores include aspirational ECMA-376 spec targets that intentionally lower our score.
Supported Formats
Parsing (15 formats): DOCX, PPTX, XLSX, ODT, ODP, ODS, HTML, Markdown, CSV, EPUB, EML, MBOX, TEX, PDF, images (JPG/PNG)
Generation (9 formats): DOCX, PPTX, XLSX, ODT, ODP, ODS, HTML, Markdown, QMD (Quarto)
Architecture
docparse/
βββ types/document.ail # Block ADT (9 variants)
βββ services/
β βββ format_router.ail # Format detection (36 inline tests)
β βββ zip_extract.ail # ZIP layer (9 inline tests)
β βββ docx_parser.ail # DOCX XML β Blocks (6 inline tests)
β βββ pptx_parser.ail # PPTX slides β Blocks
β βββ xlsx_parser.ail # XLSX worksheets β Blocks
β βββ direct_ai_parser.ail # PDF/image β Blocks (AI)
β βββ layout_ai.ail # AI self-healing (optional)
β βββ output_formatter.ail # JSON + markdown output
β βββ docparse_browser.ail # WASM browser adapter
βββ main.ail # CLI entry point
28+ contracts, 50+ inline tests.
AI Configuration
AILANG Parse uses AILANG's AI effect β any model AILANG supports works:
docparse scan.pdf --ai gemini-2.5-flash # Google (default; fast)
docparse scan.pdf --ai gemini-3-flash-preview # Google (slower; thinking model)
docparse scan.pdf --ai granite-docling # Local Ollama (free)
docparse scan.pdf --ai claude-haiku-4-5 # Anthropic
AI usage is bounded by capability budgets (AI @limit=30), so costs are predictable.
Dev Commands
docparse --check # Type-check all modules
docparse --test # Run inline tests
docparse --prove # Static Z3 contract verification
Benchmarks
uv run benchmarks/run_benchmarks.py --suite office # Structural (no API, instant)
uv run benchmarks/run_benchmarks.py --suite pdf # PDF extraction (needs AI)
uv run benchmarks/run_benchmarks.py --competitors # Compare to Docling etc.
See benchmarks/ for details.
License
Apache 2.0
