Docpick
Schema-driven document extraction with local OCR + LLM. Document in, Structured JSON out.
Ask AI about Docpick
Powered by Claude Β· Grounded in docs
I know everything about Docpick. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
Docpick
Document in, Structured JSON out. Locally. With your schema.
docpick is a lightweight, schema-driven document extraction pipeline that combines local OCR engines with local LLMs to extract structured JSON from any document β invoices, receipts, bills of lading, tax forms, and more.
- Zero cloud dependency β runs entirely on your machine (CPU or GPU)
- Custom schemas β define your own Pydantic models or use 8 built-in document schemas
- Validation built-in β checkdigit verification, cross-field rules, cross-document consistency
- Apache 2.0 β no GPL/AGPL dependencies
Install
pip install docpick # core (LLM extraction only)
pip install docpick[paddle] # + PaddleOCR (recommended)
pip install docpick[easyocr] # + EasyOCR (Korean-optimized)
pip install docpick[got] # + GOT-OCR2.0 (GPU, vision-language)
pip install docpick[all] # all OCR backends
Requirements: Python 3.11+ / LLM endpoint (vLLM, Ollama, or OpenAI-compatible)
Quick Start
Python API
from docpick import DocpickPipeline
from docpick.schemas import InvoiceSchema
pipeline = DocpickPipeline()
result = pipeline.extract("invoice.pdf", schema=InvoiceSchema)
print(result.data) # Structured dict matching schema
print(result.validation) # Validation errors/warnings
print(result.confidence) # Per-field confidence scores
CLI
# Extract structured data
docpick extract invoice.pdf --schema invoice --output result.json
# OCR only (no LLM)
docpick ocr document.png --lang ko,en
# Validate extracted JSON
docpick validate result.json --schema invoice
# Batch process a directory
docpick batch ./documents/ --schema invoice --output ./results/ --concurrency 4
# List available schemas
docpick schemas list
# Show schema details
docpick schemas show invoice
Built-in Schemas
| Schema | Document Type | Key Validations |
|---|---|---|
invoice | Commercial invoices | Line item sums, tax ID checkdigit, date order |
receipt | Retail/restaurant receipts | Total = subtotal + tax + tip |
bill_of_lading | Ocean/air B/L | Container weight sums, ISO 6346, HS code format |
purchase_order | Purchase orders | PO total = line items, delivery date order |
kr_tax_invoice | Korean e-tax invoice (μΈκΈκ³μ°μ) | Business number checkdigit (x2), supply/tax/total sums |
bank_statement | Bank statements | IBAN mod97, period date order |
id_document | Passport/ID (ICAO 9303) | MRZ, ISO 3166 country codes, date ranges |
certificate_of_origin | Certificate of Origin | ISO 3166 alpha-2 country codes |
Custom Schemas
Define your own schema with Pydantic:
from pydantic import BaseModel
from docpick import DocpickPipeline
from docpick.validation.rules import SumEqualsRule, RequiredFieldRule
class MyDocument(BaseModel):
"""Custom document schema."""
company_name: str | None = None
total_amount: float | None = None
tax_amount: float | None = None
net_amount: float | None = None
items: list[dict] | None = None
class ValidationRules:
rules = [
RequiredFieldRule("company_name"),
SumEqualsRule(["net_amount", "tax_amount"], "total_amount"),
]
pipeline = DocpickPipeline()
result = pipeline.extract("my_document.pdf", schema=MyDocument)
Or use a JSON Schema file:
docpick extract document.pdf --schema my_schema.json
Validation
Check Digit Algorithms
| Algorithm | Use Case |
|---|---|
kr_business_number | Korean business registration number (10 digits) |
luhn | Credit card numbers |
iso_6346 | Shipping container numbers |
iban_mod97 | International bank account numbers |
awb_mod7 | Air waybill numbers |
mrz | Machine Readable Zone (passport/ID) |
Cross-Field Rules
| Rule | Description |
|---|---|
SumEqualsRule | Sum of fields equals target (with tolerance) |
DateBeforeRule | Date A must precede Date B |
RequiredFieldRule | Field must be non-null and non-empty |
FieldEqualsRule | Two fields must be equal |
RangeRule | Numeric field within min/max bounds |
RegexRule | Field matches regex pattern |
Cross-Document Validation
Validate consistency across related documents (e.g., Invoice + B/L + Packing List):
from docpick.validation.cross_document import create_trade_document_validator
validator = create_trade_document_validator()
result = validator.validate({
"invoice": invoice_data,
"bl": bl_data,
"packing_list": packing_list_data,
"certificate": certificate_data,
})
print(result.is_valid)
OCR Engines
| Engine | Type | GPU | Languages | Best For |
|---|---|---|---|---|
| PaddleOCR | Traditional OCR | Optional | 111 | General documents (default) |
| EasyOCR | Traditional OCR | Optional | 80+ | Korean text |
| GOT-OCR2.0 | Vision-Language | Required | Multi | Complex layouts |
| VLM | Vision-Language | Required | Multi | Direct image β JSON |
2-Tier Auto Engine
The default auto engine uses confidence-based fallback:
- Tier 1 (CPU): PaddleOCR β EasyOCR
- Tier 2 (GPU): GOT-OCR2.0 β VLM
If Tier 1 average confidence falls below threshold (default 0.7), automatically escalates to Tier 2.
LLM Providers
| Provider | Endpoint | Default Model |
|---|---|---|
| vLLM | http://localhost:8000/v1 | Qwen/Qwen3.5-32B-AWQ |
| Ollama | http://localhost:11434 | qwen3.5:7b |
Configure via CLI or YAML:
docpick config set llm.provider ollama
docpick config set llm.base_url http://localhost:11434
docpick config set llm.model qwen3.5:7b
Error Handling
The pipeline is designed to be resilient:
- OCR failure β automatic fallback to next available engine
- LLM JSON parse failure β automatic retry with correction prompt (up to 1 retry)
- Partial results β returns whatever was extracted, with errors logged in
result.errors - Document load failure β returns empty result with error message
result = pipeline.extract("damaged.pdf", schema=InvoiceSchema)
if result.errors:
print("Pipeline warnings:", result.errors)
if result.data:
print("Partial extraction:", result.data)
Batch Processing
Process entire directories with parallel workers:
from docpick.batch import BatchProcessor
from docpick.schemas import InvoiceSchema
processor = BatchProcessor(concurrency=4)
result = processor.process_directory(
"./invoices/",
schema=InvoiceSchema,
recursive=True,
)
print(f"Processed {result.succeeded}/{result.total} files")
for path, extraction in result.results.items():
print(f"{path}: {extraction.data.get('total_amount')}")
Architecture
flowchart TD
A["π Document\n(PDF / Image)"] --> B["DocumentLoader\n(pypdfium2)"]
B --> C["Tier 1: OCR\n(PaddleOCR / EasyOCR)\nCPU"]
C --> D{"Confidence\nβ₯ threshold?"}
D -->|"yes"| F["LLM Extractor\n(vLLM / Ollama)\nSchema prompt"]
D -->|"no"| E["Tier 2: VLM\n(GOT / VLM)\nGPU"]
E --> F
F --> G["Pydantic Validation"]
G --> H["β
ExtractionResult"]
License
Apache 2.0 β all dependencies are Apache 2.0 or MIT licensed.
Part of the QuartzUnit ecosystem β composable Python libraries for data collection, extraction, search, and AI agent safety.
