⚙️ Open Source · ★★★ FEATURED

Chomper — One Document Core, Three Surfaces

One document-processing core that chomps 36+ file formats, fronted by a library, a CLI and an MCP server — with a swappable four-stage pipeline and a token-optimized wire format.

Overview

Chomper is an open-source (MIT) document-processing engine with a single design idea: build the parsing core once, expose it three ways. The same four-stage pipeline — Extract → Chunk → Enrich → Format — sits behind a Python library (import chomper), a command-line tool (chomper-parse), and a Model Context Protocol server that plugs straight into Claude and other MCP clients. Write the hard part once; the surfaces are thin.

It chomps 36 file extensions — PDF, DOCX, PPTX, Excel, CSV, HTML, Markdown, ten code languages, JSON/YAML/XML, EML/MSG email, EPUB, RTF — and can hand the result back as JSON or as TOON, a compact line-based wire format aimed at shrinking token usage in LLM context windows. It is beta: a few-day build, lazily loaded heavy models, optional dependencies that degrade gracefully when absent, and a token-reduction number measured on samples rather than guaranteed. The architecture is the point; the honesty about its maturity is part of the brand.

The shape of the problem

Document parsing tooling usually forces a trade. Libraries are ergonomic but locked inside one runtime. CLIs are scriptable but throwaway. MCP servers are LLM-native but reinvent extraction badly. And almost all of them emit verbose JSON that burns tokens the moment an LLM has to read it back.

Chomper’s answer is to refuse the trade: factor the parsing into one orchestrated core, keep every format behind a uniform interface, and make the output encoding a swappable choice rather than a fixed cost. The three surfaces are then just three different front doors onto the same engine.

STAGE MAP — one core, three surfaces

flowchart LR
  subgraph SURFACES["Three surfaces · thin"]
    LIB["Python library<br/>import chomper"]
    CLI["CLI<br/>chomper-parse"]
    MCP["MCP server<br/>9 tools · 5 prompts"]
  end
  subgraph CORE["DocumentPipeline · one core"]
    EX["1 · Extract<br/>format-specific text + structure"]
    CH["2 · Chunk<br/>auto · semantic · fixed · recursive"]
    EN["3 · Enrich<br/>keywords · sections · titles · metadata"]
    FMT["4 · Format<br/>JSON or TOON"]
  end
  OUT["ParseResult · Chunks<br/>JSON · TOON · CSV · Markdown · XML"]
  LIB --> EX
  CLI --> EX
  MCP --> EX
  EX --> CH
  CH --> EN
  EN --> FMT
  FMT --> OUT

Every stage is a swappable strategy behind an abstract base, so the pipeline is a wiring diagram, not a monolith. You can hand DocumentPipeline a custom enricher list or a different formatter and the rest of the flow does not care.

Stage 1 — Extract, with import-guarded graceful degradation

Each format has its own extractor (PDFExtractor, DOCXExtractor, ExcelExtractor, …) behind a common BaseExtractor. The interesting engineering is not the parsers themselves but how the registry is built under uncertainty about what is installed.

Heavy parsers depend on heavy libraries — pymupdf for PDF, python-docx for Word, openpyxl for Excel, beautifulsoup4 for HTML. Rather than hard-require them, every optional extractor is imported inside a try/except ImportError, falling back to None:

try:
    from .pdf_extractor import PDFExtractor
except ImportError:
    PDFExtractor = None

The pipeline then builds its extension-to-extractor map dynamically, registering a format only when its dependency actually imported. Code, text and Markdown are always available (no heavy deps); PDF, Office, web and the rest light up only when present. A bare install still runs and still chomps; a full install chomps everything. Nothing crashes on a missing wheel — it just quietly offers fewer formats. That is the whole point of import-guarded degradation: the surface area shrinks gracefully instead of failing loudly.

Stage 2 — Chunk, via the strategy pattern

Chunking is where the design earns its keep. Four strategies sit behind a BaseChunker:

auto — format-aware: a PDFChunker, MarkdownChunker, ExcelChunker and friends, each respecting that format’s natural structure.
fixed — blunt character-count splitting.
recursive — paragraph- and sentence-boundary splitting.
semantic — embedding-based, for RAG.

The semantic chunker is the one worth opening up. It embeds small text units with sentence-transformers (all-MiniLM-L6-v2 “fast” ~80 MB, or all-mpnet-base-v2 “balanced” ~420 MB), computes cosine distance between adjacent units, then cuts the document where meaning shifts. Crucially, how it decides “a shift happened” is itself a swappable strategy:

flowchart LR
  T["Text"] --> U["Split into units<br/>paragraphs then sentences"]
  U --> E["Embed units<br/>sentence-transformers"]
  E --> D["Cosine distance<br/>between adjacent units"]
  D --> B{"Breakpoint strategy"}
  B -->|"percentile"| P["cut above Nth percentile"]
  B -->|"standard_deviation"| S["cut above mean + N·σ"]
  B -->|"interquartile"| Q["cut above Q3 + N·IQR"]
  P --> C["Coherent chunks<br/>small fragments merged"]
  S --> C
  Q --> C

All three breakpoint detectors run over the same distance array; they only differ in where they draw the line — a percentile cutoff, a standard-deviations-above-mean threshold, or an IQR-based outlier test. Sub-minimum fragments are merged forward so chunks stay coherent. The embedding model is lazy-loaded and cached on first use, so importing Chomper costs nothing until you actually ask for semantic chunking — and if sentence-transformers isn’t installed, that path raises a clear, actionable error instead of a stack trace.

Stage 3 — Enrich, and Stage 4 — Format

Enrichment is a configurable list of passes — RAKE keyword extraction, heuristic section detection, keyword-driven title generation, and a metadata enricher (author, page count, word count, reading time, complexity). Pass your own list to the pipeline and you override the defaults; code files skip enrichment by default because keyword-ranking source isn’t useful.

Formatting is the swappable tail. BaseFormatter has a SimpleFormatter (JSON), a TOONFormatter, and stubs aimed at vector and graph backends (Weaviate, Neo4j) — formatting is treated as “render for a backend,” not a fixed serialization.

TOON — the token-optimized wire format

TOON (Token-Optimized Object Notation) is Chomper’s compact, line-based alternative to JSON for LLM contexts. It drops JSON’s structural overhead — braces, quotes, repeated keys — for minimal delimiters (| between fields, --- between sections):

d:report.pdf|t:pdf|w:5000|c:25000|n:10
m:author=John Doe,title=Annual Report
---
0|0-2500|text|Introduction
The document begins with an overview...
k:overview,introduction,summary

The ~40% token reduction versus JSON is a target, measured on sample documents — not a guarantee. Real savings depend on the document, the tokenizer, and how much metadata travels with the chunks. It is offered as an output_format: "toon" flag on every MCP tool, never forced; JSON stays the default. Treat the number as “what we’ve seen on our samples,” and verify on your own corpus before relying on it.

MCP-native, by design

The MCP server is a thin surface over the same core, exposing 9 tools — parse_document, parse_document_bytes (base64, for cloud-storage and DB BLOBs), get_document_chunk (offset pagination for documents larger than a context window), parse_document_chunked, extract_metadata, get_document_images, list_supported_formats, batch_parse — plus 5 ready-made prompts (summarize, extract key points, explain for an audience, extract entities, document Q&A). Token discipline is baked in: parse_document returns a 5000-char summary by default with a hint to paginate, rather than blowing the context window on a full dump.

It registers in one command — claude mcp add -s user chomper -- ... server.py — and runs fully local: no cloud round-trip, no API key, no per-page billing.

Honest status

Beta, built over a few days — useful and exercised, not battle-hardened across every one of those 36 formats.
TOON’s ~40% reduction is a sample-measured target, not a contractual guarantee.
Optional dependencies degrade gracefully — a partial install runs with fewer formats, which is a feature, but it means “supported” depends on what you installed.
Semantic chunking downloads an embedding model on first use; lovely for quality, a real cold-start cost to budget for.

What this project is really about

Chomper is a study in factoring: one document core, a four-stage pipeline where every stage is a strategy behind an abstract base, three thin surfaces over the top, and an output encoding promoted from a fixed cost to a runtime choice. The semantic chunker’s swappable breakpoint detectors and the import-guarded extractor registry are the two clearest expressions of the same instinct — make the variable parts pluggable, and let the system shrink gracefully when a piece is missing.

Stack

Python 3.10+ · MCP · sentence-transformers (all-MiniLM-L6-v2 · all-mpnet-base-v2) · NumPy · pymupdf · python-docx · python-pptx · openpyxl · beautifulsoup4 · RAKE keyword extraction · strategy-pattern pipeline · TOON wire format · MIT