TOON earns a convert verdict on 1 of our 19 test documents.
First honesty post · 2026-06-11 · external corpus: 19 public-repo docs across two same-day rounds · measured with doc2toon 0.3.x · context plans added 2026-06-12, measured with doc2toon 0.4.2 · every number below reproduces from the repo
CheapAgent converts agent documents to TOON when that saves tokens. Most tools in this space lead with their best case. We maintain a benchmark precisely so we can't: 19 documents — realistic agent docs at real sizes, deliberately problematic ones, and small examples — run through the same verdict engine that powers this site and the CLI. In lossless mode, where the output preserves the source content by construction, the engine's own verdict is convert on exactly one of them.
The headline rows
| Document | Profile | Lossless Δ chars | Verdict |
|---|---|---|---|
| CLAUDE.md, realistic (20,056 chars) | mixed | −37.8% | split_first |
| AGENTS.md, realistic (19,690 chars) | mixed | −62.1% | keep_markdown |
| SKILL.md, realistic (19,265 chars) | mixed | −38.9% | keep_markdown |
| Architecture RFC (44,790 chars) | requirements | −1.5% | keep_markdown |
| Config reference (32,319 chars) | table | +21.1% | convert |
| Worst case: a small mixed doc with one table | mixed | −142.3% | keep_markdown |
Negative means the TOON output is larger than the source. Yes, the worst case is a document with a table in it: TOON's quoting and structure overhead on the prose around the table more than doubles the document.
Why TOON loses on real agent docs
Real CLAUDE.md, AGENTS.md, and SKILL.md files profile as mixed: prose, lists, headings, the occasional table. TOON is a tabular format. Encoding prose into it buys structure overhead with no repetition to amortize it against, so the measured delta lands between −37.8% and −62.1% on our realistic agent docs (a smaller AGENTS.md fixture measures −129.0%). Their honest verdicts are keep_markdown or split_first — which is the advice this site actually gives, because the verdict on this page, in the workbench, and in the CLI is the same function.
The one win, and why we trust it
The config reference is a 32k-character uniform table: 294 rows, consistent columns. TOON removes the per-row Markdown scaffolding and measures +21.1% smaller, decode-verified row for row (294/294). It is the corpus's only safe_to_auto_apply: true row. One honest caveat even here: table conversion currently keeps 91% of content characters — the title and captions around the table are dropped, a documented v1 limitation.
The fake wins we refuse to count
Two ways to fake a savings number, and what the engine does about them:
- Deleting content. Record mode can claim +80.0% on a prose essay — while retaining 18% of the source's content characters. The same trick claims +91.6% on the architecture RFC at 8% retained. Both runs now fire a
low_coveragewarning and land onreview, neverconvert. A savings number that comes from deleting the document is not a savings number. - Counting noise. A glossary measures +4.7% in record mode. That stays
keep_markdown: savings under our 5% band don't justify a format change, because measurements that small sit inside encoding noise — re-encoding line endings alone once moved a fixture's measurement by two points. We learned that one the hard way and pinned the corpus.
External corpus: real agent docs from public repos
One fair objection to everything above: the 19 internal documents above are ours — original content describing fictional projects, committed in the repo, written by the team that built the verdict policy. So we ran the same released CLI against agent docs we don't control: 19 documents from 16 public repos (the matching count is coincidence — these are entirely different files) — agent frameworks, but also mainstream tooling far outside the agent bubble (ruff, biome, assistant-ui, logfire, the OpenAI Agents SDKs) — each MIT-licensed or explicitly MIT for the content measured, each pinned to an exact commit SHA before measurement, with thresholds frozen in advance and the gates pre-registered before sourcing. We store and publish measurements with attribution — not copies of anyone's files.
| Document | Profile | Lossless Δ chars | Verdict |
|---|---|---|---|
| github/spec-kit · AGENTS.md | mixed | −41.3% | split_first |
| browser-use · AGENTS.md | mixed | −22.7% | split_first |
| browser-use · CLAUDE.md | mixed | −86.4% | split_first |
| browser-use · skills/browser-use/SKILL.md | mixed | −22.8% | split_first |
| langchain-ai/langchain · AGENTS.md | mixed | −50.7% | split_first |
| langchain-ai/langchainjs · AGENTS.md | mixed | −42.8% | keep_markdown |
| langflow-ai/langflow · AGENTS.md | mixed | −42.4% | split_first |
| OpenHands · AGENTS.md | mixed | −49.5% | split_first |
| BerriAI/litellm · CLAUDE.md | mixed | −78.8% | split_first |
| astral-sh/uv · AGENTS.md | requirements | −10.5% | split_first |
| openai/openai-agents-python · AGENTS.md | mixed | −78.0% | split_first |
| openai/openai-agents-js · AGENTS.md | mixed | −62.4% | split_first |
| pydantic/pydantic-ai · AGENTS.md | mixed | −44.2% | split_first |
| assistant-ui/assistant-ui · AGENTS.md | requirements | −3.8% | keep_markdown |
| astral-sh/ruff · AGENTS.md | requirements | −2.8% | split_first |
| biomejs/biome · AGENTS.md | mixed | −52.8% | split_first |
| pydantic/logfire · CLAUDE.md | mixed | −40.0% | keep_markdown |
| Infisical/agent-vault · AGENTS.md | mixed | −86.9% | split_first |
| Infisical/agent-vault · CLAUDE.md | mixed | −45.7% | split_first |
0 convert. 16 split_first. 3 keep_markdown. safe_to_auto_apply on none. Every one of these real files would get larger in TOON — from 2.8% to 86.9% — and the engine says so. The closest any real-world doc came to parity is ruff's AGENTS.md at −2.8%, where the engine flagged duplicate_rule seven times; openai-agents-python's fired it four times at −78%. Real context bloat in famous files is the product's actual job — the verdict tells you to fix the doc, not to change its format.
The pointer pattern is the quiet headline. Eight files in the corpus turned out to be pointers — a CLAUDE.md containing just @AGENTS.md (ruff, assistant-ui, pydantic-ai, openai-agents-python), an AGENTS.md routing to CLAUDE.md (logfire), even a CLAUDE.md pointing at CONTRIBUTING.md (biome). We record these as evidence and never count them as documents. But notice what they mean: the ecosystem's most sophisticated repos have already split their agent context into routed, structured pieces — which is precisely what split_first, the corpus's dominant verdict, recommends. The verdict isn't contrarian; it describes where the leaders already went. (Curated skill-pack ecosystems are a different population — we measure them in a separate lane in the repo and don't mix them into this denominator.)
On our internal calibration corpus, TOON earned convert on 1 of 19 documents. On public agent-context files, the pattern is the same: CheapAgent often refuses conversion, sometimes recommends splitting, and converts only when the structure earns it.
Context plans: finds the parts worth converting, refuses the rest
Everything above measures whole documents, and whole documents usually lose. But the losing docs are not empty of structure — the tables and uniform blocks are inside them. So doc2toon 0.4 added context plans: every heading-bounded section measured standalone as if it were its own document — same frozen policy, same 5% band, zero new tunable constants — with the splice overhead of stitching a hybrid counted, never hidden, and reassembly mechanically verified. The metric was pre-registered before any plan code ran against the corpus; the planning hypothesis it tested (“maybe a third of documents have an actionable plan”) was refuted by the data. We publish what fell out:
| Corpus | Plan-positive | Median net, plan-positive | Docs with ≥1 converting section |
|---|---|---|---|
| Internal (19 docs) | 1/19 | +20.9% | 4 |
| External, public repos (19 docs) | 1/19 | +6.8% | 3 |
| Combined | 2/38 | +13.8% | 7 |
The external plan-positive document is langchain-ai/langchainjs · AGENTS.md — and it is the whole product thesis in one file. Whole-document TOON makes it 42.8% larger; its row in the table above honestly says keep_markdown. The plan finds the two package tables inside it (“Key Packages” +49.5%, “Internal Packages” +52.0%), converts exactly those, keeps the other 48 sections byte-identical, and nets +6.8% across the whole document with plan-level safe_to_auto_apply: true — the first real-world document in the corpus where the tool has a positive, auto-applicable recommendation. Five more documents have at least one converting section but net only 0.2–0.9% — below the 5% band, so the plan’s honest answer for them stays “keep the whole document.” Reassembly verified on all 38; the whole-document denominators above are unchanged by any of this.
The plan now renders in the workbench under every multi-section verdict, with the hybrid (converted sections as fenced TOON, everything else byte-identical) available for download. The CLI surface is doc2toon plan.
Run yours
Your documents are not our corpus. The same verdict, three ways:
# Verdict v1 JSON for your own doc (Node 20+)
npx doc2toon@^0.4 profile --json CLAUDE.md
# Section-level context plan, with the hybrid written on request
npx doc2toon@^0.4 plan AGENTS.md
# Or paste it into the workbench - it never leaves your browser
# https://cheapagent.ai/
# Or reproduce this page's numbers from the corpus
git clone https://github.com/Profusion-AI/doc2toon
cd doc2toon && npm install && npm run build
node scripts/benchmark-honesty.mjs # the 19 internal documents
node scripts/benchmark-external.mjs # the 19 public documents, fetched at pinned commits
node scripts/benchmark-plans.mjs # the actionable-plan rate across all 38
The posture
CheapAgent does not promise TOON always saves tokens. It tells you when it does — and on real agent docs, the honest answer is usually “keep Markdown” or “split this first.” The verdict is the product. If a tool in this space shows you only its wins, ask to see its corpus.