--- theme: default title: pdf2croissant routerMode: hash selectable: true download: true colorSchema: light fonts: sans: Source Sans 3 serif: Young Serif mono: JetBrains Mono weights: '400,500,600,700' transition: slide-left layout: cover --- # pdf2croissant Turn academic papers into MLCommons Croissant JSON-LD metadata github.com/jettyio/pdf2croissant --- transition: fade --- # This Was Extracted From a PDF ```json { "@type": "sc:Dataset", "name": "SQuAD 2.0", "description": "Reading comprehension dataset combining 100,000 answerable questions with 50,000 unanswerable ones", "license": "CC-BY-SA-4.0", "creator": [{ "name": "Pranav Rajpurkar" }] } ``` The paper never had structured metadata. An agent read 20 pages of prose and produced this. --- transition: slide-left --- # The Standard That Needs Filling Croissant is the MLCommons standard for dataset metadata. HuggingFace, Kaggle, and OpenML adopted it. It defines 16 fields in JSON-LD: name, creators, license, distributions, record sets, data types. - **Well-designed schema** — machine-readable, linked-data compatible, validated by `mlcroissant` - **Widely adopted** — the three largest dataset platforms use it as their canonical format - **But papers don't come with Croissant files** — someone has to read them and extract the fields --- transition: slide-left --- # The Extraction Gap A researcher publishes a paper describing a new dataset. The dataset name is on page 1. The license is in a footnote on page 12. The column types are in a table on page 4. The download URL might be in an appendix — or missing entirely. - Croissant needs **structured fields** - Papers contain **scattered prose** - Someone must read the paper, find each fact, and decide what is explicit vs. inferred - That someone is now an agent --- layout: section transition: iris --- # An Agent That Reads Papers --- transition: slide-left --- # Three Outputs From Every Run Upload a PDF of a paper that introduces an ML dataset. The agent reads it and delivers three files: - **croissant.json** — the Croissant JSON-LD metadata file, ready for HuggingFace or Kaggle - **validation_report.json** — what passed, what failed, how many iterations it took to fix errors - **summary.md** — human-readable extraction report with confidence levels per field

The validation report and summary exist because the Croissant file alone does not tell you how much to trust it.

--- transition: slide-left --- # The Upload Pipeline Three stages keep the API server out of the file transfer path: - **Presign** — client requests a signed URL from `/api/presign` - **Blob** — client PUTs the PDF directly to Vercel Blob storage (15MB limit, drag-and-drop with progress) - **Run** — client POSTs to `/api/run` with the blob reference and selected model

React Query polls for completion. The frontend never touches the PDF after upload.

--- transition: slide-left --- # Architecture ```mermaid {theme: 'base', scale: 0.65} graph LR A["Upload PDF"] --> B["Next.js"] B --> C["Vercel Blob"] B --> D["Jetty API"] D --> E["Sandbox"] E --> F["Agent"] F --> G["croissant.json"] F --> H["report.json"] F --> I["summary.md"] style A fill:#a06c08,stroke:#a06c08,color:#fffbf5 style B fill:#fffbf5,stroke:#2c1810,color:#2c1810 style C fill:#fffbf5,stroke:#2c1810,color:#2c1810 style D fill:#a06c08,stroke:#a06c08,color:#fffbf5 style E fill:#fffbf5,stroke:#2c1810,color:#2c1810 style F fill:#a06c08,stroke:#a06c08,color:#fffbf5 style G fill:#fffbf5,stroke:#8b5e34,color:#2c1810 style H fill:#fffbf5,stroke:#8b5e34,color:#2c1810 style I fill:#fffbf5,stroke:#8b5e34,color:#2c1810 linkStyle default stroke:#2c1810,stroke-width:2px ```

Jetty API orchestrates the agent in a sandboxed Python 3.12 environment. One env var (JETTY_API_TOKEN) connects the frontend to the backend.

--- layout: center transition: fade --- # The agent runs in a sandbox. 4 CPUs. 8 GB RAM. 20 minutes. No escape. --- transition: slide-left --- # Choose Your Extraction Engine The model selector lets users pick per run: - **Claude Opus** — highest comprehension quality, best for complex multi-table papers - **Claude Sonnet** — faster extraction, good for straightforward papers - **Gemini Pro** — the default backend model, balances speed and accuracy

Same runbook, same validation pipeline, different reading engine. The rules stay constant; the model varies.

--- layout: section transition: iris --- # The Runbook Is the System Prompt --- transition: slide-left --- # Eight Steps, One Procedure The runbook defines the agent's complete workflow: - **1. Environment Setup** — `pip install mlcroissant`, create output directories - **2. Read and Analyze** — extract dataset identity, creators, structure, characteristics - **3. Cross-Reference** — optionally query HuggingFace for supplementary data - **4. Build Croissant JSON-LD** — map 16 paper fields to Croissant properties - **5. Validate** — three-stage pipeline (syntax, schema, semantics)

Steps 6-8: iterate on errors (max 3), write executive summary, run final checklist. If any item fails, go back and fix it.

--- transition: morph-fade --- # Confidence Is a First-Class Output Every extracted field gets a confidence tag. The agent cannot skip this. - **HIGH** — explicitly stated in the paper: dataset name, authors, license, citation - **MEDIUM** — inferred from context: column types, data splits, file formats - **LOW** — not in the paper: download URLs, file sizes, update frequency - **GAPS** — documented explicitly, never filled with plausible guesses

A Croissant file with documented gaps is more useful than one with confident-looking hallucinations.

--- transition: slide-left --- # Gap Documentation The executive summary includes a mandatory gaps table: | Gap | Why It Matters | |-----|---------------| | Download URL not in paper | Croissant distribution object is incomplete | | File sizes not mentioned | Cannot verify download integrity | | Update frequency unknown | Consumers cannot plan refresh schedules | | Data split ratios implied but not stated | Record set definitions are approximations |

The agent is required to note absence rather than fabricate presence. Silence is data.

--- layout: two-cols-header transition: wipe-right --- # Paper In, Croissant Out ::left:: **What the agent reads** - PDF of an academic paper - Tables, figures, prose descriptions - Scattered metadata across sections - Implicit assumptions, missing URLs ::right:: **What the agent produces** - `croissant.json` — valid JSON-LD - `validation_report.json` — audit trail - `summary.md` — confidence per field - Gaps documented, not filled --- transition: slide-left --- # Field Mapping The runbook defines 16 specific mappings from paper concepts to Croissant JSON-LD: - **Identity** — dataset name, version, description, URL, same-as DOI - **Provenance** — creators, date published, citation, license - **Structure** — distributions (files), record sets (tables), fields (columns) - **Data types** — 7 mappings: text, integer, float, boolean, date, URL, enum

Each mapping has a source (paper) and a target (JSON-LD property). The agent fills what it can find and documents the rest as gaps.

--- layout: center transition: morph-fade --- # The runbook is a file in the repo. Copy it. Use it with Claude Code, Codex, or Gemini CLI. The web app is optional. --- layout: section transition: iris --- # Three Stages, Three Chances to Be Wrong --- transition: slide-up --- # The Validation Pipeline Three stages catch three classes of error: - **JSON syntax** — is the output valid JSON-LD? Catches malformed LLM output - **Croissant schema** — does it conform to the MLCommons spec? Validated with the `mlcroissant` Python library - **Record set generation** — can you actually load the described data? Catches semantic errors that pass schema checks

The escalation order matters. Each stage is more expensive and catches a different failure mode.

--- transition: slide-left --- # Self-Healing: Read the Error, Fix the Output When validation fails, the agent does not give up. It reads the error message, re-reads the relevant section of the paper, and fixes the Croissant output. - **8 common error patterns** documented in the runbook with known fixes - **Max 3 iterations** — most fixable errors resolve in 1-2 attempts - **Errors past 3** are usually fundamental extraction failures, not fixable by retry - Each iteration produces an updated `validation_report.json` showing what changed --- transition: slide-left --- # The Executive Summary Every run produces a human-readable summary with structured confidence tables: - **High confidence table** — fields explicitly stated in the paper, with page references - **Medium confidence table** — fields inferred from context, with reasoning - **Low confidence table** — fields the agent guessed at, flagged for manual review - **Gaps table** — fields the paper does not address at all

The summary is what a dataset maintainer reads to decide whether to accept the Croissant file or verify specific fields.

--- transition: slide-left --- # Five Benchmarks, Five Ground Truths The repo ships ground-truth Croissant files for five datasets — sourced from HuggingFace: - **SQuAD 2.0** — reading comprehension with unanswerable questions - **GLUE** — natural language understanding benchmark suite (9 tasks, distinct schemas) - **WikiText** — language modeling on Wikipedia articles - **CNN/DailyMail** — abstractive text summarization - **GSM8K** — grade school math word problems

Each pairs the original arXiv paper PDF with its known-good Croissant output. Diff agent output against reality.

--- transition: slide-left --- # Intentional V1 Constraints What pdf2croissant does not do — by design: - **No batch processing** — one paper at a time, because each extraction needs full attention - **No direct HuggingFace analysis** — the paper is the source, not the platform - **No metadata editing UI** — review the output, but edit it in your own tools - **No auth system** — single env var deployment, JETTY_API_TOKEN connects everything

These are scope decisions, not missing features. Each one keeps the system focused on doing one thing well.

--- layout: center transition: fade --- # Millions of datasets. Metadata trapped in prose. pdf2croissant reads the paper and extracts it. --- layout: end transition: fade --- # From 20 pages of prose to validated JSON-LD. github.com/jettyio/pdf2croissant