Fast local PDF parsing with PyMuPDF (fitz) for Markdown/JSON outputs and optional images/tables. Use when speed matters more than robustness, or as a fallback while heavier parsers are unavailable. Default to single-PDF parsing with per-document output folders.
Downloads
1.6k
Stars
0
Versions
1
Updated
2026-02-24
Install
npx clawhub@latest install pymupdf-pdf-parser-clawdbot-skill
Documentation
PyMuPDF PDF
Overview
Parse PDFs locally using PyMuPDF for fast, lightweight extraction into Markdown by default, with optional JSON and image/table outputs in a per-document directory.
Prereqs / when to read references
If you hit import errors (PyMuPDF not installed) or Nix libstdc++ issues, read:
- -
references/pymupdf-notes.md
Quick start (single PDF)
Run from the skill directory
./scripts/pymupdf_parse.py /path/to/file.pdf \
--format md \
--outroot ./pymupdf-output
Options
- -
--format md|json|both(default:md) - -
--imagesto extract images - -
--tablesto extract a simple line-based table JSON (quick/rough) - -
--outroot DIRto change output root - -
--langadds a language hint into JSON output metadata
Output conventions
- -Create
./pymupdf-output/<pdf-basename>/by default. - -Markdown output:
output.md - -JSON output:
output.json(includeslang) - -Images:
images/subdir - -Tables:
tables.json(rough line-based)
Notes
- -PyMuPDF is fast but less robust on complex PDFs.
- -For more robust parsing, use a heavy-duty OCR parser (e.g., MinerU) if installed.
Launch an agent with PyMuPDF PDF Parser Clawdbot Skill on Termo.