Document Parsers for Agentic Workflows: LiteParse, LlamaParse, and the Tools That Actually Matter

In 2026, the bottleneck in most agentic pipelines isn't the LLM. It's the 47-page PDF it can't read correctly.

Six months into building an internal document intelligence system, we had a parsing pipeline that looked fine in staging. It processed PDFs, extracted text, chunked it, embedded it. The retrieval scores were acceptable. Then we started running it against our actual document library - quarterly financial reports, scanned vendor invoices, multi-column regulatory filings. Six weeks later, we found 200+ corrupted chunks in the vector store. The financial tables had been silently mangled: merged cells collapsed into single values, column headers detached from their data, multi-line entries concatenated without separator. The agent downstream was answering questions about revenue using garbage. Not hallucinating - retrieving accurately from corrupted source material. The parser had succeeded technically and failed operationally. That experience is what this article is built on.

Document parsing is an orchestration problem, not an implementation problem. The question is not which tool extracts text best in isolation - it is which combination of tools reliably transforms heterogeneous inputs into a consistent structure that downstream agents can reason over. And the answer has changed significantly in the last eighteen months. Documents are no longer treated as files to be read. They are treated as multimodal artifacts - combinations of text, spatial layout, tables, charts, images, and implicit relational structure - to be semantically reconstructed. The parsers that win in 2026 benchmarks win because they use vision-language models to rebuild meaning, not because they have better character-extraction logic. That shift changes everything about how you pick a tool.

Part I: The Landscape Has Shifted

Benchmarks first - and their limits

The two most-cited benchmarks in this space are ParseBench (LlamaIndex, 2026) and SCORE-Bench (Unstructured, 2025). Before trusting either, note what they are: ParseBench was built and published by the team that sells LlamaParse. SCORE-Bench was built and published by Unstructured. There is no independent neutral benchmark for document parsing yet - the equivalent of MLPerf for inference or SWE-bench for coding agents does not exist here. Treat both as directional signals rather than verdicts, and weight them accordingly.

ParseBench is the more methodologically transparent of the two. It tests approximately 2,000 human-verified enterprise pages drawn from insurance, financial services, and government documents across five distinct dimensions: table structure recognition, chart understanding, content fidelity, semantic formatting preservation, and visual grounding. No tool tested wins on all five dimensions. That is the main finding - and it has more practical weight than any individual accuracy number.

Sources: ParseBench arXiv 2604.08538 (LlamaIndex, 2026) for LlamaParse; SCORE-Bench (Unstructured, 2025) for Unstructured CCT; Docling arXiv 2501.17887 (IBM, 2025) for FinTabNet TEDS; AWS Textract 2025 invoice line-item benchmark. Google Document AI ~95%+ and Azure DI 99%+ are vendor-reported figures from official documentation, not independently benchmarked studies. Scores are NOT directly comparable, different benchmarks, different datasets, different measurement surfaces.

What parsing now costs

Cost per page varies by three orders of magnitude across the field. Open-source local tools cost nothing beyond compute. Cloud agentic parsers running vision-language inference can reach $0.056 per page for the most capable modes. At 10 million pages per month - a number any mid-market enterprise document workflow reaches quickly - that difference is $560,000 in monthly variable cost versus near-zero. The routing decision is a financial architecture decision, not just a tooling preference.

Cloud parsers only, open-source tools (PyMuPDF, LiteParse, Docling) have no per-page cost and are excluded. Pricing as of June 2026. LlamaParse: 3 credits/page (CE) to 10 credits/page (Agentic) at $1.25/1,000 credits. Unstructured: $0.03/page pay-as-you-go. Azure DI: $0.015 (Read) to $0.10 (Custom Extraction) per page. AWS Textract: $0.0015 (text detection). Google Document AI: ~$0.015/page.

Speed at the ingestion layer

For synchronous document ingestion, throughput limits pipeline latency. PyMuPDF processes roughly 10 pages per second on commodity hardware - fast enough that a 200-page annual report arrives in the downstream agent in under 30 seconds. Cloud parsers are bounded by API response time, not compute: LlamaParse typically returns results in 5 to 20 seconds per document depending on complexity, making it unsuitable for real-time ingestion paths but acceptable for batch workflows.

Speed figures: PyMuPDF from pymupdf4llm official benchmarks (8 PDFs, 7,031 pages test suite). LiteParse v2.0: claimed 100x faster than pure-Python alternatives (Rust rewrite). Docling: IBM arXiv 2501.17887, ~30x speedup vs naive OCR. Cloud tools: typical API response time converted to pages/sec equivalent for a 50-page document. Cloud figures are network-bounded and will vary by region and load.

Landscape at a glance

Tool	Best benchmark signal	Cost/page	Local	Open source	VLM-backed	Best for
LiteParse	100x faster than pure-Python (claimed)	Free	Yes	MIT	No	Speed-critical local ingestion, 50+ formats
LlamaParse	84.9% (ParseBench overall)	0.4¢ - 5.6¢	No	No	Yes	Complex multi-modal documents, RAG pipelines
Unstructured	88.3% CCT (SCORE-Bench)	3¢	Partial	Eval only	Yes	Enterprise connectors, mixed-format ingestion
Docling	TEDS 0.97 on FinTabNet tables	Free	Yes	MIT	Yes (local)	Financial tables, air-gapped environments
PyMuPDF	~10 pages/sec throughput	Free	Yes	AGPL	No	Latency-critical digital-native PDF extraction
Google Document AI	~95%+ on prebuilt processors	~1.5¢	No	No	Yes	200+ prebuilt form types, Vertex AI integration
Azure DI	99%+ on prebuilt models	1.5¢ - 10¢	No	No	Yes	Compliance, audit trail, regulated industries
AWS Textract	82% invoice line-item accuracy	0.15¢ - 6.5¢	No	No	Partial	AWS-native pipelines, invoice and ID extraction

Part II: Tool Deep-Dives

2.1 LiteParse

LiteParse is the open-source parser from the LlamaIndex team, rewritten in Rust in version 2.0. It is model-free: no LLM inference, no cloud call, no API key. The value proposition is simple - if your documents are digital-native and you need to process them at volume with minimal marginal cost, LiteParse gives you a consistent normalisation layer across 50+ formats without the latency or expense of a cloud pipeline. It integrates natively with LlamaIndex's document node abstraction.

from liteparse import Parser

parser = Parser()
result = parser.parse("annual_report.pdf")

for node in result.nodes:
    print(f"[{node.metadata.get('page', '?')}] {node.text[:120]}")

# Batch mode for directories
from pathlib import Path

docs = [str(p) for p in Path("./contracts/").glob("*.pdf")]
results = [parser.parse(doc) for doc in docs]
all_nodes = [node for r in results for node in r.nodes]
print(f"Ingested {len(docs)} documents → {len(all_nodes)} nodes")

One constraint worth noting: LiteParse's strength is digital-native documents. For scanned PDFs it falls back to Tesseract OCR, which is accurate enough for clean scans but struggles with degraded images, handwriting, and rotated content. If your document library is more than 20% scanned pages, LiteParse alone is not sufficient - you need a routing layer that escalates those documents to a more capable OCR or VLM backend.

2.2 LlamaParse

LlamaParse is the cloud parser from LlamaIndex, positioned at the opposite end of the tradeoff space from LiteParse. It uses vision-language models to semantically reconstruct documents rather than extract characters. The agentic mode - which scored 84.9% on ParseBench across 2,000 enterprise pages - applies multi-step VLM reasoning to extract tables, charts, and complex layouts. The output is clean markdown with heading hierarchy preserved, which directly improves chunking quality downstream: a retrieval system that chunks on heading boundaries finds more semantically coherent passages than one chunking on character count.

from llama_parse import LlamaParse
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

parser = LlamaParse(
    result_type="markdown",
    use_vendor_multimodal_model=True,
    # Agentic mode - most accurate, 10 credits/page
    parsing_instruction=(
        "Extract all tables preserving row/column structure. "
        "Identify section headings and maintain hierarchy. "
        "Flag any charts or figures with [CHART: description]."
    ),
)

file_extractor = {".pdf": parser, ".docx": parser, ".pptx": parser}
documents = SimpleDirectoryReader(
    "./data", file_extractor=file_extractor
).load_data()

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

response = query_engine.query(
    "What was the gross margin in Q3 and how did it compare to Q2?"
)
print(response)

Two operational limits: LlamaParse Agentic times out on documents above roughly 200 pages in a single API call - split long documents before submission. And the 48-hour cache means re-parsing the same document within two days is free, which matters for development iteration but should not be relied on as a cost model in production.

2.3 Unstructured.io

Unstructured is the enterprise-tier option in the open-source-adjacent space. The platform positions itself as an end-to-end pipeline rather than a parser: it handles 70+ document types, integrates with S3, SharePoint, Confluence, Salesforce, and dozens of other source connectors, and outputs structured element trees (Title, NarrativeText, Table, Image, Header) rather than raw text. On SCORE-Bench - its own benchmark against 1,000+ expert-annotated enterprise pages - it achieved an adjusted CCT of 0.883 using its VLM partitioner.

from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json
from unstructured.documents.elements import Table, Title, NarrativeText

elements = partition(
    filename="q4_earnings_report.pdf",
    strategy="hi_res",           # VLM-backed, slowest but most accurate
    infer_table_structure=True,  # reconstruct table HTML
    languages=["eng"],
)

# Inspect element types
for el in elements[:10]:
    print(f"{el.category:20s} | {str(el)[:80]}")

# Extract only tables as structured JSON
tables = [e for e in elements if isinstance(e, Table)]
print(f"Found {len(tables)} tables")
print(elements_to_json(tables, indent=2))

The hi_res strategy is the correct choice for enterprise documents but it is slow - roughly 0.8 pages per second. For high-volume ingestion, use the fast strategy for digital-native documents and hi_res only for documents flagged as complex or scanned. Unstructured's API pricing at $0.03 per page is the most expensive open-access tier in the market; the self-hosted option requires a paid enterprise license.

2.4 Docling (IBM)

Docling is IBM's open-source document conversion library, MIT-licensed, trained on approximately 81,000 labeled pages. It runs entirely locally and produces clean markdown or JSON output. On IBM's own evaluation against the FinTabNet benchmark - a dataset of financial tables from S&P 500 annual reports - Docling achieved a TEDS score of 0.97 after recent model improvements. TEDS (Tree Edit Distance-based Similarity) measures table structure reconstruction fidelity: a score of 0.97 means the reconstructed table structure differs from ground truth by only 3% on average across a challenging financial document set. For pipelines that need accurate financial table extraction without cloud dependency, Docling is the strongest open-source option available.

from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True       # enable OCR for mixed digital/scanned
pipeline_options.do_table_structure = True

converter = DocumentConverter(
    allowed_formats=[InputFormat.PDF, InputFormat.DOCX, InputFormat.XLSX],
    pdf_options=pipeline_options,
)

result = converter.convert("portfolio_companies_2025.pdf")

# Export to markdown (preserves heading hierarchy)
markdown = result.document.export_to_markdown()

# Export to structured JSON
doc_json = result.document.export_to_dict()
tables = [item for item in doc_json.get("items", []) if item.get("label") == "table"]
print(f"Extracted {len(tables)} tables with TEDS-optimised structure")

2.5 PyMuPDF4LLM

PyMuPDF is the speed benchmark for document parsing. The C-based library processes roughly 10 pages per second on commodity hardware and returns markdown-formatted text with layout-aware chunking. At 10 pages per second, a 200-page annual report completes in under 25 seconds without any network call. The LLM-optimised wrapper (pymupdf4llm) adds heading detection and table formatting on top of raw extraction. For latency-critical pipelines - where a document arrives and an agent must respond within seconds - PyMuPDF is the only viable option at the local tier.

import pymupdf4llm
from pathlib import Path

# Basic extraction - returns markdown string
md_text = pymupdf4llm.to_markdown("board_deck_q2.pdf")

# Page-level extraction with metadata
pages = pymupdf4llm.to_markdown(
    "board_deck_q2.pdf",
    page_chunks=True,         # returns list of dicts, one per page
    write_images=False,       # skip image extraction for speed
    show_progress=False,
)

for page in pages[:3]:
    print(f"Page {page['metadata']['page']}: {len(page['text'])} chars")

# Targeted extraction: specific page ranges
subset = pymupdf4llm.to_markdown(
    "board_deck_q2.pdf",
    pages=[0, 1, 2],          # cover + exec summary only
    page_chunks=True,
)
print(f"Extracted {len(subset)} pages in targeted mode")

The tradeoff is fidelity on complex layouts. PyMuPDF reads embedded text coordinates and reconstructs reading order heuristically. On single-column documents with simple tables, this is excellent. On multi-column academic papers, complex financial layouts with spanning cells, or any scanned content, it degrades. PyMuPDF belongs in the fast tier of your routing architecture, not as the sole parser for a heterogeneous document library.

2.6 Google Document AI

Google Document AI is the most widely deployed enterprise document parser in production by page volume, processing billions of pages annually across Google Workspace, Google Cloud customers, and internal systems. It offers 200+ specialized processors - purpose-built models for invoices, receipts, contracts, payslips, driver's licenses, tax forms, and more - each fine-tuned on domain-specific training data. On standard enterprise document types within its prebuilt processor coverage, it achieves approximately 95%+ field extraction accuracy. The native integration with Vertex AI pipelines, BigQuery, and Google Cloud Storage makes it the obvious choice for teams already operating on GCP.

from google.cloud import documentai
from google.api_core.client_options import ClientOptions

project_id = "your-project-id"
location = "us"
processor_id = "your-processor-id"   # e.g. invoice processor

opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")
client = documentai.DocumentProcessorServiceClient(client_options=opts)
name = client.processor_path(project_id, location, processor_id)

with open("vendor_invoice_2025_q4.pdf", "rb") as f:
    raw_document = documentai.RawDocument(
        content=f.read(),
        mime_type="application/pdf",
    )

request = documentai.ProcessRequest(name=name, raw_document=raw_document)
result = client.process_document(request=request)
document = result.document

# Extract structured fields with confidence scores
for entity in document.entities:
    print(
        f"{entity.type_:30s} | "
        f"{entity.mention_text:40s} | "
        f"confidence: {entity.confidence:.2%}"
    )

# Access page-level text with layout information
for page in document.pages:
    for block in page.blocks:
        vertices = [(v.x, v.y) for v in block.layout.bounding_poly.vertices]
        text_anchor = block.layout.text_anchor
        block_text = document.text[
            text_anchor.text_segments[0].start_index:
            text_anchor.text_segments[0].end_index
        ]
        print(f"Block at {vertices[0]}: {block_text[:60]}")

One gap: Google Document AI's handwriting processor is separate from its printed-text processors. Mixed documents - printed forms with handwritten annotations - require routing through both processors and merging results. This is rarely documented clearly and is the most common source of incomplete extraction in legal and medical document workflows.

2.7 Azure Document Intelligence

Azure Document Intelligence (formerly Form Recognizer) covers 30+ prebuilt models across invoices, receipts, identity documents, tax forms, contracts, and more. On its prebuilt model set, Microsoft reports 99%+ field extraction accuracy. The compliance story is the strongest in the market: Azure DI supports SOC 2, ISO 27001, HIPAA, and GDPR, making it the default choice for regulated industries in Europe and North America. Version 4.0 added figure extraction, semantic chunking, and improved multi-page table handling. For any pipeline where auditability is a requirement, Azure DI generates a structured JSON response that can be logged verbatim as evidence of extraction.

from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeDocumentRequest
from azure.core.credentials import AzureKeyCredential

endpoint = "https://your-resource.cognitiveservices.azure.com/"
key = "your-api-key"

client = DocumentIntelligenceClient(endpoint, AzureKeyCredential(key))

with open("regulatory_filing_2025.pdf", "rb") as f:
    poller = client.begin_analyze_document(
        "prebuilt-layout",    # use prebuilt-invoice, prebuilt-contract etc for domain
        analyze_request=f,
        content_type="application/octet-stream",
        output_content_format="markdown",  # v4.0: direct markdown output
    )

result = poller.result()

# Extract all tables with cell-level confidence
for table_idx, table in enumerate(result.tables):
    print(f"Table {table_idx}: {table.row_count} rows x {table.column_count} cols")
    for cell in table.cells:
        print(f"  [{cell.row_index},{cell.column_index}] {cell.content}")

# Named field extraction (for domain-specific prebuilt models)
if result.documents:
    for field_name, field in result.documents[0].fields.items():
        if field.value_string:
            print(f"{field_name}: {field.value_string} ({field.confidence:.2%})")

Part III: What Parsers Actually Return

Benchmark scores tell you one thing. The raw output your agent receives tells you something more important: whether the downstream chunking strategy you have built will work with this parser's output format. The same source document - a paragraph followed by a financial table - produces structurally different representations across tools. That difference determines chunking quality, retrieval precision, and ultimately agent accuracy.

Source document used below: one paragraph of narrative text followed by a four-column revenue table (Q1-Q4, three business units).

LlamaParse Agentic output:

## Revenue Performance by Business Unit

Group revenue grew 18% year-on-year in FY2025, driven primarily by
strong performance in the Enterprise segment and margin expansion
in Professional Services.

| Business Unit   | Q1 (£M) | Q2 (£M) | Q3 (£M) | Q4 (£M) |
|-----------------|---------|---------|---------|---------|
| Enterprise      | 12.4    | 13.1    | 14.8    | 16.2    |
| SMB             | 4.2     | 4.0     | 3.9     | 4.5     |
| Prof. Services  | 2.1     | 2.3     | 2.4     | 2.8     |
| **Total**       | **18.7**| **19.4**| **21.1**| **23.5**|

Docling output:

## Revenue Performance by Business Unit

Group revenue grew 18% year-on-year in FY2025, driven primarily by
strong performance in the Enterprise segment and margin expansion
in Professional Services.

|Business Unit|Q1 (£M)|Q2 (£M)|Q3 (£M)|Q4 (£M)|
|---|---|---|---|---|
|Enterprise|12.4|13.1|14.8|16.2|
|SMB|4.2|4.0|3.9|4.5|
|Prof. Services|2.1|2.3|2.4|2.8|
|Total|18.7|19.4|21.1|23.5|

PyMuPDF4LLM output:

Revenue Performance by Business Unit

Group revenue grew 18% year-on-year in FY2025, driven primarily by strong performance in the
Enterprise segment and margin expansion in Professional Services.

Business Unit Q1 (£M) Q2 (£M) Q3 (£M) Q4 (£M)
Enterprise 12.4 13.1 14.8 16.2
SMB 4.2 4.0 3.9 4.5
Prof. Services 2.1 2.3 2.4 2.8
Total 18.7 19.4 21.1 23.5

The practical consequence: LlamaParse and Docling outputs can be chunked at heading boundaries using standard markdown parsers, giving the retrieval system semantically coherent passages. PyMuPDF's plain-text output flattens the heading into body text and removes table structure, which means character-count chunking will split the table arbitrarily - potentially separating "Q3 (£M)" from its values. An agent asked "what was Enterprise revenue in Q3?" retrieving a PyMuPDF-chunked passage may receive a chunk containing only the header row, not the data. This is the output contract problem: the right parser choice depends on what your agent does with the output, not just on extraction accuracy scores.

Part IV: Silent Failure Modes

The most operationally damaging parser failures are not crashes or errors. They are cases where the parser returns something that looks correct but isn't - and the retrieval system indexes it faithfully, poisoning every downstream query that touches that content. These are the failure modes that took six weeks to surface in the pipeline I described at the start. None of them appeared in any benchmark study.

Tool	Silent failure mode	Detection strategy	Mitigation
PyMuPDF	Garbled Unicode on CJK PDFs; ligature characters (fi, fl, ffi) silently dropped or rendered as □	Compare character count pre/post extraction; scan output for replacement characters (U+FFFD, □)	Route CJK or ligature-heavy documents to Docling or cloud OCR
LlamaParse	Hallucinates merged table cells on complex spanning layouts; reconstructed values may not match source	Cross-check row and column counts against source PDF metadata; validate numeric sums	Use `strict_mode=True`; add post-parse numeric validation for financial tables
Docling	Misses reading order on multi-column academic papers; left and right columns interleaved	Test on representative multi-column samples; check that paragraph sequence is coherent	Force single-column mode for academic PDFs; or route to LlamaParse for layout-complex sources
Unstructured	Categorises section headers as NarrativeText in dense technical documents, losing structure signal	Inspect element category distribution on representative sample pages	Use `hi_res` strategy; enable `infer_table_structure=True`; post-process with heading classifier
Google Document AI	Misses handwritten annotations on printed forms when using printed-text processor	Test with mixed printed/handwritten samples before production rollout	Route to handwriting-specific processor; merge results from both processors for hybrid documents
Azure DI	Incorrect field mapping on non-standard invoice layouts; custom fields labelled as nearest prebuilt equivalent	Run against a validation set of atypical documents, not just standard invoice formats	Fine-tune with custom extraction model using labeled examples of non-standard layouts

The underlying pattern is consistent across all six cases: the parser was evaluated on documents it handles well and deployed against a broader population that included documents it handles poorly. Every production deployment needs a validation set that represents the tail of the document distribution, not just the happy path. Build that set before you choose a parser, not after you find corrupted chunks.

Part V: The Orchestration Layer

The right mental model is not "pick the best parser." It is "build a routing layer that sends each document to the cheapest parser capable of handling it correctly." Most documents in a typical enterprise library are digital-native PDFs or Office files that PyMuPDF or LiteParse handles well at near-zero cost. A minority are scanned, multi-modal, or structurally complex and need a VLM-backed tool. A smaller minority require compliance-grade processing with an audit trail. The economics only work if you route correctly.

Relative capability scores normalised to 0-10 within each dimension. Accuracy scores are not directly comparable across benchmarks - see Part I. Speed, Cost-Efficiency, and Local Execution are relative rankings within this set. Use this chart to understand capability shape, not to derive a single winner.

The following orchestrator implements the routing logic with three additions that production systems require but tutorials omit: a robust scanned-page detector that samples multiple pages rather than trusting the first, structured logging with input hashes for post-hoc debugging, and an async batch interface for concurrent document processing.

import asyncio
import hashlib
import time
import logging
from dataclasses import dataclass, field
from enum import Enum
from pathlib import Path
from typing import Optional

logger = logging.getLogger(__name__)


class ParserTier(Enum):
    LOCAL_FAST = "local_fast"          # PyMuPDF: digital PDFs, latency-critical
    LOCAL_ACCURATE = "local_accurate"  # Docling: tables, financial, air-gapped
    CLOUD_AGENTIC = "cloud_agentic"   # LlamaParse: complex, multi-modal
    CLOUD_MANAGED = "cloud_managed"   # Azure DI / Google Doc AI: compliance


@dataclass
class ParsedDocument:
    text: str
    tables: list[dict]
    metadata: dict
    source: str
    parser_used: str
    confidence: Optional[float] = None
    parse_latency_ms: float = 0.0
    input_hash: str = ""


class DocumentParserOrchestrator:
    """Routes documents to the appropriate parser tier based on format and SLA.

    Tier selection priority:
      1. compliance_mode → always CLOUD_MANAGED
      2. simple formats (.txt, .md, .csv) → LOCAL_FAST
      3. scanned pages detected + cloud available → CLOUD_AGENTIC
      4. all other complex formats → LOCAL_ACCURATE
    """

    FAST_FORMATS = {".txt", ".md", ".html", ".csv", ".json"}

    def __init__(self, use_cloud: bool = True, compliance_mode: bool = False):
        self.use_cloud = use_cloud
        self.compliance_mode = compliance_mode

    def _input_hash(self, path: Path) -> str:
        """SHA-256 of file content, first 16 chars - stable identifier for debugging."""
        return hashlib.sha256(path.read_bytes()).hexdigest()[:16]

    def _detect_scanned(self, path: Path) -> bool:
        """Sample first 5 pages using block-count heuristic.

        Checking only page 0 produces false positives on documents with
        image-only cover pages or blank introductory pages. Sampling 5 pages
        and counting text blocks is more reliable for mixed-content PDFs.
        """
        if path.suffix.lower() != ".pdf":
            return False
        import pymupdf
        doc = pymupdf.open(str(path))
        sample_count = min(len(doc), 5)
        total_blocks = sum(
            len(doc[i].get_text("dict").get("blocks", []))
            for i in range(sample_count)
        )
        return total_blocks == 0

    def _select_tier(self, path: Path, has_scanned_pages: bool) -> ParserTier:
        if self.compliance_mode:
            return ParserTier.CLOUD_MANAGED
        if path.suffix.lower() in self.FAST_FORMATS:
            return ParserTier.LOCAL_FAST
        if has_scanned_pages and self.use_cloud:
            return ParserTier.CLOUD_AGENTIC
        return ParserTier.LOCAL_ACCURATE

    def parse(self, file_path: str) -> ParsedDocument:
        path = Path(file_path)
        input_hash = self._input_hash(path)
        has_scanned = self._detect_scanned(path)
        tier = self._select_tier(path, has_scanned)

        t0 = time.perf_counter()
        result = self._dispatch(path, tier)
        result.parse_latency_ms = (time.perf_counter() - t0) * 1000
        result.input_hash = input_hash

        logger.info(
            "document_parsed",
            extra={
                "input_hash": input_hash,
                "tier": tier.value,
                "latency_ms": round(result.parse_latency_ms, 1),
                "pages": result.metadata.get("pages"),
                "source": str(path.name),
            },
        )
        return result

    def _dispatch(self, path: Path, tier: ParserTier) -> ParsedDocument:
        handlers = {
            ParserTier.LOCAL_FAST: self._parse_pymupdf,
            ParserTier.LOCAL_ACCURATE: self._parse_docling,
            ParserTier.CLOUD_AGENTIC: self._parse_llamaparse,
            ParserTier.CLOUD_MANAGED: self._parse_azure,
        }
        return handlers[tier](path)

    def _parse_pymupdf(self, path: Path) -> ParsedDocument:
        import pymupdf4llm
        text = pymupdf4llm.to_markdown(str(path))
        return ParsedDocument(
            text=text, tables=[], metadata={"parser": "pymupdf"},
            source=str(path), parser_used="pymupdf",
        )

    def _parse_docling(self, path: Path) -> ParsedDocument:
        from docling.document_converter import DocumentConverter
        result = DocumentConverter().convert(str(path))
        return ParsedDocument(
            text=result.document.export_to_markdown(),
            tables=[],
            metadata={"parser": "docling"},
            source=str(path),
            parser_used="docling",
        )

    def _parse_llamaparse(self, path: Path) -> ParsedDocument:
        from llama_parse import LlamaParse
        parser = LlamaParse(result_type="markdown", use_vendor_multimodal_model=True)
        docs = parser.load_data(str(path))
        return ParsedDocument(
            text="

".join(d.text for d in docs),
            tables=[], metadata={"parser": "llamaparse"},
            source=str(path), parser_used="llamaparse",
        )

    def _parse_azure(self, path: Path) -> ParsedDocument:
        import os
        from azure.ai.documentintelligence import DocumentIntelligenceClient
        from azure.core.credentials import AzureKeyCredential
        client = DocumentIntelligenceClient(
            os.environ["AZURE_DI_ENDPOINT"],
            AzureKeyCredential(os.environ["AZURE_DI_KEY"]),
        )
        with open(path, "rb") as f:
            poller = client.begin_analyze_document(
                "prebuilt-layout", analyze_request=f,
                content_type="application/octet-stream",
                output_content_format="markdown",
            )
        result = poller.result()
        return ParsedDocument(
            text=result.content, tables=[],
            metadata={"parser": "azure-di"},
            source=str(path), parser_used="azure-di",
        )


async def parse_batch(
    paths: list[str],
    orchestrator: DocumentParserOrchestrator,
) -> list[ParsedDocument | Exception]:
    """Fan-out document parsing concurrently.

    Uses asyncio.to_thread so each synchronous parser runs in the thread pool
    without blocking the event loop. Return exceptions rather than raising so
    one failed document doesn't abort the batch.
    """
    tasks = [asyncio.to_thread(orchestrator.parse, p) for p in paths]
    return await asyncio.gather(*tasks, return_exceptions=True)


# Usage
async def main():
    orchestrator = DocumentParserOrchestrator(use_cloud=True, compliance_mode=False)
    paths = [
        "reports/q4_2025.pdf",
        "contracts/vendor_agreement.docx",
        "data/metrics_export.csv",
    ]
    results = await parse_batch(paths, orchestrator)
    for path, result in zip(paths, results):
        if isinstance(result, Exception):
            logger.error("parse_failed", extra={"path": path, "error": str(result)})
        else:
            print(f"{path}: {result.parser_used} | {result.parse_latency_ms:.0f}ms")

Part VI: Decision Playbook

Every routing decision reduces to four questions: What is the document format? Is it scanned or digital-native? What does the downstream agent need from the output? And what are the cost and compliance constraints? The table below maps these to tool choices.

Scenario	Recommended tool	Why	Breaks when
Zero marginal cost, digital-native PDFs	LiteParse or PyMuPDF	Open-source, local, MIT/AGPL, no per-page cost	Documents are scanned or contain complex multi-column layouts
Latency SLA under 200ms/page	PyMuPDF4LLM	~10 pages/sec, CPU-only, no network call	Input contains CJK text, ligatures, or scanned content
Financial tables requiring structural fidelity	Docling	TEDS 0.97 on FinTabNet; local, MIT-licensed	Multi-column academic layouts; rotated or skewed scans
Scanned or image-heavy PDFs	LlamaParse Agentic	VLM reconstruction; 84.9% ParseBench overall	Documents exceed ~200 pages (API timeout); real-time SLA requirements
Mixed enterprise formats at scale with connectors	Unstructured.io	70+ source connectors; element-level structured output	Dense technical documents where headers are miscategorised as body text
GCP-native pipeline with prebuilt form types	Google Document AI	200+ processors; native Vertex AI and BigQuery integration	Mixed printed/handwritten documents with a single processor
Compliance, audit trail, regulated industry	Azure Document Intelligence	99%+ prebuilt accuracy; SOC 2, HIPAA, GDPR certified	Non-standard invoice layouts outside prebuilt model training distribution
Air-gapped or compute-cost-constrained environment	Docling or LiteParse	Zero inference cost post-setup; no cloud dependency	Scanned content without a local OCR backend configured

One factor that belongs in the decision matrix and rarely appears: compute cost at scale. Cloud parsers running vision-language inference are not just expensive per page - they represent continuous GPU inference running in a third-party data centre. LlamaParse Agentic at $0.012 per page across 10 million monthly pages is $120,000 in variable cost and a persistent inference workload you do not control. Local parsers - Docling, LiteParse, PyMuPDF - have near-zero marginal compute cost after initial model loading; the hardware is yours and the marginal cost per additional document rounds to zero. For teams with cost ceilings, data residency requirements, or simply a preference for predictable infrastructure economics, local-first is not a concession - it is the correct architecture.

The architecture that emerges from these constraints is deliberately tiered. The happy path - digital-native documents, standard formats, speed-sensitive pipelines - runs entirely local and costs nothing per document. The complex path - scanned content, multi-modal layouts, documents requiring structural fidelity - escalates to cloud tools proportional to complexity. The compliance path is isolated from both: every regulated document routes to the compliance-grade parser regardless of format, and every extraction result is logged with its input hash for auditability. Three tiers, three cost profiles, one orchestrator.

The teams that get document parsing right are not the ones that found the highest benchmark score and deployed it everywhere. They are the ones that mapped their actual document distribution, measured failure rates on the tail, built routing logic that matches tool capability to document complexity, and instrumented every parse operation so that when corruption surfaces six weeks later, they can trace it to a specific file, a specific parser call, and a specific input hash.

Sources

ParseBench - LlamaIndex (2026). "ParseBench: A Comprehensive Benchmark for Document Parsing." arXiv:2604.08538. arxiv.org/abs/2604.08538
ParseBench Dataset - LlamaIndex (2026). GitHub repository with 2,000 human-verified enterprise pages. github.com/run-llama/ParseBench
Docling Technical Paper - IBM Research (2025). "Docling: An Efficient Open-Source Toolkit for AI-Powered Document Conversion." arXiv:2501.17887. arxiv.org/pdf/2501.17887
SCORE-Bench - Unstructured (2025). "Introducing SCORE-Bench: An Open Benchmark for Document Parsing." unstructured.io/blog/score-bench
LlamaParse ParseBench Results - LlamaIndex (2026). Official blog post with benchmark methodology and per-dimension scores. llamaindex.ai/blog/parsebench
LiteParse - LlamaIndex (2025). GitHub repository and developer documentation. github.com/run-llama/liteparse | developers.llamaindex.ai/liteparse
Unstructured Benchmark Detail - Unstructured (2025). "Unstructured Leads in Document Parsing Quality: Benchmarks Tell the Full Story." unstructured.io/blog/benchmarks
Azure Document Intelligence Pricing - Microsoft (2026). Official pricing page including prebuilt, read, and custom extraction tiers. azure.microsoft.com/pricing/document-intelligence
AWS Textract 2025 Updates - Amazon Web Services (2025). New capabilities: superscripts, rotated text, visually similar characters, low-resolution documents. aws.amazon.com/whats-new/textract-2025
Reducto LLM-Ready Document Parsing - Reducto (2025). Best practices guide for high-fidelity document extraction in LLM workflows. llms.reducto.ai/best-llm-ready-document-parsers-2025

Working through the challenges in this post? I help engineering leaders and CTOs navigate complex technical decisions and scale high-performing teams. Schedule a consultation →