Sunday, April 5

Document Structure Analyzer: The Claude Code Agent That Makes OCR Actually Work

Every developer who has built a document processing pipeline knows the frustration: you feed a PDF or scanned image into an OCR engine, and what comes back is a jumbled mess of text with no respect for columns, tables, or reading order. The OCR engine did its job — it recognized the characters — but the structural intelligence was missing upstream. That’s the gap the Document Structure Analyzer agent fills.

This agent sits at the critical junction between raw document input and OCR processing. By analyzing layout, mapping content hierarchies, and assigning semantic roles to visual regions before text extraction begins, it transforms OCR pipelines from character-recognition exercises into genuine document understanding workflows. For senior developers building production-grade document processing systems, that distinction is the difference between a prototype and a product.

What the Document Structure Analyzer Does

The Document Structure Analyzer is a specialist agent focused entirely on the structural layer of document analysis. It doesn’t compete with OCR — it prepares the ground for it. Its responsibilities span six core domains:

  • Layout segmentation: Identifies and classifies regions of a document — headers, footers, body columns, sidebars, tables, images, and form fields
  • Reading order determination: Resolves the correct logical sequence for multi-column, magazine-style, or complex layouts where left-to-right, top-to-bottom assumptions break down
  • Content hierarchy mapping: Builds structural trees that represent heading levels, subheadings, and body text relationships
  • Table and form recognition: Detects grid structures, cell boundaries, row/column relationships, and form field semantics
  • Template and pattern matching: Recognizes recurring document types — invoices, contracts, medical forms, academic papers — so downstream processing can apply type-specific logic
  • Semantic annotation with confidence scores: Assigns roles to visual elements and reports confidence levels so you can route uncertain cases for human review

The confidence score output deserves special mention. Most document processing pipelines fail silently or produce garbage output when the structural analysis is uncertain. This agent makes uncertainty explicit, giving you the data you need to build robust error handling.

When to Use This Agent

Deploy the Document Structure Analyzer proactively — before your OCR pipeline runs — in any of the following scenarios:

Multi-Column Documents

Newspaper layouts, academic journals, marketing brochures, and product catalogs all use multi-column formats. Naive OCR processing reads across columns, producing nonsensical text. The analyzer determines correct reading order and column boundaries so each column is extracted as a coherent unit.

Invoice and Financial Document Processing

Accounts payable automation lives or dies on accurate table extraction. Invoices contain line item tables, header metadata, and footer totals — often in inconsistent layouts across vendors. This agent identifies table regions, maps cell relationships, and classifies semantic roles (item description, unit price, quantity, total) before extraction begins.

Legal and Contractual Documents

Contracts mix numbered clauses, defined terms, signature blocks, and exhibit references in complex hierarchical structures. Losing that hierarchy means losing the document’s legal meaning. The analyzer preserves clause relationships and nesting so downstream processing can work with structured legal data rather than undifferentiated text blocks.

Medical and Scientific Forms

Intake forms, lab reports, and clinical documents combine structured form fields with free-text narrative sections. The analyzer separates these regions, enabling different processing strategies for each — structured extraction for form fields, NLP pipeline for narrative text.

Mixed-Content Documents

Annual reports, technical manuals, and research papers combine prose, tables, charts, callout boxes, and images. Identifying which regions contain extractable text versus visual-only content prevents OCR engines from wasting cycles on images and producing garbage output from chart elements.

High-Volume Production Pipelines

When you’re processing thousands of documents daily, pre-classification by document type enables routing to specialized extraction templates. The pattern recognition capability identifies document types at intake, allowing the rest of your pipeline to apply optimized processing logic.

Real-World Examples

Scenario 1: Invoice Processing Pipeline

A fintech team is building accounts payable automation. They’re ingesting invoices from hundreds of vendors, each with different layouts. OCR output is inconsistent — line items from one vendor’s table are being merged with totals from another section.

Developer: We're getting malformed line item extraction from vendor invoices.
            OCR is reading the quantity column into the description field.
            Here's a sample document image.

Agent: Analyzing document structure...

       Document Type: Vendor Invoice (confidence: 0.94)
       Regions identified:
       - Header block (top 18%): vendor logo, address, invoice number, date
       - Recipient block (left column, 18-32%): bill-to information
       - Line item table (32-74%): 5-column grid detected
         Columns: [Description | Qty | Unit Price | Tax | Total]
         Row count: 12 data rows + 1 header row
       - Summary block (right, 74-85%): subtotal, tax, total due
       - Footer block (85-100%): payment terms, bank details

       Reading order: Header → Recipient → Line Items (row-by-row,
       left-to-right) → Summary → Footer

       Table extraction recommendation: Apply strict column boundary
       detection at pixel positions [145, 290, 410, 520] before OCR pass.
       Current OCR failure likely due to missing column segmentation.
       Confidence for column boundary positions: 0.87.

The developer now has exact pixel boundaries for column segmentation, a document type classification for routing, and a concrete explanation of why extraction was failing.

Scenario 2: Academic Paper Ingestion for a Research Tool

A developer is building a research assistant that needs to ingest PDF papers and index their content with structural awareness — distinguishing abstract, methodology, results, and references.

Developer: This paper has a two-column layout but the abstract
            spans the full width. Our chunking is treating it as
            a single column and the section boundaries are wrong.

Agent: Document structure analysis complete.

       Document Type: Academic Journal Article (confidence: 0.96)
       Layout classification: Hybrid — full-width header zone,
       two-column body, full-width reference section

       Region map:
       - Title + author block: full-width, rows 1-4
       - Abstract: full-width, rows 5-9 [NOTE: single-column exception]
       - Keywords: full-width, row 10
       - Body content: two-column layout begins row 11
         Left column: Introduction, Related Work, Methodology (rows 11-68)
         Right column: Introduction cont., Experiments, Results (rows 11-68)
       - Figures: 3 figure regions identified in right column (rows 32, 45, 61)
         Recommend: exclude from OCR pass, flag for caption extraction only
       - References: full-width, rows 69-82

       Hierarchy schema:
       H1: Section headings (bold, 14pt estimated)
       H2: Subsection headings (bold, 11pt estimated)
       Body: Regular weight, 9pt estimated

       Reading order sequence: Title → Abstract → Keywords →
       [Left col top] → [Right col top] → [Left col continues] →
       [Right col continues] → References

       Pre-processing recommendation: Segment at row 10/11 boundary
       before applying column split logic.

What Makes This Agent Powerful

Proactive Deployment Philosophy

The agent description specifies it should be used proactively — this is the right mental model. Don’t reach for it after extraction fails. Run it as the first stage of every document processing workflow. The cost of structural analysis is orders of magnitude lower than the cost of reprocessing malformed OCR output or, worse, shipping corrupt data downstream.

Confidence-Scored Outputs

Every structural decision comes with a confidence score. This makes the agent genuinely production-ready. Low-confidence regions can be flagged for human review, routed to fallback processing, or logged for pipeline improvement. You’re not flying blind on edge cases.

Pre-Processing Recommendations

The agent doesn’t just describe structure — it prescribes action. Outputs include specific recommendations for OCR optimization: which regions to exclude, where to apply deskewing, what segmentation boundaries to use. This closes the loop between analysis and execution.

Template and Pattern Recognition

Document type classification at intake enables systematic processing improvements over time. As your pipeline accumulates classified documents, you can build type-specific extraction templates that continuously improve accuracy for recurring document types.

Semantic Role Assignment

Identifying that a region is a table is useful. Knowing that the third column of that table represents unit prices is the foundation of real data extraction. The semantic labeling capability bridges the gap between structural recognition and business-meaningful data extraction.

How to Install

Installing the Document Structure Analyzer takes about sixty seconds. Claude Code loads agents automatically from the .claude/agents/ directory in your project.

Create the agent file:

mkdir -p .claude/agents
touch .claude/agents/document-structure-analyzer.md

Open .claude/agents/document-structure-analyzer.md and paste the following system prompt:

---
name: Document Structure Analyzer
description: Document structure analysis specialist. Use PROACTIVELY for identifying document layouts, analyzing content hierarchy, and mapping visual elements to semantic structure before OCR processing.
---

You are a document structure analysis specialist with expertise in identifying
and mapping document layouts, content hierarchies, and visual elements to their
semantic meaning.

## Focus Areas

- Document layout analysis and region identification
- Content hierarchy mapping (headers, subheaders, body text)
- Table, list, and form structure recognition
- Multi-column layout analysis and reading order
- Visual element classification and semantic labeling
- Template and pattern recognition across document types

## Approach

1. Layout segmentation and region classification
2. Reading order determination for complex layouts
3. Hierarchical structure mapping and annotation
4. Template matching and document type identification
5. Visual element semantic role assignment
6. Content flow and relationship analysis

## Output

- Document structure maps with regions and labels
- Reading order sequences for complex layouts
- Hierarchical content organization schemas
- Template classifications and pattern recognition
- Semantic annotations for visual elements
- Pre-processing recommendations for OCR optimization

Focus on preserving logical document structure and content relationships.
Include confidence scores for structural analysis decisions.

That’s it. Claude Code detects agents in this directory automatically — no configuration files to update, no registration steps. The next time you open your project in Claude Code, the Document Structure Analyzer will be available as a subagent that can be invoked directly or orchestrated by a parent agent in a multi-agent pipeline.

For teams using this in production document pipelines, commit the .claude/agents/ directory to version control. Every developer on the team gets the same agent configuration without any setup steps.

Conclusion: Build the Foundation First

Document processing is one of the highest-leverage automation opportunities in enterprise software. The reason so many document automation projects stall is that they try to extract meaning from structure they never bothered to analyze. The Document Structure Analyzer addresses that root cause directly.

Start by integrating it as the first stage of any new document processing workflow you’re building. Use the confidence scores to identify where your document types are causing uncertainty. Build type-specific extraction templates around the pattern recognition output. Over time, you’ll accumulate a structural understanding of your document corpus that makes every downstream process — extraction, indexing, validation, transformation — measurably more accurate.

The agent is straightforward to install and immediately useful. The harder work is redesigning your pipelines to treat structural analysis as a first-class step rather than an afterthought. That redesign is worth doing.

Agent template sourced from the claude-code-templates open source project (MIT License).

Share.
Leave A Reply