Case study · OCR pipeline

Signature Data Extraction

Scanned petition sheets → structured CSV via OCR, validation, and human-review flags.

Input

Scans

Output

CSV

Stack

Python

Review

HITL

Problem

Paper forms needed to become usable data

Part of an effort to evaluate public support for a measure across Arkansas. Volunteers had collected thousands of paper petition sheets — printed fields for name, address, county, and date, plus a handwritten signature per row. The data had to land in a structured CSV before any analysis could begin, and manual entry was a non-starter at that volume.

Approach

Extract fields, flag uncertainty

  • Scanned sheets ingested via Python; pre-processed for skew and contrast.
  • Field-level OCR with Azure Document Intelligence against a custom layout model trained on the petition format.
  • Extracted records normalized in pandas (county lookups, address splits) and written to CSV.
  • Validation pass on each row to flag low-confidence fields for human review.

Outcome

Useful automation with a clear human boundary

Printed text — names, addresses, counties, dates — came through with high reliability and was usable for aggregation work downstream. Handwritten signatures were the hard problem: OCR confidence was low and validation was effectively human-in-the-loop. The pipeline shipped as “completed in part” — useful for the structured fields, with signature verification deferred to a follow-on workflow.

Retrospective

What I’d do differently

  • Stand up a lightweight reviewer UI for the low-confidence rows from day one.
  • Try a dedicated handwriting model (TrOCR or a specialized signature-verification network) for the signature box only.
  • Build a small held-out gold set up front so accuracy isn’t measured by eyeballing CSV diffs.

Stack

Pipeline pieces

  • Python (pandas, Pillow)
  • Azure Document Intelligence (custom layout model)
  • CSV / Excel for downstream consumers