Signature Data Extraction

Problem

Paper forms needed to become usable data

Part of an effort to evaluate public support for a measure across Arkansas. Volunteers had collected thousands of paper petition sheets — printed fields for name, address, county, and date, plus a handwritten signature per row. The data had to land in a structured CSV before any analysis could begin, and manual entry was a non-starter at that volume.

Approach

Extract fields, flag uncertainty

Scanned sheets ingested via Python; pre-processed for skew and contrast.
Field-level OCR with Azure Document Intelligence against a custom layout model trained on the petition format.
Extracted records normalized in pandas (county lookups, address splits) and written to CSV.
Validation pass on each row to flag low-confidence fields for human review.

Outcome

Useful automation with a clear human boundary

Printed text — names, addresses, counties, dates — came through with high reliability and was usable for aggregation work downstream. Handwritten signatures were the hard problem: OCR confidence was low and validation was effectively human-in-the-loop. The pipeline shipped as “completed in part” — useful for the structured fields, with signature verification deferred to a follow-on workflow.

Retrospective

What I’d do differently

Stand up a lightweight reviewer UI for the low-confidence rows from day one.
Try a dedicated handwriting model (TrOCR or a specialized signature-verification network) for the signature box only.
Build a small held-out gold set up front so accuracy isn’t measured by eyeballing CSV diffs.

Stack

Pipeline pieces

Python (pandas, Pillow)
Azure Document Intelligence (custom layout model)
CSV / Excel for downstream consumers

← Back to portfolio View code on GitHub