Problem
Paper forms needed to become usable data
Part of an effort to evaluate public support for a measure across Arkansas. Volunteers had collected thousands of paper petition sheets — printed fields for name, address, county, and date, plus a handwritten signature per row. The data had to land in a structured CSV before any analysis could begin, and manual entry was a non-starter at that volume.
Approach
Extract fields, flag uncertainty
- Scanned sheets ingested via Python; pre-processed for skew and contrast.
- Field-level OCR with Azure Document Intelligence against a custom layout model trained on the petition format.
- Extracted records normalized in pandas (county lookups, address splits) and written to CSV.
- Validation pass on each row to flag low-confidence fields for human review.
Outcome
Useful automation with a clear human boundary
Printed text — names, addresses, counties, dates — came through with high reliability and was usable for aggregation work downstream. Handwritten signatures were the hard problem: OCR confidence was low and validation was effectively human-in-the-loop. The pipeline shipped as “completed in part” — useful for the structured fields, with signature verification deferred to a follow-on workflow.
Retrospective
What I’d do differently
- Stand up a lightweight reviewer UI for the low-confidence rows from day one.
- Try a dedicated handwriting model (TrOCR or a specialized signature-verification network) for the signature box only.
- Build a small held-out gold set up front so accuracy isn’t measured by eyeballing CSV diffs.
Stack
Pipeline pieces
- Python (pandas, Pillow)
- Azure Document Intelligence (custom layout model)
- CSV / Excel for downstream consumers