Modulo 11 · AI

Document OCR — Intelligent Document Processing

Structured extraction from policies, receipts, attachments and contracts with human-in-the-loop and confidence scoring.

What is the Document OCR (IDP) module?

Document OCR is the vertical-insurance Intelligent Document Processing (IDP) module that automatically classifies and extracts structured fields from policies, receipts, contracts, claim attachments and KYC documents. It goes beyond traditional OCR: understands document structure, recognises insurance-domain entities (premium, principal, beneficiary, effective date, sum-insured, CIG, VAT, tax code), validates internal consistency (e.g. premium + taxes = total), produces structured JSON output ready to feed the policy back-office, claims module or KYC. Per-row confidence scoring + human-in-the-loop workflow guarantee > 97% post-review accuracy on structured fields.

For whom

Who handles high document volumes

Policy back-officeBulk upload of policies from partner-insurer Excel/PDF
Claims handlersAutomatic FNOL-attachment ingestion (appraisals, invoices, photos)
AML / KYC officerID document OCR + facial match
Central operationsEmail/PEC sorting with classification and routing
Key features

What the OCR module does

Classification & extraction
  • Automatic document-type classification (policy, receipt, claim, KYC)
  • Structured field extraction (insurance-specialised NER)
  • Entity recognition: principal, beneficiary, CIG, VAT, amounts
  • Internal consistency validation (sum check, format check)
  • Bounding box per extracted field (link to original document)
  • Layout preservation for re-issuance / translation
Quality & HITL
  • Per-row confidence scoring (0-100%)
  • Configurable HITL threshold (default 90%)
  • Human-review workflow with annotated UI
  • Incremental learning: corrections improve the model
  • Full audit trail: input, output, corrections, justifications
  • Structured JSON output for downstream integration
Typical workflow

From inbound document to structured data

01

Ingestion

Document arrives via PEC, email, portal upload, mobile app. Attachments extracted, decrypted (if encrypted), pre-processed (straightening, denoising).

02

Classification

Classification model identifies type: policy, receipt, appraisal, KYC. Probability over each class. Ambiguous documents flagged for review.

03

Field extraction

Specialised NER model extracts typical fields for the document type. For each field: value + confidence score + bounding box.

04

Validation & routing

Consistency checks (sum, format). If all confidence ≥ threshold: passes directly to back-office. If any field is below threshold: routing to HITL.

05

Human review (HITL)

Operator sees document + extraction + proposed alternatives. Confirms or corrects in a few clicks. Corrections feed model fine-tuning.

06

Output & archival

Structured JSON passes to the destination module (back-office, claims, KYC). Original document archived with link to record. Full audit trail.

Technologies

Technical stack

AI / ML
Multi-language OCR engine Insurance-specialised NER (IT) Document classifier
Pipeline & storage
Modular on-tenant pipeline Encrypted document store HITL annotation UI
Measurable results

Impact on document processes

> 97%Post-HITL accuracyOn structured fields of Italian policies
−80%Data-entry timeAutomatic ingestion vs manual typing
≤ 10%Documents in HITLTypical for stabilised production customer
0Data to external providersOn-tenant pipeline, no public-LLM training
FAQ

Frequently asked questions about Document OCR

What's the difference between OCR and IDP?

OCR (Optical Character Recognition) turns image into raw text. IDP (Intelligent Document Processing) goes further: it understands the document structure, extracts specific fields (e.g. "gross premium: 1,250 €"), validates consistency between related fields, classifies the document type. NewPicass 14.Net implements IDP for insurance documents: classify + extract + validate in a single pass.

What accuracy do you reach on Italian policies?

On structured documentation (policies issued by known Italian insurers) we reach > 97% accuracy on structured fields after human-in-the-loop. On unstructured documents (letters, free attachments) accuracy ranges 88-95%. The system uses per-row confidence scoring: if below threshold (configurable, default 90%) the field is automatically routed to human review.

What does human-in-the-loop mean?

The AI model doesn't operate as "take it or leave it". When confidence on a field is low, the system routes the document to an operator who sees the image, extracted text and proposed alternatives, and confirms or corrects in a few clicks. Corrections feed model fine-tuning: accuracy improves over time specifically on the customer's documents.

Is the original document layout preserved?

Yes. Extraction maintains row-by-row mapping: every extracted field has a bounding box on the original image and can be recalled as evidence. For documents needing re-issuance in other languages (see AI Translation module) layout preservation includes tables, headers, footers, font formatting.

Is the model trained on customer data?

Only if requested and with specific contractual clauses. The base model is trained on public corpora of Italian insurance documents. Customer-data fine-tuning is optional, executed on-tenant (data does not leave the customer perimeter), governed by a DPA addendum that excludes any secondary use.

What document types does the module handle?

Policies (Italian and Lloyd's slips), payment receipts, appraisals, binding-authority contracts, mandate contracts, ID documents (for KYC), claim attachments (invoices, statements, reconstructions). For new types not covered: 50-200 customer-example fine-tuning is enough to reach production accuracy.

Related modules
Let's talk · 45 minutes

Want to see Document OCR — Intelligent Document Processing in action on your real flows?

45 minutes with one of our engineers, no sales script. You show us your current process and we show you concretely how this module would solve the critical points.