Document AI Glossary — IDP, OCR, ReAct, HITL & more

IDP

Intelligent Document Processing

The category of software that ingests unstructured documents and produces structured data, typically combining OCR, classification, extraction, and validation. Twenty years ago this was rule-based template matching. Today it's predominantly LLM-driven, and the category boundary has blurred with general-purpose AI platforms — but the buyer problem (turn this PDF into a row in my database, accurately, auditably) is unchanged.

OCR

Optical Character Recognition

The conversion of pixel data (scanned images, photos of documents) into machine-readable text. Foundational input to document AI, increasingly commoditized — modern OCR engines hit 99%+ character accuracy on clean documents, and even mediocre OCR is good enough when the downstream model is an LLM that can correct context. Obsessing over OCR accuracy is the wrong metric in 2026; the differentiator is what you do with the text after.

HITL

Human-in-the-Loop

A workflow pattern where uncertain or low-confidence machine outputs are routed to a human reviewer before downstream use. Done well, HITL is how you get to 99.9% accuracy on documents where 99% would be unacceptable — the human only sees the 1% the model wasn't sure about, and the corrections feed active learning. Done poorly, HITL is a euphemism for "we ship 80% accuracy and call it a feature."

ReAct

Reason + Act architecture

An agentic LLM pattern where the model alternates between reasoning steps ("the document looks like an invoice; the line items don't add up; I should call the totals tool") and tool-using actions. Distinct from single-pass extraction (one LLM call, hope for the best) and from pure RAG (retrieve documents, generate an answer). ReAct is a natural fit for document workflows where extraction is one step in a longer chain — classify, extract, validate, escalate.

Schema-driven extraction

An extraction approach where the desired output schema (field names, types, validators) is the primary configuration, and the model is instructed to populate that schema rather than train a per-document-type classifier. Faster to iterate than supervised training, and it generalizes to documents you've never seen before. Trades some defensibility for speed: you're auditing the prompt and model version, not a frozen training artifact.

Audit trail

An immutable, time-ordered record of every operation performed on a document — including model version, prompts, responses, human actions, and lineage — used for compliance, debugging, and dispute resolution. In an AI workflow, "the model decided X" is not an audit trail; "model version 2026.04.18, prompt hash abc123, response payload, reviewer Sarah confirmed at 14:22 UTC" is.

Confidence score

A model-emitted estimate of how likely an extracted value is to be correct. Used to drive HITL routing, auto-approval thresholds, and quality reporting. Worth knowing: LLM confidence scores are notoriously poorly calibrated out of the box. A 0.95 from one model is not the same 0.95 as another. Calibration is a per-platform engineering problem, not a free input.

Multi-LLM consensus

Running the same extraction request through multiple LLM providers (or multiple models) and reconciling results. Used as an accuracy lever (where two models agree, confidence is high) and as a hedge against single-vendor failure modes (one provider has an outage; another keeps the request flowing). Costs more compute than single-model extraction; worth it on regulated or high-stakes workflows.

Zero-retention API

An LLM API configuration where the provider commits to not training on, persisting, or logging customer payloads beyond operational requirements (typically 0–30 days for abuse monitoring, with the customer able to opt out of even that on enterprise tiers). Required for regulated workflows — without it, you're sending PHI, PII, or material non-public information to a vendor that retains and possibly trains on it. OpenAI and Anthropic both offer zero-retention configurations on enterprise plans.

Sub-processor

Under GDPR Article 28, any third-party processor engaged by a primary processor to handle personal data on behalf of a controller. AI products typically have 4–7 sub-processors per request — LLM provider, vector store, observability vendor, email/SMS sender, identity provider, payments. Your DPA needs to list them, your customer needs to be able to object, and you need a 30-day notice clock when you change them.

DPA

Data Processing Agreement

The contractual instrument required by GDPR Article 28 between a controller and processor (and, recursively, between processor and sub-processor). Specifies handling, security, breach obligations, sub-processor governance, audit rights, and retention. In 2026 a DPA without an AI-specific addendum is undercooked — the Article 28 obligations apply to LLM-mediated processing the same way they apply to any other.

BAA

Business Associate Agreement

The contractual instrument required by HIPAA between a covered entity (provider, payer, clearinghouse) and a business associate handling protected health information (PHI). Without a BAA, you cannot legally process PHI on behalf of the covered entity. AI vendors that don't offer BAAs are off the table for healthcare workflows by default.

PII

Personally Identifiable Information

Data that can identify a specific individual either directly (name, government ID, biometric) or indirectly (combination of attributes that uniquely identify in context). Subject to data-protection laws across jurisdictions (GDPR, CCPA/CPRA, LGPD, PIPEDA). Document AI workflows almost always touch PII — the controls that matter are encryption, access scoping, retention discipline, and audit lineage.

PHI

Protected Health Information

Under HIPAA, individually identifiable health information held by covered entities and business associates. Subject to the HIPAA Privacy Rule (use and disclosure) and Security Rule (technical, physical, and administrative safeguards). PHI is a strict subset of PII with sharper handling rules — minimum-necessary access, breach notification thresholds, and BAA-mediated processor relationships.

Active learning

A training approach where a model selects uncertain or informative examples for human labeling, then incorporates those labels back into the model. In document AI, the loop typically runs: model extracts, low-confidence fields go to HITL review, reviewer corrections are added to a few-shot prompt or fine-tuning dataset. The cycle improves accuracy on edge cases without bulk retraining.

STP

Straight-Through Processing

A workflow pattern where high-confidence machine outputs are auto-approved without human touch. Common in claims, payments, KYC, and AP automation. STP rates (% of documents that bypass human review) are the headline metric most IDP procurement evaluations compare on. Worth pairing with quality-of-bypass metrics: a 95% STP rate at 99% accuracy is good; a 95% STP rate at 90% accuracy is a recall problem dressed as an efficiency win.

FNOL

First Notice of Loss

The initial report of a claim by a policyholder, triggering the claims workflow in insurance. FNOL packages typically arrive as multi-document bundles — claim form, photos, police reports, repair estimates — and the time to triage them is the gating step on every downstream metric (cycle time, loss adjustment expense, customer satisfaction).

KYC

Know Your Customer

The regulatory process of verifying the identity of customers, typically using a combination of identity documents (passport, driver's license, national ID) and supporting records (proof of address, source of funds). Required across financial services, fintech, marketplaces, and increasingly any business handling significant payments. KYC document workflows are bursty, latency-sensitive, and intolerant of errors — three reasons document AI fits well.

Document classification

The step of identifying what kind of document is being processed (invoice vs. payslip vs. ID), typically as a precursor to extraction. In modern stacks, classification and extraction can be the same LLM call — the model identifies the type and emits the type-specific schema in one pass — but for high-volume systems, a separate fast classifier still earns its keep on cost and latency.

Field validation

Post-extraction checks that confirm extracted values are well-formed, internally consistent, and (where possible) externally verified. Examples: checksum validation on tax IDs and bank routing numbers, date-range sanity (no future invoices), totals reconciliation (line items sum to total), external lookups (NPI valid, address resolves, EIN matches IRS file). Most extraction errors caught here, not in the model.

Document AI vocabulary.

IDP

OCR