"Why did your system extract this number from this document on this date?" is a question every document AI vendor eventually has to answer — for a customer, for an auditor, for a regulator. With deterministic systems the answer is straightforward: the rules produced the output, and the rules are version-controlled. With LLM-based extraction the answer is harder, and the architecture decisions you make on day one determine whether the question is tractable or impossible.

What "audit trail" actually means for LLM workflows

An audit trail is not a log file. Logs are what engineers read; an audit trail is what regulators read. The bar is higher:

  • Immutable. Once written, can't be changed. Append-only stores or write-once buckets.
  • Reproducible. Given an audit record, you can re-run the request and get the same output (or explain why you can't).
  • Queryable on the right keys. "All extractions for tenant X over the last 90 days" is a query that needs to return in under a second, not a Hadoop job.
  • Retention-aware. The retention rules for audit metadata are usually longer than for primary data — and the data subject's right to deletion has to be reconciled with audit retention obligations.

The reproducibility requirement is the one that breaks most LLM-based extraction systems. We'll spend most of this post on it.

The reproducibility problem

To reproduce an extraction you need: the input document, the prompt, the model version, the model parameters, and the random seed (if any). LLM providers vary in how much of this is available:

  • Input document — your problem. Store it (encrypted, retained per policy) or accept that you can't reproduce older requests.
  • Prompt — your problem. Templates change over time; "the prompt" for a request from three months ago is not "the prompt" today.
  • Model version — the provider's problem, partly. OpenAI and Anthropic both let you pin model versions. Customers who don't pin versions get surprised when their pipeline behavior shifts after a routine model update.
  • Parameters — your problem. Temperature, max-tokens, top-p, system prompt. Capture them.
  • Seed — provider-dependent. OpenAI's seed parameter gives near-determinism; Anthropic's API doesn't currently. For pure reproducibility, you may need temperature=0 and a pinned model version, and even then expect occasional divergence.

What fluex captures per extraction

For every extraction request, the audit record contains:

{
  "request_id": "req_01H....",
  "tenant_id": "ten_...",
  "document_hash": "sha256:...",
  "document_storage_ref": "gs://...",
  "schema_version": "v3.4",
  "pipeline_version": "2026.04.18",
  "steps": [
    {
      "step": "planner",
      "model": "claude-3-5-sonnet-2026-04",
      "prompt_hash": "sha256:...",
      "prompt_storage_ref": "gs://...",
      "params": { "temperature": 0.0, "max_tokens": 1024 },
      "response_hash": "sha256:...",
      "response_storage_ref": "gs://...",
      "started_at": "...",
      "completed_at": "...",
      "tokens_in": 1820, "tokens_out": 412
    },
    ... extractor / validator / post-processor steps ...
  ],
  "final_extraction_hash": "sha256:...",
  "decision": "auto-approved",
  "completed_at": "..."
}

Two design choices to highlight:

Hashes plus storage references. The audit record itself is small (a few KB) and queryable. The actual prompts, responses, and document content live in encrypted object storage with longer-term retention rules. Linking by hash means we can reconstruct any request without bloating the audit index.

Schema version + pipeline version. The schema (what fields the customer asked us to extract) and the pipeline (the orchestration code) both evolve. Both are captured. When we ship a pipeline change, replaying the audit records from the previous week against the new pipeline tells us if behavior shifted before traffic shifts.

Tenant isolation in the audit store

The audit store is data. Same isolation rules as primary data: row-level security at the database layer, scoped by tenant on every connection. A tenant's audit records are invisible to other tenants by construction.

We've seen customers with primary-data isolation but loose audit-data isolation. It's a common gap; auditors and regulators ask about it; the answer "we trust application code to filter audit queries by tenant" is the wrong answer.

Retention reconciliation

This is the awkward part. GDPR Art. 17 (right to erasure) gives data subjects a right to request deletion. SOX, HIPAA, financial-services regulators, and many DPAs require 7-year+ retention of audit records. These requirements collide.

The reconciliation we've landed on:

  • Primary data — document content, raw extractions — has the customer's configured retention. Default 90 days. Configurable to 0.
  • Audit metadata — the JSON record above, minus the storage refs to deleted primary data — retained per the customer's audit retention policy. Default 7 years.
  • Subject deletion — primary data is hard-deleted on request. Audit records are scrubbed: PII fields are nulled out, but the audit fact (the request happened, with this hash, at this time, with this decision) is retained as required by the customer's other obligations.

Customers configure the exact policy in their workflow. The default is conservative.

The customer-facing audit query API

A subtle but important capability: the audit trail is queryable by the customer, not just by us. Through the API, a customer can:

  • List all extractions for a given subject identifier (in support of Art. 15 access requests).
  • Pull the full audit record for a specific request.
  • Export audit records as JSON-Lines or CSV for their own retention.
  • Replay historical extractions against the current pipeline (Enterprise only — tells you whether a pipeline upgrade changed behavior on your traffic).

The replay capability is the surprising one. Customers preparing for their own audits use it as evidence that our behavior on their documents has been stable.

What we'd skip if you're starting today

Don't try to make every audit field human-readable in the audit table. We did that early and it inflated storage. The audit table should be machine-queryable JSON with hashes; everything human-readable lives in the linked storage. Engineers and auditors both prefer this once they get used to it.

Don't conflate the audit trail with observability traces. Traces are for engineers, retained briefly, sent to your observability vendor. Audit records are for customers and regulators, retained for years, in your own infrastructure under tight access control. Mixing the two creates the worst of both worlds.

Closing

Audit trails for non-deterministic outputs are a different kind of engineering than debugging logs. The reproducibility requirement forces architectural decisions — version-pinning, prompt versioning, structured capture — that pay back in customer trust and operational sanity, not just regulatory compliance.

The teams that build this in early discover that "why did this happen?" is a query, not an investigation. The teams that don't end up reverse-engineering their own behavior every time something goes wrong. The audit trail is the engineering investment that looks like a compliance investment until you actually need it.

For the architecture that produces these audit records, see our ReAct architecture pillar. For the observability side of the same pipeline, see tracing agentic extraction.