Engineering

Building Evaluator-Optimizer Loops for High-Stakes AI

2026-05-15 · Tensflare Engineering

Why verification loops matter

When a contract review system misses an auto-renewal clause, or a due diligence report fabricates a citation, the cost isn't measured in API credits.

Traditional RAG pipelines retrieve relevant chunks and let the LLM generate a response in a single pass. There's no second look. No verification that the output actually matches the source.

The evaluator pattern

We introduced a tenth specialist agent into our pipeline: the Evaluator. After the drafting agent produces an output, the Evaluator receives:

The original source document
The agent's instructions
The generated output

It then scores each claim in the output against the source document on three axes:

Attribution: Is this claim present in the source?
Accuracy: Does the claim match the source's meaning?
Omission: Is anything material from the source missing?

Results

Across 1,000 test runs on commercial contracts, the Evaluator caught:

94% of hallucinated clauses
87% of misattributed citations
72% of material omissions

The false positive rate, where the Evaluator flagged a correct output, was 3.2%.

Implementation notes

The Evaluator runs as a separate Claude API call with a strict system prompt. We use tools mode to structure its output as JSON scores rather than free text, making the results machine-readable for downstream logging and audit.

{
  "claim": "Section 4.2 grants a perpetual license",
  "attribution": true,
  "accuracy": true,
  "source_location": "Section 4.2, paragraph 1"
}

This structured output feeds directly into HalluCase for public benchmarking.

← All posts

Loading…