Why verification loops matter
When a contract review system misses an auto-renewal clause, or a due diligence report fabricates a citation, the cost isn't measured in API credits.
Traditional RAG pipelines retrieve relevant chunks and let the LLM generate a response in a single pass. There's no second look. No verification that the output actually matches the source.
The evaluator pattern
We introduced a tenth specialist agent into our pipeline — the Evaluator. After the drafting agent produces an output, the Evaluator receives:
- The original source document
- The agent's instructions
- The generated output
It then scores each claim in the output against the source document on three axes:
- Attribution: Is this claim present in the source?
- Accuracy: Does the claim match the source's meaning?
- Omission: Is anything material from the source missing?
Results
Across 1,000 test runs on commercial contracts, the Evaluator caught:
- 94% of hallucinated clauses
- 87% of misattributed citations
- 72% of material omissions
The false positive rate — where the Evaluator flagged a correct output — was 3.2%.
Implementation notes
The Evaluator runs as a separate Claude API call with a strict system prompt. We use tools mode to structure its output as JSON scores rather than free text, making the results machine-readable for downstream logging and audit.
{
"claim": "Section 4.2 grants a perpetual license",
"attribution": true,
"accuracy": true,
"source_location": "Section 4.2, paragraph 1"
}
This structured output feeds directly into HalluCase for public benchmarking.