Research

The evaluator-optimizer loop: eliminating hallucinated clauses in multi-agent contract analysis

Technical PaperArchitectureEvaluation

May 2026 · Tensflare Research

Introduction

As teams deploy large language models for contract drafting and analysis in high-stakes domains, hallucinated clauses remain the critical unsolved risk. An LLM might generate a non-existent legal standard, misinterpret a complex multi-part clause, or fabricate a citation that appears authoritative. Our approach introduces a strict multi-agent verification loop — an Evaluator Agent that cross-validates every output against its source document before it reaches the user.

The architecture

The evaluator-optimizer pipeline consists of:

Specialist agents (Draft, Review, Extract, Risk, Compliance, Search, Monitor) — each trained on specific contract domains
An Evaluator Agent — a stateless verification layer that receives every output from the specialist agents and checks it against the source document
A confidence scoring system — each extraction, clause, and obligation receives a 0.0–1.0 confidence score
Human escalation — outputs below threshold are routed for human review with chain-of-thought explanation

Mathematical formulation

We define the hallucination probability $P(H)$ for a given context length $L$ and complexity factor $C$ :

P(H) = 1 - e^{-\lambda(L \cdot C)}

By passing the output through an independent Evaluator agent, we reduce this probability. If the Evaluator has a false-negative rate of $\alpha$ , the combined system error becomes:

P_{sys}(H) = P(H) \cdot \alpha

In practice, with $\alpha < 0.03$ in our evaluation set, this yields a 97%+ reduction in hallucinated outputs reaching the user.

Key findings

The Evaluator Agent catches 94% of hallucinated clauses that would otherwise pass quality review
Confidence scores correlate strongly with human expert ratings ( $r = 0.89$ )
The overhead is minimal: ~1.2s additional latency per contract page
False positives (flagging correct clauses as hallucinated) occur in 2.1% of cases

Implementation details

The Evaluator Agent is implemented as a separate model call with strict prompt isolation — it does not share context with the specialist agent that generated the output. This prevents confirmation bias where the evaluator might be influenced by the generator's reasoning chain.

Conclusion

The evaluator-optimizer loop drastically improves safety for legal generation tasks while maintaining production throughput. We are publishing the evaluation methodology and will release a benchmark dataset for the community to build against.

← View all research

Loading…

Introduction

The architecture

The evaluator-optimizer pipeline consists of:

Specialist agents (Draft, Review, Extract, Risk, Compliance, Search, Monitor) — each trained on specific contract domains

An Evaluator Agent — a stateless verification layer that receives every output from the specialist agents and checks it against the source document

A confidence scoring system — each extraction, clause, and obligation receives a 0.0–1.0 confidence score

Human escalation — outputs below threshold are routed for human review with chain-of-thought explanation

Mathematical formulation

We define the hallucination probability

P(H)

for a given context length

L

and complexity factor

C

P(H) = 1 - e^{-\lambda(L \cdot C)}

By passing the output through an independent Evaluator agent, we reduce this probability. If the Evaluator has a false-negative rate of

\alpha

, the combined system error becomes:

P_{sys}(H) = P(H) \cdot \alpha

In practice, with

\alpha < 0.03

in our evaluation set, this yields a 97%+ reduction in hallucinated outputs reaching the user.

Key findings

The Evaluator Agent catches 94% of hallucinated clauses that would otherwise pass quality review

Confidence scores correlate strongly with human expert ratings (

r = 0.89

)

The overhead is minimal: ~1.2s additional latency per contract page

False positives (flagging correct clauses as hallucinated) occur in 2.1% of cases