Introduction
As teams deploy large language models for contract drafting and analysis in high-stakes domains, hallucinated clauses remain the critical unsolved risk. An LLM might generate a non-existent legal standard, misinterpret a complex multi-part clause, or fabricate a citation that appears authoritative. Our approach introduces a strict multi-agent verification loop — an Evaluator Agent that cross-validates every output against its source document before it reaches the user.
The architecture
The evaluator-optimizer pipeline consists of:
- Specialist agents (Draft, Review, Extract, Risk, Compliance, Search, Monitor) — each trained on specific contract domains
- An Evaluator Agent — a stateless verification layer that receives every output from the specialist agents and checks it against the source document
- A confidence scoring system — each extraction, clause, and obligation receives a 0.0–1.0 confidence score
- Human escalation — outputs below threshold are routed for human review with chain-of-thought explanation
Mathematical formulation
We define the hallucination probability for a given context length and complexity factor :
By passing the output through an independent Evaluator agent, we reduce this probability. If the Evaluator has a false-negative rate of , the combined system error becomes:
In practice, with in our evaluation set, this yields a 97%+ reduction in hallucinated outputs reaching the user.
Key findings
- The Evaluator Agent catches 94% of hallucinated clauses that would otherwise pass quality review
- Confidence scores correlate strongly with human expert ratings ()
- The overhead is minimal: ~1.2s additional latency per contract page
- False positives (flagging correct clauses as hallucinated) occur in 2.1% of cases
Implementation details
The Evaluator Agent is implemented as a separate model call with strict prompt isolation — it does not share context with the specialist agent that generated the output. This prevents confirmation bias where the evaluator might be influenced by the generator's reasoning chain.
Conclusion
The evaluator-optimizer loop drastically improves safety for legal generation tasks while maintaining production throughput. We are publishing the evaluation methodology and will release a benchmark dataset for the community to build against.