Case study · NEWƎИ HQ infrastructure
Multi-AI Arena.
Pit frontier AIs against each other in evidence-backed debates with a third AI as judge. Every claim cites URLs with timestamps and content hashes; rulings cite specific turns. Multi-dimensional truth scoring — never 100%. Per-provider Elo tracks who wins what kind of question over time.
Status
v1 scaffold
Providers
4
Tests
42 passing
Stack
Python + SQLite
What's different about this
- Never 100%.
accuracy_pctis hard-capped at 99 in three places — prompt, parser, schema. The system always reserves doubt. - Multi-dimensional truth scoring. Separate evidence strength, debate strength, accuracy probability, fact-check count, calibration, and absoluteness — plus
statistically_supported_side(analytically holds) vsfactually_proven_side(provable beyond reasonable doubt). - Discrepancy analysis. Every ruling carries structured data on why the two sides disagreed: source overlap, interpretation divergence, knowledge-cutoff implications.
- ALL OUT search + auto-pull.SQLite FTS5 indexes every claim, evidence excerpt, and judge ruling. Each new debate auto-pulls relevant prior arena history into both sides' prompts.
- Per-provider profiles. Each AI has an editable markdown profile with knowledge cutoff and behavioral notes injected into the system prompt at runtime.
The flow
Run debate form
↓
Orchestrator
├─ auto_pull_context (FTS5) → prior_arena_context
│
├─ Side A.debate_turn (×N rounds) ─┐
│ └─ profile + system prompt │ injected into both sides
├─ Side B.debate_turn (×N rounds) ─┘
│
├─ Judge.render_judgment
│ ├─ rubric (per question type)
│ ├─ multi-dim scoring
│ └─ NEVER 100% (clamped 3 places)
│
└─ SQLite + FTS5 triggers + Elo
↓
Streamlit dashboardCost discipline
Per debate (3 rounds, ~2k tokens per turn, judge re-reads everything):
| Provider | Model | ~Cost / slot |
|---|---|---|
| Anthropic | Claude Sonnet 4.6 | ~$0.05 |
| OpenAI | GPT-4o | ~$0.04 |
| xAI | Grok-2 | ~$0.02 |
| Gemini 2.0 Flash | ~$0.005 |
Why it's here
Hard decisions don't belong to one model. When something matters — a pricing call, an architecture pick, an irreversible move — the arena is the adjudicator. The point isn't to pick a winner. It's to see why they disagreed and weigh that against the evidence.
