NEWƎИ

Case study · NEWƎИ HQ infrastructure

Multi-AI Arena.

Pit frontier AIs against each other in evidence-backed debates with a third AI as judge. Every claim cites URLs with timestamps and content hashes; rulings cite specific turns. Multi-dimensional truth scoring — never 100%. Per-provider Elo tracks who wins what kind of question over time.

Status

v1 scaffold

Providers

4

Tests

42 passing

Stack

Python + SQLite

What's different about this

  • Never 100%. accuracy_pct is hard-capped at 99 in three places — prompt, parser, schema. The system always reserves doubt.
  • Multi-dimensional truth scoring. Separate evidence strength, debate strength, accuracy probability, fact-check count, calibration, and absoluteness — plus statistically_supported_side (analytically holds) vs factually_proven_side (provable beyond reasonable doubt).
  • Discrepancy analysis. Every ruling carries structured data on why the two sides disagreed: source overlap, interpretation divergence, knowledge-cutoff implications.
  • ALL OUT search + auto-pull.SQLite FTS5 indexes every claim, evidence excerpt, and judge ruling. Each new debate auto-pulls relevant prior arena history into both sides' prompts.
  • Per-provider profiles. Each AI has an editable markdown profile with knowledge cutoff and behavioral notes injected into the system prompt at runtime.

The flow

Run debate form
    ↓
Orchestrator
  ├─ auto_pull_context  (FTS5)  → prior_arena_context
  │
  ├─ Side A.debate_turn  (×N rounds)  ─┐
  │    └─ profile + system prompt      │ injected into both sides
  ├─ Side B.debate_turn  (×N rounds)  ─┘
  │
  ├─ Judge.render_judgment
  │    ├─ rubric (per question type)
  │    ├─ multi-dim scoring
  │    └─ NEVER 100% (clamped 3 places)
  │
  └─ SQLite + FTS5 triggers + Elo
            ↓
       Streamlit dashboard

Cost discipline

Per debate (3 rounds, ~2k tokens per turn, judge re-reads everything):

ProviderModel~Cost / slot
AnthropicClaude Sonnet 4.6~$0.05
OpenAIGPT-4o~$0.04
xAIGrok-2~$0.02
GoogleGemini 2.0 Flash~$0.005

Why it's here

Hard decisions don't belong to one model. When something matters — a pricing call, an architecture pick, an irreversible move — the arena is the adjudicator. The point isn't to pick a winner. It's to see why they disagreed and weigh that against the evidence.