Deep Pearl AI

Evaluation

Paper under review

Multivac

Independent evaluation of frontier language models. Fresh questions models haven't memorized, answered blind, and judged by a cross-family peer matrix — so no vendor ever grades its own homework.

Answerers → Judges ↓
RECUSED AGGREGATE VERDICT DENSITY 0–1
FIG. 1 — THE PEER MATRIX No vendor grades its own homework.

Why independent evaluation

AI labs optimize for benchmarks, and benchmarks fall to Goodhart's law: the metric becomes the target instead of the capability. Public leaderboards are saturated, training sets quietly absorb test sets, and the most common judge of a model's output is another model from the same vendor.

Multivac exists to break that loop. The name comes from Asimov's question-answering machine — but the point of the story was never the answers. The questions we persistently ask shape what AI becomes good at. Multivac is the question machine.

Method

How an evaluation runs

  1. Fresh questions, designed against memorization

    Every question is new — it cannot exist in training data. Questions are scored before they run: do models actually diverge on it, can quality be judged objectively, does answering well require genuine capability rather than recall?

  2. Blind answers

    Model identities are stripped before judging. No reputation effects, no brand halo — outputs are compared as anonymous candidates.

  3. Cross-family peer matrix

    Judging is distributed across model families so that same-vendor judge bias — a model family systematically preferring its own style — is measured and corrected for, not silently absorbed into the rankings.

  4. Probe the failure modes that matter

    Beyond capability: does the model admit uncertainty, resist sycophancy, stay consistent across phrasings, and fail informatively? These are trustworthiness measurements, not just intelligence measurements.

  5. Publish methodology with results

    Every evaluation documents its question, rubric, models tested, and surprises — reproducible enough that the ranking can be challenged on its merits.

Constructor bias is the core finding. A benchmark's design shapes its rankings: who writes the questions, who judges the answers, and which model family the judge belongs to all move the leaderboard. The research paper formalizing this is currently under review — results and the full methodology land at themultivac.com once it clears.

Scope

What gets measured

Capability
Code, multi-step reasoning, analysis, and communication — compound tasks with verifiable success criteria.
Trustworthiness
Calibration, honest uncertainty, sycophancy resistance, self-correction, and consistency under rephrasing.
Limits
Difficulty escalation and constraint stacking until models break — finding the ceiling, not just the average.