Multivac

Independent evaluation of frontier language models. Fresh questions models haven't memorized, answered blind, and judged by a cross-family peer matrix — so no vendor ever grades its own homework.

themultivac.com Open the platform

Answerers → Judges ↓

RECUSED AGGREGATE VERDICT DENSITY 0–1

FIG. 1 — THE PEER MATRIX No vendor grades its own homework.

Why independent evaluation

AI labs optimize for benchmarks, and benchmarks fall to Goodhart's law: the metric becomes the target instead of the capability. Public leaderboards are saturated, training sets quietly absorb test sets, and the most common judge of a model's output is another model from the same vendor.

Multivac exists to break that loop. The name comes from Asimov's question-answering machine — but the point of the story was never the answers. The questions we persistently ask shape what AI becomes good at. Multivac is the question machine.

Method

How an evaluation runs

Fresh questions, designed against memorization

Every question is new — it cannot exist in training data. Questions are scored before they run: do models actually diverge on it, can quality be judged objectively, does answering well require genuine capability rather than recall?
Blind answers

Model identities are stripped before judging. No reputation effects, no brand halo — outputs are compared as anonymous candidates.
Cross-family peer matrix

Judging is distributed across model families so that same-vendor judge bias — a model family systematically preferring its own style — is measured and corrected for, not silently absorbed into the rankings.
Probe the failure modes that matter

Beyond capability: does the model admit uncertainty, resist sycophancy, stay consistent across phrasings, and fail informatively? These are trustworthiness measurements, not just intelligence measurements.
Publish methodology with results

Every evaluation documents its question, rubric, models tested, and surprises — reproducible enough that the ranking can be challenged on its merits.

Constructor bias is the core finding. A benchmark's design shapes its rankings: who writes the questions, who judges the answers, and which model family the judge belongs to all move the leaderboard. The research paper formalizing this is currently under review — results and the full methodology land at themultivac.com once it clears.

Scope

What gets measured

Capability: Code, multi-step reasoning, analysis, and communication — compound tasks with verifiable success criteria.
Trustworthiness: Calibration, honest uncertainty, sycophancy resistance, self-correction, and consistency under rephrasing.
Limits: Difficulty escalation and constraint stacking until models break — finding the ceiling, not just the average.

Multivac

Why independent evaluation

Fresh questions, designed against memorization

Blind answers

Cross-family peer matrix

Probe the failure modes that matter

Publish methodology with results