Multivac Physics

Graduate-level physics problems run through the same blind peer matrix as Multivac — measuring whether frontier models reason about the physical world or pattern-match around it.

multivacphysics.com Multivac methodology

Instrument

The difference a thousandth makes

Three copies of the same double pendulum, integrated live with the fourth-order Runge–Kutta a problem set demands. Their starting angles differ by ε. For about six seconds they agree. Then they don't.

RK4 · h = 4 ms · ×3 systems ε = 1e-3 rad m₁=m₂=1 · L₁=L₂=1 · g = 9.81

T+0.0s ‖Δθ‖ S1→S3 2.0e-3 rad S1 · θ−ε S2 · θ S3 · θ+ε

ε = 10^x 1e-3 Drag any bob to re-fling all three

FIG. 2 — SENSITIVE DEPENDENCE Three systems, identical to one part in a thousand. Rigor, not vibes.

Why physics

Physics is the cleanest stress test for reasoning. Problems have ground truth, partial credit is visible in the working, and the difference between a memorized formula and a derived result shows up immediately. A model that genuinely reasons can carry an argument from first principles through limiting cases to a checked answer; a model that pattern-matches produces something that merely looks like that.

It is also where evaluation meets the rest of the lab. Models that will ever act in the physical world — on robots, on edge devices, in homes — need physical reasoning, and Multivac Physics is how we measure who actually has it.

Method

The same discipline, harder questions

Problem set: Graduate-level problems across mechanics, electromagnetism, thermodynamics, and quantum — written fresh, not scraped.
Blind peer matrix: Solutions are anonymized and judged across model families, with cross-family confirmation before a result counts.
What's scored: Correctness of the final answer, validity of the derivation, dimensional sanity, and honesty about assumptions.

Cross-family confirmation is the heart of it: a solution is only credited when judges from different model families independently agree the derivation holds. Disagreement between families is itself a finding — it marks the problems where physics reasoning is least settled. Live results run at multivacphysics.com.