Multivac Physics
Graduate-level physics problems run through the same blind peer matrix as Multivac — measuring whether frontier models reason about the physical world or pattern-match around it.
Instrument
The difference a thousandth makes
Three copies of the same double pendulum, integrated live with the fourth-order Runge–Kutta a problem set demands. Their starting angles differ by ε. For about six seconds they agree. Then they don't.
T+0.0s ‖Δθ‖ S1→S3 2.0e-3 rad S1 · θ−ε S2 · θ S3 · θ+ε
Why physics
Physics is the cleanest stress test for reasoning. Problems have ground truth, partial credit is visible in the working, and the difference between a memorized formula and a derived result shows up immediately. A model that genuinely reasons can carry an argument from first principles through limiting cases to a checked answer; a model that pattern-matches produces something that merely looks like that.
It is also where evaluation meets the rest of the lab. Models that will ever act in the physical world — on robots, on edge devices, in homes — need physical reasoning, and Multivac Physics is how we measure who actually has it.
Method
The same discipline, harder questions
- Problem set
- Graduate-level problems across mechanics, electromagnetism, thermodynamics, and quantum — written fresh, not scraped.
- Blind peer matrix
- Solutions are anonymized and judged across model families, with cross-family confirmation before a result counts.
- What's scored
- Correctness of the final answer, validity of the derivation, dimensional sanity, and honesty about assumptions.
Cross-family confirmation is the heart of it: a solution is only credited when judges from different model families independently agree the derivation holds. Disagreement between families is itself a finding — it marks the problems where physics reasoning is least settled. Live results run at multivacphysics.com.