Background
PhysBench is a 10 k‑item, video‑image‑text benchmark that tests whether a vision–language model (VLM) can reason about the real‑world physics that governs everyday objects and scenes. It covers four domains—object Properties, object Relationships, Scene understanding and future‑state Dynamics—split into 19 fine‑grained tasks such as mass comparison, collision outcomes and fluid behaviour.
State of play
• Human reference accuracy: 95.87 %
• Frontier AI as of Dec 2024 (InternVL 2.5‑38B): 51.94 %
• No new record has been reported in the first half of 2025; the field remains ≈ 44 percentage points below human performance.
Why reaching human‑level on PhysBench is a big milestone:
• Physics‑consistent video generation – A model that masters all four PhysBench domains should be able to create long‑form videos, ads or even feature films in which liquids pour, cloth folds and shadows move exactly as they would in the real world, eliminating today’s “physics mistakes” seen in AI generated videos.
PhysBench is the litmus test for whether next‑generation multimodal models can move from “smart autocomplete” to physically grounded intelligence—a prerequisite for everything from autonomous robots to Hollywood‑quality generative cinema
Resolution Criteria
This market resolves to the year bracket in which a fully automated AI system first achieves an average accuracy of 95% or higher (“human‑level”) on the PhysBench ALL metric.
Verification – Must be confirmed by a peer‑reviewed or arXiv paper or an independent leaderboard entry (e.g. LM‑Eval Harness, PapersWithCode).
Autonomy – The model must solve the benchmark without human‑written answers or manual tool use; fine‑tuned tool calling is allowed if self‑directed by the AI.
Compute resources – Unrestricted.
If no model reaches 95.9 % by 31 Dec 2041, the market resolves to “Not Applicable.”