MANIFOLD
Will SotA on PaperBench (Code-Dev) surpass 75% in 2025?
23
Ṁ5.1kṀ20k
resolved Jan 18
Resolved
NO

PaperBench is a benchmark open-sourced by OpenAI designed to evaluate the ability of AI agents to replicate state-of-the-art AI research papers from scratch. The papers are sourced from the ICML 2024 spotlight and oral tracks.

This market concerns the PaperBench Code-Dev variant, which specifically measures the agent's capability to generate the code required for replication, based on rubric evaluations.

The current State of the Art (SotA) reported in the paper for the Code-Dev variant is 43.4% (achieved by o1-high).

Figure 1 (above) from the paper illustrates the overall benchmark idea.

Why use the PaperBench Code-Dev metric?

The PaperBench Code-Dev variant is simpler and cheaper to evaluate than the full benchmark. The Code-Dev variant uses an LLM-based judge to apply the benchmark scoring metrics but does not execute the reproduction code itself (increasing variance but obviating the need for VMs with GPUs during evaluation).

Market Details

  • Source: This market resolves based on published data from the maintainers of the PaperBench benchmark (e.g., OpenAI Evals Team or designated successors) or credible third-party evaluations using the official benchmark configuration.

  • Metric: Average Replication Score (%) on the PaperBench Code-Dev variant, across the official set of benchmark papers.

  • Threshold Score: 75.0% Average Replication Score or greater.

  • Resolution Criterion This market resolves to YES if the State-of-the-Art (SotA) Average Replication Score on PaperBench Code-Dev is credibly reported to have surpassed 75.0% by 11:59 PM UTC 2025/12/31. Otherwise, the market resolves to NO.

  • Market Closing Date The market will close on January 15, 2026, to allow for potential reporting delays. It will resolve earlier if the YES condition (>=75.0% reported) is met and confirmed before this date.

  • Update 2026-01-18 (PST) (AI summary of creator comment): The creator has announced they will resolve this market to NO.

The DeepCode system reported:

  • 75.9% ± 4.5 on a 3-paper subset

  • 73.5% ± 2.8 on the full 20-paper PaperBench Code-Dev benchmark

Since the full benchmark score of 73.5% is below the 75% threshold, the resolution criterion has not been met.

Market context
Get
Ṁ1,000
to start trading!

🏅 Top traders

#TraderTotal profit
1Ṁ1,195
2Ṁ957
3Ṁ506
4Ṁ354
5Ṁ155
Sort by:

@Bayesian reply for ur bot and general discussion:
I had messaged the creator a couple days ago, but he never replied.

"For this question https://manifold.markets/SamuelAlbanie/will-sota-on-paperbench-codedev-sur, does this entry count? https://github.com/HKUDS/DeepCode

It got 75.9% on paperbench code-dev, unfortunately, OpenAI or 3rd-party evals never released an up to date benchmark so the onus is on the system creators themselves this lab is reputable and this project has almost 14k stars on github, so its very likely to be true information. Just wanted to confirm more before buying in

6d

also, they have taken precautions like https://github.com/HKUDS/DeepCode/issues/72 to make sure that their agent is not searching up information on the web about the papers"

---
So since OpenAI or any third part like AA or METR never released results on this, I think this is the best proxy there is, especially since this is such a well-known and trusted repo. Thoughts?

@prismatic hmmmmmmmmm i should remember never to remotely trust my googlin' skills about benchmarks that don't have an official public leaderboard

@prismatic Thanks for flagging this paper. The authors report:

  • 75.9 +/- 4.5 on a 3 paper subset of the benchmark

  • 73.5 +/- 2.8 on the full (20 paper) PaperBench Code-Dev benchmark

Consequently, although they came very close, they are still below 75% on the Code-Dev subset. I will resolve the market to No.

boughtṀ50YES
opened a Ṁ600 NO at 40% order

Probably a question closest to the heart of “will we get fast takeoff?” And easy to measure!

© Manifold Markets, Inc.TermsPrivacy