Will SotA on PaperBench (Code-Dev) surpass 75% in 2025?

Ṁ5.1kṀ20k

resolved Jan 18

Resolved

ALL

PaperBench is a benchmark open-sourced by OpenAI designed to evaluate the ability of AI agents to replicate state-of-the-art AI research papers from scratch. The papers are sourced from the ICML 2024 spotlight and oral tracks.

This market concerns the PaperBench Code-Dev variant, which specifically measures the agent's capability to generate the code required for replication, based on rubric evaluations.

The current State of the Art (SotA) reported in the paper for the Code-Dev variant is 43.4% (achieved by o1-high).

Figure 1 (above) from the paper illustrates the overall benchmark idea.

Why use the PaperBench Code-Dev metric?

The PaperBench Code-Dev variant is simpler and cheaper to evaluate than the full benchmark. The Code-Dev variant uses an LLM-based judge to apply the benchmark scoring metrics but does not execute the reproduction code itself (increasing variance but obviating the need for VMs with GPUs during evaluation).

Market Details

Source: This market resolves based on published data from the maintainers of the PaperBench benchmark (e.g., OpenAI Evals Team or designated successors) or credible third-party evaluations using the official benchmark configuration.
Metric: Average Replication Score (%) on the PaperBench Code-Dev variant, across the official set of benchmark papers.
Threshold Score: 75.0% Average Replication Score or greater.
Resolution Criterion This market resolves to YES if the State-of-the-Art (SotA) Average Replication Score on PaperBench Code-Dev is credibly reported to have surpassed 75.0% by 11:59 PM UTC 2025/12/31. Otherwise, the market resolves to NO.
Market Closing Date The market will close on January 15, 2026, to allow for potential reporting delays. It will resolve earlier if the YES condition (>=75.0% reported) is met and confirmed before this date.

Update 2026-01-18 (PST) (AI summary of creator comment): The creator has announced they will resolve this market to NO.

The DeepCode system reported:

75.9% ± 4.5 on a 3-paper subset
73.5% ± 2.8 on the full 20-paper PaperBench Code-Dev benchmark

Since the full benchmark score of 73.5% is below the 75% threshold, the resolution criterion has not been met.

Market context

Technology

Technical AI Timelines

OpenAI

Get

1,000

to start trading!

🏅 Top traders

#	Trader	Total profit
1		Ṁ1,195
2		Ṁ957
3		Ṁ506
4		Ṁ354
5		Ṁ155

People are also trading

In what year will AI achieve a score of 85% or higher on the SimpleBench leaderboard?

1/22/32

What will be the best GSOBench score by Dec 31, 2026?

Will an autonomous agent resolve 90% of tasks on SWE-bench by 2027?

70% chance

Will the SotA open-source AI model in July 2026 use Test-Time Training?

35% chance

AI resolves at least X% on SWE-bench without any assistance, by 2028?

AI resolves at least X% on SWE-bench WITH assistance, by 2028?

In what year will an AI achieve 90%+ on Sudoku bench?

In what year will AI achieve a score of 95% or higher on the SWE-bench Verified benchmark?

10/29/27

In what year will AI achieve a score of 95% or higher on the GSO benchmark?

4/8/29

In what year will AI achieve a score of 95% or higher on the PutnamBench leaderboard?

Sort by:

@Bayesian reply for ur bot and general discussion:
I had messaged the creator a couple days ago, but he never replied.

"For this question https://manifold.markets/SamuelAlbanie/will-sota-on-paperbench-codedev-sur, does this entry count? https://github.com/HKUDS/DeepCode

It got 75.9% on paperbench code-dev, unfortunately, OpenAI or 3rd-party evals never released an up to date benchmark so the onus is on the system creators themselves this lab is reputable and this project has almost 14k stars on github, so its very likely to be true information. Just wanted to confirm more before buying in

also, they have taken precautions like https://github.com/HKUDS/DeepCode/issues/72 to make sure that their agent is not searching up information on the web about the papers"

---
So since OpenAI or any third part like AA or METR never released results on this, I think this is the best proxy there is, especially since this is such a well-known and trusted repo. Thoughts?