
Algorithmic progress can be measured by reduction in cost to achieve equivalent performance. SWE-bench-lite is a popular benchmark for measuring scaffolded-LLM SWE capabilities.
By what factor will the cost of SWE-bench-lite SoTA drop between mid 2024-2025? Mid-2024 SotA is 43% costing $2,700 (per the devs), so this question will resolve Yes on the answer which most tightly bounds the reduction in cost to achieve 43% on July 1, 2025.
E.g. if in June 2025, 43% on SWE-lite costs $500 then that'd be a 5.4x reduction and the question would resolve (2) "<10x".
Update 2025-08-31 (PST) (AI summary of creator comment): - Revised baseline (mid-2024): Uses Alibaba's Lingma Agent with Claude 3.5 Sonnet at 38% costing about $2.18/problem, replacing the previously stated $2,700 for 43% on SWE-bench-lite.
2025 cost will be inferred via proxies (not strictly “43% on SWE-bench-lite on July 1, 2025”): e.g., GPT-5 Nano at about $0.01/instance on full SWE-bench and model price changes like Qwen3 a3b coder (
$0.08/output token) vsSonnet 3.5($15/output token).Estimated reduction is ~50x–250x; absent compelling contrary evidence, the market will be resolved to "<250x" soon.
🏅 Top traders
# | Name | Total profit |
---|---|---|
1 | Ṁ179 | |
2 | Ṁ19 | |
3 | Ṁ11 | |
4 | Ṁ8 | |
5 | Ṁ5 |
People are also trading
This resolution is a bit more speculative than I would have liked, but here's my current guess:
https://www.swebench.com/ shows mid-2024 Sota was
Alibaba's Lingma Agent at 33%--but looking at their paper they spent $2.18 per problem to achieve 38% with 3.5 Sonnet. So that's our mid-2024 baseline cost.
Now it gets a bit trickier for 2025, but a ballpark (outside our time horizon) is that GPT-5 Nano costs solves 3X% of the (probably harder) complete SWE-bench at $0.01 per instance. That's a roughly 200x reduction.
I believe qwen3 a3b coder was released after the cutoff and costs 0.08/m output token, whereas 3.5 Sonnet costed $15 again roughly a 200x drop. I'm not clear on what the exact scaffold+model sota was in June 2025, but I think that gives us strong evidence that this was around 200x.
So I'm very unsure about the exact value, but my 80% CI is roughly (50x, 250x). Unless anyone provides compelling evidence against this, I'll resolve to <250x next week.