MANIFOLD
Will an LLM agent complete >50% of the lab tasks on the Factorio Learning Environment benchmark in 2025?
10
Ṁ103Ṁ518
resolved Jan 1
Resolved
YES

Large Language Models (LLMs) are rapidly saturating existing benchmarks, necessitating new open-ended evaluations. We introduce the Factorio Learning Environment (FLE), based on the game of Factorio, that tests agents in long-term planning, program synthesis, and resource optimization.

https://jackhopkins.github.io/factorio-learning-environment/

As of the time of this market's creation, Claude 3.5 Sonnet tops the leaderboard with a success rate of 21.9%. This market resolves YES if the leaderboard displays an entry with >50% success before the end of 2025, NO otherwise. Claims of higher success rates won't count for resolution unless they're displayed on the leaderboard. Autonomous systems that are not based on an LLM agent framework also won't count for resolution.

Market context
Get
Ṁ1,000
to start trading!

🏅 Top traders

#TraderTotal profit
1Ṁ91
2Ṁ83
3Ṁ23
Sort by:

@creator can I get clarification on how literally you are interpreting your terms?

https://jackhopkins.github.io/factorio-learning-environment/leaderboard/ appears that it is no longer being updated.

https://jackhopkins.github.io/factorio-learning-environment/versions/0.3.0.html

shows that the 50% benchmark ahs been cleared. So thoughts?

@MRME Good catch. I'll follow the spirit of the question and say that the results shown on https://jackhopkins.github.io/factorio-learning-environment/versions/0.3.0.html count even though the leaderboard link hasn't been updated.

@cherrvak awesome. Thanks. Please resolve as yes!

© Manifold Markets, Inc.TermsPrivacy