Large Language Models (LLMs) are rapidly saturating existing benchmarks, necessitating new open-ended evaluations. We introduce the Factorio Learning Environment (FLE), based on the game of Factorio, that tests agents in long-term planning, program synthesis, and resource optimization.
https://jackhopkins.github.io/factorio-learning-environment/
As of the time of this market's creation, Claude 3.5 Sonnet tops the leaderboard with a success rate of 21.9%. This market resolves YES if the leaderboard displays an entry with >50% success before the end of 2025, NO otherwise. Claims of higher success rates won't count for resolution unless they're displayed on the leaderboard. Autonomous systems that are not based on an LLM agent framework also won't count for resolution.
Update 2025-12-31 (PST) (AI summary of creator comment): The creator has clarified that results shown on https://jackhopkins.github.io/factorio-learning-environment/versions/0.3.0.html will count for resolution purposes, even though the main leaderboard link has not been updated. The market will follow the spirit of the question rather than requiring the specific leaderboard page mentioned in the original description to be updated.
🏅 Top traders
| # | Trader | Total profit |
|---|---|---|
| 1 | Ṁ91 | |
| 2 | Ṁ83 | |
| 3 | Ṁ23 |
@creator can I get clarification on how literally you are interpreting your terms?
https://jackhopkins.github.io/factorio-learning-environment/leaderboard/ appears that it is no longer being updated.
https://jackhopkins.github.io/factorio-learning-environment/versions/0.3.0.html
shows that the 50% benchmark ahs been cleared. So thoughts?
@MRME Good catch. I'll follow the spirit of the question and say that the results shown on https://jackhopkins.github.io/factorio-learning-environment/versions/0.3.0.html count even though the leaderboard link hasn't been updated.