Show HN: Factorio Learning Environment – Agents Build Factories

jackhopkins.github.io

656 points by noddybear 20 hours ago

I'm Jack, and I'm excited to share a project that has channeled my Factorio addiction recently: the Factorio Learning Environment (FLE).

FLE is an open-source framework for developing and evaluating LLM agents in Factorio. It provides a controlled environment where AI models can attempt complex automation, resource management, and optimisation tasks in a grounded world with meaningful constraints.

A critical advantage of Factorio as a benchmark is its unbounded nature. Unlike many evals that are quickly saturated by newer models, Factorio's geometric complexity scaling means it won't be "solved" in the next 6 months (or possibly even years). This allows us to meaningfully compare models by the order-of-magnitude of resources they can produce - creating a benchmark with longevity.

The project began 18 months ago after years of playing Factorio, recognising its potential as an AI research testbed. A few months ago, our team (myself, Akbir, and Mart) came together to create a benchmark that tests agent capabilities in spatial reasoning and long-term planning.

Two technical innovations drove this project forward: First, we discovered that piping Lua into the Factorio console over TCP enables running (almost) arbitrary code without directly modding the game. Second, we developed a first-class Python API that wraps these Lua programs to provide a clean, type-hinted interface for AI agents to interact with Factorio through familiar programming paradigms.

Agents interact with FLE through a REPL pattern: 1. They observe the world (seeing the output of their last action) 2. Generate Python code to perform their next action 3. Receive detailed feedback (including exceptions and stdout)

We provide two main evaluation settings: - Lab-play: 24 structured tasks with fixed resources - Open-play: An unbounded task of building the largest possible factory on a procedurally generated map

We found that while LLMs show promising short-horizon skills, they struggle with spatial reasoning in constrained environments. They can discover basic automation strategies (like electric-powered drilling) but fail to achieve more complex automation (like electronic circuit manufacturing). Claude Sonnet 3.5 is currently the best model (by a significant margin).

The code is available at https://github.com/JackHopkins/factorio-learning-environment.

You'll need: - Factorio (version 1.1.110) - Docker - Python 3.10+

The README contains detailed installation instructions and examples of how to run evaluations with different LLM agents.

We would love to hear your thoughts and see what others can do with this framework!

vessenes 17 hours ago

OK, You’ve permanently nerd-baited me, and I wish to apply for a job at the Anthropic Factorio lab immediately.

I can’t tell from the paper or these comments if you’re sending multimodal data back — I’m guessing no, because many of these models aren’t multimodal. But some are — and of course we now have recently released Qwen 2.5 VLM which seems to be quite strong for its size.

You harp on this lack of spatial ability a fair amount, which - fair enough - and you mention difficulties in both planning and spatial planning. Are you sending images back? If not, any thoughts on this?

Thanks for this amazing bit of work, I really am reorganizing my day to play with it now.

P.s. seems like MCP enabling the python library is a natural must-do so that all tool-enabled LLMs everywhere can play factorio.

martbakler 16 hours ago

Currently it's a text-only modality environment but we are planning to support vision in the future. We did run a couple of tests and saw that including screenshots of the game state did not improve performance on the off-the-shelf models. As the complexity of the game state grew and the screenshots were filled with more entities, the models got even more confused and started hallucinating directions, entities etc or weren't capable of troubleshooting factories with apparent mistakes (i.e missing transport belt, wrongly rotated inserter). We think it's because the VLMs currently aren't good at spatial reasoning in high-detailed images, likely this would improve significantly with finetuning
Good point with MCP as well given it has been blowing up lately, we'll look into that!
- vessenes 16 hours ago
  
  That makes sense and it’s really interesting - it is a challenging visual test for sure; thousands of entities, either multi tier visual representations (screen, map, overview map) or a GIANT high res image. I hereby propose FLE-V a subset benchmark for visual models where they just turn a factorio image into a proper FLE description. And maybe the overview and map images as well.
  - kridsdale1 16 hours ago
    
    Such research could have hundreds of billions of dollars in downstream GDP implications when applied to real industrial settings.
    
    dismalpedigree 10 hours ago
    
    Not to mention the increased productivity of everyone not wasting their time in factorio (myself included) because the optimal solution is known.
    
    lukan 2 hours ago
    
    Not wasted time, you were doing research it seems.
    
    vessenes 15 hours ago
    
    Well I better get training!
- grayhatter 15 hours ago
  
  > As the complexity of the game state grew and the screenshots were filled with more entities, the models got even more confused and started hallucinating directions, entities etc or weren't capable of troubleshooting factories with apparent mistakes (i.e missing transport belt, wrongly rotated inserter). We think it's because [...]
  I think you just described a research paper that would advance sota. Less describing why, but how. (Assuming it's not just, wy finetuned the model and it worked perfectly)
  - martbakler 14 hours ago
    
    Sounds almost like a visual "needle in a haystack" type of work, that could be quite interesting!
    
    pyinstallwoes 7 hours ago
    
    Where’s Waldo test for vlm
jillyboel 16 hours ago

Why would screenshots be necessary if a textual description of the factory state is both easier to interpret and less prone to confusion? The game is played on a grid, so converting the game state to ascii ought to be trivial.
- martbakler 15 hours ago
  
  It actually is engineering wise quite trivial but the underlying question is which modality is the best to elicit spatial reasoning capabilities from the current general models. We tried (very anecdotally) a couple of months ago to get an agent to reason over a couple of ascii representations of factories and the results weren't very promising. It seems the models struggle with creating an accurate internal spatial representation of the game state only using textual tokens
  The question is what is the most efficient and high-quality representation we could use to improve that
  - ajcp 12 hours ago
    
    Did you try providing 2D vectors of where each object relates to every other object? Seems like the most obvious way.
    In my experience the current generation of models are very poor at spatial reasoning even when given accurate coordinate based location assignments of each object. But I suspect when a model can build the whole relationship of all objects by being given those spatial relationships in a vector they will be much better.
    
    martbakler 10 hours ago
    
    We did discuss this at some point but didn't end up trying it out. I think it's quite an interesting avenue and worth a shot, my intuition also says that the spatial capabilities will improve if the model has more access to relative info and doesn't need to infer it from absolute coordinates
    
    pyinstallwoes 7 hours ago
    
    Given vector space on text is more of a spatial space of semantic distance then spatial distance of geometric objects intuitively feel of a different nature due to the fact that words are not at all likely to be represented in similar ratios of distances.
    I think a tokenization of ratios between perceived boundaries might help. But, I’m just shooting in the dark.
    
    ajcp 7 hours ago
    
    You're conflating the use of vectors to only mean how they relate to semantic meaning. As vectors are just spatial relationships, in the case of objects in Factorio we could provide the vectors for every single object as to how they relate to every single other object in literal 2D space. This would essentially provide the LLM a complete relationship mapping, since it is not able to do it by "seeing" a picture or by providing it with absolute coordinates.
  - groby_b 9 hours ago
    
    > It seems the models struggle with creating an accurate internal spatial representation of the game state only using textual tokens
    That'd be actually interesting research material for the claim that LLMs are able to build internal representations of the world. (Either they can't at all, which'd be an important insight, or it turns out there's something fundamentally different about modalities that engages different reasoning/world model capabilities, which would be even more interesting)
    Or, if you want to really go wild, "what capabilities allow models to reason in modalities fundamentally different from their input data/training data".
    Damn it, I should quit and go back to University. [Ed.: She wouldn't quit, she likes her job, don't believe her]
- vessenes 16 hours ago
  
  Trivial as in only engineering work, sure. But it’s a lottt of tokens. Long context models do a number of things to get all that working context in; some of those things elide details / compress / have token segments that are harder to reason about. When a burner inserter at a location takes up like 50-100 tokens, and you want it to reason about 100 of them, this is still a pretty challenging task for any LLM.
  - jillyboel 16 hours ago
    
    Ah, I don't know much about multi modal models but I wonder what they'd think of pixel art representing the factory where each pixel is a point on the grid and each color is a specific entity, perhaps ignoring things such as bots flying about. Probably easier to comprehend than an actual screenshot?
    
    noddybear 15 hours ago
    
    The thing is that when you create a dense ASCII representation, any gain you might make from the spatial relationships is lost by: a) the tokeniser not working on characters alone (remember strawberrry), and b) the increased number of 'dead' tokens encoding not very much.
    Our sparse encoding seems to confuse the models less - even though it certainly isn't perfect.
    
    kridsdale1 16 hours ago
    
    I mean at some point you compress the board state down to Dwarf Fortress with an extended ASCII representation for each grid-state (maybe 2 bytes each?)
    
    vessenes 15 hours ago
    
    Lots of questions here - you need item, orientation, info about pipes (2 directions) , belts (3 or 4 colors x2 directions). Do you wish Circuits?

scottmsul 15 hours ago

There was a HN post here not too long ago about a team that used reinforcement learning to train an agent to beat pokemon red. They mentioned how they had to tweak the cost function to give small rewards for exploring and big rewards for completing "essential tasks" like beating gyms.

I wonder if this same approach could be used here in factorio? Using the pokemon red analogy the main "essential tasks" in Factorio are setting up automation for new items and new science packs. I think a good reward function could involve small rewards functions for production rates of each item/sec, medium rewards for setting up automation for new items, and big rewards for automating each new science pack.

Telling a Factorio agent to just "make a big factory" is like telling a pokemon red agent to just "beat the game", it has to be broken down into smaller steps with a very carefully tuned reward function.

Thinking about this is really making me want to jump into this project!

scottmsul 15 hours ago

Also I should add, being a Factorio veteran with 2-3k hours in this game, I think the goal of making the "largest possible factory" is too vague and not the right metric. When Factorio players make large megabases, they don't go for "size" per se, but rather science research per minute. The metric you should be telling the agents is SPM, not "largest" base!
- csense 10 hours ago
  
  Agree, "largest" base has some pathologies.
  Put machine #1 at the starting location, run in one direction, and put machine #2 just before time runs out.
  This is going to be a huge factory (as measured by its bounding box) but it's not super interesting.
- soulbadguy 10 hours ago
  
  ahhh another factorio addict :) Curious, how long was your first play through (assuming in v1.x lanching the first rocket)
martbakler 10 hours ago

This is interesting, one of our findings was that the Claude was capable of essential tasks & simple automation (i.e iron gear wheel factory in lab-play) but didn't even try to do it during the "build the biggest factory" game episodes. So the models can do these essential tasks but when given a general goal, i.e "complete the game", they don't have a good level of long-term planning to even try to attempt them. Often they just did un-coordinated small-scale constructs without attempting to scale up existing factories
That was also one of our goals, to find out how do the models act when given a very vague and general objective
noddybear 15 hours ago

In FLE, you have access to milestones representing the first time a new entity was created, but coming up with a stratification of rewards for different degrees of automation would be really interesting. Join us!
mclau156 12 hours ago

The same approach could be used in life

noosphr 7 hours ago

>We evaluate six frontier language models across both settings: Claude 3.5-Sonnet, GPT-4o, GPT-4o-Mini, Deepseek-v3, Gemini-2-Flash, and Llama-3.3-70B-Instruct.

While I appreciate the effort and creativity that went into this there are a lot of much simpler dynamic benchmarks that can let you saturate the planning capabilities of non-reasoning models.

Something as simple as giving a list of flight connections between cities and then asking for an itinerary between them confuses all these models when the shortest path between two nodes is long enough.

Longest shortest path the models could reliably find (8/10 tests for a given length) between two cities:

    | Model            | Path Length |
    |------------------+-------------|
    | Claude Sonnet3.5 |          10 |
    | GPT-4o           |           7 |
    | GPT-4o-mini      |           4 |
    | Deepseek-v3      |           6 |
    | Gemini-2-Flash   |  Not tested |
    | Llama3.3-70B-Ins |           4 |

owenpalmer 13 hours ago

> All models exhibited limitations in spatial planning when constructing multi-section factories. Common failures included placing entities too close together, not allocating space for connections, or incorrect inserter placement

It makes sense why LLMs are bad with spatial reasoning. Not a lot of training data for it. I wonder what additional reasoning abilities will emerge when spatial reasoning is solved.

wordpad 6 hours ago

How is there not a lot of special data?
Isnt it literally infinite via even the simplest simulator?
You could generate an unlimited training set just by implementing tik tac toe on an unbound grid, for example, in like 10 lines of code.
- owenpalmer 2 hours ago
  
  Synthetic data will play I big role, yes. There's other challenges though, like how verbal descriptions of objects would affect their spatial behavior. Building a generalized simulator that combines those modalities is hard.
  In this particular case with Factorio, I suspect generating the synthetic data would be easier, since the rules of the environment are relatively simple and well defined, with quantifiable outcomes.

spieswl 18 hours ago

Fantastic idea.

It seems like there are a lot of interesting experiments to be had here. The lab-play scenarios having a time-related component seems like a good idea, I assume most Factorio players that keep biters on treat them as a combined temporal-spatial constraint, so you have a sort-of proxy comparison to a real game situation when you put the agents on a timer.

I like the way that the framework design is testing different things than micromanagement proficiency, such as what we have seen in DOTA 2 or StarCraft 2 experiments. Notably, severe worker micromanagement (in the case of the latter game) becomes a way to squeak out extra minerals when you have infinite APM available. This is an interesting learned behavior in a narrow context, but that tactic is really control intensive and has a high chance for even pro players to screw it up when attempting to do so. It also doesn't seemingly give additional insight into an agent's longer-term planning, execution, and analytical performance. FLE seems way more interesting as a higher-level "thinking" evaluation framework, with all that in mind.

Any plans for layout optimization benchmarks? As in, start with a given factory cell with X inputs and Y outputs, and optimize its performance.

noddybear 17 hours ago

One thing we've been talking about is creating tasks that are a bit more 'tower defence', where biters are released every X steps / seconds. The idea would be to test agents in building a military-industrial complex. One amusing issue we had in developing this idea is that frontier models have an aversion to creating entities called 'GunTurret' etc - as it goes against their constitution! (perhaps we should rename turrets to 'SuperSoaker' or something)
Regarding layout optimisation benchmarks, we actually discussed this yesterday. I think we need 2 types of layout task: 1) fix this subtly broken factory, and 2) improve the throughput of this factory. These should be straightforward to implement, if you'd like to have a look.
- aftbit 9 hours ago
  
  >One amusing issue we had in developing this idea is that frontier models have an aversion to creating entities called 'GunTurret' etc - as it goes against their constitution! (perhaps we should rename turrets to 'SuperSoaker' or something)
  This sounds like a great idea for a short story in the style of Malak by Peter Watts. Imagine a future warfighter AI that has been fitted with a set of filters to make it think it's really having a pillowfight or building a factory to make screws while it's actually tearing people apart or optimizing a military production line.
- spieswl 15 hours ago
  
  Love the suggestion, I'll clone it down and start poking around.
  I believe your intuition about layout experiments needing to be of different genres is correct. I think you could have a pretty wide range of debugging opportunities (imbalanced belts, inserters fighting for items, insufficient power at full load leading to throughput loss, etc) for the first. The second feels like it would be nicely encapsulated by focusing on optimizing for ratios, although seeing an agent realize that they can get more throughput by simply copy/pasting a block and upgrading a belt would be pretty wild (depending on the recipe, of course). Maybe nuclear power / heat exchanger ratios are a bit too far down the path, but optimizing for copper cable use in green circuits is pretty important and fairly early in the tech tree?
- tomrod 12 hours ago
  
  So something like PvZ might work, right?
- robotresearcher 16 hours ago
  
  If (1) is a special case of (2), maybe you’d only need (2)?
  - noddybear 15 hours ago
    
    True - although it might be interesting to benchmark them both, as (1) is more about debugging (something that these agents spend a lot of time doing).

gglon 14 hours ago

I was thinking, to build a large, efficient factory autonomously, one could use LLM as a high level agent that is using specialized tools. The overall strategy would perhaps look like following:

1. create a (intermittent) goal for a resource production

2. create a factory graph with calculated number of machines and number of resources required to transport between them. This would be done by using linear programming (factorio calculator)

3. somehow map the resulting graph to a hardware description language. Such that each entity would be mapped to unique logic component. And each transport lane would be mapped to a unique wire (most difficult)

4. compile to 2d FPGA layout using all the VLSI algos like partitioning, routing (hdl compiler)

5. map the resulting plan back to a concrete factorio design

jkhdigital 6 hours ago

This is exactly what I’ve been thinking as I see LLMs being applied to all these complex problem domains. Humans did not conquer the world because our intelligence can solve every problem, we did it by using our intelligence to (1) break down complex problems into small, manageable pieces and (2) designing tools and machines that were exceptionally good at efficiently solving those subproblems.
The other recent example that comes to mind is the paper that explored the reasoning process used by LLMs to answer trivia questions like “Name a national capital whose letters can be rearranged to spell a common greeting in the language of a neighboring country.” (answer is Hanoi by the way)
The LLM responses show that they intuitively grasp the algorithm for answering such a question, but then they basically run the algorithm in their own thoughts (self-talk) which is horrendously inefficient.
Put differently, natural language reasoning is brilliant at turning the messiness of the real world into well-defined abstractions, but as soon as that is done it needs to hand off the task to a machine. For “solved” problems this might be a formally specified machine, but it could also be another class of model such as AlphaZero (along with a proper specification of the problem the “subcontractor” is to handle).

p10jkle 19 hours ago

Wow, fascinating. I wonder if in a few years every in-game opponent will just be an LLM with access to a game-controlling API like the one you've created.

Did you find there are particular types of tasks that the models struggle with? Or does difficulty mostly just scale with the number of items they need to place?

noirscape 18 hours ago

Very unlikely that you'll see mass-use of LLMs as opponents. Enemy AI in most games doesn't need the level of complexity that machine learning demands. (Ignoring computational costs for a second.)
The main goal of an enemy AI isn't to be the hardest thing in the world, it's to provide an interesting challenge for the player to overcome. It's not necessarily difficult to make a hypercompetent AI in most games, but that also wouldn't make it very interesting to play against. Most games have finite states of logic, just large enough to the point where a human would have trouble finding every solution to it (although humans tend to be very good at pushing on the edges of these states to find ways around them).
Even in games where the amount of state is much higher than usual, you rarely want a super AI; nobody likes playing against an aimbot in an FPS for example.
Factorio is an outlier because unlike regular games, the true condition for a "victory" is almost entirely up to the player. You can make a rocket in non-DLC Factorio (the games victory condition) without building any factory at all beyond the most basic structures for stuff you can't handcraft. It'd be extremely slow, but it's an option. That's why the benchmark for this sort of thing is more efficiency than it is "can this work".
- PetitPrince 18 hours ago
  
  As an opponent that would be indeed unfun, but as a sparring partner / coach in a competitive game (fighting game? Rts? Moba? Puzzle game?) that would be useful.
- fragmede 12 hours ago
  
  Civilization (VII just released) is famous for having the harder difficulties be harder because the AI cheats. If the game was harder because the AI was smarter instead of it cheating, it would be worth it to players to upgrade!
noddybear 19 hours ago

Hey - yes, I think this is definitely possible, as you don't need any training compute for it to work. Its super easy to plug-and-play different models into new games, once an API is made available.
Models struggle in 2 main areas. The first is spatial reasoning: often the models make off-by-one errors which they find it hard to recover from (as factories are very sensitive to these mistakes - like in programming). The second is in long-term planning, i.e figuring out what to do strategically, before making tactical subgoals.
The difficulty scales in lab-play generally in proportion to the depth of the production chains. If an item requires several factory segments first, this makes it a lot more challenging. I think this is related to planning though, as the models tend to get down 'into the weeds' of fixing minor issues - rather than coming up with a master plan first.
- pyinstallwoes 19 hours ago
  
  Have you tried specific prompting like writing a mermaid diagram that forces the model to contextual use long term horizon tasks ?
  - noddybear 18 hours ago
    
    Yes we tried that - as well as a few other visual DSLs for spatial reasoning. They didn't seem to help much, i.e there were no failure modes that this approach solved compared to the simpler approach. As ARC-AGI results showed - there don't seem to be many 'free lunch' solutions to this without actually training.
jkhdigital 5 hours ago

Why LLM? Isn’t this what AlphaZero is good at? There are many more kinds of useful ML models than LLMs!
posterman 19 hours ago

"claude plays pokemon" shows that it struggles with mount moon (as did four year old me)

myrmidon 18 hours ago

Fascinating. Would have loved to see more pictures of the bigger factories-- or is the zig-zag belt into plastic production currently the best result?

I think this very clearly illustrates a big weakness of current LLMs-- humans might struggle just as much at first, but are able to specialize and adapt to a task, while LLMs can't-- yet.

I'm expecting even greater improvements from figuring out online learning/adaptation than what we got from chain-of-thought approaches.

Do you think the "API" to interact with the game is a big obstacle, compared to a human interacting with the game via monitor? Did anyone try to interact with the game via this API, and how does human effort measure up to the AIs?

noddybear 18 hours ago

I have some pictures of bigger factories - but they tend to be filled with artefacts and general nonsense. I'll dig them out and add them to the appendix. The zig-zag into plastic production was the best 'lab' result, as its pretty clear what the agent is doing.
Yes, the agents can consistently produce economic growth in game - but we don't really see a take off, where the growth keeps compounding over time. This is certainly _possible_ in FLE, as agents could write their own Python utility functions etc to construct and manage large factories (imagine imperative Factorio blueprints), but we haven't seen that yet.
Designing the API to not get in the way was the biggest challenge. It was imperative to avoid modal collapse - where the factory could not be sufficiently well expressed in the outputs of a program. While we think that we have generally 'solved' this, there are occasionally examples where the agent acts based on its previous output, but fails because there is something blocking it that it cannot easily see. One example would be the edge of water getting in the way of an entity placement.
All of the lab tasks were completed by a human using only the API, and we have lots of tests (inductively) demonstrating that it is possible to get to a rocket launch using the API alone.

Imnimo 15 hours ago

Another category of "Lab Play" task I'd be interested in seeing is balancer design. Even small balancers can be quite complicated (https://factorioprints.com/view/-NopheiSZZ7d8VitIQv9), and it would be interesting to see how models do at designing and troubleshooting them.

fragmede 13 hours ago

someone approached that problem with a more traditional SAT solver
https://github.com/R-O-C-K-E-T/Factorio-SAT

mNovak 13 hours ago

Is there a human-play benchmark (even informally) for this style of interface? Not saying it's necessary or even relevant, I'm just curious to know what programmatic Factorio feels like -- I imagine spatial reasoning around text prompts would be fairly challenging for human players to navigate as well.

sonofhans 12 hours ago

Human benchmarks for Factorio are speed runners — rushing to launch the first rocket. The current record is just over 4 hours for one player, and 90 minutes for a team. You can see just from that that a multi-tasking LLM has room to outperform humans.
- janzer 9 hours ago
  
  The current 4h12m hour record is for 100% (where you have to get every single achievement in the game, in the one run), any% (where you just need to launch a rocket) is under 2 hours (1h42 for the latest factorio v2.x, 1h18 for v1.x). There are a few other differences between the categories regarding map selection and blueprint use as well.
  Records and specific rules for all categories can be found at https://www.speedrun.com/factorio
- goriv 9 hours ago
  
  I think he is talking about a human using the programatic API the LLMs are using to play the game. I think that would be a whole lot slower than normal playthrough

infogulch 15 hours ago

Interesting to see only a handful of complex scenarios. I've always suspected ML game agents need hundreds of tiny puzzles with hundreds of variations each to learn game mechanics properly. Like:

    The factory is not powered, place the missing power pole(s)
    The factory is missing items, place the missing belt(s)
    Craft and place these 200 assembly machines
    The assembly machine is not running for some reason, fix it
    The factory production is too low, double it
    Get to this other point in the factory as fast as possible
    Fix the brownout
    All of the above with and without bots

Programmatically generating a few thousand example scenarios like these should be relatively easy. Then use it like an IQ test question bank: draw a dozen scenarios from the bank and evaluate performance on each based on time & materials used.

I hypothesize that ML agents learn faster when evaluated on a sample from a large bank of scenarios of smoothly increasing complexity where more complex scenarios are presented after it scores sufficiently high on lower complexity scenarios.

noddybear 15 hours ago

I think generating the scenarios as you suggest (in text) is easy, but creating correct factory game states to start from is a lot harder. AFAIK it reduces into the same manual task of designing an init state and a task to complete.
- infogulch 15 hours ago
  
  Yes each scenario will need someone to design it, but you can get a lot of mileage out of each. E.g. consider the "place the missing power pole" scenario: manually build a factory with a few dozen machines connected to a couple steam engines with 20 power poles, then you can generate 400 playable puzzles/scenarios by deleting 1-2 power poles from the working starting point. Humans would find all of these to be equivalent, but I think agents need the explicit variation to learn the lesson properly.
martbakler 14 hours ago

We are thinking of something like this (a curriculum approach) for further training. The reason why we didn't want to do this for current work, where the emphasis is on evaluations, is that the "difficulty level" of different tasks is quite subjective and hence we would need to make arbitrary decisions that could affect the evals (i.e which tasks would follow which scenarios, how to ensure sufficient coverage across all difficulty levels etc)
- infogulch 14 hours ago
  
  "a curriculum approach" is a nice way to put it!
  > the difficulty level of different tasks is subjective
  That makes sense. I wonder if difficulty of different scenarios could be derived by assuming a partial ordering and ranking based on training rate: e.g. it preforms better at scenario T if it trains scenario A first, but training scenario first B doesn't help with T. Then infer A < T, and B ? T.

barrystaes 17 hours ago

I have long dreamt of automating Factorio in the way that HDL and a PCB router works: just specify the ingredients and it produces a Factorio Blueprint.

First MVP stupid designs, then optimized routing, and eventually usable ingame where it connects with provided in/outputs.

Would be more fun to develop than to play obviously..

I liked the nilhouse mega base with that factory-train-blocks blueprints, its basically Factorio DUPLO.

noddybear 16 hours ago

Someone created this Terraform provider for Factorio a few years ago: https://registry.terraform.io/providers/efokschaner/factorio...
I think this could be a good starting point for what you describe! This stuff is always more fun to develop than play. Since started working on this project, I can't bring myself to play the core game myself...
delichon 16 hours ago

I've wondered if automating Factorio would free me of the compulsion to play it.
- noddybear 16 hours ago
  
  It certainly did with me.

moconnor 19 hours ago

Incredible idea and execution, very interesting results. Genuinely: what a time to be alive!

noddybear 19 hours ago

Thank you! Very much a labour of love. The next step for us is to try and build a paperclip maximiser in FLE.
- uncertainrhymes 18 hours ago
  
  Given that Factorio has logic gates and people have built various programs (including Doom, iirc) how long will it take before someone runs an LLM inside the game?
  - noddybear 18 hours ago
    
    I'm reminded of Hitchhikers Guide to the Galaxy, where the whole planet is one big computer. I would expect that any model directly implemented in Factorio would take up most of the game-world.

kevmo314 18 hours ago

Does it provide screenshots of the game state? I, too, would struggle to play the game pretty effectively if I could not visually see the game.

noddybear 18 hours ago

Agents don't have access to screenshots, as we are purely evaluating text-only models. All reasoning is conducted over object representations of the game (with positions etc).
I have anecdotally tried using screenshots to help models debug their factories, but without training a custom CNN/ViT on the Factorio UI, the visual outputs miss critical things (e.g gaps in transport belts).
That said, we have demonstrated via unit tests that the API is technically sufficient to progress to a rocket launch alone. We have been able to complete most lab tasks using the API ourselves so the humans still have a hefty lead here! The ones that we didn't do are the late-game lab tasks, which would have taken significant time and which frontier models are far from being able to complete.
- vessenes 17 hours ago
  
  Noddy, what’s “fair game” for this benchmark? e.g. do you wish to provide frontier models with a text goal, tooling info, and leave it at that? Or do you wish to have agent architectures compete? It seems to me like tiering the goal setting, layout and implementation are all separate tasks that would benefit from different agents.
martbakler 18 hours ago

Just to jump in here as one of the authors
We designed the API to be as spatially descriptive as possible (include x-y coordinates and neighbors in game state descriptions) and the agents have tools to aid them in carrying out actions which would benefit from vision (i.e find buildable areas on the map with different sizes, placing entities next to other entities etc).
As Jack said, we completed most of the lab tasks manually ourselves and while it took us a lot longer compared to having vision, the tasks were still doable and the human performance is significantly higher than current agents. We are thinking of supporting vision for future evals but from a small number of tests we ran, current models got even more confused as the number of entities on the map grows quite quickly. This is likely due to VLMs being notoriously bad at visual reasoning on images with lots of detail and in a game where one misplaced entity in a large factory breaks everything, the errors start to compound
- Hammershaft 17 hours ago
  
  Someone below mentioned the ASCII interface for dwarf fortress as being ideal for this, and I wonder if that kind of representation with a legend might produce spatially better results. The drawback I see is that elements can be layered on a tile in Factorio, or have properties that are not visually obvious in ASCII, so the llm would need to be able to introspect on the map.
  - noddybear 17 hours ago
    
    I think your intuition is correct about the amount of information that needs to be encoded into an ASCII char. You could potentially use unicode to pack more more into each char, e.g direction, type, status etc. Or make each representation available on-demand, i.e 'show me the direction of all inserters in a 10 tile radius'.
    
    vessenes 17 hours ago
    
    Well we learned last month on HN that you can encode arbitrary data into Unicode; anecdotally, o3-mini-high at least could decode it if given instructions.
    I wonder what a quick way to calculate how many Unicode characters you’d need is.. I guess every entity + four orientations. Underground belts and pipes seem tough. But I guess you could just add an encoding showing if the square has an underground pipe or encoding.
    I propose this would work. I think I’ll give it a try today.. I’d love dwarf fortress factorio. That said, the encode/decode phase seems like a lot of tokens for a model that’s not trained to understand the Unicode ‘map’. Seems like you’d want to fine tune something at least. Maybe a layout model.
    
    noddybear 16 hours ago
    
    Checkout the 'ObserveAll' tool in the repo - its deprecated now, but it pipes all the raw entities on the map back to the agent. You could procedurally convert it to unicode format given a pre-defined codebook (which you give to the agent) before letting the agent observe and reason over it.

iliketrains 15 hours ago

This is awesome! I like the idea of abstracting the factory building with a code-like structure. I wonder if supplemental 2D image (mini-map style) as an input to the policy would help with the spatial reasoning?

I work on a similar factory game (Captain of Industry) and I have always wanted an agent that can play the game for testing and balancing reasons. However, pixels-to-mouse-actions RL policy (similar to Deep Mind's StarCraft agent) always seemed like a very hard and inefficient approach. Using code-like API seems so much better! I might try to find some time to port this framework to COI :) Thanks for sharing!

noddybear 15 hours ago

Regarding the 2d image - the issue is that these frontier models don't tend to support supplemental image inputs, and the ones that do aren't sufficiently well trained on (high precision) Factorio visuals to add that much information.
- iliketrains 13 hours ago
  
  I see, integrating image inputs can be very challenging in this case as the models work with text input. I was not even thinking about the full isometric image, but just some simple 2D map where each pixel can be color-coded based on the entity type. I guess the problem is that these maps would look like nothing the models were trained on, so as you say, it might not provide any value.
  The reason I was suggesting this is that I worked in robotics making RL policies, and supplying image data (be it maps, lidar scans, etc.) was a common practice. But our networks were custom made to ingest these data and trained from scratch, which is quite different from this approach.
  - martbakler 10 hours ago
    
    Indeed I think the trade-off here is the more "pure factorio" types of images we give to the agents, the more likely it is that they've seen it during training (from google etc), however the signal-to-noise ratio is low and hence the current models get confused as the map complexity (amount of entities) and level of detail grows. If we start to create custom images, we can reduce the unneeded noise, but then risk giving something completely OOD to the agent (unless we train a visual encoder) and the performance also tanks
noddybear 15 hours ago

Thats really cool! Please shoot me an email if you get around to doing this, always happy to assist however I can.

devit 14 hours ago

Seems like it might be more effective to use the LLMs to write a program that plays Factorio rather than having them pick the next action given a game state.

Also in general I think the issue with Factorio is that you can just find an "optimal" factory design and build order and just follow it every time; perhaps starting with a suboptimal building layout already present and restrictions like being unable to change them or build others of the same type could help.

noddybear 14 hours ago

This is exactly how FLE works, the agent writes a program that executes its policy.
I think you bring up a good point, we could create tasks where the goal is to optimise a static factory, starting from a kernel of functionality like 'steam engine power supply' etc.
- devit 10 hours ago
  
  But it seems like it's being used to generate short snippets that in the examples seem to be equivalent to command lists as opposed to generating a full program that actually plays the whole game by itself.
  The model could also then be fed back the results of running the program and iteratively change it as needed.
  I.e. prompt first with "Write a program that can play Factorio automatically given an interface <INTERFACE SPECIFICATION> and a set of goals in <GOAL FORMAT>, and produces text output that can help determine whether the program is working correctly and whether tasks are performed efficiently and goals are reached as fast as possible"
  And then with "the program was run and produced this text output: <TEXT OUTPUT> Determine any possible bugs, avenues of improvements or missing output information and modify the program accordingly, printing the new version".
  And iterate until there doesn't seem to be an improvement anymore.

jharohit 19 hours ago

Everytime a paper like this comes out, I always have 1 question - How do they control the game using the LLMs? How does the control-feedback loop work? WHat tools, software and APIs they use to do it on Mac or Windows?

collinvandyck76 19 hours ago

OP made the framework available https://github.com/JackHopkins/factorio-learning-environment
- noddybear 19 hours ago
  
  So the core insight was that we can take over the Factorio console remotely using RCON over TCP. From this, we implemented a server-side library of tools that run inside the game. We then implemented a client-side Python library that can invoke these tools - resulting in a Python API for the game. A nice side effect is that creating new tools is really easy, and they can be hot-loaded into running game servers (unlike the traditional Factorio modding approach).
  - jharohit 6 hours ago
    
    this is cool! How could one extend that to a broader set of games? E.g another one where you can run larger simulations on behaviour are procedural games like No Man's Sky
- jharohit 6 hours ago
  
  its just for this game - prev I have seen python bots extended to GTA V or Counter Strike or other games. So was wondering if broader set of tools are available?

cmgriffing 10 hours ago

Their first key insight is interesting. It says that coding ability predicts model performance in the game. I wonder if it also predicts performance of human players in some way?

onehair 15 hours ago

Hi Jack, just reached 85/88 achievements in Space Age. Seeing an article about computer science and Factorio at this stage is either the nicest romantic gesture or very cruel intended to keep me playing this beautiful game forever.

mritchie712 18 hours ago

have you tried sonnet 3.7 yet? guessing these aren't cheap evals to run.

leaderboard: https://jackhopkins.github.io/factorio-learning-environment/...

noddybear 18 hours ago

Not yet, but starting the runs for 3.7 later today! The cost for running all the evals (across all models) was about $10k.
Simply giving the agents access to the tool descriptions and API schema is like 20k tokens from the outset.
It would be really cool to use retrieval techniques to reduce this burden. I suspect that this will also outright improve the performance of all models - which becomes worse as the context scales.
- stared 8 hours ago
  
  Since Claude 3.5 Sonnet is that good, I am curious how fares Claude 3.5 Haiku.
  For programming-like tasks, I expect similar-ish distribution that in programming, see e.g. https://web.lmarena.ai/leaderboard
- 0xmason 18 hours ago
  
  Please add the results of the eval to your leaderboard / github! Looking forward to it.
  GPT 4.5 seems prohibitively expensive here though lol.

danso 18 hours ago

Tangentially: been wondering when we’d ever see the breakthroughs in LLMs trickle down to making better adversarial game AIs. Haven’t tried Civ 7 b/c of its terrible reviews, but I’d happily buy in if there were AIs that were more human-like and varied in their scheming

htrp 16 hours ago

too expensive... you'd have to pay a recurring monthly sub for the game content.
Inworld's been doing this but haven't seen what they've done recently. https://inworld.ai/blog/inworld-stardew-valley-ai
HPsquared 18 hours ago

LLMs could bring the characters to life on the diplomacy screen. Not sure if Civ is the right game for it, though.
- pferde 16 hours ago
  
  There is a guy on Twitch that is playing a Crusader Kings 3 game where (some of the) NPC characters are "played" by chatgpt, in that he gave instructions to it regarding what traits the character has, and has conversations in English with it. And somehow translates the results into the game, probably using some sort of a mod.
  I don't remember the details, as I've only watched for a few minutes and found the whole thing boring, but he seems to have been going at it for a few weeks now, so it's probably working on some level.

sturza 19 hours ago

> 1. Coding skill predicts performance

> Models with stronger coding abilities (Claude 3.5-Sonnet, GPT-4o) achieved higher Production Scores and completed more lab tasks. Claude outperformed others with a PS of 293,206 and 28 milestones, progressing beyond early-game resource extraction.

svilen_dobrev 12 hours ago

some time ago i mentored a team in Ireland, and one mid-age guy was switching professions - from construction... and he approached programming (entry level python mostly) as puzzle solving. Match the pieces in (sufficiently) proper places.
Coding seems very close to puzzle solving in this regard.

zoogeny 15 hours ago

This reminds me of how I'll judge LLM ability when it can compete in a 1v1 dark souls battle. Even Factorio is pretty slow compared to what is possible in modern games.

chompychop 14 hours ago

It's edging pretty close to this - multimodal LLMs can already beat early bosses in soulslike games: https://arxiv.org/html/2409.12889v2

mentalgear 19 hours ago

> [LLMs] yet are unable to operate effectively in constrained environments, reflecting limitations in error analysis

This reflects my experience with gen-LLM coding, where LLMs keep trying to do the same thing in a loop.

noddybear 19 hours ago

We once saw GPT4o spend something like 100 repeated interactions trying the action known not to work (before snapping out of it). My intuition here is that this is a result of target fixation - the more repetitions of something it does, the more likely it is to keep repeating it, because it occupies more of the context.
- throitallaway 11 hours ago
  
  Gemini Pro does this constantly.
  * It'll output a broken script * I tell it what's wrong and how to fix it * It tells me I'm absolutely right and that it will correct it * It outputs a script with the exact same brokenness

bocklund 13 hours ago

Is there an additional arXiv paper? The abstract referenced at the bottom refers to a completely different paper.

noddybear 13 hours ago

We’re still waiting for arXiv, but the pdf is at the top.

ainiriand 12 hours ago

Wait are you basically telling me that now I can play factorio with code?

throitallaway 11 hours ago

The mod system uses lua, so that was always possible.
https://lua-api.factorio.com/stable/

alexop 19 hours ago

its funny how video games are the hardest benchmark that humanity has for ai

WJW 18 hours ago

They're not the hardest problems we have, they are just very nice benchmark tools because by definition they already run on a computer and you can fairly easily interface an AI with them.
There's probably also a distorting factor in that all the AI research into stock market and military applications probably doesn't get published, so it seems like video game AIs are a much larger percentage of research than it actually is.
quchen 19 hours ago

A video game is a very well-defined problem, and usually comes with simple metrics for success – health, time, or in Factorio’s case, ultimately science per minute (or per minute played, for AIs?). Real world problems are much harder to define, they are embedded in a very complex ecosystem, and it’s not clear at all what to optimize for.
throitallaway 11 hours ago

DeepMind went from playing Pong to protein folding in a short number of years. There are much harder things for AI to do than playing video games. Also see: self driving cars.
Hammershaft 17 hours ago

I'd love to see a Baba is You or Stephen's Sausage Roll llm environment to gauge spatial reasoning. Stephen's Sausage Roll in particular could be very interesting because the mechanics are incredibly simple but challenging.
lucianbr 18 hours ago

It is "hardest" in a context of the AI actually having a chance.
There's no problem asking AI for the blueprints to a working faster-than-light spaceship, only we already know the AI will fail, and the way it fails provides no useful information.

cgannett 18 hours ago

I wonder if anyone has done something similar with Dwarf Fortress

noddybear 17 hours ago

Not seen one yet, but the ASCII representation of the game would be ideal for an LLM benchmark.
- TeMPOraL 11 hours ago
  
  +/- tokenization mishaps (c.f. strawberry).
cgannett 18 hours ago

Or Mindustry

sc68cal 18 hours ago

The diagonal belts and diagonal pipes are especially cursed.

philipwhiuk 18 hours ago

Don't worry - people have done it without LLMs already: https://www.reddit.com/r/factorio/comments/p5clik/just_diago...
- sc68cal 17 hours ago
  
  oh i'm aware. Trupen did a big video on it I think

leetbulb 19 hours ago

Very cool project. Lovely diagrams.

noddybear 17 hours ago

Thank you. Spent too many hours procrastinating by pushing-pixels for those diagrams!

artemonster 19 hours ago

Diagonal belts are signs of evil

Python3267 17 hours ago

The factory must grow.

johnisgood 13 hours ago

Claude wins.

zelias 19 hours ago

Fantastic! Now I can sit back and watch the factory grow itself!

More seriously, I think this is a great "next step" in the evolution of benchmarks. Contemporary benchmarks are really just standardized tests, but the Factorio problem set presents an unbounded canvas of creativity for solutioning.

loveparade 19 hours ago

"put the right signals into my train network"

Not even humans can pass this benchmark.

quchen 19 hours ago

I beg to differ! But it takes a while for it to become intuitive. »Chain in, rail out« gets you 90% there though.
- Aardwolf 18 hours ago
  
  In both Factorio and Satisfactory I end up using long (and I mean, multi-kilometers-long) belts all the time and never trains.
  Something about the track building being clunky, or I don't know what really the underlying thing that's making me prefer the simplicity of items-moving-and-splitting-and-merging-on-belts is
  - enragedcacti 17 hours ago
    
    You should give them a shot again with 2.0, they made a ton of improvements to the track building:
    https://factorio.com/blog/post/fff-377, https://factorio.com/blog/post/fff-389, https://factorio.com/blog/post/fff-403

andbberger 13 hours ago

stream this on twitch

keeganpoppen 19 hours ago

this is an absolutely fascinating project-- wow! i am going to have to fire up Factorio again and try it out! the implications of what the experience of playing games is like in this new LLM era / world order are fascinating.

Aardwolf 18 hours ago

I'd love to finally see interesting (non-predictable) AI opponents in games like StarCraft and Age of Mythology!
Since a good AI is too likely to beat humans due to high APM, don't limit their intelligence, instead limit their APM...
- noddybear 18 hours ago
  
  Yes! Better strategy would be great - especially for grand strategy games (like Paradox EU4 etc). Even more for games with aspects of diplomacy...

idiotsecant 12 hours ago

...And that was how a bunch of very enthusiastic HN factorio fans inadvertantly started the exponentially expanding intelligence that consumed the universe to make more science cubes.

deadbabe 9 hours ago

Am I the only one who doesn’t find the results promising?

This is a ton of compute power and complexity for what is basically a shitty AI. It has no practical purpose. Better AIs have been built with less, why don’t people appreciate them? Or do we just take them for granted?

tux3 19 hours ago

For the real frontier benchmark, at the edge of what humans will put up with, install Pyanodon's mod. The scale of it puts a real strain on your organizational skills. Overbuilding, underbuilding, or bad planning can all cause significant pain down the line as the factory risks becoming an unmanageable, tangled mess with no sane capacity for expansion. It's a real test of executive function and organization.

For something humans will definitely not put up with, install PyBlock in Hard Mode. I suspect that benchmark will not fall anytime soon. It is borderline impossible without superhuman patience.

HPsquared 18 hours ago

I've given up on that a few times. This time will be different, I say.
wegfawefgawefg 19 hours ago

and i thought gleba was hard

WJW 18 hours ago

Very cool and also pretty expected results tbh. Some thoughts:

Factorio is a game that requires SIGNIFICANT amounts of thinking ahead, often requiring investments into things that won't pay off until much later and which might even significantly hamper initial development. Building a main bus vs spaghetti belts is one of the obvious examples here.

Humans with a little bit of experience playing factorio know that while building 1 item/s of some new resource is good, the game is about eventually building thousands of the new item. Until the LLM learns not to be short term minded it will probably build itself into a corner very quickly.

It is kind of amazing that these models manage to figure out a strategy at all, considering the game is not in their training set. That said, the current research goals are not very good IMO. Building the largest possible base has the predictable result of the AI building a humongous belt loop covering much of the map. A much better target would be the "standard" goal of SPM.

I think 99% of Factorio could be "solved" with GOFAI algorithms from the 80s and enough processing power. Set up a goal like 10k SPM and then work backwards towards how many of each resource you need, then recursively figure out fastest way to set up the production for each subresource using standard optimization algorithms from OR. No LLMs needed.

noddybear 18 hours ago

I definitely agree that planning is essential to perform well in Factorio - my hope is that we can create agents in FLE that can better front-load the planning part, as well as create utility functions for future use - such that as the agent progresses, it can do more and more in each program / step. For example, it could create a function called 'resolve_resource_dependencies', which would enable it to backfill missing resources in order to proceed.
LLMs tend to build themselves into corners here quite often. Basically, if they break the topology (e.g enclose their factory in pipes) they struggle to reason over it and correct it. My basic view on this is that there exists some set of functions/data-structures that they can design in FLE, which will give them a better view over their factory to enable scaling (if the models take a step back to consider it).
We currently do track SPM, but decided against making that our main metric, as it zeroes out in the early stages. We use 'production score' instead, which is a more generalised metric that just captures total production (multiplied by an item-price).
There was a cool paper that came out a few years ago using meta-heuristics to do this, (https://arxiv.org/abs/2102.04871), but I reckon the combinatorial complexity of large factories makes it challenging to solve beyond trivial factories.
Its worth noting that agents in FLE can write their own libraries etc, so a dominant strategy could be for an LLM agent to implement a solver in Python to do the heavy lifting. This is quite far from current capabilities though.
- WJW 18 hours ago
  
  An agent writing its own library to interface with a good solver like Z3 (or even writing some basic planning algorithms itself) seems like the epitome of a "costly long term investment that does nothing for the short term". The only thing I can see overcoming such problems are deep search trees, but AFAIK that is not how LLMs work at all.
  - noddybear 17 hours ago
    
    I experimented a bit using deep search trees to find better Factorio trajectories (MCTS), which worked somewhat well. Unfortunately, it's very computationally expensive, and probably only makes sense in a training context (i.e gathering trajectories to train a model in a supervised setting).
mrighele 17 hours ago

> the game is about eventually building thousands of the new item.
I disagree that you need significant amount of thinking ahead. At the beginning spaghetti belt is fine, as you have little resources and you don't have the luxury of overbuilding. Once you start getting "bigger" and into more complex designs you can just leave what you already built how it is and build the new stuff somewhere else.
By the time you need to produce thousands of pieces of an item you can probably prepare a blueprint that builds the whole factory in a click.
My approach to factor.io is built on phasesw
1: build ad hoc infrastructure for the specific material that I need, close to the raw resources
2: prepare blueprints for specific resources, so that if I need more of something I can just build an extra factory. I make the blueprints so that I can compose them, like input belts on one side and output belt on the other. such "factories" are almost self contained, as in they get only a subset of materials (plates, plastic and stuff that involves liquids) and produce all the intermediate materials. This leaves some optimizations on the table, but simplify the logistic. Use trains to fetch resources from far.
3: compose the blueprints of the previous step to make "megafactories" with stations included. While at step 2 input and output of the factories are belts, at this step the input/output are train stations for specific material (with proper names, so I can add a new factory and trains will start delivering materials right away)
Of course my approach is not the only possible and probably not even efficient. I play for fun, with no care for the time it takes, as long as the time spent is enjoyable.
- bombcar 17 hours ago
  
  You can certainly build with a main bus, and segmented factories doing what they do in perfect Nilaus city blocks. It's quite like perfectly designed and planned code; though you run the risk of it becoming just a blueprint plopping game.
  But it (for me at least) is so much more fun building the spaghetti and making things work, refactoring as you go, and expanding organically.
Philpax 17 hours ago

> It is kind of amazing that these models manage to figure out a strategy at all, considering the game is not in their training set.
I'm not sure we can conclude the game's not in their training set - a lot has been written about Factorio on the internet, and a lot has been recorded; the newer multimodal LLMs have trained on YouTube, especially Gemini.
- noddybear 15 hours ago
  
  The models have some understanding of the game and the initial build orders that they should adopt. What they don't remember is: 1) How many resources each item costs to make 2) Spatial gotchas such as a belt requiring an inserter to load/offload from.
kortilla 16 hours ago

A bus is absolutely not needed. I played through many early games and got to the launch without building a main bus.
accurrent 18 hours ago

Whats very interesting is if we could use LLMs to generate GOFAI methods. Its often not at all obvious how to do so. Than being said its still hard to express goals in terms of natural language and resources to LLMs. I;ve been trying different things and none seems to work for me to say hey this is a step improvement. Its also hard to come up with a dataset for these use cases.
- noddybear 18 hours ago
  
  FLE agents technically can implement their own Python libraries to leverage GOFAI to do the heavy lifting. None has actually attempted this yet though. It would be interesting to see if this can be achieved just by modifying the manual given to the agents to bias in favour of this approach.
  - accurrent 18 hours ago
    
    That does sound interesting. I might attempt it. Thanks for this benchmark, I totally could use it for my PhD (I started with GOFAI, but have hit a dead end. My advisor is suggesting pivoting into using LLMs to call my GoFAI framework.
    
    noddybear 18 hours ago
    
    Feel free to create an issue in the repo - am totally happy to help however I am able! I think that the only change you'll have to make is to expose your GoFAI framework in the 'Namespace' object which the agents have access to (for them to call it directly). Alternatively you could design a new tool which takes in game objects and generates a solution / typed object output.
eterm 18 hours ago

> Building a main bus vs spaghetti belts is one of the obvious examples here
I'm an anti-bus extremist. ( I've even considered regsitering BanTheBus.com and doing an over-the-top static anti-bus website. ), so take what I'm about to say with a pinch of salt. Also note that my post applies to non-space-age. Space-age changes the gameplay fundamentally, so this really only applies to factorio 1.1 or 2.0 with space-age disabled. Gleba in particular breaks the JIT model. ( Fuck Gleba. )
Busses are the opposite of good factorio factories. They undo a lot of the benefits of a healthy just-in-time (JIT) manufacturing, by encouraging massive amounts of buffer (belt-buffer).
They also encourage people to anti-learn fundamental principles. You often people do "starter-busses" with 4 lanes of iron plates, but only fed by one actual belt worth of smelting. Then people look for all kinds of "balancing" solutions to try to alchemize one belt into 4 belts.
They encourage massive amounts of over-spend on expensive splitters to keep "balancing" the bus to make it look neat, over actually just focusing on what needs to be built.
Spaghetti on the other hand is much better for actually getting to the end-goal. Start by placing what you want to build, then look for what it needs. Work out how to feed it by any means necessary. If you dont' have enough input, then build more of that input. Then repeat as necessary.
There's no such thing as too much input. With short belts (even direct insert where possible), buffers are kept to a minimum, and any "overprodction" is stopped at source, because assemblers don't produce if they have nowhere to output into.
The biggest classic beginner mistakes in factorio are:
- Sticking things in chests. Even worse, trying to "maintain production" by picking up those chest contents. ( This comes from an RTS mindset where "idle" workers are a big sin. )
- Trying to increase throughput by replacing yellow belt with red belt when their yellow belt wasn't saturated.
- Looking for guides and discovering "The Main Bus".
That last point is so common, and not only does it take away some of the creativity of the game, but busses are inherently a bad solution that makes all bases look the same, and produces a mediocre result.
Look at how speedrunners are able to complete the game on default settings in sub 2hr30. They're not producing oodles of red belts. They're not producing main busses. They're not even producing railways. They're hyper focused on what's actually needed, which is very little indeed.
- noddybear 18 hours ago
  
  While I agree that busses incentivise over-production, my only argument in their favour stems from SWE design best-practices of maximising cohesion and minimising coupling. The benefits of a bus is that it makes it easier to create factory districts that can be scaled independently of everything else. I suppose the question is how independent/integrated should these districts be? Should they only create ore? Or create utility-science-packs?
- vessenes 17 hours ago
  
  Lots of ways to play. That said, there is a significant amount of tooling to support the style of “stamping” working units down. In fact, I’d argue it’s a key part of the game in that you start with small working units, and the game helps you scale them up to midsize (say on the map view you can still see the parts) and then again to large scale (you can only see blocks on the map). This is why the blueprint tool allows offsetting and grid refinement.
  So, I almost never build a large main bus style base, but it’s a fun (and helpful) part of the game’s design and tooling to allow you to create very large working systems out of a variety of component scales — “buses” enable this, and make it MUCH easier to implement.
- zelias 17 hours ago
  
  I dunno, main bus strikes me as an efficient way to feed all the mall assemblers in the early game before I can scale up to a train based base.
  After the early game, totally agreed, ban the bus.
  - eterm 17 hours ago
    
    Early game, I don't use trains either, I just spaghetti yellow belt around! I even run long belt lines to the first ring of mining outposts. It usually only costs ~500-1000 belt to do so, which sounds like a lot but really isn't much.
    Advantage of yellow belt:
    - Don't consume power - Don't produce pollution ( don't attract bugs ) - Are cheap. Rails themselves aren't too bad, costing only slighty more per tile than belts. But rails need stations, signals, chests, inserters for loading / unloading. Often combinators too. All add up to much higher investment, and critically in the early game, add up to more pollution produced before the pay-off of improved resources incoming. - Don't need lots of belt for the stations. A typical loading/unloading station can often have so much belt for "efficient" (fast) loading/unloading that you could have belt half your way to where you're going just laying that belt in a straight line.
    If you're going further than the first ring of extra resources, then trains are amazing, but there are more than enough resources several times over in the first ring of resources to get a rocket launched. The first expansion ring tends to have 500k-1Mil per patch, and have several patches, so there's no need to go miles out pre-rocket.
    It takes surprisingly little raw resources to actually launch a rocket. Someone did the maths once and calculated completing the game as needing a minimum of ~500k Iron ore. At 15/s, a single yellow belt can deliver that in under 10 hours, way below the time of a typical playthrough. This technically means a single smelting line is all you need to actually complete the game still in a reasonable time. Of course, trying to do so from a single lane would be extremely painful, and need a lot of attention to preventing over-production of intermediates, especially when it's much easier to make a few smelting lanes. I'm not recommending that, but I am recommending just slapping down new lanes and production wherever you feel like it and whenever you need, rather than pre-planning 4-lanes of plates that aren't actually useful of efficient.
    Unless you mean your "main bus" is just 1x copper and 2x iron lane, in which case fair enough, but when I attack the concept of the bus, that's not what I'm railing against. What I'm railing against is the design pattern where people put down 4 lots of 4-wide lanes to bus far more iron, copper and intermediates than will ever be needed to actually launch a rocket.
    Busses aren't efficient by any metric, other than minimising personality, creativity and thinking.
    With a bus you just follow the same template done before, it'll get you to the end, but it teaches bad habits and isn't efficient. It isn't quick either, you'll spend a long time putting down lanes and splitters and undergrounds. All of which need producing.
    For late-game (post-rocket), then either a bus design or train / city-block designs can work well. I prefer trains, but large busses have their place for mega-bases.
    But for pre-rocket, or anyone starting out, spaghetti is absolutely the way to go. It'll also better teach you via your own mistakes.
- tavavex 16 hours ago
  
  I've played other factory-building games, but not Factorio, so I'm not familiar with the bus-building paradigm. I feel like you're saying that buses would incentivize bad practices, but at the same time I don't see what would make them inherently bad. Whenever I saw screenshots of Factorio, I thought that buses were more of a logistics tool, a way to cable-manage the delivery of stuff from one place to another. Is this wrong? I feel like, if you have more consumers than producers (and end up having to rely on buffering), then you've got a big problem regardless of whether you have a bus or not - a sufficiently long belt from an ore deposit etc could replicate the big-buffer problem in the same way. I don't think I'd use buses, I like a bit of chaos, but still, I'm not sure if they're that bad.
  - eterm 15 hours ago
    
    Buffer is bad precisely because it increases the lag between having a gap in provision, and that gap being obvious to the player.
    With no buffering, as soon as your demand for steel is greater than than production of steel, then the bottleneck is immediate and obvious. The solution is also immediate and obvious: Build more steel.
    Buffering, in particular belt buffering, and in particular busses, contribute to mask this issue. There can be a great delay between increasing consumption above production, and so the root cause can be very hidden. It may also be that the ultimate root cause is that steel production is low because it's limited on how much iron ore it gets. If everything is bussed, then it can be hours before resource constraints are hit, by which point it's very hard to see what's happened to cause the shortage, and also by which time the factory may have expanded further.
    It also constributes to see-saw production, where-by a shortage in one area causes a pause, which allievates the root cause shortage for a while by backing up other production. The longer the lag between cause and effect, the greater the banding effect, further masking the root cause.
    A bus also encourages bottom-up, which further encourages massive over-consumption of base resources. If you start building green chip production and bussing it, the bus may will fill and buffer, making it look like you've got plenty of green chips. In turn, as the green chip production stops, it'll look like you've got plenty of iron plates. You'll build all your malls and other production, satisfied as you build each that they run fine and are not over-consuming.
    Only when later everything starts to run at once you realise that the stuff down the end of the line is getting scant resources, as previously each part was running in isolation before serious amounts were required.
    In contrast, a top-down approach involves building the final result first, then at each step building what's needed to feed it. This ensures that there is always enough provision, and everything can be placed to minimise buffer to reduce lag and improve feedback time on problems. It also reduces pollution since any item on a belt represents inventory for which you've paid a pollution cost but not got any final results from yet.
    The spaghetti approach can lead to "under-utilised" buildings, such as smelting array that ends up only needing to supply 0.3 of a belt. But in factorio space is almost endless, and there's little to no cost to idle buildings. The power drain of idle assemblers, particular the bare (no module) level 2 buildings you'll likely be building before end-game, is extremely low.
    For late game post-rocket, this changes of course. With beacons and level 3 assemblers with modules, the idle draw is significant, and you may want to optimise ratios and look to eliminate how many assemblers you run idle. ( That said, power is almost non-issue in 2.0 with nuclear power being much easier to run efficiently than previously, so the large solar fields aren't really needed anymore. )
    Busses have a strong visual appeal, but unlike "cable management", there's no airflow to consider in factorio. A messy spaghetti base isn't inherently inefficient. It doesn't affect productivity to just run short belts all over.
    The visual temptation of the mega-bus is clearly alluring, it looks good on youtube video guides.
    
    tavavex 15 hours ago
    
    That makes sense. I guess I just had a different approach when I played the other games. The way I organized in other factory games is that the considerations of input and output were things that I thought of upfront - I never eyeballed and then tried to estimate the production speed based on how fast my resources were drained. I might be overplanning, or maybe Factorio encourages a far more chaotic approach, but I always treated factories as black boxes that take X/s of certain items and outputted X/s results. Knowing precisely how many items per second I have on any individual belt is the most essential piece of knowledge to me, so I never relied on buffering and always made sure to build consumer factories that never overwhelmed producer factories. This means that the visual indication of the buffer draining would only signal some building mistake to me, rather than a design mistake.
- infogulch 16 hours ago
  
  I agree with everything you said, but I'd like to defend the main bus as a learning phase.
  The main bus is for production capacity discovery. New players have to discover the production tree somehow and browsing recipe cards with a calculator before starting is not the way, it has to be hands-on and interactive. Dividing up your factory into (arbitrarily defined) intermediates production and final product production is helpful to cut through the complexity. Of course, intermediate production will be severely underbuilt and logistics capacity will never keep up with consumption of all of the final product builds, but it will be enough to run a couple builds at once and test new ones.
  Once new players have a grasp of the scope of the production tree, with tangible examples of sub-factories, then they can begin to consider end-to-end production capacity.
- sdwr 16 hours ago
  
  The most challenging part of factorio is routing inputs and outputs in 2D. If you build naively, intersections between ingredient belts choke the life out of your factory (3 utilities problem). Bus guarantees building space with easy access to all ingredients.
  Sure, if you already know what you're doing, you can plan ahead and leave space for the right things. For everyone else, it simplifies the game to a manageable level.
- lupusreal 14 hours ago
  
  I delete my bus as soon as I get bots (in favor of a train base feeding a bot mall), but I've found that a small and not overly strict bus is the fastest way, for me, to get bots unlocked.
- aithrowawaycomm 17 hours ago
  
  I think the point of “The Main Bus” in guides is that it’s easy for newer players, and thereby takes a lot of complexity off the table when you still haven’t really figured out how petroleum works, or keep falling behind the biters because you underestimated red bullets demand for steel, etc etc etc. Eventually you figure out trains; until then a main bus is an idiot-resistant way to carry resources across the entire base.
lupusreal 18 hours ago

I'm not convinced that factorio requires planning ahead for computer players. For human players it certainly does, because tearing up your factory and rebuilding to fix shortsighted designs has a steep time/labor cost. Even for human players though, this cost becomes mostly a psychological obstacle once you get construction bots.
- malfist 18 hours ago
  
  The biggest cost in factorio by far is the human time cost of setting up logistics, building factories and mining outposts.
  To an LLM those probably aren't even costs
  - asah 18 hours ago
    
    seems like we just need to add that 'cost' to the agents...
    
    noddybear 18 hours ago
    
    We do actually track this. Each action an agent takes consumes 'ticks', which is how long it would take a human player to do it. For example, moving 10 tiles to the right takes something like 300 ticks (5 seconds).
    However, our results only loosely indicate that smarter models use fewer ticks. I have an intuition that we will get more signal for tick-efficiency as models improve.
- aithrowawaycomm 17 hours ago
  
  Long-term planning is necessary if you have biters enabled: typically you need to secure territory/resources and invest in defenses before the resources run low and while the biters are still manageable. Otherwise things can get badly out of hand.
  Edit: IMO the biggest difference between Satisfactory and Factorio is that Satisfactory has no crises. If a Satisfactory base shuts down it is annoying, but you can dig another miner / / build another plant / etc, entirely at your leisure. But in Factorio, a shutdown is an emergency with a ticking clock.
  - martbakler 17 hours ago
    
    We were thinking of creating a minigame resembling a "tower-defense" setting, where waves of bugs get released and the agent needs to create appropriate defenses. It would be interesting to see if agents are capable of defending the base and how much resources would they put towards defenses in a normal game where enemies are enabled

Starlord2048 15 hours ago

[flagged]

andai 15 hours ago

Fascinating. I was thinking how the factory should be communicated to the model, and represented "internally". Images aren't the right solution (very high bandwidth for no real benefit). An ASCII grid of the game's tiles (more likely, a small chunk of it) is orders of magnitude better, but you still don't need to simulate every tile in a conveyor. It's just a line, right? So the whole thing is actually a graph!
That compresses nicely into text, I imagine.
I'd like to hear more details about your symbolic approach!
- nostrademons 13 hours ago
  
  Probably the memory model of the game itself is the best representation. The devs have already spent a significant amount of development cycles optimizing this down to a minimal compressed form - belt runs, for example, are one entity regardless of how long they are. The LLM is then effectively modeling the degrees of freedom of the game simulation and picking code paths within them.
- HideousKojima 15 hours ago
  
  >An ASCII grid of the game's tiles (more likely, a small chunk of it) is orders of magnitude better, but you still don't need to simulate every tile in a conveyor. It's just a line, right? So the whole thing is actually a graph!
  Until you accidentally feed a different material into your belt and need to clean it up
noddybear 15 hours ago

This is really interesting, do you have a repo or anything describing the approach? I would be particularly interested in trying your approach in FLE to see how it affects layout design. How are you performing the spatial reasoning?
mlsu 14 hours ago

Yes!
The way I think of it is this. Yes, the LLM is a "general reasoner." However, it's locked in a box, where the only way in and out is through the tokenizer.
So there's this huge breadth of concepts and meanings that cannot be fully described by words (things like, spatial reasoning, smells, visual relationships, cause/effect physical relationships etc). The list of things that can't be described by words is long. The model would be capable of generalizing on those, it would optimize to capture those. But it can't, because the only thing that can fit through the front door is tokens.
It's a huge and fundamental limitation. I think Yann Lecunn has been talking about this for years now and I'm inclined to agree with him. This limitation is somewhat obscured by the fact that we humans can relate to all of these untokenizable things -- using tokens! So I can describe what the smell of coffee is in words and you can immediately reconstruct that based on my description, even though the actual smell of coffee is not encoded in the tokens of what I'm saying at all.

philipwhiuk 18 hours ago

It's great to see that LLMs too, struggle with oil production.

noddybear 17 hours ago

The pipe placement seems to confuse the models especially - as you can't place pipes carrying different fluids next to each other.

dkural 16 hours ago

So the battle of AGI will be won on the playing-fields of Factorio and StarCraft..