In my previous post, I introduced an environment for studying emergent cooperation in multi-agent systems. While the basic mechanics were functional, the training runs where severely unstable. This post documents the solutions to this issue.

Characterizing the instability

The instability manifested across two dimensions. First, runs with identical hyperparameters produced wildly divergent outcomes:

unstable

Second, hyperparameter sweep revealed training collapse (note: perf is normalized version of score, however, in those early experiments, perf could go above 1 due to a bug in my environment, which I fixed later):

unstable2

This level of variance made it impossible to distinguish genuine algorithmic improvements from random noise. Before adding environmental complexity, I needed to isolate and fix these sources of instability.

Solutions

Here is the list of things which helped:

Rigorous logging validation
Isolate non-determinism
Randomize initial state distribution
Sweep your rewards

Solution 1: Rigorous logging validation

Because you won’t be rendering thousands of experiments at different checkpoints, your logs are everything. If your log is bugged, or shows some metric which isn’t actually what you care about, you’re doomed to lose hours of work. So make sure these work.

I learned this the hard way. My initial logging measured agent survival time, but the implementation had subtle bugs in how it aggregated data across vectorized environments. Some edge cases (like agents surviving the entire episode) were never logged. Others were double-counted due to incorrect episode boundary handling.

Solution 2: Isolate non-determinism

Multi-agent RL environments accumulate sources of randomness: policy stochasticity, environment dynamics, GPU operations, floating-point arithmetic. Left unchecked, these make debugging nearly impossible.

The problem: you might have subtle bugs that only surface after hundreds of millions of steps. At that point, attributing failure to the algorithm versus the environment becomes guesswork.

My approach: aggressively seed everything and start simple. One environment, CPU execution, deterministic policy. Verify that trajectories are reproducible. Only then scale up. This establishes a ground truth you can return to when things break.

Solution 3: Randomize initial state distribution

This one is pretty straightforward but this really help agent see more diverse trajectories early on and learn a more stable policy.

Solution 4: Sweep your rewards

At first I was against changing the reward structure. I really wanted the agents to solve the task with minimal signal (death = -1). It turns out bad rewards can really screw your runs and there is nothing wrong with sweeping the rewards as long as you keep your main metric fixed.

Here is an example of a stable reward configuration that consistently solves the environment in under 25M steps:

Reward Component	Description	Value
held_food_reward_scale	Inventory holding reward	0.26
hp_reward_scale	Current health reward (per timestep)	0.07
reward_collect	Food collection reward	1.0
reward_death_scale	Normalized survival time reward	0.14
reward_eat	Consumption reward	0.63
reward_steal	Stealing penalty	-0.46
timestep_reward	Living penalty (per timestep)	-0.001

Reward-performance correlation

Analyzing reward-performance correlations across the sweep revealed clear patterns. Each dot is a single experiment with different rewards parameters. Discouraging theft seems essential:

rew_steal

Interestingly, dense HP-based rewards provided minimal benefit, possibly due to it being too noisy:

hp_rew

Maybe more obvious, rewarding collection and consumption behaviors proved essential:

eat_rew coll_rew

Results: Establishing a stable baseline

The original training curves demonstrated severe instability:

old_sweep

Now, agents consistently reach high performance without sudden collapse, across multiple independent runs:

same_hparams

Demonstration and next steps

The video below shows a solved rollout. Agents spawn in the house region (grey tiles, bottom-left) and learn to survive by collecting & eating food:

Now that I get stable runs, I can move on and increase the environment complexity to make things even more interesting.