The Perfect Battle with Reinforcement Learning

Firefighters and arsonists are trained to control the fire in a virtual environment. The arsonists agents, patrol the environment to keep away spots fires from the firefighters. Arsonist agents can freeze firefighter agents for few seconds to stop them extinguish the spot fires. Contrary, the firefighters agents patrol the environment to extinguish the fires and they can also freeze arsonist agents too. Both 2 contrary agents learnt through Reinforcement Learning to act optimized according to their actions. The agent is rewarded or punished if the action taken helped to accomplished the objective of the group: Firefighter or Arsonist.


The Agent observe the environment at certain estate and execute an action. Based on that action, the agent receives a positive or negative reward. When the state occurs again the agent align the action according to the reward received before. The process is repeated until the agent is able to act in an optimized way. The arsonist agents behavior is measured by the number of fires able to keep “alive”. Firefighter agents behavior is measured by the number of fires extinguished.

Platform and technology:

The environment and agents were simulated in Unity 3D, and the Unity ML-Agents toolkit plugin to connect Tensorflow and Python for the reinforcement learning Process. Unity Machine Learning Agents Toolkit ML-Agents , is an open source plugin that enable games scenes to work as environments for training intelligent agents using Reinforcement Learning, Imitation learning, neuro-evolution and other machine learning techniques.

The above diagram shows how the data flow from the environment through the agent to the Tensorflow model and come back to the agent


For the project 2 different algorithms types and programming languages were implemented. The C# programming language to move the agents and interact with the environment and Python programming language with a PPO (proximal policy optimization) model implemented from TensorFlow to execute the training sessions. The following graphs shows the 14 tests used as benchmark. The results show random actions and no learn for both agent groups.

Cumulative reward shows no learning for the different agents, this means no improvement is observed during the benchmark process

Entropy graph shows that the amount of random actions taken by the agents did not decrease during the benchmark process. Agents acting randomly


The reinforcement learning technique used was the Proximal Policy Optimization (PPO). It's well known this technique balance between ease implementation, sample complexity and ease of tuning. At each step in the training process the PPO tries to minimize the cost function.

Each pair of curve represent a different training session. Blue (Firefighters) and brown (Arsonists) represent the training session number 9. The curves cyan (Firefighters) and magenta (Arsonists) for the training session number 10

The graph above shows a clear improvement compared with the benchmark. The initial reward improved from negative values in the first training steps to 6.5 to 8.0 rewards. Both agent groups also learn at similar speed. This means both agent groups were able to learn from the contrary group. Both groups converged in similar values for the different training sessions.

To get the full technical report or further information click in this link: Mail to Carlos


In short, both groups learned that is more easy to kill each other instead of fight for control of the fire.

The reason is because the neuronal network award the results over the effort for control of the fire. It's cheap and fast "kill" the opponent. During the training process the agents obtain faster rewards to "kill" an adversarial agent instead of secure a spot fire. This policy becomes strong in the neuronal network and impact in the agents behavior.

A training session example


Add a comment