Reinforcement Learning with Raw Actions and Observations in PySC2

Published in

ITNEXT

8 min readSep 15, 2019

In my previous tutorial I introduced a recent addition to PySC2 known as raw observations and raw actions. Now we can take that knowledge and attempt to teach our bot how to play using reinforcement learning.

Applying reinforcement learning to the complete game is extremely complex and takes a lot of time and computational power, as DeepMind have shown us with AlphaStar. We can make things easier for ourselves by significantly reducing the complexity of the game.

The first step to reducing complexity is to limit what the bot can do, so we will create a Terran bot that can build a single supply depot, a single barracks, both at fixed locations, can train marines, and can attack a predetermined location.

The second step to reducing complexity is that we will have the bot compete against an opponent with the same limitations. We do this because using the game’s AI opponents leads to a much larger set of units to track and there are some things like cloaked units that we just won’t be able to deal with.

Thirdly, the opponent will simply perform actions at random, this will give us a good comparison to ensure that our bot is learning and not just getting lucky.

Finally we will remove the fog of war so that there is no hidden information and the bot’s world is more predictable and easier to respond to.

Let’s get into it.

1. Create the Base Agent

Since both of our bots will essentially operate in the same way, we will create a common agent class with code that both the learning agent and the random agent will use.

First we need to import a few libraries:

Next we create our agent class:

Let’s define which actions the agents can perform:

Notice that we do not allow the agents to rebuild their command centre. You could definitely do this but I chose to leave it out for simplicity as generally the agent is seconds away from being defeated once they lose their command centre.

There are a few utility methods we will need, first a few methods to collect units from the observation:

We will also add a method to calculate the distances between a list of units and a specified point:

These methods will be used by our action methods.

2. Create the Action Methods

We need to add a method for each action defined in the list we created earlier. Each one of these methods will receive the observation from each step so that it can act independently. You will see later this this significantly simplifies the structure of our agents.

Let’s start with the simple “no op” action that does nothing:

Next we will create the method that will send an idle SCV back to a mineral patch:

As you can see, there are a lot of mineral patch unit types! What we do here is find the mineral patch that is closest to the chosen SCV. First we calculate the distances from each mineral patch to the SCV using our get_distances method, and then we find the closest distance using np.argmin.

The Harvest_Gather_unit raw action is quite neat in that it accepts to unit tag of a worker, and the unit tag of a resource (mineral or vespene gas). As a result there is no risk of a mis-click that you may get with regular actions.

Now we create the method to build a supply depot:

Similar to what we did earlier with SCVs and mineral patches, we find the SCV that is closest to the build location and then instruct it to build there. It’s possible for the agent to call this method multiple times while an SCV is travelling to the build location, so we don’t want to choose SCVs at random otherwise multiple SCVs will be taken from the mineral line.

Unlike regular actions, raw actions do not crash if the action cannot be performed, but I like to perform my own checks so that error notifications do not appear in the game. I just have a need to keep things tidy.

The supply depot locations here are for the Simple64 map we will be using, and are just locations I chose that seemed balanced for each base location. The self.base_top_left value will be explained later.

Next we create the method to build a barracks:

This is much the same as the supply depot method above.

Once we have a barracks we can train marines:

One of the cool features of raw observations is that we can see how many marines have been queued at the barracks by using barracks.order_length which we know is limited to 5 for a standard barracks.

Lastly, the attack method:

There are a few tricks to this method. First of all we find the furthest marine from the attack location, this is similar to what we have done previously with SCVs, except that we use np.argmax to find the furthest distance.

When we have chosen a marine we actually choose a location at random around the predetermined attack coordinates. We do this to ensure that our units explore the enemy base location fully and destroy all buildings.

3. Create the Random Agent

We are almost ready to test our agent, but first we should create the agent that performs actions at random. This starts by adding a method to the base agent that will determine the base location at the start of each game:

Although our base agent is quite complex, our random agent is actually very simple:

We choose an action at random here from our predefined list, and then we use Python’s getattr which essentially converts the action name into a method call, and passes in the observation as an argument.

Now let’s add the code to run our agent:

You can see we have disabled the fog of war using disable_fog and we have also chosen a fairly large step multiplier of 48, which makes the games run much faster and doesn’t really impact the outcome for such simple agents.

We also run the agents for 1000 games, this is generally more than enough for the smart agent to learn how to win with around a 95% success rate.

Run your code and you should see the two bots battling each other.

That’s great, but we really want one of the bots to learn how to win, right?

4. Create the Q Table

One of the simplest forms of reinforcement learning is the Q table, it essentially a spreadsheet of all the states the game has been in, and how good or bad each action is within each state. The bot updates the values of each action depending on whether it wins or loses, and over time it builds a fairly good strategy for a variety of scenarios.

I have modified a version of Morvan Zhou’s code.

First create the QLearningTable class:

Now we will add the method to choose which action the bot should perform:

The e_greedy parameter here determines how often the bot should choose a random action instead of the best action. The value of 0.9 means it will choose the best action 90% of the time, and a random action 10% of the time.

In order to choose the best action it first retrieves the value of each action for the current state, then chooses the highest value action. If multiple actions have the same highest value, it will choose one of those states at random.

Now we need to add the method that allows the bot to learn:

This is where the magic of Q tables happens. The parameter s refers to the previous state, a is that action that was performed in that state, r is the reward that was received after taking the action, and s_ is the state the bot landed in after taking the action.

First in q_predict we get the value that was given for taking the action when we were first in the state. Let’s pretend the value was 0.1.

Next we determine the maximum possible value across all actions in the current state, discount it by the decay rate (0.9), and add the reward we received. As an example, if the current state’s maximum action value is 0.5 and the reward we received was 0, then q_target will be 0.45, since we multiple 0.5 by the decay rate of 0.9 and add the reward of 0.

Finally we take the difference between the new value and the previous value (e.g. 0.45 - 0.1 = 0.35) and multiply it by the learning rate (e.g. 0.35 *0.1 = 0.035). We then add this to the previous action value (e.g. 0.1 + 0.035 = 0.135) and store it back in the Q table.

The result of all this is that the action will either increase or decrease a little depending on the state we end up in, which will make it either more or less likely to be chosen if we ever get into the previous state again in the future.

You may have noticed both of these methods use another method we haven’t added yet, so let’s do that:

All this method does is check to see if the state is in the Q table already, and if not it will add it with a value of 0 for all possible actions.

Does your brain hurt yet? Sorry. OK let’s get back to our agent.

5. Create the Smart Agent

Start by creating the class:

Then we want to create an instance of the Q table when the agent is created.

In each step of the game, we need to choose an action to perform. In order to choose the action we need to know the current state. Let’s create a method to track a simplified version of the game state:

That seems like a lot of code but really we’re just keeping track of our units and the enemy units. Since we disabled the fog of war and raw units allows us to see all units on the map, we can see all of the enemy units with 100% accuracy.

Instead of tracking our mineral count, we simplify whether or not we can afford different units to a simple boolean value. This reduces the granularity of our agent considerably without really losing any value.

Now in our step, we can use the state to choose an action:

At this point our agent will choose an action at random, but it never learns, so let’s add the final piece:

If we have seen a state and performed an action, we have now landed in a new state and received a reward, so we can feed this into our Q table.

If we don’t reset the previous_state and previous_action it could teach our agent incorrectly at the start of each game, so let’s reset the values when we start a new game:

Finally, change our first agent to be the smart agent:

That’s it! If you run the code you will see that eventually the smart agent starts to win almost every game.

The full code for this tutorial is available here.

If you like this tutorial, please support me on Patreon. Also please join me on Discord, or follow me on Twitch, Medium, GitHub, Twitter and YouTube.