Atari Pong

By Luke Miller

4. October 2020 - 4 minute read - 788 words

This is a short post to describe my practical introduction to Reinforcement Learning (RL), where I trained a simple agent to play the classic Atari game Pong via a Deep Q-Network.

In English, this means we teach a novice computer to play the classic paddle game by allowing it to observe what happens when it performs various movements at different times and stages of gameplay (against the same, fairly strong opponent). Then, after making a sequence of movement choices, our agent either gets a point (reward of +1) or loses one (reward of -1). After a lot of trial and error, the agent will have observed enough situations to learn what is a good move to make at a given moment in the game.

Although my agent does not win its games, I have managed to train a fairly competent player that comes close to winning!

Please be aware that I do not claim that this work is at all original - I simply attempted to complete one of the homework assignment from Stanford University’s CS234 Reinforcement Learning course (2020). This exact topic has also been covered in one of Andrew Karpathy’s great blog posts. However, the code that completes the assignment is my own.

What’s really incredible is the sheer number of replays the agent needs to learn from in order to become a competent player (i.e. millions). Take a look at the videos below to see my player’s performance at different stages of its training.

Technical Detail

Code

The implemented approach uses the impressively flexible Deep Q-Network (DQN)¹. It uses the raw greyscale image pixels (of size 80 x 80) as its sensory input, and a standard small Atari space of six possible actions. The reward function for an action is +1 for winning a point, -1 for losing one, or 0 if the point hasn’t ended yet. The entire episode is a whole match, concluding when either player reaches 21 points.

We use OpenAI’s Gym as the simulation engine for Pong.

Much of the actual coding challenge in this exercise was about filling out abstract code, such as the architecture of the DQN, which accepts the observed state as its input, passing it through a series of convolutional layers before outputting the approximated Q-values for each possible action (see this file).

There was an extra challenge in trying to get this code to run in TensorFlow 2.0, as the original code was written in v1 - a lot of backwards compatibility issues had to be resolved (including the disabling of eager execution).

The training was run on a GPU via a Google Colab notebook, which are free to use but do have a limited lease time (about 5 hours, but is variable depending on demand). I adjusted the learning rate schedule so that it would decrease at a slower rate.

Some of the log output from the training process is shown below, clearly showing a strong improvement in the agent’s average reward over the epochs:

2020-09-24 11:09:59,538:INFO: Average reward: -20.98 +/- 0.02
2020-09-24 11:34:05,078:INFO: Average reward: -16.74 +/- 0.26
2020-09-24 12:01:07,032:INFO: Average reward: -15.28 +/- 0.36
2020-09-24 12:29:26,189:INFO: Average reward: -13.26 +/- 0.54
2020-09-24 12:57:58,539:INFO: Average reward: -12.60 +/- 0.45
2020-09-24 13:26:51,798:INFO: Average reward: -9.96 +/- 0.76
2020-09-24 13:55:03,551:INFO: Average reward: -10.36 +/- 0.47
2020-09-24 14:22:26,403:INFO: Average reward: -8.32 +/- 0.65
2020-09-24 14:49:23,826:INFO: Average reward: -7.72 +/- 0.72
2020-09-24 15:16:30,806:INFO: Average reward: -3.38 +/- 0.80
2020-09-24 15:43:34,545:INFO: Average reward: -3.66 +/- 0.81
2020-09-24 16:11:01,012:INFO: Average reward: -4.06 +/- 0.87
2020-09-24 16:38:22,830:INFO: Average reward: -2.66 +/- 0.90

By the 12th epoch, the agent is only losing on average by 2.66 points. You can see the extent of the contrast from the videos shown in the next section.

Results

The first video below shows a match played at quite an early stage in the player’s learning journey, i.e. after just 250,000 training episodes (matches). The subpar playing standard is evident - they lose quite badly, although they still managed to nick a couple of cracking points!

Contrast that to their performance in a match played once the agent has learnt from 3 million episodes. The agent has clearly become quite an adept player, keeping the game to within two points.

That said, they do still lose most of their matches! An obvious next step would be to play around with the learning-rate schedule and other hyperparameters to see if we can get them to win more often than not.

If you like this project, please hit the ⭐ button on the GitHub page!

Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement learning (2015). Nature 518, 529–533. ↩︎