**Chapter 16 – Reinforcement Learning**

This notebook contains all the sample code and solutions to the exercices in chapter 16.

# Setup

First, let's make sure this notebook works well in both python 2 and 3, import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures:

In [1]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import numpy.random as rnd
import os

# to make this notebook's output stable across runs
rnd.seed(42)

# To plot pretty figures and animations
%matplotlib nbagg
import matplotlib
import matplotlib.animation as animation
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "rl"

def save_fig(fig_id, tight_layout=True):
 path = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID, fig_id + ".png")
 print("Saving figure", fig_id)
 if tight_layout:
 plt.tight_layout()
 plt.savefig(path, format='png', dpi=300)

# Introduction to OpenAI gym

In this notebook we will be using [OpenAI gym](https://gym.openai.com/), a great toolkit for developing and comparing Reinforcement Learning algorithms. It provides many environments for your learning *agents* to interact with. Let's start by importing `gym`:

In [2]:
import gym

Next we will load the MsPacman environment, version 0.

In [3]:
env = gym.make('MsPacman-v0')

Let's initialize the environment by calling is `reset()` method. This returns an observation:

In [4]:
obs = env.reset()

Observations vary depending on the environment. In this case it is an RGB image represented as a 3D NumPy array of shape [width, height, channels] (with 3 channels: Red, Green and Blue). In other environments it may return different objects, as we will see later.

In [5]:
obs.shape

An environment can be visualized by calling its `render()` method, and you can pick the rendering mode (the rendering options depend on the environment). In this example we will set `mode="rgb_array"` to get an image of the environment as a NumPy array:

In [6]:
img = env.render(mode="rgb_array")

Let's plot this image:

In [7]:
plt.figure(figsize=(5,4))
plt.imshow(img)
plt.axis("off")
plt.show()

Welcome back to the 1980s! :)

In this environment, the rendered image is simply equal to the observation (but in many environments this is not the case):

In [8]:
(img == obs).all()

Let's create a little helper function to plot an environment:

In [9]:
def plot_environment(env, figsize=(5,4)):
 plt.close() # or else nbagg sometimes plots in the previous cell
 plt.figure(figsize=figsize)
 img = env.render(mode="rgb_array")
 plt.imshow(img)
 plt.axis("off")
 plt.show()

Let's see how to interact with an environment. Your agent will need to select an action from an "action space" (the set of possible actions). Let's see what this environment's action space looks like:

In [10]:
env.action_space

`Discrete(9)` means that the possible actions are integers 0 through 8, which represents the 9 possible positions of the joystick (0=center, 1=up, 2=right, 3=left, 4=down, 5=upper-right, 6=upper-left, 7=lower-right, 8=lower-left).

Next we need to tell the environment which action to play, and it will compute the next step of the game. Let's go left for 110 steps, then lower left for 40 steps:

In [11]:
env.reset()
for step in range(110):
 env.step(3) #left
for step in range(40):
 env.step(8) #lower-left

Where are we now?

In [12]:
plot_environment(env)

The `step()` function actually returns several important objects:

In [13]:
obs, reward, done, info = env.step(0)

The observation tells the agent what the environment looks like, as discussed earlier. This is a 210x160 RGB image:

In [14]:
obs.shape

The environment also tells the agent how much reward it got during the last step:

In [15]:
reward

When the game is over, the environment returns `done=True`:

In [16]:
done

Finally, `info` is an environment-specific dictionary that can provide some extra information about the internal state of the environment. This is useful for debugging, but your agent should not use this information for learning (it would be cheating).

In [17]:
info

Let's play one full game (with 3 lives), by moving in random directions for 10 steps at a time, recording each frame:

In [18]:
frames = []

n_max_iterations = 1000
n_change_steps = 10

obs = env.reset()
for iteration in range(n_max_iterations):
 img = env.render(mode="rgb_array")
 frames.append(img)
 if iteration % n_change_steps == 0:
 action = env.action_space.sample() # play randomly
 obs, reward, done, info = env.step(action)
 if done:
 break

Now show the animation (it's a bit jittery within Jupyter):

In [20]:
def update_scene(num, frames, patch):
 patch.set_data(frames[num])
 return patch,

def plot_animation(frames, repeat=False, interval=40):
 plt.close() # or else nbagg sometimes plots in the previous cell
 fig = plt.figure()
 patch = plt.imshow(frames[0])
 plt.axis('off')
 return animation.FuncAnimation(fig, update_scene, fargs=(frames, patch), frames=len(frames), repeat=repeat, interval=interval)

In [21]:
video = plot_animation(frames)
plt.show()

Once you have finished playing with an environment, you should close it to free up resources:

In [22]:
env.close()

To code our first learning agent, we will be using a simpler environment: the Cart-Pole. 

# A simple environment: the Cart-Pole

The Cart-Pole is a very simple environment composed of a cart that can move left or right, and pole placed vertically on top of it. The agent must move the cart left or right to keep the pole upright.

In [23]:
env = gym.make("CartPole-v0")

In [24]:
obs = env.reset()

In [25]:
obs

The observation is a 1D NumPy array composed of 4 floats: they represent the cart's horizontal position, its velocity, the angle of the pole (O = vertical), and the angular velocity. Let's render the environment... unfortunately we need to fix an annoying rendering issue first.

## Fixing the rendering issue

Some environments (including the CartPole) require access to your display, which opens up a separate window, even if you specify the `rgb_array` mode. In general you can safely ignore that window. However, if Jupyter is running on a headless server (ie. without a screen) it will raise an exception. One way to avoid this is to install a fake X server like Xvfb. You can start Jupyter using the `xvfb-run` command:

 $ xvfb-run -s "-screen 0 1400x900x24" jupyter notebook

This does not seem to be possible using binder, so unfortunately we cannot use OpenAI gym's rendering function, we need to define our own.

In [31]:
from PIL import Image, ImageDraw

try:
 from pyglet.gl import gl_info
 openai_cart_pole_rendering = True # no problem, let's use OpenAI gym's rendering function
except ImportError:
 openai_cart_pole_rendering = False # probably running on binder, let's use our own rendering function

def render_cart_pole(env, obs):
 if openai_cart_pole_rendering:
 # use OpenAI gym's rendering function
 return env.render(mode="rgb_array")
 else:
 # basic rendering for the cart pole environment if OpenAI can't render it
 img_w = 100
 img_h = 50
 cart_w = 20
 pole_len = 30
 x_width = 2
 max_ang = 0.2
 bg_col = (255, 255, 255)
 cart_col = 0x000000 # Blue Green Red
 pole_col = 0x0000FF # Blue Green Red

 pos, vel, ang, ang_vel = obs
 img = Image.new('RGB', (img_w, img_h), bg_col)
 draw = ImageDraw.Draw(img)
 cart_x = pos * img_w // x_width + img_w // x_width
 cart_y = img_h * 95 // 100
 top_pole_x = cart_x + pole_len * np.sin(ang)
 top_pole_y = cart_y - pole_len * np.cos(ang)
 pole_col = int(np.minimum(np.abs(ang / max_ang), 1) * pole_col)
 draw.line((cart_x, cart_y, top_pole_x, top_pole_y), fill=pole_col) # draw pole
 draw.line((cart_x - cart_w // 2, cart_y, cart_x + cart_w // 2, cart_y), fill=cart_col) # draw cart
 return np.array(img)

def plot_cart_pole(env, obs):
 plt.close() # or else nbagg sometimes plots in the previous cell
 img = render_cart_pole(env, obs)
 plt.imshow(img)
 plt.axis("off")
 plt.show()

In [32]:
plot_cart_pole(env, obs)

Now let's look at the action space:

In [33]:
env.action_space

Yep, just two possible actions: accelerate towards the left or towards the right. Let's push the cart left until the pole falls:

In [35]:
obs = env.reset()
while True:
 obs, reward, done, info = env.step(0)
 if done:
 break

In [36]:
plot_cart_pole(env, obs)

Notice that the game is over when the pole tilts too much, not when it actually falls. Now let's reset the environment and push the cart to right instead:

In [37]:
obs = env.reset()
while True:
 obs, reward, done, info = env.step(1)
 if done:
 break

In [38]:
plot_cart_pole(env, obs)

Looks like it's doing what we're telling it to do. Now how can we make the poll remain upright? We will need to define a _policy_ for that. This is the strategy that the agent will use to select an action at each step. It can use all the past actions and observations to decide what to do.

# A simple hard-coded policy

Let's hard code a simple strategy: if the pole is tilting to the left, then push the cart to the left, and _vice versa_. Let's see if that works:

In [39]:
frames = []

n_max_iterations = 1000
n_change_steps = 10

obs = env.reset()
for iteration in range(n_max_iterations):
 img = render_cart_pole(env, obs)
 frames.append(img)

 # hard-coded policy
 position, velocity, angle, angular_velocity = obs
 if angle < 0:
 action = 0
 else:
 action = 1

 obs, reward, done, info = env.step(action)
 if done:
 break

In [40]:
video = plot_animation(frames)
plt.show()

Nope, the system is unstable and after just a few wobbles, the pole ends up too tilted: game over. We will need to be smarter than that!

# Policy Gradients

Let's create a neural network that will take the observations as inputs, and output the action to take. More precisely, it will output a probability for each action, and we will sample an action based on those probabilities. For example, if it says that the probability of pushing left should be 70%, and the probability of pushing right should be 30%, then we will pick a random number between 0 and 1 and if it is lower than 0.7 we will push left, or else we will push right. This approach lets the agent find the right balance between exploring new actions and exploiting the actions that are known to work well. Suppose you go to the same restaurant every week, and the first time you really enjoyed the caesar salad, you could order the same thing every week and be guaranteed to enjoy your meal. But you may be missing out on another great dish. Once in a while, you should try out something new.

In [41]:
import tensorflow as tf

from tensorflow.contrib.layers import fully_connected

n_inputs = 4
n_hidden = 4
n_outputs = 1

learning_rate=0.01

X = tf.placeholder(tf.float32, shape=[None, n_inputs])
y = tf.placeholder(tf.float32, shape=[None, n_outputs])
hidden = fully_connected(X, n_hidden, activation_fn=tf.nn.elu)
logits = fully_connected(hidden, n_outputs, activation_fn=None)
outputs = tf.nn.softmax(logits)
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits, y)
optimizer = tf.train.AdamOptimizer(learning_rate)
training_op = optimizer.minimize(cross_entropy)

Now to train this network we will need to feed the input batches `X` and the targets `y`. The inputs are easy enough, these will be the observations.

_Note_: in this particular environment, the past actions and observations can safely be ignored, since you can observe the environment's full state. If there were some hidden state then you may need to consider all past actions and observations in order to try to infer the hidden state of the environment. For example, if the environment only revealed the position of the cart but not its velocity, you would have to consider not only the current observation but also the previous observation in order to estimate the current velocity. Another example is if the observations are noisy: you may want to use the past few observations to estimate the most likely current state. Our problem is thus as simple as can be: the current observation is noise-free and contains the environment's full state.

But what about the labels? How can we tell what the target probabilities should be? One option is to let this policy network play the game say 100 times. Then rank the games according to the total reward they get. The actions taken during the best games were good, on average, so they should be made a bit more likely, while the actions taken during the worst games were bad, on average, so they should be made less likely. Of course, perhaps the policy network made a few good moves during a very bad game, and unfortunately these good moves will be made less likely, but that's ok because if we repeat the process many times, after a while the good moves should on average get more and more likely, and the bad moves should get less and less likely. A good basketball player sometimes plays in a really bad team: this obviously damages his reputation, but if he stars in a sufficient number of movies, overall his reputation should correspond to his talent.

In [42]:
obs = env.reset()
while True:
 obs, reward, done, info = env.step(env.action_space.sample())
 print(reward)
 if done:
 break

**Work in progress – more content coming soon...**

# Exercise solutions

Coming soon...