Add LunarLander-v2 Policy Gradients exercise solution

main
Aurélien Geron 2021-03-20 10:04:52 +13:00
parent 8cf72920b9
commit cfd0837f5c
1 changed files with 331 additions and 1 deletions

View File

@ -2820,7 +2820,337 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"TODO" "## 8.\n",
"_Exercise: Use policy gradients to solve OpenAI Gym's LunarLander-v2 environment. You will need to install the Box2D dependencies (`python3 -m pip install -U gym[box2d]`)._"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's start by creating a LunarLander-v2 environment:"
]
},
{
"cell_type": "code",
"execution_count": 240,
"metadata": {},
"outputs": [],
"source": [
"env = gym.make(\"LunarLander-v2\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The inputs are 8-dimensional:"
]
},
{
"cell_type": "code",
"execution_count": 241,
"metadata": {},
"outputs": [],
"source": [
"env.observation_space"
]
},
{
"cell_type": "code",
"execution_count": 242,
"metadata": {},
"outputs": [],
"source": [
"env.seed(42)\n",
"obs = env.reset()\n",
"obs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the [source code](https://github.com/openai/gym/blob/master/gym/envs/box2d/lunar_lander.py), we can see that these each 8D observation (x, y, h, v, a, w, l, r) correspond to:\n",
"* x,y: the coordinates of the spaceship. It starts at a random location near (0, 1.4) and must land near the target at (0, 0).\n",
"* h,v: the horizontal and vertical speed of the spaceship. It starts with a small random speed.\n",
"* a,w: the spaceship's angle and angular velocity.\n",
"* l,r: whether the left or right leg touches the ground (1.0) or not (0.0)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The action space is discrete, with 4 possible actions:"
]
},
{
"cell_type": "code",
"execution_count": 243,
"metadata": {},
"outputs": [],
"source": [
"env.action_space"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking at the [LunarLander-v2's description](https://gym.openai.com/envs/LunarLander-v2/), these actions are:\n",
"* do nothing\n",
"* fire left orientation engine\n",
"* fire main engine\n",
"* fire right orientation engine"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's create a simple policy network with 4 output neurons (one per possible action):"
]
},
{
"cell_type": "code",
"execution_count": 244,
"metadata": {},
"outputs": [],
"source": [
"keras.backend.clear_session()\n",
"np.random.seed(42)\n",
"tf.random.set_seed(42)\n",
"\n",
"n_inputs = env.observation_space.shape[0]\n",
"n_outputs = env.action_space.n\n",
"\n",
"model = keras.models.Sequential([\n",
" keras.layers.Dense(32, activation=\"relu\", input_shape=[n_inputs]),\n",
" keras.layers.Dense(32, activation=\"relu\"),\n",
" keras.layers.Dense(n_outputs, activation=\"softmax\"),\n",
"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that we're using the softmax activation function in the output layer, instead of the sigmoid activation function \n",
"like we did for the CartPole-v1 environment. This is because we only had two possible actions for the CartPole-v1 environment, so a binary classification model worked fine. However, since we now how more than two possible actions, we need a multiclass classification model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, let's reuse the `play_one_step()` and `play_multiple_episodes()` functions we defined for the CartPole-v1 Policy Gradient code above, but we'll just tweak the `play_one_step()` function to account for the fact that the model is now a multiclass classification model rather than a binary classification model. We'll also tweak the `play_multiple_episodes()` function to call our tweaked `play_one_step()` function rather than the original one, and we add a big penalty if the spaceship does not land (or crash) before a maximum number of steps."
]
},
{
"cell_type": "code",
"execution_count": 245,
"metadata": {},
"outputs": [],
"source": [
"def lander_play_one_step(env, obs, model, loss_fn):\n",
" with tf.GradientTape() as tape:\n",
" probas = model(obs[np.newaxis])\n",
" logits = tf.math.log(probas + keras.backend.epsilon())\n",
" action = tf.random.categorical(logits, num_samples=1)\n",
" loss = tf.reduce_mean(loss_fn(action, probas))\n",
" grads = tape.gradient(loss, model.trainable_variables)\n",
" obs, reward, done, info = env.step(action[0, 0].numpy())\n",
" return obs, reward, done, grads\n",
"\n",
"def lander_play_multiple_episodes(env, n_episodes, n_max_steps, model, loss_fn):\n",
" all_rewards = []\n",
" all_grads = []\n",
" for episode in range(n_episodes):\n",
" current_rewards = []\n",
" current_grads = []\n",
" obs = env.reset()\n",
" for step in range(n_max_steps):\n",
" obs, reward, done, grads = lander_play_one_step(env, obs, model, loss_fn)\n",
" current_rewards.append(reward)\n",
" current_grads.append(grads)\n",
" if done:\n",
" break\n",
" all_rewards.append(current_rewards)\n",
" all_grads.append(current_grads)\n",
" return all_rewards, all_grads"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll keep exactly the same `discount_rewards()` and `discount_and_normalize_rewards()` functions as earlier:"
]
},
{
"cell_type": "code",
"execution_count": 246,
"metadata": {},
"outputs": [],
"source": [
"def discount_rewards(rewards, discount_rate):\n",
" discounted = np.array(rewards)\n",
" for step in range(len(rewards) - 2, -1, -1):\n",
" discounted[step] += discounted[step + 1] * discount_rate\n",
" return discounted\n",
"\n",
"def discount_and_normalize_rewards(all_rewards, discount_rate):\n",
" all_discounted_rewards = [discount_rewards(rewards, discount_rate)\n",
" for rewards in all_rewards]\n",
" flat_rewards = np.concatenate(all_discounted_rewards)\n",
" reward_mean = flat_rewards.mean()\n",
" reward_std = flat_rewards.std()\n",
" return [(discounted_rewards - reward_mean) / reward_std\n",
" for discounted_rewards in all_discounted_rewards]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's define some hyperparameters:"
]
},
{
"cell_type": "code",
"execution_count": 247,
"metadata": {},
"outputs": [],
"source": [
"n_iterations = 200\n",
"n_episodes_per_update = 16\n",
"n_max_steps = 1000\n",
"discount_rate = 0.99"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again, since the model is a multiclass classification model, we must use the categorical cross-entropy rather than the binary cross-entropy. Moreover, since the `lander_play_one_step()` function sets the targets as class indices rather than class probabilities, we must use the `sparse_categorical_crossentropy()` loss function:"
]
},
{
"cell_type": "code",
"execution_count": 248,
"metadata": {},
"outputs": [],
"source": [
"optimizer = keras.optimizers.Nadam(lr=0.005)\n",
"loss_fn = keras.losses.sparse_categorical_crossentropy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We're ready to train the model. Let's go!"
]
},
{
"cell_type": "code",
"execution_count": 249,
"metadata": {},
"outputs": [],
"source": [
"env.seed(42)\n",
"\n",
"mean_rewards = []\n",
"\n",
"for iteration in range(n_iterations):\n",
" all_rewards, all_grads = lander_play_multiple_episodes(\n",
" env, n_episodes_per_update, n_max_steps, model, loss_fn)\n",
" mean_reward = sum(map(sum, all_rewards)) / n_episodes_per_update\n",
" print(\"\\rIteration: {}/{}, mean reward: {:.1f} \".format(\n",
" iteration + 1, n_iterations, mean_reward), end=\"\")\n",
" mean_rewards.append(mean_reward)\n",
" all_final_rewards = discount_and_normalize_rewards(all_rewards,\n",
" discount_rate)\n",
" all_mean_grads = []\n",
" for var_index in range(len(model.trainable_variables)):\n",
" mean_grads = tf.reduce_mean(\n",
" [final_reward * all_grads[episode_index][step][var_index]\n",
" for episode_index, final_rewards in enumerate(all_final_rewards)\n",
" for step, final_reward in enumerate(final_rewards)], axis=0)\n",
" all_mean_grads.append(mean_grads)\n",
" optimizer.apply_gradients(zip(all_mean_grads, model.trainable_variables))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look at the learning curve:"
]
},
{
"cell_type": "code",
"execution_count": 250,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"plt.plot(mean_rewards)\n",
"plt.xlabel(\"Episode\")\n",
"plt.ylabel(\"Mean reward\")\n",
"plt.grid()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's look at the result!"
]
},
{
"cell_type": "code",
"execution_count": 257,
"metadata": {},
"outputs": [],
"source": [
"def lander_render_policy_net(model, n_max_steps=500, seed=42):\n",
" frames = []\n",
" env = gym.make(\"LunarLander-v2\")\n",
" env.seed(seed)\n",
" tf.random.set_seed(seed)\n",
" np.random.seed(seed)\n",
" obs = env.reset()\n",
" for step in range(n_max_steps):\n",
" frames.append(env.render(mode=\"rgb_array\"))\n",
" probas = model(obs[np.newaxis])\n",
" logits = tf.math.log(probas + keras.backend.epsilon())\n",
" action = tf.random.categorical(logits, num_samples=1)\n",
" obs, reward, done, info = env.step(action[0, 0].numpy())\n",
" if done:\n",
" break\n",
" env.close()\n",
" return frames"
]
},
{
"cell_type": "code",
"execution_count": 264,
"metadata": {},
"outputs": [],
"source": [
"frames = lander_render_policy_net(model, seed=42)\n",
"plot_animation(frames)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That's pretty good. You can try training it for longer and/or tweaking the hyperparameters to see if you can get it to go over 200."
] ]
}, },
{ {