From cfd0837f5c4985a9dffe755b600e13f70f3edfd6 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Aur=C3=A9lien=20Geron?= <ageron@users.noreply.github.com>
Date: Sat, 20 Mar 2021 10:04:52 +1300
Subject: [PATCH] Add LunarLander-v2 Policy Gradients exercise solution

---
 18_reinforcement_learning.ipynb | 332 +++++++++++++++++++++++++++++++-
 1 file changed, 331 insertions(+), 1 deletion(-)

diff --git a/18_reinforcement_learning.ipynb b/18_reinforcement_learning.ipynb
index 6aba902..a67103b 100644
--- a/18_reinforcement_learning.ipynb
+++ b/18_reinforcement_learning.ipynb
@@ -2820,7 +2820,337 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "TODO"
+    "## 8.\n",
+    "_Exercise: Use policy gradients to solve OpenAI Gym's LunarLander-v2 environment. You will need to install the Box2D dependencies (`python3 -m pip install -U gym[box2d]`)._"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's start by creating a LunarLander-v2 environment:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 240,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "env = gym.make(\"LunarLander-v2\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The inputs are 8-dimensional:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 241,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "env.observation_space"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 242,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "env.seed(42)\n",
+    "obs = env.reset()\n",
+    "obs"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "From the [source code](https://github.com/openai/gym/blob/master/gym/envs/box2d/lunar_lander.py), we can see that these each 8D observation (x, y, h, v, a, w, l, r) correspond to:\n",
+    "* x,y: the coordinates of the spaceship. It starts at a random location near (0, 1.4) and must land near the target at (0, 0).\n",
+    "* h,v: the horizontal and vertical speed of the spaceship. It starts with a small random speed.\n",
+    "* a,w: the spaceship's angle and angular velocity.\n",
+    "* l,r: whether the left or right leg touches the ground (1.0) or not (0.0)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The action space is discrete, with 4 possible actions:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 243,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "env.action_space"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Looking at the [LunarLander-v2's description](https://gym.openai.com/envs/LunarLander-v2/), these actions are:\n",
+    "* do nothing\n",
+    "* fire left orientation engine\n",
+    "* fire main engine\n",
+    "* fire right orientation engine"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's create a simple policy network with 4 output neurons (one per possible action):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 244,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "keras.backend.clear_session()\n",
+    "np.random.seed(42)\n",
+    "tf.random.set_seed(42)\n",
+    "\n",
+    "n_inputs = env.observation_space.shape[0]\n",
+    "n_outputs = env.action_space.n\n",
+    "\n",
+    "model = keras.models.Sequential([\n",
+    "    keras.layers.Dense(32, activation=\"relu\", input_shape=[n_inputs]),\n",
+    "    keras.layers.Dense(32, activation=\"relu\"),\n",
+    "    keras.layers.Dense(n_outputs, activation=\"softmax\"),\n",
+    "])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Note that we're using the softmax activation function in the output layer, instead of the sigmoid activation function \n",
+    "like we did for the CartPole-v1 environment. This is because we only had two possible actions for the CartPole-v1 environment, so a binary classification model worked fine. However, since we now how more than two possible actions, we need a multiclass classification model."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next, let's reuse the `play_one_step()` and `play_multiple_episodes()` functions we defined for the CartPole-v1 Policy Gradient code above, but we'll just tweak the `play_one_step()` function to account for the fact that the model is now a multiclass classification model rather than a binary classification model. We'll also tweak the `play_multiple_episodes()` function to call our tweaked `play_one_step()` function rather than the original one, and we add a big penalty if the spaceship does not land (or crash) before a maximum number of steps."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 245,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def lander_play_one_step(env, obs, model, loss_fn):\n",
+    "    with tf.GradientTape() as tape:\n",
+    "        probas = model(obs[np.newaxis])\n",
+    "        logits = tf.math.log(probas + keras.backend.epsilon())\n",
+    "        action = tf.random.categorical(logits, num_samples=1)\n",
+    "        loss = tf.reduce_mean(loss_fn(action, probas))\n",
+    "    grads = tape.gradient(loss, model.trainable_variables)\n",
+    "    obs, reward, done, info = env.step(action[0, 0].numpy())\n",
+    "    return obs, reward, done, grads\n",
+    "\n",
+    "def lander_play_multiple_episodes(env, n_episodes, n_max_steps, model, loss_fn):\n",
+    "    all_rewards = []\n",
+    "    all_grads = []\n",
+    "    for episode in range(n_episodes):\n",
+    "        current_rewards = []\n",
+    "        current_grads = []\n",
+    "        obs = env.reset()\n",
+    "        for step in range(n_max_steps):\n",
+    "            obs, reward, done, grads = lander_play_one_step(env, obs, model, loss_fn)\n",
+    "            current_rewards.append(reward)\n",
+    "            current_grads.append(grads)\n",
+    "            if done:\n",
+    "                break\n",
+    "        all_rewards.append(current_rewards)\n",
+    "        all_grads.append(current_grads)\n",
+    "    return all_rewards, all_grads"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We'll keep exactly the same `discount_rewards()` and `discount_and_normalize_rewards()` functions as earlier:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 246,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def discount_rewards(rewards, discount_rate):\n",
+    "    discounted = np.array(rewards)\n",
+    "    for step in range(len(rewards) - 2, -1, -1):\n",
+    "        discounted[step] += discounted[step + 1] * discount_rate\n",
+    "    return discounted\n",
+    "\n",
+    "def discount_and_normalize_rewards(all_rewards, discount_rate):\n",
+    "    all_discounted_rewards = [discount_rewards(rewards, discount_rate)\n",
+    "                              for rewards in all_rewards]\n",
+    "    flat_rewards = np.concatenate(all_discounted_rewards)\n",
+    "    reward_mean = flat_rewards.mean()\n",
+    "    reward_std = flat_rewards.std()\n",
+    "    return [(discounted_rewards - reward_mean) / reward_std\n",
+    "            for discounted_rewards in all_discounted_rewards]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now let's define some hyperparameters:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 247,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "n_iterations = 200\n",
+    "n_episodes_per_update = 16\n",
+    "n_max_steps = 1000\n",
+    "discount_rate = 0.99"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Again, since the model is a multiclass classification model, we must use the categorical cross-entropy rather than the binary cross-entropy. Moreover, since the `lander_play_one_step()` function sets the targets as class indices rather than class probabilities, we must use the `sparse_categorical_crossentropy()` loss function:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 248,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "optimizer = keras.optimizers.Nadam(lr=0.005)\n",
+    "loss_fn = keras.losses.sparse_categorical_crossentropy"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We're ready to train the model. Let's go!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 249,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "env.seed(42)\n",
+    "\n",
+    "mean_rewards = []\n",
+    "\n",
+    "for iteration in range(n_iterations):\n",
+    "    all_rewards, all_grads = lander_play_multiple_episodes(\n",
+    "        env, n_episodes_per_update, n_max_steps, model, loss_fn)\n",
+    "    mean_reward = sum(map(sum, all_rewards)) / n_episodes_per_update\n",
+    "    print(\"\\rIteration: {}/{}, mean reward: {:.1f}  \".format(\n",
+    "        iteration + 1, n_iterations, mean_reward), end=\"\")\n",
+    "    mean_rewards.append(mean_reward)\n",
+    "    all_final_rewards = discount_and_normalize_rewards(all_rewards,\n",
+    "                                                       discount_rate)\n",
+    "    all_mean_grads = []\n",
+    "    for var_index in range(len(model.trainable_variables)):\n",
+    "        mean_grads = tf.reduce_mean(\n",
+    "            [final_reward * all_grads[episode_index][step][var_index]\n",
+    "             for episode_index, final_rewards in enumerate(all_final_rewards)\n",
+    "                 for step, final_reward in enumerate(final_rewards)], axis=0)\n",
+    "        all_mean_grads.append(mean_grads)\n",
+    "    optimizer.apply_gradients(zip(all_mean_grads, model.trainable_variables))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's look at the learning curve:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 250,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "plt.plot(mean_rewards)\n",
+    "plt.xlabel(\"Episode\")\n",
+    "plt.ylabel(\"Mean reward\")\n",
+    "plt.grid()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now let's look at the result!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 257,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def lander_render_policy_net(model, n_max_steps=500, seed=42):\n",
+    "    frames = []\n",
+    "    env = gym.make(\"LunarLander-v2\")\n",
+    "    env.seed(seed)\n",
+    "    tf.random.set_seed(seed)\n",
+    "    np.random.seed(seed)\n",
+    "    obs = env.reset()\n",
+    "    for step in range(n_max_steps):\n",
+    "        frames.append(env.render(mode=\"rgb_array\"))\n",
+    "        probas = model(obs[np.newaxis])\n",
+    "        logits = tf.math.log(probas + keras.backend.epsilon())\n",
+    "        action = tf.random.categorical(logits, num_samples=1)\n",
+    "        obs, reward, done, info = env.step(action[0, 0].numpy())\n",
+    "        if done:\n",
+    "            break\n",
+    "    env.close()\n",
+    "    return frames"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 264,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "frames = lander_render_policy_net(model, seed=42)\n",
+    "plot_animation(frames)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "That's pretty good. You can try training it for longer and/or tweaking the hyperparameters to see if you can get it to go over 200."
    ]
   },
   {