Add LunarLander-v2 Policy Gradients exercise solution

2021-03-20 10:04:52 +13:00 · 2021-03-20 10:04:52 +13:00 · cfd0837f5c
parent 8cf72920b9
commit cfd0837f5c
1 changed files with 331 additions and 1 deletions
--- a/18_reinforcement_learning.ipynb
+++ b/18_reinforcement_learning.ipynb
@ -2820,7 +2820,337 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "TODO"
+    "## 8.\n",
    "_Exercise: Use policy gradients to solve OpenAI Gym's LunarLander-v2 environment. You will need to install the Box2D dependencies (`python3 -m pip install -U gym[box2d]`)._"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start by creating a LunarLander-v2 environment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 240,
   "metadata": {},
   "outputs": [],
   "source": [
    "env = gym.make(\"LunarLander-v2\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The inputs are 8-dimensional:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 241,
   "metadata": {},
   "outputs": [],
   "source": [
    "env.observation_space"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 242,
   "metadata": {},
   "outputs": [],
   "source": [
    "env.seed(42)\n",
    "obs = env.reset()\n",
    "obs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the [source code](https://github.com/openai/gym/blob/master/gym/envs/box2d/lunar_lander.py), we can see that these each 8D observation (x, y, h, v, a, w, l, r) correspond to:\n",
    "* x,y: the coordinates of the spaceship. It starts at a random location near (0, 1.4) and must land near the target at (0, 0).\n",
    "* h,v: the horizontal and vertical speed of the spaceship. It starts with a small random speed.\n",
    "* a,w: the spaceship's angle and angular velocity.\n",
    "* l,r: whether the left or right leg touches the ground (1.0) or not (0.0)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The action space is discrete, with 4 possible actions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 243,
   "metadata": {},
   "outputs": [],
   "source": [
    "env.action_space"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking at the [LunarLander-v2's description](https://gym.openai.com/envs/LunarLander-v2/), these actions are:\n",
    "* do nothing\n",
    "* fire left orientation engine\n",
    "* fire main engine\n",
    "* fire right orientation engine"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's create a simple policy network with 4 output neurons (one per possible action):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 244,
   "metadata": {},
   "outputs": [],
   "source": [
    "keras.backend.clear_session()\n",
    "np.random.seed(42)\n",
    "tf.random.set_seed(42)\n",
    "\n",
    "n_inputs = env.observation_space.shape[0]\n",
    "n_outputs = env.action_space.n\n",
    "\n",
    "model = keras.models.Sequential([\n",
    "    keras.layers.Dense(32, activation=\"relu\", input_shape=[n_inputs]),\n",
    "    keras.layers.Dense(32, activation=\"relu\"),\n",
    "    keras.layers.Dense(n_outputs, activation=\"softmax\"),\n",
    "])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that we're using the softmax activation function in the output layer, instead of the sigmoid activation function \n",
    "like we did for the CartPole-v1 environment. This is because we only had two possible actions for the CartPole-v1 environment, so a binary classification model worked fine. However, since we now how more than two possible actions, we need a multiclass classification model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let's reuse the `play_one_step()` and `play_multiple_episodes()` functions we defined for the CartPole-v1 Policy Gradient code above, but we'll just tweak the `play_one_step()` function to account for the fact that the model is now a multiclass classification model rather than a binary classification model. We'll also tweak the `play_multiple_episodes()` function to call our tweaked `play_one_step()` function rather than the original one, and we add a big penalty if the spaceship does not land (or crash) before a maximum number of steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 245,
   "metadata": {},
   "outputs": [],
   "source": [
    "def lander_play_one_step(env, obs, model, loss_fn):\n",
    "    with tf.GradientTape() as tape:\n",
    "        probas = model(obs[np.newaxis])\n",
    "        logits = tf.math.log(probas + keras.backend.epsilon())\n",
    "        action = tf.random.categorical(logits, num_samples=1)\n",
    "        loss = tf.reduce_mean(loss_fn(action, probas))\n",
    "    grads = tape.gradient(loss, model.trainable_variables)\n",
    "    obs, reward, done, info = env.step(action[0, 0].numpy())\n",
    "    return obs, reward, done, grads\n",
    "\n",
    "def lander_play_multiple_episodes(env, n_episodes, n_max_steps, model, loss_fn):\n",
    "    all_rewards = []\n",
    "    all_grads = []\n",
    "    for episode in range(n_episodes):\n",
    "        current_rewards = []\n",
    "        current_grads = []\n",
    "        obs = env.reset()\n",
    "        for step in range(n_max_steps):\n",
    "            obs, reward, done, grads = lander_play_one_step(env, obs, model, loss_fn)\n",
    "            current_rewards.append(reward)\n",
    "            current_grads.append(grads)\n",
    "            if done:\n",
    "                break\n",
    "        all_rewards.append(current_rewards)\n",
    "        all_grads.append(current_grads)\n",
    "    return all_rewards, all_grads"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll keep exactly the same `discount_rewards()` and `discount_and_normalize_rewards()` functions as earlier:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 246,
   "metadata": {},
   "outputs": [],
   "source": [
    "def discount_rewards(rewards, discount_rate):\n",
    "    discounted = np.array(rewards)\n",
    "    for step in range(len(rewards) - 2, -1, -1):\n",
    "        discounted[step] += discounted[step + 1] * discount_rate\n",
    "    return discounted\n",
    "\n",
    "def discount_and_normalize_rewards(all_rewards, discount_rate):\n",
    "    all_discounted_rewards = [discount_rewards(rewards, discount_rate)\n",
    "                              for rewards in all_rewards]\n",
    "    flat_rewards = np.concatenate(all_discounted_rewards)\n",
    "    reward_mean = flat_rewards.mean()\n",
    "    reward_std = flat_rewards.std()\n",
    "    return [(discounted_rewards - reward_mean) / reward_std\n",
    "            for discounted_rewards in all_discounted_rewards]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's define some hyperparameters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 247,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_iterations = 200\n",
    "n_episodes_per_update = 16\n",
    "n_max_steps = 1000\n",
    "discount_rate = 0.99"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, since the model is a multiclass classification model, we must use the categorical cross-entropy rather than the binary cross-entropy. Moreover, since the `lander_play_one_step()` function sets the targets as class indices rather than class probabilities, we must use the `sparse_categorical_crossentropy()` loss function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 248,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimizer = keras.optimizers.Nadam(lr=0.005)\n",
    "loss_fn = keras.losses.sparse_categorical_crossentropy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're ready to train the model. Let's go!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 249,
   "metadata": {},
   "outputs": [],
   "source": [
    "env.seed(42)\n",
    "\n",
    "mean_rewards = []\n",
    "\n",
    "for iteration in range(n_iterations):\n",
    "    all_rewards, all_grads = lander_play_multiple_episodes(\n",
    "        env, n_episodes_per_update, n_max_steps, model, loss_fn)\n",
    "    mean_reward = sum(map(sum, all_rewards)) / n_episodes_per_update\n",
    "    print(\"\\rIteration: {}/{}, mean reward: {:.1f}  \".format(\n",
    "        iteration + 1, n_iterations, mean_reward), end=\"\")\n",
    "    mean_rewards.append(mean_reward)\n",
    "    all_final_rewards = discount_and_normalize_rewards(all_rewards,\n",
    "                                                       discount_rate)\n",
    "    all_mean_grads = []\n",
    "    for var_index in range(len(model.trainable_variables)):\n",
    "        mean_grads = tf.reduce_mean(\n",
    "            [final_reward * all_grads[episode_index][step][var_index]\n",
    "             for episode_index, final_rewards in enumerate(all_final_rewards)\n",
    "                 for step, final_reward in enumerate(final_rewards)], axis=0)\n",
    "        all_mean_grads.append(mean_grads)\n",
    "    optimizer.apply_gradients(zip(all_mean_grads, model.trainable_variables))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the learning curve:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 250,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "plt.plot(mean_rewards)\n",
    "plt.xlabel(\"Episode\")\n",
    "plt.ylabel(\"Mean reward\")\n",
    "plt.grid()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's look at the result!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 257,
   "metadata": {},
   "outputs": [],
   "source": [
    "def lander_render_policy_net(model, n_max_steps=500, seed=42):\n",
    "    frames = []\n",
    "    env = gym.make(\"LunarLander-v2\")\n",
    "    env.seed(seed)\n",
    "    tf.random.set_seed(seed)\n",
    "    np.random.seed(seed)\n",
    "    obs = env.reset()\n",
    "    for step in range(n_max_steps):\n",
    "        frames.append(env.render(mode=\"rgb_array\"))\n",
    "        probas = model(obs[np.newaxis])\n",
    "        logits = tf.math.log(probas + keras.backend.epsilon())\n",
    "        action = tf.random.categorical(logits, num_samples=1)\n",
    "        obs, reward, done, info = env.step(action[0, 0].numpy())\n",
    "        if done:\n",
    "            break\n",
    "    env.close()\n",
    "    return frames"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 264,
   "metadata": {},
   "outputs": [],
   "source": [
    "frames = lander_render_policy_net(model, seed=42)\n",
    "plot_animation(frames)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's pretty good. You can try training it for longer and/or tweaking the hyperparameters to see if you can get it to go over 200."
   ]
  },
  {