From cfd0837f5c4985a9dffe755b600e13f70f3edfd6 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Aur=C3=A9lien=20Geron?= Date: Sat, 20 Mar 2021 10:04:52 +1300 Subject: [PATCH] Add LunarLander-v2 Policy Gradients exercise solution --- 18_reinforcement_learning.ipynb | 332 +++++++++++++++++++++++++++++++- 1 file changed, 331 insertions(+), 1 deletion(-) diff --git a/18_reinforcement_learning.ipynb b/18_reinforcement_learning.ipynb index 6aba902..a67103b 100644 --- a/18_reinforcement_learning.ipynb +++ b/18_reinforcement_learning.ipynb @@ -2820,7 +2820,337 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "TODO" + "## 8.\n", + "_Exercise: Use policy gradients to solve OpenAI Gym's LunarLander-v2 environment. You will need to install the Box2D dependencies (`python3 -m pip install -U gym[box2d]`)._" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's start by creating a LunarLander-v2 environment:" + ] + }, + { + "cell_type": "code", + "execution_count": 240, + "metadata": {}, + "outputs": [], + "source": [ + "env = gym.make(\"LunarLander-v2\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The inputs are 8-dimensional:" + ] + }, + { + "cell_type": "code", + "execution_count": 241, + "metadata": {}, + "outputs": [], + "source": [ + "env.observation_space" + ] + }, + { + "cell_type": "code", + "execution_count": 242, + "metadata": {}, + "outputs": [], + "source": [ + "env.seed(42)\n", + "obs = env.reset()\n", + "obs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "From the [source code](https://github.com/openai/gym/blob/master/gym/envs/box2d/lunar_lander.py), we can see that these each 8D observation (x, y, h, v, a, w, l, r) correspond to:\n", + "* x,y: the coordinates of the spaceship. It starts at a random location near (0, 1.4) and must land near the target at (0, 0).\n", + "* h,v: the horizontal and vertical speed of the spaceship. It starts with a small random speed.\n", + "* a,w: the spaceship's angle and angular velocity.\n", + "* l,r: whether the left or right leg touches the ground (1.0) or not (0.0)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The action space is discrete, with 4 possible actions:" + ] + }, + { + "cell_type": "code", + "execution_count": 243, + "metadata": {}, + "outputs": [], + "source": [ + "env.action_space" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Looking at the [LunarLander-v2's description](https://gym.openai.com/envs/LunarLander-v2/), these actions are:\n", + "* do nothing\n", + "* fire left orientation engine\n", + "* fire main engine\n", + "* fire right orientation engine" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's create a simple policy network with 4 output neurons (one per possible action):" + ] + }, + { + "cell_type": "code", + "execution_count": 244, + "metadata": {}, + "outputs": [], + "source": [ + "keras.backend.clear_session()\n", + "np.random.seed(42)\n", + "tf.random.set_seed(42)\n", + "\n", + "n_inputs = env.observation_space.shape[0]\n", + "n_outputs = env.action_space.n\n", + "\n", + "model = keras.models.Sequential([\n", + " keras.layers.Dense(32, activation=\"relu\", input_shape=[n_inputs]),\n", + " keras.layers.Dense(32, activation=\"relu\"),\n", + " keras.layers.Dense(n_outputs, activation=\"softmax\"),\n", + "])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that we're using the softmax activation function in the output layer, instead of the sigmoid activation function \n", + "like we did for the CartPole-v1 environment. This is because we only had two possible actions for the CartPole-v1 environment, so a binary classification model worked fine. However, since we now how more than two possible actions, we need a multiclass classification model." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, let's reuse the `play_one_step()` and `play_multiple_episodes()` functions we defined for the CartPole-v1 Policy Gradient code above, but we'll just tweak the `play_one_step()` function to account for the fact that the model is now a multiclass classification model rather than a binary classification model. We'll also tweak the `play_multiple_episodes()` function to call our tweaked `play_one_step()` function rather than the original one, and we add a big penalty if the spaceship does not land (or crash) before a maximum number of steps." + ] + }, + { + "cell_type": "code", + "execution_count": 245, + "metadata": {}, + "outputs": [], + "source": [ + "def lander_play_one_step(env, obs, model, loss_fn):\n", + " with tf.GradientTape() as tape:\n", + " probas = model(obs[np.newaxis])\n", + " logits = tf.math.log(probas + keras.backend.epsilon())\n", + " action = tf.random.categorical(logits, num_samples=1)\n", + " loss = tf.reduce_mean(loss_fn(action, probas))\n", + " grads = tape.gradient(loss, model.trainable_variables)\n", + " obs, reward, done, info = env.step(action[0, 0].numpy())\n", + " return obs, reward, done, grads\n", + "\n", + "def lander_play_multiple_episodes(env, n_episodes, n_max_steps, model, loss_fn):\n", + " all_rewards = []\n", + " all_grads = []\n", + " for episode in range(n_episodes):\n", + " current_rewards = []\n", + " current_grads = []\n", + " obs = env.reset()\n", + " for step in range(n_max_steps):\n", + " obs, reward, done, grads = lander_play_one_step(env, obs, model, loss_fn)\n", + " current_rewards.append(reward)\n", + " current_grads.append(grads)\n", + " if done:\n", + " break\n", + " all_rewards.append(current_rewards)\n", + " all_grads.append(current_grads)\n", + " return all_rewards, all_grads" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We'll keep exactly the same `discount_rewards()` and `discount_and_normalize_rewards()` functions as earlier:" + ] + }, + { + "cell_type": "code", + "execution_count": 246, + "metadata": {}, + "outputs": [], + "source": [ + "def discount_rewards(rewards, discount_rate):\n", + " discounted = np.array(rewards)\n", + " for step in range(len(rewards) - 2, -1, -1):\n", + " discounted[step] += discounted[step + 1] * discount_rate\n", + " return discounted\n", + "\n", + "def discount_and_normalize_rewards(all_rewards, discount_rate):\n", + " all_discounted_rewards = [discount_rewards(rewards, discount_rate)\n", + " for rewards in all_rewards]\n", + " flat_rewards = np.concatenate(all_discounted_rewards)\n", + " reward_mean = flat_rewards.mean()\n", + " reward_std = flat_rewards.std()\n", + " return [(discounted_rewards - reward_mean) / reward_std\n", + " for discounted_rewards in all_discounted_rewards]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's define some hyperparameters:" + ] + }, + { + "cell_type": "code", + "execution_count": 247, + "metadata": {}, + "outputs": [], + "source": [ + "n_iterations = 200\n", + "n_episodes_per_update = 16\n", + "n_max_steps = 1000\n", + "discount_rate = 0.99" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Again, since the model is a multiclass classification model, we must use the categorical cross-entropy rather than the binary cross-entropy. Moreover, since the `lander_play_one_step()` function sets the targets as class indices rather than class probabilities, we must use the `sparse_categorical_crossentropy()` loss function:" + ] + }, + { + "cell_type": "code", + "execution_count": 248, + "metadata": {}, + "outputs": [], + "source": [ + "optimizer = keras.optimizers.Nadam(lr=0.005)\n", + "loss_fn = keras.losses.sparse_categorical_crossentropy" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We're ready to train the model. Let's go!" + ] + }, + { + "cell_type": "code", + "execution_count": 249, + "metadata": {}, + "outputs": [], + "source": [ + "env.seed(42)\n", + "\n", + "mean_rewards = []\n", + "\n", + "for iteration in range(n_iterations):\n", + " all_rewards, all_grads = lander_play_multiple_episodes(\n", + " env, n_episodes_per_update, n_max_steps, model, loss_fn)\n", + " mean_reward = sum(map(sum, all_rewards)) / n_episodes_per_update\n", + " print(\"\\rIteration: {}/{}, mean reward: {:.1f} \".format(\n", + " iteration + 1, n_iterations, mean_reward), end=\"\")\n", + " mean_rewards.append(mean_reward)\n", + " all_final_rewards = discount_and_normalize_rewards(all_rewards,\n", + " discount_rate)\n", + " all_mean_grads = []\n", + " for var_index in range(len(model.trainable_variables)):\n", + " mean_grads = tf.reduce_mean(\n", + " [final_reward * all_grads[episode_index][step][var_index]\n", + " for episode_index, final_rewards in enumerate(all_final_rewards)\n", + " for step, final_reward in enumerate(final_rewards)], axis=0)\n", + " all_mean_grads.append(mean_grads)\n", + " optimizer.apply_gradients(zip(all_mean_grads, model.trainable_variables))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's look at the learning curve:" + ] + }, + { + "cell_type": "code", + "execution_count": 250, + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "plt.plot(mean_rewards)\n", + "plt.xlabel(\"Episode\")\n", + "plt.ylabel(\"Mean reward\")\n", + "plt.grid()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's look at the result!" + ] + }, + { + "cell_type": "code", + "execution_count": 257, + "metadata": {}, + "outputs": [], + "source": [ + "def lander_render_policy_net(model, n_max_steps=500, seed=42):\n", + " frames = []\n", + " env = gym.make(\"LunarLander-v2\")\n", + " env.seed(seed)\n", + " tf.random.set_seed(seed)\n", + " np.random.seed(seed)\n", + " obs = env.reset()\n", + " for step in range(n_max_steps):\n", + " frames.append(env.render(mode=\"rgb_array\"))\n", + " probas = model(obs[np.newaxis])\n", + " logits = tf.math.log(probas + keras.backend.epsilon())\n", + " action = tf.random.categorical(logits, num_samples=1)\n", + " obs, reward, done, info = env.step(action[0, 0].numpy())\n", + " if done:\n", + " break\n", + " env.close()\n", + " return frames" + ] + }, + { + "cell_type": "code", + "execution_count": 264, + "metadata": {}, + "outputs": [], + "source": [ + "frames = lander_render_policy_net(model, seed=42)\n", + "plot_animation(frames)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "That's pretty good. You can try training it for longer and/or tweaking the hyperparameters to see if you can get it to go over 200." ] }, {