Add LunarLander-v2 Policy Gradients exercise solution
parent
8cf72920b9
commit
cfd0837f5c
|
@ -2820,7 +2820,337 @@
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"TODO"
|
"## 8.\n",
|
||||||
|
"_Exercise: Use policy gradients to solve OpenAI Gym's LunarLander-v2 environment. You will need to install the Box2D dependencies (`python3 -m pip install -U gym[box2d]`)._"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Let's start by creating a LunarLander-v2 environment:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 240,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"env = gym.make(\"LunarLander-v2\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"The inputs are 8-dimensional:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 241,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"env.observation_space"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 242,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"env.seed(42)\n",
|
||||||
|
"obs = env.reset()\n",
|
||||||
|
"obs"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"From the [source code](https://github.com/openai/gym/blob/master/gym/envs/box2d/lunar_lander.py), we can see that these each 8D observation (x, y, h, v, a, w, l, r) correspond to:\n",
|
||||||
|
"* x,y: the coordinates of the spaceship. It starts at a random location near (0, 1.4) and must land near the target at (0, 0).\n",
|
||||||
|
"* h,v: the horizontal and vertical speed of the spaceship. It starts with a small random speed.\n",
|
||||||
|
"* a,w: the spaceship's angle and angular velocity.\n",
|
||||||
|
"* l,r: whether the left or right leg touches the ground (1.0) or not (0.0)."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"The action space is discrete, with 4 possible actions:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 243,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"env.action_space"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Looking at the [LunarLander-v2's description](https://gym.openai.com/envs/LunarLander-v2/), these actions are:\n",
|
||||||
|
"* do nothing\n",
|
||||||
|
"* fire left orientation engine\n",
|
||||||
|
"* fire main engine\n",
|
||||||
|
"* fire right orientation engine"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Let's create a simple policy network with 4 output neurons (one per possible action):"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 244,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"keras.backend.clear_session()\n",
|
||||||
|
"np.random.seed(42)\n",
|
||||||
|
"tf.random.set_seed(42)\n",
|
||||||
|
"\n",
|
||||||
|
"n_inputs = env.observation_space.shape[0]\n",
|
||||||
|
"n_outputs = env.action_space.n\n",
|
||||||
|
"\n",
|
||||||
|
"model = keras.models.Sequential([\n",
|
||||||
|
" keras.layers.Dense(32, activation=\"relu\", input_shape=[n_inputs]),\n",
|
||||||
|
" keras.layers.Dense(32, activation=\"relu\"),\n",
|
||||||
|
" keras.layers.Dense(n_outputs, activation=\"softmax\"),\n",
|
||||||
|
"])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Note that we're using the softmax activation function in the output layer, instead of the sigmoid activation function \n",
|
||||||
|
"like we did for the CartPole-v1 environment. This is because we only had two possible actions for the CartPole-v1 environment, so a binary classification model worked fine. However, since we now how more than two possible actions, we need a multiclass classification model."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Next, let's reuse the `play_one_step()` and `play_multiple_episodes()` functions we defined for the CartPole-v1 Policy Gradient code above, but we'll just tweak the `play_one_step()` function to account for the fact that the model is now a multiclass classification model rather than a binary classification model. We'll also tweak the `play_multiple_episodes()` function to call our tweaked `play_one_step()` function rather than the original one, and we add a big penalty if the spaceship does not land (or crash) before a maximum number of steps."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 245,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"def lander_play_one_step(env, obs, model, loss_fn):\n",
|
||||||
|
" with tf.GradientTape() as tape:\n",
|
||||||
|
" probas = model(obs[np.newaxis])\n",
|
||||||
|
" logits = tf.math.log(probas + keras.backend.epsilon())\n",
|
||||||
|
" action = tf.random.categorical(logits, num_samples=1)\n",
|
||||||
|
" loss = tf.reduce_mean(loss_fn(action, probas))\n",
|
||||||
|
" grads = tape.gradient(loss, model.trainable_variables)\n",
|
||||||
|
" obs, reward, done, info = env.step(action[0, 0].numpy())\n",
|
||||||
|
" return obs, reward, done, grads\n",
|
||||||
|
"\n",
|
||||||
|
"def lander_play_multiple_episodes(env, n_episodes, n_max_steps, model, loss_fn):\n",
|
||||||
|
" all_rewards = []\n",
|
||||||
|
" all_grads = []\n",
|
||||||
|
" for episode in range(n_episodes):\n",
|
||||||
|
" current_rewards = []\n",
|
||||||
|
" current_grads = []\n",
|
||||||
|
" obs = env.reset()\n",
|
||||||
|
" for step in range(n_max_steps):\n",
|
||||||
|
" obs, reward, done, grads = lander_play_one_step(env, obs, model, loss_fn)\n",
|
||||||
|
" current_rewards.append(reward)\n",
|
||||||
|
" current_grads.append(grads)\n",
|
||||||
|
" if done:\n",
|
||||||
|
" break\n",
|
||||||
|
" all_rewards.append(current_rewards)\n",
|
||||||
|
" all_grads.append(current_grads)\n",
|
||||||
|
" return all_rewards, all_grads"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"We'll keep exactly the same `discount_rewards()` and `discount_and_normalize_rewards()` functions as earlier:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 246,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"def discount_rewards(rewards, discount_rate):\n",
|
||||||
|
" discounted = np.array(rewards)\n",
|
||||||
|
" for step in range(len(rewards) - 2, -1, -1):\n",
|
||||||
|
" discounted[step] += discounted[step + 1] * discount_rate\n",
|
||||||
|
" return discounted\n",
|
||||||
|
"\n",
|
||||||
|
"def discount_and_normalize_rewards(all_rewards, discount_rate):\n",
|
||||||
|
" all_discounted_rewards = [discount_rewards(rewards, discount_rate)\n",
|
||||||
|
" for rewards in all_rewards]\n",
|
||||||
|
" flat_rewards = np.concatenate(all_discounted_rewards)\n",
|
||||||
|
" reward_mean = flat_rewards.mean()\n",
|
||||||
|
" reward_std = flat_rewards.std()\n",
|
||||||
|
" return [(discounted_rewards - reward_mean) / reward_std\n",
|
||||||
|
" for discounted_rewards in all_discounted_rewards]"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Now let's define some hyperparameters:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 247,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"n_iterations = 200\n",
|
||||||
|
"n_episodes_per_update = 16\n",
|
||||||
|
"n_max_steps = 1000\n",
|
||||||
|
"discount_rate = 0.99"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Again, since the model is a multiclass classification model, we must use the categorical cross-entropy rather than the binary cross-entropy. Moreover, since the `lander_play_one_step()` function sets the targets as class indices rather than class probabilities, we must use the `sparse_categorical_crossentropy()` loss function:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 248,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"optimizer = keras.optimizers.Nadam(lr=0.005)\n",
|
||||||
|
"loss_fn = keras.losses.sparse_categorical_crossentropy"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"We're ready to train the model. Let's go!"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 249,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"env.seed(42)\n",
|
||||||
|
"\n",
|
||||||
|
"mean_rewards = []\n",
|
||||||
|
"\n",
|
||||||
|
"for iteration in range(n_iterations):\n",
|
||||||
|
" all_rewards, all_grads = lander_play_multiple_episodes(\n",
|
||||||
|
" env, n_episodes_per_update, n_max_steps, model, loss_fn)\n",
|
||||||
|
" mean_reward = sum(map(sum, all_rewards)) / n_episodes_per_update\n",
|
||||||
|
" print(\"\\rIteration: {}/{}, mean reward: {:.1f} \".format(\n",
|
||||||
|
" iteration + 1, n_iterations, mean_reward), end=\"\")\n",
|
||||||
|
" mean_rewards.append(mean_reward)\n",
|
||||||
|
" all_final_rewards = discount_and_normalize_rewards(all_rewards,\n",
|
||||||
|
" discount_rate)\n",
|
||||||
|
" all_mean_grads = []\n",
|
||||||
|
" for var_index in range(len(model.trainable_variables)):\n",
|
||||||
|
" mean_grads = tf.reduce_mean(\n",
|
||||||
|
" [final_reward * all_grads[episode_index][step][var_index]\n",
|
||||||
|
" for episode_index, final_rewards in enumerate(all_final_rewards)\n",
|
||||||
|
" for step, final_reward in enumerate(final_rewards)], axis=0)\n",
|
||||||
|
" all_mean_grads.append(mean_grads)\n",
|
||||||
|
" optimizer.apply_gradients(zip(all_mean_grads, model.trainable_variables))"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Let's look at the learning curve:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 250,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"import matplotlib.pyplot as plt\n",
|
||||||
|
"\n",
|
||||||
|
"plt.plot(mean_rewards)\n",
|
||||||
|
"plt.xlabel(\"Episode\")\n",
|
||||||
|
"plt.ylabel(\"Mean reward\")\n",
|
||||||
|
"plt.grid()\n",
|
||||||
|
"plt.show()"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Now let's look at the result!"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 257,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"def lander_render_policy_net(model, n_max_steps=500, seed=42):\n",
|
||||||
|
" frames = []\n",
|
||||||
|
" env = gym.make(\"LunarLander-v2\")\n",
|
||||||
|
" env.seed(seed)\n",
|
||||||
|
" tf.random.set_seed(seed)\n",
|
||||||
|
" np.random.seed(seed)\n",
|
||||||
|
" obs = env.reset()\n",
|
||||||
|
" for step in range(n_max_steps):\n",
|
||||||
|
" frames.append(env.render(mode=\"rgb_array\"))\n",
|
||||||
|
" probas = model(obs[np.newaxis])\n",
|
||||||
|
" logits = tf.math.log(probas + keras.backend.epsilon())\n",
|
||||||
|
" action = tf.random.categorical(logits, num_samples=1)\n",
|
||||||
|
" obs, reward, done, info = env.step(action[0, 0].numpy())\n",
|
||||||
|
" if done:\n",
|
||||||
|
" break\n",
|
||||||
|
" env.close()\n",
|
||||||
|
" return frames"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 264,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"frames = lander_render_policy_net(model, seed=42)\n",
|
||||||
|
"plot_animation(frames)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"That's pretty good. You can try training it for longer and/or tweaking the hyperparameters to see if you can get it to go over 200."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
|
|
Loading…
Reference in New Issue