Finish exercise solution for chapter 9, and ensure sync between notebook and book for chapter 2

2017-05-28 18:14:49 +02:00 · 2017-05-28 18:14:49 +02:00 · 168ad47702
parent ea44f0c794
commit 168ad47702
2 changed files with 873 additions and 238 deletions
--- a/02_end_to_end_machine_learning_project.ipynb
+++ b/02_end_to_end_machine_learning_project.ipynb
--- a/09_up_and_running_with_tensorflow.ipynb
+++ b/09_up_and_running_with_tensorflow.ipynb
@ -2810,7 +2810,7 @@
   },
   "outputs": [],
   "source": [
-    "n_epochs = 500\n",
+    "n_epochs = 1000\n",
    "batch_size = 50\n",
    "n_batches = int(np.ceil(m / batch_size))\n",
    "\n",
@ -2936,14 +2936,14 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Well, that looks pretty bad, doesn't it? But let's not forget that the Logistic Regression model has a linear decision boundary, so this is actually close to the best we can do with this model (unless we add more features, such as ${x_1}^2$, ${x_2}^2$ and $x_1 x_2$)."
+    "Well, that looks pretty bad, doesn't it? But let's not forget that the Logistic Regression model has a linear decision boundary, so this is actually close to the best we can do with this model (unless we add more features, as we will show in a second)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now let's just add all the bells and whistles, as listed in the exercise:\n",
+    "Now let's start over, but this time we will add all the bells and whistles, as listed in the exercise:\n",
    "* Define the graph within a `logistic_regression()` function that can be reused easily.\n",
    "* Save checkpoints using a `Saver` at regular intervals during training, and save the final model at the end of training.\n",
    "* Restore the last checkpoint upon startup if training was interrupted.\n",
@ -2956,7 +2956,353 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "**Coming soon**"
+    "Before we start, we will add 4 more features to the inputs: ${x_1}^2$, ${x_2}^2$, ${x_1}^3$ and ${x_2}^3$. This was not part of the exercise, but it will demonstrate how adding features can improve the model. We will do this manually, but you could also add them using `sklearn.preprocessing.PolynomialFeatures`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 127,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "X_train_enhanced = np.c_[X_train,\n",
+    "                         np.square(X_train[:, 1]),\n",
+    "                         np.square(X_train[:, 2]),\n",
+    "                         X_train[:, 1] ** 3,\n",
+    "                         X_train[:, 2] ** 3]\n",
+    "X_test_enhanced = np.c_[X_test,\n",
+    "                        np.square(X_test[:, 1]),\n",
+    "                        np.square(X_test[:, 2]),\n",
+    "                        X_test[:, 1] ** 3,\n",
+    "                        X_test[:, 2] ** 3]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This is what the \"enhanced\" training set looks like:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 128,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "X_train_enhanced[:5]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Ok, next let's reset the default graph:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 129,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "tf.reset_default_graph()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now let's define the `logistic_regression()` function to create the graph. We will leave out the definition of the inputs `X` and the targets `y`. We could include them here, but leaving them out will make it easier to use this function in a wide range of use cases (e.g. perhaps we will want to add some preprocessing steps for the inputs before we feed them to the Logistic Regression model)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 130,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "def logistic_regression(X, y, initializer=None, seed=42, learning_rate=0.01):\n",
+    "    n_inputs_including_bias = int(X.get_shape()[1])\n",
+    "    with tf.name_scope(\"logistic_regression\"):\n",
+    "        with tf.name_scope(\"model\"):\n",
+    "            if initializer is None:\n",
+    "                initializer = tf.random_uniform([n_inputs_including_bias, 1], -1.0, 1.0, seed=seed)\n",
+    "            theta = tf.Variable(initializer, name=\"theta\")\n",
+    "            logits = tf.matmul(X, theta, name=\"logits\")\n",
+    "            y_proba = tf.sigmoid(logits)\n",
+    "        with tf.name_scope(\"train\"):\n",
+    "            loss = tf.losses.log_loss(y, y_proba, scope=\"loss\")\n",
+    "            optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)\n",
+    "            training_op = optimizer.minimize(loss)\n",
+    "            loss_summary = tf.summary.scalar('log_loss', loss)\n",
+    "        with tf.name_scope(\"init\"):\n",
+    "            init = tf.global_variables_initializer()\n",
+    "        with tf.name_scope(\"save\"):\n",
+    "            saver = tf.train.Saver()\n",
+    "    return y_proba, loss, training_op, loss_summary, init, saver"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's create a little function to get the name of the log directory to save the summaries for Tensorboard:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 131,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "from datetime import datetime\n",
+    "\n",
+    "def log_dir(prefix=\"\"):\n",
+    "    now = datetime.utcnow().strftime(\"%Y%m%d%H%M%S\")\n",
+    "    root_logdir = \"tf_logs\"\n",
+    "    if prefix:\n",
+    "        prefix += \"-\"\n",
+    "    name = prefix + \"run-\" + now\n",
+    "    return \"{}/{}/\".format(root_logdir, name)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next, let's create the graph, using the `logistic_regression()` function. We will also create the `FileWriter` to save the summaries to the log directory for Tensorboard:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 132,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "n_inputs = 2 + 4\n",
+    "logdir = log_dir(\"logreg\")\n",
+    "\n",
+    "X = tf.placeholder(tf.float32, shape=(None, n_inputs + 1), name=\"X\")\n",
+    "y = tf.placeholder(tf.float32, shape=(None, 1), name=\"y\")\n",
+    "\n",
+    "y_proba, loss, training_op, loss_summary, init, saver = logistic_regression(X, y)\n",
+    "\n",
+    "file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "At last we can train the model! We will start by checking whether a previous training session was interrupted, and if so we will load the checkpoint and continue training from the epoch number we saved. In this example we just save the epoch number to a separate file, but in chapter 11 we will see how to store the training step directly as part of the model, using a non-trainable variable called `global_step` that we pass to the optimizer's `minimize()` method.\n",
+    "\n",
+    "You can try interrupting training to verify that it does indeed restore the last checkpoint when you start it again."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 133,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "n_epochs = 10001\n",
+    "batch_size = 50\n",
+    "n_batches = int(np.ceil(m / batch_size))\n",
+    "\n",
+    "checkpoint_path = \"/tmp/my_logreg_model.ckpt\"\n",
+    "checkpoint_epoch_path = checkpoint_path + \".epoch\"\n",
+    "final_model_path = \"./my_logreg_model\"\n",
+    "\n",
+    "with tf.Session() as sess:\n",
+    "    if os.path.isfile(checkpoint_epoch_path):\n",
+    "        # if the checkpoint file exists, restore the model and load the epoch number\n",
+    "        with open(checkpoint_epoch_path, \"rb\") as f:\n",
+    "            start_epoch = int(f.read())\n",
+    "        print(\"Training was interrupted. Continuing at epoch\", start_epoch)\n",
+    "        saver.restore(sess, checkpoint_path)\n",
+    "    else:\n",
+    "        start_epoch = 0\n",
+    "        sess.run(init)\n",
+    "\n",
+    "    for epoch in range(start_epoch, n_epochs):\n",
+    "        for batch_index in range(n_batches):\n",
+    "            X_batch, y_batch = random_batch(X_train_enhanced, y_train, batch_size)\n",
+    "            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})\n",
+    "        loss_val, summary_str = sess.run([loss, loss_summary], feed_dict={X: X_test_enhanced, y: y_test})\n",
+    "        file_writer.add_summary(summary_str, epoch)\n",
+    "        if epoch % 500 == 0:\n",
+    "            print(\"Epoch:\", epoch, \"\\tLoss:\", loss_val)\n",
+    "            saver.save(sess, checkpoint_path)\n",
+    "            with open(checkpoint_epoch_path, \"wb\") as f:\n",
+    "                f.write(b\"%d\" % (epoch + 1))\n",
+    "\n",
+    "    saver.save(sess, final_model_path)\n",
+    "    y_proba_val = y_proba.eval(feed_dict={X: X_test_enhanced, y: y_test})\n",
+    "    os.remove(checkpoint_epoch_path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Once again, we can make predictions by just classifying as positive all the instances whose estimated probability is greater or equal to 0.5:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 134,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "y_pred = (y_proba_val >= 0.5)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 135,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "precision_score(y_test, y_pred)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 136,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "recall_score(y_test, y_pred)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 137,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "y_pred_idx = y_pred.reshape(-1) # a 1D array rather than a column vector\n",
+    "plt.plot(X_test[y_pred_idx, 1], X_test[y_pred_idx, 2], 'go', label=\"Positive\")\n",
+    "plt.plot(X_test[~y_pred_idx, 1], X_test[~y_pred_idx, 2], 'r^', label=\"Negative\")\n",
+    "plt.legend()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now that's much, much better! Apparently the new features really helped a lot."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Try starting the tensorboard server, find the latest run and look at the learning curve (i.e., how the loss evaluated on the test set evolves as a function of the epoch number):\n",
+    "\n",
+    "```\n",
+    "$ tensorboard --logdir=tf_logs\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now you can play around with the hyperparameters (e.g. the `batch_size` or the `learning_rate`) and run training again and again, comparing the learning curves. You can even automate this process by implementing grid search or randomized search. Below is a simple implementation of a randomized search on both the batch size and the learning rate. For the sake of simplicity, the checkpoint mechanism was removed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 138,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "from scipy.stats import reciprocal\n",
+    "\n",
+    "n_search_iterations = 10\n",
+    "\n",
+    "for search_iteration in range(n_search_iterations):\n",
+    "    batch_size = np.random.randint(1, 100)\n",
+    "    learning_rate = reciprocal(0.0001, 0.1).rvs()\n",
+    "\n",
+    "    n_inputs = 2 + 4\n",
+    "    logdir = log_dir(\"logreg\")\n",
+    "    \n",
+    "    print(\"Iteration\", search_iteration)\n",
+    "    print(\"  logdir:\", logdir)\n",
+    "    print(\"  batch size:\", batch_size)\n",
+    "    print(\"  learning_rate:\", learning_rate)\n",
+    "    print(\"  training: \", end=\"\")\n",
+    "\n",
+    "    tf.reset_default_graph()\n",
+    "\n",
+    "    X = tf.placeholder(tf.float32, shape=(None, n_inputs + 1), name=\"X\")\n",
+    "    y = tf.placeholder(tf.float32, shape=(None, 1), name=\"y\")\n",
+    "\n",
+    "    y_proba, loss, training_op, loss_summary, init, saver = logistic_regression(\n",
+    "        X, y, learning_rate=learning_rate)\n",
+    "\n",
+    "    file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph())\n",
+    "\n",
+    "    n_epochs = 10001\n",
+    "    n_batches = int(np.ceil(m / batch_size))\n",
+    "\n",
+    "    final_model_path = \"./my_logreg_model_%d\" % search_iteration\n",
+    "\n",
+    "    with tf.Session() as sess:\n",
+    "        sess.run(init)\n",
+    "\n",
+    "        for epoch in range(n_epochs):\n",
+    "            for batch_index in range(n_batches):\n",
+    "                X_batch, y_batch = random_batch(X_train_enhanced, y_train, batch_size)\n",
+    "                sess.run(training_op, feed_dict={X: X_batch, y: y_batch})\n",
+    "            loss_val, summary_str = sess.run([loss, loss_summary], feed_dict={X: X_test_enhanced, y: y_test})\n",
+    "            file_writer.add_summary(summary_str, epoch)\n",
+    "            if epoch % 500 == 0:\n",
+    "                print(\".\", end=\"\")\n",
+    "\n",
+    "        saver.save(sess, final_model_path)\n",
+    "\n",
+    "        print()\n",
+    "        y_proba_val = y_proba.eval(feed_dict={X: X_test_enhanced, y: y_test})\n",
+    "        y_pred = (y_proba_val >= 0.5)\n",
+    "        \n",
+    "        print(\"  precision:\", precision_score(y_test, y_pred))\n",
+    "        print(\"  recall:\", recall_score(y_test, y_pred))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The `reciprocal()` function from SciPy's `stats` module returns a random distribution that is commonly used when you have no idea of the optimal scale of a hyperparameter. See the exercise solutions for chapter 2 for more details. "
   ]
  },
  {