Add exercise solutions for chapter 08

2017-06-26 00:09:23 +02:00 · 2017-06-26 00:09:23 +02:00 · b11721e1e5
parent 6947a43d0d
commit b11721e1e5
1 changed files with 533 additions and 24 deletions
--- a/08_dimensionality_reduction.ipynb
+++ b/08_dimensionality_reduction.ipynb
@ -77,7 +77,9 @@
  {
   "cell_type": "code",
   "execution_count": 2,
-   "metadata": {},
+   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "np.random.seed(4)\n",
@ -116,7 +118,9 @@
  {
   "cell_type": "code",
   "execution_count": 4,
-   "metadata": {},
+   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "m, n = X.shape\n",
@ -137,7 +141,9 @@
  {
   "cell_type": "code",
   "execution_count": 6,
-   "metadata": {},
+   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "W2 = V.T[:, :2]\n",
@ -172,7 +178,9 @@
  {
   "cell_type": "code",
   "execution_count": 8,
-   "metadata": {},
+   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA\n",
@ -277,7 +285,9 @@
  {
   "cell_type": "code",
   "execution_count": 15,
-   "metadata": {},
+   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X3D_inv_using_svd = X2D_using_svd.dot(V[:2, :])"
@ -440,7 +450,9 @@
  {
   "cell_type": "code",
   "execution_count": 23,
-   "metadata": {},
+   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "axes = [-1.8, 1.8, -1.3, 1.3, -1.0, 1.0]\n",
@ -786,7 +798,9 @@
  {
   "cell_type": "code",
   "execution_count": 33,
-   "metadata": {},
+   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pca = PCA()\n",
@ -807,7 +821,9 @@
  {
   "cell_type": "code",
   "execution_count": 35,
-   "metadata": {},
+   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pca = PCA(n_components=0.95)\n",
@ -835,7 +851,9 @@
  {
   "cell_type": "code",
   "execution_count": 38,
-   "metadata": {},
+   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pca = PCA(n_components = 154)\n",
@ -1004,7 +1022,9 @@
  {
   "cell_type": "code",
   "execution_count": 48,
-   "metadata": {},
+   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "filename = \"my_mnist.data\"\n",
@ -1024,7 +1044,9 @@
  {
   "cell_type": "code",
   "execution_count": 49,
-   "metadata": {},
+   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "del X_mm"
@ -1053,7 +1075,9 @@
  {
   "cell_type": "code",
   "execution_count": 51,
-   "metadata": {},
+   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "rnd_pca = PCA(n_components=154, svd_solver=\"randomized\", random_state=42)\n",
@ -1313,7 +1337,9 @@
  {
   "cell_type": "code",
   "execution_count": 62,
-   "metadata": {},
+   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "rbf_pca = KernelPCA(n_components = 2, kernel=\"rbf\", gamma=0.0433,\n",
@ -1354,7 +1380,9 @@
  {
   "cell_type": "code",
   "execution_count": 65,
-   "metadata": {},
+   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.manifold import LocallyLinearEmbedding\n",
@ -1390,7 +1418,9 @@
  {
   "cell_type": "code",
   "execution_count": 67,
-   "metadata": {},
+   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.manifold import MDS\n",
@ -1402,7 +1432,9 @@
  {
   "cell_type": "code",
   "execution_count": 68,
-   "metadata": {},
+   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.manifold import Isomap\n",
@ -1741,7 +1773,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Nice! Reducing dimensionality led to a ×4 speedup. :)  Let's the model's accuracy:"
+    "Nice! Reducing dimensionality led to a 4× speedup. :)  Let's the model's accuracy:"
   ]
  },
  {
@ -1758,7 +1790,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "A very slight drop in performance, which might be a reasonable price to pay for a ×4 speedup, depending on the application."
+    "A very slight drop in performance, which might be a reasonable price to pay for a 4× speedup, depending on the application."
   ]
  },
  {
@ -1779,7 +1811,484 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Coming soon."
+    "*Exercise: Use t-SNE to reduce the MNIST dataset down to two dimensions and plot the result using Matplotlib. You can use a scatterplot using 10 different colors to represent each image's target class.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start by loading the MNIST dataset (again):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import fetch_mldata\n",
    "\n",
    "mnist = fetch_mldata('MNIST original')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dimensionality reduction on the full 60,000 images takes a very long time, so let's only do this on a random subset of 10,000 images:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(42)\n",
    "\n",
    "m = 10000\n",
    "idx = np.random.permutation(60000)[:m]\n",
    "\n",
    "X = mnist['data'][idx]\n",
    "y = mnist['target'][idx]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's use t-SNE to reduce dimensionality down to 2D so we can plot the dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.manifold import TSNE\n",
    "\n",
    "tsne = TSNE(n_components=2, random_state=42)\n",
    "X_reduced = tsne.fit_transform(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's use Matplotlib's `scatter()` function to plot a scatterplot, using a different color for each digit:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(13,10))\n",
    "plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap=\"jet\")\n",
    "plt.axis('off')\n",
    "plt.colorbar()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Isn't this just beautiful? :) This plot tells us which numbers are easily distinguishable from the others (e.g., 0s, 6s, and most 8s are rather well separated clusters), and it also tells us which numbers are often hard to distinguish (e.g., 4s and 9s, 5s and 3s, and so on)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's focus on digits 3 and 5, which seem to overlap a lot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(9,9))\n",
    "cmap = matplotlib.cm.get_cmap(\"jet\")\n",
    "for digit in (2, 3, 5):\n",
    "    plt.scatter(X_reduced[y == digit, 0], X_reduced[y == digit, 1], c=cmap(digit / 9))\n",
    "plt.axis('off')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see if we can produce a nicer image by running t-SNE on these 3 digits:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "idx = (y == 2) | (y == 3) | (y == 5) \n",
    "X_subset = X[idx]\n",
    "y_subset = y[idx]\n",
    "\n",
    "tsne_subset = TSNE(n_components=2, random_state=42)\n",
    "X_subset_reduced = tsne_subset.fit_transform(X_subset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(9,9))\n",
    "for digit in (2, 3, 5):\n",
    "    plt.scatter(X_subset_reduced[y_subset == digit, 0], X_subset_reduced[y_subset == digit, 1], c=cmap(digit / 9))\n",
    "plt.axis('off')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Much better, now the clusters have far less overlap. But some 3s are all over the place. Plus, there are two distinct clusters of 2s, and also two distinct clusters of 5s. It would be nice if we could visualize a few digits from each cluster, to understand why this is the case. Let's do that now. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "*Exercise: Alternatively, you can write colored digits at the location of each instance, or even plot scaled-down versions of the digit images themselves (if you plot all digits, the visualization will be too cluttered, so you should either draw a random sample or plot an instance only if no other instance has already been plotted at a close distance). You should get a nice visualization with well-separated clusters of digits.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's create a `plot_digits()` function that will draw a scatterplot (similar to the above scatterplots) plus write colored digits, with a minimum distance guaranteed between these digits. If the digit images are provided, they are plotted instead. This implementation was inspired from one of Scikit-Learn's excellent examples ([plot_lle_digits](http://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html), based on a different digit dataset)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import MinMaxScaler\n",
    "from matplotlib.offsetbox import AnnotationBbox, OffsetImage\n",
    "\n",
    "def plot_digits(X, y, min_distance=0.05, images=None, figsize=(13, 10)):\n",
    "    # Let's scale the input features so that they range from 0 to 1\n",
    "    X_normalized = MinMaxScaler().fit_transform(X)\n",
    "    # Now we create the list of coordinates of the digits plotted so far.\n",
    "    # We pretend that one is already plotted far away at the start, to\n",
    "    # avoid `if` statements in the loop below\n",
    "    neighbors = np.array([[10., 10.]])\n",
    "    # The rest should be self-explanatory\n",
    "    plt.figure(figsize=figsize)\n",
    "    cmap = matplotlib.cm.get_cmap(\"jet\")\n",
    "    digits = np.unique(y)\n",
    "    for digit in digits:\n",
    "        plt.scatter(X_normalized[y == digit, 0], X_normalized[y == digit, 1], c=cmap(digit / 9))\n",
    "    plt.axis(\"off\")\n",
    "    ax = plt.gcf().gca()  # get current axes in current figure\n",
    "    for index, image_coord in enumerate(X_normalized):\n",
    "        closest_distance = np.linalg.norm(np.array(neighbors) - image_coord, axis=1).min()\n",
    "        if closest_distance > min_distance:\n",
    "            neighbors = np.r_[neighbors, [image_coord]]\n",
    "            if images is None:\n",
    "                plt.text(image_coord[0], image_coord[1], str(int(y[index])),\n",
    "                         color=cmap(y[index] / 9), fontdict={\"weight\": \"bold\", \"size\": 16})\n",
    "            else:\n",
    "                image = images[index].reshape(28, 28)\n",
    "                imagebox = AnnotationBbox(OffsetImage(image, cmap=\"binary\"), image_coord)\n",
    "                ax.add_artist(imagebox)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's try it! First let's just write colored digits:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_digits(X_reduced, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Well that's okay, but not that beautiful. Let's try with the digit images:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_digits(X_reduced, y, images=X, figsize=(35, 25))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_digits(X_subset_reduced, y_subset, images=X_subset, figsize=(22, 22))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Exercise: Try using other dimensionality reduction algorithms such as PCA, LLE, or MDS and compare the resulting visualizations.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start with PCA. We will also time how long it takes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "import time\n",
    "\n",
    "t0 = time.time()\n",
    "X_pca_reduced = PCA(n_components=2, random_state=42).fit_transform(X)\n",
    "t1 = time.time()\n",
    "print(\"PCA took {:.1f}s.\".format(t1 - t0))\n",
    "plot_digits(X_pca_reduced, y)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wow, PCA is blazingly fast! But although we do see a few clusters, there's way too much overlap. Let's try LLE:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.manifold import LocallyLinearEmbedding\n",
    "\n",
    "t0 = time.time()\n",
    "X_lle_reduced = LocallyLinearEmbedding(n_components=2, random_state=42).fit_transform(X)\n",
    "t1 = time.time()\n",
    "print(\"LLE took {:.1f}s.\".format(t1 - t0))\n",
    "plot_digits(X_lle_reduced, y)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That took a while, and the result does not look too good. Let's see what happens if we apply PCA first, preserving 95% of the variance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.pipeline import Pipeline\n",
    "\n",
    "pca_lle = Pipeline([\n",
    "    (\"pca\", PCA(n_components=0.95, random_state=42)),\n",
    "    (\"lle\", LocallyLinearEmbedding(n_components=2, random_state=42)),\n",
    "])\n",
    "t0 = time.time()\n",
    "X_pca_lle_reduced = pca_lle.fit_transform(X)\n",
    "t1 = time.time()\n",
    "print(\"PCA+LLE took {:.1f}s.\".format(t1 - t0))\n",
    "plot_digits(X_pca_lle_reduced, y)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result is more or less the same, but this time it was almost 4× faster."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's try MDS. It's much too long if we run it on 10,000 instances, so let's just try 2,000 for now:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.manifold import MDS\n",
    "\n",
    "m = 2000\n",
    "t0 = time.time()\n",
    "X_mds_reduced = MDS(n_components=2, random_state=42).fit_transform(X[:m])\n",
    "t1 = time.time()\n",
    "print(\"MDS took {:.1f}s (on just 2,000 MNIST images instead of 10,000).\".format(t1 - t0))\n",
    "plot_digits(X_mds_reduced, y[:m])\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Meh. This does not look great, all clusters overlap too much. Let's try with PCA first, perhaps it will be faster?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.pipeline import Pipeline\n",
    "\n",
    "pca_mds = Pipeline([\n",
    "    (\"pca\", PCA(n_components=0.95, random_state=42)),\n",
    "    (\"mds\", MDS(n_components=2, random_state=42)),\n",
    "])\n",
    "t0 = time.time()\n",
    "X_pca_mds_reduced = pca_mds.fit_transform(X[:2000])\n",
    "t1 = time.time()\n",
    "print(\"PCA+MDS took {:.1f}s (on 2,000 MNIST images).\".format(t1 - t0))\n",
    "plot_digits(X_pca_mds_reduced, y[:2000])\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Same result, and no speedup: PCA did not help (or hurt)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's try LDA:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.discriminant_analysis import LinearDiscriminantAnalysis\n",
    "\n",
    "t0 = time.time()\n",
    "X_lda_reduced = LinearDiscriminantAnalysis(n_components=2).fit_transform(X, y)\n",
    "t1 = time.time()\n",
    "print(\"LDA took {:.1f}s.\".format(t1 - t0))\n",
    "plot_digits(X_lda_reduced, y, figsize=(12,12))\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This one is very fast, and it looks nice at first, until you realize that several clusters overlap severely."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Well, it's pretty clear that t-SNE won this little competition, wouldn't you agree? We did not time it, so let's do that now:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.manifold import TSNE\n",
    "\n",
    "t0 = time.time()\n",
    "X_tsne_reduced = TSNE(n_components=2, random_state=42).fit_transform(X)\n",
    "t1 = time.time()\n",
    "print(\"t-SNE took {:.1f}s.\".format(t1 - t0))\n",
    "plot_digits(X_tsne_reduced, y)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's twice slower than LLE, but still much faster than MDS, and the result looks great. Let's see if a bit of PCA can speed it up:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {},
   "outputs": [],
   "source": [
    "pca_tsne = Pipeline([\n",
    "    (\"pca\", PCA(n_components=0.95, random_state=42)),\n",
    "    (\"tsne\", TSNE(n_components=2, random_state=42)),\n",
    "])\n",
    "t0 = time.time()\n",
    "X_pca_tsne_reduced = pca_tsne.fit_transform(X)\n",
    "t1 = time.time()\n",
    "print(\"PCA+t-SNE took {:.1f}s.\".format(t1 - t0))\n",
    "plot_digits(X_pca_tsne_reduced, y)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Yes, PCA roughly gave us a 25% speedup, without damaging the result. We have a winner!"
   ]
  },
  {
@ -1794,21 +2303,21 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 2",
   "language": "python",
-   "name": "python3"
+   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
-    "version": 3
+    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
+   "pygments_lexer": "ipython2",
-   "version": "3.5.3"
+   "version": "2.7.12"
  },
  "nav_menu": {
   "height": "352px",