handson-ml/07_dimensionality_reduction...

1868 lines
52 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Chapter 7 Dimensionality Reduction**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"_This notebook contains all the sample code and solutions to the exercises in chapter 7._"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<table align=\"left\">\n",
" <td>\n",
" <a href=\"https://colab.research.google.com/github/ageron/handson-ml2/blob/master/08_dimensionality_reduction.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
" </td>\n",
" <td>\n",
" <a target=\"_blank\" href=\"https://kaggle.com/kernels/welcome?src=https://github.com/ageron/handson-ml2/blob/master/08_dimensionality_reduction.ipynb\"><img src=\"https://kaggle.com/static/images/open-in-kaggle.svg\" /></a>\n",
" </td>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"# Setup"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This project requires Python 3.8 or above:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"\n",
"assert sys.version_info >= (3, 8)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It also requires Scikit-Learn ≥ 1.0.1:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import sklearn\n",
"\n",
"assert sklearn.__version__ >= \"1.0.1\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we did in previous chapters, let's define the default font sizes to make the figures prettier:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib as mpl\n",
"\n",
"mpl.rc('font', size=12)\n",
"mpl.rc('axes', labelsize=14, titlesize=14)\n",
"mpl.rc('legend', fontsize=14)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And let's create the `images/dim_reduction` folder (if it doesn't already exist), and define the `save_fig()` function which is used through this notebook to save the figures in high-res for the book:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"from pathlib import Path\n",
"\n",
"IMAGES_PATH = Path() / \"images\" / \"dim_reduction\"\n",
"IMAGES_PATH.mkdir(parents=True, exist_ok=True)\n",
"\n",
"def save_fig(fig_id, tight_layout=True, fig_extension=\"png\", resolution=300):\n",
" path = IMAGES_PATH / f\"{fig_id}.{fig_extension}\"\n",
" if tight_layout:\n",
" plt.tight_layout()\n",
" plt.savefig(path, format=fig_extension, dpi=resolution)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# PCA"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This chapter starts with several figures to explain the concepts of PCA and Manifold Learning. Below is the code to generate these figures. You can skip directly to the [Principal Components](#Principal-Components) section below if you want."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's generate a small 3D dataset:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# not in the book we've generated plenty of datasets before with similar code\n",
"\n",
"import numpy as np\n",
"\n",
"np.random.seed(42)\n",
"m = 60\n",
"angles = (np.random.rand(m) ** 3 + 0.5) * 2 * np.pi\n",
"X = np.zeros((m, 3))\n",
"X[:, 0] = np.cos(angles)\n",
"X[:, 1] = np.sin(angles) * 0.5\n",
"X += 0.28 * np.random.randn(m, 3)\n",
"X = rotate_3d(X, -np.pi / 4, np.pi / 30, -np.pi / 20)\n",
"X += [0.2, 0, 0.2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plot the 3D dataset, with the projection plane."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# not in the book this code generates and saves Figure 72\n",
"\n",
"import matplotlib.pyplot as plt\n",
"from mpl_toolkits.mplot3d import Axes3D\n",
"from sklearn.decomposition import PCA\n",
"\n",
"pca = PCA(n_components=2)\n",
"X2D = pca.fit_transform(X) # dataset reduced to 2D\n",
"X3D_inv = pca.inverse_transform(X2D) # 3D position of the projected samples\n",
"X_centered = X - X.mean(axis=0)\n",
"U, s, Vt = np.linalg.svd(X_centered)\n",
"\n",
"axes = [-1.4, 1.4, -1.4, 1.4, -1.1, 1.1]\n",
"x1, x2 = np.meshgrid(np.linspace(axes[0], axes[1], 10),\n",
" np.linspace(axes[2], axes[3], 10))\n",
"w1, w2 = np.linalg.solve(Vt[:2, :2], Vt[:2, 2]) # projection plane coefs\n",
"z = w1 * (x1 - pca.mean_[0]) + w2 * (x2 - pca.mean_[1]) - pca.mean_[2] # plane\n",
"X3D_above = X[X[:, 2] >= X3D_inv[:, 2]] # samples above plane\n",
"X3D_below = X[X[:, 2] < X3D_inv[:, 2]] # samples below plane\n",
"\n",
"fig = plt.figure(figsize=(9, 9))\n",
"ax = fig.add_subplot(111, projection=\"3d\")\n",
"\n",
"# plot samples and projection lines below plane first\n",
"ax.plot(X3D_below[:, 0], X3D_below[:, 1], X3D_below[:, 2], \"ro\", alpha=0.3)\n",
"for i in range(m):\n",
" if X[i, 2] < X3D_inv[i, 2]:\n",
" ax.plot([X[i][0], X3D_inv[i][0]],\n",
" [X[i][1], X3D_inv[i][1]],\n",
" [X[i][2], X3D_inv[i][2]], \":\", color=\"#F88\")\n",
"\n",
"ax.plot_surface(x1, x2, z, alpha=0.1, color=\"b\") # projection plane\n",
"ax.plot(X3D_inv[:, 0], X3D_inv[:, 1], X3D_inv[:, 2], \"b+\") # projected samples\n",
"ax.plot(X3D_inv[:, 0], X3D_inv[:, 1], X3D_inv[:, 2], \"b.\")\n",
"\n",
"# now plot projection lines and samples above plane\n",
"for i in range(m):\n",
" if X[i, 2] >= X3D_inv[i, 2]:\n",
" ax.plot([X[i][0], X3D_inv[i][0]],\n",
" [X[i][1], X3D_inv[i][1]],\n",
" [X[i][2], X3D_inv[i][2]], \"r--\")\n",
"\n",
"ax.plot(X3D_above[:, 0], X3D_above[:, 1], X3D_above[:, 2], \"ro\")\n",
"\n",
"def set_xyz_axes(ax, axes):\n",
" ax.xaxis.set_rotate_label(False)\n",
" ax.yaxis.set_rotate_label(False)\n",
" ax.zaxis.set_rotate_label(False)\n",
" ax.set_xlabel(\"$x_1$\", labelpad=8, rotation=0)\n",
" ax.set_ylabel(\"$x_2$\", labelpad=8, rotation=0)\n",
" ax.set_zlabel(\"$x_3$\", labelpad=8, rotation=0)\n",
" ax.set_xlim(axes[0:2])\n",
" ax.set_ylim(axes[2:4])\n",
" ax.set_zlim(axes[4:6])\n",
"\n",
"set_xyz_axes(ax, axes)\n",
"ax.set_zticks([-1, -0.5, 0, 0.5, 1])\n",
"\n",
"save_fig(\"dataset_3d_plot\", tight_layout=False)\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# not in the book this code generates and saves Figure 73\n",
"\n",
"fig = plt.figure()\n",
"ax = fig.add_subplot(1, 1, 1, aspect='equal')\n",
"ax.plot(X2D[:, 0], X2D[:, 1], \"b+\")\n",
"ax.plot(X2D[:, 0], X2D[:, 1], \"b.\")\n",
"ax.plot([0], [0], \"bo\")\n",
"ax.arrow(0, 0, 1, 0, head_width=0.05, length_includes_head=True,\n",
" head_length=0.1, fc='b', ec='b', linewidth=4)\n",
"ax.arrow(0, 0, 0, 1, head_width=0.05, length_includes_head=True,\n",
" head_length=0.1, fc='b', ec='b', linewidth=1)\n",
"ax.set_xlabel(\"$z_1$\")\n",
"ax.set_yticks([-0.5, 0, 0.5, 1])\n",
"ax.set_ylabel(\"$z_2$\", rotation=0)\n",
"ax.set_axisbelow(True)\n",
"ax.grid(True)\n",
"save_fig(\"dataset_2d_plot\")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import make_swiss_roll\n",
"\n",
"X_swiss, t = make_swiss_roll(n_samples=1000, noise=0.2, random_state=42)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# not in the book this code generates and saves Figure 74\n",
"\n",
"from matplotlib.colors import ListedColormap\n",
"\n",
"darker_hot = ListedColormap(plt.cm.hot(np.linspace(0, 0.8, 256)))\n",
"\n",
"axes = [-11.5, 14, -2, 23, -12, 15]\n",
"\n",
"fig = plt.figure(figsize=(6, 5))\n",
"ax = fig.add_subplot(111, projection='3d')\n",
"\n",
"ax.scatter(X_swiss[:, 0], X_swiss[:, 1], X_swiss[:, 2], c=t, cmap=darker_hot)\n",
"ax.view_init(10, -70)\n",
"set_xyz_axes(ax, axes)\n",
"save_fig(\"swiss_roll_plot\")\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"# not in the book this code generates and saves plots for Figure 75\n",
"\n",
"plt.figure(figsize=(10, 4))\n",
"\n",
"plt.subplot(121)\n",
"plt.scatter(X_swiss[:, 0], X_swiss[:, 1], c=t, cmap=darker_hot)\n",
"plt.axis(axes[:4])\n",
"plt.xlabel(\"$x_1$\")\n",
"plt.ylabel(\"$x_2$\", labelpad=10, rotation=0)\n",
"plt.grid(True)\n",
"\n",
"plt.subplot(122)\n",
"plt.scatter(t, X_swiss[:, 1], c=t, cmap=darker_hot)\n",
"plt.axis([4, 14.8, axes[2], axes[3]])\n",
"plt.xlabel(\"$z_1$\")\n",
"plt.grid(True)\n",
"\n",
"save_fig(\"squished_swiss_roll_plot\")\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# not in the book this code generates and saves plots for Figure 76\n",
" \n",
"axes = [-11.5, 14, -2, 23, -12, 15]\n",
"x2s = np.linspace(axes[2], axes[3], 10)\n",
"x3s = np.linspace(axes[4], axes[5], 10)\n",
"x2, x3 = np.meshgrid(x2s, x3s)\n",
"\n",
"positive_class = X_swiss[:, 0] > 5\n",
"X_pos = X_swiss[positive_class]\n",
"X_neg = X_swiss[~positive_class]\n",
"\n",
"fig = plt.figure(figsize=(6, 5))\n",
"ax = plt.subplot(1, 1, 1, projection='3d')\n",
"ax.view_init(10, -70)\n",
"ax.plot(X_neg[:, 0], X_neg[:, 1], X_neg[:, 2], \"y^\")\n",
"ax.plot_wireframe(5, x2, x3, alpha=0.5)\n",
"ax.plot(X_pos[:, 0], X_pos[:, 1], X_pos[:, 2], \"gs\")\n",
"set_xyz_axes(ax, axes)\n",
"save_fig(\"manifold_decision_boundary_plot1\")\n",
"plt.show()\n",
"\n",
"fig = plt.figure(figsize=(5, 4))\n",
"ax = plt.subplot(1, 1, 1)\n",
"ax.plot(t[positive_class], X_swiss[positive_class, 1], \"gs\")\n",
"ax.plot(t[~positive_class], X_swiss[~positive_class, 1], \"y^\")\n",
"ax.axis([4, 15, axes[2], axes[3]])\n",
"ax.set_xlabel(\"$z_1$\")\n",
"ax.set_ylabel(\"$z_2$\", rotation=0, labelpad=8)\n",
"ax.grid(True)\n",
"save_fig(\"manifold_decision_boundary_plot2\")\n",
"plt.show()\n",
"\n",
"positive_class = 2 * (t[:] - 4) > X_swiss[:, 1]\n",
"X_pos = X_swiss[positive_class]\n",
"X_neg = X_swiss[~positive_class]\n",
"\n",
"fig = plt.figure(figsize=(6, 5))\n",
"ax = plt.subplot(1, 1, 1, projection='3d')\n",
"ax.view_init(10, -70)\n",
"ax.plot(X_neg[:, 0], X_neg[:, 1], X_neg[:, 2], \"y^\")\n",
"ax.plot(X_pos[:, 0], X_pos[:, 1], X_pos[:, 2], \"gs\")\n",
"ax.xaxis.set_rotate_label(False)\n",
"ax.yaxis.set_rotate_label(False)\n",
"ax.zaxis.set_rotate_label(False)\n",
"ax.set_xlabel(\"$x_1$\", rotation=0)\n",
"ax.set_ylabel(\"$x_2$\", rotation=0)\n",
"ax.set_zlabel(\"$x_3$\", rotation=0)\n",
"ax.set_xlim(axes[0:2])\n",
"ax.set_ylim(axes[2:4])\n",
"ax.set_zlim(axes[4:6])\n",
"save_fig(\"manifold_decision_boundary_plot3\")\n",
"plt.show()\n",
"\n",
"fig = plt.figure(figsize=(5, 4))\n",
"ax = plt.subplot(1, 1, 1)\n",
"ax.plot(t[positive_class], X_swiss[positive_class, 1], \"gs\")\n",
"ax.plot(t[~positive_class], X_swiss[~positive_class, 1], \"y^\")\n",
"ax.plot([4, 15], [0, 22], \"b-\", linewidth=2)\n",
"ax.axis([4, 15, axes[2], axes[3]])\n",
"ax.set_xlabel(\"$z_1$\")\n",
"ax.set_ylabel(\"$z_2$\", rotation=0, labelpad=8)\n",
"ax.grid(True)\n",
"save_fig(\"manifold_decision_boundary_plot4\")\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# not in the book this code generates and saves Figure 77\n",
"\n",
"angle = np.pi / 5\n",
"stretch = 5\n",
"m = 200\n",
"\n",
"np.random.seed(3)\n",
"X_line = np.random.randn(m, 2) / 10\n",
"X_line = X_line @ np.array([[stretch, 0], [0, 1]]) # stretch\n",
"X_line = X_line @ [[np.cos(angle), np.sin(angle)],\n",
" [np.sin(angle), np.cos(angle)]] # rotate\n",
"\n",
"u1 = np.array([np.cos(angle), np.sin(angle)])\n",
"u2 = np.array([np.cos(angle - 2 * np.pi / 6), np.sin(angle - 2 * np.pi / 6)])\n",
"u3 = np.array([np.cos(angle - np.pi / 2), np.sin(angle - np.pi / 2)])\n",
"\n",
"X_proj1 = X_line @ u1.reshape(-1, 1)\n",
"X_proj2 = X_line @ u2.reshape(-1, 1)\n",
"X_proj3 = X_line @ u3.reshape(-1, 1)\n",
"\n",
"plt.figure(figsize=(8, 4))\n",
"plt.subplot2grid((3, 2), (0, 0), rowspan=3)\n",
"plt.plot([-1.4, 1.4], [-1.4 * u1[1] / u1[0], 1.4 * u1[1] / u1[0]], \"k-\",\n",
" linewidth=2)\n",
"plt.plot([-1.4, 1.4], [-1.4 * u2[1] / u2[0], 1.4 * u2[1] / u2[0]], \"k--\",\n",
" linewidth=2)\n",
"plt.plot([-1.4, 1.4], [-1.4 * u3[1] / u3[0], 1.4 * u3[1] / u3[0]], \"k:\",\n",
" linewidth=2)\n",
"plt.plot(X_line[:, 0], X_line[:, 1], \"ro\", alpha=0.5)\n",
"plt.arrow(0, 0, u1[0], u1[1], head_width=0.1, linewidth=4, alpha=0.9,\n",
" length_includes_head=True, head_length=0.1, fc=\"b\", ec=\"b\", zorder=10)\n",
"plt.arrow(0, 0, u3[0], u3[1], head_width=0.1, linewidth=1, alpha=0.9,\n",
" length_includes_head=True, head_length=0.1, fc=\"b\", ec=\"b\", zorder=10)\n",
"plt.text(u1[0] + 0.1, u1[1] - 0.05, r\"$\\mathbf{c_1}$\",\n",
" color=\"blue\", fontsize=14)\n",
"plt.text(u3[0] + 0.1, u3[1], r\"$\\mathbf{c_2}$\",\n",
" color=\"blue\", fontsize=14)\n",
"plt.xlabel(\"$x_1$\")\n",
"plt.ylabel(\"$x_2$\", rotation=0)\n",
"plt.axis([-1.4, 1.4, -1.4, 1.4])\n",
"plt.grid()\n",
"\n",
"plt.subplot2grid((3, 2), (0, 1))\n",
"plt.plot([-2, 2], [0, 0], \"k-\", linewidth=2)\n",
"plt.plot(X_proj1[:, 0], np.zeros(m), \"ro\", alpha=0.3)\n",
"plt.gca().get_yaxis().set_ticks([])\n",
"plt.gca().get_xaxis().set_ticklabels([])\n",
"plt.axis([-2, 2, -1, 1])\n",
"plt.grid()\n",
"\n",
"plt.subplot2grid((3, 2), (1, 1))\n",
"plt.plot([-2, 2], [0, 0], \"k--\", linewidth=2)\n",
"plt.plot(X_proj2[:, 0], np.zeros(m), \"ro\", alpha=0.3)\n",
"plt.gca().get_yaxis().set_ticks([])\n",
"plt.gca().get_xaxis().set_ticklabels([])\n",
"plt.axis([-2, 2, -1, 1])\n",
"plt.grid()\n",
"\n",
"plt.subplot2grid((3, 2), (2, 1))\n",
"plt.plot([-2, 2], [0, 0], \"k:\", linewidth=2)\n",
"plt.plot(X_proj3[:, 0], np.zeros(m), \"ro\", alpha=0.3)\n",
"plt.gca().get_yaxis().set_ticks([])\n",
"plt.axis([-2, 2, -1, 1])\n",
"plt.xlabel(\"$z_1$\")\n",
"plt.grid()\n",
"\n",
"save_fig(\"pca_best_projection_plot\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Principal Components"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"# X = [...] # the small 3D dataset was created ealier in this notebook\n",
"X_centered = X - X.mean(axis=0)\n",
"U, s, Vt = np.linalg.svd(X_centered)\n",
"c1 = Vt[0]\n",
"c2 = Vt[1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note: in principle, the SVD factorization algorithm returns three matrices, **U**, **Σ** and **V**, such that **X** = **UΣV**<sup>⊺</sup>, where **U** is an _m_ × _m_ matrix, **Σ** is an _m_ × _n_ matrix, and **V** is an _n_ × _n_ matrix. But the `svd()` function returns **U**, **s** and **V**<sup>⊺</sup> instead. **s** is the vector containing all the values on the main diagonal of the top _n_ rows of **Σ**. Since **Σ** is full of zeros elsewhere, your can easily reconstruct it from **s**, like this:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"# not in the book this code shows how to construct Σ from s\n",
"m, n = X.shape\n",
"Σ = np.zeros_like(X_centered)\n",
"Σ[:n, :n] = np.diag(s)\n",
"assert np.allclose(X_centered, U @ Σ @ Vt)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Projecting Down to d Dimensions"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"W2 = Vt[:2].T\n",
"X2D = X_centered @ W2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using Scikit-Learn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With Scikit-Learn, PCA is really trivial. It even takes care of mean centering for you:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.decomposition import PCA\n",
"\n",
"pca = PCA(n_components=2)\n",
"X2D = pca.fit_transform(X)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"pca.components_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Recover the 3D points projected on the plane (PCA 2D subspace)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Explained Variance Ratio"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's look at the explained variance ratio:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"pca.explained_variance_ratio_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first dimension explains about 76% of the variance, while the second explains about 15%."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By projecting down to 2D, we lost about 9% of the variance:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"1 - pca.explained_variance_ratio_.sum() # not in the book"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Choosing the Right Number of Dimensions"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import fetch_openml\n",
"\n",
"mnist = fetch_openml('mnist_784', as_frame=False)\n",
"X_train, y_train = mnist.data[:60_000], mnist.target[:60_000]\n",
"X_test, y_test = mnist.data[60_000:], mnist.target[60_000:]\n",
"\n",
"pca = PCA()\n",
"pca.fit(X_train)\n",
"cumsum = np.cumsum(pca.explained_variance_ratio_)\n",
"d = np.argmax(cumsum >= 0.95) + 1 # d equals 154"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"d"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"pca = PCA(n_components=0.95)\n",
"X_reduced = pca.fit_transform(X_train)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"pca.n_components_"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"pca.explained_variance_ratio_.sum() # not in the book"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Code to generate Figure 88. Explained variance as a function of the number of dimensions:**"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize=(6,4))\n",
"plt.plot(cumsum, linewidth=3)\n",
"plt.axis([0, 400, 0, 1])\n",
"plt.xlabel(\"Dimensions\")\n",
"plt.ylabel(\"Explained Variance\")\n",
"plt.plot([d, d], [0, 0.95], \"k:\")\n",
"plt.plot([0, d], [0.95, 0.95], \"k:\")\n",
"plt.plot(d, 0.95, \"ko\")\n",
"plt.annotate(\"Elbow\", xy=(65, 0.85), xytext=(70, 0.7),\n",
" arrowprops=dict(arrowstyle=\"->\"))\n",
"plt.grid(True)\n",
"save_fig(\"explained_variance_plot\")\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.model_selection import RandomizedSearchCV\n",
"from sklearn.pipeline import make_pipeline\n",
"\n",
"clf = make_pipeline(PCA(random_state=42),\n",
" RandomForestClassifier(random_state=42))\n",
"param_distrib = {\n",
" \"pca__n_components\": np.arange(10, 80),\n",
" \"randomforestclassifier__n_estimators\": np.arange(50, 500)\n",
"}\n",
"rnd_search = RandomizedSearchCV(clf, param_distrib, n_iter=10, cv=3,\n",
" random_state=42)\n",
"rnd_search.fit(X_train[:1000], y_train[:1000])"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"print(rnd_search.best_params_)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import SGDClassifier\n",
"from sklearn.model_selection import GridSearchCV\n",
"\n",
"clf = make_pipeline(PCA(random_state=42), SGDClassifier())\n",
"param_grid = {\"pca__n_components\": np.arange(10, 80)}\n",
"grid_search = GridSearchCV(clf, param_grid, cv=3)\n",
"grid_search.fit(X_train[:1000], y_train[:1000])"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"grid_search.best_params_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## PCA for Compression"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"pca = PCA(0.95)\n",
"X_reduced = pca.fit_transform(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"X_recovered = pca.inverse_transform(X_reduced)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"# not in the book this cell generates and saves Figure 79\n",
"\n",
"plt.figure(figsize=(7, 4))\n",
"for idx, X in enumerate((X_train[::2100], X_recovered[::2100])):\n",
" plt.subplot(1, 2, idx + 1)\n",
" plt.title([\"Original\", \"Compressed\"][idx])\n",
" for row in range(5):\n",
" for col in range(5):\n",
" plt.imshow(X[row * 5 + col].reshape(28, 28), cmap=\"binary\",\n",
" vmin=0, vmax=255, extent=(row, row + 1, col, col + 1))\n",
" plt.axis([0, 5, 0, 5])\n",
" plt.axis(\"off\")\n",
"\n",
"save_fig(\"mnist_compression_plot\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Randomized PCA"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"rnd_pca = PCA(n_components=154, svd_solver=\"randomized\", random_state=42)\n",
"X_reduced = rnd_pca.fit_transform(X_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Incremental PCA"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.decomposition import IncrementalPCA\n",
"\n",
"n_batches = 100\n",
"inc_pca = IncrementalPCA(n_components=154)\n",
"for X_batch in np.array_split(X_train, n_batches):\n",
" inc_pca.partial_fit(X_batch)\n",
"\n",
"X_reduced = inc_pca.transform(X_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Using `memmap()`:**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's create the `memmap()` structure, copy the MNIST training set into it, and call `flush()` which ensures that any data still in cache is saved to disk. This would typically be done by a first program:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"filename = \"my_mnist.mmap\"\n",
"X_mmap = np.memmap(filename, dtype='float32', mode='write', shape=X_train.shape)\n",
"X_mmap[:] = X_train # could be a loop instead, saving the data chunk by chunk\n",
"X_mmap.flush()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, another program would load the data and use it for training:"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"X_mmap = np.memmap(filename, dtype=\"float32\", mode=\"readonly\").reshape(-1, 784)\n",
"batch_size = X_mmap.shape[0] // n_batches\n",
"inc_pca = IncrementalPCA(n_components=154, batch_size=batch_size)\n",
"inc_pca.fit(X_mmap)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Random Projection"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Warning**: this sections will use close to 2.5 GB of RAM. If your computer runs out of memory, just reduce _m_ and _n_:"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.random_projection import johnson_lindenstrauss_min_dim\n",
"\n",
"m, ε = 5_000, 0.1\n",
"d = johnson_lindenstrauss_min_dim(m, eps=ε)\n",
"d"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"# not in the book show the equation computed by johnson_lindenstrauss_min_dim\n",
"d = int(4 * np.log(m) / (ε ** 2 / 2 - ε ** 3 / 3))\n",
"d"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"n = 20_000\n",
"np.random.seed(42)\n",
"P = np.random.randn(d, n) / np.sqrt(d) # std dev = square root of variance\n",
"\n",
"X = np.random.randn(m, n) # generate a fake dataset\n",
"X_reduced = X @ P.T"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.random_projection import GaussianRandomProjection\n",
"\n",
"gaussian_rnd_proj = GaussianRandomProjection(eps=ε, random_state=42)\n",
"X_reduced = gaussian_rnd_proj.fit_transform(X) # same result as above"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Warning**: the following cell may take several minutes to run:"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"components_pinv = np.linalg.pinv(gaussian_rnd_proj.components_)\n",
"X_recovered = X_reduced @ components_pinv.T"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"# not in the book performance comparison between Gaussian and Sparse RP\n",
"\n",
"from sklearn.random_projection import SparseRandomProjection\n",
"\n",
"print(\"GaussianRandomProjection fit\")\n",
"%timeit GaussianRandomProjection(random_state=42).fit(X)\n",
"print(\"SparseRandomProjection fit\")\n",
"%timeit SparseRandomProjection(random_state=42).fit(X)\n",
"\n",
"gaussian_rnd_proj = GaussianRandomProjection(random_state=42).fit(X)\n",
"sparse_rnd_proj = SparseRandomProjection(random_state=42).fit(X)\n",
"print(\"GaussianRandomProjection transform\")\n",
"%timeit gaussian_rnd_proj.transform(X)\n",
"print(\"SparseRandomProjection transform\")\n",
"%timeit sparse_rnd_proj.transform(X)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# LLE"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import make_swiss_roll\n",
"from sklearn.manifold import LocallyLinearEmbedding\n",
"\n",
"X_swiss, t = make_swiss_roll(n_samples=1000, noise=0.2, random_state=42)\n",
"lle = LocallyLinearEmbedding(n_components=2, n_neighbors=10, random_state=42)\n",
"X_unrolled = lle.fit_transform(X_swiss)"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [],
"source": [
"# not in the book this cell generates and saves Figure 710\n",
"\n",
"plt.title(\"Unrolled swiss roll using LLE\")\n",
"plt.scatter(X_unrolled[:, 0], X_unrolled[:, 1],\n",
" c=t, cmap=darker_hot)\n",
"plt.xlabel(\"$z_1$\")\n",
"plt.ylabel(\"$z_2$\", rotation=0)\n",
"plt.axis([-0.055, 0.060, -0.070, 0.090])\n",
"plt.grid(True)\n",
"\n",
"save_fig(\"lle_unrolling_plot\")\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [],
"source": [
"# not in the book shows how well correlated z1 is to t: LLE worked fine\n",
"plt.title(\"$z_1$ vs $t$\")\n",
"plt.scatter(X_unrolled[:, 0], t, c=t, cmap=darker_hot)\n",
"plt.xlabel(\"$z_1$\")\n",
"plt.ylabel(\"$t$\", rotation=0)\n",
"plt.grid(True)\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.manifold import MDS\n",
"\n",
"mds = MDS(n_components=2, random_state=42)\n",
"X_reduced_mds = mds.fit_transform(X_swiss)"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.manifold import Isomap\n",
"\n",
"isomap = Isomap(n_components=2)\n",
"X_reduced_isomap = isomap.fit_transform(X_swiss)"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.manifold import TSNE\n",
"\n",
"tsne = TSNE(n_components=2, init=\"random\", learning_rate=\"auto\",\n",
" random_state=42)\n",
"X_reduced_tsne = tsne.fit_transform(X_swiss)"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [],
"source": [
"# not in the book this cell generates and saves Figure 711\n",
"\n",
"titles = [\"MDS\", \"Isomap\", \"t-SNE\"]\n",
"\n",
"plt.figure(figsize=(11,4))\n",
"\n",
"for subplot, title, X_reduced in zip((131, 132, 133), titles,\n",
" (X_reduced_mds, X_reduced_isomap, X_reduced_tsne)):\n",
" plt.subplot(subplot)\n",
" plt.title(title)\n",
" plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=darker_hot)\n",
" plt.xlabel(\"$z_1$\")\n",
" if subplot == 131:\n",
" plt.ylabel(\"$z_2$\", rotation=0)\n",
" plt.grid(True)\n",
"\n",
"save_fig(\"other_dim_reduction_plot\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Extra Material Kernel PCA"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.decomposition import KernelPCA\n",
"\n",
"rbf_pca = KernelPCA(n_components=2, kernel=\"rbf\", gamma=0.04, random_state=42)\n",
"X_reduced = rbf_pca.fit_transform(X_swiss)"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [],
"source": [
"lin_pca = KernelPCA(kernel=\"linear\")\n",
"rbf_pca = KernelPCA(kernel=\"rbf\", gamma=0.002)\n",
"sig_pca = KernelPCA(kernel=\"sigmoid\", gamma=0.002, coef0=1)\n",
"\n",
"kernel_pcas = ((lin_pca, \"Linear kernel\"),\n",
" (rbf_pca, rf\"RBF kernel, $\\gamma={rbf_pca.gamma}$\"),\n",
" (sig_pca, rf\"Sigmoid kernel, $\\gamma={sig_pca.gamma}, r={sig_pca.coef0}$\"))\n",
"\n",
"plt.figure(figsize=(11, 3.5))\n",
"for idx, (kpca, title) in enumerate(kernel_pcas):\n",
" kpca.n_components = 2\n",
" kpca.random_state = 42\n",
" X_reduced = kpca.fit_transform(X_swiss)\n",
"\n",
" plt.subplot(1, 3, idx + 1)\n",
" plt.title(title)\n",
" plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=darker_hot)\n",
" plt.xlabel(\"$z_1$\")\n",
" if idx == 0:\n",
" plt.ylabel(\"$z_2$\", rotation=0)\n",
" plt.grid()\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercise solutions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. to 8."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"See appendix A."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise: _Load the MNIST dataset (introduced in chapter 3) and split it into a training set and a test set (take the first 60,000 instances for training, and the remaining 10,000 for testing)._"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The MNIST dataset was loaded earlier."
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [],
"source": [
"X_train = mnist.data[:60000]\n",
"y_train = mnist.target[:60000]\n",
"\n",
"X_test = mnist.data[60000:]\n",
"y_test = mnist.target[60000:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise: _Train a Random Forest classifier on the dataset and time how long it takes, then evaluate the resulting model on the test set._"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [],
"source": [
"rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [],
"source": [
"%time rnd_clf.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import accuracy_score\n",
"\n",
"y_pred = rnd_clf.predict(X_test)\n",
"accuracy_score(y_test, y_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise: _Next, use PCA to reduce the dataset's dimensionality, with an explained variance ratio of 95%._"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.decomposition import PCA\n",
"\n",
"pca = PCA(n_components=0.95)\n",
"X_train_reduced = pca.fit_transform(X_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise: _Train a new Random Forest classifier on the reduced dataset and see how long it takes. Was training much faster?_"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [],
"source": [
"rnd_clf_with_pca = RandomForestClassifier(n_estimators=100, random_state=42)\n",
"%time rnd_clf_with_pca.fit(X_train_reduced, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Oh no! Training is actually about twice slower now! How can that be? Well, as we saw in this chapter, dimensionality reduction does not always lead to faster training time: it depends on the dataset, the model and the training algorithm. See figure 8-6 (the `manifold_decision_boundary_plot*` plots above). If you try `SGDClassifier` instead of `RandomForestClassifier`, you will find that training time is reduced by a factor of 3 when using PCA. Actually, we will do this in a second, but first let's check the precision of the new random forest classifier."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise: _Next evaluate the classifier on the test set: how does it compare to the previous classifier?_"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [],
"source": [
"X_test_reduced = pca.transform(X_test)\n",
"\n",
"y_pred = rnd_clf_with_pca.predict(X_test_reduced)\n",
"accuracy_score(y_test, y_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is common for performance to drop slightly when reducing dimensionality, because we do lose some potentially useful signal in the process. However, the performance drop is rather severe in this case. So PCA really did not help: it slowed down training *and* reduced performance. 😭"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise: _Try again with an `SGDClassifier`. How much does PCA help now?_"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import SGDClassifier\n",
"\n",
"sgd_clf = SGDClassifier(random_state=42)\n",
"%time sgd_clf.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [],
"source": [
"y_pred = sgd_clf.predict(X_test)\n",
"accuracy_score(y_test, y_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Okay, so the `SGDClassifier` takes much longer to train on this dataset than the `RandomForestClassifier`, plus it performs worse on the test set. But that's not what we are interested in right now, we want to see how much PCA can help `SGDClassifier`. Let's train it using the reduced dataset:"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [],
"source": [
"sgd_clf_with_pca = SGDClassifier(random_state=42)\n",
"%time sgd_clf_with_pca.fit(X_train_reduced, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nice! Reducing dimensionality led to roughly 5× speedup. :) Let's check the model's accuracy:"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [],
"source": [
"y_pred = sgd_clf_with_pca.predict(X_test_reduced)\n",
"accuracy_score(y_test, y_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Great! PCA not only gave us a 5× speed boost, it also improved performance slightly."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So there you have it: PCA can give you a formidable speedup, and if you're lucky a performance boost... but it's really not guaranteed: it depends on the model and the dataset!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise: _Use t-SNE to reduce the first 5,000 images of the MNIST dataset down to two dimensions and plot the result using Matplotlib. You can use a scatterplot using 10 different colors to represent each image's target class._"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's limit ourselves to the first 5,000 images of the MNIST training set, to speed things up a lot."
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [],
"source": [
"X_sample, y_sample = X_train[:5000], y_train[:5000]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's use t-SNE to reduce dimensionality down to 2D so we can plot the dataset:"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.manifold import TSNE\n",
"\n",
"tsne = TSNE(n_components=2, init=\"random\", learning_rate=\"auto\",\n",
" random_state=42)\n",
"%time X_reduced = tsne.fit_transform(X_sample)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's use Matplotlib's `scatter()` function to plot a scatterplot, using a different color for each digit:"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize=(13, 10))\n",
"plt.scatter(X_reduced[:, 0], X_reduced[:, 1],\n",
" c=y_sample.astype(np.int8), cmap=\"jet\", alpha=0.5)\n",
"plt.axis('off')\n",
"plt.colorbar()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Isn't this just beautiful? :) Most digits are nicely separated from the others, even though t-SNE wasn't given the targets: it just identified clusters of similar images. But there is still a bit of overlap. For example, the 3s and the 5s overlap a lot (on the right side of the plot), and so do the 4s and the 9s (in the top-right corner)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's focus on just the digits 4 and 9:"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize=(9, 9))\n",
"cmap = plt.cm.jet\n",
"for digit in ('4', '9'):\n",
" plt.scatter(X_reduced[y_sample == digit, 0], X_reduced[y_sample == digit, 1],\n",
" c=[cmap(float(digit) / 9)], alpha=0.5)\n",
"plt.axis('off')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see if we can produce a nicer image by running t-SNE on just these 2 digits:"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [],
"source": [
"idx = (y_sample == '4') | (y_sample == '9')\n",
"X_subset = X_sample[idx]\n",
"y_subset = y_sample[idx]\n",
"\n",
"tsne_subset = TSNE(n_components=2, init=\"random\", learning_rate=\"auto\",\n",
" random_state=42)\n",
"X_subset_reduced = tsne_subset.fit_transform(X_subset)"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize=(9, 9))\n",
"for digit in ('4', '9'):\n",
" plt.scatter(X_subset_reduced[y_subset == digit, 0],\n",
" X_subset_reduced[y_subset == digit, 1],\n",
" c=[cmap(float(digit) / 9)], alpha=0.5)\n",
"plt.axis('off')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That's much better, although there's still a bit of overlap. Perhaps some 4s really do look like 9s, and vice versa. It would be nice if we could visualize a few digits from each region of this plot, to understand what's going on. In fact, let's do that now."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise: _Alternatively, you can write colored digits at the location of each instance, or even plot scaled-down versions of the digit images themselves (if you plot all digits, the visualization will be too cluttered, so you should either draw a random sample or plot an instance only if no other instance has already been plotted at a close distance). You should get a nice visualization with well-separated clusters of digits._"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's create a `plot_digits()` function that will draw a scatterplot (similar to the above scatterplots) plus write colored digits, with a minimum distance guaranteed between these digits. If the digit images are provided, they are plotted instead. This implementation was inspired from one of Scikit-Learn's excellent examples ([plot_lle_digits](http://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html), based on a different digit dataset)."
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import MinMaxScaler\n",
"from matplotlib.offsetbox import AnnotationBbox, OffsetImage\n",
"\n",
"def plot_digits(X, y, min_distance=0.04, images=None, figsize=(13, 10)):\n",
" # Let's scale the input features so that they range from 0 to 1\n",
" X_normalized = MinMaxScaler().fit_transform(X)\n",
" # Now we create the list of coordinates of the digits plotted so far.\n",
" # We pretend that one is already plotted far away at the start, to\n",
" # avoid `if` statements in the loop below\n",
" neighbors = np.array([[10., 10.]])\n",
" # The rest should be self-explanatory\n",
" plt.figure(figsize=figsize)\n",
" cmap = plt.cm.jet\n",
" digits = np.unique(y)\n",
" for digit in digits:\n",
" plt.scatter(X_normalized[y == digit, 0], X_normalized[y == digit, 1],\n",
" c=[cmap(float(digit) / 9)], alpha=0.5)\n",
" plt.axis(\"off\")\n",
" ax = plt.gca() # get current axes\n",
" for index, image_coord in enumerate(X_normalized):\n",
" closest_distance = np.linalg.norm(neighbors - image_coord, axis=1).min()\n",
" if closest_distance > min_distance:\n",
" neighbors = np.r_[neighbors, [image_coord]]\n",
" if images is None:\n",
" plt.text(image_coord[0], image_coord[1], str(int(y[index])),\n",
" color=cmap(float(y[index]) / 9),\n",
" fontdict={\"weight\": \"bold\", \"size\": 16})\n",
" else:\n",
" image = images[index].reshape(28, 28)\n",
" imagebox = AnnotationBbox(OffsetImage(image, cmap=\"binary\"),\n",
" image_coord)\n",
" ax.add_artist(imagebox)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try it! First let's show colored digits (not images), for all 5,000 images:"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [],
"source": [
"plot_digits(X_reduced, y_sample)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Well that's okay, but not that beautiful. Let's try with the digit images:"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [],
"source": [
"plot_digits(X_reduced, y_sample, images=X_sample, figsize=(35, 25))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That's nicer! Now let's focus on just the 3s and the 5s:"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [],
"source": [
"plot_digits(X_subset_reduced, y_subset, images=X_subset, figsize=(22, 22))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice how similar-looking 4s are grouped together. For example, the 4s get more and more inclined as they approach the top of the figure. The inclined 9s are also closer to the top. Some 4s really do look like 9s, and vice versa."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise: _Try using other dimensionality reduction algorithms such as PCA, LLE, or MDS and compare the resulting visualizations._"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's start with PCA. We will also time how long it takes:"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [],
"source": [
"pca = PCA(n_components=2, random_state=42)\n",
"%time X_pca_reduced = pca.fit_transform(X_sample)\n",
"plot_digits(X_pca_reduced, y_sample)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wow, PCA is blazingly fast! But although we do see a few clusters, there's way too much overlap. Let's try LLE:"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [],
"source": [
"lle = LocallyLinearEmbedding(n_components=2, random_state=42)\n",
"%time X_lle_reduced = lle.fit_transform(X_sample)\n",
"plot_digits(X_lle_reduced, y_sample)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That took more time, and yet the result does not look good at all. Let's see what happens if we apply PCA first, preserving 95% of the variance:"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [],
"source": [
"pca_lle = make_pipeline(PCA(n_components=0.95),\n",
" LocallyLinearEmbedding(n_components=2, random_state=42))\n",
"\n",
"%time X_pca_lle_reduced = pca_lle.fit_transform(X_sample)\n",
"plot_digits(X_pca_lle_reduced, y_sample)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The result is more or less as bad, but this time training was a bit faster."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try MDS:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Warning**: the following cell will take about 10 minutes to run, depending on your hardware:"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [],
"source": [
"%time X_mds_reduced = MDS(n_components=2, random_state=42).fit_transform(X_sample)\n",
"plot_digits(X_mds_reduced, y_sample)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Meh. This does not look great, all clusters overlap too much. Let's try with PCA first, perhaps it will be faster?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Warning**: the following cell will take about 10 minutes to run, depending on your hardware:"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [],
"source": [
"pca_mds = make_pipeline(PCA(n_components=0.95, random_state=42),\n",
" MDS(n_components=2, random_state=42))\n",
"\n",
"%time X_pca_mds_reduced = pca_mds.fit_transform(X_sample)\n",
"plot_digits(X_pca_mds_reduced, y_sample)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Same result, and not faster: PCA did not help in this case."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try LDA now:"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.discriminant_analysis import LinearDiscriminantAnalysis\n",
"\n",
"lda = LinearDiscriminantAnalysis(n_components=2)\n",
"%time X_lda_reduced = lda.fit_transform(X_sample, y_sample)\n",
"plot_digits(X_lda_reduced, y_sample, figsize=(12,12))\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This one is very fast, and it looks nice at first, until you realize that several clusters overlap severely."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Well, it's pretty clear that t-SNE won this little competition, wouldn't you agree?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And that's all for today, I hope you enjoyed this chapter!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}