handson-ml/08_dimensionality_reduction...

1884 lines
57 KiB
Plaintext
Raw Normal View History

2019-01-15 05:36:29 +01:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Chapter 8 Dimensionality Reduction**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"_This notebook contains all the sample code and solutions to the exercises in chapter 8._"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<table align=\"left\">\n",
" <td>\n",
" <a href=\"https://colab.research.google.com/github/ageron/handson-ml3/blob/main/08_dimensionality_reduction.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
" </td>\n",
2021-05-25 21:31:19 +02:00
" <td>\n",
" <a target=\"_blank\" href=\"https://kaggle.com/kernels/welcome?src=https://github.com/ageron/handson-ml3/blob/main/08_dimensionality_reduction.ipynb\"><img src=\"https://kaggle.com/static/images/open-in-kaggle.svg\" /></a>\n",
2021-05-25 21:31:19 +02:00
" </td>\n",
"</table>"
]
},
2019-01-15 05:36:29 +01:00
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
2019-01-15 05:36:29 +01:00
"source": [
"# Setup"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This project requires Python 3.8 or above:"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
2019-01-15 05:36:29 +01:00
"\n",
"assert sys.version_info >= (3, 8)"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It also requires Scikit-Learn ≥ 1.0.1:"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 2,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"import sklearn\n",
2019-01-15 05:36:29 +01:00
"\n",
"assert sklearn.__version__ >= \"1.0.1\""
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we did in previous chapters, let's define the default font sizes to make the figures prettier:"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 3,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
2019-01-15 05:36:29 +01:00
"\n",
"plt.rc('font', size=14)\n",
"plt.rc('axes', labelsize=14, titlesize=14)\n",
"plt.rc('legend', fontsize=14)\n",
"plt.rc('xtick',labelsize=10)\n",
"plt.rc('ytick',labelsize=10)"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And let's create the `images/dim_reduction` folder (if it doesn't already exist), and define the `save_fig()` function which is used through this notebook to save the figures in high-res for the book:"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 4,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"from pathlib import Path\n",
2019-01-15 05:36:29 +01:00
"\n",
"IMAGES_PATH = Path() / \"images\" / \"dim_reduction\"\n",
"IMAGES_PATH.mkdir(parents=True, exist_ok=True)\n",
"\n",
"def save_fig(fig_id, tight_layout=True, fig_extension=\"png\", resolution=300):\n",
" path = IMAGES_PATH / f\"{fig_id}.{fig_extension}\"\n",
" if tight_layout:\n",
" plt.tight_layout()\n",
" plt.savefig(path, format=fig_extension, dpi=resolution)"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# PCA"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This chapter starts with several figures to explain the concepts of PCA and Manifold Learning. Below is the code to generate these figures. You can skip directly to the [Principal Components](#Principal-Components) section below if you want."
]
},
2019-01-15 05:36:29 +01:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's generate a small 3D dataset. It's an oval shape, rotated in 3D space, with points distributed unevenly, and with quite a lot of noise:"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 5,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"# not in the book\n",
"\n",
"import numpy as np\n",
"from scipy.spatial.transform import Rotation\n",
"\n",
"m = 60\n",
"X = np.zeros((m, 3)) # initialize 3D dataset\n",
"np.random.seed(42)\n",
"angles = (np.random.rand(m) ** 3 + 0.5) * 2 * np.pi # uneven distribution\n",
"X[:, 0], X[:, 1] = np.cos(angles), np.sin(angles) * 0.5 # oval\n",
"X += 0.28 * np.random.randn(m, 3) # add more noise\n",
"X = Rotation.from_rotvec([np.pi / 29, -np.pi / 20, np.pi / 4]).apply(X)\n",
"X += [0.2, 0, 0.2] # shift a bit"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plot the 3D dataset, with the projection plane."
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"tags": []
},
2019-01-15 05:36:29 +01:00
"outputs": [],
"source": [
"# not in the book this code generates and saves Figure 82\n",
2019-01-15 05:36:29 +01:00
"\n",
"import matplotlib.pyplot as plt\n",
2019-01-15 05:36:29 +01:00
"from mpl_toolkits.mplot3d import Axes3D\n",
"from sklearn.decomposition import PCA\n",
2019-01-15 05:36:29 +01:00
"\n",
"pca = PCA(n_components=2)\n",
"X2D = pca.fit_transform(X) # dataset reduced to 2D\n",
"X3D_inv = pca.inverse_transform(X2D) # 3D position of the projected samples\n",
"X_centered = X - X.mean(axis=0)\n",
"U, s, Vt = np.linalg.svd(X_centered)\n",
2019-01-15 05:36:29 +01:00
"\n",
"axes = [-1.4, 1.4, -1.4, 1.4, -1.1, 1.1]\n",
"x1, x2 = np.meshgrid(np.linspace(axes[0], axes[1], 10),\n",
" np.linspace(axes[2], axes[3], 10))\n",
"w1, w2 = np.linalg.solve(Vt[:2, :2], Vt[:2, 2]) # projection plane coefs\n",
"z = w1 * (x1 - pca.mean_[0]) + w2 * (x2 - pca.mean_[1]) - pca.mean_[2] # plane\n",
"X3D_above = X[X[:, 2] >= X3D_inv[:, 2]] # samples above plane\n",
"X3D_below = X[X[:, 2] < X3D_inv[:, 2]] # samples below plane\n",
2019-01-15 05:36:29 +01:00
"\n",
"fig = plt.figure(figsize=(9, 9))\n",
"ax = fig.add_subplot(111, projection=\"3d\")\n",
2019-01-15 05:36:29 +01:00
"\n",
"# plot samples and projection lines below plane first\n",
"ax.plot(X3D_below[:, 0], X3D_below[:, 1], X3D_below[:, 2], \"ro\", alpha=0.3)\n",
2019-01-15 05:36:29 +01:00
"for i in range(m):\n",
" if X[i, 2] < X3D_inv[i, 2]:\n",
" ax.plot([X[i][0], X3D_inv[i][0]],\n",
" [X[i][1], X3D_inv[i][1]],\n",
" [X[i][2], X3D_inv[i][2]], \":\", color=\"#F88\")\n",
2019-01-15 05:36:29 +01:00
"\n",
"ax.plot_surface(x1, x2, z, alpha=0.1, color=\"b\") # projection plane\n",
"ax.plot(X3D_inv[:, 0], X3D_inv[:, 1], X3D_inv[:, 2], \"b+\") # projected samples\n",
"ax.plot(X3D_inv[:, 0], X3D_inv[:, 1], X3D_inv[:, 2], \"b.\")\n",
2019-01-15 05:36:29 +01:00
"\n",
"# now plot projection lines and samples above plane\n",
"for i in range(m):\n",
" if X[i, 2] >= X3D_inv[i, 2]:\n",
" ax.plot([X[i][0], X3D_inv[i][0]],\n",
" [X[i][1], X3D_inv[i][1]],\n",
" [X[i][2], X3D_inv[i][2]], \"r--\")\n",
"\n",
"ax.plot(X3D_above[:, 0], X3D_above[:, 1], X3D_above[:, 2], \"ro\")\n",
"\n",
"def set_xyz_axes(ax, axes):\n",
" ax.xaxis.set_rotate_label(False)\n",
" ax.yaxis.set_rotate_label(False)\n",
" ax.zaxis.set_rotate_label(False)\n",
" ax.set_xlabel(\"$x_1$\", labelpad=8, rotation=0)\n",
" ax.set_ylabel(\"$x_2$\", labelpad=8, rotation=0)\n",
" ax.set_zlabel(\"$x_3$\", labelpad=8, rotation=0)\n",
" ax.set_xlim(axes[0:2])\n",
" ax.set_ylim(axes[2:4])\n",
" ax.set_zlim(axes[4:6])\n",
"\n",
"set_xyz_axes(ax, axes)\n",
"ax.set_zticks([-1, -0.5, 0, 0.5, 1])\n",
"\n",
"save_fig(\"dataset_3d_plot\", tight_layout=False)\n",
2019-01-15 05:36:29 +01:00
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 7,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"# not in the book this code generates and saves Figure 83\n",
"\n",
2019-01-15 05:36:29 +01:00
"fig = plt.figure()\n",
"ax = fig.add_subplot(1, 1, 1, aspect='equal')\n",
"ax.plot(X2D[:, 0], X2D[:, 1], \"b+\")\n",
"ax.plot(X2D[:, 0], X2D[:, 1], \"b.\")\n",
"ax.plot([0], [0], \"bo\")\n",
"ax.arrow(0, 0, 1, 0, head_width=0.05, length_includes_head=True,\n",
" head_length=0.1, fc='b', ec='b', linewidth=4)\n",
"ax.arrow(0, 0, 0, 1, head_width=0.05, length_includes_head=True,\n",
" head_length=0.1, fc='b', ec='b', linewidth=1)\n",
"ax.set_xlabel(\"$z_1$\")\n",
"ax.set_yticks([-0.5, 0, 0.5, 1])\n",
"ax.set_ylabel(\"$z_2$\", rotation=0)\n",
"ax.set_axisbelow(True)\n",
2019-01-15 05:36:29 +01:00
"ax.grid(True)\n",
"save_fig(\"dataset_2d_plot\")"
]
},
{
"cell_type": "code",
"execution_count": 8,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import make_swiss_roll\n",
"\n",
"X_swiss, t = make_swiss_roll(n_samples=1000, noise=0.2, random_state=42)"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 9,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"# not in the book this code generates and saves Figure 84\n",
"\n",
"from matplotlib.colors import ListedColormap\n",
"\n",
"darker_hot = ListedColormap(plt.cm.hot(np.linspace(0, 0.8, 256)))\n",
"\n",
2019-01-15 05:36:29 +01:00
"axes = [-11.5, 14, -2, 23, -12, 15]\n",
"\n",
"fig = plt.figure(figsize=(6, 5))\n",
"ax = fig.add_subplot(111, projection='3d')\n",
"\n",
"ax.scatter(X_swiss[:, 0], X_swiss[:, 1], X_swiss[:, 2], c=t, cmap=darker_hot)\n",
2019-01-15 05:36:29 +01:00
"ax.view_init(10, -70)\n",
"set_xyz_axes(ax, axes)\n",
2019-01-15 05:36:29 +01:00
"save_fig(\"swiss_roll_plot\")\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 10,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"# not in the book this code generates and saves plots for Figure 85\n",
2021-11-21 06:04:07 +01:00
"\n",
"plt.figure(figsize=(10, 4))\n",
2019-01-15 05:36:29 +01:00
"\n",
"plt.subplot(121)\n",
"plt.scatter(X_swiss[:, 0], X_swiss[:, 1], c=t, cmap=darker_hot)\n",
2019-01-15 05:36:29 +01:00
"plt.axis(axes[:4])\n",
"plt.xlabel(\"$x_1$\")\n",
"plt.ylabel(\"$x_2$\", labelpad=10, rotation=0)\n",
2019-01-15 05:36:29 +01:00
"plt.grid(True)\n",
"\n",
"plt.subplot(122)\n",
"plt.scatter(t, X_swiss[:, 1], c=t, cmap=darker_hot)\n",
"plt.axis([4, 14.8, axes[2], axes[3]])\n",
"plt.xlabel(\"$z_1$\")\n",
2019-01-15 05:36:29 +01:00
"plt.grid(True)\n",
"\n",
"save_fig(\"squished_swiss_roll_plot\")\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 11,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"# not in the book this code generates and saves plots for Figure 86\n",
2021-11-21 06:04:07 +01:00
" \n",
2019-01-15 05:36:29 +01:00
"axes = [-11.5, 14, -2, 23, -12, 15]\n",
"x2s = np.linspace(axes[2], axes[3], 10)\n",
"x3s = np.linspace(axes[4], axes[5], 10)\n",
"x2, x3 = np.meshgrid(x2s, x3s)\n",
"\n",
"positive_class = X_swiss[:, 0] > 5\n",
"X_pos = X_swiss[positive_class]\n",
"X_neg = X_swiss[~positive_class]\n",
"\n",
"fig = plt.figure(figsize=(6, 5))\n",
"ax = plt.subplot(1, 1, 1, projection='3d')\n",
2019-01-15 05:36:29 +01:00
"ax.view_init(10, -70)\n",
"ax.plot(X_neg[:, 0], X_neg[:, 1], X_neg[:, 2], \"y^\")\n",
"ax.plot_wireframe(5, x2, x3, alpha=0.5)\n",
"ax.plot(X_pos[:, 0], X_pos[:, 1], X_pos[:, 2], \"gs\")\n",
"set_xyz_axes(ax, axes)\n",
2019-01-15 05:36:29 +01:00
"save_fig(\"manifold_decision_boundary_plot1\")\n",
"plt.show()\n",
"\n",
"fig = plt.figure(figsize=(5, 4))\n",
"ax = plt.subplot(1, 1, 1)\n",
"ax.plot(t[positive_class], X_swiss[positive_class, 1], \"gs\")\n",
"ax.plot(t[~positive_class], X_swiss[~positive_class, 1], \"y^\")\n",
"ax.axis([4, 15, axes[2], axes[3]])\n",
"ax.set_xlabel(\"$z_1$\")\n",
"ax.set_ylabel(\"$z_2$\", rotation=0, labelpad=8)\n",
"ax.grid(True)\n",
2019-01-15 05:36:29 +01:00
"save_fig(\"manifold_decision_boundary_plot2\")\n",
"plt.show()\n",
"\n",
"positive_class = 2 * (t[:] - 4) > X_swiss[:, 1]\n",
"X_pos = X_swiss[positive_class]\n",
"X_neg = X_swiss[~positive_class]\n",
"\n",
"fig = plt.figure(figsize=(6, 5))\n",
"ax = plt.subplot(1, 1, 1, projection='3d')\n",
2019-01-15 05:36:29 +01:00
"ax.view_init(10, -70)\n",
"ax.plot(X_neg[:, 0], X_neg[:, 1], X_neg[:, 2], \"y^\")\n",
"ax.plot(X_pos[:, 0], X_pos[:, 1], X_pos[:, 2], \"gs\")\n",
"ax.xaxis.set_rotate_label(False)\n",
"ax.yaxis.set_rotate_label(False)\n",
"ax.zaxis.set_rotate_label(False)\n",
"ax.set_xlabel(\"$x_1$\", rotation=0)\n",
"ax.set_ylabel(\"$x_2$\", rotation=0)\n",
"ax.set_zlabel(\"$x_3$\", rotation=0)\n",
2019-01-15 05:36:29 +01:00
"ax.set_xlim(axes[0:2])\n",
"ax.set_ylim(axes[2:4])\n",
"ax.set_zlim(axes[4:6])\n",
"save_fig(\"manifold_decision_boundary_plot3\")\n",
"plt.show()\n",
"\n",
"fig = plt.figure(figsize=(5, 4))\n",
"ax = plt.subplot(1, 1, 1)\n",
"ax.plot(t[positive_class], X_swiss[positive_class, 1], \"gs\")\n",
"ax.plot(t[~positive_class], X_swiss[~positive_class, 1], \"y^\")\n",
"ax.plot([4, 15], [0, 22], \"b-\", linewidth=2)\n",
"ax.axis([4, 15, axes[2], axes[3]])\n",
"ax.set_xlabel(\"$z_1$\")\n",
"ax.set_ylabel(\"$z_2$\", rotation=0, labelpad=8)\n",
"ax.grid(True)\n",
2019-01-15 05:36:29 +01:00
"save_fig(\"manifold_decision_boundary_plot4\")\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 12,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"# not in the book this code generates and saves Figure 87\n",
"\n",
2019-01-15 05:36:29 +01:00
"angle = np.pi / 5\n",
"stretch = 5\n",
"m = 200\n",
"\n",
"np.random.seed(3)\n",
"X_line = np.random.randn(m, 2) / 10\n",
"X_line = X_line @ np.array([[stretch, 0], [0, 1]]) # stretch\n",
"X_line = X_line @ [[np.cos(angle), np.sin(angle)],\n",
" [np.sin(angle), np.cos(angle)]] # rotate\n",
2019-01-15 05:36:29 +01:00
"\n",
"u1 = np.array([np.cos(angle), np.sin(angle)])\n",
"u2 = np.array([np.cos(angle - 2 * np.pi / 6), np.sin(angle - 2 * np.pi / 6)])\n",
"u3 = np.array([np.cos(angle - np.pi / 2), np.sin(angle - np.pi / 2)])\n",
"\n",
"X_proj1 = X_line @ u1.reshape(-1, 1)\n",
"X_proj2 = X_line @ u2.reshape(-1, 1)\n",
"X_proj3 = X_line @ u3.reshape(-1, 1)\n",
"\n",
"plt.figure(figsize=(8, 4))\n",
"plt.subplot2grid((3, 2), (0, 0), rowspan=3)\n",
"plt.plot([-1.4, 1.4], [-1.4 * u1[1] / u1[0], 1.4 * u1[1] / u1[0]], \"k-\",\n",
" linewidth=2)\n",
"plt.plot([-1.4, 1.4], [-1.4 * u2[1] / u2[0], 1.4 * u2[1] / u2[0]], \"k--\",\n",
" linewidth=2)\n",
"plt.plot([-1.4, 1.4], [-1.4 * u3[1] / u3[0], 1.4 * u3[1] / u3[0]], \"k:\",\n",
" linewidth=2)\n",
"plt.plot(X_line[:, 0], X_line[:, 1], \"ro\", alpha=0.5)\n",
"plt.arrow(0, 0, u1[0], u1[1], head_width=0.1, linewidth=4, alpha=0.9,\n",
" length_includes_head=True, head_length=0.1, fc=\"b\", ec=\"b\", zorder=10)\n",
"plt.arrow(0, 0, u3[0], u3[1], head_width=0.1, linewidth=1, alpha=0.9,\n",
" length_includes_head=True, head_length=0.1, fc=\"b\", ec=\"b\", zorder=10)\n",
"plt.text(u1[0] + 0.1, u1[1] - 0.05, r\"$\\mathbf{c_1}$\", color=\"blue\")\n",
"plt.text(u3[0] + 0.1, u3[1], r\"$\\mathbf{c_2}$\", color=\"blue\")\n",
"plt.xlabel(\"$x_1$\")\n",
"plt.ylabel(\"$x_2$\", rotation=0)\n",
2019-01-15 05:36:29 +01:00
"plt.axis([-1.4, 1.4, -1.4, 1.4])\n",
"plt.grid()\n",
2019-01-15 05:36:29 +01:00
"\n",
"plt.subplot2grid((3, 2), (0, 1))\n",
"plt.plot([-2, 2], [0, 0], \"k-\", linewidth=2)\n",
"plt.plot(X_proj1[:, 0], np.zeros(m), \"ro\", alpha=0.3)\n",
2019-01-15 05:36:29 +01:00
"plt.gca().get_yaxis().set_ticks([])\n",
"plt.gca().get_xaxis().set_ticklabels([])\n",
"plt.axis([-2, 2, -1, 1])\n",
"plt.grid()\n",
2019-01-15 05:36:29 +01:00
"\n",
"plt.subplot2grid((3, 2), (1, 1))\n",
"plt.plot([-2, 2], [0, 0], \"k--\", linewidth=2)\n",
"plt.plot(X_proj2[:, 0], np.zeros(m), \"ro\", alpha=0.3)\n",
2019-01-15 05:36:29 +01:00
"plt.gca().get_yaxis().set_ticks([])\n",
"plt.gca().get_xaxis().set_ticklabels([])\n",
"plt.axis([-2, 2, -1, 1])\n",
"plt.grid()\n",
2019-01-15 05:36:29 +01:00
"\n",
"plt.subplot2grid((3, 2), (2, 1))\n",
2019-01-15 05:36:29 +01:00
"plt.plot([-2, 2], [0, 0], \"k:\", linewidth=2)\n",
"plt.plot(X_proj3[:, 0], np.zeros(m), \"ro\", alpha=0.3)\n",
2019-01-15 05:36:29 +01:00
"plt.gca().get_yaxis().set_ticks([])\n",
"plt.axis([-2, 2, -1, 1])\n",
"plt.xlabel(\"$z_1$\")\n",
"plt.grid()\n",
2019-01-15 05:36:29 +01:00
"\n",
2019-05-06 07:15:01 +02:00
"save_fig(\"pca_best_projection_plot\")\n",
2019-01-15 05:36:29 +01:00
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Principal Components"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 13,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
2019-01-15 05:36:29 +01:00
"\n",
"# X = [...] # the small 3D dataset was created ealier in this notebook\n",
"X_centered = X - X.mean(axis=0)\n",
"U, s, Vt = np.linalg.svd(X_centered)\n",
"c1 = Vt[0]\n",
"c2 = Vt[1]"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
2019-01-15 05:36:29 +01:00
"metadata": {},
"source": [
"Note: in principle, the SVD factorization algorithm returns three matrices, **U**, **Σ** and **V**, such that **X** = **UΣV**<sup>⊺</sup>, where **U** is an _m_ × _m_ matrix, **Σ** is an _m_ × _n_ matrix, and **V** is an _n_ × _n_ matrix. But the `svd()` function returns **U**, **s** and **V**<sup>⊺</sup> instead. **s** is the vector containing all the values on the main diagonal of the top _n_ rows of **Σ**. Since **Σ** is full of zeros elsewhere, your can easily reconstruct it from **s**, like this:"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 14,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
2021-11-21 06:04:07 +01:00
"# not in the book this code shows how to construct Σ from s\n",
"m, n = X.shape\n",
"Σ = np.zeros_like(X_centered)\n",
"Σ[:n, :n] = np.diag(s)\n",
"assert np.allclose(X_centered, U @ Σ @ Vt)"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Projecting Down to d Dimensions"
]
},
2019-01-15 05:36:29 +01:00
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"W2 = Vt[:2].T\n",
"X2D = X_centered @ W2"
]
},
{
"cell_type": "markdown",
2019-01-15 05:36:29 +01:00
"metadata": {},
"source": [
"## Using Scikit-Learn"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
2019-01-15 05:36:29 +01:00
"metadata": {},
"source": [
"With Scikit-Learn, PCA is really trivial. It even takes care of mean centering for you:"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 16,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"from sklearn.decomposition import PCA\n",
"\n",
"pca = PCA(n_components=2)\n",
"X2D = pca.fit_transform(X)"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"pca.components_"
]
},
2019-01-15 05:36:29 +01:00
{
"cell_type": "markdown",
2019-01-15 05:36:29 +01:00
"metadata": {},
"source": [
"Recover the 3D points projected on the plane (PCA 2D subspace)."
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Explained Variance Ratio"
]
},
2019-01-15 05:36:29 +01:00
{
"cell_type": "markdown",
2019-01-15 05:36:29 +01:00
"metadata": {},
"source": [
"Now let's look at the explained variance ratio:"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 18,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"pca.explained_variance_ratio_"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
2019-01-15 05:36:29 +01:00
"metadata": {},
"source": [
"The first dimension explains about 76% of the variance, while the second explains about 15%."
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By projecting down to 2D, we lost about 9% of the variance:"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 19,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"1 - pca.explained_variance_ratio_.sum() # not in the book"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Choosing the Right Number of Dimensions"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
2019-01-15 05:36:29 +01:00
"source": [
"from sklearn.datasets import fetch_openml\n",
2019-01-15 05:36:29 +01:00
"\n",
"mnist = fetch_openml('mnist_784', as_frame=False)\n",
"X_train, y_train = mnist.data[:60_000], mnist.target[:60_000]\n",
"X_test, y_test = mnist.data[60_000:], mnist.target[60_000:]\n",
2019-01-15 05:36:29 +01:00
"\n",
"pca = PCA()\n",
"pca.fit(X_train)\n",
"cumsum = np.cumsum(pca.explained_variance_ratio_)\n",
"d = np.argmax(cumsum >= 0.95) + 1 # d equals 154"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 21,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"d"
]
},
2019-01-15 05:36:29 +01:00
{
"cell_type": "code",
"execution_count": 22,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"pca = PCA(n_components=0.95)\n",
"X_reduced = pca.fit_transform(X_train)"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 23,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"pca.n_components_"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 24,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"pca.explained_variance_ratio_.sum() # not in the book"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Code to generate Figure 88. Explained variance as a function of the number of dimensions:**"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 25,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize=(6,4))\n",
"plt.plot(cumsum, linewidth=3)\n",
"plt.axis([0, 400, 0, 1])\n",
"plt.xlabel(\"Dimensions\")\n",
"plt.ylabel(\"Explained Variance\")\n",
"plt.plot([d, d], [0, 0.95], \"k:\")\n",
"plt.plot([0, d], [0.95, 0.95], \"k:\")\n",
"plt.plot(d, 0.95, \"ko\")\n",
"plt.annotate(\"Elbow\", xy=(65, 0.85), xytext=(70, 0.7),\n",
" arrowprops=dict(arrowstyle=\"->\"))\n",
"plt.grid(True)\n",
"save_fig(\"explained_variance_plot\")\n",
"plt.show()"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 26,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
2019-01-15 05:36:29 +01:00
"source": [
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.model_selection import RandomizedSearchCV\n",
"from sklearn.pipeline import make_pipeline\n",
"\n",
"clf = make_pipeline(PCA(random_state=42),\n",
" RandomForestClassifier(random_state=42))\n",
"param_distrib = {\n",
" \"pca__n_components\": np.arange(10, 80),\n",
" \"randomforestclassifier__n_estimators\": np.arange(50, 500)\n",
"}\n",
"rnd_search = RandomizedSearchCV(clf, param_distrib, n_iter=10, cv=3,\n",
" random_state=42)\n",
"rnd_search.fit(X_train[:1000], y_train[:1000])"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 27,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
2019-01-15 05:36:29 +01:00
"source": [
"print(rnd_search.best_params_)"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 28,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import SGDClassifier\n",
"from sklearn.model_selection import GridSearchCV\n",
2019-01-15 05:36:29 +01:00
"\n",
"clf = make_pipeline(PCA(random_state=42), SGDClassifier())\n",
"param_grid = {\"pca__n_components\": np.arange(10, 80)}\n",
"grid_search = GridSearchCV(clf, param_grid, cv=3)\n",
"grid_search.fit(X_train[:1000], y_train[:1000])"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 29,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"grid_search.best_params_"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## PCA for Compression"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 30,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"pca = PCA(0.95)\n",
"X_reduced = pca.fit_transform(X_train, y_train)"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 31,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
2019-01-15 05:36:29 +01:00
"source": [
"X_recovered = pca.inverse_transform(X_reduced)"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 32,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"# not in the book this cell generates and saves Figure 89\n",
2019-01-15 05:36:29 +01:00
"\n",
"plt.figure(figsize=(7, 4))\n",
"for idx, X in enumerate((X_train[::2100], X_recovered[::2100])):\n",
" plt.subplot(1, 2, idx + 1)\n",
" plt.title([\"Original\", \"Compressed\"][idx])\n",
" for row in range(5):\n",
" for col in range(5):\n",
" plt.imshow(X[row * 5 + col].reshape(28, 28), cmap=\"binary\",\n",
" vmin=0, vmax=255, extent=(row, row + 1, col, col + 1))\n",
" plt.axis([0, 5, 0, 5])\n",
" plt.axis(\"off\")\n",
2019-01-15 05:36:29 +01:00
"\n",
"save_fig(\"mnist_compression_plot\")"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Randomized PCA"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 33,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"rnd_pca = PCA(n_components=154, svd_solver=\"randomized\", random_state=42)\n",
"X_reduced = rnd_pca.fit_transform(X_train)"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Incremental PCA"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
2019-01-15 05:36:29 +01:00
"outputs": [],
"source": [
"from sklearn.decomposition import IncrementalPCA\n",
"\n",
"n_batches = 100\n",
"inc_pca = IncrementalPCA(n_components=154)\n",
"for X_batch in np.array_split(X_train, n_batches):\n",
" inc_pca.partial_fit(X_batch)\n",
2019-01-15 05:36:29 +01:00
"\n",
"X_reduced = inc_pca.transform(X_train)"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Using `memmap()`:**"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
2019-01-15 05:36:29 +01:00
"metadata": {},
"source": [
"Let's create the `memmap()` structure, copy the MNIST training set into it, and call `flush()` which ensures that any data still in cache is saved to disk. This would typically be done by a first program:"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 35,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"filename = \"my_mnist.mmap\"\n",
"X_mmap = np.memmap(filename, dtype='float32', mode='write', shape=X_train.shape)\n",
"X_mmap[:] = X_train # could be a loop instead, saving the data chunk by chunk\n",
"X_mmap.flush()"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, another program would load the data and use it for training:"
]
},
2019-01-15 05:36:29 +01:00
{
"cell_type": "code",
"execution_count": 36,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"X_mmap = np.memmap(filename, dtype=\"float32\", mode=\"readonly\").reshape(-1, 784)\n",
"batch_size = X_mmap.shape[0] // n_batches\n",
"inc_pca = IncrementalPCA(n_components=154, batch_size=batch_size)\n",
"inc_pca.fit(X_mmap)"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Random Projection"
]
},
2019-01-15 05:36:29 +01:00
{
"cell_type": "markdown",
2019-01-15 05:36:29 +01:00
"metadata": {},
"source": [
"**Warning**: this sections will use close to 2.5 GB of RAM. If your computer runs out of memory, just reduce _m_ and _n_:"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 37,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"from sklearn.random_projection import johnson_lindenstrauss_min_dim\n",
2019-01-15 05:36:29 +01:00
"\n",
"m, ε = 5_000, 0.1\n",
"d = johnson_lindenstrauss_min_dim(m, eps=ε)\n",
"d"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
2021-11-21 06:04:07 +01:00
"# not in the book show the equation computed by johnson_lindenstrauss_min_dim\n",
"d = int(4 * np.log(m) / (ε ** 2 / 2 - ε ** 3 / 3))\n",
"d"
]
},
2019-01-15 05:36:29 +01:00
{
"cell_type": "code",
"execution_count": 39,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"n = 20_000\n",
"np.random.seed(42)\n",
"P = np.random.randn(d, n) / np.sqrt(d) # std dev = square root of variance\n",
2019-01-15 05:36:29 +01:00
"\n",
"X = np.random.randn(m, n) # generate a fake dataset\n",
"X_reduced = X @ P.T"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 40,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"from sklearn.random_projection import GaussianRandomProjection\n",
"\n",
"gaussian_rnd_proj = GaussianRandomProjection(eps=ε, random_state=42)\n",
"X_reduced = gaussian_rnd_proj.fit_transform(X) # same result as above"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Warning**: the following cell may take several minutes to run:"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 41,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"components_pinv = np.linalg.pinv(gaussian_rnd_proj.components_)\n",
"X_recovered = X_reduced @ components_pinv.T"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 42,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
2021-11-21 06:04:07 +01:00
"# not in the book performance comparison between Gaussian and Sparse RP\n",
"\n",
"from sklearn.random_projection import SparseRandomProjection\n",
"\n",
"print(\"GaussianRandomProjection fit\")\n",
"%timeit GaussianRandomProjection(random_state=42).fit(X)\n",
"print(\"SparseRandomProjection fit\")\n",
"%timeit SparseRandomProjection(random_state=42).fit(X)\n",
2019-01-15 05:36:29 +01:00
"\n",
"gaussian_rnd_proj = GaussianRandomProjection(random_state=42).fit(X)\n",
"sparse_rnd_proj = SparseRandomProjection(random_state=42).fit(X)\n",
"print(\"GaussianRandomProjection transform\")\n",
"%timeit gaussian_rnd_proj.transform(X)\n",
"print(\"SparseRandomProjection transform\")\n",
"%timeit sparse_rnd_proj.transform(X)"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# LLE"
]
},
{
"cell_type": "code",
"execution_count": 43,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import make_swiss_roll\n",
2019-01-15 05:36:29 +01:00
"from sklearn.manifold import LocallyLinearEmbedding\n",
"\n",
"X_swiss, t = make_swiss_roll(n_samples=1000, noise=0.2, random_state=42)\n",
2019-01-15 05:36:29 +01:00
"lle = LocallyLinearEmbedding(n_components=2, n_neighbors=10, random_state=42)\n",
"X_unrolled = lle.fit_transform(X_swiss)"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 44,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"# not in the book this cell generates and saves Figure 810\n",
2021-11-21 06:04:07 +01:00
"\n",
"plt.scatter(X_unrolled[:, 0], X_unrolled[:, 1],\n",
" c=t, cmap=darker_hot)\n",
"plt.xlabel(\"$z_1$\")\n",
"plt.ylabel(\"$z_2$\", rotation=0)\n",
"plt.axis([-0.055, 0.060, -0.070, 0.090])\n",
2019-01-15 05:36:29 +01:00
"plt.grid(True)\n",
"\n",
"save_fig(\"lle_unrolling_plot\")\n",
"plt.title(\"Unrolled swiss roll using LLE\")\n",
2019-01-15 05:36:29 +01:00
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 45,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
2019-01-15 05:36:29 +01:00
"source": [
2021-11-21 06:04:07 +01:00
"# not in the book shows how well correlated z1 is to t: LLE worked fine\n",
"plt.title(\"$z_1$ vs $t$\")\n",
"plt.scatter(X_unrolled[:, 0], t, c=t, cmap=darker_hot)\n",
"plt.xlabel(\"$z_1$\")\n",
"plt.ylabel(\"$t$\", rotation=0)\n",
"plt.grid(True)\n",
"plt.show()"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 46,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"from sklearn.manifold import MDS\n",
"\n",
"mds = MDS(n_components=2, random_state=42)\n",
"X_reduced_mds = mds.fit_transform(X_swiss)"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 47,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"from sklearn.manifold import Isomap\n",
"\n",
"isomap = Isomap(n_components=2)\n",
"X_reduced_isomap = isomap.fit_transform(X_swiss)"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 48,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"from sklearn.manifold import TSNE\n",
"\n",
"tsne = TSNE(n_components=2, init=\"random\", learning_rate=\"auto\",\n",
" random_state=42)\n",
"X_reduced_tsne = tsne.fit_transform(X_swiss)"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 49,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"# not in the book this cell generates and saves Figure 811\n",
2021-11-21 06:04:07 +01:00
"\n",
2019-01-15 05:36:29 +01:00
"titles = [\"MDS\", \"Isomap\", \"t-SNE\"]\n",
"\n",
"plt.figure(figsize=(11,4))\n",
"\n",
"for subplot, title, X_reduced in zip((131, 132, 133), titles,\n",
" (X_reduced_mds, X_reduced_isomap, X_reduced_tsne)):\n",
" plt.subplot(subplot)\n",
" plt.title(title)\n",
" plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=darker_hot)\n",
" plt.xlabel(\"$z_1$\")\n",
2019-01-15 05:36:29 +01:00
" if subplot == 131:\n",
" plt.ylabel(\"$z_2$\", rotation=0)\n",
2019-01-15 05:36:29 +01:00
" plt.grid(True)\n",
"\n",
"save_fig(\"other_dim_reduction_plot\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Extra Material Kernel PCA"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.decomposition import KernelPCA\n",
"\n",
"rbf_pca = KernelPCA(n_components=2, kernel=\"rbf\", gamma=0.04, random_state=42)\n",
"X_reduced = rbf_pca.fit_transform(X_swiss)"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [],
"source": [
"lin_pca = KernelPCA(kernel=\"linear\")\n",
"rbf_pca = KernelPCA(kernel=\"rbf\", gamma=0.002)\n",
"sig_pca = KernelPCA(kernel=\"sigmoid\", gamma=0.002, coef0=1)\n",
"\n",
"kernel_pcas = ((lin_pca, \"Linear kernel\"),\n",
" (rbf_pca, rf\"RBF kernel, $\\gamma={rbf_pca.gamma}$\"),\n",
" (sig_pca, rf\"Sigmoid kernel, $\\gamma={sig_pca.gamma}, r={sig_pca.coef0}$\"))\n",
"\n",
"plt.figure(figsize=(11, 3.5))\n",
"for idx, (kpca, title) in enumerate(kernel_pcas):\n",
" kpca.n_components = 2\n",
" kpca.random_state = 42\n",
" X_reduced = kpca.fit_transform(X_swiss)\n",
"\n",
" plt.subplot(1, 3, idx + 1)\n",
" plt.title(title)\n",
" plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=darker_hot)\n",
" plt.xlabel(\"$z_1$\")\n",
" if idx == 0:\n",
" plt.ylabel(\"$z_2$\", rotation=0)\n",
" plt.grid()\n",
"\n",
"plt.show()"
]
},
2019-01-15 05:36:29 +01:00
{
"cell_type": "markdown",
2020-04-06 09:13:12 +02:00
"metadata": {},
2019-01-15 05:36:29 +01:00
"source": [
"# Exercise solutions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. to 8."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. The main motivations for dimensionality reduction are:\n",
" * To speed up a subsequent training algorithm (in some cases it may even remove noise and redundant features, making the training algorithm perform better)\n",
" * To visualize the data and gain insights on the most important features\n",
" * To save space (compression)\n",
" \n",
" The main drawbacks are:\n",
" * Some information is lost, possibly degrading the performance of subsequent training algorithms.\n",
" * It can be computationally intensive.\n",
" * It adds some complexity to your Machine Learning pipelines.\n",
" * Transformed features are often hard to interpret.\n",
"2. The curse of dimensionality refers to the fact that many problems that do not exist in low-dimensional space arise in high-dimensional space. In Machine Learning, one common manifestation is the fact that randomly sampled high-dimensional vectors are generally far from one another, increasing the risk of overfitting and making it very difficult to identify patterns without having plenty of training data.\n",
"3. Once a dataset's dimensionality has been reduced using one of the algorithms we discussed, it is almost always impossible to perfectly reverse the operation, because some information gets lost during dimensionality reduction. Moreover, while some algorithms (such as PCA) have a simple reverse transformation procedure that can reconstruct a dataset relatively similar to the original, other algorithms (such as T-SNE) do not.\n",
"4. PCA can be used to significantly reduce the dimensionality of most datasets, even if they are highly nonlinear, because it can at least get rid of useless dimensions. However, if there are no useless dimensions—as in the Swiss roll dataset—then reducing dimensionality with PCA will lose too much information. You want to unroll the Swiss roll, not squash it.\n",
"5. That's a trick question: it depends on the dataset. Let's look at two extreme examples. First, suppose the dataset is composed of points that are almost perfectly aligned. In this case, PCA can reduce the dataset down to just one dimension while still preserving 95% of the variance. Now imagine that the dataset is composed of perfectly random points, scattered all around the 1,000 dimensions. In this case roughly 950 dimensions are required to preserve 95% of the variance. So the answer is, it depends on the dataset, and it could be any number between 1 and 950. Plotting the explained variance as a function of the number of dimensions is one way to get a rough idea of the dataset's intrinsic dimensionality.\n",
"6. Regular PCA is the default, but it works only if the dataset fits in memory. Incremental PCA is useful for large datasets that don't fit in memory, but it is slower than regular PCA, so if the dataset fits in memory you should prefer regular PCA. Incremental PCA is also useful for online tasks, when you need to apply PCA on the fly, every time a new instance arrives. Randomized PCA is useful when you want to considerably reduce dimensionality and the dataset fits in memory; in this case, it is much faster than regular PCA. Finally, Random Projection is great for very high-dimensional datasets.\n",
"7. Intuitively, a dimensionality reduction algorithm performs well if it eliminates a lot of dimensions from the dataset without losing too much information. One way to measure this is to apply the reverse transformation and measure the reconstruction error. However, not all dimensionality reduction algorithms provide a reverse transformation. Alternatively, if you are using dimensionality reduction as a preprocessing step before another Machine Learning algorithm (e.g., a Random Forest classifier), then you can simply measure the performance of that second algorithm; if dimensionality reduction did not lose too much information, then the algorithm should perform just as well as when using the original dataset.\n",
"8. It can absolutely make sense to chain two different dimensionality reduction algorithms. A common example is using PCA or Random Projection to quickly get rid of a large number of useless dimensions, then applying another much slower dimensionality reduction algorithm, such as LLE. This two-step approach will likely yield roughly the same performance as using LLE only, but in a fraction of the time."
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
2020-04-06 09:13:12 +02:00
"metadata": {},
2019-01-15 05:36:29 +01:00
"source": [
"## 9."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise: _Load the MNIST dataset (introduced in chapter 3) and split it into a training set and a test set (take the first 60,000 instances for training, and the remaining 10,000 for testing)._"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The MNIST dataset was loaded earlier."
]
},
{
"cell_type": "code",
"execution_count": 52,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"X_train = mnist.data[:60000]\n",
"y_train = mnist.target[:60000]\n",
2019-01-15 05:36:29 +01:00
"\n",
"X_test = mnist.data[60000:]\n",
"y_test = mnist.target[60000:]"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise: _Train a Random Forest classifier on the dataset and time how long it takes, then evaluate the resulting model on the test set._"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 53,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 54,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"%time rnd_clf.fit(X_train, y_train)"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 55,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import accuracy_score\n",
"\n",
"y_pred = rnd_clf.predict(X_test)\n",
"accuracy_score(y_test, y_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise: _Next, use PCA to reduce the dataset's dimensionality, with an explained variance ratio of 95%._"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 56,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"from sklearn.decomposition import PCA\n",
"\n",
"pca = PCA(n_components=0.95)\n",
"X_train_reduced = pca.fit_transform(X_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise: _Train a new Random Forest classifier on the reduced dataset and see how long it takes. Was training much faster?_"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 57,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"rnd_clf_with_pca = RandomForestClassifier(n_estimators=100, random_state=42)\n",
"%time rnd_clf_with_pca.fit(X_train_reduced, y_train)"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Oh no! Training is actually about twice slower now! How can that be? Well, as we saw in this chapter, dimensionality reduction does not always lead to faster training time: it depends on the dataset, the model and the training algorithm. See figure 8-6 (the `manifold_decision_boundary_plot*` plots above). If you try `SGDClassifier` instead of `RandomForestClassifier`, you will find that training time is reduced by a factor of 3 when using PCA. Actually, we will do this in a second, but first let's check the precision of the new random forest classifier."
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise: _Next evaluate the classifier on the test set: how does it compare to the previous classifier?_"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 58,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"X_test_reduced = pca.transform(X_test)\n",
"\n",
"y_pred = rnd_clf_with_pca.predict(X_test_reduced)\n",
2019-01-15 05:36:29 +01:00
"accuracy_score(y_test, y_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is common for performance to drop slightly when reducing dimensionality, because we do lose some potentially useful signal in the process. However, the performance drop is rather severe in this case. So PCA really did not help: it slowed down training *and* reduced performance. 😭"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
2019-01-15 05:36:29 +01:00
"metadata": {},
"source": [
"Exercise: _Try again with an `SGDClassifier`. How much does PCA help now?_"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 59,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import SGDClassifier\n",
"\n",
"sgd_clf = SGDClassifier(random_state=42)\n",
"%time sgd_clf.fit(X_train, y_train)"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 60,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"y_pred = sgd_clf.predict(X_test)\n",
2019-01-15 05:36:29 +01:00
"accuracy_score(y_test, y_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Okay, so the `SGDClassifier` takes much longer to train on this dataset than the `RandomForestClassifier`, plus it performs worse on the test set. But that's not what we are interested in right now, we want to see how much PCA can help `SGDClassifier`. Let's train it using the reduced dataset:"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 61,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"sgd_clf_with_pca = SGDClassifier(random_state=42)\n",
"%time sgd_clf_with_pca.fit(X_train_reduced, y_train)"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nice! Reducing dimensionality led to roughly 5× speedup. :) Let's check the model's accuracy:"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 62,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"y_pred = sgd_clf_with_pca.predict(X_test_reduced)\n",
2019-01-15 05:36:29 +01:00
"accuracy_score(y_test, y_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Great! PCA not only gave us a 5× speed boost, it also improved performance slightly."
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So there you have it: PCA can give you a formidable speedup, and if you're lucky a performance boost... but it's really not guaranteed: it depends on the model and the dataset!"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise: _Use t-SNE to reduce the first 5,000 images of the MNIST dataset down to two dimensions and plot the result using Matplotlib. You can use a scatterplot using 10 different colors to represent each image's target class._"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's limit ourselves to the first 5,000 images of the MNIST training set, to speed things up a lot."
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 63,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"X_sample, y_sample = X_train[:5000], y_train[:5000]"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's use t-SNE to reduce dimensionality down to 2D so we can plot the dataset:"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 64,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"from sklearn.manifold import TSNE\n",
"\n",
"tsne = TSNE(n_components=2, init=\"random\", learning_rate=\"auto\",\n",
" random_state=42)\n",
"%time X_reduced = tsne.fit_transform(X_sample)"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's use Matplotlib's `scatter()` function to plot a scatterplot, using a different color for each digit:"
]
},
{
"cell_type": "code",
"execution_count": 65,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize=(13, 10))\n",
"plt.scatter(X_reduced[:, 0], X_reduced[:, 1],\n",
" c=y_sample.astype(np.int8), cmap=\"jet\", alpha=0.5)\n",
2019-01-15 05:36:29 +01:00
"plt.axis('off')\n",
"plt.colorbar()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Isn't this just beautiful? :) Most digits are nicely separated from the others, even though t-SNE wasn't given the targets: it just identified clusters of similar images. But there is still a bit of overlap. For example, the 3s and the 5s overlap a lot (on the right side of the plot), and so do the 4s and the 9s (in the top-right corner)."
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's focus on just the digits 4 and 9:"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 66,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize=(9, 9))\n",
"cmap = plt.cm.jet\n",
"for digit in ('4', '9'):\n",
" plt.scatter(X_reduced[y_sample == digit, 0], X_reduced[y_sample == digit, 1],\n",
" c=[cmap(float(digit) / 9)], alpha=0.5)\n",
2019-01-15 05:36:29 +01:00
"plt.axis('off')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see if we can produce a nicer image by running t-SNE on just these 2 digits:"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 67,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"idx = (y_sample == '4') | (y_sample == '9')\n",
"X_subset = X_sample[idx]\n",
"y_subset = y_sample[idx]\n",
2019-01-15 05:36:29 +01:00
"\n",
"tsne_subset = TSNE(n_components=2, init=\"random\", learning_rate=\"auto\",\n",
" random_state=42)\n",
2019-01-15 05:36:29 +01:00
"X_subset_reduced = tsne_subset.fit_transform(X_subset)"
]
},
{
"cell_type": "code",
"execution_count": 68,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize=(9, 9))\n",
"for digit in ('4', '9'):\n",
" plt.scatter(X_subset_reduced[y_subset == digit, 0],\n",
" X_subset_reduced[y_subset == digit, 1],\n",
" c=[cmap(float(digit) / 9)], alpha=0.5)\n",
2019-01-15 05:36:29 +01:00
"plt.axis('off')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That's much better, although there's still a bit of overlap. Perhaps some 4s really do look like 9s, and vice versa. It would be nice if we could visualize a few digits from each region of this plot, to understand what's going on. In fact, let's do that now."
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
2020-04-06 09:13:12 +02:00
"metadata": {},
2019-01-15 05:36:29 +01:00
"source": [
"Exercise: _Alternatively, you can write colored digits at the location of each instance, or even plot scaled-down versions of the digit images themselves (if you plot all digits, the visualization will be too cluttered, so you should either draw a random sample or plot an instance only if no other instance has already been plotted at a close distance). You should get a nice visualization with well-separated clusters of digits._"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's create a `plot_digits()` function that will draw a scatterplot (similar to the above scatterplots) plus write colored digits, with a minimum distance guaranteed between these digits. If the digit images are provided, they are plotted instead. This implementation was inspired from one of Scikit-Learn's excellent examples ([plot_lle_digits](http://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html), based on a different digit dataset)."
]
},
{
"cell_type": "code",
"execution_count": 69,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import MinMaxScaler\n",
"from matplotlib.offsetbox import AnnotationBbox, OffsetImage\n",
"\n",
"def plot_digits(X, y, min_distance=0.04, images=None, figsize=(13, 10)):\n",
2019-01-15 05:36:29 +01:00
" # Let's scale the input features so that they range from 0 to 1\n",
" X_normalized = MinMaxScaler().fit_transform(X)\n",
" # Now we create the list of coordinates of the digits plotted so far.\n",
" # We pretend that one is already plotted far away at the start, to\n",
" # avoid `if` statements in the loop below\n",
" neighbors = np.array([[10., 10.]])\n",
" # The rest should be self-explanatory\n",
" plt.figure(figsize=figsize)\n",
" cmap = plt.cm.jet\n",
2019-01-15 05:36:29 +01:00
" digits = np.unique(y)\n",
" for digit in digits:\n",
" plt.scatter(X_normalized[y == digit, 0], X_normalized[y == digit, 1],\n",
" c=[cmap(float(digit) / 9)], alpha=0.5)\n",
2019-01-15 05:36:29 +01:00
" plt.axis(\"off\")\n",
" ax = plt.gca() # get current axes\n",
2019-01-15 05:36:29 +01:00
" for index, image_coord in enumerate(X_normalized):\n",
" closest_distance = np.linalg.norm(neighbors - image_coord, axis=1).min()\n",
2019-01-15 05:36:29 +01:00
" if closest_distance > min_distance:\n",
" neighbors = np.r_[neighbors, [image_coord]]\n",
" if images is None:\n",
" plt.text(image_coord[0], image_coord[1], str(int(y[index])),\n",
" color=cmap(float(y[index]) / 9),\n",
" fontdict={\"weight\": \"bold\", \"size\": 16})\n",
2019-01-15 05:36:29 +01:00
" else:\n",
" image = images[index].reshape(28, 28)\n",
" imagebox = AnnotationBbox(OffsetImage(image, cmap=\"binary\"),\n",
" image_coord)\n",
2019-01-15 05:36:29 +01:00
" ax.add_artist(imagebox)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try it! First let's show colored digits (not images), for all 5,000 images:"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 70,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"plot_digits(X_reduced, y_sample)"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Well that's okay, but not that beautiful. Let's try with the digit images:"
]
},
{
"cell_type": "code",
"execution_count": 71,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"plot_digits(X_reduced, y_sample, images=X_sample, figsize=(35, 25))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That's nicer! Now let's focus on just the 3s and the 5s:"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 72,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"plot_digits(X_subset_reduced, y_subset, images=X_subset, figsize=(22, 22))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice how similar-looking 4s are grouped together. For example, the 4s get more and more inclined as they approach the top of the figure. The inclined 9s are also closer to the top. Some 4s really do look like 9s, and vice versa."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise: _Try using other dimensionality reduction algorithms such as PCA, LLE, or MDS and compare the resulting visualizations._"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's start with PCA. We will also time how long it takes:"
]
},
{
"cell_type": "code",
"execution_count": 73,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"pca = PCA(n_components=2, random_state=42)\n",
"%time X_pca_reduced = pca.fit_transform(X_sample)\n",
"plot_digits(X_pca_reduced, y_sample)\n",
2019-01-15 05:36:29 +01:00
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wow, PCA is blazingly fast! But although we do see a few clusters, there's way too much overlap. Let's try LLE:"
]
},
{
"cell_type": "code",
"execution_count": 74,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"lle = LocallyLinearEmbedding(n_components=2, random_state=42)\n",
"%time X_lle_reduced = lle.fit_transform(X_sample)\n",
"plot_digits(X_lle_reduced, y_sample)\n",
2019-01-15 05:36:29 +01:00
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That took more time, and yet the result does not look good at all. Let's see what happens if we apply PCA first, preserving 95% of the variance:"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 75,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"pca_lle = make_pipeline(PCA(n_components=0.95),\n",
" LocallyLinearEmbedding(n_components=2, random_state=42))\n",
2019-01-15 05:36:29 +01:00
"\n",
"%time X_pca_lle_reduced = pca_lle.fit_transform(X_sample)\n",
"plot_digits(X_pca_lle_reduced, y_sample)\n",
"plt.show()"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The result is more or less as bad, but this time training was a bit faster."
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try MDS:"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Warning**: the following cell will take about 10 minutes to run, depending on your hardware:"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 76,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"%time X_mds_reduced = MDS(n_components=2, random_state=42).fit_transform(X_sample)\n",
"plot_digits(X_mds_reduced, y_sample)\n",
2019-01-15 05:36:29 +01:00
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Meh. This does not look great, all clusters overlap too much. Let's try with PCA first, perhaps it will be faster?"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Warning**: the following cell will take about 10 minutes to run, depending on your hardware:"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 77,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"pca_mds = make_pipeline(PCA(n_components=0.95, random_state=42),\n",
" MDS(n_components=2, random_state=42))\n",
2019-01-15 05:36:29 +01:00
"\n",
"%time X_pca_mds_reduced = pca_mds.fit_transform(X_sample)\n",
"plot_digits(X_pca_mds_reduced, y_sample)\n",
2019-01-15 05:36:29 +01:00
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Same result, and not faster: PCA did not help in this case."
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try LDA now:"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": 78,
2019-01-15 05:36:29 +01:00
"metadata": {},
"outputs": [],
"source": [
"from sklearn.discriminant_analysis import LinearDiscriminantAnalysis\n",
2019-01-15 05:36:29 +01:00
"\n",
"lda = LinearDiscriminantAnalysis(n_components=2)\n",
"%time X_lda_reduced = lda.fit_transform(X_sample, y_sample)\n",
"plot_digits(X_lda_reduced, y_sample, figsize=(12, 12))\n",
2019-01-15 05:36:29 +01:00
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This one is very fast, and it looks nice at first, until you realize that several clusters overlap severely."
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
2019-01-15 05:36:29 +01:00
"metadata": {},
"source": [
"Well, it's pretty clear that t-SNE won this little competition, wouldn't you agree?"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And that's all for today, I hope you enjoyed this chapter!"
2019-01-15 05:36:29 +01:00
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
2019-01-15 05:36:29 +01:00
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
2021-10-17 03:27:34 +02:00
"version": "3.8.12"
2019-01-15 05:36:29 +01:00
}
},
"nbformat": 4,
2020-04-06 09:13:12 +02:00
"nbformat_minor": 4
2019-01-15 05:36:29 +01:00
}