From 07b8560867ba3d6cb5af0bec164d0c6ae9634d97 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Aur=C3=A9lien=20Geron?= Date: Mon, 22 Nov 2021 21:36:00 +1300 Subject: [PATCH] Big update of notebook 8 for the 3rd edition --- 08_unsupervised_learning.ipynb | 2167 ++++++++++++-------------------- 1 file changed, 773 insertions(+), 1394 deletions(-) diff --git a/08_unsupervised_learning.ipynb b/08_unsupervised_learning.ipynb index d94fece..dcc51ef 100644 --- a/08_unsupervised_learning.ipynb +++ b/08_unsupervised_learning.ipynb @@ -30,7 +30,9 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "tags": [] + }, "source": [ "# Setup" ] @@ -39,7 +41,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "First, let's import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures." + "This project requires Python 3.8 or above:" ] }, { @@ -48,27 +50,64 @@ "metadata": {}, "outputs": [], "source": [ - "# Python ≥3.8 is required\n", "import sys\n", - "assert sys.version_info >= (3, 8)\n", "\n", - "# Common imports\n", - "import numpy as np\n", + "assert sys.version_info >= (3, 8)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It also requires Scikit-Learn ≥ 1.0.1:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import sklearn\n", + "\n", + "assert sklearn.__version__ >= \"1.0.1\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As we did in previous chapters, let's define the default font sizes to make the figures prettier:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib as mpl\n", + "\n", + "mpl.rc('font', size=12)\n", + "mpl.rc('axes', labelsize=14, titlesize=14)\n", + "mpl.rc('legend', fontsize=14)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And let's create the `images/unsupervised_learning` folder (if it doesn't already exist), and define the `save_fig()` function which is used through this notebook to save the figures in high-res for the book:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ "from pathlib import Path\n", "\n", - "# Scikit-Learn ≥1.0 is required\n", - "import sklearn\n", - "assert sklearn.__version__ >= \"1.0\"\n", - "\n", - "# To plot pretty figures\n", - "%matplotlib inline\n", - "import matplotlib as mpl\n", - "import matplotlib.pyplot as plt\n", - "mpl.rc('axes', labelsize=14)\n", - "mpl.rc('xtick', labelsize=12)\n", - "mpl.rc('ytick', labelsize=12)\n", - "\n", - "# Where to save the figures\n", "IMAGES_PATH = Path() / \"images\" / \"unsupervised_learning\"\n", "IMAGES_PATH.mkdir(parents=True, exist_ok=True)\n", "\n", @@ -79,6 +118,13 @@ " plt.savefig(path, format=fig_extension, dpi=resolution)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First, let's import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures." + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -95,45 +141,37 @@ }, { "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.datasets import load_iris" - ] - }, - { - "cell_type": "code", - "execution_count": 3, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ + "# not in the book – this code generates and saves Figure 8–1\n", + "\n", + "import matplotlib.pyplot as plt\n", + "from sklearn.datasets import load_iris\n", + "\n", "data = load_iris()\n", "X = data.data\n", "y = data.target\n", - "data.target_names" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [], - "source": [ + "data.target_names\n", + "\n", "plt.figure(figsize=(9, 3.5))\n", "\n", "plt.subplot(121)\n", "plt.plot(X[y==0, 2], X[y==0, 3], \"yo\", label=\"Iris setosa\")\n", "plt.plot(X[y==1, 2], X[y==1, 3], \"bs\", label=\"Iris versicolor\")\n", "plt.plot(X[y==2, 2], X[y==2, 3], \"g^\", label=\"Iris virginica\")\n", - "plt.xlabel(\"Petal length\", fontsize=14)\n", - "plt.ylabel(\"Petal width\", fontsize=14)\n", - "plt.legend(fontsize=12)\n", + "plt.xlabel(\"Petal length\")\n", + "plt.ylabel(\"Petal width\")\n", + "plt.grid()\n", + "plt.legend()\n", "\n", "plt.subplot(122)\n", "plt.scatter(X[:, 2], X[:, 3], c=\"k\", marker=\".\")\n", - "plt.xlabel(\"Petal length\", fontsize=14)\n", + "plt.xlabel(\"Petal length\")\n", "plt.tick_params(labelleft=False)\n", + "plt.gca().set_axisbelow(True)\n", + "plt.grid()\n", "\n", "save_fig(\"classification_vs_clustering_plot\")\n", "plt.show()" @@ -143,16 +181,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "A Gaussian mixture model (explained below) can actually separate these clusters pretty well (using all 4 features: petal length & width, and sepal length & width)." - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.mixture import GaussianMixture" + "**Note**: the next cell shows how a Gaussian mixture model (explained later in this chapter) can actually separate these clusters pretty well using all 4 features: petal length & width, and sepal length & width. This code maps each cluster to a class. Instead of hard coding the mapping, the code picks the most common class for each cluster using the `scipy.stats.mode()` function:" ] }, { @@ -161,14 +190,36 @@ "metadata": {}, "outputs": [], "source": [ - "y_pred = GaussianMixture(n_components=3, random_state=42).fit(X).predict(X)" + "# not in the book\n", + "\n", + "import numpy as np\n", + "from scipy import stats\n", + "from sklearn.mixture import GaussianMixture\n", + "\n", + "y_pred = GaussianMixture(n_components=3, random_state=42).fit(X).predict(X)\n", + "\n", + "mapping = {}\n", + "for class_id in np.unique(y):\n", + " mode, _ = stats.mode(y_pred[y==class_id])\n", + " mapping[mode[0]] = class_id\n", + "\n", + "y_pred = np.array([mapping[cluster_id] for cluster_id in y_pred])\n", + "\n", + "plt.plot(X[y_pred==0, 2], X[y_pred==0, 3], \"yo\", label=\"Cluster 1\")\n", + "plt.plot(X[y_pred==1, 2], X[y_pred==1, 3], \"bs\", label=\"Cluster 2\")\n", + "plt.plot(X[y_pred==2, 2], X[y_pred==2, 3], \"g^\", label=\"Cluster 3\")\n", + "plt.xlabel(\"Petal length\")\n", + "plt.ylabel(\"Petal width\")\n", + "plt.legend(loc=\"upper left\")\n", + "plt.grid()\n", + "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Let's map each cluster to a class. Instead of hard coding the mapping (as is done in the book, for simplicity), we will pick the most common class for each cluster (using the `scipy.stats.mode()` function):" + "What's the ratio of iris plants we assigned to the right cluster?" ] }, { @@ -177,63 +228,7 @@ "metadata": {}, "outputs": [], "source": [ - "from scipy import stats\n", - "\n", - "mapping = {}\n", - "for class_id in np.unique(y):\n", - " mode, _ = stats.mode(y_pred[y==class_id])\n", - " mapping[mode[0]] = class_id\n", - "\n", - "mapping" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [], - "source": [ - "y_pred = np.array([mapping[cluster_id] for cluster_id in y_pred])" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [], - "source": [ - "plt.plot(X[y_pred==0, 2], X[y_pred==0, 3], \"yo\", label=\"Cluster 1\")\n", - "plt.plot(X[y_pred==1, 2], X[y_pred==1, 3], \"bs\", label=\"Cluster 2\")\n", - "plt.plot(X[y_pred==2, 2], X[y_pred==2, 3], \"g^\", label=\"Cluster 3\")\n", - "plt.xlabel(\"Petal length\", fontsize=14)\n", - "plt.ylabel(\"Petal width\", fontsize=14)\n", - "plt.legend(loc=\"upper left\", fontsize=12)\n", - "plt.show()" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [], - "source": [ - "np.sum(y_pred==y)" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [], - "source": [ - "np.sum(y_pred==y) / len(y_pred)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Note**: the results in this notebook may differ slightly from the book. This is because algorithms can sometimes be tweaked a bit between Scikit-Learn versions." + "(y_pred==y).sum() / len(y_pred)" ] }, { @@ -247,41 +242,35 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Let's start by generating some blobs:" + "**Fit and predict**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's train a K-Means clusterer on a dataset if blobs. It will try to find each blob's center and assign each instance to the closest blob:" ] }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 8, "metadata": {}, "outputs": [], "source": [ - "from sklearn.datasets import make_blobs" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [], - "source": [ - "blob_centers = np.array(\n", - " [[ 0.2, 2.3],\n", - " [-1.5 , 2.3],\n", - " [-2.8, 1.8],\n", - " [-2.8, 2.8],\n", - " [-2.8, 1.3]])\n", - "blob_std = np.array([0.4, 0.3, 0.1, 0.1, 0.1])" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [], - "source": [ - "X, y = make_blobs(n_samples=2000, centers=blob_centers,\n", - " cluster_std=blob_std, random_state=7)" + "from sklearn.cluster import KMeans\n", + "from sklearn.datasets import make_blobs\n", + "\n", + "# not in the book – the exact arguments of make_blobs() are not important\n", + "blob_centers = np.array([[ 0.2, 2.3], [-1.5 , 2.3], [-2.8, 1.8],\n", + " [-2.8, 2.8], [-2.8, 1.3]])\n", + "blob_std = np.array([0.4, 0.3, 0.1, 0.1, 0.1])\n", + "X, y = make_blobs(n_samples=2000, centers=blob_centers, cluster_std=blob_std,\n", + " random_state=7)\n", + "\n", + "k = 5\n", + "kmeans = KMeans(n_clusters=k, random_state=42)\n", + "y_pred = kmeans.fit_predict(X)" ] }, { @@ -293,62 +282,25 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 9, "metadata": {}, "outputs": [], "source": [ + "# not in the book – this code generates and saves Figure 8–2\n", + "\n", "def plot_clusters(X, y=None):\n", " plt.scatter(X[:, 0], X[:, 1], c=y, s=1)\n", - " plt.xlabel(\"$x_1$\", fontsize=14)\n", - " plt.ylabel(\"$x_2$\", fontsize=14, rotation=0)" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [], - "source": [ + " plt.xlabel(\"$x_1$\")\n", + " plt.ylabel(\"$x_2$\", rotation=0)\n", + "\n", "plt.figure(figsize=(8, 4))\n", "plot_clusters(X)\n", + "plt.gca().set_axisbelow(True)\n", + "plt.grid()\n", "save_fig(\"blobs_plot\")\n", "plt.show()" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Fit and predict**" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's train a K-Means clusterer on this dataset. It will try to find each blob's center and assign each instance to the closest blob:" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.cluster import KMeans" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": {}, - "outputs": [], - "source": [ - "k = 5\n", - "kmeans = KMeans(n_clusters=k, random_state=42)\n", - "y_pred = kmeans.fit_predict(X)" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -358,7 +310,7 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 10, "metadata": {}, "outputs": [], "source": [ @@ -367,7 +319,7 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 11, "metadata": {}, "outputs": [], "source": [ @@ -383,7 +335,7 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 12, "metadata": {}, "outputs": [], "source": [ @@ -394,12 +346,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Note that the `KMeans` instance preserves the labels of the instances it was trained on. Somewhat confusingly, in this context, the _label_ of an instance is the index of the cluster that instance gets assigned to:" + "Note that the `KMeans` instance preserves the labels of the instances it was trained on. Somewhat confusingly, in this context, the _label_ of an instance is the index of the cluster that instance gets assigned to (they are not targets, they are predictions):" ] }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 13, "metadata": {}, "outputs": [], "source": [ @@ -415,10 +367,12 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 14, "metadata": {}, "outputs": [], "source": [ + "import numpy as np\n", + "\n", "X_new = np.array([[0, 2], [3, 2], [-3, 3], [-3, 2.5]])\n", "kmeans.predict(X_new)" ] @@ -439,10 +393,12 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 15, "metadata": {}, "outputs": [], "source": [ + "# not in the book – this code generates and saves Figure 8–3\n", + "\n", "def plot_data(X):\n", " plt.plot(X[:, 0], X[:, 1], 'k.', markersize=2)\n", "\n", @@ -474,21 +430,14 @@ " plot_centroids(clusterer.cluster_centers_)\n", "\n", " if show_xlabels:\n", - " plt.xlabel(\"$x_1$\", fontsize=14)\n", + " plt.xlabel(\"$x_1$\")\n", " else:\n", " plt.tick_params(labelbottom=False)\n", " if show_ylabels:\n", - " plt.ylabel(\"$x_2$\", fontsize=14, rotation=0)\n", + " plt.ylabel(\"$x_2$\", rotation=0)\n", " else:\n", - " plt.tick_params(labelleft=False)" - ] - }, - { - "cell_type": "code", - "execution_count": 25, - "metadata": {}, - "outputs": [], - "source": [ + " plt.tick_params(labelleft=False)\n", + "\n", "plt.figure(figsize=(8, 4))\n", "plot_decision_boundaries(kmeans, X)\n", "save_fig(\"voronoi_plot\")\n", @@ -513,16 +462,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Rather than arbitrarily choosing the closest cluster for each instance, which is called _hard clustering_, it might be better measure the distance of each instance to all 5 centroids. This is what the `transform()` method does:" + "Rather than arbitrarily choosing the closest cluster for each instance, which is called _hard clustering_, it might be better to measure the distance of each instance to all 5 centroids. This is what the `transform()` method does:" ] }, { "cell_type": "code", - "execution_count": 26, + "execution_count": 16, "metadata": {}, "outputs": [], "source": [ - "kmeans.transform(X_new)" + "kmeans.transform(X_new).round(2)" ] }, { @@ -534,11 +483,13 @@ }, { "cell_type": "code", - "execution_count": 27, + "execution_count": 17, "metadata": {}, "outputs": [], "source": [ - "np.linalg.norm(np.tile(X_new, (1, k)).reshape(-1, k, 2) - kmeans.cluster_centers_, axis=2)" + "# not in the book\n", + "np.linalg.norm(np.tile(X_new, (1, k)).reshape(-1, k, 2)\n", + " - kmeans.cluster_centers_, axis=2).round(2)" ] }, { @@ -553,7 +504,7 @@ "metadata": {}, "source": [ "The K-Means algorithm is one of the fastest clustering algorithms, and also one of the simplest:\n", - "* First initialize $k$ centroids randomly: $k$ distinct instances are chosen randomly from the dataset and the centroids are placed at their locations.\n", + "* First initialize $k$ centroids randomly: e.g., $k$ distinct instances are chosen randomly from the dataset and the centroids are placed at their locations.\n", "* Repeat until convergence (i.e., until the centroids stop moving):\n", " * Assign each instance to the closest centroid.\n", " * Update the centroids to be the mean of the instances that are assigned to them." @@ -563,7 +514,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The `KMeans` class applies an optimized algorithm by default. To get the original K-Means algorithm (for educational purposes only), you must set `init=\"random\"`, `n_init=1`and `algorithm=\"full\"`. These hyperparameters will be explained below." + "The `KMeans` class uses an optimized initialization technique by default. To get the original K-Means algorithm (for educational purposes only), you must set `init=\"random\"` and `n_init=1`. More on this later in this chapter." ] }, { @@ -575,53 +526,44 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 18, "metadata": {}, "outputs": [], "source": [ - "kmeans_iter1 = KMeans(n_clusters=5, init=\"random\", n_init=1,\n", - " algorithm=\"full\", max_iter=1, random_state=0)\n", - "kmeans_iter2 = KMeans(n_clusters=5, init=\"random\", n_init=1,\n", - " algorithm=\"full\", max_iter=2, random_state=0)\n", - "kmeans_iter3 = KMeans(n_clusters=5, init=\"random\", n_init=1,\n", - " algorithm=\"full\", max_iter=3, random_state=0)\n", + "# not in the book – this code generates and saves Figure 8–4\n", + "\n", + "kmeans_iter1 = KMeans(n_clusters=5, init=\"random\", n_init=1, max_iter=1,\n", + " random_state=5)\n", + "kmeans_iter2 = KMeans(n_clusters=5, init=\"random\", n_init=1, max_iter=2,\n", + " random_state=5)\n", + "kmeans_iter3 = KMeans(n_clusters=5, init=\"random\", n_init=1, max_iter=3,\n", + " random_state=5)\n", "kmeans_iter1.fit(X)\n", "kmeans_iter2.fit(X)\n", - "kmeans_iter3.fit(X)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "And let's plot this:" - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "metadata": {}, - "outputs": [], - "source": [ + "kmeans_iter3.fit(X)\n", + "\n", "plt.figure(figsize=(10, 8))\n", "\n", "plt.subplot(321)\n", "plot_data(X)\n", "plot_centroids(kmeans_iter1.cluster_centers_, circle_color='r', cross_color='w')\n", - "plt.ylabel(\"$x_2$\", fontsize=14, rotation=0)\n", + "plt.ylabel(\"$x_2$\", rotation=0)\n", "plt.tick_params(labelbottom=False)\n", - "plt.title(\"Update the centroids (initially randomly)\", fontsize=14)\n", + "plt.title(\"Update the centroids (initially randomly)\")\n", "\n", "plt.subplot(322)\n", - "plot_decision_boundaries(kmeans_iter1, X, show_xlabels=False, show_ylabels=False)\n", - "plt.title(\"Label the instances\", fontsize=14)\n", + "plot_decision_boundaries(kmeans_iter1, X, show_xlabels=False,\n", + " show_ylabels=False)\n", + "plt.title(\"Label the instances\")\n", "\n", "plt.subplot(323)\n", - "plot_decision_boundaries(kmeans_iter1, X, show_centroids=False, show_xlabels=False)\n", + "plot_decision_boundaries(kmeans_iter1, X, show_centroids=False,\n", + " show_xlabels=False)\n", "plot_centroids(kmeans_iter2.cluster_centers_)\n", "\n", "plt.subplot(324)\n", - "plot_decision_boundaries(kmeans_iter2, X, show_xlabels=False, show_ylabels=False)\n", + "plot_decision_boundaries(kmeans_iter2, X, show_xlabels=False,\n", + " show_ylabels=False)\n", "\n", "plt.subplot(325)\n", "plot_decision_boundaries(kmeans_iter2, X, show_centroids=False)\n", @@ -652,11 +594,14 @@ }, { "cell_type": "code", - "execution_count": 30, + "execution_count": 19, "metadata": {}, "outputs": [], "source": [ - "def plot_clusterer_comparison(clusterer1, clusterer2, X, title1=None, title2=None):\n", + "# not in the book – this code generates and saves Figure 8–5\n", + "\n", + "def plot_clusterer_comparison(clusterer1, clusterer2, X, title1=None,\n", + " title2=None):\n", " clusterer1.fit(X)\n", " clusterer2.fit(X)\n", "\n", @@ -665,30 +610,44 @@ " plt.subplot(121)\n", " plot_decision_boundaries(clusterer1, X)\n", " if title1:\n", - " plt.title(title1, fontsize=14)\n", + " plt.title(title1)\n", "\n", " plt.subplot(122)\n", " plot_decision_boundaries(clusterer2, X, show_ylabels=False)\n", " if title2:\n", - " plt.title(title2, fontsize=14)" + " plt.title(title2)\n", + "\n", + "kmeans_rnd_init1 = KMeans(n_clusters=5, init=\"random\", n_init=1, random_state=2)\n", + "kmeans_rnd_init2 = KMeans(n_clusters=5, init=\"random\", n_init=1, random_state=9)\n", + "\n", + "plot_clusterer_comparison(kmeans_rnd_init1, kmeans_rnd_init2, X,\n", + " \"Solution 1\",\n", + " \"Solution 2 (with a different random init)\")\n", + "\n", + "save_fig(\"kmeans_variability_plot\")\n", + "plt.show()" ] }, { "cell_type": "code", - "execution_count": 31, + "execution_count": 20, "metadata": {}, "outputs": [], "source": [ - "kmeans_rnd_init1 = KMeans(n_clusters=5, init=\"random\", n_init=1,\n", - " algorithm=\"full\", random_state=2)\n", - "kmeans_rnd_init2 = KMeans(n_clusters=5, init=\"random\", n_init=1,\n", - " algorithm=\"full\", random_state=5)\n", - "\n", - "plot_clusterer_comparison(kmeans_rnd_init1, kmeans_rnd_init2, X,\n", - " \"Solution 1\", \"Solution 2 (with a different random init)\")\n", - "\n", - "save_fig(\"kmeans_variability_plot\")\n", - "plt.show()" + "good_init = np.array([[-3, 3], [-3, 2], [-3, 1], [-1, 2], [0, 2]])\n", + "kmeans = KMeans(n_clusters=5, init=good_init, n_init=1, random_state=42)\n", + "kmeans.fit(X)" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "# not in the book\n", + "plt.figure(figsize=(8, 4))\n", + "plot_decision_boundaries(kmeans, X)" ] }, { @@ -707,13 +666,31 @@ }, { "cell_type": "code", - "execution_count": 32, + "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "kmeans.inertia_" ] }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [], + "source": [ + "kmeans_rnd_init1.inertia_ # not in the book" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [], + "source": [ + "kmeans_rnd_init2.inertia_ # not in the book" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -723,12 +700,13 @@ }, { "cell_type": "code", - "execution_count": 33, + "execution_count": 25, "metadata": {}, "outputs": [], "source": [ + "# not in the book\n", "X_dist = kmeans.transform(X)\n", - "np.sum(X_dist[np.arange(len(X_dist)), kmeans.labels_]**2)" + "(X_dist[np.arange(len(X_dist)), kmeans.labels_] ** 2).sum()" ] }, { @@ -740,7 +718,7 @@ }, { "cell_type": "code", - "execution_count": 34, + "execution_count": 26, "metadata": {}, "outputs": [], "source": [ @@ -758,32 +736,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "So one approach to solve the variability issue is to simply run the K-Means algorithm multiple times with different random initializations, and select the solution that minimizes the inertia. For example, here are the inertias of the two \"bad\" models shown in the previous figure:" - ] - }, - { - "cell_type": "code", - "execution_count": 35, - "metadata": {}, - "outputs": [], - "source": [ - "kmeans_rnd_init1.inertia_" - ] - }, - { - "cell_type": "code", - "execution_count": 36, - "metadata": {}, - "outputs": [], - "source": [ - "kmeans_rnd_init2.inertia_" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "As you can see, they have a higher inertia than the first \"good\" model we trained, which means they are probably worse." + "So one approach to solve the variability issue is to simply run the K-Means algorithm multiple times with different random initializations, and select the solution that minimizes the inertia." ] }, { @@ -795,12 +748,13 @@ }, { "cell_type": "code", - "execution_count": 37, + "execution_count": 27, "metadata": {}, "outputs": [], "source": [ + "# not in the book\n", "kmeans_rnd_10_inits = KMeans(n_clusters=5, init=\"random\", n_init=10,\n", - " algorithm=\"full\", random_state=2)\n", + " random_state=2)\n", "kmeans_rnd_10_inits.fit(X)" ] }, @@ -813,15 +767,25 @@ }, { "cell_type": "code", - "execution_count": 38, + "execution_count": 28, "metadata": {}, "outputs": [], "source": [ + "# not in the book\n", "plt.figure(figsize=(8, 4))\n", "plot_decision_boundaries(kmeans_rnd_10_inits, X)\n", "plt.show()" ] }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [], + "source": [ + "kmeans_rnd_10_inits.inertia_" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -853,27 +817,6 @@ "To set the initialization to K-Means++, simply set `init=\"k-means++\"` (this is actually the default):" ] }, - { - "cell_type": "code", - "execution_count": 39, - "metadata": {}, - "outputs": [], - "source": [ - "KMeans()" - ] - }, - { - "cell_type": "code", - "execution_count": 40, - "metadata": {}, - "outputs": [], - "source": [ - "good_init = np.array([[-3, 3], [-3, 2], [-3, 1], [-1, 2], [0, 2]])\n", - "kmeans = KMeans(n_clusters=5, init=good_init, n_init=1, random_state=42)\n", - "kmeans.fit(X)\n", - "kmeans.inertia_" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -885,41 +828,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The K-Means algorithm can be significantly accelerated by avoiding many unnecessary distance calculations: this is achieved by exploiting the triangle inequality (given three points A, B and C, the distance AC is always such that AC ≤ AB + BC) and by keeping track of lower and upper bounds for distances between instances and centroids (see this [2003 paper](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf) by Charles Elkan for more details)." + "The K-Means algorithm can sometimes be accelerated by avoiding many unnecessary distance calculations: this is achieved by exploiting the triangle inequality (given three points A, B and C, the distance AC is always such that AC ≤ AB + BC) and by keeping track of lower and upper bounds for distances between instances and centroids (see this [2003 paper](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf) by Charles Elkan for more details)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "To use Elkan's variant of K-Means, just set `algorithm=\"elkan\"`. Note that it does not support sparse data, so by default, Scikit-Learn uses `\"elkan\"` for dense data, and `\"full\"` (the regular K-Means algorithm) for sparse data." - ] - }, - { - "cell_type": "code", - "execution_count": 41, - "metadata": {}, - "outputs": [], - "source": [ - "%timeit -n 50 KMeans(algorithm=\"elkan\", random_state=42).fit(X)" - ] - }, - { - "cell_type": "code", - "execution_count": 42, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "%timeit -n 50 KMeans(algorithm=\"full\", random_state=42).fit(X)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "There's no big difference in this case, as the dataset is fairly small." + "For Elkan's variant of K-Means, use `algorithm=\"elkan\"`. For regular KMeans, use `algorithm=\"full\"`. The default is `\"auto\"`, which uses the full algorithm since Scikit-Learn 1.1 (it used Elkan's algorithm before that)." ] }, { @@ -938,26 +854,19 @@ }, { "cell_type": "code", - "execution_count": 43, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.cluster import MiniBatchKMeans" - ] - }, - { - "cell_type": "code", - "execution_count": 44, + "execution_count": 30, "metadata": {}, "outputs": [], "source": [ + "from sklearn.cluster import MiniBatchKMeans\n", + "\n", "minibatch_kmeans = MiniBatchKMeans(n_clusters=5, random_state=42)\n", "minibatch_kmeans.fit(X)" ] }, { "cell_type": "code", - "execution_count": 45, + "execution_count": 31, "metadata": {}, "outputs": [], "source": [ @@ -968,226 +877,128 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "If the dataset does not fit in memory, the simplest option is to use the `memmap` class, just like we did for incremental PCA in the previous chapter. First let's load MNIST:" + "**Using `MiniBatchKMeans` along with `memmap`** (not in the book)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "**Warning:** since Scikit-Learn 0.24, `fetch_openml()` returns a Pandas `DataFrame` by default. To avoid this and keep the same code as in the book, we use `as_frame=False`." + "If the dataset does not fit in memory, the simplest option is to use the `memmap` class, just like we did for incremental PCA in the previous chapter. First let's load MNIST:" ] }, { "cell_type": "code", - "execution_count": 46, + "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "import urllib.request\n", "from sklearn.datasets import fetch_openml\n", "\n", - "mnist = fetch_openml('mnist_784', version=1, as_frame=False)\n", - "mnist.target = mnist.target.astype(np.int64)" - ] - }, - { - "cell_type": "code", - "execution_count": 47, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.model_selection import train_test_split\n", - "\n", - "X_train, X_test, y_train, y_test = train_test_split(\n", - " mnist[\"data\"], mnist[\"target\"], random_state=42)" + "mnist = fetch_openml('mnist_784', as_frame=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Next, let's write it to a `memmap`:" + "Let's split the dataset:" ] }, { "cell_type": "code", - "execution_count": 48, + "execution_count": 33, "metadata": {}, "outputs": [], "source": [ - "filename = \"my_mnist.data\"\n", - "X_mm = np.memmap(filename, dtype='float32', mode='write', shape=X_train.shape)\n", - "X_mm[:] = X_train" - ] - }, - { - "cell_type": "code", - "execution_count": 49, - "metadata": {}, - "outputs": [], - "source": [ - "minibatch_kmeans = MiniBatchKMeans(n_clusters=10, batch_size=10, random_state=42)\n", - "minibatch_kmeans.fit(X_mm)" + "X_train, y_train = mnist.data[:60000], mnist.target[:60000]\n", + "X_test, y_test = mnist.data[60000:], mnist.target[60000:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "If your data is so large that you cannot use `memmap`, things get more complicated. Let's start by writing a function to load the next batch (in real life, you would load the data from disk):" + "Next, let's write the training set to a `memmap`:" ] }, { "cell_type": "code", - "execution_count": 50, + "execution_count": 34, "metadata": {}, "outputs": [], "source": [ - "def load_next_batch(batch_size):\n", - " return X[np.random.choice(len(X), batch_size, replace=False)]" + "filename = \"my_mnist.mmap\"\n", + "X_memmap = np.memmap(filename, dtype='float32', mode='write',\n", + " shape=X_train.shape)\n", + "X_memmap[:] = X_train\n", + "X_memmap.flush()" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.cluster import MiniBatchKMeans\n", + "\n", + "minibatch_kmeans = MiniBatchKMeans(n_clusters=10, batch_size=10,\n", + " random_state=42)\n", + "minibatch_kmeans.fit(X_memmap)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Now we can train the model by feeding it one batch at a time. We also need to implement multiple initializations and keep the model with the lowest inertia:" + "Let's plot the inertia ratio and the training time ratio between Mini-batch K-Means and regular K-Means:" ] }, { "cell_type": "code", - "execution_count": 51, + "execution_count": 36, "metadata": {}, "outputs": [], "source": [ - "np.random.seed(42)" - ] - }, - { - "cell_type": "code", - "execution_count": 52, - "metadata": {}, - "outputs": [], - "source": [ - "k = 5\n", - "n_init = 10\n", - "n_iterations = 100\n", - "batch_size = 100\n", - "init_size = 500 # more data for K-Means++ initialization\n", - "evaluate_on_last_n_iters = 10\n", + "# not in the book – this code generates and saves Figure 8–6\n", "\n", - "best_kmeans = None\n", + "from timeit import timeit\n", "\n", - "for init in range(n_init):\n", - " minibatch_kmeans = MiniBatchKMeans(n_clusters=k, init_size=init_size)\n", - " X_init = load_next_batch(init_size)\n", - " minibatch_kmeans.partial_fit(X_init)\n", - "\n", - " minibatch_kmeans.sum_inertia_ = 0\n", - " for iteration in range(n_iterations):\n", - " X_batch = load_next_batch(batch_size)\n", - " minibatch_kmeans.partial_fit(X_batch)\n", - " if iteration >= n_iterations - evaluate_on_last_n_iters:\n", - " minibatch_kmeans.sum_inertia_ += minibatch_kmeans.inertia_\n", - "\n", - " if (best_kmeans is None or\n", - " minibatch_kmeans.sum_inertia_ < best_kmeans.sum_inertia_):\n", - " best_kmeans = minibatch_kmeans" - ] - }, - { - "cell_type": "code", - "execution_count": 53, - "metadata": {}, - "outputs": [], - "source": [ - "best_kmeans.score(X)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Mini-batch K-Means is much faster than regular K-Means:" - ] - }, - { - "cell_type": "code", - "execution_count": 54, - "metadata": {}, - "outputs": [], - "source": [ - "%timeit KMeans(n_clusters=5, random_state=42).fit(X)" - ] - }, - { - "cell_type": "code", - "execution_count": 55, - "metadata": {}, - "outputs": [], - "source": [ - "%timeit MiniBatchKMeans(n_clusters=5, random_state=42).fit(X)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "That's *much* faster! However, its performance is often lower (higher inertia), and it keeps degrading as _k_ increases. Let's plot the inertia ratio and the training time ratio between Mini-batch K-Means and regular K-Means:" - ] - }, - { - "cell_type": "code", - "execution_count": 56, - "metadata": {}, - "outputs": [], - "source": [ - "from timeit import timeit" - ] - }, - { - "cell_type": "code", - "execution_count": 57, - "metadata": {}, - "outputs": [], - "source": [ - "times = np.empty((100, 2))\n", - "inertias = np.empty((100, 2))\n", - "for k in range(1, 101):\n", - " kmeans_ = KMeans(n_clusters=k, random_state=42)\n", + "max_k = 100\n", + "times = np.empty((max_k, 2))\n", + "inertias = np.empty((max_k, 2))\n", + "for k in range(1, max_k + 1):\n", + " kmeans_ = KMeans(n_clusters=k, algorithm=\"full\", random_state=42)\n", " minibatch_kmeans = MiniBatchKMeans(n_clusters=k, random_state=42)\n", - " print(\"\\r{}/{}\".format(k, 100), end=\"\")\n", - " times[k-1, 0] = timeit(\"kmeans_.fit(X)\", number=10, globals=globals())\n", - " times[k-1, 1] = timeit(\"minibatch_kmeans.fit(X)\", number=10, globals=globals())\n", - " inertias[k-1, 0] = kmeans_.inertia_\n", - " inertias[k-1, 1] = minibatch_kmeans.inertia_" - ] - }, - { - "cell_type": "code", - "execution_count": 58, - "metadata": {}, - "outputs": [], - "source": [ + " print(f\"\\r{k}/{max_k}\", end=\"\") # \\r returns to the start of line\n", + " times[k - 1, 0] = timeit(\"kmeans_.fit(X)\", number=10, globals=globals())\n", + " times[k - 1, 1] = timeit(\"minibatch_kmeans.fit(X)\", number=10,\n", + " globals=globals())\n", + " inertias[k - 1, 0] = kmeans_.inertia_\n", + " inertias[k - 1, 1] = minibatch_kmeans.inertia_\n", + "\n", "plt.figure(figsize=(10,4))\n", "\n", "plt.subplot(121)\n", - "plt.plot(range(1, 101), inertias[:, 0], \"r--\", label=\"K-Means\")\n", - "plt.plot(range(1, 101), inertias[:, 1], \"b.-\", label=\"Mini-batch K-Means\")\n", - "plt.xlabel(\"$k$\", fontsize=16)\n", - "plt.title(\"Inertia\", fontsize=14)\n", - "plt.legend(fontsize=14)\n", - "plt.axis([1, 100, 0, 100])\n", + "plt.plot(range(1, max_k + 1), inertias[:, 0], \"r--\", label=\"K-Means\")\n", + "plt.plot(range(1, max_k + 1), inertias[:, 1], \"b.-\", label=\"Mini-batch K-Means\")\n", + "plt.xlabel(\"$k$\")\n", + "plt.title(\"Inertia\")\n", + "plt.legend()\n", + "plt.axis([1, max_k, 0, 100])\n", + "plt.grid()\n", "\n", "plt.subplot(122)\n", - "plt.plot(range(1, 101), times[:, 0], \"r--\", label=\"K-Means\")\n", - "plt.plot(range(1, 101), times[:, 1], \"b.-\", label=\"Mini-batch K-Means\")\n", - "plt.xlabel(\"$k$\", fontsize=16)\n", - "plt.title(\"Training time (seconds)\", fontsize=14)\n", - "plt.axis([1, 100, 0, 6])\n", + "plt.plot(range(1, max_k + 1), times[:, 0], \"r--\", label=\"K-Means\")\n", + "plt.plot(range(1, max_k + 1), times[:, 1], \"b.-\", label=\"Mini-batch K-Means\")\n", + "plt.xlabel(\"$k$\")\n", + "plt.title(\"Training time (seconds)\")\n", + "plt.axis([1, max_k, 0, 4])\n", + "plt.grid()\n", "\n", - "save_fig(\"minibatch_kmeans_vs_kmeans\")\n", + "save_fig(\"minibatch_kmeans_vs_kmeans_plot\")\n", "plt.show()" ] }, @@ -1207,10 +1018,12 @@ }, { "cell_type": "code", - "execution_count": 59, + "execution_count": 37, "metadata": {}, "outputs": [], "source": [ + "# not in the book – this code generates and saves Figure 8–7\n", + "\n", "kmeans_k3 = KMeans(n_clusters=3, random_state=42)\n", "kmeans_k8 = KMeans(n_clusters=8, random_state=42)\n", "\n", @@ -1228,7 +1041,7 @@ }, { "cell_type": "code", - "execution_count": 60, + "execution_count": 38, "metadata": {}, "outputs": [], "source": [ @@ -1237,7 +1050,7 @@ }, { "cell_type": "code", - "execution_count": 61, + "execution_count": 39, "metadata": {}, "outputs": [], "source": [ @@ -1253,33 +1066,25 @@ }, { "cell_type": "code", - "execution_count": 62, + "execution_count": 40, "metadata": {}, "outputs": [], "source": [ + "# not in the book – this code generates and saves Figure 8–8\n", + "\n", "kmeans_per_k = [KMeans(n_clusters=k, random_state=42).fit(X)\n", " for k in range(1, 10)]\n", - "inertias = [model.inertia_ for model in kmeans_per_k]" - ] - }, - { - "cell_type": "code", - "execution_count": 63, - "metadata": {}, - "outputs": [], - "source": [ + "inertias = [model.inertia_ for model in kmeans_per_k]\n", + "\n", "plt.figure(figsize=(8, 3.5))\n", "plt.plot(range(1, 10), inertias, \"bo-\")\n", - "plt.xlabel(\"$k$\", fontsize=14)\n", - "plt.ylabel(\"Inertia\", fontsize=14)\n", - "plt.annotate('Elbow',\n", - " xy=(4, inertias[3]),\n", - " xytext=(0.55, 0.55),\n", - " textcoords='figure fraction',\n", - " fontsize=16,\n", - " arrowprops=dict(facecolor='black', shrink=0.1)\n", - " )\n", + "plt.xlabel(\"$k$\")\n", + "plt.ylabel(\"Inertia\")\n", + "plt.annotate(\"\", xy=(4, inertias[3]), xytext=(4.45, 650),\n", + " arrowprops=dict(facecolor='black', shrink=0.1))\n", + "plt.text(4.5, 650, \"Elbow\", fontsize=14, horizontalalignment=\"center\")\n", "plt.axis([1, 8.5, 0, 1300])\n", + "plt.grid()\n", "save_fig(\"inertia_vs_k_plot\")\n", "plt.show()" ] @@ -1293,11 +1098,12 @@ }, { "cell_type": "code", - "execution_count": 64, + "execution_count": 41, "metadata": {}, "outputs": [], "source": [ - "plot_decision_boundaries(kmeans_per_k[4-1], X)\n", + "# not in the book\n", + "plot_decision_boundaries(kmeans_per_k[4 - 1], X)\n", "plt.show()" ] }, @@ -1305,7 +1111,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Another approach is to look at the _silhouette score_, which is the mean _silhouette coefficient_ over all the instances. An instance's silhouette coefficient is equal to $(b - a)/\\max(a, b)$ where $a$ is the mean distance to the other instances in the same cluster (it is the _mean intra-cluster distance_), and $b$ is the _mean nearest-cluster distance_, that is the mean distance to the instances of the next closest cluster (defined as the one that minimizes $b$, excluding the instance's own cluster). The silhouette coefficient can vary between -1 and +1: a coefficient close to +1 means that the instance is well inside its own cluster and far from other clusters, while a coefficient close to 0 means that it is close to a cluster boundary, and finally a coefficient close to -1 means that the instance may have been assigned to the wrong cluster." + "Another approach is to look at the _silhouette score_, which is the mean _silhouette coefficient_ over all the instances. An instance's silhouette coefficient is equal to (_b_ - _a_) / max(_a_, _b_) where _a_ is the mean distance to the other instances in the same cluster (it is the _mean intra-cluster distance_), and _b_ is the _mean nearest-cluster distance_, that is the mean distance to the instances of the next closest cluster (defined as the one that minimizes _b_, excluding the instance's own cluster). The silhouette coefficient can vary between -1 and +1: a coefficient close to +1 means that the instance is well inside its own cluster and far from other clusters, while a coefficient close to 0 means that it is close to a cluster boundary, and finally a coefficient close to -1 means that the instance may have been assigned to the wrong cluster." ] }, { @@ -1317,7 +1123,7 @@ }, { "cell_type": "code", - "execution_count": 65, + "execution_count": 42, "metadata": {}, "outputs": [], "source": [ @@ -1326,7 +1132,7 @@ }, { "cell_type": "code", - "execution_count": 66, + "execution_count": 43, "metadata": {}, "outputs": [], "source": [ @@ -1335,25 +1141,21 @@ }, { "cell_type": "code", - "execution_count": 67, + "execution_count": 44, "metadata": {}, "outputs": [], "source": [ + "# not in the book – this code generates and saves Figure 8–9\n", + "\n", "silhouette_scores = [silhouette_score(X, model.labels_)\n", - " for model in kmeans_per_k[1:]]" - ] - }, - { - "cell_type": "code", - "execution_count": 68, - "metadata": {}, - "outputs": [], - "source": [ + " for model in kmeans_per_k[1:]]\n", + "\n", "plt.figure(figsize=(8, 3))\n", "plt.plot(range(2, 10), silhouette_scores, \"bo-\")\n", - "plt.xlabel(\"$k$\", fontsize=14)\n", - "plt.ylabel(\"Silhouette score\", fontsize=14)\n", + "plt.xlabel(\"$k$\")\n", + "plt.ylabel(\"Silhouette score\")\n", "plt.axis([1.8, 8.5, 0.55, 0.7])\n", + "plt.grid()\n", "save_fig(\"silhouette_score_vs_k_plot\")\n", "plt.show()" ] @@ -1374,10 +1176,12 @@ }, { "cell_type": "code", - "execution_count": 69, + "execution_count": 45, "metadata": {}, "outputs": [], "source": [ + "# not in the book – this code generates and saves Figure 8–10\n", + "\n", "from sklearn.metrics import silhouette_samples\n", "from matplotlib.ticker import FixedLocator, FixedFormatter\n", "\n", @@ -1396,7 +1200,7 @@ " coeffs = silhouette_coefficients[y_pred == i]\n", " coeffs.sort()\n", "\n", - " color = mpl.cm.Spectral(i / k)\n", + " color = plt.cm.Spectral(i / k)\n", " plt.fill_betweenx(np.arange(pos, pos + len(coeffs)), 0, coeffs,\n", " facecolor=color, edgecolor=color, alpha=0.7)\n", " ticks.append(pos + len(coeffs) // 2)\n", @@ -1414,7 +1218,7 @@ " plt.tick_params(labelbottom=False)\n", "\n", " plt.axvline(x=silhouette_scores[k - 2], color=\"red\", linestyle=\"--\")\n", - " plt.title(\"$k={}$\".format(k), fontsize=16)\n", + " plt.title(f\"$k={k}$\")\n", "\n", "save_fig(\"silhouette_analysis_plot\")\n", "plt.show()" @@ -1434,56 +1238,44 @@ "## Limits of K-Means" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's generate a more difficult dataset, with elongated blobs and varying densities, and show that K-Means struggles to cluster it correctly:" + ] + }, { "cell_type": "code", - "execution_count": 70, + "execution_count": 46, "metadata": {}, "outputs": [], "source": [ + "# not in the book – this code generates and saves Figure 8–11\n", + "\n", "X1, y1 = make_blobs(n_samples=1000, centers=((4, -4), (0, 0)), random_state=42)\n", "X1 = X1.dot(np.array([[0.374, 0.95], [0.732, 0.598]]))\n", "X2, y2 = make_blobs(n_samples=250, centers=1, random_state=42)\n", "X2 = X2 + [6, -8]\n", "X = np.r_[X1, X2]\n", - "y = np.r_[y1, y2]" - ] - }, - { - "cell_type": "code", - "execution_count": 71, - "metadata": {}, - "outputs": [], - "source": [ - "plot_clusters(X)" - ] - }, - { - "cell_type": "code", - "execution_count": 72, - "metadata": {}, - "outputs": [], - "source": [ - "kmeans_good = KMeans(n_clusters=3, init=np.array([[-1.5, 2.5], [0.5, 0], [4, 0]]), n_init=1, random_state=42)\n", + "y = np.r_[y1, y2]\n", + "\n", + "kmeans_good = KMeans(n_clusters=3,\n", + " init=np.array([[-1.5, 2.5], [0.5, 0], [4, 0]]),\n", + " n_init=1, random_state=42)\n", "kmeans_bad = KMeans(n_clusters=3, random_state=42)\n", "kmeans_good.fit(X)\n", - "kmeans_bad.fit(X)" - ] - }, - { - "cell_type": "code", - "execution_count": 73, - "metadata": {}, - "outputs": [], - "source": [ + "kmeans_bad.fit(X)\n", + "\n", "plt.figure(figsize=(10, 3.2))\n", "\n", "plt.subplot(121)\n", "plot_decision_boundaries(kmeans_good, X)\n", - "plt.title(\"Inertia = {:.1f}\".format(kmeans_good.inertia_), fontsize=14)\n", + "plt.title(f\"Inertia = {kmeans_good.inertia_:.1f}\")\n", "\n", "plt.subplot(122)\n", "plot_decision_boundaries(kmeans_bad, X, show_ylabels=False)\n", - "plt.title(\"Inertia = {:.1f}\".format(kmeans_bad.inertia_), fontsize=14)\n", + "plt.title(f\"Inertia = {kmeans_bad.inertia_:.1f}\")\n", "\n", "save_fig(\"bad_kmeans_plot\")\n", "plt.show()" @@ -1497,34 +1289,44 @@ ] }, { - "cell_type": "code", - "execution_count": 74, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "# Download the ladybug image\n", - "root = \"https://raw.githubusercontent.com/ageron/handson-ml2/master/\"\n", - "filename = \"ladybug.png\"\n", - "if not (images_path / filename).is_file():\n", - " print(\"Downloading\", filename)\n", - " url = root + \"images/unsupervised_learning/\" + filename\n", - " urllib.request.urlretrieve(url, os.path.join(images_path, filename))" + "Download the ladybug image:" ] }, { "cell_type": "code", - "execution_count": 75, + "execution_count": 47, "metadata": {}, "outputs": [], "source": [ - "from matplotlib.image import imread\n", - "image = imread(os.path.join(images_path, filename))\n", + "# not in the book\n", + "\n", + "root = \"https://raw.githubusercontent.com/ageron/handson-ml2/master/\"\n", + "filename = \"ladybug.png\"\n", + "filepath = IMAGES_PATH / filename\n", + "if not filepath.is_file():\n", + " print(\"Downloading\", filename)\n", + " url = f\"{root}/images/unsupervised_learning/{filename}\"\n", + " urllib.request.urlretrieve(url, filepath)" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": {}, + "outputs": [], + "source": [ + "import PIL\n", + "\n", + "image = np.asarray(PIL.Image.open(filepath))\n", "image.shape" ] }, { "cell_type": "code", - "execution_count": 76, + "execution_count": 49, "metadata": {}, "outputs": [], "source": [ @@ -1536,249 +1338,37 @@ }, { "cell_type": "code", - "execution_count": 77, + "execution_count": 50, "metadata": {}, "outputs": [], "source": [ + "# not in the book – this code generates and saves Figure 8–12\n", + "\n", "segmented_imgs = []\n", "n_colors = (10, 8, 6, 4, 2)\n", "for n_clusters in n_colors:\n", " kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(X)\n", " segmented_img = kmeans.cluster_centers_[kmeans.labels_]\n", - " segmented_imgs.append(segmented_img.reshape(image.shape))" - ] - }, - { - "cell_type": "code", - "execution_count": 78, - "metadata": {}, - "outputs": [], - "source": [ + " segmented_imgs.append(segmented_img.reshape(image.shape))\n", + "\n", "plt.figure(figsize=(10,5))\n", "plt.subplots_adjust(wspace=0.05, hspace=0.1)\n", "\n", - "plt.subplot(231)\n", + "plt.subplot(2, 3, 1)\n", "plt.imshow(image)\n", "plt.title(\"Original image\")\n", "plt.axis('off')\n", "\n", "for idx, n_clusters in enumerate(n_colors):\n", - " plt.subplot(232 + idx)\n", - " plt.imshow(segmented_imgs[idx])\n", - " plt.title(\"{} colors\".format(n_clusters))\n", + " plt.subplot(2, 3, 2 + idx)\n", + " plt.imshow(segmented_imgs[idx] / 255)\n", + " plt.title(f\"{n_clusters} colors\")\n", " plt.axis('off')\n", "\n", "save_fig('image_segmentation_diagram', tight_layout=False)\n", "plt.show()" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Using Clustering for Preprocessing" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's tackle the _digits dataset_ which is a simple MNIST-like dataset containing 1,797 grayscale 8×8 images representing digits 0 to 9." - ] - }, - { - "cell_type": "code", - "execution_count": 79, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.datasets import load_digits" - ] - }, - { - "cell_type": "code", - "execution_count": 80, - "metadata": {}, - "outputs": [], - "source": [ - "X_digits, y_digits = load_digits(return_X_y=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's split it into a training set and a test set:" - ] - }, - { - "cell_type": "code", - "execution_count": 81, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.model_selection import train_test_split" - ] - }, - { - "cell_type": "code", - "execution_count": 82, - "metadata": {}, - "outputs": [], - "source": [ - "X_train, X_test, y_train, y_test = train_test_split(X_digits, y_digits, random_state=42)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now let's fit a Logistic Regression model and evaluate it on the test set:" - ] - }, - { - "cell_type": "code", - "execution_count": 83, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.linear_model import LogisticRegression" - ] - }, - { - "cell_type": "code", - "execution_count": 84, - "metadata": {}, - "outputs": [], - "source": [ - "log_reg = LogisticRegression(multi_class=\"ovr\", solver=\"lbfgs\", max_iter=5000, random_state=42)\n", - "log_reg.fit(X_train, y_train)" - ] - }, - { - "cell_type": "code", - "execution_count": 85, - "metadata": {}, - "outputs": [], - "source": [ - "log_reg_score = log_reg.score(X_test, y_test)\n", - "log_reg_score" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Okay, that's our baseline: 96.89% accuracy. Let's see if we can do better by using K-Means as a preprocessing step. We will create a pipeline that will first cluster the training set into 50 clusters and replace the images with their distances to the 50 clusters, then apply a logistic regression model:" - ] - }, - { - "cell_type": "code", - "execution_count": 86, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.pipeline import Pipeline" - ] - }, - { - "cell_type": "code", - "execution_count": 87, - "metadata": {}, - "outputs": [], - "source": [ - "pipeline = Pipeline([\n", - " (\"kmeans\", KMeans(n_clusters=50, random_state=42)),\n", - " (\"log_reg\", LogisticRegression(multi_class=\"ovr\", solver=\"lbfgs\", max_iter=5000, random_state=42)),\n", - "])\n", - "pipeline.fit(X_train, y_train)" - ] - }, - { - "cell_type": "code", - "execution_count": 88, - "metadata": {}, - "outputs": [], - "source": [ - "pipeline_score = pipeline.score(X_test, y_test)\n", - "pipeline_score" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "How much did the error rate drop?" - ] - }, - { - "cell_type": "code", - "execution_count": 89, - "metadata": {}, - "outputs": [], - "source": [ - "1 - (1 - pipeline_score) / (1 - log_reg_score)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "How about that? We reduced the error rate by over 35%! But we chose the number of clusters $k$ completely arbitrarily, we can surely do better. Since K-Means is just a preprocessing step in a classification pipeline, finding a good value for $k$ is much simpler than earlier: there's no need to perform silhouette analysis or minimize the inertia, the best value of $k$ is simply the one that results in the best classification performance." - ] - }, - { - "cell_type": "code", - "execution_count": 90, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.model_selection import GridSearchCV" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Warning**: the following cell may take close to 20 minutes to run, or more depending on your hardware." - ] - }, - { - "cell_type": "code", - "execution_count": 91, - "metadata": {}, - "outputs": [], - "source": [ - "param_grid = dict(kmeans__n_clusters=range(2, 100))\n", - "grid_clf = GridSearchCV(pipeline, param_grid, cv=3, verbose=2)\n", - "grid_clf.fit(X_train, y_train)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's see what the best number of clusters is:" - ] - }, - { - "cell_type": "code", - "execution_count": 92, - "metadata": {}, - "outputs": [], - "source": [ - "grid_clf.best_params_" - ] - }, - { - "cell_type": "code", - "execution_count": 93, - "metadata": {}, - "outputs": [], - "source": [ - "grid_clf.score(X_test, y_test)" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -1790,7 +1380,27 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Another use case for clustering is in semi-supervised learning, when we have plenty of unlabeled instances and very few labeled instances." + "Another use case for clustering is semi-supervised learning, when we have plenty of unlabeled instances and very few labeled instances." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's tackle the _digits dataset_ which is a simple MNIST-like dataset containing 1,797 grayscale 8×8 images representing digits 0 to 9." + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.datasets import load_digits\n", + "\n", + "X_digits, y_digits = load_digits(return_X_y=True)\n", + "X_train, y_train = X_digits[:1400], y_digits[:1400]\n", + "X_test, y_test = X_digits[1400:], y_digits[1400:]" ] }, { @@ -1802,24 +1412,38 @@ }, { "cell_type": "code", - "execution_count": 94, + "execution_count": 52, "metadata": {}, "outputs": [], "source": [ - "n_labeled = 50" + "from sklearn.linear_model import LogisticRegression\n", + "\n", + "n_labeled = 50\n", + "log_reg = LogisticRegression(max_iter=10_000)\n", + "log_reg.fit(X_train[:n_labeled], y_train[:n_labeled])" ] }, { "cell_type": "code", - "execution_count": 95, + "execution_count": 53, "metadata": {}, "outputs": [], "source": [ - "log_reg = LogisticRegression(multi_class=\"ovr\", solver=\"lbfgs\", random_state=42)\n", - "log_reg.fit(X_train[:n_labeled], y_train[:n_labeled])\n", "log_reg.score(X_test, y_test)" ] }, + { + "cell_type": "code", + "execution_count": 54, + "metadata": {}, + "outputs": [], + "source": [ + "# not in the book – measure the accuracy when we use the whole training set\n", + "log_reg_full = LogisticRegression(max_iter=10_000)\n", + "log_reg_full.fit(X_train, y_train)\n", + "log_reg_full.score(X_test, y_test)" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -1829,22 +1453,14 @@ }, { "cell_type": "code", - "execution_count": 96, - "metadata": {}, - "outputs": [], - "source": [ - "k = 50" - ] - }, - { - "cell_type": "code", - "execution_count": 97, + "execution_count": 55, "metadata": {}, "outputs": [], "source": [ + "k = 50\n", "kmeans = KMeans(n_clusters=k, random_state=42)\n", "X_digits_dist = kmeans.fit_transform(X_train)\n", - "representative_digit_idx = np.argmin(X_digits_dist, axis=0)\n", + "representative_digit_idx = X_digits_dist.argmin(axis=0)\n", "X_representative_digits = X_train[representative_digit_idx]" ] }, @@ -1857,14 +1473,17 @@ }, { "cell_type": "code", - "execution_count": 98, + "execution_count": 56, "metadata": {}, "outputs": [], "source": [ + "# not in the book – this cell generates and saves Figure 8–13\n", + "\n", "plt.figure(figsize=(8, 2))\n", "for index, X_representative_digit in enumerate(X_representative_digits):\n", " plt.subplot(k // 10, 10, index + 1)\n", - " plt.imshow(X_representative_digit.reshape(8, 8), cmap=\"binary\", interpolation=\"bilinear\")\n", + " plt.imshow(X_representative_digit.reshape(8, 8), cmap=\"binary\",\n", + " interpolation=\"bilinear\")\n", " plt.axis('off')\n", "\n", "save_fig(\"representative_images_diagram\", tight_layout=False)\n", @@ -1873,25 +1492,17 @@ }, { "cell_type": "code", - "execution_count": 99, - "metadata": {}, - "outputs": [], - "source": [ - "y_train[representative_digit_idx]" - ] - }, - { - "cell_type": "code", - "execution_count": 100, + "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "y_representative_digits = np.array([\n", - " 0, 1, 3, 2, 7, 6, 4, 6, 9, 5,\n", - " 1, 2, 9, 5, 2, 7, 8, 1, 8, 6,\n", - " 3, 2, 5, 4, 5, 4, 0, 3, 2, 6,\n", - " 1, 7, 7, 9, 1, 8, 6, 5, 4, 8,\n", - " 5, 3, 3, 6, 7, 9, 7, 8, 4, 9])" + " 1, 3, 6, 0, 7, 9, 2, 4, 8, 9,\n", + " 5, 4, 7, 1, 2, 6, 1, 2, 5, 1,\n", + " 4, 1, 3, 3, 8, 8, 2, 5, 6, 9,\n", + " 1, 4, 0, 6, 8, 3, 4, 6, 7, 2,\n", + " 4, 1, 0, 7, 5, 1, 9, 9, 3, 7\n", + "])" ] }, { @@ -1903,11 +1514,11 @@ }, { "cell_type": "code", - "execution_count": 101, + "execution_count": 58, "metadata": {}, "outputs": [], "source": [ - "log_reg = LogisticRegression(multi_class=\"ovr\", solver=\"lbfgs\", max_iter=5000, random_state=42)\n", + "log_reg = LogisticRegression(max_iter=10_000)\n", "log_reg.fit(X_representative_digits, y_representative_digits)\n", "log_reg.score(X_test, y_test)" ] @@ -1916,7 +1527,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Wow! We jumped from 83.3% accuracy to 91.3%, although we are still only training the model on 50 instances. Since it's often costly and painful to label instances, especially when it has to be done manually by experts, it's a good idea to make them label representative instances rather than just random instances." + "Wow! We jumped from 74.8% accuracy to 84.9%, although we are still only training the model on 50 instances. Since it's often costly and painful to label instances, especially when it has to be done manually by experts, it's a good idea to make them label representative instances rather than just random instances." ] }, { @@ -1928,28 +1539,28 @@ }, { "cell_type": "code", - "execution_count": 102, + "execution_count": 59, "metadata": {}, "outputs": [], "source": [ - "y_train_propagated = np.empty(len(X_train), dtype=np.int32)\n", + "y_train_propagated = np.empty(len(X_train), dtype=np.int64)\n", "for i in range(k):\n", - " y_train_propagated[kmeans.labels_==i] = y_representative_digits[i]" + " y_train_propagated[kmeans.labels_ == i] = y_representative_digits[i]" ] }, { "cell_type": "code", - "execution_count": 103, + "execution_count": 60, "metadata": {}, "outputs": [], "source": [ - "log_reg = LogisticRegression(multi_class=\"ovr\", solver=\"lbfgs\", max_iter=5000, random_state=42)\n", + "log_reg = LogisticRegression(max_iter=10_000)\n", "log_reg.fit(X_train, y_train_propagated)" ] }, { "cell_type": "code", - "execution_count": 104, + "execution_count": 61, "metadata": {}, "outputs": [], "source": [ @@ -1960,16 +1571,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We got a tiny little accuracy boost. Better than nothing, but we should probably have propagated the labels only to the instances closest to the centroid, because by propagating to the full cluster, we have certainly included some outliers. Let's only propagate the labels to the 75th percentile closest to the centroid:" + "We got another significant accuracy boost! Let's see if we can do even better by ignoring the 1% instances that are farthest from their cluster center: this should eliminate some outliers:" ] }, { "cell_type": "code", - "execution_count": 105, + "execution_count": 62, "metadata": {}, "outputs": [], "source": [ - "percentile_closest = 75\n", + "percentile_closest = 99\n", "\n", "X_cluster_dist = X_digits_dist[np.arange(len(X_train)), kmeans.labels_]\n", "for i in range(k):\n", @@ -1977,15 +1588,8 @@ " cluster_dist = X_cluster_dist[in_cluster]\n", " cutoff_distance = np.percentile(cluster_dist, percentile_closest)\n", " above_cutoff = (X_cluster_dist > cutoff_distance)\n", - " X_cluster_dist[in_cluster & above_cutoff] = -1" - ] - }, - { - "cell_type": "code", - "execution_count": 106, - "metadata": {}, - "outputs": [], - "source": [ + " X_cluster_dist[in_cluster & above_cutoff] = -1\n", + "\n", "partially_propagated = (X_cluster_dist != -1)\n", "X_train_partially_propagated = X_train[partially_propagated]\n", "y_train_partially_propagated = y_train_propagated[partially_propagated]" @@ -1993,20 +1597,12 @@ }, { "cell_type": "code", - "execution_count": 107, - "metadata": {}, - "outputs": [], - "source": [ - "log_reg = LogisticRegression(multi_class=\"ovr\", solver=\"lbfgs\", max_iter=5000, random_state=42)\n", - "log_reg.fit(X_train_partially_propagated, y_train_partially_propagated)" - ] - }, - { - "cell_type": "code", - "execution_count": 108, + "execution_count": 63, "metadata": {}, "outputs": [], "source": [ + "log_reg = LogisticRegression(max_iter=10_000)\n", + "log_reg.fit(X_train_partially_propagated, y_train_partially_propagated)\n", "log_reg.score(X_test, y_test)" ] }, @@ -2014,23 +1610,23 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "A bit better. With just 50 labeled instances (just 5 examples per class on average!), we got 92.7% performance, which is getting closer to the performance of logistic regression on the fully labeled _digits_ dataset (which was 96.9%)." + "Wow, another accuracy boost! We have even slightly surpassed the performance we got by training on the fully labeled training set!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "This is because the propagated labels are actually pretty good: their accuracy is close to 96%:" + "Our propagated labels are actually pretty good: their accuracy is about 97.6%:" ] }, { "cell_type": "code", - "execution_count": 109, + "execution_count": 64, "metadata": {}, "outputs": [], "source": [ - "np.mean(y_train_partially_propagated == y_train[partially_propagated])" + "(y_train_partially_propagated == y_train[partially_propagated]).mean()" ] }, { @@ -2051,44 +1647,21 @@ }, { "cell_type": "code", - "execution_count": 110, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.datasets import make_moons" - ] - }, - { - "cell_type": "code", - "execution_count": 111, - "metadata": {}, - "outputs": [], - "source": [ - "X, y = make_moons(n_samples=1000, noise=0.05, random_state=42)" - ] - }, - { - "cell_type": "code", - "execution_count": 112, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.cluster import DBSCAN" - ] - }, - { - "cell_type": "code", - "execution_count": 113, + "execution_count": 65, "metadata": {}, "outputs": [], "source": [ + "from sklearn.cluster import DBSCAN\n", + "from sklearn.datasets import make_moons\n", + "\n", + "X, y = make_moons(n_samples=1000, noise=0.05, random_state=42)\n", "dbscan = DBSCAN(eps=0.05, min_samples=5)\n", "dbscan.fit(X)" ] }, { "cell_type": "code", - "execution_count": 114, + "execution_count": 66, "metadata": {}, "outputs": [], "source": [ @@ -2097,16 +1670,7 @@ }, { "cell_type": "code", - "execution_count": 115, - "metadata": {}, - "outputs": [], - "source": [ - "len(dbscan.core_sample_indices_)" - ] - }, - { - "cell_type": "code", - "execution_count": 116, + "execution_count": 67, "metadata": {}, "outputs": [], "source": [ @@ -2115,38 +1679,21 @@ }, { "cell_type": "code", - "execution_count": 117, + "execution_count": 68, "metadata": {}, "outputs": [], "source": [ - "dbscan.components_[:3]" + "dbscan.components_" ] }, { "cell_type": "code", - "execution_count": 118, - "metadata": {}, - "outputs": [], - "source": [ - "np.unique(dbscan.labels_)" - ] - }, - { - "cell_type": "code", - "execution_count": 119, - "metadata": {}, - "outputs": [], - "source": [ - "dbscan2 = DBSCAN(eps=0.2)\n", - "dbscan2.fit(X)" - ] - }, - { - "cell_type": "code", - "execution_count": 120, + "execution_count": 69, "metadata": {}, "outputs": [], "source": [ + "# not in the book – this cell generates and saves Figure 8–14\n", + "\n", "def plot_dbscan(dbscan, X, size, show_xlabels=True, show_ylabels=True):\n", " core_mask = np.zeros_like(dbscan.labels_, dtype=bool)\n", " core_mask[dbscan.core_sample_indices_] = True\n", @@ -2159,27 +1706,27 @@ " \n", " plt.scatter(cores[:, 0], cores[:, 1],\n", " c=dbscan.labels_[core_mask], marker='o', s=size, cmap=\"Paired\")\n", - " plt.scatter(cores[:, 0], cores[:, 1], marker='*', s=20, c=dbscan.labels_[core_mask])\n", + " plt.scatter(cores[:, 0], cores[:, 1], marker='*', s=20,\n", + " c=dbscan.labels_[core_mask])\n", " plt.scatter(anomalies[:, 0], anomalies[:, 1],\n", " c=\"r\", marker=\"x\", s=100)\n", - " plt.scatter(non_cores[:, 0], non_cores[:, 1], c=dbscan.labels_[non_core_mask], marker=\".\")\n", + " plt.scatter(non_cores[:, 0], non_cores[:, 1],\n", + " c=dbscan.labels_[non_core_mask], marker=\".\")\n", " if show_xlabels:\n", - " plt.xlabel(\"$x_1$\", fontsize=14)\n", + " plt.xlabel(\"$x_1$\")\n", " else:\n", " plt.tick_params(labelbottom=False)\n", " if show_ylabels:\n", - " plt.ylabel(\"$x_2$\", fontsize=14, rotation=0)\n", + " plt.ylabel(\"$x_2$\", rotation=0)\n", " else:\n", " plt.tick_params(labelleft=False)\n", - " plt.title(\"eps={:.2f}, min_samples={}\".format(dbscan.eps, dbscan.min_samples), fontsize=14)" - ] - }, - { - "cell_type": "code", - "execution_count": 121, - "metadata": {}, - "outputs": [], - "source": [ + " plt.title(f\"eps={dbscan.eps:.2f}, min_samples={dbscan.min_samples}\")\n", + " plt.grid()\n", + " plt.gca().set_axisbelow(True)\n", + "\n", + "dbscan2 = DBSCAN(eps=0.2)\n", + "dbscan2.fit(X)\n", + "\n", "plt.figure(figsize=(9, 3.2))\n", "\n", "plt.subplot(121)\n", @@ -2189,40 +1736,33 @@ "plot_dbscan(dbscan2, X, size=600, show_ylabels=False)\n", "\n", "save_fig(\"dbscan_plot\")\n", - "plt.show()\n" + "plt.show()" ] }, { "cell_type": "code", - "execution_count": 122, + "execution_count": 70, "metadata": {}, "outputs": [], "source": [ - "dbscan = dbscan2" + "dbscan = dbscan2 # not in the book – the text says we now use eps=0.2" ] }, { "cell_type": "code", - "execution_count": 123, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.neighbors import KNeighborsClassifier" - ] - }, - { - "cell_type": "code", - "execution_count": 124, + "execution_count": 71, "metadata": {}, "outputs": [], "source": [ + "from sklearn.neighbors import KNeighborsClassifier\n", + "\n", "knn = KNeighborsClassifier(n_neighbors=50)\n", "knn.fit(dbscan.components_, dbscan.labels_[dbscan.core_sample_indices_])" ] }, { "cell_type": "code", - "execution_count": 125, + "execution_count": 72, "metadata": {}, "outputs": [], "source": [ @@ -2232,7 +1772,7 @@ }, { "cell_type": "code", - "execution_count": 126, + "execution_count": 73, "metadata": {}, "outputs": [], "source": [ @@ -2241,10 +1781,12 @@ }, { "cell_type": "code", - "execution_count": 127, + "execution_count": 74, "metadata": {}, "outputs": [], "source": [ + "# not in the book – this cell generates and saves Figure 8–15\n", + "\n", "plt.figure(figsize=(6, 3))\n", "plot_decision_boundaries(knn, X, show_centroids=False)\n", "plt.scatter(X_new[:, 0], X_new[:, 1], c=\"b\", marker=\"+\", s=200, zorder=10)\n", @@ -2254,7 +1796,7 @@ }, { "cell_type": "code", - "execution_count": 128, + "execution_count": 75, "metadata": {}, "outputs": [], "source": [ @@ -2271,6 +1813,13 @@ "## Other Clustering Algorithms" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The code in this section is bonus material, not in the book." + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -2280,7 +1829,7 @@ }, { "cell_type": "code", - "execution_count": 129, + "execution_count": 76, "metadata": {}, "outputs": [], "source": [ @@ -2289,7 +1838,7 @@ }, { "cell_type": "code", - "execution_count": 130, + "execution_count": 77, "metadata": {}, "outputs": [], "source": [ @@ -2299,7 +1848,16 @@ }, { "cell_type": "code", - "execution_count": 131, + "execution_count": 78, + "metadata": {}, + "outputs": [], + "source": [ + "sc1.affinity_matrix_.round(2)" + ] + }, + { + "cell_type": "code", + "execution_count": 79, "metadata": {}, "outputs": [], "source": [ @@ -2309,38 +1867,31 @@ }, { "cell_type": "code", - "execution_count": 132, + "execution_count": 80, "metadata": {}, "outputs": [], "source": [ - "np.percentile(sc1.affinity_matrix_, 95)" - ] - }, - { - "cell_type": "code", - "execution_count": 133, - "metadata": {}, - "outputs": [], - "source": [ - "def plot_spectral_clustering(sc, X, size, alpha, show_xlabels=True, show_ylabels=True):\n", - " plt.scatter(X[:, 0], X[:, 1], marker='o', s=size, c='gray', cmap=\"Paired\", alpha=alpha)\n", + "def plot_spectral_clustering(sc, X, size, alpha, show_xlabels=True,\n", + " show_ylabels=True):\n", + " plt.scatter(X[:, 0], X[:, 1], marker='o', s=size, c='gray', cmap=\"Paired\",\n", + " alpha=alpha)\n", " plt.scatter(X[:, 0], X[:, 1], marker='o', s=30, c='w')\n", " plt.scatter(X[:, 0], X[:, 1], marker='.', s=10, c=sc.labels_, cmap=\"Paired\")\n", " \n", " if show_xlabels:\n", - " plt.xlabel(\"$x_1$\", fontsize=14)\n", + " plt.xlabel(\"$x_1$\")\n", " else:\n", " plt.tick_params(labelbottom=False)\n", " if show_ylabels:\n", - " plt.ylabel(\"$x_2$\", fontsize=14, rotation=0)\n", + " plt.ylabel(\"$x_2$\", rotation=0)\n", " else:\n", " plt.tick_params(labelleft=False)\n", - " plt.title(\"RBF gamma={}\".format(sc.gamma), fontsize=14)" + " plt.title(f\"RBF gamma={sc.gamma}\")" ] }, { "cell_type": "code", - "execution_count": 134, + "execution_count": 81, "metadata": {}, "outputs": [], "source": [ @@ -2352,7 +1903,7 @@ "plt.subplot(122)\n", "plot_spectral_clustering(sc2, X, size=4000, alpha=0.01, show_ylabels=False)\n", "\n", - "plt.show()\n" + "plt.show()" ] }, { @@ -2364,7 +1915,7 @@ }, { "cell_type": "code", - "execution_count": 135, + "execution_count": 82, "metadata": {}, "outputs": [], "source": [ @@ -2373,7 +1924,7 @@ }, { "cell_type": "code", - "execution_count": 136, + "execution_count": 83, "metadata": {}, "outputs": [], "source": [ @@ -2383,7 +1934,7 @@ }, { "cell_type": "code", - "execution_count": 137, + "execution_count": 84, "metadata": {}, "outputs": [], "source": [ @@ -2394,7 +1945,7 @@ }, { "cell_type": "code", - "execution_count": 138, + "execution_count": 85, "metadata": {}, "outputs": [], "source": [ @@ -2403,7 +1954,7 @@ }, { "cell_type": "code", - "execution_count": 139, + "execution_count": 86, "metadata": { "scrolled": true }, @@ -2419,9 +1970,16 @@ "# Gaussian Mixtures" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's generate the same dataset as earliers with three ellipsoids (the one K-Means had trouble with):" + ] + }, { "cell_type": "code", - "execution_count": 140, + "execution_count": 87, "metadata": {}, "outputs": [], "source": [ @@ -2442,7 +2000,7 @@ }, { "cell_type": "code", - "execution_count": 141, + "execution_count": 88, "metadata": {}, "outputs": [], "source": [ @@ -2451,7 +2009,7 @@ }, { "cell_type": "code", - "execution_count": 142, + "execution_count": 89, "metadata": {}, "outputs": [], "source": [ @@ -2468,7 +2026,7 @@ }, { "cell_type": "code", - "execution_count": 143, + "execution_count": 90, "metadata": {}, "outputs": [], "source": [ @@ -2477,7 +2035,7 @@ }, { "cell_type": "code", - "execution_count": 144, + "execution_count": 91, "metadata": {}, "outputs": [], "source": [ @@ -2486,7 +2044,7 @@ }, { "cell_type": "code", - "execution_count": 145, + "execution_count": 92, "metadata": {}, "outputs": [], "source": [ @@ -2502,7 +2060,7 @@ }, { "cell_type": "code", - "execution_count": 146, + "execution_count": 93, "metadata": {}, "outputs": [], "source": [ @@ -2518,7 +2076,7 @@ }, { "cell_type": "code", - "execution_count": 147, + "execution_count": 94, "metadata": {}, "outputs": [], "source": [ @@ -2534,7 +2092,7 @@ }, { "cell_type": "code", - "execution_count": 148, + "execution_count": 95, "metadata": {}, "outputs": [], "source": [ @@ -2543,11 +2101,11 @@ }, { "cell_type": "code", - "execution_count": 149, + "execution_count": 96, "metadata": {}, "outputs": [], "source": [ - "gm.predict_proba(X)" + "gm.predict_proba(X).round(3)" ] }, { @@ -2559,7 +2117,7 @@ }, { "cell_type": "code", - "execution_count": 150, + "execution_count": 97, "metadata": {}, "outputs": [], "source": [ @@ -2569,7 +2127,7 @@ }, { "cell_type": "code", - "execution_count": 151, + "execution_count": 98, "metadata": {}, "outputs": [], "source": [ @@ -2592,11 +2150,11 @@ }, { "cell_type": "code", - "execution_count": 152, + "execution_count": 99, "metadata": {}, "outputs": [], "source": [ - "gm.score_samples(X)" + "gm.score_samples(X).round(2)" ] }, { @@ -2608,10 +2166,12 @@ }, { "cell_type": "code", - "execution_count": 153, + "execution_count": 100, "metadata": {}, "outputs": [], "source": [ + "# not in the book – bonus material\n", + "\n", "resolution = 100\n", "grid = np.arange(-10, 10, 1 / resolution)\n", "xx, yy = np.meshgrid(grid, grid)\n", @@ -2631,10 +2191,12 @@ }, { "cell_type": "code", - "execution_count": 154, + "execution_count": 101, "metadata": {}, "outputs": [], "source": [ + "# not in the book – this cells generates and saves Figure 8–16\n", + "\n", "from matplotlib.colors import LogNorm\n", "\n", "def plot_gaussian_mixture(clusterer, X, resolution=1000, show_ylabels=True):\n", @@ -2661,19 +2223,12 @@ " plt.plot(X[:, 0], X[:, 1], 'k.', markersize=2)\n", " plot_centroids(clusterer.means_, clusterer.weights_)\n", "\n", - " plt.xlabel(\"$x_1$\", fontsize=14)\n", + " plt.xlabel(\"$x_1$\")\n", " if show_ylabels:\n", - " plt.ylabel(\"$x_2$\", fontsize=14, rotation=0)\n", + " plt.ylabel(\"$x_2$\", rotation=0)\n", " else:\n", - " plt.tick_params(labelleft=False)" - ] - }, - { - "cell_type": "code", - "execution_count": 155, - "metadata": {}, - "outputs": [], - "source": [ + " plt.tick_params(labelleft=False)\n", + "\n", "plt.figure(figsize=(8, 4))\n", "\n", "plot_gaussian_mixture(gm, X)\n", @@ -2687,52 +2242,44 @@ "metadata": {}, "source": [ "You can impose constraints on the covariance matrices that the algorithm looks for by setting the `covariance_type` hyperparameter:\n", - "* `\"full\"` (default): no constraint, all clusters can take on any ellipsoidal shape of any size.\n", - "* `\"tied\"`: all clusters must have the same shape, which can be any ellipsoid (i.e., they all share the same covariance matrix).\n", "* `\"spherical\"`: all clusters must be spherical, but they can have different diameters (i.e., different variances).\n", - "* `\"diag\"`: clusters can take on any ellipsoidal shape of any size, but the ellipsoid's axes must be parallel to the axes (i.e., the covariance matrices must be diagonal)." + "* `\"diag\"`: clusters can take on any ellipsoidal shape of any size, but the ellipsoid's axes must be parallel to the axes (i.e., the covariance matrices must be diagonal).\n", + "* `\"tied\"`: all clusters must have the same shape, which can be any ellipsoid (i.e., they all share the same covariance matrix).\n", + "* `\"full\"` (default): no constraint, all clusters can take on any ellipsoidal shape of any size." ] }, { "cell_type": "code", - "execution_count": 156, + "execution_count": 102, "metadata": {}, "outputs": [], "source": [ - "gm_full = GaussianMixture(n_components=3, n_init=10, covariance_type=\"full\", random_state=42)\n", - "gm_tied = GaussianMixture(n_components=3, n_init=10, covariance_type=\"tied\", random_state=42)\n", - "gm_spherical = GaussianMixture(n_components=3, n_init=10, covariance_type=\"spherical\", random_state=42)\n", - "gm_diag = GaussianMixture(n_components=3, n_init=10, covariance_type=\"diag\", random_state=42)\n", + "# not in the book – this code generates and saves Figure 8–17\n", + "\n", + "gm_full = GaussianMixture(n_components=3, n_init=10,\n", + " covariance_type=\"full\", random_state=42)\n", + "gm_tied = GaussianMixture(n_components=3, n_init=10,\n", + " covariance_type=\"tied\", random_state=42)\n", + "gm_spherical = GaussianMixture(n_components=3, n_init=10,\n", + " covariance_type=\"spherical\", random_state=42)\n", + "gm_diag = GaussianMixture(n_components=3, n_init=10,\n", + " covariance_type=\"diag\", random_state=42)\n", "gm_full.fit(X)\n", "gm_tied.fit(X)\n", "gm_spherical.fit(X)\n", - "gm_diag.fit(X)" - ] - }, - { - "cell_type": "code", - "execution_count": 157, - "metadata": {}, - "outputs": [], - "source": [ + "gm_diag.fit(X)\n", + "\n", "def compare_gaussian_mixtures(gm1, gm2, X):\n", " plt.figure(figsize=(9, 4))\n", "\n", " plt.subplot(121)\n", " plot_gaussian_mixture(gm1, X)\n", - " plt.title('covariance_type=\"{}\"'.format(gm1.covariance_type), fontsize=14)\n", + " plt.title(f'covariance_type=\"{gm1.covariance_type}\"')\n", "\n", " plt.subplot(122)\n", " plot_gaussian_mixture(gm2, X, show_ylabels=False)\n", - " plt.title('covariance_type=\"{}\"'.format(gm2.covariance_type), fontsize=14)\n" - ] - }, - { - "cell_type": "code", - "execution_count": 158, - "metadata": {}, - "outputs": [], - "source": [ + " plt.title(f'covariance_type=\"{gm2.covariance_type}\"')\n", + "\n", "compare_gaussian_mixtures(gm_tied, gm_spherical, X)\n", "\n", "save_fig(\"covariance_type_plot\")\n", @@ -2741,10 +2288,11 @@ }, { "cell_type": "code", - "execution_count": 159, + "execution_count": 103, "metadata": {}, "outputs": [], "source": [ + "# not in the book – comparing covariance_type=\"full\" and covariance_type=\"diag\"\n", "compare_gaussian_mixtures(gm_full, gm_diag, X)\n", "plt.tight_layout()\n", "plt.show()" @@ -2761,26 +2309,28 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Gaussian Mixtures can be used for _anomaly detection_: instances located in low-density regions can be considered anomalies. You must define what density threshold you want to use. For example, in a manufacturing company that tries to detect defective products, the ratio of defective products is usually well-known. Say it is equal to 4%, then you can set the density threshold to be the value that results in having 4% of the instances located in areas below that threshold density:" + "Gaussian Mixtures can be used for _anomaly detection_: instances located in low-density regions can be considered anomalies. You must define what density threshold you want to use. For example, in a manufacturing company that tries to detect defective products, the ratio of defective products is usually well-known. Say it is equal to 2%, then you can set the density threshold to be the value that results in having 2% of the instances located in areas below that threshold density:" ] }, { "cell_type": "code", - "execution_count": 160, + "execution_count": 104, "metadata": {}, "outputs": [], "source": [ "densities = gm.score_samples(X)\n", - "density_threshold = np.percentile(densities, 4)\n", + "density_threshold = np.percentile(densities, 2)\n", "anomalies = X[densities < density_threshold]" ] }, { "cell_type": "code", - "execution_count": 161, + "execution_count": 105, "metadata": {}, "outputs": [], "source": [ + "# not in the book – this code generates and saves Figure 8–18\n", + "\n", "plt.figure(figsize=(8, 4))\n", "\n", "plot_gaussian_mixture(gm, X)\n", @@ -2817,7 +2367,88 @@ }, { "cell_type": "code", - "execution_count": 162, + "execution_count": 106, + "metadata": {}, + "outputs": [], + "source": [ + "# not in the book – this cell generates and saves Figure 8–19\n", + "\n", + "from scipy.stats import norm\n", + "\n", + "x_val = 2.5\n", + "std_val = 1.3\n", + "x_range = [-6, 4]\n", + "x_proba_range = [-2, 2]\n", + "stds_range = [1, 2]\n", + "\n", + "xs = np.linspace(x_range[0], x_range[1], 501)\n", + "stds = np.linspace(stds_range[0], stds_range[1], 501)\n", + "Xs, Stds = np.meshgrid(xs, stds)\n", + "Z = 2 * norm.pdf(Xs - 1.0, 0, Stds) + norm.pdf(Xs + 4.0, 0, Stds)\n", + "Z = Z / Z.sum(axis=1)[:,np.newaxis] / (xs[1] - xs[0])\n", + "\n", + "x_example_idx = (xs >= x_val).argmax() # index of the first value >= x_val\n", + "max_idx = Z[:, x_example_idx].argmax()\n", + "max_val = Z[:, x_example_idx].max()\n", + "s_example_idx = (stds >= std_val).argmax()\n", + "x_range_min_idx = (xs >= x_proba_range[0]).argmax()\n", + "x_range_max_idx = (xs >= x_proba_range[1]).argmax()\n", + "log_max_idx = np.log(Z[:, x_example_idx]).argmax()\n", + "log_max_val = np.log(Z[:, x_example_idx]).max()\n", + "\n", + "plt.figure(figsize=(8, 4.5))\n", + "\n", + "plt.subplot(2, 2, 1)\n", + "plt.contourf(Xs, Stds, Z, cmap=\"GnBu\")\n", + "plt.plot([-6, 4], [std_val, std_val], \"k-\", linewidth=2)\n", + "plt.plot([x_val, x_val], [1, 2], \"b-\", linewidth=2)\n", + "plt.ylabel(r\"$\\theta$\", rotation=0, labelpad=10)\n", + "plt.title(r\"Model $f(x; \\theta)$\")\n", + "\n", + "plt.subplot(2, 2, 2)\n", + "plt.plot(stds, Z[:, x_example_idx], \"b-\")\n", + "plt.plot(stds[max_idx], max_val, \"r.\")\n", + "plt.plot([stds[max_idx], stds[max_idx]], [0, max_val], \"r:\")\n", + "plt.plot([0, stds[max_idx]], [max_val, max_val], \"r:\")\n", + "plt.text(stds[max_idx]+ 0.01, 0.081, r\"$\\hat{\\theta}$\", fontsize=14)\n", + "plt.text(stds[max_idx]+ 0.01, max_val - 0.005, r\"$Max$\", fontsize=14)\n", + "plt.text(1.01, max_val - 0.008, r\"$\\hat{\\mathcal{L}}$\", fontsize=14)\n", + "plt.ylabel(r\"$\\mathcal{L}$\", rotation=0, labelpad=10)\n", + "plt.title(fr\"$\\mathcal{{L}}(\\theta|x={x_val}) = f(x={x_val}; \\theta)$\")\n", + "plt.grid()\n", + "plt.axis([1, 2, 0.08, 0.12])\n", + "\n", + "plt.subplot(2, 2, 3)\n", + "plt.plot(xs, Z[s_example_idx], \"k-\")\n", + "plt.fill_between(xs[x_range_min_idx:x_range_max_idx+1],\n", + " Z[s_example_idx, x_range_min_idx:x_range_max_idx+1], alpha=0.2)\n", + "plt.xlabel(r\"$x$\")\n", + "plt.ylabel(\"PDF\")\n", + "plt.title(fr\"PDF $f(x; \\theta={std_val})$\")\n", + "plt.grid()\n", + "plt.axis([-6, 4, 0, 0.25])\n", + "\n", + "plt.subplot(2, 2, 4)\n", + "plt.plot(stds, np.log(Z[:, x_example_idx]), \"b-\")\n", + "plt.plot(stds[log_max_idx], log_max_val, \"r.\")\n", + "plt.plot([stds[log_max_idx], stds[log_max_idx]], [-5, log_max_val], \"r:\")\n", + "plt.plot([0, stds[log_max_idx]], [log_max_val, log_max_val], \"r:\")\n", + "plt.text(stds[log_max_idx]+ 0.01, log_max_val - 0.05, r\"$Max$\", fontsize=14)\n", + "plt.text(stds[log_max_idx]+ 0.01, -2.49, r\"$\\hat{\\theta}$\", fontsize=14)\n", + "plt.text(1.01, log_max_val - 0.07, r\"$\\log \\, \\hat{\\mathcal{L}}$\", fontsize=14)\n", + "plt.xlabel(r\"$\\theta$\")\n", + "plt.ylabel(r\"$\\log\\mathcal{L}$\", rotation=0, labelpad=10)\n", + "plt.title(fr\"$\\log \\, \\mathcal{{L}}(\\theta|x={x_val})$\")\n", + "plt.grid()\n", + "plt.axis([1, 2, -2.5, -2.1])\n", + "\n", + "save_fig(\"likelihood_function_plot\")\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 107, "metadata": {}, "outputs": [], "source": [ @@ -2826,7 +2457,7 @@ }, { "cell_type": "code", - "execution_count": 163, + "execution_count": 108, "metadata": {}, "outputs": [], "source": [ @@ -2842,10 +2473,11 @@ }, { "cell_type": "code", - "execution_count": 164, + "execution_count": 109, "metadata": {}, "outputs": [], "source": [ + "# not in the book – bonus material\n", "n_clusters = 3\n", "n_dims = 2\n", "n_params_for_weights = n_clusters - 1\n", @@ -2854,25 +2486,10 @@ "n_params = n_params_for_weights + n_params_for_means + n_params_for_covariance\n", "max_log_likelihood = gm.score(X) * len(X) # log(L^)\n", "bic = np.log(len(X)) * n_params - 2 * max_log_likelihood\n", - "aic = 2 * n_params - 2 * max_log_likelihood" - ] - }, - { - "cell_type": "code", - "execution_count": 165, - "metadata": {}, - "outputs": [], - "source": [ - "bic, aic" - ] - }, - { - "cell_type": "code", - "execution_count": 166, - "metadata": {}, - "outputs": [], - "source": [ - "n_params" + "aic = 2 * n_params - 2 * max_log_likelihood\n", + "print(f\"bic = {bic}\")\n", + "print(f\"aic = {aic}\")\n", + "print(f\"n_params = {n_params}\")" ] }, { @@ -2891,92 +2508,32 @@ }, { "cell_type": "code", - "execution_count": 167, + "execution_count": 110, "metadata": {}, "outputs": [], "source": [ + "# not in the book – this cell generates and saves Figure 8–20\n", + "\n", "gms_per_k = [GaussianMixture(n_components=k, n_init=10, random_state=42).fit(X)\n", - " for k in range(1, 11)]" - ] - }, - { - "cell_type": "code", - "execution_count": 168, - "metadata": {}, - "outputs": [], - "source": [ + " for k in range(1, 11)]\n", "bics = [model.bic(X) for model in gms_per_k]\n", - "aics = [model.aic(X) for model in gms_per_k]" - ] - }, - { - "cell_type": "code", - "execution_count": 169, - "metadata": {}, - "outputs": [], - "source": [ + "aics = [model.aic(X) for model in gms_per_k]\n", + "\n", "plt.figure(figsize=(8, 3))\n", "plt.plot(range(1, 11), bics, \"bo-\", label=\"BIC\")\n", "plt.plot(range(1, 11), aics, \"go--\", label=\"AIC\")\n", - "plt.xlabel(\"$k$\", fontsize=14)\n", - "plt.ylabel(\"Information Criterion\", fontsize=14)\n", - "plt.axis([1, 9.5, np.min(aics) - 50, np.max(aics) + 50])\n", - "plt.annotate('Minimum',\n", - " xy=(3, bics[2]),\n", - " xytext=(0.35, 0.6),\n", - " textcoords='figure fraction',\n", - " fontsize=14,\n", - " arrowprops=dict(facecolor='black', shrink=0.1)\n", - " )\n", + "plt.xlabel(\"$k$\")\n", + "plt.ylabel(\"Information Criterion\")\n", + "plt.axis([1, 9.5, min(aics) - 50, max(aics) + 50])\n", + "plt.annotate(\"\", xy=(3, bics[2]), xytext=(3.4, 8650),\n", + " arrowprops=dict(facecolor='black', shrink=0.1))\n", + "plt.text(3.5, 8660, \"Minimum\", fontsize=14, horizontalalignment=\"center\")\n", "plt.legend()\n", + "plt.grid()\n", "save_fig(\"aic_bic_vs_k_plot\")\n", "plt.show()" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's search for best combination of values for both the number of clusters and the `covariance_type` hyperparameter:" - ] - }, - { - "cell_type": "code", - "execution_count": 170, - "metadata": {}, - "outputs": [], - "source": [ - "min_bic = np.infty\n", - "\n", - "for k in range(1, 11):\n", - " for covariance_type in (\"full\", \"tied\", \"spherical\", \"diag\"):\n", - " bic = GaussianMixture(n_components=k, n_init=10,\n", - " covariance_type=covariance_type,\n", - " random_state=42).fit(X).bic(X)\n", - " if bic < min_bic:\n", - " min_bic = bic\n", - " best_k = k\n", - " best_covariance_type = covariance_type" - ] - }, - { - "cell_type": "code", - "execution_count": 171, - "metadata": {}, - "outputs": [], - "source": [ - "best_k" - ] - }, - { - "cell_type": "code", - "execution_count": 172, - "metadata": {}, - "outputs": [], - "source": [ - "best_covariance_type" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -2993,45 +2550,31 @@ }, { "cell_type": "code", - "execution_count": 173, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.mixture import BayesianGaussianMixture" - ] - }, - { - "cell_type": "code", - "execution_count": 174, + "execution_count": 111, "metadata": {}, "outputs": [], "source": [ + "from sklearn.mixture import BayesianGaussianMixture\n", + "\n", "bgm = BayesianGaussianMixture(n_components=10, n_init=10, random_state=42)\n", - "bgm.fit(X)" + "bgm.fit(X)\n", + "bgm.weights_.round(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The algorithm automatically detected that only 3 components are needed:" + "The algorithm automatically detected that only 3 components are needed!" ] }, { "cell_type": "code", - "execution_count": 175, - "metadata": {}, - "outputs": [], - "source": [ - "np.round(bgm.weights_, 2)" - ] - }, - { - "cell_type": "code", - "execution_count": 176, + "execution_count": 112, "metadata": {}, "outputs": [], "source": [ + "# not in the book – this figure is almost identical to Figure 8–16\n", "plt.figure(figsize=(8, 5))\n", "plot_gaussian_mixture(bgm, X)\n", "plt.show()" @@ -3039,97 +2582,24 @@ }, { "cell_type": "code", - "execution_count": 177, + "execution_count": 113, "metadata": {}, "outputs": [], "source": [ - "bgm_low = BayesianGaussianMixture(n_components=10, max_iter=1000, n_init=1,\n", - " weight_concentration_prior=0.01, random_state=42)\n", - "bgm_high = BayesianGaussianMixture(n_components=10, max_iter=1000, n_init=1,\n", - " weight_concentration_prior=10000, random_state=42)\n", - "nn = 73\n", - "bgm_low.fit(X[:nn])\n", - "bgm_high.fit(X[:nn])" - ] - }, - { - "cell_type": "code", - "execution_count": 178, - "metadata": {}, - "outputs": [], - "source": [ - "np.round(bgm_low.weights_, 2)" - ] - }, - { - "cell_type": "code", - "execution_count": 179, - "metadata": {}, - "outputs": [], - "source": [ - "np.round(bgm_high.weights_, 2)" - ] - }, - { - "cell_type": "code", - "execution_count": 180, - "metadata": {}, - "outputs": [], - "source": [ - "plt.figure(figsize=(9, 4))\n", + "# not in the book – this cell generates and saves Figure 8–21\n", "\n", - "plt.subplot(121)\n", - "plot_gaussian_mixture(bgm_low, X[:nn])\n", - "plt.title(\"weight_concentration_prior = 0.01\", fontsize=14)\n", + "X_moons, y_moons = make_moons(n_samples=1000, noise=0.05, random_state=42)\n", "\n", - "plt.subplot(122)\n", - "plot_gaussian_mixture(bgm_high, X[:nn], show_ylabels=False)\n", - "plt.title(\"weight_concentration_prior = 10000\", fontsize=14)\n", - "\n", - "save_fig(\"mixture_concentration_prior_plot\")\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note: the fact that you see only 3 regions in the right plot although there are 4 centroids is not a bug. The weight of the top-right cluster is much larger than the weight of the lower-right cluster, so the probability that any given point in this region belongs to the top right cluster is greater than the probability that it belongs to the lower-right cluster." - ] - }, - { - "cell_type": "code", - "execution_count": 181, - "metadata": {}, - "outputs": [], - "source": [ - "X_moons, y_moons = make_moons(n_samples=1000, noise=0.05, random_state=42)" - ] - }, - { - "cell_type": "code", - "execution_count": 182, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ "bgm = BayesianGaussianMixture(n_components=10, n_init=10, random_state=42)\n", - "bgm.fit(X_moons)" - ] - }, - { - "cell_type": "code", - "execution_count": 183, - "metadata": {}, - "outputs": [], - "source": [ + "bgm.fit(X_moons)\n", + "\n", "plt.figure(figsize=(9, 3.2))\n", "\n", "plt.subplot(121)\n", "plot_data(X_moons)\n", - "plt.xlabel(\"$x_1$\", fontsize=14)\n", - "plt.ylabel(\"$x_2$\", fontsize=14, rotation=0)\n", + "plt.xlabel(\"$x_1$\")\n", + "plt.ylabel(\"$x_2$\", rotation=0)\n", + "plt.grid()\n", "\n", "plt.subplot(122)\n", "plot_gaussian_mixture(bgm, X_moons, show_ylabels=False)\n", @@ -3145,101 +2615,6 @@ "Oops, not great... instead of detecting 2 moon-shaped clusters, the algorithm detected 8 ellipsoidal clusters. However, the density plot does not look too bad, so it might be usable for anomaly detection." ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Likelihood Function**" - ] - }, - { - "cell_type": "code", - "execution_count": 184, - "metadata": {}, - "outputs": [], - "source": [ - "from scipy.stats import norm" - ] - }, - { - "cell_type": "code", - "execution_count": 185, - "metadata": {}, - "outputs": [], - "source": [ - "xx = np.linspace(-6, 4, 101)\n", - "ss = np.linspace(1, 2, 101)\n", - "XX, SS = np.meshgrid(xx, ss)\n", - "ZZ = 2 * norm.pdf(XX - 1.0, 0, SS) + norm.pdf(XX + 4.0, 0, SS)\n", - "ZZ = ZZ / ZZ.sum(axis=1)[:,np.newaxis] / (xx[1] - xx[0])" - ] - }, - { - "cell_type": "code", - "execution_count": 186, - "metadata": {}, - "outputs": [], - "source": [ - "from matplotlib.patches import Polygon\n", - "\n", - "plt.figure(figsize=(8, 4.5))\n", - "\n", - "x_idx = 85\n", - "s_idx = 30\n", - "\n", - "plt.subplot(221)\n", - "plt.contourf(XX, SS, ZZ, cmap=\"GnBu\")\n", - "plt.plot([-6, 4], [ss[s_idx], ss[s_idx]], \"k-\", linewidth=2)\n", - "plt.plot([xx[x_idx], xx[x_idx]], [1, 2], \"b-\", linewidth=2)\n", - "plt.xlabel(r\"$x$\")\n", - "plt.ylabel(r\"$\\theta$\", fontsize=14, rotation=0)\n", - "plt.title(r\"Model $f(x; \\theta)$\", fontsize=14)\n", - "\n", - "plt.subplot(222)\n", - "plt.plot(ss, ZZ[:, x_idx], \"b-\")\n", - "max_idx = np.argmax(ZZ[:, x_idx])\n", - "max_val = np.max(ZZ[:, x_idx])\n", - "plt.plot(ss[max_idx], max_val, \"r.\")\n", - "plt.plot([ss[max_idx], ss[max_idx]], [0, max_val], \"r:\")\n", - "plt.plot([0, ss[max_idx]], [max_val, max_val], \"r:\")\n", - "plt.text(1.01, max_val + 0.005, r\"$\\hat{L}$\", fontsize=14)\n", - "plt.text(ss[max_idx]+ 0.01, 0.055, r\"$\\hat{\\theta}$\", fontsize=14)\n", - "plt.text(ss[max_idx]+ 0.01, max_val - 0.012, r\"$Max$\", fontsize=12)\n", - "plt.axis([1, 2, 0.05, 0.15])\n", - "plt.xlabel(r\"$\\theta$\", fontsize=14)\n", - "plt.grid(True)\n", - "plt.text(1.99, 0.135, r\"$=f(x=2.5; \\theta)$\", fontsize=14, ha=\"right\")\n", - "plt.title(r\"Likelihood function $\\mathcal{L}(\\theta|x=2.5)$\", fontsize=14)\n", - "\n", - "plt.subplot(223)\n", - "plt.plot(xx, ZZ[s_idx], \"k-\")\n", - "plt.axis([-6, 4, 0, 0.25])\n", - "plt.xlabel(r\"$x$\", fontsize=14)\n", - "plt.grid(True)\n", - "plt.title(r\"PDF $f(x; \\theta=1.3)$\", fontsize=14)\n", - "verts = [(xx[41], 0)] + list(zip(xx[41:81], ZZ[s_idx, 41:81])) + [(xx[80], 0)]\n", - "poly = Polygon(verts, facecolor='0.9', edgecolor='0.5')\n", - "plt.gca().add_patch(poly)\n", - "\n", - "plt.subplot(224)\n", - "plt.plot(ss, np.log(ZZ[:, x_idx]), \"b-\")\n", - "max_idx = np.argmax(np.log(ZZ[:, x_idx]))\n", - "max_val = np.max(np.log(ZZ[:, x_idx]))\n", - "plt.plot(ss[max_idx], max_val, \"r.\")\n", - "plt.plot([ss[max_idx], ss[max_idx]], [-5, max_val], \"r:\")\n", - "plt.plot([0, ss[max_idx]], [max_val, max_val], \"r:\")\n", - "plt.axis([1, 2, -2.4, -2])\n", - "plt.xlabel(r\"$\\theta$\", fontsize=14)\n", - "plt.text(ss[max_idx]+ 0.01, max_val - 0.05, r\"$Max$\", fontsize=12)\n", - "plt.text(ss[max_idx]+ 0.01, -2.39, r\"$\\hat{\\theta}$\", fontsize=14)\n", - "plt.text(1.01, max_val + 0.02, r\"$\\log \\, \\hat{L}$\", fontsize=14)\n", - "plt.grid(True)\n", - "plt.title(r\"$\\log \\, \\mathcal{L}(\\theta|x=2.5)$\", fontsize=14)\n", - "\n", - "save_fig(\"likelihood_function_plot\")\n", - "plt.show()" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -3277,7 +2652,7 @@ }, { "cell_type": "code", - "execution_count": 187, + "execution_count": 114, "metadata": {}, "outputs": [], "source": [ @@ -3288,7 +2663,7 @@ }, { "cell_type": "code", - "execution_count": 188, + "execution_count": 115, "metadata": {}, "outputs": [], "source": [ @@ -3297,7 +2672,7 @@ }, { "cell_type": "code", - "execution_count": 189, + "execution_count": 116, "metadata": {}, "outputs": [], "source": [ @@ -3313,14 +2688,15 @@ }, { "cell_type": "code", - "execution_count": 190, + "execution_count": 117, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import StratifiedShuffleSplit\n", "\n", "strat_split = StratifiedShuffleSplit(n_splits=1, test_size=40, random_state=42)\n", - "train_valid_idx, test_idx = next(strat_split.split(olivetti.data, olivetti.target))\n", + "train_valid_idx, test_idx = next(strat_split.split(olivetti.data,\n", + " olivetti.target))\n", "X_train_valid = olivetti.data[train_valid_idx]\n", "y_train_valid = olivetti.target[train_valid_idx]\n", "X_test = olivetti.data[test_idx]\n", @@ -3336,7 +2712,7 @@ }, { "cell_type": "code", - "execution_count": 191, + "execution_count": 118, "metadata": {}, "outputs": [], "source": [ @@ -3354,7 +2730,7 @@ }, { "cell_type": "code", - "execution_count": 192, + "execution_count": 119, "metadata": {}, "outputs": [], "source": [ @@ -3377,7 +2753,7 @@ }, { "cell_type": "code", - "execution_count": 193, + "execution_count": 120, "metadata": {}, "outputs": [], "source": [ @@ -3386,14 +2762,15 @@ "k_range = range(5, 150, 5)\n", "kmeans_per_k = []\n", "for k in k_range:\n", - " print(\"k={}\".format(k))\n", - " kmeans = KMeans(n_clusters=k, random_state=42).fit(X_train_pca)\n", + " print(f\"k={k}\")\n", + " kmeans = KMeans(n_clusters=k, random_state=42)\n", + " kmeans.fit(X_train_pca)\n", " kmeans_per_k.append(kmeans)" ] }, { "cell_type": "code", - "execution_count": 194, + "execution_count": 121, "metadata": {}, "outputs": [], "source": [ @@ -3407,15 +2784,16 @@ "\n", "plt.figure(figsize=(8, 3))\n", "plt.plot(k_range, silhouette_scores, \"bo-\")\n", - "plt.xlabel(\"$k$\", fontsize=14)\n", - "plt.ylabel(\"Silhouette score\", fontsize=14)\n", + "plt.xlabel(\"$k$\")\n", + "plt.ylabel(\"Silhouette score\")\n", "plt.plot(best_k, best_score, \"rs\")\n", + "plt.grid()\n", "plt.show()" ] }, { "cell_type": "code", - "execution_count": 195, + "execution_count": 122, "metadata": {}, "outputs": [], "source": [ @@ -3431,7 +2809,7 @@ }, { "cell_type": "code", - "execution_count": 196, + "execution_count": 123, "metadata": {}, "outputs": [], "source": [ @@ -3440,9 +2818,10 @@ "\n", "plt.figure(figsize=(8, 3.5))\n", "plt.plot(k_range, inertias, \"bo-\")\n", - "plt.xlabel(\"$k$\", fontsize=14)\n", - "plt.ylabel(\"Inertia\", fontsize=14)\n", + "plt.xlabel(\"$k$\")\n", + "plt.ylabel(\"Inertia\")\n", "plt.plot(best_k, best_inertia, \"rs\")\n", + "plt.grid()\n", "plt.show()" ] }, @@ -3455,7 +2834,7 @@ }, { "cell_type": "code", - "execution_count": 197, + "execution_count": 124, "metadata": {}, "outputs": [], "source": [ @@ -3471,7 +2850,7 @@ }, { "cell_type": "code", - "execution_count": 198, + "execution_count": 125, "metadata": {}, "outputs": [], "source": [ @@ -3519,7 +2898,7 @@ }, { "cell_type": "code", - "execution_count": 199, + "execution_count": 126, "metadata": {}, "outputs": [], "source": [ @@ -3539,7 +2918,7 @@ }, { "cell_type": "code", - "execution_count": 200, + "execution_count": 127, "metadata": {}, "outputs": [], "source": [ @@ -3576,17 +2955,17 @@ }, { "cell_type": "code", - "execution_count": 201, + "execution_count": 128, "metadata": {}, "outputs": [], "source": [ - "from sklearn.pipeline import Pipeline\n", + "from sklearn.pipeline import make_pipeline\n", "\n", "for n_clusters in k_range:\n", - " pipeline = Pipeline([\n", - " (\"kmeans\", KMeans(n_clusters=n_clusters, random_state=42)),\n", - " (\"forest_clf\", RandomForestClassifier(n_estimators=150, random_state=42))\n", - " ])\n", + " pipeline = make_pipeline(\n", + " KMeans(n_clusters=n_clusters, random_state=42),\n", + " RandomForestClassifier(n_estimators=150, random_state=42)\n", + " )\n", " pipeline.fit(X_train_pca, y_train)\n", " print(n_clusters, pipeline.score(X_valid_pca, y_valid))" ] @@ -3607,7 +2986,7 @@ }, { "cell_type": "code", - "execution_count": 202, + "execution_count": 129, "metadata": {}, "outputs": [], "source": [ @@ -3618,7 +2997,7 @@ }, { "cell_type": "code", - "execution_count": 203, + "execution_count": 130, "metadata": {}, "outputs": [], "source": [ @@ -3650,7 +3029,7 @@ }, { "cell_type": "code", - "execution_count": 204, + "execution_count": 131, "metadata": {}, "outputs": [], "source": [ @@ -3669,7 +3048,7 @@ }, { "cell_type": "code", - "execution_count": 205, + "execution_count": 132, "metadata": {}, "outputs": [], "source": [ @@ -3680,7 +3059,7 @@ }, { "cell_type": "code", - "execution_count": 206, + "execution_count": 133, "metadata": {}, "outputs": [], "source": [ @@ -3696,7 +3075,7 @@ }, { "cell_type": "code", - "execution_count": 207, + "execution_count": 134, "metadata": {}, "outputs": [], "source": [ @@ -3723,7 +3102,7 @@ }, { "cell_type": "code", - "execution_count": 208, + "execution_count": 135, "metadata": {}, "outputs": [], "source": [ @@ -3732,7 +3111,7 @@ }, { "cell_type": "code", - "execution_count": 209, + "execution_count": 136, "metadata": {}, "outputs": [], "source": [ @@ -3748,7 +3127,7 @@ }, { "cell_type": "code", - "execution_count": 210, + "execution_count": 137, "metadata": {}, "outputs": [], "source": [ @@ -3778,16 +3157,16 @@ }, { "cell_type": "code", - "execution_count": 211, + "execution_count": 138, "metadata": {}, "outputs": [], "source": [ - "X_train_pca" + "X_train_pca.round(2)" ] }, { "cell_type": "code", - "execution_count": 212, + "execution_count": 139, "metadata": {}, "outputs": [], "source": [ @@ -3800,7 +3179,7 @@ }, { "cell_type": "code", - "execution_count": 213, + "execution_count": 140, "metadata": {}, "outputs": [], "source": [ @@ -3809,7 +3188,7 @@ }, { "cell_type": "code", - "execution_count": 214, + "execution_count": 141, "metadata": {}, "outputs": [], "source": [ @@ -3818,7 +3197,7 @@ }, { "cell_type": "code", - "execution_count": 215, + "execution_count": 142, "metadata": {}, "outputs": [], "source": [ @@ -3827,7 +3206,7 @@ }, { "cell_type": "code", - "execution_count": 216, + "execution_count": 143, "metadata": {}, "outputs": [], "source": [ @@ -3845,7 +3224,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": "Python 3", "language": "python", "name": "python3" },