From 295d9a1353a3d4dcafdc7e3591832a2b6894e79b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Aur=C3=A9lien=20Geron?= Date: Sun, 26 Jan 2020 19:16:11 +1300 Subject: [PATCH] Add solutions to chapter 9 code exercises --- 09_unsupervised_learning.ipynb | 618 ++++++++++++++++++++++++++++++++- 1 file changed, 610 insertions(+), 8 deletions(-) diff --git a/09_unsupervised_learning.ipynb b/09_unsupervised_learning.ipynb index 6b72618..05ef094 100644 --- a/09_unsupervised_learning.ipynb +++ b/09_unsupervised_learning.ipynb @@ -1687,6 +1687,13 @@ "grid_clf.fit(X_train, y_train)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's see what the best number of clusters is:" + ] + }, { "cell_type": "code", "execution_count": 90, @@ -1707,13 +1714,6 @@ "grid_clf.score(X_test, y_test)" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The performance improved most with $k=99$, so 99 it is." - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -1963,7 +1963,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "You could now do a few iterations of _active learning_:\n", + "You could now do a few iterations of *active learning*:\n", "1. Manually label the instances that the classifier is least sure about, if possible by picking them in distinct clusters.\n", "2. Train a new model with these additional labels." ] @@ -3166,6 +3166,608 @@ "plt.show()" ] }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Exercise solutions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. to 9." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "See Appendix A." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 10. Cluster the Olivetti Faces Dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*Exercise: The classic Olivetti faces dataset contains 400 grayscale 64 × 64–pixel images of faces. Each image is flattened to a 1D vector of size 4,096. 40 different people were photographed (10 times each), and the usual task is to train a model that can predict which person is represented in each picture. Load the dataset using the `sklearn.datasets.fetch_olivetti_faces()` function.*" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.datasets import fetch_olivetti_faces\n", + "\n", + "olivetti = fetch_olivetti_faces()" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "print(olivetti.DESCR)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "olivetti.target" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*Exercise: Then split it into a training set, a validation set, and a test set (note that the dataset is already scaled between 0 and 1). Since the dataset is quite small, you probably want to use stratified sampling to ensure that there are the same number of images per person in each set.*" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.model_selection import StratifiedShuffleSplit\n", + "\n", + "strat_split = StratifiedShuffleSplit(n_splits=1, test_size=40, random_state=42)\n", + "train_valid_idx, test_idx = next(strat_split.split(olivetti.data, olivetti.target))\n", + "X_train_valid = olivetti.data[train_valid_idx]\n", + "y_train_valid = olivetti.target[train_valid_idx]\n", + "X_test = olivetti.data[test_idx]\n", + "y_test = olivetti.target[test_idx]\n", + "\n", + "strat_split = StratifiedShuffleSplit(n_splits=1, test_size=80, random_state=43)\n", + "train_idx, valid_idx = next(strat_split.split(X_train_valid, y_train_valid))\n", + "X_train = X_train_valid[train_idx]\n", + "y_train = y_train_valid[train_idx]\n", + "X_valid = X_train_valid[valid_idx]\n", + "y_valid = y_train_valid[valid_idx]" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "print(X_train.shape, y_train.shape)\n", + "print(X_valid.shape, y_valid.shape)\n", + "print(X_test.shape, y_test.shape)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To speed things up, we'll reduce the data's dimensionality using PCA:" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.decomposition import PCA\n", + "\n", + "pca = PCA(0.99)\n", + "X_train_pca = pca.fit_transform(X_train)\n", + "X_valid_pca = pca.transform(X_valid)\n", + "X_test_pca = pca.transform(X_test)\n", + "\n", + "pca.n_components_" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*Exercise: Next, cluster the images using K-Means, and ensure that you have a good number of clusters (using one of the techniques discussed in this chapter).*" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.cluster import KMeans\n", + "\n", + "k_range = range(5, 150, 5)\n", + "kmeans_per_k = []\n", + "for k in k_range:\n", + " print(\"k={}\".format(k))\n", + " kmeans = KMeans(n_clusters=k, random_state=42).fit(X_train_pca)\n", + " kmeans_per_k.append(kmeans)" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.metrics import silhouette_score\n", + "\n", + "silhouette_scores = [silhouette_score(X_train_pca, model.labels_)\n", + " for model in kmeans_per_k]\n", + "best_index = np.argmax(silhouette_scores)\n", + "best_k = k_range[best_index]\n", + "best_score = silhouette_scores[best_index]\n", + "\n", + "plt.figure(figsize=(8, 3))\n", + "plt.plot(k_range, silhouette_scores, \"bo-\")\n", + "plt.xlabel(\"$k$\", fontsize=14)\n", + "plt.ylabel(\"Silhouette score\", fontsize=14)\n", + "plt.plot(best_k, best_score, \"rs\")\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "best_k" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It looks like the best number of clusters is quite high, at 120. You might have expected it to be 40, since there are 40 different people on the pictures. However, the same person may look quite different on different pictures (e.g., with or without glasses, or simply shifted left or right)." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "inertias = [model.inertia_ for model in kmeans_per_k]\n", + "best_inertia = inertias[best_index]\n", + "\n", + "plt.figure(figsize=(8, 3.5))\n", + "plt.plot(k_range, inertias, \"bo-\")\n", + "plt.xlabel(\"$k$\", fontsize=14)\n", + "plt.ylabel(\"Inertia\", fontsize=14)\n", + "plt.plot(best_k, best_inertia, \"rs\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The optimal number of clusters is not clear on this inertia diagram, as there is no obvious elbow, so let's stick with k=120." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "best_model = kmeans_per_k[best_index]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*Exercise: Visualize the clusters: do you see similar faces in each cluster?*" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "def plot_faces(faces, labels, n_cols=5):\n", + " n_rows = (len(faces) - 1) // n_cols + 1\n", + " plt.figure(figsize=(n_cols, n_rows * 1.1))\n", + " for index, (face, label) in enumerate(zip(faces, labels)):\n", + " plt.subplot(n_rows, n_cols, index + 1)\n", + " plt.imshow(face.reshape(64, 64), cmap=\"gray\")\n", + " plt.axis(\"off\")\n", + " plt.title(label)\n", + " plt.show()\n", + "\n", + "for cluster_id in np.unique(best_model.labels_):\n", + " print(\"Cluster\", cluster_id)\n", + " in_cluster = best_model.labels_==cluster_id\n", + " faces = X_train[in_cluster].reshape(-1, 64, 64)\n", + " labels = y_train[in_cluster]\n", + " plot_faces(faces, labels)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "About 2 out of 3 clusters are useful: that is, they contain at least 2 pictures, all of the same person. However, the rest of the clusters have either one or more intruders, or they have just a single picture.\n", + "\n", + "Clustering images this way may be too imprecise to be directly useful when training a model (as we will see below), but it can be tremendously useful when labeling images in a new dataset: it will usually make labelling much faster." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 11. Using Clustering as Preprocessing for Classification" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*Exercise: Continuing with the Olivetti faces dataset, train a classifier to predict which person is represented in each picture, and evaluate it on the validation set.*" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.ensemble import RandomForestClassifier\n", + "\n", + "clf = RandomForestClassifier(n_estimators=150, random_state=42)\n", + "clf.fit(X_train_pca, y_train)\n", + "clf.score(X_valid_pca, y_valid)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*Exercise: Next, use K-Means as a dimensionality reduction tool, and train a classifier on the reduced set.*" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "X_train_reduced = best_model.transform(X_train_pca)\n", + "X_valid_reduced = best_model.transform(X_valid_pca)\n", + "X_test_reduced = best_model.transform(X_test_pca)\n", + "\n", + "clf = RandomForestClassifier(n_estimators=150, random_state=42)\n", + "clf.fit(X_train_reduced, y_train)\n", + " \n", + "clf.score(X_valid_reduced, y_valid)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Yikes! That's not better at all! Let's see if tuning the number of clusters helps." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*Exercise: Search for the number of clusters that allows the classifier to get the best performance: what performance can you reach?*" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We could use a `GridSearchCV` like we did earlier in this notebook, but since we already have a validation set, we don't need K-fold cross-validation, and we're only exploring a single hyperparameter, so it's simpler to just run a loop manually:" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.pipeline import Pipeline\n", + "\n", + "for n_clusters in k_range:\n", + " pipeline = Pipeline([\n", + " (\"kmeans\", KMeans(n_clusters=n_clusters, random_state=n_clusters)),\n", + " (\"forest_clf\", RandomForestClassifier(n_estimators=150, random_state=42))\n", + " ])\n", + " pipeline.fit(X_train_pca, y_train)\n", + " print(n_clusters, pipeline.score(X_valid_pca, y_valid))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Oh well, even by tuning the number of clusters, we never get beyond 80% accuracy. Looks like the distances to the cluster centroids are not as informative as the original images." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*Exercise: What if you append the features from the reduced set to the original features (again, searching for the best number of clusters)?*" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "X_train_extended = np.c_[X_train_pca, X_train_reduced]\n", + "X_valid_extended = np.c_[X_valid_pca, X_valid_reduced]\n", + "X_test_extended = np.c_[X_test_pca, X_test_reduced]" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [], + "source": [ + "clf = RandomForestClassifier(n_estimators=150, random_state=42)\n", + "clf.fit(X_train_extended, y_train)\n", + "clf.score(X_valid_extended, y_valid)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "That's a bit better, but still worse than without the cluster features. The clusters are not useful to directly train a classifier in this case (but they can still help when labelling new training instances)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 12. A Gaussian Mixture Model for the Olivetti Faces Dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*Exercise: Train a Gaussian mixture model on the Olivetti faces dataset. To speed up the algorithm, you should probably reduce the dataset's dimensionality (e.g., use PCA, preserving 99% of the variance).*" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.mixture import GaussianMixture\n", + "\n", + "gm = GaussianMixture(n_components=40, random_state=42)\n", + "y_pred = gm.fit_predict(X_train_pca)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*Exercise: Use the model to generate some new faces (using the `sample()` method), and visualize them (if you used PCA, you will need to use its `inverse_transform()` method).*" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [], + "source": [ + "n_gen_faces = 20\n", + "gen_faces_reduced, y_gen_faces = gm.sample(n_samples=n_gen_faces)\n", + "gen_faces = pca.inverse_transform(gen_faces_reduced)" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [], + "source": [ + "plot_faces(gen_faces, y_gen_faces)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*Exercise: Try to modify some images (e.g., rotate, flip, darken) and see if the model can detect the anomalies (i.e., compare the output of the `score_samples()` method for normal images and for anomalies).*" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [], + "source": [ + "n_rotated = 4\n", + "rotated = np.transpose(X_train[:n_rotated].reshape(-1, 64, 64), axes=[0, 2, 1])\n", + "rotated = rotated.reshape(-1, 64*64)\n", + "y_rotated = y_train[:n_rotated]\n", + "\n", + "n_flipped = 3\n", + "flipped = X_train[:n_flipped].reshape(-1, 64, 64)[:, ::-1]\n", + "flipped = flipped.reshape(-1, 64*64)\n", + "y_flipped = y_train[:n_flipped]\n", + "\n", + "n_darkened = 3\n", + "darkened = X_train[:n_darkened].copy()\n", + "darkened[:, 1:-1] *= 0.3\n", + "darkened = darkened.reshape(-1, 64*64)\n", + "y_darkened = y_train[:n_darkened]\n", + "\n", + "X_bad_faces = np.r_[rotated, flipped, darkened]\n", + "y_bad = np.concatenate([y_rotated, y_flipped, y_darkened])\n", + "\n", + "plot_faces(X_bad_faces, y_bad)" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [], + "source": [ + "X_bad_faces_pca = pca.transform(X_bad_faces)" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [], + "source": [ + "gm.score_samples(X_bad_faces_pca)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The bad faces are all considered highly unlikely by the Gaussian Mixture model. Compare this to the scores of some training instances:" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [], + "source": [ + "gm.score_samples(X_train_pca[:10])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 13. Using Dimensionality Reduction Techniques for Anomaly Detection" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*Exercise: Some dimensionality reduction techniques can also be used for anomaly detection. For example, take the Olivetti faces dataset and reduce it with PCA, preserving 99% of the variance. Then compute the reconstruction error for each image. Next, take some of the modified images you built in the previous exercise, and look at their reconstruction error: notice how much larger the reconstruction error is. If you plot a reconstructed image, you will see why: it tries to reconstruct a normal face.*" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We already reduced the dataset using PCA earlier:" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [], + "source": [ + "X_train_pca" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [], + "source": [ + "def reconstruction_errors(pca, X):\n", + " X_pca = pca.transform(X)\n", + " X_reconstructed = pca.inverse_transform(X_pca)\n", + " mse = np.square(X_reconstructed - X).mean(axis=-1)\n", + " return mse" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [], + "source": [ + "reconstruction_errors(pca, X_train).mean()" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [], + "source": [ + "reconstruction_errors(pca, X_bad_faces).mean()" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [], + "source": [ + "plot_faces(X_bad_faces, y_gen_faces)" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [], + "source": [ + "X_bad_faces_reconstructed = pca.inverse_transform(X_bad_faces_pca)\n", + "plot_faces(X_bad_faces_reconstructed, y_gen_faces)" + ] + }, { "cell_type": "code", "execution_count": null,