diff --git a/09_unsupervised_learning.ipynb b/09_unsupervised_learning.ipynb index 368f90d..121a06d 100644 --- a/09_unsupervised_learning.ipynb +++ b/09_unsupervised_learning.ipynb @@ -6,7 +6,7 @@ "source": [ "**Chapter 9 – Unsupervised Learning**\n", "\n", - "_This notebook contains all the sample code and solutions to the exercises in chapter 9._" + "_This notebook contains all the sample code in chapter 9._" ] }, { @@ -925,13 +925,10 @@ "outputs": [], "source": [ "from six.moves import urllib\n", - "try:\n", - " from sklearn.datasets import fetch_openml\n", - " mnist = fetch_openml('mnist_784', version=1)\n", - " mnist.target = mnist.target.astype(np.int64)\n", - "except ImportError:\n", - " from sklearn.datasets import fetch_mldata\n", - " mnist = fetch_mldata('MNIST original')" + "from sklearn.datasets import fetch_openml\n", + "\n", + "mnist = fetch_openml('mnist_784', version=1)\n", + "mnist.target = mnist.target.astype(np.int64)" ] }, { @@ -1100,12 +1097,12 @@ "times = np.empty((100, 2))\n", "inertias = np.empty((100, 2))\n", "for k in range(1, 101):\n", - " kmeans = KMeans(n_clusters=k, random_state=42)\n", + " kmeans_ = KMeans(n_clusters=k, random_state=42)\n", " minibatch_kmeans = MiniBatchKMeans(n_clusters=k, random_state=42)\n", " print(\"\\r{}/{}\".format(k, 100), end=\"\")\n", - " times[k-1, 0] = timeit(\"kmeans.fit(X)\", number=10, globals=globals())\n", + " times[k-1, 0] = timeit(\"kmeans_.fit(X)\", number=10, globals=globals())\n", " times[k-1, 1] = timeit(\"minibatch_kmeans.fit(X)\", number=10, globals=globals())\n", - " inertias[k-1, 0] = kmeans.inertia_\n", + " inertias[k-1, 0] = kmeans_.inertia_\n", " inertias[k-1, 1] = minibatch_kmeans.inertia_" ] }, @@ -1121,7 +1118,6 @@ "plt.plot(range(1, 101), inertias[:, 0], \"r--\", label=\"K-Means\")\n", "plt.plot(range(1, 101), inertias[:, 1], \"b.-\", label=\"Mini-batch K-Means\")\n", "plt.xlabel(\"$k$\", fontsize=16)\n", - "#plt.ylabel(\"Inertia\", fontsize=14)\n", "plt.title(\"Inertia\", fontsize=14)\n", "plt.legend(fontsize=14)\n", "plt.axis([1, 100, 0, 100])\n", @@ -1130,10 +1126,8 @@ "plt.plot(range(1, 101), times[:, 0], \"r--\", label=\"K-Means\")\n", "plt.plot(range(1, 101), times[:, 1], \"b.-\", label=\"Mini-batch K-Means\")\n", "plt.xlabel(\"$k$\", fontsize=16)\n", - "#plt.ylabel(\"Training time (seconds)\", fontsize=14)\n", "plt.title(\"Training time (seconds)\", fontsize=14)\n", "plt.axis([1, 100, 0, 6])\n", - "#plt.legend(fontsize=14)\n", "\n", "save_fig(\"minibatch_kmeans_vs_kmeans\")\n", "plt.show()" @@ -1579,7 +1573,7 @@ "metadata": {}, "outputs": [], "source": [ - "log_reg = LogisticRegression(multi_class=\"ovr\", solver=\"lbfgs\", random_state=42)\n", + "log_reg = LogisticRegression(multi_class=\"ovr\", solver=\"lbfgs\", max_iter=5000, random_state=42)\n", "log_reg.fit(X_train, y_train)" ] }, @@ -1596,7 +1590,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Okay, that's our baseline: 96.7% accuracy. Let's see if we can do better by using K-Means as a preprocessing step. We will create a pipeline that will first cluster the training set into 50 clusters and replace the images with their distances to the 50 clusters, then apply a logistic regression model:" + "Okay, that's our baseline: 96.89% accuracy. Let's see if we can do better by using K-Means as a preprocessing step. We will create a pipeline that will first cluster the training set into 50 clusters and replace the images with their distances to the 50 clusters, then apply a logistic regression model:" ] }, { @@ -1616,7 +1610,7 @@ "source": [ "pipeline = Pipeline([\n", " (\"kmeans\", KMeans(n_clusters=50, random_state=42)),\n", - " (\"log_reg\", LogisticRegression(multi_class=\"ovr\", solver=\"lbfgs\", random_state=42)),\n", + " (\"log_reg\", LogisticRegression(multi_class=\"ovr\", solver=\"lbfgs\", max_iter=5000, random_state=42)),\n", "])\n", "pipeline.fit(X_train, y_train)" ] @@ -1636,14 +1630,14 @@ "metadata": {}, "outputs": [], "source": [ - "1 - (1 - 0.9822222) / (1 - 0.9666666)" + "1 - (1 - 0.977777) / (1 - 0.968888)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "How about that? We almost divided the error rate by a factor of 2! But we chose the number of clusters $k$ completely arbitrarily, we can surely do better. Since K-Means is just a preprocessing step in a classification pipeline, finding a good value for $k$ is much simpler than earlier: there's no need to perform silhouette analysis or minimize the inertia, the best value of $k$ is simply the one that results in the best classification performance." + "How about that? We reduced the error rate by over 28%! But we chose the number of clusters $k$ completely arbitrarily, we can surely do better. Since K-Means is just a preprocessing step in a classification pipeline, finding a good value for $k$ is much simpler than earlier: there's no need to perform silhouette analysis or minimize the inertia, the best value of $k$ is simply the one that results in the best classification performance." ] }, { @@ -1678,7 +1672,9 @@ { "cell_type": "code", "execution_count": 90, - "metadata": {}, + "metadata": { + "scrolled": false + }, "outputs": [], "source": [ "grid_clf.score(X_test, y_test)" @@ -1688,7 +1684,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The performance is slightly improved when $k=90$, so 90 it is." + "The performance improved most with $k=99$, so 99 it is." ] }, { @@ -1810,7 +1806,7 @@ "metadata": {}, "outputs": [], "source": [ - "log_reg = LogisticRegression(multi_class=\"ovr\", solver=\"lbfgs\", random_state=42)\n", + "log_reg = LogisticRegression(multi_class=\"ovr\", solver=\"lbfgs\", max_iter=5000, random_state=42)\n", "log_reg.fit(X_representative_digits, y_representative_digits)\n", "log_reg.score(X_test, y_test)" ] @@ -1819,7 +1815,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Wow! We jumped from 82.7% accuracy to 92.4%, although we are still only training the model on 50 instances. Since it's often costly and painful to label instances, especially when it has to be done manually by experts, it's a good idea to make them label representative instances rather than just random instances." + "Wow! We jumped from 83.3% accuracy to 92.2%, although we are still only training the model on 50 instances. Since it's often costly and painful to label instances, especially when it has to be done manually by experts, it's a good idea to make them label representative instances rather than just random instances." ] }, { @@ -1846,7 +1842,7 @@ "metadata": {}, "outputs": [], "source": [ - "log_reg = LogisticRegression(multi_class=\"ovr\", solver=\"lbfgs\", random_state=42)\n", + "log_reg = LogisticRegression(multi_class=\"ovr\", solver=\"lbfgs\", max_iter=5000, random_state=42)\n", "log_reg.fit(X_train, y_train_propagated)" ] }, @@ -1900,7 +1896,7 @@ "metadata": {}, "outputs": [], "source": [ - "log_reg = LogisticRegression(multi_class=\"ovr\", solver=\"lbfgs\", random_state=42)\n", + "log_reg = LogisticRegression(multi_class=\"ovr\", solver=\"lbfgs\", max_iter=5000, random_state=42)\n", "log_reg.fit(X_train_partially_propagated, y_train_partially_propagated)" ] }, @@ -1917,7 +1913,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Nice! With just 50 labeled instances (just 5 examples per class on average!), we got 94.2% performance, which is pretty close to the performance of logistic regression on the fully labeled _digits_ dataset (which was 96.7%)." + "Nice! With just 50 labeled instances (just 5 examples per class on average!), we got 94% performance, which is pretty close to the performance of logistic regression on the fully labeled _digits_ dataset (which was 96.9%)." ] }, { @@ -3143,22 +3139,6 @@ "plt.show()" ] }, - { - "cell_type": "markdown", - "metadata": { - "collapsed": true - }, - "source": [ - "# Exercise solutions" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "TODO" - ] - }, { "cell_type": "code", "execution_count": null,