From 7fe396ef9264d9e01f07050b5094c2d3cf82f1e2 Mon Sep 17 00:00:00 2001
From: Victor Khaustov <3192677+vi3itor@users.noreply.github.com>
Date: Tue, 24 May 2022 21:37:17 +0900
Subject: [PATCH 1/6] Fix figure numbering and correct typos

---
 04_training_linear_models.ipynb | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/04_training_linear_models.ipynb b/04_training_linear_models.ipynb
index a9f7329..1216721 100644
--- a/04_training_linear_models.ipynb
+++ b/04_training_linear_models.ipynb
@@ -947,7 +947,7 @@
     "train_errors = -train_scores.mean(axis=1)\n",
     "valid_errors = -valid_scores.mean(axis=1)\n",
     "\n",
-    "plt.figure(figsize=(6, 4))  # extra code – not need, just formatting\n",
+    "plt.figure(figsize=(6, 4))  # extra code – not needed, just formatting\n",
     "plt.plot(train_sizes, train_errors, \"r-+\", linewidth=2, label=\"train\")\n",
     "plt.plot(train_sizes, valid_errors, \"b-\", linewidth=3, label=\"valid\")\n",
     "\n",
@@ -1124,11 +1124,11 @@
    "source": [
     "# extra code – this cell generates and saves Figure 4–17\n",
     "\n",
-    "def plot_model(model_class, polynomial, alphas, **model_kargs):\n",
+    "def plot_model(model_class, polynomial, alphas, **model_kwargs):\n",
     "    plt.plot(X, y, \"b.\", linewidth=3)\n",
     "    for alpha, style in zip(alphas, (\"b:\", \"g--\", \"r-\")):\n",
     "        if alpha > 0:\n",
-    "            model = model_class(alpha, **model_kargs)\n",
+    "            model = model_class(alpha, **model_kwargs)\n",
     "        else:\n",
     "            model = LinearRegression()\n",
     "        if polynomial:\n",
@@ -1875,7 +1875,7 @@
     "plt.plot([decision_boundary, decision_boundary], [0, 1], \"k:\", linewidth=2,\n",
     "         label=\"Decision boundary\")\n",
     "\n",
-    "# extra code – this section beautifies and saves Figure 4–21\n",
+    "# extra code – this section beautifies and saves Figure 4–23\n",
     "plt.arrow(x=decision_boundary, y=0.08, dx=-0.3, dy=0,\n",
     "          head_width=0.05, head_length=0.1, fc=\"b\", ec=\"b\")\n",
     "plt.arrow(x=decision_boundary, y=0.92, dx=0.3, dy=0,\n",
@@ -1951,7 +1951,7 @@
     }
    ],
    "source": [
-    "# extra code – this cell generates and saves Figure 4–22\n",
+    "# extra code – this cell generates and saves Figure 4–24\n",
     "\n",
     "X = iris.data[[\"petal length (cm)\", \"petal width (cm)\"]].values\n",
     "y = iris.target_names[iris.target] == 'virginica'\n",
@@ -2083,7 +2083,7 @@
     }
    ],
    "source": [
-    "# extra code – this cell generates and saves Figure 4–23\n",
+    "# extra code – this cell generates and saves Figure 4–25\n",
     "\n",
     "from matplotlib.colors import ListedColormap\n",
     "\n",
@@ -2195,7 +2195,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The easiest option to split the dataset into a training set, a validation set and a test set would be to use Scikit-Learn's `train_test_split()` function, but again, we want to did this manually:"
+    "The easiest option to split the dataset into a training set, a validation set and a test set would be to use Scikit-Learn's `train_test_split()` function, but again, we want to do it manually:"
    ]
   },
   {
@@ -2227,7 +2227,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The targets are currently class indices (0, 1 or 2), but we need target class probabilities to train the Softmax Regression model. Each instance will have target class probabilities equal to 0.0 for all classes except for the target class which will have a probability of 1.0 (in other words, the vector of class probabilities for any given instance is a one-hot vector). Let's write a small function to convert the vector of class indices into a matrix containing a one-hot vector for each instance. To understand this code, you need to know that `np.diag(np.ones(n))` creates an n×n matrix full of 0s except for 1s on the main diagonal. Moreover, if `a` in a NumPy array, then `a[[1, 3, 2]]` returns an array with 3 rows equal to `a[1]`, `a[3]` and `a[2]` (this is [advanced NumPy indexing](https://numpy.org/doc/stable/reference/arrays.indexing.html#advanced-indexing))."
+    "The targets are currently class indices (0, 1 or 2), but we need target class probabilities to train the Softmax Regression model. Each instance will have target class probabilities equal to 0.0 for all classes except for the target class which will have a probability of 1.0 (in other words, the vector of class probabilities for any given instance is a one-hot vector). Let's write a small function to convert the vector of class indices into a matrix containing a one-hot vector for each instance. To understand this code, you need to know that `np.diag(np.ones(n))` creates an n×n matrix full of 0s except for 1s on the main diagonal. Moreover, if `a` is a NumPy array, then `a[[1, 3, 2]]` returns an array with 3 rows equal to `a[1]`, `a[3]` and `a[2]` (this is [advanced NumPy indexing](https://numpy.org/doc/stable/user/basics.indexing.html#advanced-indexing))."
    ]
   },
   {
@@ -2662,7 +2662,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Oh well, still no change in validation acccuracy, but at least early training shortened training a bit."
+    "Oh well, still no change in validation accuracy, but at least early stopping shortened training a bit."
    ]
   },
   {

From f32ce273d2b3a15b14ce023106f864a31dac71f2 Mon Sep 17 00:00:00 2001
From: Victor Khaustov <3192677+vi3itor@users.noreply.github.com>
Date: Tue, 31 May 2022 16:51:50 +0900
Subject: [PATCH 2/6] Fix typos and remove unused args in plot.grid()

---
 05_support_vector_machines.ipynb | 23 +++++++++++------------
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/05_support_vector_machines.ipynb b/05_support_vector_machines.ipynb
index cd23551..ab8c079 100644
--- a/05_support_vector_machines.ipynb
+++ b/05_support_vector_machines.ipynb
@@ -11,7 +11,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "_This notebook is an extra chapter on Support Vector Machines. It also includes exercises and their solutions at the end._"
+    "_This notebook contains all the sample code and solutions to the exercises in chapter 5._"
    ]
   },
   {
@@ -540,7 +540,7 @@
     "plt.figure(figsize=(10, 3))\n",
     "\n",
     "plt.subplot(121)\n",
-    "plt.grid(True, which='both')\n",
+    "plt.grid(True)\n",
     "plt.axhline(y=0, color='k')\n",
     "plt.plot(X1D[:, 0][y==0], np.zeros(4), \"bs\")\n",
     "plt.plot(X1D[:, 0][y==1], np.zeros(5), \"g^\")\n",
@@ -549,7 +549,7 @@
     "plt.axis([-4.5, 4.5, -0.2, 0.2])\n",
     "\n",
     "plt.subplot(122)\n",
-    "plt.grid(True, which='both')\n",
+    "plt.grid(True)\n",
     "plt.axhline(y=0, color='k')\n",
     "plt.axvline(x=0, color='k')\n",
     "plt.plot(X2D[:, 0][y==0], X2D[:, 1][y==0], \"bs\")\n",
@@ -624,7 +624,7 @@
     "    plt.plot(X[:, 0][y==0], X[:, 1][y==0], \"bs\")\n",
     "    plt.plot(X[:, 0][y==1], X[:, 1][y==1], \"g^\")\n",
     "    plt.axis(axes)\n",
-    "    plt.grid(True, which='both')\n",
+    "    plt.grid(True)\n",
     "    plt.xlabel(\"$x_1$\")\n",
     "    plt.ylabel(\"$x_2$\", rotation=0)\n",
     "\n",
@@ -766,7 +766,7 @@
     "plt.figure(figsize=(10.5, 4))\n",
     "\n",
     "plt.subplot(121)\n",
-    "plt.grid(True, which='both')\n",
+    "plt.grid(True)\n",
     "plt.axhline(y=0, color='k')\n",
     "plt.scatter(x=[-2, 1], y=[0, 0], s=150, alpha=0.5, c=\"red\")\n",
     "plt.plot(X1D[:, 0][yk==0], np.zeros(4), \"bs\")\n",
@@ -789,7 +789,7 @@
     "plt.axis([-4.5, 4.5, -0.1, 1.1])\n",
     "\n",
     "plt.subplot(122)\n",
-    "plt.grid(True, which='both')\n",
+    "plt.grid(True)\n",
     "plt.axhline(y=0, color='k')\n",
     "plt.axvline(x=0, color='k')\n",
     "plt.plot(XK[:, 0][yk==0], XK[:, 1][yk==0], \"bs\")\n",
@@ -1185,7 +1185,7 @@
     "        axs, (hinge_pos, hinge_pos ** 2), (hinge_neg, hinge_neg ** 2), titles):\n",
     "    ax.plot(s, loss_pos, \"g-\", linewidth=2, zorder=10, label=\"$t=1$\")\n",
     "    ax.plot(s, loss_neg, \"r--\", linewidth=2, zorder=10, label=\"$t=-1$\")\n",
-    "    ax.grid(True, which='both')\n",
+    "    ax.grid(True)\n",
     "    ax.axhline(y=0, color='k')\n",
     "    ax.axvline(x=0, color='k')\n",
     "    ax.set_xlabel(r\"$s = \\mathbf{w}^\\intercal \\mathbf{x} + b$\")\n",
@@ -1250,10 +1250,9 @@
     "        w = np.random.randn(X.shape[1], 1)  # n feature weights\n",
     "        b = 0\n",
     "\n",
-    "        m = len(X)\n",
     "        t = np.array(y, dtype=np.float64).reshape(-1, 1) * 2 - 1\n",
     "        X_t = X * t\n",
-    "        self.Js=[]\n",
+    "        self.Js = []\n",
     "\n",
     "        # Training\n",
     "        for epoch in range(self.n_epochs):\n",
@@ -1492,7 +1491,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "1. The fundamental idea behind Support Vector Machines is to fit the widest possible \"street\" between the classes. In other words, the goal is to have the largest possible margin between the decision boundary that separates the two classes and the training instances. When performing soft margin classification, the SVM searches for a compromise between perfectly separating the two classes and having the widest possible street (i.e., a few instances may end up on the street). Another key idea is to use kernels when training on nonlinear datasets. SVMs can also be tweaked to perform linear and nonlinear regression, as well as novelty detection.\n",
+    "1. The fundamental idea behind Support Vector Machines is to fit the widest possible \"street\" between the classes. In other words, the goal is to have the largest possible margin between the decision boundary that separates the two classes of the training instances. When performing soft margin classification, the SVM searches for a compromise between perfectly separating the two classes and having the widest possible street (i.e., a few instances may end up on the street). Another key idea is to use kernels when training on nonlinear datasets. SVMs can also be tweaked to perform linear and nonlinear regression, as well as novelty detection.\n",
     "2. After training an SVM, a _support vector_ is any instance located on the \"street\" (see the previous answer), including its border. The decision boundary is entirely determined by the support vectors. Any instance that is _not_ a support vector (i.e., is off the street) has no influence whatsoever; you could remove them, add more instances, or move them around, and as long as they stay off the street they won't affect the decision boundary. Computing the predictions with a kernelized SVM only involves the support vectors, not the whole training set.\n",
     "3. SVMs try to fit the largest possible \"street\" between the classes (see the first answer), so if the training set is not scaled, the SVM will tend to neglect small features (see Figure 5–2).\n",
     "4. You can use the `decision_function()` method to get confidence scores. These scores represent the distance between the instance and the decision boundary. However, they cannot be directly converted into an estimation of the class probability. If you set `probability=True` when creating an `SVC`, then at the end of training it will use 5-fold cross-validation to generate out-of-sample scores for the training samples, and it will train a `LogisticRegression` model to map these scores to estimated probabilities. The `predict_proba()` and `predict_log_proba()` methods will then be available.\n",
@@ -2249,7 +2248,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This tuned kernelized SVM performs better than the `LinearSVC` model, but we get a lower score on the test set than we measured using cross-validation. This is quite common: since we did so much hyperparameter tuning, we ended up slightly overfitting the cross-validation test sets. It's tempting to tweak the hyperparameters a bit more until we get a better result on the test set, but we this would probably not help, as we would just start overfitting the test set. Anyway, this score is not bad at all, so let's stop here."
+    "This tuned kernelized SVM performs better than the `LinearSVC` model, but we get a lower score on the test set than we measured using cross-validation. This is quite common: since we did so much hyperparameter tuning, we ended up slightly overfitting the cross-validation test sets. It's tempting to tweak the hyperparameters a bit more until we get a better result on the test set, but this would probably not help, as we would just start overfitting the test set. Anyway, this score is not bad at all, so let's stop here."
    ]
   },
   {
@@ -2309,7 +2308,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Don't forget to scale the data:"
+    "Don't forget to scale the data!"
    ]
   },
   {

From 1cdc1b7178775d383e808b2213ac40d79a4ecb1a Mon Sep 17 00:00:00 2001
From: Victor Khaustov <3192677+vi3itor@users.noreply.github.com>
Date: Tue, 31 May 2022 17:04:55 +0900
Subject: [PATCH 3/6] Correct preposition

---
 05_support_vector_machines.ipynb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/05_support_vector_machines.ipynb b/05_support_vector_machines.ipynb
index ab8c079..ef01ded 100644
--- a/05_support_vector_machines.ipynb
+++ b/05_support_vector_machines.ipynb
@@ -1491,7 +1491,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "1. The fundamental idea behind Support Vector Machines is to fit the widest possible \"street\" between the classes. In other words, the goal is to have the largest possible margin between the decision boundary that separates the two classes of the training instances. When performing soft margin classification, the SVM searches for a compromise between perfectly separating the two classes and having the widest possible street (i.e., a few instances may end up on the street). Another key idea is to use kernels when training on nonlinear datasets. SVMs can also be tweaked to perform linear and nonlinear regression, as well as novelty detection.\n",
+    "1. The fundamental idea behind Support Vector Machines is to fit the widest possible \"street\" between the classes. In other words, the goal is to have the largest possible margin between the decision boundary that separates the two classes and the training instances. When performing soft margin classification, the SVM searches for a compromise between perfectly separating the two classes and having the widest possible street (i.e., a few instances may end up on the street). Another key idea is to use kernels when training on nonlinear datasets. SVMs can also be tweaked to perform linear and nonlinear regression, as well as novelty detection.\n",
     "2. After training an SVM, a _support vector_ is any instance located on the \"street\" (see the previous answer), including its border. The decision boundary is entirely determined by the support vectors. Any instance that is _not_ a support vector (i.e., is off the street) has no influence whatsoever; you could remove them, add more instances, or move them around, and as long as they stay off the street they won't affect the decision boundary. Computing the predictions with a kernelized SVM only involves the support vectors, not the whole training set.\n",
     "3. SVMs try to fit the largest possible \"street\" between the classes (see the first answer), so if the training set is not scaled, the SVM will tend to neglect small features (see Figure 5–2).\n",
     "4. You can use the `decision_function()` method to get confidence scores. These scores represent the distance between the instance and the decision boundary. However, they cannot be directly converted into an estimation of the class probability. If you set `probability=True` when creating an `SVC`, then at the end of training it will use 5-fold cross-validation to generate out-of-sample scores for the training samples, and it will train a `LogisticRegression` model to map these scores to estimated probabilities. The `predict_proba()` and `predict_log_proba()` methods will then be available.\n",

From 2c61add90fcf65abf69b2ebe102c3411d5c5bc76 Mon Sep 17 00:00:00 2001
From: Victor Khaustov <3192677+vi3itor@users.noreply.github.com>
Date: Wed, 1 Jun 2022 18:15:06 +0900
Subject: [PATCH 4/6] Fix Big O notation and log base in ex. 5 solution

---
 06_decision_trees.ipynb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/06_decision_trees.ipynb b/06_decision_trees.ipynb
index 15d2bbd..32b07ba 100644
--- a/06_decision_trees.ipynb
+++ b/06_decision_trees.ipynb
@@ -1692,7 +1692,7 @@
     "2. A node's Gini impurity is generally lower than its parent's. This is due to the CART training algorithm's cost function, which splits each node in a way that minimizes the weighted sum of its children's Gini impurities. However, it is possible for a node to have a higher Gini impurity than its parent, as long as this increase is more than compensated for by a decrease in the other child's impurity. For example, consider a node containing four instances of class A and one of class B. Its Gini impurity is 1 – (1/5)² – (4/5)² = 0.32. Now suppose the dataset is one-dimensional and the instances are lined up in the following order: A, B, A, A, A. You can verify that the algorithm will split this node after the second instance, producing one child node with instances A, B, and the other child node with instances A, A, A. The first child node's Gini impurity is 1 – (1/2)² – (1/2)² = 0.5, which is higher than its parent's. This is compensated for by the fact that the other node is pure, so its overall weighted Gini impurity is 2/5 × 0.5 + 3/5 × 0 = 0.2, which is lower than the parent's Gini impurity.\n",
     "3. If a Decision Tree is overfitting the training set, it may be a good idea to decrease `max_depth`, since this will constrain the model, regularizing it.\n",
     "4. Decision Trees don't care whether or not the training data is scaled or centered; that's one of the nice things about them. So if a Decision Tree underfits the training set, scaling the input features will just be a waste of time.\n",
-    "5. The computational complexity of training a Decision Tree is 𝓞(_n_ × _m_ log(_m_)). So if you multiply the training set size by 10, the training time will be multiplied by _K_ = (_n_ × 10 _m_ × log(10 _m_)) / (_n_ × _m_ × log(_m_)) = 10 × log(10 _m_) / log(_m_). If _m_ = 10<sup>6</sup>, then _K_ ≈ 11.7, so you can expect the training time to be roughly 11.7 hours.\n",
+    "5. The computational complexity of training a Decision Tree is _O_(_n_ × _m_ log₂(_m_)). So if you multiply the training set size by 10, the training time will be multiplied by _K_ = (_n_ × 10 _m_ × log₂(10 _m_)) / (_n_ × _m_ × log₂(_m_)) = 10 × log₂(10 _m_) / log₂(_m_). If _m_ = 10<sup>6</sup>, then _K_ ≈ 11.7, so you can expect the training time to be roughly 11.7 hours.\n",
     "6. If the number of features doubles, then the training time will also roughly double."
    ]
   },

From cbffd5eb928769ee1556eeef90199725c64fe6df Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Aur=C3=A9lien=20Geron?= <ageron@users.noreply.github.com>
Date: Thu, 2 Jun 2022 10:04:35 +1200
Subject: [PATCH 5/6] Thanks to Victor Khaustov

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 6fbff2a..0de1bd0 100644
--- a/README.md
+++ b/README.md
@@ -76,4 +76,4 @@ See [INSTALL.md](INSTALL.md)
 See [INSTALL.md](INSTALL.md)
 
 ## Contributors
-I would like to thank everyone [who contributed to this project](https://github.com/ageron/handson-ml3/graphs/contributors), either by providing useful feedback, filing issues or submitting Pull Requests. Special thanks go to Haesun Park and Ian Beauregard who reviewed every notebook and submitted many PRs, including help on some of the exercise solutions. Thanks as well to Steven Bunkley and Ziembla who created the `docker` directory, and to github user SuperYorio who helped on some exercise solutions.
+I would like to thank everyone [who contributed to this project](https://github.com/ageron/handson-ml3/graphs/contributors), either by providing useful feedback, filing issues or submitting Pull Requests. Special thanks go to Haesun Park and Ian Beauregard who reviewed every notebook and submitted many PRs, including help on some of the exercise solutions. Thanks as well to Steven Bunkley and Ziembla who created the `docker` directory, and to github user SuperYorio who helped on some exercise solutions. And last but not least, thanks a lot to Victor Khaustov who submitted plenty of excellent PRs, fixing many errors.

From 38b40643d8f564dfd0d41d39155de2776a8dbd70 Mon Sep 17 00:00:00 2001
From: Victor Khaustov <3192677+vi3itor@users.noreply.github.com>
Date: Tue, 14 Jun 2022 14:47:11 +0900
Subject: [PATCH 6/6] Improve comments and fix typo

---
 08_dimensionality_reduction.ipynb | 35 ++++++++++---------------------
 1 file changed, 11 insertions(+), 24 deletions(-)

diff --git a/08_dimensionality_reduction.ipynb b/08_dimensionality_reduction.ipynb
index 06656e1..e555b80 100644
--- a/08_dimensionality_reduction.ipynb
+++ b/08_dimensionality_reduction.ipynb
@@ -193,7 +193,6 @@
     "# extra code – this cell generates and saves Figure 8–2\n",
     "\n",
     "import matplotlib.pyplot as plt\n",
-    "from mpl_toolkits.mplot3d import Axes3D\n",
     "from sklearn.decomposition import PCA\n",
     "\n",
     "pca = PCA(n_components=2)\n",
@@ -601,7 +600,7 @@
    "source": [
     "import numpy as np\n",
     "\n",
-    "# X = [...]  # the small 3D dataset was created ealier in this notebook\n",
+    "# X = [...]  # the small 3D dataset was created earlier in this notebook\n",
     "X_centered = X - X.mean(axis=0)\n",
     "U, s, Vt = np.linalg.svd(X_centered)\n",
     "c1 = Vt[0]\n",
@@ -692,13 +691,6 @@
     "pca.components_"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Recover the 3D points projected on the plane (PCA 2D subspace)."
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -862,13 +854,6 @@
     "pca.explained_variance_ratio_.sum()  # extra code"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Code to generate Figure 8–8. Explained variance as a function of the number of dimensions:**"
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": 25,
@@ -888,6 +873,8 @@
     }
    ],
    "source": [
+    "# extra code – this cell generates and saves Figure 8–8\n",
+    "\n",
     "plt.figure(figsize=(6, 4))\n",
     "plt.plot(cumsum, linewidth=3)\n",
     "plt.axis([0, 400, 0, 1])\n",
@@ -1125,14 +1112,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**Using `memmap()`:**"
+    "**Using NumPy's `memmap` class – a memory-map to an array stored in a binary file on disk.**"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's create the `memmap()` structure, copy the MNIST training set into it, and call `flush()` which ensures that any data still in cache is saved to disk. This would typically be done by a first program:"
+    "Let's create the `memmap` instance, copy the MNIST training set into it, and call `flush()` which ensures that any data still in cache is saved to disk. This would typically be done by a first program:"
    ]
   },
   {
@@ -1267,7 +1254,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**Warning**: the following cell may take several minutes to run:"
+    "**Warning**, the following cell may take several minutes to run:"
    ]
   },
   {
@@ -1568,7 +1555,7 @@
     "    * It adds some complexity to your Machine Learning pipelines.\n",
     "    * Transformed features are often hard to interpret.\n",
     "2. The curse of dimensionality refers to the fact that many problems that do not exist in low-dimensional space arise in high-dimensional space. In Machine Learning, one common manifestation is the fact that randomly sampled high-dimensional vectors are generally far from one another, increasing the risk of overfitting and making it very difficult to identify patterns without having plenty of training data.\n",
-    "3. Once a dataset's dimensionality has been reduced using one of the algorithms we discussed, it is almost always impossible to perfectly reverse the operation, because some information gets lost during dimensionality reduction. Moreover, while some algorithms (such as PCA) have a simple reverse transformation procedure that can reconstruct a dataset relatively similar to the original, other algorithms (such as T-SNE) do not.\n",
+    "3. Once a dataset's dimensionality has been reduced using one of the algorithms we discussed, it is almost always impossible to perfectly reverse the operation, because some information gets lost during dimensionality reduction. Moreover, while some algorithms (such as PCA) have a simple reverse transformation procedure that can reconstruct a dataset relatively similar to the original, other algorithms (such as t-SNE) do not.\n",
     "4. PCA can be used to significantly reduce the dimensionality of most datasets, even if they are highly nonlinear, because it can at least get rid of useless dimensions. However, if there are no useless dimensions—as in the Swiss roll dataset—then reducing dimensionality with PCA will lose too much information. You want to unroll the Swiss roll, not squash it.\n",
     "5. That's a trick question: it depends on the dataset. Let's look at two extreme examples. First, suppose the dataset is composed of points that are almost perfectly aligned. In this case, PCA can reduce the dataset down to just one dimension while still preserving 95% of the variance. Now imagine that the dataset is composed of perfectly random points, scattered all around the 1,000 dimensions. In this case roughly 950 dimensions are required to preserve 95% of the variance. So the answer is, it depends on the dataset, and it could be any number between 1 and 950. Plotting the explained variance as a function of the number of dimensions is one way to get a rough idea of the dataset's intrinsic dimensionality.\n",
     "6. Regular PCA is the default, but it works only if the dataset fits in memory. Incremental PCA is useful for large datasets that don't fit in memory, but it is slower than regular PCA, so if the dataset fits in memory you should prefer regular PCA. Incremental PCA is also useful for online tasks, when you need to apply PCA on the fly, every time a new instance arrives. Randomized PCA is useful when you want to considerably reduce dimensionality and the dataset fits in memory; in this case, it is much faster than regular PCA. Finally, Random Projection is great for very high-dimensional datasets.\n",
@@ -2109,14 +2096,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Exercise: _Alternatively, you can write colored digits at the location of each instance, or even plot scaled-down versions of the digit images themselves (if you plot all digits, the visualization will be too cluttered, so you should either draw a random sample or plot an instance only if no other instance has already been plotted at a close distance). You should get a nice visualization with well-separated clusters of digits._"
+    "Exercise: _Alternatively, you can replace each dot in the scatterplot with the corresponding instance’s class (a digit from 0 to 9), or even plot scaled-down versions of the digit images themselves (if you plot all digits, the visualization will be too cluttered, so you should either draw a random sample or plot an instance only if no other instance has already been plotted at a close distance). You should get a nice visualization with well-separated clusters of digits._"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's create a `plot_digits()` function that will draw a scatterplot (similar to the above scatterplots) plus write colored digits, with a minimum distance guaranteed between these digits. If the digit images are provided, they are plotted instead. This implementation was inspired from one of Scikit-Learn's excellent examples ([plot_lle_digits](http://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html), based on a different digit dataset)."
+    "Let's create a `plot_digits()` function that will draw a scatterplot (similar to the above scatterplots) plus write colored digits, with a minimum distance guaranteed between these digits. If the digit images are provided, they are plotted instead. This implementation was inspired from one of Scikit-Learn's excellent examples ([plot_lle_digits](https://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html), based on a different digit dataset)."
    ]
   },
   {
@@ -2400,7 +2387,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**Warning**: the following cell will take about 10 minutes to run, depending on your hardware:"
+    "**Warning**, the following cell will take about 10-30 minutes to run, depending on your hardware:"
    ]
   },
   {
@@ -2446,7 +2433,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**Warning**: the following cell will take about 10 minutes to run, depending on your hardware:"
+    "**Warning**, the following cell will take about 10-30 minutes to run, depending on your hardware:"
    ]
   },
   {