diff --git a/04_training_linear_models.ipynb b/04_training_linear_models.ipynb
index a9f7329..1216721 100644
--- a/04_training_linear_models.ipynb
+++ b/04_training_linear_models.ipynb
@@ -947,7 +947,7 @@
     "train_errors = -train_scores.mean(axis=1)\n",
     "valid_errors = -valid_scores.mean(axis=1)\n",
     "\n",
-    "plt.figure(figsize=(6, 4))  # extra code – not need, just formatting\n",
+    "plt.figure(figsize=(6, 4))  # extra code – not needed, just formatting\n",
     "plt.plot(train_sizes, train_errors, \"r-+\", linewidth=2, label=\"train\")\n",
     "plt.plot(train_sizes, valid_errors, \"b-\", linewidth=3, label=\"valid\")\n",
     "\n",
@@ -1124,11 +1124,11 @@
    "source": [
     "# extra code – this cell generates and saves Figure 4–17\n",
     "\n",
-    "def plot_model(model_class, polynomial, alphas, **model_kargs):\n",
+    "def plot_model(model_class, polynomial, alphas, **model_kwargs):\n",
     "    plt.plot(X, y, \"b.\", linewidth=3)\n",
     "    for alpha, style in zip(alphas, (\"b:\", \"g--\", \"r-\")):\n",
     "        if alpha > 0:\n",
-    "            model = model_class(alpha, **model_kargs)\n",
+    "            model = model_class(alpha, **model_kwargs)\n",
     "        else:\n",
     "            model = LinearRegression()\n",
     "        if polynomial:\n",
@@ -1875,7 +1875,7 @@
     "plt.plot([decision_boundary, decision_boundary], [0, 1], \"k:\", linewidth=2,\n",
     "         label=\"Decision boundary\")\n",
     "\n",
-    "# extra code – this section beautifies and saves Figure 4–21\n",
+    "# extra code – this section beautifies and saves Figure 4–23\n",
     "plt.arrow(x=decision_boundary, y=0.08, dx=-0.3, dy=0,\n",
     "          head_width=0.05, head_length=0.1, fc=\"b\", ec=\"b\")\n",
     "plt.arrow(x=decision_boundary, y=0.92, dx=0.3, dy=0,\n",
@@ -1951,7 +1951,7 @@
     }
    ],
    "source": [
-    "# extra code – this cell generates and saves Figure 4–22\n",
+    "# extra code – this cell generates and saves Figure 4–24\n",
     "\n",
     "X = iris.data[[\"petal length (cm)\", \"petal width (cm)\"]].values\n",
     "y = iris.target_names[iris.target] == 'virginica'\n",
@@ -2083,7 +2083,7 @@
     }
    ],
    "source": [
-    "# extra code – this cell generates and saves Figure 4–23\n",
+    "# extra code – this cell generates and saves Figure 4–25\n",
     "\n",
     "from matplotlib.colors import ListedColormap\n",
     "\n",
@@ -2195,7 +2195,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The easiest option to split the dataset into a training set, a validation set and a test set would be to use Scikit-Learn's `train_test_split()` function, but again, we want to did this manually:"
+    "The easiest option to split the dataset into a training set, a validation set and a test set would be to use Scikit-Learn's `train_test_split()` function, but again, we want to do it manually:"
    ]
   },
   {
@@ -2227,7 +2227,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The targets are currently class indices (0, 1 or 2), but we need target class probabilities to train the Softmax Regression model. Each instance will have target class probabilities equal to 0.0 for all classes except for the target class which will have a probability of 1.0 (in other words, the vector of class probabilities for any given instance is a one-hot vector). Let's write a small function to convert the vector of class indices into a matrix containing a one-hot vector for each instance. To understand this code, you need to know that `np.diag(np.ones(n))` creates an n×n matrix full of 0s except for 1s on the main diagonal. Moreover, if `a` in a NumPy array, then `a[[1, 3, 2]]` returns an array with 3 rows equal to `a[1]`, `a[3]` and `a[2]` (this is [advanced NumPy indexing](https://numpy.org/doc/stable/reference/arrays.indexing.html#advanced-indexing))."
+    "The targets are currently class indices (0, 1 or 2), but we need target class probabilities to train the Softmax Regression model. Each instance will have target class probabilities equal to 0.0 for all classes except for the target class which will have a probability of 1.0 (in other words, the vector of class probabilities for any given instance is a one-hot vector). Let's write a small function to convert the vector of class indices into a matrix containing a one-hot vector for each instance. To understand this code, you need to know that `np.diag(np.ones(n))` creates an n×n matrix full of 0s except for 1s on the main diagonal. Moreover, if `a` is a NumPy array, then `a[[1, 3, 2]]` returns an array with 3 rows equal to `a[1]`, `a[3]` and `a[2]` (this is [advanced NumPy indexing](https://numpy.org/doc/stable/user/basics.indexing.html#advanced-indexing))."
    ]
   },
   {
@@ -2662,7 +2662,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Oh well, still no change in validation acccuracy, but at least early training shortened training a bit."
+    "Oh well, still no change in validation accuracy, but at least early stopping shortened training a bit."
    ]
   },
   {
diff --git a/05_support_vector_machines.ipynb b/05_support_vector_machines.ipynb
index cd23551..ef01ded 100644
--- a/05_support_vector_machines.ipynb
+++ b/05_support_vector_machines.ipynb
@@ -11,7 +11,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "_This notebook is an extra chapter on Support Vector Machines. It also includes exercises and their solutions at the end._"
+    "_This notebook contains all the sample code and solutions to the exercises in chapter 5._"
    ]
   },
   {
@@ -540,7 +540,7 @@
     "plt.figure(figsize=(10, 3))\n",
     "\n",
     "plt.subplot(121)\n",
-    "plt.grid(True, which='both')\n",
+    "plt.grid(True)\n",
     "plt.axhline(y=0, color='k')\n",
     "plt.plot(X1D[:, 0][y==0], np.zeros(4), \"bs\")\n",
     "plt.plot(X1D[:, 0][y==1], np.zeros(5), \"g^\")\n",
@@ -549,7 +549,7 @@
     "plt.axis([-4.5, 4.5, -0.2, 0.2])\n",
     "\n",
     "plt.subplot(122)\n",
-    "plt.grid(True, which='both')\n",
+    "plt.grid(True)\n",
     "plt.axhline(y=0, color='k')\n",
     "plt.axvline(x=0, color='k')\n",
     "plt.plot(X2D[:, 0][y==0], X2D[:, 1][y==0], \"bs\")\n",
@@ -624,7 +624,7 @@
     "    plt.plot(X[:, 0][y==0], X[:, 1][y==0], \"bs\")\n",
     "    plt.plot(X[:, 0][y==1], X[:, 1][y==1], \"g^\")\n",
     "    plt.axis(axes)\n",
-    "    plt.grid(True, which='both')\n",
+    "    plt.grid(True)\n",
     "    plt.xlabel(\"$x_1$\")\n",
     "    plt.ylabel(\"$x_2$\", rotation=0)\n",
     "\n",
@@ -766,7 +766,7 @@
     "plt.figure(figsize=(10.5, 4))\n",
     "\n",
     "plt.subplot(121)\n",
-    "plt.grid(True, which='both')\n",
+    "plt.grid(True)\n",
     "plt.axhline(y=0, color='k')\n",
     "plt.scatter(x=[-2, 1], y=[0, 0], s=150, alpha=0.5, c=\"red\")\n",
     "plt.plot(X1D[:, 0][yk==0], np.zeros(4), \"bs\")\n",
@@ -789,7 +789,7 @@
     "plt.axis([-4.5, 4.5, -0.1, 1.1])\n",
     "\n",
     "plt.subplot(122)\n",
-    "plt.grid(True, which='both')\n",
+    "plt.grid(True)\n",
     "plt.axhline(y=0, color='k')\n",
     "plt.axvline(x=0, color='k')\n",
     "plt.plot(XK[:, 0][yk==0], XK[:, 1][yk==0], \"bs\")\n",
@@ -1185,7 +1185,7 @@
     "        axs, (hinge_pos, hinge_pos ** 2), (hinge_neg, hinge_neg ** 2), titles):\n",
     "    ax.plot(s, loss_pos, \"g-\", linewidth=2, zorder=10, label=\"$t=1$\")\n",
     "    ax.plot(s, loss_neg, \"r--\", linewidth=2, zorder=10, label=\"$t=-1$\")\n",
-    "    ax.grid(True, which='both')\n",
+    "    ax.grid(True)\n",
     "    ax.axhline(y=0, color='k')\n",
     "    ax.axvline(x=0, color='k')\n",
     "    ax.set_xlabel(r\"$s = \\mathbf{w}^\\intercal \\mathbf{x} + b$\")\n",
@@ -1250,10 +1250,9 @@
     "        w = np.random.randn(X.shape[1], 1)  # n feature weights\n",
     "        b = 0\n",
     "\n",
-    "        m = len(X)\n",
     "        t = np.array(y, dtype=np.float64).reshape(-1, 1) * 2 - 1\n",
     "        X_t = X * t\n",
-    "        self.Js=[]\n",
+    "        self.Js = []\n",
     "\n",
     "        # Training\n",
     "        for epoch in range(self.n_epochs):\n",
@@ -2249,7 +2248,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This tuned kernelized SVM performs better than the `LinearSVC` model, but we get a lower score on the test set than we measured using cross-validation. This is quite common: since we did so much hyperparameter tuning, we ended up slightly overfitting the cross-validation test sets. It's tempting to tweak the hyperparameters a bit more until we get a better result on the test set, but we this would probably not help, as we would just start overfitting the test set. Anyway, this score is not bad at all, so let's stop here."
+    "This tuned kernelized SVM performs better than the `LinearSVC` model, but we get a lower score on the test set than we measured using cross-validation. This is quite common: since we did so much hyperparameter tuning, we ended up slightly overfitting the cross-validation test sets. It's tempting to tweak the hyperparameters a bit more until we get a better result on the test set, but this would probably not help, as we would just start overfitting the test set. Anyway, this score is not bad at all, so let's stop here."
    ]
   },
   {
@@ -2309,7 +2308,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Don't forget to scale the data:"
+    "Don't forget to scale the data!"
    ]
   },
   {
diff --git a/06_decision_trees.ipynb b/06_decision_trees.ipynb
index 15d2bbd..32b07ba 100644
--- a/06_decision_trees.ipynb
+++ b/06_decision_trees.ipynb
@@ -1692,7 +1692,7 @@
     "2. A node's Gini impurity is generally lower than its parent's. This is due to the CART training algorithm's cost function, which splits each node in a way that minimizes the weighted sum of its children's Gini impurities. However, it is possible for a node to have a higher Gini impurity than its parent, as long as this increase is more than compensated for by a decrease in the other child's impurity. For example, consider a node containing four instances of class A and one of class B. Its Gini impurity is 1 – (1/5)² – (4/5)² = 0.32. Now suppose the dataset is one-dimensional and the instances are lined up in the following order: A, B, A, A, A. You can verify that the algorithm will split this node after the second instance, producing one child node with instances A, B, and the other child node with instances A, A, A. The first child node's Gini impurity is 1 – (1/2)² – (1/2)² = 0.5, which is higher than its parent's. This is compensated for by the fact that the other node is pure, so its overall weighted Gini impurity is 2/5 × 0.5 + 3/5 × 0 = 0.2, which is lower than the parent's Gini impurity.\n",
     "3. If a Decision Tree is overfitting the training set, it may be a good idea to decrease `max_depth`, since this will constrain the model, regularizing it.\n",
     "4. Decision Trees don't care whether or not the training data is scaled or centered; that's one of the nice things about them. So if a Decision Tree underfits the training set, scaling the input features will just be a waste of time.\n",
-    "5. The computational complexity of training a Decision Tree is 𝓞(_n_ × _m_ log(_m_)). So if you multiply the training set size by 10, the training time will be multiplied by _K_ = (_n_ × 10 _m_ × log(10 _m_)) / (_n_ × _m_ × log(_m_)) = 10 × log(10 _m_) / log(_m_). If _m_ = 10<sup>6</sup>, then _K_ ≈ 11.7, so you can expect the training time to be roughly 11.7 hours.\n",
+    "5. The computational complexity of training a Decision Tree is _O_(_n_ × _m_ log₂(_m_)). So if you multiply the training set size by 10, the training time will be multiplied by _K_ = (_n_ × 10 _m_ × log₂(10 _m_)) / (_n_ × _m_ × log₂(_m_)) = 10 × log₂(10 _m_) / log₂(_m_). If _m_ = 10<sup>6</sup>, then _K_ ≈ 11.7, so you can expect the training time to be roughly 11.7 hours.\n",
     "6. If the number of features doubles, then the training time will also roughly double."
    ]
   },
diff --git a/08_dimensionality_reduction.ipynb b/08_dimensionality_reduction.ipynb
index 06656e1..e555b80 100644
--- a/08_dimensionality_reduction.ipynb
+++ b/08_dimensionality_reduction.ipynb
@@ -193,7 +193,6 @@
     "# extra code – this cell generates and saves Figure 8–2\n",
     "\n",
     "import matplotlib.pyplot as plt\n",
-    "from mpl_toolkits.mplot3d import Axes3D\n",
     "from sklearn.decomposition import PCA\n",
     "\n",
     "pca = PCA(n_components=2)\n",
@@ -601,7 +600,7 @@
    "source": [
     "import numpy as np\n",
     "\n",
-    "# X = [...]  # the small 3D dataset was created ealier in this notebook\n",
+    "# X = [...]  # the small 3D dataset was created earlier in this notebook\n",
     "X_centered = X - X.mean(axis=0)\n",
     "U, s, Vt = np.linalg.svd(X_centered)\n",
     "c1 = Vt[0]\n",
@@ -692,13 +691,6 @@
     "pca.components_"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Recover the 3D points projected on the plane (PCA 2D subspace)."
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -862,13 +854,6 @@
     "pca.explained_variance_ratio_.sum()  # extra code"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Code to generate Figure 8–8. Explained variance as a function of the number of dimensions:**"
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": 25,
@@ -888,6 +873,8 @@
     }
    ],
    "source": [
+    "# extra code – this cell generates and saves Figure 8–8\n",
+    "\n",
     "plt.figure(figsize=(6, 4))\n",
     "plt.plot(cumsum, linewidth=3)\n",
     "plt.axis([0, 400, 0, 1])\n",
@@ -1125,14 +1112,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**Using `memmap()`:**"
+    "**Using NumPy's `memmap` class – a memory-map to an array stored in a binary file on disk.**"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's create the `memmap()` structure, copy the MNIST training set into it, and call `flush()` which ensures that any data still in cache is saved to disk. This would typically be done by a first program:"
+    "Let's create the `memmap` instance, copy the MNIST training set into it, and call `flush()` which ensures that any data still in cache is saved to disk. This would typically be done by a first program:"
    ]
   },
   {
@@ -1267,7 +1254,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**Warning**: the following cell may take several minutes to run:"
+    "**Warning**, the following cell may take several minutes to run:"
    ]
   },
   {
@@ -1568,7 +1555,7 @@
     "    * It adds some complexity to your Machine Learning pipelines.\n",
     "    * Transformed features are often hard to interpret.\n",
     "2. The curse of dimensionality refers to the fact that many problems that do not exist in low-dimensional space arise in high-dimensional space. In Machine Learning, one common manifestation is the fact that randomly sampled high-dimensional vectors are generally far from one another, increasing the risk of overfitting and making it very difficult to identify patterns without having plenty of training data.\n",
-    "3. Once a dataset's dimensionality has been reduced using one of the algorithms we discussed, it is almost always impossible to perfectly reverse the operation, because some information gets lost during dimensionality reduction. Moreover, while some algorithms (such as PCA) have a simple reverse transformation procedure that can reconstruct a dataset relatively similar to the original, other algorithms (such as T-SNE) do not.\n",
+    "3. Once a dataset's dimensionality has been reduced using one of the algorithms we discussed, it is almost always impossible to perfectly reverse the operation, because some information gets lost during dimensionality reduction. Moreover, while some algorithms (such as PCA) have a simple reverse transformation procedure that can reconstruct a dataset relatively similar to the original, other algorithms (such as t-SNE) do not.\n",
     "4. PCA can be used to significantly reduce the dimensionality of most datasets, even if they are highly nonlinear, because it can at least get rid of useless dimensions. However, if there are no useless dimensions—as in the Swiss roll dataset—then reducing dimensionality with PCA will lose too much information. You want to unroll the Swiss roll, not squash it.\n",
     "5. That's a trick question: it depends on the dataset. Let's look at two extreme examples. First, suppose the dataset is composed of points that are almost perfectly aligned. In this case, PCA can reduce the dataset down to just one dimension while still preserving 95% of the variance. Now imagine that the dataset is composed of perfectly random points, scattered all around the 1,000 dimensions. In this case roughly 950 dimensions are required to preserve 95% of the variance. So the answer is, it depends on the dataset, and it could be any number between 1 and 950. Plotting the explained variance as a function of the number of dimensions is one way to get a rough idea of the dataset's intrinsic dimensionality.\n",
     "6. Regular PCA is the default, but it works only if the dataset fits in memory. Incremental PCA is useful for large datasets that don't fit in memory, but it is slower than regular PCA, so if the dataset fits in memory you should prefer regular PCA. Incremental PCA is also useful for online tasks, when you need to apply PCA on the fly, every time a new instance arrives. Randomized PCA is useful when you want to considerably reduce dimensionality and the dataset fits in memory; in this case, it is much faster than regular PCA. Finally, Random Projection is great for very high-dimensional datasets.\n",
@@ -2109,14 +2096,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Exercise: _Alternatively, you can write colored digits at the location of each instance, or even plot scaled-down versions of the digit images themselves (if you plot all digits, the visualization will be too cluttered, so you should either draw a random sample or plot an instance only if no other instance has already been plotted at a close distance). You should get a nice visualization with well-separated clusters of digits._"
+    "Exercise: _Alternatively, you can replace each dot in the scatterplot with the corresponding instance’s class (a digit from 0 to 9), or even plot scaled-down versions of the digit images themselves (if you plot all digits, the visualization will be too cluttered, so you should either draw a random sample or plot an instance only if no other instance has already been plotted at a close distance). You should get a nice visualization with well-separated clusters of digits._"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's create a `plot_digits()` function that will draw a scatterplot (similar to the above scatterplots) plus write colored digits, with a minimum distance guaranteed between these digits. If the digit images are provided, they are plotted instead. This implementation was inspired from one of Scikit-Learn's excellent examples ([plot_lle_digits](http://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html), based on a different digit dataset)."
+    "Let's create a `plot_digits()` function that will draw a scatterplot (similar to the above scatterplots) plus write colored digits, with a minimum distance guaranteed between these digits. If the digit images are provided, they are plotted instead. This implementation was inspired from one of Scikit-Learn's excellent examples ([plot_lle_digits](https://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html), based on a different digit dataset)."
    ]
   },
   {
@@ -2400,7 +2387,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**Warning**: the following cell will take about 10 minutes to run, depending on your hardware:"
+    "**Warning**, the following cell will take about 10-30 minutes to run, depending on your hardware:"
    ]
   },
   {
@@ -2446,7 +2433,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**Warning**: the following cell will take about 10 minutes to run, depending on your hardware:"
+    "**Warning**, the following cell will take about 10-30 minutes to run, depending on your hardware:"
    ]
   },
   {
diff --git a/README.md b/README.md
index 6fbff2a..0de1bd0 100644
--- a/README.md
+++ b/README.md
@@ -76,4 +76,4 @@ See [INSTALL.md](INSTALL.md)
 See [INSTALL.md](INSTALL.md)
 
 ## Contributors
-I would like to thank everyone [who contributed to this project](https://github.com/ageron/handson-ml3/graphs/contributors), either by providing useful feedback, filing issues or submitting Pull Requests. Special thanks go to Haesun Park and Ian Beauregard who reviewed every notebook and submitted many PRs, including help on some of the exercise solutions. Thanks as well to Steven Bunkley and Ziembla who created the `docker` directory, and to github user SuperYorio who helped on some exercise solutions.
+I would like to thank everyone [who contributed to this project](https://github.com/ageron/handson-ml3/graphs/contributors), either by providing useful feedback, filing issues or submitting Pull Requests. Special thanks go to Haesun Park and Ian Beauregard who reviewed every notebook and submitted many PRs, including help on some of the exercise solutions. Thanks as well to Steven Bunkley and Ziembla who created the `docker` directory, and to github user SuperYorio who helped on some exercise solutions. And last but not least, thanks a lot to Victor Khaustov who submitted plenty of excellent PRs, fixing many errors.