Clarify a few messages

2021-10-28 16:02:31 +13:00 · 2021-10-28 16:02:31 +13:00 · 06aa1f1dfb
commit 06aa1f1dfb
parent b75811f999
1 changed files with 26 additions and 9 deletions
--- a/02_end_to_end_machine_learning_project.ipynb
+++ b/02_end_to_end_machine_learning_project.ipynb
@ -177,6 +177,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
+    "# Not in the book\n",
    "import matplotlib as mpl\n",
    "\n",
    "mpl.rc('font', size=12)\n",
@ -197,6 +198,8 @@
   "metadata": {},
   "outputs": [],
   "source": [
+    "# Not in the book\n",
+    "\n",
    "# Where to save the figures\n",
    "IMAGES_PATH = Path() / \"images\" / \"end_to_end_project\"\n",
    "IMAGES_PATH.mkdir(parents=True, exist_ok=True)\n",
@ -1997,6 +2000,13 @@
    "pd.Series(lin_rmses).describe()"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Warning:** the following cell may take a few minutes to run:"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 129,
@ -2011,13 +2021,6 @@
    "                                scoring=\"neg_root_mean_squared_error\", cv=10)"
   ]
  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Warning:** the following cell may take a few minutes to run:"
-   ]
-  },
  {
   "cell_type": "code",
   "execution_count": 130,
@ -2031,7 +2034,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Measure the RMSE on the training set:"
+    "Let's compare this RMSE measured using cross-validation (the \"validation error\") with the RMSE measured on the training set (the \"training error\"):"
   ]
  },
  {
@ -2047,6 +2050,13 @@
    "forest_rmse"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The training error is much lower than the validation error, which usually means that the model has overfit the training set. Another possible explanation may be that there's a mismatch between the training data and the validation data, but it's not the case here, since both came from the same dataset that we shuffled and split in two parts."
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@ -2645,6 +2655,13 @@
    "_Try replacing the `GridSearchCV` with a `RandomizedSearchCV`._"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Warning:** the following cell will take several minutes to run. You can specify `verbose=2` when creating the `RandomizedSearchCV` if you want to see the training details."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 155,
@ -3016,7 +3033,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Oh well... at least we tried! It looks like the cluster similarity features are definitely better than the KNN feature. But perhaps you could try having both?"
+    "Oh well... at least we tried! It looks like the cluster similarity features are definitely better than the KNN feature. But perhaps you could try having both? And maybe training on the full training set would help as well."
   ]
  },
  {