From 06aa1f1dfb85b879697ac96c92bc06e1151ac3a9 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Aur=C3=A9lien=20Geron?= Date: Thu, 28 Oct 2021 16:02:31 +1300 Subject: [PATCH] Clarify a few messages --- 02_end_to_end_machine_learning_project.ipynb | 35 +++++++++++++++----- 1 file changed, 26 insertions(+), 9 deletions(-) diff --git a/02_end_to_end_machine_learning_project.ipynb b/02_end_to_end_machine_learning_project.ipynb index 0e231ee..16eb1bd 100644 --- a/02_end_to_end_machine_learning_project.ipynb +++ b/02_end_to_end_machine_learning_project.ipynb @@ -177,6 +177,7 @@ "metadata": {}, "outputs": [], "source": [ + "# Not in the book\n", "import matplotlib as mpl\n", "\n", "mpl.rc('font', size=12)\n", @@ -197,6 +198,8 @@ "metadata": {}, "outputs": [], "source": [ + "# Not in the book\n", + "\n", "# Where to save the figures\n", "IMAGES_PATH = Path() / \"images\" / \"end_to_end_project\"\n", "IMAGES_PATH.mkdir(parents=True, exist_ok=True)\n", @@ -1997,6 +2000,13 @@ "pd.Series(lin_rmses).describe()" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Warning:** the following cell may take a few minutes to run:" + ] + }, { "cell_type": "code", "execution_count": 129, @@ -2011,13 +2021,6 @@ " scoring=\"neg_root_mean_squared_error\", cv=10)" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Warning:** the following cell may take a few minutes to run:" - ] - }, { "cell_type": "code", "execution_count": 130, @@ -2031,7 +2034,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Measure the RMSE on the training set:" + "Let's compare this RMSE measured using cross-validation (the \"validation error\") with the RMSE measured on the training set (the \"training error\"):" ] }, { @@ -2047,6 +2050,13 @@ "forest_rmse" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The training error is much lower than the validation error, which usually means that the model has overfit the training set. Another possible explanation may be that there's a mismatch between the training data and the validation data, but it's not the case here, since both came from the same dataset that we shuffled and split in two parts." + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -2645,6 +2655,13 @@ "_Try replacing the `GridSearchCV` with a `RandomizedSearchCV`._" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Warning:** the following cell will take several minutes to run. You can specify `verbose=2` when creating the `RandomizedSearchCV` if you want to see the training details." + ] + }, { "cell_type": "code", "execution_count": 155, @@ -3016,7 +3033,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Oh well... at least we tried! It looks like the cluster similarity features are definitely better than the KNN feature. But perhaps you could try having both?" + "Oh well... at least we tried! It looks like the cluster similarity features are definitely better than the KNN feature. But perhaps you could try having both? And maybe training on the full training set would help as well." ] }, {