From 11b984eb922197a5423d67c88c220f130bcacb98 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Aur=C3=A9lien=20Geron?= Date: Sun, 18 Jun 2017 13:52:10 +0200 Subject: [PATCH] Make fetch_housing_data() work on Windows --- 02_end_to_end_machine_learning_project.ipynb | 952 ++++--------------- 1 file changed, 174 insertions(+), 778 deletions(-) diff --git a/02_end_to_end_machine_learning_project.ipynb b/02_end_to_end_machine_learning_project.ipynb index 895a484..d6c133e 100644 --- a/02_end_to_end_machine_learning_project.ipynb +++ b/02_end_to_end_machine_learning_project.ipynb @@ -2,10 +2,7 @@ "cells": [ { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "**Chapter 2 – End-to-end Machine Learning project**\n", "\n", @@ -16,30 +13,21 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "**Note**: You may find little differences between the code outputs in the book and in these Jupyter notebooks: these slight differences are mostly due to the random nature of many training algorithms: although I have tried to make these notebooks' outputs as constant as possible, it is impossible to guarantee that they will produce the exact same output on every platform. Also, some data structures (such as dictionaries) do not preserve the item order. Finally, I fixed a few minor bugs (I added notes next to the concerned cells) which lead to slightly different results, without changing the ideas presented in the book." ] }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "# Setup" ] }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "First, let's make sure this notebook works well in both python 2 and 3, import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures:" ] @@ -47,11 +35,7 @@ { "cell_type": "code", "execution_count": 1, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "# To support both python 2 and python 3\n", @@ -86,10 +70,7 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "# Get the data" ] @@ -97,11 +78,7 @@ { "cell_type": "code", "execution_count": 2, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "import os\n", @@ -109,8 +86,8 @@ "from six.moves import urllib\n", "\n", "DOWNLOAD_ROOT = \"https://raw.githubusercontent.com/ageron/handson-ml/master/\"\n", - "HOUSING_PATH = \"datasets/housing\"\n", - "HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + \"/housing.tgz\"\n", + "HOUSING_PATH = os.path.join(\"datasets\", \"housing\")\n", + "HOUSING_URL = DOWNLOAD_ROOT + \"datasets/housing/housing.tgz\"\n", "\n", "def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):\n", " if not os.path.isdir(housing_path):\n", @@ -125,11 +102,7 @@ { "cell_type": "code", "execution_count": 3, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "fetch_housing_data()" @@ -139,9 +112,7 @@ "cell_type": "code", "execution_count": 4, "metadata": { - "collapsed": true, - "deletable": true, - "editable": true + "collapsed": true }, "outputs": [], "source": [ @@ -155,11 +126,7 @@ { "cell_type": "code", "execution_count": 5, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing = load_housing_data()\n", @@ -169,11 +136,7 @@ { "cell_type": "code", "execution_count": 6, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing.info()" @@ -182,11 +145,7 @@ { "cell_type": "code", "execution_count": 7, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing[\"ocean_proximity\"].value_counts()" @@ -195,11 +154,7 @@ { "cell_type": "code", "execution_count": 8, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing.describe()" @@ -208,11 +163,7 @@ { "cell_type": "code", "execution_count": 9, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", @@ -226,9 +177,7 @@ "cell_type": "code", "execution_count": 10, "metadata": { - "collapsed": true, - "deletable": true, - "editable": true + "collapsed": true }, "outputs": [], "source": [ @@ -239,11 +188,7 @@ { "cell_type": "code", "execution_count": 11, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", @@ -260,11 +205,7 @@ { "cell_type": "code", "execution_count": 12, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "train_set, test_set = split_train_test(housing, 0.2)\n", @@ -274,11 +215,7 @@ { "cell_type": "code", "execution_count": 13, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "import hashlib\n", @@ -296,9 +233,7 @@ "cell_type": "code", "execution_count": 14, "metadata": { - "collapsed": true, - "deletable": true, - "editable": true + "collapsed": true }, "outputs": [], "source": [ @@ -310,11 +245,7 @@ { "cell_type": "code", "execution_count": 15, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing_with_id = housing.reset_index() # adds an `index` column\n", @@ -324,11 +255,7 @@ { "cell_type": "code", "execution_count": 16, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing_with_id[\"id\"] = housing[\"longitude\"] * 1000 + housing[\"latitude\"]\n", @@ -338,11 +265,7 @@ { "cell_type": "code", "execution_count": 17, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "test_set.head()" @@ -351,11 +274,7 @@ { "cell_type": "code", "execution_count": 18, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", @@ -366,11 +285,7 @@ { "cell_type": "code", "execution_count": 19, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "test_set.head()" @@ -379,11 +294,7 @@ { "cell_type": "code", "execution_count": 20, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing[\"median_income\"].hist()" @@ -392,11 +303,7 @@ { "cell_type": "code", "execution_count": 21, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "# Divide by 1.5 to limit the number of income categories\n", @@ -408,11 +315,7 @@ { "cell_type": "code", "execution_count": 22, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing[\"income_cat\"].value_counts()" @@ -421,11 +324,7 @@ { "cell_type": "code", "execution_count": 23, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing[\"income_cat\"].hist()" @@ -434,11 +333,7 @@ { "cell_type": "code", "execution_count": 24, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import StratifiedShuffleSplit\n", @@ -452,11 +347,7 @@ { "cell_type": "code", "execution_count": 25, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing[\"income_cat\"].value_counts() / len(housing)" @@ -465,11 +356,7 @@ { "cell_type": "code", "execution_count": 26, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "def income_cat_proportions(data):\n", @@ -489,11 +376,7 @@ { "cell_type": "code", "execution_count": 27, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "compare_props" @@ -502,11 +385,7 @@ { "cell_type": "code", "execution_count": 28, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "for set_ in (strat_train_set, strat_test_set):\n", @@ -515,10 +394,7 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "# Discover and visualize the data to gain insights" ] @@ -527,9 +403,7 @@ "cell_type": "code", "execution_count": 29, "metadata": { - "collapsed": true, - "deletable": true, - "editable": true + "collapsed": true }, "outputs": [], "source": [ @@ -539,11 +413,7 @@ { "cell_type": "code", "execution_count": 30, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing.plot(kind=\"scatter\", x=\"longitude\", y=\"latitude\")\n", @@ -553,11 +423,7 @@ { "cell_type": "code", "execution_count": 31, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing.plot(kind=\"scatter\", x=\"longitude\", y=\"latitude\", alpha=0.1)\n", @@ -566,10 +432,7 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "The argument `sharex=False` fixes a display bug (the x-axis values and legend were not displayed). This is a temporary fix (see: https://github.com/pandas-dev/pandas/issues/10611). Thanks to Wilmer Arellano for pointing it out." ] @@ -577,11 +440,7 @@ { "cell_type": "code", "execution_count": 32, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing.plot(kind=\"scatter\", x=\"longitude\", y=\"latitude\", alpha=0.4,\n", @@ -595,11 +454,7 @@ { "cell_type": "code", "execution_count": 33, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "import matplotlib.image as mpimg\n", @@ -628,9 +483,7 @@ "cell_type": "code", "execution_count": 34, "metadata": { - "collapsed": true, - "deletable": true, - "editable": true + "collapsed": true }, "outputs": [], "source": [ @@ -640,11 +493,7 @@ { "cell_type": "code", "execution_count": 35, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "corr_matrix[\"median_house_value\"].sort_values(ascending=False)" @@ -653,11 +502,7 @@ { "cell_type": "code", "execution_count": 36, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "from pandas.tools.plotting import scatter_matrix\n", @@ -671,11 +516,7 @@ { "cell_type": "code", "execution_count": 37, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing.plot(kind=\"scatter\", x=\"median_income\", y=\"median_house_value\",\n", @@ -688,9 +529,7 @@ "cell_type": "code", "execution_count": 38, "metadata": { - "collapsed": true, - "deletable": true, - "editable": true + "collapsed": true }, "outputs": [], "source": [ @@ -701,10 +540,7 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "Note: there was a bug in the previous cell, in the definition of the `rooms_per_household` attribute. This explains why the correlation value below differs slightly from the value in the book (unless you are reading the latest version)." ] @@ -712,11 +548,7 @@ { "cell_type": "code", "execution_count": 39, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "corr_matrix = housing.corr()\n", @@ -726,11 +558,7 @@ { "cell_type": "code", "execution_count": 40, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing.plot(kind=\"scatter\", x=\"rooms_per_household\", y=\"median_house_value\",\n", @@ -742,11 +570,7 @@ { "cell_type": "code", "execution_count": 41, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing.describe()" @@ -754,10 +578,7 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "# Prepare the data for Machine Learning algorithms" ] @@ -766,9 +587,7 @@ "cell_type": "code", "execution_count": 42, "metadata": { - "collapsed": true, - "deletable": true, - "editable": true + "collapsed": true }, "outputs": [], "source": [ @@ -779,11 +598,7 @@ { "cell_type": "code", "execution_count": 43, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()\n", @@ -793,11 +608,7 @@ { "cell_type": "code", "execution_count": 44, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "sample_incomplete_rows.dropna(subset=[\"total_bedrooms\"]) # option 1" @@ -806,11 +617,7 @@ { "cell_type": "code", "execution_count": 45, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "sample_incomplete_rows.drop(\"total_bedrooms\", axis=1) # option 2" @@ -819,11 +626,7 @@ { "cell_type": "code", "execution_count": 46, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "median = housing[\"total_bedrooms\"].median()\n", @@ -834,11 +637,7 @@ { "cell_type": "code", "execution_count": 47, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import Imputer\n", @@ -856,11 +655,7 @@ { "cell_type": "code", "execution_count": 48, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing_num = housing.drop(\"ocean_proximity\", axis=1)" @@ -869,11 +664,7 @@ { "cell_type": "code", "execution_count": 49, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "imputer.fit(housing_num)" @@ -882,11 +673,7 @@ { "cell_type": "code", "execution_count": 50, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "imputer.statistics_" @@ -902,11 +689,7 @@ { "cell_type": "code", "execution_count": 51, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing_num.median().values" @@ -923,9 +706,7 @@ "cell_type": "code", "execution_count": 52, "metadata": { - "collapsed": true, - "deletable": true, - "editable": true + "collapsed": true }, "outputs": [], "source": [ @@ -935,11 +716,7 @@ { "cell_type": "code", "execution_count": 53, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing_tr = pd.DataFrame(X, columns=housing_num.columns,\n", @@ -949,11 +726,7 @@ { "cell_type": "code", "execution_count": 54, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing_tr.loc[sample_incomplete_rows.index.values]" @@ -962,11 +735,7 @@ { "cell_type": "code", "execution_count": 55, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "imputer.strategy" @@ -975,11 +744,7 @@ { "cell_type": "code", "execution_count": 56, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing_tr = pd.DataFrame(X, columns=housing_num.columns)\n", @@ -989,11 +754,7 @@ { "cell_type": "code", "execution_count": 57, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import LabelEncoder\n", @@ -1007,11 +768,7 @@ { "cell_type": "code", "execution_count": 58, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "print(encoder.classes_)" @@ -1020,11 +777,7 @@ { "cell_type": "code", "execution_count": 59, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import OneHotEncoder\n", @@ -1037,11 +790,7 @@ { "cell_type": "code", "execution_count": 60, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing_cat_1hot.toarray()" @@ -1050,11 +799,7 @@ { "cell_type": "code", "execution_count": 61, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import LabelBinarizer\n", @@ -1067,11 +812,7 @@ { "cell_type": "code", "execution_count": 62, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "from sklearn.base import BaseEstimator, TransformerMixin\n", @@ -1101,11 +842,7 @@ { "cell_type": "code", "execution_count": 63, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing_extra_attribs = pd.DataFrame(housing_extra_attribs, columns=list(housing.columns)+[\"rooms_per_household\", \"population_per_household\"])\n", @@ -1115,11 +852,7 @@ { "cell_type": "code", "execution_count": 64, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "from sklearn.pipeline import Pipeline\n", @@ -1137,11 +870,7 @@ { "cell_type": "code", "execution_count": 65, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing_num_tr" @@ -1151,9 +880,7 @@ "cell_type": "code", "execution_count": 66, "metadata": { - "collapsed": true, - "deletable": true, - "editable": true + "collapsed": true }, "outputs": [], "source": [ @@ -1174,9 +901,7 @@ "cell_type": "code", "execution_count": 67, "metadata": { - "collapsed": true, - "deletable": true, - "editable": true + "collapsed": true }, "outputs": [], "source": [ @@ -1199,11 +924,7 @@ { "cell_type": "code", "execution_count": 68, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "from sklearn.pipeline import FeatureUnion\n", @@ -1217,11 +938,7 @@ { "cell_type": "code", "execution_count": 69, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing_prepared = full_pipeline.fit_transform(housing)\n", @@ -1231,11 +948,7 @@ { "cell_type": "code", "execution_count": 70, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing_prepared.shape" @@ -1243,10 +956,7 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "# Select and train a model " ] @@ -1254,11 +964,7 @@ { "cell_type": "code", "execution_count": 71, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", @@ -1270,11 +976,7 @@ { "cell_type": "code", "execution_count": 72, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "# let's try the full pipeline on a few training instances\n", @@ -1295,11 +997,7 @@ { "cell_type": "code", "execution_count": 73, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "print(\"Labels:\", list(some_labels))" @@ -1308,11 +1006,7 @@ { "cell_type": "code", "execution_count": 74, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "some_data_prepared" @@ -1321,11 +1015,7 @@ { "cell_type": "code", "execution_count": 75, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import mean_squared_error\n", @@ -1339,11 +1029,7 @@ { "cell_type": "code", "execution_count": 76, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import mean_absolute_error\n", @@ -1355,11 +1041,7 @@ { "cell_type": "code", "execution_count": 77, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeRegressor\n", @@ -1371,11 +1053,7 @@ { "cell_type": "code", "execution_count": 78, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing_predictions = tree_reg.predict(housing_prepared)\n", @@ -1386,10 +1064,7 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "# Fine-tune your model" ] @@ -1397,11 +1072,7 @@ { "cell_type": "code", "execution_count": 79, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import cross_val_score\n", @@ -1414,11 +1085,7 @@ { "cell_type": "code", "execution_count": 80, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "def display_scores(scores):\n", @@ -1432,11 +1099,7 @@ { "cell_type": "code", "execution_count": 81, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,\n", @@ -1448,11 +1111,7 @@ { "cell_type": "code", "execution_count": 82, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestRegressor\n", @@ -1464,11 +1123,7 @@ { "cell_type": "code", "execution_count": 83, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing_predictions = forest_reg.predict(housing_prepared)\n", @@ -1480,11 +1135,7 @@ { "cell_type": "code", "execution_count": 84, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import cross_val_score\n", @@ -1498,11 +1149,7 @@ { "cell_type": "code", "execution_count": 85, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "scores = cross_val_score(lin_reg, housing_prepared, housing_labels, scoring=\"neg_mean_squared_error\", cv=10)\n", @@ -1512,11 +1159,7 @@ { "cell_type": "code", "execution_count": 86, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "from sklearn.svm import SVR\n", @@ -1532,11 +1175,7 @@ { "cell_type": "code", "execution_count": 87, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import GridSearchCV\n", @@ -1565,11 +1204,7 @@ { "cell_type": "code", "execution_count": 88, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "grid_search.best_params_" @@ -1578,11 +1213,7 @@ { "cell_type": "code", "execution_count": 89, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "grid_search.best_estimator_" @@ -1598,11 +1229,7 @@ { "cell_type": "code", "execution_count": 90, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "cvres = grid_search.cv_results_\n", @@ -1613,11 +1240,7 @@ { "cell_type": "code", "execution_count": 91, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(grid_search.cv_results_)" @@ -1626,11 +1249,7 @@ { "cell_type": "code", "execution_count": 92, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import RandomizedSearchCV\n", @@ -1650,11 +1269,7 @@ { "cell_type": "code", "execution_count": 93, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "cvres = rnd_search.cv_results_\n", @@ -1665,11 +1280,7 @@ { "cell_type": "code", "execution_count": 94, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "feature_importances = grid_search.best_estimator_.feature_importances_\n", @@ -1679,11 +1290,7 @@ { "cell_type": "code", "execution_count": 95, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "extra_attribs = [\"rooms_per_hhold\", \"pop_per_hhold\", \"bedrooms_per_room\"]\n", @@ -1696,9 +1303,7 @@ "cell_type": "code", "execution_count": 96, "metadata": { - "collapsed": true, - "deletable": true, - "editable": true + "collapsed": true }, "outputs": [], "source": [ @@ -1717,11 +1322,7 @@ { "cell_type": "code", "execution_count": 97, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "final_rmse" @@ -1729,20 +1330,14 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "# Extra material" ] }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "## Label Binarizer hack\n", "`LabelBinarizer`'s `fit_transform()` method only accepts one parameter `y` (because it was meant for labels, not predictors), so it does not work in a pipeline where the final estimator is a supervised estimator because in this case its `fit()` method takes two parameters `X` and `y`.\n", @@ -1753,11 +1348,7 @@ { "cell_type": "code", "execution_count": 98, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "class SupervisionFriendlyLabelBinarizer(LabelBinarizer):\n", @@ -1779,10 +1370,7 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "## Model persistence using joblib" ] @@ -1791,9 +1379,7 @@ "cell_type": "code", "execution_count": 99, "metadata": { - "collapsed": true, - "deletable": true, - "editable": true + "collapsed": true }, "outputs": [], "source": [ @@ -1804,9 +1390,7 @@ "cell_type": "code", "execution_count": 100, "metadata": { - "collapsed": true, - "deletable": true, - "editable": true + "collapsed": true }, "outputs": [], "source": [ @@ -1818,10 +1402,7 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "## Example SciPy distributions for `RandomizedSearchCV`" ] @@ -1829,11 +1410,7 @@ { "cell_type": "code", "execution_count": 101, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "from scipy.stats import geom, expon\n", @@ -1847,20 +1424,14 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "# Exercise solutions" ] }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "## 1." ] @@ -1868,9 +1439,7 @@ { "cell_type": "markdown", "metadata": { - "collapsed": true, - "deletable": true, - "editable": true + "collapsed": true }, "source": [ "Question: Try a Support Vector Machine regressor (`sklearn.svm.SVR`), with various hyperparameters such as `kernel=\"linear\"` (with various values for the `C` hyperparameter) or `kernel=\"rbf\"` (with various values for the `C` and `gamma` hyperparameters). Don't worry about what these hyperparameters mean for now. How does the best `SVR` predictor perform?" @@ -1879,11 +1448,7 @@ { "cell_type": "code", "execution_count": 102, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import GridSearchCV\n", @@ -1901,10 +1466,7 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "The best model achieves the following score (evaluated using 5-fold cross validation):" ] @@ -1912,11 +1474,7 @@ { "cell_type": "code", "execution_count": 103, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "negative_mse = grid_search.best_score_\n", @@ -1926,10 +1484,7 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "That's much worse than the `RandomForestRegressor`. Let's check the best hyperparameters found:" ] @@ -1937,11 +1492,7 @@ { "cell_type": "code", "execution_count": 104, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "grid_search.best_params_" @@ -1949,30 +1500,21 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "The linear kernel seems better than the RBF kernel. Notice that the value of `C` is the maximum tested value. When this happens you definitely want to launch the grid search again with higher values for `C` (removing the smallest values), because it is likely that higher values of `C` will be better." ] }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "## 2." ] }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "Question: Try replacing `GridSearchCV` with `RandomizedSearchCV`." ] @@ -1980,11 +1522,7 @@ { "cell_type": "code", "execution_count": 105, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import RandomizedSearchCV\n", @@ -2009,10 +1547,7 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "The best model achieves the following score (evaluated using 5-fold cross validation):" ] @@ -2020,11 +1555,7 @@ { "cell_type": "code", "execution_count": 106, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "negative_mse = rnd_search.best_score_\n", @@ -2034,10 +1565,7 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "Now this is much closer to the performance of the `RandomForestRegressor` (but not quite there yet). Let's check the best hyperparameters found:" ] @@ -2045,11 +1573,7 @@ { "cell_type": "code", "execution_count": 107, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "rnd_search.best_params_" @@ -2057,20 +1581,14 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "This time the search found a good set of hyperparameters for the RBF kernel. Randomized search tends to find better hyperparameters than grid search in the same amount of time." ] }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "Let's look at the exponential distribution we used, with `scale=1.0`. Note that some samples are much larger or smaller than 1.0, but when you look at the log of the distribution, you can see that most values are actually concentrated roughly in the range of exp(-2) to exp(+2), which is about 0.1 to 7.4." ] @@ -2078,11 +1596,7 @@ { "cell_type": "code", "execution_count": 108, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "expon_distrib = expon(scale=1.)\n", @@ -2099,10 +1613,7 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "The distribution we used for `C` looks quite different: the scale of the samples is picked from a uniform distribution within a given range, which is why the right graph, which represents the log of the samples, looks roughly constant. This distribution is useful when you don't have a clue of what the target scale is:" ] @@ -2110,11 +1621,7 @@ { "cell_type": "code", "execution_count": 109, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "reciprocal_distrib = reciprocal(20, 200000)\n", @@ -2131,30 +1638,21 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "The reciprocal distribution is useful when you have no idea what the scale of the hyperparameter should be (indeed, as you can see on the figure on the right, all scales are equally likely, within the given range), whereas the exponential distribution is best when you know (more or less) what the scale of the hyperparameter should be." ] }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "## 3." ] }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "Question: Try adding a transformer in the preparation pipeline to select only the most important attributes." ] @@ -2163,9 +1661,7 @@ "cell_type": "code", "execution_count": 110, "metadata": { - "collapsed": true, - "deletable": true, - "editable": true + "collapsed": true }, "outputs": [], "source": [ @@ -2187,20 +1683,14 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "Note: this feature selector assumes that you have already computed the feature importances somehow (for example using a `RandomForestRegressor`). You may be tempted to compute them directly in the `TopFeatureSelector`'s `fit()` method, however this would likely slow down grid/randomized search since the feature importances would have to be computed for every hyperparameter combination (unless you implement some sort of cache)." ] }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "Let's define the number of top features we want to keep:" ] @@ -2209,9 +1699,7 @@ "cell_type": "code", "execution_count": 111, "metadata": { - "collapsed": true, - "deletable": true, - "editable": true + "collapsed": true }, "outputs": [], "source": [ @@ -2220,10 +1708,7 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "Now let's look for the indices of the top k features:" ] @@ -2231,11 +1716,7 @@ { "cell_type": "code", "execution_count": 112, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "top_k_feature_indices = indices_of_top_k(feature_importances, k)\n", @@ -2245,11 +1726,7 @@ { "cell_type": "code", "execution_count": 113, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "np.array(attributes)[top_k_feature_indices]" @@ -2257,10 +1734,7 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "Let's double check that these are indeed the top k features:" ] @@ -2268,11 +1742,7 @@ { "cell_type": "code", "execution_count": 114, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "sorted(zip(feature_importances, attributes), reverse=True)[:k]" @@ -2280,10 +1750,7 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "Looking good... Now let's create a new pipeline that runs the previously defined preparation pipeline, and adds top k feature selection:" ] @@ -2291,11 +1758,7 @@ { "cell_type": "code", "execution_count": 115, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "preparation_and_feature_selection_pipeline = Pipeline([\n", @@ -2308,9 +1771,7 @@ "cell_type": "code", "execution_count": 116, "metadata": { - "collapsed": true, - "deletable": true, - "editable": true + "collapsed": true }, "outputs": [], "source": [ @@ -2319,10 +1780,7 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "Let's look at the features of the first 3 instances:" ] @@ -2330,11 +1788,7 @@ { "cell_type": "code", "execution_count": 117, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing_prepared_top_k_features[0:3]" @@ -2342,10 +1796,7 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "Now let's double check that these are indeed the top k features:" ] @@ -2353,11 +1804,7 @@ { "cell_type": "code", "execution_count": 118, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing_prepared[0:3, top_k_feature_indices]" @@ -2365,30 +1812,21 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "Works great! :)" ] }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "## 4." ] }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "Question: Try creating a single pipeline that does the full data preparation plus the final prediction." ] @@ -2396,11 +1834,7 @@ { "cell_type": "code", "execution_count": 119, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "prepare_select_and_predict_pipeline = Pipeline([\n", @@ -2413,11 +1847,7 @@ { "cell_type": "code", "execution_count": 120, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "prepare_select_and_predict_pipeline.fit(housing, housing_labels)" @@ -2425,10 +1855,7 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "Let's try the full pipeline on a few instances:" ] @@ -2436,11 +1863,7 @@ { "cell_type": "code", "execution_count": 121, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "some_data = housing.iloc[:4]\n", @@ -2452,30 +1875,21 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "Well, the full pipeline seems to work fine. Of course, the predictions are not fantastic: they would be better if we used the best `RandomForestRegressor` that we found earlier, rather than the best `SVR`." ] }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "## 5." ] }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "Question: Automatically explore some preparation options using `GridSearchCV`." ] @@ -2483,11 +1897,7 @@ { "cell_type": "code", "execution_count": 122, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "param_grid = [\n", @@ -2503,11 +1913,7 @@ { "cell_type": "code", "execution_count": 123, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "grid_search_prep.best_params_" @@ -2515,10 +1921,7 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "Great! It seems that we had the right imputer stragegy (mean), and apparently only the top 7 features are useful (out of 9), the last 2 seem to just add some noise." ] @@ -2526,11 +1929,7 @@ { "cell_type": "code", "execution_count": 124, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "housing.shape" @@ -2538,10 +1937,7 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "Congratulations! You already know quite a lot about Machine Learning. :)" ] @@ -2580,5 +1976,5 @@ } }, "nbformat": 4, - "nbformat_minor": 0 + "nbformat_minor": 1 }