{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**Chapter 2 – End-to-end Machine Learning project**\n", "\n", "*Welcome to Machine Learning Housing Corp.! Your task is to predict median house values in Californian districts, given a number of features from these districts.*\n", "\n", "*This notebook contains all the sample code and solutions to the exercices in chapter 2.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " \n", "
\n", " Run in Google Colab\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, let's import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures. We also check that Python 3.5 or later is installed (although Python 2.x may work, it is deprecated so we strongly recommend you use Python 3 instead), as well as Scikit-Learn ≥0.20." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Python ≥3.5 is required\n", "import sys\n", "assert sys.version_info >= (3, 5)\n", "\n", "# Scikit-Learn ≥0.20 is required\n", "import sklearn\n", "assert sklearn.__version__ >= \"0.20\"\n", "\n", "# Common imports\n", "import numpy as np\n", "import os\n", "\n", "# To plot pretty figures\n", "%matplotlib inline\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "mpl.rc('axes', labelsize=14)\n", "mpl.rc('xtick', labelsize=12)\n", "mpl.rc('ytick', labelsize=12)\n", "\n", "# Where to save the figures\n", "PROJECT_ROOT_DIR = \".\"\n", "CHAPTER_ID = \"end_to_end_project\"\n", "IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, \"images\", CHAPTER_ID)\n", "os.makedirs(IMAGES_PATH, exist_ok=True)\n", "\n", "def save_fig(fig_id, tight_layout=True, fig_extension=\"png\", resolution=300):\n", " path = os.path.join(IMAGES_PATH, fig_id + \".\" + fig_extension)\n", " print(\"Saving figure\", fig_id)\n", " if tight_layout:\n", " plt.tight_layout()\n", " plt.savefig(path, format=fig_extension, dpi=resolution)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Get the data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import os\n", "import tarfile\n", "import urllib.request\n", "\n", "DOWNLOAD_ROOT = \"https://raw.githubusercontent.com/ageron/handson-ml2/master/\"\n", "HOUSING_PATH = os.path.join(\"datasets\", \"housing\")\n", "HOUSING_URL = DOWNLOAD_ROOT + \"datasets/housing/housing.tgz\"\n", "\n", "def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):\n", " if not os.path.isdir(housing_path):\n", " os.makedirs(housing_path)\n", " tgz_path = os.path.join(housing_path, \"housing.tgz\")\n", " urllib.request.urlretrieve(housing_url, tgz_path)\n", " housing_tgz = tarfile.open(tgz_path)\n", " housing_tgz.extractall(path=housing_path)\n", " housing_tgz.close()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "fetch_housing_data()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "def load_housing_data(housing_path=HOUSING_PATH):\n", " csv_path = os.path.join(housing_path, \"housing.csv\")\n", " return pd.read_csv(csv_path)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "housing = load_housing_data()\n", "housing.head()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "housing.info()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "housing[\"ocean_proximity\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "housing.describe()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "housing.hist(bins=50, figsize=(20,15))\n", "save_fig(\"attribute_histogram_plots\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# to make this notebook's output identical at every run\n", "np.random.seed(42)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "# For illustration only. Sklearn has train_test_split()\n", "def split_train_test(data, test_ratio):\n", " shuffled_indices = np.random.permutation(len(data))\n", " test_set_size = int(len(data) * test_ratio)\n", " test_indices = shuffled_indices[:test_set_size]\n", " train_indices = shuffled_indices[test_set_size:]\n", " return data.iloc[train_indices], data.iloc[test_indices]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "train_set, test_set = split_train_test(housing, 0.2)\n", "len(train_set)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "len(test_set)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "from zlib import crc32\n", "\n", "def test_set_check(identifier, test_ratio):\n", " return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32\n", "\n", "def split_train_test_by_id(data, test_ratio, id_column):\n", " ids = data[id_column]\n", " in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))\n", " return data.loc[~in_test_set], data.loc[in_test_set]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The implementation of `test_set_check()` above works fine in both Python 2 and Python 3. In earlier releases, the following implementation was proposed, which supported any hash function, but was much slower and did not support Python 2:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "import hashlib\n", "\n", "def test_set_check(identifier, test_ratio, hash=hashlib.md5):\n", " return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want an implementation that supports any hash function and is compatible with both Python 2 and Python 3, here is one:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "def test_set_check(identifier, test_ratio, hash=hashlib.md5):\n", " return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "housing_with_id = housing.reset_index() # adds an `index` column\n", "train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, \"index\")" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "housing_with_id[\"id\"] = housing[\"longitude\"] * 1000 + housing[\"latitude\"]\n", "train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, \"id\")" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "test_set.head()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "test_set.head()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "housing[\"median_income\"].hist()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "housing[\"income_cat\"] = pd.cut(housing[\"median_income\"],\n", " bins=[0., 1.5, 3.0, 4.5, 6., np.inf],\n", " labels=[1, 2, 3, 4, 5])" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "housing[\"income_cat\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "housing[\"income_cat\"].hist()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import StratifiedShuffleSplit\n", "\n", "split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)\n", "for train_index, test_index in split.split(housing, housing[\"income_cat\"]):\n", " strat_train_set = housing.loc[train_index]\n", " strat_test_set = housing.loc[test_index]" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "strat_test_set[\"income_cat\"].value_counts() / len(strat_test_set)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "housing[\"income_cat\"].value_counts() / len(housing)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "def income_cat_proportions(data):\n", " return data[\"income_cat\"].value_counts() / len(data)\n", "\n", "train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)\n", "\n", "compare_props = pd.DataFrame({\n", " \"Overall\": income_cat_proportions(housing),\n", " \"Stratified\": income_cat_proportions(strat_test_set),\n", " \"Random\": income_cat_proportions(test_set),\n", "}).sort_index()\n", "compare_props[\"Rand. %error\"] = 100 * compare_props[\"Random\"] / compare_props[\"Overall\"] - 100\n", "compare_props[\"Strat. %error\"] = 100 * compare_props[\"Stratified\"] / compare_props[\"Overall\"] - 100" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "compare_props" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "for set_ in (strat_train_set, strat_test_set):\n", " set_.drop(\"income_cat\", axis=1, inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Discover and visualize the data to gain insights" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "housing = strat_train_set.copy()" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "housing.plot(kind=\"scatter\", x=\"longitude\", y=\"latitude\")\n", "save_fig(\"bad_visualization_plot\")" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "housing.plot(kind=\"scatter\", x=\"longitude\", y=\"latitude\", alpha=0.1)\n", "save_fig(\"better_visualization_plot\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The argument `sharex=False` fixes a display bug (the x-axis values and legend were not displayed). This is a temporary fix (see: https://github.com/pandas-dev/pandas/issues/10611 ). Thanks to Wilmer Arellano for pointing it out." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "housing.plot(kind=\"scatter\", x=\"longitude\", y=\"latitude\", alpha=0.4,\n", " s=housing[\"population\"]/100, label=\"population\", figsize=(10,7),\n", " c=\"median_house_value\", cmap=plt.get_cmap(\"jet\"), colorbar=True,\n", " sharex=False)\n", "plt.legend()\n", "save_fig(\"housing_prices_scatterplot\")" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "# Download the California image\n", "images_path = os.path.join(PROJECT_ROOT_DIR, \"images\", \"end_to_end_project\")\n", "os.makedirs(images_path, exist_ok=True)\n", "DOWNLOAD_ROOT = \"https://raw.githubusercontent.com/ageron/handson-ml2/master/\"\n", "filename = \"california.png\"\n", "print(\"Downloading\", filename)\n", "url = DOWNLOAD_ROOT + \"images/end_to_end_project/\" + filename\n", "urllib.request.urlretrieve(url, os.path.join(images_path, filename))" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "import matplotlib.image as mpimg\n", "california_img=mpimg.imread(os.path.join(images_path, filename))\n", "ax = housing.plot(kind=\"scatter\", x=\"longitude\", y=\"latitude\", figsize=(10,7),\n", " s=housing['population']/100, label=\"Population\",\n", " c=\"median_house_value\", cmap=plt.get_cmap(\"jet\"),\n", " colorbar=False, alpha=0.4)\n", "plt.imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5,\n", " cmap=plt.get_cmap(\"jet\"))\n", "plt.ylabel(\"Latitude\", fontsize=14)\n", "plt.xlabel(\"Longitude\", fontsize=14)\n", "\n", "prices = housing[\"median_house_value\"]\n", "tick_values = np.linspace(prices.min(), prices.max(), 11)\n", "cbar = plt.colorbar(ticks=tick_values/prices.max())\n", "cbar.ax.set_yticklabels([\"$%dk\"%(round(v/1000)) for v in tick_values], fontsize=14)\n", "cbar.set_label('Median House Value', fontsize=16)\n", "\n", "plt.legend(fontsize=16)\n", "save_fig(\"california_housing_prices_plot\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "corr_matrix = housing.corr()" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "corr_matrix[\"median_house_value\"].sort_values(ascending=False)" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "# from pandas.tools.plotting import scatter_matrix # For older versions of Pandas\n", "from pandas.plotting import scatter_matrix\n", "\n", "attributes = [\"median_house_value\", \"median_income\", \"total_rooms\",\n", " \"housing_median_age\"]\n", "scatter_matrix(housing[attributes], figsize=(12, 8))\n", "save_fig(\"scatter_matrix_plot\")" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "housing.plot(kind=\"scatter\", x=\"median_income\", y=\"median_house_value\",\n", " alpha=0.1)\n", "plt.axis([0, 16, 0, 550000])\n", "save_fig(\"income_vs_house_value_scatterplot\")" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "housing[\"rooms_per_household\"] = housing[\"total_rooms\"]/housing[\"households\"]\n", "housing[\"bedrooms_per_room\"] = housing[\"total_bedrooms\"]/housing[\"total_rooms\"]\n", "housing[\"population_per_household\"]=housing[\"population\"]/housing[\"households\"]" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "corr_matrix = housing.corr()\n", "corr_matrix[\"median_house_value\"].sort_values(ascending=False)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "housing.plot(kind=\"scatter\", x=\"rooms_per_household\", y=\"median_house_value\",\n", " alpha=0.2)\n", "plt.axis([0, 5, 0, 520000])\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "housing.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Prepare the data for Machine Learning algorithms" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "housing = strat_train_set.drop(\"median_house_value\", axis=1) # drop labels for training set\n", "housing_labels = strat_train_set[\"median_house_value\"].copy()" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()\n", "sample_incomplete_rows" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "sample_incomplete_rows.dropna(subset=[\"total_bedrooms\"]) # option 1" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "sample_incomplete_rows.drop(\"total_bedrooms\", axis=1) # option 2" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "median = housing[\"total_bedrooms\"].median()\n", "sample_incomplete_rows[\"total_bedrooms\"].fillna(median, inplace=True) # option 3" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "sample_incomplete_rows" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "from sklearn.impute import SimpleImputer\n", "imputer = SimpleImputer(strategy=\"median\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remove the text attribute because median can only be calculated on numerical attributes:" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "housing_num = housing.drop(\"ocean_proximity\", axis=1)\n", "# alternatively: housing_num = housing.select_dtypes(include=[np.number])" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "imputer.fit(housing_num)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "imputer.statistics_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check that this is the same as manually computing the median of each attribute:" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "housing_num.median().values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Transform the training set:" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "X = imputer.transform(housing_num)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "housing_tr = pd.DataFrame(X, columns=housing_num.columns,\n", " index=housing.index)" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "housing_tr.loc[sample_incomplete_rows.index.values]" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "imputer.strategy" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "housing_tr = pd.DataFrame(X, columns=housing_num.columns,\n", " index=housing_num.index)" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "housing_tr.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's preprocess the categorical input feature, `ocean_proximity`:" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "housing_cat = housing[[\"ocean_proximity\"]]\n", "housing_cat.head(10)" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import OrdinalEncoder\n", "\n", "ordinal_encoder = OrdinalEncoder()\n", "housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)\n", "housing_cat_encoded[:10]" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [], "source": [ "ordinal_encoder.categories_" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import OneHotEncoder\n", "\n", "cat_encoder = OneHotEncoder()\n", "housing_cat_1hot = cat_encoder.fit_transform(housing_cat)\n", "housing_cat_1hot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, the `OneHotEncoder` class returns a sparse array, but we can convert it to a dense array if needed by calling the `toarray()` method:" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [], "source": [ "housing_cat_1hot.toarray()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, you can set `sparse=False` when creating the `OneHotEncoder`:" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [], "source": [ "cat_encoder = OneHotEncoder(sparse=False)\n", "housing_cat_1hot = cat_encoder.fit_transform(housing_cat)\n", "housing_cat_1hot" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [], "source": [ "cat_encoder.categories_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create a custom transformer to add extra attributes:" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "from sklearn.base import BaseEstimator, TransformerMixin\n", "\n", "# column index\n", "rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6\n", "\n", "class CombinedAttributesAdder(BaseEstimator, TransformerMixin):\n", " def __init__(self, add_bedrooms_per_room=True): # no *args or **kargs\n", " self.add_bedrooms_per_room = add_bedrooms_per_room\n", " def fit(self, X, y=None):\n", " return self # nothing else to do\n", " def transform(self, X):\n", " rooms_per_household = X[:, rooms_ix] / X[:, households_ix]\n", " population_per_household = X[:, population_ix] / X[:, households_ix]\n", " if self.add_bedrooms_per_room:\n", " bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]\n", " return np.c_[X, rooms_per_household, population_per_household,\n", " bedrooms_per_room]\n", " else:\n", " return np.c_[X, rooms_per_household, population_per_household]\n", "\n", "attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)\n", "housing_extra_attribs = attr_adder.transform(housing.values)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that I hard coded the indices (3, 4, 5, 6) for concision and clarity in the book, but it would be much cleaner to get them dynamically, like this:" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [], "source": [ "col_names = \"total_rooms\", \"total_bedrooms\", \"population\", \"households\"\n", "rooms_ix, bedrooms_ix, population_ix, households_ix = [\n", " housing.columns.get_loc(c) for c in col_names] # get the column indices" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also, `housing_extra_attribs` is a NumPy array, we've lost the column names (unfortunately, that's a problem with Scikit-Learn). To recover a `DataFrame`, you could run this:" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "housing_extra_attribs = pd.DataFrame(\n", " housing_extra_attribs,\n", " columns=list(housing.columns)+[\"rooms_per_household\", \"population_per_household\"],\n", " index=housing.index)\n", "housing_extra_attribs.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's build a pipeline for preprocessing the numerical attributes:" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [], "source": [ "from sklearn.pipeline import Pipeline\n", "from sklearn.preprocessing import StandardScaler\n", "\n", "num_pipeline = Pipeline([\n", " ('imputer', SimpleImputer(strategy=\"median\")),\n", " ('attribs_adder', CombinedAttributesAdder()),\n", " ('std_scaler', StandardScaler()),\n", " ])\n", "\n", "housing_num_tr = num_pipeline.fit_transform(housing_num)" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [], "source": [ "housing_num_tr" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "from sklearn.compose import ColumnTransformer\n", "\n", "num_attribs = list(housing_num)\n", "cat_attribs = [\"ocean_proximity\"]\n", "\n", "full_pipeline = ColumnTransformer([\n", " (\"num\", num_pipeline, num_attribs),\n", " (\"cat\", OneHotEncoder(), cat_attribs),\n", " ])\n", "\n", "housing_prepared = full_pipeline.fit_transform(housing)" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [], "source": [ "housing_prepared" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [], "source": [ "housing_prepared.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For reference, here is the old solution based on a `DataFrameSelector` transformer (to just select a subset of the Pandas `DataFrame` columns), and a `FeatureUnion`:" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [], "source": [ "from sklearn.base import BaseEstimator, TransformerMixin\n", "\n", "# Create a class to select numerical or categorical columns \n", "class OldDataFrameSelector(BaseEstimator, TransformerMixin):\n", " def __init__(self, attribute_names):\n", " self.attribute_names = attribute_names\n", " def fit(self, X, y=None):\n", " return self\n", " def transform(self, X):\n", " return X[self.attribute_names].values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's join all these components into a big pipeline that will preprocess both the numerical and the categorical features:" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [], "source": [ "num_attribs = list(housing_num)\n", "cat_attribs = [\"ocean_proximity\"]\n", "\n", "old_num_pipeline = Pipeline([\n", " ('selector', OldDataFrameSelector(num_attribs)),\n", " ('imputer', SimpleImputer(strategy=\"median\")),\n", " ('attribs_adder', CombinedAttributesAdder()),\n", " ('std_scaler', StandardScaler()),\n", " ])\n", "\n", "old_cat_pipeline = Pipeline([\n", " ('selector', OldDataFrameSelector(cat_attribs)),\n", " ('cat_encoder', OneHotEncoder(sparse=False)),\n", " ])" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [], "source": [ "from sklearn.pipeline import FeatureUnion\n", "\n", "old_full_pipeline = FeatureUnion(transformer_list=[\n", " (\"num_pipeline\", old_num_pipeline),\n", " (\"cat_pipeline\", old_cat_pipeline),\n", " ])" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [], "source": [ "old_housing_prepared = old_full_pipeline.fit_transform(housing)\n", "old_housing_prepared" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is the same as with the `ColumnTransformer`:" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [], "source": [ "np.allclose(housing_prepared, old_housing_prepared)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Select and train a model " ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "\n", "lin_reg = LinearRegression()\n", "lin_reg.fit(housing_prepared, housing_labels)" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [], "source": [ "# let's try the full preprocessing pipeline on a few training instances\n", "some_data = housing.iloc[:5]\n", "some_labels = housing_labels.iloc[:5]\n", "some_data_prepared = full_pipeline.transform(some_data)\n", "\n", "print(\"Predictions:\", lin_reg.predict(some_data_prepared))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compare against the actual values:" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [], "source": [ "print(\"Labels:\", list(some_labels))" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [], "source": [ "some_data_prepared" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import mean_squared_error\n", "\n", "housing_predictions = lin_reg.predict(housing_prepared)\n", "lin_mse = mean_squared_error(housing_labels, housing_predictions)\n", "lin_rmse = np.sqrt(lin_mse)\n", "lin_rmse" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note**: since Scikit-Learn 0.22, you can get the RMSE directly by calling the `mean_squared_error()` function with `squared=False`." ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import mean_absolute_error\n", "\n", "lin_mae = mean_absolute_error(housing_labels, housing_predictions)\n", "lin_mae" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeRegressor\n", "\n", "tree_reg = DecisionTreeRegressor(random_state=42)\n", "tree_reg.fit(housing_prepared, housing_labels)" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [], "source": [ "housing_predictions = tree_reg.predict(housing_prepared)\n", "tree_mse = mean_squared_error(housing_labels, housing_predictions)\n", "tree_rmse = np.sqrt(tree_mse)\n", "tree_rmse" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Fine-tune your model" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import cross_val_score\n", "\n", "scores = cross_val_score(tree_reg, housing_prepared, housing_labels,\n", " scoring=\"neg_mean_squared_error\", cv=10)\n", "tree_rmse_scores = np.sqrt(-scores)" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [], "source": [ "def display_scores(scores):\n", " print(\"Scores:\", scores)\n", " print(\"Mean:\", scores.mean())\n", " print(\"Standard deviation:\", scores.std())\n", "\n", "display_scores(tree_rmse_scores)" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [], "source": [ "lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,\n", " scoring=\"neg_mean_squared_error\", cv=10)\n", "lin_rmse_scores = np.sqrt(-lin_scores)\n", "display_scores(lin_rmse_scores)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note**: we specify `n_estimators=100` to be future-proof since the default value is going to change to 100 in Scikit-Learn 0.22 (for simplicity, this is not shown in the book)." ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestRegressor\n", "\n", "forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)\n", "forest_reg.fit(housing_prepared, housing_labels)" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [], "source": [ "housing_predictions = forest_reg.predict(housing_prepared)\n", "forest_mse = mean_squared_error(housing_labels, housing_predictions)\n", "forest_rmse = np.sqrt(forest_mse)\n", "forest_rmse" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import cross_val_score\n", "\n", "forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,\n", " scoring=\"neg_mean_squared_error\", cv=10)\n", "forest_rmse_scores = np.sqrt(-forest_scores)\n", "display_scores(forest_rmse_scores)" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [], "source": [ "scores = cross_val_score(lin_reg, housing_prepared, housing_labels, scoring=\"neg_mean_squared_error\", cv=10)\n", "pd.Series(np.sqrt(-scores)).describe()" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [], "source": [ "from sklearn.svm import SVR\n", "\n", "svm_reg = SVR(kernel=\"linear\")\n", "svm_reg.fit(housing_prepared, housing_labels)\n", "housing_predictions = svm_reg.predict(housing_prepared)\n", "svm_mse = mean_squared_error(housing_labels, housing_predictions)\n", "svm_rmse = np.sqrt(svm_mse)\n", "svm_rmse" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import GridSearchCV\n", "\n", "param_grid = [\n", " # try 12 (3×4) combinations of hyperparameters\n", " {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},\n", " # then try 6 (2×3) combinations with bootstrap set as False\n", " {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},\n", " ]\n", "\n", "forest_reg = RandomForestRegressor(random_state=42)\n", "# train across 5 folds, that's a total of (12+6)*5=90 rounds of training \n", "grid_search = GridSearchCV(forest_reg, param_grid, cv=5,\n", " scoring='neg_mean_squared_error',\n", " return_train_score=True)\n", "grid_search.fit(housing_prepared, housing_labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The best hyperparameter combination found:" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [], "source": [ "grid_search.best_params_" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [], "source": [ "grid_search.best_estimator_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the score of each hyperparameter combination tested during the grid search:" ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [], "source": [ "cvres = grid_search.cv_results_\n", "for mean_score, params in zip(cvres[\"mean_test_score\"], cvres[\"params\"]):\n", " print(np.sqrt(-mean_score), params)" ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(grid_search.cv_results_)" ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import RandomizedSearchCV\n", "from scipy.stats import randint\n", "\n", "param_distribs = {\n", " 'n_estimators': randint(low=1, high=200),\n", " 'max_features': randint(low=1, high=8),\n", " }\n", "\n", "forest_reg = RandomForestRegressor(random_state=42)\n", "rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,\n", " n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)\n", "rnd_search.fit(housing_prepared, housing_labels)" ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [], "source": [ "cvres = rnd_search.cv_results_\n", "for mean_score, params in zip(cvres[\"mean_test_score\"], cvres[\"params\"]):\n", " print(np.sqrt(-mean_score), params)" ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [], "source": [ "feature_importances = grid_search.best_estimator_.feature_importances_\n", "feature_importances" ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [], "source": [ "extra_attribs = [\"rooms_per_hhold\", \"pop_per_hhold\", \"bedrooms_per_room\"]\n", "#cat_encoder = cat_pipeline.named_steps[\"cat_encoder\"] # old solution\n", "cat_encoder = full_pipeline.named_transformers_[\"cat\"]\n", "cat_one_hot_attribs = list(cat_encoder.categories_[0])\n", "attributes = num_attribs + extra_attribs + cat_one_hot_attribs\n", "sorted(zip(feature_importances, attributes), reverse=True)" ] }, { "cell_type": "code", "execution_count": 108, "metadata": {}, "outputs": [], "source": [ "final_model = grid_search.best_estimator_\n", "\n", "X_test = strat_test_set.drop(\"median_house_value\", axis=1)\n", "y_test = strat_test_set[\"median_house_value\"].copy()\n", "\n", "X_test_prepared = full_pipeline.transform(X_test)\n", "final_predictions = final_model.predict(X_test_prepared)\n", "\n", "final_mse = mean_squared_error(y_test, final_predictions)\n", "final_rmse = np.sqrt(final_mse)" ] }, { "cell_type": "code", "execution_count": 109, "metadata": {}, "outputs": [], "source": [ "final_rmse" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can compute a 95% confidence interval for the test RMSE:" ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [], "source": [ "from scipy import stats\n", "\n", "confidence = 0.95\n", "squared_errors = (final_predictions - y_test) ** 2\n", "np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,\n", " loc=squared_errors.mean(),\n", " scale=stats.sem(squared_errors)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could compute the interval manually like this:" ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [], "source": [ "m = len(squared_errors)\n", "mean = squared_errors.mean()\n", "tscore = stats.t.ppf((1 + confidence) / 2, df=m - 1)\n", "tmargin = tscore * squared_errors.std(ddof=1) / np.sqrt(m)\n", "np.sqrt(mean - tmargin), np.sqrt(mean + tmargin)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, we could use a z-scores rather than t-scores:" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [], "source": [ "zscore = stats.norm.ppf((1 + confidence) / 2)\n", "zmargin = zscore * squared_errors.std(ddof=1) / np.sqrt(m)\n", "np.sqrt(mean - zmargin), np.sqrt(mean + zmargin)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Extra material" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A full pipeline with both preparation and prediction" ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [], "source": [ "full_pipeline_with_predictor = Pipeline([\n", " (\"preparation\", full_pipeline),\n", " (\"linear\", LinearRegression())\n", " ])\n", "\n", "full_pipeline_with_predictor.fit(housing, housing_labels)\n", "full_pipeline_with_predictor.predict(some_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model persistence using joblib" ] }, { "cell_type": "code", "execution_count": 114, "metadata": {}, "outputs": [], "source": [ "my_model = full_pipeline_with_predictor" ] }, { "cell_type": "code", "execution_count": 115, "metadata": {}, "outputs": [], "source": [ "import joblib\n", "joblib.dump(my_model, \"my_model.pkl\") # DIFF\n", "#...\n", "my_model_loaded = joblib.load(\"my_model.pkl\") # DIFF" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example SciPy distributions for `RandomizedSearchCV`" ] }, { "cell_type": "code", "execution_count": 116, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import geom, expon\n", "geom_distrib=geom(0.5).rvs(10000, random_state=42)\n", "expon_distrib=expon(scale=1).rvs(10000, random_state=42)\n", "plt.hist(geom_distrib, bins=50)\n", "plt.show()\n", "plt.hist(expon_distrib, bins=50)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise solutions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question: Try a Support Vector Machine regressor (`sklearn.svm.SVR`), with various hyperparameters such as `kernel=\"linear\"` (with various values for the `C` hyperparameter) or `kernel=\"rbf\"` (with various values for the `C` and `gamma` hyperparameters). Don't worry about what these hyperparameters mean for now. How does the best `SVR` predictor perform?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Warning**: the following cell may take close to 30 minutes to run, or more depending on your hardware." ] }, { "cell_type": "code", "execution_count": 117, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import GridSearchCV\n", "\n", "param_grid = [\n", " {'kernel': ['linear'], 'C': [10., 30., 100., 300., 1000., 3000., 10000., 30000.0]},\n", " {'kernel': ['rbf'], 'C': [1.0, 3.0, 10., 30., 100., 300., 1000.0],\n", " 'gamma': [0.01, 0.03, 0.1, 0.3, 1.0, 3.0]},\n", " ]\n", "\n", "svm_reg = SVR()\n", "grid_search = GridSearchCV(svm_reg, param_grid, cv=5, scoring='neg_mean_squared_error', verbose=2)\n", "grid_search.fit(housing_prepared, housing_labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The best model achieves the following score (evaluated using 5-fold cross validation):" ] }, { "cell_type": "code", "execution_count": 118, "metadata": {}, "outputs": [], "source": [ "negative_mse = grid_search.best_score_\n", "rmse = np.sqrt(-negative_mse)\n", "rmse" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's much worse than the `RandomForestRegressor`. Let's check the best hyperparameters found:" ] }, { "cell_type": "code", "execution_count": 119, "metadata": {}, "outputs": [], "source": [ "grid_search.best_params_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The linear kernel seems better than the RBF kernel. Notice that the value of `C` is the maximum tested value. When this happens you definitely want to launch the grid search again with higher values for `C` (removing the smallest values), because it is likely that higher values of `C` will be better." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question: Try replacing `GridSearchCV` with `RandomizedSearchCV`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Warning**: the following cell may take close to 45 minutes to run, or more depending on your hardware." ] }, { "cell_type": "code", "execution_count": 120, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import RandomizedSearchCV\n", "from scipy.stats import expon, reciprocal\n", "\n", "# see https://docs.scipy.org/doc/scipy/reference/stats.html\n", "# for `expon()` and `reciprocal()` documentation and more probability distribution functions.\n", "\n", "# Note: gamma is ignored when kernel is \"linear\"\n", "param_distribs = {\n", " 'kernel': ['linear', 'rbf'],\n", " 'C': reciprocal(20, 200000),\n", " 'gamma': expon(scale=1.0),\n", " }\n", "\n", "svm_reg = SVR()\n", "rnd_search = RandomizedSearchCV(svm_reg, param_distributions=param_distribs,\n", " n_iter=50, cv=5, scoring='neg_mean_squared_error',\n", " verbose=2, random_state=42)\n", "rnd_search.fit(housing_prepared, housing_labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The best model achieves the following score (evaluated using 5-fold cross validation):" ] }, { "cell_type": "code", "execution_count": 121, "metadata": {}, "outputs": [], "source": [ "negative_mse = rnd_search.best_score_\n", "rmse = np.sqrt(-negative_mse)\n", "rmse" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now this is much closer to the performance of the `RandomForestRegressor` (but not quite there yet). Let's check the best hyperparameters found:" ] }, { "cell_type": "code", "execution_count": 122, "metadata": {}, "outputs": [], "source": [ "rnd_search.best_params_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This time the search found a good set of hyperparameters for the RBF kernel. Randomized search tends to find better hyperparameters than grid search in the same amount of time." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the exponential distribution we used, with `scale=1.0`. Note that some samples are much larger or smaller than 1.0, but when you look at the log of the distribution, you can see that most values are actually concentrated roughly in the range of exp(-2) to exp(+2), which is about 0.1 to 7.4." ] }, { "cell_type": "code", "execution_count": 123, "metadata": {}, "outputs": [], "source": [ "expon_distrib = expon(scale=1.)\n", "samples = expon_distrib.rvs(10000, random_state=42)\n", "plt.figure(figsize=(10, 4))\n", "plt.subplot(121)\n", "plt.title(\"Exponential distribution (scale=1.0)\")\n", "plt.hist(samples, bins=50)\n", "plt.subplot(122)\n", "plt.title(\"Log of this distribution\")\n", "plt.hist(np.log(samples), bins=50)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The distribution we used for `C` looks quite different: the scale of the samples is picked from a uniform distribution within a given range, which is why the right graph, which represents the log of the samples, looks roughly constant. This distribution is useful when you don't have a clue of what the target scale is:" ] }, { "cell_type": "code", "execution_count": 124, "metadata": {}, "outputs": [], "source": [ "reciprocal_distrib = reciprocal(20, 200000)\n", "samples = reciprocal_distrib.rvs(10000, random_state=42)\n", "plt.figure(figsize=(10, 4))\n", "plt.subplot(121)\n", "plt.title(\"Reciprocal distribution (scale=1.0)\")\n", "plt.hist(samples, bins=50)\n", "plt.subplot(122)\n", "plt.title(\"Log of this distribution\")\n", "plt.hist(np.log(samples), bins=50)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The reciprocal distribution is useful when you have no idea what the scale of the hyperparameter should be (indeed, as you can see on the figure on the right, all scales are equally likely, within the given range), whereas the exponential distribution is best when you know (more or less) what the scale of the hyperparameter should be." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question: Try adding a transformer in the preparation pipeline to select only the most important attributes." ] }, { "cell_type": "code", "execution_count": 125, "metadata": {}, "outputs": [], "source": [ "from sklearn.base import BaseEstimator, TransformerMixin\n", "\n", "def indices_of_top_k(arr, k):\n", " return np.sort(np.argpartition(np.array(arr), -k)[-k:])\n", "\n", "class TopFeatureSelector(BaseEstimator, TransformerMixin):\n", " def __init__(self, feature_importances, k):\n", " self.feature_importances = feature_importances\n", " self.k = k\n", " def fit(self, X, y=None):\n", " self.feature_indices_ = indices_of_top_k(self.feature_importances, self.k)\n", " return self\n", " def transform(self, X):\n", " return X[:, self.feature_indices_]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: this feature selector assumes that you have already computed the feature importances somehow (for example using a `RandomForestRegressor`). You may be tempted to compute them directly in the `TopFeatureSelector`'s `fit()` method, however this would likely slow down grid/randomized search since the feature importances would have to be computed for every hyperparameter combination (unless you implement some sort of cache)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's define the number of top features we want to keep:" ] }, { "cell_type": "code", "execution_count": 126, "metadata": {}, "outputs": [], "source": [ "k = 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's look for the indices of the top k features:" ] }, { "cell_type": "code", "execution_count": 127, "metadata": {}, "outputs": [], "source": [ "top_k_feature_indices = indices_of_top_k(feature_importances, k)\n", "top_k_feature_indices" ] }, { "cell_type": "code", "execution_count": 128, "metadata": {}, "outputs": [], "source": [ "np.array(attributes)[top_k_feature_indices]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's double check that these are indeed the top k features:" ] }, { "cell_type": "code", "execution_count": 129, "metadata": {}, "outputs": [], "source": [ "sorted(zip(feature_importances, attributes), reverse=True)[:k]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking good... Now let's create a new pipeline that runs the previously defined preparation pipeline, and adds top k feature selection:" ] }, { "cell_type": "code", "execution_count": 130, "metadata": {}, "outputs": [], "source": [ "preparation_and_feature_selection_pipeline = Pipeline([\n", " ('preparation', full_pipeline),\n", " ('feature_selection', TopFeatureSelector(feature_importances, k))\n", "])" ] }, { "cell_type": "code", "execution_count": 131, "metadata": {}, "outputs": [], "source": [ "housing_prepared_top_k_features = preparation_and_feature_selection_pipeline.fit_transform(housing)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the features of the first 3 instances:" ] }, { "cell_type": "code", "execution_count": 132, "metadata": {}, "outputs": [], "source": [ "housing_prepared_top_k_features[0:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's double check that these are indeed the top k features:" ] }, { "cell_type": "code", "execution_count": 133, "metadata": {}, "outputs": [], "source": [ "housing_prepared[0:3, top_k_feature_indices]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Works great! :)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question: Try creating a single pipeline that does the full data preparation plus the final prediction." ] }, { "cell_type": "code", "execution_count": 134, "metadata": {}, "outputs": [], "source": [ "prepare_select_and_predict_pipeline = Pipeline([\n", " ('preparation', full_pipeline),\n", " ('feature_selection', TopFeatureSelector(feature_importances, k)),\n", " ('svm_reg', SVR(**rnd_search.best_params_))\n", "])" ] }, { "cell_type": "code", "execution_count": 135, "metadata": {}, "outputs": [], "source": [ "prepare_select_and_predict_pipeline.fit(housing, housing_labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try the full pipeline on a few instances:" ] }, { "cell_type": "code", "execution_count": 136, "metadata": {}, "outputs": [], "source": [ "some_data = housing.iloc[:4]\n", "some_labels = housing_labels.iloc[:4]\n", "\n", "print(\"Predictions:\\t\", prepare_select_and_predict_pipeline.predict(some_data))\n", "print(\"Labels:\\t\\t\", list(some_labels))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Well, the full pipeline seems to work fine. Of course, the predictions are not fantastic: they would be better if we used the best `RandomForestRegressor` that we found earlier, rather than the best `SVR`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question: Automatically explore some preparation options using `GridSearchCV`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Warning**: the following cell may take close to 45 minutes to run, or more depending on your hardware." ] }, { "cell_type": "code", "execution_count": 137, "metadata": {}, "outputs": [], "source": [ "param_grid = [{\n", " 'preparation__num__imputer__strategy': ['mean', 'median', 'most_frequent'],\n", " 'feature_selection__k': list(range(1, len(feature_importances) + 1))\n", "}]\n", "\n", "grid_search_prep = GridSearchCV(prepare_select_and_predict_pipeline, param_grid, cv=5,\n", " scoring='neg_mean_squared_error', verbose=2)\n", "grid_search_prep.fit(housing, housing_labels)" ] }, { "cell_type": "code", "execution_count": 138, "metadata": {}, "outputs": [], "source": [ "grid_search_prep.best_params_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The best imputer strategy is `most_frequent` and apparently almost all features are useful (15 out of 16). The last one (`ISLAND`) seems to just add some noise." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Congratulations! You already know quite a lot about Machine Learning. :)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.9" }, "nav_menu": { "height": "279px", "width": "309px" }, "toc": { "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "toc_cell": false, "toc_position": {}, "toc_section_display": "block", "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }