{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**Chapter 2 – End-to-end Machine Learning project**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*This notebook contains all the sample code and solutions to the exercices in chapter 2.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " \n", " \n", "
\n", " \"Open\n", " \n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "print(\"Welcome to Machine Learning!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This project requires Python 3.8 or above:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import sys\n", "\n", "assert sys.version_info >= (3, 8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It also requires Scikit-Learn ≥ 1.0.1:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import sklearn\n", "\n", "assert sklearn.__version__ >= \"1.0.1\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Get the Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Welcome to Machine Learning Housing Corp.! Your task is to predict median house values in Californian districts, given a number of features from these districts.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download the Data" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "import tarfile\n", "import urllib.request\n", "\n", "import pandas as pd\n", "\n", "def load_housing_data():\n", " housing_path = Path() / \"datasets\" / \"housing\"\n", " if not (housing_path / \"housing.csv\").is_file():\n", " housing_path.mkdir(parents=True, exist_ok=True)\n", " root = \"https://raw.githubusercontent.com/ageron/handson-ml3/main/\"\n", " url = root + \"datasets/housing/housing.tgz\"\n", " tgz_path = housing_path / \"housing.tgz\"\n", " urllib.request.urlretrieve(url, tgz_path)\n", " with tarfile.open(tgz_path) as housing_tgz:\n", " housing_tgz.extractall(path=housing_path)\n", " return pd.read_csv(housing_path / \"housing.csv\")\n", "\n", "housing = load_housing_data()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Take a Quick Look at the Data Structure" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "housing.head()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "housing.info()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "housing[\"ocean_proximity\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "housing.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following cell is not shown either in the book. It creates the `images/end_to_end_project` folder (if it doesn't already exist), and it defines the `save_fig()` function which is used through this notebook to save the figures in high-res for the book." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# extra code – code to save the figures as high-res PNGs for the book\n", "\n", "IMAGES_PATH = Path() / \"images\" / \"end_to_end_project\"\n", "IMAGES_PATH.mkdir(parents=True, exist_ok=True)\n", "\n", "def save_fig(fig_id, tight_layout=True, fig_extension=\"png\", resolution=300):\n", " path = IMAGES_PATH / f\"{fig_id}.{fig_extension}\"\n", " if tight_layout:\n", " plt.tight_layout()\n", " plt.savefig(path, format=fig_extension, dpi=resolution)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "# extra code – the next 5 lines define the default font sizes\n", "plt.rc('font', size=14)\n", "plt.rc('axes', labelsize=14, titlesize=14)\n", "plt.rc('legend', fontsize=14)\n", "plt.rc('xtick', labelsize=10)\n", "plt.rc('ytick', labelsize=10)\n", "\n", "housing.hist(bins=50, figsize=(12, 8))\n", "save_fig(\"attribute_histogram_plots\") # extra code\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a Test Set" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "def shuffle_and_split_data(data, test_ratio):\n", " shuffled_indices = np.random.permutation(len(data))\n", " test_set_size = int(len(data) * test_ratio)\n", " test_indices = shuffled_indices[:test_set_size]\n", " train_indices = shuffled_indices[test_set_size:]\n", " return data.iloc[train_indices], data.iloc[test_indices]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "train_set, test_set = shuffle_and_split_data(housing, 0.2)\n", "len(train_set)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "len(test_set)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To ensure that this notebook's outputs remain the same every time we run it, we need to set the random seed:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "np.random.seed(42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sadly, this won't guarantee that this notebook will output exactly the same results as in the book, since there are other possible sources of variation. The most important is the fact that algorithms get tweaked over time when libraries evolve. So please tolerate some minor differences: hopefully, most of the outputs should be the same, or at least in the right ballpark." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: another source of randomness is the order of Python sets: it is based on Python's `hash()` function, which is randomly \"salted\" when Python starts up (this started in Python 3.3, to prevent some denial-of-service attacks). To remove this randomness, the solution is to set the `PYTHONHASHSEED` environment variable to `\"0\"` _before_ Python even starts up. Nothing will happen if you do it after that. Luckily, if you're running this notebook on Colab, the variable is already set for you." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "from zlib import crc32\n", "\n", "def is_id_in_test_set(identifier, test_ratio):\n", " return crc32(np.int64(identifier)) < test_ratio * 2**32\n", "\n", "def split_data_with_id_hash(data, test_ratio, id_column):\n", " ids = data[id_column]\n", " in_test_set = ids.apply(lambda id_: is_id_in_test_set(id_, test_ratio))\n", " return data.loc[~in_test_set], data.loc[in_test_set]" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "housing_with_id = housing.reset_index() # adds an `index` column\n", "train_set, test_set = split_data_with_id_hash(housing_with_id, 0.2, \"index\")" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "housing_with_id[\"id\"] = housing[\"longitude\"] * 1000 + housing[\"latitude\"]\n", "train_set, test_set = split_data_with_id_hash(housing_with_id, 0.2, \"id\")" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "test_set[\"total_bedrooms\"].isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To find the probability that a random sample of 1,000 people contains less than 48.5% female or more than 53.5% female when the population's female ratio is 51.1%, we use the [binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution). The `cdf()` method of the binomial distribution gives us the probability that the number of females will be equal or less than the given value." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# extra code – shows how to compute the 10.7% proba of getting a bad sample\n", "\n", "from scipy.stats import binom\n", "\n", "sample_size = 1000\n", "ratio_female = 0.511\n", "proba_too_small = binom(sample_size, ratio_female).cdf(485 - 1)\n", "proba_too_large = 1 - binom(sample_size, ratio_female).cdf(535)\n", "print(proba_too_small + proba_too_large)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you prefer simulations over maths, here's how you could get roughly the same result:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# extra code – shows another way to estimate the probability of bad sample\n", "\n", "np.random.seed(42)\n", "\n", "samples = (np.random.rand(100_000, sample_size) < ratio_female).sum(axis=1)\n", "((samples < 485) | (samples > 535)).mean()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "housing[\"income_cat\"] = pd.cut(housing[\"median_income\"],\n", " bins=[0., 1.5, 3.0, 4.5, 6., np.inf],\n", " labels=[1, 2, 3, 4, 5])" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "housing[\"income_cat\"].value_counts().sort_index().plot.bar(rot=0, grid=True)\n", "plt.xlabel(\"Income category\")\n", "plt.ylabel(\"Number of districts\")\n", "save_fig(\"housing_income_cat_bar_plot\") # extra code\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import StratifiedShuffleSplit\n", "\n", "splitter = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42)\n", "strat_splits = []\n", "for train_index, test_index in splitter.split(housing, housing[\"income_cat\"]):\n", " strat_train_set_n = housing.loc[train_index]\n", " strat_test_set_n = housing.loc[test_index]\n", " strat_splits.append([strat_train_set_n, strat_test_set_n])" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "strat_train_set, strat_test_set = strat_splits[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's much shorter to get a single stratified split:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "strat_train_set, strat_test_set = train_test_split(\n", " housing, test_size=0.2, stratify=housing[\"income_cat\"], random_state=42)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "strat_test_set[\"income_cat\"].value_counts() / len(strat_test_set)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "# extra code – computes the data for Figure 2–10\n", "\n", "def income_cat_proportions(data):\n", " return data[\"income_cat\"].value_counts() / len(data)\n", "\n", "train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)\n", "\n", "compare_props = pd.DataFrame({\n", " \"Overall %\": income_cat_proportions(housing),\n", " \"Stratified %\": income_cat_proportions(strat_test_set),\n", " \"Random %\": income_cat_proportions(test_set),\n", "}).sort_index()\n", "compare_props.index.name = \"Income Category\"\n", "compare_props[\"Strat. Error %\"] = (compare_props[\"Stratified %\"] /\n", " compare_props[\"Overall %\"] - 1)\n", "compare_props[\"Rand. Error %\"] = (compare_props[\"Random %\"] /\n", " compare_props[\"Overall %\"] - 1)\n", "(compare_props * 100).round(2)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "for set_ in (strat_train_set, strat_test_set):\n", " set_.drop(\"income_cat\", axis=1, inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Discover and Visualize the Data to Gain Insights" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "housing = strat_train_set.copy()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualizing Geographical Data" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "housing.plot(kind=\"scatter\", x=\"longitude\", y=\"latitude\", grid=True)\n", "save_fig(\"bad_visualization_plot\") # extra code\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "housing.plot(kind=\"scatter\", x=\"longitude\", y=\"latitude\", grid=True, alpha=0.2)\n", "save_fig(\"better_visualization_plot\") # extra code\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "housing.plot(kind=\"scatter\", x=\"longitude\", y=\"latitude\", grid=True,\n", " s=housing[\"population\"] / 100, label=\"population\",\n", " c=\"median_house_value\", cmap=\"jet\", colorbar=True,\n", " legend=True, sharex=False, figsize=(10, 7))\n", "save_fig(\"housing_prices_scatterplot\") # extra code\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The argument `sharex=False` fixes a display bug: without it, the x-axis values and label are not displayed (see: https://github.com/pandas-dev/pandas/issues/10611)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next cell generates the first figure in the chapter (this code is not in the book). It's just a beautified version of the previous figure, with an image of California added in the background, nicer label names and no grid." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "# extra code – this cell generates the first figure in the chapter\n", "\n", "# Download the California image\n", "filename = \"california.png\"\n", "if not (IMAGES_PATH / filename).is_file():\n", " root = \"https://raw.githubusercontent.com/ageron/handson-ml3/main/\"\n", " url = root + \"images/end_to_end_project/\" + filename\n", " print(\"Downloading\", filename)\n", " urllib.request.urlretrieve(url, IMAGES_PATH / filename)\n", "\n", "housing_renamed = housing.rename(columns={\n", " \"latitude\": \"Latitude\", \"longitude\": \"Longitude\",\n", " \"population\": \"Population\",\n", " \"median_house_value\": \"Median house value (ᴜsᴅ)\"})\n", "housing_renamed.plot(\n", " kind=\"scatter\", x=\"Longitude\", y=\"Latitude\",\n", " s=housing_renamed[\"Population\"] / 100, label=\"Population\",\n", " c=\"Median house value (ᴜsᴅ)\", cmap=\"jet\", colorbar=True,\n", " legend=True, sharex=False, figsize=(10, 7))\n", "\n", "california_img = plt.imread(IMAGES_PATH / filename)\n", "axis = -124.55, -113.95, 32.45, 42.05\n", "plt.axis(axis)\n", "plt.imshow(california_img, extent=axis)\n", "\n", "save_fig(\"california_housing_prices_plot\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Looking for Correlations" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "corr_matrix = housing.corr()" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "corr_matrix[\"median_house_value\"].sort_values(ascending=False)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "from pandas.plotting import scatter_matrix\n", "\n", "attributes = [\"median_house_value\", \"median_income\", \"total_rooms\",\n", " \"housing_median_age\"]\n", "scatter_matrix(housing[attributes], figsize=(12, 8))\n", "save_fig(\"scatter_matrix_plot\") # extra code\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "housing.plot(kind=\"scatter\", x=\"median_income\", y=\"median_house_value\",\n", " alpha=0.1, grid=True)\n", "save_fig(\"income_vs_house_value_scatterplot\") # extra code\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Experimenting with Attribute Combinations" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "housing[\"rooms_per_house\"] = housing[\"total_rooms\"] / housing[\"households\"]\n", "housing[\"bedrooms_ratio\"] = housing[\"total_bedrooms\"] / housing[\"total_rooms\"]\n", "housing[\"people_per_house\"] = housing[\"population\"] / housing[\"households\"]" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "corr_matrix = housing.corr()\n", "corr_matrix[\"median_house_value\"].sort_values(ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Prepare the Data for Machine Learning Algorithms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's revert to the original training set and separate the target (note that `strat_train_set.drop()` creates a copy of `strat_train_set` without the column, it doesn't actually modify `strat_train_set` itself, unless you pass `inplace=True`):" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "housing = strat_train_set.drop(\"median_house_value\", axis=1)\n", "housing_labels = strat_train_set[\"median_house_value\"].copy()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Cleaning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the book 3 options are listed to handle the NaN values:\n", "\n", "```python\n", "housing.dropna(subset=[\"total_bedrooms\"], inplace=True) # option 1\n", "\n", "housing.drop(\"total_bedrooms\", axis=1) # option 2\n", "\n", "median = housing[\"total_bedrooms\"].median() # option 3\n", "housing[\"total_bedrooms\"].fillna(median, inplace=True)\n", "```\n", "\n", "For each option, we'll create a copy of `housing` and work on that copy to avoid breaking `housing`. We'll also show the output of each option, but filtering on the rows that originally contained a NaN value." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "null_rows_idx = housing.isnull().any(axis=1)\n", "housing.loc[null_rows_idx].head()" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "housing_option1 = housing.copy()\n", "\n", "housing_option1.dropna(subset=[\"total_bedrooms\"], inplace=True) # option 1\n", "\n", "housing_option1.loc[null_rows_idx].head()" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "housing_option2 = housing.copy()\n", "\n", "housing_option2.drop(\"total_bedrooms\", axis=1, inplace=True) # option 2\n", "\n", "housing_option2.loc[null_rows_idx].head()" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "housing_option3 = housing.copy()\n", "\n", "median = housing[\"total_bedrooms\"].median()\n", "housing_option3[\"total_bedrooms\"].fillna(median, inplace=True) # option 3\n", "\n", "housing_option3.loc[null_rows_idx].head()" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "from sklearn.impute import SimpleImputer\n", "\n", "imputer = SimpleImputer(strategy=\"median\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Separating out the numerical attributes to use the `\"median\"` strategy (as it cannot be calculated on text attributes like `ocean_proximity`):" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "housing_num = housing.select_dtypes(include=[np.number])" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "imputer.fit(housing_num)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "imputer.statistics_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check that this is the same as manually computing the median of each attribute:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "housing_num.median().values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Transform the training set:" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "X = imputer.transform(housing_num)" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "imputer.feature_names_in_" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "housing_tr = pd.DataFrame(X, columns=housing_num.columns,\n", " index=housing_num.index)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "housing_tr.loc[null_rows_idx].head()" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "imputer.strategy" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "housing_tr = pd.DataFrame(X, columns=housing_num.columns,\n", " index=housing_num.index)" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "housing_tr.loc[null_rows_idx].head() # not shown in the book" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "#from sklearn import set_config\n", "#\n", "# set_config(pandas_in_out=True) # not available yet – see SLEP014" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's drop some outliers:" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import IsolationForest\n", "\n", "isolation_forest = IsolationForest(random_state=42)\n", "outlier_pred = isolation_forest.fit_predict(X)" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "outlier_pred" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "#housing = housing.iloc[outlier_pred == 1]\n", "#housing_labels = housing_labels.iloc[outlier_pred == 1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Handling Text and Categorical Attributes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's preprocess the categorical input feature, `ocean_proximity`:" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "housing_cat = housing[[\"ocean_proximity\"]]\n", "housing_cat.head(8)" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import OrdinalEncoder\n", "\n", "ordinal_encoder = OrdinalEncoder()\n", "housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "housing_cat_encoded[:8]" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [], "source": [ "ordinal_encoder.categories_" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import OneHotEncoder\n", "\n", "cat_encoder = OneHotEncoder()\n", "housing_cat_1hot = cat_encoder.fit_transform(housing_cat)" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [], "source": [ "housing_cat_1hot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, the `OneHotEncoder` class returns a sparse array, but we can convert it to a dense array if needed by calling the `toarray()` method:" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [], "source": [ "housing_cat_1hot.toarray()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, you can set `sparse=False` when creating the `OneHotEncoder`:" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [], "source": [ "cat_encoder = OneHotEncoder(sparse=False)\n", "housing_cat_1hot = cat_encoder.fit_transform(housing_cat)\n", "housing_cat_1hot" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "cat_encoder.categories_" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [], "source": [ "df_test = pd.DataFrame({\"ocean_proximity\": [\"INLAND\", \"NEAR BAY\"]})\n", "pd.get_dummies(df_test)" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "cat_encoder.transform(df_test)" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [], "source": [ "df_test_unknown = pd.DataFrame({\"ocean_proximity\": [\"<2H OCEAN\", \"ISLAND\"]})\n", "pd.get_dummies(df_test_unknown)" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [], "source": [ "cat_encoder.handle_unknown = \"ignore\"\n", "cat_encoder.transform(df_test_unknown)" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "cat_encoder.feature_names_in_" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [], "source": [ "cat_encoder.get_feature_names_out()" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [], "source": [ "df_output = pd.DataFrame(cat_encoder.transform(df_test_unknown),\n", " columns=cat_encoder.get_feature_names_out(),\n", " index=df_test_unknown.index)" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [], "source": [ "df_output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature Scaling" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import MinMaxScaler\n", "\n", "min_max_scaler = MinMaxScaler(feature_range=(-1, 1))\n", "housing_num_min_max_scaled = min_max_scaler.fit_transform(housing_num)" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import StandardScaler\n", "\n", "std_scaler = StandardScaler()\n", "housing_num_std_scaled = std_scaler.fit_transform(housing_num)" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [], "source": [ "# extra code – this cell generates Figure 2–17\n", "fig, axs = plt.subplots(1, 2, figsize=(8, 3), sharey=True)\n", "housing[\"population\"].hist(ax=axs[0], bins=50)\n", "housing[\"population\"].apply(np.log).hist(ax=axs[1], bins=50)\n", "axs[0].set_xlabel(\"Population\")\n", "axs[1].set_xlabel(\"Log of population\")\n", "axs[0].set_ylabel(\"Number of districts\")\n", "save_fig(\"long_tail_plot\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if we replace each value with its percentile?" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [], "source": [ "# extra code – just shows that we get a uniform distribution\n", "percentiles = [np.percentile(housing[\"median_income\"], p)\n", " for p in range(1, 100)]\n", "flattened_median_income = pd.cut(housing[\"median_income\"],\n", " bins=[-np.inf] + percentiles + [np.inf],\n", " labels=range(1, 100 + 1))\n", "flattened_median_income.hist(bins=50)\n", "plt.xlabel(\"Median income percentile\")\n", "plt.ylabel(\"Number of districts\")\n", "plt.show()\n", "# Note: incomes below the 1st percentile are labeled 1, and incomes above the\n", "# 99th percentile are labeled 100. This is why the distribution below ranges\n", "# from 1 to 100 (not 0 to 100)." ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics.pairwise import rbf_kernel\n", "\n", "age_simil_35 = rbf_kernel(housing[[\"housing_median_age\"]], [[35]], gamma=0.1)" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [], "source": [ "# extra code – this cell generates Figure 2–18\n", "\n", "ages = np.linspace(housing[\"housing_median_age\"].min(),\n", " housing[\"housing_median_age\"].max(),\n", " 500).reshape(-1, 1)\n", "gamma1 = 0.1\n", "gamma2 = 0.03\n", "rbf1 = rbf_kernel(ages, [[35]], gamma=gamma1)\n", "rbf2 = rbf_kernel(ages, [[35]], gamma=gamma2)\n", "\n", "fig, ax1 = plt.subplots()\n", "\n", "ax1.set_xlabel(\"Housing median age\")\n", "ax1.set_ylabel(\"Number of districts\")\n", "ax1.hist(housing[\"housing_median_age\"], bins=50)\n", "\n", "ax2 = ax1.twinx() # create a twin axis that shares the same x-axis\n", "color = \"blue\"\n", "ax2.plot(ages, rbf1, color=color, label=\"gamma = 0.10\")\n", "ax2.plot(ages, rbf2, color=color, label=\"gamma = 0.03\", linestyle=\"--\")\n", "ax2.tick_params(axis='y', labelcolor=color)\n", "ax2.set_ylabel(\"Age similarity\", color=color)\n", "\n", "plt.legend(loc=\"upper left\", labelcolor=color)\n", "save_fig(\"age_similarity_plot\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "\n", "target_scaler = StandardScaler()\n", "scaled_labels = target_scaler.fit_transform(housing_labels.to_frame())\n", "\n", "model = LinearRegression()\n", "model.fit(housing[[\"median_income\"]], scaled_labels)\n", "some_new_data = housing[[\"median_income\"]].iloc[:5] # pretend this is new data\n", "\n", "scaled_predictions = model.predict(some_new_data)\n", "predictions = target_scaler.inverse_transform(scaled_predictions)" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [], "source": [ "predictions" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [], "source": [ "from sklearn.compose import TransformedTargetRegressor\n", "\n", "model = TransformedTargetRegressor(LinearRegression(),\n", " transformer=StandardScaler())\n", "model.fit(housing[[\"median_income\"]], housing_labels)\n", "predictions = model.predict(some_new_data)" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [], "source": [ "predictions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Custom Transformers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To create simple transformers:" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import FunctionTransformer\n", "\n", "log_transformer = FunctionTransformer(np.log, inverse_func=np.exp)\n", "log_pop = log_transformer.transform(housing[[\"population\"]])" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [], "source": [ "rbf_transformer = FunctionTransformer(rbf_kernel,\n", " kw_args=dict(Y=[[35.]], gamma=0.1))\n", "age_simil_35 = rbf_transformer.transform(housing[[\"housing_median_age\"]])" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [], "source": [ "age_simil_35" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [], "source": [ "sf_coords = 37.7749, -122.41\n", "sf_transformer = FunctionTransformer(rbf_kernel,\n", " kw_args=dict(Y=[sf_coords], gamma=0.1))\n", "sf_simil = sf_transformer.transform(housing[[\"latitude\", \"longitude\"]])" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [], "source": [ "sf_simil" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [], "source": [ "ratio_transformer = FunctionTransformer(lambda X: X[:, [0]] / X[:, [1]])\n", "ratio_transformer.transform(np.array([[1., 2.], [3., 4.]]))" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [], "source": [ "from sklearn.base import BaseEstimator, TransformerMixin\n", "from sklearn.utils.validation import check_array, check_is_fitted\n", "\n", "class StandardScalerClone(BaseEstimator, TransformerMixin):\n", " def __init__(self, with_mean=True): # no *args or **kwargs!\n", " self.with_mean = with_mean\n", "\n", " def fit(self, X, y=None): # y is required even though we don't use it\n", " X = check_array(X) # checks that X is an array with finite float values\n", " self.mean_ = X.mean(axis=0)\n", " self.scale_ = X.std(axis=0)\n", " self.n_features_in_ = X.shape[1] # every estimator stores this in fit()\n", " return self # always return self!\n", "\n", " def transform(self, X):\n", " check_is_fitted(self) # looks for learned attributes (with trailing _)\n", " X = check_array(X)\n", " assert self.n_features_in_ == X.shape[1]\n", " if self.with_mean:\n", " X = X - self.mean_\n", " return X / self.scale_" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [], "source": [ "from sklearn.cluster import KMeans\n", "\n", "class ClusterSimilarity(BaseEstimator, TransformerMixin):\n", " def __init__(self, n_clusters=10, gamma=1.0, random_state=None):\n", " self.n_clusters = n_clusters\n", " self.gamma = gamma\n", " self.random_state = random_state\n", "\n", " def fit(self, X, y=None, sample_weight=None):\n", " self.kmeans_ = KMeans(self.n_clusters, random_state=self.random_state)\n", " self.kmeans_.fit(X, sample_weight=sample_weight)\n", " return self # always return self!\n", "\n", " def transform(self, X):\n", " return rbf_kernel(X, self.kmeans_.cluster_centers_, gamma=self.gamma)\n", " \n", " def get_feature_names_out(self, names=None):\n", " return [f\"Cluster {i} similarity\" for i in range(self.n_clusters)]" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [], "source": [ "cluster_simil = ClusterSimilarity(n_clusters=10, gamma=1., random_state=42)\n", "similarities = cluster_simil.fit_transform(housing[[\"latitude\", \"longitude\"]],\n", " sample_weight=housing_labels)" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [], "source": [ "similarities[:3].round(2)" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [], "source": [ "# extra code – this cell generates Figure 2–19\n", "\n", "housing_renamed = housing.rename(columns={\n", " \"latitude\": \"Latitude\", \"longitude\": \"Longitude\",\n", " \"population\": \"Population\",\n", " \"median_house_value\": \"Median house value (ᴜsᴅ)\"})\n", "housing_renamed[\"Max cluster similarity\"] = similarities.max(axis=1)\n", "\n", "housing_renamed.plot(kind=\"scatter\", x=\"Longitude\", y=\"Latitude\", grid=True,\n", " s=housing_renamed[\"Population\"] / 100, label=\"Population\",\n", " c=\"Max cluster similarity\",\n", " cmap=\"jet\", colorbar=True,\n", " legend=True, sharex=False, figsize=(10, 7))\n", "plt.plot(cluster_simil.kmeans_.cluster_centers_[:, 1],\n", " cluster_simil.kmeans_.cluster_centers_[:, 0],\n", " linestyle=\"\", color=\"black\", marker=\"X\", markersize=20,\n", " label=\"Cluster centers\")\n", "plt.legend(loc=\"upper right\")\n", "save_fig(\"district_cluster_plot\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Transformation Pipelines" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's build a pipeline to preprocess the numerical attributes:" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [], "source": [ "from sklearn.pipeline import Pipeline\n", "\n", "num_pipeline = Pipeline([\n", " (\"impute\", SimpleImputer(strategy=\"median\")),\n", " (\"standardize\", StandardScaler()),\n", "])" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [], "source": [ "from sklearn.pipeline import make_pipeline\n", "\n", "num_pipeline = make_pipeline(SimpleImputer(strategy=\"median\"), StandardScaler())" ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [], "source": [ "from sklearn import set_config\n", "\n", "set_config(display='diagram')\n", "\n", "num_pipeline" ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [], "source": [ "housing_num_prepared = num_pipeline.fit_transform(housing_num)\n", "housing_num_prepared[:2].round(2)" ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [], "source": [ "def monkey_patch_get_signature_names_out():\n", " \"\"\"Monkey patch some classes which did not handle get_feature_names_out()\n", " correctly in 1.0.0.\"\"\"\n", " from inspect import Signature, signature, Parameter\n", " import pandas as pd\n", " from sklearn.impute import SimpleImputer\n", " from sklearn.pipeline import make_pipeline, Pipeline\n", " from sklearn.preprocessing import FunctionTransformer, StandardScaler\n", "\n", " default_get_feature_names_out = StandardScaler.get_feature_names_out\n", "\n", " if not hasattr(SimpleImputer, \"get_feature_names_out\"):\n", " print(\"Monkey-patching SimpleImputer.get_feature_names_out()\")\n", " SimpleImputer.get_feature_names_out = default_get_feature_names_out\n", "\n", " if not hasattr(FunctionTransformer, \"get_feature_names_out\"):\n", " print(\"Monkey-patching FunctionTransformer.get_feature_names_out()\")\n", " orig_init = FunctionTransformer.__init__\n", " orig_sig = signature(orig_init)\n", "\n", " def __init__(*args, feature_names_out=None, **kwargs):\n", " orig_sig.bind(*args, **kwargs)\n", " orig_init(*args, **kwargs)\n", " args[0].feature_names_out = feature_names_out\n", "\n", " __init__.__signature__ = Signature(\n", " list(signature(orig_init).parameters.values()) + [\n", " Parameter(\"feature_names_out\", Parameter.KEYWORD_ONLY)])\n", "\n", " def get_feature_names_out(self, names=None):\n", " if self.feature_names_out is None:\n", " return default_get_feature_names_out(self, names)\n", " elif callable(self.feature_names_out):\n", " return self.feature_names_out(names)\n", " else:\n", " return self.feature_names_out\n", "\n", " FunctionTransformer.__init__ = __init__\n", " FunctionTransformer.get_feature_names_out = get_feature_names_out\n", "\n", "monkey_patch_get_signature_names_out()" ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [], "source": [ "df_housing_num_prepared = pd.DataFrame(\n", " housing_num_prepared, columns=num_pipeline.get_feature_names_out(),\n", " index=housing_num.index)" ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [], "source": [ "df_housing_num_prepared.head(2) # extra code" ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [], "source": [ "num_pipeline.steps" ] }, { "cell_type": "code", "execution_count": 108, "metadata": {}, "outputs": [], "source": [ "num_pipeline[1]" ] }, { "cell_type": "code", "execution_count": 109, "metadata": {}, "outputs": [], "source": [ "num_pipeline[:-1]" ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [], "source": [ "num_pipeline.named_steps[\"simpleimputer\"]" ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [], "source": [ "num_pipeline.set_params(simpleimputer__strategy=\"median\")" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [], "source": [ "from sklearn.compose import ColumnTransformer\n", "\n", "num_attribs = [\"longitude\", \"latitude\", \"housing_median_age\", \"total_rooms\",\n", " \"total_bedrooms\", \"population\", \"households\", \"median_income\"]\n", "cat_attribs = [\"ocean_proximity\"]\n", "\n", "cat_pipeline = make_pipeline(\n", " SimpleImputer(strategy=\"most_frequent\"),\n", " OneHotEncoder(handle_unknown=\"ignore\"))\n", "\n", "preprocessing = ColumnTransformer([\n", " (\"num\", num_pipeline, num_attribs),\n", " (\"cat\", cat_pipeline, cat_attribs),\n", "])" ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [], "source": [ "from sklearn.compose import make_column_selector, make_column_transformer\n", "\n", "preprocessing = make_column_transformer(\n", " (num_pipeline, make_column_selector(dtype_include=np.number)),\n", " (cat_pipeline, make_column_selector(dtype_include=np.object)),\n", ")" ] }, { "cell_type": "code", "execution_count": 114, "metadata": {}, "outputs": [], "source": [ "housing_prepared = preprocessing.fit_transform(housing)" ] }, { "cell_type": "code", "execution_count": 115, "metadata": {}, "outputs": [], "source": [ "# extra code – shows that we can get a DataFrame out if we want\n", "housing_prepared_fr = pd.DataFrame(\n", " housing_prepared,\n", " columns=preprocessing.get_feature_names_out(),\n", " index=housing.index)\n", "housing_prepared_fr.head(2)" ] }, { "cell_type": "code", "execution_count": 116, "metadata": {}, "outputs": [], "source": [ "def column_ratio(X):\n", " return X[:, [0]] / X[:, [1]]\n", "\n", "def ratio_pipeline(name=None):\n", " return make_pipeline(\n", " SimpleImputer(strategy=\"median\"),\n", " FunctionTransformer(column_ratio,\n", " feature_names_out=[name]),\n", " StandardScaler())\n", "\n", "log_pipeline = make_pipeline(SimpleImputer(strategy=\"median\"),\n", " FunctionTransformer(np.log),\n", " StandardScaler())\n", "cluster_simil = ClusterSimilarity(n_clusters=10, gamma=1., random_state=42)\n", "default_num_pipeline = make_pipeline(SimpleImputer(strategy=\"median\"),\n", " StandardScaler())\n", "preprocessing = ColumnTransformer([\n", " (\"bedrooms_ratio\", ratio_pipeline(\"bedrooms_ratio\"),\n", " [\"total_bedrooms\", \"total_rooms\"]),\n", " (\"rooms_per_house\", ratio_pipeline(\"rooms_per_house\"),\n", " [\"total_rooms\", \"households\"]),\n", " (\"people_per_house\", ratio_pipeline(\"people_per_house\"),\n", " [\"population\", \"households\"]),\n", " (\"log\", log_pipeline, [\"total_bedrooms\", \"total_rooms\",\n", " \"population\", \"households\", \"median_income\"]),\n", " (\"geo\", cluster_simil, [\"latitude\", \"longitude\"]),\n", " (\"cat\", cat_pipeline, make_column_selector(dtype_include=np.object)),\n", " ],\n", " remainder=default_num_pipeline) # one column remaining: housing_median_age" ] }, { "cell_type": "code", "execution_count": 117, "metadata": {}, "outputs": [], "source": [ "housing_prepared = preprocessing.fit_transform(housing)\n", "housing_prepared.shape" ] }, { "cell_type": "code", "execution_count": 118, "metadata": {}, "outputs": [], "source": [ "preprocessing.get_feature_names_out()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Select and Train a Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training and Evaluating on the Training Set" ] }, { "cell_type": "code", "execution_count": 119, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "\n", "lin_reg = make_pipeline(preprocessing, LinearRegression())\n", "lin_reg.fit(housing, housing_labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try the full preprocessing pipeline on a few training instances:" ] }, { "cell_type": "code", "execution_count": 120, "metadata": {}, "outputs": [], "source": [ "housing_predictions = lin_reg.predict(housing)\n", "housing_predictions[:5].round(-2) # -2 = rounded to the nearest hundred" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compare against the actual values:" ] }, { "cell_type": "code", "execution_count": 121, "metadata": {}, "outputs": [], "source": [ "housing_labels.iloc[:5].values" ] }, { "cell_type": "code", "execution_count": 122, "metadata": {}, "outputs": [], "source": [ "# extra code – computes the error ratios discussed in the book\n", "error_ratios = housing_predictions[:5].round(-2) / housing_labels.iloc[:5].values - 1\n", "print(\", \".join([f\"{100 * ratio:.1f}%\" for ratio in error_ratios]))" ] }, { "cell_type": "code", "execution_count": 123, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import mean_squared_error\n", "\n", "lin_rmse = mean_squared_error(housing_labels, housing_predictions,\n", " squared=False)\n", "lin_rmse" ] }, { "cell_type": "code", "execution_count": 124, "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeRegressor\n", "\n", "tree_reg = make_pipeline(preprocessing, DecisionTreeRegressor(random_state=42))\n", "tree_reg.fit(housing, housing_labels)" ] }, { "cell_type": "code", "execution_count": 125, "metadata": {}, "outputs": [], "source": [ "housing_predictions = tree_reg.predict(housing)\n", "tree_rmse = mean_squared_error(housing_labels, housing_predictions,\n", " squared=False)\n", "tree_rmse" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Better Evaluation Using Cross-Validation" ] }, { "cell_type": "code", "execution_count": 126, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import cross_val_score\n", "\n", "tree_rmses = -cross_val_score(tree_reg, housing, housing_labels,\n", " scoring=\"neg_root_mean_squared_error\", cv=10)" ] }, { "cell_type": "code", "execution_count": 127, "metadata": {}, "outputs": [], "source": [ "pd.Series(tree_rmses).describe()" ] }, { "cell_type": "code", "execution_count": 128, "metadata": {}, "outputs": [], "source": [ "# extra code – computes the error stats for the linear model\n", "lin_rmses = -cross_val_score(lin_reg, housing, housing_labels,\n", " scoring=\"neg_root_mean_squared_error\", cv=10)\n", "pd.Series(lin_rmses).describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Warning:** the following cell may take a few minutes to run:" ] }, { "cell_type": "code", "execution_count": 129, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestRegressor\n", "\n", "forest_reg = make_pipeline(preprocessing,\n", " RandomForestRegressor(random_state=42))\n", "forest_rmses = -cross_val_score(forest_reg, housing, housing_labels,\n", " scoring=\"neg_root_mean_squared_error\", cv=10)" ] }, { "cell_type": "code", "execution_count": 130, "metadata": {}, "outputs": [], "source": [ "pd.Series(forest_rmses).describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's compare this RMSE measured using cross-validation (the \"validation error\") with the RMSE measured on the training set (the \"training error\"):" ] }, { "cell_type": "code", "execution_count": 131, "metadata": {}, "outputs": [], "source": [ "forest_reg.fit(housing, housing_labels)\n", "housing_predictions = forest_reg.predict(housing)\n", "forest_rmse = mean_squared_error(housing_labels, housing_predictions,\n", " squared=False)\n", "forest_rmse" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The training error is much lower than the validation error, which usually means that the model has overfit the training set. Another possible explanation may be that there's a mismatch between the training data and the validation data, but it's not the case here, since both came from the same dataset that we shuffled and split in two parts." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Fine-Tune Your Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Grid Search" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Warning:** the following cell make take a few minutes to run:" ] }, { "cell_type": "code", "execution_count": 132, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import GridSearchCV\n", "\n", "full_pipeline = Pipeline([\n", " (\"preprocessing\", preprocessing),\n", " (\"random_forest\", RandomForestRegressor(random_state=42)),\n", "])\n", "param_grid = [\n", " {'preprocessing__geo__n_clusters': [5, 8, 10],\n", " 'random_forest__max_features': [4, 6, 8]},\n", " {'preprocessing__geo__n_clusters': [10, 15],\n", " 'random_forest__max_features': [6, 8, 10]},\n", "]\n", "grid_search = GridSearchCV(full_pipeline, param_grid, cv=3,\n", " scoring='neg_root_mean_squared_error')\n", "grid_search.fit(housing, housing_labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can get the full list of hyperparameters available for tuning by looking at `full_pipeline.get_params().keys()`:" ] }, { "cell_type": "code", "execution_count": 133, "metadata": {}, "outputs": [], "source": [ "# extra code – shows part of the output of get_params().keys()\n", "print(str(full_pipeline.get_params().keys())[:1000] + \"...\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The best hyperparameter combination found:" ] }, { "cell_type": "code", "execution_count": 134, "metadata": {}, "outputs": [], "source": [ "grid_search.best_params_" ] }, { "cell_type": "code", "execution_count": 135, "metadata": {}, "outputs": [], "source": [ "grid_search.best_estimator_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the score of each hyperparameter combination tested during the grid search:" ] }, { "cell_type": "code", "execution_count": 136, "metadata": {}, "outputs": [], "source": [ "cv_res = pd.DataFrame(grid_search.cv_results_)\n", "cv_res.sort_values(by=\"mean_test_score\", ascending=False, inplace=True)\n", "\n", "# extra code – these few lines of code just make the DataFrame look nicer\n", "cv_res = cv_res[[\"param_preprocessing__geo__n_clusters\",\n", " \"param_random_forest__max_features\", \"split0_test_score\",\n", " \"split1_test_score\", \"split2_test_score\", \"mean_test_score\"]]\n", "score_cols = [\"split0\", \"split1\", \"split2\", \"mean_test_rmse\"]\n", "cv_res.columns = [\"n_clusters\", \"max_features\"] + score_cols\n", "cv_res[score_cols] = -cv_res[score_cols].round().astype(np.int64)\n", "\n", "cv_res.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Randomized Search" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Warning:** the following cell make take a few minutes to run:" ] }, { "cell_type": "code", "execution_count": 137, "metadata": {}, "outputs": [], "source": [ "from sklearn.experimental import enable_halving_search_cv\n", "from sklearn.model_selection import HalvingRandomSearchCV" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try 30 (`n_iter` × `cv`) random combinations of hyperparameters:" ] }, { "cell_type": "code", "execution_count": 138, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import RandomizedSearchCV\n", "from scipy.stats import randint\n", "\n", "param_distribs = {'preprocessing__geo__n_clusters': randint(low=3, high=50),\n", " 'random_forest__max_features': randint(low=2, high=20)}\n", "\n", "rnd_search = RandomizedSearchCV(\n", " full_pipeline, param_distributions=param_distribs, n_iter=10, cv=3,\n", " scoring='neg_root_mean_squared_error', random_state=42)\n", "\n", "rnd_search.fit(housing, housing_labels)" ] }, { "cell_type": "code", "execution_count": 139, "metadata": {}, "outputs": [], "source": [ "# extra code – displays the random search results\n", "cv_res = pd.DataFrame(rnd_search.cv_results_)\n", "cv_res.sort_values(by=\"mean_test_score\", ascending=False, inplace=True)\n", "cv_res = cv_res[[\"param_preprocessing__geo__n_clusters\",\n", " \"param_random_forest__max_features\", \"split0_test_score\",\n", " \"split1_test_score\", \"split2_test_score\", \"mean_test_score\"]]\n", "cv_res.columns = [\"n_clusters\", \"max_features\"] + score_cols\n", "cv_res[score_cols] = -cv_res[score_cols].round().astype(np.int64)\n", "cv_res.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Bonus section: how to choose the sampling distribution for a hyperparameter**\n", "\n", "* `scipy.stats.randint(a, b+1)`: for hyperparameters with _discrete_ values that range from a to b, and all values in that range seem equally likely.\n", "* `scipy.stats.uniform(a, b)`: this is very similar, but for _continuous_ hyperparameters.\n", "* `scipy.stats.geom(1 / scale)`: for discrete values, when you want to sample roughly in a given scale. E.g., with scale=1000 most samples will be in this ballpark, but ~10% of all samples will be <100 and ~10% will be >2300.\n", "* `scipy.stats.expon(scale)`: this is the continuous equivalent of `geom`. Just set `scale` to the most likely value.\n", "* `scipy.stats.reciprocal(a, b)`: when you have almost no idea what the optimal hyperparameter value's scale is. If you set a=0.01 and b=100, then you're just as likely to sample a value between 0.01 and 0.1 as a value between 10 and 100.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are plots of the probability mass functions (for discrete variables), and probability density functions (for continuous variables) for `randint()`, `uniform()`, `geom()` and `expon()`:" ] }, { "cell_type": "code", "execution_count": 140, "metadata": { "tags": [] }, "outputs": [], "source": [ "# extra code – plots a few distributions you can use in randomized search\n", "\n", "from scipy.stats import randint, uniform, geom, expon\n", "\n", "xs1 = np.arange(0, 7 + 1)\n", "randint_distrib = randint(0, 7 + 1).pmf(xs1)\n", "\n", "xs2 = np.linspace(0, 7, 500)\n", "uniform_distrib = uniform(0, 7).pdf(xs2)\n", "\n", "xs3 = np.arange(0, 7 + 1)\n", "geom_distrib = geom(0.5).pmf(xs3)\n", "\n", "xs4 = np.linspace(0, 7, 500)\n", "expon_distrib = expon(scale=1).pdf(xs4)\n", "\n", "plt.figure(figsize=(12, 7))\n", "\n", "plt.subplot(2, 2, 1)\n", "plt.bar(xs1, randint_distrib, label=\"scipy.randint(0, 7 + 1)\")\n", "plt.ylabel(\"Probability\")\n", "plt.legend()\n", "plt.axis([-1, 8, 0, 0.2])\n", "\n", "plt.subplot(2, 2, 2)\n", "plt.fill_between(xs2, uniform_distrib, label=\"scipy.uniform(0, 7)\")\n", "plt.ylabel(\"PDF\")\n", "plt.legend()\n", "plt.axis([-1, 8, 0, 0.2])\n", "\n", "plt.subplot(2, 2, 3)\n", "plt.bar(xs3, geom_distrib, label=\"scipy.geom(0.5)\")\n", "plt.xlabel(\"Hyperparameter value\")\n", "plt.ylabel(\"Probability\")\n", "plt.legend()\n", "plt.axis([0, 7, 0, 1])\n", "\n", "plt.subplot(2, 2, 4)\n", "plt.fill_between(xs4, expon_distrib, label=\"scipy.expon(scale=1)\")\n", "plt.xlabel(\"Hyperparameter value\")\n", "plt.ylabel(\"PDF\")\n", "plt.legend()\n", "plt.axis([0, 7, 0, 1])\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the PDF for `expon()` and `reciprocal()` (left column), as well as the PDF of log(X) (right column). The right column shows the distribution of hyperparameter _scales_. You can see that `expon()` favors hyperparameters with roughly the desired scale, with a longer tail towards the smaller scales. But `reciprocal()` does not favor any scale, they are all equally likely:" ] }, { "cell_type": "code", "execution_count": 141, "metadata": { "tags": [] }, "outputs": [], "source": [ "# extra code – shows the difference between expon and reciprocal\n", "\n", "from scipy.stats import reciprocal\n", "\n", "xs1 = np.linspace(0, 7, 500)\n", "expon_distrib = expon(scale=1).pdf(xs1)\n", "\n", "log_xs2 = np.linspace(-5, 3, 500)\n", "log_expon_distrib = np.exp(log_xs2 - np.exp(log_xs2))\n", "\n", "xs3 = np.linspace(0.001, 1000, 500)\n", "reciprocal_distrib = reciprocal(0.001, 1000).pdf(xs3)\n", "\n", "log_xs4 = np.linspace(np.log(0.001), np.log(1000), 500)\n", "log_reciprocal_distrib = uniform(np.log(0.001), np.log(1000)).pdf(log_xs4)\n", "\n", "plt.figure(figsize=(12, 7))\n", "\n", "plt.subplot(2, 2, 1)\n", "plt.fill_between(xs1, expon_distrib,\n", " label=\"scipy.expon(scale=1)\")\n", "plt.ylabel(\"PDF\")\n", "plt.legend()\n", "plt.axis([0, 7, 0, 1])\n", "\n", "plt.subplot(2, 2, 2)\n", "plt.fill_between(log_xs2, log_expon_distrib,\n", " label=\"log(X) with X ~ expon\")\n", "plt.legend()\n", "plt.axis([-5, 3, 0, 1])\n", "\n", "plt.subplot(2, 2, 3)\n", "plt.fill_between(xs3, reciprocal_distrib,\n", " label=\"scipy.reciprocal(0.001, 1000)\")\n", "plt.xlabel(\"Hyperparameter value\")\n", "plt.ylabel(\"PDF\")\n", "plt.legend()\n", "plt.axis([0.001, 1000, 0, 0.005])\n", "\n", "plt.subplot(2, 2, 4)\n", "plt.fill_between(log_xs4, log_reciprocal_distrib,\n", " label=\"log(X) with X ~ reciprocal\")\n", "plt.xlabel(\"Log of hyperparameter value\")\n", "plt.legend()\n", "plt.axis([-8, 1, 0, 0.2])\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analyze the Best Models and Their Errors" ] }, { "cell_type": "code", "execution_count": 142, "metadata": {}, "outputs": [], "source": [ "final_model = rnd_search.best_estimator_ # includes preprocessing\n", "feature_importances = final_model[\"random_forest\"].feature_importances_\n", "feature_importances.round(2)" ] }, { "cell_type": "code", "execution_count": 143, "metadata": {}, "outputs": [], "source": [ "sorted(zip(feature_importances,\n", " final_model[\"preprocessing\"].get_feature_names_out()),\n", " reverse=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluate Your System on the Test Set" ] }, { "cell_type": "code", "execution_count": 144, "metadata": {}, "outputs": [], "source": [ "X_test = strat_test_set.drop(\"median_house_value\", axis=1)\n", "y_test = strat_test_set[\"median_house_value\"].copy()\n", "\n", "final_predictions = final_model.predict(X_test)\n", "\n", "final_rmse = mean_squared_error(y_test, final_predictions, squared=False)\n", "print(final_rmse)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can compute a 95% confidence interval for the test RMSE:" ] }, { "cell_type": "code", "execution_count": 145, "metadata": {}, "outputs": [], "source": [ "from scipy import stats\n", "\n", "confidence = 0.95\n", "squared_errors = (final_predictions - y_test) ** 2\n", "np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,\n", " loc=squared_errors.mean(),\n", " scale=stats.sem(squared_errors)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could compute the interval manually like this:" ] }, { "cell_type": "code", "execution_count": 146, "metadata": {}, "outputs": [], "source": [ "# extra code – shows how to compute a confidence interval for the RMSE\n", "m = len(squared_errors)\n", "mean = squared_errors.mean()\n", "tscore = stats.t.ppf((1 + confidence) / 2, df=m - 1)\n", "tmargin = tscore * squared_errors.std(ddof=1) / np.sqrt(m)\n", "np.sqrt(mean - tmargin), np.sqrt(mean + tmargin)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, we could use a z-scores rather than t-scores—since the test set is not too small, it won't make a big difference:" ] }, { "cell_type": "code", "execution_count": 147, "metadata": {}, "outputs": [], "source": [ "# extra code – computes a confidence interval again using z-score\n", "zscore = stats.norm.ppf((1 + confidence) / 2)\n", "zmargin = zscore * squared_errors.std(ddof=1) / np.sqrt(m)\n", "np.sqrt(mean - zmargin), np.sqrt(mean + zmargin)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model persistence using joblib" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save the final model:" ] }, { "cell_type": "code", "execution_count": 148, "metadata": {}, "outputs": [], "source": [ "import joblib\n", "\n", "joblib.dump(final_model, \"my_california_housing_model.pkl\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you can deploy this model to production. For example, the following code could be a script that would run in production:" ] }, { "cell_type": "code", "execution_count": 149, "metadata": {}, "outputs": [], "source": [ "import joblib\n", "\n", "# extra code – excluded for conciseness\n", "from sklearn.cluster import KMeans\n", "from sklearn.base import BaseEstimator, TransformerMixin\n", "from sklearn.metrics.pairwise import rbf_kernel\n", "\n", "def column_ratio(X):\n", " return X[:, [0]] / X[:, [1]]\n", "\n", "#class ClusterSimilarity(BaseEstimator, TransformerMixin):\n", "# [...]\n", "\n", "final_model_reloaded = joblib.load(\"my_california_housing_model.pkl\")\n", "\n", "new_data = housing.iloc[:5] # pretend these are new districts\n", "predictions = final_model_reloaded.predict(new_data)" ] }, { "cell_type": "code", "execution_count": 150, "metadata": {}, "outputs": [], "source": [ "predictions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also works with pickle, but joblib is more efficient." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise solutions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exercise: _Try a Support Vector Machine regressor (`sklearn.svm.SVR`) with various hyperparameters, such as `kernel=\"linear\"` (with various values for the `C` hyperparameter) or `kernel=\"rbf\"` (with various values for the `C` and `gamma` hyperparameters). Note that SVMs don't scale well to large datasets, so you should probably train your model on just the first 5,000 instances of the training set and use only 3-fold cross-validation, or else it will take hours. Don't worry about what the hyperparameters mean for now (see the SVM notebook if you're interested). How does the best `SVR` predictor perform?_" ] }, { "cell_type": "code", "execution_count": 151, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import GridSearchCV\n", "from sklearn.svm import SVR\n", "\n", "param_grid = [\n", " {'svr__kernel': ['linear'], 'svr__C': [10., 30., 100., 300., 1000.,\n", " 3000., 10000., 30000.0]},\n", " {'svr__kernel': ['rbf'], 'svr__C': [1.0, 3.0, 10., 30., 100., 300.,\n", " 1000.0],\n", " 'svr__gamma': [0.01, 0.03, 0.1, 0.3, 1.0, 3.0]},\n", " ]\n", "\n", "svr_pipeline = Pipeline([(\"preprocessing\", preprocessing), (\"svr\", SVR())])\n", "grid_search = GridSearchCV(svr_pipeline, param_grid, cv=3,\n", " scoring='neg_root_mean_squared_error')\n", "grid_search.fit(housing.iloc[:5000], housing_labels.iloc[:5000])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The best model achieves the following score (evaluated using 3-fold cross validation):" ] }, { "cell_type": "code", "execution_count": 152, "metadata": {}, "outputs": [], "source": [ "svr_grid_search_rmse = -grid_search.best_score_\n", "svr_grid_search_rmse" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's much worse than the `RandomForestRegressor` (but to be fair, we trained the model on much less data). Let's check the best hyperparameters found:" ] }, { "cell_type": "code", "execution_count": 153, "metadata": {}, "outputs": [], "source": [ "grid_search.best_params_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The linear kernel seems better than the RBF kernel. Notice that the value of `C` is the maximum tested value. When this happens you definitely want to launch the grid search again with higher values for `C` (removing the smallest values), because it is likely that higher values of `C` will be better." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exercise: _Try replacing the `GridSearchCV` with a `RandomizedSearchCV`._" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Warning:** the following cell will take several minutes to run. You can specify `verbose=2` when creating the `RandomizedSearchCV` if you want to see the training details." ] }, { "cell_type": "code", "execution_count": 154, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import RandomizedSearchCV\n", "from scipy.stats import expon, reciprocal\n", "\n", "# see https://docs.scipy.org/doc/scipy/reference/stats.html\n", "# for `expon()` and `reciprocal()` documentation and more probability distribution functions.\n", "\n", "# Note: gamma is ignored when kernel is \"linear\"\n", "param_distribs = {\n", " 'svr__kernel': ['linear', 'rbf'],\n", " 'svr__C': reciprocal(20, 200_000),\n", " 'svr__gamma': expon(scale=1.0),\n", " }\n", "\n", "rnd_search = RandomizedSearchCV(svr_pipeline,\n", " param_distributions=param_distribs,\n", " n_iter=50, cv=3,\n", " scoring='neg_root_mean_squared_error',\n", " random_state=42)\n", "rnd_search.fit(housing.iloc[:5000], housing_labels.iloc[:5000])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The best model achieves the following score (evaluated using 3-fold cross validation):" ] }, { "cell_type": "code", "execution_count": 155, "metadata": {}, "outputs": [], "source": [ "svr_rnd_search_rmse = -rnd_search.best_score_\n", "svr_rnd_search_rmse" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that's really much better, but still far from the `RandomForestRegressor`'s performance. Let's check the best hyperparameters found:" ] }, { "cell_type": "code", "execution_count": 156, "metadata": {}, "outputs": [], "source": [ "rnd_search.best_params_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This time the search found a good set of hyperparameters for the RBF kernel. Randomized search tends to find better hyperparameters than grid search in the same amount of time." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we used the `expon()` distribution for `gamma`, with a scale of 1, so `RandomSearch` mostly searched for values roughly of that scale: about 80% of the samples were between 0.1 and 2.3 (roughly 10% were smaller and 10% were larger):" ] }, { "cell_type": "code", "execution_count": 157, "metadata": {}, "outputs": [], "source": [ "np.random.seed(42)\n", "\n", "s = expon(scale=1).rvs(100_000) # get 100,000 samples\n", "((s > 0.105) & (s < 2.29)).sum() / 100_000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We used the `reciprocal()` distribution for `C`, meaning we did not have a clue what the optimal scale of `C` was before running the random search. It explored the range from 20 to 200 just as much as the range from 2,000 to 20,000 or from 20,000 to 200,000." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exercise: _Try adding a `SelectFromModel` transformer in the preparation pipeline to select only the most important attributes._" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create a new pipeline that runs the previously defined preparation pipeline, and adds a `SelectFromModel` transformer based on a `RandomForestRegressor` before the final regressor:" ] }, { "cell_type": "code", "execution_count": 158, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_selection import SelectFromModel\n", "\n", "selector_pipeline = Pipeline([\n", " ('preprocessing', preprocessing),\n", " ('selector', SelectFromModel(RandomForestRegressor(random_state=42),\n", " threshold=0.005)), # min feature importance\n", " ('svr', SVR(C=rnd_search.best_params_[\"svr__C\"],\n", " gamma=rnd_search.best_params_[\"svr__gamma\"],\n", " kernel=rnd_search.best_params_[\"svr__kernel\"])),\n", "])" ] }, { "cell_type": "code", "execution_count": 159, "metadata": {}, "outputs": [], "source": [ "selector_rmses = -cross_val_score(selector_pipeline,\n", " housing.iloc[:5000],\n", " housing_labels.iloc[:5000],\n", " scoring=\"neg_root_mean_squared_error\",\n", " cv=3)\n", "pd.Series(selector_rmses).describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Oh well, feature selection does not seem to help. But maybe that's just because the threshold we used was not optimal. Perhaps try tuning it using random search or grid search?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exercise: _Try creating a custom transformer that trains a k-Nearest Neighbors regressor (`sklearn.neighbors.KNeighborsRegressor`) in its `fit()` method, and outputs the model's predictions in its `transform()` method. Then add this feature to the preprocessing pipeline, using latitude and longitude as the inputs to this transformer. This will add a feature in the model that corresponds to the housing median price of the nearest districts._" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rather than restrict ourselves to k-Nearest Neighbors regressors, let's create a transform that accepts any regressor. For this, we can extend the `MetaEstimatorMixin` and have a required `estimator` argument in the constructor. The `fit()` method must work on a clone of this estimator, and it must also save `feature_names_in_`. The `MetaEstimatorMixin` will ensure that `estimator` is listed as a required parameters, and it will update `get_params()` and `set_params()` to make the estimator's hyperparameters available for tuning. Lastly, we create a `get_feature_names_out()` method: the output column name is the " ] }, { "cell_type": "code", "execution_count": 160, "metadata": {}, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsRegressor\n", "from sklearn.base import MetaEstimatorMixin, clone\n", "\n", "class FeatureFromRegressor(MetaEstimatorMixin, BaseEstimator, TransformerMixin):\n", " def __init__(self, estimator):\n", " self.estimator = estimator\n", "\n", " def fit(self, X, y=None):\n", " estimator_ = clone(self.estimator)\n", " estimator_.fit(X, y)\n", " self.estimator_ = estimator_\n", " self.n_features_in_ = self.estimator_.n_features_in_\n", " if hasattr(self.estimator, \"feature_names_in_\"):\n", " self.feature_names_in_ = self.estimator.feature_names_in_\n", " return self # always return self!\n", " \n", " def transform(self, X):\n", " check_is_fitted(self)\n", " predictions = self.estimator_.predict(X)\n", " if predictions.ndim == 1:\n", " predictions = predictions.reshape(-1, 1)\n", " return predictions\n", "\n", " def get_feature_names_out(self, names=None):\n", " check_is_fitted(self)\n", " n_outputs = getattr(self.estimator_, \"n_outputs_\", 1)\n", " estimator_class_name = self.estimator_.__class__.__name__\n", " estimator_short_name = estimator_class_name.lower().replace(\"_\", \"\")\n", " return [f\"{estimator_short_name}_prediction_{i}\"\n", " for i in range(n_outputs)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's ensure it complies to Scikit-Learn's API:" ] }, { "cell_type": "code", "execution_count": 161, "metadata": {}, "outputs": [], "source": [ "from sklearn.utils.estimator_checks import check_estimator\n", "\n", "check_estimator(FeatureFromRegressor(KNeighborsRegressor()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Good! Now let's test it:" ] }, { "cell_type": "code", "execution_count": 162, "metadata": {}, "outputs": [], "source": [ "knn_reg = KNeighborsRegressor(n_neighbors=3, weights=\"distance\")\n", "knn_transformer = FeatureFromRegressor(knn_reg)\n", "geo_features = housing[[\"latitude\", \"longitude\"]]\n", "knn_transformer.fit_transform(geo_features, housing_labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And what does its output feature name look like?" ] }, { "cell_type": "code", "execution_count": 163, "metadata": {}, "outputs": [], "source": [ "knn_transformer.get_feature_names_out()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay, now let's include this transformer in our preprocessing pipeline:" ] }, { "cell_type": "code", "execution_count": 164, "metadata": {}, "outputs": [], "source": [ "from sklearn.base import clone\n", "\n", "transformers = [(name, clone(transformer), columns)\n", " for name, transformer, columns in preprocessing.transformers]\n", "geo_index = [name for name, _, _ in transformers].index(\"geo\")\n", "transformers[geo_index] = (\"geo\", knn_transformer, [\"latitude\", \"longitude\"])\n", "\n", "new_geo_preprocessing = ColumnTransformer(transformers)" ] }, { "cell_type": "code", "execution_count": 165, "metadata": {}, "outputs": [], "source": [ "new_geo_pipeline = Pipeline([\n", " ('preprocessing', new_geo_preprocessing),\n", " ('svr', SVR(C=rnd_search.best_params_[\"svr__C\"],\n", " gamma=rnd_search.best_params_[\"svr__gamma\"],\n", " kernel=rnd_search.best_params_[\"svr__kernel\"])),\n", "])" ] }, { "cell_type": "code", "execution_count": 166, "metadata": {}, "outputs": [], "source": [ "new_pipe_rmses = -cross_val_score(new_geo_pipeline,\n", " housing.iloc[:5000],\n", " housing_labels.iloc[:5000],\n", " scoring=\"neg_root_mean_squared_error\",\n", " cv=3)\n", "pd.Series(new_pipe_rmses).describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Yikes, that's terrible! Apparently the cluster similarity features were much better. But perhaps we should tune the `KNeighborsRegressor`'s hyperparameters? That's what the next exercise is about." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exercise: _Automatically explore some preparation options using `RandomSearchCV`._" ] }, { "cell_type": "code", "execution_count": 167, "metadata": {}, "outputs": [], "source": [ "param_distribs = {\n", " \"preprocessing__geo__estimator__n_neighbors\": range(1, 30),\n", " \"preprocessing__geo__estimator__weights\": [\"distance\", \"uniform\"],\n", " \"svr__C\": reciprocal(20, 200_000),\n", " \"svr__gamma\": expon(scale=1.0),\n", "}\n", "\n", "new_geo_rnd_search = RandomizedSearchCV(new_geo_pipeline,\n", " param_distributions=param_distribs,\n", " n_iter=50,\n", " cv=3,\n", " scoring='neg_root_mean_squared_error',\n", " random_state=42)\n", "new_geo_rnd_search.fit(housing.iloc[:5000], housing_labels.iloc[:5000])" ] }, { "cell_type": "code", "execution_count": 168, "metadata": {}, "outputs": [], "source": [ "new_geo_rnd_search_rmse = -new_geo_rnd_search.best_score_\n", "new_geo_rnd_search_rmse" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Oh well... at least we tried! It looks like the cluster similarity features are definitely better than the KNN feature. But perhaps you could try having both? And maybe training on the full training set would help as well." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exercise: _Try to implement the `StandardScalerClone` class again from scratch, then add support for the `inverse_transform()` method: executing `scaler.inverse_transform(scaler.fit_transform(X))` should return an array very close to `X`. Then add support for feature names: set `feature_names_in_` in the `fit()` method if the input is a DataFrame. This attribute should be a NumPy array of column names. Lastly, implement the `get_feature_names_out()` method: it should have one optional `input_features=None` argument. If passed, the method should check that its length matches `n_features_in_`, and it should match `feature_names_in_` if it is defined, then `input_features` should be returned. If `input_features` is `None`, then the method should return `feature_names_in_` if it is defined or `np.array([\"x0\", \"x1\", ...])` with length `n_features_in_` otherwise._" ] }, { "cell_type": "code", "execution_count": 169, "metadata": {}, "outputs": [], "source": [ "from sklearn.base import BaseEstimator, TransformerMixin\n", "from sklearn.utils.validation import check_array, check_is_fitted\n", "\n", "class StandardScalerClone(BaseEstimator, TransformerMixin):\n", " def __init__(self, with_mean=True): # no *args or **kwargs!\n", " self.with_mean = with_mean\n", "\n", " def fit(self, X, y=None): # y is required even though we don't use it\n", " X_orig = X\n", " X = check_array(X) # checks that X is an array with finite float values\n", " self.mean_ = X.mean(axis=0)\n", " self.scale_ = X.std(axis=0)\n", " self.n_features_in_ = X.shape[1] # every estimator stores this in fit()\n", " if hasattr(X_orig, \"columns\"):\n", " self.feature_names_in_ = np.array(X_orig.columns, dtype=np.object)\n", " return self # always return self!\n", "\n", " def transform(self, X):\n", " check_is_fitted(self) # looks for learned attributes (with trailing _)\n", " X = check_array(X)\n", " if self.n_features_in_ != X.shape[1]:\n", " raise ValueError(\"Unexpected number of features\")\n", " if self.with_mean:\n", " X = X - self.mean_\n", " return X / self.scale_\n", " \n", " def inverse_transform(self, X):\n", " check_is_fitted(self)\n", " X = check_array(X)\n", " if self.n_features_in_ != X.shape[1]:\n", " raise ValueError(\"Unexpected number of features\")\n", " X = X * self.scale_\n", " return X + self.mean_ if self.with_mean else X\n", " \n", " def get_feature_names_out(self, input_features=None):\n", " if input_features is None:\n", " return getattr(self, \"feature_names_in_\",\n", " [f\"x{i}\" for i in range(self.n_features_in_)])\n", " else:\n", " if len(input_features) != self.n_features_in_:\n", " raise ValueError(\"Invalid number of features\")\n", " if hasattr(self, \"feature_names_in_\") and not np.all(\n", " self.feature_names_in_ == input_features\n", " ):\n", " raise ValueError(\"input_features ≠ feature_names_in_\")\n", " return input_features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's test our custom transformer:" ] }, { "cell_type": "code", "execution_count": 170, "metadata": {}, "outputs": [], "source": [ "from sklearn.utils.estimator_checks import check_estimator\n", " \n", "check_estimator(StandardScalerClone())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No errors, that's a great start, we respect the Scikit-Learn API." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's ensure we the transformation works as expected:" ] }, { "cell_type": "code", "execution_count": 171, "metadata": {}, "outputs": [], "source": [ "np.random.seed(42)\n", "X = np.random.rand(1000, 3)\n", "\n", "scaler = StandardScalerClone()\n", "X_scaled = scaler.fit_transform(X)\n", "\n", "assert np.allclose(X_scaled, (X - X.mean(axis=0)) / X.std(axis=0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How about setting `with_mean=False`?" ] }, { "cell_type": "code", "execution_count": 172, "metadata": {}, "outputs": [], "source": [ "scaler = StandardScalerClone(with_mean=False)\n", "X_scaled_uncentered = scaler.fit_transform(X)\n", "\n", "assert np.allclose(X_scaled_uncentered, X / X.std(axis=0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And does the inverse work?" ] }, { "cell_type": "code", "execution_count": 173, "metadata": {}, "outputs": [], "source": [ "scaler = StandardScalerClone()\n", "X_back = scaler.inverse_transform(scaler.fit_transform(X))\n", "assert np.allclose(X, X_back)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How about the feature names out?" ] }, { "cell_type": "code", "execution_count": 174, "metadata": {}, "outputs": [], "source": [ "assert np.all(scaler.get_feature_names_out() == [\"x0\", \"x1\", \"x2\"])\n", "assert np.all(scaler.get_feature_names_out([\"a\", \"b\", \"c\"]) == [\"a\", \"b\", \"c\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And if we fit a DataFrame, are the feature in and out ok?" ] }, { "cell_type": "code", "execution_count": 175, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame({\"a\": np.random.rand(100), \"b\": np.random.rand(100)})\n", "scaler = StandardScalerClone()\n", "X_scaled = scaler.fit_transform(df)\n", "\n", "assert np.all(scaler.feature_names_in_ == [\"a\", \"b\"])\n", "assert np.all(scaler.get_feature_names_out() == [\"a\", \"b\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All good! That's all for today! 😀" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Congratulations! You already know quite a lot about Machine Learning. :)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" }, "nav_menu": { "height": "279px", "width": "309px" }, "toc": { "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "toc_cell": false, "toc_position": {}, "toc_section_display": "block", "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }