{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**Chapter 6 – Ensemble Learning and Random Forests**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_This notebook contains all the sample code and solutions to the exercises in chapter 6._" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " \n", " \n", "
\n", " \"Open\n", " \n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "# Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This project requires Python 3.8 or above:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import sys\n", "\n", "assert sys.version_info >= (3, 8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It also requires Scikit-Learn ≥ 1.0.1:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import sklearn\n", "\n", "assert sklearn.__version__ >= \"1.0.1\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we did in previous chapters, let's define the default font sizes to make the figures prettier:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import matplotlib as mpl\n", "\n", "mpl.rc('font', size=12)\n", "mpl.rc('axes', labelsize=14, titlesize=14)\n", "mpl.rc('legend', fontsize=14)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And let's create the `images/ensembles` folder (if it doesn't already exist), and define the `save_fig()` function which is used through this notebook to save the figures in high-res for the book:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "IMAGES_PATH = Path() / \"images\" / \"ensembles\"\n", "IMAGES_PATH.mkdir(parents=True, exist_ok=True)\n", "\n", "def save_fig(fig_id, tight_layout=True, fig_extension=\"png\", resolution=300):\n", " path = IMAGES_PATH / f\"{fig_id}.{fig_extension}\"\n", " if tight_layout:\n", " plt.tight_layout()\n", " plt.savefig(path, format=fig_extension, dpi=resolution)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Voting Classifiers" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# not in the book – this cell generates and saves Figure 6–3\n", "\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "heads_proba = 0.51\n", "np.random.seed(42)\n", "coin_tosses = (np.random.rand(10000, 10) < heads_proba).astype(np.int32)\n", "cumulative_heads = coin_tosses.cumsum(axis=0)\n", "cumulative_heads_ratio = cumulative_heads / np.arange(1, 10001).reshape(-1, 1)\n", "\n", "plt.figure(figsize=(8,3.5))\n", "plt.plot(cumulative_heads_ratio)\n", "plt.plot([0, 10000], [0.51, 0.51], \"k--\", linewidth=2, label=\"51%\")\n", "plt.plot([0, 10000], [0.5, 0.5], \"k-\", label=\"50%\")\n", "plt.xlabel(\"Number of coin tosses\")\n", "plt.ylabel(\"Heads ratio\")\n", "plt.legend(loc=\"lower right\")\n", "plt.axis([0, 10000, 0.42, 0.58])\n", "plt.grid()\n", "save_fig(\"law_of_large_numbers_plot\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's build a voting classifier:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import make_moons\n", "from sklearn.ensemble import RandomForestClassifier, VotingClassifier\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.svm import SVC\n", "\n", "X, y = make_moons(n_samples=500, noise=0.30, random_state=42)\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)\n", "\n", "voting_clf = VotingClassifier(\n", " estimators=[\n", " ('lr', LogisticRegression(random_state=42)),\n", " ('rf', RandomForestClassifier(random_state=42)),\n", " ('svc', SVC(random_state=42))\n", " ]\n", ")\n", "voting_clf.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "for name, clf in voting_clf.named_estimators_.items():\n", " print(name, \"=\", clf.score(X_test, y_test))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "voting_clf.predict(X_test[:1])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "[clf.predict(X_test[:1]) for clf in voting_clf.estimators_]" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "voting_clf.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's use soft voting:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "voting_clf.voting = \"soft\"\n", "voting_clf.named_estimators[\"svc\"].probability = True\n", "voting_clf.fit(X_train, y_train)\n", "voting_clf.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Bagging and Pasting\n", "## Bagging and Pasting in Scikit-Learn" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import BaggingClassifier\n", "from sklearn.tree import DecisionTreeClassifier\n", "\n", "bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,\n", " max_samples=100, random_state=42)\n", "bag_clf.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# not in the book – this cell generates and saves Figure 6–5\n", "\n", "def plot_decision_boundary(clf, X, y, alpha=1.0):\n", " axes=[-1.5, 2.4, -1, 1.5]\n", " x1, x2 = np.meshgrid(np.linspace(axes[0], axes[1], 100),\n", " np.linspace(axes[2], axes[3], 100))\n", " X_new = np.c_[x1.ravel(), x2.ravel()]\n", " y_pred = clf.predict(X_new).reshape(x1.shape)\n", " \n", " plt.contourf(x1, x2, y_pred, alpha=0.3 * alpha, cmap='Wistia')\n", " plt.contour(x1, x2, y_pred, cmap=\"Greys\", alpha=0.8 * alpha)\n", " colors = [\"#78785c\", \"#c47b27\"]\n", " markers = (\"o\", \"^\")\n", " for idx in (0, 1):\n", " plt.plot(X[:, 0][y == idx], X[:, 1][y == idx],\n", " color=colors[idx], marker=markers[idx], linestyle=\"none\")\n", " plt.axis(axes)\n", " plt.xlabel(r\"$x_1$\")\n", " plt.ylabel(r\"$x_2$\", rotation=0)\n", "\n", "tree_clf = DecisionTreeClassifier(random_state=42)\n", "tree_clf.fit(X_train, y_train)\n", "\n", "fig, axes = plt.subplots(ncols=2, figsize=(10, 4), sharey=True)\n", "plt.sca(axes[0])\n", "plot_decision_boundary(tree_clf, X_train, y_train)\n", "plt.title(\"Decision Tree\")\n", "plt.sca(axes[1])\n", "plot_decision_boundary(bag_clf, X_train, y_train)\n", "plt.title(\"Decision Trees with Bagging\")\n", "plt.ylabel(\"\")\n", "save_fig(\"decision_tree_without_and_with_bagging_plot\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Out-of-Bag evaluation" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,\n", " oob_score=True, random_state=42)\n", "bag_clf.fit(X_train, y_train)\n", "bag_clf.oob_score_" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "bag_clf.oob_decision_function_[:3] # probas for the first 3 instances" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "scrolled": true }, "outputs": [], "source": [ "from sklearn.metrics import accuracy_score\n", "\n", "y_pred = bag_clf.predict(X_test)\n", "accuracy_score(y_test, y_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you randomly draw one instance from a dataset of size _m_, each instance in the dataset obviously has probability 1/_m_ of getting picked, and therefore it has a probability 1 – 1/_m_ of _not_ getting picked. If you draw _m_ instances with replacement, all draws are independent and therefore each instance has a probability (1 – 1/_m_)_m_ of _not_ getting picked. Now let's use the fact that exp(_x_) is equal to the limit of (1 + _x_/_m_)_m_ as _m_ approaches infinity. So if _m_ is large, the ratio of out-of-bag instances will be about exp(–1) ≈ 0.37. So roughly 63% (1 – 0.37) will be sampled." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# not in the book – this code shows how to compute the 63% proba\n", "print(1 - (1 - 1 / 1000) ** 1000)\n", "print(1 - np.exp(-1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Random Forests" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "\n", "rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16,\n", " n_jobs=-1, random_state=42)\n", "rnd_clf.fit(X_train, y_train)\n", "y_pred_rf = rnd_clf.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A Random Forest is equivalent to a bag of decision trees:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "bag_clf = BaggingClassifier(\n", " DecisionTreeClassifier(max_features=\"sqrt\", max_leaf_nodes=16),\n", " n_estimators=500, n_jobs=-1, random_state=42)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# not in the book – this code verifies that the predictions are identical\n", "bag_clf.fit(X_train, y_train)\n", "y_pred_bag = bag_clf.predict(X_test)\n", "np.all(y_pred_bag == y_pred_rf) # same predictions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature Importance" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_iris\n", "\n", "iris = load_iris(as_frame=True)\n", "rnd_clf = RandomForestClassifier(n_estimators=500, random_state=42)\n", "rnd_clf.fit(iris.data, iris.target)\n", "for score, name in zip(rnd_clf.feature_importances_, iris.data.columns):\n", " print(round(score, 2), name)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# not in the book – this cell generates and saves Figure 6–6\n", "\n", "from sklearn.datasets import fetch_openml\n", "\n", "X_mnist, y_mnist = fetch_openml('mnist_784', return_X_y=True, as_frame=False)\n", "\n", "rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)\n", "rnd_clf.fit(X_mnist, y_mnist)\n", "\n", "heatmap_image = rnd_clf.feature_importances_.reshape(28, 28)\n", "plt.imshow(heatmap_image, cmap=\"hot\")\n", "cbar = plt.colorbar(ticks=[rnd_clf.feature_importances_.min(),\n", " rnd_clf.feature_importances_.max()])\n", "cbar.ax.set_yticklabels(['Not important', 'Very important'], fontsize=14)\n", "plt.axis(\"off\")\n", "save_fig(\"mnist_feature_importance_plot\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Boosting\n", "## AdaBoost" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# not in the book – this cell generates and saves Figure 6–8\n", "\n", "m = len(X_train)\n", "\n", "fix, axes = plt.subplots(ncols=2, figsize=(10,4), sharey=True)\n", "for subplot, learning_rate in ((0, 1), (1, 0.5)):\n", " sample_weights = np.ones(m) / m\n", " plt.sca(axes[subplot])\n", " for i in range(5):\n", " svm_clf = SVC(C=0.2, gamma=0.6, random_state=42)\n", " svm_clf.fit(X_train, y_train, sample_weight=sample_weights * m)\n", " y_pred = svm_clf.predict(X_train)\n", "\n", " error_weights = sample_weights[y_pred != y_train].sum()\n", " r = error_weights / sample_weights.sum() # equation 7-1\n", " alpha = learning_rate * np.log((1 - r) / r) # equation 7-2\n", " sample_weights[y_pred != y_train] *= np.exp(alpha) # equation 7-3\n", " sample_weights /= sample_weights.sum() # normalization step\n", "\n", " plot_decision_boundary(svm_clf, X_train, y_train, alpha=0.4)\n", " plt.title(\"learning_rate = {}\".format(learning_rate))\n", " if subplot == 0:\n", " plt.text(-0.75, -0.95, \"1\", fontsize=16)\n", " plt.text(-1.05, -0.95, \"2\", fontsize=16)\n", " plt.text(1.0, -0.95, \"3\", fontsize=16)\n", " plt.text(-1.45, -0.5, \"4\", fontsize=16)\n", " plt.text(1.36, -0.95, \"5\", fontsize=16)\n", " else:\n", " plt.ylabel(\"\")\n", "\n", "save_fig(\"boosting_plot\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import AdaBoostClassifier\n", "\n", "ada_clf = AdaBoostClassifier(\n", " DecisionTreeClassifier(max_depth=1), n_estimators=30,\n", " learning_rate=0.5, random_state=42)\n", "ada_clf.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "# not in the book – in case you're curious to see what the decision boundary\n", "# looks like for the AdaBoost classifier\n", "plot_decision_boundary(ada_clf, X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Gradient Boosting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create a simple quadratic dataset:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from sklearn.tree import DecisionTreeRegressor\n", "\n", "np.random.seed(42)\n", "X = np.random.rand(100, 1) - 0.5\n", "y = 3 * X[:, 0] ** 2 + 0.05 * np.random.randn(100) # y = 3x² + Gaussian noise\n", "\n", "tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42)\n", "tree_reg1.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's train a decision tree regressor on this dataset:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "y2 = y - tree_reg1.predict(X)\n", "tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=43)\n", "tree_reg2.fit(X, y2)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "y3 = y2 - tree_reg2.predict(X)\n", "tree_reg3 = DecisionTreeRegressor(max_depth=2, random_state=44)\n", "tree_reg3.fit(X, y3)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "X_new = np.array([[-0.4], [0.], [0.5]])\n", "sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "# not in the book – this cell generates and saves Figure 6–9\n", "\n", "def plot_predictions(regressors, X, y, axes, style,\n", " label=None, data_style=\"b.\", data_label=None):\n", " x1 = np.linspace(axes[0], axes[1], 500)\n", " y_pred = sum(regressor.predict(x1.reshape(-1, 1))\n", " for regressor in regressors)\n", " plt.plot(X[:, 0], y, data_style, label=data_label)\n", " plt.plot(x1, y_pred, style, linewidth=2, label=label)\n", " if label or data_label:\n", " plt.legend(loc=\"upper center\")\n", " plt.axis(axes)\n", "\n", "plt.figure(figsize=(11,11))\n", "\n", "plt.subplot(3, 2, 1)\n", "plot_predictions([tree_reg1], X, y, axes=[-0.5, 0.5, -0.2, 0.8], style=\"g-\",\n", " label=\"$h_1(x_1)$\", data_label=\"Training set\")\n", "plt.ylabel(\"$y$ \", rotation=0)\n", "plt.title(\"Residuals and tree predictions\")\n", "\n", "plt.subplot(3, 2, 2)\n", "plot_predictions([tree_reg1], X, y, axes=[-0.5, 0.5, -0.2, 0.8], style=\"r-\",\n", " label=\"$h(x_1) = h_1(x_1)$\", data_label=\"Training set\")\n", "plt.title(\"Ensemble predictions\")\n", "\n", "plt.subplot(3, 2, 3)\n", "plot_predictions([tree_reg2], X, y2, axes=[-0.5, 0.5, -0.4, 0.6], style=\"g-\",\n", " label=\"$h_2(x_1)$\", data_style=\"k+\",\n", " data_label=\"Residuals: $y - h_1(x_1)$\")\n", "plt.ylabel(\"$y$ \", rotation=0)\n", "\n", "plt.subplot(3, 2, 4)\n", "plot_predictions([tree_reg1, tree_reg2], X, y, axes=[-0.5, 0.5, -0.2, 0.8],\n", " style=\"r-\", label=\"$h(x_1) = h_1(x_1) + h_2(x_1)$\")\n", "\n", "plt.subplot(3, 2, 5)\n", "plot_predictions([tree_reg3], X, y3, axes=[-0.5, 0.5, -0.4, 0.6], style=\"g-\",\n", " label=\"$h_3(x_1)$\", data_style=\"k+\",\n", " data_label=\"Residuals: $y - h_1(x_1) - h_2(x_1)$\")\n", "plt.xlabel(\"$x_1$\")\n", "plt.ylabel(\"$y$ \", rotation=0)\n", "\n", "plt.subplot(3, 2, 6)\n", "plot_predictions([tree_reg1, tree_reg2, tree_reg3], X, y,\n", " axes=[-0.5, 0.5, -0.2, 0.8], style=\"r-\",\n", " label=\"$h(x_1) = h_1(x_1) + h_2(x_1) + h_3(x_1)$\")\n", "plt.xlabel(\"$x_1$\")\n", "\n", "save_fig(\"gradient_boosting_plot\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's try a gradient boosting regressor:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import GradientBoostingRegressor\n", "\n", "gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3,\n", " learning_rate=1.0, random_state=42)\n", "gbrt.fit(X, y)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "gbrt_best = GradientBoostingRegressor(\n", " max_depth=2, learning_rate=0.05, n_estimators=500,\n", " n_iter_no_change=10, random_state=42)\n", "gbrt_best.fit(X, y)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "gbrt_best.n_estimators_" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "# not in the book – this cell generates and saves Figure 6–10\n", "\n", "fix, axes = plt.subplots(ncols=2, figsize=(10,4), sharey=True)\n", "\n", "plt.sca(axes[0])\n", "plot_predictions([gbrt], X, y, axes=[-0.5, 0.5, -0.1, 0.8], style=\"r-\",\n", " label=\"Ensemble predictions\")\n", "plt.title(f\"learning_rate={gbrt.learning_rate}, \"\n", " f\"n_estimators={gbrt.n_estimators_}\")\n", "plt.xlabel(\"$x_1$\")\n", "plt.ylabel(\"$y$\", rotation=0)\n", "\n", "plt.sca(axes[1])\n", "plot_predictions([gbrt_best], X, y, axes=[-0.5, 0.5, -0.1, 0.8], style=\"r-\")\n", "plt.title(f\"learning_rate={gbrt_best.learning_rate}, \"\n", " f\"n_estimators={gbrt_best.n_estimators_}\")\n", "plt.xlabel(\"$x_1$\")\n", "\n", "save_fig(\"gbrt_learning_rate_plot\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "# not in the book – at least not in this chapter, it's presented in chapter 2\n", "\n", "import tarfile\n", "import urllib.request\n", "\n", "import pandas as pd\n", "from sklearn.model_selection import train_test_split\n", "\n", "def load_housing_data():\n", " housing_path = Path() / \"datasets\" / \"housing\"\n", " if not (housing_path / \"housing.csv\").is_file():\n", " housing_path.mkdir(parents=True, exist_ok=True)\n", " root = \"https://raw.githubusercontent.com/ageron/handson-ml2/master/\"\n", " url = root + \"datasets/housing/housing.tgz\"\n", " tgz_path = housing_path / \"housing.tgz\"\n", " urllib.request.urlretrieve(url, tgz_path)\n", " with tarfile.open(tgz_path) as housing_tgz:\n", " housing_tgz.extractall(path=housing_path)\n", " return pd.read_csv(housing_path / \"housing.csv\")\n", "\n", "housing = load_housing_data()\n", "\n", "train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)\n", "housing_labels = train_set[\"median_house_value\"]\n", "housing = train_set.drop(\"median_house_value\", axis=1)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "from sklearn.pipeline import make_pipeline\n", "from sklearn.compose import make_column_transformer\n", "from sklearn.ensemble import HistGradientBoostingRegressor\n", "from sklearn.preprocessing import OrdinalEncoder \n", "\n", "hgb_reg = make_pipeline(\n", " make_column_transformer((OrdinalEncoder(), [\"ocean_proximity\"]),\n", " remainder=\"passthrough\"),\n", " HistGradientBoostingRegressor(categorical_features=[0], random_state=42)\n", ")\n", "hgb_reg.fit(housing, housing_labels)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "# not in the book – evaluate the RMSE stats for the hgb_reg model\n", "\n", "from sklearn.model_selection import cross_val_score\n", "\n", "hgb_rmses = -cross_val_score(hgb_reg, housing, housing_labels,\n", " scoring=\"neg_root_mean_squared_error\", cv=10)\n", "pd.Series(hgb_rmses).describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Stacking" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import StackingClassifier\n", "\n", "stacking_clf = StackingClassifier(\n", " estimators=[\n", " ('lr', LogisticRegression(random_state=42)),\n", " ('rf', RandomForestClassifier(random_state=42)),\n", " ('svc', SVC(probability=True, random_state=42))\n", " ],\n", " final_estimator=RandomForestClassifier(random_state=43),\n", " cv=5 # number of cross-validation folds\n", ")\n", "stacking_clf.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "stacking_clf.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise solutions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. to 7." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See Appendix A." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8. Voting Classifier" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exercise: _Load the MNIST data and split it into a training set, a validation set, and a test set (e.g., use 50,000 instances for training, 10,000 for validation, and 10,000 for testing)._" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The MNIST dataset was loaded earlier. The dataset is already split into a training set (the first 60,000 instances) and a test set (the last 10,000 instances), and the training set is already shuffled. So all we need to do is to take the first 50,000 instances for the new training set, the next 10,000 for the validation set, and the last 10,000 for the test set:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "X_train, y_train = X_mnist[:50_000], y_mnist[:50_000]\n", "X_valid, y_valid = X_mnist[50_000:60_000], y_mnist[50_000:60_000]\n", "X_test, y_test = X_mnist[60_000:], y_mnist[60_000:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exercise: _Then train various classifiers, such as a Random Forest classifier, an Extra-Trees classifier, and an SVM._" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import ExtraTreesClassifier\n", "from sklearn.svm import LinearSVC\n", "from sklearn.neural_network import MLPClassifier" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)\n", "extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)\n", "svm_clf = LinearSVC(max_iter=100, tol=20, random_state=42)\n", "mlp_clf = MLPClassifier(random_state=42)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "estimators = [random_forest_clf, extra_trees_clf, svm_clf, mlp_clf]\n", "for estimator in estimators:\n", " print(\"Training the\", estimator)\n", " estimator.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "[estimator.score(X_valid, y_valid) for estimator in estimators]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The linear SVM is far outperformed by the other classifiers. However, let's keep it for now since it may improve the voting classifier's performance." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exercise: _Next, try to combine \\[the classifiers\\] into an ensemble that outperforms them all on the validation set, using a soft or hard voting classifier._" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import VotingClassifier" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "named_estimators = [\n", " (\"random_forest_clf\", random_forest_clf),\n", " (\"extra_trees_clf\", extra_trees_clf),\n", " (\"svm_clf\", svm_clf),\n", " (\"mlp_clf\", mlp_clf),\n", "]" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "voting_clf = VotingClassifier(named_estimators)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "voting_clf.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "voting_clf.score(X_valid, y_valid)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `VotingClassifier` made a clone of each classifier, and it trained the clones using class indices as the labels, not the original class names. Therefore, to evaluate these clones we need to provide class indices as well. To convert the classes to class indices, we can use a `LabelEncoder`:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import LabelEncoder\n", "\n", "encoder = LabelEncoder()\n", "y_valid_encoded = encoder.fit_transform(y_valid)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, in the case of MNIST, it's simpler to just convert the class names to integers, since the digits match the class ids:" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "y_valid_encoded = y_valid.astype(np.int64)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's evaluate the classifier clones:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "[estimator.score(X_valid, y_valid_encoded)\n", " for estimator in voting_clf.estimators_]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's remove the SVM to see if performance improves. It is possible to remove an estimator by setting it to `\"drop\"` using `set_params()` like this:" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "voting_clf.set_params(svm_clf=\"drop\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This updated the list of estimators:" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "voting_clf.estimators" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, it did not update the list of _trained_ estimators:" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "voting_clf.estimators_" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "voting_clf.named_estimators_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we can either fit the `VotingClassifier` again, or just remove the SVM from the list of trained estimators, both in `estimators_` and `named_estimators_`:" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "svm_clf_trained = voting_clf.named_estimators_.pop(\"svm_clf\")\n", "voting_clf.estimators_.remove(svm_clf_trained)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's evaluate the `VotingClassifier` again:" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "voting_clf.score(X_valid, y_valid)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A bit better! The SVM was hurting performance. Now let's try using a soft voting classifier. We do not actually need to retrain the classifier, we can just set `voting` to `\"soft\"`:" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "voting_clf.voting = \"soft\"" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "voting_clf.score(X_valid, y_valid)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nope, hard voting wins in this case." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_Once you have found \\[an ensemble that performs better than the individual predictors\\], try it on the test set. How much better does it perform compared to the individual classifiers?_" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "voting_clf.voting = \"hard\"\n", "voting_clf.score(X_test, y_test)" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "[estimator.score(X_test, y_test.astype(np.int64))\n", " for estimator in voting_clf.estimators_]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The voting classifier reduced the error rate of the best model from about 3% to 2.7%, which means 10% less errors." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 9. Stacking Ensemble" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exercise: _Run the individual classifiers from the previous exercise to make predictions on the validation set, and create a new training set with the resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, and the target is the image's class. Train a classifier on this new training set._" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "X_valid_predictions = np.empty((len(X_valid), len(estimators)), dtype=np.object)\n", "\n", "for index, estimator in enumerate(estimators):\n", " X_valid_predictions[:, index] = estimator.predict(X_valid)" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "X_valid_predictions" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [], "source": [ "rnd_forest_blender = RandomForestClassifier(n_estimators=200, oob_score=True,\n", " random_state=42)\n", "rnd_forest_blender.fit(X_valid_predictions, y_valid)" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "rnd_forest_blender.oob_score_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You could fine-tune this blender or try other types of blenders (e.g., an `MLPClassifier`), then select the best one using cross-validation, as always." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exercise: _Congratulations, you have just trained a blender, and together with the classifiers they form a stacking ensemble! Now let's evaluate the ensemble on the test set. For each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensemble's predictions. How does it compare to the voting classifier you trained earlier?_" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [], "source": [ "X_test_predictions = np.empty((len(X_test), len(estimators)), dtype=np.object)\n", "\n", "for index, estimator in enumerate(estimators):\n", " X_test_predictions[:, index] = estimator.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [], "source": [ "y_pred = rnd_forest_blender.predict(X_test_predictions)" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [], "source": [ "accuracy_score(y_test, y_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This stacking ensemble does not perform as well as the voting classifier we trained earlier, and it's even very slightly worse than the best individual classifier." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exercise: _Now try again using a `StackingClassifier` instead: do you get better performance? If so, why?_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since `StackingClassifier` uses K-Fold cross-validation, we don't need a separate validation set, so let's join the training set and the validation set into a bigger training set:" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "X_train_full, y_train_full = X_mnist[:60_000], y_mnist[:60_000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's create and train the stacking classifier on the full training set:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Warning**: the following cell will take quite a while to run (15-30 minutes depending on your hardware), as it uses K-Fold validation with 5 folds by default. It will train the 4 classifiers 5 times each on 80% of the full training set to make the predictions, plus one last time each on the full training set, and lastly it will train the final model on the predictions. That's a total of 25 models to train!" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [], "source": [ "stack_clf = StackingClassifier(named_estimators,\n", " final_estimator=rnd_forest_blender)\n", "stack_clf.fit(X_train_full, y_train_full)" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "stack_clf.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `StackingClassifier` significantly outperforms the custom stacking implementation we tried earlier! This is for mainly two reasons:\n", "\n", "* Since we could reclaim the validation set, the `StackingClassifier` was trained on a larger dataset.\n", "* It used `predict_proba()` if available, or else `decision_function()` if available, or else `predict()`. This gave the blender much more nuanced inputs to work with." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And that's all for today, congratulations on finishing the chapter and the exercises!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" }, "nav_menu": { "height": "252px", "width": "333px" }, "toc": { "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": 6, "toc_cell": false, "toc_section_display": "block", "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }