From 93676a4f23822a796f00a408afec2cdb83f78b85 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Aur=C3=A9lien=20Geron?= Date: Wed, 10 Nov 2021 18:00:46 +1300 Subject: [PATCH] Big update to notebook 06 for 3rd edition --- 06_ensemble_learning_and_random_forests.ipynb | 1260 ++++++++--------- 1 file changed, 606 insertions(+), 654 deletions(-) diff --git a/06_ensemble_learning_and_random_forests.ipynb b/06_ensemble_learning_and_random_forests.ipynb index 73a7af5..c0a5c41 100644 --- a/06_ensemble_learning_and_random_forests.ipynb +++ b/06_ensemble_learning_and_random_forests.ipynb @@ -30,7 +30,9 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "tags": [] + }, "source": [ "# Setup" ] @@ -39,7 +41,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "First, let's import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures." + "This project requires Python 3.8 or above:" ] }, { @@ -48,30 +50,64 @@ "metadata": {}, "outputs": [], "source": [ - "# Python ≥3.8 is required\n", "import sys\n", - "assert sys.version_info >= (3, 8)\n", "\n", - "# Scikit-Learn ≥1.0 is required\n", + "assert sys.version_info >= (3, 8)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It also requires Scikit-Learn ≥ 1.0.1:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ "import sklearn\n", - "assert sklearn.__version__ >= \"1.0\"\n", "\n", - "# Common imports\n", - "import numpy as np\n", + "assert sklearn.__version__ >= \"1.0.1\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As we did in previous chapters, let's define the default font sizes to make the figures prettier:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib as mpl\n", + "\n", + "mpl.rc('font', size=12)\n", + "mpl.rc('axes', labelsize=14, titlesize=14)\n", + "mpl.rc('legend', fontsize=14)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And let's create the `images/ensembles` folder (if it doesn't already exist), and define the `save_fig()` function which is used through this notebook to save the figures in high-res for the book:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ "from pathlib import Path\n", "\n", - "# to make this notebook's output stable across runs\n", - "np.random.seed(42)\n", - "\n", - "# To plot pretty figures\n", - "%matplotlib inline\n", - "import matplotlib as mpl\n", - "import matplotlib.pyplot as plt\n", - "mpl.rc('axes', labelsize=14)\n", - "mpl.rc('xtick', labelsize=12)\n", - "mpl.rc('ytick', labelsize=12)\n", - "\n", - "# Where to save the figures\n", "IMAGES_PATH = Path() / \"images\" / \"ensembles\"\n", "IMAGES_PATH.mkdir(parents=True, exist_ok=True)\n", "\n", @@ -89,74 +125,11 @@ "# Voting Classifiers" ] }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "heads_proba = 0.51\n", - "coin_tosses = (np.random.rand(10000, 10) < heads_proba).astype(np.int32)\n", - "cumulative_heads_ratio = np.cumsum(coin_tosses, axis=0) / np.arange(1, 10001).reshape(-1, 1)" - ] - }, { "cell_type": "markdown", "metadata": {}, "source": [ - "**Code to generate Figure 7–3. The law of large numbers:**" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "plt.figure(figsize=(8,3.5))\n", - "plt.plot(cumulative_heads_ratio)\n", - "plt.plot([0, 10000], [0.51, 0.51], \"k--\", linewidth=2, label=\"51%\")\n", - "plt.plot([0, 10000], [0.5, 0.5], \"k-\", label=\"50%\")\n", - "plt.xlabel(\"Number of coin tosses\")\n", - "plt.ylabel(\"Heads ratio\")\n", - "plt.legend(loc=\"lower right\")\n", - "plt.axis([0, 10000, 0.42, 0.58])\n", - "save_fig(\"law_of_large_numbers_plot\")\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's use the moons dataset:" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.model_selection import train_test_split\n", - "from sklearn.datasets import make_moons\n", - "\n", - "X, y = make_moons(n_samples=500, noise=0.30, random_state=42)\n", - "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Note**: to be future-proof, we set `solver=\"lbfgs\"`, `n_estimators=100`, and `gamma=\"scale\"` since these will be the default values in upcoming Scikit-Learn versions." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Code examples:**" + "**Code to generate Figure 6–3. The law of large numbers:**" ] }, { @@ -165,18 +138,35 @@ "metadata": {}, "outputs": [], "source": [ - "from sklearn.ensemble import RandomForestClassifier\n", - "from sklearn.ensemble import VotingClassifier\n", - "from sklearn.linear_model import LogisticRegression\n", - "from sklearn.svm import SVC\n", + "# not in the book\n", "\n", - "log_clf = LogisticRegression(solver=\"lbfgs\", random_state=42)\n", - "rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)\n", - "svm_clf = SVC(gamma=\"scale\", random_state=42)\n", + "import matplotlib.pyplot as plt\n", + "import numpy as np\n", "\n", - "voting_clf = VotingClassifier(\n", - " estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],\n", - " voting='hard')" + "heads_proba = 0.51\n", + "np.random.seed(42)\n", + "coin_tosses = (np.random.rand(10000, 10) < heads_proba).astype(np.int32)\n", + "cumulative_heads = coin_tosses.cumsum(axis=0)\n", + "cumulative_heads_ratio = cumulative_heads / np.arange(1, 10001).reshape(-1, 1)\n", + "\n", + "plt.figure(figsize=(8,3.5))\n", + "plt.plot(cumulative_heads_ratio)\n", + "plt.plot([0, 10000], [0.51, 0.51], \"k--\", linewidth=2, label=\"51%\")\n", + "plt.plot([0, 10000], [0.5, 0.5], \"k-\", label=\"50%\")\n", + "plt.xlabel(\"Number of coin tosses\")\n", + "plt.ylabel(\"Heads ratio\")\n", + "plt.legend(loc=\"lower right\")\n", + "plt.axis([0, 10000, 0.42, 0.58])\n", + "plt.grid()\n", + "save_fig(\"law_of_large_numbers_plot\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's build a voting classifier:" ] }, { @@ -185,6 +175,22 @@ "metadata": {}, "outputs": [], "source": [ + "from sklearn.datasets import make_moons\n", + "from sklearn.ensemble import RandomForestClassifier, VotingClassifier\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.svm import SVC\n", + "\n", + "X, y = make_moons(n_samples=500, noise=0.30, random_state=42)\n", + "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)\n", + "\n", + "voting_clf = VotingClassifier(\n", + " estimators=[\n", + " ('lr', LogisticRegression(random_state=42)),\n", + " ('rf', RandomForestClassifier(random_state=42)),\n", + " ('svc', SVC(random_state=42))\n", + " ]\n", + ")\n", "voting_clf.fit(X_train, y_train)" ] }, @@ -194,26 +200,8 @@ "metadata": {}, "outputs": [], "source": [ - "from sklearn.metrics import accuracy_score\n", - "\n", - "for clf in (log_clf, rnd_clf, svm_clf, voting_clf):\n", - " clf.fit(X_train, y_train)\n", - " y_pred = clf.predict(X_test)\n", - " print(clf.__class__.__name__, accuracy_score(y_test, y_pred))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Note**: the results in this notebook may differ slightly from the book, as Scikit-Learn algorithms sometimes get tweaked." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Soft voting:" + "for name, clf in voting_clf.named_estimators_.items():\n", + " print(name, \"=\", clf.score(X_test, y_test))" ] }, { @@ -222,14 +210,7 @@ "metadata": {}, "outputs": [], "source": [ - "log_clf = LogisticRegression(solver=\"lbfgs\", random_state=42)\n", - "rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)\n", - "svm_clf = SVC(gamma=\"scale\", probability=True, random_state=42)\n", - "\n", - "voting_clf = VotingClassifier(\n", - " estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],\n", - " voting='soft')\n", - "voting_clf.fit(X_train, y_train)" + "voting_clf.predict(X_test[:1])" ] }, { @@ -238,12 +219,35 @@ "metadata": {}, "outputs": [], "source": [ - "from sklearn.metrics import accuracy_score\n", - "\n", - "for clf in (log_clf, rnd_clf, svm_clf, voting_clf):\n", - " clf.fit(X_train, y_train)\n", - " y_pred = clf.predict(X_test)\n", - " print(clf.__class__.__name__, accuracy_score(y_test, y_pred))" + "[clf.predict(X_test[:1]) for clf in voting_clf.estimators_]" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "voting_clf.score(X_test, y_test)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's use soft voting:" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "voting_clf.voting = \"soft\"\n", + "voting_clf.named_estimators[\"svc\"].probability = True\n", + "voting_clf.fit(X_train, y_train)\n", + "voting_clf.score(X_test, y_test)" ] }, { @@ -256,47 +260,23 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import BaggingClassifier\n", "from sklearn.tree import DecisionTreeClassifier\n", "\n", - "bag_clf = BaggingClassifier(\n", - " DecisionTreeClassifier(), n_estimators=500,\n", - " max_samples=100, bootstrap=True, random_state=42)\n", - "bag_clf.fit(X_train, y_train)\n", - "y_pred = bag_clf.predict(X_test)" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.metrics import accuracy_score\n", - "print(accuracy_score(y_test, y_pred))" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [], - "source": [ - "tree_clf = DecisionTreeClassifier(random_state=42)\n", - "tree_clf.fit(X_train, y_train)\n", - "y_pred_tree = tree_clf.predict(X_test)\n", - "print(accuracy_score(y_test, y_pred_tree))" + "bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,\n", + " max_samples=100, random_state=42)\n", + "bag_clf.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "**Code to generate Figure 7–5. A single Decision Tree (left) versus a bagging ensemble of 500 trees (right):**" + "**Code to generate Figure 6–5. A single Decision Tree (left) versus a bagging ensemble of 500 trees (right):**" ] }, { @@ -305,41 +285,43 @@ "metadata": {}, "outputs": [], "source": [ - "from matplotlib.colors import ListedColormap\n", + "# not in the book\n", "\n", - "def plot_decision_boundary(clf, X, y, axes=[-1.5, 2.45, -1, 1.5], alpha=0.5, contour=True):\n", - " x1s = np.linspace(axes[0], axes[1], 100)\n", - " x2s = np.linspace(axes[2], axes[3], 100)\n", - " x1, x2 = np.meshgrid(x1s, x2s)\n", + "def plot_decision_boundary(clf, X, y, alpha=1.0):\n", + " axes=[-1.5, 2.4, -1, 1.5]\n", + " x1, x2 = np.meshgrid(np.linspace(axes[0], axes[1], 100),\n", + " np.linspace(axes[2], axes[3], 100))\n", " X_new = np.c_[x1.ravel(), x2.ravel()]\n", " y_pred = clf.predict(X_new).reshape(x1.shape)\n", - " custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])\n", - " plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap)\n", - " if contour:\n", - " custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50'])\n", - " plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8)\n", - " plt.plot(X[:, 0][y==0], X[:, 1][y==0], \"yo\", alpha=alpha)\n", - " plt.plot(X[:, 0][y==1], X[:, 1][y==1], \"bs\", alpha=alpha)\n", + " \n", + " plt.contourf(x1, x2, y_pred, alpha=0.3 * alpha, cmap='Wistia')\n", + " plt.contour(x1, x2, y_pred, cmap=\"Greys\", alpha=0.8 * alpha)\n", + " colors = [\"#78785c\", \"#c47b27\"]\n", + " markers = (\"o\", \"^\")\n", + " for idx in (0, 1):\n", + " plt.plot(X[:, 0][y == idx], X[:, 1][y == idx],\n", + " color=colors[idx], marker=markers[idx], linestyle=\"none\")\n", " plt.axis(axes)\n", - " plt.xlabel(r\"$x_1$\", fontsize=18)\n", - " plt.ylabel(r\"$x_2$\", fontsize=18, rotation=0)" + " plt.xlabel(r\"$x_1$\")\n", + " plt.ylabel(r\"$x_2$\", rotation=0)" ] }, { "cell_type": "code", "execution_count": 14, - "metadata": { - "scrolled": true - }, + "metadata": {}, "outputs": [], "source": [ - "fix, axes = plt.subplots(ncols=2, figsize=(10,4), sharey=True)\n", + "tree_clf = DecisionTreeClassifier(random_state=42)\n", + "tree_clf.fit(X_train, y_train)\n", + "\n", + "fig, axes = plt.subplots(ncols=2, figsize=(10, 4), sharey=True)\n", "plt.sca(axes[0])\n", - "plot_decision_boundary(tree_clf, X, y)\n", - "plt.title(\"Decision Tree\", fontsize=14)\n", + "plot_decision_boundary(tree_clf, X_train, y_train)\n", + "plt.title(\"Decision Tree\")\n", "plt.sca(axes[1])\n", - "plot_decision_boundary(bag_clf, X, y)\n", - "plt.title(\"Decision Trees with Bagging\", fontsize=14)\n", + "plot_decision_boundary(bag_clf, X_train, y_train)\n", + "plt.title(\"Decision Trees with Bagging\")\n", "plt.ylabel(\"\")\n", "save_fig(\"decision_tree_without_and_with_bagging_plot\")\n", "plt.show()" @@ -358,9 +340,8 @@ "metadata": {}, "outputs": [], "source": [ - "bag_clf = BaggingClassifier(\n", - " DecisionTreeClassifier(), n_estimators=500,\n", - " bootstrap=True, oob_score=True, random_state=40)\n", + "bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,\n", + " oob_score=True, random_state=42)\n", "bag_clf.fit(X_train, y_train)\n", "bag_clf.oob_score_" ] @@ -371,7 +352,7 @@ "metadata": {}, "outputs": [], "source": [ - "bag_clf.oob_decision_function_" + "bag_clf.oob_decision_function_[:3] # probas for the first 3 instances" ] }, { @@ -383,10 +364,29 @@ "outputs": [], "source": [ "from sklearn.metrics import accuracy_score\n", + "\n", "y_pred = bag_clf.predict(X_test)\n", "accuracy_score(y_test, y_pred)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you randomly draw one instance from a dataset of size _m_, each instance in the dataset obviously has probability 1/_m_ of getting picked, and therefore it has a probability 1 – 1/_m_ of _not_ getting picked. If you draw _m_ instances with replacement, all draws are independent and therefore each instance has a probability (1 – 1/_m_)_m_ of _not_ getting picked. Now let's use the fact that exp(_x_) is equal to the limit of (1 + _x_/_m_)_m_ as _m_ approaches infinity. So if _m_ is large, the ratio of out-of-bag instances will be about exp(–1) ≈ 0.37. So roughly 63% (1 – 0.37) will be sampled." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "# not in the book\n", + "print(1 - (1 - 1 / 1000) ** 1000)\n", + "print(1 - np.exp(-1))" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -396,15 +396,15 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "\n", - "rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, random_state=42)\n", + "rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16,\n", + " n_jobs=-1, random_state=42)\n", "rnd_clf.fit(X_train, y_train)\n", - "\n", "y_pred_rf = rnd_clf.predict(X_test)" ] }, @@ -417,23 +417,13 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "bag_clf = BaggingClassifier(\n", " DecisionTreeClassifier(max_features=\"sqrt\", max_leaf_nodes=16),\n", - " n_estimators=500, random_state=42)" - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "metadata": {}, - "outputs": [], - "source": [ - "bag_clf.fit(X_train, y_train)\n", - "y_pred = bag_clf.predict(X_test)" + " n_estimators=500, n_jobs=-1, random_state=42)" ] }, { @@ -442,7 +432,10 @@ "metadata": {}, "outputs": [], "source": [ - "np.sum(y_pred == y_pred_rf) / len(y_pred) # very similar predictions" + "# not in the book\n", + "bag_clf.fit(X_train, y_train)\n", + "y_pred_bag = bag_clf.predict(X_test)\n", + "np.all(y_pred_bag == y_pred_rf) # same predictions" ] }, { @@ -459,11 +452,19 @@ "outputs": [], "source": [ "from sklearn.datasets import load_iris\n", - "iris = load_iris()\n", + "\n", + "iris = load_iris(as_frame=True)\n", "rnd_clf = RandomForestClassifier(n_estimators=500, random_state=42)\n", - "rnd_clf.fit(iris[\"data\"], iris[\"target\"])\n", - "for name, score in zip(iris[\"feature_names\"], rnd_clf.feature_importances_):\n", - " print(name, score)" + "rnd_clf.fit(iris.data, iris.target)\n", + "for score, name in zip(rnd_clf.feature_importances_, iris.data.columns):\n", + " print(round(score, 2), name)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Code to generate Figure 6–6. MNIST pixel importance (according to a Random Forest classifier):**" ] }, { @@ -472,93 +473,21 @@ "metadata": {}, "outputs": [], "source": [ - "rnd_clf.feature_importances_" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The following figure overlays the decision boundaries of 15 decision trees. As you can see, even though each decision tree is imperfect, the ensemble defines a pretty good decision boundary:" - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "metadata": {}, - "outputs": [], - "source": [ - "plt.figure(figsize=(6, 4))\n", + "# not in the book\n", "\n", - "for i in range(15):\n", - " tree_clf = DecisionTreeClassifier(max_leaf_nodes=16, random_state=42 + i)\n", - " indices_with_replacement = np.random.randint(0, len(X_train), len(X_train))\n", - " tree_clf.fit(X_train[indices_with_replacement], y_train[indices_with_replacement])\n", - " plot_decision_boundary(tree_clf, X, y, axes=[-1.5, 2.45, -1, 1.5], alpha=0.02, contour=False)\n", - "\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Code to generate Figure 7–6. MNIST pixel importance (according to a Random Forest classifier):**" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Warning:** since Scikit-Learn 0.24, `fetch_openml()` returns a Pandas `DataFrame` by default. To avoid this and keep the same code as in the book, we use `as_frame=False`." - ] - }, - { - "cell_type": "code", - "execution_count": 25, - "metadata": {}, - "outputs": [], - "source": [ "from sklearn.datasets import fetch_openml\n", "\n", - "mnist = fetch_openml('mnist_784', version=1, as_frame=False)\n", - "mnist.target = mnist.target.astype(np.uint8)" - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "metadata": {}, - "outputs": [], - "source": [ + "X_mnist, y_mnist = fetch_openml('mnist_784', return_X_y=True, as_frame=False)\n", + "\n", "rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)\n", - "rnd_clf.fit(mnist[\"data\"], mnist[\"target\"])" - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "metadata": {}, - "outputs": [], - "source": [ - "def plot_digit(data):\n", - " image = data.reshape(28, 28)\n", - " plt.imshow(image, cmap = mpl.cm.hot,\n", - " interpolation=\"nearest\")\n", - " plt.axis(\"off\")" - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "metadata": {}, - "outputs": [], - "source": [ - "plot_digit(rnd_clf.feature_importances_)\n", - "\n", - "cbar = plt.colorbar(ticks=[rnd_clf.feature_importances_.min(), rnd_clf.feature_importances_.max()])\n", - "cbar.ax.set_yticklabels(['Not important', 'Very important'])\n", + "rnd_clf.fit(X_mnist, y_mnist)\n", "\n", + "heatmap_image = rnd_clf.feature_importances_.reshape(28, 28)\n", + "plt.imshow(heatmap_image, cmap=\"hot\")\n", + "cbar = plt.colorbar(ticks=[rnd_clf.feature_importances_.min(),\n", + " rnd_clf.feature_importances_.max()])\n", + "cbar.ax.set_yticklabels(['Not important', 'Very important'], fontsize=14)\n", + "plt.axis(\"off\")\n", "save_fig(\"mnist_feature_importance_plot\")\n", "plt.show()" ] @@ -571,42 +500,21 @@ "## AdaBoost" ] }, - { - "cell_type": "code", - "execution_count": 29, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.ensemble import AdaBoostClassifier\n", - "\n", - "ada_clf = AdaBoostClassifier(\n", - " DecisionTreeClassifier(max_depth=1), n_estimators=200,\n", - " algorithm=\"SAMME.R\", learning_rate=0.5, random_state=42)\n", - "ada_clf.fit(X_train, y_train)" - ] - }, - { - "cell_type": "code", - "execution_count": 30, - "metadata": {}, - "outputs": [], - "source": [ - "plot_decision_boundary(ada_clf, X, y)" - ] - }, { "cell_type": "markdown", "metadata": {}, "source": [ - "**Code to generate Figure 7–8. Decision boundaries of consecutive predictors:**" + "**Code to generate Figure 6–8. Decision boundaries of consecutive predictors:**" ] }, { "cell_type": "code", - "execution_count": 31, + "execution_count": 24, "metadata": {}, "outputs": [], "source": [ + "# not in the book\n", + "\n", "m = len(X_train)\n", "\n", "fix, axes = plt.subplots(ncols=2, figsize=(10,4), sharey=True)\n", @@ -614,23 +522,24 @@ " sample_weights = np.ones(m) / m\n", " plt.sca(axes[subplot])\n", " for i in range(5):\n", - " svm_clf = SVC(kernel=\"rbf\", C=0.2, gamma=0.6, random_state=42)\n", + " svm_clf = SVC(C=0.2, gamma=0.6, random_state=42)\n", " svm_clf.fit(X_train, y_train, sample_weight=sample_weights * m)\n", " y_pred = svm_clf.predict(X_train)\n", "\n", - " r = sample_weights[y_pred != y_train].sum() / sample_weights.sum() # equation 7-1\n", - " alpha = learning_rate * np.log((1 - r) / r) # equation 7-2\n", - " sample_weights[y_pred != y_train] *= np.exp(alpha) # equation 7-3\n", - " sample_weights /= sample_weights.sum() # normalization step\n", + " error_weights = sample_weights[y_pred != y_train].sum()\n", + " r = error_weights / sample_weights.sum() # equation 7-1\n", + " alpha = learning_rate * np.log((1 - r) / r) # equation 7-2\n", + " sample_weights[y_pred != y_train] *= np.exp(alpha) # equation 7-3\n", + " sample_weights /= sample_weights.sum() # normalization step\n", "\n", - " plot_decision_boundary(svm_clf, X, y, alpha=0.2)\n", - " plt.title(\"learning_rate = {}\".format(learning_rate), fontsize=16)\n", + " plot_decision_boundary(svm_clf, X_train, y_train, alpha=0.4)\n", + " plt.title(\"learning_rate = {}\".format(learning_rate))\n", " if subplot == 0:\n", - " plt.text(-0.75, -0.95, \"1\", fontsize=14)\n", - " plt.text(-1.05, -0.95, \"2\", fontsize=14)\n", - " plt.text(1.0, -0.95, \"3\", fontsize=14)\n", - " plt.text(-1.45, -0.5, \"4\", fontsize=14)\n", - " plt.text(1.36, -0.95, \"5\", fontsize=14)\n", + " plt.text(-0.75, -0.95, \"1\", fontsize=16)\n", + " plt.text(-1.05, -0.95, \"2\", fontsize=16)\n", + " plt.text(1.0, -0.95, \"3\", fontsize=16)\n", + " plt.text(-1.45, -0.5, \"4\", fontsize=16)\n", + " plt.text(1.36, -0.95, \"5\", fontsize=16)\n", " else:\n", " plt.ylabel(\"\")\n", "\n", @@ -638,6 +547,29 @@ "plt.show()" ] }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.ensemble import AdaBoostClassifier\n", + "\n", + "ada_clf = AdaBoostClassifier(\n", + " DecisionTreeClassifier(max_depth=1), n_estimators=30,\n", + " learning_rate=0.5, random_state=42)\n", + "ada_clf.fit(X_train, y_train)" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [], + "source": [ + "plot_decision_boundary(ada_clf, X_train, y_train) # not in the book" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -649,18 +581,24 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Let create a simple quadratic dataset:" + "Let's create a simple quadratic dataset:" ] }, { "cell_type": "code", - "execution_count": 32, + "execution_count": 27, "metadata": {}, "outputs": [], "source": [ + "import numpy as np\n", + "from sklearn.tree import DecisionTreeRegressor\n", + "\n", "np.random.seed(42)\n", "X = np.random.rand(100, 1) - 0.5\n", - "y = 3*X[:, 0]**2 + 0.05 * np.random.randn(100)" + "y = 3 * X[:, 0] ** 2 + 0.05 * np.random.randn(100) # y = 3x² + Gaussian noise\n", + "\n", + "tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42)\n", + "tree_reg1.fit(X, y)" ] }, { @@ -672,123 +610,97 @@ }, { "cell_type": "code", - "execution_count": 33, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.tree import DecisionTreeRegressor\n", - "\n", - "tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42)\n", - "tree_reg1.fit(X, y)" - ] - }, - { - "cell_type": "code", - "execution_count": 34, + "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "y2 = y - tree_reg1.predict(X)\n", - "tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=42)\n", + "tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=43)\n", "tree_reg2.fit(X, y2)" ] }, { "cell_type": "code", - "execution_count": 35, + "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "y3 = y2 - tree_reg2.predict(X)\n", - "tree_reg3 = DecisionTreeRegressor(max_depth=2, random_state=42)\n", + "tree_reg3 = DecisionTreeRegressor(max_depth=2, random_state=44)\n", "tree_reg3.fit(X, y3)" ] }, { "cell_type": "code", - "execution_count": 36, + "execution_count": 30, "metadata": {}, "outputs": [], "source": [ - "X_new = np.array([[0.8]])" - ] - }, - { - "cell_type": "code", - "execution_count": 37, - "metadata": {}, - "outputs": [], - "source": [ - "y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))" - ] - }, - { - "cell_type": "code", - "execution_count": 38, - "metadata": {}, - "outputs": [], - "source": [ - "y_pred" + "X_new = np.array([[-0.4], [0.], [0.5]])\n", + "sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "**Code to generate Figure 7–9. In this depiction of Gradient Boosting, the first predictor (top left) is trained normally, then each consecutive predictor (middle left and lower left) is trained on the previous predictor’s residuals; the right column shows the resulting ensemble’s predictions:**" + "**Code to generate Figure 6–9. In this depiction of Gradient Boosting, the first predictor (top left) is trained normally, then each consecutive predictor (middle left and lower left) is trained on the previous predictor’s residuals; the right column shows the resulting ensemble’s predictions:**" ] }, { "cell_type": "code", - "execution_count": 39, + "execution_count": 31, "metadata": {}, "outputs": [], "source": [ - "def plot_predictions(regressors, X, y, axes, label=None, style=\"r-\", data_style=\"b.\", data_label=None):\n", + "# not in the book\n", + "\n", + "def plot_predictions(regressors, X, y, axes, style,\n", + " label=None, data_style=\"b.\", data_label=None):\n", " x1 = np.linspace(axes[0], axes[1], 500)\n", - " y_pred = sum(regressor.predict(x1.reshape(-1, 1)) for regressor in regressors)\n", + " y_pred = sum(regressor.predict(x1.reshape(-1, 1))\n", + " for regressor in regressors)\n", " plt.plot(X[:, 0], y, data_style, label=data_label)\n", " plt.plot(x1, y_pred, style, linewidth=2, label=label)\n", " if label or data_label:\n", - " plt.legend(loc=\"upper center\", fontsize=16)\n", - " plt.axis(axes)" - ] - }, - { - "cell_type": "code", - "execution_count": 40, - "metadata": {}, - "outputs": [], - "source": [ + " plt.legend(loc=\"upper center\")\n", + " plt.axis(axes)\n", + "\n", "plt.figure(figsize=(11,11))\n", "\n", - "plt.subplot(321)\n", - "plot_predictions([tree_reg1], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label=\"$h_1(x_1)$\", style=\"g-\", data_label=\"Training set\")\n", - "plt.ylabel(\"$y$\", fontsize=16, rotation=0)\n", - "plt.title(\"Residuals and tree predictions\", fontsize=16)\n", + "plt.subplot(3, 2, 1)\n", + "plot_predictions([tree_reg1], X, y, axes=[-0.5, 0.5, -0.2, 0.8], style=\"g-\",\n", + " label=\"$h_1(x_1)$\", data_label=\"Training set\")\n", + "plt.ylabel(\"$y$ \", rotation=0)\n", + "plt.title(\"Residuals and tree predictions\")\n", "\n", - "plt.subplot(322)\n", - "plot_predictions([tree_reg1], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label=\"$h(x_1) = h_1(x_1)$\", data_label=\"Training set\")\n", - "plt.ylabel(\"$y$\", fontsize=16, rotation=0)\n", - "plt.title(\"Ensemble predictions\", fontsize=16)\n", + "plt.subplot(3, 2, 2)\n", + "plot_predictions([tree_reg1], X, y, axes=[-0.5, 0.5, -0.2, 0.8], style=\"r-\",\n", + " label=\"$h(x_1) = h_1(x_1)$\", data_label=\"Training set\")\n", + "plt.title(\"Ensemble predictions\")\n", "\n", - "plt.subplot(323)\n", - "plot_predictions([tree_reg2], X, y2, axes=[-0.5, 0.5, -0.5, 0.5], label=\"$h_2(x_1)$\", style=\"g-\", data_style=\"k+\", data_label=\"Residuals\")\n", - "plt.ylabel(\"$y - h_1(x_1)$\", fontsize=16)\n", + "plt.subplot(3, 2, 3)\n", + "plot_predictions([tree_reg2], X, y2, axes=[-0.5, 0.5, -0.4, 0.6], style=\"g-\",\n", + " label=\"$h_2(x_1)$\", data_style=\"k+\",\n", + " data_label=\"Residuals: $y - h_1(x_1)$\")\n", + "plt.ylabel(\"$y$ \", rotation=0)\n", "\n", - "plt.subplot(324)\n", - "plot_predictions([tree_reg1, tree_reg2], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label=\"$h(x_1) = h_1(x_1) + h_2(x_1)$\")\n", - "plt.ylabel(\"$y$\", fontsize=16, rotation=0)\n", + "plt.subplot(3, 2, 4)\n", + "plot_predictions([tree_reg1, tree_reg2], X, y, axes=[-0.5, 0.5, -0.2, 0.8],\n", + " style=\"r-\", label=\"$h(x_1) = h_1(x_1) + h_2(x_1)$\")\n", "\n", - "plt.subplot(325)\n", - "plot_predictions([tree_reg3], X, y3, axes=[-0.5, 0.5, -0.5, 0.5], label=\"$h_3(x_1)$\", style=\"g-\", data_style=\"k+\")\n", - "plt.ylabel(\"$y - h_1(x_1) - h_2(x_1)$\", fontsize=16)\n", - "plt.xlabel(\"$x_1$\", fontsize=16)\n", + "plt.subplot(3, 2, 5)\n", + "plot_predictions([tree_reg3], X, y3, axes=[-0.5, 0.5, -0.4, 0.6], style=\"g-\",\n", + " label=\"$h_3(x_1)$\", data_style=\"k+\",\n", + " data_label=\"Residuals: $y - h_1(x_1) - h_2(x_1)$\")\n", + "plt.xlabel(\"$x_1$\")\n", + "plt.ylabel(\"$y$ \", rotation=0)\n", "\n", - "plt.subplot(326)\n", - "plot_predictions([tree_reg1, tree_reg2, tree_reg3], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label=\"$h(x_1) = h_1(x_1) + h_2(x_1) + h_3(x_1)$\")\n", - "plt.xlabel(\"$x_1$\", fontsize=16)\n", - "plt.ylabel(\"$y$\", fontsize=16, rotation=0)\n", + "plt.subplot(3, 2, 6)\n", + "plot_predictions([tree_reg1, tree_reg2, tree_reg3], X, y,\n", + " axes=[-0.5, 0.5, -0.2, 0.8], style=\"r-\",\n", + " label=\"$h(x_1) = h_1(x_1) + h_2(x_1) + h_3(x_1)$\")\n", + "plt.xlabel(\"$x_1$\")\n", "\n", "save_fig(\"gradient_boosting_plot\")\n", "plt.show()" @@ -803,244 +715,174 @@ }, { "cell_type": "code", - "execution_count": 41, + "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import GradientBoostingRegressor\n", "\n", - "gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0, random_state=42)\n", + "gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3,\n", + " learning_rate=1.0, random_state=42)\n", "gbrt.fit(X, y)" ] }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [], + "source": [ + "gbrt_best = GradientBoostingRegressor(\n", + " max_depth=2, learning_rate=0.05, n_estimators=500,\n", + " n_iter_no_change=10, random_state=42)\n", + "gbrt_best.fit(X, y)" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [], + "source": [ + "gbrt_best.n_estimators_" + ] + }, { "cell_type": "markdown", "metadata": {}, "source": [ - "**Code to generate Figure 7–10. GBRT ensembles with not enough predictors (left) and too many (right):**" + "**Code to generate Figure 6–10. GBRT ensembles with not enough predictors (left) and too many (right):**" ] }, { "cell_type": "code", - "execution_count": 42, - "metadata": {}, - "outputs": [], - "source": [ - "gbrt_slow = GradientBoostingRegressor(max_depth=2, n_estimators=200, learning_rate=0.1, random_state=42)\n", - "gbrt_slow.fit(X, y)" - ] - }, - { - "cell_type": "code", - "execution_count": 43, + "execution_count": 35, "metadata": {}, "outputs": [], "source": [ + "# not in the book\n", + "\n", "fix, axes = plt.subplots(ncols=2, figsize=(10,4), sharey=True)\n", "\n", "plt.sca(axes[0])\n", - "plot_predictions([gbrt], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label=\"Ensemble predictions\")\n", - "plt.title(\"learning_rate={}, n_estimators={}\".format(gbrt.learning_rate, gbrt.n_estimators), fontsize=14)\n", - "plt.xlabel(\"$x_1$\", fontsize=16)\n", - "plt.ylabel(\"$y$\", fontsize=16, rotation=0)\n", + "plot_predictions([gbrt], X, y, axes=[-0.5, 0.5, -0.1, 0.8], style=\"r-\",\n", + " label=\"Ensemble predictions\")\n", + "plt.title(f\"learning_rate={gbrt.learning_rate}, \"\n", + " f\"n_estimators={gbrt.n_estimators_}\")\n", + "plt.xlabel(\"$x_1$\")\n", + "plt.ylabel(\"$y$\", rotation=0)\n", "\n", "plt.sca(axes[1])\n", - "plot_predictions([gbrt_slow], X, y, axes=[-0.5, 0.5, -0.1, 0.8])\n", - "plt.title(\"learning_rate={}, n_estimators={}\".format(gbrt_slow.learning_rate, gbrt_slow.n_estimators), fontsize=14)\n", - "plt.xlabel(\"$x_1$\", fontsize=16)\n", + "plot_predictions([gbrt_best], X, y, axes=[-0.5, 0.5, -0.1, 0.8], style=\"r-\")\n", + "plt.title(f\"learning_rate={gbrt_best.learning_rate}, \"\n", + " f\"n_estimators={gbrt_best.n_estimators_}\")\n", + "plt.xlabel(\"$x_1$\")\n", "\n", "save_fig(\"gbrt_learning_rate_plot\")\n", "plt.show()" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Gradient Boosting with Early stopping:**" - ] - }, { "cell_type": "code", - "execution_count": 44, + "execution_count": 36, "metadata": {}, "outputs": [], "source": [ - "import numpy as np\n", + "# not in the book (at least, not in this chapter: it's presented in chapter 2)\n", + "\n", + "import tarfile\n", + "import urllib.request\n", + "\n", + "import pandas as pd\n", "from sklearn.model_selection import train_test_split\n", - "from sklearn.metrics import mean_squared_error\n", "\n", - "X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=49)\n", + "def load_housing_data():\n", + " housing_path = Path() / \"datasets\" / \"housing\"\n", + " if not (housing_path / \"housing.csv\").is_file():\n", + " housing_path.mkdir(parents=True, exist_ok=True)\n", + " root = \"https://raw.githubusercontent.com/ageron/handson-ml2/master/\"\n", + " url = root + \"datasets/housing/housing.tgz\"\n", + " tgz_path = housing_path / \"housing.tgz\"\n", + " urllib.request.urlretrieve(url, tgz_path)\n", + " with tarfile.open(tgz_path) as housing_tgz:\n", + " housing_tgz.extractall(path=housing_path)\n", + " return pd.read_csv(housing_path / \"housing.csv\")\n", "\n", - "gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120, random_state=42)\n", - "gbrt.fit(X_train, y_train)\n", + "housing = load_housing_data()\n", "\n", - "errors = [mean_squared_error(y_val, y_pred)\n", - " for y_pred in gbrt.staged_predict(X_val)]\n", - "bst_n_estimators = np.argmin(errors) + 1\n", + "train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)\n", + "housing_labels = train_set[\"median_house_value\"]\n", + "housing = train_set.drop(\"median_house_value\", axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.pipeline import make_pipeline\n", + "from sklearn.compose import make_column_transformer\n", + "from sklearn.ensemble import HistGradientBoostingRegressor\n", + "from sklearn.preprocessing import OrdinalEncoder \n", "\n", - "gbrt_best = GradientBoostingRegressor(max_depth=2, n_estimators=bst_n_estimators, random_state=42)\n", - "gbrt_best.fit(X_train, y_train)" + "hgb_reg = make_pipeline(\n", + " make_column_transformer((OrdinalEncoder(), [\"ocean_proximity\"]),\n", + " remainder=\"passthrough\"),\n", + " HistGradientBoostingRegressor(categorical_features=[0], random_state=42)\n", + ")\n", + "hgb_reg.fit(housing, housing_labels)" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [], + "source": [ + "# not in the book\n", + "\n", + "from sklearn.model_selection import cross_val_score\n", + "\n", + "hgb_rmses = -cross_val_score(hgb_reg, housing, housing_labels,\n", + " scoring=\"neg_root_mean_squared_error\", cv=10)\n", + "pd.Series(hgb_rmses).describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "**Code to generate Figure 7–11. Tuning the number of trees using early stopping:**" + "# Stacking" ] }, { "cell_type": "code", - "execution_count": 45, + "execution_count": 39, "metadata": {}, "outputs": [], "source": [ - "min_error = np.min(errors)" - ] - }, - { - "cell_type": "code", - "execution_count": 46, - "metadata": {}, - "outputs": [], - "source": [ - "plt.figure(figsize=(10, 4))\n", + "from sklearn.ensemble import StackingClassifier\n", "\n", - "plt.subplot(121)\n", - "plt.plot(errors, \"b.-\")\n", - "plt.plot([bst_n_estimators, bst_n_estimators], [0, min_error], \"k--\")\n", - "plt.plot([0, 120], [min_error, min_error], \"k--\")\n", - "plt.plot(bst_n_estimators, min_error, \"ko\")\n", - "plt.text(bst_n_estimators, min_error*1.2, \"Minimum\", ha=\"center\", fontsize=14)\n", - "plt.axis([0, 120, 0, 0.01])\n", - "plt.xlabel(\"Number of trees\")\n", - "plt.ylabel(\"Error\", fontsize=16)\n", - "plt.title(\"Validation error\", fontsize=14)\n", - "\n", - "plt.subplot(122)\n", - "plot_predictions([gbrt_best], X, y, axes=[-0.5, 0.5, -0.1, 0.8])\n", - "plt.title(\"Best model (%d trees)\" % bst_n_estimators, fontsize=14)\n", - "plt.ylabel(\"$y$\", fontsize=16, rotation=0)\n", - "plt.xlabel(\"$x_1$\", fontsize=16)\n", - "\n", - "save_fig(\"early_stopping_gbrt_plot\")\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Early stopping with some patience (interrupts training only after there's no improvement for 5 epochs):" + "stacking_clf = StackingClassifier(\n", + " estimators=[\n", + " ('lr', LogisticRegression(random_state=42)),\n", + " ('rf', RandomForestClassifier(random_state=42)),\n", + " ('svc', SVC(probability=True, random_state=42))\n", + " ],\n", + " final_estimator=RandomForestClassifier(random_state=43),\n", + " cv=5 # number of cross-validation folds\n", + ")\n", + "stacking_clf.fit(X_train, y_train)" ] }, { "cell_type": "code", - "execution_count": 47, + "execution_count": 40, "metadata": {}, "outputs": [], "source": [ - "gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True, random_state=42)\n", - "\n", - "min_val_error = float(\"inf\")\n", - "error_going_up = 0\n", - "for n_estimators in range(1, 120):\n", - " gbrt.n_estimators = n_estimators\n", - " gbrt.fit(X_train, y_train)\n", - " y_pred = gbrt.predict(X_val)\n", - " val_error = mean_squared_error(y_val, y_pred)\n", - " if val_error < min_val_error:\n", - " min_val_error = val_error\n", - " error_going_up = 0\n", - " else:\n", - " error_going_up += 1\n", - " if error_going_up == 5:\n", - " break # early stopping" - ] - }, - { - "cell_type": "code", - "execution_count": 48, - "metadata": {}, - "outputs": [], - "source": [ - "print(gbrt.n_estimators)" - ] - }, - { - "cell_type": "code", - "execution_count": 49, - "metadata": {}, - "outputs": [], - "source": [ - "print(\"Minimum validation MSE:\", min_val_error)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Using XGBoost:**" - ] - }, - { - "cell_type": "code", - "execution_count": 50, - "metadata": {}, - "outputs": [], - "source": [ - "try:\n", - " import xgboost\n", - "except ImportError as ex:\n", - " print(\"Error: the xgboost library is not installed.\")\n", - " xgboost = None" - ] - }, - { - "cell_type": "code", - "execution_count": 51, - "metadata": {}, - "outputs": [], - "source": [ - "if xgboost is not None: # not shown in the book\n", - " xgb_reg = xgboost.XGBRegressor(random_state=42)\n", - " xgb_reg.fit(X_train, y_train)\n", - " y_pred = xgb_reg.predict(X_val)\n", - " val_error = mean_squared_error(y_val, y_pred) # Not shown\n", - " print(\"Validation MSE:\", val_error) # Not shown" - ] - }, - { - "cell_type": "code", - "execution_count": 52, - "metadata": {}, - "outputs": [], - "source": [ - "if xgboost is not None: # not shown in the book\n", - " xgb_reg.fit(X_train, y_train,\n", - " eval_set=[(X_val, y_val)], early_stopping_rounds=2)\n", - " y_pred = xgb_reg.predict(X_val)\n", - " val_error = mean_squared_error(y_val, y_pred) # Not shown\n", - " print(\"Validation MSE:\", val_error) # Not shown" - ] - }, - { - "cell_type": "code", - "execution_count": 53, - "metadata": {}, - "outputs": [], - "source": [ - "%timeit xgboost.XGBRegressor().fit(X_train, y_train) if xgboost is not None else None" - ] - }, - { - "cell_type": "code", - "execution_count": 54, - "metadata": {}, - "outputs": [], - "source": [ - "%timeit GradientBoostingRegressor().fit(X_train, y_train)" + "stacking_clf.score(X_test, y_test)" ] }, { @@ -1082,28 +924,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The MNIST dataset was loaded earlier." + "The MNIST dataset was loaded earlier. The dataset is already split into a training set (the first 60,000 instances) and a test set (the last 10,000 instances), and the training set is already shuffled. So all we need to do is to take the first 50,000 instances for the new training set, the next 10,000 for the validation set, and the last 10,000 for the test set:" ] }, { "cell_type": "code", - "execution_count": 55, + "execution_count": 41, "metadata": {}, "outputs": [], "source": [ - "from sklearn.model_selection import train_test_split" - ] - }, - { - "cell_type": "code", - "execution_count": 56, - "metadata": {}, - "outputs": [], - "source": [ - "X_train_val, X_test, y_train_val, y_test = train_test_split(\n", - " mnist.data, mnist.target, test_size=10000, random_state=42)\n", - "X_train, X_val, y_train, y_val = train_test_split(\n", - " X_train_val, y_train_val, test_size=10000, random_state=42)" + "X_train, y_train = X_mnist[:50_000], y_mnist[:50_000]\n", + "X_valid, y_valid = X_mnist[50_000:60_000], y_mnist[50_000:60_000]\n", + "X_test, y_test = X_mnist[60_000:], y_mnist[60_000:]" ] }, { @@ -1115,18 +947,18 @@ }, { "cell_type": "code", - "execution_count": 57, + "execution_count": 42, "metadata": {}, "outputs": [], "source": [ - "from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier\n", + "from sklearn.ensemble import ExtraTreesClassifier\n", "from sklearn.svm import LinearSVC\n", "from sklearn.neural_network import MLPClassifier" ] }, { "cell_type": "code", - "execution_count": 58, + "execution_count": 43, "metadata": {}, "outputs": [], "source": [ @@ -1138,7 +970,7 @@ }, { "cell_type": "code", - "execution_count": 59, + "execution_count": 44, "metadata": {}, "outputs": [], "source": [ @@ -1150,11 +982,11 @@ }, { "cell_type": "code", - "execution_count": 60, + "execution_count": 45, "metadata": {}, "outputs": [], "source": [ - "[estimator.score(X_val, y_val) for estimator in estimators]" + "[estimator.score(X_valid, y_valid) for estimator in estimators]" ] }, { @@ -1168,12 +1000,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Exercise: _Next, try to combine them into an ensemble that outperforms them all on the validation set, using a soft or hard voting classifier._" + "Exercise: _Next, try to combine \\[the classifiers\\] into an ensemble that outperforms them all on the validation set, using a soft or hard voting classifier._" ] }, { "cell_type": "code", - "execution_count": 61, + "execution_count": 46, "metadata": {}, "outputs": [], "source": [ @@ -1182,7 +1014,7 @@ }, { "cell_type": "code", - "execution_count": 62, + "execution_count": 47, "metadata": {}, "outputs": [], "source": [ @@ -1196,7 +1028,7 @@ }, { "cell_type": "code", - "execution_count": 63, + "execution_count": 48, "metadata": {}, "outputs": [], "source": [ @@ -1205,7 +1037,7 @@ }, { "cell_type": "code", - "execution_count": 64, + "execution_count": 49, "metadata": {}, "outputs": [], "source": [ @@ -1214,36 +1046,79 @@ }, { "cell_type": "code", - "execution_count": 65, + "execution_count": 50, "metadata": {}, "outputs": [], "source": [ - "voting_clf.score(X_val, y_val)" - ] - }, - { - "cell_type": "code", - "execution_count": 66, - "metadata": {}, - "outputs": [], - "source": [ - "[estimator.score(X_val, y_val) for estimator in voting_clf.estimators_]" + "voting_clf.score(X_valid, y_valid)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Let's remove the SVM to see if performance improves. It is possible to remove an estimator by setting it to `None` using `set_params()` like this:" + "The `VotingClassifier` made a clone of each classifier, and it trained the clones using class indices as the labels, not the original class names. Therefore, to evaluate these clones we need to provide class indices as well. To convert the classes to class indices, we can use a `LabelEncoder`:" ] }, { "cell_type": "code", - "execution_count": 67, + "execution_count": 51, "metadata": {}, "outputs": [], "source": [ - "voting_clf.set_params(svm_clf=None)" + "from sklearn.preprocessing import LabelEncoder\n", + "\n", + "encoder = LabelEncoder()\n", + "y_valid_encoded = encoder.fit_transform(y_valid)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "However, in the case of MNIST, it's simpler to just convert the class names to integers, since the digits match the class ids:" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "metadata": {}, + "outputs": [], + "source": [ + "y_valid_encoded = y_valid.astype(np.int64)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's evaluate the classifier clones:" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "metadata": {}, + "outputs": [], + "source": [ + "[estimator.score(X_valid, y_valid_encoded)\n", + " for estimator in voting_clf.estimators_]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's remove the SVM to see if performance improves. It is possible to remove an estimator by setting it to `\"drop\"` using `set_params()` like this:" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "metadata": {}, + "outputs": [], + "source": [ + "voting_clf.set_params(svm_clf=\"drop\")" ] }, { @@ -1255,7 +1130,7 @@ }, { "cell_type": "code", - "execution_count": 68, + "execution_count": 55, "metadata": {}, "outputs": [], "source": [ @@ -1271,27 +1146,37 @@ }, { "cell_type": "code", - "execution_count": 69, + "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "voting_clf.estimators_" ] }, + { + "cell_type": "code", + "execution_count": 57, + "metadata": {}, + "outputs": [], + "source": [ + "voting_clf.named_estimators_" + ] + }, { "cell_type": "markdown", "metadata": {}, "source": [ - "So we can either fit the `VotingClassifier` again, or just remove the SVM from the list of trained estimators:" + "So we can either fit the `VotingClassifier` again, or just remove the SVM from the list of trained estimators, both in `estimators_` and `named_estimators_`:" ] }, { "cell_type": "code", - "execution_count": 70, + "execution_count": 58, "metadata": {}, "outputs": [], "source": [ - "del voting_clf.estimators_[2]" + "svm_clf_trained = voting_clf.named_estimators_.pop(\"svm_clf\")\n", + "voting_clf.estimators_.remove(svm_clf_trained)" ] }, { @@ -1303,11 +1188,11 @@ }, { "cell_type": "code", - "execution_count": 71, + "execution_count": 59, "metadata": {}, "outputs": [], "source": [ - "voting_clf.score(X_val, y_val)" + "voting_clf.score(X_valid, y_valid)" ] }, { @@ -1319,7 +1204,7 @@ }, { "cell_type": "code", - "execution_count": 72, + "execution_count": 60, "metadata": {}, "outputs": [], "source": [ @@ -1328,11 +1213,11 @@ }, { "cell_type": "code", - "execution_count": 73, + "execution_count": 61, "metadata": {}, "outputs": [], "source": [ - "voting_clf.score(X_val, y_val)" + "voting_clf.score(X_valid, y_valid)" ] }, { @@ -1346,12 +1231,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "_Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?_" + "_Once you have found \\[an ensemble that performs better than the individual predictors\\], try it on the test set. How much better does it perform compared to the individual classifiers?_" ] }, { "cell_type": "code", - "execution_count": 74, + "execution_count": 62, "metadata": {}, "outputs": [], "source": [ @@ -1361,18 +1246,19 @@ }, { "cell_type": "code", - "execution_count": 75, + "execution_count": 63, "metadata": {}, "outputs": [], "source": [ - "[estimator.score(X_test, y_test) for estimator in voting_clf.estimators_]" + "[estimator.score(X_test, y_test.astype(np.int64))\n", + " for estimator in voting_clf.estimators_]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The voting classifier only very slightly reduced the error rate of the best model in this case." + "The voting classifier reduced the error rate of the best model from about 3% to 2.7%, which means 10% less errors." ] }, { @@ -1391,38 +1277,39 @@ }, { "cell_type": "code", - "execution_count": 76, + "execution_count": 64, "metadata": {}, "outputs": [], "source": [ - "X_val_predictions = np.empty((len(X_val), len(estimators)), dtype=np.float32)\n", + "X_valid_predictions = np.empty((len(X_valid), len(estimators)), dtype=np.object)\n", "\n", "for index, estimator in enumerate(estimators):\n", - " X_val_predictions[:, index] = estimator.predict(X_val)" + " X_valid_predictions[:, index] = estimator.predict(X_valid)" ] }, { "cell_type": "code", - "execution_count": 77, + "execution_count": 65, "metadata": {}, "outputs": [], "source": [ - "X_val_predictions" + "X_valid_predictions" ] }, { "cell_type": "code", - "execution_count": 78, + "execution_count": 66, "metadata": {}, "outputs": [], "source": [ - "rnd_forest_blender = RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)\n", - "rnd_forest_blender.fit(X_val_predictions, y_val)" + "rnd_forest_blender = RandomForestClassifier(n_estimators=200, oob_score=True,\n", + " random_state=42)\n", + "rnd_forest_blender.fit(X_valid_predictions, y_valid)" ] }, { "cell_type": "code", - "execution_count": 79, + "execution_count": 67, "metadata": {}, "outputs": [], "source": [ @@ -1445,11 +1332,11 @@ }, { "cell_type": "code", - "execution_count": 80, + "execution_count": 68, "metadata": {}, "outputs": [], "source": [ - "X_test_predictions = np.empty((len(X_test), len(estimators)), dtype=np.float32)\n", + "X_test_predictions = np.empty((len(X_test), len(estimators)), dtype=np.object)\n", "\n", "for index, estimator in enumerate(estimators):\n", " X_test_predictions[:, index] = estimator.predict(X_test)" @@ -1457,7 +1344,7 @@ }, { "cell_type": "code", - "execution_count": 81, + "execution_count": 69, "metadata": {}, "outputs": [], "source": [ @@ -1466,16 +1353,7 @@ }, { "cell_type": "code", - "execution_count": 82, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.metrics import accuracy_score" - ] - }, - { - "cell_type": "code", - "execution_count": 83, + "execution_count": 70, "metadata": {}, "outputs": [], "source": [ @@ -1486,7 +1364,81 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This stacking ensemble does not perform as well as the voting classifier we trained earlier, it's not quite as good as the best individual classifier." + "This stacking ensemble does not perform as well as the voting classifier we trained earlier, and it's even very slightly worse than the best individual classifier." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exercise: _Now try again using a `StackingClassifier` instead: do you get better performance? If so, why?_" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Since `StackingClassifier` uses K-Fold cross-validation, we don't need a separate validation set, so let's join the training set and the validation set into a bigger training set:" + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "metadata": {}, + "outputs": [], + "source": [ + "X_train_full, y_train_full = X_mnist[:60_000], y_mnist[:60_000]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's create and train the stacking classifier on the full training set:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Warning**: the following cell will take quite a while to run (15-30 minutes depending on your hardware), as it uses K-Fold validation with 5 folds by default. It will train the 4 classifiers 5 times each on 80% of the full training set to make the predictions, plus one last time each on the full training set, and lastly it will train the final model on the predictions. That's a total of 25 models to train!" + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "metadata": {}, + "outputs": [], + "source": [ + "stack_clf = StackingClassifier(named_estimators,\n", + " final_estimator=rnd_forest_blender)\n", + "stack_clf.fit(X_train_full, y_train_full)" + ] + }, + { + "cell_type": "code", + "execution_count": 73, + "metadata": {}, + "outputs": [], + "source": [ + "stack_clf.score(X_test, y_test)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `StackingClassifier` significantly outperforms the custom stacking implementation we tried earlier! This is for mainly two reasons:\n", + "\n", + "* Since we could reclaim the validation set, the `StackingClassifier` was trained on a larger dataset.\n", + "* It used `predict_proba()` if available, or else `decision_function()` if available, or else `predict()`. This gave the blender much more nuanced inputs to work with." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And that's all for today, congratulations on finishing the chapter and the exercises!" ] }, { @@ -1499,7 +1451,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": "Python 3", "language": "python", "name": "python3" },