Update notebooks 1 to 8 to latest library versions (in particular Scikit-Learn 0.20)

main
Aurélien Geron 2018-12-21 10:18:31 +08:00
parent dc16446c5f
commit b54ee1b608
8 changed files with 694 additions and 586 deletions

View File

@ -64,7 +64,7 @@
"\n",
"# Ignore useless warnings (see SciPy issue #5998)\n",
"import warnings\n",
"warnings.filterwarnings(action=\"ignore\", module=\"scipy\", message=\"^internal gelsd\")"
"warnings.filterwarnings(action=\"ignore\", message=\"^internal gelsd\")"
]
},
{
@ -407,7 +407,7 @@
"source": [
"cyprus_gdp_per_capita = gdp_per_capita.loc[\"Cyprus\"][\"GDP per capita\"]\n",
"print(cyprus_gdp_per_capita)\n",
"cyprus_predicted_life_satisfaction = lin1.predict(cyprus_gdp_per_capita)[0][0]\n",
"cyprus_predicted_life_satisfaction = lin1.predict([[cyprus_gdp_per_capita]])[0][0]\n",
"cyprus_predicted_life_satisfaction"
]
},
@ -719,7 +719,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
"version": "3.6.6"
},
"nav_menu": {},
"toc": {

View File

@ -661,15 +661,25 @@
"sample_incomplete_rows"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Warning**: Since Scikit-Learn 0.20, the `sklearn.preprocessing.Imputer` class was replaced by the `sklearn.impute.SimpleImputer` class."
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import Imputer\n",
"try:\n",
" from sklearn.impute import SimpleImputer # Scikit-Learn 0.20+\n",
"except ImportError:\n",
" from sklearn.preprocessing import Imputer as SimpleImputer\n",
"\n",
"imputer = Imputer(strategy=\"median\")"
"imputer = SimpleImputer(strategy=\"median\")"
]
},
{
@ -798,7 +808,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"**Warning**: earlier versions of the book used the `LabelEncoder` class or Pandas' `Series.factorize()` method to encode string categorical attributes as integers. However, the `OrdinalEncoder` class that is planned to be introduced in Scikit-Learn 0.20 (see [PR #10521](https://github.com/scikit-learn/scikit-learn/issues/10521)) is preferable since it is designed for input features (`X` instead of labels `y`) and it plays well with pipelines (introduced later in this notebook). For now, we will import it from `future_encoders.py`, but once it is available you can import it directly from `sklearn.preprocessing`."
"**Warning**: earlier versions of the book used the `LabelEncoder` class or Pandas' `Series.factorize()` method to encode string categorical attributes as integers. However, the `OrdinalEncoder` class that was introduced in Scikit-Learn 0.20 (see [PR #10521](https://github.com/scikit-learn/scikit-learn/issues/10521)) is preferable since it is designed for input features (`X` instead of labels `y`) and it plays well with pipelines (introduced later in this notebook). If you are using an older version of Scikit-Learn (<0.20), then you can import it from `future_encoders.py` instead."
]
},
{
@ -807,7 +817,10 @@
"metadata": {},
"outputs": [],
"source": [
"from future_encoders import OrdinalEncoder"
"try:\n",
" from sklearn.preprocessing import OrdinalEncoder\n",
"except ImportError:\n",
" from future_encoders import OrdinalEncoder # Scikit-Learn < 0.20"
]
},
{
@ -834,7 +847,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"**Warning**: earlier versions of the book used the `LabelBinarizer` or `CategoricalEncoder` classes to convert each categorical value to a one-hot vector. It is now preferable to use the `OneHotEncoder` class. Right now it can only handle integer categorical inputs, but in Scikit-Learn 0.20 it will also handle string categorical inputs (see [PR #10521](https://github.com/scikit-learn/scikit-learn/issues/10521)). So for now we import it from `future_encoders.py`, but when Scikit-Learn 0.20 is released, you can import it from `sklearn.preprocessing` instead:"
"**Warning**: earlier versions of the book used the `LabelBinarizer` or `CategoricalEncoder` classes to convert each categorical value to a one-hot vector. It is now preferable to use the `OneHotEncoder` class. Since Scikit-Learn 0.20 it can handle string categorical inputs (see [PR #10521](https://github.com/scikit-learn/scikit-learn/issues/10521)), not just integer categorical inputs. If you are using an older version of Scikit-Learn, you can import the new version from `future_encoders.py`:"
]
},
{
@ -843,7 +856,11 @@
"metadata": {},
"outputs": [],
"source": [
"from future_encoders import OneHotEncoder\n",
"try:\n",
" from sklearn.preprocessing import OrdinalEncoder # just to raise an ImportError if Scikit-Learn < 0.20\n",
" from sklearn.preprocessing import OneHotEncoder\n",
"except ImportError:\n",
" from future_encoders import OneHotEncoder # Scikit-Learn < 0.20\n",
"\n",
"cat_encoder = OneHotEncoder()\n",
"housing_cat_1hot = cat_encoder.fit_transform(housing_cat)\n",
@ -959,7 +976,7 @@
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"num_pipeline = Pipeline([\n",
" ('imputer', Imputer(strategy=\"median\")),\n",
" ('imputer', SimpleImputer(strategy=\"median\")),\n",
" ('attribs_adder', CombinedAttributesAdder()),\n",
" ('std_scaler', StandardScaler()),\n",
" ])\n",
@ -980,7 +997,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"**Warning**: earlier versions of the book applied different transformations to different columns using a solution based on a `DataFrameSelector` transformer and a `FeatureUnion` (see below). It is now preferable to use the `ColumnTransformer` class that will be introduced in Scikit-Learn 0.20. For now we import it from `future_encoders.py`, but when Scikit-Learn 0.20 is released, you can import it from `sklearn.compose` instead:"
"**Warning**: earlier versions of the book applied different transformations to different columns using a solution based on a `DataFrameSelector` transformer and a `FeatureUnion` (see below). It is now preferable to use the `ColumnTransformer` class that was introduced in Scikit-Learn 0.20. If you are using an older version of Scikit-Learn, you can import it from `future_encoders.py`:"
]
},
{
@ -989,7 +1006,10 @@
"metadata": {},
"outputs": [],
"source": [
"from future_encoders import ColumnTransformer"
"try:\n",
" from sklearn.compose import ColumnTransformer\n",
"except ImportError:\n",
" from future_encoders import ColumnTransformer # Scikit-Learn < 0.20"
]
},
{
@ -1070,7 +1090,7 @@
"\n",
"old_num_pipeline = Pipeline([\n",
" ('selector', OldDataFrameSelector(num_attribs)),\n",
" ('imputer', Imputer(strategy=\"median\")),\n",
" ('imputer', SimpleImputer(strategy=\"median\")),\n",
" ('attribs_adder', CombinedAttributesAdder()),\n",
" ('std_scaler', StandardScaler()),\n",
" ])\n",
@ -1275,6 +1295,13 @@
"display_scores(lin_rmse_scores)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note**: we specify `n_estimators=10` to avoid a warning about the fact that the default value is going to change to 100 in Scikit-Learn 0.22."
]
},
{
"cell_type": "code",
"execution_count": 91,
@ -1283,7 +1310,7 @@
"source": [
"from sklearn.ensemble import RandomForestRegressor\n",
"\n",
"forest_reg = RandomForestRegressor(random_state=42)\n",
"forest_reg = RandomForestRegressor(n_estimators=10, random_state=42)\n",
"forest_reg.fit(housing_prepared, housing_labels)"
]
},
@ -2114,10 +2141,10 @@
"metadata": {},
"outputs": [],
"source": [
"param_grid = [\n",
" {'preparation__num__imputer__strategy': ['mean', 'median', 'most_frequent'],\n",
" 'feature_selection__k': list(range(1, len(feature_importances) + 1))}\n",
"]\n",
"param_grid = [{\n",
" 'preparation__num__imputer__strategy': ['mean', 'median', 'most_frequent'],\n",
" 'feature_selection__k': list(range(1, len(feature_importances) + 1))\n",
"}]\n",
"\n",
"grid_search_prep = GridSearchCV(prepare_select_and_predict_pipeline, param_grid, cv=5,\n",
" scoring='neg_mean_squared_error', verbose=2, n_jobs=4)\n",
@ -2164,7 +2191,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"version": "3.6.6"
},
"nav_menu": {
"height": "279px",

File diff suppressed because it is too large Load Diff

View File

@ -65,7 +65,7 @@
"\n",
"# Ignore useless warnings (see SciPy issue #5998)\n",
"import warnings\n",
"warnings.filterwarnings(action=\"ignore\", module=\"scipy\", message=\"^internal gelsd\")"
"warnings.filterwarnings(action=\"ignore\", message=\"^internal gelsd\")"
]
},
{
@ -384,7 +384,7 @@
"outputs": [],
"source": [
"from sklearn.linear_model import SGDRegressor\n",
"sgd_reg = SGDRegressor(max_iter=50, penalty=None, eta0=0.1, random_state=42)\n",
"sgd_reg = SGDRegressor(max_iter=50, tol=-np.infty, penalty=None, eta0=0.1, random_state=42)\n",
"sgd_reg.fit(X, y.ravel())"
]
},
@ -727,7 +727,7 @@
"metadata": {},
"outputs": [],
"source": [
"sgd_reg = SGDRegressor(max_iter=5, penalty=\"l2\", random_state=42)\n",
"sgd_reg = SGDRegressor(max_iter=50, tol=-np.infty, penalty=\"l2\", random_state=42)\n",
"sgd_reg.fit(X, y.ravel())\n",
"sgd_reg.predict([[1.5]])"
]
@ -810,6 +810,7 @@
"X_val_poly_scaled = poly_scaler.transform(X_val)\n",
"\n",
"sgd_reg = SGDRegressor(max_iter=1,\n",
" tol=-np.infty,\n",
" penalty=None,\n",
" eta0=0.0005,\n",
" warm_start=True,\n",
@ -854,7 +855,7 @@
"outputs": [],
"source": [
"from sklearn.base import clone\n",
"sgd_reg = SGDRegressor(max_iter=1, warm_start=True, penalty=None,\n",
"sgd_reg = SGDRegressor(max_iter=1, tol=-np.infty, warm_start=True, penalty=None,\n",
" learning_rate=\"constant\", eta0=0.0005, random_state=42)\n",
"\n",
"minimum_val_error = float(\"inf\")\n",
@ -1043,7 +1044,7 @@
"outputs": [],
"source": [
"from sklearn.linear_model import LogisticRegression\n",
"log_reg = LogisticRegression(random_state=42)\n",
"log_reg = LogisticRegression(solver=\"liblinear\", random_state=42)\n",
"log_reg.fit(X, y)"
]
},
@ -1123,7 +1124,7 @@
"X = iris[\"data\"][:, (2, 3)] # petal length, petal width\n",
"y = (iris[\"target\"] == 2).astype(np.int)\n",
"\n",
"log_reg = LogisticRegression(C=10**10, random_state=42)\n",
"log_reg = LogisticRegression(solver=\"liblinear\", C=10**10, random_state=42)\n",
"log_reg.fit(X, y)\n",
"\n",
"x0, x1 = np.meshgrid(\n",
@ -1742,7 +1743,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
"version": "3.6.6"
},
"nav_menu": {},
"toc": {

View File

@ -774,6 +774,13 @@
"y = (0.2 + 0.1 * X + 0.5 * X**2 + np.random.randn(m, 1)/10).ravel()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Warning**: the default value of `gamma` will change from `'auto'` to `'scale'` in version 0.22 to better account for unscaled features. To preserve the same results as in the book, we explicitly set it to `'auto'`, but you should probably just use the default in your own code."
]
},
{
"cell_type": "code",
"execution_count": 27,
@ -782,7 +789,7 @@
"source": [
"from sklearn.svm import SVR\n",
"\n",
"svm_poly_reg = SVR(kernel=\"poly\", degree=2, C=100, epsilon=0.1)\n",
"svm_poly_reg = SVR(kernel=\"poly\", degree=2, C=100, epsilon=0.1, gamma=\"auto\")\n",
"svm_poly_reg.fit(X, y)"
]
},
@ -794,8 +801,8 @@
"source": [
"from sklearn.svm import SVR\n",
"\n",
"svm_poly_reg1 = SVR(kernel=\"poly\", degree=2, C=100, epsilon=0.1)\n",
"svm_poly_reg2 = SVR(kernel=\"poly\", degree=2, C=0.01, epsilon=0.1)\n",
"svm_poly_reg1 = SVR(kernel=\"poly\", degree=2, C=100, epsilon=0.1, gamma=\"auto\")\n",
"svm_poly_reg2 = SVR(kernel=\"poly\", degree=2, C=0.01, epsilon=0.1, gamma=\"auto\")\n",
"svm_poly_reg1.fit(X, y)\n",
"svm_poly_reg2.fit(X, y)"
]
@ -876,7 +883,7 @@
"ax1 = fig.add_subplot(111, projection='3d')\n",
"plot_3D_decision_function(ax1, w=svm_clf2.coef_[0], b=svm_clf2.intercept_[0])\n",
"\n",
"save_fig(\"iris_3D_plot\")\n",
"#save_fig(\"iris_3D_plot\")\n",
"plt.show()"
]
},
@ -1165,7 +1172,7 @@
"source": [
"from sklearn.linear_model import SGDClassifier\n",
"\n",
"sgd_clf = SGDClassifier(loss=\"hinge\", alpha = 0.017, max_iter = 50, random_state=42)\n",
"sgd_clf = SGDClassifier(loss=\"hinge\", alpha = 0.017, max_iter = 50, tol=-np.infty, random_state=42)\n",
"sgd_clf.fit(X, y.ravel())\n",
"\n",
"m = len(X)\n",
@ -1265,7 +1272,7 @@
"lin_clf = LinearSVC(loss=\"hinge\", C=C, random_state=42)\n",
"svm_clf = SVC(kernel=\"linear\", C=C)\n",
"sgd_clf = SGDClassifier(loss=\"hinge\", learning_rate=\"constant\", eta0=0.001, alpha=alpha,\n",
" max_iter=100000, random_state=42)\n",
" max_iter=100000, tol=-np.infty, random_state=42)\n",
"\n",
"scaler = StandardScaler()\n",
"X_scaled = scaler.fit_transform(X)\n",
@ -1354,9 +1361,13 @@
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import fetch_mldata\n",
"try:\n",
" from sklearn.datasets import fetch_openml\n",
" mnist = fetch_openml('mnist_784', version=1, cache=True)\n",
"except ImportError:\n",
" from sklearn.datasets import fetch_mldata\n",
" mnist = fetch_mldata('MNIST original')\n",
"\n",
"mnist = fetch_mldata(\"MNIST original\")\n",
"X = mnist[\"data\"]\n",
"y = mnist[\"target\"]\n",
"\n",
@ -1425,7 +1436,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Wow, 82% accuracy on MNIST is a really bad performance. This linear model is certainly too simple for MNIST, but perhaps we just needed to scale the data first:"
"Wow, 86% accuracy on MNIST is a really bad performance. This linear model is certainly too simple for MNIST, but perhaps we just needed to scale the data first:"
]
},
{
@ -1474,7 +1485,7 @@
"metadata": {},
"outputs": [],
"source": [
"svm_clf = SVC(decision_function_shape=\"ovr\")\n",
"svm_clf = SVC(decision_function_shape=\"ovr\", gamma=\"auto\")\n",
"svm_clf.fit(X_train_scaled[:10000], y_train[:10000])"
]
},
@ -1505,7 +1516,7 @@
"from scipy.stats import reciprocal, uniform\n",
"\n",
"param_distributions = {\"gamma\": reciprocal(0.001, 0.1), \"C\": uniform(1, 10)}\n",
"rnd_search_cv = RandomizedSearchCV(svm_clf, param_distributions, n_iter=10, verbose=2)\n",
"rnd_search_cv = RandomizedSearchCV(svm_clf, param_distributions, n_iter=10, verbose=2, cv=3)\n",
"rnd_search_cv.fit(X_train_scaled[:1000], y_train[:1000])"
]
},
@ -1536,7 +1547,7 @@
},
{
"cell_type": "code",
"execution_count": 59,
"execution_count": 60,
"metadata": {},
"outputs": [],
"source": [
@ -1545,7 +1556,7 @@
},
{
"cell_type": "code",
"execution_count": 60,
"execution_count": 61,
"metadata": {},
"outputs": [],
"source": [
@ -1562,7 +1573,7 @@
},
{
"cell_type": "code",
"execution_count": 61,
"execution_count": 62,
"metadata": {},
"outputs": [],
"source": [
@ -1600,7 +1611,7 @@
},
{
"cell_type": "code",
"execution_count": 62,
"execution_count": 63,
"metadata": {},
"outputs": [],
"source": [
@ -1620,7 +1631,7 @@
},
{
"cell_type": "code",
"execution_count": 63,
"execution_count": 64,
"metadata": {},
"outputs": [],
"source": [
@ -1638,7 +1649,7 @@
},
{
"cell_type": "code",
"execution_count": 64,
"execution_count": 65,
"metadata": {},
"outputs": [],
"source": [
@ -1658,7 +1669,7 @@
},
{
"cell_type": "code",
"execution_count": 65,
"execution_count": 66,
"metadata": {},
"outputs": [],
"source": [
@ -1677,7 +1688,7 @@
},
{
"cell_type": "code",
"execution_count": 66,
"execution_count": 67,
"metadata": {},
"outputs": [],
"source": [
@ -1697,7 +1708,7 @@
},
{
"cell_type": "code",
"execution_count": 67,
"execution_count": 68,
"metadata": {},
"outputs": [],
"source": [
@ -1713,7 +1724,7 @@
},
{
"cell_type": "code",
"execution_count": 68,
"execution_count": 69,
"metadata": {},
"outputs": [],
"source": [
@ -1722,13 +1733,13 @@
"from scipy.stats import reciprocal, uniform\n",
"\n",
"param_distributions = {\"gamma\": reciprocal(0.001, 0.1), \"C\": uniform(1, 10)}\n",
"rnd_search_cv = RandomizedSearchCV(SVR(), param_distributions, n_iter=10, verbose=2, random_state=42)\n",
"rnd_search_cv = RandomizedSearchCV(SVR(), param_distributions, n_iter=10, verbose=2, cv=3, random_state=42)\n",
"rnd_search_cv.fit(X_train_scaled, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 69,
"execution_count": 70,
"metadata": {},
"outputs": [],
"source": [
@ -1744,7 +1755,7 @@
},
{
"cell_type": "code",
"execution_count": 70,
"execution_count": 71,
"metadata": {},
"outputs": [],
"source": [
@ -1762,7 +1773,7 @@
},
{
"cell_type": "code",
"execution_count": 71,
"execution_count": 72,
"metadata": {},
"outputs": [],
"source": [
@ -1771,6 +1782,26 @@
"np.sqrt(mse)"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [],
"source": [
"cmap = matplotlib.cm.get_cmap(\"jet\")"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import fetch_openml\n",
"mnist = fetch_openml(\"mnist_784\", version=1)\n",
"print(mnist.data.shape)"
]
},
{
"cell_type": "code",
"execution_count": null,
@ -1795,7 +1826,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
"version": "3.6.6"
},
"nav_menu": {},
"toc": {

View File

@ -531,7 +531,7 @@
"from sklearn.model_selection import GridSearchCV\n",
"\n",
"params = {'max_leaf_nodes': list(range(2, 100)), 'min_samples_split': [2, 3, 4]}\n",
"grid_search_cv = GridSearchCV(DecisionTreeClassifier(random_state=42), params, n_jobs=-1, verbose=1)\n",
"grid_search_cv = GridSearchCV(DecisionTreeClassifier(random_state=42), params, n_jobs=-1, verbose=1, cv=3)\n",
"\n",
"grid_search_cv.fit(X_train, y_train)"
]
@ -710,7 +710,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
"version": "3.6.6"
},
"nav_menu": {
"height": "309px",

View File

@ -115,6 +115,13 @@
"X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Warning**: In Scikit-Learn 0.20, some hyperparameters (`solver`, `n_estimators`, `gamma`, etc.) start issuing warnings about the fact that their default value will change in Scikit-Learn 0.22. To avoid these warnings and ensure that this notebooks keeps producing the same outputs as in the book, I set the hyperparameters to their old default value. In your own code, you can simply rely on the latest default values instead."
]
},
{
"cell_type": "code",
"execution_count": 5,
@ -126,9 +133,9 @@
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.svm import SVC\n",
"\n",
"log_clf = LogisticRegression(random_state=42)\n",
"rnd_clf = RandomForestClassifier(random_state=42)\n",
"svm_clf = SVC(random_state=42)\n",
"log_clf = LogisticRegression(solver=\"liblinear\", random_state=42)\n",
"rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42)\n",
"svm_clf = SVC(gamma=\"auto\", random_state=42)\n",
"\n",
"voting_clf = VotingClassifier(\n",
" estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],\n",
@ -164,9 +171,9 @@
"metadata": {},
"outputs": [],
"source": [
"log_clf = LogisticRegression(random_state=42)\n",
"rnd_clf = RandomForestClassifier(random_state=42)\n",
"svm_clf = SVC(probability=True, random_state=42)\n",
"log_clf = LogisticRegression(solver=\"liblinear\", random_state=42)\n",
"rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42)\n",
"svm_clf = SVC(gamma=\"auto\", probability=True, random_state=42)\n",
"\n",
"voting_clf = VotingClassifier(\n",
" estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],\n",
@ -420,8 +427,13 @@
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import fetch_mldata\n",
"mnist = fetch_mldata('MNIST original')"
"try:\n",
" from sklearn.datasets import fetch_openml\n",
" mnist = fetch_openml('mnist_784', version=1)\n",
" mnist.target = mnist.target.astype(np.int64)\n",
"except ImportError:\n",
" from sklearn.datasets import fetch_mldata\n",
" mnist = fetch_mldata('MNIST original')"
]
},
{
@ -430,7 +442,7 @@
"metadata": {},
"outputs": [],
"source": [
"rnd_clf = RandomForestClassifier(random_state=42)\n",
"rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42)\n",
"rnd_clf.fit(mnist[\"data\"], mnist[\"target\"])"
]
},
@ -505,7 +517,7 @@
" sample_weights = np.ones(m)\n",
" plt.subplot(subplot)\n",
" for i in range(5):\n",
" svm_clf = SVC(kernel=\"rbf\", C=0.05, random_state=42)\n",
" svm_clf = SVC(kernel=\"rbf\", C=0.05, gamma=\"auto\", random_state=42)\n",
" svm_clf.fit(X_train, y_train, sample_weight=sample_weights)\n",
" y_pred = svm_clf.predict(X_train)\n",
" sample_weights[y_pred != y_train] *= (1 + learning_rate)\n",
@ -911,36 +923,25 @@
"Exercise: _Load the MNIST data and split it into a training set, a validation set, and a test set (e.g., use 50,000 instances for training, 10,000 for validation, and 10,000 for testing)._"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The MNIST dataset was loaded earlier."
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import fetch_mldata"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [],
"source": [
"mnist = fetch_mldata('MNIST original')"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "code",
"execution_count": 58,
"execution_count": 56,
"metadata": {},
"outputs": [],
"source": [
@ -959,7 +960,7 @@
},
{
"cell_type": "code",
"execution_count": 59,
"execution_count": 57,
"metadata": {},
"outputs": [],
"source": [
@ -970,19 +971,19 @@
},
{
"cell_type": "code",
"execution_count": 60,
"execution_count": 58,
"metadata": {},
"outputs": [],
"source": [
"random_forest_clf = RandomForestClassifier(random_state=42)\n",
"extra_trees_clf = ExtraTreesClassifier(random_state=42)\n",
"random_forest_clf = RandomForestClassifier(n_estimators=10, random_state=42)\n",
"extra_trees_clf = ExtraTreesClassifier(n_estimators=10, random_state=42)\n",
"svm_clf = LinearSVC(random_state=42)\n",
"mlp_clf = MLPClassifier(random_state=42)"
]
},
{
"cell_type": "code",
"execution_count": 61,
"execution_count": 59,
"metadata": {},
"outputs": [],
"source": [
@ -994,7 +995,7 @@
},
{
"cell_type": "code",
"execution_count": 62,
"execution_count": 60,
"metadata": {},
"outputs": [],
"source": [
@ -1017,7 +1018,7 @@
},
{
"cell_type": "code",
"execution_count": 63,
"execution_count": 61,
"metadata": {},
"outputs": [],
"source": [
@ -1026,7 +1027,7 @@
},
{
"cell_type": "code",
"execution_count": 64,
"execution_count": 62,
"metadata": {},
"outputs": [],
"source": [
@ -1040,7 +1041,7 @@
},
{
"cell_type": "code",
"execution_count": 65,
"execution_count": 63,
"metadata": {},
"outputs": [],
"source": [
@ -1049,7 +1050,7 @@
},
{
"cell_type": "code",
"execution_count": 66,
"execution_count": 64,
"metadata": {},
"outputs": [],
"source": [
@ -1058,7 +1059,7 @@
},
{
"cell_type": "code",
"execution_count": 67,
"execution_count": 65,
"metadata": {},
"outputs": [],
"source": [
@ -1067,7 +1068,7 @@
},
{
"cell_type": "code",
"execution_count": 68,
"execution_count": 66,
"metadata": {},
"outputs": [],
"source": [
@ -1083,7 +1084,7 @@
},
{
"cell_type": "code",
"execution_count": 69,
"execution_count": 67,
"metadata": {},
"outputs": [],
"source": [
@ -1099,16 +1100,7 @@
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [],
"source": [
"voting_clf.estimators"
]
},
{
"cell_type": "code",
"execution_count": 71,
"execution_count": 68,
"metadata": {},
"outputs": [],
"source": [
@ -1124,7 +1116,7 @@
},
{
"cell_type": "code",
"execution_count": 72,
"execution_count": 69,
"metadata": {},
"outputs": [],
"source": [
@ -1140,7 +1132,7 @@
},
{
"cell_type": "code",
"execution_count": 73,
"execution_count": 70,
"metadata": {},
"outputs": [],
"source": [
@ -1156,7 +1148,7 @@
},
{
"cell_type": "code",
"execution_count": 74,
"execution_count": 71,
"metadata": {},
"outputs": [],
"source": [
@ -1167,12 +1159,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Much better! The SVM was hurting performance. Now let's try using a soft voting classifier. We do not actually need to retrain the classifier, we can just set `voting` to `\"soft\"`:"
"A bit better! The SVM was hurting performance. Now let's try using a soft voting classifier. We do not actually need to retrain the classifier, we can just set `voting` to `\"soft\"`:"
]
},
{
"cell_type": "code",
"execution_count": 75,
"execution_count": 72,
"metadata": {},
"outputs": [],
"source": [
@ -1181,7 +1173,7 @@
},
{
"cell_type": "code",
"execution_count": 76,
"execution_count": 73,
"metadata": {},
"outputs": [],
"source": [
@ -1204,7 +1196,7 @@
},
{
"cell_type": "code",
"execution_count": 77,
"execution_count": 74,
"metadata": {},
"outputs": [],
"source": [
@ -1213,7 +1205,7 @@
},
{
"cell_type": "code",
"execution_count": 78,
"execution_count": 75,
"metadata": {},
"outputs": [],
"source": [
@ -1224,7 +1216,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The voting classifier reduced the error rate from about 4.9% for our best model (the `MLPClassifier`) to just 3.5%. That's about 28% less errors, not bad!"
"The voting classifier reduced the error rate from about 4.0% for our best model (the `MLPClassifier`) to just 3.1%. That's about 22.5% less errors, not bad!"
]
},
{
@ -1243,7 +1235,7 @@
},
{
"cell_type": "code",
"execution_count": 79,
"execution_count": 76,
"metadata": {},
"outputs": [],
"source": [
@ -1255,7 +1247,7 @@
},
{
"cell_type": "code",
"execution_count": 80,
"execution_count": 77,
"metadata": {},
"outputs": [],
"source": [
@ -1264,7 +1256,7 @@
},
{
"cell_type": "code",
"execution_count": 81,
"execution_count": 78,
"metadata": {},
"outputs": [],
"source": [
@ -1274,7 +1266,7 @@
},
{
"cell_type": "code",
"execution_count": 82,
"execution_count": 79,
"metadata": {},
"outputs": [],
"source": [
@ -1297,7 +1289,7 @@
},
{
"cell_type": "code",
"execution_count": 83,
"execution_count": 80,
"metadata": {},
"outputs": [],
"source": [
@ -1309,7 +1301,7 @@
},
{
"cell_type": "code",
"execution_count": 84,
"execution_count": 81,
"metadata": {},
"outputs": [],
"source": [
@ -1318,7 +1310,7 @@
},
{
"cell_type": "code",
"execution_count": 85,
"execution_count": 82,
"metadata": {},
"outputs": [],
"source": [
@ -1327,7 +1319,7 @@
},
{
"cell_type": "code",
"execution_count": 86,
"execution_count": 83,
"metadata": {},
"outputs": [],
"source": [
@ -1338,15 +1330,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This stacking ensemble does not perform as well as the soft voting classifier we trained earlier, but it still beats all the individual classifiers."
"This stacking ensemble does not perform as well as the soft voting classifier we trained earlier, it's just as good as the best individual classifier."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
@ -1365,7 +1350,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
"version": "3.6.6"
},
"nav_menu": {
"height": "252px",

File diff suppressed because it is too large Load Diff