Set OneHotEncoder's handle_unknown='ignore' to avoid warnings

main
Aurélien Geron 2021-10-11 20:51:34 +13:00
parent 4488c80cf0
commit 1b16a81fe5
1 changed files with 9 additions and 0 deletions

View File

@ -2291,12 +2291,21 @@
"**Warning**: the following cell may take close to 45 minutes to run, or more depending on your hardware." "**Warning**: the following cell may take close to 45 minutes to run, or more depending on your hardware."
] ]
}, },
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note:** In the code below, I've set the `OneHotEncoder`'s `handle_unknown` hyperparameter to `'ignore'`, to avoid warnings during training. Without this, the `OneHotEncoder` would default to `handle_unknown='error'`, meaning that it would raise an error when transforming any data containing a category it didn't see during training. If we kept the default, then the `GridSearchCV` would run into errors during training when evaluating the folds in which not all the categories are in the training set. This is likely to happen since there's only one sample in the `'ISLAND'` category, and it may end up in the test set in some of the folds. So some folds would just be dropped by the `GridSearchCV`, and it's best to avoid that."
]
},
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 137, "execution_count": 137,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"full_pipeline.named_transformers_[\"cat\"].handle_unknown = 'ignore'\n",
"\n",
"param_grid = [{\n", "param_grid = [{\n",
" 'preparation__num__imputer__strategy': ['mean', 'median', 'most_frequent'],\n", " 'preparation__num__imputer__strategy': ['mean', 'median', 'most_frequent'],\n",
" 'feature_selection__k': list(range(1, len(feature_importances) + 1))\n", " 'feature_selection__k': list(range(1, len(feature_importances) + 1))\n",