Clarify why we are using OrdinalEncoder and OneHotEncoder

2018-05-07 20:17:30 +02:00 · 2018-05-07 20:17:30 +02:00 · 77d3d4838d
commit 77d3d4838d
parent 46f547daeb
1 changed files with 5 additions and 6 deletions
--- a/02_end_to_end_machine_learning_project.ipynb
+++ b/02_end_to_end_machine_learning_project.ipynb
@ -798,7 +798,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "**Warning**: earlier versions of the book used the `LabelEncoder` class or Pandas' `Series.factorize()` method instead of the `OrdinalEncoder` class (available since Scikit-Learn 0.20). It is preferable to use the `OrdinalEncoder` class, since it is designed for input features (instead of labels) and it plays well with pipelines, as we will see later in this notebook. Similarly, earlier version of the book used the `LabelBinarizer` class or the `CategoricalEncoder` class for one-hot encoding (which we will look at shortly), but since Scikit-Learn 0.20 it is preferable to use the `OneHotEncoder` class. If you are using an older version of Scikit-Learn, please consider upgrading (in case you want to stick to an old version of Scikit-Learn, the new `OrdinalEncoder` and `OneHotEncoder` classes are provided in the `future_encoders.py` file)."
+    "**Warning**: earlier versions of the book used the `LabelEncoder` class or Pandas' `Series.factorize()` method to encode string categorical attributes as integers. The `OrdinalEncoder` class that is planned to be introduced in Scikit-Learn 0.20 (see [PR #10521](https://github.com/scikit-learn/scikit-learn/issues/10521)) is preferable since it is designed for input features (`X` instead of labels `y`) and it plays well with pipelines, as we will see later in this notebook. For now, we will import it from `future_encoders.py`, but when it is available you can change `future_encoders` to `sklearn.preprocessing`."
   ]
  },
  {
@ -807,10 +807,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "try:\n",
-    "    from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder\n",
-    "except ImportError:\n",
-    "    from future_encoders import OrdinalEncoder, OneHotEncoder"
+    "from future_encoders import OrdinalEncoder"
   ]
  },
  {
@ -837,7 +834,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can convert each categorical value to a one-hot vector using a `OneHotEncoder`. Prior to Scikit-Learn 0.20, this class could only handle integer categorical inputs. Now it can also handle string categorical inputs:"
+    "We can convert each categorical value to a one-hot vector using a `OneHotEncoder`. Right now this class can only handle integer categorical inputs, but in Scikit-Learn 0.20 it will handle string categorical inputs. So for now we import it from `future_encoders.py`, but when Scikit-Learn 0.20 is released, you can import it from `sklearn.preprocessing` instead:"
   ]
  },
  {
@ -846,6 +843,8 @@
   "metadata": {},
   "outputs": [],
   "source": [
+    "from future_encoders import OneHotEncoder\n",
+    "\n",
    "cat_encoder = OneHotEncoder()\n",
    "housing_cat_1hot = cat_encoder.fit_transform(housing_cat)\n",
    "housing_cat_1hot"