From 771dccaca4d8c5cd1c41783df1f19a47c124052e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Aur=C3=A9lien=20Geron?= Date: Mon, 7 May 2018 21:09:08 +0200 Subject: [PATCH] Clarify future encoders in Scikit-Learn 0.20 --- 02_end_to_end_machine_learning_project.ipynb | 4 +- 03_classification.ipynb | 43 +++++++++----------- 2 files changed, 22 insertions(+), 25 deletions(-) diff --git a/02_end_to_end_machine_learning_project.ipynb b/02_end_to_end_machine_learning_project.ipynb index 3545638..b82b519 100644 --- a/02_end_to_end_machine_learning_project.ipynb +++ b/02_end_to_end_machine_learning_project.ipynb @@ -798,7 +798,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "**Warning**: earlier versions of the book used the `LabelEncoder` class or Pandas' `Series.factorize()` method to encode string categorical attributes as integers. The `OrdinalEncoder` class that is planned to be introduced in Scikit-Learn 0.20 (see [PR #10521](https://github.com/scikit-learn/scikit-learn/issues/10521)) is preferable since it is designed for input features (`X` instead of labels `y`) and it plays well with pipelines, as we will see later in this notebook. For now, we will import it from `future_encoders.py`, but when it is available you can change `future_encoders` to `sklearn.preprocessing`." + "**Warning**: earlier versions of the book used the `LabelEncoder` class or Pandas' `Series.factorize()` method to encode string categorical attributes as integers. However, the `OrdinalEncoder` class that is planned to be introduced in Scikit-Learn 0.20 (see [PR #10521](https://github.com/scikit-learn/scikit-learn/issues/10521)) is preferable since it is designed for input features (`X` instead of labels `y`) and it plays well with pipelines (introduced later in this notebook). For now, we will import it from `future_encoders.py`, but once it is available you can import it directly from `sklearn.preprocessing`." ] }, { @@ -834,7 +834,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can convert each categorical value to a one-hot vector using a `OneHotEncoder`. Right now this class can only handle integer categorical inputs, but in Scikit-Learn 0.20 it will handle string categorical inputs. So for now we import it from `future_encoders.py`, but when Scikit-Learn 0.20 is released, you can import it from `sklearn.preprocessing` instead:" + "**Warning**: earlier versions of the book used the `LabelBinarizer` or `CategoricalEncoder` classes to convert each categorical value to a one-hot vector. It is now preferable to use the `OneHotEncoder` class. Right now it can only handle integer categorical inputs, but in Scikit-Learn 0.20 it will also handle string categorical inputs (see [PR #10521](https://github.com/scikit-learn/scikit-learn/issues/10521)). So for now we import it from `future_encoders.py`, but when Scikit-Learn 0.20 is released, you can import it from `sklearn.preprocessing` instead:" ] }, { diff --git a/03_classification.ipynb b/03_classification.ipynb index 6cfb7f6..5a262b5 100644 --- a/03_classification.ipynb +++ b/03_classification.ipynb @@ -1513,25 +1513,6 @@ "The Embarked attribute tells us where the passenger embarked: C=Cherbourg, Q=Queenstown, S=Southampton." ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The `OneHotEncoder` class will allow us to convert categorical attributes to one-hot vectors. Since Scikit-Learn 0.20, this class can handle string categorical attributes, which is what we need. In case you are using an older version of Scikit-Learn, we get the latest version of this class from `future_encoders.py`." - ] - }, - { - "cell_type": "code", - "execution_count": 110, - "metadata": {}, - "outputs": [], - "source": [ - "try:\n", - " from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder\n", - "except:\n", - " from future_encoders import OrdinalEncoder, OneHotEncoder" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -1541,7 +1522,7 @@ }, { "cell_type": "code", - "execution_count": 111, + "execution_count": 110, "metadata": {}, "outputs": [], "source": [ @@ -1567,7 +1548,7 @@ }, { "cell_type": "code", - "execution_count": 112, + "execution_count": 111, "metadata": {}, "outputs": [], "source": [ @@ -1584,7 +1565,7 @@ }, { "cell_type": "code", - "execution_count": 113, + "execution_count": 112, "metadata": {}, "outputs": [], "source": [ @@ -1600,7 +1581,7 @@ }, { "cell_type": "code", - "execution_count": 114, + "execution_count": 113, "metadata": {}, "outputs": [], "source": [ @@ -1614,6 +1595,22 @@ " return X.fillna(self.most_frequent_)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can convert each categorical value to a one-hot vector using a `OneHotEncoder`. Right now this class can only handle integer categorical inputs, but in Scikit-Learn 0.20 it will also handle string categorical inputs (see [PR #10521](https://github.com/scikit-learn/scikit-learn/issues/10521)). So for now we import it from `future_encoders.py`, but when Scikit-Learn 0.20 is released, you can import it from `sklearn.preprocessing` instead:" + ] + }, + { + "cell_type": "code", + "execution_count": 114, + "metadata": {}, + "outputs": [], + "source": [ + "from future_encoders import OneHotEncoder" + ] + }, { "cell_type": "markdown", "metadata": {},