Do not use LabelEncoder and LabelBinarizer, use factorize() and CategoricalEncoder instead.
parent
236cb24e0b
commit
7629334e9b
|
@ -782,18 +782,28 @@
|
|||
"housing_tr.head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now let's preprocess the categorical input feature, `ocean_proximity`:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 57,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from sklearn.preprocessing import LabelEncoder\n",
|
||||
"\n",
|
||||
"encoder = LabelEncoder()\n",
|
||||
"housing_cat = housing[\"ocean_proximity\"]\n",
|
||||
"housing_cat_encoded = encoder.fit_transform(housing_cat)\n",
|
||||
"housing_cat_encoded"
|
||||
"housing_cat.head(10)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can use Pandas' `factorize()` method to convert this string categorical feature to an integer categorical feature, which will be easier for Machine Learning algorithms to handle:"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -802,7 +812,8 @@
|
|||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(encoder.classes_)"
|
||||
"housing_cat_encoded, housing_categories = housing_cat.factorize()\n",
|
||||
"housing_cat_encoded[:10]"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -810,6 +821,29 @@
|
|||
"execution_count": 59,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"housing_categories"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**Warning**: earlier versions of the book used the `LabelEncoder` class instead of Pandas' `factorize()` method. This was incorrect: indeed, as its name suggests, the `LabelEncoder` class was designed for labels, not for input features. The code worked because we were handling a single categorical input feature, but it would break if you passed multiple categorical input features."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can convert each categorical value to a one-hot vector using a `OneHotEncoder`:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 60,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from sklearn.preprocessing import OneHotEncoder\n",
|
||||
"\n",
|
||||
|
@ -819,12 +853,10 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 60,
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"housing_cat_1hot.toarray()"
|
||||
"The `OneHotEncoder` returns a sparse array by default, but we can convert it to a dense array if needed:"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -833,11 +865,14 @@
|
|||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from sklearn.preprocessing import LabelBinarizer\n",
|
||||
"\n",
|
||||
"encoder = LabelBinarizer()\n",
|
||||
"housing_cat_1hot = encoder.fit_transform(housing_cat)\n",
|
||||
"housing_cat_1hot"
|
||||
"housing_cat_1hot.toarray()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**Warning**: earlier versions of the book used the `LabelBinarizer` class at this point. Again, this was incorrect: just like the `LabelEncoder` class, the `LabelBinarizer` class was designed to preprocess labels, not input features. A better solution is to use Scikit-Learn's upcoming `CategoricalEncoder` class: it will soon be added to Scikit-Learn, and in the meantime you can use the code below (copied from [Pull Request #9151](https://github.com/scikit-learn/scikit-learn/pull/9151))."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -847,6 +882,273 @@
|
|||
"collapsed": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Definition of the CategoricalEncoder class, copied from PR #9151.\n",
|
||||
"# Just run this cell, or copy it to your code, do not try to understand it (yet).\n",
|
||||
"\n",
|
||||
"from sklearn.base import BaseEstimator, TransformerMixin\n",
|
||||
"from sklearn.utils import check_array\n",
|
||||
"from sklearn.preprocessing import LabelEncoder\n",
|
||||
"from scipy import sparse\n",
|
||||
"\n",
|
||||
"class CategoricalEncoder(BaseEstimator, TransformerMixin):\n",
|
||||
" \"\"\"Encode categorical features as a numeric array.\n",
|
||||
" The input to this transformer should be a matrix of integers or strings,\n",
|
||||
" denoting the values taken on by categorical (discrete) features.\n",
|
||||
" The features can be encoded using a one-hot aka one-of-K scheme\n",
|
||||
" (``encoding='onehot'``, the default) or converted to ordinal integers\n",
|
||||
" (``encoding='ordinal'``).\n",
|
||||
" This encoding is needed for feeding categorical data to many scikit-learn\n",
|
||||
" estimators, notably linear models and SVMs with the standard kernels.\n",
|
||||
" Read more in the :ref:`User Guide <preprocessing_categorical_features>`.\n",
|
||||
" Parameters\n",
|
||||
" ----------\n",
|
||||
" encoding : str, 'onehot', 'onehot-dense' or 'ordinal'\n",
|
||||
" The type of encoding to use (default is 'onehot'):\n",
|
||||
" - 'onehot': encode the features using a one-hot aka one-of-K scheme\n",
|
||||
" (or also called 'dummy' encoding). This creates a binary column for\n",
|
||||
" each category and returns a sparse matrix.\n",
|
||||
" - 'onehot-dense': the same as 'onehot' but returns a dense array\n",
|
||||
" instead of a sparse matrix.\n",
|
||||
" - 'ordinal': encode the features as ordinal integers. This results in\n",
|
||||
" a single column of integers (0 to n_categories - 1) per feature.\n",
|
||||
" categories : 'auto' or a list of lists/arrays of values.\n",
|
||||
" Categories (unique values) per feature:\n",
|
||||
" - 'auto' : Determine categories automatically from the training data.\n",
|
||||
" - list : ``categories[i]`` holds the categories expected in the ith\n",
|
||||
" column. The passed categories are sorted before encoding the data\n",
|
||||
" (used categories can be found in the ``categories_`` attribute).\n",
|
||||
" dtype : number type, default np.float64\n",
|
||||
" Desired dtype of output.\n",
|
||||
" handle_unknown : 'error' (default) or 'ignore'\n",
|
||||
" Whether to raise an error or ignore if a unknown categorical feature is\n",
|
||||
" present during transform (default is to raise). When this is parameter\n",
|
||||
" is set to 'ignore' and an unknown category is encountered during\n",
|
||||
" transform, the resulting one-hot encoded columns for this feature\n",
|
||||
" will be all zeros.\n",
|
||||
" Ignoring unknown categories is not supported for\n",
|
||||
" ``encoding='ordinal'``.\n",
|
||||
" Attributes\n",
|
||||
" ----------\n",
|
||||
" categories_ : list of arrays\n",
|
||||
" The categories of each feature determined during fitting. When\n",
|
||||
" categories were specified manually, this holds the sorted categories\n",
|
||||
" (in order corresponding with output of `transform`).\n",
|
||||
" Examples\n",
|
||||
" --------\n",
|
||||
" Given a dataset with three features and two samples, we let the encoder\n",
|
||||
" find the maximum value per feature and transform the data to a binary\n",
|
||||
" one-hot encoding.\n",
|
||||
" >>> from sklearn.preprocessing import CategoricalEncoder\n",
|
||||
" >>> enc = CategoricalEncoder(handle_unknown='ignore')\n",
|
||||
" >>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])\n",
|
||||
" ... # doctest: +ELLIPSIS\n",
|
||||
" CategoricalEncoder(categories='auto', dtype=<... 'numpy.float64'>,\n",
|
||||
" encoding='onehot', handle_unknown='ignore')\n",
|
||||
" >>> enc.transform([[0, 1, 1], [1, 0, 4]]).toarray()\n",
|
||||
" array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0.],\n",
|
||||
" [ 0., 1., 1., 0., 0., 0., 0., 0., 0.]])\n",
|
||||
" See also\n",
|
||||
" --------\n",
|
||||
" sklearn.preprocessing.OneHotEncoder : performs a one-hot encoding of\n",
|
||||
" integer ordinal features. The ``OneHotEncoder assumes`` that input\n",
|
||||
" features take on values in the range ``[0, max(feature)]`` instead of\n",
|
||||
" using the unique values.\n",
|
||||
" sklearn.feature_extraction.DictVectorizer : performs a one-hot encoding of\n",
|
||||
" dictionary items (also handles string-valued features).\n",
|
||||
" sklearn.feature_extraction.FeatureHasher : performs an approximate one-hot\n",
|
||||
" encoding of dictionary items or strings.\n",
|
||||
" \"\"\"\n",
|
||||
"\n",
|
||||
" def __init__(self, encoding='onehot', categories='auto', dtype=np.float64,\n",
|
||||
" handle_unknown='error'):\n",
|
||||
" self.encoding = encoding\n",
|
||||
" self.categories = categories\n",
|
||||
" self.dtype = dtype\n",
|
||||
" self.handle_unknown = handle_unknown\n",
|
||||
"\n",
|
||||
" def fit(self, X, y=None):\n",
|
||||
" \"\"\"Fit the CategoricalEncoder to X.\n",
|
||||
" Parameters\n",
|
||||
" ----------\n",
|
||||
" X : array-like, shape [n_samples, n_feature]\n",
|
||||
" The data to determine the categories of each feature.\n",
|
||||
" Returns\n",
|
||||
" -------\n",
|
||||
" self\n",
|
||||
" \"\"\"\n",
|
||||
"\n",
|
||||
" if self.encoding not in ['onehot', 'onehot-dense', 'ordinal']:\n",
|
||||
" template = (\"encoding should be either 'onehot', 'onehot-dense' \"\n",
|
||||
" \"or 'ordinal', got %s\")\n",
|
||||
" raise ValueError(template % self.handle_unknown)\n",
|
||||
"\n",
|
||||
" if self.handle_unknown not in ['error', 'ignore']:\n",
|
||||
" template = (\"handle_unknown should be either 'error' or \"\n",
|
||||
" \"'ignore', got %s\")\n",
|
||||
" raise ValueError(template % self.handle_unknown)\n",
|
||||
"\n",
|
||||
" if self.encoding == 'ordinal' and self.handle_unknown == 'ignore':\n",
|
||||
" raise ValueError(\"handle_unknown='ignore' is not supported for\"\n",
|
||||
" \" encoding='ordinal'\")\n",
|
||||
"\n",
|
||||
" X = check_array(X, dtype=np.object, accept_sparse='csc', copy=True)\n",
|
||||
" n_samples, n_features = X.shape\n",
|
||||
"\n",
|
||||
" self._label_encoders_ = [LabelEncoder() for _ in range(n_features)]\n",
|
||||
"\n",
|
||||
" for i in range(n_features):\n",
|
||||
" le = self._label_encoders_[i]\n",
|
||||
" Xi = X[:, i]\n",
|
||||
" if self.categories == 'auto':\n",
|
||||
" le.fit(Xi)\n",
|
||||
" else:\n",
|
||||
" valid_mask = np.in1d(Xi, self.categories[i])\n",
|
||||
" if not np.all(valid_mask):\n",
|
||||
" if self.handle_unknown == 'error':\n",
|
||||
" diff = np.unique(Xi[~valid_mask])\n",
|
||||
" msg = (\"Found unknown categories {0} in column {1}\"\n",
|
||||
" \" during fit\".format(diff, i))\n",
|
||||
" raise ValueError(msg)\n",
|
||||
" le.classes_ = np.array(np.sort(self.categories[i]))\n",
|
||||
"\n",
|
||||
" self.categories_ = [le.classes_ for le in self._label_encoders_]\n",
|
||||
"\n",
|
||||
" return self\n",
|
||||
"\n",
|
||||
" def transform(self, X):\n",
|
||||
" \"\"\"Transform X using one-hot encoding.\n",
|
||||
" Parameters\n",
|
||||
" ----------\n",
|
||||
" X : array-like, shape [n_samples, n_features]\n",
|
||||
" The data to encode.\n",
|
||||
" Returns\n",
|
||||
" -------\n",
|
||||
" X_out : sparse matrix or a 2-d array\n",
|
||||
" Transformed input.\n",
|
||||
" \"\"\"\n",
|
||||
" X = check_array(X, accept_sparse='csc', dtype=np.object, copy=True)\n",
|
||||
" n_samples, n_features = X.shape\n",
|
||||
" X_int = np.zeros_like(X, dtype=np.int)\n",
|
||||
" X_mask = np.ones_like(X, dtype=np.bool)\n",
|
||||
"\n",
|
||||
" for i in range(n_features):\n",
|
||||
" valid_mask = np.in1d(X[:, i], self.categories_[i])\n",
|
||||
"\n",
|
||||
" if not np.all(valid_mask):\n",
|
||||
" if self.handle_unknown == 'error':\n",
|
||||
" diff = np.unique(X[~valid_mask, i])\n",
|
||||
" msg = (\"Found unknown categories {0} in column {1}\"\n",
|
||||
" \" during transform\".format(diff, i))\n",
|
||||
" raise ValueError(msg)\n",
|
||||
" else:\n",
|
||||
" # Set the problematic rows to an acceptable value and\n",
|
||||
" # continue `The rows are marked `X_mask` and will be\n",
|
||||
" # removed later.\n",
|
||||
" X_mask[:, i] = valid_mask\n",
|
||||
" X[:, i][~valid_mask] = self.categories_[i][0]\n",
|
||||
" X_int[:, i] = self._label_encoders_[i].transform(X[:, i])\n",
|
||||
"\n",
|
||||
" if self.encoding == 'ordinal':\n",
|
||||
" return X_int.astype(self.dtype, copy=False)\n",
|
||||
"\n",
|
||||
" mask = X_mask.ravel()\n",
|
||||
" n_values = [cats.shape[0] for cats in self.categories_]\n",
|
||||
" n_values = np.array([0] + n_values)\n",
|
||||
" indices = np.cumsum(n_values)\n",
|
||||
"\n",
|
||||
" column_indices = (X_int + indices[:-1]).ravel()[mask]\n",
|
||||
" row_indices = np.repeat(np.arange(n_samples, dtype=np.int32),\n",
|
||||
" n_features)[mask]\n",
|
||||
" data = np.ones(n_samples * n_features)[mask]\n",
|
||||
"\n",
|
||||
" out = sparse.csc_matrix((data, (row_indices, column_indices)),\n",
|
||||
" shape=(n_samples, indices[-1]),\n",
|
||||
" dtype=self.dtype).tocsr()\n",
|
||||
" if self.encoding == 'onehot-dense':\n",
|
||||
" return out.toarray()\n",
|
||||
" else:\n",
|
||||
" return out"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The `CategoricalEncoder` expects a 2D array containing one or more categorical input features. We need to reshape `housing_cat` to a 2D array:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 63,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#from sklearn.preprocessing import CategoricalEncoder # in future versions of Scikit-Learn\n",
|
||||
"\n",
|
||||
"cat_encoder = CategoricalEncoder()\n",
|
||||
"housing_cat_reshaped = housing_cat.values.reshape(-1, 1)\n",
|
||||
"housing_cat_1hot = cat_encoder.fit_transform(housing_cat_reshaped)\n",
|
||||
"housing_cat_1hot"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The default encoding is one-hot, and it returns a sparse array. You can use `toarray()` to get a dense array:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 64,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"housing_cat_1hot.toarray()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Alternatively, you can specify the encoding to be `\"onehot-dense\"` to get a dense matrix rather than a sparse matrix:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 65,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"cat_encoder = CategoricalEncoder(encoding=\"onehot-dense\")\n",
|
||||
"housing_cat_1hot = cat_encoder.fit_transform(housing_cat_reshaped)\n",
|
||||
"housing_cat_1hot"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 66,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"cat_encoder.categories_"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Let's create a custom transformer to add extra attributes:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 67,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from sklearn.base import BaseEstimator, TransformerMixin\n",
|
||||
"\n",
|
||||
|
@ -874,7 +1176,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 63,
|
||||
"execution_count": 68,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -882,9 +1184,16 @@
|
|||
"housing_extra_attribs.head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now let's build a pipeline for preprocessing the numerical attributes:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 64,
|
||||
"execution_count": 69,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
|
@ -904,16 +1213,23 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 65,
|
||||
"execution_count": 70,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"housing_num_tr"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"And a transformer to just select a subset of the Pandas DataFrame columns:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 66,
|
||||
"execution_count": 71,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
|
@ -936,28 +1252,13 @@
|
|||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**Important note**: the `LabelEncoder` and `LabelBinarizer` classes were designed for preprocessing labels, not input features, so their `fit()` and `fit_transform()` methods only accept one parameter `y` instead of two parameters `X` and `y`. The proper way to convert categorical input features to one-hot vectors should be to use the `OneHotEncoder` class, but unfortunately it does not work with string categories, only integer categories (people are working on it, see [Pull Request 7327](https://github.com/scikit-learn/scikit-learn/pull/7327)). In the meantime, one workaround was to use the `LabelBinarizer` class, as shown in the book. Unfortunately, since Scikit-Learn 0.19.0, pipelines now expect each estimator to have a `fit()` or `fit_transform()` method with two parameters `X` and `y`, so the code shown in the book won't work if you are using Scikit-Learn 0.19.0 (and possibly later as well). A temporary workaround (until PR 7327 is finished and you can use a `OneHotEncoder`) is to create a small wrapper class around the `LabelBinarizer` class, to fix its `fit_transform()` method, like this:"
|
||||
"Now let's join all these components into a big pipeline that will preprocess both the numerical and the categorical features:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 67,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"class PipelineFriendlyLabelBinarizer(LabelBinarizer):\n",
|
||||
" def fit_transform(self, X, y=None):\n",
|
||||
" return super(PipelineFriendlyLabelBinarizer, self).fit_transform(X)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 68,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
"execution_count": 72,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"num_attribs = list(housing_num)\n",
|
||||
|
@ -972,13 +1273,13 @@
|
|||
"\n",
|
||||
"cat_pipeline = Pipeline([\n",
|
||||
" ('selector', DataFrameSelector(cat_attribs)),\n",
|
||||
" ('label_binarizer', PipelineFriendlyLabelBinarizer()),\n",
|
||||
" ('cat_encoder', CategoricalEncoder(encoding=\"onehot-dense\")),\n",
|
||||
" ])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 69,
|
||||
"execution_count": 73,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
|
@ -994,7 +1295,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 70,
|
||||
"execution_count": 74,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1004,7 +1305,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 71,
|
||||
"execution_count": 75,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1020,7 +1321,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 72,
|
||||
"execution_count": 76,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1032,7 +1333,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 73,
|
||||
"execution_count": 77,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1053,7 +1354,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 74,
|
||||
"execution_count": 78,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1062,7 +1363,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 75,
|
||||
"execution_count": 79,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1071,7 +1372,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 76,
|
||||
"execution_count": 80,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1085,7 +1386,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 77,
|
||||
"execution_count": 81,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1097,7 +1398,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 78,
|
||||
"execution_count": 82,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1109,7 +1410,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 79,
|
||||
"execution_count": 83,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1128,7 +1429,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 80,
|
||||
"execution_count": 84,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
|
@ -1143,7 +1444,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 81,
|
||||
"execution_count": 85,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1157,7 +1458,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 82,
|
||||
"execution_count": 86,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1169,7 +1470,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 83,
|
||||
"execution_count": 87,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1181,7 +1482,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 84,
|
||||
"execution_count": 88,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1193,7 +1494,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 85,
|
||||
"execution_count": 89,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1207,7 +1508,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 86,
|
||||
"execution_count": 90,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1217,7 +1518,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 87,
|
||||
"execution_count": 91,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1233,7 +1534,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 88,
|
||||
"execution_count": 92,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1262,7 +1563,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 89,
|
||||
"execution_count": 93,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1271,7 +1572,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 90,
|
||||
"execution_count": 94,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1287,7 +1588,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 91,
|
||||
"execution_count": 95,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1298,7 +1599,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 92,
|
||||
"execution_count": 96,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1307,7 +1608,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 93,
|
||||
"execution_count": 97,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1327,7 +1628,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 94,
|
||||
"execution_count": 98,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1338,7 +1639,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 95,
|
||||
"execution_count": 99,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1348,19 +1649,20 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 96,
|
||||
"execution_count": 100,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"extra_attribs = [\"rooms_per_hhold\", \"pop_per_hhold\", \"bedrooms_per_room\"]\n",
|
||||
"cat_one_hot_attribs = list(encoder.classes_)\n",
|
||||
"cat_encoder = cat_pipeline.named_steps[\"cat_encoder\"]\n",
|
||||
"cat_one_hot_attribs = list(cat_encoder.categories_[0])\n",
|
||||
"attributes = num_attribs + extra_attribs + cat_one_hot_attribs\n",
|
||||
"sorted(zip(feature_importances, attributes), reverse=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 97,
|
||||
"execution_count": 101,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
|
@ -1380,7 +1682,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 98,
|
||||
"execution_count": 102,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1403,7 +1705,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 99,
|
||||
"execution_count": 103,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1425,7 +1727,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 100,
|
||||
"execution_count": 104,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
|
@ -1436,7 +1738,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 101,
|
||||
"execution_count": 105,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
|
@ -1457,7 +1759,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 102,
|
||||
"execution_count": 106,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1495,7 +1797,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 103,
|
||||
"execution_count": 107,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1521,7 +1823,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 104,
|
||||
"execution_count": 108,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1539,7 +1841,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 105,
|
||||
"execution_count": 109,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1569,7 +1871,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 106,
|
||||
"execution_count": 110,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1602,7 +1904,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 107,
|
||||
"execution_count": 111,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1620,7 +1922,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 108,
|
||||
"execution_count": 112,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1643,7 +1945,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 109,
|
||||
"execution_count": 113,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1668,7 +1970,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 110,
|
||||
"execution_count": 114,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1707,7 +2009,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 111,
|
||||
"execution_count": 115,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
|
@ -1745,7 +2047,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 112,
|
||||
"execution_count": 116,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
|
@ -1763,7 +2065,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 113,
|
||||
"execution_count": 117,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1773,7 +2075,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 114,
|
||||
"execution_count": 118,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1789,7 +2091,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 115,
|
||||
"execution_count": 119,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1805,7 +2107,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 116,
|
||||
"execution_count": 120,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
|
@ -1819,7 +2121,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 117,
|
||||
"execution_count": 121,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
|
@ -1837,7 +2139,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 118,
|
||||
"execution_count": 122,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1853,7 +2155,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 119,
|
||||
"execution_count": 123,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1883,7 +2185,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 120,
|
||||
"execution_count": 124,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
|
@ -1898,7 +2200,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 121,
|
||||
"execution_count": 125,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1914,7 +2216,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 122,
|
||||
"execution_count": 126,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1948,7 +2250,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 123,
|
||||
"execution_count": 127,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1964,7 +2266,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 124,
|
||||
"execution_count": 128,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -1980,7 +2282,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 125,
|
||||
"execution_count": 129,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
|
Loading…
Reference in New Issue