Use 'np.random' rather than 'import numpy.random as rnd', and add random_state to make notebook's output constant

main
Aurélien Geron 2017-06-06 15:15:20 +02:00
parent 692b674196
commit 3e87055ed9
1 changed files with 103 additions and 36 deletions

View File

@ -55,11 +55,10 @@
"\n",
"# Common imports\n",
"import numpy as np\n",
"import numpy.random as rnd\n",
"import os\n",
"\n",
"# to make this notebook's output stable across runs\n",
"rnd.seed(42)\n",
"np.random.seed(42)\n",
"\n",
"# To plot pretty figures\n",
"%matplotlib inline\n",
@ -348,8 +347,8 @@
},
"outputs": [],
"source": [
"rnd.seed(6)\n",
"Xs = rnd.rand(100, 2) - 0.5\n",
"np.random.seed(6)\n",
"Xs = np.random.rand(100, 2) - 0.5\n",
"ys = (Xs[:, 0] > 0).astype(np.float32) * 2\n",
"\n",
"angle = np.pi / 4\n",
@ -385,23 +384,27 @@
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"# Quadratic training set + noise\n",
"rnd.seed(42)\n",
"np.random.seed(42)\n",
"m = 200\n",
"X = rnd.rand(m, 1)\n",
"X = np.random.rand(m, 1)\n",
"y = 4 * (X - 0.5) ** 2\n",
"y = y + rnd.randn(m, 1) / 10"
"y = y + np.random.randn(m, 1) / 10"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -545,7 +548,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"See appendix A."
]
@ -563,21 +569,30 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"_Exercise: train and fine-tune a Decision Tree for the moons dataset._"
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"a. Generate a moons dataset using `make_moons(n_samples=10000, noise=0.4)`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Adding `random_state=42` to make this notebook's output constant:"
]
@ -586,7 +601,9 @@
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -597,7 +614,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"b. Split it into a training set and a test set using `train_test_split()`."
]
@ -606,18 +626,23 @@
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)"
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"c. Use grid search with cross-validation (with the help of the `GridSearchCV` class) to find good hyperparameter values for a `DecisionTreeClassifier`. Hint: try various values for `max_leaf_nodes`."
]
@ -626,14 +651,16 @@
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"from sklearn.model_selection import GridSearchCV\n",
"\n",
"params = {'max_leaf_nodes': list(range(2, 100)), 'min_samples_split': [2, 3, 4]}\n",
"grid_search_cv = GridSearchCV(DecisionTreeClassifier(), params, n_jobs=-1, verbose=1)\n",
"grid_search_cv = GridSearchCV(DecisionTreeClassifier(random_state=42), params, n_jobs=-1, verbose=1)\n",
"\n",
"grid_search_cv.fit(X_train, y_train)"
]
@ -642,7 +669,9 @@
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -651,14 +680,20 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"d. Train it on the full training set using these hyperparameters, and measure your model's performance on the test set. You should get roughly 85% to 87% accuracy."
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"By default, `GridSearchCV` trains the best model found on the whole training set (you can change this by setting `refit=False`), so we don't need to do it again. We can simply evaluate the model's accuracy:"
]
@ -667,7 +702,9 @@
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -679,21 +716,30 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## 8."
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"_Exercise: Grow a forest._"
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"a. Continuing the previous exercise, generate 1,000 subsets of the training set, each containing 100 instances selected randomly. Hint: you can use Scikit-Learn's `ShuffleSplit` class for this."
]
@ -702,7 +748,9 @@
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -722,7 +770,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"b. Train one Decision Tree on each subset, using the best hyperparameter values found above. Evaluate these 1,000 Decision Trees on the test set. Since they were trained on smaller sets, these Decision Trees will likely perform worse than the first Decision Tree, achieving only about 80% accuracy."
]
@ -731,7 +782,9 @@
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -752,7 +805,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"c. Now comes the magic. For each test set instance, generate the predictions of the 1,000 Decision Trees, and keep only the most frequent prediction (you can use SciPy's `mode()` function for this). This gives you _majority-vote predictions_ over the test set."
]
@ -761,7 +817,9 @@
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -775,7 +833,9 @@
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -786,7 +846,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"d. Evaluate these predictions on the test set: you should obtain a slightly higher accuracy than your first model (about 0.5 to 1.5% higher). Congratulations, you have trained a Random Forest classifier!"
]
@ -795,7 +858,9 @@
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -806,7 +871,9 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": []