2594 lines
68 KiB
Plaintext
2594 lines
68 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Chapter 11 – Training Deep Neural Networks**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"_This notebook contains all the sample code and solutions to the exercises in chapter 11._"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Setup"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"First, let's import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures. We also check that Python 3.5 or later is installed (although Python 2.x may work, it is deprecated so we strongly recommend you use Python 3 instead), as well as Scikit-Learn ≥0.20 and TensorFlow ≥2.0-preview."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Python ≥3.5 is required\n",
|
||
"import sys\n",
|
||
"assert sys.version_info >= (3, 5)\n",
|
||
"\n",
|
||
"# Scikit-Learn ≥0.20 is required\n",
|
||
"import sklearn\n",
|
||
"assert sklearn.__version__ >= \"0.20\"\n",
|
||
"\n",
|
||
"# TensorFlow ≥2.0-preview is required\n",
|
||
"import tensorflow as tf\n",
|
||
"from tensorflow import keras\n",
|
||
"assert tf.__version__ >= \"2.0\"\n",
|
||
"\n",
|
||
"# Common imports\n",
|
||
"import numpy as np\n",
|
||
"import os\n",
|
||
"\n",
|
||
"# to make this notebook's output stable across runs\n",
|
||
"np.random.seed(42)\n",
|
||
"\n",
|
||
"# To plot pretty figures\n",
|
||
"%matplotlib inline\n",
|
||
"import matplotlib as mpl\n",
|
||
"import matplotlib.pyplot as plt\n",
|
||
"mpl.rc('axes', labelsize=14)\n",
|
||
"mpl.rc('xtick', labelsize=12)\n",
|
||
"mpl.rc('ytick', labelsize=12)\n",
|
||
"\n",
|
||
"# Where to save the figures\n",
|
||
"PROJECT_ROOT_DIR = \".\"\n",
|
||
"CHAPTER_ID = \"deep\"\n",
|
||
"IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, \"images\", CHAPTER_ID)\n",
|
||
"os.makedirs(IMAGES_PATH, exist_ok=True)\n",
|
||
"\n",
|
||
"def save_fig(fig_id, tight_layout=True, fig_extension=\"png\", resolution=300):\n",
|
||
" path = os.path.join(IMAGES_PATH, fig_id + \".\" + fig_extension)\n",
|
||
" print(\"Saving figure\", fig_id)\n",
|
||
" if tight_layout:\n",
|
||
" plt.tight_layout()\n",
|
||
" plt.savefig(path, format=fig_extension, dpi=resolution)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Vanishing/Exploding Gradients Problem"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def logit(z):\n",
|
||
" return 1 / (1 + np.exp(-z))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"z = np.linspace(-5, 5, 200)\n",
|
||
"\n",
|
||
"plt.plot([-5, 5], [0, 0], 'k-')\n",
|
||
"plt.plot([-5, 5], [1, 1], 'k--')\n",
|
||
"plt.plot([0, 0], [-0.2, 1.2], 'k-')\n",
|
||
"plt.plot([-5, 5], [-3/4, 7/4], 'g--')\n",
|
||
"plt.plot(z, logit(z), \"b-\", linewidth=2)\n",
|
||
"props = dict(facecolor='black', shrink=0.1)\n",
|
||
"plt.annotate('Saturating', xytext=(3.5, 0.7), xy=(5, 1), arrowprops=props, fontsize=14, ha=\"center\")\n",
|
||
"plt.annotate('Saturating', xytext=(-3.5, 0.3), xy=(-5, 0), arrowprops=props, fontsize=14, ha=\"center\")\n",
|
||
"plt.annotate('Linear', xytext=(2, 0.2), xy=(0, 0.5), arrowprops=props, fontsize=14, ha=\"center\")\n",
|
||
"plt.grid(True)\n",
|
||
"plt.title(\"Sigmoid activation function\", fontsize=14)\n",
|
||
"plt.axis([-5, 5, -0.2, 1.2])\n",
|
||
"\n",
|
||
"save_fig(\"sigmoid_saturation_plot\")\n",
|
||
"plt.show()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Xavier and He Initialization"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"[name for name in dir(keras.initializers) if not name.startswith(\"_\")]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"keras.layers.Dense(10, activation=\"relu\", kernel_initializer=\"he_normal\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"init = keras.initializers.VarianceScaling(scale=2., mode='fan_avg',\n",
|
||
" distribution='uniform')\n",
|
||
"keras.layers.Dense(10, activation=\"relu\", kernel_initializer=init)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Nonsaturating Activation Functions"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Leaky ReLU"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def leaky_relu(z, alpha=0.01):\n",
|
||
" return np.maximum(alpha*z, z)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"plt.plot(z, leaky_relu(z, 0.05), \"b-\", linewidth=2)\n",
|
||
"plt.plot([-5, 5], [0, 0], 'k-')\n",
|
||
"plt.plot([0, 0], [-0.5, 4.2], 'k-')\n",
|
||
"plt.grid(True)\n",
|
||
"props = dict(facecolor='black', shrink=0.1)\n",
|
||
"plt.annotate('Leak', xytext=(-3.5, 0.5), xy=(-5, -0.2), arrowprops=props, fontsize=14, ha=\"center\")\n",
|
||
"plt.title(\"Leaky ReLU activation function\", fontsize=14)\n",
|
||
"plt.axis([-5, 5, -0.5, 4.2])\n",
|
||
"\n",
|
||
"save_fig(\"leaky_relu_plot\")\n",
|
||
"plt.show()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 9,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"[m for m in dir(keras.activations) if not m.startswith(\"_\")]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 10,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"[m for m in dir(keras.layers) if \"relu\" in m.lower()]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 11,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"leaky_relu = keras.layers.LeakyReLU(alpha=0.2)\n",
|
||
"layer = keras.layers.Dense(10, activation=leaky_relu)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 12,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"layer.activation"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Let's train a neural network on Fashion MNIST using the Leaky ReLU:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 13,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()\n",
|
||
"X_train_full = X_train_full / 255.0\n",
|
||
"X_test = X_test / 255.0\n",
|
||
"X_valid, X_train = X_train_full[:5000], X_train_full[5000:]\n",
|
||
"y_valid, y_train = y_train_full[:5000], y_train_full[5000:]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 14,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model = keras.models.Sequential([\n",
|
||
" keras.layers.Flatten(input_shape=[28, 28]),\n",
|
||
" keras.layers.Dense(300, activation=leaky_relu),\n",
|
||
" keras.layers.Dense(100, activation=leaky_relu),\n",
|
||
" keras.layers.Dense(10, activation=\"softmax\")\n",
|
||
"])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 15,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model.compile(loss=\"sparse_categorical_crossentropy\", optimizer=\"sgd\",\n",
|
||
" metrics=[\"accuracy\"])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 16,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"history = model.fit(X_train, y_train, epochs=10,\n",
|
||
" validation_data=(X_valid, y_valid))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### ELU"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 17,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def elu(z, alpha=1):\n",
|
||
" return np.where(z < 0, alpha * (np.exp(z) - 1), z)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 18,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"plt.plot(z, elu(z), \"b-\", linewidth=2)\n",
|
||
"plt.plot([-5, 5], [0, 0], 'k-')\n",
|
||
"plt.plot([-5, 5], [-1, -1], 'k--')\n",
|
||
"plt.plot([0, 0], [-2.2, 3.2], 'k-')\n",
|
||
"plt.grid(True)\n",
|
||
"plt.title(r\"ELU activation function ($\\alpha=1$)\", fontsize=14)\n",
|
||
"plt.axis([-5, 5, -2.2, 3.2])\n",
|
||
"\n",
|
||
"save_fig(\"elu_plot\")\n",
|
||
"plt.show()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Implementing ELU in TensorFlow is trivial, just specify the activation function when building each layer:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 19,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"keras.layers.Dense(10, activation=\"elu\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### SELU"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"This activation function was proposed in this [great paper](https://arxiv.org/pdf/1706.02515.pdf) by Günter Klambauer, Thomas Unterthiner and Andreas Mayr, published in June 2017. During training, a neural network composed exclusively of a stack of dense layers using the SELU activation function and LeCun initialization will self-normalize: the output of each layer will tend to preserve the same mean and variance during training, which solves the vanishing/exploding gradients problem. As a result, this activation function outperforms the other activation functions very significantly for such neural nets, so you should really try it out. Unfortunately, the self-normalizing property of the SELU activation function is easily broken: you cannot use ℓ<sub>1</sub> or ℓ<sub>2</sub> regularization, regular dropout, max-norm, skip connections or other non-sequential topologies (so recurrent neural networks won't self-normalize). However, in practice it works quite well with sequential CNNs. If you break self-normalization, SELU will not necessarily outperform other activation functions."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 20,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from scipy.special import erfc\n",
|
||
"\n",
|
||
"# alpha and scale to self normalize with mean 0 and standard deviation 1\n",
|
||
"# (see equation 14 in the paper):\n",
|
||
"alpha_0_1 = -np.sqrt(2 / np.pi) / (erfc(1/np.sqrt(2)) * np.exp(1/2) - 1)\n",
|
||
"scale_0_1 = (1 - erfc(1 / np.sqrt(2)) * np.sqrt(np.e)) * np.sqrt(2 * np.pi) * (2 * erfc(np.sqrt(2))*np.e**2 + np.pi*erfc(1/np.sqrt(2))**2*np.e - 2*(2+np.pi)*erfc(1/np.sqrt(2))*np.sqrt(np.e)+np.pi+2)**(-1/2)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 21,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def selu(z, scale=scale_0_1, alpha=alpha_0_1):\n",
|
||
" return scale * elu(z, alpha)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 22,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"plt.plot(z, selu(z), \"b-\", linewidth=2)\n",
|
||
"plt.plot([-5, 5], [0, 0], 'k-')\n",
|
||
"plt.plot([-5, 5], [-1.758, -1.758], 'k--')\n",
|
||
"plt.plot([0, 0], [-2.2, 3.2], 'k-')\n",
|
||
"plt.grid(True)\n",
|
||
"plt.title(\"SELU activation function\", fontsize=14)\n",
|
||
"plt.axis([-5, 5, -2.2, 3.2])\n",
|
||
"\n",
|
||
"save_fig(\"selu_plot\")\n",
|
||
"plt.show()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"By default, the SELU hyperparameters (`scale` and `alpha`) are tuned in such a way that the mean output of each neuron remains close to 0, and the standard deviation remains close to 1 (assuming the inputs are standardized with mean 0 and standard deviation 1 too). Using this activation function, even a 1,000 layer deep neural network preserves roughly mean 0 and standard deviation 1 across all layers, avoiding the exploding/vanishing gradients problem:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 23,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"np.random.seed(42)\n",
|
||
"Z = np.random.normal(size=(500, 100)) # standardized inputs\n",
|
||
"for layer in range(1000):\n",
|
||
" W = np.random.normal(size=(100, 100), scale=np.sqrt(1 / 100)) # LeCun initialization\n",
|
||
" Z = selu(np.dot(Z, W))\n",
|
||
" means = np.mean(Z, axis=0).mean()\n",
|
||
" stds = np.std(Z, axis=0).mean()\n",
|
||
" if layer % 100 == 0:\n",
|
||
" print(\"Layer {}: mean {:.2f}, std deviation {:.2f}\".format(layer, means, stds))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Using SELU is easy:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 24,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"keras.layers.Dense(10, activation=\"selu\",\n",
|
||
" kernel_initializer=\"lecun_normal\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Let's create a neural net for Fashion MNIST with 100 hidden layers, using the SELU activation function:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 25,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"np.random.seed(42)\n",
|
||
"tf.random.set_seed(42)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 26,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model = keras.models.Sequential()\n",
|
||
"model.add(keras.layers.Flatten(input_shape=[28, 28]))\n",
|
||
"model.add(keras.layers.Dense(300, activation=\"selu\",\n",
|
||
" kernel_initializer=\"lecun_normal\"))\n",
|
||
"for layer in range(99):\n",
|
||
" model.add(keras.layers.Dense(100, activation=\"selu\",\n",
|
||
" kernel_initializer=\"lecun_normal\"))\n",
|
||
"model.add(keras.layers.Dense(10, activation=\"softmax\"))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 27,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model.compile(loss=\"sparse_categorical_crossentropy\", optimizer=\"sgd\",\n",
|
||
" metrics=[\"accuracy\"])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Now let's train it. Do not forget to scale the inputs to mean 0 and standard deviation 1:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 28,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"pixel_means = X_train.mean(axis=0, keepdims=True)\n",
|
||
"pixel_stds = X_train.std(axis=0, keepdims=True)\n",
|
||
"X_train_scaled = (X_train - pixel_means) / pixel_stds\n",
|
||
"X_valid_scaled = (X_valid - pixel_means) / pixel_stds\n",
|
||
"X_test_scaled = (X_test - pixel_means) / pixel_stds"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 29,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"history = model.fit(X_train_scaled, y_train, epochs=5,\n",
|
||
" validation_data=(X_valid_scaled, y_valid))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Now look at what happens if we try to use the ReLU activation function instead:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 30,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"np.random.seed(42)\n",
|
||
"tf.random.set_seed(42)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 31,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model = keras.models.Sequential()\n",
|
||
"model.add(keras.layers.Flatten(input_shape=[28, 28]))\n",
|
||
"model.add(keras.layers.Dense(300, activation=\"relu\", kernel_initializer=\"he_normal\"))\n",
|
||
"for layer in range(99):\n",
|
||
" model.add(keras.layers.Dense(100, activation=\"relu\", kernel_initializer=\"he_normal\"))\n",
|
||
"model.add(keras.layers.Dense(10, activation=\"softmax\"))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 32,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model.compile(loss=\"sparse_categorical_crossentropy\", optimizer=\"sgd\",\n",
|
||
" metrics=[\"accuracy\"])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 33,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"history = model.fit(X_train_scaled, y_train, epochs=5,\n",
|
||
" validation_data=(X_valid_scaled, y_valid))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Not great at all, we suffered from the vanishing/exploding gradients problem."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Batch Normalization"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 34,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model = keras.models.Sequential([\n",
|
||
" keras.layers.Flatten(input_shape=[28, 28]),\n",
|
||
" keras.layers.BatchNormalization(),\n",
|
||
" keras.layers.Dense(300, activation=\"relu\"),\n",
|
||
" keras.layers.BatchNormalization(),\n",
|
||
" keras.layers.Dense(100, activation=\"relu\"),\n",
|
||
" keras.layers.BatchNormalization(),\n",
|
||
" keras.layers.Dense(10, activation=\"softmax\")\n",
|
||
"])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 35,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model.summary()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 36,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"bn1 = model.layers[1]\n",
|
||
"[(var.name, var.trainable) for var in bn1.variables]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 37,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"bn1.updates"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 38,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model.compile(loss=\"sparse_categorical_crossentropy\", optimizer=\"sgd\",\n",
|
||
" metrics=[\"accuracy\"])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 39,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"history = model.fit(X_train, y_train, epochs=10,\n",
|
||
" validation_data=(X_valid, y_valid))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Sometimes applying BN before the activation function works better (there's a debate on this topic). Moreover, the layer before a `BatchNormalization` layer does not need to have bias terms, since the `BatchNormalization` layer some as well, it would be a waste of parameters, so you can set `use_bias=False` when creating those layers:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 40,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model = keras.models.Sequential([\n",
|
||
" keras.layers.Flatten(input_shape=[28, 28]),\n",
|
||
" keras.layers.BatchNormalization(),\n",
|
||
" keras.layers.Dense(300, use_bias=False),\n",
|
||
" keras.layers.BatchNormalization(),\n",
|
||
" keras.layers.Activation(\"relu\"),\n",
|
||
" keras.layers.Dense(100, use_bias=False),\n",
|
||
" keras.layers.Activation(\"relu\"),\n",
|
||
" keras.layers.BatchNormalization(),\n",
|
||
" keras.layers.Dense(10, activation=\"softmax\")\n",
|
||
"])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 41,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model.compile(loss=\"sparse_categorical_crossentropy\", optimizer=\"sgd\",\n",
|
||
" metrics=[\"accuracy\"])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 42,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"history = model.fit(X_train, y_train, epochs=10,\n",
|
||
" validation_data=(X_valid, y_valid))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Gradient Clipping"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"All Keras optimizers accept `clipnorm` or `clipvalue` arguments:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 43,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"optimizer = keras.optimizers.SGD(clipvalue=1.0)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 44,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"optimizer = keras.optimizers.SGD(clipnorm=1.0)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Reusing Pretrained Layers"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Reusing a Keras model"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Let's split the fashion MNIST training set in two:\n",
|
||
"* `X_train_A`: all images of all items except for sandals and shirts (classes 5 and 6).\n",
|
||
"* `X_train_B`: a much smaller training set of just the first 200 images of sandals or shirts.\n",
|
||
"\n",
|
||
"The validation set and the test set are also split this way, but without restricting the number of images.\n",
|
||
"\n",
|
||
"We will train a model on set A (classification task with 8 classes), and try to reuse it to tackle set B (binary classification). We hope to transfer a little bit of knowledge from task A to task B, since classes in set A (sneakers, ankle boots, coats, t-shirts, etc.) are somewhat similar to classes in set B (sandals and shirts). However, since we are using `Dense` layers, only patterns that occur at the same location can be reused (in contrast, convolutional layers will transfer much better, since learned patterns can be detected anywhere on the image, as we will see in the CNN chapter)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 45,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def split_dataset(X, y):\n",
|
||
" y_5_or_6 = (y == 5) | (y == 6) # sandals or shirts\n",
|
||
" y_A = y[~y_5_or_6]\n",
|
||
" y_A[y_A > 6] -= 2 # class indices 7, 8, 9 should be moved to 5, 6, 7\n",
|
||
" y_B = (y[y_5_or_6] == 6).astype(np.float32) # binary classification task: is it a shirt (class 6)?\n",
|
||
" return ((X[~y_5_or_6], y_A),\n",
|
||
" (X[y_5_or_6], y_B))\n",
|
||
"\n",
|
||
"(X_train_A, y_train_A), (X_train_B, y_train_B) = split_dataset(X_train, y_train)\n",
|
||
"(X_valid_A, y_valid_A), (X_valid_B, y_valid_B) = split_dataset(X_valid, y_valid)\n",
|
||
"(X_test_A, y_test_A), (X_test_B, y_test_B) = split_dataset(X_test, y_test)\n",
|
||
"X_train_B = X_train_B[:200]\n",
|
||
"y_train_B = y_train_B[:200]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 46,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"X_train_A.shape"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 47,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"X_train_B.shape"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 48,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"y_train_A[:30]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 49,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"y_train_B[:30]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 50,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"tf.random.set_seed(42)\n",
|
||
"np.random.seed(42)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 51,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model_A = keras.models.Sequential()\n",
|
||
"model_A.add(keras.layers.Flatten(input_shape=[28, 28]))\n",
|
||
"for n_hidden in (300, 100, 50, 50, 50):\n",
|
||
" model_A.add(keras.layers.Dense(n_hidden, activation=\"selu\"))\n",
|
||
"model_A.add(keras.layers.Dense(8, activation=\"softmax\"))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 52,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model_A.compile(loss=\"sparse_categorical_crossentropy\", optimizer=\"sgd\",\n",
|
||
" metrics=[\"accuracy\"])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 53,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"history = model_A.fit(X_train_A, y_train_A, epochs=20,\n",
|
||
" validation_data=(X_valid_A, y_valid_A))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 54,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model_A.save(\"my_model_A.h5\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 55,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model_B = keras.models.Sequential()\n",
|
||
"model_B.add(keras.layers.Flatten(input_shape=[28, 28]))\n",
|
||
"for n_hidden in (300, 100, 50, 50, 50):\n",
|
||
" model_B.add(keras.layers.Dense(n_hidden, activation=\"selu\"))\n",
|
||
"model_B.add(keras.layers.Dense(1, activation=\"sigmoid\"))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 56,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model_B.compile(loss=\"binary_crossentropy\", optimizer=\"sgd\",\n",
|
||
" metrics=[\"accuracy\"])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 57,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"history = model_B.fit(X_train_B, y_train_B, epochs=20,\n",
|
||
" validation_data=(X_valid_B, y_valid_B))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 58,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model.summary()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 59,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model_A = keras.models.load_model(\"my_model_A.h5\")\n",
|
||
"model_B_on_A = keras.models.Sequential(model_A.layers[:-1])\n",
|
||
"model_B_on_A.add(keras.layers.Dense(1, activation=\"sigmoid\"))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 60,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model_A_clone = keras.models.clone_model(model_A)\n",
|
||
"model_A_clone.set_weights(model_A.get_weights())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 61,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"for layer in model_B_on_A.layers[:-1]:\n",
|
||
" layer.trainable = False\n",
|
||
"\n",
|
||
"model_B_on_A.compile(loss=\"binary_crossentropy\", optimizer=\"sgd\",\n",
|
||
" metrics=[\"accuracy\"])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 62,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"history = model_B_on_A.fit(X_train_B, y_train_B, epochs=4,\n",
|
||
" validation_data=(X_valid_B, y_valid_B))\n",
|
||
"\n",
|
||
"for layer in model_B_on_A.layers[:-1]:\n",
|
||
" layer.trainable = True\n",
|
||
"\n",
|
||
"model_B_on_A.compile(loss=\"binary_crossentropy\", optimizer=\"sgd\",\n",
|
||
" metrics=[\"accuracy\"])\n",
|
||
"history = model_B_on_A.fit(X_train_B, y_train_B, epochs=16,\n",
|
||
" validation_data=(X_valid_B, y_valid_B))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"So, what's the final verdict?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 63,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model_B.evaluate(X_test_B, y_test_B)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 64,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model_B_on_A.evaluate(X_test_B, y_test_B)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Great! We got quite a bit of transfer: the error rate dropped by a factor of almost 4!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 65,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"(100 - 97.05) / (100 - 99.25)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Faster Optimizers"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Momentum optimization"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 66,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Nesterov Accelerated Gradient"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 67,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## AdaGrad"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 68,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"optimizer = keras.optimizers.Adagrad(lr=0.001)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## RMSProp"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 69,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Adam Optimization"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 70,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Adamax Optimization"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 71,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"optimizer = keras.optimizers.Adamax(lr=0.001, beta_1=0.9, beta_2=0.999)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Nadam Optimization"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 72,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"optimizer = keras.optimizers.Nadam(lr=0.001, beta_1=0.9, beta_2=0.999)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Learning Rate Scheduling"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Power Scheduling"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"```lr = lr0 / (1 + steps / s)**c```\n",
|
||
"* Keras uses `c=1` and `s = 1 / decay`"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 73,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"optimizer = keras.optimizers.SGD(lr=0.01, decay=1e-4)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 74,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model = keras.models.Sequential([\n",
|
||
" keras.layers.Flatten(input_shape=[28, 28]),\n",
|
||
" keras.layers.Dense(300, activation=\"selu\", kernel_initializer=\"lecun_normal\"),\n",
|
||
" keras.layers.Dense(100, activation=\"selu\", kernel_initializer=\"lecun_normal\"),\n",
|
||
" keras.layers.Dense(10, activation=\"softmax\")\n",
|
||
"])\n",
|
||
"model.compile(loss=\"sparse_categorical_crossentropy\", optimizer=optimizer, metrics=[\"accuracy\"])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 75,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"n_epochs = 25\n",
|
||
"history = model.fit(X_train_scaled, y_train, epochs=n_epochs,\n",
|
||
" validation_data=(X_valid_scaled, y_valid))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 76,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"learning_rate = 0.01\n",
|
||
"decay = 1e-4\n",
|
||
"batch_size = 32\n",
|
||
"n_steps_per_epoch = len(X_train) // batch_size\n",
|
||
"epochs = np.arange(n_epochs)\n",
|
||
"lrs = learning_rate / (1 + decay * epochs * n_steps_per_epoch)\n",
|
||
"\n",
|
||
"plt.plot(epochs, lrs, \"o-\")\n",
|
||
"plt.axis([0, n_epochs - 1, 0, 0.01])\n",
|
||
"plt.xlabel(\"Epoch\")\n",
|
||
"plt.ylabel(\"Learning Rate\")\n",
|
||
"plt.title(\"Power Scheduling\", fontsize=14)\n",
|
||
"plt.grid(True)\n",
|
||
"plt.show()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Exponential Scheduling"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"```lr = lr0 * 0.1**(epoch / s)```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 77,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def exponential_decay_fn(epoch):\n",
|
||
" return 0.01 * 0.1**(epoch / 20)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 78,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def exponential_decay(lr0, s):\n",
|
||
" def exponential_decay_fn(epoch):\n",
|
||
" return lr0 * 0.1**(epoch / s)\n",
|
||
" return exponential_decay_fn\n",
|
||
"\n",
|
||
"exponential_decay_fn = exponential_decay(lr0=0.01, s=20)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 79,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model = keras.models.Sequential([\n",
|
||
" keras.layers.Flatten(input_shape=[28, 28]),\n",
|
||
" keras.layers.Dense(300, activation=\"selu\", kernel_initializer=\"lecun_normal\"),\n",
|
||
" keras.layers.Dense(100, activation=\"selu\", kernel_initializer=\"lecun_normal\"),\n",
|
||
" keras.layers.Dense(10, activation=\"softmax\")\n",
|
||
"])\n",
|
||
"model.compile(loss=\"sparse_categorical_crossentropy\", optimizer=\"nadam\", metrics=[\"accuracy\"])\n",
|
||
"n_epochs = 25"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 80,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn)\n",
|
||
"history = model.fit(X_train_scaled, y_train, epochs=n_epochs,\n",
|
||
" validation_data=(X_valid_scaled, y_valid),\n",
|
||
" callbacks=[lr_scheduler])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 81,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"plt.plot(history.epoch, history.history[\"lr\"], \"o-\")\n",
|
||
"plt.axis([0, n_epochs - 1, 0, 0.011])\n",
|
||
"plt.xlabel(\"Epoch\")\n",
|
||
"plt.ylabel(\"Learning Rate\")\n",
|
||
"plt.title(\"Exponential Scheduling\", fontsize=14)\n",
|
||
"plt.grid(True)\n",
|
||
"plt.show()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The schedule function can take the current learning rate as a second argument:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 82,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def exponential_decay_fn(epoch, lr):\n",
|
||
" return lr * 0.1**(1 / 20)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"If you want to update the learning rate at each iteration rather than at each epoch, you must write your own callback class:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 83,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"K = keras.backend\n",
|
||
"\n",
|
||
"class ExponentialDecay(keras.callbacks.Callback):\n",
|
||
" def __init__(self, s=40000):\n",
|
||
" super().__init__()\n",
|
||
" self.s = s\n",
|
||
"\n",
|
||
" def on_batch_begin(self, batch, logs=None):\n",
|
||
" # Note: the `batch` argument is reset at each epoch\n",
|
||
" lr = K.get_value(self.model.optimizer.lr)\n",
|
||
" K.set_value(self.model.optimizer.lr, lr * 0.1**(1 / s))\n",
|
||
"\n",
|
||
" def on_epoch_end(self, epoch, logs=None):\n",
|
||
" logs = logs or {}\n",
|
||
" logs['lr'] = K.get_value(self.model.optimizer.lr)\n",
|
||
"\n",
|
||
"model = keras.models.Sequential([\n",
|
||
" keras.layers.Flatten(input_shape=[28, 28]),\n",
|
||
" keras.layers.Dense(300, activation=\"selu\", kernel_initializer=\"lecun_normal\"),\n",
|
||
" keras.layers.Dense(100, activation=\"selu\", kernel_initializer=\"lecun_normal\"),\n",
|
||
" keras.layers.Dense(10, activation=\"softmax\")\n",
|
||
"])\n",
|
||
"lr0 = 0.01\n",
|
||
"optimizer = keras.optimizers.Nadam(lr=lr0)\n",
|
||
"model.compile(loss=\"sparse_categorical_crossentropy\", optimizer=optimizer, metrics=[\"accuracy\"])\n",
|
||
"n_epochs = 25\n",
|
||
"\n",
|
||
"s = 20 * len(X_train) // 32 # number of steps in 20 epochs (batch size = 32)\n",
|
||
"exp_decay = ExponentialDecay(s)\n",
|
||
"history = model.fit(X_train_scaled, y_train, epochs=n_epochs,\n",
|
||
" validation_data=(X_valid_scaled, y_valid),\n",
|
||
" callbacks=[exp_decay])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 84,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"n_steps = n_epochs * len(X_train) // 32\n",
|
||
"steps = np.arange(n_steps)\n",
|
||
"lrs = lr0 * 0.1**(steps / s)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 85,
|
||
"metadata": {
|
||
"scrolled": true
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"plt.plot(steps, lrs, \"-\", linewidth=2)\n",
|
||
"plt.axis([0, n_steps - 1, 0, lr0 * 1.1])\n",
|
||
"plt.xlabel(\"Batch\")\n",
|
||
"plt.ylabel(\"Learning Rate\")\n",
|
||
"plt.title(\"Exponential Scheduling (per batch)\", fontsize=14)\n",
|
||
"plt.grid(True)\n",
|
||
"plt.show()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Piecewise Constant Scheduling"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 86,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def piecewise_constant_fn(epoch):\n",
|
||
" if epoch < 5:\n",
|
||
" return 0.01\n",
|
||
" elif epoch < 15:\n",
|
||
" return 0.005\n",
|
||
" else:\n",
|
||
" return 0.001"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 87,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def piecewise_constant(boundaries, values):\n",
|
||
" boundaries = np.array([0] + boundaries)\n",
|
||
" values = np.array(values)\n",
|
||
" def piecewise_constant_fn(epoch):\n",
|
||
" return values[np.argmax(boundaries > epoch) - 1]\n",
|
||
" return piecewise_constant_fn\n",
|
||
"\n",
|
||
"piecewise_constant_fn = piecewise_constant([5, 15], [0.01, 0.005, 0.001])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 88,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"lr_scheduler = keras.callbacks.LearningRateScheduler(piecewise_constant_fn)\n",
|
||
"\n",
|
||
"model = keras.models.Sequential([\n",
|
||
" keras.layers.Flatten(input_shape=[28, 28]),\n",
|
||
" keras.layers.Dense(300, activation=\"selu\", kernel_initializer=\"lecun_normal\"),\n",
|
||
" keras.layers.Dense(100, activation=\"selu\", kernel_initializer=\"lecun_normal\"),\n",
|
||
" keras.layers.Dense(10, activation=\"softmax\")\n",
|
||
"])\n",
|
||
"model.compile(loss=\"sparse_categorical_crossentropy\", optimizer=\"nadam\", metrics=[\"accuracy\"])\n",
|
||
"n_epochs = 25\n",
|
||
"history = model.fit(X_train_scaled, y_train, epochs=n_epochs,\n",
|
||
" validation_data=(X_valid_scaled, y_valid),\n",
|
||
" callbacks=[lr_scheduler])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 89,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"plt.plot(history.epoch, [piecewise_constant_fn(epoch) for epoch in history.epoch], \"o-\")\n",
|
||
"plt.axis([0, n_epochs - 1, 0, 0.011])\n",
|
||
"plt.xlabel(\"Epoch\")\n",
|
||
"plt.ylabel(\"Learning Rate\")\n",
|
||
"plt.title(\"Piecewise Constant Scheduling\", fontsize=14)\n",
|
||
"plt.grid(True)\n",
|
||
"plt.show()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Performance Scheduling"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 90,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"tf.random.set_seed(42)\n",
|
||
"np.random.seed(42)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 91,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)\n",
|
||
"\n",
|
||
"model = keras.models.Sequential([\n",
|
||
" keras.layers.Flatten(input_shape=[28, 28]),\n",
|
||
" keras.layers.Dense(300, activation=\"selu\", kernel_initializer=\"lecun_normal\"),\n",
|
||
" keras.layers.Dense(100, activation=\"selu\", kernel_initializer=\"lecun_normal\"),\n",
|
||
" keras.layers.Dense(10, activation=\"softmax\")\n",
|
||
"])\n",
|
||
"optimizer = keras.optimizers.SGD(lr=0.02, momentum=0.9)\n",
|
||
"model.compile(loss=\"sparse_categorical_crossentropy\", optimizer=optimizer, metrics=[\"accuracy\"])\n",
|
||
"n_epochs = 25\n",
|
||
"history = model.fit(X_train_scaled, y_train, epochs=n_epochs,\n",
|
||
" validation_data=(X_valid_scaled, y_valid),\n",
|
||
" callbacks=[lr_scheduler])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 92,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"plt.plot(history.epoch, history.history[\"lr\"], \"bo-\")\n",
|
||
"plt.xlabel(\"Epoch\")\n",
|
||
"plt.ylabel(\"Learning Rate\", color='b')\n",
|
||
"plt.tick_params('y', colors='b')\n",
|
||
"plt.gca().set_xlim(0, n_epochs - 1)\n",
|
||
"plt.grid(True)\n",
|
||
"\n",
|
||
"ax2 = plt.gca().twinx()\n",
|
||
"ax2.plot(history.epoch, history.history[\"val_loss\"], \"r^-\")\n",
|
||
"ax2.set_ylabel('Validation Loss', color='r')\n",
|
||
"ax2.tick_params('y', colors='r')\n",
|
||
"\n",
|
||
"plt.title(\"Reduce LR on Plateau\", fontsize=14)\n",
|
||
"plt.show()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### tf.keras schedulers"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 93,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model = keras.models.Sequential([\n",
|
||
" keras.layers.Flatten(input_shape=[28, 28]),\n",
|
||
" keras.layers.Dense(300, activation=\"selu\", kernel_initializer=\"lecun_normal\"),\n",
|
||
" keras.layers.Dense(100, activation=\"selu\", kernel_initializer=\"lecun_normal\"),\n",
|
||
" keras.layers.Dense(10, activation=\"softmax\")\n",
|
||
"])\n",
|
||
"s = 20 * len(X_train) // 32 # number of steps in 20 epochs (batch size = 32)\n",
|
||
"learning_rate = keras.optimizers.schedules.ExponentialDecay(0.01, s, 0.1)\n",
|
||
"optimizer = keras.optimizers.SGD(learning_rate)\n",
|
||
"model.compile(loss=\"sparse_categorical_crossentropy\", optimizer=optimizer, metrics=[\"accuracy\"])\n",
|
||
"n_epochs = 25\n",
|
||
"history = model.fit(X_train_scaled, y_train, epochs=n_epochs,\n",
|
||
" validation_data=(X_valid_scaled, y_valid))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"For piecewise constant scheduling, try this:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 94,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"learning_rate = keras.optimizers.schedules.PiecewiseConstantDecay(\n",
|
||
" boundaries=[5. * n_steps_per_epoch, 15. * n_steps_per_epoch],\n",
|
||
" values=[0.01, 0.005, 0.001])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 1Cycle scheduling"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 95,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"K = keras.backend\n",
|
||
"\n",
|
||
"class ExponentialLearningRate(keras.callbacks.Callback):\n",
|
||
" def __init__(self, factor):\n",
|
||
" self.factor = factor\n",
|
||
" self.rates = []\n",
|
||
" self.losses = []\n",
|
||
" def on_batch_end(self, batch, logs):\n",
|
||
" self.rates.append(K.get_value(self.model.optimizer.lr))\n",
|
||
" self.losses.append(logs[\"loss\"])\n",
|
||
" K.set_value(self.model.optimizer.lr, self.model.optimizer.lr * self.factor)\n",
|
||
"\n",
|
||
"def find_learning_rate(model, X, y, epochs=1, batch_size=32, min_rate=10**-5, max_rate=10):\n",
|
||
" init_weights = model.get_weights()\n",
|
||
" iterations = len(X) // batch_size * epochs\n",
|
||
" factor = np.exp(np.log(max_rate / min_rate) / iterations)\n",
|
||
" init_lr = K.get_value(model.optimizer.lr)\n",
|
||
" K.set_value(model.optimizer.lr, min_rate)\n",
|
||
" exp_lr = ExponentialLearningRate(factor)\n",
|
||
" history = model.fit(X, y, epochs=epochs, batch_size=batch_size,\n",
|
||
" callbacks=[exp_lr])\n",
|
||
" K.set_value(model.optimizer.lr, init_lr)\n",
|
||
" model.set_weights(init_weights)\n",
|
||
" return exp_lr.rates, exp_lr.losses\n",
|
||
"\n",
|
||
"def plot_lr_vs_loss(rates, losses):\n",
|
||
" plt.plot(rates, losses)\n",
|
||
" plt.gca().set_xscale('log')\n",
|
||
" plt.hlines(min(losses), min(rates), max(rates))\n",
|
||
" plt.axis([min(rates), max(rates), min(losses), (losses[0] + min(losses)) / 2])\n",
|
||
" plt.xlabel(\"Learning rate\")\n",
|
||
" plt.ylabel(\"Loss\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 96,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"tf.random.set_seed(42)\n",
|
||
"np.random.seed(42)\n",
|
||
"\n",
|
||
"model = keras.models.Sequential([\n",
|
||
" keras.layers.Flatten(input_shape=[28, 28]),\n",
|
||
" keras.layers.Dense(300, activation=\"selu\", kernel_initializer=\"lecun_normal\"),\n",
|
||
" keras.layers.Dense(100, activation=\"selu\", kernel_initializer=\"lecun_normal\"),\n",
|
||
" keras.layers.Dense(10, activation=\"softmax\")\n",
|
||
"])\n",
|
||
"model.compile(loss=\"sparse_categorical_crossentropy\", optimizer=\"sgd\", metrics=[\"accuracy\"])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 97,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"batch_size = 128\n",
|
||
"rates, losses = find_learning_rate(model, X_train_scaled, y_train, epochs=1, batch_size=batch_size)\n",
|
||
"plot_lr_vs_loss(rates, losses)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 98,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"class OneCycleScheduler(keras.callbacks.Callback):\n",
|
||
" def __init__(self, iterations, max_rate, start_rate=None,\n",
|
||
" last_iterations=None, last_rate=None):\n",
|
||
" self.iterations = iterations\n",
|
||
" self.max_rate = max_rate\n",
|
||
" self.start_rate = start_rate or max_rate / 10\n",
|
||
" self.last_iterations = last_iterations or iterations // 10 + 1\n",
|
||
" self.half_iteration = (iterations - self.last_iterations) // 2\n",
|
||
" self.last_rate = last_rate or self.start_rate / 1000\n",
|
||
" self.iteration = 0\n",
|
||
" def _interpolate(self, iter1, iter2, rate1, rate2):\n",
|
||
" return ((rate2 - rate1) * (iter2 - self.iteration)\n",
|
||
" / (iter2 - iter1) + rate1)\n",
|
||
" def on_batch_begin(self, batch, logs):\n",
|
||
" if self.iteration < self.half_iteration:\n",
|
||
" rate = self._interpolate(0, self.half_iteration, self.start_rate, self.max_rate)\n",
|
||
" elif self.iteration < 2 * self.half_iteration:\n",
|
||
" rate = self._interpolate(self.half_iteration, 2 * self.half_iteration,\n",
|
||
" self.max_rate, self.start_rate)\n",
|
||
" else:\n",
|
||
" rate = self._interpolate(2 * self.half_iteration, self.iterations,\n",
|
||
" self.start_rate, self.last_rate)\n",
|
||
" rate = max(rate, self.last_rate)\n",
|
||
" self.iteration += 1\n",
|
||
" K.set_value(self.model.optimizer.lr, rate)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 99,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"n_epochs = 25\n",
|
||
"onecycle = OneCycleScheduler(len(X_train) // batch_size * n_epochs, max_rate=0.05)\n",
|
||
"history = model.fit(X_train_scaled, y_train, epochs=n_epochs, batch_size=batch_size,\n",
|
||
" validation_data=(X_valid_scaled, y_valid),\n",
|
||
" callbacks=[onecycle])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Avoiding Overfitting Through Regularization"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## $\\ell_1$ and $\\ell_2$ regularization"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 100,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"layer = keras.layers.Dense(100, activation=\"elu\",\n",
|
||
" kernel_initializer=\"he_normal\",\n",
|
||
" kernel_regularizer=keras.regularizers.l2(0.01))\n",
|
||
"# or l1(0.1) for ℓ1 regularization with a factor or 0.1\n",
|
||
"# or l1_l2(0.1, 0.01) for both ℓ1 and ℓ2 regularization, with factors 0.1 and 0.01 respectively"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 101,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model = keras.models.Sequential([\n",
|
||
" keras.layers.Flatten(input_shape=[28, 28]),\n",
|
||
" keras.layers.Dense(300, activation=\"elu\",\n",
|
||
" kernel_initializer=\"he_normal\",\n",
|
||
" kernel_regularizer=keras.regularizers.l2(0.01)),\n",
|
||
" keras.layers.Dense(100, activation=\"elu\",\n",
|
||
" kernel_initializer=\"he_normal\",\n",
|
||
" kernel_regularizer=keras.regularizers.l2(0.01)),\n",
|
||
" keras.layers.Dense(10, activation=\"softmax\",\n",
|
||
" kernel_regularizer=keras.regularizers.l2(0.01))\n",
|
||
"])\n",
|
||
"model.compile(loss=\"sparse_categorical_crossentropy\", optimizer=\"nadam\", metrics=[\"accuracy\"])\n",
|
||
"n_epochs = 2\n",
|
||
"history = model.fit(X_train_scaled, y_train, epochs=n_epochs,\n",
|
||
" validation_data=(X_valid_scaled, y_valid))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 102,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from functools import partial\n",
|
||
"\n",
|
||
"RegularizedDense = partial(keras.layers.Dense,\n",
|
||
" activation=\"elu\",\n",
|
||
" kernel_initializer=\"he_normal\",\n",
|
||
" kernel_regularizer=keras.regularizers.l2(0.01))\n",
|
||
"\n",
|
||
"model = keras.models.Sequential([\n",
|
||
" keras.layers.Flatten(input_shape=[28, 28]),\n",
|
||
" RegularizedDense(300),\n",
|
||
" RegularizedDense(100),\n",
|
||
" RegularizedDense(10, activation=\"softmax\")\n",
|
||
"])\n",
|
||
"model.compile(loss=\"sparse_categorical_crossentropy\", optimizer=\"nadam\", metrics=[\"accuracy\"])\n",
|
||
"n_epochs = 2\n",
|
||
"history = model.fit(X_train_scaled, y_train, epochs=n_epochs,\n",
|
||
" validation_data=(X_valid_scaled, y_valid))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Dropout"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 103,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model = keras.models.Sequential([\n",
|
||
" keras.layers.Flatten(input_shape=[28, 28]),\n",
|
||
" keras.layers.Dropout(rate=0.2),\n",
|
||
" keras.layers.Dense(300, activation=\"elu\", kernel_initializer=\"he_normal\"),\n",
|
||
" keras.layers.Dropout(rate=0.2),\n",
|
||
" keras.layers.Dense(100, activation=\"elu\", kernel_initializer=\"he_normal\"),\n",
|
||
" keras.layers.Dropout(rate=0.2),\n",
|
||
" keras.layers.Dense(10, activation=\"softmax\")\n",
|
||
"])\n",
|
||
"model.compile(loss=\"sparse_categorical_crossentropy\", optimizer=\"nadam\", metrics=[\"accuracy\"])\n",
|
||
"n_epochs = 2\n",
|
||
"history = model.fit(X_train_scaled, y_train, epochs=n_epochs,\n",
|
||
" validation_data=(X_valid_scaled, y_valid))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Alpha Dropout"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 104,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"tf.random.set_seed(42)\n",
|
||
"np.random.seed(42)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 105,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model = keras.models.Sequential([\n",
|
||
" keras.layers.Flatten(input_shape=[28, 28]),\n",
|
||
" keras.layers.AlphaDropout(rate=0.2),\n",
|
||
" keras.layers.Dense(300, activation=\"selu\", kernel_initializer=\"lecun_normal\"),\n",
|
||
" keras.layers.AlphaDropout(rate=0.2),\n",
|
||
" keras.layers.Dense(100, activation=\"selu\", kernel_initializer=\"lecun_normal\"),\n",
|
||
" keras.layers.AlphaDropout(rate=0.2),\n",
|
||
" keras.layers.Dense(10, activation=\"softmax\")\n",
|
||
"])\n",
|
||
"optimizer = keras.optimizers.SGD(lr=0.01, momentum=0.9, nesterov=True)\n",
|
||
"model.compile(loss=\"sparse_categorical_crossentropy\", optimizer=optimizer, metrics=[\"accuracy\"])\n",
|
||
"n_epochs = 20\n",
|
||
"history = model.fit(X_train_scaled, y_train, epochs=n_epochs,\n",
|
||
" validation_data=(X_valid_scaled, y_valid))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 106,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model.evaluate(X_test_scaled, y_test)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 107,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model.evaluate(X_train_scaled, y_train)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 108,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"history = model.fit(X_train_scaled, y_train)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## MC Dropout"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 109,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"tf.random.set_seed(42)\n",
|
||
"np.random.seed(42)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 110,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"y_probas = np.stack([model(X_test_scaled, training=True)\n",
|
||
" for sample in range(100)])\n",
|
||
"y_proba = y_probas.mean(axis=0)\n",
|
||
"y_std = y_probas.std(axis=0)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 111,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"np.round(model.predict(X_test_scaled[:1]), 2)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 112,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"np.round(y_probas[:, :1], 2)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 113,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"np.round(y_proba[:1], 2)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 114,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"y_std = y_probas.std(axis=0)\n",
|
||
"np.round(y_std[:1], 2)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 115,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"y_pred = np.argmax(y_proba, axis=1)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 116,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"accuracy = np.sum(y_pred == y_test) / len(y_test)\n",
|
||
"accuracy"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 117,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"class MCDropout(keras.layers.Dropout):\n",
|
||
" def call(self, inputs):\n",
|
||
" return super().call(inputs, training=True)\n",
|
||
"\n",
|
||
"class MCAlphaDropout(keras.layers.AlphaDropout):\n",
|
||
" def call(self, inputs):\n",
|
||
" return super().call(inputs, training=True)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 118,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"tf.random.set_seed(42)\n",
|
||
"np.random.seed(42)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 119,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"mc_model = keras.models.Sequential([\n",
|
||
" MCAlphaDropout(layer.rate) if isinstance(layer, keras.layers.AlphaDropout) else layer\n",
|
||
" for layer in model.layers\n",
|
||
"])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 120,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"mc_model.summary()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 121,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"optimizer = keras.optimizers.SGD(lr=0.01, momentum=0.9, nesterov=True)\n",
|
||
"mc_model.compile(loss=\"sparse_categorical_crossentropy\", optimizer=optimizer, metrics=[\"accuracy\"])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 122,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"mc_model.set_weights(model.get_weights())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Now we can use the model with MC Dropout:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 123,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"np.round(np.mean([mc_model.predict(X_test_scaled[:1]) for sample in range(100)], axis=0), 2)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Max norm"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 124,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"layer = keras.layers.Dense(100, activation=\"selu\", kernel_initializer=\"lecun_normal\",\n",
|
||
" kernel_constraint=keras.constraints.max_norm(1.))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 125,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"MaxNormDense = partial(keras.layers.Dense,\n",
|
||
" activation=\"selu\", kernel_initializer=\"lecun_normal\",\n",
|
||
" kernel_constraint=keras.constraints.max_norm(1.))\n",
|
||
"\n",
|
||
"model = keras.models.Sequential([\n",
|
||
" keras.layers.Flatten(input_shape=[28, 28]),\n",
|
||
" MaxNormDense(300),\n",
|
||
" MaxNormDense(100),\n",
|
||
" keras.layers.Dense(10, activation=\"softmax\")\n",
|
||
"])\n",
|
||
"model.compile(loss=\"sparse_categorical_crossentropy\", optimizer=\"nadam\", metrics=[\"accuracy\"])\n",
|
||
"n_epochs = 2\n",
|
||
"history = model.fit(X_train_scaled, y_train, epochs=n_epochs,\n",
|
||
" validation_data=(X_valid_scaled, y_valid))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"collapsed": true
|
||
},
|
||
"source": [
|
||
"# Exercises"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 1. to 7."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"See appendix A."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 8. Deep Learning"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 8.1."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"_Exercise: Build a DNN with five hidden layers of 100 neurons each, He initialization, and the ELU activation function._"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 8.2."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"_Exercise: Using Adam optimization and early stopping, try training it on MNIST but only on digits 0 to 4, as we will use transfer learning for digits 5 to 9 in the next exercise. You will need a softmax output layer with five neurons, and as always make sure to save checkpoints at regular intervals and save the final model so you can reuse it later._"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 8.3."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"_Exercise: Tune the hyperparameters using cross-validation and see what precision you can achieve._"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 8.4."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"_Exercise: Now try adding Batch Normalization and compare the learning curves: is it converging faster than before? Does it produce a better model?_"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 8.5."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"_Exercise: is the model overfitting the training set? Try adding dropout to every layer and try again. Does it help?_"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"collapsed": true
|
||
},
|
||
"source": [
|
||
"## 9. Transfer learning"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 9.1."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"_Exercise: create a new DNN that reuses all the pretrained hidden layers of the previous model, freezes them, and replaces the softmax output layer with a new one._"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 9.2."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"_Exercise: train this new DNN on digits 5 to 9, using only 100 images per digit, and time how long it takes. Despite this small number of examples, can you achieve high precision?_"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 9.3."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"_Exercise: try caching the frozen layers, and train the model again: how much faster is it now?_"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 9.4."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"_Exercise: try again reusing just four hidden layers instead of five. Can you achieve a higher precision?_"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 9.5."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"_Exercise: now unfreeze the top two hidden layers and continue training: can you get the model to perform even better?_"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 10. Pretraining on an auxiliary task"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"In this exercise you will build a DNN that compares two MNIST digit images and predicts whether they represent the same digit or not. Then you will reuse the lower layers of this network to train an MNIST classifier using very little training data."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 10.1.\n",
|
||
"Exercise: _Start by building two DNNs (let's call them DNN A and B), both similar to the one you built earlier but without the output layer: each DNN should have five hidden layers of 100 neurons each, He initialization, and ELU activation. Next, add one more hidden layer with 10 units on top of both DNNs. You should use the `keras.layers.concatenate()` function to concatenate the outputs of both DNNs, then feed the result to the hidden layer. Finally, add an output layer with a single neuron using the logistic activation function._"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 10.2.\n",
|
||
"_Exercise: split the MNIST training set in two sets: split #1 should containing 55,000 images, and split #2 should contain contain 5,000 images. Create a function that generates a training batch where each instance is a pair of MNIST images picked from split #1. Half of the training instances should be pairs of images that belong to the same class, while the other half should be images from different classes. For each pair, the training label should be 0 if the images are from the same class, or 1 if they are from different classes._"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 10.3.\n",
|
||
"_Exercise: train the DNN on this training set. For each image pair, you can simultaneously feed the first image to DNN A and the second image to DNN B. The whole network will gradually learn to tell whether two images belong to the same class or not._"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 10.4.\n",
|
||
"_Exercise: now create a new DNN by reusing and freezing the hidden layers of DNN A and adding a softmax output layer on top with 10 neurons. Train this network on split #2 and see if you can achieve high performance despite having only 500 images per class._"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.6.8"
|
||
},
|
||
"nav_menu": {
|
||
"height": "360px",
|
||
"width": "416px"
|
||
},
|
||
"toc": {
|
||
"navigate_menu": true,
|
||
"number_sections": true,
|
||
"sideBar": true,
|
||
"threshold": 6,
|
||
"toc_cell": false,
|
||
"toc_section_display": "block",
|
||
"toc_window_display": false
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 1
|
||
}
|