From 0e0483d5ee791ab7f45134fc24f0739a71a956fb Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Aur=C3=A9lien=20Geron?= <ageron@users.noreply.github.com>
Date: Sun, 27 May 2018 21:24:24 +0200
Subject: [PATCH] Add missing training: True for dropout, and point to
 tf.nn.selu, fixes #228

---
 11_deep_learning.ipynb | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/11_deep_learning.ipynb b/11_deep_learning.ipynb
index 50d7228..46abb3b 100644
--- a/11_deep_learning.ipynb
+++ b/11_deep_learning.ipynb
@@ -462,7 +462,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This activation function was proposed in this [great paper](https://arxiv.org/pdf/1706.02515.pdf) by Günter Klambauer, Thomas Unterthiner and Andreas Mayr, published in June 2017 (I will definitely add it to the book). It outperforms the other activation functions very significantly for deep neural networks, so you should really try it out."
+    "This activation function was proposed in this [great paper](https://arxiv.org/pdf/1706.02515.pdf) by Günter Klambauer, Thomas Unterthiner and Andreas Mayr, published in June 2017 (I will definitely add it to the book). During training, a neural network composed of a stack of dense layers using the SELU activation function will self-normalize: the output of each layer will tend to preserve the same mean and variance during training, which solves the vanishing/exploding gradients problem. As a result, this activation function outperforms the other activation functions very significantly for such neural nets, so you should really try it out."
    ]
   },
   {
@@ -499,7 +499,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "With this activation function, even a 100 layer deep neural network preserves roughly mean 0 and standard deviation 1 across all layers, avoiding the exploding/vanishing gradients problem:"
+    "By default, the SELU hyperparameters (`scale` and `alpha`) are tuned in such a way that the mean remains close to 0, and the standard deviation remains close to 1 (assuming the inputs are standardized with mean 0 and standard deviation 1 too). Using this activation function, even a 100 layer deep neural network preserves roughly mean 0 and standard deviation 1 across all layers, avoiding the exploding/vanishing gradients problem:"
    ]
   },
   {
@@ -524,7 +524,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Here's a TensorFlow implementation (there will almost certainly be a `tf.nn.selu()` function in future TensorFlow versions):"
+    "The `tf.nn.selu()` function was added in TensorFlow 1.4. For earlier versions, you can use the following implementation:"
    ]
   },
   {
@@ -543,7 +543,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "SELUs can also be combined with dropout, check out [this implementation](https://github.com/bioinf-jku/SNNs/blob/master/selu.py) by the Institute of Bioinformatics, Johannes Kepler University Linz."
+    "However, the SELU activation function cannot be used along with regular Dropout (this would cancel the SELU activation function's self-normalizing property). Fortunately, there is a Dropout variant called Alpha Dropout proposed in the same paper. It is available in `tf.contrib.nn.alpha_dropout()` since TF 1.4 (or check out [this implementation](https://github.com/bioinf-jku/SNNs/blob/master/selu.py) by the Institute of Bioinformatics, Johannes Kepler University Linz)."
    ]
   },
   {
@@ -2330,7 +2330,7 @@
     "    init.run()\n",
     "    for epoch in range(n_epochs):\n",
     "        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):\n",
-    "            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})\n",
+    "            sess.run(training_op, feed_dict={X: X_batch, y: y_batch, training: True})\n",
     "        accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})\n",
     "        print(epoch, \"Validation accuracy:\", accuracy_val)\n",
     "\n",