From 326fa9ca8989e700c17ce0b5a3095ff8e6b940b8 Mon Sep 17 00:00:00 2001
From: dave <davidcotton@users.noreply.github.com>
Date: Tue, 21 Apr 2020 11:45:59 +1000
Subject: [PATCH] Fix typo in 'Gradient Descent, revisited' section

---
 math_differential_calculus.ipynb | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/math_differential_calculus.ipynb b/math_differential_calculus.ipynb
index b9bfb92..210ff97 100644
--- a/math_differential_calculus.ipynb
+++ b/math_differential_calculus.ipynb
@@ -1191,7 +1191,7 @@
     "\n",
     "In Deep Learning, the letter $\\mathbf{x}$ is generally used to represent the input data. When you _use_ a neural network to make predictions, you feed the neural network the inputs $\\mathbf{x}$, and you get back a prediction $\\hat{y} = f(\\mathbf{x})$. The function $f$ treats the model parameters as constants. We can use more explicit notation by writing $\\hat{y} = f_\\mathbf{w}(\\mathbf{x})$, where $\\mathbf{w}$ represents the model parameters and indicates that the function relies on them, but treats them as constants.\n",
     "\n",
-    "However, when _training_ a neural network, we do quite the opposite: all the training examples are grouped in a matrix $\\mathbf{X}$, all the labels are grouped in a vector $\\mathbf{y}$, and both $\\mathbf{X}$ and $\\mathbf{y}$ are treated as constants, while $\\mathbf{w}$ is treated as variable: specifically, we try to minimize the cost function $\\mathcal L_{\\mathbf{X}, \\mathbf{y}}(\\mathbf{w}) = g(f_{\\mathbf{X}}(\\mathbf{w}), \\mathbf{y})$, where $g$ is a function that measures the \"discrepancy\" between the predictions $f_{\\mathbf{X}}(\\mathbf{w})$ and the labels $\\mathbf{w}$, where $f_{\\mathbf{X}}(\\mathbf{w})$ represents the vector containing the predictions for each training example. Minimizing the loss function is usually performed using Gradient Descent (or a variant of GD): we start with random model parameters $\\mathbf{w}_0$, then we compute $\\nabla \\mathcal L(\\mathbf{w}_0)$ and we use this gradient vector to perform a Gradient Descent step, then we repeat the process until convergence. It is crucial to understand that the gradient of the loss function is with regards to the model parameters $\\mathbf{w}$ (_not_ the inputs $\\mathbf{x}$)."
+    "However, when _training_ a neural network, we do quite the opposite: all the training examples are grouped in a matrix $\\mathbf{X}$, all the labels are grouped in a vector $\\mathbf{y}$, and both $\\mathbf{X}$ and $\\mathbf{y}$ are treated as constants, while $\\mathbf{w}$ is treated as variable: specifically, we try to minimize the cost function $\\mathcal L_{\\mathbf{X}, \\mathbf{y}}(\\mathbf{w}) = g(f_{\\mathbf{X}}(\\mathbf{w}), \\mathbf{y})$, where $g$ is a function that measures the \"discrepancy\" between the predictions $f_{\\mathbf{X}}(\\mathbf{w})$ and the labels $\\mathbf{y}$, where $f_{\\mathbf{X}}(\\mathbf{w})$ represents the vector containing the predictions for each training example. Minimizing the loss function is usually performed using Gradient Descent (or a variant of GD): we start with random model parameters $\\mathbf{w}_0$, then we compute $\\nabla \\mathcal L(\\mathbf{w}_0)$ and we use this gradient vector to perform a Gradient Descent step, then we repeat the process until convergence. It is crucial to understand that the gradient of the loss function is with regards to the model parameters $\\mathbf{w}$ (_not_ the inputs $\\mathbf{x}$)."
    ]
   },
   {
@@ -1781,6 +1781,15 @@
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
    "version": "3.7.6"
+  },
+  "pycharm": {
+   "stem_cell": {
+    "cell_type": "raw",
+    "metadata": {
+     "collapsed": false
+    },
+    "source": []
+   }
   }
  },
  "nbformat": 4,