Merge pull request #151 from davidcotton/master

Fix typo in 'Gradient Descent, revisited' section
2020-04-21 18:02:57 +12:00 · 2020-04-21 18:02:57 +12:00 · 2a7a849d72
commit 2a7a849d72
parent ab97bf435b 326fa9ca89
1 changed files with 10 additions and 1 deletions
--- a/math_differential_calculus.ipynb
+++ b/math_differential_calculus.ipynb
@ -1191,7 +1191,7 @@
    "\n",
    "In Deep Learning, the letter $\\mathbf{x}$ is generally used to represent the input data. When you _use_ a neural network to make predictions, you feed the neural network the inputs $\\mathbf{x}$, and you get back a prediction $\\hat{y} = f(\\mathbf{x})$. The function $f$ treats the model parameters as constants. We can use more explicit notation by writing $\\hat{y} = f_\\mathbf{w}(\\mathbf{x})$, where $\\mathbf{w}$ represents the model parameters and indicates that the function relies on them, but treats them as constants.\n",
    "\n",
-    "However, when _training_ a neural network, we do quite the opposite: all the training examples are grouped in a matrix $\\mathbf{X}$, all the labels are grouped in a vector $\\mathbf{y}$, and both $\\mathbf{X}$ and $\\mathbf{y}$ are treated as constants, while $\\mathbf{w}$ is treated as variable: specifically, we try to minimize the cost function $\\mathcal L_{\\mathbf{X}, \\mathbf{y}}(\\mathbf{w}) = g(f_{\\mathbf{X}}(\\mathbf{w}), \\mathbf{y})$, where $g$ is a function that measures the \"discrepancy\" between the predictions $f_{\\mathbf{X}}(\\mathbf{w})$ and the labels $\\mathbf{w}$, where $f_{\\mathbf{X}}(\\mathbf{w})$ represents the vector containing the predictions for each training example. Minimizing the loss function is usually performed using Gradient Descent (or a variant of GD): we start with random model parameters $\\mathbf{w}_0$, then we compute $\\nabla \\mathcal L(\\mathbf{w}_0)$ and we use this gradient vector to perform a Gradient Descent step, then we repeat the process until convergence. It is crucial to understand that the gradient of the loss function is with regards to the model parameters $\\mathbf{w}$ (_not_ the inputs $\\mathbf{x}$)."
+    "However, when _training_ a neural network, we do quite the opposite: all the training examples are grouped in a matrix $\\mathbf{X}$, all the labels are grouped in a vector $\\mathbf{y}$, and both $\\mathbf{X}$ and $\\mathbf{y}$ are treated as constants, while $\\mathbf{w}$ is treated as variable: specifically, we try to minimize the cost function $\\mathcal L_{\\mathbf{X}, \\mathbf{y}}(\\mathbf{w}) = g(f_{\\mathbf{X}}(\\mathbf{w}), \\mathbf{y})$, where $g$ is a function that measures the \"discrepancy\" between the predictions $f_{\\mathbf{X}}(\\mathbf{w})$ and the labels $\\mathbf{y}$, where $f_{\\mathbf{X}}(\\mathbf{w})$ represents the vector containing the predictions for each training example. Minimizing the loss function is usually performed using Gradient Descent (or a variant of GD): we start with random model parameters $\\mathbf{w}_0$, then we compute $\\nabla \\mathcal L(\\mathbf{w}_0)$ and we use this gradient vector to perform a Gradient Descent step, then we repeat the process until convergence. It is crucial to understand that the gradient of the loss function is with regards to the model parameters $\\mathbf{w}$ (_not_ the inputs $\\mathbf{x}$)."
   ]
  },
  {
@ -1781,6 +1781,15 @@
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
+  },
+  "pycharm": {
+   "stem_cell": {
+    "cell_type": "raw",
+    "metadata": {
+     "collapsed": false
+    },
+    "source": []
+   }
  }
 },
 "nbformat": 4,