"Perfect, but this requires some mathematical work. It is not too hard in this case, but for a deep neural network, it is pratically impossible to compute the derivatives this way. So let's look at various ways to automate this!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Numeric differentiation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, we compute an approxiation of the gradients using the equation: $\\dfrac{\\partial f}{\\partial x} = \\displaystyle{\\lim_{\\epsilon \\to 0}}\\dfrac{f(x+\\epsilon, y) - f(x, y)}{\\epsilon}$ (and there is a similar definition for $\\dfrac{\\partial f}{\\partial y}$)."
"The good news is that it is pretty easy to compute the Hessians. First let's create functions that compute the first order partial derivatives (also called Jacobians):"
"So everything works well, but the result is approximate, and computing the gradients of a function with regards to $n$ variables requires calling that function $n$ times. In deep neural nets, there are often thousands of parameters to tweak using gradient descent (which requires computing the gradients of the loss function with regards to each of these parameters), so this approach would be much too slow."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Implementing a Toy Computation Graph"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Rather than this numerical approach, let's implement some symbolic autodiff techniques. For this, we will need to define classes to represent constants, variables and operations."
"The autodiff methods we will present below are all based on the *chain rule*."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Suppose we have two functions $u$ and $v$, and we apply them sequentially to some input $x$, and we get the result $z$. So we have $z = v(u(x))$, which we can rewrite as $z = v(s)$ and $s = u(x)$. Now we can apply the chain rule to get the partial derivative of the output $z$ with regards to the input $x$:\n",
"In forward mode autodiff, the algorithm computes these terms \"forward\" (i.e., in the same order as the computations required to compute the output $z$), that is from left to right: first $\\dfrac{\\partial s_1}{\\partial x}$, then $\\dfrac{\\partial s_2}{\\partial s_1}$, and so on. In reverse mode autodiff, the algorithm computes these terms \"backwards\", from right to left: first $\\dfrac{\\partial z}{\\partial s_n}$, then $\\dfrac{\\partial s_n}{\\partial s_{n-1}}$, and so on.\n",
"\n",
"For example, suppose you want to compute the derivative of the function $z(x)=\\sin(x^2)$ at x=3, using forward mode autodiff. The algorithm would first compute the partial derivative $\\dfrac{\\partial s_1}{\\partial x}=\\dfrac{\\partial x^2}{\\partial x}=2x=6$. Next, it would compute $\\dfrac{\\partial z}{\\partial x}=\\dfrac{\\partial s_1}{\\partial x}\\cdot\\dfrac{\\partial z}{\\partial s_1}= 6 \\cdot \\dfrac{\\partial \\sin(s_1)}{\\partial s_1}=6 \\cdot \\cos(s_1)=6 \\cdot \\cos(3^2)\\approx-5.46$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's verify this result using the `gradients()` function defined earlier:"
"Look good. Now let's do the same thing using reverse mode autodiff. This time the algorithm would start from the right hand side so it would compute $\\dfrac{\\partial z}{\\partial s_1} = \\dfrac{\\partial \\sin(s_1)}{\\partial s_1}=\\cos(s_1)=\\cos(3^2)\\approx -0.91$. Next it would compute $\\dfrac{\\partial z}{\\partial x}=\\dfrac{\\partial s_1}{\\partial x}\\cdot\\dfrac{\\partial z}{\\partial s_1} \\approx \\dfrac{\\partial s_1}{\\partial x} \\cdot -0.91 = \\dfrac{\\partial x^2}{\\partial x} \\cdot -0.91=2x \\cdot -0.91 = 6\\cdot-0.91=-5.46$."
"Of course both approaches give the same result (except for rounding errors), and with a single input and output they involve the same number of computations. But when there are several inputs or outputs, they can have very different performance. Indeed, if there are many inputs, the right-most terms will be needed to compute the partial derivatives with regards to each input, so it is a good idea to compute these right-most terms first. That means using reverse-mode autodiff. This way, the right-most terms can be computed just once and used to compute all the partial derivatives. Conversely, if there are many outputs, forward-mode is generally preferable because the left-most terms can be computed just once to compute the partial derivatives of the different outputs. In Deep Learning, there are typically thousands of model parameters, meaning there are lots of inputs, but few outputs. In fact, there is generally just one output during training: the loss. This is why reverse mode autodiff is used in TensorFlow and all major Deep Learning libraries."
"There's one additional complexity in reverse mode autodiff: the value of $s_i$ is generally required when computing $\\dfrac{\\partial s_{i+1}}{\\partial s_i}$, and computing $s_i$ requires first computing $s_{i-1}$, which requires computing $s_{i-2}$, and so on. So basically, a first pass forward through the network is required to compute $s_1$, $s_2$, $s_3$, $\\dots$, $s_{n-1}$ and $s_n$, and then the algorithm can compute the partial derivatives from right to left. Storing all the intermediate values $s_i$ in RAM is sometimes a problem, especially when handling images, and when using GPUs which often have limited RAM: to limit this problem, one can reduce the number of layers in the neural network, or configure TensorFlow to make it swap these values from GPU RAM to CPU RAM. Another approach is to only cache every other intermediate value, $s_1$, $s_3$, $s_5$, $\\dots$, $s_{n-4}$, $s_{n-2}$ and $s_n$. This means that when the algorithm computes the partial derivatives, if an intermediate value $s_i$ is missing, it will need to recompute it based on the previous intermediate value $s_{i-1}$. This trades off CPU for RAM (if you are interested, check out [this paper](https://pdfs.semanticscholar.org/f61e/9fd5a4878e1493f7a6b03774a61c17b7e9a4.pdf))."
"Since the output of the `gradient()` method is fully symbolic, we are not limited to the first order derivatives, we can also compute second order derivatives, and so on:"
"Note that the result is now exact, not an approximation (up to the limit of the machine's float precision, of course)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Forward mode autodiff using dual numbers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A nice way to apply forward mode autodiff is to use [dual numbers](https://en.wikipedia.org/wiki/Dual_number). In short, a dual number $z$ has the form $z = a + b\\epsilon$, where $a$ and $b$ are real numbers, and $\\epsilon$ is an infinitesimal number, positive but smaller than all real numbers, and such that $\\epsilon^2=0$.\n",
"It can be shown that $f(x + \\epsilon) = f(x) + \\dfrac{\\partial f}{\\partial x}\\epsilon$, so simply by computing $f(x + \\epsilon)$ we get both the value of $f(x)$ and the partial derivative of $f$ with regards to $x$. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dual numbers have their own arithmetic rules, which are generally quite natural. For example:\n",
"Let's create a class to represent dual numbers, and implement a few operations (addition and multiplication). You can try adding some more if you want."
"Again, in this implementation the outputs are just numbers, not symbolic expressions, so we are limited to first order derivatives. However, we could have made the `backpropagate()` methods return symbolic expressions rather than values (e.g., return `Add(2,3)` rather than 5). This would make it possible to compute second order gradients (and beyond). This is what TensorFlow does, as do all the major libraries that implement autodiff."
"WARNING:tensorflow:Calling GradientTape.gradient on a persistent tape inside its context is significantly less efficient than calling it outside the context (it causes the gradient ops to be recorded on the tape, leading to increased CPU and memory usage). Only call GradientTape.gradient inside the context if you actually want to trace the gradient in order to compute higher order derivatives.\n"
"Note that when we compute the derivative of a tensor with regards to a variable that it does not depend on, instead of returning 0.0, the `gradient()` function returns `None`."