"Calculus is the study of continuous change. It has two major subfields: *differential calculus*, which studies the rate of change of functions, and *integral calculus*, which studies the area under the curve. In this notebook, we will discuss the former.\n",
"\n",
"*Differential calculus is at the core of Deep Learning, so it is important to understand what derivatives and gradients are, how they are used in Deep Learning, and understand what their limitations are.*\n",
"\n",
"**Note:** the code in this notebook is only used to create figures and animations. You do not need to understand how it works (although I did my best to make it clear, in case you are interested)."
"As you probably know, the slope of a (non-vertical) straight line can be calculated by taking any two points $\\mathrm{A}$ and $\\mathrm{B}$ on the line, and computing the \"rise over run\":\n",
"Obviously, the slope varies: on the left (i.e., when $x<0$), the slope is negative (i.e., when we move from left to right, the curve goes down), while on the right (i.e., when $x>0$) the slope is positive (i.e., when we move from left to right, the curve goes up). At the point $x=0$, the slope is equal to 0 (i.e., the curve is locally flat). The fact that the slope is 0 when we reach a minimum (or indeed a maximum) is crucially important, and we will come back to it later."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "4qCXg9nQSp6S"
},
"source": [
"How can we put numbers on these intuitions? Well, say we want to estimate the slope of the curve at a point $\\mathrm{A}$, we can do this by taking another point $\\mathrm{B}$ on the curve, not too far away, and then computing the slope between these two points:\n"
"As you can see, when point $\\mathrm{B}$ is very close to point $\\mathrm{A}$, the $(\\mathrm{AB})$ line becomes almost indistinguishable from the curve itself (at least locally around point $\\mathrm{A}$). The $(\\mathrm{AB})$ line gets closer and closer to the **tangent** line to the curve at point $\\mathrm{A}$: this is the best linear approximation of the curve at point $\\mathrm{A}$.\n",
"So it makes sense to define the slope of the curve at point $\\mathrm{A}$ as the slope that the $\\mathrm{(AB)}$ line approaches when $\\mathrm{B}$ gets infinitely close to $\\mathrm{A}$. This slope is called the **derivative** of the function $f$ at $x=x_\\mathrm{A}$. For example, the derivative of the function $f(x)=x^2$ at $x=x_\\mathrm{A}$ is equal to $2x_\\mathrm{A}$ (we will see how to get this result shortly), so on the graph above, since the point $\\mathrm{A}$ is located at $x_\\mathrm{A}=-1$, the tangent line to the curve at that point has a slope of $-2$."
"No matter how much you zoom in on the origin (the point at $x=0, y=0$), the curve will always look like a V. The slope is -1 for any $x < 0$, and it is +1 for any $x > 0$, but **at $x = 0$, the slope is undefined**, since it is not possible to approximate the curve $y=|x|$ locally around the origin using a straight line, no matter how much you zoom in on that point.\n",
"The function $f(x)=|x|$ is said to be **non-differentiable** at $x=0$: its derivative is undefined at $x=0$. This means that the curve $y=|x|$ has an undefined slope at that point. However, the function $f(x)=|x|$ is **differentiable** at all other points.\n",
"In order for a function $f(x)$ to be differentiable at some point $x_\\mathrm{A}$, the slope of the $(\\mathrm{AB})$ line must approach a single finite value as $\\mathrm{B}$ gets infinitely close to $\\mathrm{A}$.\n",
"* First, the function must of course be **defined** at $x_\\mathrm{A}$. As a counterexample, the function $f(x)=\\dfrac{1}{x}$ is undefined at $x_\\mathrm{A}=0$, so it is not differentiable at that point.\n",
"* The function must also be **continuous** at $x_\\mathrm{A}$, meaning that as $x_\\mathrm{B}$ gets infinitely close to $x_\\mathrm{A}$, $f(x_\\mathrm{B})$ must also get infinitely close to $f(x_\\mathrm{A})$. As a counterexample, $f(x)=\\begin{cases}-1 \\text{ if }x < 0\\\\+1 \\text{ if }x \\geq 0\\end{cases}$ is not continuous at $x_\\mathrm{A}=0$, even though it is defined at that point: indeed, when you approach it from the negative side, it does not approach infinitely close to $f(0)=+1$. Therefore, it is not continuous at that point, and thus not differentiable either.\n",
"* The function must not have a **breaking point** at $x_\\mathrm{A}$, meaning that the slope that the $(\\mathrm{AB})$ line approaches as $\\mathrm{B}$ approaches $\\mathrm{A}$ must be the same whether $\\mathrm{B}$ approaches from the left side or from the right side. We already saw a counterexample with $f(x)=|x|$, which is both defined and continuous at $x_\\mathrm{A}=0$, but which has a breaking point at $x_\\mathrm{A}=0$: the slope of the curve $y=|x|$ is -1 on the left, and +1 on the right.\n",
"* The curve $y=f(x)$ must not be **vertical** at point $\\mathrm{A}$. One counterexample is $f(x)=\\sqrt[3]{x}$, the cubic root of $x$: the curve is vertical at the origin, so the function is not differentiable at $x_\\mathrm{A}=0$, as you can see in the following animation:"
"Don't be scared, this is simpler than it looks! You may recognize the _rise over run_ equation $\\dfrac{y_\\mathrm{B} - y_\\mathrm{A}}{x_\\mathrm{B} - x_\\mathrm{A}}$ that we discussed earlier. That's just the slope of the $\\mathrm{(AB)}$ line. And the notation $\\underset{x_\\mathrm{B} \\to x_\\mathrm{A}}\\lim$ means that we are making $x_\\mathrm{B}$ approach infinitely close to $x_\\mathrm{A}$. So in plain English, $f'(x_\\mathrm{A})$ is the value that the slope of the $\\mathrm{(AB)}$ line approaches when $\\mathrm{B}$ gets infinitely close to $\\mathrm{A}$. This is just a formal way of saying exactly the same thing as earlier."
"Let's look at a concrete example. Let's see if we can determine what the slope of the $y=x^2$ curve is, at any point $\\mathrm{A}$ (try to understand each line, I promise it's not that hard):\n",
"& = \\underset{x_\\mathrm{B} \\to x_\\mathrm{A}}\\lim(x_\\mathrm{B} + x_\\mathrm{A})\\quad && \\text{since the two } (x_\\mathrm{B} - x_\\mathrm{A}) \\text{ cancel out}\\\\\n",
"& = \\underset{x_\\mathrm{B} \\to x_\\mathrm{A}}\\lim x_\\mathrm{B} \\, + \\underset{x_\\mathrm{B} \\to x_\\mathrm{A}}\\lim x_\\mathrm{A}\\quad && \\text{since the limit of a sum is the sum of the limits}\\\\\n",
"That's it! We just proved that the slope of $y = x^2$ at any point $\\mathrm{A}$ is $f'(x_\\mathrm{A}) = 2x_\\mathrm{A}$. What we have done is called **differentiation**: finding the derivative of a function."
"Note that we used a couple of important properties of limits. Here are the main properties you need to know to work with derivatives:\n",
"\n",
"* $\\underset{x \\to k}\\lim c = c \\quad$ if $c$ is some constant value that does not depend on $x$, then the limit is just $c$.\n",
"* $\\underset{x \\to k}\\lim x = k \\quad$ if $x$ approaches some value $k$, then the limit is $k$.\n",
"* $\\underset{x \\to k}\\lim\\,\\left[f(x) + g(x)\\right] = \\underset{x \\to k}\\lim f(x) + \\underset{x \\to k}\\lim g(x) \\quad$ the limit of a sum is the sum of the limits\n",
"* $\\underset{x \\to k}\\lim\\,\\left[f(x) \\times g(x)\\right] = \\underset{x \\to k}\\lim f(x) \\times \\underset{x \\to k}\\lim g(x) \\quad$ the limit of a product is the product of the limits\n"
"**Important note:** in Deep Learning, differentiation is almost always performed automatically by the framework you are using (such as TensorFlow or PyTorch). This is called auto-diff, and I did [another notebook](https://github.com/ageron/handson-ml3/blob/main/extra_autodiff.ipynb) on that topic. However, you should still make sure you have a good understanding of derivatives, or else they will come and bite you one day, for example when you use a square root in your cost function without realizing that its derivative approaches infinity when $x$ approaches 0 (tip: you should use $\\sqrt{x+\\epsilon}$ instead, where $\\epsilon$ is some small constant, such as $10^{-4}$)."
"You will often find a slightly different (but equivalent) definition of the derivative. Let's derive it from the previous definition. First, let's define $\\epsilon = x_\\mathrm{B} - x_\\mathrm{A}$. Next, note that $\\epsilon$ will approach 0 as $x_\\mathrm{B}$ approaches $x_\\mathrm{A}$. Lastly, note that $x_\\mathrm{B} = x_\\mathrm{A} + \\epsilon$. With that, we can reformulate the definition above like so:\n",
"Okay! Now let's use this new definition to find the derivative of $f(x) = x^2$ at any point $x$, and (hopefully) we should find the same result as above (except using $x$ instead of $x_\\mathrm{A}$):\n",
"& = \\underset{\\epsilon \\to 0}\\lim\\dfrac{2x\\epsilon + \\epsilon^2}{\\epsilon}\\quad && \\text{since the two } {x}^2 \\text{ cancel out}\\\\\n",
"& = \\underset{\\epsilon \\to 0}\\lim \\, (2x + \\epsilon)\\quad && \\text{since } 2x\\epsilon \\text{ and } \\epsilon^2 \\text{ can both be divided by } \\epsilon\\\\\n",
"This notation is also handy when a function is not named. For example $\\dfrac{\\mathrm{d}}{\\mathrm{d}x}[x^2]$ refers to the derivative of the function $x \\mapsto x^2$.\n",
"Moreover, when people talk about the function $f(x)$, they sometimes leave out \"$(x)$\", and they just talk about the function $f$. When this is the case, the notation of the derivative is also simpler:\n",
"Let's use the equation $f'(x) = 2x$ to plot the tangent to the $y=x^2$ curve at various values of $x$ (you can click on the play button under the graphs to play the animation):"
"**Note:** consider the tangent line to the curve $y=f(x)$ at some point $\\mathrm{A}$. What is its equation? Well, since the tangent is a straight line, its equation must look like:\n",
"where $\\alpha$ is the slope of the line, and $\\beta$ is the offset (i.e., the $y$ coordinate of the point at which the line crosses the vertical axis). We already know that the slope of the tangent line at point $\\mathrm{A}$ is the derivative of $f(x)$ at that point, so:\n",
"But what about the offset $\\beta$? Well we also know that the tangent line touches the curve at point $\\mathrm{A}$, so we know that $\\alpha x_\\mathrm{A} + \\beta = f(x_\\mathrm{A})$. So:\n",
"One very important rule is that **the derivative of a sum is the sum of the derivatives**. More precisely, if we define $f(x) = g(x) + h(x)$, then $f'(x) = g'(x) + h'(x)$. This is quite easy to prove:\n",
"& = \\underset{\\epsilon \\to 0}\\lim\\dfrac{g(x+\\epsilon) - g(x)}{\\epsilon} + \\underset{\\epsilon \\to 0}\\lim\\dfrac{h(x+\\epsilon) - h(x)}{\\epsilon} && \\quad \\text{since the limit of a sum is the sum of the limits}\\\\\n",
"& = g'(x) + h'(x) && \\quad \\text{using the definitions of }g'(x) \\text{ and } h'(x)\n",
"Let's try differentiating a simple function using the above rules: we will find the derivative of $f(x)=x^3+\\cos(x)$. Using the rule for the derivative of sums, we find that $f'(x)=\\dfrac{\\mathrm{d}}{\\mathrm{d}x}[x^3] + \\dfrac{\\mathrm{d}}{\\mathrm{d}x}[\\cos(x)]$. Using the rule for the derivative of powers and for the $\\cos$ function, we find that $f'(x) = 3x^2 - \\sin(x)$."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "n6HwqWcADMVk"
},
"source": [
"---\n",
"\n",
"Let's try a harder example: let's find the derivative of $f(x) = \\sin(2 x^2) + 1$. First, let's define $u(x)=\\sin(x) + 1$ and $v(x) = 2x^2$. Using the rule for sums, we find that $u'(x)=\\dfrac{\\mathrm{d}}{\\mathrm{d}x}[sin(x)] + \\dfrac{\\mathrm{d}}{\\mathrm{d}x}[1]$. Since the derivative of the $\\sin$ function is $\\cos$, and the derivative of constants is 0, we find that $u'(x)=\\cos(x)$. Next, using the product rule, we find that $v'(x)=2\\dfrac{\\mathrm{d}}{\\mathrm{d}x}[x^2] + \\dfrac{\\mathrm{d}}{\\mathrm{d}x}[2]\\,x^2$. Since the derivative of a constant is 0, the second term cancels out. And since the power rule tells us that the derivative of $x^2$ is $2x$, we find that $v'(x)=4x$. Lastly, using the chain rule, since $f(x)=u(v(x))$, we find that $f'(x)=u'(v(x))\\,v'(x)=\\cos(2x^2)\\,4x$.\n",
"The chain rule is easier to remember using Leibniz's notation:\n",
"\n",
"If $f(x)=g(h(x))$ and $y=h(x)$, then: $\\dfrac{\\mathrm{d}f}{\\mathrm{d}x} = \\dfrac{\\mathrm{d}f}{\\mathrm{d}y} \\dfrac{\\mathrm{d}y}{\\mathrm{d}x}$\n",
"\n",
"Indeed, $\\dfrac{\\mathrm{d}f}{\\mathrm{d}y} = f'(y) = f'(h(x))$ and $\\dfrac{\\mathrm{d}y}{\\mathrm{d}x}=h'(x)$.\n",
"\n",
"It is possible to chain many functions. For example, if $f(x)=g(h(i(x)))$, and we define $y=i(x)$ and $z=h(y)$, then $\\dfrac{\\mathrm{d}f}{\\mathrm{d}x} = \\dfrac{\\mathrm{d}f}{\\mathrm{d}z} \\dfrac{\\mathrm{d}z}{\\mathrm{d}y} \\dfrac{\\mathrm{d}y}{\\mathrm{d}x}$. Using Lagrange's notation, we get $f'(x)=g'(z)\\,h'(y)\\,i'(x)=g'(h(i(x)))\\,h'(i(x))\\,i'(x)$\n",
"\n",
"The chain rule is crucial in Deep Learning, as a neural network is basically as a long composition of functions. For example, a 3-layer dense neural network corresponds to the following function: $f(\\mathbf{x})=\\operatorname{Dense}_3(\\operatorname{Dense}_2(\\operatorname{Dense}_1(\\mathbf{x})))$ (in this example, $\\operatorname{Dense}_3$ is the output layer).\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "JvAsOt0yAypb"
},
"source": [
"# Derivatives and optimization\n",
"\n",
"When trying to optimize a function $f(x)$, we look for the values of $x$ that minimize (or maximize) the function.\n",
"\n",
"It is important to note that when a function reaches a minimum or maximum, assuming it is differentiable at that point, the derivative will necessarily be equal to 0. For example, you can check the above animation, and notice that whenever the function $f$ (in the upper graph) reaches a maximum or minimum, then the derivative $f'$ (in the lower graph) is equal to 0.\n",
"\n",
"So one way to optimize a function is to differentiate it and analytically find all the values for which the derivative is 0, then determine which of these values optimize the function (if any). For example, consider the function $f(x)=\\dfrac{1}{4}x^4 - x^2 + \\dfrac{1}{2}$. Using the derivative rules (specifically, the sum rule, the product rule, the power rule and the constant rule), we find that $f'(x)=x^3 - 2x$. We look for the values of $x$ for which $f'(x)=0$, so $x^3-2x=0$, and therefore $x(x^2-2)=0$. So $x=0$, or $x=\\sqrt2$ or $x=-\\sqrt2$. As you can see on the following graph of $f(x)$, these 3 values correspond to local extrema. Two global minima $f\\left(\\sqrt2\\right)=f\\left(-\\sqrt2\\right)=-\\dfrac{1}{2}$ and one local maximum $f(0)=\\dfrac{1}{2}$.\n"
"If a function has a local extremum at a point $x_\\mathrm{A}$ and is differentiable at that point, then $f'(x_\\mathrm{A})=0$. However, the reverse is not always true. For example, consider $f(x)=x^3$. Its derivative is $f'(x)=x^2$, which is equal to 0 at $x_\\mathrm{A}=0$. Yet, this point is _not_ an extremum, as you can see on the following diagram. It's just a single point where the slope is 0."
"So in short, you can optimize a function by analytically working out the points at which the derivative is 0, and then investigating only these points. It's a beautifully elegant solution, but it requires a lot of work, and it's not always easy, or even possible. For neural networks, it's practically impossible."
"Another option to optimize a function is to perform **Gradient Descent** (we will consider minimizing the function, but the process would be almost identical if we tried to maximize a function instead): start at a random point $x_0$, then use the function's derivative to determine the slope at that point, and move a little bit in the downwards direction, then repeat the process until you reach a local minimum, and cross your fingers in the hope that this happens to be the global minimum.\n",
"\n",
"At each iteration, the step size is proportional to the slope, so the process naturally slows down as it approaches a local minimum. Each step is also proportional to the learning rate: a parameter of the Gradient Descent algorithm itself (since it is not a parameter of the function we are optimizing, it is called a **hyperparameter**).\n",
"\n",
"Here is an animation of this process on the function $f(x)=\\dfrac{1}{4}x^4 - x^2 + \\dfrac{1}{2}$:"
"In this example, we started with $x_0 = \\dfrac{1}{4}$, so Gradient Descent \"rolled down\" towards the minimum value at $x = \\sqrt2$. But if we had started at $x_0 = -\\dfrac{1}{4}$, it would have gone towards $-\\sqrt2$. This illustrates the fact that the initial value is important: depending on $x_0$, the algorithm may converge to a global minimum (hurray!) or to a poor local minimum (boo!) or stay stuck on a plateau, such as a horizontal inflection point (boo!)."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "48eMS_1gJYai"
},
"source": [
"There are many variants of the Gradient Descent algorithm, discussed in Chapter 11 of the book. These are the ones we care about in Deep Learning. They all rely on the derivative of the cost function with regards to the model parameters (we will discuss functions with multiple parameters later in this notebook)."
"What happens if we try to differentiate the function $f'(x)$? Well, we get the so-called second order derivative, noted $f''(x)$, or $\\dfrac{\\mathrm{d}^2f}{\\mathrm{d}x^2}$. If we repeat the process by differentiating $f''(x)$, we get the third-order derivative $f'''(x)$, or $\\dfrac{\\mathrm{d}^3f}{\\mathrm{d}x^3}$. And we could go on to get higher order derivatives.\n",
"What's the intuition behind second order derivatives? Well, since the (first order) derivative represents the instantaneous rate of change of $f$ at each point, the second order derivative represents the instantaneous rate of change of the rate of change itself, in other words, you can think of it as the **acceleration** of the curve: if $f''(x) < 0$, then the curve is accelerating \"downwards\", if $f''(x) > 0$ then the curve is accelerating \"upwards\", and if $f''(x) = 0$, then the curve is locally a straight line. Note that a curve could be going upwards (i.e., $f'(x)>0$) but also be accelerating downwards (i.e., $f''(x) < 0$): for example, imagine the path of a stone thrown upwards, as it is being slowed down by gravity (which constantly accelerates the stone downwards).\n",
"\n",
"Deep Learning generally only uses first order derivatives, but you will sometimes run into some optimization algorithms or cost functions based on second order derivatives."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "TwrWcqj7Ybyk"
},
"source": [
"# Partial derivatives\n",
"\n",
"Up to now, we have only considered functions with a single variable $x$. What happens when there are multiple variables? For example, let's start with a simple function with 2 variables: $f(x,y)=\\sin(xy)$. If we plot this function, using $z=f(x,y)$, we get the following 3D graph. I also plotted some point $\\mathrm{A}$ on the surface, along with two lines I will describe shortly."
"If you were to stand on this surface at point $\\mathrm{A}$ and walk along the $x$ axis towards the right (increasing $x$), your path would go down quite steeply (along the dashed blue line). The slope along this axis would be negative. However, if you were to walk along the $y$ axis, towards the back (increasing $y$), then your path would almost be flat (along the solid red line), at least locally: the slope along that axis, at point $\\mathrm{A}$, would be very slightly positive.\n",
"\n",
"As you can see, a single number is no longer sufficient to describe the slope of the function at a given point. We need one slope for the $x$ axis, and one slope for the $y$ axis. One slope for each variable. To find the slope along the $x$ axis, called the **partial derivative of $f$ with regards to $x$**, and noted $\\dfrac{\\partial f}{\\partial x}$ (with curly $\\partial$), we can differentiate $f(x,y)$ with regards to $x$ while treating all other variables (in this case just $y$) as constants:\n",
"If you use the derivative rules listed earlier (in this example you would just need the product rule and the chain rule), making sure to treat $y$ as a constant, then you will find:\n",
"We now have equations to compute the slope along the $x$ axis and along the $y$ axis. But what about the other directions? If you were standing on the surface at point $\\mathrm{A}$, you could decide to walk in any direction you choose, not just along the $x$ or $y$ axes. What would the slope be then? Shouldn't we compute the slope along every possible direction?\n",
"\n",
"Well, it can be shown that if all the partial derivatives are defined and continuous in a neighborhood around point $\\mathrm{A}$, then the function $f$ is **totally differentiable** at that point, meaning that it can be locally approximated by a plane $P_\\mathrm{A}$ (the tangent plane to the surface at point $\\mathrm{A}$). In this case, having just the partial derivatives along each axis ($x$ and $y$ in our case) is sufficient to perfectly characterize that plane. Its equation is:\n",
"In Deep Learning, we will generally be dealing with well-behaved functions that are totally differentiable at any point where all the partial derivatives are defined, but you should know that some functions are not that nice. For example, consider the function:\n",
"\n",
"$h(x,y)=\\begin{cases}0 \\text { if } x=0 \\text{ or } y=0\\\\1 \\text { otherwise}\\end{cases}$\n",
"\n",
"At the origin (i.e., at $(x,y)=(0,0)$), the partial derivatives of the function $h$ with respect to $x$ and $y$ are both perfectly defined: they are equal to 0. Yet the function can clearly not be approximated by a plane at that point. It is not totally differentiable at that point (but it is totally differentiable at any point off the axes).\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "VS0xnTE_Ym4c"
},
"source": [
"# Gradients\n",
"\n",
"So far we have considered only functions with a single variable $x$, or with 2 variables, $x$ and $y$, but the previous paragraph also applies to functions with more variables. So let's consider a function $f$ with $n$ variables: $f(x_1, x_2, \\dots, x_n)$. For convenience, we will define a vector $\\mathbf{x}$ whose components are these variables:\n",
"\n",
"$\\mathbf{x}=\\begin{pmatrix}\n",
"x_1\\\\\n",
"x_2\\\\\n",
"\\vdots\\\\\n",
"x_n\n",
"\\end{pmatrix}$ \n",
"\n",
"Now $f(\\mathbf{x})$ is easier to write than $f(x_1, x_2, \\dots, x_n)$.\n",
"\n",
"The gradient of the function $f(\\mathbf{x})$ at some point $\\mathbf{x}_\\mathrm{A}$ is the vector whose components are all the partial derivatives of the function at that point. It is noted $\\nabla f(\\mathbf{x}_\\mathrm{A})$, or sometimes $\\nabla_{\\mathbf{x}_\\mathrm{A}}f$:\n",
"Assuming the function is totally differentiable at the point $\\mathbf{x}_\\mathbf{A}$, then the surface it describes can be approximated by a plane at that point (as discussed in the previous section), and the gradient vector is the one that points towards the steepest slope on that plane."
"In Deep Learning, the Gradient Descent algorithm we discussed earlier is based on gradients instead of derivatives (hence its name). It works in much the same way, but using vectors instead of scalars: simply start with a random vector $\\mathbf{x}_0$, then compute the gradient of $f$ at that point, and perform a small step in the opposite direction, then repeat until convergence. More precisely, at each step $t$, compute $\\mathbf{x}_t = \\mathbf{x}_{t-1} - \\eta \\nabla f(\\mathbf{x}_{t-1})$. The constant $\\eta$ is the learning rate, typically a small value such as $10^{-3}$. In practice, we generally use more efficient variants of this algorithm, but the general idea remains the same.\n",
"\n",
"In Deep Learning, the letter $\\mathbf{x}$ is generally used to represent the input data. When you _use_ a neural network to make predictions, you feed the neural network the inputs $\\mathbf{x}$, and you get back a prediction $\\hat{y} = f(\\mathbf{x})$. The function $f$ treats the model parameters as constants. We can use more explicit notation by writing $\\hat{y} = f_\\mathbf{w}(\\mathbf{x})$, where $\\mathbf{w}$ represents the model parameters and indicates that the function relies on them, but treats them as constants.\n",
"However, when _training_ a neural network, we do quite the opposite: all the training examples are grouped in a matrix $\\mathbf{X}$, all the labels are grouped in a vector $\\mathbf{y}$, and both $\\mathbf{X}$ and $\\mathbf{y}$ are treated as constants, while $\\mathbf{w}$ is treated as variable: specifically, we try to minimize the cost function $\\mathcal L_{\\mathbf{X}, \\mathbf{y}}(\\mathbf{w}) = g(f_{\\mathbf{X}}(\\mathbf{w}), \\mathbf{y})$, where $g$ is a function that measures the \"discrepancy\" between the predictions $f_{\\mathbf{X}}(\\mathbf{w})$ and the labels $\\mathbf{y}$, where $f_{\\mathbf{X}}(\\mathbf{w})$ represents the vector containing the predictions for each training example. Minimizing the loss function is usually performed using Gradient Descent (or a variant of GD): we start with random model parameters $\\mathbf{w}_0$, then we compute $\\nabla \\mathcal L(\\mathbf{w}_0)$ and we use this gradient vector to perform a Gradient Descent step, then we repeat the process until convergence. It is crucial to understand that the gradient of the loss function is with regards to the model parameters $\\mathbf{w}$ (_not_ the inputs $\\mathbf{x}$)."
"Until now we have only considered functions that output a scalar, but it is possible to output vectors instead. For example, a classification neural network typically outputs one probability for each class, so if there are $m$ classes, the neural network will output an $d$-dimensional vector for each input.\n",
"\n",
"In Deep Learning we generally only need to differentiate the loss function, which almost always outputs a single scalar number. But suppose for a second that you want to differentiate a function $\\mathbf{f}(\\mathbf{x})$ which outputs $d$-dimensional vectors. The good news is that you can treat each _output_ dimension independently of the others. This will give you a partial derivative for each input dimension and each output dimension. If you put them all in a single matrix, with one column per input dimension and one row per output dimension, you get the so-called **Jacobian matrix**.\n",
"The partial derivatives themselves are often called the **Jacobians**. It's just the first order partial derivatives of the function $\\mathbf{f}$."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "fAx8-JfDgtVY"
},
"source": [
"# Hessians\n",
"\n",
"Let's come back to a function $f(\\mathbf{x})$ which takes an $n$-dimensional vector as input and outputs a scalar. If you determine the equation of the partial derivative of $f$ with regards to $x_i$ (the $i^\\text{th}$ component of $\\mathbf{x}$), you will get a new function of $\\mathbf{x}$: $\\dfrac{\\partial f}{\\partial x_i}$. You can then compute the partial derivative of this function with regards to $x_j$ (the $j^\\text{th}$ component of $\\mathbf{x}$). The result is a partial derivative of a partial derivative: in other words, it is a **second order partial derivatives**, also called a **Hessian**. It is noted $\\mathbf{x}$: $\\dfrac{\\partial^2 f}{\\partial x_jx_i}$. If $i\\neq j$ then it is called a **mixed second order partial derivative**.\n",
"Let's look at an example: $f(x, y)=\\sin(xy)$. As we showed earlier, the first order partial derivatives of $f$ are: $\\dfrac{\\partial f}{\\partial x}=y\\cos(xy)$ and $\\dfrac{\\partial f}{\\partial y}=x\\cos(xy)$. So we can now compute all the Hessians (using the derivative rules we discussed earlier):\n",
"Note that $\\dfrac{\\partial^2 f}{\\partial x\\,\\partial y} = \\dfrac{\\partial^2 f}{\\partial y\\,\\partial x}$. This is the case whenever all the partial derivatives are defined and continuous in a neighborhood around the point at which we differentiate.\n",
"\n",
"The matrix containing all the Hessians is called the **Hessian matrix**:\n",
"There are great optimization algorithms which take advantage of the Hessians, but in practice Deep Learning almost never uses them. Indeed, if a function has $n$ variables, there are $n^2$ Hessians: since neural networks typically have several millions of parameters, the number of Hessians would exceed thousands of billions. Even if we had the necessary amount of RAM, the computations would be prohibitively slow."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "IOpOUgSyWhou"
},
"source": [
"## A few proofs\n",
"\n",
"Let's finish by proving all the derivative rules we listed earlier. You don't have to go through all these proofs to be a good Deep Learning practitioner, but it may help you get a deeper understanding of derivatives."
"& = \\underset{\\epsilon \\to 0}\\lim\\dfrac{g(x+\\epsilon)h(x+\\epsilon) - g(x)h(x+\\epsilon)}{\\epsilon} + \\underset{\\epsilon \\to 0}\\lim\\dfrac{g(x)h(x + \\epsilon) - g(x)h(x)}{\\epsilon} && \\quad \\text{since the limit of a sum is the sum of the limits}\\\\\n",
"& = \\underset{\\epsilon \\to 0}\\lim{\\left[\\dfrac{g(x+\\epsilon) - g(x)}{\\epsilon}h(x+\\epsilon)\\right]} \\,+\\, g(x)\\underset{\\epsilon \\to 0}\\lim{\\dfrac{h(x + \\epsilon) - h(x)}{\\epsilon}} && \\quad \\text{taking } g(x) \\text{ out of the limit since it does not depend on }\\epsilon\\\\\n",
"& = \\underset{\\epsilon \\to 0}\\lim{\\left[\\dfrac{g(x+\\epsilon) - g(x)}{\\epsilon}h(x+\\epsilon)\\right]} \\,+\\, g(x)h'(x) && \\quad \\text{using the definition of h'(x)}\\\\\n",
"& = \\underset{\\epsilon \\to 0}\\lim{\\left[\\dfrac{g(x+\\epsilon) - g(x)}{\\epsilon}\\right]}\\underset{\\epsilon \\to 0}\\lim{h(x+\\epsilon)} + g(x)h'(x) && \\quad \\text{since the limit of a product is the product of the limits}\\\\\n",
"& = \\underset{\\epsilon \\to 0}\\lim{\\left[\\dfrac{h(x+\\epsilon)-h(x)}{\\epsilon}\\right]} \\underset{\\epsilon \\to 0}\\lim{\\left[\\dfrac{g(h(x+\\epsilon)) - g(h(x))}{h(x+\\epsilon)-h(x)}\\right]} && \\quad \\text{the limit of a product is the product of the limits}\\\\\n",
"& = h'(x) \\underset{\\epsilon \\to 0}\\lim{\\left[\\dfrac{g(h(x+\\epsilon)) - g(h(x))}{h(x+\\epsilon)-h(x)}\\right]} && \\quad \\text{using the definition of }h'(x)\\\\\n",
"There are several equivalent definitions of the number $e$. One of them states that $e$ is the unique positive number for which $\\underset{\\epsilon \\to 0}\\lim{\\dfrac{e^\\epsilon - 1}{\\epsilon}}=1$. We will use this in this proof:\n",
"& = \\underset{\\epsilon \\to 0}\\lim{e^x} \\, \\underset{\\epsilon \\to 0}\\lim{\\dfrac{e^\\epsilon - 1}{\\epsilon}} && \\quad \\text{the limit of a product is the product of the limits}\\\\\n",
"& = \\dfrac{1}{x}\\underset{u \\to 0}\\lim{\\left[\\ln\\left((1 + u)^{1/u}\\right)\\right]} && \\quad \\text{taking }\\dfrac{1}{x} \\text{ out since it does not depend on }\\epsilon\\\\\n",
"& = \\dfrac{1}{x}\\ln\\left(\\underset{u \\to 0}\\lim{(1 + u)^{1/u}}\\right) && \\quad \\text{taking }\\ln\\text{ out since it is a continuous function}\\\\\n",
"Let's define $g(x)=e^x$ and $h(x)=\\ln(x^r)$. Since $a = e^{\\ln(a)}$, we can rewrite $f$ as $f(x)=g(h(x))$, which allows us to use the chain rule:\n",
"\n",
"$f'(x) = h'(x)g'(h(x))$\n",
"\n",
"We know the derivative of the exponential: $g'(x)=e^x$. We also know the derivative of the natural logarithm: $\\ln'(x)=\\dfrac{1}{x}$ so $h'(x)=\\dfrac{r}{x}$. Therefore:\n",
"For this proof we will first need to prove that $\\underset{\\theta \\to 0}\\lim\\dfrac{\\sin(\\theta)}{\\theta}=1$. One way to do that is to consider the following diagram:\n",
"Assuming $0 < \\theta < \\dfrac{\\pi}{2}$, the area of the blue triangle (area $\\mathrm{A}$) is equal to its height ($\\sin(\\theta)$), times its base ($\\cos(\\theta)$), divided by 2. So $\\mathrm{A} = \\dfrac{1}{2}\\sin(\\theta)\\cos(\\theta)$.\n",
"\n",
"The unit circle has an area of $\\pi$, so the circular sector (in the shape of a pizza slice) has an area of A + B = $\\pi\\dfrac{\\theta}{2\\pi} = \\dfrac{\\theta}{2}$.\n",
"\n",
"Next, the large triangle (A + B + C) has an area equal to its height ($\\tan(\\theta)$) multiplied by its base (1) divided by 2, so A + B + C = $\\dfrac{\\tan(\\theta)}{2}$.\n",
"\n",
"When $0 < \\theta < \\dfrac{\\pi}{2}$, we have $\\mathrm{A} < \\mathrm{A} + \\mathrm{B} < \\mathrm{A} + \\mathrm{B} + \\mathrm{C}$, therefore:\n",
"We can multiply all the terms by 2 to get rid of the $\\dfrac{1}{2}$ factors. We can also divide by $\\sin(\\theta)$, which is stricly positive (assuming $0 < \\theta < \\dfrac{\\pi}{2}$), so the inequalities still hold:\n",
"Since all these terms are strictly positive when $0 < \\theta < \\dfrac{\\pi}{2}$, we can take their inverse and change the direction of the inequalities:\n",
"Now since $\\sin(-\\theta)=-\\sin(\\theta)$, we see that $\\dfrac{\\sin(-\\theta)}{-\\theta}=\\dfrac{\\sin(\\theta)}{\\theta}$. Moreover, $\\cos(-\\theta)=\\cos(\\theta)$, and therefore $\\dfrac{1}{\\cos(-\\theta)}=\\dfrac{1}{\\cos(\\theta)}$. Replacing the terms in the inequalities (1), we get:\n",
"assuming $-\\dfrac{\\theta}{2} < \\theta < \\dfrac{\\pi}{2}$ and $\\theta \\neq 0$\n",
"<hr />\n",
"\n",
"Since $\\cos$ is a continuous function, $\\underset{\\theta \\to 0}\\lim\\cos(\\theta)=\\cos(0)=1$. Similarly, $\\underset{\\theta \\to 0}\\lim\\dfrac{1}{cos(\\theta)}=\\dfrac{1}{\\cos(0)}=1$.\n",
"\n",
"Since the inequalities (2) tell us that $\\dfrac{\\sin(\\theta)}{\\theta}$ is squeezed between $\\dfrac{1}{cos(\\theta)}$ and $\\cos(\\theta)$ when $\\theta$ is close to 0, and since both of these approach 1 when $\\theta$ approaches 0, we can use the **squeeze theorem** (also called the **sandwich theorem**) to conclude that $\\dfrac{\\sin(\\theta)}{\\theta}$ must also approach 1 when $\\theta$ approaches 0.\n",
"Now the second thing we need to prove before we can tackle the derivative of the $\\sin$ function is the fact that $\\underset{\\theta \\to 0}\\lim\\dfrac{\\cos(\\theta) - 1}{\\theta}=0$. Here we go:\n",
"& = \\underset{\\theta \\to 0}\\lim\\dfrac{\\sin(\\theta)}{\\theta}\\dfrac{\\sin(\\theta)}{\\cos(\\theta) + 1} && \\quad \\text{ just rearranging the terms}\\\\\n",
"& = \\underset{\\theta \\to 0}\\lim\\dfrac{\\sin(\\theta)}{\\theta} \\, \\underset{\\theta \\to 0}\\lim\\dfrac{\\sin(\\theta)}{\\cos(\\theta) + 1} && \\quad \\text{ since the limit of a product is the product of the limits}\\\\\n",
"& = \\underset{\\theta \\to 0}\\lim\\dfrac{\\cos(x)\\sin(\\theta)}{\\theta} + \\underset{\\theta \\to 0}\\lim\\dfrac{\\sin(x)\\cos(\\theta) - \\sin(x)}{\\theta} && \\quad \\text{since the limit of a sum is the sum of the limits}\\\\\n",
"& = \\cos(x)\\underset{\\theta \\to 0}\\lim\\dfrac{\\sin(\\theta)}{\\theta} + \\sin(x)\\underset{\\theta \\to 0}\\lim\\dfrac{\\cos(\\theta) - 1}{\\theta} && \\quad \\text{bringing out } \\cos(x) \\text{ and } \\sin(x) \\text{ since they don't depend on }\\theta\\\\\n",