diff --git a/math_differential_calculus.ipynb b/math_differential_calculus.ipynb index bf92085..697f50d 100644 --- a/math_differential_calculus.ipynb +++ b/math_differential_calculus.ipynb @@ -177,7 +177,7 @@ "plt.plot([0, 0], [0, 3], \"k--\")\n", "plt.arrow(-1.4, 2.5, 0.5, -1.3, head_width=0.1)\n", "plt.arrow(0.85, 1.05, 0.5, 1.3, head_width=0.1)\n", - "show([-2.1, 2.1, 0, 2.8], title=\"Slope of $f(x) = x^2$\")" + "show([-2.1, 2.1, 0, 2.8], title=\"Slope of the curve $y = x^2$\")" ] }, { @@ -251,7 +251,7 @@ " B_text = ax.text(x_B + text_offset_B, y_B, \"B\", fontsize=14)\n", "\n", " # plot the grid and axis labels\n", - " title = \"Slope of $f(x) = {}$ at the point $x = {}$\".format(f_str, x_A)\n", + " title = r\"Slope of the curve $y = {}$ at $x_\\mathrm{{A}} = {}$\".format(f_str, x_A)\n", " show(axis or [-2.1, 2.1, 0, 2.8], title=title)\n", "\n", " def update_graph(i):\n", @@ -283,9 +283,9 @@ "id": "GAV2or0qutJX" }, "source": [ - "As you can see, when point $\\mathrm{B}$ is very close to point $\\mathrm{A}$, the $(\\mathrm{AB})$ line becomes almost indistinguishable from the curve itself (at least locally around point $\\mathrm{A}$). The $(\\mathrm{AB})$ line gets closer and closer to the **tangent** line to the curve at point $\\mathrm{A}$: this is the best linear approximation of the function at point $\\mathrm{A}$.\n", + "As you can see, when point $\\mathrm{B}$ is very close to point $\\mathrm{A}$, the $(\\mathrm{AB})$ line becomes almost indistinguishable from the curve itself (at least locally around point $\\mathrm{A}$). The $(\\mathrm{AB})$ line gets closer and closer to the **tangent** line to the curve at point $\\mathrm{A}$: this is the best linear approximation of the curve at point $\\mathrm{A}$.\n", "\n", - "So it makes sense to define the slope of the curve at point $\\mathrm{A}$ as the slope that the $\\mathrm{(AB)}$ line approaches when $\\mathrm{B}$ gets infinitely close to $\\mathrm{A}$. This is called the **derivative** of the function at that point. For example, the derivative of $x^2$ at the point $x_\\mathrm{A}$ is $2x_\\mathrm{A}$ (we will see how to get this result shortly). So the tangent line to the $x^2$ curve at point $x_A=-1$ has a slope of $-2$." + "So it makes sense to define the slope of the curve at point $\\mathrm{A}$ as the slope that the $\\mathrm{(AB)}$ line approaches when $\\mathrm{B}$ gets infinitely close to $\\mathrm{A}$. This slope is called the **derivative** of the function $f$ at $x=x_\\mathrm{A}$. For example, the derivative of the function $f(x)=x^2$ at $x=x_\\mathrm{A}$ is equal to $2x_\\mathrm{A}$ (we will see how to get this result shortly), so on the graph above, since the point $\\mathrm{A}$ is located at $x_\\mathrm{A}=-1$, the tangent line to the curve at that point has a slope of $-2$." ] }, { @@ -326,18 +326,18 @@ "id": "l9dTigohF0BM" }, "source": [ - "No matter how much you zoom in on the point $x=0$, you will always see a curve that looks like a V. The slope is -1 for any $x < 0$, and it is +1 for any $x > 0$, but at the point $x = 0$ itself, the slope is undefined, since it is not possible to approximate $|x|$ locally around point $x=0$ using a straight line, no matter how much you zoom in on that point.\n", + "No matter how much you zoom in on the origin (the point at $x=0, y=0$), the curve will always look like a V. The slope is -1 for any $x < 0$, and it is +1 for any $x > 0$, but **at $x = 0$, the slope is undefined**, since it is not possible to approximate the curve $y=|x|$ locally around the origin using a straight line, no matter how much you zoom in on that point.\n", "\n", - "The function $f(x)=|x|$ is said to be **non-differentiable** at the point $x=0$, meaning that its slope (i.e., its derivative) is undefined at that point. However, it is **differentiable** at all other points.\n", + "The function $f(x)=|x|$ is said to be **non-differentiable** at $x=0$: its derivative is undefined at $x=0$. This means that the curve $y=|x|$ has an undefined slope at that point. However, the function $f(x)=|x|$ is **differentiable** at all other points.\n", "\n", - "In order for a function $f$ to be differentiable at some point $x_\\mathrm{A}$, the slope of the $(\\mathrm{AB})$ line must approach a single finite value as $x_\\mathrm{B}$ gets infinitely close to $x_\\mathrm{A}$.\n", + "In order for a function $f(x)$ to be differentiable at some point $x_\\mathrm{A}$, the slope of the $(\\mathrm{AB})$ line must approach a single finite value as $\\mathrm{B}$ gets infinitely close to $\\mathrm{A}$.\n", "\n", "This implies several constraints:\n", "\n", - "* First, the function must of course be **defined** at point $x_\\mathrm{A}$. As a counterexample, the function $f(x)=\\dfrac{1}{x}$ is undefined at point $x_\\mathrm{A}=0$, so it is not differentiable at that point.\n", - "* The function must also be **continuous** at point $x_\\mathrm{A}$, meaning that as $x_\\mathrm{B}$ gets infinitely close to $x_\\mathrm{A}$, $f(x_\\mathrm{B})$ must also get infinitely close to $f(x_\\mathrm{A})$. As a counterexample, $f(x)=\\begin{cases}-1 \\text{ if }x < 0\\\\+1 \\text{ if }x \\geq 0\\end{cases}$ is not continuous at point $x_\\mathrm{A}=0$, even though it is defined at that point: indeed, when you approach it from the negative side, it does not approach infinitely close to $f(0)=+1$. Therefore, it is not continuous at that point, and thus not differentiable either.\n", - "* The curve must not have a **breaking point** at point $x_\\mathrm{A}$, meaning that the slope that the $\\mathrm{AB}$ line approaches as $x_\\mathrm{B}$ approaches $x_\\mathrm{A}$ must be the same whether $x_\\mathrm{B}$ approaches from the left side or from the right side. We already saw a counterexample with $f(x)=|x|$, which is both defined and continuous at $x_\\mathrm{A}=0$, but which has a breaking point at $x_\\mathrm{A}=0$: the slope on the left is -1 while the slope on the right is +1.\n", - "* The curve must not be **vertical** at point $x_\\mathrm{A}$. One counterexample is $f(x)=\\sqrt[3]{x}$, the cubic root of $x$: as $x_\\mathrm{B}$ approaches $x_\\mathrm{A}=0$, the slope of the $(\\mathrm{AB})$ line becomes infinite (i.e., the line becomes vertical), so the function is not differentiable at that point, as you can see in the following animation:" + "* First, the function must of course be **defined** at $x_\\mathrm{A}$. As a counterexample, the function $f(x)=\\dfrac{1}{x}$ is undefined at $x_\\mathrm{A}=0$, so it is not differentiable at that point.\n", + "* The function must also be **continuous** at $x_\\mathrm{A}$, meaning that as $x_\\mathrm{B}$ gets infinitely close to $x_\\mathrm{A}$, $f(x_\\mathrm{B})$ must also get infinitely close to $f(x_\\mathrm{A})$. As a counterexample, $f(x)=\\begin{cases}-1 \\text{ if }x < 0\\\\+1 \\text{ if }x \\geq 0\\end{cases}$ is not continuous at $x_\\mathrm{A}=0$, even though it is defined at that point: indeed, when you approach it from the negative side, it does not approach infinitely close to $f(0)=+1$. Therefore, it is not continuous at that point, and thus not differentiable either.\n", + "* The function must not have a **breaking point** at $x_\\mathrm{A}$, meaning that the slope that the $(\\mathrm{AB})$ line approaches as $\\mathrm{B}$ approaches $\\mathrm{A}$ must be the same whether $\\mathrm{B}$ approaches from the left side or from the right side. We already saw a counterexample with $f(x)=|x|$, which is both defined and continuous at $x_\\mathrm{A}=0$, but which has a breaking point at $x_\\mathrm{A}=0$: the slope of the curve $y=|x|$ is -1 on the left, and +1 on the right.\n", + "* The curve $y=f(x)$ must not be **vertical** at point $\\mathrm{A}$. One counterexample is $f(x)=\\sqrt[3]{x}$, the cubic root of $x$: the curve is vertical at the origin, so the function is not differentiable at $x_\\mathrm{A}=0$, as you can see in the following animation:" ] }, { @@ -399,7 +399,7 @@ "source": [ "
\n", "\n", - "The **derivative** of a function $f(x)$ at the point $x = x_\\mathrm{A}$ is noted $f'(x_\\mathrm{A})$, and it is defined as:\n", + "The **derivative** of a function $f(x)$ at $x = x_\\mathrm{A}$ is noted $f'(x_\\mathrm{A})$, and it is defined as:\n", "\n", "$f'(x_\\mathrm{A}) = \\underset{x_\\mathrm{B} \\to x_\\mathrm{A}}\\lim\\dfrac{f(x_\\mathrm{B}) - f(x_\\mathrm{A})}{x_\\mathrm{B} - x_\\mathrm{A}}$\n", "\n", @@ -433,7 +433,7 @@ "id": "1Hab-C8p8GPw" }, "source": [ - "Let's look at a concrete example. Let's see if we can determine what the slope of the $x^2$ curve is, at any point $\\mathrm{A}$ (try to understand each line, I promise it's not that hard):\n", + "Let's look at a concrete example. Let's see if we can determine what the slope of the $y=x^2$ curve is, at any point $\\mathrm{A}$ (try to understand each line, I promise it's not that hard):\n", "\n", "$\n", "\\begin{split}\n", @@ -448,7 +448,7 @@ "\\end{split}\n", "$\n", "\n", - "That's it! We just proved that the slope of $f(x) = x^2$ at any point $\\mathrm{A}$ is $f'(x_\\mathrm{A}) = 2x_\\mathrm{A}$. What we have done is called **differentiation**: finding the derivative of a function." + "That's it! We just proved that the slope of $y = x^2$ at any point $\\mathrm{A}$ is $f'(x_\\mathrm{A}) = 2x_\\mathrm{A}$. What we have done is called **differentiation**: finding the derivative of a function." ] }, { @@ -548,7 +548,7 @@ "\n", "$f'(x) = \\dfrac{\\mathrm{d}f(x)}{\\mathrm{d}x} = \\dfrac{\\mathrm{d}}{\\mathrm{d}x}f(x)$\n", "\n", - "This notation is also handy when a function is not named. For example $\\dfrac{\\mathrm{d}}{\\mathrm{d}x}[x^2]$ refers to the derivative of the function $x^2$.\n", + "This notation is also handy when a function is not named. For example $\\dfrac{\\mathrm{d}}{\\mathrm{d}x}[x^2]$ refers to the derivative of the function $x \\mapsto x^2$.\n", "\n", "Moreover, when people talk about the function $f(x)$, they sometimes leave out \"$(x)$\", and they just talk about the function $f$. When this is the case, the notation of the derivative is also simpler:\n", "\n", @@ -576,7 +576,7 @@ "id": "hLxiC5r4Xk3N" }, "source": [ - "Let's use the equation $f'(x) = 2x$ to plot the tangent to the $x^2$ curve at various values of $x$ (you can click on the play button under the graphs to play the animation):" + "Let's use the equation $f'(x) = 2x$ to plot the tangent to the $y=x^2$ curve at various values of $x$ (you can click on the play button under the graphs to play the animation):" ] }, { @@ -619,9 +619,9 @@ " point_A2, = ax2.plot(0, 0, \"bo\")\n", "\n", " show([-2.1, 2.1, 0, 2.8], ax=ax1, ylabel=\"$f(x)$\",\n", - " title=r\"$f(x)=\" + f_str + \"$ and the tangent at $x=x_\\mathrm{A}$\")\n", + " title=r\"$y=f(x)=\" + f_str + \"$ and the tangent at $x=x_\\mathrm{A}$\")\n", " show([-2.1, 2.1, -4.2, 4.2], ax=ax2, ylabel=\"$f'(x)$\",\n", - " title=r\"Slope of the tangent at $x=x_\\mathrm{A}$\")\n", + " title=r\"y=f'(x) and the slope of the tangent at $x=x_\\mathrm{A}$\")\n", "\n", " def update_graph(i):\n", " x = 1.5 * np.sin(2 * np.pi * i / n_frames)\n", @@ -648,7 +648,7 @@ "def fp(x):\n", " return 2*x\n", "\n", - "animate_tangent(lambda x: x**2, lambda x: 2*x, \"x^2\")" + "animate_tangent(f, fp, \"x^2\")" ] }, { @@ -660,25 +660,25 @@ "source": [ "
\n", "\n", - "**Note:** consider the tangent line to the function $f(x)$ at some point $\\mathrm{A}$. What is its equation? Well, since the tangent is a straight line, its equation must look like:\n", + "**Note:** consider the tangent line to the curve $y=f(x)$ at some point $\\mathrm{A}$. What is its equation? Well, since the tangent is a straight line, its equation must look like:\n", "\n", - "$t(x) = \\alpha x + \\beta$\n", + "$y = \\alpha x + \\beta$\n", "\n", "where $\\alpha$ is the slope of the line, and $\\beta$ is the offset (i.e., the $y$ coordinate of the point at which the line crosses the vertical axis). We already know that the slope of the tangent line at point $\\mathrm{A}$ is the derivative of $f(x)$ at that point, so:\n", "\n", "$\\alpha = f'(x_\\mathrm{A})$\n", "\n", - "But what about the offset $\\beta$? Well we also know that the tangent line touches the curve at point $\\mathrm{A}$, so we know that $t(x_\\mathrm{A})=f(x_\\mathrm{A})$. Therefore, $\\alpha x_\\mathrm{A} + \\beta = f(x_\\mathrm{A})$, and finally:\n", + "But what about the offset $\\beta$? Well we also know that the tangent line touches the curve at point $\\mathrm{A}$, so we know that $\\alpha x_\\mathrm{A} + \\beta = f(x_\\mathrm{A})$. So:\n", "\n", "$\\beta = f(x_\\mathrm{A}) - f'(x_\\mathrm{A})x_\\mathrm{A}$\n", "\n", - "So we get the following equation for the tangent of $f$ at point $x_\\mathrm{A}$:\n", + "So we get the following equation for the tangent:\n", "\n", - "$t_{\\mathrm{A}}(x) = f(x_\\mathrm{A}) + f'(x_\\mathrm{A})(x - x_\\mathrm{A})$\n", + "$y = f(x_\\mathrm{A}) + f'(x_\\mathrm{A})(x - x_\\mathrm{A})$\n", "\n", - "If we apply this to the function $f(x)=x^2$, we get the following equation:\n", + "For example, the tangent to the $y=x^2$ curve is given by:\n", "\n", - "$t_{\\mathrm{A}}(x) = {x_\\mathrm{A}}^2 + 2x_\\mathrm{A}(x - x_\\mathrm{A}) = 2x_\\mathrm{A}x - x_\\mathrm{A}^2$\n", + "$y = {x_\\mathrm{A}}^2 + 2x_\\mathrm{A}(x - x_\\mathrm{A}) = 2x_\\mathrm{A}x - x_\\mathrm{A}^2$\n", "
" ] }, @@ -759,7 +759,7 @@ "\n", "Let's try a harder example: let's find the derivative of $f(x) = \\sin(2 x^2) + 1$. First, let's define $u(x)=\\sin(x) + 1$ and $v(x) = 2x^2$. Using the rule for sums, we find that $u'(x)=\\dfrac{\\mathrm{d}}{\\mathrm{d}x}[sin(x)] + \\dfrac{\\mathrm{d}}{\\mathrm{d}x}[1]$. Since the derivative of the $\\sin$ function is $\\cos$, and the derivative of constants is 0, we find that $u'(x)=\\cos(x)$. Next, using the product rule, we find that $v'(x)=2\\dfrac{\\mathrm{d}}{\\mathrm{d}x}[x^2] + \\dfrac{\\mathrm{d}}{\\mathrm{d}x}[2]\\,x^2$. Since the derivative of a constant is 0, the second term cancels out. And since the power rule tells us that the derivative of $x^2$ is $2x$, we find that $v'(x)=4x$. Lastly, using the chain rule, since $f(x)=u(v(x))$, we find that $f'(x)=u'(v(x))\\,v'(x)=\\cos(2x^2)\\,4x$.\n", "\n", - "Let's plot $f$ followed by $f'$, and let's use $f'(x_\\mathbf{A})$ to find the slope of the tangent at $x=x_\\mathbf{A}$:\n" + "Let's plot $f$ followed by $f'$, and let's use $f'(x_\\mathbf{A})$ to find the slope of the tangent at some point $\\mathbf{A}$:\n" ] }, { @@ -842,7 +842,7 @@ " fontsize=14, horizontalalignment=\"center\")\n", "plt.text(np.sqrt(2), 0.1, r\"$\\sqrt{2}$\",\n", " fontsize=14, horizontalalignment=\"center\")\n", - "show(axis=[-2.1, 2.1, -1.4, 1.4], title=r\"$f(x)=\\dfrac{1}{4}x^4 - x^2 + 5$\")" + "show(axis=[-2.1, 2.1, -1.4, 1.4], title=r\"$y=f(x)=\\dfrac{1}{4}x^4 - x^2 + 5$\")" ] }, { @@ -887,7 +887,7 @@ "id": "NyDyBVnUFlUl" }, "source": [ - "So in short, you can optimize a function by analytically working out the points at which the derivative is 0, and then investigating only these points. It's a beautifully elegant solution, but it requires a lot of work, and it's not always easy, or even possible." + "So in short, you can optimize a function by analytically working out the points at which the derivative is 0, and then investigating only these points. It's a beautifully elegant solution, but it requires a lot of work, and it's not always easy, or even possible. For neural networks, it's practically impossible." ] }, { @@ -945,9 +945,9 @@ " point_A2, = ax2.plot(0, 0, \"bo\")\n", "\n", " show([-2.1, 2.1, -1.4, 1.4], ax=ax1, ylabel=\"$f(x)$\",\n", - " title=r\"$f(x)=\" + f_str + \"$ and the tangent at $x=x_\\mathrm{A}$\")\n", + " title=r\"$y=f(x)=\" + f_str + \"$ and the tangent at $x=x_\\mathrm{A}$\")\n", " show([-2.1, 2.1, -4.2, 4.2], ax=ax2, ylabel=\"$f'(x)$\",\n", - " title=r\"Slope of the tangent at $x=x_\\mathrm{A}$\")\n", + " title=r\"$y=f'(x)$ and the slope of the tangent at $x=x_\\mathrm{A}$\")\n", "\n", " xs = []\n", " x = x_0\n", @@ -1094,7 +1094,7 @@ "def df_dy(x, y):\n", " return x * np.cos(x * y)\n", "\n", - "ax = plot_3d(f, r\"$f(x) = \\sin(xy)$\")\n", + "ax = plot_3d(f, r\"$z = f(x, y) = \\sin(xy)$\")\n", "plot_tangents(ax, 0.1, -1, f, df_dx, df_dy)\n", "\n", "plt.show()" @@ -1129,7 +1129,7 @@ "\n", "Well, it can be shown that if all the partial derivatives are defined and continuous in a neighborhood around point $\\mathrm{A}$, then the function $f$ is **totally differentiable** at that point, meaning that it can be locally approximated by a plane $P_\\mathrm{A}$ (the tangent plane to the surface at point $\\mathrm{A}$). In this case, having just the partial derivatives along each axis ($x$ and $y$ in our case) is sufficient to perfectly characterize that plane. Its equation is:\n", "\n", - "$P_\\mathrm{A}(x,y) = f(x_\\mathrm{A},y_\\mathrm{A}) + (x - x_\\mathrm{A})\\dfrac{\\partial f}{\\partial x}(x_\\mathrm{A},y_\\mathrm{A}) + (y - y_\\mathrm{A})\\dfrac{\\partial f}{\\partial y}(x_\\mathrm{A},y_\\mathrm{A})$\n", + "$z = f(x_\\mathrm{A},y_\\mathrm{A}) + (x - x_\\mathrm{A})\\dfrac{\\partial f}{\\partial x}(x_\\mathrm{A},y_\\mathrm{A}) + (y - y_\\mathrm{A})\\dfrac{\\partial f}{\\partial y}(x_\\mathrm{A},y_\\mathrm{A})$\n", "\n", "In Deep Learning, we will generally be dealing with well-behaved functions that are totally differentiable at any point where all the partial derivatives are defined, but you should know that some functions are not that nice. For example, consider the function:\n", "\n", @@ -1175,7 +1175,7 @@ "id": "u2YNS1ZqsKeg" }, "source": [ - "Assuming the function is totally differentiable at the point $x_\\mathbf{A}$, it can be approximated by a plane at that point (as discussed in the previous section), and the gradient vector is the one that points towards the steepest slope on that plane." + "Assuming the function is totally differentiable at the point $\\mathbf{x}_\\mathbf{A}$, then the surface it describes can be approximated by a plane at that point (as discussed in the previous section), and the gradient vector is the one that points towards the steepest slope on that plane." ] }, { @@ -1185,6 +1185,8 @@ "id": "GF8nLfs08iuR" }, "source": [ + "## Gradient Descent, revisited\n", + "\n", "In Deep Learning, the Gradient Descent algorithm we discussed earlier is based on gradients instead of derivatives (hence its name). It works in much the same way, but using vectors instead of scalars: simply start with a random vector $\\mathbf{x}_0$, then compute the gradient of $f$ at that point, and perform a small step in the opposite direction, then repeat until convergence. More precisely, at each step $t$, compute $\\mathbf{x}_t = \\mathbf{x}_{t-1} - \\eta \\nabla f(\\mathbf{x}_{t-1})$. The constant $\\eta$ is the learning rate, typically a small value such as $10^{-3}$. In practice, we generally use more efficient variants of this algorithm, but the general idea remains the same.\n", "\n", "In Deep Learning, the letter $\\mathbf{x}$ is generally used to represent the input data. When you _use_ a neural network to make predictions, you feed the neural network the inputs $\\mathbf{x}$, and you get back a prediction $\\hat{y} = f(\\mathbf{x})$. The function $f$ treats the model parameters as constants. We can use more explicit notation by writing $\\hat{y} = f_\\mathbf{w}(\\mathbf{x})$, where $\\mathbf{w}$ represents the model parameters and indicates that the function relies on them, but treats them as constants.\n",