From 42525c610cecf835ccab452c106d353d63a0912c Mon Sep 17 00:00:00 2001 From: Victor Khaustov <3192677+vi3itor@users.noreply.github.com> Date: Fri, 20 May 2022 13:17:31 +0900 Subject: [PATCH 1/2] Update differential calculus tutorial --- math_differential_calculus.ipynb | 55 +++++++++++++++----------------- 1 file changed, 26 insertions(+), 29 deletions(-) diff --git a/math_differential_calculus.ipynb b/math_differential_calculus.ipynb index d332649..9f2983e 100644 --- a/math_differential_calculus.ipynb +++ b/math_differential_calculus.ipynb @@ -49,7 +49,6 @@ "outputs": [], "source": [ "#@title\n", - "%matplotlib inline\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", @@ -167,7 +166,7 @@ "id": "gcb7eqkmGGXf" }, "source": [ - "But what if you want to know the slope of something else than a straight line? For example, let's consider the curve defined by $y = f(x) = x^2$:" + "But what if you want to know the slope of something other than a straight line? For example, let's consider the curve defined by $y = f(x) = x^2$:" ] }, { @@ -226,7 +225,7 @@ "id": "4qCXg9nQSp6S" }, "source": [ - "How can we put numbers on these intuitions? Well, say we want to estimate the slope of the curve at a point $\\mathrm{A}$, we can do this by taking another point $\\mathrm{B}$ on the curve, not too far away, and then computing the slope between these two points:\n" + "How can we put numbers on these intuitions? Well, say we want to estimate the slope of the curve at a point $\\mathrm{A}$. We can do this by taking another point $\\mathrm{B}$ on the curve, not too far away, and then computing the slope between these two points:\n" ] }, { @@ -963,7 +962,7 @@ "source": [ "# Differentiability\n", "\n", - "Note that some functions are not quite as well-behaved as $x^2$: for example, consider the function $f(x)=|x|$, the absolute value of $x$:" + "Note that some functions are not quite as well-behaved as $x^2$. For example, consider the function $f(x)=|x|$, the absolute value of $x$:" ] }, { @@ -2231,7 +2230,7 @@ "& = \\underset{x_\\mathrm{B} \\to x_\\mathrm{A}}\\lim x_\\mathrm{B} \\, + \\underset{x_\\mathrm{B} \\to x_\\mathrm{A}}\\lim x_\\mathrm{A}\\quad && \\text{since the limit of a sum is the sum of the limits}\\\\\n", "& = x_\\mathrm{A} \\, + \\underset{x_\\mathrm{B} \\to x_\\mathrm{A}}\\lim x_\\mathrm{A} \\quad && \\text{since } x_\\mathrm{B}\\text{ approaches } x_\\mathrm{A} \\\\\n", "& = x_\\mathrm{A} + x_\\mathrm{A} \\quad && \\text{since } x_\\mathrm{A} \\text{ remains constant when } x_\\mathrm{B}\\text{ approaches } x_\\mathrm{A} \\\\\n", - "& = 2 x_\\mathrm{A}\n", + "& = 2x_\\mathrm{A} &&\n", "\\end{align*}\n", "$\n", "\n", @@ -2307,7 +2306,7 @@ "& = \\underset{\\epsilon \\to 0}\\lim\\dfrac{{x}^2 + 2x\\epsilon + \\epsilon^2 - {x}^2}{\\epsilon}\\quad && \\text{since } (x + \\epsilon)^2 = {x}^2 + 2x\\epsilon + \\epsilon^2\\\\\n", "& = \\underset{\\epsilon \\to 0}\\lim\\dfrac{2x\\epsilon + \\epsilon^2}{\\epsilon}\\quad && \\text{since the two } {x}^2 \\text{ cancel out}\\\\\n", "& = \\underset{\\epsilon \\to 0}\\lim \\, (2x + \\epsilon)\\quad && \\text{since } 2x\\epsilon \\text{ and } \\epsilon^2 \\text{ can both be divided by } \\epsilon\\\\\n", - "& = 2 x\n", + "& = 2x &&\n", "\\end{align*}\n", "$\n", "\n", @@ -2343,7 +2342,7 @@ "\n", "The $f'$ notation is Lagrange's notation, while $\\dfrac{\\mathrm{d}f}{\\mathrm{d}x}$ is Leibniz's notation.\n", "\n", - "There are also other less common notations, such as Newton's notation $\\dot y$ (assuming $y = f(x)$) or Euler's notation $\\mathrm{D}f$." + "There are other less common notations, such as Newton's notation $\\dot y$ (assuming $y = f(x)$) or Euler's notation $\\mathrm{D}f$." ] }, { @@ -4164,7 +4163,7 @@ "\n", "It is possible to chain many functions. For example, if $f(x)=g(h(i(x)))$, and we define $y=i(x)$ and $z=h(y)$, then $\\dfrac{\\mathrm{d}f}{\\mathrm{d}x} = \\dfrac{\\mathrm{d}f}{\\mathrm{d}z} \\dfrac{\\mathrm{d}z}{\\mathrm{d}y} \\dfrac{\\mathrm{d}y}{\\mathrm{d}x}$. Using Lagrange's notation, we get $f'(x)=g'(z)\\,h'(y)\\,i'(x)=g'(h(i(x)))\\,h'(i(x))\\,i'(x)$\n", "\n", - "The chain rule is crucial in Deep Learning, as a neural network is basically as a long composition of functions. For example, a 3-layer dense neural network corresponds to the following function: $f(\\mathbf{x})=\\operatorname{Dense}_3(\\operatorname{Dense}_2(\\operatorname{Dense}_1(\\mathbf{x})))$ (in this example, $\\operatorname{Dense}_3$ is the output layer).\n" + "The chain rule is crucial in Deep Learning, as a neural network is basically a long composition of functions. For example, a 3-layer dense neural network corresponds to the following function: $f(\\mathbf{x})=\\operatorname{Dense}_3(\\operatorname{Dense}_2(\\operatorname{Dense}_1(\\mathbf{x})))$ (in this example, $\\operatorname{Dense}_3$ is the output layer).\n" ] }, { @@ -4296,7 +4295,7 @@ "\n", "At each iteration, the step size is proportional to the slope, so the process naturally slows down as it approaches a local minimum. Each step is also proportional to the learning rate: a parameter of the Gradient Descent algorithm itself (since it is not a parameter of the function we are optimizing, it is called a **hyperparameter**).\n", "\n", - "Here is an animation of this process on the function $f(x)=\\dfrac{1}{4}x^4 - x^2 + \\dfrac{1}{2}$:" + "Here is an animation of this process for the function $f(x)=\\dfrac{1}{4}x^4 - x^2 + \\dfrac{1}{2}$:" ] }, { @@ -5253,8 +5252,6 @@ ], "source": [ "#@title\n", - "from mpl_toolkits.mplot3d import Axes3D\n", - "\n", "def plot_3d(f, title):\n", " fig = plt.figure(figsize=(8, 5))\n", " ax = fig.add_subplot(111, projection='3d')\n", @@ -5367,7 +5364,7 @@ "$\\nabla f(\\mathbf{x}_\\mathrm{A}) = \\begin{pmatrix}\n", "\\dfrac{\\partial f}{\\partial x_1}(\\mathbf{x}_\\mathrm{A})\\\\\n", "\\dfrac{\\partial f}{\\partial x_2}(\\mathbf{x}_\\mathrm{A})\\\\\n", - "\\vdots\\\\\\\n", + "\\vdots\\\\\n", "\\dfrac{\\partial f}{\\partial x_n}(\\mathbf{x}_\\mathrm{A})\\\\\n", "\\end{pmatrix}$" ] @@ -5407,7 +5404,7 @@ "source": [ "# Jacobians\n", "\n", - "Until now we have only considered functions that output a scalar, but it is possible to output vectors instead. For example, a classification neural network typically outputs one probability for each class, so if there are $m$ classes, the neural network will output an $d$-dimensional vector for each input.\n", + "Until now, we have only considered functions that output a scalar, but it is possible to output vectors instead. For example, a classification neural network typically outputs one probability for each class, so if there are $m$ classes, the neural network will output a $d$-dimensional vector for each input.\n", "\n", "In Deep Learning we generally only need to differentiate the loss function, which almost always outputs a single scalar number. But suppose for a second that you want to differentiate a function $\\mathbf{f}(\\mathbf{x})$ which outputs $d$-dimensional vectors. The good news is that you can treat each _output_ dimension independently of the others. This will give you a partial derivative for each input dimension and each output dimension. If you put them all in a single matrix, with one column per input dimension and one row per output dimension, you get the so-called **Jacobian matrix**.\n", "\n", @@ -5532,7 +5529,7 @@ "& = \\underset{\\epsilon \\to 0}\\lim\\dfrac{g(x+\\epsilon)h(x+\\epsilon) - g(x)h(x+\\epsilon)}{\\epsilon} + \\underset{\\epsilon \\to 0}\\lim\\dfrac{g(x)h(x + \\epsilon) - g(x)h(x)}{\\epsilon} && \\quad \\text{since the limit of a sum is the sum of the limits}\\\\\n", "& = \\underset{\\epsilon \\to 0}\\lim{\\left[\\dfrac{g(x+\\epsilon) - g(x)}{\\epsilon}h(x+\\epsilon)\\right]} \\,+\\, \\underset{\\epsilon \\to 0}\\lim{\\left[g(x)\\dfrac{h(x + \\epsilon) - h(x)}{\\epsilon}\\right]} && \\quad \\text{factorizing }h(x+\\epsilon) \\text{ and } g(x)\\\\\n", "& = \\underset{\\epsilon \\to 0}\\lim{\\left[\\dfrac{g(x+\\epsilon) - g(x)}{\\epsilon}h(x+\\epsilon)\\right]} \\,+\\, g(x)\\underset{\\epsilon \\to 0}\\lim{\\dfrac{h(x + \\epsilon) - h(x)}{\\epsilon}} && \\quad \\text{taking } g(x) \\text{ out of the limit since it does not depend on }\\epsilon\\\\\n", - "& = \\underset{\\epsilon \\to 0}\\lim{\\left[\\dfrac{g(x+\\epsilon) - g(x)}{\\epsilon}h(x+\\epsilon)\\right]} \\,+\\, g(x)h'(x) && \\quad \\text{using the definition of h'(x)}\\\\\n", + "& = \\underset{\\epsilon \\to 0}\\lim{\\left[\\dfrac{g(x+\\epsilon) - g(x)}{\\epsilon}h(x+\\epsilon)\\right]} \\,+\\, g(x)h'(x) && \\quad \\text{using the definition of }h'(x)\\\\\n", "& = \\underset{\\epsilon \\to 0}\\lim{\\left[\\dfrac{g(x+\\epsilon) - g(x)}{\\epsilon}\\right]}\\underset{\\epsilon \\to 0}\\lim{h(x+\\epsilon)} + g(x)h'(x) && \\quad \\text{since the limit of a product is the product of the limits}\\\\\n", "& = \\underset{\\epsilon \\to 0}\\lim{\\left[\\dfrac{g(x+\\epsilon) - g(x)}{\\epsilon}\\right]}h(x) + h(x)g'(x) && \\quad \\text{since } h(x) \\text{ is continuous}\\\\\n", "& = g'(x)h(x) + g(x)h'(x) && \\quad \\text{using the definition of }g'(x)\n", @@ -5620,7 +5617,7 @@ "& = \\underset{\\epsilon \\to 0}\\lim{\\left[\\dfrac{1}{\\epsilon} \\, \\ln\\left(1 + \\dfrac{\\epsilon}{x}\\right)\\right]} && \\quad \\text{just moving things around a bit}\\\\\n", "& = \\underset{\\epsilon \\to 0}\\lim{\\left[\\dfrac{1}{xu} \\, \\ln\\left(1 + u\\right)\\right]} && \\quad \\text{defining }u=\\dfrac{\\epsilon}{x} \\text{ and thus } \\epsilon=xu\\\\\n", "& = \\underset{u \\to 0}\\lim{\\left[\\dfrac{1}{xu} \\, \\ln\\left(1 + u\\right)\\right]} && \\quad \\text{replacing } \\underset{\\epsilon \\to 0}\\lim \\text{ with } \\underset{u \\to 0}\\lim \\text{ since }\\underset{\\epsilon \\to 0}\\lim u=0\\\\\n", - "& = \\underset{u \\to 0}\\lim{\\left[\\dfrac{1}{x} \\, \\ln\\left((1 + u)^{1/u}\\right)\\right]} && \\quad \\text{since }a\\ln(b)=\\ln(a^b)\\\\\n", + "& = \\underset{u \\to 0}\\lim{\\left[\\dfrac{1}{x} \\, \\ln\\left((1 + u)^{1/u}\\right)\\right]} && \\quad \\text{since }a\\ln(b)=\\ln(b^a)\\\\\n", "& = \\dfrac{1}{x}\\underset{u \\to 0}\\lim{\\left[\\ln\\left((1 + u)^{1/u}\\right)\\right]} && \\quad \\text{taking }\\dfrac{1}{x} \\text{ out since it does not depend on }\\epsilon\\\\\n", "& = \\dfrac{1}{x}\\ln\\left(\\underset{u \\to 0}\\lim{(1 + u)^{1/u}}\\right) && \\quad \\text{taking }\\ln\\text{ out since it is a continuous function}\\\\\n", "& = \\dfrac{1}{x}\\ln(e) && \\quad \\text{since }e=\\underset{u \\to 0}\\lim{(1 + u)^{1/u}}\\\\\n", @@ -5644,9 +5641,9 @@ "\n", "We know the derivative of the exponential: $g'(x)=e^x$. We also know the derivative of the natural logarithm: $\\ln'(x)=\\dfrac{1}{x}$ so $h'(x)=\\dfrac{r}{x}$. Therefore:\n", "\n", - "$f'(x) = \\dfrac{r}{x}\\exp\\left({\\ln(x^r)}\\right)$\n", + "$f'(x) = \\dfrac{r}{x} e^{\\ln(x^r)}$\n", "\n", - "Since $a = \\exp(\\ln(a))$, this equation simplifies to:\n", + "Since $e^{\\ln(a)} = a$, this equation simplifies to:\n", "\n", "$f'(x) = \\dfrac{r}{x} x^r$\n", "\n", @@ -5657,7 +5654,7 @@ "Note that the power rule works for any $r \\neq 0$, including negative numbers and real numbers. For example:\n", "\n", "* if $f(x) = \\dfrac{1}{x} = x^{-1}$, then $f'(x)=-x^{-2}=-\\dfrac{1}{x^2}$.\n", - "* if $f(x) = \\sqrt(x) = x^{1/2}$, then $f'(x)=\\dfrac{1}{2}x^{-1/2}=\\dfrac{1}{2\\sqrt{x}}$" + "* if $f(x) = \\sqrt{x} = x^{1/2}$, then $f'(x)=\\dfrac{1}{2}x^{-1/2}=\\dfrac{1}{2\\sqrt{x}}$" ] }, { @@ -5800,17 +5797,17 @@ "source": [ "The circle is the unit circle (radius=1).\n", "\n", - "Assuming $0 < \\theta < \\dfrac{\\pi}{2}$, the area of the blue triangle (area $\\mathrm{A}$) is equal to its height ($\\sin(\\theta)$), times its base ($\\cos(\\theta)$), divided by 2. So $\\mathrm{A} = \\dfrac{1}{2}\\sin(\\theta)\\cos(\\theta)$.\n", + "Assuming $0 < \\theta < \\dfrac{\\pi}{2}$, the area of the blue triangle (area $\\mathrm{A}$) is equal to its height ($\\sin(\\theta)$) times its base ($\\cos(\\theta)$) divided by 2. So $\\mathrm{A} = \\dfrac{1}{2}\\sin(\\theta)\\cos(\\theta)$.\n", "\n", "The unit circle has an area of $\\pi$, so the circular sector (in the shape of a pizza slice) has an area of A + B = $\\pi\\dfrac{\\theta}{2\\pi} = \\dfrac{\\theta}{2}$.\n", "\n", - "Next, the large triangle (A + B + C) has an area equal to its height ($\\tan(\\theta)$) multiplied by its base (1) divided by 2, so A + B + C = $\\dfrac{\\tan(\\theta)}{2}$.\n", + "Next, the large triangle (A + B + C) has an area equal to its height ($\\tan(\\theta)$) multiplied by its base (of length 1) divided by 2, so A + B + C = $\\dfrac{\\tan(\\theta)}{2}$.\n", "\n", "When $0 < \\theta < \\dfrac{\\pi}{2}$, we have $\\mathrm{A} < \\mathrm{A} + \\mathrm{B} < \\mathrm{A} + \\mathrm{B} + \\mathrm{C}$, therefore:\n", "\n", "$\\dfrac{1}{2}\\sin(\\theta)\\cos(\\theta) < \\dfrac{\\theta}{2} < \\dfrac{\\tan(\\theta)}{2}$\n", "\n", - "We can multiply all the terms by 2 to get rid of the $\\dfrac{1}{2}$ factors. We can also divide by $\\sin(\\theta)$, which is stricly positive (assuming $0 < \\theta < \\dfrac{\\pi}{2}$), so the inequalities still hold:\n", + "We can multiply all the terms by 2 to get rid of the $\\dfrac{1}{2}$ factors. We can also divide by $\\sin(\\theta)$, which is strictly positive (assuming $0 < \\theta < \\dfrac{\\pi}{2}$), so the inequalities still hold:\n", "\n", "$cos(\\theta) < \\dfrac{\\theta}{\\sin(\\theta)} < \\dfrac{\\tan(\\theta)}{\\sin(\\theta)}$\n", "\n", @@ -5843,7 +5840,7 @@ "\n", "$\\dfrac{1}{cos(\\theta)} > \\dfrac{\\sin(\\theta)}{\\theta} > \\cos(\\theta)$\n", "\n", - "assuming $-\\dfrac{\\theta}{2} < \\theta < \\dfrac{\\pi}{2}$ and $\\theta \\neq 0$\n", + "assuming $-\\dfrac{\\pi}{2} < \\theta < \\dfrac{\\pi}{2}$ and $\\theta \\neq 0$\n", "