Automatic Differentiation Explained
How Deep Learning libraries handle gradients
How do you compute derivatives of a function?
By hand, you’d find it analytically. You can then write a function to compute it. Or you can write a function to approximate it.
return x*x // functiondef df(x):
return 2*x // analytical derivativedef df(x,f,e):
return (f(x+e) - f(x))/e // approximation
This can get quite cumbersome for complex functions and you can’t possibly write out the analytical form of the derivative of a deep neural network. The approximation form is not very friendly when we have a lot of independent variables. Here comes automatic differentiation. The basic idea is that you tell each basic operation (addition, multiplication) on how to compute its own derivative. Now you can compute the derivative of any complex function using the chain rule since any complex function is a nested function of simpler ones.
There are two approaches to auto-diff, forward and backward. In forward-diff, you compute the derivative in the order of leaf nodes to the root nodes. Whereas in backward-diff, you compute the derivative of the root node first and go down the branches.
z(x) = v(u(x))forward diff --> z'(x) = u'(x)*v'(u(x))backward diff --> z'(x) = v'(u(x))*u'(x)
Consider the following example
z = x*x + x*y + 2 . The computational graph is given below.
Every node provides its parent node with its own derivative with respect to the input variables. The parent node then operates upon them (by their definition) and sends it to its parent node.
Backward Diff / Backpropagation
In this case, the root node provides its derivative with respect to its children node to the respective child nodes.
Please read through the following code very carefully. Try understanding what the
backprop() functions of each class are doing. Verify it with the figures given above. In doing so you will get a vague understanding of what’s going on. Unless you plan to go deeper into this subject this level of understanding is fair for deep learning practitioners.
Backward over Forward Diff
Usually, in deep neural networks, the input size is usually much larger than the output size. If we were to use forward diff, we would need to call
graph.derivative() on every input variable. Now, let op1 and op2 be two nodes in the computational graph, then
graph.derivative() will recompute all derivatives above the common ancestor of op1 and op2 twice, once for op1 and once for op2. Whereas in backpropagation we call
graph.backprop() once and can access the
gradient of each variable. One disadvantage of backpropagation is some node values need to be recomputed which is not the case in forward diff. This can be solved by caching the values computed and reusing it during backpropagation.
Based on this notebook.