Gradient descent make a machine learnable, if the right metrics can be applied. The parameters constituting of the machine can be adjusted by gradient descent. But the magic of the gradient descent is operating only when the calculations between the parameters are differentiable. The reason is simple: because the 'gradient' can be found through 'differential', then we can apply 'descent' on it. This means if the whole system is differentiable, the whole system can be adjusted.
Differentiable programming has the same concept about the 'fully differentiable system'. Not a single module or some parts of the system, but whole modules can be adjusted at once if those modules are fully connected in a differentiable way. This paradigm is so-called 'end-to-end'. And all about this paradigm are based on a single algorithmic root, gradient descent.
In gradient descent, gradient decides the degree of contribution to the system's output. Therefore, if the parameter A's gradient was 0.7, it means that A affected the system's output as much as 0.7. Once we know the gradient, we can adjust A to be better for the system. If we can calculate the gradients for all parameters, we can 'track' the contributions of each parameters and adjust them. Sounds simple.
But it is not that easy to build fully differentiable end-to-end system. We may encounter some obstacles. This is because many existing systems have lots of non-differentiable operations. One of those is indexing. Say, a list A contains three elements of numbers (1, 2, 3) in a row. If we want to get the second element, we can access it by calling the index '1'(if index starts with '0'). But is this process differentiable? No. Why?
'Differentiable opeartion' means that the operation can be decomposed into basic arithmetic calculations. These basic calculations are connected with each ohter and have continuity. Continuity simply means that the variables participating in thoes calculations are real number. But indexing is an integer process, not a real number. However we may mitigate the 'integer problem' by using rounding approximation trick like function x - sin(2pi x)/(2pi). This way we can inject an integer calculation into the middle of the calculations of real number.
x - sin(2pi x)/(2pi)
But the main problem of indexing remains. The main cause of non-differentiableness is that there is no differentiable connection between the index and the indexed outcome. Say, consider the case when we feed our neural network N1 a random input and get the output as an index '4'. Then the indexed outcome from the array A was '5' (A[4]=5), and then the '5' become an input value for the next neural network N2, and got the final output '6'. In this case, all the calculation processes are differentiable, except getting A[4]=5. Why A[4]=5 is not differentiable? Because A[4]=5 is a definition (in Korean, '약속'). Yeah, this indexing process is just a preliminary for the entire program, is strict and not a 'variable', cannot 'learnable', and does not contain any arithmetic calculations. It is just a MAPPING.
Fortunately, a widely known trick to make indexing differentiable exists. The trick is to make an additional neural network and train it to copy the MAPPING. Neural networks are differentiable global function approximator, and theoretically, they can MEMORIZE all of the MAPPING. We can prepare the pre-trained MAPPING networks with almost perfect accuracy (this can be easily achieved because this networks are managed only to memorize the index-indexed outcome mapping) before main training. Then, If we substitute and freeze the differentiable copied-MAPPING networks for the non-differentiable MAPPING, we can finally get the program fully differentiable, end-to-end training enabled.
댓글 영역