It is highly recommended by everyone that implements neural networks that you should numerically check your gradient calculations. This is because it is very easy to introduce small errors into the back-propagation equations that will make it seem like the network is working, but it doesn't do quite as well as it should. A simple numerical check can let you know that all your numbers are right.
This tutorial will go through the process of how to numerically compute the derivatives for a simple 1 hidden layer neural network. The exact same process is used for checking the derivatives of e.g. LSTMs, which are much more complicated.
The main thing to remember is the error measure you are using for your neural network. If you are using Mean Squared Error then the error \(E = \dfrac{1}{N}\sum^N_{n=1} \sum^K_{k=1} (y_k - x_k)^2 \) is what we want to minimise. In the previous equation, \(y_k\) is the desired label and \(x_k\) is the network output. \(N\) is the minibatch size, if you are just putting in a single input at a time then \(N\) is one. This single number \(E\) tells us how good our network is at predicting the output, the lower the value of \(E\) the better. Our aim is to adjust the weights so that \(E\) gets smaller.
For a neural net with 1 hidden layer we have 2 sets of weights and 2 sets of biases. The output is computed as follows:
\( a = \sigma (i*W_1 + b_1) \)\( x = \sigma ( a*W_2 + b_2) \)
\(i\) is our input vector, \(a\) is the hidden layer activation, and \(x\) is the network output.
The way to numerically check the gradient is to pick one of the weights e.g. element (1,1) of \(W_1\), and to add and subtract a small number to/from it e.g. 0.0001. This way we have \(W_1^+\) and \(W_1^-\) (note that only element (1,1) is changed, all the other weights stay the same for now). We now have to compute \(x^+\) and \(x^-\) using the new slightly modified weights. Then we can compute \(E^+\) and \(E^-\) using \(x^+\) and \(x^-\). The gradient of E is then \((E^+ - E^-)/(2*0.0001)\).
Now that we have the derivative of \(E\) with respect to weight (1,1) of \(W_1\), we have to do it for all the other weights as well. This follows the exact same procedure, we just add/subtract a small number from a different weight. The final matrix of derivatives should exactly match the gradient as calculated by back-propagation. This python code: nn_2layer.py implements both back-propagation and the numerical check for a simple single hidden layer nn.
thanks!
ReplyDelete